We briefly discussed model evaluation previously in our articles on problem definition and data collection, as well as model training. Today’s article will provide additional detail and instruction on model evaluation including how to choose the right metrics, how to calculate them, and how to gain insights from those metrics.
1. The Test Set
Remember the different data sets that we split our data up in to? If not, don’t worry. Here is a refresher.
- Training Data Set – Data set used to initially train our model.
- Validation Data Set – Data set used to validate the initial training.
- Test Data Set – Data set that will act as an unbiased judge of our model’s ability to generalize unseen data. Using test data that is previously unseen by our model is crucial for avoiding “overfitting”, where our model might process the training data well, but struggles with new data.
| Training Set | 60% – 80% | The majority of our data is used to train our model. This data sets the initial internal parameters of our model’s algorithm. | 
| Validation Set | 10% – 20% | The validation set of data will be used to validate that our model is capable of processing novel data to the desired result. | 
| Test Set | 10% – 20% | Finally, we keep 10 to 20% of data to use a test set to perform final checks and validations against our model. | 
2. Choosing the Right Metrics
Metrics are key measures that provide you an ability to explain and understand the performance of our machine learning model in order to evaluate the success or failure of our model’s outputs. Before we talk about these metrics and how they are calculated, we should align on several definitions.
- True Positives (TP) – Correctly predicted positive cases.
- True Negatives (TN) – Correctly predicted negative cases.
- False Positives (FP) – Incorrectly predicted positive cases (Type 1 error).
- False Negatives (FN) – Incorrectly predicted negative cases (Type 2 error).
2.1. Classification: Predicting Categories
Accuracy
Percentage of correct predictions overall. Simple, but can be misleading if class distributions are imbalanced.
Accuracy = (True Positives + True Negatives) / (Total Predictions)
Recall (Sensitivity)
Out of all actual positive cases, how many did the model correctly identify? Focuses on minimizing false negatives.
Recall = True Positives / (True Positives + False Negatives)
Precision
Out of all positive predictions, how many were actually correct? Focuses on avoiding false positives.
Precision = True Positives / (True Positives + False Positives)
F1-Score
Is a harmonic mean of both precision and recall. Balances both type 1 and type 2 errors.
F1-Score = 2* ((Precision * Recall) / (Prevision + Recall))
2.2. Regression: Predicting Continuous Values
Mean Absolute Error
The variance by which each prediction was wrong, summarized as a mean.
Σ|yᵢ - ŷᵢ| / n
Mean Squared Error (MSE)
Average squared difference between predictions and actual values. Lower MSE indicates better performance.
MSE =  Σ(yᵢ - ŷᵢ)² / n
Root Mean Squared Error (RMSE)
Square root of the Mean Squared Error.
√(MSE) = √[Σ(yᵢ - ŷᵢ)² / n]
Coefficient of determination (R-Squared)
Proportion of variance in the target variable that’s explained by the model. Higher is better, ranges from 0 to 1.
R-Squared = 1 - (Residual Sum of Squares/Total Sum of Squares)
Residual Sum of Squares = Σ(yᵢ - ŷᵢ)²
Total Sum of Squares = Σ(yᵢ - mean(y))²
2.3. Other Metrics
- Area Under the Receiver Operating Characteristic Curve (AUC-ROC) – Used for binary classification. The AUC value will range from 0 to 1. 1 indicating perfect classification.
- Bilingual Evaluation Understudy (BLEU) – BLEU is a commonly used metric that evaluates the quality of machine translation systems, particularly the measurement of fluency and accuracy.
3. Analyzing Errors: Unveiling Insights
It’s important to look beyond just the overall scores when evaluating our model. We should carefully examine all errors where our model makes mistakes.
- Types of Errors – Are there specific classes of data that it struggles with? Certain input patterns it consistently gets wrong?
- Error Distribution – Are errors clustered in certain regions of the data or output space?
- Visualizations – Plots, graphs, and heatmaps can help visualize error patterns.
4. Guiding Improvement
- Data Collection – Identify areas where more or better-quality data is needed.
- Feature Engineering – Create new features that might capture relevant information missing from the original dataset.
- Algorithm Selection/Tuning – Experiment with different algorithms or fine-tune existing ones for specific types of errors.

