In part 6 of our Machine Learning Concepts series we are going to discuss model training. Up until now we’ve identified the goal of our machine learning model, have pre-processed our data into a standard format, identified the features of interest in our data, and selected the necessary models that we’ll be running our data against.
1. Data Splitting
Data splitting is a simple, yet important, step in machine learning deployment. The goal of data splitting is to split the available data into three distinct data sets that will be used to train, validate, and test our model to ensure precision and accuracy.
Accuracy is representation of how often our machine learning model can correctly predict the outcome.
Precision is a representation of how often our machine learning model can correctly predict the positive class.
| Training Set | 60% – 80% | The majority of our data is used to train our model. This data sets the initial internal parameters of our model’s algorithm. | 
| Validation Set | 10% – 20% | The validation set of data will be used to validate that our model is capable of processing novel data to the desired result. | 
| Test Set | 10% – 20% | Finally, we keep 10 to 20% of data to use a test set to perform final checks and validations against our model. | 
2. Model Training
Model training is perhaps the simplest task of every activity that we’ll perform in the deployment of our machine learning model because the model will do all of the work and configure an initial set of hyperparameters. We simply need to feed our training data into the model and allow the model’s algorithm to repeatedly process the training data, adjusting its internal parameters in order to minimize the difference between its predictions and the actual values.
For example, a linear regression algorithm will attempt to find the best-fitting line that minimizes the sum of squared errors between the predicted values on the training set and the actual target values.
2.1. Hyperparameters
Machine learning models use internal settings calls hyperparameters to help control how and what they learn.
3. Optimization
Most machine learning algorithms use built-in optimization techniques in order to find the best parameters. Let’s look at common methods for optimizing our machine learning model.
3.1. Gradient-Based Methods
Gradient-based methods rely on calculating the gradient (direction of steepest descent) of the loss function. These models iteratively update parameters in the direction of negative gradient to minimize loss.
- Gradient Descent – The most basic method which updates parameters directly proportional to negative gradient. Can be slow. Other times of gradient descent include Batch Gradient Descent and Stochastic Gradient Descent (SGD).
- Variations of Stochastic Gradient Descent (SGD) – Variations of SGD include Mini-Batch Gradient Descent and Momentum.
- Adaptive Learning Rate Methods – Adaptive learning rate methods allow us to adjust the learning rate during training. Methods include Adagrad,RMSprop, and Adaptive Moment Estimation (ADAM).
3.2. Second-Order Methods
Second-order methods use the Hessian metrics to approximate the curvature of loss. These methods often lead to faster convergence, however, are more computationally expensive.
- Newton’s Method – Directly uses the Hessian matrix for updates.
- Quasi-Newton Methods – Approximate the Hessian using previous gradient information. A popular quasi-newton optimization method is the Broyden-Fletcher-Goldfarb-Shanno (BGFS) model because it offers good performance and is relatively efficient.
3.3. Other Optimization Techniques
- Genetic Algorithms – Inspired by biological evolution, these methods use a population of solutions that evolve over generations through processes that include selection, crossover, and mutation.
- Simulated Annealing (SA) – Simulated annealing is a probabilistic method for approximating the global optimum of a given function. In SA, we allow the algorithm to make “bad” moves, which allows our model to move in the “wrong” direction. This forces our model to evaluate “bad” moves, as well as “good” moves, and focuses our model on obtaining a global minimum instead of focusing too much on local minima. In other words, our model is able to identify minimums are large amounts of data and not get too focused on a single valley.
4. Iteration
Although our machine learning algorithm may perform built-in iteration, it’s possible that manual training our model multiple times may be necessary in order to obtain the appropriate performance and trueness.
5. Monitoring and Evaluation
Our validation set can be used during the training process to monitor and evaluate the overall performance of our model. If accuracy begins to decrease, it may signal overfitting, suggesting that adjustments like reducing complexity or adding regularization are needed.
Finally, our remaining test data set should be used to perform final confirmation of the trueness of our machine learning model. If the training data set performs as expected when introduced to our model we can be fairly certain that model training has been performed successfully.

