In part 6 of our Machine Learning Concepts series we are going to discuss model training. Up until now we’ve identified the goal of our machine learning model, have pre-processed our data into a standard format, identified the features of interest in our data, and selected the necessary models that we’ll be running our data against.
Data splitting is a simple, yet important, step in machine learning deployment. The goal of data splitting is to split the available data into three distinct data sets that will be used to train, validate, and test our model to ensure precision and accuracy.
Accuracy is representation of how often our machine learning model can correctly predict the outcome.
Precision is a representation of how often our machine learning model can correctly predict the positive class.
Training Set | 60% – 80% | The majority of our data is used to train our model. This data sets the initial internal parameters of our model’s algorithm. |
Validation Set | 10% – 20% | The validation set of data will be used to validate that our model is capable of processing novel data to the desired result. |
Test Set | 10% – 20% | Finally, we keep 10 to 20% of data to use a test set to perform final checks and validations against our model. |
Model training is perhaps the simplest task of every activity that we’ll perform in the deployment of our machine learning model because the model will do all of the work and configure an initial set of hyperparameters. We simply need to feed our training data into the model and allow the model’s algorithm to repeatedly process the training data, adjusting its internal parameters in order to minimize the difference between its predictions and the actual values.
For example, a linear regression algorithm will attempt to find the best-fitting line that minimizes the sum of squared errors between the predicted values on the training set and the actual target values.
Machine learning models use internal settings calls hyperparameters to help control how and what they learn.
Most machine learning algorithms use built-in optimization techniques in order to find the best parameters. Let’s look at common methods for optimizing our machine learning model.
Gradient-based methods rely on calculating the gradient (direction of steepest descent) of the loss function. These models iteratively update parameters in the direction of negative gradient to minimize loss.
Second-order methods use the Hessian metrics to approximate the curvature of loss. These methods often lead to faster convergence, however, are more computationally expensive.
Although our machine learning algorithm may perform built-in iteration, it’s possible that manual training our model multiple times may be necessary in order to obtain the appropriate performance and trueness.
Our validation set can be used during the training process to monitor and evaluate the overall performance of our model. If accuracy begins to decrease, it may signal overfitting, suggesting that adjustments like reducing complexity or adding regularization are needed.
Finally, our remaining test data set should be used to perform final confirmation of the trueness of our machine learning model. If the training data set performs as expected when introduced to our model we can be fairly certain that model training has been performed successfully.
Written by: Justin
Tagged as: Simulated Annealing, Optimization, SA, Monitoring, Training Set, Evaluation, Validation Set, Data Splitting, Test Set, Gradient Descent, Model Training, Stochastic Gradient Descent, SGD, Machine Learning, Adaptive Learning Rate Methods, Quasi-Newton Methods, Hyperparameters, Genetic Algorithms, Iteration.
Machine Learning Justin
©Copyright roguesecurity.ca 2024. All Rights Reserved.
Post comments (0)