- Machine Learning Concepts – Part 1 – Deployment Introduction
- Machine Learning Concepts – Part 2 – Problem Definition and Data Collection
- Machine Learning Concepts – Part 3 – Data Preprocessing
- Machine Learning Concepts – Part 4 – Exploratory Data Analysis
- Machine Learning Concepts – Part 5 – Model Selection
- Machine Learning Concepts – Part 6 – Model Training
- Machine Learning Concepts – Part 7 – Hyperparameter Tuning
In Machine Learning Concepts – Part 6 – Model Training we briefly glossed over hyperparameters in large language models. In this article we will continue to explore hyperparameters in large language models including what they are, how they are used, and finally look at common techniques for hyperparameter tuning
Hyperparameters
You can think of hyperparameters as being the knobs and dials that you can adjust in order to control how our large language model learns. Hyperparameters are not something that are learned from data, instead they are identified and set beforehand.
Examples of Hyperparameters
Hyperparameters can be used to control every aspect of our machine learning model but that doesn’t mean that we need to use them. The following are common examples of hyperparameters that you’ll see when working with machine learning.
- Learning Rate – Controls the step size during gradient descent updates.
- Number of Hidden Layers (for Neural Networks) – Determine the depth of the model’s representation. Deeper models can capture more complex patterns, but also require more data and computational resources.
- Regularization Strength – Techniques as Dropout and Weight Decay allow us to exclude factors, or add weights to training data to help prevent overfitting and to discourage our model from becoming overly complex.
- Training Procedures – Control the size of samples that are processed during each iteration (i.e. batch size), as well as the number of times the entire dataset if passed through the model (i.e. number of epochs).
- Data Augmentation – Techniques such as Back Translation and Textual paraphrasing can be used to contextualize data by introducing slight variations of data, or rewording data while preserving the original meaning.
Why Hyperparameters Matter
Different datasets, tasks, and algorithms can respond differently when using different models. Hyperparameters allow us to tune our model for both performance and trueness. However, we must also be aware of any bias that this may introduce.
Techniques for Hyperparameter Tuning
- Grid Search – Systematically explores a predefined grid of hyperparameter values.
- Random Search – Samples hyperparameter values randomly from a distribution.
- Bayesian Optimization – Uses probabilistic models to guide the search for optimal hyperparameters.
Challenges
Tuning hyperparameters won’t come without its set of challenges. Let’s review.
- The Curse of Dimensionality – As the number of hyperparameters increases, the search space increases exponentially making our model computationally exhaustive.
- Computational Cost – Evaluating a model for different hyperparameter combinations can be computationally expensive. This can lead to a limit on the number of trials that can be performed within reasonable costs and timeframes.
- Non-Convex Optimization Landscape – The relationship between hyperparameters and model performance is non-linear. It can be difficult, and time-consuming, to find the right combination of hyperparameters for optimization.
- Data Dependence – The optimal hyperparameters will differ from dataset to dataset. What works well for one dataset might not be suitable for another.
- Lack of Interpretability – Due to the complexity of LLM’s it can often be difficult to understand why a certain hyperparameter lead to increased or decreased model performance. This can add difficulty in identifying appropriate tuning strategies.
- Overfitting the Hyperparameters – Just as our model can overfit to training data, you can also overfit to the validation set of data. In other words, A setting that works well with the validation set may not generalize well to unseen data.
Choosing the Right Technique
The optimal hyperparameter settings are highly dependent on the specific LLM architecture, the dataset we use, and the task that we’re attempting to perform. Here are some strategies to mitigate the challenges of tuning hyperparameters.
- Utilize Bayesian Optimization – Bayesian optimization is the most efficient method of exploring complex search spaces of hyperparameters in order to learn from previous evaluations.
- Employ Grid Search or Random Search Strategically – Strategically use grid search and random search strategies on smaller, more manageable search spaces within the overall hyperparameter space.
- Leverage AutoML Techniques – Automated machine learning platforms can help to automate pieces of the hyperparameter process.
- Prioritize Robustness and Generalization – Focus on settings that lead to our model generalizing well to unseen data.