Optimize large language model [LLM] hyperparameters to enhance performance of generative AI solutions.
When training a large language model [LLM] for generative AI, selecting the right hyperparameters is essential to optimize both the model's inference performance and training efficiency.
This guide outlines 20 key hyperparameters you can consider when pre-training or fine-tuning your model.
The guide has been written in a framework-and-language-agnostic style that distills fundamental deep learning concepts, while describing a generalized set of hyperparameters you can use across a broad set of AI implementation patterns and use cases.
What is a Hyperparameter?
A hyperparameter is a configuration you set before you start training your model.
Unlike weights and biases, which are parameters your model learns through the training process itself, hyperparameters direct model training.
Why Hyperparameters?
Firstly, hyperparameters enable you to optimize model performance. By setting these parameters before you start model training, you can guide your model to generate more accurate and relevant responses.
Additionally, you can leverage hyperparameters to enhance training efficiency. Training models can be demanding, both in terms of resource costs and time. By selecting the right hyperparameters, you can control how efficiently your model uses compute, storage and network resources, so you can optimize your model learning while avoiding wasted resources.
Therefore, carefully select and tune hyperparameters that will enable you to strike the right balance between achieving optimal model performance and doing so in a cost-effective and timely manner that aligns with the practical demands of your business objectives and budget constraints.
Top 20 LLM Hyperparameters
The 20 key parameters have been outlined across five categories, namely:
- Optimization and Training Efficiency
- Learning Rate
- Model Complexity and Regularization
- Model Architecture
- Training Data
Optimization and Training Efficiency
#1. Optimizer
Let's start by defining Optimization and Gradient Descent.
Optimization is the process by which a model learns to minimize the Loss — the Error — between its predictions and the training data. The Loss is quantified by a Loss Function, which is a statistical measure of the differences between predicted and actual values across your model's parameters.
Through Optimization, your model learns to statistically represent your training data, in a state from which the model can accurately perform predictions of your training data and, optimally, for new inputs that are not part of your training data.
Gradient Descent is the optimization algorithm used during model training to find the Minimum Loss. Gradient Descent works by iteratively taking steps in the direction of the steepest and largest gradient to traverse down the Loss Curve — hence, a Negative Gradient — until it reaches the Minimum Loss.
At this point, your model has learned the optimal value — the Weight — for each model parameter that produces the Minimum Loss for your training data.
Image Source: Reducing Loss: Gradient Descent | Machine Learning | Google
The Pragmatic Engineer's article How does ChatGPT work? As explained by the ChatGPT team explains Gradient Descent well using the analogy of a hiker. Here is the explanation [paraphrased]:
An analogy of gradient descent is a hiker stuck up a mountain, who is trying to get down. However, he doesn't have a full view of the mountain, due to heavy fog which limits his view to a small area around him.
Gradient descent would mean to look at the steepness of the decline from the hiker's current position, and proceed in the direction of the steepest descent.
It takes time to measure steepness and the hiker wants to get down before sunset.
So, this hiker needs to decide how frequently to stop and measure the steepness, so he can get down before sunset.
So, back to Optimization, Optimizer is the specific optimization algorithm that implements a variant of Gradient Descent to find the Minimum Loss during model training.
Common examples of the Optimizer hyperparameter include:
- SGD: Stochastic Gradient Descent
- Adam: Adjusts Learning Rate [described below] for each parameter based on the historical average of gradients, which can help with more stable learning.
- AdamW: Variant of Adam that includes Weight Decay [described below].
#2. Number of Update Steps for Which to Accumulate Gradients
Let's start by describing a Neuron, a Neural Network and a Large Language Model [LLM].
The concept of a Neuron is inspired by the structure and function of the human brain.
Think of a Neuron as a logical unit of decision-making that takes in a collection of inputs; applies a weight to each input; adds a Bias [described below] to the weighted sum; and then applies an Activation Function [described below] before returning a single output.
A Bias is a learned value that equips the Neuron with the flexibility to learn independently of Weights.
Image Source: Machine Learning Crash Course: Part 3 — Neural Networks | Machine Learning @ Berkeley
A Neural Network is a series of interconnected layers of Neurons and consists of an Input Layer, one or more Hidden Layers and an Output Layer.
Each layer in the Neural Network is linked to another layer and, as information progresses through the network, each progressive layer processes more complex information.
Image Source: Deep Learning Cheatsheet | Stanford University
Image Source: Machine Learning Crash Course: Part 3 — Neural Networks | Machine Learning @ Berkeley
Neural Networks are the foundation of Deep Learning, a subset of Machine Learning, where Neural Networks learn from large amounts of data.
Generative AI, largely based on Large Language Models [LLMs], is a subset of Deep Learning.
A Large Language Model [LLM] is a massive Neural Network that has been trained to learn a statistical representation of the structures and patterns of natural language, including the semantic and contextual relationships between words and sentences.
A Foundational LLM [Foundation Model] is trained on natural language by feeding the model vast amounts of publicly-available data from the web.
Consequently, the LLM learns to simulate an understanding of natural language and leverages such understanding to generate human-like responses to natural language prompts.
During LLM training, the model's Neural Network learns the optimal weights — and biases — for each neuron in the network, to generate accurate and relevant model outputs.
The network learns weights through iterative steps of Gradient Descent. Consequently, model training incrementally updates parameter weights as the model goes through successive Gradient Descent steps.
The Number of Update Steps for Which to Accumulate Gradients hyperparameter is the number of Gradient Descent steps the model should take each time before updating its weights.
This hyperparameter is used to stabilize model updates, particularly important given that LLMs are trained with large datasets.
Example values for the Number of Update Steps for Which to Accumulate Gradients hyperparameter are: 1
, 5
, 10
, 25
#3. Activation Functions
An Activation Function is a mathematical function that transforms the weighted sum of a Neuron to produce the final output.
Commonly used Activation Functions add non-linearity to a Neural Network, which enable the network to learn and model complex relationships that linear equations cannot capture.
Examples of Activation Functions include:
- ReLU [Rectified Linear Unit]
- Sigmoid
- Tanh [Hyperbolic Tangent]
#4. Batch Size per GPU
Let's start by describing Model Generalization, Overfitting and Underfitting.
Model Generalization is a measure of how effectively the concepts learned by a model apply to examples not seen in the training data when the model was learning. The goal of a great model is to generalize well from the training data to data the model has not seen.
Overfitting occurs when a model learns the training data too well. Overfitting occurs when a model learns noise, errors and outliers in the training data, to the extent that such noise is picked up and learned as concepts by the model.
The problem with Overfitting is that such concepts do not apply to new data and they adversely impact the model's ability to generalize to data that is different from the training data.
Think of Overfitting as when the model hugs its training data too closely.
Overfitted models learn too many features / parameters and are consequently too complex.
Underfitting is the opposite of Overfitting. Underfitting is when the model has not learned its training data sufficiently and so the model will make error-prone guesses to fill in the gaps.
Image Source: A Comprehensive Guide of Regularization Techniques in Deep Learning | Towards Data Science | Eugenia Anello
GPUs — Graphical Processing Units — are the logical units of compute hardware across which model training is distributed and parallelized within a cluster.
The Batch Size per GPU hyperparameter specifies the number of training dataset records each GPU processes per batch during model training.
This hyperparameter specifies the number of data samples processed by each GPU before the model's internal parameters are updated.
Larger Batch Sizes: Using larger batch sizes can utilize GPU memory more efficiently. Additionally, larger batch sizes can improve Model Generalization by making weight updates over a larger sample of data.
Smaller Batch Sizes: Conversely, smaller batch sizes send fewer examples per update, potentially increasing the frequency of weight updates. Smaller batch sizes may lead to Overfitting since the model learns on smaller subsets of data.
Example settings for the Batch Size per GPU hyperparameter are: 5
, 10
, 50
, 100
#5. Number of Training Epochs
The Number of Training Epochs hyperparameter specifies the number of complete passes the model makes through the entire training dataset.
More Epochs: A higher number of epochs generally improves the model's accuracy on the training data. This can result in more precise and reliable outputs for inputs similar to those seen during training.
Risk of Overfitting: However, training through more epochs can also cause the model to overfit. So your model learns the training data too well, including noise and anomalies, which limits the model's ability to perform well on data that is different from the training data.
Example values for the Number of Training Epochs hyperparameter are: 1
, 2
, 3
Image Source: Overfitting and Underfitting | Kaggle | Ryan Holbrook
#6. Weights Precision
The Weights Precision hyperparameter specifies the number of decimal places to which the model stores weights.
Specify Weights Precision through picking the data type for your weights.
Common Weight Precision values used for model training include:
- FP32: 32-bit full precision floating point. 4 bytes. Highest precision but computationally-expensive.
- FP16: 16-bit half precision float. 2 bytes. Commonly used due to memory efficiency.
- BF16: 2 bytes. Becoming more common. Brain Float 16, created by Google Brain. "Truncated FP32". Hybrid between full-precision FP32 and half-precision FP16. Maintains range of FP32 with half its memory footprint. Offers benefits (efficiency and speed) similar to FP16 with the scale (size) similar to FP32. Supported by newer GPUs, such as NVIDIA's A100. BF16 has recently become a popular alternative to FP16.
- INT8: 8-bit integer. 1 byte. May not be suitable for LLM training due to limited precision.
Quantization is the process of compressing the size of model weights by converting higher-precision weight values into lower-precision floating-point or integer values.
For example, if you have a weight in a neural network model that is represented with a floating-point number like 0.6573294
, quantizing the weight might mean approximating it to 0.66
or even 0.7
.
The benefits of Quantization include faster model training and inference; lower compute costs; as well as reduced memory requirements and storage needs; but at the cost of some loss in precision.
However, in practice, the loss in precision from Quantization — for FP16 or BF16, in particular — causes a loss in model accuracy that is generally immaterial.
Higher Precision: More detailed model weights which can lead to more accurate outputs. However, higher precision weights can bloat model size, which can slow down model training and inference, while increasing costs.
Lower Precision: Quantization to lower-precision floating point data types generally lead to gains in model efficiency with marginal impact on output accuracy.
Learning Rate
#7. Learning Rate
The Learning Rate hyperparameter specifies the step size for each iteration of Gradient Descent Optimization.
Consequently, Learning Rate controls the magnitude of changes to weights that Gradient Descent Optimization can make.
Higher Learning Rate: Model learns more aggressively, which results in larger changes to model weights for each step. Therefore, a higher learning rate can cause model training to complete faster. However, such aggressive learning increases the risk of overshooting the optimal weights.
Lower Learning Rate: Model learns more conservatively and makes smaller, more precise changes to weights per optimization step. However, such conservative learning requires more steps and will cause model training to take longer.
Example settings for the Learning Rate hyperparameter are: 1
, 0.1
, 0.01
, 0.001
#8. Initial Learning Rate
The Initial Learning Rate hyperparameter sets the starting value of the Learning Rate at the beginning of model training.
This hyperparameter works in conjunction with the Learning Rate Scheduler [described below] and Linear Warmup Ratio [described below], for scenarios where the Learning Rate changes over the duration of model training.
Example settings for the Initial Learning Rate hyperparameter are: 0.0001
, 0.001
, 0.01
, 0.1
, 1
#9. Learning Rate Scheduler
The Learning Rate Scheduler hyperparameter specifies the strategy used to adjust Learning Rate during model training.
Common settings for the Learning Rate Scheduler are:
- Constant: Learning Rate stays fixed throughout model training.
- Cosine Decay: Learning Rate varies following cosine curve. Starts with the Initial Learning Rate then gradually decreases towards zero.
- Linear Decay: Learning Rate decays by a constant factor — as a straight line — over the entirety of training.
- Step-Based Decay: Learning Rate decays by a specified factor at predefined steps during training.
- Exponential Decay: Learning Rate decays by an exponential function of the Number of Epochs.
- Time-Based Decay: Learning Rate decreases as a function of the inverse Number of Epochs.
Image Source: How to Choose a Learning Rate Scheduler for Neural Networks | neptune.ai
#10. Warmup Ratio: Ratio of Steps for Linear Warmup
The Warmup Ratio hyperparameter sets the fraction of total training steps during which the Learning Rate will linearly increase from zero to a predefined target value, in alignment with the Learning Rate Scheduler.
For instance, with an Initial Learning Rate of 0.01
and a Warmup Ratio of 0.03
, the Learning Rate will start from 0
and linearly increase up to 0.01
for the first 3%
of total training steps.
Example settings for the Warmup Ratio hyperparameter are: 0
[no warmup], 0.01
, 0.02
, 0.03
Model Complexity and Regularization
#11. Dropout Ratio
Let's start by describing Regularization.
Regularization refers to steps taken to prevent Overfitting of models. Overfitted models have high complexity because they have overlearned features / parameters.
Dropout is one of several Regularization techniques to prevent Overfitting. Dropout works by randomly and temporarily deactivating neurons from network layers during training.
Such removal of neurons can improve the Model Generalization by training the model with different views of the network layers.
The Dropout Ratio hyperparameter is the probability of dropping neurons in the network during training.
Example values for the Dropout Ratio hyperparameter are: 0
[no dropout], 0.01
, 0.02
, 0.03
#12. Number of Hidden Layers
The Number of Hidden Layers in a neural network defines how many layers of neurons exist between the input and output layers.
Increasing the Number of Hidden Layers can enhance your model's ability to understand the training data more robustly through capturing more complex patterns and relationships.
However, adding more hidden layers to your model will increase the size of the model and will consequently increase the resource costs to train, host and serve the model.
Additionally, increasing the Number of Hidden Layers in your model increases your model's complexity, which can lead to Overfitting.
Image Source: Hidden Layers in a Neural Network | Baeldung | Panagiotis Antoniadis
Example values for the Number of Hidden Layers hyperparameter are: 1
, 2
, 10
, 25
For context, OpenAI's GPT-4 supposedly has 120 hidden layers.
#13. Regularization Penalty Term
Another Regularization technique to prevent Overfitting is to penalize model complexity.
You penalize model complexity by adding a Regularization Penalty Term to your model's Loss Function. The value of this Penalty Term increases with model complexity, i.e., more parameters.
Consequently, Regularization causes disproportionate Loss during model training, requiring that increases in model complexity must significantly improve model performance to offset the disproportionate Loss as the model training optimizes to Minimum Loss.
Penalizing model complexity through use of a Regularization Penalty Term encourages the model to learn simpler, more generalizable representations of the training data.
There are two types of Regularization Penalty Terms:
- L1 LASSO: Based on the sum of absolute weights.
- L2 Ridge: Based on the sum of squared weights.
#14. Regularization Term
The Regularization Term, denoted by λ (Lambda), is a hyperparameter that controls the strength of the Regularization Penalty.
Lambda (λ) is part of the calculation for L1 LASSO and L2 Ridge, the two types of Regularization Penalty Terms.
A larger value of λ causes a stronger penalty on the Loss Function. So if λ is too high, it can lead to Underfitting.
Conversely, a smaller λ value reduces the impact of Regularization. Therefore, a λ value that is too low can still allow Overfitting.
Example settings for the Regularization Term hyperparameter are: 0.0001
, 0.001
, 0.01
#15. Weight Decay
Weight Decay is a form of Regularization that prevents Overfitting by penalizing large weight values.
The Weight Decay hyperparameter adds a penalty term, proportional to the square of the weight values, to each weight.
The concepts of Weight Decay and L2 Ridge Regularization are often used interchangeably because of their similarity. However, there is a subtle difference between both concepts.
Weight Decay applies the penalty directly to each model weight. However, L2 Ridge applies the penalty to the overall Loss Function, which covers the entirety of the model's weights.
Like the Regularization Penalty Terms, Weight Decay encourages the model to learn leaner and more stable weights, which can improve Model Generalization.
#16. Maximum Gradient Normalization
Maximum Gradient Normalization is a technique used to stabilize model training by setting an upper limit on gradient values, to prevent exploding gradients.
Gradient Descent optimization repeatedly computes gradients and iteratively updates model weights to detect Minimum Loss. However, in some cases, gradients can become excessively large, which can lead to unstable training.
By limiting the size of gradients, Maximum Gradient Normalization helps prevent large, disruptive updates to the model weights.
The Maximum Gradient Normalization hyperparameter sets an upper limit on gradient values, above which gradient values are clipped.
Example values for the Maximum Gradient Normalization hyperparameter are: 0.1
, 0.2
, 0.3
Model Architecture
#17. LoRA Dimensionality
Let's start by describing Model Fine-Tuning and LoRA.
Model Fine-Tuning is training a large language model on additional information to expand the model's reasoning ability and knowledge beyond the original training data.
Model Fine-Tuning is particularly useful to customize a foundation model — like GPT, Gemini or Llama — with proprietary data and / or for specialized domains.
Full Fine-Tuning [FFT]: Full Fine-Tuning retrains a foundation model to update all its weights. Full Fine-Tuning can be resource-intensive, time-consuming and cost-prohibitive, especially for large models.
Parameter-Efficient Fine-Tuning [PEFT]: Parameter-Efficient Fine-Tuning is an alternative to Full Fine-Tuning that updates a small subset of the model's weights / parameters.
LoRA [Low-Rank Adaptation]: LoRA is an implementation of Parameter-Efficient Fine-Tuning that creates a low-dimensionality — fewer parameters — model layer [or layers] on top of the foundation model.
LoRA creates an approximation of the foundation model that is simpler, more computationally-efficient, has lower memory requirements and is faster to train than Full Fine-Tuning.
The LoRA Dimensionality hyperparameter is used during Parameter-Efficient Fine-Tuning to specify the number of parameters for the low-dimensionality layer[s].
Example settings for the LoRA Dimensionality hyperparameter are: 64
, 128
, 256
#18. Embedding Dimensionality
Let's start by describing Tokenization and Vector Embeddings.
Tokenization breaks up training data into tokens, with each token consisting of a few characters. The size of a token depends on the specific tokenizer used.
For the sake of simplicity, let's think about a token as the same thing as a word and use both terms interchangeably.
Image Source: Tokenization | Supervised Machine Learning for Text Analysis in R
Vector Embedding converts each token of the training data into a high-dimensionality sequence of numbers - a Vector.
Vector Embedding is performed using an embedding model that has been trained to convert each token into a vector.
Vectors encode the semantic meaning and similarity of tokens within the broader context of a vocabulary's vector embedding space.
Each dimension of the Vector Embedding corresponds to some characteristic or feature of the word. Illustrative examples: living being, feline, human, gender, royalty, verb, plural
Image Source: Embeddings: Meaning, Examples and How To Compute | Arize AI | Francisco Castillo
Semantic Similarity: Words with similar meanings have similar vector representations and so they live close to each within the vector embedding space.
For example, apple, banana and orange might be close to each other in the embedding space because they share Semantic Similarity since they are fruits.
Image Source: Vector Embedding | Cause Writer AI
The Embedding Dimensionality hyperparameter is the number of dimensions for the vectorized tokens used to train the model.
Use of higher Embedding Dimensionality — more dimensions — enables a richer representation of each word / token and the capture of more complex semantic relationships.
However, higher Embedding Dimensionality increases the model size and, consequently, increases resource usage, cost and training time.
Additionally, excessive Embedding Dimensionality can lead to Overfitting.
Training Data
#19. Maximum Sequence Length to Use
The Maximum Sequence Length to Use hyperparameter sets the upper limit on the number of tokens that the model can process in a single training iteration.
The training process will truncate sequences longer than the specified maximum length. For sequences shorter than the specified length, the training process will typically pad the sequences to ensure consistent sequence lengths.
If the Maximum Sequence Length to Use hyperparameter is not explicitly specified, model training may default to using a predetermined sequence length, depending on the model architecture. Alternatively, training may default to using the maximum sequence length from the training data.
Example settings for the Maximum Sequence Length to Use hyperparameter are: 1024
, 2048
, 4096
, 8192
#20. Pack Multiple Short Examples in Same Sequence
Certain model architectures enable combining multiple short sequences into a single sequence during training, up to a Maximum Sequence Length.
Specify this setting to combine sequences using the Pack Multiple Short Examples in Same Sequence hyperparameter.
There are two settings for the Pack Multiple Short Examples in Same Sequence hyperparameter, namely: True
and False
Some model architectures allow for leaving this hyperparameter unspecified or passing in a None
value. In such instances, model training may default the setting to True
or False
, depending on the model architecture.
Packing multiple sequences into the same sequence may boost training efficiency at the expense of potential Underfitting and reduced model accuracy.
Wrapping Up
Hyperparameter tuning is both an art and a science.
There is no one size fits all combination of hyperparameter settings that will produce optimal outcomes across all scenarios.
Continuously monitor model performance and adapt your model architecture as training and inference data distribution evolves over time.
As with most software architecture decisions, it is crucial to understand the tradeoffs involved when tuning hyperparameters. Improving performance in one dimension may occur at the cost of performance in another dimension.
Therefore, clearly define your priority metrics for optimization and embrace experimentation to find the optimal balance of metrics for your specific AI workflows.
Leverage LLMOps, MLOps, DataOps and DevOps to automate the end-to-end data and AI/ML lifecycle that enables robust logging, monitoring, experimentation, benchmarking and enhancements to your model architecture.
AI optimization is a journey not a destination.
Therefore, implement a continuous experiment-driven AI strategy, powered by automation, to ensure you deploy high-quality AI solutions that deliver high-ROI outcomes for your business.