The document discusses different methods for customizing large language models (LLMs) with proprietary or private data, including training a custom model, fine-tuning a general model, and prompting with expanded inputs. Fine-tuning techniques like low-rank adaptation and supervised fine-tuning allow emphasizing custom knowledge without full retraining. Prompt expansion using techniques like retrieval augmented generation can provide additional context beyond the character limit.
Presented by indico co-founder Madison May at ODSC East.
Abstract: Transfer learning, the practice of applying knowledge gained on one machine learning task to aid the solution of a second task, has seen historic success in the field of computer vision. The output representations of generic image classification models trained on ImageNet have been leveraged to build models that detect the presence of custom objects in natural images. Image classification tasks that would typically require hundreds of thousands of images can be tackled with mere dozens of training examples per class thanks to the use of these pretrained reprsentations. The field of natural language processing, however, has seen more limited gains from transfer learning, with most approaches limited to the use of pretrained word representations. In this talk, we explore parameter and data efficient mechanisms for transfer learning on text, and show practical improvements on real-world tasks. In addition, we demo the use of Enso, a newly open-sourced library designed to simplify benchmarking of transfer learning methods on a variety of target tasks. Enso provides tools for the fair comparison of varied feature representations and target task models as the amount of training data made available to the target model is incrementally increased.
This document discusses techniques to improve the scalability of transactional memory systems. It analyzes issues that arise with scaling such systems to larger numbers of processors. It proposes using a value predictor to reduce conflicts and aborts in transactional memory, and presents initial results showing performance improvements. Further work is outlined to extend these techniques and better evaluate their effectiveness.
The anatomy of a neural network consists of layers, input data and targets, a loss function, and an optimizer. Layers are the building blocks and include dense, RNN, CNN, and more. Keras is a user-friendly deep learning framework that allows easy construction of neural networks by stacking layers. It supports TensorFlow as a backend and offers pre-trained models, GPU acceleration, and integration with data libraries. To set up a deep learning workstation, software like TensorFlow, Keras, and CUDA must be installed along with a GPU. The hypothesis space refers to all possible models considered by an algorithm. Loss functions measure prediction error while optimizers adjust parameters to minimize loss and improve accuracy. Common examples are described.
The document provides an overview of deep learning concepts and techniques for natural language processing tasks. It includes the following:
1. A schedule for a deep learning workshop covering fundamentals of deep learning for machine translation, word embeddings, neural language models, and neural machine translation.
2. Descriptions of neural networks, activation functions, backpropagation, and word embeddings.
3. Details about feedforward neural network language models, recurrent neural network language models, and how they are applied to tasks like language modeling and machine translation.
4. An explanation of attention-based encoder-decoder models for neural machine translation.
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішенняLviv Startup Club
This document discusses large language models (LLMs) such as BERT, GPT, GPT-J, and Alpaca. It describes how LLMs work using techniques like attention mechanisms, transformers, and pre-training on large datasets. It also discusses approaches like LLaMA that divide models into sub-components, as well as quantization, fine-tuning, and few-shot learning. The document outlines some challenges for LLMs like biased outputs and lack of world knowledge, and calls for responsible development and oversight of these powerful models.
The document discusses different methods for customizing large language models (LLMs) with proprietary or private data, including training a custom model, fine-tuning a general model, and prompting with expanded inputs. Fine-tuning techniques like low-rank adaptation and supervised fine-tuning allow emphasizing custom knowledge without full retraining. Prompt expansion using techniques like retrieval augmented generation can provide additional context beyond the character limit.
ODSC East: Effective Transfer Learning for NLPindico data
Presented by indico co-founder Madison May at ODSC East.
Abstract: Transfer learning, the practice of applying knowledge gained on one machine learning task to aid the solution of a second task, has seen historic success in the field of computer vision. The output representations of generic image classification models trained on ImageNet have been leveraged to build models that detect the presence of custom objects in natural images. Image classification tasks that would typically require hundreds of thousands of images can be tackled with mere dozens of training examples per class thanks to the use of these pretrained reprsentations. The field of natural language processing, however, has seen more limited gains from transfer learning, with most approaches limited to the use of pretrained word representations. In this talk, we explore parameter and data efficient mechanisms for transfer learning on text, and show practical improvements on real-world tasks. In addition, we demo the use of Enso, a newly open-sourced library designed to simplify benchmarking of transfer learning methods on a variety of target tasks. Enso provides tools for the fair comparison of varied feature representations and target task models as the amount of training data made available to the target model is incrementally increased.
This document discusses techniques to improve the scalability of transactional memory systems. It analyzes issues that arise with scaling such systems to larger numbers of processors. It proposes using a value predictor to reduce conflicts and aborts in transactional memory, and presents initial results showing performance improvements. Further work is outlined to extend these techniques and better evaluate their effectiveness.
Se 381 - lec 26 - 26 - 12 may30 - software design - detailed design - se de...babak danyal
This document discusses detailed design in software engineering. It explains that detailed design further specifies modules identified in high-level design in an unambiguous way so they can be implemented by programmers. Module specifications can use structured English, pseudo-code, decision tables, or state transition diagrams. An example of a structured English specification for a sorting module is provided. The document also discusses when more formal methods like decision trees may be needed for complex modules.
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...San Kim
1. LLaMA-Adapter is a new method for efficiently fine-tuning large language models with zero-init attention.
2. It freezes the pre-trained parameters of LLaMA and only learns lightweight adaption prompts, requiring much less computation than fully fine-tuning the large model.
3. Experimental results show that LLaMA-Adapter achieves comparable performance to fully fine-tuned models while being over 3 times faster to train.
Automated Essay Scoring Using Efficient Transformer-Based Language ModelsNat Rice
This summary provides an overview of an academic paper that evaluates the performance of efficient transformer-based language models on an automated essay scoring dataset:
The paper explores using smaller, more efficient transformer models rather than larger ones like BERT for automated essay scoring (AES). It evaluates several efficient models - Albert, Reformer, Electra, and MobileBERT - on the ASAP AES dataset. By ensembling multiple efficient models, the paper achieves state-of-the-art results on the dataset using far fewer parameters than typical transformer models, challenging the idea that bigger models are always better for AES. The efficient models show potential for extending the maximum text length they can analyze and reducing computational requirements for AES.
The document provides tips for improving the performance of MATLAB code. It discusses using the profiler to identify bottlenecks, preallocating arrays to avoid dynamic resizing overhead, and how the Just-In-Time accelerator can speed up loops and functions by avoiding interpretation. Preallocating arrays is shown to improve the speed of examples by over 3 times, and is beneficial for cases where the final array size may vary. The JIT accelerator most effectively accelerates code using supported data types, array shapes, and language elements within loops and conditionals.
Recurrent neural networks (RNNs) are well-suited for analyzing text data because they can model sequential and structural relationships in text. RNNs use gating mechanisms like LSTMs and GRUs to address the problem of exploding or vanishing gradients when training on long sequences. Modern RNNs trained with techniques like gradient clipping, improved initialization, and optimized training algorithms like Adam can learn meaningful representations from text even with millions of training examples. RNNs may outperform conventional bag-of-words models on large datasets but require significant computational resources. The author describes an RNN library called Passage and provides an example of sentiment analysis on movie reviews to demonstrate RNNs for text analysis.
This document provides an overview of a talk on advanced Zemax Programming Language (ZPL) macro programming. It introduces the speakers and describes the structure of the talk, which will cover what ZPL is, examples of using ZPL for specific tasks like calculating ray angles of incidence across a pupil, and tips for writing macros. The document outlines several example macros that will be discussed in detail during the talk, including automating repetitive tasks and creating custom operands. It encourages participants to provide feedback and suggestions for additional topics.
This document discusses various techniques for optimizing neural networks for efficient inference including:
1. NVIDIA TensorRT which provides optimizations like layer and tensor fusion, kernel auto-tuning, and precision calibration to improve throughput, efficiency, latency, and memory usage.
2. TensorFlow Lite which converts TensorFlow models into an efficient format for mobile and embedded devices and provides optimizations like quantization, pruning, and model topology transforms to reduce latency, memory usage, and improve power efficiency.
3. Deploying optimized models to edge devices using platforms like Coral or Raspberry Pi enables on-device machine learning with benefits like improved privacy, performance, and offline operation.
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
Integrated into Intel® Advisor, Cache-aware Roofline Modeling (CARM) provides insight into how an application behaves by helping to determine a) how optimally it works on a given hardware, b) the main factors that limit performance, c) if the workload is memory or compute-bound, and d) the right strategy to improve application performance.
Optimize your AI / Deep Learning models and pipelines.
Cut cost on infrastructure, deployment time and inference time.
Pruning, Compression, Retraining, Loss Targets, Quantization, TensorRT, Tensorflow
This document evaluates several neural machine translation models for English to Japanese translation. It finds that simple neural models outperform statistical machine translation baselines. Soft attention models with LSTM units performed best. However, training these models on pre-reordered data hurt performance. The neural models tended to produce grammatically correct but incomplete translations by omitting information. Replacing unknown words helped some models but more sophisticated solutions are needed for models trained on natural order data.
n this talk, Rsqrd welcomes Emad Elwany, CTO and Co-Founder of Lexion! He discusses his experiences with ML tooling and how it has evolved through the lifespan of Lexion, and shares his findings on important considerations, problems and solutions, and how decisions about ML tooling have changed over time through the stages of a startup.
**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**
This document discusses various techniques for estimating software project costs, schedules, and sizes. It covers function point analysis, lines of code estimation, productivity models like COCOMO, and probabilistic techniques like PERT estimation. Key approaches mentioned include analogies, decomposition, mathematical models, mean schedule dates, and probability distributions.
Similar to LLM Cheatsheet and it's brief introduction (20)
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and TuningDonghwan Lee
이 세션에서는 SageMaker Training Jobs / SageMaker Jumpstart를 사용하여 Foundation Model 을 Pre-Triaining 하거나 Fine Tuing 하는 방안을 제시합니다. 이 세션을 통해 아래 3가지가 소개됩니다.
1. 파운데이션 모델을 처음부터 Training
2. 오픈 소스 모델을 사용하여 파운데이션 모델을 Pre-Training
3. 도메인에 맞게 모델을 Fine Tuning하는 방안
발표자:
Miron Perel, Principal ML GTM Specialist, AWS
Kristine Pearce, Principal ML BD, AWS
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
How We Added Replication to QuestDB - JonTheBeachjavier ramirez
Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance.
A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen.
Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second.
In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.
2. LLM Instruction
Fine-Tuning Evaluation
TASKSPECIFIC FINETUNING MULTI-TASK FINE-TUNING MODEL EVALUATION
INSTRUCTION FINETUNING
In-Context Learning Limitations:
Instruction Fine-Tuning
• May be insufficient for very specific tasks.
• Examples take up space in the context window.
• May be insufficient for very specific tasks.
• Examples take up space in the context window.
• May be insufficient for very specific tasks.
• Examples take up space in the context window.
Solutions:
• It might not be an issue if only a single task matters.
• Fine-tune for multiple tasks concurrently
(~50K to 100K examples needed).
• Opt for Parameter Efficient Fine-Tuning (PEFT) instead
of full fine-tuning, which involves training only a small
number of task-specific adapter layers and parameters.
• The LLM generates better completions for a specific task
• Has potentially high computing requirements
The LLM is trained to estimate the next token probability
on a cautiously curated dataset of high-quality examples
for specific tasks.
(e.g., various tasks, non-deterministic outputs, equally
valid answers with different wordings).
To measure and compare LLMs more holistically, use
evaluation benchmark datasets specific to model skills.
Task-specific fine-tuning involves training a pre-trained
model on a particular task or domain using a dataset
tailored for that purpose.
Fine-tuning can significantly increase the performance
of a model on a specific task, but can reduce the
performance on other tasks (“catastrophic forgetting”).
Drawback: It requires a lot of data
(around 50K to 100K examples).
Model variants differ based on the datasets and tasks
used during fine-tuning.
ROUGE BLEU SCORE
Steps:
Multi-task fine-tuning diversifies training with examples
for multiple tasks, guiding the model to perform
various tasks.
Various approaches exist, but there are a few examples:
1. Prepare the training data.
2. Pass examples of training data to the LLM
(prompt and ground-truth answer).
3. Compute the cross-entropy loss for each completion
token and backpropagate.
Task-specific examples
Prompt-completion pairs Adjusted LLM weights
Pre-trained
LLM
Fine-tuned
LLM
Prompt, completion
Prompt, completion
Prompt, completion
Task-specific dataset
e.g., translation
Often, good results can be achieved with just a
few hundred or thousand examples.
Pre-trained
LLM
InstructLLM
Translate the text:
Source text (English)
Source completion (French)
Multi-task training dataset
Many examples of each task needed for training
Pre-trained
LLM
InstructLLM
Analyze the sentiment
Identify entities
Summarize the text
Translate the text:
Source text (English)
Source completion
(French)
Trainingdata
Prompt LLMcompletion
Loss
Groundtruth
Label this review:
Amazing product!
Sentiment:
Label this review:
Amazing product!
Sentiment: Neutral
Label this review:
Amazing product!
Sentiment: Positive
Pre-trained
LLM
Example of the FLAN family of models
FLAN, or Fine-tuned LAnguage Net, provides
tailored instructions for refining various
models, akin to dessert after pre-training.
FLAN-T5 is an instruct fine-tuned version of the
T5 foundation model, serving as a versatile model
for various tasks.
FLAN-T5 has been fine-tuned on a total of 473
datasets across 146 task categories. For instance,
the SAMSum dataset was used for summarization.
A specialized variant of this model for chat
summarization or for custom company usage
could be developed through additional fine-tuning
on specialized datasets (e.g., DialogSum or custom
internal data).
Evaluating LLMs Is Challenging
Need for automated and organized performance
assessments
• Purpose: To evaluate LLMs on narrow tasks
(summarization, translation) when a reference
is available
• Based on n-grams and rely on precision and
recall scores (multiple variants)
BERT SCORE
• Purpose: To evaluate LLMs in a task-agnostic
manner when a reference is available.
• Based on token-wise comparison, a similarity score
is computed between candidate and reference
sentences.
LLM-as-a-Judge
E.g., GLUE, SuperGLUE, MMLU, Big Bench, Helm
• Purpose: To evaluate LLMs in a task-agnostic
manner when a reference is available.
• Based on prompting an LLM to assess the equivalence
of a generated answer with a ground-truth answer.
3. Parameter Efficient Fine-Tuning
(PEFT) Methods
LoRA SOFT PROMPTS
PEFT
PEFT methods only update a small number of model parameters.
Examples of PEFT techniques:
• Freeze most model weights, and fine tune only specific layer parameters.
• Keep existing parameters untouched; add only a few new ones or layers
for fine-tuning.
Trade-Off: A smaller rank reduces parameters and accelerates training
but risks lower adaptation quality due to reduced task-specific
information capture.
In literature, it appears that a rank between 4-32 is a good trade-off.
LoRA can be combined with quantization (=QLoRA).
Method to reduce the number of trainable parameters during fine-tuning
by freezing all original model parameters and injecting a pair of rank
decomposition matrices alongside the original weights
Prompt tuning: Add trainable tensors to the model input embeddings,
commonly known as “soft prompts,” optimized directly through
gradient descent.
• Decrease memory usage, often requiring just 1 GPU.
• Mitigate risk of catastrophic forgetting.
• Limit storage to only the new PEFT weights.
Multiple methods exist with trade-offs on parameters or memory efficiency,
training speed, model quality, and inference costs.
Three PEFT methods classes from literature:
Fullfine-tuningofLLMsischallenging:
Mainbenefits:
• No impact on inference latency.
• Fine-tuning specifically on the self-attention layers using LoRA is often
enough to enhance performance for a given task.
• Weights can be switched out as needed, allowing for training on many
different tasks.
Additionalnotes:
• Equal in length to the embedding vectors of the input language tokens
• Can be seen as virtual tokens which can take any value within the
multidimensional embedding space
In prompt tuning, LLM weights are frozen:
• Over time, the embedding vector of the soft prompt is adjusted to optimize
model’s completion of the prompt
• Only few parameters are updated
• A different set of soft prompts can be trained for each task and easily swapped
out during inference (occupying very little space on disk).
From literature, it is shown that at 10B parameters, prompt tuning is as efficient
as full fine-tuning.
Softpromptvectors:
RankChoiceforLoRAMatrices:
The trained parameters can account for only 15%-20% of the
original LLM weights.
Interpreting virtual tokens can pose challenges
(nearest neighbor tokens to the soft prompt location can be used).
!
LoRA
Pre-trained
weights W0
+
h = W0.x + AB.x
B
A
rank r
Outputs h
Inputs x
Gradients
Activations
Optimizer states
Temporary variables
Trainable weights
Requiresalot
ofmemory
Fine-tune only
specific parts of
the original LLM.
Use low-rank representations
to reduce the number of
trainable parameters.
E.g., LoRA
Reparameterization Additive
Selective
Augment the pre-trained
model with new parameters
or layers, training only
the additions.
Adapter
Softprompts
1 - Keep the majority of the original
LLM weights frozen.
2 - Introduce a pair of rank
decomposition matrices.
3 - Train the new matrices A and B.
Model weights update:
1 - Matrix multiplication:
2 - Add to original weights :
B * A = BxA
+ BxA
Unlike prompt engineering, whose limits are:
• The manual effort requirements
• The length of the context window
Pre-trained LLM
Tunable soft prompt Input text
(Typically, 20-100 tokens)
4. LLM Compute Challenges
and Scaling Laws
COMPUTATIONAL CHALLENGES QUANTIZATION SCALING LAWS
LARGE LANGUAGE MODEL CHOICE
Generative AI Project Lifecycle
Memory Challenge
Two options for model selection
Model pre-training:
Use case
definition
scoping
Model
Selection
Adapt
(prompt
engineering,
fine tuning),
augment,
and evaluate
model
App integration
(model
optimization,
deployment)
• Use a pre-trained LLM.
• Train your own LLM from scratch.
• Optimizer states (e.g., 2 for Adam)
• Gradients
• Forward activations
• Temporary variables
This could result in an additional 12-20 bytes of
memory needed per model parameter.
• Developed by Google Brain
• Balances memory efficiency and accuracy
• Wider dynamic range
• Optimized for storage and speed in ML tasks
e.g., FLAN T5 pre-trained using BFLOAT16
Model weights are adjusted in order to minimize the
loss of the training objective.
It requires significant computational resources,
(i.e., GPUs, due to high computational load).
Model Cards: List of best use cases, training details,
limitations on models.
LLMs are massive and require plenty of memory
for training and inference.
In most cases, quantization strongly reduces
memory requirements with a limited loss
in prediction.
The model choice will depend on the details
of the task to carry out.
But, in general...
…develop your application using a pre-trained LLM,
except if you work with extremely specific data
(i.e., medical, legal)
Hubs: Where you can browse existing models
To load the model into GPU RAM:
1 parameter (32-bit precision) = 4 bytes needed
1B parameters = 4 x 109 bytes = 4GB of GPU
Pre-training requires storing additional components,
beyond the model’s parameters:
540B
PaLM
GPT-3
175B 100B
YaLM GPT-2
1.5B
BERT
110M
Number
of parameters
RuntimeError : CUDA out of memory
Hence,thememoryneededforLLMtrainingis:
Excessive for consumer hardware
Even demanding for data center hardware
(for single processor training).
For instance, NVIDIA A100 supports up to
80GB of RAM.
Benefitsofquantization:
Less memory
Potentially better model performance
Higher calculation speed
This would mean it requires 16 GB to 24 GB of
GPU memory to train a 1-billion parameter
LLM, around 4-6x the GPU RAM needed just for
storing the model weights.
How can you reduce memory for training?
Quantization: Decrease memory to store the
weights of the model by converting the precision
from 32bit to 16bit or 8bit integers.
How big do the models need to be?
The goal is to maximize model performance.
Researchers explored trade-offs between
the dataset size, the model size, and the
compute budget:
Increasing compute may seem ideal for better
performance, but practical constraints like
hardware, time, and budget limit its feasibility.
What’s the optimal balance?
Once scaling laws have been estimated, we can use the
Chinchilla approach, i.e., we can choose the dataset
size and the model size to train a compute-optimal
model, which maximizes performance for a given
compute budget. The compute-optimal training dataset
size is ~20x the number of parameters.
It has been empirically shown that, as the compute
budget remains fixed:
Quantization maps the FP32 numbers to a lower
precision space by employing scaling factors
determined from the range of the FP32 numbers.
BFLOAT16 is a popular alternative to FP16:
Compute budget
Model
performance
3 x 10-38
3 x 1038
0.0
FP16 | BFLOAT16 | INT8 | INT4
FP32 space
Model size
# of parameters
Scaling choice
Dataset size
# of tokens
Scaling choice
Fixed model size: Increasing training dataset
size improves model performance.
Fixed dataset size: Larger models
demonstrate lower test loss, indicating
enhanced performance.
Constraint
5. Preference
Fine-Tuning (Part 1)
RLHF PRINCIPLES COLLECTING HUMAN FEEDBACK REWARD MODEL
INTRODUCTION
To ensure alignment between LLMs and human values,
emphasis should be placed on qualities like helpfulness,
honesty, and harmlessness (HHH).
The answers have been generated by the model we want
to fine-tune and then assessed by human evaluators or
an LLM.
The action the model will take depends on:
• The prompt text in the context
• The probability distribution across the vocabulary space
The reward model assesses alignment of LLM outputs
with human preferences.
The reward values obtained are then used to update the
LLM weights and train a new human-aligned version,
with the specifics determined by the optimization
algorithm.
Type of ML in which an agent learns to make decisions
towards a specific goal by taking actions in an
environment, aiming to maximize some cumulative
reward.
Action space: All possible actions based on the current
environment state.
Action: Text generation
Action space: Token vocabulary
State: Any text in the current context window
Additional training with preference data can boost
HHH in completions. Detailed instructions improve response quality
and consistency, resulting in labeled completions
that reflect a consensus.
To develop a model or system that accepts a text
sequence and outputs a scalar reward representing
human preference numerically.
Objective:
The reward model, often a language model (e.g., BERT),
is trained using supervised learning on pairwise
comparison data derived from human assessments
of prompts.
Mathematically, it learns to prioritize the
human-preferred completion while minimizing
the log sigmoid of the reward difference.
Reward model training:
Use the reward model as a binary classifier to assign
reward values to prompt-completion pairs.
Reward value equals the logits output by the model.
Usage of the reward model:
Use case
definition
scoping
Model
selection
Adapt (prompt
engineering, fine
tuning),
augment, and
evaluate model
App integration
(model
optimization,
deployment)
How to create a
bomb?
In order to create a
bomb, you have to…
I'm sorry, but I can't assist
with that. Creating a bomb
is illegal…
��� Generating toxic language
• Responding aggressively
• Providing harmful information
Some models exhibit undesirable behavior:
Reminder on Reinforcement Learning
In the context of LLMs...
• Reinforcement Learning With Human Feedback
(RLHF): Preference data is used to train a reward model
that mimic human annotator preferences, which then
scores LLM completions for reinforcement learning
adjustments.
• Preference Optimization (DPO, IPO): Minimize a
training loss directly on preference data.
Two approaches:
Generative AI Project Lifecycle
Preference data
Prompt Answer A Answer B
Agent
RL policy (Model)
Environment
Action
space
Action at
Reward rt
Objective:
Win the game!
rt+1
State st
st+1
Agent
RL policy = LLM
Environment
LLM Context
Token
vocabulary
Action at
Objective:
Generate aligned text
rt+1
Current
context
State st
st+1
Instruct
LLM
Reward rt
Reward
LLM
1. Choose a model and use it to curate a dataset for
human feedback.
2. Collect feedback from human labelers (generally,
thousands of people):
• Specify the model alignment criterion.
• Request that the labelers rank the outputs according
to that criterion.
3. Prepare the data for training
Create pairwise training data from rankings for the
training of the reward model.
Steps
Prompt samples Model completions
LLM
LLM
2 2 2
1 1 3
3 3 1
Alignment criterion:
helpfulness
The coffee
is too bitter
Completion 3
Completion 1
Completion 2
Completions
Completions
Reward
Completions
2
1
3
[0,1]
[1,0]
[1,0]
[1,0]
[1,0]
[1,0]
Reward
Place the preferred
option first by
reordering
completions.
Rank
Assign 1 for the
preferred response and
0 for the rejected one
response in each pair.
RM
(Prompt x,
Completion yj
)
(Prompt x,
Completion yk
)
Reward rj
Reward rk
loss = log( (rj
-r k
)
RM
RM
Samantha enjoys reading books
Positive
Negative
3.17
-2,6
Logits
(Prompt x,
Completion y)
6. Preference
Fine-Tuning (Part 2)
PPO ALGORITHM FOR LLMS REWARD HACKING RL FROM AI FEEDBACK
FINETUNING WITH RL
REWARD MODEL
The LLM weights are updated to create a human-aligned
model via reinforcement learning, leveraging the reward
model, and starting with a high-performing base model.
Goal: To align the LLM with provided instructions and
human behavior.
As the process advances successfully, the reward will
gradually increase until it meets the predefined evaluation
criteria for helpfulness.
Updated model: The resulting updated model should
be more aligned with human preferences.
Reinforcement learning algorithm: Proximal policy
optimization (PPO) is a popular choice.
Example:
Prompt: A tree is...
Iteration 1: ...a plant with a trunk. → Reward: 0.3
…
Iteration 4: ...a provider of shade and oxygen. → Reward: 1.6
…
Iteration n: ...a symbol of strength and resilience. → Reward: 2.9
PPO iteratively updates the policy to maximize the reward,
adjusting the LLM weights incrementally to maintain
proximity to the previous version within a defined range
for stable learning.
The PPO objective is used to update the LLM weights
by backpropagation:
The agent learns to cheat the system by maximizing
rewards at the expense of alignment with desired behavior.
Value Loss: Minimize it to improve return
prediction accuracy.
Policy Loss: Maximize it to get higher rewards while
staying within reliable bounds.
Entropy Loss: Maximize it to promote and sustain
model creativity.
The higher the entropy, the more creative the policy.
Obtaining the reward model is labor-intensive;
scaling through AI-supervision is more precise and
requires fewer human labels.
Constitutional AI (Bai, Yuntao, et al., 2022)
Approach that relies on a set of principles governing
AI behavior, along with a small number of examples
for few-shot prompting, collectively forming
the “constitution.”
Example of constitutional principle: “Please choose the
response that is the most helpful, honest, and harmless.”
To prevent reward hacking, penalize RL updates if they
significantly deviate from the frozen original LLM, using
KL divergence.
Updated
LLM
Prompt
N iterations
1
1: Text Generation
2: Scoring
3: Model weights update with
reinforcement learning.
2
3
Answer
Reinforcement learning
Scores
RM:
Reward
Model
Hyperparameters
Policy loss Value loss Entropy loss
Prompt
RL
updated
LLM
“The movie was...” “...an absolute thrill
fest that left me breathless!”
Value
loss
Estimated
future total reward
Value
function
Actual Reward
from the reward model
Model’s probability distribution over tokens
Probabilities of
the next token
with the updated LLM
Probabilities of
the next token
with the initial LLM
Advantage term Define “trust region”
Guardrails
Keeping the policy in the “trust region”
RL
updated
LLM
Original
LLM
RM
Prompt
“The movie was...”
“... enjoyable
and decent”
“... thrilling and
unforgettable...”
PPO
KL divergence
Shift penalty
KL penalty
added in
reward
DIRECT PREFERENCE
OPTIMIZATION
An RLHF pipeline is difficult to implement:
• Need to train a reward model
• New completions needed during training
• Instability of the RL algorithm
Direct Preference Optimization (DPO) is a simpler
and more stable alternative to RLHF. It solves the
same problem by minimizing a training loss directly
based on the preference data (without reward
modeling or RL).
Identity Preference Optimization (IPO) is a variant
of DPO less prone to overfitting.
Comparison
data
DPO (or IPO)
Fine
tuned
LLM
1. Supervised Learning Stage
2. Reinforcement Learning (RL) Stage - RLAIF
Helpful
LLM
Fine-
tuned
LLM
Harmful prompts,
completions
Harmful prompts,
revised completions
Critique and revise
responses based on
constitutional principles
Fine-tune a
pre-trained LLM
1
2
3
Fine-
tuned
LLM
Preference
model
Ask which response
is best based on
constitutional principles
Fine-tune the LLM
using RL against
the preference model
Train a
preference model
Harmful prompts,
pair of completions
AI-generated
comparison data
4
5
6
7
+ human feedback
helpfulness data
Result: A policy trained by Reinforcement
Learning with AI Feedback (RLAIF)
7. LLM-Powered Applications
LLMINTEGRATED APPLICATIONS LLM REASONING WITH
CHAINOFTHOUGHT PROMPTING
PROGRAMAIDEDLANGUAGEREACT
MODEL OPTIMIZATION
FOR DEPLOYMENT
• Scale down model complexity while preserving accuracy.
• Train a small student model to mimic a large frozen
teacher model.
• Knowledge can be out of date.
• LLMs struggle with certain tasks (e.g., math).
• LLMs can confidently provide wrong answers
(hallucination).
• Soft labels: Teacher completions serve as ground
truth labels.
• Student and distillation losses update student model
weights via backpropagation.
• The student LLM can be used for inference.
Leverage external app or data sources
LLM should serve as a reasoning engine.
The prompt and completion are important!
ModelDistillation
• Prompts the model to break down problems into
sequential steps.
• Operates by integrating intermediate reasoning steps
into examples for one-or few-shot inference.
Chain-of-Thought(CoT)
PTQ reduces model weight precision to 16-bit float or
8-bit integer.
• Can target both weights and activation layers for impact.
• May sacrifice performance, yet beneficial for cost
savings and performance gains.
Complex reasoning is challenging for LLMs.
E.g., problems with multiple steps, mathematical reasoning
In the completion, the whole prompt is included.
• We retrieve documents most similar to the input query
in the external data.
• We combine the documents with the input query and
send the prompt to the LLM to receive the answer.
PostTrainingQuantization(PTQ)
Removes redundant model parameters that contribute
little to the model performance.
Some methods require full model training, while others are
in the PEFT category (LoRA).
Size of the context window can be a limitation.
ModelPruning
RetrievalAugmentedGeneration(RAG)
AI framework that integrates external data sources
and apps (e.g., documents, private databases, etc.).
Multiple implementations exist, will depend on the
details of the task and the data format.
ReAct
Prompting strategy that combines CoT reasoning and
action planning, employing structured examples to
guide an LLM in problem-solving and decision-making
for solutions
Prompt
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis
balls each is 6 tennis balls. 5+6=11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples
do they have?
Completion
A: The cafeteria had 23 apples. They used 20 to
make lunch. 23-20=3. They bought 6 more apples,
so 3+6=9. The answer is 9.
Use multiple chunks (e.g., with LangChain)
Shrink model size, maintain performance
Improves performance but struggles with
precision-demanding tasks like tax computation
or discount application.
Inference challenges: High computing and storage demands
LLM
Student
Knowledge
distillation
Soft
labels
Soft
predictions
Hard
predictions
Hard
labels
LLM
Teacher
Distillation
loss
Student
loss
Labeled
training data
Retriever
User
Query External
knowledge
Query
encoder
LLM Answer
LLM
User
Ext data sources
Ext applications
API
Python
LLM-integrated application
Orchestrator
Frontend E.g.,
!
! Data must be in format that allows its relevance
to be assessed at inference time.
Use embedding vectors (vector store)
Vector database: Stores vectors and associated
metadata, enabling efficient nearest-neighbor
vector search.
Solution: Allow the LLM to communicate with a proficient
math program, as a Python interpreter. ReAct reduces the risks of errors.
LangChain can be used to connect multiple
components through agents, tools, etc.
Agents: Interpret the user input and determine which
tool to use for the task (LangChain includes agents for
PAL ReAct).
1.Plan actions 2.Format outputs 3.Validate actions
Set of instructions
Step1: Get
customer ID
Step2: Reset
password
Requires formatting
for applications to
understand actions
Collect information
that allows validation
of an action
Program-AidedLanguage(PAL)
Generate scripts and pass it to the interpreter.
Completion is handed off to a Python interpreter.
Calculations are accurate and reliable.
Prompt
Q: Roger has 5 tennis balls. [...]
A:
# Roger started with 5 tennis balls
tennis_balles=5
# 2 cans of tennis balls each is
bought_balls=2*3
# tennis balls. The answer is
answer = tennis_balls + bought_balls
Q. [...]
CoT reasoning
PALexecution
Instructions
Question
Thought
Action
Observation
Question to be answered
Thought: Analysis of the
current situation and the
next steps to take
Instructions: Define the task,
what is a thought and the
actions
Action: The actions are from
a predetermined list and
defined in the set of
instructions in the prompt
The loop ends when the
action is finish []
Observation: Result of the
previous action