Key idea: A LLM is a deep learning model that takes text as input and predicts what text should come next.
LLMs break down text into tokens. Tokens are the basic unit of text in LLMs.
The process of breaking down text into tokens is called tokenization.
Tokenization example
Each LLM has a vocabulary of tokens that it can use as input.
The vocabulary is fixed for a given LLM.
The number of tokens in the vocabulary is called the vocabulary size.
Modern LLMs have vocabulary sizes of ~30k-50k tokens.
LLMs:
The vector learned for each token is called an embedding.
The length of the embedding vector is called the embedding dimension.
Modern LLMs have embedding dimensions of ~768-1024.
Embeddings are dense vectors that capture the semantic meaning of input token(s).
The matrix of all embeddings is called an embedding space or embedding matrix.
Each LLM has its own embedding matrix that it learned from the pre-training process.
Each token in the input matrix is mapped to a vector index in the vocabulary.
This list of vector indices forms the token IDs of the input.
The embedding matrix is subsetted based on the token IDs to create the input to the LLM.
Attention is the key to LLMs.
It allows the model to focus on different parts of the input when making predictions about what tokens should come next.
After the attention mechanism is applied, the model creates a “hidden state” matrix that adjusted the input embeddings based to incorporate the other tokens in the input.
The hidden state is then used to make predictions about what tokens should come next.
Training models from scratch is a very expensive and time consuming process.
Inference (i.e. using a trained model to generate text) is much faster.
When we interact with a LLM, the model is performing inference.
Training LLMs from scratch is a very expensive and time consuming process.
Only a few companies and institutions have the resources to train LLMs from scratch.
These models are called “Foundation Models”.
They differ in their architecture, parameters, and training data.
LLMs are one of:
Open weight models can be fine-tuned.
This means that the weights can be adjusted to perform a specific task.
This is a much faster and less expensive process than training a model from scratch.
The input passed to an LLM is called the context.
The context is like “short term memory” - the model can only remember a limited amount of information at once.
Without being given additional capabilities, LLMs have no “long term memory” per se.
Each LLM has a fixed number of tokens it can “see” at once (N from previous).
This is called the context window size.
For example, if N=1024, then the model can only see 1024 tokens at once.
If the input is longer than N, the model will only see the last N tokens.
The N tokens currently visible to the model are called the context window.
The prompt is the text input to the LLM.
The attention mechanism ensures all information within the context window influences the output.
The accuracy and usefulness of a response depends on the quality of the prompt.
The prompt generally has two components:
The system prompt is optional, but it is often used to guide the LLM’s behavior.
The user prompt is required.
System prompt:
You are a helpful assistant and expert in world history
and geography.
User prompt:
What is the capital of France?
The whole prompt is passed to the LLM as context:
System: You are a helpful assistant and expert in world history
and geography.
User: What is the capital of France?
The quality of the prompt is critical to the success of the LLM.
The process of creating a good prompt is called prompt engineering.
The idea is to give the LLM enough of the right information to answer the question or perform the task.
When the LLM responds, its output is added to the context and awaits user input.
The process repeats until the user is satisfied with the response.
User responses are concatenated to the context and the entire context is passed to the LLM as input.
System: You are a helpful assistant and expert in world history
and geography.
User: What is the capital of France?
Assistant: The capital of France is **Paris**. It’s a global
center for art, fashion, gastronomy, and culture, and a hugely
important historical city – it’s been the seat of French power
for centuries! Do you want to know anything more about Paris,
or perhaps another city in France?
NB: You won’t typically see the role labels (e.g. “System”, “User”, “Assistant”) in chatbots.
“Prompt engineering” has been subsumed into “context engineering”.
Context engineering is the process of creating a good context for an LLM which includes but is not limited to the user prompt.