Skip to content

3Blue1Brown AI Science

Transformers, the tech behind LLMs | Deep Learning Chapter 5

3Blue1Brown's visual introduction to how transformers, the T in GPT, actually work. Grant Sanderson follows one stream of data through the network: text is split into tokens, each token becomes a high dimensional vector via the embedding matrix, attention and multilayer perceptron blocks refine those vectors layer by layer, and a final unembedding plus softmax step turns the last vector into a probability guess for the next token. He grounds every idea in the real GPT-3 numbers, 175 billion parameters, 12,288 embedding dimensions, a 50,257 token vocabulary, and shows how directions in embedding space carry meaning and how temperature reshapes the output. It is the foundation chapter that sets up the later deep dive on attention.

Published Apr 1, 2024 27:14 video 23 min read Added Jul 4, 2026 Open on YouTube →

At a glance

This is 3Blue1Brown chapter 5 in the Deep Learning series, and it is the visual origin story for the thing behind the letters G, P, and T. Grant Sanderson takes the acronym Generative Pretrained Transformer apart word by word, then follows a single stream of data all the way through a transformer: text goes in, gets chopped into tokens, each token becomes a vector, the vectors talk to each other and get refined layer after layer, and at the very end one operation turns the last vector into a probability guess for the next chunk of text. Run that guess, sample from it, append, repeat, and you get ChatGPT writing one word at a time.

The whole episode is built to make the later chapters on attention feel easy, so it spends its time on foundations: what deep learning actually is, why almost every computation inside these models is a matrix times a vector, how words get turned into geometry through embeddings, how the dot product measures meaning, and how softmax converts raw scores into a distribution. Along the way Sanderson keeps a running tally of where GPT-3 hides its 175 billion parameters, using the concrete numbers as a spine. This page rebuilds the video in order, keeps every number, every analogy, and every aside, and turns his key animations into diagrams you can read on the page.

The deep explanation

What the three letters mean

The name says most of what the model is. Generative means it makes new text. Pretrained means it already went through a huge learning phase on a massive pile of data, with the prefix hinting that you can still fine tune it afterward on narrower tasks. But the load bearing word is the last one. A transformer is a specific kind of neural network, and Sanderson calls it the core invention underlying the current boom in AI. The goal of the video, and the chapters after it, is a visually driven walk through what actually happens inside one, following the data as it flows.

Transformers are not one gadget but a family. Some take in audio and emit a transcript. Some run the other way, generating synthetic speech from text. The image tools that stunned everyone in 2022, DALL-E and Midjourney, turn a text description into a picture and are built on transformers too. The original transformer, introduced in 2017 by Google in the paper Attention Is All You Need, was built for one narrow job: translating text between languages. The variant this series follows, the one under ChatGPT, is trained to do something simpler to state. Take in a piece of text, possibly with some images or sound alongside it, and predict what comes next. That prediction is not a single word but a probability distribution over many possible chunks that might follow.

Predict, sample, repeat: how prediction becomes generation

Predicting the next word sounds like a different task from writing a paragraph, but they are the same task run in a loop. Give the model a starting snippet. Let it produce its distribution over what comes next. Draw a random sample from that distribution. Append the sample to the text. Then feed the whole longer string back in and predict again. Sanderson admits the honest reaction up front: it really does not feel like this should work.

He shows why size matters by running two versions. GPT-2 running locally on his laptop, sampling one chunk at a time, produces a story that does not really hold together. Swap in API calls to GPT-3, the same basic model just much bigger, and a sensible story appears almost magically, one that even infers a pi creature would live in a land of math and computation. That loop of repeated prediction and sampling is exactly what you are watching when a chatbot streams out one word at a time. Sanderson even wishes aloud for a feature that would let you see the underlying distribution behind each word it picks.

Turning this next word predictor into a chatbot takes one more trick. You prepend a bit of text that sets the scene of a user talking with a helpful AI assistant, which is the system prompt. You drop the user's question in as the first line of dialogue. Then you let the model predict what such a helpful assistant would say next. There is an extra training step needed to make this behave well, covered later, but that is the whole idea.

A high level tour of the pipeline

Before any detail, Sanderson gives the aerial view of the journey a word takes. First the input is broken into little pieces called tokens. For text these are words, fragments of words, or common character combinations. For images or sound a token might be a patch of the image or a chunk of the audio. Each token is then attached to a vector, a list of numbers meant to encode its meaning. Think of those vectors as coordinates in a very high dimensional space, where words with similar meanings land near each other.

That sequence of vectors flows into an attention block, where the vectors talk to each other and pass information back and forth to update their values. His example: the word "model" means one thing in "a machine learning model" and another in "a fashion model." Attention is the machinery that figures out which words in context should update which other words, and how. Whenever he says meaning, he means it is literally encoded in the numbers inside those vectors.

Next the vectors pass through a multilayer perceptron, also called a feed forward layer. Here the vectors do not talk to each other. Each one goes through the same operation in parallel. It is harder to interpret, but later chapters describe it as asking a long list of questions about each vector and updating it based on the answers. Both blocks are, underneath, a giant pile of matrix multiplications, and the real skill is learning to read the underlying matrices. He glosses over some normalization steps that sit between the blocks, since this is a preview.

Then it repeats. You bounce between attention blocks and multilayer perceptron blocks many times, the hope being that by the end all the essential meaning of the passage has been baked into the very last vector in the sequence. One final operation on that last vector produces the probability distribution over all possible next tokens. And once you can predict what comes next, you feed in seed text and play the predict, sample, append, repeat game forever.

Figure 1. The whole pipeline in one column. Text becomes tokens, tokens become vectors through the embedding matrix W_E, the vectors are refined by alternating attention and multilayer perceptron blocks (GPT-3 stacks 96 such layers), and the final vector is turned into logits by the unembedding matrix W_U and into a probability distribution by softmax. Gray is data flowing through; amber and blue are the learned operations.

What this chapter actually covers

Sanderson is upfront about scope. This chapter expands the very beginning of the network (turning tokens into vectors) and the very end (turning the last vector into a prediction), plus a lot of background that any machine learning engineer would have taken for granted by the time transformers arrived. If you already have that background and are impatient, he says you can skip to the next chapter, which is about attention, the part generally considered the heart of the transformer. After that come chapters on the multilayer perceptron blocks, on how training works, and on the details skipped along the way. These videos sit inside a larger deep learning miniseries, and he says you can watch out of order, but he wants everyone on the same page about the basic premise first.

The premise of deep learning

Machine learning is any approach where you use data to determine how a model behaves. Say you want a function that takes an image and returns a label, or takes a passage and predicts the next word, or any task that needs intuition and pattern recognition. Rather than writing out an explicit procedure in code, the way people did in the earliest days of AI, you set up a very flexible structure with tunable parameters, like a wall of knobs and dials, and then use many examples of correct input and output to tune those knobs until the behavior is mimicked.

The simplest version is linear regression, where input and output are each single numbers, like the square footage of a house and its price, and you fit a line through the data to predict future prices. That line has just two continuous parameters, the slope and the y intercept. Deep learning models get much more complicated. GPT-3 has not two but 175 billion parameters. And it is not obvious you can build a model that big without it either grossly overfitting or being impossible to train. Deep learning is the class of models that, over the last couple decades, has proven to scale remarkably well.

What unifies these models is that they all train with the same algorithm, backpropagation, covered in earlier chapters. The key context is that for backpropagation to work at scale, the models have to follow a specific format, and knowing that format up front explains a lot of choices in a transformer that would otherwise look arbitrary.

The format has a few rules. The input must be formatted as an array of real numbers. That could be a flat list, a two dimensional array, or a higher dimensional array, the general term being a tensor. The data is progressively transformed through many layers, each layer itself an array of real numbers, until a final layer you call the output. In the text model, the final layer is the list of numbers giving the probability distribution over next tokens.

The parameters are almost always called weights, because the only way they touch the data is through weighted sums. You sprinkle in some nonlinear functions too, but those do not depend on parameters. Instead of writing those weighted sums out naked, you package them as matrix vector products, which says the same thing since each component of a matrix vector product is itself a weighted sum. It is cleaner to picture matrices full of tunable parameters transforming vectors drawn from the data.

Here Sanderson lays down a rule he holds for the whole series. Keep a sharp mental line between the weights, which he always colors blue or red, and the data being processed, which he always colors gray. The weights are the actual brains, the thing learned during training, the thing that determines behavior. The data just encodes whatever specific input is fed in on a given run. Almost all of the computation inside a tool like ChatGPT, once you look under the hood, is matrix vector multiplication.

Tokens and the embedding matrix

Now the first real step: break the input into tokens and turn each into a vector. Tokens can be word fragments or punctuation, but for teaching Sanderson pretends text splits cleanly into words, since humans think in words and it makes the examples easier.

The model has a predefined vocabulary, some fixed list of all possible words, say 50,000 of them. The first matrix, the embedding matrix, labeled W_E, has one column for every word in the vocabulary. Those columns are what decide which vector each word becomes in the first step. Like every matrix in the model, its values start random and get learned from data. Looking up a word's vector is called embedding the word, which invites you to picture the vectors geometrically, as points in a high dimensional space.

Three numbers would be coordinates in 3D space, easy to draw, but real word embeddings are far higher dimensional. In GPT-3 each embedding has 12,288 dimensions, and it matters to work in a space with that many distinct directions. To animate embeddings from a simple model, Sanderson takes a three dimensional slice through the high dimensional space and projects the word vectors onto it, the same way you might project 3D points onto a 2D plane.

Directions carry meaning

The big idea: as the model tunes its weights during training, it tends to settle on embeddings where directions in the space carry semantic meaning. In the simple word2vec style model he runs, searching for the words whose embeddings are closest to "tower" returns words that all give tower-ish vibes. Nearness in the space means similarity in meaning.

The classic demonstration is arithmetic on meaning. Take the vector for "woman" minus the vector for "man." That difference is a little arrow in the space. It turns out to be very close to the difference between "king" and "queen." So if you did not know the word for a female monarch, you could compute king plus the (woman minus man) direction and search for the nearest embedding. Kind of. Sanderson is honest that the true embedding of "queen" sits a little farther off than that suggests, presumably because "queen" is not used in training data as merely a feminine "king." Family relations, he found, illustrate the idea more cleanly. The point stands: during training the model found it useful to make one direction in the space encode gender.

More examples in the same vein. Take the embedding of Italy, subtract Germany, add the result to Hitler, and you land near Mussolini, as if one direction encodes Italian-ness and another encodes World War 2 axis leaders. His favorite: in some models, Germany minus Japan added to sushi lands near bratwurst. And while hunting nearest neighbors he was pleased how close "cat" was to both "beast" and "monster."

Figure 2. Meaning becomes geometry. When embeddings are trained, a single direction in the space can come to encode a concept such as gender. The arrow from man to woman is nearly parallel to the arrow from king to queen, so "king plus the woman minus man direction" lands close to "queen." The same trick recovers Mussolini from Italy minus Germany plus Hitler.

The dot product measures alignment

A piece of intuition that pays off in the next chapter: the dot product of two vectors measures how well they align. Computationally it multiplies corresponding components and adds the results, which fits the theme that everything wants to look like a weighted sum. Geometrically the dot product is positive when the vectors point in similar directions, zero when they are perpendicular, and negative when they point opposite ways.

Sanderson tests a hypothesis with it. Suppose "cats" minus "cat" represents a plurality direction. Take that vector and dot it against the embeddings of various singular nouns, then against their plurals. The plural nouns consistently give higher values, meaning they align more with that direction. Even better, dotting the same direction against the embeddings of "one," "two," "three," and so on gives increasing values, as if you can quantitatively measure how plural the model finds a given word.

Counting the first parameters

The embedding matrix is the first pile of weights, and here the running tally begins. Using the real GPT-3 numbers, the vocabulary size is 50,257 tokens and the embedding dimension is 12,288. Multiply them and the embedding matrix holds about 617 million weights. He starts a scoreboard, since by the end it must reach 175 billion.

Two framings to hold onto. First, vectors in the embedding space are not meant to stay as single words. They also encode the word's position, and more importantly they have the capacity to soak in context. Second, that soaking is the entire point of the network. A vector that begins as the embedding of "king" might get tugged and pulled by the network's blocks until it points in a far more specific direction, one that encodes it was a king who lived in Scotland, who took his post after murdering the previous king, and who is being described in Shakespearean language, exactly the story of Macbeth. The meaning of a word is informed by its surroundings, sometimes from far away, and the network exists to let each vector incorporate that context efficiently. In the first step every vector is just plucked from the embedding matrix and knows only its own word. The job of the network is to enrich it.

Context size

The network processes a fixed number of vectors at a time, its context size. GPT-3 was trained with a context size of 2048, so the data always looks like an array of 2048 columns, each with 12,288 dimensions. That limit is how much text the model can weigh when predicting the next word. It is why very long conversations with early versions of ChatGPT felt like the bot was losing the thread. Once you talked past the window, the earliest text simply fell out of view.

Figure 3. Why attention exists. The token "model" enters with one generic embedding, but its true meaning depends on neighbors. Attention, the subject of the next chapter, is the block that lets nearby words tug the vector toward the machine learning sense or the fashion sense. Meaning is not fixed at lookup; it is refined layer by layer.

The unembedding: turning the last vector into a guess

Skipping ahead past attention, Sanderson jumps to the end. The desired output is a probability distribution over all tokens that might come next. His example: if the last word is "Professor," the context mentions Harry Potter, and just before we saw "least favorite teacher," then a well trained network that learned the books would assign a high number to Snape.

Getting there takes two steps. First, another matrix maps the very last vector to a list of 50,000 values, one per token in the vocabulary. Then softmax normalizes that list into a probability distribution. It might seem odd to predict using only the last vector while thousands of other context rich vectors sit in the final layer doing nothing. The reason is training efficiency: during training it turns out to be much more efficient to use every vector in the final layer to simultaneously predict what comes right after it, so all of them do useful work.

That second matrix is the unembedding matrix, labeled W_U. Its entries start random and are learned. It has one row per vocabulary word, and each row has as many elements as the embedding dimension, so it is essentially the embedding matrix with the order swapped. That adds another 617 million parameters, pushing the tally just over one billion, a small but not wholly insignificant slice of the eventual 175 billion.

Figure 4. The scoreboard. The two matrices this chapter counts, embedding and unembedding, come to about 1.2 billion parameters, the thin amber sliver, only about 0.7 percent of GPT-3's 175 billion. The vast remainder lives in the attention and multilayer perceptron blocks stacked 96 deep, which later chapters unpack. All 175 billion weights are organized into just under 28,000 matrices falling into 8 categories.

Softmax, with temperature

The last mini lesson is softmax, which returns during attention too. If a list of numbers is to act as a probability distribution, each value must sit between 0 and 1 and they must all add to 1. But in the deep learning game where everything is matrix vector multiplication, the raw outputs obey none of that. They are often negative, sometimes far bigger than 1, and they do not sum to 1.

Softmax is the standard fix. It turns any list of numbers into a valid distribution where the largest inputs end up near 1 and the smallest near 0. The mechanism: raise e to the power of each number, which makes every term positive, then divide each by the sum of all of them so the whole list adds to 1. If one input is meaningfully bigger than the rest, its output term dominates, so sampling would almost always pick it. But it is softer than a hard maximum, because when several inputs are similarly large they all keep meaningful weight, and everything varies continuously as the inputs vary.

When ChatGPT uses this to pick a word, there is room for a knob. You throw a constant T into the denominator of the exponents and call it the temperature, because it loosely echoes temperature in thermodynamics. Larger T gives more weight to the lower values, flattening the distribution toward uniform. Smaller T lets the big values dominate harder. At T equal to zero all the weight goes to the single maximum.

He demonstrates with the seed "once upon a time there was A" at different temperatures. At temperature zero it always takes the most predictable word and produces a trite derivative of Goldilocks. Higher temperature lets it reach for less likely words, which is riskier: one run starts more originally, about a young web artist from South Korea, then quickly degenerates into nonsense. The OpenAI API will not let you set temperature above 2, not for any mathematical reason but as an arbitrary guardrail against fully nonsensical output. For the animation itself, Sanderson notes he takes the 20 most probable next tokens GPT-3 will hand him, the maximum it returns, then reweights their probabilities using an exponent of one fifth.

One last piece of jargon. Just as the outputs of softmax are called probabilities, the inputs are called logits. He riffs on the pronunciation (logits, logits) and lands on "logits." So when text flows through the network and you do the final multiply with the unembedding matrix, the raw unnormalized numbers that come out are the logits for the next word prediction.

Figure 5. Temperature reshapes the same logits. At T = 0.5 the top token takes 86 percent of the mass and the model is predictable. At T = 1 the peak softens to 61 percent. At T = 2 the mass spreads out, so unlikely tokens get a real chance, which is where original phrasing and eventual gibberish both come from. The API caps T at 2.

Why the groundwork matters

Sanderson closes by naming his strategy: this chapter was foundation laying, Karate Kid wax on wax off style. If you have a strong intuition for word embeddings, for softmax, for how the dot product measures similarity, and for the premise that most of the computation is matrix multiplication with matrices full of tunable parameters, then the attention mechanism, the cornerstone of the modern AI boom, should go down smoothly in the next chapter. He mentions that as he published this, a draft of the attention chapter was already available for review by Patreon supporters, with the public version a week or two out.

Key takeaways

GPT stands for Generative Pretrained Transformer, and the transformer, introduced by Google in 2017 for translation, is the invention behind the current AI boom.
Generation is just next token prediction in a loop: predict a distribution, sample, append, repeat. GPT-3 makes this feel magical where GPT-2 does not, purely from scale.
Data flows as tokens turned into vectors, refined by alternating attention blocks (vectors share context) and multilayer perceptron blocks (each vector processed in parallel), stacked many layers deep, then read out at the end.
Almost every computation inside the model is a matrix times a vector. Keep the weights (learned, the brains) mentally separate from the data (the specific input flowing through).
Embeddings turn words into points in a high dimensional space, 12,288 dimensions in GPT-3, where directions carry meaning. Woman minus man is close to queen minus king.
The dot product measures how well two vectors align, which is the tool attention will lean on next.
Softmax converts raw logits into a valid probability distribution. Temperature controls how peaked or flat that distribution is, trading predictability for creativity.
The GPT-3 numbers: 175 billion parameters, just under 28,000 matrices in 8 categories, 50,257 token vocabulary, 12,288 embedding dimension, 2048 context size. Embedding plus unembedding is about 1.2 billion of that, only around 0.7 percent.

Chapters

0:00 Predict, sample, repeat 3:03 Inside a transformer 6:36 Chapter layout 7:20 The premise of Deep Learning 12:27 Word embeddings 18:25 Embeddings beyond words 20:22 Unembedding 22:22 Softmax with temperature 26:03 Up next

Notable quotes

"A transformer is a specific kind of neural network, a machine learning model, and it's the core invention underlying the current boom in AI." (0:22)

"I don't know about you, but it really doesn't feel like this should actually work." (1:26)

"The weights are the actual brains, they are the things learned during training, and they determine how it behaves." (11:38)

"It tends to settle on a set of embeddings where directions in the space have a kind of semantic meaning." (13:55)

"A vector that started its life as the embedding of the word king might progressively get tugged and pulled by various blocks in this network, so that by the end it points in a much more specific and nuanced direction that somehow encodes that it was a king who lived in Scotland, and who had achieved his post after murdering the previous king, and who's being described in Shakespearean language." (18:52)

"Softmax is the standard way to turn an arbitrary list of numbers into a valid distribution in such a way that the largest values end up closest to 1, and the smaller values end up very close to 0." (23:04)

"Some people say logits, some people say logits, I'm gonna say logits." (25:18)

"A lot of the goal with this chapter was to lay the foundations for understanding the attention mechanism, Karate Kid wax-on-wax-off style." (25:44)

Resources mentioned

Attention Is All You Need (Vaswani et al., 2017), the Google paper that introduced the transformer.
Language Models are Few-Shot Learners (2020), the GPT-3 paper behind the numbers used throughout.
ChatGPT, GPT-2, and GPT-3 from OpenAI.
DALL-E and Midjourney, image models built on transformers.
word2vec, the style of word embedding model used for the animations of semantic directions.
Softmax function and logits, the readout math at the end of the network.
Backpropagation, the training algorithm covered in earlier chapters of the series.
3Blue1Brown Neural Networks series, the miniseries this video belongs to.
3blue1brown.com and the channel on Patreon, where the next chapter on attention first appeared in draft.

Full transcript

The initials GPT stand for Generative Pretrained Transformer. So that first word is straightforward enough, these are bots that generate new text. Pretrained refers to how the model went through a process of learning from a massive amount of data, and the prefix insinuates that there's more room to fine-tune it on specific tasks with additional training. But the last word, that's the real key piece. A transformer is a specific kind of neural network, a machine learning model, and it's the core invention underlying the current boom in AI. What I want to do with this video and the following chapters is go through a visually-driven explanation for what actually happens inside a transformer. We're going to follow the data that flows through it and go step by step. There are many different kinds of models that you can build using transformers. Some models take in audio and produce a transcript. This sentence comes from a model going the other way around, producing synthetic speech just from text. All those tools that took the world by storm in 2022 like DALL-E and Midjourney that take in a text description and produce an image are based on transformers. Even if I can't quite get it to understand what a pi creature is supposed to be, I'm still blown away that this kind of thing is even remotely possible. And the original transformer introduced in 2017 by Google was invented for the specific use case of translating text from one language into another. But the variant that you and I will focus on, which is the type that underlies tools like ChatGPT, will be a model that's trained to take in a piece of text, maybe even with some surrounding images or sound accompanying it, and produce a prediction for what comes next in the passage. That prediction takes the form of a probability distribution over many different chunks of text that might follow. At first glance, you might think that predicting the next word feels like a very different goal from generating new text. But once you have a prediction model like this, a simple thing you could try to make it generate a longer piece of text is to give it an initial snippet to work with, have it take a random sample from the distribution it just generated, append that sample to the text, and then run the whole process again to make a new prediction based on all the new text, including what it just added. I don't know about you, but it really doesn't feel like this should actually work. In this animation, for example, I'm running GPT-2 on my laptop and having it repeatedly predict and sample the next chunk of text to generate a story based on the seed text. The story just doesn't actually really make that much sense. But if I swap it out for API calls to GPT-3 instead, which is the same basic model, just much bigger, suddenly almost magically we do get a sensible story, one that even seems to infer that a pi creature would live in a land of math and computation. This process here of repeated prediction and sampling is essentially what's happening when you interact with ChatGPT, or any of these other large language models, and you see them producing one word at a time. In fact, one feature that I would very much enjoy is the ability to see the underlying distribution for each new word that it chooses. Let's kick things off with a very high level preview of how data flows through a transformer. We will spend much more time motivating and interpreting and expanding on the details of each step, but in broad strokes, when one of these chatbots generates a given word, here's what's going on under the hood. First, the input is broken up into a bunch of little pieces. These pieces are called tokens, and in the case of text these tend to be words or little pieces of words or other common character combinations. If images or sound are involved, then tokens could be little patches of that image or little chunks of that sound. Each one of these tokens is then associated with a vector, meaning some list of numbers, which is meant to somehow encode the meaning of that piece. If you think of these vectors as giving coordinates in some very high dimensional space, words with similar meanings tend to land on vectors that are close to each other in that space. This sequence of vectors then passes through an operation that's known as an attention block, and this allows the vectors to talk to each other and pass information back and forth to update their values. For example, the meaning of the word model in the phrase "a machine learning model" is different from its meaning in the phrase "a fashion model". The attention block is what's responsible for figuring out which words in context are relevant to updating the meanings of which other words, and how exactly those meanings should be updated. And again, whenever I use the word meaning, this is somehow entirely encoded in the entries of those vectors. After that, these vectors pass through a different kind of operation, and depending on the source that you're reading this will be referred to as a multi-layer perceptron or maybe a feed-forward layer. And here the vectors don't talk to each other, they all go through the same operation in parallel. And while this block is a little bit harder to interpret, later on we'll talk about how the step is a little bit like asking a long list of questions about each vector, and then updating them based on the answers to those questions. All of the operations in both of these blocks look like a giant pile of matrix multiplications, and our primary job is going to be to understand how to read the underlying matrices. I'm glossing over some details about some normalization steps that happen in between, but this is after all a high-level preview. After that, the process essentially repeats, you go back and forth between attention blocks and multi-layer perceptron blocks, until at the very end the hope is that all of the essential meaning of the passage has somehow been baked into the very last vector in the sequence. We then perform a certain operation on that last vector that produces a probability distribution over all possible tokens, all possible little chunks of text that might come next. And like I said, once you have a tool that predicts what comes next given a snippet of text, you can feed it a little bit of seed text and have it repeatedly play this game of predicting what comes next, sampling from the distribution, appending it, and then repeating over and over. Some of you in the know may remember how long before ChatGPT came into the scene, this is what early demos of GPT-3 looked like, you would have it autocomplete stories and essays based on an initial snippet. To make a tool like this into a chatbot, the easiest starting point is to have a little bit of text that establishes the setting of a user interacting with a helpful AI assistant, what you would call the system prompt, and then you would use the user's initial question or prompt as the first bit of dialogue, and then you have it start predicting what such a helpful AI assistant would say in response. There is more to say about an added step of training that's required to make this work well, but at a high level this is the idea. In this chapter, you and I are going to expand on the details of what happens at the very beginning of the network, at the very end of the network, and I also want to spend a lot of time reviewing some important bits of background knowledge, things that would have been second nature to any machine learning engineer by the time transformers came around. If you're comfortable with that background knowledge and a little impatient, you could probably feel free to skip to the next chapter, which is going to focus on the attention blocks, generally considered the heart of the transformer. After that, I want to talk more about these multi-layer perceptron blocks, how training works, and a number of other details that will have been skipped up to that point. For broader context, these videos are additions to a mini-series about deep learning, and it's okay if you haven't watched the previous ones, I think you can do it out of order, but before diving into transformers specifically, I do think it's worth making sure that we're on the same page about the basic premise and structure of deep learning. At the risk of stating the obvious, this is one approach to machine learning, which describes any model where you're using data to somehow determine how a model behaves. What I mean by that is, let's say you want a function that takes in an image and it produces a label describing it, or our example of predicting the next word given a passage of text, or any other task that seems to require some element of intuition and pattern recognition. We almost take this for granted these days, but the idea with machine learning is that rather than trying to explicitly define a procedure for how to do that task in code, which is what people would have done in the earliest days of AI, instead you set up a very flexible structure with tunable parameters, like a bunch of knobs and dials, and then, somehow, you use many examples of what the output should look like for a given input to tweak and tune the values of those parameters to mimic this behavior. For example, maybe the simplest form of machine learning is linear regression, where your inputs and outputs are each single numbers, something like the square footage of a house and its price, and what you want is to find a line of best fit through this data, you know, to predict future house prices. That line is described by two continuous parameters, say the slope and the y-intercept, and the goal of linear regression is to determine those parameters to closely match the data. Needless to say, deep learning models get much more complicated. GPT-3, for example, has not two, but 175 billion parameters. But here's the thing, it's not a given that you can create some giant model with a huge number of parameters without it either grossly overfitting the training data or being completely intractable to train. Deep learning describes a class of models that in the last couple decades have proven to scale remarkably well. What unifies them is that they all use the same training algorithm, it's called backpropagation, we talked about it in previous chapters, and the context that I want you to have as we go in is that in order for this training algorithm to work well at scale, these models have to follow a certain specific format. And if you know this format going in, it helps to explain many of the choices for how a transformer processes language, which otherwise run the risk of feeling kinda arbitrary. First, whatever kind of model you're making, the input has to be formatted as an array of real numbers. This could simply mean a list of numbers, it could be a two-dimensional array, or very often you deal with higher dimensional arrays, where the general term used is tensor. You often think of that input data as being progressively transformed into many distinct layers, where again, each layer is always structured as some kind of array of real numbers, until you get to a final layer which you consider the output. For example, the final layer in our text processing model is a list of numbers representing the probability distribution for all possible next tokens. In deep learning, these model parameters are almost always referred to as weights, and this is because a key feature of these models is that the only way these parameters interact with the data being processed is through weighted sums. You also sprinkle some non-linear functions throughout, but they won't depend on parameters. Typically, though, instead of seeing the weighted sums all naked and written out explicitly like this, you'll instead find them packaged together as various components in a matrix vector product. It amounts to saying the same thing, if you think back to how matrix vector multiplication works, each component in the output looks like a weighted sum. It's just often conceptually cleaner for you and me to think about matrices that are filled with tunable parameters that transform vectors that are drawn from the data being processed. For example, those 175 billion weights in GPT-3 are organized into just under 28,000 distinct matrices. Those matrices in turn fall into eight different categories, and what you and I are going to do is step through each one of those categories to understand what that type does. As we go through, I think it's kind of fun to reference the specific numbers from GPT-3 to count up exactly where those 175 billion come from. Even if nowadays there are bigger and better models, this one has a certain charm as the first large-language model to really capture the world's attention outside of ML communities. Also, practically speaking, companies tend to keep much tighter lips around the specific numbers for more modern networks. I just want to set the scene going in, that as you peek under the hood to see what happens inside a tool like ChatGPT, almost all of the actual computation looks like matrix vector multiplication. There's a little bit of a risk getting lost in the sea of billions of numbers, but you should draw a very sharp distinction in your mind between the weights of the model, which I'll always color in blue or red, and the data being processed, which I'll always color in gray. The weights are the actual brains, they are the things learned during training, and they determine how it behaves. The data being processed simply encodes whatever specific input is fed into the model for a given run, like an example snippet of text. With all of that as foundation, let's dig into the first step of this text processing example, which is to break up the input into little chunks and turn those chunks into vectors. I mentioned how those chunks are called tokens, which might be pieces of words or punctuation, but every now and then in this chapter and especially in the next one, I'd like to just pretend that it's broken more cleanly into words. Because we humans think in words, this will just make it much easier to reference little examples and clarify each step. The model has a predefined vocabulary, some list of all possible words, say 50,000 of them, and the first matrix that we'll encounter, known as the embedding matrix, has a single column for each one of these words. These columns are what determines what vector each word turns into in that first step. We label it W_E, and like all the matrices we see, its values begin random, but they're going to be learned based on data. Turning words into vectors was common practice in machine learning long before transformers, but it's a little weird if you've never seen it before, and it sets the foundation for everything that follows, so let's take a moment to get familiar with it. We often call this embedding a word, which invites you to think of these vectors very geometrically as points in some high dimensional space. Visualizing a list of three numbers as coordinates for points in 3D space would be no problem, but word embeddings tend to be much much higher dimensional. In GPT-3 they have 12,288 dimensions, and as you'll see, it matters to work in a space that has a lot of distinct directions. In the same way that you could take a two-dimensional slice through a 3D space and project all the points onto that slice, for the sake of animating word embeddings that a simple model is giving me, I'm going to do an analogous thing by choosing a three-dimensional slice through this very high dimensional space, and projecting the word vectors down onto that and displaying the results. The big idea here is that as a model tweaks and tunes its weights to determine how exactly words get embedded as vectors during training, it tends to settle on a set of embeddings where directions in the space have a kind of semantic meaning. For the simple word-to-vector model I'm running here, if I run a search for all the words whose embeddings are closest to that of tower, you'll notice how they all seem to give very similar tower-ish vibes. And if you want to pull up some Python and play along at home, this is the specific model that I'm using to make the animations. It's not a transformer, but it's enough to illustrate the idea that directions in the space can carry semantic meaning. A very classic example of this is how if you take the difference between the vectors for woman and man, something you would visualize as a little vector in the space connecting the tip of one to the tip of the other, it's very similar to the difference between king and queen. So let's say you didn't know the word for a female monarch, you could find it by taking king, adding this woman minus man direction, and searching for the embedding closest to that point. At least, kind of. Despite this being a classic example for the model I'm playing with, the true embedding of queen is actually a little farther off than this would suggest, presumably because the way queen is used in training data is not merely a feminine version of king. When I played around, family relations seemed to illustrate the idea much better. The point is, it looks like during training the model found it advantageous to choose embeddings such that one direction in this space encodes gender information. Another example is that if you take the embedding of Italy, and you subtract the embedding of Germany, and add that to the embedding of Hitler, you get something very close to the embedding of Mussolini. It's as if the model learned to associate some directions with Italian-ness, and others with WWII axis leaders. Maybe my favorite example in this vein is how in some models, if you take the difference between Germany and Japan, and add it to sushi, you end up very close to bratwurst. Also in playing this game of finding nearest neighbors, I was very pleased to see how close cat was to both beast and monster. One bit of mathematical intuition that's helpful to have in mind, especially for the next chapter, is how the dot product of two vectors can be thought of as a way to measure how well they align. Computationally, dot products involve multiplying all the corresponding components and then adding the results, which is good, since so much of our computation has to look like weighted sums. Geometrically, the dot product is positive when vectors point in similar directions, it's zero if they're perpendicular, and it's negative whenever they point in opposite directions. For example, let's say you were playing with this model, and you hypothesize that the embedding of cats minus cat might represent a sort of plurality direction in this space. To test this, I'm going to take this vector and compute its dot product against the embeddings of certain singular nouns, and compare it to the dot products with the corresponding plural nouns. If you play around with this, you'll notice that the plural ones do indeed seem to consistently give higher values than the singular ones, indicating that they align more with this direction. It's also fun how if you take this dot product with the embeddings of the words one, two, three, and so on, they give increasing values, so it's as if we can quantitatively measure how plural the model finds a given word. Again, the specifics for how words get embedded is learned using data. This embedding matrix, whose columns tell us what happens to each word, is the first pile of weights in our model. Using the GPT-3 numbers, the vocabulary size specifically is 50,257, and again, technically this consists not of words per se, but of tokens. The embedding dimension is 12,288, and multiplying those tells us this consists of about 617 million weights. Let's go ahead and add this to a running tally, remembering that by the end we should count up to 175 billion. In the case of transformers, you really want to think of the vectors in this embedding space as not merely representing individual words. For one thing, they also encode information about the position of that word, which we'll talk about later, but more importantly, you should think of them as having the capacity to soak in context. A vector that started its life as the embedding of the word king, for example, might progressively get tugged and pulled by various blocks in this network, so that by the end it points in a much more specific and nuanced direction that somehow encodes that it was a king who lived in Scotland, and who had achieved his post after murdering the previous king, and who's being described in Shakespearean language. Think about your own understanding of a given word. The meaning of that word is clearly informed by the surroundings, and sometimes this includes context from a long distance away, so in putting together a model that has the ability to predict what word comes next, the goal is to somehow empower it to incorporate context efficiently. To be clear, in that very first step, when you create the array of vectors based on the input text, each one of those is simply plucked out of the embedding matrix, so initially each one can only encode the meaning of a single word without any input from its surroundings. But you should think of the primary goal of this network that it flows through as being to enable each one of those vectors to soak up a meaning that's much more rich and specific than what mere individual words could represent. The network can only process a fixed number of vectors at a time, known as its context size. For GPT-3 it was trained with a context size of 2048, so the data flowing through the network always looks like this array of 2048 columns, each of which has 12,000 dimensions. This context size limits how much text the transformer can incorporate when it's making a prediction of the next word. This is why long conversations with certain chatbots, like the early versions of ChatGPT, often gave the feeling of the bot kind of losing the thread of conversation as you continued too long. We'll go into the details of attention in due time, but skipping ahead I want to talk for a minute about what happens at the very end. Remember, the desired output is a probability distribution over all tokens that might come next. For example, if the very last word is Professor, and the context includes words like Harry Potter, and immediately preceding we see least favorite teacher, and also if you give me some leeway by letting me pretend that tokens simply look like full words, then a well-trained network that had built up knowledge of Harry Potter would presumably assign a high number to the word Snape. This involves two different steps. The first one is to use another matrix that maps the very last vector in that context to a list of 50,000 values, one for each token in the vocabulary. Then there's a function that normalizes this into a probability distribution, it's called softmax and we'll talk more about it in just a second, but before that it might seem a little bit weird to only use this last embedding to make a prediction, when after all in that last step there are thousands of other vectors in the layer just sitting there with their own context-rich meanings. This has to do with the fact that in the training process it turns out to be much more efficient if you use each one of those vectors in the final layer to simultaneously make a prediction for what would come immediately after it. There's a lot more to be said about training later on, but I just want to call that out right now. This matrix is called the Unembedding matrix and we give it the label WU. Again, like all the weight matrices we see, its entries begin at random, but they are learned during the training process. Keeping score on our total parameter count, this Unembedding matrix has one row for each word in the vocabulary, and each row has the same number of elements as the embedding dimension. It's very similar to the embedding matrix, just with the order swapped, so it adds another 617 million parameters to the network, meaning our count so far is a little over a billion, a small but not wholly insignificant fraction of the 175 billion we'll end up with in total. As the very last mini-lesson for this chapter, I want to talk more about this softmax function, since it makes another appearance for us once we dive into the attention blocks. The idea is that if you want a sequence of numbers to act as a probability distribution, say a distribution over all possible next words, then each value has to be between 0 and 1, and you also need all of them to add up to 1. However, if you're playing the deep learning game where everything you do looks like matrix-vector multiplication, the outputs you get by default don't abide by this at all. The values are often negative, or much bigger than 1, and they almost certainly don't add up to 1. Softmax is the standard way to turn an arbitrary list of numbers into a valid distribution in such a way that the largest values end up closest to 1, and the smaller values end up very close to 0. That's all you really need to know. But if you're curious, the way it works is to first raise e to the power of each of the numbers, which means you now have a list of positive values, and then you can take the sum of all those positive values and divide each term by that sum, which normalizes it into a list that adds up to 1. You'll notice that if one of the numbers in the input is meaningfully bigger than the rest, then in the output the corresponding term dominates the distribution, so if you were sampling from it you'd almost certainly just be picking the maximizing input. But it's softer than just picking the max in the sense that when other values are similarly large, they also get meaningful weight in the distribution, and everything changes continuously as you continuously vary the inputs. In some situations, like when ChatGPT is using this distribution to create a next word, there's room for a little bit of extra fun by adding a little extra spice into this function, with a constant T thrown into the denominator of those exponents. We call it the temperature, since it vaguely resembles the role of temperature in certain thermodynamics equations, and the effect is that when T is larger, you give more weight to the lower values, meaning the distribution is a little bit more uniform, and if T is smaller, then the bigger values will dominate more aggressively, where in the extreme, setting T equal to zero means all of the weight goes to maximum value. For example, I'll have GPT-3 generate a story with the seed text, "once upon a time there was A", but I'll use different temperatures in each case. Temperature zero means that it always goes with the most predictable word, and what you get ends up being a trite derivative of Goldilocks. A higher temperature gives it a chance to choose less likely words, but it comes with a risk. In this case, the story starts out more originally, about a young web artist from South Korea, but it quickly degenerates into nonsense. Technically speaking, the API doesn't actually let you pick a temperature bigger than 2. There's no mathematical reason for this, it's just an arbitrary constraint imposed to keep their tool from being seen generating things that are too nonsensical. So if you're curious, the way this animation is actually working is I'm taking the 20 most probable next tokens that GPT-3 generates, which seems to be the maximum they'll give me, and then I tweak the probabilities based on an exponent of 1/5. As another bit of jargon, in the same way that you might call the components of the output of this function probabilities, people often refer to the inputs as logits, or some people say logits, some people say logits, I'm gonna say logits. So for instance, when you feed in some text, you have all these word embeddings flow through the network, and you do this final multiplication with the unembedding matrix, machine learning people would refer to the components in that raw, unnormalized output as the logits for the next word prediction. A lot of the goal with this chapter was to lay the foundations for understanding the attention mechanism, Karate Kid wax-on-wax-off style. You see, if you have a strong intuition for word embeddings, for softmax, for how dot products measure similarity, and also the underlying premise that most of the calculations have to look like matrix multiplication with matrices full of tunable parameters, then understanding the attention mechanism, this cornerstone piece in the whole modern boom in AI, should be relatively smooth. For that, come join me in the next chapter. As I'm publishing this, a draft of that next chapter is available for review by Patreon supporters. A final version should be up in public in a week or two, it usually depends on how much I end up changing based on that review. In the meantime, if you want to dive into attention, and if you want to help the channel out a little bit, it's there waiting.