Monday, April 29, 2024

LLM Part 2: Encoding

Welcome to Part 2 of building your own large language model!

Part 1 was about breaking down your input text into smaller subwords. (tokenization)  If you don't remember what subwords are, have a look at the post below:

LLM Part 1: Tokenization

(If this post still doesn't make the concept of subwords very clear, please leave a comment below!  I'll try and make another post to elaborate further.)


This time we will be covering encoding: the process of converting subwords into a list of numbers called "token IDs."  Machines are good at processing numbers so encoding enables the machine to "know" what pieces of text are being fed into it.  However, it is important to note that encoding is NOT about understanding the context of the input.  (That will be covered in a future article.)  Instead it's simply about recognizing the subwords in the input.


Here is a Jupyter Notebook for you all to revise tokenization, and then show you how you can encode those subwords.  I've also added an alternative line of code that can tokenize and encode the input at the same time!

LLM Part 2: Encoding


To summarize:

  • Tokenization:  The process of breaking down the input text as a list of subwords
  • Encoding:  The process of converting the subwords into a list of token IDs
Step-by-step illustration of tokenization and encoding using example text from code.


Ending

How did you find this post?  It may be short but I hope it makes these concepts easier to digest.  Personally I struggled to know where to begin when studying LLM concepts myself when I first started out, so I wanted to explain these as simply as possible.😊  Next time we'll be covering another key process called "decoding."  Then we can finally make LLMs perform simple tasks using a few lines of code!  Stay tuned for future posts.😎

Update (2024/05/19)

The next part of the LLM series is out now!



LLM Part 3: Decoding

Tada!  Here's Part 3 of the large language model series! Are you excited for the next chapter of our LLM 101 series?  So far we've c...