Welcome to Part 2 of building your own large language model!
Part 1 was about breaking down your input text into smaller subwords. (tokenization) If you don't remember what subwords are, have a look at the post below:
(If this post still doesn't make the concept of subwords very clear, please leave a comment below! I'll try and make another post to elaborate further.)
This time we will be covering encoding: the process of converting subwords into a list of numbers called "token IDs." Machines are good at processing numbers so encoding enables the machine to "know" what pieces of text are being fed into it. However, it is important to note that encoding is NOT about understanding the context of the input. (That will be covered in a future article.) Instead it's simply about recognizing the subwords in the input.
Here is a Jupyter Notebook for you all to revise tokenization, and then show you how you can encode those subwords. I've also added an alternative line of code that can tokenize and encode the input at the same time!
To summarize:
- Tokenization: The process of breaking down the input text as a list of subwords
- Encoding: The process of converting the subwords into a list of token IDs
No comments:
Post a Comment