Hi everyone!
As promised, I will present a tutorial on how to build your very own large language model! This part focuses on tokenization.
What is tokenization?
It's the process of converting human text into a sequence of numerical IDs that LLMs can understand. A similar idea among humans is word-by-word translation of individual words from one language to another.
For example, a word-by-word translation of "I love you." into Japanese can become "私(I)愛(love but this word is in noun-form)君(you)."
(Important: Not including context or meaning at all)
Tokenization is a necessary first step because how would a model do what you want if it doesn't even know what words you're saying!
How does tokenization work?
- Prepare text
- Break down text into smaller chucks of text: These can be as following
- Words
- Alphabets
- Sub-words (Common letter combinations in words) - Most common pattern
Convert chunks into numbers called token IDs
That's it! Tokenization is all about creating a series of token IDs from breaking down your text into smaller chunks.
To get a better understanding, I prepared a mini-project tokenizing and de-tokenization the text "Hi! I'm Lukas Fleur. Nice to meet you."
Ending
How did you find this first part? When I was reading about tokenization for the first time, I felt like I had to read around in circles to get a good understanding of it. Of course, if you want to have more in-depth knowledge there are more resources out there, but this is a good first step in understanding tokenization, and thus LLMs.
For now, we'll be focusing on building our own T5-model but we can revisit the topic of tokenization in the future to see what we can do using this technique!
How did you find this article? If you enjoyed it, or hated it, comment down below!
Update
I recently re-read this page and I realized that I was actually confused about tokenization and encoding myself! Apparently, tokenization is simply about breaking down text into smaller units, but the assignment of a token ID per token is actually encoding. (I was confused because of the name token ID haha... 😅) I've stricken out the bits that were actually to do with encoding, and updated the code in the link to only showcase tokenization.
I'm so sorry if I've caused any confusion! I am currently working on the next article which will be titled "LLM Part 2: Encoding" where I will expand on what encoding actually is. Stay tuned! 👀
Update(2024/05/12)
Here's the next part of the LLM series: