Hi everyone!

As promised, I will present a tutorial on how to build your very own large language model! This part focuses on tokenization.

What is tokenization?

It's the process of converting human text into a sequence of numerical IDs that LLMs can understand. A similar idea among humans is word-by-word translation of individual words from one language to another.

For example, a word-by-word translation of "I love you." into Japanese can become "私（I）愛（love but this word is in noun-form）君（you）."

(Important: Not including context or meaning at all)

Tokenization is a necessary first step because how would a model do what you want if it doesn't even know what words you're saying!

How does tokenization work?

Prepare text
Break down text into smaller chucks of text: These can be as following

Words
Alphabets
Sub-words (Common letter combinations in words) - Most common pattern

Convert chunks into numbers called token IDs

That's it! Tokenization is all about ~~creating a series of token IDs from~~ breaking down your text into smaller chunks.

To get a better understanding, I prepared a mini-project tokenizing and de-tokenization the text "Hi! I'm Lukas Fleur. Nice to meet you."

LLM Part 1: Tokenization Code Link

Ending

How did you find this first part? When I was reading about tokenization for the first time, I felt like I had to read around in circles to get a good understanding of it. Of course, if you want to have more in-depth knowledge there are more resources out there, but this is a good first step in understanding tokenization, and thus LLMs.

For now, we'll be focusing on building our own T5-model but we can revisit the topic of tokenization in the future to see what we can do using this technique!

How did you find this article? If you enjoyed it, or hated it, comment down below!

Update

I recently re-read this page and I realized that I was actually confused about tokenization and encoding myself! Apparently, tokenization is simply about breaking down text into smaller units, but the assignment of a token ID per token is actually encoding. (I was confused because of the name token ID haha... 😅) I've stricken out the bits that were actually to do with encoding, and updated the code in the link to only showcase tokenization.

I'm so sorry if I've caused any confusion! I am currently working on the next article which will be titled "LLM Part 2: Encoding" where I will expand on what encoding actually is. Stay tuned! 👀

Update（2024/05/12）

Here's the next part of the LLM series:

LLM Part 2: Encoding

Resources

T5: Legacy

T5: Updated

Chronicles of a Neurodivergent Programmer

Friday, March 1, 2024

LLM Part 1: Tokenization

Hi everyone!

What is tokenization?

How does tokenization work?

Ending

Update

Update（2024/05/12）

Resources

No comments:

Post a Comment

A New Frontier: Building bots without code!!!

Report Abuse

Labels