How to truncate context with transformers and tiktoken
Learn how to truncate context of documents for LLMs with small context windows.
By Jack Gordley
Not all language models are created equal, especially when it comes to their maximum context length. Some of the most powerful models, like the GPT-4-turbo model, offer staggering context windows of up to 128K tokens. But most of the state-of-the-art open-source models offer much smaller ones. For example, Llama-3 and Mistral-8x7B offer 8K and 32K token context windows, respectively.
Smaller context windows offer more than enough room for quick back-and-forth chats that are perfect for a run-of-the-mill chatbot. But what happens when we want to ask questions about larger pieces of text?
In this blog, we’ll explore some popular methods for truncating files to fit specific context windows and experiment with their effectiveness.
Truncating context for open-source LLMs
At Telnyx, context length is important. It’s extremely relevant to our low-latency summarization endpoint, which allows customers to summarize files in the Mission Control Portal. Sometimes customers upload large files of text that are too big for the context window of the model. In these cases, we first have to truncate the uploaded files to ensure the LLM can completely summarize them.
3 common approaches to truncating context
There are a few ways you can truncate context to make sure an LLM can manage them.
1. Cutting context based on average token length
Let’s start with an intuitive approach that simply cuts context based on the average length of a token—which is four English characters, according to OpenAI. To test this approach, we’ll load in the Charles Dickens classic, “A Tale of Two Cities.txt,” and perform a simple truncation based on the context window offered by GPT-4 which is 8,192 tokens.
Now, we can send the first section of Tale of Two Cities to GPT-4 like so:
Here's the output:
So what happened? Because of the variability of tokens across text, such as the frequency of spaces, punctuation, or longer words split into multiple tokens, this approach of splitting based on the average token length of four characters is unreliable.
As you can see, across the Tale of Two Cities book, a wide range of token counts are present (per 8,192 x 4 character windows):
Token counts across different sections of Tale of Two Cities
Pros of cutting context based on average length
- It can be implemented very quickly using a simple string truncation
- No external libraries are needed
- Great for non-exact cases, such as utilizing roughly 75% of the context window
Cons of cutting context based on average length
- This leaves room for error, potentially resulting in bad requests
- There is less certainty in your results
Bottom line It’s clear that we will need a more definitive approach to maximize the number of tokens we can send.
2. Truncating context window with OpenAI’s tiktoken counter
OpenAI released an open-source tokenizer called tiktoken and an accompanying OpenAI cookbook tutorial on counting tokens with it. This tokenizer claims to operate at speeds three to six times faster than comparable HuggingFace open-source tokenizers. It also lets you encode and decode tokens for a specific model, such as GPT-4. Let’s see it in action.
Tiktoken makes it relatively easy to tokenize our inputs, truncate them to the proper context length, and then decode them back so we can send them to OpenAI at the proper length.
As you can see, we get our maximum context usage. But that doesn’t leave any room for the model to respond. Depending on the length of your desired response, a general rule of thumb is to use up only about 75% of the max context window so you can leave 25% for the model to respond.
Pros of OpenAI tiktoken
- Gives speedy, precise token counts for OpenAI models
- Simple encoding and decoding functions
Cons of OpenAI tiktoken
- No support for open-source models
- Limited support for non-English languages
Bottom line This approach is great for OpenAI models as it allows precise token counting for them, but it does not support open-source LLMs.
3. HuggingFace FastTokenizers
HuggingFace's tokenizer library includes both fast and slow tokenizers for open-source LLMs. The fast tokenizer, written in Rust, significantly outperforms the slow version in speed. It can also use advanced options such as a truncation
strategy and provide a max_length
as input to the function itself.
For models on HuggingFace, you can often find the model's reference context window under the config.json
file property max_position_embeddings
. For example, in the popular NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO model the max_position_embeddings
value is 32,768 tokens.
Let’s try the AutoTokenizer in a similar way to our previous example with Llama3, an open-source model:
Now, using the OpenAI SDK with Telnyx Inference, we can try it out and see if our token counter did the trick:
Indeed, our truncation using the HuggingFace fast tokenizers library worked. We could send a truncated version of the text that’s within the model’s context window.
You can load any of the HuggingFace tokenizers for the open-source models offered by Telnyx and use them to truncate long bodies of text you wish to use for inference.
See the full list of available models in our LLM library or hitting the /v2/ai/models endpoint.
Pros of HuggingFace fast tokenizers
- Enhanced flexibility, as it works for any model in the HuggingFace catalog
- Written in Rust for speed and efficiency
- Simple built-in encoding and decoding functions with additional options for truncation strategies
Cons of HuggingsFace fast tokenizers
- Dependency on the HuggingFace ecosystem and models they support.
Choosing the best truncation method
We’ve walked through three simple approaches to truncate context for usage with different types of LLMs. The best approach largely depends on your use case, whether that’s summarizing large files as we do for our /v2/ai/summarize
endpoint using open-source models or working with large amounts of text with OpenAI models.
Here are some key takeaways to keep in mind while working with LLMs for these use cases:
- Consider the trade-offs between speed, accuracy, and flexibility when choosing a truncation method. While Tiktoken and simple character-based truncation offer speed advantages, they may sacrifice accuracy or flexibility compared to the more feature-rich HuggingFace tokenizers.
- Truncation may inherently lead to information loss, so for scenarios where retaining the full context is crucial, alternative approaches such as Retrieval Augmented Generation (RAG) might be more suitable for your use case.
- Keep in mind that the max context length also includes the LLM response, so it’s generally good practice to keep the context sent to around 75% of the max context tokens to leave room for the model to give a complete answer.
- You can easily find HuggingFace open-source LLM context windows using the
max_position_embeddings
property in theirconfig.json
.
As language models continue to evolve and offer larger context windows, the need for truncation may diminish in the future. However, until then, these truncation methods remain essential tools for working with long texts or large datasets within the constraints of current language model architectures.
Sign up for a free Telnyx account and explore context truncation.
Sign up for emails of our latest articles and news
Related articles