Learn how to truncate context of documents for LLMs with small context windows.
By Jack Gordley
Now, we can send the first section of Tale of Two Cities to GPT-4 like so:
Here's the output:
Bottom line It’s clear that we will need a more definitive approach to maximize the number of tokens we can send.
So what happened? Because of the variability of tokens across text, such as the frequency of spaces, punctuation, or longer words split into multiple tokens, this approach of splitting based on the average token length of four characters is unreliable.
As you can see, across the Tale of Two Cities book, a wide range of token counts are present (per 8,192 x 4 character windows):
Token counts across different sections of Tale of Two Cities
Pros of cutting context based on average length
Cons of cutting context based on average length
Bottom line This approach is great for OpenAI models as it allows precise token counting for them, but it does not support open-source LLMs.
Not all language models are created equal, especially when it comes to their maximum context length. Some of the most powerful models, like the GPT-4-turbo model, offer staggering context windows of up to 128K tokens. But most of the state-of-the-art open-source models offer much smaller ones. For example, Llama-3 and Mistral-8x7B offer 8K and 32K token context windows, respectively.
Smaller context windows offer more than enough room for quick back-and-forth chats that are perfect for a run-of-the-mill chatbot. But what happens when we want to ask questions about larger pieces of text?
In this blog, we’ll explore some popular methods for truncating files to fit specific context windows and experiment with their effectiveness.
At Telnyx, context length is important. It’s extremely relevant to our low-latency summarization endpoint, which allows customers to summarize files in the Mission Control Portal. Sometimes customers upload large files of text that are too big for the context window of the model. In these cases, we first have to truncate the uploaded files to ensure the LLM can completely summarize them.
There are a few ways you can truncate context to make sure an LLM can manage them.
Let’s start with an intuitive approach that simply cuts context based on the average length of a token—which is four English characters, according to OpenAI. To test this approach, we’ll load in the Charles Dickens classic, “A Tale of Two Cities.txt,” and perform a simple truncation based on the context window offered by GPT-4 which is 8,192 tokens.
See the full list of available models in our LLM library or hitting the /v2/ai/models endpoint.
OpenAI released an open-source tokenizer called tiktoken and an accompanying OpenAI cookbook tutorial on counting tokens with it. This tokenizer claims to operate at speeds three to six times faster than comparable HuggingFace open-source tokenizers. It also lets you encode and decode tokens for a specific model, such as GPT-4. Let’s see it in action.
Tiktoken makes it relatively easy to tokenize our inputs, truncate them to the proper context length, and then decode them back so we can send them to OpenAI at the proper length.
As you can see, we get our maximum context usage. But that doesn’t leave any room for the model to respond. Depending on the length of your desired response, a general rule of thumb is to use up only about 75% of the max context window so you can leave 25% for the model to respond.
Pros of OpenAI tiktoken
Cons of OpenAI tiktoken
HuggingFace's tokenizer library includes both fast and slow tokenizers for open-source LLMs. The fast tokenizer, written in Rust, significantly outperforms the slow version in speed. It can also use advanced options such as a truncation
strategy and provide a max_length
as input to the function itself.
For models on HuggingFace, you can often find the model's reference context window under the config.json
file property max_position_embeddings
. For example, in the popular NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO model the max_position_embeddings
value is 32,768 tokens.
Let’s try the AutoTokenizer in a similar way to our previous example with Llama3, an open-source model:
Now, using the OpenAI SDK with Telnyx Inference, we can try it out and see if our token counter did the trick:
Indeed, our truncation using the HuggingFace fast tokenizers library worked. We could send a truncated version of the text that’s within the model’s context window.
You can load any of the HuggingFace tokenizers for the open-source models offered by Telnyx and use them to truncate long bodies of text you wish to use for inference.
Pros of HuggingFace fast tokenizers
Cons of HuggingsFace fast tokenizers
We’ve walked through three simple approaches to truncate context for usage with different types of LLMs. The best approach largely depends on your use case, whether that’s summarizing large files as we do for our /v2/ai/summarize
endpoint using open-source models or working with large amounts of text with OpenAI models.
Here are some key takeaways to keep in mind while working with LLMs for these use cases:
max_position_embeddings
property in their config.json
.As language models continue to evolve and offer larger context windows, the need for truncation may diminish in the future. However, until then, these truncation methods remain essential tools for working with long texts or large datasets within the constraints of current language model architectures.
Sign up for a free Telnyx account and explore context truncation.
Related articles