Inference

Last updated 8 Aug 2024

How to truncate context with transformers and tiktoken

By Jack Gordley

Context Truncation - Snippet 1

Now, we can send the first section of Tale of Two Cities to GPT-4 like so:

Context Truncation - Snippet 1.1

Here's the output:

Context Truncation - Snippet 1.2

Bottom line It’s clear that we will need a more definitive approach to maximize the number of tokens we can send.

So what happened? Because of the variability of tokens across text, such as the frequency of spaces, punctuation, or longer words split into multiple tokens, this approach of splitting based on the average token length of four characters is unreliable.

As you can see, across the Tale of Two Cities book, a wide range of token counts are present (per 8,192 x 4 character windows):

Context Truncation - Token Count

Token counts across different sections of Tale of Two Cities

Pros of cutting context based on average length

It can be implemented very quickly using a simple string truncation
No external libraries are needed
Great for non-exact cases, such as utilizing roughly 75% of the context window

Cons of cutting context based on average length

This leaves room for error, potentially resulting in bad requests
There is less certainty in your results

Bottom line This approach is great for OpenAI models as it allows precise token counting for them, but it does not support open-source LLMs.

Not all language models are created equal, especially when it comes to their maximum context length. Some of the most powerful models, like the GPT-4-turbo model, offer staggering context windows of up to 128K tokens. But most of the state-of-the-art open-source models offer much smaller ones. For example, Llama-3 and Mistral-8x7B offer 8K and 32K token context windows, respectively.

Smaller context windows offer more than enough room for quick back-and-forth chats that are perfect for a run-of-the-mill chatbot. But what happens when we want to ask questions about larger pieces of text?

In this blog, we’ll explore some popular methods for truncating files to fit specific context windows and experiment with their effectiveness.

Truncating context for open-source LLMs

At Telnyx, context length is important. It’s extremely relevant to our low-latency summarization endpoint, which allows customers to summarize files in the Mission Control Portal. Sometimes customers upload large files of text that are too big for the context window of the model. In these cases, we first have to truncate the uploaded files to ensure the LLM can completely summarize them.

3 common approaches to truncating context

There are a few ways you can truncate context to make sure an LLM can manage them.

1. Cutting context based on average token length

Let’s start with an intuitive approach that simply cuts context based on the average length of a token—which is four English characters, according to OpenAI. To test this approach, we’ll load in the Charles Dickens classic, “A Tale of Two Cities.txt,” and perform a simple truncation based on the context window offered by GPT-4 which is 8,192 tokens.

See the full list of available models in our LLM library or hitting the /v2/ai/models endpoint.

2. Truncating context window with OpenAI’s tiktoken counter

OpenAI released an open-source tokenizer called tiktoken and an accompanying OpenAI cookbook tutorial on counting tokens with it. This tokenizer claims to operate at speeds three to six times faster than comparable HuggingFace open-source tokenizers. It also lets you encode and decode tokens for a specific model, such as GPT-4. Let’s see it in action.

Context Truncation - Snippet 2

Tiktoken makes it relatively easy to tokenize our inputs, truncate them to the proper context length, and then decode them back so we can send them to OpenAI at the proper length.

Context Truncation - Snippet 2.1

As you can see, we get our maximum context usage. But that doesn’t leave any room for the model to respond. Depending on the length of your desired response, a general rule of thumb is to use up only about 75% of the max context window so you can leave 25% for the model to respond.

Pros of OpenAI tiktoken

Gives speedy, precise token counts for OpenAI models
Simple encoding and decoding functions

Cons of OpenAI tiktoken

No support for open-source models
Limited support for non-English languages

3. HuggingFace FastTokenizers

HuggingFace's tokenizer library includes both fast and slow tokenizers for open-source LLMs. The fast tokenizer, written in Rust, significantly outperforms the slow version in speed. It can also use advanced options such as a truncation strategy and provide a max_length as input to the function itself.

For models on HuggingFace, you can often find the model's reference context window under the config.json file property max_position_embeddings. For example, in the popular NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO model the max_position_embeddings value is 32,768 tokens.

Let’s try the AutoTokenizer in a similar way to our previous example with Llama3, an open-source model:

Context Truncation - Snippet 3

Now, using the OpenAI SDK with Telnyx Inference, we can try it out and see if our token counter did the trick:

Context Truncation - Snippet 3.1

Indeed, our truncation using the HuggingFace fast tokenizers library worked. We could send a truncated version of the text that’s within the model’s context window.

You can load any of the HuggingFace tokenizers for the open-source models offered by Telnyx and use them to truncate long bodies of text you wish to use for inference.

Pros of HuggingFace fast tokenizers

Enhanced flexibility, as it works for any model in the HuggingFace catalog
Written in Rust for speed and efficiency
Simple built-in encoding and decoding functions with additional options for truncation strategies

Cons of HuggingsFace fast tokenizers

Dependency on the HuggingFace ecosystem and models they support.

Choosing the best truncation method

We’ve walked through three simple approaches to truncate context for usage with different types of LLMs. The best approach largely depends on your use case, whether that’s summarizing large files as we do for our /v2/ai/summarize endpoint using open-source models or working with large amounts of text with OpenAI models.

Here are some key takeaways to keep in mind while working with LLMs for these use cases:

Consider the trade-offs between speed, accuracy, and flexibility when choosing a truncation method. While Tiktoken and simple character-based truncation offer speed advantages, they may sacrifice accuracy or flexibility compared to the more feature-rich HuggingFace tokenizers.
Truncation may inherently lead to information loss, so for scenarios where retaining the full context is crucial, alternative approaches such as Retrieval Augmented Generation (RAG) might be more suitable for your use case.
Keep in mind that the max context length also includes the LLM response, so it’s generally good practice to keep the context sent to around 75% of the max context tokens to leave room for the model to give a complete answer.
You can easily find HuggingFace open-source LLM context windows using the max_position_embeddings property in their config.json.

As language models continue to evolve and offer larger context windows, the need for truncation may diminish in the future. However, until then, these truncation methods remain essential tools for working with long texts or large datasets within the constraints of current language model architectures.

Sign up for a free Telnyx account and explore context truncation.

Share on Social

Jump to:Truncating context for open-source LLMs 3 common approaches to truncating context Choosing the best truncation method