Llama-4-Scout-Instruct

Meta's multimodal mixture-of-experts model with 17B active parameters across 16 experts, supporting text and image input with a 10M token context window.

about

Trained on 40 trillion tokens of multimodal data, Scout fuses vision at the transformer backbone level from pre-training using an enhanced MetaCLIP-based encoder rather than attaching it as a separate module. Its interleaved attention architecture alternates layers with and without positional encodings to achieve the 10-million-token context window, and the full 109B-parameter model fits on a single H100 GPU with int4 quantization.

Licensellama4
Context window(in thousands)128000

Use cases for Llama-4-Scout-Instruct

  1. Extreme-length document processing: The 10-million-token context window accommodates entire codebases, multi-volume legal records, or years of correspondence in a single inference pass.
  2. Native image understanding: Early-fusion multimodal training via MetaCLIP processes images at the transformer backbone level, not as an afterthought, enabling integrated visual reasoning alongside text.
  3. Single-GPU frontier inference: The full 109B-parameter model fits on one H100 with int4 quantization, making frontier multimodal capability accessible without multi-node infrastructure.

Quality

Arena Elo1250
MMLU79.6
MT BenchN/A

Llama 4 Scout scores 79.6% on MMLU and 74.3% on MMLU-Pro, placing it between Gemini 2.0 Flash (76.4% MMLU) and GPT-4o mini (82.0% MMLU) on the same sheet. With only 17B of its 109B parameters active per token across 16 experts, it fits on a single H100 GPU with int4 quantization while supporting a 10-million-token context window.

Claude-3-7-Sonnet-Latest

1268

GPT-4 1106 Preview

1251

Llama-4-Scout-Instruct

1250

Llama 3.1 70B Instruct

1248

GPT-4 0125 Preview

1245

pricing

Running Llama 4 Scout through Telnyx Inference follows the 70B+ pricing tier at $0.0006 per 1,000 tokens, since only 17B of its 109B parameters are active per token. Processing 1,000,000 long-document queries at 5,000 tokens each would cost $3,000, with the 10M-token context window eliminating the need for retrieval augmentation.

What's Twitter saying?

  • Developers praise Llama 4 Scout's 10M context length as a breakthrough for summarization, function calling, and long-context tasks like multi-document processing, calling it best-in-class for cheap, local GPUs.
  • Many tech reviewers criticize its poor coding performance, noting it struggles with basic tasks, underperforms smaller models like Llama 3.3 70B or Gemma 3 27B, and lags in benchmarks.
  • Commentators highlight mixed enterprise results, with strong accuracy in simple info extraction but weaknesses in complex reasoning compared to Maverick or rivals like Claude Haiku.

Explore Our LLM Library

Discover the power and diversity of large language models available with Telnyx. Explore the options below to find the perfect model for your project.

Organizationdeepseek-ai
Model NameDeepSeek-R1-Distill-Qwen-14B
Taskstext generation
Languages SupportedEnglish
Context Length43,000
Parameters14.8B
Model Tiermedium
Licensedeepseek

TRY IT OUT

Chat with an LLM

Powered by our own GPU infrastructure, select a large language model, add a prompt, and chat away. For unlimited chats, sign up for a free account on our Mission Control Portal here.

HOW IT WORKS

Selecting LLMs for Voice AI

RESOURCES

Get started

Check out our helpful tools to help get you started.

  • Icon Resources ebook

    Test in the portal

    Easily browse and select your preferred model in the AI Playground.

  • Icon Resources Docs

    Explore the docs

    Don’t wait to scale, start today with our public API endpoints.

  • Icon Resources Article

    Stay up to date

    Keep an eye on our AI changelog so you don't miss a beat.

Sign up and start building

faqs

llama-4-scout-17b-16e-instruct

What is Llama 4 Scout 17B 16E Instruct?

Llama 4 Scout is Meta's multimodal mixture-of-experts model with 17B active parameters out of 109B total, using 16 specialized experts. It supports text and image input with a 128K context window (up to 10M tokens supported) across 12 languages.

What is Llama 4 Scout good for?

Llama 4 Scout handles assistant-style chat, visual reasoning, code generation, and document analysis. It is optimized for tasks that combine text and image understanding, such as answering questions about images, captioning, and multimodal content generation.

How much is the token limit for Llama 4 Scout?

Llama 4 Scout has a standard context window of 128K tokens for most deployments, with support for up to 10 million tokens in extended configurations. The large context makes it suitable for processing lengthy documents and maintaining extensive conversation histories.

What GPU is needed for Llama 4 Scout?

Llama 4 Scout requires significant GPU resources due to its 109B total parameters. At minimum, an A100 80GB or H100 GPU is recommended for quantized inference. Full-precision deployment typically requires multi-GPU setups.

Can Llama 4 run on CPU?

Llama 4 Scout can technically run on CPU-only systems using quantized formats, but performance is severely limited with very slow token generation. GPU inference is strongly recommended for practical use.