Google's smallest open model uses multi-query attention rather than the standard multi-head attention found in larger models, an architectural choice optimized for on-device inference on phones and laptops. Trained on 2 trillion tokens using the same data infrastructure as Gemini but built from scratch rather than distilled, it handles text generation, classification, and lightweight reasoning within an 8K context window.
Discover the power and diversity of large language models available with Telnyx. Explore the options below to find the perfect model for your project.
| Organization | Model Name | Tasks | Languages Supported | Context Length | Parameters | Model Tier | License |
|---|---|---|---|---|---|---|---|
| No data available at this time, please try again later. |
Powered by our own GPU infrastructure, select a large language model, add a prompt, and chat away. For unlimited chats, sign up for a free account on our Mission Control Portal here.
Gemma 2B IT is Google DeepMind's smallest instruction-tuned model, designed for on-device and resource-constrained environments. It is available on Telnyx's inference platform and Hugging Face.
Gemma 2B is optimized for on-device deployment with minimal resource requirements, while the 7B variant offers stronger reasoning at higher compute cost. Both share the same Google DeepMind architecture but target different deployment scenarios.
Gemma 2B IT scores 42.3% on MMLU (5-shot), placing it below Llama 2 7B Chat (45.3%) on the same sheet despite being roughly one-third the size. The lower score reflects the 2B parameter constraint and 2T token training budget (versus Llama 2's 2T at 7B), designed for on-device deployment where the tradeoff between quality and footprint is acceptable.
The cost of running Gemma 2B IT with Telnyx Inference is $0.0002 per 1,000 tokens. Processing 5,000,000 lightweight classification tasks at 200 tokens each would cost $200, the lowest total cost of any model on the sheet for high-volume, low-complexity workloads.
Gemma 2B was released by Google in February 2024 as part of the initial Gemma model family. It was designed to bring Google's model technology to edge and mobile devices.
Gemma 2B requires approximately 4GB of RAM for full-precision inference, or 2GB with quantization. This makes it one of the most resource-efficient models available, suitable for mobile and embedded deployment.
Yes, Gemma 2B is released under Google's permissive terms of use for free research and commercial applications. Weights are available on Hugging Face.
Gemma 2B handles basic text generation, classification, and summarization tasks where model size is a constraint. For edge inference and on-device applications, it provides a capable option that runs on consumer hardware.