Google's long-context multimodal model with up to 2M token windows.

Key specifications

Intelligence
75.8
Context window
2M
Max output
8.2K
Output speed
65 t/s
Latency (TTFT)
0.90s
Input $/1M
$1.25
Output $/1M
$5.00
License
Proprietary
Architecture
Transformer (MoE)

Capabilities

Function callingVisionAudioStreamingJSON modeFine-tuning

Strengths

  • + Huge 2M context
  • + Native multimodal
  • + Competitive price

Limitations

  • Slower TTFT
  • Closed weights

Benchmark scores

BenchmarkScore
GPQA Diamond59.1%
MMLU-Pro75.8%
SWE-Bench Verified38.0%
HumanEval84.1
AIME 202411.0%