Back to GalleryBack
Benchmarks Comparison
Benchmark Comparison
Compare AI model performance across standardized tests
Individual Benchmark Results
This table shows performance on specific standardized tests (MMLU, HumanEval, GPQA, etc.) rather than category aggregations. Scores represent pass rates or accuracy percentages on each benchmark.
Filters & View Options
178 models
Select benchmarks to display (showing first 10):
Model | Provider | AIME | AA Coding Index | AAII | AA Math Index | DROP | GPQA | HLE | HumanEval | LiveCodeBench | MATH-500 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| claude-3-5-sonnet-20240620 | 15.7 | 30.2 | 29.9 | — | 87.1 | 59.7 | 3.9 | 92.0 | 38.1 | 77.1 | |
| claude-3-haiku-20240307 | 1.0 | — | 9.6 | — | 78.4 | 33.3 | — | 75.9 | 15.4 | 39.4 | |
| claude-3-opus-20240229 | 3.3 | 19.5 | 20.6 | — | 83.1 | 49.6 | 3.1 | 84.9 | 27.9 | 64.1 | |
| claude-3.5-haiku | 3.3 | — | 20.2 | — | 83.1 | 41.2 | 3.5 | 88.1 | 31.4 | 72.1 | |
| claude-3.7-sonnet | 48.7 | 35.8 | 49.9 | 56.3 | — | 77.2 | 10.3 | — | 47.3 | 94.7 | |
| claude-haiku-4.5 | — | 37.0 | 41.7 | 39.0 | — | 73.0 | 4.3 | — | 51.1 | — | |
| claude-opus-4 | 75.7 | 44.2 | 54.2 | 73.3 | — | 79.6 | 11.7 | — | 63.6 | 98.2 | |
| claude-opus-4.1 | — | 46.1 | 59.3 | 80.3 | — | 80.9 | 11.9 | — | 65.4 | — | |
| claude-opus-4.5 | — | 60.2 | 69.8 | 91.3 | — | 86.6 | 28.4 | — | 87.1 | — | |
| claude-sonnet-4 | 77.3 | 45.1 | 56.5 | 74.3 | — | 77.7 | 9.6 | — | 65.5 | 99.1 | |
| claude-sonnet-4.5 | — | 49.8 | 62.7 | 88.0 | — | 83.4 | 17.3 | — | 71.4 | — | |
| llama3-1-70b-instruct-v1.0 | 17.3 | 17.6 | 22.6 | 4.0 | 79.6 | 41.3 | 4.6 | 80.5 | 23.2 | 64.9 | |
| llama3-1-8b-instruct-v1.0 | 7.7 | 8.5 | 16.9 | 4.3 | 59.5 | 28.1 | 5.1 | 72.6 | 11.6 | 51.9 | |
| llama3-2-3b-instruct-v1.0 | 6.7 | — | 11.2 | 3.3 | — | 32.8 | 5.2 | — | 8.3 | 48.9 | |
| mistral-7b-instruct-v0.2 | 0.0 | — | 1.0 | — | — | 17.7 | 4.3 | — | 4.6 | 12.1 | |
| DeepSeek-R1 | 89.3 | 44.1 | 52.0 | 76.0 | — | 81.3 | 14.9 | — | 77.0 | 98.3 | |
| DeepSeek-R1-Distill-Llama-70B | 67.0 | 19.7 | 29.9 | 53.7 | — | 65.2 | 6.1 | — | 26.6 | 93.5 | |
| DeepSeek-V3 | 25.3 | 25.9 | 32.5 | 26.0 | 91.6 | 57.4 | 3.6 | — | 35.9 | 88.7 | |
| deepseek-chat-v3-0324 | — | 39.0 | 44.8 | 49.7 | — | 73.5 | 6.3 | — | 57.7 | — | |
| deepseek-chat-v3.1 | — | 39.0 | 44.8 | 49.7 | — | 73.5 | 6.3 | — | 57.7 | — | |
| deepseek-r1-0528 | — | — | — | — | — | 81.0 | — | — | — | — | |
| devstral-small | 0.3 | 18.5 | 27.2 | 29.3 | — | 41.4 | 3.7 | — | 25.4 | 63.5 | |
| devstral-small-2505 | 6.7 | — | 19.6 | — | — | 43.4 | 4.0 | — | 25.8 | 68.4 | |
| gemma-3-12b-it | 22.0 | 10.6 | 20.4 | 18.3 | — | 40.9 | 4.8 | 85.4 | 13.7 | 85.3 | |
| gemma-3-27b-it | 25.3 | 12.8 | 22.1 | 20.7 | — | 42.6 | 4.7 | 87.8 | 13.7 | 88.3 | |
| gemma-3-4b-it | 6.3 | 6.4 | 14.7 | 12.7 | — | 29.9 | 5.2 | 71.3 | 11.2 | 76.6 | |
| gpt-oss-120b | — | 49.6 | 60.5 | 93.4 | — | 79.2 | 18.5 | — | 87.8 | — | |
| gpt-oss-20b | — | 40.7 | 52.4 | 89.3 | — | 70.2 | 9.8 | — | 77.7 | — | |
| hermes-3-llama-3.1-70b | 2.3 | — | 14.7 | — | — | 40.1 | 4.1 | — | 18.8 | 53.8 | |
| llama-3.1-405b-instruct | 21.3 | 22.2 | 28.1 | 3.0 | 84.8 | 51.1 | 4.2 | 89.0 | 30.5 | 70.3 | |
| llama-3.1-70b-instruct | 17.3 | 17.6 | 22.6 | 4.0 | 79.6 | 41.3 | 4.6 | 80.5 | 23.2 | 64.9 | |
| llama-3.1-8b-instruct | 7.7 | 8.5 | 16.9 | 4.3 | 59.5 | 28.1 | 5.1 | 72.6 | 11.6 | 51.9 | |
| llama-3.1-nemotron-70b-instruct | 24.7 | 14.8 | 23.6 | 11.0 | — | 46.5 | 4.6 | — | 16.9 | 73.3 | |
| llama-3.2-3b-instruct | 6.7 | — | 11.2 | 3.3 | — | 32.8 | 5.2 | — | 8.3 | 48.9 | |
| llama-3.3-70b-instruct | 30.0 | 19.2 | 27.9 | 7.7 | — | 50.2 | 4.0 | 88.4 | 28.8 | 77.3 | |
| llama-4-maverick | 39.0 | 26.4 | 35.8 | 19.3 | — | 68.5 | 4.8 | — | 39.7 | 88.9 | |
| llama-4-scout | 28.3 | 16.1 | 28.1 | 14.0 | — | 57.9 | 4.3 | — | 29.9 | 84.4 | |
| mistral-7b-instruct | 0.0 | — | 1.0 | — | — | 17.7 | 4.3 | — | 4.6 | 12.1 | |
| mistral-nemo | 0.3 | — | 5.2 | — | — | 31.4 | 4.4 | — | 5.7 | 39.5 | |
| mistral-small-24b-instruct-2501 | — | — | — | — | — | 45.3 | — | 84.8 | — | — | |
| mistral-small-3.1-24b-instruct | 6.3 | — | 13.0 | — | — | 38.1 | 4.3 | — | 14.1 | 56.3 | |
| mistral-small-3.2-24b-instruct | 6.3 | — | 13.0 | — | — | 38.1 | 4.3 | — | 14.1 | 56.3 | |
| mixtral-8x7b-instruct | 0.0 | — | 2.6 | — | — | 29.2 | 4.5 | — | 6.6 | 29.9 | |
| nemotron-nano-9b-v2 | — | 31.9 | 37.2 | 69.7 | — | 57.0 | 4.6 | — | 72.4 | — | |
| phi-4 | — | — | — | — | — | 65.8 | — | — | — | — | |
| phi-4-multimodal-instruct | 9.3 | — | 12.4 | — | — | 31.5 | 4.4 | — | 13.1 | 69.3 | |
| phi-4-reasoning-plus | — | — | — | — | — | 68.9 | — | — | — | — | |
| qwen-2.5-72b-instruct | 16.0 | 19.5 | 29.0 | 14.0 | — | 49.0 | 4.2 | 86.6 | 27.6 | 85.8 | |
| qwen-2.5-7b-instruct | — | — | — | — | — | 36.4 | — | 84.8 | — | — | |
| qwen3-14b | 28.0 | 19.8 | 29.2 | 58.0 | — | 47.0 | 4.2 | — | 28.0 | 87.1 | |
| qwen3-235b-a22b | 32.7 | 23.3 | 29.9 | 23.7 | — | 61.3 | 4.7 | — | 34.3 | 90.2 | |
| qwen3-235b-a22b-2507 | 94.0 | 44.6 | 57.5 | 91.0 | — | 79.0 | 15.0 | — | 78.8 | 98.4 | |
| qwen3-235b-a22b-thinking-2507 | — | — | — | — | — | 81.1 | — | — | — | — | |
| qwen3-30b-a3b | 72.7 | 29.2 | 37.0 | 66.3 | — | 65.9 | 6.8 | — | 51.5 | 97.5 | |
| qwen3-32b | 80.7 | 30.9 | 38.7 | 73.0 | — | 66.8 | 8.3 | — | 54.6 | 96.1 | |
| qwq-32b | 78.0 | — | 37.9 | 29.0 | — | 65.2 | 8.2 | — | 63.1 | 95.7 | |
| deepseek-chat | — | 39.0 | 44.8 | 49.7 | — | 73.5 | 6.3 | — | 57.7 | — | |
| deepseek-reasoner | — | 47.2 | 54.0 | 89.7 | — | 77.9 | 13.0 | — | 78.4 | — | |
| deepseek-r1-05-28 | 89.3 | 44.1 | 52.0 | 76.0 | — | 81.3 | 14.9 | — | 77.0 | 98.3 | |
| deepseek-v3-03-24 | 25.3 | 25.9 | 32.5 | 26.0 | 91.6 | 57.4 | 3.6 | — | 35.9 | 88.7 | |
| kimi-k2-instruct | — | — | — | — | — | 75.1 | — | 93.3 | — | — | |
| qwen3-coder-480b-a35b-instruct | 47.7 | 37.4 | 42.3 | 39.3 | — | 61.8 | 4.4 | — | 58.5 | 94.2 | |
| gemini-2.0-flash | 33.0 | 23.4 | 33.6 | 21.7 | — | 62.2 | 5.3 | — | 33.4 | 93.0 | |
| gemini-2.0-flash-lite | — | — | — | — | — | 51.5 | — | — | — | — | |
| gemini-2.5-flash | 50.0 | 30.0 | 40.4 | 60.3 | — | 82.8 | 5.1 | — | 49.5 | 93.2 | |
| gemini-2.5-flash-preview | 50.0 | 30.0 | 40.4 | 60.3 | — | 82.8 | 5.1 | — | 49.5 | 93.2 | |
| gemini-2.5-pro | 88.7 | 49.3 | 59.6 | 87.7 | — | 83.7 | 21.1 | — | 80.1 | 96.7 | |
| gemini-2.5-pro-preview | 88.7 | 49.3 | 59.6 | 87.7 | — | 83.7 | 21.1 | — | 80.1 | 96.7 | |
| gemini-3-pro-preview | — | 62.3 | 72.8 | 95.7 | — | 91.3 | 37.2 | — | 91.7 | — | |
| gemma-2-9b-it | 0.0 | — | 7.8 | — | — | 31.1 | 3.9 | 40.2 | 12.6 | 51.7 | |
| kimi-k2-0905 | — | 38.1 | 50.4 | 57.3 | — | 76.3 | 6.3 | 94.5 | 61.0 | — | |
| codestral-2501 | 4.3 | 16.3 | 20.1 | 6.0 | — | 31.2 | 4.5 | — | 24.3 | 60.7 | |
| codestral-2508 | 4.3 | 16.3 | 20.1 | 6.0 | — | 31.2 | 4.5 | — | 24.3 | 60.7 | |
| devstral-medium | 6.7 | 23.9 | 27.9 | 4.7 | — | 49.2 | 3.8 | — | 33.7 | 70.7 | |
| ministral-3b | 0.0 | 5.4 | 10.9 | 0.3 | — | 26.0 | 5.5 | — | 6.9 | 53.7 | |
| mistral-large | 0.0 | — | 11.9 | — | — | 35.1 | 3.4 | — | 17.8 | 52.7 | |
| mistral-large-2.1 | 0.0 | — | 11.9 | — | — | 35.1 | 3.4 | — | 17.8 | 52.7 | |
| mistral-large-2407 | 9.3 | — | 22.3 | 0.0 | — | 47.2 | 3.2 | — | 26.7 | 71.4 | |
| mistral-medium-3 | 44.0 | 25.6 | 33.6 | 30.3 | — | 57.8 | 4.3 | — | 40.0 | 90.7 | |
| mistral-medium-3.1 | — | 28.1 | 35.4 | 38.3 | — | 58.8 | 4.4 | — | 40.6 | — | |
| mistral-saba | 13.0 | — | 19.6 | — | — | 42.4 | 4.1 | — | — | 67.7 | |
| mistral-small | 6.3 | — | 13.0 | — | — | 38.1 | 4.3 | — | 14.1 | 56.3 | |
| mistral-small-3 | 6.3 | — | 13.0 | — | — | 38.1 | 4.3 | — | 14.1 | 56.3 | |
| mistral-small-3.1 | 6.3 | — | 13.0 | — | — | 38.1 | 4.3 | — | 14.1 | 56.3 | |
| mistral-small-3.2 | 6.3 | — | 13.0 | — | — | 38.1 | 4.3 | — | 14.1 | 56.3 | |
| chatgpt-4o-latest | 32.7 | — | 35.6 | 25.7 | — | 65.5 | 5.0 | — | 42.5 | 89.3 | |
| gpt-3.5-turbo | — | 10.7 | 8.3 | — | 70.2 | 30.3 | — | 68.0 | — | 44.1 | |
| gpt-3.5-turbo-0613 | — | — | — | — | — | — | — | — | — | — | |
| gpt-3.5-turbo-instruct | — | 10.7 | 8.3 | — | 70.2 | 30.3 | — | 68.0 | — | 44.1 | |
| gpt-4.1 | 43.7 | 32.2 | 43.4 | 34.7 | — | 66.5 | 4.6 | — | 45.7 | 91.3 | |
| gpt-4.1-mini | 43.0 | 31.9 | 42.5 | 46.3 | — | 66.4 | 4.6 | — | 48.3 | 92.5 | |
| gpt-4.1-nano | 23.7 | 20.7 | 27.3 | 24.0 | — | 51.2 | 3.9 | — | 32.6 | 84.8 | |
| gpt-4o | 15.0 | 24.0 | 27.0 | 6.0 | — | 54.3 | 3.3 | — | 30.9 | 75.9 | |
| gpt-4o-2024-05-13 | 15.0 | 24.0 | 27.0 | 6.0 | 83.4 | 54.0 | 3.3 | 90.2 | 30.9 | 75.9 | |
| gpt-4o-mini | 11.7 | — | 21.2 | 14.7 | — | 42.6 | 4.0 | — | 23.4 | 78.9 | |
| gpt-4o-mini-search-preview | — | — | — | — | 79.7 | 40.2 | — | 87.2 | — | — | |
| gpt-5 | 95.7 | 52.7 | 68.5 | 94.3 | — | 85.4 | 26.5 | — | 84.6 | 99.4 | |
| gpt-5-chat | — | 34.7 | 41.8 | 48.3 | — | 68.6 | 5.8 | — | 54.3 | — | |
| gpt-5-mini | — | 51.4 | 64.3 | 90.7 | — | 82.8 | 19.7 | — | 83.8 | — | |
| gpt-5-nano | — | 42.3 | 51.0 | 83.7 | — | 67.6 | 8.2 | — | 78.9 | — | |
| gpt-5.1 | — | 57.5 | 69.7 | 94.0 | — | 87.3 | 26.5 | — | 86.8 | — | |
| o1 | 72.3 | 38.6 | 47.2 | — | — | 76.4 | 7.7 | 88.1 | 67.9 | 97.0 | |
| o1-mini | 60.3 | — | 39.2 | — | — | 60.1 | 4.9 | 92.4 | 57.6 | 94.4 | |
| o3 | 90.3 | 52.2 | 65.5 | 88.3 | — | 83.0 | 20.0 | — | 80.8 | 99.2 | |
| o3-mini | 77.0 | 39.4 | 48.1 | — | — | 76.0 | 8.7 | — | 71.7 | 97.3 | |
| o4-mini | 94.0 | 48.9 | 59.6 | 90.7 | — | 79.9 | 17.5 | — | 85.9 | 98.9 | |
| command-a | 9.7 | 19.2 | 26.9 | 13.0 | — | 52.7 | 4.6 | — | 28.7 | 81.9 | |
| deephermes-3-mistral-24b-preview | 4.7 | — | 15.5 | — | — | 38.2 | 3.9 | — | 19.5 | 59.5 | |
| deepseek-r1 | 89.3 | 44.1 | 52.0 | 76.0 | — | 81.3 | 14.9 | — | 77.0 | 98.3 | |
| deepseek-r1-distill-llama-70b | 67.0 | 19.7 | 29.9 | 53.7 | — | 65.2 | 6.1 | — | 26.6 | 93.5 | |
| deepseek-r1-distill-qwen-14b | 66.7 | — | 29.7 | 55.7 | — | 59.1 | 4.4 | — | 37.6 | 94.9 | |
| deepseek-r1-distill-qwen-32b | 68.7 | — | 32.7 | 63.0 | — | 61.8 | 5.5 | — | 27.0 | 94.1 | |
| deepseek-v3.1-terminus | — | 49.6 | 57.7 | 89.7 | — | 79.2 | 15.2 | — | 79.8 | — | |
| deepseek-v3.1-terminus:exacto | 25.3 | 25.9 | 32.5 | 26.0 | 91.6 | 57.4 | 3.6 | — | 35.9 | 88.7 | |
| deepseek-v3.2-exp | — | 39.6 | 46.3 | 57.7 | — | 79.9 | 8.6 | — | 55.4 | — | |
| ernie-4.5-300b-a47b | 49.3 | 27.9 | 32.9 | 41.3 | — | 81.1 | 3.5 | — | 46.7 | 93.1 | |
| gemini-2.0-flash-001 | 33.0 | 23.4 | 33.6 | 21.7 | — | 62.2 | 5.3 | — | 33.4 | 93.0 | |
| gemini-2.0-flash-lite-001 | 27.7 | — | 26.8 | — | — | 53.5 | 3.6 | — | 18.5 | 87.3 | |
| gemini-2.5-flash-image | 50.0 | 30.0 | 40.4 | 60.3 | — | 82.8 | 5.1 | — | 49.5 | 93.2 | |
| gemini-2.5-flash-image-preview | 50.0 | 30.0 | 40.4 | 60.3 | — | 82.8 | 5.1 | — | 49.5 | 93.2 | |
| gemini-2.5-flash-lite | 50.0 | 19.9 | 30.1 | 35.3 | — | 64.6 | 3.7 | — | 40.0 | 92.6 | |
| gemini-2.5-flash-lite-preview-06-17 | 50.0 | 30.0 | 40.4 | 60.3 | — | 82.8 | 5.1 | — | 49.5 | 93.2 | |
| gemini-2.5-flash-lite-preview-09-2025 | — | 36.5 | 47.9 | 68.7 | — | 70.9 | 6.6 | — | 68.8 | — | |
| gemini-2.5-flash-preview-09-2025 | — | 42.5 | 54.4 | 78.3 | — | 79.3 | 12.7 | — | 71.3 | — | |
| gemini-2.5-pro-preview-05-06 | 88.7 | 49.3 | 59.6 | 87.7 | — | 83.7 | 21.1 | — | 80.1 | 96.7 | |
| gemma-3n-e4b-it | — | — | — | — | — | 23.7 | — | 75.0 | — | — | |
| gpt-oss-120b:exacto | — | 49.6 | 60.5 | 93.4 | — | 79.2 | 18.5 | — | 87.8 | — | |
| grok-3-mini-beta | 93.3 | 42.2 | 57.1 | 84.7 | — | 79.1 | 11.1 | — | 69.6 | 99.2 | |
| grok-4-fast | — | 48.4 | 60.3 | 89.7 | — | 84.7 | 17.0 | — | 83.2 | — | |
| kimi-k2-thinking | — | 52.2 | 67.0 | 94.7 | — | 83.8 | 22.3 | — | 85.3 | — | |
| kimi-linear-48b-a3b-instruct | — | 22.8 | — | 36.3 | — | 41.2 | 2.7 | — | 37.8 | — | |
| lfm2-8b-a1b | — | 7.3 | 17.4 | 25.3 | — | 34.4 | 4.9 | — | 15.1 | — | |
| ling-1t | — | 37.6 | 44.8 | 71.3 | — | 71.9 | 7.2 | — | 67.7 | — | |
| llama-3.1-405b | 21.3 | 22.2 | 28.1 | 3.0 | 84.8 | 51.1 | 4.2 | 89.0 | 30.5 | 70.3 | |
| llama-3.1-nemotron-ultra-253b-v1 | 74.7 | 33.7 | 38.5 | 63.7 | — | 74.4 | 8.1 | — | 64.1 | 95.2 | |
| llama-3.3-nemotron-super-49b-v1.5 | 19.3 | 17.0 | 25.9 | 7.7 | — | 66.7 | 3.5 | — | 28.0 | 77.5 | |
| minimax-m1 | 81.3 | 35.2 | 40.0 | 13.7 | — | 68.7 | 7.5 | — | 65.7 | 97.2 | |
| minimax-m2 | — | 47.6 | 61.4 | 78.3 | — | 77.8 | 12.5 | — | 82.6 | — | |
| nova-lite-v1 | 10.7 | 10.4 | 21.4 | 7.0 | 80.2 | 42.6 | 4.6 | 85.4 | 16.7 | 76.5 | |
| nova-micro-v1 | 8.0 | 8.3 | 17.7 | 6.0 | 79.3 | 37.9 | 4.7 | 81.1 | 14.0 | 70.3 | |
| nova-premier-v1 | 17.0 | 22.0 | 32.3 | 17.3 | — | 56.9 | 4.7 | — | 31.7 | 83.9 | |
| nova-pro-v1 | 10.7 | 16.6 | 25.0 | 7.0 | 85.4 | 48.4 | 3.4 | 89.0 | 23.3 | 78.6 | |
| qwen-2.5-coder-32b-instruct | 12.0 | — | 21.8 | — | — | 41.7 | 3.8 | 92.7 | 29.5 | 76.7 | |
| qwen-turbo | 12.0 | — | 19.1 | — | — | 41.0 | 4.2 | — | 16.3 | 80.5 | |
| qwen3-8b | 24.3 | 13.0 | 22.9 | 24.3 | — | 45.2 | 2.8 | — | 20.2 | 82.8 | |
| qwen3-max | — | 36.2 | 55.8 | 82.3 | — | 77.6 | 12.0 | — | 53.5 | — | |
| qwen3-next-80b-a3b-instruct | — | 35.4 | 44.8 | 66.3 | — | 73.4 | 7.3 | — | 68.4 | — | |
| qwen3-next-80b-a3b-thinking | — | — | — | — | — | 77.2 | — | — | — | — | |
| qwen3-vl-235b-a22b-instruct | — | 33.9 | 44.1 | 70.7 | — | 71.2 | 6.3 | — | 59.4 | — | |
| qwen3-vl-235b-a22b-thinking | — | — | — | — | — | — | — | — | — | — | |
| qwen3-vl-30b-a3b-instruct | 29.7 | 27.4 | 33.4 | 29.0 | — | 70.4 | 4.0 | — | 40.3 | 89.3 | |
| qwen3-vl-30b-a3b-thinking | — | — | — | — | — | 74.4 | — | — | — | — | |
| qwen3-vl-8b-instruct | — | 17.6 | 27.1 | 27.3 | — | 42.7 | 2.9 | — | 33.2 | — | |
| qwen3-vl-8b-thinking | — | — | — | — | — | 69.9 | — | — | — | — | |
| ring-1t | — | 35.8 | 41.8 | 89.3 | — | 59.5 | 10.2 | — | 64.3 | — | |
| sonar | 48.7 | — | 28.8 | — | — | 47.1 | 7.3 | — | 29.5 | 81.7 | |
| sonar-pro | 29.0 | — | 28.2 | — | — | 57.8 | 7.9 | — | 27.5 | 74.5 | |
| sonar-pro-search | 29.0 | — | 28.2 | — | — | 57.8 | 7.9 | — | 27.5 | 74.5 | |
| sonar-reasoning | 77.0 | — | 34.2 | — | — | 62.3 | — | — | — | 92.1 | |
| sonar-reasoning-pro | 79.0 | — | 46.3 | — | — | — | — | — | — | 95.7 | |
| deepseek-r1-0528-qwen3-8b | 89.3 | 44.1 | 52.0 | 76.0 | — | 81.3 | 14.9 | — | 77.0 | 98.3 | |
| glm-4.5 | 87.3 | 43.3 | 51.3 | 73.7 | — | 78.6 | 12.2 | — | 73.8 | 97.9 | |
| glm-4.5v | — | 20.1 | 26.0 | 15.3 | — | 57.3 | 3.6 | — | 35.2 | — | |
| DeepSeek-R1-Distill-Qwen-1.5B | 17.7 | — | 8.6 | 22.0 | — | 33.8 | 3.3 | — | 7.0 | 68.7 | |
| DeepSeek-R1-Distill-Qwen-14B | 66.7 | — | 29.7 | 55.7 | — | 59.1 | 4.4 | — | 37.6 | 94.9 | |
| Llama-3.1-Nemotron-70B-Instruct-HF | 24.7 | 14.8 | 23.6 | 11.0 | — | 46.5 | 4.6 | — | 16.9 | 73.3 | |
| QwQ-32B-Preview | 45.3 | — | 28.0 | — | — | 65.2 | 4.8 | — | 33.7 | 91.0 | |
| Qwen2-72B-Instruct | 14.7 | — | 18.1 | — | — | 42.4 | 3.7 | 86.0 | 15.9 | 70.1 | |
| gemma-2-27b-it | 29.7 | — | 17.2 | — | — | 35.7 | 3.7 | 51.8 | 27.9 | 54.1 | |
| grok-2 | — | — | — | — | — | 56.0 | — | 88.4 | — | — | |
| grok-3 | — | — | 41.4 | — | — | — | — | — | — | — | |
| grok-3-mini | 93.3 | 42.2 | 57.1 | 84.7 | — | 79.1 | 11.1 | — | 69.6 | 99.2 | |
| grok-4 | 94.3 | 55.1 | 65.3 | 92.7 | — | 87.6 | 23.9 | — | 81.9 | 99.0 | |
| grok-code-fast-1 | — | 39.4 | 48.6 | 43.3 | — | 72.7 | 7.5 | — | 65.7 | — | |
| glm-4.5-air | 67.3 | 39.4 | 48.8 | 80.7 | — | 74.2 | 6.8 | — | 68.4 | 96.5 | |
| glm-4.5-airx | 67.3 | 39.4 | 48.8 | 80.7 | — | 74.2 | 6.8 | — | 68.4 | 96.5 | |
| glm-4.5-x | 87.3 | 43.3 | 51.3 | 73.7 | — | 78.6 | 12.2 | — | 73.8 | 97.9 | |
| glm-4.6 | — | 38.7 | 44.7 | 44.3 | — | 81.0 | 5.2 | — | 56.1 | — |
Showing 178 models with benchmark data. Click column headers to sort. Select benchmarks above to customize view.