Benchmarks Comparison

Benchmark Comparison

Compare AI model performance across standardized tests

Individual Benchmark Results

This table shows performance on specific standardized tests (MMLU, HumanEval, GPQA, etc.) rather than category aggregations. Scores represent pass rates or accuracy percentages on each benchmark.

Filters & View Options

203 models

Select benchmarks to display (showing first 10):

Model	AIME	AA Coding Index	AAII	AA Math Index	DROP	GPQA	HLE	HumanEval	LiveCodeBench	MATH-500
claude-3-5-sonnet-20240620	15.7	30.2	15.9	—	87.1	59.7	3.9	92.0	38.1	77.1
claude-3-haiku-20240307	1.0	—	9.3	—	78.4	33.3	—	75.9	15.4	39.4
claude-3-opus-20240229	3.3	19.5	12.5	—	83.1	49.6	3.1	84.9	27.9	64.1
claude-3.5-haiku	3.3	10.7	18.7	—	83.1	41.2	3.5	88.1	31.4	72.1
claude-3.7-sonnet	48.7	27.6	34.6	56.3	—	77.2	10.3	—	47.3	94.7
claude-haiku-4.5	—	29.6	31.0	39.0	—	73.0	4.3	—	51.1	—
claude-opus-4	75.7	34.0	27.4	73.3	—	79.6	11.7	—	63.6	98.2
claude-opus-4.1	—	36.5	31.9	80.3	—	80.9	11.9	—	65.4	—
claude-opus-4.5	—	47.8	49.7	91.3	—	86.6	28.4	—	87.1	—
claude-opus-4.6	—	47.6	46.4	—	—	91.3	18.6	—	—	—
claude-sonnet-4	77.3	34.1	38.6	74.3	—	77.7	9.6	—	65.5	99.1
claude-sonnet-4.5	—	38.6	42.9	88.0	—	83.4	17.3	—	71.4	—
llama3-1-70b-instruct-v1.0	17.3	10.9	12.2	4.0	79.6	41.3	4.6	80.5	23.2	64.9
llama3-1-8b-instruct-v1.0	7.7	4.9	11.7	4.3	59.5	28.1	5.1	72.6	11.6	51.9
llama3-2-3b-instruct-v1.0	6.7	—	9.7	3.3	—	32.8	5.2	—	8.3	48.9
mistral-7b-instruct-v0.2	0.0	—	7.4	—	—	17.7	4.3	—	4.6	12.1
DeepSeek-R1	89.3	24.0	27.0	76.0	—	81.3	14.9	—	77.0	98.3
DeepSeek-R1-Distill-Llama-70B	67.0	11.4	16.0	53.7	—	65.2	6.1	—	26.6	93.5
DeepSeek-V3	25.3	16.4	16.4	26.0	91.6	57.4	3.6	—	35.9	88.7
deepseek-chat-v3-0324	—	28.4	28.0	49.7	—	73.5	6.3	—	57.7	—
deepseek-chat-v3.1	—	28.4	28.0	49.7	—	73.5	6.3	—	57.7	—
deepseek-r1-0528	—	—	—	—	—	81.0	—	—	—	—
devstral-small	0.3	12.1	15.2	29.3	—	41.4	3.7	—	25.4	63.5
devstral-small-2505	6.7	12.2	18.0	—	—	43.4	4.0	—	25.8	68.4
gemma-3-12b-it	22.0	6.3	8.8	18.3	—	40.9	4.8	85.4	13.7	85.3
gemma-3-27b-it	25.3	9.6	10.2	20.7	—	42.6	4.7	87.8	13.7	88.3
gemma-3-4b-it	6.3	2.9	6.3	12.7	—	29.9	5.2	71.3	11.2	76.6
gpt-oss-120b	—	28.6	33.3	93.4	—	79.2	18.5	—	87.8	—
gpt-oss-20b	—	18.5	24.5	89.3	—	70.2	9.8	—	77.7	—
hermes-3-llama-3.1-70b	2.3	—	10.6	—	—	40.1	4.1	—	18.8	53.8
llama-3.1-405b-instruct	21.3	14.5	14.2	3.0	84.8	51.1	4.2	89.0	30.5	70.3
llama-3.1-70b-instruct	17.3	10.9	12.2	4.0	79.6	41.3	4.6	80.5	23.2	64.9
llama-3.1-8b-instruct	7.7	4.9	11.7	4.3	59.5	28.1	5.1	72.6	11.6	51.9
llama-3.1-nemotron-70b-instruct	24.7	10.8	13.4	11.0	—	46.5	4.6	—	16.9	73.3
llama-3.2-3b-instruct	6.7	—	9.7	3.3	—	32.8	5.2	—	8.3	48.9
llama-3.3-70b-instruct	30.0	10.7	14.2	7.7	—	50.2	4.0	88.4	28.8	77.3
llama-4-maverick	39.0	15.6	18.3	19.3	—	68.5	4.8	—	39.7	88.9
llama-4-scout	28.3	6.7	13.5	14.0	—	57.9	4.3	—	29.9	84.4
mistral-7b-instruct	0.0	—	7.4	—	—	17.7	4.3	—	4.6	12.1
mistral-nemo	—	—	—	—	—	—	—	—	—	—
mistral-small-24b-instruct-2501	—	—	—	—	—	45.3	—	84.8	—	—
mistral-small-3.1-24b-instruct	6.3	—	10.2	—	—	38.1	4.3	—	14.1	56.3
mistral-small-3.2-24b-instruct	6.3	—	10.2	—	—	38.1	4.3	—	14.1	56.3
mixtral-8x7b-instruct	0.0	—	7.7	—	—	29.2	4.5	—	6.6	29.9
nemotron-nano-9b-v2	—	8.3	14.8	69.7	—	57.0	4.6	—	72.4	—
phi-4	—	—	—	—	—	65.8	—	—	—	—
phi-4-multimodal-instruct	9.3	—	10.0	—	—	31.5	4.4	—	13.1	69.3
phi-4-reasoning-plus	—	—	—	—	—	68.9	—	—	—	—
qwen-2.5-72b-instruct	16.0	11.9	15.6	14.0	—	49.0	4.2	86.6	27.6	85.8
qwen-2.5-7b-instruct	—	—	—	—	—	36.4	—	84.8	—	—
qwen3-14b	76.3	13.1	16.2	55.7	—	60.4	4.3	—	52.3	96.1
qwen3-235b-a22b	32.7	14.0	16.9	23.7	—	61.3	4.7	—	34.3	90.2
qwen3-235b-a22b-2507	94.0	23.2	29.5	91.0	—	79.0	15.0	—	78.8	98.4
qwen3-235b-a22b-thinking-2507	—	—	—	—	—	81.1	—	—	—	—
qwen3-30b-a3b	72.7	14.2	15.0	66.3	—	65.9	6.8	—	51.5	97.5
qwen3-32b	80.7	13.8	16.5	73.0	—	66.8	8.3	—	54.6	96.1
qwen3-coder	—	22.9	28.1	—	—	73.7	9.3	—	—	—
qwq-32b	78.0	—	19.7	29.0	—	65.2	8.2	—	63.1	95.7
deepseek-chat	—	34.6	32.1	59.0	—	75.1	10.5	—	59.3	—
deepseek-reasoner	—	36.7	41.6	92.0	—	83.2	22.2	—	86.2	—
deepseek-r1-05-28	89.3	24.0	27.0	76.0	—	81.3	14.9	—	77.0	98.3
deepseek-v3-03-24	25.3	16.4	16.4	26.0	91.6	57.4	3.6	—	35.9	88.7
kimi-k2-instruct	—	—	—	—	—	75.1	—	93.3	—	—
qwen3-coder-480b-a35b-instruct	47.7	24.6	24.6	39.3	—	61.8	4.4	—	58.5	94.2
gemini-2.0-flash	33.0	13.6	18.5	21.7	—	62.2	5.3	—	33.4	93.0
gemini-2.0-flash-lite	—	—	—	—	—	51.5	—	—	—	—
gemini-2.5-flash	50.0	17.8	20.5	60.3	—	82.8	5.1	—	49.5	93.2
gemini-2.5-flash-preview	50.0	17.8	20.5	60.3	—	82.8	5.1	—	49.5	93.2
gemini-2.5-pro	88.7	31.9	34.5	87.7	—	83.7	21.1	—	80.1	96.7
gemini-2.5-pro-preview	88.7	31.9	34.5	87.7	—	83.7	21.1	—	80.1	96.7
gemini-3-flash-preview	—	37.8	35.1	55.7	—	90.4	14.1	—	79.7	—
gemini-3-pro-preview	—	46.5	48.4	95.7	—	91.3	37.2	—	91.7	—
gemma-2-9b-it	—	—	—	—	—	—	—	40.2	—	—
kimi-k2-0905	—	25.9	30.8	57.3	—	76.3	6.3	94.5	61.0	—
codestral-2501	—	—	—	—	—	—	—	81.1	—	—
codestral-2508	—	—	—	—	—	—	—	81.1	—	—
devstral-medium	6.7	15.9	18.6	4.7	—	49.2	3.8	—	33.7	70.7
mistral-large	0.0	—	9.9	—	—	35.1	3.4	—	17.8	52.7
mistral-large-2.1	0.0	—	9.9	—	—	35.1	3.4	—	17.8	52.7
mistral-large-2407	9.3	—	13.0	0.0	—	47.2	3.2	—	26.7	71.4
mistral-medium-3	44.0	13.6	18.7	30.3	—	57.8	4.3	—	40.0	90.7
mistral-medium-3.1	—	18.3	21.1	38.3	—	58.8	4.4	—	40.6	—
mistral-saba	13.0	—	12.1	—	—	42.4	4.1	—	—	67.7
mistral-small	6.3	—	10.2	—	—	38.1	4.3	—	14.1	56.3
mistral-small-3	6.3	—	10.2	—	—	38.1	4.3	—	14.1	56.3
mistral-small-3.1	6.3	—	10.2	—	—	38.1	4.3	—	14.1	56.3
mistral-small-3.2	6.3	—	10.2	—	—	38.1	4.3	—	14.1	56.3
chatgpt-4o-latest	—	—	—	—	—	84.0	—	—	—	—
gpt-3.5-turbo	—	10.7	9.0	—	70.2	30.3	—	68.0	—	44.1
gpt-3.5-turbo-0613	—	—	—	—	—	—	—	—	—	—
gpt-3.5-turbo-instruct	—	10.7	9.0	—	70.2	30.3	—	68.0	—	44.1
gpt-4.1	43.7	21.8	25.6	34.7	—	66.5	4.6	—	45.7	91.3
gpt-4.1-mini	43.0	18.5	22.4	46.3	—	66.4	4.6	—	48.3	92.5
gpt-4.1-nano	23.7	11.2	12.9	24.0	—	51.2	3.9	—	32.6	84.8
gpt-4o	15.0	16.7	17.3	6.0	—	54.3	3.3	—	30.9	75.9
gpt-4o-2024-05-13	15.0	16.7	17.3	6.0	83.4	54.0	3.3	90.2	30.9	75.9
gpt-4o-mini	11.7	—	12.6	14.7	—	42.6	4.0	—	23.4	78.9
gpt-4o-mini-search-preview	—	—	—	—	79.7	40.2	—	87.2	—	—
gpt-5	95.7	36.0	44.6	94.3	—	85.4	26.5	—	84.6	99.4
gpt-5-chat	—	21.2	21.8	48.3	—	68.6	5.8	—	54.3	—
gpt-5-mini	—	35.3	41.0	90.7	—	82.8	19.7	—	83.8	—
gpt-5-nano	—	20.3	26.7	83.7	—	67.6	8.2	—	78.9	—
gpt-5.1	—	44.7	47.6	94.0	—	87.3	26.5	—	86.8	—
gpt-5.2	—	48.7	51.2	99.0	—	90.3	35.4	—	88.9	—
o1	72.3	20.5	30.7	—	—	76.4	7.7	88.1	67.9	97.0
o1-mini	60.3	—	20.4	—	—	60.1	4.9	92.4	57.6	94.4
o3	90.3	38.4	38.3	88.3	—	83.0	20.0	—	80.8	99.2
o3-mini	77.0	17.9	25.9	—	—	76.0	8.7	—	71.7	97.3
o4-mini	94.0	25.6	33.0	90.7	—	79.9	17.5	—	85.9	98.9
command-a	9.7	9.9	13.4	13.0	—	52.7	4.6	—	28.7	81.9
deepseek-r1	89.3	24.0	27.0	76.0	—	81.3	14.9	—	77.0	98.3
deepseek-r1-distill-llama-70b	67.0	11.4	16.0	53.7	—	65.2	6.1	—	26.6	93.5
deepseek-r1-distill-qwen-32b	68.7	—	17.2	63.0	—	61.8	5.5	—	27.0	94.1
deepseek-v3.1-terminus	—	33.7	33.8	89.7	—	79.2	15.2	—	79.8	—
deepseek-v3.1-terminus:exacto	25.3	16.4	16.4	26.0	91.6	57.4	3.6	—	35.9	88.7
deepseek-v3.2	—	36.7	41.6	92.0	—	84.0	22.2	—	86.2	—
deepseek-v3.2-exp	—	30.0	28.3	57.7	—	79.9	8.6	—	55.4	—
deepseek-v3.2-speciale	—	37.9	34.1	96.7	—	87.1	26.1	—	89.6	—
ernie-4.5-21b-a3b	49.3	14.5	14.9	41.3	28.6	81.1	3.5	—	46.7	93.1
ernie-4.5-21b-a3b-thinking	49.3	14.5	14.9	41.3	28.6	81.1	3.5	—	46.7	93.1
ernie-4.5-300b-a47b	49.3	14.5	14.9	41.3	28.6	81.1	3.5	—	46.7	93.1
ernie-4.5-vl-28b-a3b	49.3	14.5	14.9	41.3	28.6	81.1	3.5	—	46.7	93.1
ernie-4.5-vl-424b-a47b	49.3	14.5	14.9	41.3	28.6	81.1	3.5	—	46.7	93.1
gemini-2.0-flash-001	33.0	13.6	18.5	21.7	—	62.2	5.3	—	33.4	93.0
gemini-2.0-flash-lite-001	27.7	—	14.7	—	—	53.5	3.6	—	18.5	87.3
gemini-2.5-flash-image	50.0	17.8	20.5	60.3	—	82.8	5.1	—	49.5	93.2
gemini-2.5-flash-lite	50.0	7.4	12.5	35.3	—	64.6	3.7	—	40.0	92.6
gemini-2.5-flash-lite-preview-09-2025	—	18.1	21.6	68.7	—	70.9	6.6	—	68.8	—
gemini-2.5-pro-preview-05-06	88.7	31.9	34.5	87.7	—	83.7	21.1	—	80.1	96.7
gemini-3-pro-image-preview	—	46.5	48.4	95.7	—	91.3	37.2	—	91.7	—
gemma-3n-e4b-it	—	—	—	—	—	23.7	—	75.0	—	—
glm-4.6v	—	19.7	23.5	85.3	—	71.9	8.9	—	16.0	—
glm-4.7	—	36.3	42.0	95.0	—	85.8	25.1	—	89.4	—
glm-4.7-flash	—	25.9	30.1	—	—	75.2	7.1	—	—	—
glm-5	—	44.2	49.6	—	—	82.0	27.2	—	—	—
gpt-oss-120b:exacto	—	28.6	33.3	93.4	—	79.2	18.5	—	87.8	—
grok-3-mini-beta	93.3	25.2	32.0	84.7	—	79.1	11.1	—	69.6	99.2
grok-4-fast	—	27.4	34.9	89.7	—	84.7	17.0	—	83.2	—
grok-4.1-fast	—	30.9	38.5	89.3	—	85.3	17.6	—	82.2	—
intellect-3	—	19.1	22.1	88.0	—	76.1	12.1	—	77.7	—
kimi-k2-thinking	—	34.8	40.7	94.7	—	83.8	22.3	—	85.3	—
kimi-k2.5	—	39.5	46.7	—	—	87.8	29.4	—	—	—
lfm2-8b-a1b	—	2.3	6.8	25.3	—	34.4	4.9	—	15.1	—
llama-3.1-405b	21.3	14.5	14.2	3.0	84.8	51.1	4.2	89.0	30.5	70.3
llama-3.1-nemotron-ultra-253b-v1	74.7	13.1	15.0	63.7	—	74.4	8.1	—	64.1	95.2
llama-3.3-nemotron-super-49b-v1.5	19.3	7.6	14.3	7.7	—	66.7	3.5	—	28.0	77.5
longcat-flash-chat	—	—	—	—	79.1	73.2	—	88.4	—	—
mimo-v2-flash	—	31.8	39.2	96.3	—	84.6	21.1	—	86.8	—
minimax-m1	81.3	14.1	20.9	13.7	—	68.7	7.5	—	65.7	97.2
minimax-m2	—	29.2	36.0	78.3	—	77.8	12.5	—	82.6	—
minimax-m2-her	—	29.2	36.0	78.3	—	77.8	12.5	—	82.6	—
minimax-m2.1	—	32.8	39.5	82.7	—	82.0	22.2	—	81.0	—
minimax-m2.5	—	37.4	42.0	—	—	84.8	19.1	—	—	—
nemotron-3-nano-30b-a3b	—	15.8	13.3	13.3	—	75.0	4.6	—	36.0	—
nemotron-nano-12b-v2-vl	—	5.9	10.1	26.7	—	43.9	4.5	—	34.5	—
nova-lite-v1	10.7	5.1	12.4	7.0	80.2	42.6	4.6	85.4	16.7	76.5
nova-micro-v1	8.0	4.1	10.3	6.0	79.3	37.9	4.7	81.1	14.0	70.3
nova-premier-v1	17.0	13.8	18.9	17.3	—	56.9	4.7	—	31.7	83.9
nova-pro-v1	10.7	11.0	13.5	7.0	85.4	48.4	3.4	89.0	23.3	78.6
olmo-3-32b-think	—	10.5	12.0	73.7	—	61.0	5.9	—	67.2	—
olmo-3-7b-instruct	—	3.4	8.1	41.3	—	40.0	5.8	—	26.6	—
olmo-3-7b-think	—	7.6	9.5	70.7	—	51.6	5.7	—	61.7	—
olmo-3.1-32b-instruct	—	5.6	12.0	—	—	53.9	4.9	—	—	—
olmo-3.1-32b-think	—	9.8	14.2	77.3	—	59.1	6.0	—	69.5	—
qwen-2.5-coder-32b-instruct	12.0	—	12.9	—	—	41.7	3.8	92.7	29.5	76.7
qwen-turbo	12.0	—	12.0	—	—	41.0	4.2	—	16.3	80.5
qwen3-4b	—	9.5	18.6	82.7	—	66.7	5.9	—	64.1	—
qwen3-8b	74.7	9.0	13.1	19.0	—	58.9	4.2	—	40.6	90.4
qwen3-max	—	26.4	31.3	80.7	—	76.4	11.1	—	76.7	—
qwen3-max-thinking	—	24.5	32.5	82.3	—	77.6	12.0	—	53.5	—
qwen3-next-80b-a3b-instruct	—	15.3	20.1	66.3	—	73.4	7.3	—	68.4	—
qwen3-next-80b-a3b-thinking	—	—	—	—	—	77.2	—	—	—	—
qwen3-vl-235b-a22b-instruct	—	16.5	20.6	70.7	—	71.2	6.3	—	59.4	—
qwen3-vl-235b-a22b-thinking	—	—	—	—	—	—	—	—	—	—
qwen3-vl-30b-a3b-instruct	29.7	19.4	20.0	29.0	—	70.4	4.0	—	40.3	89.3
qwen3-vl-30b-a3b-thinking	—	—	—	—	—	74.4	—	—	—	—
qwen3-vl-32b-instruct	—	15.6	17.2	68.3	—	68.0	6.3	—	51.4	—
qwen3-vl-8b-instruct	—	7.3	14.3	27.3	—	42.7	2.9	—	33.2	—
qwen3-vl-8b-thinking	—	—	—	—	—	69.9	—	—	—	—
qwen3.5-397b-a17b	—	41.3	45.0	—	—	88.9	27.3	—	—	—
sonar	48.7	—	15.5	—	—	47.1	7.3	—	29.5	81.7
sonar-pro	29.0	—	15.2	—	—	57.8	7.9	—	27.5	74.5
sonar-pro-search	29.0	—	15.2	—	—	57.8	7.9	—	27.5	74.5
sonar-reasoning-pro	79.0	—	24.6	—	—	—	—	—	—	95.7
step-3.5-flash	—	—	—	—	—	—	—	—	—	—
deepseek-r1-0528-qwen3-8b	89.3	24.0	27.0	76.0	—	81.3	14.9	—	77.0	98.3
glm-4.5	87.3	26.3	26.2	73.7	—	78.6	12.2	—	73.8	97.9
glm-4.5v	—	10.8	12.5	15.3	—	57.3	3.6	—	35.2	—
DeepSeek-R1-Distill-Qwen-1.5B	17.7	—	9.1	22.0	—	33.8	3.3	—	7.0	68.7
DeepSeek-R1-Distill-Qwen-14B	66.7	—	15.8	55.7	—	59.1	4.4	—	37.6	94.9
Llama-3.1-Nemotron-70B-Instruct-HF	24.7	10.8	13.4	11.0	—	46.5	4.6	—	16.9	73.3
QwQ-32B-Preview	45.3	—	15.2	—	—	65.2	4.8	—	33.7	91.0
Qwen2-72B-Instruct	14.7	—	11.7	—	—	42.4	3.7	86.0	15.9	70.1
gemma-2-27b-it	—	—	—	—	—	—	—	51.8	—	—
grok-2	—	—	—	—	—	56.0	—	88.4	—	—
grok-3	—	—	21.6	—	—	—	—	—	—	—
grok-3-mini	93.3	25.2	32.0	84.7	—	79.1	11.1	—	69.6	99.2
grok-4	94.3	40.5	41.4	92.7	—	87.6	23.9	—	81.9	99.0
grok-code-fast-1	—	23.7	28.7	43.3	—	72.7	7.5	—	65.7	—
glm-4.5-air	67.3	23.8	23.2	80.7	—	74.2	6.8	—	68.4	96.5
glm-4.5-airx	67.3	23.8	23.2	80.7	—	74.2	6.8	—	68.4	96.5
glm-4.5-x	87.3	26.3	26.2	73.7	—	78.6	12.2	—	73.8	97.9
glm-4.6	—	30.2	30.1	44.3	—	81.0	5.2	—	56.1	—

Showing 203 models with benchmark data. Click column headers to sort. Select benchmarks above to customize view.