vision Rankings

vision Model Rankings

Top performing AI models ranked by vision benchmark scores

Total Models

26

Providers

7

Avg Score

67.05

Updated

Sep 8, 2025

Access o3 through LangDB AI Gateway

Recommended

Integrate with openai's o3 and 250+ other models through a unified API. Monitor usage, control costs, and enhance security.

Unified API
Cost Optimization
Enterprise Security
Get Started Now

Free tier available • No credit card required

Instant Setup
99.9% Uptime
10,000+Monthly Requests
Rank
Model
Details
#1
o3

o3 is a well-rounded and powerful model across domains. It sets a new standard for math, science, coding, and visual reasoning tasks. It also excels at technical writing and instruction-following. Use it to think through multi-step problems that involve analysis across text, code, and images.

Provider
openai
Type
completions
Context
200K
Score
82.9
Pricing
In: $2 / 1M tokens
Out: $8 / 1M tokens
#2
gemini-2.5-pro-preview

Gemini 2.5 Pro Experimental is Google's state-of-the-art thinking model, capable of reasoning over complex problems in code, math, and STEM, as well as analyzing large datasets, codebases, and documents using long context.

Provider
gemini
Type
completions
Context
1M
Score
82
Pricing
In: $1.25 / 1M tokens
Out: $10 / 1M tokens
#3
o4-mini

o4-mini is latest small o-series model. It's optimized for fast, effective reasoning with exceptionally efficient performance in coding and visual tasks.

Provider
openai
Type
completions
Context
200K
Score
81.6
Pricing
In: $1.1 / 1M tokens
Out: $4.4 / 1M tokens
#4
gemini-2.5-flash

Google's best model in terms of price-performance, offering well-rounded capabilities.

Provider
gemini
Type
completions
Context
1M
Score
79.7
Pricing
In: $0.15 / 1M tokens
Out: $0.6 / 1M tokens
#5
gemini-2.5-pro

Gemini 2.5 Pro is our most advanced reasoning Gemini model, capable of solving complex problems.

Provider
gemini
Type
completions
Context
1M
Score
79.6
Pricing
In: $1.25 / 1M tokens
Out: $10 / 1M tokens
#6
grok-3

Grok 3 is the latest model from xAI. It's their flagship model that excels at enterprise use cases like data extraction, coding, and text summarization. Possesses deep domain knowledge in finance, healthcare, law, and science.

Provider
openrouter
Type
completions
Context
131072
Score
78
Pricing
In: $4 / 1M tokens
Out: $20 / 1M tokens
#7
o1

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1).

Provider
openai
Type
completions
Context
200K
Score
77.6
Pricing
In: $15 / 1M tokens
Out: $60 / 1M tokens
#8
gpt-4.1

GPT-4.1 is OpenAI's flagship model for complex tasks. It is well suited for problem solving across domains.

Provider
openai
Type
completions
Context
1047576
Score
74.8
Pricing
In: $2 / 1M tokens
Out: $8 / 1M tokens
#9
claude-sonnet-4

Our high-performance model with exceptional reasoning and efficiency

Provider
anthropic
Type
completions
Context
200K
Score
74.4
Pricing
In: $3 / 1M tokens
Out: $15 / 1M tokens
#10
llama-4-maverick

Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forward pass (400B total). It supports multilingual text and image input, and produces multilingual text and code output across 12 supported languages. Optimized for vision-language tasks, Maverick is instruction-tuned for assistant-like behavior, image reasoning, and general-purpose multimodal interaction. Maverick features early fusion for native multimodality and a 1 million token context window. It was trained on a curated mixture of public, licensed, and Meta-platform data, covering ~22 trillion tokens, with a knowledge cutoff in August 2024. Released on April 5, 2025 under the Llama 4 Community License, Maverick is suited for research and commercial applications requiring advanced multimodal understanding and high model throughput.

Provider
deepinfra
Type
completions
Context
1048576
Score
73.4
Pricing
In: $0.15 / 1M tokens
Out: $0.6 / 1M tokens
#11
llama-4-maverick

Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forward pass (400B total). It supports multilingual text and image input, and produces multilingual text and code output across 12 supported languages. Optimized for vision-language tasks, Maverick is instruction-tuned for assistant-like behavior, image reasoning, and general-purpose multimodal interaction. Maverick features early fusion for native multimodality and a 1 million token context window. It was trained on a curated mixture of public, licensed, and Meta-platform data, covering ~22 trillion tokens, with a knowledge cutoff in August 2024. Released on April 5, 2025 under the Llama 4 Community License, Maverick is suited for research and commercial applications requiring advanced multimodal understanding and high model throughput.

Provider
parasail
Type
completions
Context
1048576
Score
73.4
Pricing
In: $0.15 / 1M tokens
Out: $0.85 / 1M tokens
#12
gemini-2.5-flash-lite

Gemini 2.5 Flash-Lite is a lightweight reasoning model in the Gemini 2.5 family, optimized for ultra-low latency and cost efficiency. It offers improved throughput, faster token generation, and better performance across common benchmarks compared to earlier Flash models. By default, "thinking" (i.e. multi-pass reasoning) is disabled to prioritize speed, but developers can enable it via the [Reasoning API parameter](https://openrouter.ai/docs/use-cases/reasoning-tokens) to selectively trade off cost for intelligence.

Provider
openrouter
Type
completions
Context
1048576
Score
72.9
Pricing
In: $0.1 / 1M tokens
Out: $0.4 / 1M tokens
#13
gpt-4.1-mini

GPT-4.1 mini provides a balance between intelligence, speed, and cost that makes it an attractive model for many use cases.

Provider
openai
Type
completions
Context
1047576
Score
72.7
Pricing
In: $0.4 / 1M tokens
Out: $1.6 / 1M tokens
#14
gemini-2.0-flash

Google's most capable multi-modal model with great performance across all tasks, with a 1 million token context window, and built for the era of Agents.

Provider
gemini
Type
completions
Context
1M
Score
70.7
Pricing
In: $0.1 / 1M tokens
Out: $0.4 / 1M tokens
#15
llama-4-scout

Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by Meta, activating 17 billion parameters out of a total of 109B. It supports native multimodal input (text and image) and multilingual output (text and code) across 12 supported languages. Designed for assistant-style interaction and visual reasoning, Scout uses 16 experts per forward pass and features a context length of 10 million tokens, with a training corpus of ~40 trillion tokens. Built for high efficiency and local or commercial deployment, Llama 4 Scout incorporates early fusion for seamless modality integration. It is instruction-tuned for use in multilingual chat, captioning, and image understanding tasks. Released under the Llama 4 Community License, it was last trained on data up to August 2024 and launched publicly on April 5, 2025.

Provider
deepinfra
Type
completions
Context
327680
Score
69.4
Pricing
In: $0.08 / 1M tokens
Out: $0.3 / 1M tokens
#16
llama-4-scout

Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by Meta, activating 17 billion parameters out of a total of 109B. It supports native multimodal input (text and image) and multilingual output (text and code) across 12 supported languages. Designed for assistant-style interaction and visual reasoning, Scout uses 16 experts per forward pass and features a context length of 10 million tokens, with a training corpus of ~40 trillion tokens. Built for high efficiency and local or commercial deployment, Llama 4 Scout incorporates early fusion for seamless modality integration. It is instruction-tuned for use in multilingual chat, captioning, and image understanding tasks. Released under the Llama 4 Community License, it was last trained on data up to August 2024 and launched publicly on April 5, 2025.

Provider
openrouter
Type
completions
Context
1048576
Score
69.4
Pricing
In: $0.08 / 1M tokens
Out: $0.37 / 1M tokens
#17
gemini-2.0-flash-lite

Google's smallest and most cost effective model, built for at scale usage.

Provider
gemini
Type
completions
Context
1M
Score
68
Pricing
In: $0.07 / 1M tokens
Out: $0.3 / 1M tokens
#18
grok-2

Grok-2 is an advanced AI model developed by xAI, designed to provide highly accurate and helpful responses to a wide range of questions, often with a unique perspective on humanity.

Provider
xai
Type
completions
Context
131072
Score
66.1
Pricing
In: $2 / 1M tokens
Out: $10 / 1M tokens
#19
nova-pro-v1

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the-art performance on key benchmarks including visual question answering (TextVQA) and video understanding (VATEX). Amazon Nova Pro demonstrates strong capabilities in processing both visual and textual information and at analyzing financial documents. **NOTE**: Video input is not supported at this time.

Provider
openrouter
Type
completions
Context
300K
Score
61.7
Pricing
In: $0.8 / 1M tokens
Out: $3.2 / 1M tokens
#20
gpt-4o-mini-search-preview

GPT-4o mini Search Preview is a specialized model trained to understand and execute web search queries with the Chat Completions API.

Provider
openai
Type
completions
Context
128K
Score
59.4
Pricing
In: $0.15 / 1M tokens
Out: $0.6 / 1M tokens
#21
nova-lite-v1

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time customer interactions, document analysis, and visual question-answering tasks with high accuracy. With an input context of 300K tokens, it can analyze multiple images or up to 30 minutes of video in a single input.

Provider
openrouter
Type
completions
Context
300K
Score
56.2
Pricing
In: $0.06 / 1M tokens
Out: $0.24 / 1M tokens
#22
gpt-4.1-nano

GPT-4.1 nano is the fastest, most cost-effective GPT-4.1 model.

Provider
openai
Type
completions
Context
1047576
Score
55.4
Pricing
In: $0.1 / 1M tokens
Out: $0.4 / 1M tokens
#23
phi-4-multimodal-instruct

Phi-4 Multimodal Instruct is a versatile 5.6B parameter foundation model that combines advanced reasoning and instruction-following capabilities across both text and visual inputs, providing accurate text outputs. The unified architecture enables efficient, low-latency inference, suitable for edge and mobile deployments. Phi-4 Multimodal Instruct supports text inputs in multiple languages including Arabic, Chinese, English, French, German, Japanese, Spanish, and more, with visual input optimized primarily for English. It delivers impressive performance on multimodal tasks involving mathematical, scientific, and document reasoning, providing developers and enterprises a powerful yet compact model for sophisticated interactive applications. For more information, see the [Phi-4 Multimodal blog post](https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/).

Provider
deepinfra
Type
completions
Context
131072
Score
55.1
Pricing
In: $0.05 / 1M tokens
Out: $0.1 / 1M tokens
#24
phi-4-multimodal-instruct

Phi-4 Multimodal Instruct is a versatile 5.6B parameter foundation model that combines advanced reasoning and instruction-following capabilities across both text and visual inputs, providing accurate text outputs. The unified architecture enables efficient, low-latency inference, suitable for edge and mobile deployments. Phi-4 Multimodal Instruct supports text inputs in multiple languages including Arabic, Chinese, English, French, German, Japanese, Spanish, and more, with visual input optimized primarily for English. It delivers impressive performance on multimodal tasks involving mathematical, scientific, and document reasoning, providing developers and enterprises a powerful yet compact model for sophisticated interactive applications. For more information, see the [Phi-4 Multimodal blog post](https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/).

Provider
openrouter
Type
completions
Context
131072
Score
55.1
Pricing
In: $0.05 / 1M tokens
Out: $0.1 / 1M tokens
#25
gemini-1.5-flash-8b

lightweight model, smaller and faster, lower price + higher rate limits + Lower latency on small prompts (compared to 1.5 Flash)

Provider
gemini
Type
completions
Context
1M
Score
53.7
Pricing
In: $0.04 / 1M tokens
Out: $0.15 / 1M tokens
#26
gpt-3.5-turbo

The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls.

Provider
openai
Type
completions
Context
16385
Score
0
Pricing
In: $0.5 / 1M tokens
Out: $1.5 / 1M tokens