vision Rankings

vision Model Rankings

Top performing AI models ranked by vision benchmark scores

Total Models

25

Providers

5

Avg Score

67.11

Updated

Dec 4, 2025

Access o3 through LangDB AI Gateway

Recommended

Integrate with openai's o3 and 250+ other models through a unified API. Monitor usage, control costs, and enhance security.

Unified API
Cost Optimization
Enterprise Security
Get Started Now

Free tier available • No credit card required

Instant Setup
99.9% Uptime
10,000+Monthly Requests
Rank
Model
Details
#1
o3

o3 is a well-rounded and powerful model across domains. It sets a new standard for math, science, coding, and visual reasoning tasks. It also excels at technical writing and instruction-following. Use it to think through multi-step problems that involve analysis across text, code, and images.

Provider
openai
Type
completions
Context
200K
Score
82.9
Pricing
In: $2 / 1M tokens
Out: $8 / 1M tokens
#2
o4-mini

o4-mini is latest small o-series model. It's optimized for fast, effective reasoning with exceptionally efficient performance in coding and visual tasks.

Provider
openai
Type
completions
Context
200K
Score
81.6
Pricing
In: $1.1 / 1M tokens
Out: $4.4 / 1M tokens
#3
gemini-2.5-flash

Google's best model in terms of price-performance, offering well-rounded capabilities.

Provider
gemini
Type
completions
Context
1M
Score
79.7
Pricing
In: $0.15 / 1M tokens
Out: $0.6 / 1M tokens
#4
gemini-2.5-flash-preview

Google's best model in terms of price-performance, offering well-rounded capabilities. Gemini 2.5 Flash rate limits are more restricted since it is an experimental / preview model.

Provider
gemini
Type
completions
Context
1M
Score
79.7
Pricing
In: $0.15 / 1M tokens
Out: $0.6 / 1M tokens
#5
gemini-2.5-flash-image

Gemini 2.5 Flash Image, a.k.a. "Nano Banana," is now generally available. It is a state of the art image generation model with contextual understanding. It is capable of image generation, edits, and multi-turn conversations. Aspect ratios can be controlled with the [image_config API Parameter](https://openrouter.ai/docs/features/multimodal/image-generation#image-aspect-ratio-configuration)

Provider
openrouter
Type
completions
Context
32768
Score
79.7
Pricing
In: $0.3 / 1M tokens
Out: $2.5 / 1M tokens
#6
gemini-2.5-flash-image-preview

Gemini 2.5 Flash Image Preview is a state of the art image generation model with contextual understanding. It is capable of image generation, edits, and multi-turn conversations.

Provider
openrouter
Type
completions
Context
32768
Score
79.7
Pricing
In: $0.3 / 1M tokens
Out: $2.5 / 1M tokens
#7
gemini-2.5-flash-lite-preview-06-17

Gemini 2.5 Flash-Lite is a lightweight reasoning model in the Gemini 2.5 family, optimized for ultra-low latency and cost efficiency. It offers improved throughput, faster token generation, and better performance across common benchmarks compared to earlier Flash models. By default, "thinking" (i.e. multi-pass reasoning) is disabled to prioritize speed, but developers can enable it via the [Reasoning API parameter](https://openrouter.ai/docs/use-cases/reasoning-tokens) to selectively trade off cost for intelligence.

Provider
openrouter
Type
completions
Context
1048576
Score
79.7
Pricing
In: $0.1 / 1M tokens
Out: $0.4 / 1M tokens
#8
gemini-2.5-pro

Gemini 2.5 Pro is our most advanced reasoning Gemini model, capable of solving complex problems.

Provider
gemini
Type
completions
Context
1M
Score
79.6
Pricing
In: $1.25 / 1M tokens
Out: $10 / 1M tokens
#9
gemini-2.5-pro-preview

Gemini 2.5 Pro Experimental is Google's state-of-the-art thinking model, capable of reasoning over complex problems in code, math, and STEM, as well as analyzing large datasets, codebases, and documents using long context.

Provider
gemini
Type
completions
Context
1M
Score
79.6
Pricing
In: $1.25 / 1M tokens
Out: $10 / 1M tokens
#10
gemini-2.5-pro-preview-05-06

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs “thinking” capabilities, enabling it to reason through responses with enhanced accuracy and nuanced context handling. Gemini 2.5 Pro achieves top-tier performance on multiple benchmarks, including first-place positioning on the LMArena leaderboard, reflecting superior human-preference alignment and complex problem-solving abilities.

Provider
openrouter
Type
completions
Context
1048576
Score
79.6
Pricing
In: $1.25 / 1M tokens
Out: $10 / 1M tokens
#11
o1

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1).

Provider
openai
Type
completions
Context
200K
Score
77.6
Pricing
In: $15 / 1M tokens
Out: $60 / 1M tokens
#12
gpt-4.1

GPT-4.1 is OpenAI's flagship model for complex tasks. It is well suited for problem solving across domains.

Provider
openai
Type
completions
Context
1047576
Score
74.8
Pricing
In: $2 / 1M tokens
Out: $8 / 1M tokens
#13
llama-4-maverick

Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forward pass (400B total). It supports multilingual text and image input, and produces multilingual text and code output across 12 supported languages. Optimized for vision-language tasks, Maverick is instruction-tuned for assistant-like behavior, image reasoning, and general-purpose multimodal interaction. Maverick features early fusion for native multimodality and a 1 million token context window. It was trained on a curated mixture of public, licensed, and Meta-platform data, covering ~22 trillion tokens, with a knowledge cutoff in August 2024. Released on April 5, 2025 under the Llama 4 Community License, Maverick is suited for research and commercial applications requiring advanced multimodal understanding and high model throughput.

Provider
deepinfra
Type
completions
Context
1048576
Score
73.4
Pricing
In: $0.15 / 1M tokens
Out: $0.6 / 1M tokens
#14
gemini-2.5-flash-lite

Gemini 2.5 Flash-Lite is a lightweight reasoning model in the Gemini 2.5 family, optimized for ultra-low latency and cost efficiency. It offers improved throughput, faster token generation, and better performance across common benchmarks compared to earlier Flash models. By default, "thinking" (i.e. multi-pass reasoning) is disabled to prioritize speed, but developers can enable it via the [Reasoning API parameter](https://openrouter.ai/docs/use-cases/reasoning-tokens) to selectively trade off cost for intelligence.

Provider
openrouter
Type
completions
Context
1048576
Score
72.9
Pricing
In: $0.1 / 1M tokens
Out: $0.4 / 1M tokens
#15
gemini-2.0-flash

Google's most capable multi-modal model with great performance across all tasks, with a 1 million token context window, and built for the era of Agents.

Provider
gemini
Type
completions
Context
1M
Score
70.7
Pricing
In: $0.1 / 1M tokens
Out: $0.4 / 1M tokens
#16
gemini-2.0-flash-001

Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5). It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.

Provider
openrouter
Type
completions
Context
1048576
Score
70.7
Pricing
In: $0.13 / 1M tokens
Out: $0.5 / 1M tokens
#17
llama-4-scout

Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by Meta, activating 17 billion parameters out of a total of 109B. It supports native multimodal input (text and image) and multilingual output (text and code) across 12 supported languages. Designed for assistant-style interaction and visual reasoning, Scout uses 16 experts per forward pass and features a context length of 10 million tokens, with a training corpus of ~40 trillion tokens. Built for high efficiency and local or commercial deployment, Llama 4 Scout incorporates early fusion for seamless modality integration. It is instruction-tuned for use in multilingual chat, captioning, and image understanding tasks. Released under the Llama 4 Community License, it was last trained on data up to August 2024 and launched publicly on April 5, 2025.

Provider
deepinfra
Type
completions
Context
327680
Score
69.4
Pricing
In: $0.08 / 1M tokens
Out: $0.3 / 1M tokens
#18
gemini-2.0-flash-lite

Google's smallest and most cost effective model, built for at scale usage.

Provider
gemini
Type
completions
Context
1M
Score
68
Pricing
In: $0.07 / 1M tokens
Out: $0.3 / 1M tokens
#19
grok-2

Grok-2 is an advanced AI model developed by xAI, designed to provide highly accurate and helpful responses to a wide range of questions, often with a unique perspective on humanity.

Provider
xai
Type
completions
Context
131072
Score
66.1
Pricing
In: $2 / 1M tokens
Out: $10 / 1M tokens
#20
nova-pro-v1

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the-art performance on key benchmarks including visual question answering (TextVQA) and video understanding (VATEX). Amazon Nova Pro demonstrates strong capabilities in processing both visual and textual information and at analyzing financial documents. **NOTE**: Video input is not supported at this time.

Provider
openrouter
Type
completions
Context
300K
Score
61.7
Pricing
In: $0.8 / 1M tokens
Out: $3.2 / 1M tokens
#21
gpt-4o-mini-search-preview

GPT-4o mini Search Preview is a specialized model trained to understand and execute web search queries with the Chat Completions API.

Provider
openai
Type
completions
Context
128K
Score
59.4
Pricing
In: $0.15 / 1M tokens
Out: $0.6 / 1M tokens
#22
nova-lite-v1

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time customer interactions, document analysis, and visual question-answering tasks with high accuracy. With an input context of 300K tokens, it can analyze multiple images or up to 30 minutes of video in a single input.

Provider
openrouter
Type
completions
Context
300K
Score
56.2
Pricing
In: $0.06 / 1M tokens
Out: $0.24 / 1M tokens
#23
phi-4-multimodal-instruct

Phi-4 Multimodal Instruct is a versatile 5.6B parameter foundation model that combines advanced reasoning and instruction-following capabilities across both text and visual inputs, providing accurate text outputs. The unified architecture enables efficient, low-latency inference, suitable for edge and mobile deployments. Phi-4 Multimodal Instruct supports text inputs in multiple languages including Arabic, Chinese, English, French, German, Japanese, Spanish, and more, with visual input optimized primarily for English. It delivers impressive performance on multimodal tasks involving mathematical, scientific, and document reasoning, providing developers and enterprises a powerful yet compact model for sophisticated interactive applications. For more information, see the [Phi-4 Multimodal blog post](https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/).

Provider
deepinfra
Type
completions
Context
131072
Score
55.1
Pricing
In: $0.05 / 1M tokens
Out: $0.1 / 1M tokens
#24
gpt-3.5-turbo

The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls.

Provider
openai
Type
completions
Context
16385
Score
0
Pricing
In: $0.5 / 1M tokens
Out: $1.5 / 1M tokens
#25
gpt-3.5-turbo-instruct

This model is a variant of GPT-3.5 Turbo tuned for instructional prompts and omitting chat-related optimizations. Training data: up to Sep 2021.

Provider
openai
Type
completions
Context
4095
Score
0
Pricing
In: $1.5 / 1M tokens
Out: $2 / 1M tokens
Vision Model Rankings | LangDB