Vision Model Rankings

#1

The Qwen3.5 122B-A10B native vision-language model is built on a hybrid architecture that integrates a linear attention mechanism with a sparse mixture-of-experts model, achieving higher inference efficiency. In terms of overall performance, this model is second only to Qwen3.5-397B-A17B. Its text capabilities significantly outperform those of Qwen3-235B-2507, and its visual capabilities surpass those of Qwen3-VL-235B.

completions

262144 tokens

83.9

Input: $0.32 / 1M tokens

Output: $2.56 / 1M tokens

#1

qwen3.5-122b-a10b

The Qwen3.5 122B-A10B native vision-language model is built on a hybrid architecture that integrates a linear attention mechanism with a sparse mixture-of-experts model, achieving higher inference efficiency. In terms of overall performance, this model is second only to Qwen3.5-397B-A17B. Its text capabilities significantly outperform those of Qwen3-235B-2507, and its visual capabilities surpass those of Qwen3-VL-235B.

Provider

openrouter

Type

completions

Context

262144

Score

83.9

Pricing

In: $0.32 / 1M tokens

Out: $2.56 / 1M tokens

#2

o3

o3 is a well-rounded and powerful model across domains. It sets a new standard for math, science, coding, and visual reasoning tasks. It also excels at technical writing and instruction-following. Use it to think through multi-step problems that involve analysis across text, code, and images.

completions

200K tokens

82.9

Input: $2 / 1M tokens

Output: $8 / 1M tokens

#2

o3

o3 is a well-rounded and powerful model across domains. It sets a new standard for math, science, coding, and visual reasoning tasks. It also excels at technical writing and instruction-following. Use it to think through multi-step problems that involve analysis across text, code, and images.

Provider

openai

Type

completions

Context

200K

Score

82.9

Pricing

In: $2 / 1M tokens

Out: $8 / 1M tokens

#3

qwen3.5-27b

The Qwen3.5 27B native vision-language Dense model incorporates a linear attention mechanism, delivering fast response times while balancing inference speed and performance. Its overall capabilities are comparable to those of the Qwen3.5-122B-A10B.

completions

262144 tokens

82.3

Input: $0.26 / 1M tokens

Output: $2.04 / 1M tokens

#3

qwen3.5-27b

The Qwen3.5 27B native vision-language Dense model incorporates a linear attention mechanism, delivering fast response times while balancing inference speed and performance. Its overall capabilities are comparable to those of the Qwen3.5-122B-A10B.

Provider

openrouter

Type

completions

Context

262144

Score

82.3

Pricing

In: $0.26 / 1M tokens

Out: $2.04 / 1M tokens

#4

o4-mini

o4-mini is latest small o-series model. It's optimized for fast, effective reasoning with exceptionally efficient performance in coding and visual tasks.

completions

200K tokens

81.6

Input: $1.1 / 1M tokens

Output: $4.4 / 1M tokens

#4

o4-mini

o4-mini is latest small o-series model. It's optimized for fast, effective reasoning with exceptionally efficient performance in coding and visual tasks.

Provider

openai

Type

completions

Context

200K

Score

81.6

Pricing

In: $1.1 / 1M tokens

Out: $4.4 / 1M tokens

#5

qwen3.5-35b-a3b

The Qwen3.5 Series 35B-A3B is a native vision-language model designed with a hybrid architecture that integrates linear attention mechanisms and a sparse mixture-of-experts model, achieving higher inference efficiency. Its overall performance is comparable to that of the Qwen3.5-27B.

completions

262144 tokens

81.4

Input: $0.21 / 1M tokens

Output: $1.63 / 1M tokens

#5

qwen3.5-35b-a3b

The Qwen3.5 Series 35B-A3B is a native vision-language model designed with a hybrid architecture that integrates linear attention mechanisms and a sparse mixture-of-experts model, achieving higher inference efficiency. Its overall performance is comparable to that of the Qwen3.5-27B.

Provider

openrouter

Type

completions

Context

262144

Score

81.4

Pricing

In: $0.21 / 1M tokens

Out: $1.63 / 1M tokens

#6

gemini-2.5-flash

Google's best model in terms of price-performance, offering well-rounded capabilities.

completions

1M tokens

79.7

Input: $0.15 / 1M tokens

Output: $0.6 / 1M tokens

#6

gemini-2.5-flash

Google's best model in terms of price-performance, offering well-rounded capabilities.

Provider

gemini

Type

completions

Context

1M

Score

79.7

Pricing

In: $0.15 / 1M tokens

Out: $0.6 / 1M tokens

#7

gemini-2.5-flash-preview

Google's best model in terms of price-performance, offering well-rounded capabilities. Gemini 2.5 Flash rate limits are more restricted since it is an experimental / preview model.

completions

1M tokens

79.7

Input: $0.15 / 1M tokens

Output: $0.6 / 1M tokens

#7

gemini-2.5-flash-preview

Google's best model in terms of price-performance, offering well-rounded capabilities. Gemini 2.5 Flash rate limits are more restricted since it is an experimental / preview model.

Provider

gemini

Type

completions

Context

1M

Score

79.7

Pricing

In: $0.15 / 1M tokens

Out: $0.6 / 1M tokens

#8

gemini-2.5-flash-image

Gemini 2.5 Flash Image, a.k.a. "Nano Banana," is now generally available. It is a state of the art image generation model with contextual understanding. It is capable of image generation, edits, and multi-turn conversations. Aspect ratios can be controlled with the [image_config API Parameter](https://openrouter.ai/docs/features/multimodal/image-generation#image-aspect-ratio-configuration)

completions

32768 tokens

79.7

Input: $0.3 / 1M tokens

Output: $2.5 / 1M tokens

#8

gemini-2.5-flash-image

Gemini 2.5 Flash Image, a.k.a. "Nano Banana," is now generally available. It is a state of the art image generation model with contextual understanding. It is capable of image generation, edits, and multi-turn conversations. Aspect ratios can be controlled with the [image_config API Parameter](https://openrouter.ai/docs/features/multimodal/image-generation#image-aspect-ratio-configuration)

Provider

openrouter

Type

completions

Context

32768

Score

79.7

Pricing

In: $0.3 / 1M tokens

Out: $2.5 / 1M tokens

#9

gemini-2.5-pro

Gemini 2.5 Pro is our most advanced reasoning Gemini model, capable of solving complex problems.

completions

1M tokens

79.6

Input: $1.25 / 1M tokens

Output: $10 / 1M tokens

#9

gemini-2.5-pro

Gemini 2.5 Pro is our most advanced reasoning Gemini model, capable of solving complex problems.

Provider

gemini

Type

completions

Context

1M

Score

79.6

Pricing

In: $1.25 / 1M tokens

Out: $10 / 1M tokens

#10

gemini-2.5-pro-preview

Gemini 2.5 Pro Experimental is Google's state-of-the-art thinking model, capable of reasoning over complex problems in code, math, and STEM, as well as analyzing large datasets, codebases, and documents using long context.

completions

1M tokens

79.6

Input: $1.25 / 1M tokens

Output: $10 / 1M tokens

#10

gemini-2.5-pro-preview

Gemini 2.5 Pro Experimental is Google's state-of-the-art thinking model, capable of reasoning over complex problems in code, math, and STEM, as well as analyzing large datasets, codebases, and documents using long context.

Provider

gemini

Type

completions

Context

1M

Score

79.6

Pricing

In: $1.25 / 1M tokens

Out: $10 / 1M tokens

#11

gemini-2.5-pro-preview-05-06

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs “thinking” capabilities, enabling it to reason through responses with enhanced accuracy and nuanced context handling. Gemini 2.5 Pro achieves top-tier performance on multiple benchmarks, including first-place positioning on the LMArena leaderboard, reflecting superior human-preference alignment and complex problem-solving abilities.

completions

1048576 tokens

79.6

Input: $1.25 / 1M tokens

Output: $10 / 1M tokens

#11

gemini-2.5-pro-preview-05-06

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs “thinking” capabilities, enabling it to reason through responses with enhanced accuracy and nuanced context handling. Gemini 2.5 Pro achieves top-tier performance on multiple benchmarks, including first-place positioning on the LMArena leaderboard, reflecting superior human-preference alignment and complex problem-solving abilities.

Provider

openrouter

Type

completions

Context

1048576

Score

79.6

Pricing

In: $1.25 / 1M tokens

Out: $10 / 1M tokens

#12

o1

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1).

completions

200K tokens

77.6

Input: $15 / 1M tokens

Output: $60 / 1M tokens

#12

o1

The latest and strongest model family from OpenAI, o1 is designed to spend more time thinking before responding. The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. The o1 models are optimized for math, science, programming, and other STEM-related tasks. They consistently exhibit PhD-level accuracy on benchmarks in physics, chemistry, and biology. Learn more in the [launch announcement](https://openai.com/o1).

Provider

openai

Type

completions

Context

200K

Score

77.6

Pricing

In: $15 / 1M tokens

Out: $60 / 1M tokens

#13

gpt-4.1

GPT-4.1 is OpenAI's flagship model for complex tasks. It is well suited for problem solving across domains.

completions

1047576 tokens

74.8

Input: $2 / 1M tokens

Output: $8 / 1M tokens

#13

gpt-4.1

GPT-4.1 is OpenAI's flagship model for complex tasks. It is well suited for problem solving across domains.

Provider

openai

Type

completions

Context

1047576

Score

74.8

Pricing

In: $2 / 1M tokens

Out: $8 / 1M tokens

#14

llama-4-maverick

Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forward pass (400B total). It supports multilingual text and image input, and produces multilingual text and code output across 12 supported languages. Optimized for vision-language tasks, Maverick is instruction-tuned for assistant-like behavior, image reasoning, and general-purpose multimodal interaction. Maverick features early fusion for native multimodality and a 1 million token context window. It was trained on a curated mixture of public, licensed, and Meta-platform data, covering ~22 trillion tokens, with a knowledge cutoff in August 2024. Released on April 5, 2025 under the Llama 4 Community License, Maverick is suited for research and commercial applications requiring advanced multimodal understanding and high model throughput.

completions

1048576 tokens

73.4

Input: $0.15 / 1M tokens

Output: $0.6 / 1M tokens

#14

llama-4-maverick

Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forward pass (400B total). It supports multilingual text and image input, and produces multilingual text and code output across 12 supported languages. Optimized for vision-language tasks, Maverick is instruction-tuned for assistant-like behavior, image reasoning, and general-purpose multimodal interaction. Maverick features early fusion for native multimodality and a 1 million token context window. It was trained on a curated mixture of public, licensed, and Meta-platform data, covering ~22 trillion tokens, with a knowledge cutoff in August 2024. Released on April 5, 2025 under the Llama 4 Community License, Maverick is suited for research and commercial applications requiring advanced multimodal understanding and high model throughput.

Provider

deepinfra

Type

completions

Context

1048576

Score

73.4

Pricing

In: $0.15 / 1M tokens

Out: $0.6 / 1M tokens

#15

gemini-2.5-flash-lite

Gemini 2.5 Flash-Lite is a lightweight reasoning model in the Gemini 2.5 family, optimized for ultra-low latency and cost efficiency. It offers improved throughput, faster token generation, and better performance across common benchmarks compared to earlier Flash models. By default, "thinking" (i.e. multi-pass reasoning) is disabled to prioritize speed, but developers can enable it via the [Reasoning API parameter](https://openrouter.ai/docs/use-cases/reasoning-tokens) to selectively trade off cost for intelligence.

completions

1048576 tokens

72.9

Input: $0.1 / 1M tokens

Output: $0.4 / 1M tokens

#15

gemini-2.5-flash-lite

Gemini 2.5 Flash-Lite is a lightweight reasoning model in the Gemini 2.5 family, optimized for ultra-low latency and cost efficiency. It offers improved throughput, faster token generation, and better performance across common benchmarks compared to earlier Flash models. By default, "thinking" (i.e. multi-pass reasoning) is disabled to prioritize speed, but developers can enable it via the [Reasoning API parameter](https://openrouter.ai/docs/use-cases/reasoning-tokens) to selectively trade off cost for intelligence.

Provider

openrouter

Type

completions

Context

1048576

Score

72.9

Pricing

In: $0.1 / 1M tokens

Out: $0.4 / 1M tokens

#16

gemini-2.0-flash

Google's most capable multi-modal model with great performance across all tasks, with a 1 million token context window, and built for the era of Agents.

completions

1M tokens

70.7

Input: $0.1 / 1M tokens

Output: $0.4 / 1M tokens

#16

gemini-2.0-flash

Google's most capable multi-modal model with great performance across all tasks, with a 1 million token context window, and built for the era of Agents.

Provider

gemini

Type

completions

Context

1M

Score

70.7

Pricing

In: $0.1 / 1M tokens

Out: $0.4 / 1M tokens

#17

gemini-2.0-flash-001

Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5). It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.

completions

1048576 tokens

70.7

Input: $0.13 / 1M tokens

Output: $0.5 / 1M tokens

#17

gemini-2.0-flash-001

Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5). It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.

Provider

openrouter

Type

completions

Context

1048576

Score

70.7

Pricing

In: $0.13 / 1M tokens

Out: $0.5 / 1M tokens

#18

llama-4-scout

Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by Meta, activating 17 billion parameters out of a total of 109B. It supports native multimodal input (text and image) and multilingual output (text and code) across 12 supported languages. Designed for assistant-style interaction and visual reasoning, Scout uses 16 experts per forward pass and features a context length of 10 million tokens, with a training corpus of ~40 trillion tokens. Built for high efficiency and local or commercial deployment, Llama 4 Scout incorporates early fusion for seamless modality integration. It is instruction-tuned for use in multilingual chat, captioning, and image understanding tasks. Released under the Llama 4 Community License, it was last trained on data up to August 2024 and launched publicly on April 5, 2025.

completions

327680 tokens

69.4

Input: $0.08 / 1M tokens

Output: $0.3 / 1M tokens

#18

llama-4-scout

Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by Meta, activating 17 billion parameters out of a total of 109B. It supports native multimodal input (text and image) and multilingual output (text and code) across 12 supported languages. Designed for assistant-style interaction and visual reasoning, Scout uses 16 experts per forward pass and features a context length of 10 million tokens, with a training corpus of ~40 trillion tokens. Built for high efficiency and local or commercial deployment, Llama 4 Scout incorporates early fusion for seamless modality integration. It is instruction-tuned for use in multilingual chat, captioning, and image understanding tasks. Released under the Llama 4 Community License, it was last trained on data up to August 2024 and launched publicly on April 5, 2025.

Provider

deepinfra

Type

completions

Context

327680

Score

69.4

Pricing

In: $0.08 / 1M tokens

Out: $0.3 / 1M tokens

#19

gemini-2.0-flash-lite

Google's smallest and most cost effective model, built for at scale usage.

completions

1M tokens

68

Input: $0.07 / 1M tokens

Output: $0.3 / 1M tokens

#19

gemini-2.0-flash-lite

Google's smallest and most cost effective model, built for at scale usage.

Provider

gemini

Type

completions

Context

1M

Score

68

Pricing

In: $0.07 / 1M tokens

Out: $0.3 / 1M tokens

#20

grok-2

Grok-2 is an advanced AI model developed by xAI, designed to provide highly accurate and helpful responses to a wide range of questions, often with a unique perspective on humanity.

completions

131072 tokens

66.1

Input: $2 / 1M tokens

Output: $10 / 1M tokens

#20

grok-2

Grok-2 is an advanced AI model developed by xAI, designed to provide highly accurate and helpful responses to a wide range of questions, often with a unique perspective on humanity.

Provider

xai

Type

completions

Context

131072

Score

66.1

Pricing

In: $2 / 1M tokens

Out: $10 / 1M tokens

#21

nova-pro-v1

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the-art performance on key benchmarks including visual question answering (TextVQA) and video understanding (VATEX). Amazon Nova Pro demonstrates strong capabilities in processing both visual and textual information and at analyzing financial documents. **NOTE**: Video input is not supported at this time.

completions

300K tokens

61.7

Input: $0.8 / 1M tokens

Output: $3.2 / 1M tokens

#21

nova-pro-v1

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the-art performance on key benchmarks including visual question answering (TextVQA) and video understanding (VATEX). Amazon Nova Pro demonstrates strong capabilities in processing both visual and textual information and at analyzing financial documents. **NOTE**: Video input is not supported at this time.

Provider

openrouter

Type

completions

Context

300K

Score

61.7

Pricing

In: $0.8 / 1M tokens

Out: $3.2 / 1M tokens

#22

gpt-4o-mini-search-preview

GPT-4o mini Search Preview is a specialized model trained to understand and execute web search queries with the Chat Completions API.

completions

128K tokens

59.4

Input: $0.15 / 1M tokens

Output: $0.6 / 1M tokens

#22

gpt-4o-mini-search-preview

GPT-4o mini Search Preview is a specialized model trained to understand and execute web search queries with the Chat Completions API.

Provider

openai

Type

completions

Context

128K

Score

59.4

Pricing

In: $0.15 / 1M tokens

Out: $0.6 / 1M tokens

#23

nova-lite-v1

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time customer interactions, document analysis, and visual question-answering tasks with high accuracy. With an input context of 300K tokens, it can analyze multiple images or up to 30 minutes of video in a single input.

completions

300K tokens

56.2

Input: $0.06 / 1M tokens

Output: $0.24 / 1M tokens

#23

nova-lite-v1

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time customer interactions, document analysis, and visual question-answering tasks with high accuracy. With an input context of 300K tokens, it can analyze multiple images or up to 30 minutes of video in a single input.

Provider

openrouter

Type

completions

Context

300K

Score

56.2

Pricing

In: $0.06 / 1M tokens

Out: $0.24 / 1M tokens

#24

phi-4-multimodal-instruct

Phi-4 Multimodal Instruct is a versatile 5.6B parameter foundation model that combines advanced reasoning and instruction-following capabilities across both text and visual inputs, providing accurate text outputs. The unified architecture enables efficient, low-latency inference, suitable for edge and mobile deployments. Phi-4 Multimodal Instruct supports text inputs in multiple languages including Arabic, Chinese, English, French, German, Japanese, Spanish, and more, with visual input optimized primarily for English. It delivers impressive performance on multimodal tasks involving mathematical, scientific, and document reasoning, providing developers and enterprises a powerful yet compact model for sophisticated interactive applications. For more information, see the [Phi-4 Multimodal blog post](https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/).

completions

131072 tokens

55.1

Input: $0.05 / 1M tokens

Output: $0.1 / 1M tokens

#24

phi-4-multimodal-instruct

Phi-4 Multimodal Instruct is a versatile 5.6B parameter foundation model that combines advanced reasoning and instruction-following capabilities across both text and visual inputs, providing accurate text outputs. The unified architecture enables efficient, low-latency inference, suitable for edge and mobile deployments. Phi-4 Multimodal Instruct supports text inputs in multiple languages including Arabic, Chinese, English, French, German, Japanese, Spanish, and more, with visual input optimized primarily for English. It delivers impressive performance on multimodal tasks involving mathematical, scientific, and document reasoning, providing developers and enterprises a powerful yet compact model for sophisticated interactive applications. For more information, see the [Phi-4 Multimodal blog post](https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/).

Provider

deepinfra

Type

completions

Context

131072

Score

55.1

Pricing

In: $0.05 / 1M tokens

Out: $0.1 / 1M tokens

#25

gpt-3.5-turbo

The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls.

completions

16385 tokens

0

Input: $0.5 / 1M tokens

Output: $1.5 / 1M tokens

#25

gpt-3.5-turbo

The latest GPT-3.5 Turbo model with higher accuracy at responding in requested formats and a fix for a bug which caused a text encoding issue for non-English language function calls.

Provider

openai

Type

completions

Context

16385

Score

0

Pricing

In: $0.5 / 1M tokens

Out: $1.5 / 1M tokens

#26

gpt-3.5-turbo-instruct

This model is a variant of GPT-3.5 Turbo tuned for instructional prompts and omitting chat-related optimizations. Training data: up to Sep 2021.

completions

4095 tokens

0

Input: $1.5 / 1M tokens

Output: $2 / 1M tokens

#26

gpt-3.5-turbo-instruct

This model is a variant of GPT-3.5 Turbo tuned for instructional prompts and omitting chat-related optimizations. Training data: up to Sep 2021.

Provider

openai

Type

completions

Context

4095

Score

0

Pricing

In: $1.5 / 1M tokens

Out: $2 / 1M tokens

vision Model Rankings

Access qwen3.5-122b-a10b through LangDB AI Gateway

Explore Other Categories