AI Model Benchmark

Real-world e-commerce customer service scenarios, blind evaluation by multiple AI judges

19 questions evaluated 783 evaluations performed Última atualização: Abr 09, 2026
# Model Provedor Overall Score Evaluations Avg Response
1 ChatGPT 5.4 OpenAI
73,7
46 7.6s
2 Claude Sonnet 4.6 Anthropic
67,9
67 6.7s
3 ChatGPT 5.4 mini OpenAI
67,8
68 3.5s
4 Gemini 3.1 Flash-Lite Google
65,5
64 2.6s
5 ChatGPT 4.1 OpenAI
65,0
67 5.3s
6 Claude Haiku 4.5 Anthropic
64,0
63 4.3s
7 Claude Opus 4.6 Anthropic
63,7
44 9.4s
8 Grok 4.1 Fast xAI
62,5
62 2.7s
9 ChatGPT 4.1 mini OpenAI
62,0
68 5.5s
10 Gemini 3.1 Pro Preview Google
60,8
57 12.4s
11 Grok 4 xAI
58,1
67 28.7s
12 Gemini 3 Flash Google
56,4
64 9.7s
13 Grok 4.20 xAI
55,6
46 2.9s

Score Breakdown

Model Accuracy (30%) Relevance (20%) Completeness (15%) Helpfulness (15%) Tone (10%) Conciseness (10%)
ChatGPT 5.4 66,5 81,9 67,9 74,2 84,7 75,6
Claude Sonnet 4.6 57,7 77,7 60,4 65,6 84,2 77,0
ChatGPT 5.4 mini 63,4 73,7 57,5 63,2 79,7 79,6
Gemini 3.1 Flash-Lite 54,4 76,1 60,6 63,3 82,5 71,2
ChatGPT 4.1 51,2 77,3 59,5 62,1 83,9 75,9
Claude Haiku 4.5 52,6 74,1 57,5 61,2 82,6 73,1
Claude Opus 4.6 50,2 76,3 58,6 59,2 82,7 74,2
Grok 4.1 Fast 49,6 74,4 58,1 60,1 78,4 72,1
ChatGPT 4.1 mini 49,9 73,3 54,6 57,1 81,5 75,0
Gemini 3.1 Pro Preview 58,5 66,9 47,1 54,7 78,1 67,7
Grok 4 44,3 69,5 54,0 54,3 79,2 67,6
Gemini 3 Flash 47,1 67,6 48,7 52,1 76,3 60,0
Grok 4.20 34,6 72,8 53,8 52,3 77,9 69,7

Como Funciona

Real Questions

Selected from actual production customer service conversations in e-commerce.

Same Prompt

All models receive the identical system prompt, knowledge base, and question.

Blind Evaluation

Evaluators see only 'Answer A', 'Answer B' — they don't know which model wrote it.

Cross-Evaluation

Top-tier models from each provider evaluate answers. No model judges its own response.

Scoring Criteria

Each answer is scored 0-100 on six criteria with the following weights:

Accuracy 30%
Relevance 20%
Completeness 15%
Helpfulness 15%
Tone 10%
Conciseness 10%

Copyright © Chaterimo

about-icon