AI Model Benchmark

Real-world e-commerce customer service scenarios, blind evaluation by multiple AI judges

No benchmark data available yet

Benchmark results will appear here once evaluations are completed.

Come funziona

Real Questions

Selected from actual production customer service conversations in e-commerce.

Same Prompt

All models receive the identical system prompt, knowledge base, and question.

Blind Evaluation

Evaluators see only 'Answer A', 'Answer B' — they don't know which model wrote it.

Cross-Evaluation

Top-tier models from each provider evaluate answers. No model judges its own response.

Scoring Criteria

Each answer is scored 0-100 on six criteria with the following weights:

Accuracy 30%
Relevance 20%
Completeness 15%
Helpfulness 15%
Tone 10%
Conciseness 10%

Copyright © Chaterimo

about-icon