How It Works
Real Questions
Selected from actual production customer service conversations in e-commerce.
Same Prompt
All models receive the identical system prompt, knowledge base, and question.
Blind Evaluation
Evaluators see only 'Answer A', 'Answer B' — they don't know which model wrote it.
Cross-Evaluation
Top-tier models from each provider evaluate answers. No model judges its own response.
Scoring Criteria
Each answer is scored 0-100 on six criteria with the following weights:
To keep the comparison fair, public scores are calculated only from questions answered by every model included in the selected comparison set. That prevents newer or retired models from benefiting from an easier question mix.