Cómo funciona
Real Questions
Selected from actual production customer service conversations in e-commerce.
Same Prompt
All models receive the identical system prompt, knowledge base, and question.
Blind Evaluation
Evaluators see only 'Answer A', 'Answer B' — they don't know which model wrote it.
Cross-Evaluation
Top-tier models from each provider evaluate answers. No model judges its own response.
Scoring Criteria
Each answer is scored 0-100 on six criteria with the following weights:
Accuracy 30%
Relevance 20%
Completeness 15%
Helpfulness 15%
Tone 10%
Conciseness 10%