AI Customer Service Model Benchmark

Come funziona

Selected from actual production customer service conversations in e-commerce.

All models receive the identical system prompt, knowledge base, and question.

Evaluators see only 'Answer A', 'Answer B' — they don't know which model wrote it.

Top-tier models from each provider evaluate answers. No model judges its own response.

Each answer is scored 0-100 on six criteria with the following weights:

Accuracy 30%

Relevance 20%

Completeness 15%

Helpfulness 15%

Tone 10%

Conciseness 10%