Eval Harness
Simple evaluation harness for testing LLM outputs
Related Tools
Human Eval Form
Create grading rubrics and forms for human evaluation of LLM outputs
Latency Benchmark Recorder
Record and visualize latency metrics from your own API tests
Model A/B Test Evaluator
Analyze results from model A/B tests for statistical significance
BLEU & ROUGE Calculator
Calculate standard text generation metrics between reference and hypothesis
Confusion Matrix Visualizer
Generate and analyze confusion matrices for classification models
What is an Eval Harness?
An evaluation harness is a framework for systematically testing LLM outputs against expected results. By defining test cases with inputs and expected outputs, you can measure how well your model or prompt performs across multiple scenarios.
This simple harness lets you define test cases, paste actual outputs, and score them using different matching strategies—perfect for quick prompt iteration and quality checks.
Scoring Methods
Exact Match
100% if strings match exactly (case-insensitive). Good for factual answers.
Contains
100% if expected appears anywhere in actual. Good for verbose responses.
Word Overlap
Percentage of expected words found in actual. Good for semantic similarity.
Best Practices
- Diverse test cases: Include edge cases, not just happy paths.
- Choose appropriate scoring: Exact for facts, contains for natural language.
- Set baselines: Run before changes to measure improvement.
- Document expectations: Write clear, unambiguous expected outputs.
FAQ
Does this call any LLM APIs?
No. You manually paste the actual outputs. This tool only scores—run your prompts separately.
What does pass rate measure?
Percentage of test cases scoring 80% or higher. Threshold can be customized for your needs.
