Eval Harness

Simple evaluation harness for testing LLM outputs

Scoring:

Average Score

Pass Rate (≥80%)

Related Tools

Human Eval Form

Create grading rubrics and forms for human evaluation of LLM outputs

Latency Benchmark Recorder

Record and visualize latency metrics from your own API tests

Model A/B Test Evaluator

Analyze results from model A/B tests for statistical significance

BLEU & ROUGE Calculator

Calculate standard text generation metrics between reference and hypothesis

Confusion Matrix Visualizer

Generate and analyze confusion matrices for classification models

What is an Eval Harness?

An evaluation harness is a framework for systematically testing LLM outputs against expected results. By defining test cases with inputs and expected outputs, you can measure how well your model or prompt performs across multiple scenarios.

This simple harness lets you define test cases, paste actual outputs, and score them using different matching strategies—perfect for quick prompt iteration and quality checks.

Scoring Methods

Exact Match

100% if strings match exactly (case-insensitive). Good for factual answers.

Contains

100% if expected appears anywhere in actual. Good for verbose responses.

Word Overlap

Percentage of expected words found in actual. Good for semantic similarity.

Best Practices

Diverse test cases: Include edge cases, not just happy paths.
Choose appropriate scoring: Exact for facts, contains for natural language.
Set baselines: Run before changes to measure improvement.
Document expectations: Write clear, unambiguous expected outputs.

FAQ

Does this call any LLM APIs?

No. You manually paste the actual outputs. This tool only scores—run your prompts separately.

What does pass rate measure?

Percentage of test cases scoring 80% or higher. Threshold can be customized for your needs.