Eval Harness

Simple evaluation harness for testing LLM outputs

What is an Eval Harness?

An evaluation harness is a framework for systematically testing LLM outputs against expected results. By defining test cases with inputs and expected outputs, you can measure how well your model or prompt performs across multiple scenarios.

This simple harness lets you define test cases, paste actual outputs, and score them using different matching strategies—perfect for quick prompt iteration and quality checks.

Scoring Methods

Exact Match

100% if strings match exactly (case-insensitive). Good for factual answers.

Contains

100% if expected appears anywhere in actual. Good for verbose responses.

Word Overlap

Percentage of expected words found in actual. Good for semantic similarity.

Best Practices

  • Diverse test cases: Include edge cases, not just happy paths.
  • Choose appropriate scoring: Exact for facts, contains for natural language.
  • Set baselines: Run before changes to measure improvement.
  • Document expectations: Write clear, unambiguous expected outputs.

FAQ

Does this call any LLM APIs?

No. You manually paste the actual outputs. This tool only scores—run your prompts separately.

What does pass rate measure?

Percentage of test cases scoring 80% or higher. Threshold can be customized for your needs.