AI Eval Collection

Evaluate and benchmark AI model performance

Analyze results from model A/B tests for statistical significance

Calculate standard text generation metrics between reference and hypothesis

Generate and analyze confusion matrices for classification models

Generate configuration files for LM Evaluation Harness

Create grading rubrics and forms for human evaluation of LLM outputs

Record and visualize latency metrics from your own API tests

Why Use Our AI Tools?

🌐

Use these tools directly in your browser without installation.

🔒

All processing happens locally on your device where possible.

⚡

Optimized for speed and productivity.