ROUGE Score Calculator

Calculate ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores for summarization

Input Texts

13 tokens

12 tokens

ROUGE Scores

ROUGE-1 (Unigrams)

54.5%
Precision

54.5%

Recall

54.5%

F1

54.5%

ROUGE-2 (Bigrams)

26.1%
Precision

27.3%

Recall

25.0%

F1

26.1%

ROUGE-L (LCS)

56.0%
Precision

58.3%

Recall

53.8%

F1

56.0%

Related Tools

What is ROUGE?

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization and machine translation. Unlike BLEU, ROUGE is recall-oriented, focusing on how much of the reference is captured.

ROUGE is the standard evaluation metric for summarization tasks in NLP.

ROUGE Variants

ROUGE-1

Measures unigram (single word) overlap. Good for capturing content words.

ROUGE-2

Measures bigram (word pair) overlap. Better for capturing phrase-level similarity.

ROUGE-L

Uses Longest Common Subsequence. Captures sentence-level structure and word order.

Precision vs Recall vs F1

MetricFormulaMeasures
Precisionoverlap / hypothesisHow much of hypothesis is relevant
Recalloverlap / referenceHow much of reference is captured
F12×P×R / (P+R)Harmonic mean of P and R

Pro Tip: Which ROUGE to Report

For summarization papers, report ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. ROUGE-2 F1 is often considered the most important for comparing systems.

Frequently Asked Questions

ROUGE vs BLEU: When to use which?

Use ROUGE for summarization (recall matters more - capturing key info). Use BLEU for translation (precision matters more - avoiding errors).

What's a good ROUGE score?

It depends on the dataset. For news summarization, ROUGE-2 F1 of 15-20% is common. Always compare to baseline systems on the same dataset.

Related Tools

BLEU Score

Calculate BLEU for translation evaluation.

Perplexity Calculator

Evaluate language model quality.