Dataset Splitter

Split datasets into train/validation/test sets for ML training

What is Dataset Splitting?

Dataset splitting divides your data into separate sets for training, validation, and testing. This fundamental practice prevents overfitting by ensuring you evaluate model performance on data it has never seen during training.

This dataset splitter handles JSONL data with configurable split ratios and reproducible shuffling via seeded random number generation—essential for consistent ML experiments.

Purpose of Each Split

Training Set (60-80%)

Used to train the model. The model learns patterns and adjusts weights based on this data.

Validation Set (10-20%)

Used during training to tune hyperparameters and prevent overfitting. Guides early stopping decisions.

Test Set (10-20%)

Held out completely until final evaluation. Provides unbiased estimate of real-world performance.

Common Split Ratios

SplitWhen to Use
80/10/10Standard split for most ML tasks
70/15/15When you need more validation data
90/5/5When data is very limited

Best Practices

  • Always shuffle: Prevents ordering bias if data was collected sequentially.
  • Use fixed seeds: Ensures reproducible experiments. Document seeds in your research.
  • Stratify when possible: For classification, maintain class distribution across splits.
  • Never peek at test: Only evaluate on test set once, at the very end.

FAQ

What if my percentages don't add to 100?

The tool allocates train first, then validation. Remaining examples go to test regardless of the percentage set.

Why use a seed for shuffling?

Seeds make shuffling deterministic. The same seed always produces the same split, enabling reproducible experiments and debugging.