Dataset Splitter

Split datasets into train/validation/test sets for ML training

Split Configuration

Train %

Val %

Test %

Seed

Shuffle

Dataset (JSONL, one line per example)

Train (0)

80% of data

Validation (0)

10% of data

Test (0)

10% of data

Related Tools

JSONL Converter

Convert between JSON and JSONL formats for fine-tuning comparisons

PII Detector

Identify and redact Personally Identifiable Information in datasets client-side

Synthetic Data Generator

Generate synthetic examples based on schema instructions

Training Data Formatter

Format text for various training objectives (Fill-in-middle, Next Token)

Annotation Converter

Convert between different data annotation formats (COCO, YOLO, Pascal VOC)

Data Augmentation Preview

Visualize image augmentation techniques for training data

What is Dataset Splitting?

Dataset splitting divides your data into separate sets for training, validation, and testing. This fundamental practice prevents overfitting by ensuring you evaluate model performance on data it has never seen during training.

This dataset splitter handles JSONL data with configurable split ratios and reproducible shuffling via seeded random number generation—essential for consistent ML experiments.

Purpose of Each Split

Training Set (60-80%)

Used to train the model. The model learns patterns and adjusts weights based on this data.

Validation Set (10-20%)

Used during training to tune hyperparameters and prevent overfitting. Guides early stopping decisions.

Test Set (10-20%)

Held out completely until final evaluation. Provides unbiased estimate of real-world performance.

Common Split Ratios

Split	When to Use
80/10/10	Standard split for most ML tasks
70/15/15	When you need more validation data
90/5/5	When data is very limited

Best Practices

Always shuffle: Prevents ordering bias if data was collected sequentially.
Use fixed seeds: Ensures reproducible experiments. Document seeds in your research.
Stratify when possible: For classification, maintain class distribution across splits.
Never peek at test: Only evaluate on test set once, at the very end.

FAQ

What if my percentages don't add to 100?

The tool allocates train first, then validation. Remaining examples go to test regardless of the percentage set.

Why use a seed for shuffling?

Seeds make shuffling deterministic. The same seed always produces the same split, enabling reproducible experiments and debugging.