Dataset Splitter
Split datasets into train/validation/test sets for ML training
Split Configuration
Train (0)
Validation (0)
Test (0)
Related Tools
JSONL Converter
Convert between JSON and JSONL formats for fine-tuning comparisons
PII Detector
Identify and redact Personally Identifiable Information in datasets client-side
Synthetic Data Generator
Generate synthetic examples based on schema instructions
Training Data Formatter
Format text for various training objectives (Fill-in-middle, Next Token)
Annotation Converter
Convert between different data annotation formats (COCO, YOLO, Pascal VOC)
Data Augmentation Preview
Visualize image augmentation techniques for training data
What is Dataset Splitting?
Dataset splitting divides your data into separate sets for training, validation, and testing. This fundamental practice prevents overfitting by ensuring you evaluate model performance on data it has never seen during training.
This dataset splitter handles JSONL data with configurable split ratios and reproducible shuffling via seeded random number generation—essential for consistent ML experiments.
Purpose of Each Split
Training Set (60-80%)
Used to train the model. The model learns patterns and adjusts weights based on this data.
Validation Set (10-20%)
Used during training to tune hyperparameters and prevent overfitting. Guides early stopping decisions.
Test Set (10-20%)
Held out completely until final evaluation. Provides unbiased estimate of real-world performance.
Common Split Ratios
| Split | When to Use |
|---|---|
| 80/10/10 | Standard split for most ML tasks |
| 70/15/15 | When you need more validation data |
| 90/5/5 | When data is very limited |
Best Practices
- Always shuffle: Prevents ordering bias if data was collected sequentially.
- Use fixed seeds: Ensures reproducible experiments. Document seeds in your research.
- Stratify when possible: For classification, maintain class distribution across splits.
- Never peek at test: Only evaluate on test set once, at the very end.
FAQ
What if my percentages don't add to 100?
The tool allocates train first, then validation. Remaining examples go to test regardless of the percentage set.
Why use a seed for shuffling?
Seeds make shuffling deterministic. The same seed always produces the same split, enabling reproducible experiments and debugging.
