How to Split Your Data
This tutorial explains different data splitting strategies in Nkululeko for supervised machine learning experiments. Based on the blog post by Felix Burkhardt.
Why Split Data?
In supervised machine learning, you typically need three kinds of datasets:
Train data: To teach the model the relation between data and labels
Dev data (development): To tune meta-parameters of your model (e.g., number of neurons, batch size, learning rate)
Test data: To evaluate your model ONCE at the end to check on generalization
All of this is to prevent overfitting on your train and/or dev data. If you’ve used your test data for a while, you might need to find a new set, as chances are high that you overfitted on your test during experiments.
Rules for Good Data Splits
Train and dev can be from the same set, but the test set is ideally from a different database
If you don’t have much data: use an 80/20/20% split
If you have masses of data: use only so much dev and test that your population seems covered
If you have really little data: use k-fold cross-validation for train and dev, but the test set should still be separate
Split Strategies in Nkululeko
Nkululeko offers several split strategies configured via split_strategy in the [DATA] section:
1. Specified Split
Use predefined train and test files. Ideal when you have a standard benchmark dataset with official splits.
Configuration:
[DATA]
emodb.split_strategy = specified
emodb.test_tables = ['emotion.categories.test.gold_standard']
emodb.train_tables = ['emotion.categories.train.gold_standard']
Example: exp_emodb_split_specified.ini
Run:
python -m nkululeko.nkululeko --config examples/exp_emodb_split_specified.ini
When to use:
You have official benchmark splits
You want reproducible comparisons with other research
Dataset provides predefined train/test files
2. Random Split
Randomly assign samples to train and test sets. Simple but doesn’t guarantee speaker independence.
Configuration:
[DATA]
emodb.split_strategy = random
emodb.test_size = 20 # 20% for test
Example: exp_emodb_split_random.ini
Run:
python -m nkululeko.nkululeko --config examples/exp_emodb_split_random.ini
When to use:
Quick experiments
Large datasets where speaker overlap is less critical
When speaker information is not available
Caution: May lead to speaker overlap between train and test, resulting in optimistic performance estimates.
3. Speaker Split
Ensures speakers in train and test are different (speaker-independent evaluation). Critical for real-world generalization.
Configuration:
[DATA]
emodb.split_strategy = speaker_split
emodb.test_size = 20 # 20% for test
Example: exp_emodb_split_speaker.ini
Run:
python -m nkululeko.nkululeko --config examples/exp_emodb_split_speaker.ini
When to use:
Real-world deployment scenarios
You want to test generalization to unseen speakers
Gold standard for speaker-independent evaluation
Why it matters: Prevents the model from memorizing speaker characteristics, forcing it to learn genuine emotion patterns.
4. LOSO (Leave-One-Speaker-Out)
Cross-validation where each speaker is held out once as the test set. Tests generalization to every speaker.
Configuration:
[DATA]
emodb.split_strategy = speaker_split
emodb.test_size = 10 # Percentage for initial split
[MODEL]
logo = 10 # Number of speakers for LOSO cross-validation
Example: exp_emodb_split_loso.ini
Run:
python -m nkululeko.nkululeko --config examples/exp_emodb_split_loso.ini
When to use:
Small datasets with few speakers
You want robust speaker-independent evaluation
You need per-speaker performance analysis
How it works:
Uses
speaker_splitstrategy to ensure speaker independenceThe
logoparameter specifies the number of speakers (folds)For EmoDB with 10 speakers,
logo = 10means each fold leaves one speaker out (LOSO)Trains 10 models, each testing on a different speaker
Note: Computationally expensive for datasets with many speakers. The number specified in logo should match the number of speakers in your dataset.
5. LOGO (Leave-One-Group-Out)
Cross-validation on the training data by leaving out one group at a time. Used for meta-parameter tuning.
Configuration:
[DATA]
emodb.split_strategy = random # First split train/test
emodb.test_size = 20
[MODEL]
logo = 4 # Leave-One-Group-Out with 4 groups
Example: exp_emodb_split_logo.ini
Run:
python -m nkululeko.nkululeko --config examples/exp_emodb_split_logo.ini
When to use:
Tuning model hyperparameters
You want more robust validation than single train/dev split
Combined with another split strategy for train/test
6. K-Fold Cross-Validation
Splits training data into K folds and trains K models, using each fold as validation once.
Configuration:
[DATA]
emodb.split_strategy = random # First split train/test
emodb.test_size = 20
[MODEL]
k_fold_cross = 5 # 5-fold cross-validation
Example: exp_emodb_split_kfold.ini
Run:
python -m nkululeko.nkululeko --config examples/exp_emodb_split_kfold.ini
When to use:
Small to medium datasets
You want robust performance estimates
Comparing different models or feature sets
Common values: 5-fold or 10-fold
Exercise 1: Compare Split Strategies
Try all split methods with EmoDB using OpenSMILE features and XGBoost:
# 1. Specified split
python -m nkululeko.nkululeko --config examples/exp_emodb_split_specified.ini
# 2. Random split
python -m nkululeko.nkululeko --config examples/exp_emodb_split_random.ini
# 3. Speaker split
python -m nkululeko.nkululeko --config examples/exp_emodb_split_speaker.ini
# 4. LOSO
python -m nkululeko.nkululeko --config examples/exp_emodb_split_loso.ini
# 5. LOGO
python -m nkululeko.nkululeko --config examples/exp_emodb_split_logo.ini
# 6. 5-fold cross-validation
python -m nkululeko.nkululeko --config examples/exp_emodb_split_kfold.ini
Question: Which split strategy gives the best performance? Why?
Expected findings:
Random split typically gives the highest performance (but least realistic)
Speaker split / LOSO give more conservative (realistic) performance
K-fold / LOGO provide robust estimates with confidence intervals
Exercise 2: Detecting Overfitting
Run a neural network experiment to visualize when overfitting starts:
Configuration: exp_emodb_split_overfitting.ini
python -m nkululeko.nkululeko --config examples/exp_emodb_split_overfitting.ini
This configuration:
Uses an MLP with layers
{l1: 1024, l2: 64}Trains for 200 epochs
Plots epoch progression and identifies the best model
What to look for:
Open the epoch progression plot in
examples/results/exp_emodb_split_overfitting/images/Find where training loss continues decreasing but validation loss starts increasing
That’s where overfitting begins!
Comparison Table
Split Strategy |
Speaker Independent |
Use Case |
Computational Cost |
Realism |
|---|---|---|---|---|
Specified |
Depends on dataset |
Benchmark comparison |
Low |
Varies |
Random |
❌ No |
Quick experiments |
Low |
Low |
Speaker Split |
✅ Yes |
Real-world deployment |
Low |
High |
LOSO |
✅ Yes |
Small datasets, per-speaker analysis |
High |
Very High |
LOGO |
Configurable |
Hyperparameter tuning |
Medium |
Medium |
K-Fold |
Configurable |
Robust evaluation |
Medium-High |
Medium |
Best Practices
For Small Datasets (< 1000 samples)
Use k-fold cross-validation (k=5 or k=10) on train+dev
Keep a separate test set that you evaluate ONLY ONCE
Consider LOSO if you have < 20 speakers
For Medium Datasets (1000-10,000 samples)
Use speaker split with 80/10/10 (train/dev/test)
Ensure different speakers in each split
Use k-fold on training data for hyperparameter tuning
For Large Datasets (> 10,000 samples)
Simple random split or speaker split works well
Dev and test sets can be smaller (e.g., 5% each)
Focus on ensuring the test set covers the population diversity
General Tips
✅ Always keep test data separate until final evaluation
✅ Use speaker split for realistic performance estimates
✅ Use cross-validation for robust hyperparameter tuning
❌ Never tune on test data
❌ Don’t evaluate on test data multiple times (you’ll overfit!)
Advanced: Balanced Splits
For imbalanced datasets, use stratified or balanced splits:
[DATA]
emodb.split_strategy = balanced
emodb.test_size = 20
# Stratify by multiple variables with weights
balance = {'emotion':2, 'age':1, 'gender':1}
age_bins = 2
size_diff_weight = 1
See exp_emodb_split.ini for a complete example.
References
Blog post: How to Split Your Data
Documentation: Nkululeko INI file reference
Related: How to Evaluate Your Model
Summary
Choosing the right split strategy is crucial for reliable machine learning experiments:
For benchmarking: Use specified splits
For real-world deployment: Use speaker split or LOSO
For quick experiments: Use random split
For small datasets: Use k-fold cross-validation
For hyperparameter tuning: Use LOGO or k-fold
Remember: Your test set performance is only meaningful if it represents the real-world scenario your model will face!