How to Split Your Data

This tutorial explains different data splitting strategies in Nkululeko for supervised machine learning experiments. Based on the blog post by Felix Burkhardt.

Why Split Data?

In supervised machine learning, you typically need three kinds of datasets:

Train data: To teach the model the relation between data and labels
Dev data (development): To tune meta-parameters of your model (e.g., number of neurons, batch size, learning rate)
Test data: To evaluate your model ONCE at the end to check on generalization

All of this is to prevent overfitting on your train and/or dev data. If you’ve used your test data for a while, you might need to find a new set, as chances are high that you overfitted on your test during experiments.

Rules for Good Data Splits

Train and dev can be from the same set, but the test set is ideally from a different database
If you don’t have much data: use an 80/20/20% split
If you have masses of data: use only so much dev and test that your population seems covered
If you have really little data: use k-fold cross-validation for train and dev, but the test set should still be separate

Split Strategies in Nkululeko

Nkululeko offers several split strategies configured via split_strategy in the [DATA] section:

1. Specified Split

Use predefined train and test files. Ideal when you have a standard benchmark dataset with official splits.

Configuration:

[DATA]
emodb.split_strategy = specified
emodb.test_tables = ['emotion.categories.test.gold_standard']
emodb.train_tables = ['emotion.categories.train.gold_standard']

Example: exp_emodb_split_specified.ini

Run:

python -m nkululeko.nkululeko --config examples/exp_emodb_split_specified.ini

When to use:

You have official benchmark splits
You want reproducible comparisons with other research
Dataset provides predefined train/test files

2. Random Split

Randomly assign samples to train and test sets. Simple but doesn’t guarantee speaker independence.

Configuration:

[DATA]
emodb.split_strategy = random
emodb.test_size = 20  # 20% for test

Example: exp_emodb_split_random.ini

Run:

python -m nkululeko.nkululeko --config examples/exp_emodb_split_random.ini

When to use:

Quick experiments
Large datasets where speaker overlap is less critical
When speaker information is not available

Caution: May lead to speaker overlap between train and test, resulting in optimistic performance estimates.

3. Speaker Split

Ensures speakers in train and test are different (speaker-independent evaluation). Critical for real-world generalization.

Configuration:

[DATA]
emodb.split_strategy = speaker_split
emodb.test_size = 20  # 20% for test

Example: exp_emodb_split_speaker.ini

Run:

python -m nkululeko.nkululeko --config examples/exp_emodb_split_speaker.ini

When to use:

Real-world deployment scenarios
You want to test generalization to unseen speakers
Gold standard for speaker-independent evaluation

Why it matters: Prevents the model from memorizing speaker characteristics, forcing it to learn genuine emotion patterns.

4. LOSO (Leave-One-Speaker-Out)

Cross-validation where each speaker is held out once as the test set. Tests generalization to every speaker.

Configuration:

[DATA]
emodb.split_strategy = speaker_split
emodb.test_size = 10  # Percentage for initial split

[MODEL]
logo = 10  # Number of speakers for LOSO cross-validation

Example: exp_emodb_split_loso.ini

Run:

python -m nkululeko.nkululeko --config examples/exp_emodb_split_loso.ini

When to use:

Small datasets with few speakers
You want robust speaker-independent evaluation
You need per-speaker performance analysis

How it works:

Uses speaker_split strategy to ensure speaker independence
The logo parameter specifies the number of speakers (folds)
For EmoDB with 10 speakers, logo = 10 means each fold leaves one speaker out (LOSO)
Trains 10 models, each testing on a different speaker

Note: Computationally expensive for datasets with many speakers. The number specified in logo should match the number of speakers in your dataset.

5. LOGO (Leave-One-Group-Out)

Cross-validation on the training data by leaving out one group at a time. Used for meta-parameter tuning.

Configuration:

[DATA]
emodb.split_strategy = random  # First split train/test
emodb.test_size = 20

[MODEL]
logo = 4  # Leave-One-Group-Out with 4 groups

Example: exp_emodb_split_logo.ini

Run:

python -m nkululeko.nkululeko --config examples/exp_emodb_split_logo.ini

When to use:

Tuning model hyperparameters
You want more robust validation than single train/dev split
Combined with another split strategy for train/test

6. K-Fold Cross-Validation

Splits training data into K folds and trains K models, using each fold as validation once.

Configuration:

[DATA]
emodb.split_strategy = random  # First split train/test
emodb.test_size = 20

[MODEL]
k_fold_cross = 5  # 5-fold cross-validation

Example: exp_emodb_split_kfold.ini

Run:

python -m nkululeko.nkululeko --config examples/exp_emodb_split_kfold.ini

When to use:

Small to medium datasets
You want robust performance estimates
Comparing different models or feature sets

Common values: 5-fold or 10-fold

Exercise 1: Compare Split Strategies

Try all split methods with EmoDB using OpenSMILE features and XGBoost:

# 1. Specified split
python -m nkululeko.nkululeko --config examples/exp_emodb_split_specified.ini

# 2. Random split
python -m nkululeko.nkululeko --config examples/exp_emodb_split_random.ini

# 3. Speaker split
python -m nkululeko.nkululeko --config examples/exp_emodb_split_speaker.ini

# 4. LOSO
python -m nkululeko.nkululeko --config examples/exp_emodb_split_loso.ini

# 5. LOGO
python -m nkululeko.nkululeko --config examples/exp_emodb_split_logo.ini

# 6. 5-fold cross-validation
python -m nkululeko.nkululeko --config examples/exp_emodb_split_kfold.ini

Question: Which split strategy gives the best performance? Why?

Expected findings:

Random split typically gives the highest performance (but least realistic)
Speaker split / LOSO give more conservative (realistic) performance
K-fold / LOGO provide robust estimates with confidence intervals

Exercise 2: Detecting Overfitting

Run a neural network experiment to visualize when overfitting starts:

Configuration: exp_emodb_split_overfitting.ini

python -m nkululeko.nkululeko --config examples/exp_emodb_split_overfitting.ini

This configuration:

Uses an MLP with layers {l1: 1024, l2: 64}
Trains for 200 epochs
Plots epoch progression and identifies the best model

What to look for:

Open the epoch progression plot in examples/results/exp_emodb_split_overfitting/images/
Find where training loss continues decreasing but validation loss starts increasing
That’s where overfitting begins!

Comparison Table

Split Strategy	Speaker Independent	Use Case	Computational Cost	Realism
Specified	Depends on dataset	Benchmark comparison	Low	Varies
Random	❌ No	Quick experiments	Low	Low
Speaker Split	✅ Yes	Real-world deployment	Low	High
LOSO	✅ Yes	Small datasets, per-speaker analysis	High	Very High
LOGO	Configurable	Hyperparameter tuning	Medium	Medium
K-Fold	Configurable	Robust evaluation	Medium-High	Medium

Best Practices

For Small Datasets (< 1000 samples)

Use k-fold cross-validation (k=5 or k=10) on train+dev
Keep a separate test set that you evaluate ONLY ONCE
Consider LOSO if you have < 20 speakers

For Medium Datasets (1000-10,000 samples)

Use speaker split with 80/10/10 (train/dev/test)
Ensure different speakers in each split
Use k-fold on training data for hyperparameter tuning

For Large Datasets (> 10,000 samples)

Simple random split or speaker split works well
Dev and test sets can be smaller (e.g., 5% each)
Focus on ensuring the test set covers the population diversity

General Tips

✅ Always keep test data separate until final evaluation
✅ Use speaker split for realistic performance estimates
✅ Use cross-validation for robust hyperparameter tuning
❌ Never tune on test data
❌ Don’t evaluate on test data multiple times (you’ll overfit!)

Advanced: Balanced Splits

For imbalanced datasets, use stratified or balanced splits:

[DATA]
emodb.split_strategy = balanced
emodb.test_size = 20
# Stratify by multiple variables with weights
balance = {'emotion':2, 'age':1, 'gender':1}
age_bins = 2
size_diff_weight = 1

See exp_emodb_split.ini for a complete example.

References

Blog post: How to Split Your Data
Documentation: Nkululeko INI file reference
Related: How to Evaluate Your Model

Summary

Choosing the right split strategy is crucial for reliable machine learning experiments:

For benchmarking: Use specified splits
For real-world deployment: Use speaker split or LOSO
For quick experiments: Use random split
For small datasets: Use k-fold cross-validation
For hyperparameter tuning: Use LOGO or k-fold

Remember: Your test set performance is only meaningful if it represents the real-world scenario your model will face!