Test Splits

This tutorial explains how to use three-way data splits (train, dev, test) in Nkululeko for proper model evaluation and to avoid overfitting.

Reference: Nkululeko: How to use train/dev/test splits

Why Three Splits?

Supervised machine learning works as follows:

Training phase: A learning algorithm adapts to a training dataset, producing a trained model
Inference phase: The model makes predictions on a test set

The Overfitting Problem

Complex models may memorize training data rather than learning generalizable patterns. This means:

✅ Great performance on training data
❌ Poor performance on new data

This phenomenon is called overfitting.

The Solution: Development Set

To prevent overfitting:

Hyperparameters are optimized using a held-out evaluation set (not used during training)
Training stops when performance on the evaluation set declines (early stopping)
The best-performing model on evaluation data is selected

However, this introduces a new problem: the model may now be overfitted to the evaluation data!

The Final Solution: Test Set

A third dataset is needed for final testing—one that has not been used at any stage of model development.

The three splits are:

Train: Used for model training
Dev (Development): Used for hyperparameter tuning and early stopping
Test: Used only for final evaluation

Enabling Three-Way Splits

Enable train/dev/test splitting with a single option:

[EXP]
traindevtest = True

Example: EmoDB with MLP

Here’s a complete example using the emoDB dataset (which has no predefined splits):

[EXP]
root = ./examples/results/
name = exp_emodb_traindevtest
traindevtest = True
epochs = 100

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
labels = ['anger', 'happiness', 'neutral', 'sadness']
target = emotion

[FEATS]
type = ['os']
scale = standard

[MODEL]
type = mlp
layers = {'l1': 100, 'l2': 16}
patience = 10

[PLOT]
best_model = True
epoch_progression = True

Key Options

traindevtest = True: Enables three-way splitting
emodb.split_strategy = speaker_split: Splits by speaker to avoid data leakage
patience = 10: Early stopping patience (stops if no improvement for 10 epochs)
epoch_progression = True: Plots training progress over epochs

Split Strategies

When using traindevtest = True, you can use different split strategies:

Automatic Speaker Split

emodb.split_strategy = speaker_split

Automatically divides speakers into train/dev/test sets.

Manual Speaker Assignment

emodb.split_strategy = speakers_stated
emodb.train = [3, 9, 10, 11, 13, 16]
emodb.dev = [14, 8]
; Test gets remaining speakers

Pre-defined Splits

For datasets with existing splits (like MELD):

[DATA]
databases = ['train', 'dev', 'test']
train = ./data/meld/meld_train.csv
train.split_strategy = train
dev = ./data/meld/meld_dev.csv
dev.split_strategy = train
test = ./data/meld/meld_test.csv
test.split_strategy = test

Output and Evaluation

With traindevtest = True, Nkululeko produces three evaluations:

Best model on dev set: Model selected by early stopping
Best model on test set: Same model, evaluated on held-out test data
Last model on dev set: Final epoch model performance

Interpreting Results

The test set performance is typically lower than dev set performance because:

The model was optimized for the dev set
The test set represents truly unseen data
This is the most realistic estimate of real-world performance

Example Files

exp_emodb_traindevtest.ini: Basic train/dev/test with XGB
exp_emodb_traindevtest_split.ini: Manual speaker assignment with MLP

Running the Experiment

python -m nkululeko.nkululeko --config examples/exp_emodb_traindevtest.ini

Tips

Always use speaker splits: Avoid having the same speaker in train and test sets
Set patience appropriately: Too low may stop training too early; too high wastes computation
Report test set results: Only the test set gives unbiased performance estimates
Use with neural networks: Train/dev/test splits are most important for models that can overfit (MLP, CNN, Transformers)