# How to Split Your Data This tutorial explains different data splitting strategies in Nkululeko for supervised machine learning experiments. Based on the [blog post by Felix Burkhardt](http://blog.syntheticspeech.de/2022/12/01/how-to-split-you-data/). ## Why Split Data? In supervised machine learning, you typically need three kinds of datasets: 1. **Train data**: To teach the model the relation between data and labels 2. **Dev data** (development): To tune meta-parameters of your model (e.g., number of neurons, batch size, learning rate) 3. **Test data**: To evaluate your model ONCE at the end to check on generalization All of this is to prevent **overfitting** on your train and/or dev data. If you've used your test data for a while, you might need to find a new set, as chances are high that you overfitted on your test during experiments. ## Rules for Good Data Splits - Train and dev can be from the same set, but the **test set is ideally from a different database** - If you don't have much data: use an **80/20/20% split** - If you have masses of data: use only so much dev and test that your population seems covered - If you have really little data: use **k-fold cross-validation** for train and dev, but the test set should still be separate ## Split Strategies in Nkululeko Nkululeko offers several split strategies configured via `split_strategy` in the `[DATA]` section: ### 1. Specified Split Use predefined train and test files. Ideal when you have a standard benchmark dataset with official splits. **Configuration:** ```ini [DATA] emodb.split_strategy = specified emodb.test_tables = ['emotion.categories.test.gold_standard'] emodb.train_tables = ['emotion.categories.train.gold_standard'] ``` **Example:** [exp_emodb_split_specified.ini](../examples/exp_emodb_split_specified.ini) **Run:** ```bash python -m nkululeko.nkululeko --config examples/exp_emodb_split_specified.ini ``` **When to use:** - You have official benchmark splits - You want reproducible comparisons with other research - Dataset provides predefined train/test files --- ### 2. Random Split Randomly assign samples to train and test sets. Simple but doesn't guarantee speaker independence. **Configuration:** ```ini [DATA] emodb.split_strategy = random emodb.test_size = 20 # 20% for test ``` **Example:** [exp_emodb_split_random.ini](../examples/exp_emodb_split_random.ini) **Run:** ```bash python -m nkululeko.nkululeko --config examples/exp_emodb_split_random.ini ``` **When to use:** - Quick experiments - Large datasets where speaker overlap is less critical - When speaker information is not available **Caution:** May lead to speaker overlap between train and test, resulting in optimistic performance estimates. --- ### 3. Speaker Split Ensures speakers in train and test are different (speaker-independent evaluation). Critical for real-world generalization. **Configuration:** ```ini [DATA] emodb.split_strategy = speaker_split emodb.test_size = 20 # 20% for test ``` **Example:** [exp_emodb_split_speaker.ini](../examples/exp_emodb_split_speaker.ini) **Run:** ```bash python -m nkululeko.nkululeko --config examples/exp_emodb_split_speaker.ini ``` **When to use:** - Real-world deployment scenarios - You want to test generalization to unseen speakers - Gold standard for speaker-independent evaluation **Why it matters:** Prevents the model from memorizing speaker characteristics, forcing it to learn genuine emotion patterns. --- ### 4. LOSO (Leave-One-Speaker-Out) Cross-validation where each speaker is held out once as the test set. Tests generalization to every speaker. **Configuration:** ```ini [DATA] emodb.split_strategy = speaker_split emodb.test_size = 10 # Percentage for initial split [MODEL] logo = 10 # Number of speakers for LOSO cross-validation ``` **Example:** [exp_emodb_split_loso.ini](../examples/exp_emodb_split_loso.ini) **Run:** ```bash python -m nkululeko.nkululeko --config examples/exp_emodb_split_loso.ini ``` **When to use:** - Small datasets with few speakers - You want robust speaker-independent evaluation - You need per-speaker performance analysis **How it works:** - Uses `speaker_split` strategy to ensure speaker independence - The `logo` parameter specifies the number of speakers (folds) - For EmoDB with 10 speakers, `logo = 10` means each fold leaves one speaker out (LOSO) - Trains 10 models, each testing on a different speaker **Note:** Computationally expensive for datasets with many speakers. The number specified in `logo` should match the number of speakers in your dataset. --- ### 5. LOGO (Leave-One-Group-Out) Cross-validation on the training data by leaving out one group at a time. Used for meta-parameter tuning. **Configuration:** ```ini [DATA] emodb.split_strategy = random # First split train/test emodb.test_size = 20 [MODEL] logo = 4 # Leave-One-Group-Out with 4 groups ``` **Example:** [exp_emodb_split_logo.ini](../examples/exp_emodb_split_logo.ini) **Run:** ```bash python -m nkululeko.nkululeko --config examples/exp_emodb_split_logo.ini ``` **When to use:** - Tuning model hyperparameters - You want more robust validation than single train/dev split - Combined with another split strategy for train/test --- ### 6. K-Fold Cross-Validation Splits training data into K folds and trains K models, using each fold as validation once. **Configuration:** ```ini [DATA] emodb.split_strategy = random # First split train/test emodb.test_size = 20 [MODEL] k_fold_cross = 5 # 5-fold cross-validation ``` **Example:** [exp_emodb_split_kfold.ini](../examples/exp_emodb_split_kfold.ini) **Run:** ```bash python -m nkululeko.nkululeko --config examples/exp_emodb_split_kfold.ini ``` **When to use:** - Small to medium datasets - You want robust performance estimates - Comparing different models or feature sets **Common values:** 5-fold or 10-fold --- ## Exercise 1: Compare Split Strategies Try all split methods with EmoDB using OpenSMILE features and XGBoost: ```bash # 1. Specified split python -m nkululeko.nkululeko --config examples/exp_emodb_split_specified.ini # 2. Random split python -m nkululeko.nkululeko --config examples/exp_emodb_split_random.ini # 3. Speaker split python -m nkululeko.nkululeko --config examples/exp_emodb_split_speaker.ini # 4. LOSO python -m nkululeko.nkululeko --config examples/exp_emodb_split_loso.ini # 5. LOGO python -m nkululeko.nkululeko --config examples/exp_emodb_split_logo.ini # 6. 5-fold cross-validation python -m nkululeko.nkululeko --config examples/exp_emodb_split_kfold.ini ``` **Question:** Which split strategy gives the best performance? Why? **Expected findings:** - **Random split** typically gives the highest performance (but least realistic) - **Speaker split / LOSO** give more conservative (realistic) performance - **K-fold / LOGO** provide robust estimates with confidence intervals --- ## Exercise 2: Detecting Overfitting Run a neural network experiment to visualize when overfitting starts: **Configuration:** [exp_emodb_split_overfitting.ini](../examples/exp_emodb_split_overfitting.ini) ```bash python -m nkululeko.nkululeko --config examples/exp_emodb_split_overfitting.ini ``` This configuration: - Uses an MLP with layers `{l1: 1024, l2: 64}` - Trains for 200 epochs - Plots epoch progression and identifies the best model **What to look for:** 1. Open the epoch progression plot in `examples/results/exp_emodb_split_overfitting/images/` 2. Find where training loss continues decreasing but validation loss starts increasing 3. That's where overfitting begins! --- ## Comparison Table | Split Strategy | Speaker Independent | Use Case | Computational Cost | Realism | |---------------|-------------------|----------|-------------------|---------| | **Specified** | Depends on dataset | Benchmark comparison | Low | Varies | | **Random** | ❌ No | Quick experiments | Low | Low | | **Speaker Split** | ✅ Yes | Real-world deployment | Low | High | | **LOSO** | ✅ Yes | Small datasets, per-speaker analysis | High | Very High | | **LOGO** | Configurable | Hyperparameter tuning | Medium | Medium | | **K-Fold** | Configurable | Robust evaluation | Medium-High | Medium | --- ## Best Practices ### For Small Datasets (< 1000 samples) 1. Use **k-fold cross-validation** (k=5 or k=10) on train+dev 2. Keep a separate test set that you evaluate ONLY ONCE 3. Consider **LOSO** if you have < 20 speakers ### For Medium Datasets (1000-10,000 samples) 1. Use **speaker split** with 80/10/10 (train/dev/test) 2. Ensure different speakers in each split 3. Use **k-fold** on training data for hyperparameter tuning ### For Large Datasets (> 10,000 samples) 1. Simple **random split** or **speaker split** works well 2. Dev and test sets can be smaller (e.g., 5% each) 3. Focus on ensuring the test set covers the population diversity ### General Tips - ✅ **Always** keep test data separate until final evaluation - ✅ Use **speaker split** for realistic performance estimates - ✅ Use **cross-validation** for robust hyperparameter tuning - ❌ **Never** tune on test data - ❌ Don't evaluate on test data multiple times (you'll overfit!) --- ## Advanced: Balanced Splits For imbalanced datasets, use stratified or balanced splits: ```ini [DATA] emodb.split_strategy = balanced emodb.test_size = 20 # Stratify by multiple variables with weights balance = {'emotion':2, 'age':1, 'gender':1} age_bins = 2 size_diff_weight = 1 ``` See [exp_emodb_split.ini](../examples/exp_emodb_split.ini) for a complete example. --- ## References - Blog post: [How to Split Your Data](http://blog.syntheticspeech.de/2022/12/01/how-to-split-you-data/) - Documentation: [Nkululeko INI file reference](https://github.com/felixbur/nkululeko/blob/main/ini_file.md#data) - Related: [How to Evaluate Your Model](http://blog.syntheticspeech.de/2022/11/28/how-to-evaluate-your-model/) --- ## Summary Choosing the right split strategy is crucial for reliable machine learning experiments: - **For benchmarking**: Use specified splits - **For real-world deployment**: Use speaker split or LOSO - **For quick experiments**: Use random split - **For small datasets**: Use k-fold cross-validation - **For hyperparameter tuning**: Use LOGO or k-fold Remember: Your test set performance is only meaningful if it represents the real-world scenario your model will face!