Data Balancing in Nkululeko
Data imbalance is a common problem in machine learning, particularly in speech emotion recognition and other audio classification tasks. When some classes have significantly more samples than others, models tend to be biased toward the majority classes, resulting in poor performance on minority classes.
Nkululeko provides a comprehensive set of balancing techniques to address this issue through the DataBalancer class, supporting various over-sampling, under-sampling, and combination methods.
Overview
The balancing functionality in nkululeko allows you to:
Automatically detect class imbalance in your datasets
Apply various balancing techniques to improve model performance
Compare different balancing methods using the flags module
Preserve data integrity while addressing imbalance issues
Quick Start
To quickly try balancing techniques, you can use the provided demo example:
# Clone the repository and navigate to it
cd nkululeko
# Run the balancing demo with SMOTE
python -m nkululeko.nkululeko --config examples/exp_balancing_demo.ini
This demo uses the test dataset included with nkululeko and applies SMOTE balancing to show how class distribution changes.
To use balancing in your own nkululeko experiment, simply add the balancing parameter to your [FEATS] section:
[FEATS]
type = ['os']
balancing = smote
scale = standard
Supported Balancing Methods
Nkululeko supports three categories of balancing techniques:
1. Over-sampling Methods
These methods increase the number of minority class samples:
ros- Random Over-Sampling: Randomly duplicates minority class samplessmote- SMOTE: Generates synthetic samples using k-nearest neighborsadasyn- ADASYN: Adaptive synthetic sampling with density-based generationborderlinesmote- Borderline SMOTE: Focuses on borderline samplessvmsmote- SVM SMOTE: Uses SVM to identify support vectors for synthesis
2. Under-sampling Methods
These methods reduce the number of majority class samples:
randomundersampler- Random Under-Sampling: Randomly removes majority class samplesclustercentroids- Cluster Centroids: Replaces clusters with their centroidseditednearestneighbours- Edited Nearest Neighbours: Removes inconsistent samplestomeklinks- Tomek Links: Removes Tomek link pairs
3. Combination Methods
These methods combine over-sampling and under-sampling:
smoteenn- SMOTE + Edited Nearest Neighbourssmotetomek- SMOTE + Tomek Links
Working Examples
Basic SMOTE Balancing
[EXP]
root = ./examples/results/
name = exp_smote_balancing
runs = 1
epochs = 10
[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = specified
emodb.test_tables = ['emotion.categories.test.gold_standard']
emodb.train_tables = ['emotion.categories.train.gold_standard']
target = emotion
labels = ['anger', 'happiness', 'sadness', 'neutral']
[FEATS]
type = ['os']
balancing = smote
scale = standard
[MODEL]
type = xgb
Random Over-Sampling Example
[EXP]
root = ./examples/results/
name = exp_ros_balancing
[DATA]
databases = ['polish_train', 'polish_dev', 'polish_test']
polish_train = ./data/polish/polish_train.csv
polish_train.type = csv
polish_train.split_strategy = train
polish_dev = ./data/polish/polish_dev.csv
polish_dev.type = csv
polish_dev.split_strategy = train
polish_test = ./data/polish/polish_test.csv
polish_test.type = csv
polish_test.split_strategy = test
target = emotion
[FEATS]
type = ['os']
balancing = ros
[MODEL]
type = svm
Under-sampling with Cluster Centroids
[EXP]
root = ./examples/results/
name = exp_clustercentroids_balancing
[DATA]
databases = ['train', 'dev', 'test']
train = ./data/polish/polish_train.csv
train.type = csv
train.split_strategy = train
target = emotion
[FEATS]
type = ['praat']
balancing = clustercentroids
scale = robust
[MODEL]
type = mlp
Choosing the Right Balancing Method
When to Use Over-sampling
Use over-sampling when:
You have limited data and don’t want to lose samples
The minority classes have sufficient diversity to generate meaningful synthetic samples
Computational resources allow for larger datasets
Recommended methods:
SMOTE: Good general-purpose choice, works well with most datasets
ADASYN: Better for highly imbalanced datasets
ROS: Simple and fast, good baseline
When to Use Under-sampling
Use under-sampling when:
You have abundant data in majority classes
Computational resources are limited
The majority class contains redundant or noisy samples
Recommended methods:
Cluster Centroids: Preserves data distribution while reducing size
Tomek Links: Removes noisy and borderline samples
Random Under-sampling: Simple and fast baseline
When to Use Combination Methods
Use combination methods when:
You want the benefits of both approaches
The dataset has complex imbalance patterns
You need to clean noisy samples while adding synthetic ones
Recommended methods:
SMOTE + Tomek: Generates synthetic samples then removes noise
SMOTE + ENN: More aggressive noise removal
Comparing Balancing Methods
Use the flags module to systematically compare different balancing techniques:
[EXP]
root = ./examples/results/
name = balancing_comparison
[DATA]
databases = ['mydata']
mydata = ./data/mydata.csv
target = emotion
[FEATS]
type = ['os']
scale = ['standard']
[MODEL]
type = ['xgb']
[FLAGS]
balancing = ['none', 'ros', 'smote', 'adasyn', 'clustercentroids', 'smoteenn']
This will run experiments with all specified balancing methods and show you which performs best on your dataset.
Understanding the Output
When balancing is applied, you’ll see output like this:
Balancing features with: smote
Original dataset size: 1200
Original class distribution: {'happy': 400, 'sad': 300, 'angry': 300, 'neutral': 200}
Balanced dataset size: 1600 (was 1200)
New class distribution: {'happy': 400, 'sad': 400, 'angry': 400, 'neutral': 400}
Class distribution after smote balancing: {'happy': 400, 'sad': 400, 'angry': 400, 'neutral': 400}
Key Information:
Original size: Number of samples before balancing
Original distribution: Number of samples per class before balancing
Balanced size: Number of samples after balancing
New distribution: Number of samples per class after balancing
Advanced Configuration
Custom Random State
The balancing process uses a random state for reproducibility. You can control this in your experiment configuration:
# This is handled automatically by nkululeko's experiment setup
# The random state is derived from your experiment configuration
Method-Specific Parameters
Some balancing methods accept additional parameters. While nkululeko uses sensible defaults, you can modify the source code for custom behavior:
# Example: Custom SMOTE configuration
sampler = SMOTE(
random_state=self.random_state,
k_neighbors=5, # Number of neighbors for synthesis
sampling_strategy='auto' # Which classes to balance
)
Best Practices
1. Start with SMOTE
SMOTE is a good default choice for most audio classification tasks:
[FEATS]
balancing = smote
2. Consider Data Size
Small datasets (< 1000 samples): Use over-sampling (ROS, SMOTE)
Large datasets (> 10000 samples): Consider under-sampling (cluster centroids)
Medium datasets: Try combination methods (SMOTE + Tomek)
3. Validate on Separate Test Set
Always ensure your test set remains unbalanced to get realistic performance estimates:
[DATA]
# Training data will be balanced
train.split_strategy = train
# Test data remains unbalanced
test.split_strategy = test
4. Monitor Class Distribution
Check the balancing output to ensure the method is working as expected:
Over-sampling should increase dataset size
Under-sampling should decrease dataset size
Check that target classes are actually balanced
5. Compare Multiple Methods
Use the flags module to systematically compare balancing approaches:
[FLAGS]
balancing = ['none', 'smote', 'adasyn', 'clustercentroids']
models = ['xgb']
features = ['os']
Troubleshooting
Common Issues
“Unknown balancing algorithm” error
Check spelling of the balancing method name
Ensure the method is in the supported list
Memory errors with over-sampling
Try under-sampling methods instead
Reduce feature dimensions before balancing
Poor results after balancing
Try different balancing methods
Check if your features are suitable for the chosen method
Ensure test set remains unbalanced
SMOTE failing with sparse data
Try ADASYN instead of SMOTE
Increase the number of samples in minority classes
Use ROS as a fallback
Error Messages
# If you see this error:
"Unknown balancing algorithm: invalid_method"
# Available methods: ['ros', 'smote', 'adasyn', ...]
# Solution: Use one of the supported methods
balancing = smote # instead of 'invalid_method'
Integration with Other Features
With Feature Scaling
Balancing works well with feature scaling:
[FEATS]
type = ['os']
balancing = smote
scale = standard # Apply scaling after balancing
With Multiple Features
Balancing is applied to the combined feature space:
[FEATS]
type = ['os', 'praat'] # Features are combined first
balancing = adasyn # Then balancing is applied
With Cross-Validation
Balancing is applied within each fold to prevent data leakage:
[EXP]
runs = 5 # Each run applies balancing independently
Performance Considerations
Computational Impact
Over-sampling: Increases dataset size → longer training time
Under-sampling: Decreases dataset size → faster training time
Combination methods: Variable impact depending on data distribution
Memory Usage
SMOTE/ADASYN: May require significant memory for large datasets
Cluster Centroids: Reduces memory requirements
ROS: Minimal additional memory (just duplicates existing samples)
Timing
The balancing process typically adds minimal overhead compared to feature extraction and model training.
Real-World Examples
Emotion Recognition with Imbalanced Data
# Example: EmoDb dataset with balanced emotions
[EXP]
root = ./results/emotion_balanced/
name = emodb_balanced_experiment
[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = specified
target = emotion
labels = ['anger', 'happiness', 'sadness', 'neutral']
[FEATS]
type = ['os']
set = eGeMAPSv02
balancing = smote
scale = standard
[MODEL]
type = xgb
save = True
[PLOT]
name = balanced_emotion_results
Age Detection with Severe Imbalance
# Example: Age groups with severe imbalance
[EXP]
name = age_detection_balanced
[DATA]
databases = ['age_data']
age_data = ./data/age/age_dataset.csv
target = age_group
labels = ['child', 'adult', 'elderly']
[FEATS]
type = ['praat']
balancing = adasyn # ADASYN works well with severe imbalance
scale = robust
[MODEL]
type = svm
kernel = rbf
Comparing All Methods
# Example: Systematic comparison of all balancing methods
[EXP]
root = ./results/balancing_study/
name = comprehensive_balancing_study
[DATA]
databases = ['study_data']
study_data = ./data/study/dataset.csv
target = label
[FEATS]
type = ['os']
scale = ['standard']
[MODEL]
type = ['xgb']
[FLAGS]
balancing = [
'none', 'ros', 'smote', 'adasyn', 'borderlinesmote',
'randomundersampler', 'clustercentroids', 'tomeklinks',
'smoteenn', 'smotetomek'
]
Conclusion
Data balancing is a crucial step in building robust audio classification models. Nkululeko’s comprehensive balancing support allows you to:
Easily apply various balancing techniques with a single configuration parameter
Systematically compare different methods using the flags module
Maintain reproducibility with consistent random states
Monitor results with detailed logging and distribution reporting
Start with SMOTE for most applications, but don’t hesitate to explore other methods based on your specific dataset characteristics and computational constraints. The flags module makes it easy to find the optimal balancing strategy for your particular use case.