# Data Balancing in Nkululeko

Data imbalance is a common problem in machine learning, particularly in speech emotion recognition and other audio classification tasks. When some classes have significantly more samples than others, models tend to be biased toward the majority classes, resulting in poor performance on minority classes.

Nkululeko provides a comprehensive set of balancing techniques to address this issue through the `DataBalancer` class, supporting various over-sampling, under-sampling, and combination methods.

## Overview

The balancing functionality in nkululeko allows you to:

- **Automatically detect** class imbalance in your datasets
- **Apply various balancing techniques** to improve model performance
- **Compare different balancing methods** using the flags module
- **Preserve data integrity** while addressing imbalance issues

## Quick Start

To quickly try balancing techniques, you can use the provided demo example:

```bash
# Clone the repository and navigate to it
cd nkululeko

# Run the balancing demo with SMOTE
python -m nkululeko.nkululeko --config examples/exp_balancing_demo.ini
```

This demo uses the test dataset included with nkululeko and applies SMOTE balancing to show how class distribution changes.

To use balancing in your own nkululeko experiment, simply add the `balancing` parameter to your `[FEATS]` section:

```ini
[FEATS]
type = ['os']
balancing = smote
scale = standard
```

## Supported Balancing Methods

Nkululeko supports three categories of balancing techniques:

### 1. Over-sampling Methods
These methods increase the number of minority class samples:

- **`ros`** - Random Over-Sampling: Randomly duplicates minority class samples
- **`smote`** - SMOTE: Generates synthetic samples using k-nearest neighbors
- **`adasyn`** - ADASYN: Adaptive synthetic sampling with density-based generation
- **`borderlinesmote`** - Borderline SMOTE: Focuses on borderline samples
- **`svmsmote`** - SVM SMOTE: Uses SVM to identify support vectors for synthesis

### 2. Under-sampling Methods
These methods reduce the number of majority class samples:

- **`randomundersampler`** - Random Under-Sampling: Randomly removes majority class samples
- **`clustercentroids`** - Cluster Centroids: Replaces clusters with their centroids
- **`editednearestneighbours`** - Edited Nearest Neighbours: Removes inconsistent samples
- **`tomeklinks`** - Tomek Links: Removes Tomek link pairs

### 3. Combination Methods
These methods combine over-sampling and under-sampling:

- **`smoteenn`** - SMOTE + Edited Nearest Neighbours
- **`smotetomek`** - SMOTE + Tomek Links

## Working Examples

### Basic SMOTE Balancing

```ini
[EXP]
root = ./examples/results/
name = exp_smote_balancing
runs = 1
epochs = 10

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = specified
emodb.test_tables = ['emotion.categories.test.gold_standard']
emodb.train_tables = ['emotion.categories.train.gold_standard']
target = emotion
labels = ['anger', 'happiness', 'sadness', 'neutral']

[FEATS]
type = ['os']
balancing = smote
scale = standard

[MODEL]
type = xgb
```

### Random Over-Sampling Example

```ini
[EXP]
root = ./examples/results/
name = exp_ros_balancing

[DATA]
databases = ['polish_train', 'polish_dev', 'polish_test']
polish_train = ./data/polish/polish_train.csv
polish_train.type = csv
polish_train.split_strategy = train
polish_dev = ./data/polish/polish_dev.csv
polish_dev.type = csv
polish_dev.split_strategy = train
polish_test = ./data/polish/polish_test.csv
polish_test.type = csv
polish_test.split_strategy = test
target = emotion

[FEATS]
type = ['os']
balancing = ros

[MODEL]
type = svm
```

### Under-sampling with Cluster Centroids

```ini
[EXP]
root = ./examples/results/
name = exp_clustercentroids_balancing

[DATA]
databases = ['train', 'dev', 'test']
train = ./data/polish/polish_train.csv
train.type = csv
train.split_strategy = train
target = emotion

[FEATS]
type = ['praat']
balancing = clustercentroids
scale = robust

[MODEL]
type = mlp
```

## Choosing the Right Balancing Method

### When to Use Over-sampling

**Use over-sampling when:**
- You have limited data and don't want to lose samples
- The minority classes have sufficient diversity to generate meaningful synthetic samples
- Computational resources allow for larger datasets

**Recommended methods:**
- **SMOTE**: Good general-purpose choice, works well with most datasets
- **ADASYN**: Better for highly imbalanced datasets
- **ROS**: Simple and fast, good baseline

### When to Use Under-sampling

**Use under-sampling when:**
- You have abundant data in majority classes
- Computational resources are limited
- The majority class contains redundant or noisy samples

**Recommended methods:**
- **Cluster Centroids**: Preserves data distribution while reducing size
- **Tomek Links**: Removes noisy and borderline samples
- **Random Under-sampling**: Simple and fast baseline

### When to Use Combination Methods

**Use combination methods when:**
- You want the benefits of both approaches
- The dataset has complex imbalance patterns
- You need to clean noisy samples while adding synthetic ones

**Recommended methods:**
- **SMOTE + Tomek**: Generates synthetic samples then removes noise
- **SMOTE + ENN**: More aggressive noise removal

## Comparing Balancing Methods

Use the flags module to systematically compare different balancing techniques:

```ini
[EXP]
root = ./examples/results/
name = balancing_comparison

[DATA]
databases = ['mydata']
mydata = ./data/mydata.csv
target = emotion

[FEATS]
type = ['os']
scale = ['standard']

[MODEL]
type = ['xgb']

[FLAGS]
balancing = ['none', 'ros', 'smote', 'adasyn', 'clustercentroids', 'smoteenn']
```

This will run experiments with all specified balancing methods and show you which performs best on your dataset.

## Understanding the Output

When balancing is applied, you'll see output like this:

```
Balancing features with: smote
Original dataset size: 1200
Original class distribution: {'happy': 400, 'sad': 300, 'angry': 300, 'neutral': 200}
Balanced dataset size: 1600 (was 1200)
New class distribution: {'happy': 400, 'sad': 400, 'angry': 400, 'neutral': 400}
Class distribution after smote balancing: {'happy': 400, 'sad': 400, 'angry': 400, 'neutral': 400}
```

### Key Information:
- **Original size**: Number of samples before balancing
- **Original distribution**: Number of samples per class before balancing
- **Balanced size**: Number of samples after balancing
- **New distribution**: Number of samples per class after balancing

## Advanced Configuration

### Custom Random State

The balancing process uses a random state for reproducibility. You can control this in your experiment configuration:

```python
# This is handled automatically by nkululeko's experiment setup
# The random state is derived from your experiment configuration
```

### Method-Specific Parameters

Some balancing methods accept additional parameters. While nkululeko uses sensible defaults, you can modify the source code for custom behavior:

```python
# Example: Custom SMOTE configuration
sampler = SMOTE(
    random_state=self.random_state,
    k_neighbors=5,  # Number of neighbors for synthesis
    sampling_strategy='auto'  # Which classes to balance
)
```

## Best Practices

### 1. Start with SMOTE
SMOTE is a good default choice for most audio classification tasks:
```ini
[FEATS]
balancing = smote
```

### 2. Consider Data Size
- **Small datasets** (< 1000 samples): Use over-sampling (ROS, SMOTE)
- **Large datasets** (> 10000 samples): Consider under-sampling (cluster centroids)
- **Medium datasets**: Try combination methods (SMOTE + Tomek)

### 3. Validate on Separate Test Set
Always ensure your test set remains unbalanced to get realistic performance estimates:

```ini
[DATA]
# Training data will be balanced
train.split_strategy = train
# Test data remains unbalanced
test.split_strategy = test
```

### 4. Monitor Class Distribution
Check the balancing output to ensure the method is working as expected:
- Over-sampling should increase dataset size
- Under-sampling should decrease dataset size
- Check that target classes are actually balanced

### 5. Compare Multiple Methods
Use the flags module to systematically compare balancing approaches:

```ini
[FLAGS]
balancing = ['none', 'smote', 'adasyn', 'clustercentroids']
models = ['xgb']
features = ['os']
```

## Troubleshooting

### Common Issues

1. **"Unknown balancing algorithm" error**
   - Check spelling of the balancing method name
   - Ensure the method is in the supported list

2. **Memory errors with over-sampling**
   - Try under-sampling methods instead
   - Reduce feature dimensions before balancing

3. **Poor results after balancing**
   - Try different balancing methods
   - Check if your features are suitable for the chosen method
   - Ensure test set remains unbalanced

4. **SMOTE failing with sparse data**
   - Try ADASYN instead of SMOTE
   - Increase the number of samples in minority classes
   - Use ROS as a fallback

### Error Messages

```python
# If you see this error:
"Unknown balancing algorithm: invalid_method"
# Available methods: ['ros', 'smote', 'adasyn', ...]

# Solution: Use one of the supported methods
balancing = smote  # instead of 'invalid_method'
```

## Integration with Other Features

### With Feature Scaling
Balancing works well with feature scaling:

```ini
[FEATS]
type = ['os']
balancing = smote
scale = standard  # Apply scaling after balancing
```

### With Multiple Features
Balancing is applied to the combined feature space:

```ini
[FEATS]
type = ['os', 'praat']  # Features are combined first
balancing = adasyn      # Then balancing is applied
```

### With Cross-Validation
Balancing is applied within each fold to prevent data leakage:

```ini
[EXP]
runs = 5  # Each run applies balancing independently
```

## Performance Considerations

### Computational Impact
- **Over-sampling**: Increases dataset size → longer training time
- **Under-sampling**: Decreases dataset size → faster training time
- **Combination methods**: Variable impact depending on data distribution

### Memory Usage
- **SMOTE/ADASYN**: May require significant memory for large datasets
- **Cluster Centroids**: Reduces memory requirements
- **ROS**: Minimal additional memory (just duplicates existing samples)

### Timing
The balancing process typically adds minimal overhead compared to feature extraction and model training.

## Real-World Examples

### Emotion Recognition with Imbalanced Data

```ini
# Example: EmoDb dataset with balanced emotions
[EXP]
root = ./results/emotion_balanced/
name = emodb_balanced_experiment

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = specified
target = emotion
labels = ['anger', 'happiness', 'sadness', 'neutral']

[FEATS]
type = ['os']
set = eGeMAPSv02
balancing = smote
scale = standard

[MODEL]
type = xgb
save = True

[PLOT]
name = balanced_emotion_results
```

### Age Detection with Severe Imbalance

```ini
# Example: Age groups with severe imbalance
[EXP]
name = age_detection_balanced

[DATA]
databases = ['age_data']
age_data = ./data/age/age_dataset.csv
target = age_group
labels = ['child', 'adult', 'elderly']

[FEATS]
type = ['praat']
balancing = adasyn  # ADASYN works well with severe imbalance
scale = robust

[MODEL]
type = svm
kernel = rbf
```

### Comparing All Methods

```ini
# Example: Systematic comparison of all balancing methods
[EXP]
root = ./results/balancing_study/
name = comprehensive_balancing_study

[DATA]
databases = ['study_data']
study_data = ./data/study/dataset.csv
target = label

[FEATS]
type = ['os']
scale = ['standard']

[MODEL]
type = ['xgb']

[FLAGS]
balancing = [
    'none', 'ros', 'smote', 'adasyn', 'borderlinesmote',
    'randomundersampler', 'clustercentroids', 'tomeklinks',
    'smoteenn', 'smotetomek'
]
```

## Conclusion

Data balancing is a crucial step in building robust audio classification models. Nkululeko's comprehensive balancing support allows you to:

1. **Easily apply** various balancing techniques with a single configuration parameter
2. **Systematically compare** different methods using the flags module
3. **Maintain reproducibility** with consistent random states
4. **Monitor results** with detailed logging and distribution reporting

Start with SMOTE for most applications, but don't hesitate to explore other methods based on your specific dataset characteristics and computational constraints. The flags module makes it easy to find the optimal balancing strategy for your particular use case.