# Comparing Classifiers, Features, and Databases

This tutorial explains how to use Nkululeko's multiple runs feature to statistically compare different experimental configurations. This is essential for determining if differences between feature sets, classifiers, or databases are statistically significant.

## Overview

Since Nkululeko version 0.98, you can run experiments multiple times and compare outcomes across different configurations with statistical significance testing.

**Use cases:**
- Compare feature extractors (OpenSMILE vs. Praat vs. wav2vec2)
- Compare classifiers (SVM vs. XGBoost vs. MLP)
- Compare databases or training data combinations

## Key Configuration Options

### Multiple Runs

Set the number of experimental runs in `[EXP]`:

```ini
[EXP]
runs = 10
epochs = 100
```

Each run uses a different random seed, producing a distribution of results.

### Statistical Output

Enable detailed statistics in `[EXPL]`:

```ini
[EXPL]
print_stats = True
```

### Comparison Plot

Configure what to compare in `[PLOT]`:

```ini
[PLOT]
runs_compare = features  # Options: 'features', 'models', 'databases'
```

## Example: Comparing Feature Extractors

This example compares OpenSMILE, Praat, and audmodel features for emotion recognition.

### Step 1: Create Base Configuration

All configurations share the same experiment name but differ in feature type.

#### Configuration 1: OpenSMILE (`exp_emodb_compare_os.ini`)

```ini
[EXP]
root = ./examples/results/
name = exp_emodb_compare
runs = 5
epochs = 50
save = True

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
labels = ['anger', 'happiness', 'neutral', 'sadness']
target = emotion

[FEATS]
type = ['os']
scale = standard

[MODEL]
type = mlp
layers = {'l1':64, 'l2':16}
patience = 5

[EXPL]
print_stats = True

[PLOT]
runs_compare = features
best_model = True
```

#### Configuration 2: Praat (`exp_emodb_compare_praat.ini`)

```ini
[FEATS]
type = ['praat']
scale = standard
```

(Other sections remain the same)

#### Configuration 3: audmodel (`exp_emodb_compare_audmodel.ini`)

```ini
[FEATS]
type = ['audwav2vec2']
scale = standard
```

(Other sections remain the same)

### Step 2: Run All Configurations

```bash
# Run with OpenSMILE features
python -m nkululeko.nkululeko --config examples/exp_emodb_compare_os.ini

# Run with Praat features
python -m nkululeko.nkululeko --config examples/exp_emodb_compare_praat.ini

# Run with audmodel features
python -m nkululeko.nkululeko --config examples/exp_emodb_compare_audmodel.ini
```

### Step 3: View Comparison Results

After running all configurations, Nkululeko generates:
- A comparison plot showing distributions for each feature type
- Statistical significance tests (Mann-Whitney U or t-test)
- Overall and pairwise significance values

The plot shows:
- Box plots or violin plots of accuracy distributions
- Title with overall significance (e.g., p < 0.05)
- Pairwise comparisons between configurations

### Example Output Plots

**Box Plot Comparison:**

![Runs Comparison Box Plot](images/runs_compare_boxplot.png)

**Bar Plot Comparison:**

![Runs Comparison Bar Plot](images/runs_compare_barplot.png)

## Statistical Tests

Nkululeko automatically selects the appropriate test:

| Number of Runs | Test Used |
|----------------|-----------|
| ≤ 30 | Mann-Whitney U test (non-parametric) |
| > 30 | Student's t-test (parametric) |

## Comparing Different Aspects

### Compare Classifiers

```ini
[PLOT]
runs_compare = models
```

Then run experiments with different `[MODEL] type` values:
- `type = svm`
- `type = xgb`
- `type = mlp`

### Compare Databases

```ini
[PLOT]
runs_compare = databases
```

Then run experiments with different database configurations.

## Best Practices

### Number of Runs

| Purpose | Recommended Runs |
|---------|------------------|
| Quick comparison | 5-10 |
| Publication | 10-30 |
| High confidence | 30+ |

More runs = more statistical power, but longer runtime.

### Keep Other Variables Constant

When comparing one aspect, keep everything else the same:
- **Comparing features**: Same model, same data, same splits
- **Comparing models**: Same features, same data, same splits
- **Comparing databases**: Same features, same model

### Use Same Experiment Name

All configurations being compared should use the same `name` in `[EXP]`:

```ini
[EXP]
name = exp_emodb_compare  # Same for all configurations
```

This ensures results are collected in the same folder for comparison.

## Output Files

After running multiple configurations:

```
results/exp_emodb_compare/
├── images/
│   └── runs_comparison.png      # Comparison plot with statistics
├── results/
│   ├── run_results_os.txt       # Results for OpenSMILE
│   ├── run_results_praat.txt    # Results for Praat
│   └── run_results_audmodel.txt # Results for audmodel
└── ...
```

## Interpreting Results

### Significance Levels

| p-value | Interpretation |
|---------|----------------|
| p < 0.001 | Highly significant (***) |
| p < 0.01 | Very significant (**) |
| p < 0.05 | Significant (*) |
| p ≥ 0.05 | Not significant (ns) |

### Example Output

The comparison plot title might show:
```
Overall: p=0.003** | Largest pairwise: os vs audmodel p=0.001***
```

This means:
- Overall difference across all groups is significant (p=0.003)
- The largest difference is between OpenSMILE and audmodel features

## Complete Workflow Example

```bash
# 1. Compare features
python -m nkululeko.nkululeko --config examples/exp_emodb_compare_os.ini
python -m nkululeko.nkululeko --config examples/exp_emodb_compare_praat.ini
python -m nkululeko.nkululeko --config examples/exp_emodb_compare_audmodel.ini

# 2. Check the comparison plot
# Open: examples/results/exp_emodb_compare/images/runs_comparison.png
```

## Tips

1. **Start with fewer runs** (5) for quick exploration, increase for final results
2. **Use speaker-independent splits** (`speaker_split`) for realistic evaluation
3. **Document your configurations** for reproducibility
4. **Consider computational cost**: More runs × more epochs = longer runtime
5. **Check for convergence**: Ensure models have enough epochs to converge

## Limitations

- Statistical comparison assumes runs are independent samples
- Some statisticians debate whether multiple random seeds truly represent independent samples
- Consider this approach as exploratory rather than definitive proof

## Related Tutorials

- [Using Uncertainty](uncertainty.md)
- [Hyperparameter Optimization](optim.md)
- [Multi-Database Experiments](multidb.md)

## Reference

- [Blog: How to compare classifiers, features and databases](http://blog.syntheticspeech.de/2025/09/24/nkululeko-how-to-compare-classifiers-features-and-databases-using-multiple-runs/)