# How to Align Databases

This tutorial explains how to combine and align multiple databases that have different label schemes for related tasks. This is useful when you want to leverage data from one domain (e.g., emotion) to improve prediction in another related domain (e.g., stress).

## Overview

Sometimes you want to combine databases that are similar but don't label exactly the same phenomena. For example:
- You have limited **stress** data but many **emotion** databases
- You want to use **angry** samples as **stressed** and **happy/neutral** as **non-stressed**

Nkululeko provides several configuration options to align databases:
- **Column renaming** (`colnames`)
- **Label mapping** (`mapping`)
- **Sample filtering** (`filter`)
- **Target table selection** (`target_tables`)

## Configuration Options

### Column Renaming: `colnames`

Rename columns to align with your target task:

```ini
emodb.colnames = {"emotion": "stress"}
```

This renames the `emotion` column to `stress`.

### Label Mapping: `mapping`

Map original labels to new categories:

```ini
emodb.mapping = {"anger": "stress", "disgust": "stress", "neutral": "no stress", "sadness": "no stress"}
```

### Sample Filtering: `filter`

Select only specific samples based on column values:

```ini
# Keep only anger, neutral, and happiness samples
emodb.filter = [["stress", ["anger", "neutral", "happiness"]]]
```

### Target Tables: `target_tables`

Specify which tables contain the target labels:

```ini
emodb.target_tables = ["emotion"]
```

## Example: Emotion to Stress Mapping

This example shows how to convert Berlin EmoDB emotion labels into binary stress labels.

### Configuration: `exp_emodb_stress.ini`

```ini
[EXP]
root = ./examples/results/
name = emodb_stress
save_test = ./examples/results/emodb_stress/test.csv
epochs = 5

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
# Specify where target values come from
emodb.target_tables = ["emotion"]
# Rename emotion column to stress
emodb.colnames = {"emotion": "stress"}
# Keep only these emotion categories
emodb.filter = [["stress", ["anger", "neutral", "sadness", "disgust"]]]
# Map emotions to stress labels
emodb.mapping = {"anger": "stress", "disgust": "stress", "neutral": "no stress", "sadness": "no stress"}
emodb.split_strategy = speaker_split
# Define final labels
labels = ["stress", "no stress"]
target = stress

[FEATS]
type = ['os']

[MODEL]
type = mlp
layers = [64, 12]
drop = [.3, .4]

[PLOT]
uncertainty_threshold = 0.3
```

### Run the Experiment

```bash
python -m nkululeko.nkululeko --config examples/exp_emodb_stress.ini
```

## Advanced: Combining Multiple Databases

You can combine databases with different label schemes by aligning them to a common target.

### Example: EmoDB + SUSAS for Stress Detection

```ini
[DATA]
databases = ['emodb', 'susas']

# EmoDB configuration
emodb = ./data/emodb/emodb
emodb.target_tables = ["emotion"]
emodb.colnames = {"emotion": "stress"}
emodb.filter = [["stress", ["anger", "neutral", "happiness"]]]
emodb.mapping = {"anger": "stress", "neutral": "no stress", "happiness": "no stress"}
# Use all emodb for training
emodb.split_strategy = train

# SUSAS configuration
susas = ./data/susas/
# Map ternary stress labels to binary
susas.mapping = {'0,1': 'no stress', '2': 'stress'}
susas.split_strategy = speaker_split

target = stress
labels = ["stress", "no stress"]
```

### Key Points

1. **EmoDB is used only for training** (`split_strategy = train`)
2. **SUSAS is split into train/test** (`split_strategy = speaker_split`)
3. **Both databases use the same target labels** (`stress`, `no stress`)

## Multi-Database Alignment with Root Files

For complex multi-database setups, use a separate configuration file for database roots:

### Root Configuration: `data_roots.ini`

```ini
[DATA]
emodb = ./data/emodb/emodb
emodb.split_strategy = specified
emodb.test_tables = ['emotion.categories.test.gold_standard']
emodb.train_tables = ['emotion.categories.train.gold_standard']
emodb.mapping = {'anger':'angry', 'happiness':'happy', 'sadness':'sad', 'neutral':'neutral'}

crema-d = ./data/crema-d/crema-d/1.3.0/
crema-d.split_strategy = specified
crema-d.colnames = {'sex':'gender'}
crema-d.target_tables = ['emotion.categories.desired.test','emotion.categories.desired.train']
crema-d.mapping = {'anger':'angry', 'happiness':'happy', 'sadness':'sad', 'neutral':'neutral'}
```

### Main Configuration

```ini
[EXP]
root = ./examples/results/multidb
databases = ['emodb', 'crema-d']

[DATA]
root_folders = ./examples/data_roots.ini
target = emotion
labels = ['angry', 'happy', 'sad', 'neutral']

[FEATS]
type = ['os']
scale = standard

[MODEL]
type = mlp
```

## Configuration Reference

| Option | Description | Example |
|--------|-------------|---------|
| `colnames` | Rename columns | `{"emotion": "stress"}` |
| `mapping` | Map label values | `{"anger": "stress", "neutral": "no stress"}` |
| `filter` | Filter samples by column values | `[["column", ["val1", "val2"]]]` |
| `target_tables` | Tables containing target labels | `["emotion"]` |
| `split_strategy` | How to split data | `train`, `test`, `speaker_split`, `random` |

## Use Cases

1. **Cross-domain transfer**: Use emotion data for stress detection
2. **Label harmonization**: Combine databases with different label schemes
3. **Data augmentation**: Add out-of-domain data to training
4. **Multi-corpus experiments**: Train on multiple databases with aligned labels

## Tips

- **In-domain data usually works better**: Adding out-of-domain data doesn't always help
- **Use a third database for evaluation**: When combining databases, evaluate on held-out data
- **Check label distributions**: Ensure balanced classes after mapping
- **Document your mappings**: Keep track of how labels were aligned

## Related Tutorials

- [Multi-Database Experiments](multidb.md)
- [Data Balancing](balance.md)
- [INI File Reference](ini_file.md)

## Reference

- [Blog: How to align databases](http://blog.syntheticspeech.de/2025/08/06/nkululeko-how-to-align-databases/)