# Feature Scaling in Nkululeko Feature scaling is a crucial preprocessing step in machine learning that standardizes the range of features to improve model performance and convergence. The Nkululeko framework provides a comprehensive `Scaler` class that offers multiple scaling strategies to normalize speech features. ## Table of Contents - [Overview](#overview) - [Available Scaling Methods](#available-scaling-methods) - [Configuration](#configuration) - [Usage Examples](#usage-examples) - [Best Practices](#best-practices) - [API Reference](#api-reference) ## Overview The `Scaler` class in Nkululeko (`nkululeko/scaler.py`) handles feature normalization across training, development, and test sets. It ensures that: - Features are scaled consistently across all datasets - The scaling parameters are learned only from the training set - Different scaling strategies can be applied based on data characteristics - Speaker-specific normalization is supported ## Available Scaling Methods ### 1. Standard Scaling (`standard`) **Z-score normalization** - transforms features to have zero mean and unit variance. ```ini [FEATS] scale = standard ``` **Formula:** `(x - mean) / std` **Use case:** Most commonly used method, works well when features follow a normal distribution. ### 2. Robust Scaling (`robust`) **Robust to outliers** - uses median and interquartile range instead of mean and standard deviation. ```ini [FEATS] scale = robust ``` **Formula:** `(x - median) / IQR` **Use case:** Recommended when features contain outliers that could skew standard scaling. ### 3. Min-Max Scaling (`minmax`) **Range normalization** - scales features to a fixed range [0, 1]. ```ini [FEATS] scale = minmax ``` **Formula:** `(x - min) / (max - min)` **Use case:** When you need features bounded to a specific range, especially for neural networks. ### 4. Max-Abs Scaling (`maxabs`) **Absolute maximum scaling** - scales by the maximum absolute value. ```ini [FEATS] scale = maxabs ``` **Formula:** `x / max(|x|)` **Use case:** Preserves sparsity in sparse datasets and handles both positive and negative values. ### 5. Normalizer (`normalizer`) **L2 normalization** - scales individual samples to have unit norm. ```ini [FEATS] scale = normalizer ``` **Use case:** When the direction of the data vector is more important than the magnitude. ### 6. Power Transformer (`powertransformer`) **Gaussian-like transformation** - applies power transformations to make data more Gaussian. ```ini [FEATS] scale = powertransformer ``` **Use case:** When features have skewed distributions and you want to make them more normal. ### 7. Quantile Transformer (`quantiletransformer`) **Uniform/Gaussian mapping** - maps features to uniform or Gaussian distribution. ```ini [FEATS] scale = quantiletransformer ``` **Use case:** When you want to reduce the impact of outliers and enforce a specific distribution. ### 8. Binning (`bins`) **Categorical binning** - converts continuous features into three categorical bins. ```ini [FEATS] scale = bins ``` **Output:** Features are converted to strings: "0" (low), "0.5" (medium), "1" (high) **Thresholds:** 33rd and 66th percentiles of the training data **Use case:** When you want to discretize continuous features for tree-based models or categorical analysis. ### 9. Speaker-wise Scaling (`speaker`) **Per-speaker normalization** - applies standard scaling individually for each speaker. ```ini [FEATS] scale = speaker ``` **Use case:** When speaker-specific characteristics should be normalized, useful for speaker-independent emotion recognition. ## Configuration ### Quick Start Demo To quickly test scaling techniques, you can use the provided demo example: ```bash # Clone the repository and navigate to it cd nkululeko # Run a single scaling demo with standard scaling python -m nkululeko.nkululeko --config examples/exp_scaling_demo.ini # Or run all scaling methods systematically bash scripts/run_scaler_experiments.sh ``` The systematic script will test all 9 scaling methods and provide a comprehensive comparison of their performance on your dataset. ### Basic Configuration Add the scaling configuration to the `[FEATS]` section of your INI file: ```ini [FEATS] type = ['os'] # Feature type set = eGeMAPSv02 # Feature set scale = standard # Scaling method ``` ### Advanced Configuration Examples #### Robust scaling with OpenSMILE features: ```ini [FEATS] type = ['os'] set = ComParE_2016 level = functionals scale = robust ``` #### Min-max scaling for neural networks: ```ini [FEATS] type = ['spectra'] scale = minmax [MODEL] type = cnn ``` #### Speaker-wise normalization: ```ini [FEATS] type = ['os'] scale = speaker [DATA] # Ensure speaker information is available target = emotion ``` #### Binning for tree-based models: ```ini [FEATS] type = ['os'] scale = bins [MODEL] type = xgb ``` ## Usage Examples ### Complete Experiment Configuration ```ini [EXP] root = ./experiments/ name = emotion_recognition_robust_scaling type = classification [DATA] databases = ['emodb'] emodb = /path/to/emodb target = emotion labels = ['anger', 'happiness', 'neutral', 'sadness'] [FEATS] type = ['os'] set = eGeMAPSv02 level = functionals scale = robust # Using robust scaling for outlier resistance [MODEL] type = svm C_val = 1.0 kernel = rbf ``` ### Comparing Different Scaling Methods You can compare different scaling methods using the automated script or manually: #### Automated Comparison (Recommended) ```bash # Run all scaling methods on your dataset bash scripts/run_scaler_experiments.sh ``` This script will: - Test all 9 scaling methods automatically - Generate individual configuration files - Run experiments with consistent settings - Provide a summary comparison of results - Clean up temporary files #### Manual Comparison You can also compare different scaling methods by running separate experiments: **Experiment 1: Standard scaling** ```ini [EXP] name = emotion_standard_scaling [FEATS] scale = standard ``` **Experiment 2: Robust scaling** ```ini [EXP] name = emotion_robust_scaling [FEATS] scale = robust ``` **Experiment 3: Min-max scaling** ```ini [EXP] name = emotion_minmax_scaling [FEATS] scale = minmax ``` #### Using the FLAGS Module for Comparison For systematic comparison within a single run: ```ini [EXP] root = ./results/scaling_comparison/ name = comprehensive_scaling_study [DATA] databases = ['mydata'] mydata = ./data/mydata.csv target = emotion [FEATS] type = ['os'] [MODEL] type = ['xgb'] [FLAGS] scale = ['standard', 'robust', 'minmax', 'maxabs', 'normalizer', 'powertransformer', 'quantiletransformer', 'bins'] ``` ## Understanding Scaling Results When you run the scaling experiments script, you'll see output like this: ``` Starting scaling experiments... =============================== Current directory: /path/to/nkululeko Examples path: ./examples Results path: ./examples/results Checking data availability... ✓ Polish dataset found - using full dataset Running experiment with scaling method: standard ================================================= Config file created: ./examples/results/temp_scaling_configs/exp_scaling_standard.ini Starting experiment... ✓ SUCCESS: standard scaling completed Result: best result: 0.75 Running experiment with scaling method: robust =============================================== ... ======================================== All scaling experiments completed! Success: 9/9 ======================================== Quick Results Comparison: ======================== standard : 0.75 robust : 0.78 minmax : 0.72 maxabs : 0.74 normalizer : 0.69 powertransformer : 0.76 quantiletransformer : 0.77 bins : 0.71 speaker : 0.73 ``` ### Interpreting Results - **Higher scores** indicate better performance (accuracy for classification) - **Robust scaling** often performs well with real-world audio data due to outlier resistance - **Standard scaling** is a reliable baseline - **Bins scaling** may show different results as it converts to categorical features - **Speaker scaling** is useful when speaker variability is a concern ### Result Files The script generates several output files: - `scaling_experiments_summary.txt`: Complete summary with timestamps and method descriptions - Individual log files: `exp_scaling_[method].log` for detailed experiment logs - Result plots (if configured): Visual comparisons of scaling effects ## Best Practices ### 1. Choosing the Right Scaling Method | Data Characteristics | Recommended Scaler | Reason | |---------------------|-------------------|--------| | Normal distribution, few outliers | `standard` | Classical z-score normalization | | Contains outliers | `robust` | Uses median/IQR, less sensitive to outliers | | Need bounded range [0,1] | `minmax` | Explicit range control | | Sparse data | `maxabs` | Preserves sparsity | | Skewed distributions | `powertransformer` | Makes data more Gaussian | | Many outliers | `quantiletransformer` | Robust distribution mapping | | Tree-based models | `bins` | Can improve interpretability | | Speaker variability | `speaker` | Normalizes per-speaker differences | ### 2. Neural Network Considerations For neural networks, consider: - `minmax` for bounded inputs - `standard` for well-behaved distributions - Avoid `bins` as neural networks work better with continuous features ### 3. SVM Considerations SVMs benefit from scaled features: - `standard` or `robust` are typically good choices - `minmax` ensures all features contribute equally ### 4. Tree-based Model Considerations Tree-based models (XGBoost, Random Forest) are generally scale-invariant: - Scaling may not be necessary - `bins` can improve interpretability - Standard scaling doesn't hurt and may help with some implementations ### 5. Cross-database Experiments When working with multiple databases: - Ensure consistent scaling across databases - `robust` or `quantiletransformer` may be more stable across different recording conditions ## API Reference ### Scaler Class ```python class Scaler: """Class to normalize speech features.""" def __init__(self, train_data_df, test_data_df, train_feats, test_feats, scaler_type, dev_x=None, dev_y=None): """ Initialize the scaler. Parameters: ----------- train_data_df : pd.DataFrame Training dataframe with speaker information (needed for speaker scaling) test_data_df : pd.DataFrame Test dataframe with speaker information train_feats : pd.DataFrame Training features dataframe test_feats : pd.DataFrame Test features dataframe scaler_type : str Type of scaling: 'standard', 'robust', 'minmax', 'maxabs', 'normalizer', 'powertransformer', 'quantiletransformer', 'bins', 'speaker' dev_x : pd.DataFrame, optional Development data dataframe dev_y : pd.DataFrame, optional Development features dataframe """ def scale(self): """ Scale features based on the configured scaling method. Returns: -------- tuple (train_scaled, test_scaled) or (train_scaled, dev_scaled, test_scaled) """ def scale_all(self): """Scale all datasets using the configured scaler.""" def speaker_scale(self): """Apply speaker-wise scaling.""" def bin_to_three(self): """Convert features to three bins: low, medium, high.""" ``` ### Key Methods #### `scale()` Main method that applies the selected scaling strategy. #### `scale_all()` Handles scaling for non-speaker-specific methods. #### `speaker_scale()` Applies scaling per speaker for speaker-wise normalization. #### `bin_to_three()` Implements the binning strategy, converting continuous features to categorical bins. ### Return Values The scaler returns scaled DataFrames in the same format as the input: - Same indices as input features - Same column names as input features - Scaled/transformed values according to the selected method For the `bins` method, values are returned as strings: "0", "0.5", "1". ## Error Handling The scaler includes robust error handling: ```python # Invalid scaler type scaler = Scaler(..., scaler_type="invalid") # Raises: ValueError with message about unknown scaler # Missing speaker information for speaker scaling # Will raise appropriate error if speaker column is missing ``` ## Integration with Nkululeko Pipeline The scaler is automatically integrated into the Nkululeko pipeline: 1. Features are extracted according to `[FEATS]` configuration 2. Scaler is applied if `scale` parameter is specified 3. Scaled features are passed to the model for training/testing No manual intervention is required - just specify the scaling method in your INI file. --- ## Script Usage and Examples ### Running the Scaling Experiments Script The `run_scaler_experiments.sh` script provides an automated way to test all scaling methods: ```bash # From nkululeko root directory bash scripts/run_scaler_experiments.sh # From scripts directory cd scripts bash run_scaler_experiments.sh ``` ### Script Features - **Automatic dataset detection**: Uses Polish dataset if available, falls back to test dataset - **Dynamic configuration**: Creates temporary config files for each scaling method - **Comprehensive logging**: Individual log files for each experiment - **Results summary**: Consolidated summary with performance comparison - **Error handling**: Continues with other methods if one fails - **Cleanup**: Removes temporary files after completion ### Script Output Files | File | Description | |------|-------------| | `scaling_experiments_summary.txt` | Main summary with all results and timestamps | | `exp_scaling_[method].log` | Detailed log for each scaling method | | `[method]_scaling_results/` | Model outputs and plots (if save=True) | ### Customizing the Script You can modify the script to: 1. **Change the dataset**: Edit the config creation functions 2. **Add custom scaling methods**: Extend the `scaling_methods` array 3. **Modify experiment parameters**: Update epochs, runs, or model type 4. **Change feature types**: Modify the `[FEATS]` section in config templates Example customization for different features: ```bash # Edit the create_scaling_config function to use different features [FEATS] type = ['praat'] # Instead of ['os'] scale = ${method} ``` For more information about feature extraction and model configuration, see: - [Feature Extraction Documentation](nkululeko.feat_extract.rst) - [INI File Configuration Reference](ini_file.md) - [Model Documentation](nkululeko.models.rst)