Overview of options for the nkululeko framework

To be specified in a .ini file, config parser syntax
Kind of all (well, most) values have defaults

Contents

Overview of options for the nkululeko framework
- Contents
- Sections
  - EXP
  - DATA
  - AUGMENT
  - SEGMENT
  - FEATS
  - MODEL
  - EXPL
  - PREDICT
  - EXPORT
  - CROSSDB
  - PLOT
  - RESAMPLE
  - REPORT
  - OPTIM
  - FLAGS

Sections

EXP

General experiment settings: paths, naming, run count, and output options.

root: experiment root folder
- root = ./results/
type: the kind of experiment
- type = classification
- possible values:
  - classification: supervised learning experiment with a restricted set of categories (e.g., emotion categories).
  - regression: supervised learning experiment with continuous values (e.g., speaker age in years).
store: (relative to root) folder for caches
- store = ./store/
name: a name for debugging output
- name = emodb_exp
fig_dir: (relative to root) folder for plots
- fig_dir = ./images/
res_dir: (relative to root) folder for result output
- res_dir = ./results/
models_dir: (relative to root) folder to save models
- models_dir = ./models/
runs: number of runs (e.g., to average over random initializations)
- runs = 1
epochs: number of epochs for ANN training
- epochs = 1
save: save the experiment as a pickle file to be restored again later (True or False)
- save = False
save_test: save the test predictions as a new database in CSV format (default is False)
- save_test = ./my_saved_test_predictions.csv
databases: name of databases to compare for the multidb module
- databases = [‘emodb’, ‘timit’]
use_splits: can be used for multidb module to use the orginal split sets when train or test database. Else the whole database is used.
- use_splits = True
traindevtest: set to true if you want to specify an extra dev set, that will be used for early stopping (patience) in neural net experiments.
- traindevtest = False
sample_selection: select the samples to process (e.g. for augmentation, re-sampling, etc.): either train, test, or all
- sample_selection = all
export_onnx = export the best trained model in ONNX format.
- export_onnx = False

DATA

Database loading, label mapping, and train/test split configuration.

type: just a flag now to mark continuous data, so it can be binned to categorical data (using bins and labels)
- type = continuous
databases: list of databases to be used in the experiment
- databases = [‘emodb’, ‘timit’]
tests: Datasets to be used as test data for the stored best model. The databases listed here do not have to appear in the databases field. When nkululeko.nkululeko is run with this option set and a saved experiment file already exists on disk, training is skipped entirely: the module loads the stored best model, evaluates it on the listed test databases, and writes a confusion matrix, a per-class text report, and a predictions CSV (with all original test columns plus a predicted column) to the results directory. On the very first run (no saved file yet) the module trains normally and saves the experiment. See test_new_database.md for a step-by-step guide.
- tests = [‘emovo’]
- tests = [‘ravdess’, ‘cremad’] ; multiple databases are concatenated
root_folders: specify an additional configuration specifically for all entries starting with a dataset name, acting as global defaults.
- root_folders = data_roots.ini
db_name: path with audformatted repository for each database listed in ‘databases*. If this path is not absolute, it will be treated relative to the experiment folder.
- emodb = /home/data/audformat/emodb/
db_name.type: type of storage, e.g., audformat database or ‘csv’ (needs header: file, speaker, task)
- emodb.type = audformat
db_name.absolute_path: only for ‘csv’ databases: are the audio file paths relative or absolute? If not absolute, they will be treated relative to the database parent folder. NOT the experiment root folder.
- my_data.absolute_path = True
db_name.audio_path: only for ‘csv’ databases: are the audio files in a special common folder?
- my_data.audio_path = wav_files/
db_name.mapping: mapping python dictionary to map between categories for cross-database experiments (format: {‘target_emo’:’source_emo’})
- emodb.mapping = {‘anger’:’angry’, ‘happiness’:’happy’, ‘sadness’:’sad’, ‘neutral’:’neutral’}
- can also be used for general mapping
- emodb.mapping = {‘gender’:{‘male’:0, ‘female’:1}, ‘emotion’:{‘anger’:’stress’, ‘neutral’:’no stress’}}
db_name.columns: names of the columns to load from the data (only for audformat databases)
- my_data.columns = [“age”, “gender”, “speaker”, “diagnosis”]
db_name.label: name of the target variable for this database (if different from DATA.target)
- my_data.label = “expression”
db_name.colnames: mapping to rename columns to standard names
- my_data.colnames = {‘speaker’:’Participant ID’, ‘sex’:’gender’, ‘Age’: ‘age’}
db_name.split_strategy: How to identify sets for train/development data splits within one database
- emodb.split_strategy = reuse
- Possible values:
  - database: default (task.train, task.dev and task.test)
  - specified: specify the tables (an opportunity to assign multiple or no tables to train or dev set)
    - emodb.train_tables = [‘emotion.categories.train.gold_standard’]
    - emodb.dev_tables = [‘emotion.categories.dev.gold_standard’]
    - emodb.test_tables = [‘emotion.categories.test.gold_standard’]
  - speaker_split: split samples randomly but speaker disjunct, given a percentage of speakers for the test (and dev) set.
    - emodb.test_size = 50 (default:20)
    - emodb.dev_size = 20 # for train-dev-test experiments
  - list of test speakers: you can simply provide a list of test ids
    - emodb.split_strategy = [12, 14, 15, 16]
  - speakers_stated: explicitly state the speaker names for all splits (test and dev are required)
    - emodb.test = [14, 8]
    - emodb.dev = [12, 15]
    - emodb.train = [3, 9, 10, 11, 13, 16]
  - random: split samples randomly (but NOT speaker disjunct, e.g., no speaker info given or each sample a speaker), given a percentage of samples for the test set.
    - emodb.tests_size = 50 (default:20)
  - reuse: reuse the splits after a speaker_split run to save time with feature extraction.
  - train: use the entire database for training
  - test: use the entire database for evaluation / testing
  - dev: use the entire database for evaluation / development
  - balanced: stratify the data splits
    - balance = {‘emotion’:2, ‘age’:1, ‘gender’:1}
    - age_bins = 2
db_name.target_tables: tables that contain the target / speaker / sex labels
- emodb.target_tables = [‘emotion’]
target_tables_append: set this to True if the multiple tables should be combined row-wise, else they are combined column-wise
- target_tables_append = False
db_name.files_tables: tables that contain the audio file names
- emodb.files_tables = [‘files’]
db_name.test_tables: tables that should be used for testing
- emodb.test_tables = [‘emotion.categories.test.gold_standard’]
db_name.train_tables: tables that should be used for training
- emodb.train_tables = [‘emotion.categories.train.gold_standard’]
db_name.as_test: use only the test split (for automatic experiments)
- emodb.as_test = False
db_name.as_train: use only the train split (for automatic experiments)
- emodb.as_train = False
db_name.limit_samples: maximum number of random N samples per table (for testing with very large data mainly)
- emodb.limit_samples = 20
db_name.required: force a data set to have a specific feature (for example, filter all sets that have gender labeled in a database where this is not the case for all samples, e.g. MozillaCommonVoice)
- emodb.required = gender
db_name.limit_samples_per_speaker: maximum number of samples per speaker (for leveling data where the same speakers have a large number of samples)
- emodb.limit_samples_per_speaker = 20
db_name.min_duration_of_sample: limit the samples to a minimum length (in seconds)
- emodb.min_duration_of_sample = 0.0
db_name.max_duration_of_sample: limit the samples to a maximum length (in seconds)
- emodb.max_duration_of_sample = 0.0
db_name.rename_speakers: add the database name to the speaker names, e.g., because several databases use the same names
- emodb.rename_speakers = False
db_name.filter: don’t use all the data but only selected values from columns: [col, val]*
- emodb.filter = {‘gender’: [‘female’, ‘diverse’]}
db_name.scale: scale (standard normalize) the target variable (if numeric)
- my_data.scale = True
db_name.reverse: reverse the target variable (if numeric). I.e. f(x) = abs(x-max)
db_name.reverse.max: max value to be used in the formula above. If omitted, the distribution will start with 0.
target: the task name, e.g. age or emotion
- target = emotion
labels: for classification experiments: the names of the categories (is also used for regression when binning the values)
- labels = [‘anger’, ‘boredom’, ‘disgust’, ‘fear’, ‘happiness’, ‘neutral’, ‘sadness’]
bins: array of integers to be used for binning continuous data
- bins = [-100, 40, 50, 60, 70, 100]
no_reuse: don’t re-use any tables, but start fresh
- no_reuse = False
min_dur_test: specify a minimum duration for test samples (in seconds)
- min_dur_test = 3.5
target_divide_by: divide the target values by some factor, e.g., to make age smaller and encode years from .0 to 1
- target_divide_by = 100
limit_samples: maximum number of random N samples per sample selection
- limit_samples = 20
limit_samples_per_speaker: maximum number of samples per speaker per sample selection
- limit_samples_per_speaker = 20
min_duration_of_sample: limit the samples to a minimum length (in seconds) per sample selection
- min_duration_of_sample = 0.0
max_duration_of_sample: limit the samples to a maximum length (in seconds) per sample selection
- max_duration_of_sample = 0.0
check_size: check the filesize of all samples in train and test splits in bytes
- check_size = 1000
check_vad: check if the files contain speech, using silero VAD
- check_vad = True
filter.sample_selection: restrict the filters to either [train, test, all]
- filter.sample_selection=all

AUGMENT

Data augmentation options to artificially expand the training set.

augment: select the methods to augment: either traditional or random_splice
- augment = [‘traditional’, ‘auglib’, ‘random_splice’]
- choices are:
  - traditional: uses the audiomentations package
  - auglib: uses audEERING’s auglib package
  - random_splice: randomly re-orders short splices (obfuscates the words)
p_reverse: for random_splice: probability of some samples to be in reverse order (default: 0.3)
top_db: for random_splice: top db level for silence to be recognized (default: 12)
result: file name to store the augmented data (can then be added to training)
- result = augmented.csv
augmentations: select the augmentation methods for the audiomentation module. Default provided.
- augmentations = Compose([AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.05),Shift(p=0.5),BandPassFilter(min_center_freq=100.0, max_center_freq=6000),])
transformations: select the augmentation methods for the auglib package. Defaults to [“room”, “music”, “noise”, “babble”, “crop”, “cough”]
- transformations = [‘music’, ‘room’, ‘cough’]

SEGMENT

Audio segmentation settings for splitting recordings into smaller chunks (e.g., by silence or fixed duration).

result: name of the segmented data table as a result. Additionally, a segment file with the gaps will be generated: segmented_silence.csv.
- result = segmented.csv
method: select the model
- method = silero
min_length: the minimum length of rest samples (in seconds)
- min_length = 2
max_length: the maximum length of segments; longer ones are cut here. (in seconds)
- max_length = 10 # if not set, original segmentation is used
output_audio: export actual audio files for each detected segment (default: False)
- output_audio = True
audio_format: output audio format when output_audio is True (default: wav)
- audio_format = wav # supported values: wav, flac, mp3
audio_dir: output directory for audio segments, relative to the experiment data directory ({root}/{name}, default: segments)
- audio_dir = segments
sampling_rate: resample exported audio segments to this rate in Hz; omit to preserve the original sample rate
- sampling_rate = 16000
include_silence_borders: for the result file that represent the gaps between speech: include the borders?
- include_silence_borders = False

FEATS

Feature extraction settings. Multiple feature types can be combined by listing them together; they are concatenated column-wise.

type: a comma-separated list of types of features; they will be column-wise concatenated
- type = [‘os’]
- possible values:
  - import: already computed features
    - import_file = pathes to files with features in CSV format
      - import_file = [‘path1/file1.csv’, ‘path2/file1.csv2’]
    - import_files_append = set this to False if you want the files to be concatenated column-wise, else it’s done row-wise
      - import_files_append = True
  - mld: mid-level-descriptors
    - mld.model = path to the mld sources folder
    - mld.df = MLD class to use for feature extraction (default: Mld)
      - accepted values: Mld, MldSust, MldStruct
      - example: mld.df = MldSust
    - min_syls = minimum number of syllables
  - os: open smile features
    - set = eGeMAPSv02 (features set)
    - level = functionals (or lld: feature level)
    - os.features: list of selected features (disregard others)
  - praat: Praat selected features thanks to David R. Feinberg scripts
    - praat.features: list of selected features (disregard others)
  - spectra: Melspecs for convolutional networks
    - fft_win_dur = 25 (msec analysis frame/window length)
    - fft_hop_dur = 10 (msec hop duration)
    - fft_nbands = 64 (number of frequency bands)
  - ast: audio spectrogram transformer features from MIT
  - wav2vec variants: wav2vec2 embeddings from facebook
    - “wav2vec2-large-robust-ft-swbd-300h”
    - wav2vec2.model = path to the wav2vec2 model folder
    - wav2vec2.layer = which last hidden layer to use
  - bert variants: Bert embeddings
    - bert.model = path to the bert model folder (without the google-bert/)
    - bert.layer = which last hidden layer to use
    - bert.text_column = which column to use for text analysis
      - bert.text_column = text
  - Hubert variants: facebook Hubert models
    - “hubert-base-ls960”, “hubert-large-ll60k”, “hubert-large-ls960-ft”, hubert-xlarge-ll60k, “hubert-xlarge-ls960-ft”
  - WavLM:
    - “wavlm-base”, “wavlm-base-plus”, “wavlm-large”
  - Whisper: whisper models
    - “whisper-base”, “whisper-large”, “whisper-medium”, “whisper-tiny”
  - audmodel: generic audmodel format model import
    - audmodel.id = audmodel id
    - audmodel.embeddings_name = hidden_states
  - audwav2vec2: audEERING emotion model embeddings, wav2vec2.0 model finetuned on MSPPodcast emotions, embeddings
    - aud.model = ./audmodel/ (path to the audEERING model folder)
  - auddim: audEERING emotion model dimensions, wav2vec2.0 model finetuned on MSPPodcast arousal, dominance, valence
  - agender: audEERING age and gender model embeddings, wav2vec2.0 model finetuned on several age databases, embeddings
    - agender.model = ./agender/ (path to the audEERING model folder)
  - agender_agender: audEERING age and gender model age and gender predictions, wav2vec2.0 model finetuned on several age and gendeer databases: age, female, male, child
  - clap: Laion’s Clap embedding
  - xbow: open crossbow features codebook computed from open smile features
    - xbow.model = path to xbow root folder (containing xbow.jar)
    - size = 500 (codebook size, rule of thumb: should grow with datasize)
    - assignments = 10 (number of words in the bag representation where the counter is increased for each input LLD, rule of thumb: should grow/shrink with codebook size)
  - snr: estimated SNR (signal-to-noise ratio)
  - mos: estimated MOS (mean opinion score)
  - pesq: estimated PESQ (Perceptual Evaluation of Speech Quality)
  - sdr: estimated SDR (Perceptual Evaluation of Speech Quality)
  - spkrec: speaker-id: speechbrain embeddings
  - stoi: estimated STOI (Perceptual Evaluation of Speech Quality)
  - squim: TorchAudio SQUIM (Speech Quality and Intelligibility Measures)
  - trill: Google Research TRILL features
  - wav2vec2: Facebook’s wav2vec2 models
  - whisper: OpenAI’s Whisper ASR model
  - xbow: openXBOW processed opensmile features
  - audmodel: audEERING’s models
balancing: balance the data with respect to class distribution
- balancing = smote
- possible values:
  - ros: Random Over Sampler
  - smote: SMOTE
  - adasyn: ADASYN
  - borderlinesmote: Borderline SMOTE
  - svmsmote: SVM SMOTE
  - smoteenn: SMOTE + Edited Nearest Neighbours
  - smotetomek: SMOTE + Tomek links
  - clustercentroids: Cluster Centroids
  - randomundersampler: Random Under Sampler
  - editednearestneighbours: Edited Nearest Neighbours
  - tomeklinks: Tomek Links
scale: scale (standard/normalize) the features
- scale = standard
- possible values:
  - standard: z-transformation (mean of 0 and std of 1) based on the training set
  - robust: robust scaler
  - speaker: like standard but based on individual speaker sets (also for the test)
  - bins: convert feature values into 0, .5 and 1 (for low, mid and high)
  - minmax: rescales the data set such that all feature values are in the range [0, 1]
  - maxabs: similar to MinMaxScaler except that the values are mapped across several ranges depending on whether negative OR positive values are present
  - normalizer: scales each sample (row) individually to have unit norm (e.g., L2 norm)
  - powertransformer: applies a power transformation to each feature to make the data more Gaussian-like in order to stabilize variance and minimize skewness
  - quantiletransformer: applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform or Gaussian distribution (range [0, 1])
set: name of opensmile feature set, e.g. eGeMAPSv02, ComParE_2016, GeMAPSv01a, eGeMAPSv01a
- set = eGeMAPSv02
level: level of opensmile features
- level = functional
- possible values:
  - functional: aggregated over the whole utterance
  - lld: low-level descriptor: framewise
no_reuse: don’t re-use any feature files, but start fresh
- no_reuse = False
features: disregard all other features and only use these the ones stated here.
- features = [‘speechrate(nsyll / dur)’, ‘F0semitoneFrom27.5Hz_sma3nz_amean’]
needs_feature_extraction: force the features to be freshly extracted
- needs_feature_extraction = False
print_feats: set this to False if you don’t want os and praat feature names to be printed out
- print_feats = True
store_format: in which format to store the feature data frames [pkl | csv]
- store_format = pkl

MODEL

Model and training specifications. In general, default values should work for classification tasks.

type: select the model
- type = xgb
- possible values:
  - xgb: XGBoost
  - xgr: XGBoost for regression
  - svm: Support vector machine
  - svr: Support vector machine for regression
  - knn: k nearest neighbors
  - knn_reg: k nearest neighbors for regression
  - tree: Decision tree
  - tree_reg: Decision tree for regression
  - nb: Naive Bayes
  - mlp: Multi-layer perceptron (neural network)
  - cnn: Convolutional neural network
  - finetune: Fine-tuning for pre-trained models:
    - pretrained_model: HF for base model
    - push_to_hub: True
    - max_duration: 8 (in seconds, resit are disgarded)
    - balancing: smote (as in FEATS, only for finetune needs to be defined here)
class_weight: add class_weight to the linear classifier (XGB, SVM) fit methods for imbalanced data (True or False)
- class_weight = False
logo: leave-one-speaker group out. Will disregard train/dev splits and split the speakers in logo groups and then do a LOGO evaluation. If you want LOSO (leave one speaker out), simply set the number to the number of speakers.
- logo = 10
k_fold_cross: k-fold-cross validation. Will disregard train/dev splits and do a stratified cross validation (meaning that classes are balanced across folds). speaker id is ignored.
- k_fold_cross = 10
learning_rate: learning rate for neural networks
- learning_rate = 0.0001
optimizer: optimizer type for neural networks (case insensitive)
- optimizer = adam
- possible values:
  - adam: Adam optimizer (default)
  - adamw: AdamW optimizer with weight decay
  - sgd: SGD optimizer with momentum
- related parameters:
  - weight_decay: weight decay for AdamW optimizer (default: 0.01)
    - weight_decay = 0.01
  - momentum: momentum for SGD optimizer (default: 0.9)
    - momentum = 0.9
scheduler: learning rate scheduler for neural networks (case insensitive)
- scheduler = cosine
- possible values:
  - cosine: cosine annealing with linear warmup (default); steps per batch
  - step: step decay — reduces LR by gamma every step_size epochs; steps per epoch
  - exponential: exponential decay — multiplies LR by gamma each epoch; steps per epoch
  - none / false: no scheduler
- related parameters:
  - warmup_epochs: number of warmup epochs for cosine scheduler (default: 5)
    - warmup_epochs = 5
  - scheduler.step_size: epoch interval for step scheduler (default: 10)
    - scheduler.step_size = 10
  - scheduler.gamma: decay factor for step and exponential schedulers
    - step default: 0.5; exponential default: 0.95
    - scheduler.gamma = 0.5
drop: dropout rate for neural networks (0 to 1)
- drop = 0.1
batch_size: batch size for neural networks
- batch_size = 8
loss: loss function for neural networks
- loss = cross
- possible values:
  - bce: BinaryCrossEntropyLoss (for binary classification)
  - cross: CrossEntropyLoss
  - f1: F1 loss
  - focal: Focal loss (for imbalanced classification)
  - 1-ccc: concordance correlation coefficient
  - mse: Mean squared error (for regression)
  - mae: Mean absolute error (for regression)
  - weighted_bce: Weighted BinaryCrossEntropyLoss (for imbalanced binary classification)
label_smoothing: label smoothing for cross-entropy loss. Accepts either a boolean or a float in [0.0, 1.0]. Helps prevent overconfidence and can improve generalization.
- label_smoothing = 0.1
- Set to True to use the default value of 0.1
- Set to a float between 0.0 and 1.0 for a custom smoothing factor
- Invalid or out-of-range values fall back to 0.0 with a warning
- Default: 0.0 (no smoothing)
measure: A measure/metric to report progress with experiments. For classification, default is UAR. For regression, default is MSE.
- measure = mse
- possible values:
  - uar: Unweighted Average Recall (default for classification)
  - eer: Equal Error Rate (for binary classification, commonly used in biometric systems and deepfake detection)
  - mse: Mean Squared Error (default for regression)
  - mae: Mean Absolute Error (for regression)
  - ccc: Concordance Correlation Coefficient (for regression)
- Note: When EER is specified, both EER and UAR will be reported
activation: The activation function for MLPs. One of [“relu”, “sigmoid”, “tanh”, “leaky_relu”]
- activation = relu
layers: specify the layer architecture for MLP
- layers = [64, 16]
C_val: regularization value for SVM
- C_val = 1.0
gamma: gamma value for SVM (kernel coefficient)
- gamma = scale
kernel: kernel type for SVM
- kernel = rbf
- possible values: linear, poly, rbf, sigmoid
K_val: number of neighbors for KNN
- K_val = 5
weights: weight function for KNN
- weights = uniform
- possible values: uniform, distance
n_estimators: number of trees for tree-based models (XGBoost, Random Forest)
- n_estimators = 100
max_depth: maximum depth of trees
- max_depth = 6
subsample: subsample ratio for XGBoost
- subsample = 1.0
colsample_bytree: subsample ratio of columns for XGBoost
- colsample_bytree = 1.0
random_seed: random seed for reproducible results
- random_seed = 42 # set this to False if #run > 1
device: device for neural network training
- device = cpu
- possible values: cpu, cuda
patience: early stopping patience for neural networks
- patience = 5
save: set this to False if you don’t want models stored on disk
- save = True

EXPL

Feature exploration and visualisation options, used by the explore module.

feature_distributions: plot distributions for features and analyze importance
- feature_distributions = False
ignore_gender: ignore gender when plotting feature distribution
- ignore_gender = False
model: Which model to use to estimate feature importance.
- model = [‘log_reg’] # can be all models from the MODEL section, If they are combined, the mean result is used.
max_feats: Maximal number of important features
- max_feats = 10
permutation: use feature permutation to determine the best features. Make sure to test the models before.
- permutation = True
scatter: make a scatter plot of combined train and test data, colored by label.
- scatter = [‘tsne’, ‘umap’, ‘pca’]
scatter.target: target for the scatter plot (defaults to target value).
- scatter.target = [‘age’, ‘gender’, ‘likability’]
scatter.dim: dimension of reduction, can be 2 or 3.
- scatter.dim = 2
plot_tree: Plot a decision tree for classification (Requires model = tree)
- plot_tree = False
value_counts: plot distributions of target for the samples and speakers (in the image_dir)
- value_counts = [[‘gender’], [‘age’], [‘age’, ‘duration’]]
column.bin_reals: If the column variable is real numbers (instead of categories), should it be binned? for any value in value_counts as well as the target variable
- age.bin_reals = True
dist_type: type of plot for value counts, either histogram (hist) or density estimation (kde)
- dist_type = kde
spotlight: open a web-browser window to inspect the data with the spotlight software. Needs package renumics-spotlight to be installed!
- spotlight = False
shap: compute SHAP values, need to run the model first.
- shap = False
print_stats: whether (possibly extensive) results from statistical tests should be printed out on the debug channel
- print_stats = False
print_colvals: print the unique values for all columns in the data
- print_colvals = False
plot_features: plot distributions for this features in any case, irrespective of their importance
- plot_features = [“speechrate”, “mean_f0”]
regplot: do scatter plots for two features, and show categories. When two values are given, the target is used as category, else one could be stated.
- regplot = [[“feat_a”, “feat_b”], [“feat_a”, “feat_b”, “emotion”], [“feat_a”, “feat_b”, “age”]]

PREDICT 

Automatic soft-label prediction using pre-trained models (e.g., age, gender, arousal).

targets: Speaker/speech characteristics to be predicted by some models
- targets = [‘text’, ‘translation’, ‘textclassification’, ‘speaker’, ‘gender’, ‘age’, ‘snr’, ‘arousal’, ‘valence’, ‘dominance’, ‘pesq’, ‘mos’]
- textclassifier.candidates = [“sadness”, “anger”, “neutral”]: for target textclassification: the labels for the categories that should be predicted (using joeddav/xlm-roberta-large-xnli)
target_language: target language for the translation prediction
- target_language = en

EXPORT

Options for exporting the dataset (audio files and annotations) to a new location or format.

target_root: New root directory for the database, will be created
- target_root = ./exported_data/
orig_root: Path to folder that is parent to the original audio files
- orig_root = ../data/emodb/wav
data_name: Name for the CSV file
- data_name = exported_database
segments_as_files: Whether original files should be used, or segments split (resulting potentially in many new files).
- segments_as_files = False
bundle_path: Output directory for the portable model bundle created by python -m nkululeko.bundle. Defaults to <root>/<name>/export. Overridden by the --output CLI flag.
- bundle_path = ./my_polish_bundle

CROSSDB

Cross-database experiment settings for evaluating generalisation across datasets.

train_extra: add a additional training partition to all experiments in the cross database series. This extra data should be described in a root_folders file
- train_extra = [‘addtrain_db_1’, ‘addtrain_db_2’]

PLOT

Plot styling and output options for result figures.

name: special name as a prefix for all plots (stored in img_dir).
- name = my_special_config_within_the_experiment
epochs: whether to make a plot each for every epoch result.
- epochs = False
anim_progression: generate an animated GIF from the epoch plots
- anim_progression = False
fps: frames per second for the animated GIF
- fps = 1
epoch_progression: plot the progression of test, train and loss results over epochs
- epoch_progression = False
best_model: search for the best performing model and plot conf matrix (needs MODEL.store to be turned on)
- best_model = False
combine_per_speaker: print an extra confusion plot where the predictions per speaker are combined, with either the mode or the mean function
- combine_per_speaker = mode
format: format for plots, either png or eps (for scalable graphics)
- format = png
ccc: show concordance correlation coefficient in plot headings
- ccc = False
fill_areas: should areas, e.g. in distribution plots, be filled?
- fill_areas = False
uncertainty_threshold: plot a confusionmatrix with samples removed that are less uncertain
- uncertainty_threshold = .6
runs_compare: generate plots to compare the run results: compare features, models or databases
- runs_compare = features
titles: if titles should be added to the plots
- titles = True
kind: kind of plot for EXPL.feature distributions: [violin, bar, box, swarm, strip]
- kind = violin

RESAMPLE

Audio resampling settings for converting sample rates across a dataset.

replace: whether samples should be replaced right where they are, or copies done and a new dataframe given
- replace = False
target: the name of the new dataframe, if replace==false
- target = data_resampled.csv

REPORT

Controls how experiment results are collected, displayed, and persisted.

show: print the report at the end
- show = False
fresh: start a new report
latex: generate a latex and PDF document: name of document
- latex = my_latex_document
title: title for document
author: author for document

OPTIM

Hyperparameter optimisation settings for automated model tuning.

model: the model type to optimize (e.g., ‘mlp’, ‘svm’, ‘xgb’)
- model = mlp
search_strategy: intelligent search strategy for faster optimization
- search_strategy = random
- possible values:
  - grid: exhaustive grid search (default, slowest but thorough)
  - random: random search with n_iter samples (faster, often as good as grid)
  - halving_random: successive halving random search (fastest, requires sklearn >= 0.24)
  - halving_grid: successive halving grid search (compromise between speed and thoroughness)
metric: evaluation metric for optimization
- metric = uar
- possible values:
  - uar: Unweighted Average Recall (balanced accuracy, good for imbalanced datasets)
  - accuracy: Standard accuracy (default)
  - f1: Macro-averaged F1-score (balance of precision and recall)
  - precision: Macro-averaged precision
  - recall: Macro-averaged recall
  - sensitivity: Sensitivity (same as recall)
  - specificity: Specificity (true negative rate)
n_iter: number of parameter combinations to try for random search
- n_iter = 50
cv_folds: number of cross-validation folds for hyperparameter evaluation
- cv_folds = 3
Parameter specifications: Define search spaces for hyperparameters using tuples for ranges and lists for discrete choices
- nlayers: number of hidden layers for neural networks
  - nlayers = (1, 3) # search from 1 to 3 layers
- nnodes: number of nodes per layer for neural networks
  - nnodes = (16, 256) # search powers of 2 from 16 to 256
- lr: learning rate for neural networks
  - lr = [0.0001, 0.001, 0.01, 0.1] # discrete log-scale choices (recommended)
  - lr = (0.0001, 0.01) # or range with automatic log-scale sampling
- bs: batch size for neural networks
  - bs = (2, 256) # search powers of 2 from 2 to 256
- loss: loss function for neural networks
  - loss = [“cross”, “f1”] # discrete choices
- do: dropout rate for neural networks
  - do = (0.1, 0.5, 0.1) # search from 0.1 to 0.5 with step 0.1
- Traditional ML parameters: For SVM, XGB, etc., use parameter names from sklearn
  - C = [0.1, 1.0, 10.0] # SVM regularization parameter
  - n_estimators = [50, 100, 200] # XGB number of estimators
  - max_depth = [3, 6, 9] # XGB maximum depth

Parameter specification formats:

(min, max): Range with automatic step selection based on parameter type
- For learning rates: uses logarithmic sampling (5-8 values)
- For dropout: uses linear sampling (5 values)
- For integers: uses linear sampling
(min, max, step): Range with explicit step size
[val1, val2, …]: Discrete list of values to try (recommended for most cases)
value: Single value (equivalent to [value])

Recommended parameter ranges:

Learning rate: [0.0001, 0.001, 0.01, 0.1] (log-scale discrete values)
Dropout: [0.1, 0.3, 0.5, 0.7] (common dropout rates)
SVM C: [0.1, 1.0, 10.0, 100.0] (regularization parameter)
XGB n_estimators: [50, 100, 200] (number of trees)
XGB max_depth: [3, 6, 9, 12] (tree depth)

Usage: Run with python3 -m nkululeko.optim --config exp.ini

FLAGS

Running different values at once. All listed parameters are combined via Cartesian product — one experiment is run per combination. Example:

models = [‘xgb’, ‘svm’]
features = [‘praat’, ‘os’]
balancing = [‘none’, ‘ros’, ‘smote’]
scale = [‘none’, ‘standard’, ‘robust’, ‘minmax’]
name_target = list of (EXP.name, DATA.target) pairs iterated as a unit
- Each pair sets EXP.name and DATA.target together for one experiment slot, then that slot is combined via product with any other FLAGS parameters.
- Label DataFrames are reloaded per pair; audio features are extracted once and reused across all pairs.
- example:
```
name_target = [("grade", "grade"), ("roughness", "roughness"), ("strain", "strain")]
models = ['xgb', 'mlp']
```
  → runs 3 × 2 = 6 experiments
The FLAGS mechanism can also drive the explore module (feature analysis / visualisation) instead of model training. Pass --mod explore on the command line:
```
python -m nkululeko.flags --config exp.ini --mod explore
```
No result score is reported; output plots are stored per experiment under {EXP.root}/{EXP.name}/images/.