# Overview of options for the nkululeko framework * To be specified in a .ini file, [config parser syntax](https://zetcode.com/python/configparser/) * Kind of all (well, most) values have defaults ## Contents - [Overview of options for the nkululeko framework](#overview-of-options-for-the-nkululeko-framework) - [Contents](#contents) - [Sections](#sections) - [EXP](#exp) - [DATA](#data) - [AUGMENT](#augment) - [SEGMENT](#segment) - [FEATS](#feats) - [MODEL](#model) - [EXPL](#expl) - [PREDICT](#predict) - [EXPORT](#export) - [CROSSDB](#crossdb) - [PLOT](#plot) - [RESAMPLE](#resample) - [REPORT](#report) - [OPTIM](#optim) - [FLAGS](#flags) ## Sections ### EXP General experiment settings: paths, naming, run count, and output options. * **root**: experiment root folder * root = ./results/ * **type**: the kind of experiment * type = classification * possible values: * **classification**: supervised learning experiment with a restricted set of categories (e.g., emotion categories). * **regression**: supervised learning experiment with continuous values (e.g., speaker age in years). * **store**: (relative to *root*) folder for caches * store = ./store/ * **name**: a name for debugging output * name = emodb_exp * **fig_dir**: (relative to *root*) folder for plots * fig_dir = ./images/ * **res_dir**: (relative to *root*) folder for result output * res_dir = ./results/ * **models_dir**: (relative to *root*) folder to save models * models_dir = ./models/ * **runs**: number of runs (e.g., to average over random initializations) * runs = 1 * **epochs**: number of epochs for ANN training * epochs = 1 * **save**: save the experiment as a pickle file to be restored again later (True or False) * save = False * **save_test**: save the test predictions as a new database in CSV format (default is False) * save_test = ./my_saved_test_predictions.csv * **databases**: name of databases to compare for the *multidb* module * databases = ['emodb', 'timit'] * **use_splits**: can be used for multidb module to use the orginal split sets when train or test database. Else the whole database is used. * use_splits = True * **traindevtest**: set to true if you want to specify an extra dev set, that will be used for early stopping (patience) in neural net experiments. * traindevtest = False * **sample_selection**: select the samples to process (e.g. for augmentation, re-sampling, etc.): either *train*, *test*, or *all* * sample_selection = all * **export_onnx** = export the best trained model in [ONNX format](https://github.com/onnx/onnx). * export_onnx = False ### DATA Database loading, label mapping, and train/test split configuration. * **type**: just a flag now to mark continuous data, so it can be binned to categorical data (using *bins* and *labels*) * type = continuous * **databases**: list of databases to be used in the experiment * databases = ['emodb', 'timit'] * **tests**: Datasets to be used as test data for the stored best model. The databases listed here do **not** have to appear in the `databases` field. When `nkululeko.nkululeko` is run with this option set **and** a saved experiment file already exists on disk, training is skipped entirely: the module loads the stored best model, evaluates it on the listed test databases, and writes a confusion matrix, a per-class text report, and a predictions CSV (with all original test columns plus a `predicted` column) to the results directory. On the very first run (no saved file yet) the module trains normally and saves the experiment. See [test_new_database.md](test_new_database.md) for a step-by-step guide. * tests = ['emovo'] * tests = ['ravdess', 'cremad'] ; multiple databases are concatenated * **root_folders**: specify an additional configuration specifically for all entries starting with a dataset name, acting as global defaults. * root_folders = data_roots.ini * **db_name**: path with audformatted repository for each database listed in 'databases*. If this path is not absolute, it will be treated relative to the experiment folder. * emodb = /home/data/audformat/emodb/ * **db_name.type**: type of storage, e.g., audformat database or 'csv' (needs header: file, speaker, task) * emodb.type = audformat * **db_name.absolute_path**: only for 'csv' databases: are the audio file paths relative or absolute? If not absolute, they will be treated relative to the database parent folder. NOT the experiment root folder. * my_data.absolute_path = True * **db_name.audio_path**: only for 'csv' databases: are the audio files in a special common folder? * my_data.audio_path = wav_files/ * **db_name.mapping**: mapping python dictionary to map between categories for cross-database experiments (format: {'target_emo':'source_emo'}) * emodb.mapping = {'anger':'angry', 'happiness':'happy', 'sadness':'sad', 'neutral':'neutral'} * can also be used for general mapping * emodb.mapping = {'gender':{'male':0, 'female':1}, 'emotion':{'anger':'stress', 'neutral':'no stress'}} * **db_name.columns**: names of the columns to load from the data (only for audformat databases) * my_data.columns = ["age", "gender", "speaker", "diagnosis"] * **db_name.label**: name of the target variable for this database (if different from DATA.target) * my_data.label = "expression" * **db_name.colnames**: mapping to rename columns to standard names * my_data.colnames = {'speaker':'Participant ID', 'sex':'gender', 'Age': 'age'} * **db_name.split_strategy**: How to identify sets for train/development data splits within one database * emodb.split_strategy = reuse * Possible values: * **database**: default (*task*.train, *task*.dev and *task*.test) * **specified**: specify the tables (an opportunity to assign multiple or no tables to train or dev set) * emodb.train_tables = ['emotion.categories.train.gold_standard'] * emodb.dev_tables = ['emotion.categories.dev.gold_standard'] * emodb.test_tables = ['emotion.categories.test.gold_standard'] * **speaker_split**: split samples randomly but speaker disjunct, given a percentage of speakers for the test (and dev) set. * emodb.test_size = 50 (default:20) * emodb.dev_size = 20 # for train-dev-test experiments * **list of test speakers**: you can simply provide a list of test ids * emodb.split_strategy = [12, 14, 15, 16] * **speakers_stated**: explicitly state the speaker names for all splits (test and dev are required) * emodb.test = [14, 8] * emodb.dev = [12, 15] * emodb.train = [3, 9, 10, 11, 13, 16] * **random**: split samples randomly (but NOT speaker disjunct, e.g., no speaker info given or each sample a speaker), given a percentage of samples for the test set. * emodb.tests_size = 50 (default:20) * **reuse**: reuse the splits after a *speaker_split* run to save time with feature extraction. * **train**: use the entire database for training * **test**: use the entire database for evaluation / testing * **dev**: use the entire database for evaluation / development * **balanced**: [stratify the data splits](https://blog.syntheticspeech.de/2023/11/07/nkululeko-automatically-stratify-your-split-sets/) * balance = {'emotion':2, 'age':1, 'gender':1} * age_bins = 2 * **db_name.target_tables**: tables that contain the target / speaker / sex labels * emodb.target_tables = ['emotion'] * **target_tables_append**: set this to True if the multiple tables should be combined row-wise, else they are combined column-wise * target_tables_append = False * **db_name.files_tables**: tables that contain the audio file names * emodb.files_tables = ['files'] * **db_name.test_tables**: tables that should be used for testing * emodb.test_tables = ['emotion.categories.test.gold_standard'] * **db_name.train_tables**: tables that should be used for training * emodb.train_tables = ['emotion.categories.train.gold_standard'] * **db_name.as_test**: use only the test split (for automatic experiments) * emodb.as_test = False * **db_name.as_train**: use only the train split (for automatic experiments) * emodb.as_train = False * **db_name.limit_samples**: maximum number of random N samples per table (for testing with very large data mainly) * emodb.limit_samples = 20 * **db_name.required**: force a data set to have a specific feature (for example, filter all sets that have gender labeled in a database where this is not the case for all samples, e.g. MozillaCommonVoice) * emodb.required = gender * **db_name.limit_samples_per_speaker**: maximum number of samples per speaker (for leveling data where the same speakers have a large number of samples) * emodb.limit_samples_per_speaker = 20 * **db_name.min_duration_of_sample**: limit the samples to a minimum length (in seconds) * emodb.min_duration_of_sample = 0.0 * **db_name.max_duration_of_sample**: limit the samples to a maximum length (in seconds) * emodb.max_duration_of_sample = 0.0 * **db_name.rename_speakers**: add the database name to the speaker names, e.g., because several databases use the same names * emodb.rename_speakers = False * **db_name.filter**: don't use all the data but only selected values from columns: [col, val]* * emodb.filter = {'gender': ['female', 'diverse']} * **db_name.scale**: [scale (standard normalize) the target variable](http://blog.syntheticspeech.de/2024/03/13/nkululeko-how-to-tweak-the-target-variable-for-database-comparison/) (if numeric) * my_data.scale = True * **db_name.reverse**: reverse the target variable (if numeric). I.e. f(x) = abs(x-max) * **db_name.reverse.max**: max value to be used in the formula above. If omitted, the distribution will start with 0. * **target**: the task name, e.g. *age* or *emotion* * target = emotion * **labels**: for classification experiments: the names of the categories (is also used for regression when binning the values) * labels = ['anger', 'boredom', 'disgust', 'fear', 'happiness', 'neutral', 'sadness'] * **bins**: array of integers to be used for binning continuous data * bins = [-100, 40, 50, 60, 70, 100] * **no_reuse**: don't re-use any tables, but start fresh * no_reuse = False * **min_dur_test**: specify a minimum duration for test samples (in seconds) * min_dur_test = 3.5 * **target_divide_by**: divide the target values by some factor, e.g., to make age smaller and encode years from .0 to 1 * target_divide_by = 100 * **limit_samples**: maximum number of random N samples per sample selection * limit_samples = 20 * **limit_samples_per_speaker**: maximum number of samples per speaker per sample selection * limit_samples_per_speaker = 20 * **min_duration_of_sample**: limit the samples to a minimum length (in seconds) per sample selection * min_duration_of_sample = 0.0 * **max_duration_of_sample**: limit the samples to a maximum length (in seconds) per sample selection * max_duration_of_sample = 0.0 * **check_size**: check the filesize of all samples in train and test splits in bytes * check_size = 1000 * **check_vad**: check if the files contain speech, using [silero VAD](https://github.com/snakers4/silero-vad) * check_vad = True * **filter.sample_selection**: restrict the filters to either [train, test, all] * filter.sample_selection=all ### AUGMENT Data augmentation options to artificially expand the training set. * **augment**: select the methods to augment: either *traditional* or *random_splice* * augment = ['traditional', 'auglib', 'random_splice'] * choices are: * *traditional*: uses the [audiomentations package](https://github.com/iver56/audiomentations) * *auglib*: uses [audEERING's auglib package](https://audeering.github.io/auglib/) * *random_splice*: randomly re-orders short splices (obfuscates the words) * **p_reverse**: for random_splice: probability of some samples to be in reverse order (default: 0.3) * **top_db**: for random_splice: top db level for silence to be recognized (default: 12) * **result**: file name to store the augmented data (can then be added to training) * result = augmented.csv * **augmentations**: select the augmentation methods for the audiomentation module. Default provided. * augmentations = Compose([AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.05),Shift(p=0.5),BandPassFilter(min_center_freq=100.0, max_center_freq=6000),]) * **transformations**: select the augmentation methods for the auglib package. Defaults to ["room", "music", "noise", "babble", "crop", "cough"] * transformations = ['music', 'room', 'cough'] ### SEGMENT Audio segmentation settings for splitting recordings into smaller chunks (e.g., by silence or fixed duration). * **result**: name of the segmented data table as a result. Additionally, a segment file with the gaps will be generated: *segmented_silence.csv*. * result = segmented.csv * **method**: select the model * method = [silero](https://github.com/snakers4/silero-vad) * **min_length**: the minimum length of rest samples (in seconds) * min_length = 2 * **max_length**: the maximum length of segments; longer ones are cut here. (in seconds) * max_length = 10 # if not set, original segmentation is used * **output_audio**: export actual audio files for each detected segment (default: False) * output_audio = True * **audio_format**: output audio format when *output_audio* is True (default: wav) * audio_format = wav # supported values: wav, flac, mp3 * **audio_dir**: output directory for audio segments, relative to the experiment data directory (`{root}/{name}`, default: segments) * audio_dir = segments * **sampling_rate**: resample exported audio segments to this rate in Hz; omit to preserve the original sample rate * sampling_rate = 16000 * **include_silence_borders**: for the result file that represent the gaps between speech: include the borders? * include_silence_borders = False ### FEATS Feature extraction settings. Multiple feature types can be combined by listing them together; they are concatenated column-wise. * **type**: a comma-separated list of types of features; they will be column-wise concatenated * type = ['os'] * possible values: * **import**: [already computed features](http://blog.syntheticspeech.de/2022/10/18/how-to-import-features-from-outside-the-nkululeko-software/) * **import_file** = pathes to files with features in CSV format * import_file = ['path1/file1.csv', 'path2/file1.csv2'] * **import_files_append** = set this to False if you want the files to be concatenated column-wise, else it's done row-wise * import_files_append = True * **mld**: [mid-level-descriptors](http://www.essv.de/paper.php?id=447) * **mld.model** = *path to the mld sources folder* * **mld.df** = *MLD class to use for feature extraction* (default: `Mld`) * accepted values: `Mld`, `MldSust`, `MldStruct` * example: `mld.df = MldSust` * **min_syls** = *minimum number of syllables* * **os**: [open smile features](https://audeering.github.io/opensmile-python/) * **set** = eGeMAPSv02 *(features set)* * **level** = functionals *(or lld: feature level)* * **os.features**: list of selected features (disregard others) * **praat**: Praat selected features thanks to [David R. Feinberg scripts](https://github.com/drfeinberg/PraatScripts) * **praat.features**: list of selected features (disregard others) * **spectra**: Melspecs for convolutional networks * **fft_win_dur** = 25 *(msec analysis frame/window length)* * **fft_hop_dur** = 10 *(msec hop duration)* * **fft_nbands** = 64 *(number of frequency bands)* * **ast**: [audio spectrogram transformer](https://arxiv.org/abs/2104.01778) features from MIT * **wav2vec variants**: [wav2vec2 embeddings](https://huggingface.co/facebook/wav2vec2-large-robust-ft-swbd-300h) from facebook * "wav2vec2-large-robust-ft-swbd-300h" * **wav2vec2.model** = *path to the wav2vec2 model folder* * **wav2vec2.layer** = *which last hidden layer to use* * **bert variants**: [Bert embeddings](https://huggingface.co/transformers/v3.0.2/model_doc/bert.html#bertmodel) * **bert.model** = path to the bert model folder (without the google-bert/) * **bert.layer** = which last hidden layer to use * **bert.text_column** = which column to use for text analysis * *bert.text_column = text* * **Hubert variants**: [facebook Hubert models](https://ai.meta.com/blog/hubert-self-supervised-representation-learning-for-speech-recognition-generation-and-compression/) * "hubert-base-ls960", "hubert-large-ll60k", "hubert-large-ls960-ft", hubert-xlarge-ll60k, "hubert-xlarge-ls960-ft" * **WavLM**: * "wavlm-base", "wavlm-base-plus", "wavlm-large" * **Whisper**: [whisper models](https://huggingface.co/models?other=whisper) * "whisper-base", "whisper-large", "whisper-medium", "whisper-tiny" * **audmodel**: generic [audmodel format model](https://audeering.github.io/audmodel/index.html) import * **audmodel.id** = audmodel id * **audmodel.embeddings_name** = hidden_states * **audwav2vec2**: [audEERING emotion model embeddings](https://arxiv.org/abs/2203.07378), wav2vec2.0 model finetuned on [MSPPodcast](https://www.lab-msp.com/MSP/MSP-Podcast.html) emotions, embeddings * **aud.model** = ./audmodel/ (*path to the audEERING model folder*) * **auddim**: [audEERING emotion model dimensions](https://arxiv.org/abs/2203.07378), wav2vec2.0 model finetuned on [MSPPodcast](https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html) arousal, dominance, valence * **agender**: [audEERING age and gender model embeddings](https://arxiv.org/abs/2306.16962), wav2vec2.0 model finetuned on [several age databases](https://github.com/audeering/w2v2-age-gender-how-to), embeddings * **agender.model** = ./agender/ (*path to the audEERING model folder*) * **agender_agender**: [audEERING age and gender model age and gender predictions](https://arxiv.org/abs/2306.16962), wav2vec2.0 model finetuned on [several age and gendeer databases](https://github.com/audeering/w2v2-age-gender-how-to): age, female, male, child * **clap**: [Laion's Clap embedding](https://github.com/LAION-AI/CLAP) * **xbow**: [open crossbow](https://github.com/openXBOW) features codebook computed from open smile features * **xbow.model** = *path to xbow root folder (containing xbow.jar)* * **size** = 500 *(codebook size, rule of thumb: should grow with datasize)* * **assignments** = 10 *(number of words in the bag representation where the counter is increased for each input LLD, rule of thumb: should grow/shrink with codebook size)* * **snr**: estimated SNR (signal-to-noise ratio) * **mos**: estimated [MOS](https://arxiv.org/pdf/2304.01448.pdf) (mean opinion score) * **pesq**: estimated [PESQ](https://arxiv.org/pdf/2304.01448.pdf) (Perceptual Evaluation of Speech Quality) * **sdr**: estimated [SDR](https://arxiv.org/pdf/2304.01448.pdf) (Perceptual Evaluation of Speech Quality) * **spkrec**: speaker-id: [speechbrain embeddings](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb) * **stoi**: estimated [STOI](https://arxiv.org/pdf/2304.01448.pdf) (Perceptual Evaluation of Speech Quality) * **squim**: [TorchAudio SQUIM](https://pytorch.org/audio/stable/tutorials/squim_tutorial.html) (Speech Quality and Intelligibility Measures) * **trill**: [Google Research TRILL features](https://ai.googleblog.com/2020/06/improving-speech-representations-and.html) * **wav2vec2**: [Facebook's wav2vec2 models](https://huggingface.co/docs/transformers/model_doc/wav2vec2) * **whisper**: [OpenAI's Whisper ASR model](https://openai.com/research/whisper) * **xbow**: [openXBOW processed opensmile features](https://github.com/openXBOW/openXBOW) * **audmodel**: [audEERING's models](https://github.com/audeering/audmodel) * **balancing**: [balance the data with respect to class distribution](https://imbalanced-learn.org/stable/) * balancing = smote * possible values: * **ros**: Random Over Sampler * **smote**: SMOTE * **adasyn**: ADASYN * **borderlinesmote**: Borderline SMOTE * **svmsmote**: SVM SMOTE * **smoteenn**: SMOTE + Edited Nearest Neighbours * **smotetomek**: SMOTE + Tomek links * **clustercentroids**: Cluster Centroids * **randomundersampler**: Random Under Sampler * **editednearestneighbours**: Edited Nearest Neighbours * **tomeklinks**: Tomek Links * **scale**: scale (standard/normalize) the features * scale = standard * possible values: * **standard**: z-transformation (mean of 0 and std of 1) based on the training set * **robust**: robust scaler * **speaker**: like *standard* but based on individual speaker sets (also for the test) * **bins**: convert feature values into 0, .5 and 1 (for low, mid and high) * **minmax**: rescales the data set such that all feature values are in the range [0, 1] * **maxabs**: similar to MinMaxScaler except that the values are mapped across several ranges depending on whether negative OR positive values are present * **normalizer**: scales each sample (row) individually to have unit norm (e.g., L2 norm) * **powertransformer**: applies a power transformation to each feature to make the data more Gaussian-like in order to stabilize variance and minimize skewness * **quantiletransformer**: applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform or Gaussian distribution (range [0, 1]) * **set**: name of opensmile feature set, e.g. eGeMAPSv02, ComParE_2016, GeMAPSv01a, eGeMAPSv01a * set = eGeMAPSv02 * **level**: level of opensmile features * level = functional * possible values: * **functional**: aggregated over the whole utterance * **lld**: low-level descriptor: framewise * **no_reuse**: don't re-use any feature files, but start fresh * no_reuse = False * **features**: disregard all other features and only use these the ones stated here. * features = ['speechrate(nsyll / dur)', 'F0semitoneFrom27.5Hz_sma3nz_amean'] * **needs_feature_extraction**: force the features to be freshly extracted * needs_feature_extraction = False * **print_feats**: set this to False if you don't want os and praat feature names to be printed out * print_feats = True * **store_format**: in which format to store the feature data frames [pkl | csv] * store_format = pkl ### MODEL Model and training specifications. In general, default values should work for classification tasks. * **type**: select the model * type = xgb * possible values: * **xgb**: [XGBoost](http://blog.syntheticspeech.de/2021/07/13/nkululeko-how-to-use-xgboost-for-speech-emotion-recognition/) * **xgr**: XGBoost for regression * **svm**: [Support vector machine](http://blog.syntheticspeech.de/2021/07/08/nkululeko-how-to-use-support-vector-machines-for-speech-emotion-recognition/) * **svr**: Support vector machine for regression * **knn**: [k nearest neighbors](http://blog.syntheticspeech.de/2021/08/19/nkululeko-k-nearest-neighbors/) * **knn_reg**: k nearest neighbors for regression * **tree**: Decision tree * **tree_reg**: Decision tree for regression * **nb**: Naive Bayes * **mlp**: [Multi-layer perceptron](http://blog.syntheticspeech.de/2021/08/30/nkululeko-multi-layer-perceptron/) (neural network) * **cnn**: [Convolutional neural network](http://blog.syntheticspeech.de/2022/01/17/how-to-use-convolutional-neural-networks-with-nkululeko/) * **finetune**: [Fine-tuning](http://blog.syntheticspeech.de/2022/10/07/nkululeko-how-to-fine-tune-a-wav2vec2-model/) for pre-trained models: - pretrained_model: HF for base model - push_to_hub: True - max_duration: 8 (in seconds, resit are disgarded) - balancing: smote (as in FEATS, only for finetune needs to be defined here) * **class_weight**: add class_weight to the linear classifier (XGB, SVM) fit methods for imbalanced data (True or False) * class_weight = False * **logo**: leave-one-speaker group out. Will disregard train/dev splits and split the speakers in *logo* groups and then do a LOGO evaluation. If you want LOSO (leave one speaker out), simply set the number to the number of speakers. * logo = 10 * **k_fold_cross**: k-fold-cross validation. Will disregard train/dev splits and do a stratified cross validation (meaning that classes are balanced across folds). speaker id is ignored. * k_fold_cross = 10 * **learning_rate**: learning rate for neural networks * learning_rate = 0.0001 * **optimizer**: optimizer type for neural networks (case insensitive) * optimizer = adam * possible values: * **adam**: Adam optimizer (default) * **adamw**: AdamW optimizer with weight decay * **sgd**: SGD optimizer with momentum * related parameters: * **weight_decay**: weight decay for AdamW optimizer (default: 0.01) * weight_decay = 0.01 * **momentum**: momentum for SGD optimizer (default: 0.9) * momentum = 0.9 * **scheduler**: learning rate scheduler for neural networks (case insensitive) * scheduler = cosine * possible values: * **cosine**: cosine annealing with linear warmup (default); steps per batch * **step**: step decay — reduces LR by gamma every step_size epochs; steps per epoch * **exponential**: exponential decay — multiplies LR by gamma each epoch; steps per epoch * **none** / **false**: no scheduler * related parameters: * **warmup_epochs**: number of warmup epochs for cosine scheduler (default: 5) * warmup_epochs = 5 * **scheduler.step_size**: epoch interval for step scheduler (default: 10) * scheduler.step_size = 10 * **scheduler.gamma**: decay factor for step and exponential schedulers * step default: 0.5; exponential default: 0.95 * scheduler.gamma = 0.5 * **drop**: dropout rate for neural networks (0 to 1) * drop = 0.1 * **batch_size**: batch size for neural networks * batch_size = 8 * **loss**: loss function for neural networks * loss = cross * possible values: * **bce**: BinaryCrossEntropyLoss (for binary classification) * **cross**: CrossEntropyLoss * **f1**: F1 loss * **focal**: Focal loss (for imbalanced classification) * **1-ccc**: concordance correlation coefficient * **mse**: Mean squared error (for regression) * **mae**: Mean absolute error (for regression) * **weighted_bce**: Weighted BinaryCrossEntropyLoss (for imbalanced binary classification) * **label_smoothing**: label smoothing for cross-entropy loss. Accepts either a boolean or a float in [0.0, 1.0]. Helps prevent overconfidence and can improve generalization. * label_smoothing = 0.1 * Set to `True` to use the default value of 0.1 * Set to a float between 0.0 and 1.0 for a custom smoothing factor * Invalid or out-of-range values fall back to 0.0 with a warning * Default: 0.0 (no smoothing) * **measure**: A measure/metric to report progress with experiments. For classification, default is UAR. For regression, default is MSE. * measure = mse * possible values: * **uar**: Unweighted Average Recall (default for classification) * **eer**: Equal Error Rate (for binary classification, commonly used in biometric systems and deepfake detection) * **mse**: Mean Squared Error (default for regression) * **mae**: Mean Absolute Error (for regression) * **ccc**: Concordance Correlation Coefficient (for regression) * Note: When EER is specified, both EER and UAR will be reported * **activation**: The activation function for MLPs. One of ["relu", "sigmoid", "tanh", "leaky_relu"] * activation = relu * **layers**: specify the layer architecture for MLP * layers = [64, 16] * **C_val**: regularization value for SVM * C_val = 1.0 * **gamma**: gamma value for SVM (kernel coefficient) * gamma = scale * **kernel**: kernel type for SVM * kernel = rbf * possible values: linear, poly, rbf, sigmoid * **K_val**: number of neighbors for KNN * K_val = 5 * **weights**: weight function for KNN * weights = uniform * possible values: uniform, distance * **n_estimators**: number of trees for tree-based models (XGBoost, Random Forest) * n_estimators = 100 * **max_depth**: maximum depth of trees * max_depth = 6 * **subsample**: subsample ratio for XGBoost * subsample = 1.0 * **colsample_bytree**: subsample ratio of columns for XGBoost * colsample_bytree = 1.0 * **random_seed**: random seed for reproducible results * random_seed = 42 # set this to *False* if #run > 1 * **device**: device for neural network training * device = cpu * possible values: cpu, cuda * **patience**: early stopping patience for neural networks * patience = 5 * **save**: set this to *False* if you don't want models stored on disk * save = True ### EXPL Feature exploration and visualisation options, used by the `explore` module. * **feature_distributions**: plot distributions for features and analyze importance * feature_distributions = False * **ignore_gender**: ignore gender when plotting feature distribution * ignore_gender = False * **model**: Which model to use to estimate feature importance. * model = ['log_reg'] # can be all models from the [MODEL](#model) section, If they are combined, the mean result is used. * **max_feats**: Maximal number of important features * max_feats = 10 * **permutation**: use [feature permutation](https://scikit-learn.org/stable/modules/permutation_importance.html) to determine the best features. Make sure to test the models before. * permutation = True * **scatter**: make a scatter plot of combined train and test data, colored by label. * scatter = ['tsne', 'umap', 'pca'] * **scatter.target**: target for the scatter plot (defaults to *target* value). * scatter.target = ['age', 'gender', 'likability'] * **scatter.dim**: dimension of reduction, can be 2 or 3. * scatter.dim = 2 * **plot_tree**: Plot a decision tree for classification (Requires model = tree) * plot_tree = False * **value_counts**: plot distributions of target for the samples and speakers (in the *image_dir*) * value_counts = [['gender'], ['age'], ['age', 'duration']] * **column.bin_reals**: If the column variable is real numbers (instead of categories), should it be binned? for any value in *value_counts* as well as the target variable * age.bin_reals = True * **dist_type**: type of plot for value counts, either histogram (hist) or density estimation (kde) * dist_type = kde * **spotlight**: open a web-browser window to inspect the data with the [spotlight software](https://github.com/Renumics/spotlight). Needs package *renumics-spotlight* to be installed! * spotlight = False * **shap**: compute [SHAP](https://shap.readthedocs.io/en/latest/) values, need to run the model first. * shap = False * **print_stats**: whether (possibly extensive) results from statistical tests should be printed out on the debug channel * print_stats = False * **print_colvals**: print the unique values for all columns in the data * print_colvals = False * **plot_features**: plot distributions for this features in any case, irrespective of their importance * plot_features = ["speechrate", "mean_f0"] * **regplot**: do scatter plots for two features, and show categories. When two values are given, the target is used as category, else one could be stated. * regplot = [["feat_a", "feat_b"], ["feat_a", "feat_b", "emotion"], ["feat_a", "feat_b", "age"]] ### [PREDICT](#predict) Automatic soft-label prediction using pre-trained models (e.g., age, gender, arousal). * **targets**: Speaker/speech characteristics to be predicted by some models * targets = ['text', 'translation', 'textclassification', 'speaker', 'gender', 'age', 'snr', 'arousal', 'valence', 'dominance', 'pesq', 'mos'] * textclassifier.candidates = ["sadness", "anger", "neutral"]: for target *textclassification*: the labels for the categories that should be predicted (using [joeddav/xlm-roberta-large-xnli](https://huggingface.co/joeddav/xlm-roberta-large-xnli)) * **target_language**: target language for the translation prediction * target_language = en ### EXPORT Options for exporting the dataset (audio files and annotations) to a new location or format. * **target_root**: New root directory for the database, will be created * target_root = ./exported_data/ * **orig_root**: Path to folder that is parent to the original audio files * orig_root = ../data/emodb/wav * **data_name**: Name for the CSV file * data_name = exported_database * **segments_as_files**: Whether original files should be used, or segments split (resulting potentially in many new files). * segments_as_files = False * **bundle_path**: Output directory for [the portable model bundle](bundle.md) created by `python -m nkululeko.bundle`. Defaults to `//export`. Overridden by the `--output` CLI flag. * bundle_path = ./my_polish_bundle ### CROSSDB Cross-database experiment settings for evaluating generalisation across datasets. * **train_extra**: add a additional training partition to all experiments in [the cross database series](http://blog.syntheticspeech.de/2024/01/02/nkululeko-compare-several-databases/). This extra data should be described [in a root_folders file](http://blog.syntheticspeech.de/2022/02/21/specifying-database-disk-location-with-nkululeko/) * train_extra = ['addtrain_db_1', 'addtrain_db_2'] ### PLOT Plot styling and output options for result figures. * **name**: special name as a prefix for all plots (stored in *img_dir*). * name = my_special_config_within_the_experiment * **epochs**: whether to make a plot each for every epoch result. * epochs = False * **anim_progression**: generate an **animated** GIF from the epoch plots * anim_progression = False * **fps**: frames per second for the animated GIF * fps = **1** * **epoch_progression**: plot the progression of test, train and loss results over epochs * epoch_progression = False * **best_model**: search for the best performing model and plot conf matrix (needs *MODEL.store* to be turned on) * best_model = False * **combine_per_speaker**: print an extra confusion plot where the predictions per speaker are combined, with either the `mode` or the `mean` function * combine_per_speaker = mode * **format**: format for plots, either *png* or *eps* (for scalable graphics) * format = png * **ccc**: show concordance correlation coefficient in plot headings * ccc = False * **fill_areas**: should areas, e.g. in distribution plots, be filled? * fill_areas = False * **uncertainty_threshold**: plot a confusionmatrix with samples removed that are less uncertain * uncertainty_threshold = .6 * **runs_compare**: generate plots to compare the run results: compare *features*, *models* or *databases* * runs_compare = features * **titles**: if titles should be added to the plots * titles = True * **kind**: kind of plot for EXPL.feature distributions: [violin, bar, box, swarm, strip] * kind = violin ### RESAMPLE Audio resampling settings for converting sample rates across a dataset. * **replace**: whether samples should be replaced right where they are, or copies done and a new dataframe given * replace = False * **target**: the name of the new dataframe, if replace==false * target = data_resampled.csv ### REPORT Controls how experiment results are collected, displayed, and persisted. * **show**: print the report at the end * show = False * **fresh**: start a new report * **latex**: generate a latex and PDF document: name of document * latex = my_latex_document * **title**: title for document * **author**: author for document ### OPTIM Hyperparameter optimisation settings for automated model tuning. * **model**: the model type to optimize (e.g., 'mlp', 'svm', 'xgb') * model = mlp * **search_strategy**: intelligent search strategy for faster optimization * search_strategy = random * possible values: * **grid**: exhaustive grid search (default, slowest but thorough) * **random**: random search with n_iter samples (faster, often as good as grid) * **halving_random**: successive halving random search (fastest, requires sklearn >= 0.24) * **halving_grid**: successive halving grid search (compromise between speed and thoroughness) * **metric**: evaluation metric for optimization * metric = uar * possible values: * **uar**: Unweighted Average Recall (balanced accuracy, good for imbalanced datasets) * **accuracy**: Standard accuracy (default) * **f1**: Macro-averaged F1-score (balance of precision and recall) * **precision**: Macro-averaged precision * **recall**: Macro-averaged recall * **sensitivity**: Sensitivity (same as recall) * **specificity**: Specificity (true negative rate) * **n_iter**: number of parameter combinations to try for random search * n_iter = 50 * **cv_folds**: number of cross-validation folds for hyperparameter evaluation * cv_folds = 3 * **Parameter specifications**: Define search spaces for hyperparameters using tuples for ranges and lists for discrete choices * **nlayers**: number of hidden layers for neural networks * nlayers = (1, 3) # search from 1 to 3 layers * **nnodes**: number of nodes per layer for neural networks * nnodes = (16, 256) # search powers of 2 from 16 to 256 * **lr**: learning rate for neural networks * lr = [0.0001, 0.001, 0.01, 0.1] # discrete log-scale choices (recommended) * lr = (0.0001, 0.01) # or range with automatic log-scale sampling * **bs**: batch size for neural networks * bs = (2, 256) # search powers of 2 from 2 to 256 * **loss**: loss function for neural networks * loss = ["cross", "f1"] # discrete choices * **do**: dropout rate for neural networks * do = (0.1, 0.5, 0.1) # search from 0.1 to 0.5 with step 0.1 * **Traditional ML parameters**: For SVM, XGB, etc., use parameter names from sklearn * C = [0.1, 1.0, 10.0] # SVM regularization parameter * n_estimators = [50, 100, 200] # XGB number of estimators * max_depth = [3, 6, 9] # XGB maximum depth **Parameter specification formats**: * **(min, max)**: Range with automatic step selection based on parameter type * For learning rates: uses logarithmic sampling (5-8 values) * For dropout: uses linear sampling (5 values) * For integers: uses linear sampling * **(min, max, step)**: Range with explicit step size * **[val1, val2, ...]**: Discrete list of values to try (recommended for most cases) * **value**: Single value (equivalent to [value]) **Recommended parameter ranges**: * **Learning rate**: `[0.0001, 0.001, 0.01, 0.1]` (log-scale discrete values) * **Dropout**: `[0.1, 0.3, 0.5, 0.7]` (common dropout rates) * **SVM C**: `[0.1, 1.0, 10.0, 100.0]` (regularization parameter) * **XGB n_estimators**: `[50, 100, 200]` (number of trees) * **XGB max_depth**: `[3, 6, 9, 12]` (tree depth) **Usage**: Run with `python3 -m nkululeko.optim --config exp.ini` ### FLAGS Running different values at once. All listed parameters are combined via Cartesian product — one experiment is run per combination. Example: * **models** = ['xgb', 'svm'] * **features** = ['praat', 'os'] * **balancing** = ['none', 'ros', 'smote'] * **scale** = ['none', 'standard', 'robust', 'minmax'] * **name_target** = *list of (EXP.name, DATA.target) pairs iterated as a unit* * Each pair sets `EXP.name` and `DATA.target` together for one experiment slot, then that slot is combined via product with any other FLAGS parameters. * Label DataFrames are reloaded per pair; audio features are extracted once and reused across all pairs. * example: ```ini name_target = [("grade", "grade"), ("roughness", "roughness"), ("strain", "strain")] models = ['xgb', 'mlp'] ``` → runs 3 × 2 = 6 experiments The FLAGS mechanism can also drive the `explore` module (feature analysis / visualisation) instead of model training. Pass `--mod explore` on the command line: ```bash python -m nkululeko.flags --config exp.ini --mod explore ``` No result score is reported; output plots are stored per experiment under `{EXP.root}/{EXP.name}/images/`.