# Finetuning Transformer Models This tutorial shows how to finetune pretrained transformer models (like wav2vec2, WavLM, HuBERT) for your specific classification or regression task. **Reference**: [Nkululeko: How to finetune a transformer model](http://blog.syntheticspeech.de/2024/05/29/nkululeko-how-to-finetune-a-transformer-model/) ## Overview Since version 0.85.0, Nkululeko supports finetuning transformer models with [HuggingFace](https://huggingface.co/docs/transformers/training). **Finetuning** means training the entire pretrained transformer with your data labels, as opposed to only using the last layer as embeddings (which is what `type = ['wav2vec2']` does in `[FEATS]`). ## When to Finetune vs Use Embeddings | Approach | When to Use | |----------|-------------| | **Embeddings** (`[FEATS] type = ['wav2vec2']`) | Small datasets, quick experiments, limited GPU | | **Finetuning** (`[MODEL] type = finetune`) | Large datasets, best performance, GPU available | ## Basic Configuration To finetune a transformer model: ```ini [EXP] root = ./examples/results/ name = wavlm_finetuned epochs = 5 [DATA] databases = ['emodb'] emodb = ./data/emodb/emodb emodb.split_strategy = speaker_split target = emotion [FEATS] ; Features should be empty for finetuning type = [] [MODEL] type = finetune ``` ### Key Points - `[FEATS] type = []` - Must be empty because the transformer model has its own CNN layers for acoustic feature extraction - `[MODEL] type = finetune` - Triggers finetuning mode - Maximum audio duration: 8 seconds by default (rest is ignored) ## Choosing a Pretrained Model The default model is [facebook/wav2vec2-large-robust-ft-swbd-300h](https://huggingface.co/facebook/wav2vec2-large-robust-ft-swbd-300h). Specify a different model: ```ini [MODEL] type = finetune pretrained_model = microsoft/wavlm-base ``` ### Popular Pretrained Models | Model | Description | |-------|-------------| | `facebook/wav2vec2-large-robust-ft-swbd-300h` | Default, robust to noise | | `microsoft/wavlm-base` | Good for speech tasks | | `microsoft/wavlm-large` | Larger, better performance | | `facebook/hubert-base-ls960` | HuBERT base model | | `facebook/wav2vec2-base-960h` | Smaller, faster | ## Training Parameters Configure deep learning hyperparameters: ```ini [MODEL] type = finetune pretrained_model = microsoft/wavlm-base learning_rate = 0.0001 batch_size = 16 device = cuda:0 duration = 10.5 ``` ### Parameter Reference | Parameter | Default | Description | |-----------|---------|-------------| | `pretrained_model` | wav2vec2-large-robust | HuggingFace model name | | `learning_rate` | 0.00001 | Learning rate | | `batch_size` | 8 | Batch size (reduce if OOM) | | `device` | cuda | Device: `cuda`, `cuda:0`, `cpu` | | `duration` | 8 | Max audio duration in seconds | ## Loss Functions Loss functions are automatically selected: - **Classification**: Weighted cross-entropy - **Regression**: Concordance correlation coefficient (CCC) ## Publishing to HuggingFace To publish your finetuned model to HuggingFace Hub: ```ini [MODEL] type = finetune push_to_hub = True ``` Make sure you're logged in to HuggingFace CLI first: ```bash huggingface-cli login ``` ## Complete Example ```ini [EXP] root = ./examples/results/ name = wavlm_finetuned runs = 1 epochs = 10 save = True [DATA] databases = ['emodb'] emodb = ./data/emodb/emodb emodb.split_strategy = speaker_split target = emotion labels = ['anger', 'happiness', 'neutral', 'sadness'] [FEATS] type = [] [MODEL] type = finetune pretrained_model = microsoft/wavlm-base batch_size = 4 device = cuda ; push_to_hub = True ``` ## Output The finetuning process produces: - Best model checkpoint in the project folder - HuggingFace logs (readable with TensorBoard) - Training metrics and evaluation results ### Viewing Training Progress ```bash tensorboard --logdir examples/results/wavlm_finetuned/ ``` ## Example Files - [`exp_emodb_finetune.ini`](https://github.com/felixbur/nkululeko/blob/main/examples/exp_emodb_finetune.ini): Finetune WavLM on emoDB ## Running the Experiment ```bash python -m nkululeko.nkululeko --config examples/exp_emodb_finetune.ini ``` ## Tips 1. **GPU Memory**: Reduce `batch_size` if you get out-of-memory errors 2. **Duration**: Long audio files are truncated to `duration` seconds 3. **Epochs**: Start with 5-10 epochs; use early stopping with dev set 4. **Model size**: Use `base` models for limited GPU; `large` for best performance 5. **Learning rate**: Default is usually fine; reduce if training is unstable ## Related Tutorials - [Train/Dev/Test Splits](traindevtest.md): Proper evaluation with early stopping - [Comparing Runs](compare_runs.md): Compare finetuned vs embedding approaches