Finetuning Transformer Models
This tutorial shows how to finetune pretrained transformer models (like wav2vec2, WavLM, HuBERT) for your specific classification or regression task.
Reference: Nkululeko: How to finetune a transformer model
Overview
Since version 0.85.0, Nkululeko supports finetuning transformer models with HuggingFace.
Finetuning means training the entire pretrained transformer with your data labels, as opposed to only using the last layer as embeddings (which is what type = ['wav2vec2'] does in [FEATS]).
When to Finetune vs Use Embeddings
Approach |
When to Use |
|---|---|
Embeddings ( |
Small datasets, quick experiments, limited GPU |
Finetuning ( |
Large datasets, best performance, GPU available |
Basic Configuration
To finetune a transformer model:
[EXP]
root = ./examples/results/
name = wavlm_finetuned
epochs = 5
[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
target = emotion
[FEATS]
; Features should be empty for finetuning
type = []
[MODEL]
type = finetune
Key Points
[FEATS] type = []- Must be empty because the transformer model has its own CNN layers for acoustic feature extraction[MODEL] type = finetune- Triggers finetuning modeMaximum audio duration: 8 seconds by default (rest is ignored)
Choosing a Pretrained Model
The default model is facebook/wav2vec2-large-robust-ft-swbd-300h.
Specify a different model:
[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base
Popular Pretrained Models
Model |
Description |
|---|---|
|
Default, robust to noise |
|
Good for speech tasks |
|
Larger, better performance |
|
HuBERT base model |
|
Smaller, faster |
Training Parameters
Configure deep learning hyperparameters:
[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base
learning_rate = 0.0001
batch_size = 16
device = cuda:0
duration = 10.5
Parameter Reference
Parameter |
Default |
Description |
|---|---|---|
|
wav2vec2-large-robust |
HuggingFace model name |
|
0.00001 |
Learning rate |
|
8 |
Batch size (reduce if OOM) |
|
cuda |
Device: |
|
8 |
Max audio duration in seconds |
Loss Functions
Loss functions are automatically selected:
Classification: Weighted cross-entropy
Regression: Concordance correlation coefficient (CCC)
Publishing to HuggingFace
To publish your finetuned model to HuggingFace Hub:
[MODEL]
type = finetune
push_to_hub = True
Make sure you’re logged in to HuggingFace CLI first:
huggingface-cli login
Complete Example
[EXP]
root = ./examples/results/
name = wavlm_finetuned
runs = 1
epochs = 10
save = True
[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
target = emotion
labels = ['anger', 'happiness', 'neutral', 'sadness']
[FEATS]
type = []
[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base
batch_size = 4
device = cuda
; push_to_hub = True
Output
The finetuning process produces:
Best model checkpoint in the project folder
HuggingFace logs (readable with TensorBoard)
Training metrics and evaluation results
Viewing Training Progress
tensorboard --logdir examples/results/wavlm_finetuned/
Example Files
exp_emodb_finetune.ini: Finetune WavLM on emoDB
Running the Experiment
python -m nkululeko.nkululeko --config examples/exp_emodb_finetune.ini
Tips
GPU Memory: Reduce
batch_sizeif you get out-of-memory errorsDuration: Long audio files are truncated to
durationsecondsEpochs: Start with 5-10 epochs; use early stopping with dev set
Model size: Use
basemodels for limited GPU;largefor best performanceLearning rate: Default is usually fine; reduce if training is unstable