Finetuning Transformer Models

This tutorial shows how to finetune pretrained transformer models (like wav2vec2, WavLM, HuBERT) for your specific classification or regression task.

Reference: Nkululeko: How to finetune a transformer model

Overview

Since version 0.85.0, Nkululeko supports finetuning transformer models with HuggingFace.

Finetuning means training the entire pretrained transformer with your data labels, as opposed to only using the last layer as embeddings (which is what type = ['wav2vec2'] does in [FEATS]).

When to Finetune vs Use Embeddings

Approach	When to Use
Embeddings (`[FEATS] type = ['wav2vec2']`)	Small datasets, quick experiments, limited GPU
Finetuning (`[MODEL] type = finetune`)	Large datasets, best performance, GPU available

Basic Configuration

To finetune a transformer model:

[EXP]
root = ./examples/results/
name = wavlm_finetuned
epochs = 5

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
target = emotion

[FEATS]
; Features should be empty for finetuning
type = []

[MODEL]
type = finetune

Key Points

[FEATS] type = [] - Must be empty because the transformer model has its own CNN layers for acoustic feature extraction
[MODEL] type = finetune - Triggers finetuning mode
Maximum audio duration: 8 seconds by default (rest is ignored)

Choosing a Pretrained Model

The default model is facebook/wav2vec2-large-robust-ft-swbd-300h.

Specify a different model:

[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base

Popular Pretrained Models

Model	Description
`facebook/wav2vec2-large-robust-ft-swbd-300h`	Default, robust to noise
`microsoft/wavlm-base`	Good for speech tasks
`microsoft/wavlm-large`	Larger, better performance
`facebook/hubert-base-ls960`	HuBERT base model
`facebook/wav2vec2-base-960h`	Smaller, faster

Training Parameters

Configure deep learning hyperparameters:

[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base
learning_rate = 0.0001
batch_size = 16
device = cuda:0
duration = 10.5

Parameter Reference

Parameter	Default	Description
`pretrained_model`	wav2vec2-large-robust	HuggingFace model name
`learning_rate`	0.00001	Learning rate
`batch_size`	8	Batch size (reduce if OOM)
`device`	cuda	Device: `cuda`, `cuda:0`, `cpu`
`duration`	8	Max audio duration in seconds

Loss Functions

Loss functions are automatically selected:

Classification: Weighted cross-entropy
Regression: Concordance correlation coefficient (CCC)

Publishing to HuggingFace

To publish your finetuned model to HuggingFace Hub:

[MODEL]
type = finetune
push_to_hub = True

Make sure you’re logged in to HuggingFace CLI first:

huggingface-cli login

Complete Example

[EXP]
root = ./examples/results/
name = wavlm_finetuned
runs = 1
epochs = 10
save = True

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
target = emotion
labels = ['anger', 'happiness', 'neutral', 'sadness']

[FEATS]
type = []

[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base
batch_size = 4
device = cuda
; push_to_hub = True

Output

The finetuning process produces:

Best model checkpoint in the project folder
HuggingFace logs (readable with TensorBoard)
Training metrics and evaluation results

Viewing Training Progress

tensorboard --logdir examples/results/wavlm_finetuned/

Example Files

exp_emodb_finetune.ini: Finetune WavLM on emoDB

Running the Experiment

python -m nkululeko.nkululeko --config examples/exp_emodb_finetune.ini

Tips

GPU Memory: Reduce batch_size if you get out-of-memory errors
Duration: Long audio files are truncated to duration seconds
Epochs: Start with 5-10 epochs; use early stopping with dev set
Model size: Use base models for limited GPU; large for best performance
Learning rate: Default is usually fine; reduce if training is unstable