Finetuning Transformer Models

This tutorial shows how to finetune pretrained transformer models (like wav2vec2, WavLM, HuBERT) for your specific classification or regression task.

Reference: Nkululeko: How to finetune a transformer model

Overview

Since version 0.85.0, Nkululeko supports finetuning transformer models with HuggingFace.

Finetuning means training the entire pretrained transformer with your data labels, as opposed to only using the last layer as embeddings (which is what type = ['wav2vec2'] does in [FEATS]).

When to Finetune vs Use Embeddings

Approach

When to Use

Embeddings ([FEATS] type = ['wav2vec2'])

Small datasets, quick experiments, limited GPU

Finetuning ([MODEL] type = finetune)

Large datasets, best performance, GPU available

Basic Configuration

To finetune a transformer model:

[EXP]
root = ./examples/results/
name = wavlm_finetuned
epochs = 5

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
target = emotion

[FEATS]
; Features should be empty for finetuning
type = []

[MODEL]
type = finetune

Key Points

  • [FEATS] type = [] - Must be empty because the transformer model has its own CNN layers for acoustic feature extraction

  • [MODEL] type = finetune - Triggers finetuning mode

  • Maximum audio duration: 8 seconds by default (rest is ignored)

Choosing a Pretrained Model

The default model is facebook/wav2vec2-large-robust-ft-swbd-300h.

Specify a different model:

[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base

Training Parameters

Configure deep learning hyperparameters:

[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base
learning_rate = 0.0001
batch_size = 16
device = cuda:0
duration = 10.5

Parameter Reference

Parameter

Default

Description

pretrained_model

wav2vec2-large-robust

HuggingFace model name

learning_rate

0.00001

Learning rate

batch_size

8

Batch size (reduce if OOM)

device

cuda

Device: cuda, cuda:0, cpu

duration

8

Max audio duration in seconds

Loss Functions

Loss functions are automatically selected:

  • Classification: Weighted cross-entropy

  • Regression: Concordance correlation coefficient (CCC)

Publishing to HuggingFace

To publish your finetuned model to HuggingFace Hub:

[MODEL]
type = finetune
push_to_hub = True

Make sure you’re logged in to HuggingFace CLI first:

huggingface-cli login

Complete Example

[EXP]
root = ./examples/results/
name = wavlm_finetuned
runs = 1
epochs = 10
save = True

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
target = emotion
labels = ['anger', 'happiness', 'neutral', 'sadness']

[FEATS]
type = []

[MODEL]
type = finetune
pretrained_model = microsoft/wavlm-base
batch_size = 4
device = cuda
; push_to_hub = True

Output

The finetuning process produces:

  • Best model checkpoint in the project folder

  • HuggingFace logs (readable with TensorBoard)

  • Training metrics and evaluation results

Viewing Training Progress

tensorboard --logdir examples/results/wavlm_finetuned/

Example Files

Running the Experiment

python -m nkululeko.nkululeko --config examples/exp_emodb_finetune.ini

Tips

  1. GPU Memory: Reduce batch_size if you get out-of-memory errors

  2. Duration: Long audio files are truncated to duration seconds

  3. Epochs: Start with 5-10 epochs; use early stopping with dev set

  4. Model size: Use base models for limited GPU; large for best performance

  5. Learning rate: Default is usually fine; reduce if training is unstable