Linguistic Features with BERT

This tutorial shows how to use BERT embeddings to model linguistic (semantic) content of speech, either alone or combined with acoustic features.

Reference: Nkululeko: How to explicitly model linguistics

Overview

Speech emotion recognition typically relies on acoustic features like pitch, energy, and spectral characteristics. However, what is being said (the linguistic content) can be just as important as how it is said.

Nkululeko supports BERT (Bidirectional Encoder Representations from Transformers) embeddings to capture the semantic meaning of transcribed speech. This is particularly useful when:

You have transcripts available in your dataset
The spoken content is relevant to your classification task
You want to combine linguistic and acoustic information

Requirements

Your dataset must have a text column containing transcripts. If your column has a different name (e.g., “Utterance”, “transcript”), use the colnames option to rename it.

Basic BERT Features

To use only BERT linguistic features:

[EXP]
root = ./
name = exp_meld_bert
; Set language for BERT model
language = en

[DATA]
databases = ['train', 'test']
train = ./data/meld/meld_train.csv
train.type = csv
train.split_strategy = train
test = ./data/meld/meld_test.csv
test.type = csv
test.split_strategy = test
; Rename column to 'text' if needed
colnames = {'Utterance': 'text'}
target = emotion
labels = ['anger', 'joy', 'neutral', 'sadness']

[FEATS]
type = ['bert']
scale = standard

[MODEL]
type = svm

Combining BERT with Acoustic Features

To leverage both linguistic and acoustic information:

[FEATS]
; Combine BERT with OpenSMILE acoustic features
type = ['bert', 'os']
os.set = eGeMAPSv02
scale = standard

This creates a feature vector combining:

BERT embeddings (768 dimensions from bert-base-uncased)
OpenSMILE features (88 features from eGeMAPSv02)

BERT Model Selection

By default, Nkululeko uses bert-base-uncased. You can specify a different model:

[FEATS]
type = ['bert']
; Use multilingual BERT
bert.model = bert-base-multilingual-cased

Common BERT models:

bert-base-uncased: English, 110M parameters (default)
bert-base-cased: English, case-sensitive
bert-base-multilingual-cased: 104 languages
bert-large-uncased: English, 340M parameters

Language Setting

The language option in [EXP] helps select appropriate models:

[EXP]
; For German text
language = de

; For English text  
language = en

Using with Transcription

If you don’t have transcripts, you can first use Whisper to transcribe:

[DATA]
; First experiment: transcribe
[PREDICT]
targets = ['text']

Then use the generated text column for BERT features in a subsequent experiment.

Example Files

exp_meld_bert.ini: BERT-only features
exp_meld_bert_os.ini: BERT + OpenSMILE combined

Running the Experiment

python -m nkululeko.nkululeko --config examples/exp_meld_bert.ini

Tips

Memory: BERT models require significant GPU memory. Use device = cpu if needed.
Text quality: BERT performance depends on transcript quality.
Feature scaling: Always use scale = standard when combining different feature types.
Combining features: Multi-modal (linguistic + acoustic) often outperforms single modality.