Linguistic Features with BERT

This tutorial shows how to use BERT embeddings to model linguistic (semantic) content of speech, either alone or combined with acoustic features.

Reference: Nkululeko: How to explicitly model linguistics

Overview

Speech emotion recognition typically relies on acoustic features like pitch, energy, and spectral characteristics. However, what is being said (the linguistic content) can be just as important as how it is said.

Nkululeko supports BERT (Bidirectional Encoder Representations from Transformers) embeddings to capture the semantic meaning of transcribed speech. This is particularly useful when:

  • You have transcripts available in your dataset

  • The spoken content is relevant to your classification task

  • You want to combine linguistic and acoustic information

Requirements

Your dataset must have a text column containing transcripts. If your column has a different name (e.g., “Utterance”, “transcript”), use the colnames option to rename it.

Basic BERT Features

To use only BERT linguistic features:

[EXP]
root = ./
name = exp_meld_bert
; Set language for BERT model
language = en

[DATA]
databases = ['train', 'test']
train = ./data/meld/meld_train.csv
train.type = csv
train.split_strategy = train
test = ./data/meld/meld_test.csv
test.type = csv
test.split_strategy = test
; Rename column to 'text' if needed
colnames = {'Utterance': 'text'}
target = emotion
labels = ['anger', 'joy', 'neutral', 'sadness']

[FEATS]
type = ['bert']
scale = standard

[MODEL]
type = svm

Combining BERT with Acoustic Features

To leverage both linguistic and acoustic information:

[FEATS]
; Combine BERT with OpenSMILE acoustic features
type = ['bert', 'os']
os.set = eGeMAPSv02
scale = standard

This creates a feature vector combining:

  • BERT embeddings (768 dimensions from bert-base-uncased)

  • OpenSMILE features (88 features from eGeMAPSv02)

BERT Model Selection

By default, Nkululeko uses bert-base-uncased. You can specify a different model:

[FEATS]
type = ['bert']
; Use multilingual BERT
bert.model = bert-base-multilingual-cased

Common BERT models:

  • bert-base-uncased: English, 110M parameters (default)

  • bert-base-cased: English, case-sensitive

  • bert-base-multilingual-cased: 104 languages

  • bert-large-uncased: English, 340M parameters

Language Setting

The language option in [EXP] helps select appropriate models:

[EXP]
; For German text
language = de

; For English text  
language = en

Using with Transcription

If you don’t have transcripts, you can first use Whisper to transcribe:

[DATA]
; First experiment: transcribe
[PREDICT]
targets = ['text']

Then use the generated text column for BERT features in a subsequent experiment.

Example Files

Running the Experiment

python -m nkululeko.nkululeko --config examples/exp_meld_bert.ini

Tips

  1. Memory: BERT models require significant GPU memory. Use device = cpu if needed.

  2. Text quality: BERT performance depends on transcript quality.

  3. Feature scaling: Always use scale = standard when combining different feature types.

  4. Combining features: Multi-modal (linguistic + acoustic) often outperforms single modality.