Linguistic Features with BERT
This tutorial shows how to use BERT embeddings to model linguistic (semantic) content of speech, either alone or combined with acoustic features.
Reference: Nkululeko: How to explicitly model linguistics
Overview
Speech emotion recognition typically relies on acoustic features like pitch, energy, and spectral characteristics. However, what is being said (the linguistic content) can be just as important as how it is said.
Nkululeko supports BERT (Bidirectional Encoder Representations from Transformers) embeddings to capture the semantic meaning of transcribed speech. This is particularly useful when:
You have transcripts available in your dataset
The spoken content is relevant to your classification task
You want to combine linguistic and acoustic information
Requirements
Your dataset must have a text column containing transcripts. If your column has a different name (e.g., “Utterance”, “transcript”), use the colnames option to rename it.
Basic BERT Features
To use only BERT linguistic features:
[EXP]
root = ./
name = exp_meld_bert
; Set language for BERT model
language = en
[DATA]
databases = ['train', 'test']
train = ./data/meld/meld_train.csv
train.type = csv
train.split_strategy = train
test = ./data/meld/meld_test.csv
test.type = csv
test.split_strategy = test
; Rename column to 'text' if needed
colnames = {'Utterance': 'text'}
target = emotion
labels = ['anger', 'joy', 'neutral', 'sadness']
[FEATS]
type = ['bert']
scale = standard
[MODEL]
type = svm
Combining BERT with Acoustic Features
To leverage both linguistic and acoustic information:
[FEATS]
; Combine BERT with OpenSMILE acoustic features
type = ['bert', 'os']
os.set = eGeMAPSv02
scale = standard
This creates a feature vector combining:
BERT embeddings (768 dimensions from bert-base-uncased)
OpenSMILE features (88 features from eGeMAPSv02)
BERT Model Selection
By default, Nkululeko uses bert-base-uncased. You can specify a different model:
[FEATS]
type = ['bert']
; Use multilingual BERT
bert.model = bert-base-multilingual-cased
Common BERT models:
bert-base-uncased: English, 110M parameters (default)bert-base-cased: English, case-sensitivebert-base-multilingual-cased: 104 languagesbert-large-uncased: English, 340M parameters
Language Setting
The language option in [EXP] helps select appropriate models:
[EXP]
; For German text
language = de
; For English text
language = en
Using with Transcription
If you don’t have transcripts, you can first use Whisper to transcribe:
[DATA]
; First experiment: transcribe
[PREDICT]
targets = ['text']
Then use the generated text column for BERT features in a subsequent experiment.
Example Files
exp_meld_bert.ini: BERT-only featuresexp_meld_bert_os.ini: BERT + OpenSMILE combined
Running the Experiment
python -m nkululeko.nkululeko --config examples/exp_meld_bert.ini
Tips
Memory: BERT models require significant GPU memory. Use
device = cpuif needed.Text quality: BERT performance depends on transcript quality.
Feature scaling: Always use
scale = standardwhen combining different feature types.Combining features: Multi-modal (linguistic + acoustic) often outperforms single modality.