Predicting Speaker ID

This tutorial shows how to predict speaker identities in your audio data using pyannote speaker diarization.

Reference: Nkululeko: Predict Speaker ID

Overview

Since version 0.93.0, Nkululeko interfaces with pyannote for speaker diarization (as an alternative to silero).

There are two modules for speaker identification:

Module	Scope	Use Case
SEGMENT	Per-file	Find speakers within each audio file (e.g., one long recording)
PREDICT	Whole database	Find speakers across all files in the database

⚠️ Performance Note: Both methods are slow on CPU. Best run on GPU.

Requirements

HuggingFace Token: Required for pyannote models
- Get yours at huggingface.co/settings/tokens
GPU recommended: CPU processing is very slow (no progress bar)

Method 1: SEGMENT Module

Use this when you want to find different speakers within each file (e.g., diarizing a long conversation).

[EXP]
root = ./examples/results/
name = exp_emodb_segment_speaker

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = random
target = emotion

[SEGMENT]
method = pyannote
segment_target = _segmented
sample_selection = all

[MODEL]
hf_token = <your_huggingface_token>
device = cuda  ; or gpu

[FEATS]
type = ['os']
scale = standard

SEGMENT Options

Option	Description
`method`	`pyannote` (speaker diarization) or `silero` (VAD only)
`segment_target`	Suffix for output files (e.g., `_segmented`)
`sample_selection`	Which samples to process: `all`, `train`, or `test`
`min_length`	Minimum segment length in seconds (optional)
`max_length`	Maximum segment length in seconds (optional)

Output

The SEGMENT module produces:

New segmented audio files with speaker labels
A distribution plot of detected speakers in the images/ folder

Method 2: PREDICT Module

Use this when you want to identify speakers across a list of audio files (e.g., clustering utterances by speaker). The unified nkululeko.predict module dispatches --model speaker to the speaker autopredict target.

python -m nkululeko.predict \
    --list ./data/emodb/emodb_files.csv \
    --model speaker \
    --outfile ./emodb_speakers.csv

The output CSV preserves the original columns of emodb_files.csv and adds a speaker_pred column. The same command works with --folder or --file.

Available autopredict targets

--model accepts any of the autopredict targets, including:

speaker — speaker identity
gender — male/female
age — age estimation
snr — signal-to-noise ratio
valence, arousal, dominance — emotional dimensions
pesq, mos, sdr, stoi — speech quality metrics
emotion — emotion classification
text, translation, textclassification — text processing

See predict.md for the full list and per-target details.

Running the Experiments

With the SEGMENT module:

python -m nkululeko.segment --config examples/exp_emodb_segment_speaker.ini

With the PREDICT module:

python -m nkululeko.predict \
    --list ./data/emodb/emodb_files.csv \
    --model speaker \
    --outfile ./emodb_speakers.csv

Example Files

exp_emodb_segment_speaker.ini: SEGMENT-based speaker diarization
exp_androids_segment.ini: Silero VAD segmentation

Tips

Use GPU: Set device = cuda in [MODEL] section for 10x+ speedup
HuggingFace token: Required for pyannote; accept the model license on HuggingFace first
Silero alternative: Use method = silero for faster VAD-only segmentation (no speaker ID)
Long files: Use max_length to split very long segments