Predicting Speaker ID
This tutorial shows how to predict speaker identities in your audio data using pyannote speaker diarization.
Reference: Nkululeko: Predict Speaker ID
Overview
Since version 0.93.0, Nkululeko interfaces with pyannote for speaker diarization (as an alternative to silero).
There are two modules for speaker identification:
Module |
Scope |
Use Case |
|---|---|---|
SEGMENT |
Per-file |
Find speakers within each audio file (e.g., one long recording) |
PREDICT |
Whole database |
Find speakers across all files in the database |
⚠️ Performance Note: Both methods are slow on CPU. Best run on GPU.
Requirements
HuggingFace Token: Required for pyannote models
Get yours at huggingface.co/settings/tokens
GPU recommended: CPU processing is very slow (no progress bar)
Method 1: SEGMENT Module
Use this when you want to find different speakers within each file (e.g., diarizing a long conversation).
[EXP]
root = ./examples/results/
name = exp_emodb_segment_speaker
[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = random
target = emotion
[SEGMENT]
method = pyannote
segment_target = _segmented
sample_selection = all
[MODEL]
hf_token = <your_huggingface_token>
device = cuda ; or gpu
[FEATS]
type = ['os']
scale = standard
SEGMENT Options
Option |
Description |
|---|---|
|
|
|
Suffix for output files (e.g., |
|
Which samples to process: |
|
Minimum segment length in seconds (optional) |
|
Maximum segment length in seconds (optional) |
Output
The SEGMENT module produces:
New segmented audio files with speaker labels
A distribution plot of detected speakers in the
images/folder
Method 2: PREDICT Module
Use this when you want to identify speakers across a list of audio files
(e.g., clustering utterances by speaker). The unified
nkululeko.predict module dispatches --model speaker to the
speaker autopredict target.
python -m nkululeko.predict \
--list ./data/emodb/emodb_files.csv \
--model speaker \
--outfile ./emodb_speakers.csv
The output CSV preserves the original columns of emodb_files.csv and adds a
speaker_pred column. The same command works with --folder or --file.
Available autopredict targets
--model accepts any of the autopredict targets, including:
speaker— speaker identitygender— male/femaleage— age estimationsnr— signal-to-noise ratiovalence,arousal,dominance— emotional dimensionspesq,mos,sdr,stoi— speech quality metricsemotion— emotion classificationtext,translation,textclassification— text processing
See predict.md for the full list and per-target details.
Running the Experiments
With the SEGMENT module:
python -m nkululeko.segment --config examples/exp_emodb_segment_speaker.ini
With the PREDICT module:
python -m nkululeko.predict \
--list ./data/emodb/emodb_files.csv \
--model speaker \
--outfile ./emodb_speakers.csv
Example Files
exp_emodb_segment_speaker.ini: SEGMENT-based speaker diarizationexp_androids_segment.ini: Silero VAD segmentation
Tips
Use GPU: Set
device = cudain[MODEL]section for 10x+ speedupHuggingFace token: Required for pyannote; accept the model license on HuggingFace first
Silero alternative: Use
method = silerofor faster VAD-only segmentation (no speaker ID)Long files: Use
max_lengthto split very long segments