Predicting Speaker ID

This tutorial shows how to predict speaker identities in your audio data using pyannote speaker diarization.

Reference: Nkululeko: Predict Speaker ID

Overview

Since version 0.93.0, Nkululeko interfaces with pyannote for speaker diarization (as an alternative to silero).

There are two modules for speaker identification:

Module

Scope

Use Case

SEGMENT

Per-file

Find speakers within each audio file (e.g., one long recording)

PREDICT

Whole database

Find speakers across all files in the database

⚠️ Performance Note: Both methods are slow on CPU. Best run on GPU.

Requirements

  • HuggingFace Token: Required for pyannote models

  • GPU recommended: CPU processing is very slow (no progress bar)

Method 1: SEGMENT Module

Use this when you want to find different speakers within each file (e.g., diarizing a long conversation).

[EXP]
root = ./examples/results/
name = exp_emodb_segment_speaker

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = random
target = emotion

[SEGMENT]
method = pyannote
segment_target = _segmented
sample_selection = all

[MODEL]
hf_token = <your_huggingface_token>
device = cuda  ; or gpu

[FEATS]
type = ['os']
scale = standard

SEGMENT Options

Option

Description

method

pyannote (speaker diarization) or silero (VAD only)

segment_target

Suffix for output files (e.g., _segmented)

sample_selection

Which samples to process: all, train, or test

min_length

Minimum segment length in seconds (optional)

max_length

Maximum segment length in seconds (optional)

Output

The SEGMENT module produces:

  • New segmented audio files with speaker labels

  • A distribution plot of detected speakers in the images/ folder

Method 2: PREDICT Module

Use this when you want to identify speakers across a list of audio files (e.g., clustering utterances by speaker). The unified nkululeko.predict module dispatches --model speaker to the speaker autopredict target.

python -m nkululeko.predict \
    --list ./data/emodb/emodb_files.csv \
    --model speaker \
    --outfile ./emodb_speakers.csv

The output CSV preserves the original columns of emodb_files.csv and adds a speaker_pred column. The same command works with --folder or --file.

Available autopredict targets

--model accepts any of the autopredict targets, including:

  • speaker — speaker identity

  • gender — male/female

  • age — age estimation

  • snr — signal-to-noise ratio

  • valence, arousal, dominance — emotional dimensions

  • pesq, mos, sdr, stoi — speech quality metrics

  • emotion — emotion classification

  • text, translation, textclassification — text processing

See predict.md for the full list and per-target details.

Running the Experiments

With the SEGMENT module:

python -m nkululeko.segment --config examples/exp_emodb_segment_speaker.ini

With the PREDICT module:

python -m nkululeko.predict \
    --list ./data/emodb/emodb_files.csv \
    --model speaker \
    --outfile ./emodb_speakers.csv

Example Files

Tips

  1. Use GPU: Set device = cuda in [MODEL] section for 10x+ speedup

  2. HuggingFace token: Required for pyannote; accept the model license on HuggingFace first

  3. Silero alternative: Use method = silero for faster VAD-only segmentation (no speaker ID)

  4. Long files: Use max_length to split very long segments