Text Processing: Transcribe, Translate, and Classify
This tutorial demonstrates how to use Nkululeko’s text processing pipeline to:
Transcribe audio to text using Whisper speech-to-text
Translate text between languages using Google Translate
Classify text topics using zero-shot classification
This is useful when you want to analyze the linguistic content of speech databases, especially for cross-lingual analysis.
Overview
The pipeline consists of three steps, each invoking the unified
nkululeko.predict module with a different autopredict target:
Audio → [--model text] → Text
Text → [--model translation] → English Text
Text → [--model textclassification] → Topic Labels
The output CSV of one step is fed into the --list argument of the next.
Prerequisites
Nkululeko >= 1.6.0
A speech database (we use Berlin EmoDB as an example)
Required dependencies:
openai-whisper,googletrans
Step 1: Transcribe Audio to Text
Use Whisper (via transformers) to transcribe audio to text.
Configuration (exp_emodb_predict_text.ini)
The config carries only the source-language setting; the input / output / model choice is on the command line.
[EXP]
root = ./examples/results
name = exp_emodb_predict_text
language = de
Run transcription
python -m nkululeko.predict \
--list ./data/emodb/emodb_files.csv \
--model text \
--config examples/exp_emodb_predict_text.ini \
--outfile ./emodb_transcribed.csv
Output
The output CSV preserves the original columns and adds a text column:
file,start,end,emotion,text
./data/emodb/wav/03a01Fa.wav,0 days,,happiness,Der Lappen liegt auf dem Eisschrank.
./data/emodb/wav/03a01Nc.wav,0 days,,neutral,Der Lappen liegt auf dem Eisschrank.
Step 2: Translate Text to English
Translate the German transcriptions to English using Google Translate.
Configuration (exp_emodb_translate.ini)
[EXP]
root = ./examples/results
name = exp_emodb_translate
language = de
target_language = en
Run translation
The input is the CSV produced in step 1 (it already contains the text
column expected by the translation predictor):
python -m nkululeko.predict \
--list ./emodb_transcribed.csv \
--model translation \
--config examples/exp_emodb_translate.ini \
--language es \
--outfile ./emodb_translated.csv
Note:
--language esoverrides bothEXP.languageandPREDICT.target_languagefrom the INI. For--model translationonly the target language matters, so the output column is named after--language(eshere). Drop--languageto fall back to the INI’starget_language.
Output
file,start,end,emotion,text,es
./data/emodb/wav/03a01Fa.wav,0 days,,happiness,Der Lappen liegt auf dem Eisschrank.,El trapo está sobre la nevera.
Step 3: Classify text topics
Zero-shot classification with a multilingual XLM-RoBERTa model.
Configuration (exp_emodb_textclassifier.ini)
[EXP]
root = ./examples/results
name = emodb_textclassifier
[FEATS]
textclassifier.candidates = ["sadness", "anger", "neutral", "happiness", "fear", "disgust", "boredom"]
Run classification
python -m nkululeko.predict \
--list ./emodb_translated.csv \
--model textclassification \
--config examples/exp_emodb_textclassifier.ini \
--outfile ./emodb_classified.csv
Zero-shot classification
The text classifier uses
joeddav/xlm-roberta-large-xnli,
a zero-shot model that can classify text into any categories you define
without further training. Customize the candidates for your use case:
# Sentiment analysis
textclassifier.candidates = ["positive", "negative", "neutral"]
# Topic classification
textclassifier.candidates = ["sports", "politics", "technology", "entertainment"]
# Intent detection
textclassifier.candidates = ["question", "statement", "command", "greeting"]
Output
file,classification_winner,sadness,anger,neutral,happiness,fear,disgust,boredom
./data/emodb/wav/03a01Fa.wav,neutral,0.116,0.141,0.359,0.059,0.089,0.121,0.114
Complete pipeline
Run all three steps in sequence, piping each output to the next input:
python -m nkululeko.predict --list ./data/emodb/emodb_files.csv --model text --config examples/exp_emodb_predict_text.ini --outfile transcribed.csv
python -m nkululeko.predict --list transcribed.csv --model translation --config examples/exp_emodb_translate.ini --outfile translated.csv
python -m nkululeko.predict --list translated.csv --model textclassification --config examples/exp_emodb_textclassifier.ini --outfile classified.csv
Troubleshooting
KeyError: 'text'
The translation step needs a text column in the input CSV. Make sure step 1
finished successfully and that you pass its output to step 2 via --list.
Slow transcription
Whisper is slow on CPU. Use GPU if available:
[MODEL]
device = cuda
Google Translate rate limits
For large datasets you may hit translation API rate limits. Consider:
Splitting the list into smaller chunks
Adding delays between requests
Switching to an alternative translation service
Use cases
Cross-lingual emotion analysis: analyse emotional content in non-English speech.
Content analysis: extract topics and themes from speech recordings.
Dataset enrichment: add linguistic features to audio datasets.
Multilingual research: compare linguistic patterns across languages.
References
Whisper: OpenAI’s speech recognition model
XLM-RoBERTa-XNLI: zero-shot classification model