Activation Functions in Neural Network Models

Overview

This tutorial demonstrates how to use different activation functions in MLP (Multi-Layer Perceptron) and CNN (Convolutional Neural Network) models in nkululeko. Activation functions are crucial components of neural networks that introduce non-linearity, enabling models to learn complex patterns in your data.

Version: Added in nkululeko 1.1.2
Models: MLP classifier, MLP regression, CNN

What are Activation Functions?

Activation functions determine the output of a neural network node given an input or set of inputs. They introduce non-linear properties to the network, allowing it to learn complex decision boundaries and representations. Choosing the right activation function can significantly impact your model’s performance.

Available Activation Functions

Nkululeko supports four activation functions for MLP models:

Function

Description

Range

Best Used For

relu

Rectified Linear Unit

[0, ∞)

Default choice, general purpose, fast training

leaky_relu

Leaky ReLU (small negative slope)

(-∞, ∞)

Mitigating dead neurons, handling negative values

tanh

Hyperbolic tangent

[-1, 1]

Zero-centered outputs, classification tasks

sigmoid

Logistic function

[0, 1]

Binary classification, probability outputs

Activation Function Characteristics

ReLU (Rectified Linear Unit) - Default

f(x) = max(0, x)
  • Advantages: Fast computation, reduces vanishing gradient problem

  • Disadvantages: Can cause “dead neurons” (neurons that always output 0)

  • Use when: General purpose, first choice for most problems

Leaky ReLU

f(x) = x if x > 0 else 0.01x
  • Advantages: Prevents dead neurons, allows small negative gradients

  • Disadvantages: Slight increase in computation

  • Use when: ReLU shows signs of dead neurons, need better gradient flow

Tanh (Hyperbolic Tangent)

f(x) = (e^x - e^-x) / (e^x + e^-x)
  • Advantages: Zero-centered output, stronger gradients than sigmoid

  • Disadvantages: Can suffer from vanishing gradients with deep networks

  • Use when: Need zero-centered activations, shallow to medium networks

Sigmoid

f(x) = 1 / (1 + e^-x)
  • Advantages: Smooth gradient, output interpretable as probability

  • Disadvantages: Vanishing gradient problem, not zero-centered

  • Use when: Output layer for binary classification, need probability outputs

Configuration

Basic Usage

Add the activation parameter to the [MODEL] section:

[MODEL]
type = mlp
layers = [128, 64, 32]
activation = leaky_relu  # Options: relu, tanh, sigmoid, leaky_relu

Complete Example: Emotion Recognition with Different Activations

1. Using Default ReLU (no activation specified)

[EXP]
root = ./experiments/
name = exp_emodb_relu
runs = 3
epochs = 100

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
target = emotion
labels = ['anger', 'happiness', 'sadness', 'neutral']

[FEATS]
type = ['os']  # OpenSMILE features
scale = standard

[MODEL]
type = mlp
layers = [128, 64]
# activation = relu  # Default, can be omitted
drop = 0.3
patience = 10
learning_rate = 0.0001

3. Using Tanh (for zero-centered outputs)

[MODEL]
type = mlp
layers = [128, 64]
activation = tanh  # Zero-centered activation
drop = 0.3
patience = 10

4. Using Sigmoid (for specific architectures)

[MODEL]
type = mlp
layers = [128, 64]
activation = sigmoid  # Smooth, probabilistic
drop = 0.3
patience = 10

Practical Examples

Example 1: Speaker Age Prediction (Regression)

For regression tasks, leaky ReLU or tanh often work well:

[EXP]
root = ./experiments/
name = age_prediction
runs = 5
epochs = 200

[DATA]
databases = ['agedb']
agedb = ./data/agedb/
target = age

[FEATS]
type = ['os', 'praat']
scale = standard

[MODEL]
type = mlp
layers = [256, 128, 64]
activation = leaky_relu  # Good for regression
drop = 0.4
patience = 15
learning_rate = 0.00005

Example 2: Binary Emotion Classification

For binary classification, tanh or leaky_relu are good choices:

[EXP]
root = ./experiments/
name = binary_emotion
runs = 3

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
target = emotion
labels = ['anger', 'neutral']  # Binary classification

[FEATS]
type = ['os']
scale = standard

[MODEL]
type = mlp
layers = [64, 32]
activation = tanh  # Zero-centered for binary classification
drop = 0.2
patience = 5

Example 3: Multi-class Classification with Deep Network

For deeper networks, leaky_relu helps prevent vanishing gradients:

[MODEL]
type = mlp
layers = [256, 128, 64, 32]  # Deep network
activation = leaky_relu  # Prevents dead neurons in deep networks
drop = [0.3, 0.3, 0.2, 0.2]  # Layer-specific dropout
patience = 20
learning_rate = 0.00001
batch_size = 16

CNN Models with List Layer Format

The same update also introduced support for list format in CNN layers:

Old Format (still supported)

[MODEL]
type = cnn
layers = {'l1': 120, 'l2': 84}

Choosing the Right Activation Function

Decision Guide

Start with: relu or leaky_relu
    |
    ├─> If you have dead neurons → Try leaky_relu
    |
    ├─> If gradients vanish in deep networks → Try leaky_relu
    |
    ├─> If you need zero-centered outputs → Try tanh
    |
    └─> If you need probabilistic interpretation → Try sigmoid

Common Use Cases

Task

Recommended Activation

Reason

General classification

relu or leaky_relu

Fast, effective, standard choice

Deep networks (>3 layers)

leaky_relu

Prevents dead neurons, better gradient flow

Regression

leaky_relu

Handles negative values well

Binary classification

tanh or leaky_relu

Zero-centered, good gradients

Multi-class emotion

leaky_relu

Robust, prevents gradient issues

Small networks

relu or tanh

Simple, effective

Performance Comparison Example

Running the same experiment with different activations (emodb dataset, 2 emotions):

# Test all activations
for activation in relu leaky_relu tanh sigmoid; do
    python -m nkululeko.nkululeko --config exp_test_${activation}.ini
done

Typical Results (example)

Activation

UAR

Training Time

Notes

relu

0.645

18.3s

Fast convergence

leaky_relu

0.658

19.1s

Best performance

tanh

0.641

21.2s

Stable training

sigmoid

0.612

24.5s

Slower convergence

Note: Results vary by dataset and configuration

Testing Your Configuration

Quick Test Script

Create a test configuration with fewer epochs to quickly validate your setup:

[EXP]
name = quick_test
runs = 1
epochs = 5  # Quick test

[MODEL]
type = mlp
layers = [64, 16]
activation = leaky_relu  # Test your activation

Verify Activation Function

Check the log output to confirm your activation is being used:

python -m nkululeko.nkululeko --config your_config.ini 2>&1 | grep "activation"

Expected output:

DEBUG: model: using activation function: leaky_relu

Advanced Tips

1. Combining with Dropout

Different activations may work better with different dropout rates:

[MODEL]
type = mlp
layers = [128, 64]
activation = leaky_relu
drop = 0.3  # Start with 0.3 and adjust

# Layer-specific dropout
# drop = [0.4, 0.3]  # Higher dropout in earlier layers

2. Learning Rate Adjustment

Some activations may require different learning rates:

# For relu/leaky_relu
learning_rate = 0.0001  # Standard

# For tanh (sometimes needs lower LR)
learning_rate = 0.00005

# For sigmoid (often needs lower LR)
learning_rate = 0.00003

3. Batch Size Considerations

# Larger batch sizes often work better with relu/leaky_relu
[MODEL]
activation = leaky_relu
batch_size = 32

# Smaller batch sizes may help tanh/sigmoid
[MODEL]
activation = tanh
batch_size = 8

4. Network Depth and Activation

# Shallow networks: relu or tanh work well
[MODEL]
layers = [64, 32]
activation = relu

# Deep networks: prefer leaky_relu
[MODEL]
layers = [256, 128, 64, 32]
activation = leaky_relu  # Better gradient flow

Troubleshooting

Problem: Model not improving

Solutions:

  1. Try leaky_relu instead of relu

  2. Reduce learning rate

  3. Increase patience parameter

  4. Check if dropout is too high

Problem: Loss becomes NaN

Solutions:

  1. Lower the learning rate significantly

  2. Try tanh or leaky_relu instead of relu

  3. Check feature scaling (use scale = standard)

  4. Reduce batch size

Problem: Training is very slow

Solutions:

  1. Use relu or leaky_relu (fastest)

  2. Avoid sigmoid for hidden layers

  3. Increase batch size

  4. Reduce network complexity

Problem: Overfitting

Solutions:

  1. Increase dropout

  2. Try leaky_relu with higher dropout

  3. Reduce network size

  4. Use early stopping (patience parameter)

Complete Working Example

Here’s a complete, tested configuration for emotion recognition:

# File: exp_emotion_leaky_relu.ini
# Description: Emotion recognition with leaky_relu activation

[EXP]
root = ./experiments/results/
name = emotion_leaky_relu
runs = 3
epochs = 100
save = True

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = specified
emodb.test_tables = ['emotion.categories.test.gold_standard']
emodb.train_tables = ['emotion.categories.train.gold_standard']
target = emotion
labels = ['anger', 'happiness', 'sadness', 'neutral']

[FEATS]
type = ['os']  # OpenSMILE features
scale = standard  # Important: scale features

[MODEL]
type = mlp
layers = [128, 64, 32]
activation = leaky_relu  # NEW: Activation function
drop = 0.3
patience = 10
learning_rate = 0.0001
batch_size = 16

[PLOT]
best_model = True
epoch_progression = True

Run it:

python -m nkululeko.nkululeko --config exp_emotion_leaky_relu.ini

Summary

  • Default: relu - Fast, effective, good starting point

  • Recommended: leaky_relu - More robust, prevents dead neurons

  • For specific needs: tanh (zero-centered) or sigmoid (probabilistic)

  • Always: Combine with proper scaling (scale = standard)

  • Experiment: Test different activations with your specific dataset

References

  • Nkululeko documentation: ini_file.md

  • Added in PR: “added different activation functions” (2026-01-08)

  • Version: 1.1.2+

See Also