# Activation Functions in Neural Network Models

## Overview

This tutorial demonstrates how to use different activation functions in MLP (Multi-Layer Perceptron) and CNN (Convolutional Neural Network) models in nkululeko. Activation functions are crucial components of neural networks that introduce non-linearity, enabling models to learn complex patterns in your data.

**Version**: Added in nkululeko 1.1.2  
**Models**: MLP classifier, MLP regression, CNN

## What are Activation Functions?

Activation functions determine the output of a neural network node given an input or set of inputs. They introduce non-linear properties to the network, allowing it to learn complex decision boundaries and representations. Choosing the right activation function can significantly impact your model's performance.

## Available Activation Functions

Nkululeko supports four activation functions for MLP models:

| Function | Description | Range | Best Used For |
|----------|-------------|-------|---------------|
| **relu** | Rectified Linear Unit | [0, ∞) | Default choice, general purpose, fast training |
| **leaky_relu** | Leaky ReLU (small negative slope) | (-∞, ∞) | Mitigating dead neurons, handling negative values |
| **tanh** | Hyperbolic tangent | [-1, 1] | Zero-centered outputs, classification tasks |
| **sigmoid** | Logistic function | [0, 1] | Binary classification, probability outputs |

### Activation Function Characteristics

#### ReLU (Rectified Linear Unit) - Default
```
f(x) = max(0, x)
```
- **Advantages**: Fast computation, reduces vanishing gradient problem
- **Disadvantages**: Can cause "dead neurons" (neurons that always output 0)
- **Use when**: General purpose, first choice for most problems

#### Leaky ReLU
```
f(x) = x if x > 0 else 0.01x
```
- **Advantages**: Prevents dead neurons, allows small negative gradients
- **Disadvantages**: Slight increase in computation
- **Use when**: ReLU shows signs of dead neurons, need better gradient flow

#### Tanh (Hyperbolic Tangent)
```
f(x) = (e^x - e^-x) / (e^x + e^-x)
```
- **Advantages**: Zero-centered output, stronger gradients than sigmoid
- **Disadvantages**: Can suffer from vanishing gradients with deep networks
- **Use when**: Need zero-centered activations, shallow to medium networks

#### Sigmoid
```
f(x) = 1 / (1 + e^-x)
```
- **Advantages**: Smooth gradient, output interpretable as probability
- **Disadvantages**: Vanishing gradient problem, not zero-centered
- **Use when**: Output layer for binary classification, need probability outputs

## Configuration

### Basic Usage

Add the `activation` parameter to the `[MODEL]` section:

```ini
[MODEL]
type = mlp
layers = [128, 64, 32]
activation = leaky_relu  # Options: relu, tanh, sigmoid, leaky_relu
```

### Complete Example: Emotion Recognition with Different Activations

#### 1. Using Default ReLU (no activation specified)
```ini
[EXP]
root = ./experiments/
name = exp_emodb_relu
runs = 3
epochs = 100

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = speaker_split
target = emotion
labels = ['anger', 'happiness', 'sadness', 'neutral']

[FEATS]
type = ['os']  # OpenSMILE features
scale = standard

[MODEL]
type = mlp
layers = [128, 64]
# activation = relu  # Default, can be omitted
drop = 0.3
patience = 10
learning_rate = 0.0001
```

#### 2. Using Leaky ReLU (recommended for most cases)
```ini
[MODEL]
type = mlp
layers = [128, 64]
activation = leaky_relu  # Better gradient flow
drop = 0.3
patience = 10
```

#### 3. Using Tanh (for zero-centered outputs)
```ini
[MODEL]
type = mlp
layers = [128, 64]
activation = tanh  # Zero-centered activation
drop = 0.3
patience = 10
```

#### 4. Using Sigmoid (for specific architectures)
```ini
[MODEL]
type = mlp
layers = [128, 64]
activation = sigmoid  # Smooth, probabilistic
drop = 0.3
patience = 10
```

## Practical Examples

### Example 1: Speaker Age Prediction (Regression)

For regression tasks, leaky ReLU or tanh often work well:

```ini
[EXP]
root = ./experiments/
name = age_prediction
runs = 5
epochs = 200

[DATA]
databases = ['agedb']
agedb = ./data/agedb/
target = age

[FEATS]
type = ['os', 'praat']
scale = standard

[MODEL]
type = mlp
layers = [256, 128, 64]
activation = leaky_relu  # Good for regression
drop = 0.4
patience = 15
learning_rate = 0.00005
```

### Example 2: Binary Emotion Classification

For binary classification, tanh or leaky_relu are good choices:

```ini
[EXP]
root = ./experiments/
name = binary_emotion
runs = 3

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
target = emotion
labels = ['anger', 'neutral']  # Binary classification

[FEATS]
type = ['os']
scale = standard

[MODEL]
type = mlp
layers = [64, 32]
activation = tanh  # Zero-centered for binary classification
drop = 0.2
patience = 5
```

### Example 3: Multi-class Classification with Deep Network

For deeper networks, leaky_relu helps prevent vanishing gradients:

```ini
[MODEL]
type = mlp
layers = [256, 128, 64, 32]  # Deep network
activation = leaky_relu  # Prevents dead neurons in deep networks
drop = [0.3, 0.3, 0.2, 0.2]  # Layer-specific dropout
patience = 20
learning_rate = 0.00001
batch_size = 16
```

## CNN Models with List Layer Format

The same update also introduced support for list format in CNN layers:

### Old Format (still supported)
```ini
[MODEL]
type = cnn
layers = {'l1': 120, 'l2': 84}
```

### New Format (recommended)
```ini
[MODEL]
type = cnn
layers = [120, 84]  # Simpler, more intuitive
```

## Choosing the Right Activation Function

### Decision Guide

```
Start with: relu or leaky_relu
    |
    ├─> If you have dead neurons → Try leaky_relu
    |
    ├─> If gradients vanish in deep networks → Try leaky_relu
    |
    ├─> If you need zero-centered outputs → Try tanh
    |
    └─> If you need probabilistic interpretation → Try sigmoid
```

### Common Use Cases

| Task | Recommended Activation | Reason |
|------|----------------------|--------|
| **General classification** | `relu` or `leaky_relu` | Fast, effective, standard choice |
| **Deep networks (>3 layers)** | `leaky_relu` | Prevents dead neurons, better gradient flow |
| **Regression** | `leaky_relu` | Handles negative values well |
| **Binary classification** | `tanh` or `leaky_relu` | Zero-centered, good gradients |
| **Multi-class emotion** | `leaky_relu` | Robust, prevents gradient issues |
| **Small networks** | `relu` or `tanh` | Simple, effective |

## Performance Comparison Example

Running the same experiment with different activations (emodb dataset, 2 emotions):

```bash
# Test all activations
for activation in relu leaky_relu tanh sigmoid; do
    python -m nkululeko.nkululeko --config exp_test_${activation}.ini
done
```

### Typical Results (example)

| Activation | UAR | Training Time | Notes |
|------------|-----|---------------|-------|
| relu | 0.645 | 18.3s | Fast convergence |
| leaky_relu | 0.658 | 19.1s | Best performance |
| tanh | 0.641 | 21.2s | Stable training |
| sigmoid | 0.612 | 24.5s | Slower convergence |

*Note: Results vary by dataset and configuration*

## Testing Your Configuration

### Quick Test Script

Create a test configuration with fewer epochs to quickly validate your setup:

```ini
[EXP]
name = quick_test
runs = 1
epochs = 5  # Quick test

[MODEL]
type = mlp
layers = [64, 16]
activation = leaky_relu  # Test your activation
```

### Verify Activation Function

Check the log output to confirm your activation is being used:

```bash
python -m nkululeko.nkululeko --config your_config.ini 2>&1 | grep "activation"
```

Expected output:
```
DEBUG: model: using activation function: leaky_relu
```

## Advanced Tips

### 1. Combining with Dropout

Different activations may work better with different dropout rates:

```ini
[MODEL]
type = mlp
layers = [128, 64]
activation = leaky_relu
drop = 0.3  # Start with 0.3 and adjust

# Layer-specific dropout
# drop = [0.4, 0.3]  # Higher dropout in earlier layers
```

### 2. Learning Rate Adjustment

Some activations may require different learning rates:

```ini
# For relu/leaky_relu
learning_rate = 0.0001  # Standard

# For tanh (sometimes needs lower LR)
learning_rate = 0.00005

# For sigmoid (often needs lower LR)
learning_rate = 0.00003
```

### 3. Batch Size Considerations

```ini
# Larger batch sizes often work better with relu/leaky_relu
[MODEL]
activation = leaky_relu
batch_size = 32

# Smaller batch sizes may help tanh/sigmoid
[MODEL]
activation = tanh
batch_size = 8
```

### 4. Network Depth and Activation

```ini
# Shallow networks: relu or tanh work well
[MODEL]
layers = [64, 32]
activation = relu

# Deep networks: prefer leaky_relu
[MODEL]
layers = [256, 128, 64, 32]
activation = leaky_relu  # Better gradient flow
```

## Troubleshooting

### Problem: Model not improving

**Solutions:**
1. Try `leaky_relu` instead of `relu`
2. Reduce learning rate
3. Increase patience parameter
4. Check if dropout is too high

### Problem: Loss becomes NaN

**Solutions:**
1. Lower the learning rate significantly
2. Try `tanh` or `leaky_relu` instead of `relu`
3. Check feature scaling (use `scale = standard`)
4. Reduce batch size

### Problem: Training is very slow

**Solutions:**
1. Use `relu` or `leaky_relu` (fastest)
2. Avoid `sigmoid` for hidden layers
3. Increase batch size
4. Reduce network complexity

### Problem: Overfitting

**Solutions:**
1. Increase dropout
2. Try `leaky_relu` with higher dropout
3. Reduce network size
4. Use early stopping (patience parameter)

## Complete Working Example

Here's a complete, tested configuration for emotion recognition:

```ini
# File: exp_emotion_leaky_relu.ini
# Description: Emotion recognition with leaky_relu activation

[EXP]
root = ./experiments/results/
name = emotion_leaky_relu
runs = 3
epochs = 100
save = True

[DATA]
databases = ['emodb']
emodb = ./data/emodb/emodb
emodb.split_strategy = specified
emodb.test_tables = ['emotion.categories.test.gold_standard']
emodb.train_tables = ['emotion.categories.train.gold_standard']
target = emotion
labels = ['anger', 'happiness', 'sadness', 'neutral']

[FEATS]
type = ['os']  # OpenSMILE features
scale = standard  # Important: scale features

[MODEL]
type = mlp
layers = [128, 64, 32]
activation = leaky_relu  # NEW: Activation function
drop = 0.3
patience = 10
learning_rate = 0.0001
batch_size = 16

[PLOT]
best_model = True
epoch_progression = True
```

Run it:
```bash
python -m nkululeko.nkululeko --config exp_emotion_leaky_relu.ini
```

## Summary

- **Default**: `relu` - Fast, effective, good starting point
- **Recommended**: `leaky_relu` - More robust, prevents dead neurons
- **For specific needs**: `tanh` (zero-centered) or `sigmoid` (probabilistic)
- **Always**: Combine with proper scaling (`scale = standard`)
- **Experiment**: Test different activations with your specific dataset

## References

- Nkululeko documentation: [ini_file.md](../ini_file.md)
- Added in PR: "added different activation functions" (2026-01-08)
- Version: 1.1.2+

## See Also

- [Feature extraction tutorial](tut_regplot_features.md)
- [Model optimization guide](../ini_file.md#model-section)
- [Data preprocessing](../ini_file.md#data-section)