Activation Functions from Scratch

Saba Shahrukh April 24, 2026 0

Here is a complete, plain Python implementation covering all the activation functions, mathematical formulas, and data tables detailed in your handwritten notes.

This code uses only standard Python functionality (the built-in math module) and maps directly to the concepts you outlined, from Linear and Sigmoid functions to Tanh and Softmax.

Plain Python Implementation of Activation Functions

Python

import math

# =====================================================================
# 1. LINEAR ACTIVATION FUNCTION
# Notes Mapping: "Linear Activation Function - f(z) = z, Identity function"
# "Output layer for Regression"
# =====================================================================

def linear(z):
    """
    Identity function: returns the input exactly as it is.
    Typically used in the output layer for regression problems.
    """
    return z

# =====================================================================
# 2. SIGMOID ACTIVATION FUNCTION
# Notes Mapping: f(x) = 1 / (1 + e^-x)
# "Outputs probability values. Used for Output layer of Binary Classifier"
# =====================================================================

def sigmoid(x):
    """
    Squashes inputs to a range between 0 and 1.
    Includes a small safeguard to prevent math overflow on extreme negative inputs.
    """
    # Prevent Overflow: math.exp can crash if x is a massive negative number
    x = max(-700.0, min(700.0, x)) 
    return 1.0 / (1.0 + math.exp(-x))

# =====================================================================
# 3. TANH (TANGENT HYPERBOLIC) ACTIVATION FUNCTION
# Notes Mapping: f(x) = (e^x - e^-x) / (e^x + e^-x)
# "Output is 0 centered. Used in hidden layers... Helps in learning"
# =====================================================================

def tanh_custom(x):
    """
    Squashes inputs to a range between -1 and 1.
    Using the exact formula from the notes rather than Python's built-in math.tanh.
    """
    # Prevent Overflow
    x = max(-350.0, min(350.0, x))
    
    numerator = math.exp(x) - math.exp(-x)
    denominator = math.exp(x) + math.exp(-x)
    return numerator / denominator

# =====================================================================
# 4. SOFTMAX ACTIVATION FUNCTION
# Notes Mapping: Transforms "Logits" into Probability Distribution.
# "Squashes values between 0 & 1. Ensures sum of all outputs is exactly 1.0"
# "Op layer of Multiclass classification problem"
# =====================================================================

def softmax(logits):
    """
    Applies the exponential function to each element, then divides by the sum 
    of all exponentials to create a valid probability distribution.
    """
    # To prevent math overflow with large exponents, it is standard practice 
    # to subtract the maximum logit from all logits. This does not change the math outcome.
    max_logit = max(logits)
    
    exponentials = [math.exp(z - max_logit) for z in logits]
    sum_exponentials = sum(exponentials)
    
    probabilities = [e / sum_exponentials for e in exponentials]
    return probabilities

# =====================================================================
# 5. EXECUTION AND VERIFICATION (Mirroring the Notes' Tables)
# =====================================================================

if __name__ == "__main__":
    
    # --- Verifying Sigmoid Table ---
    print("=== SIGMOID FUNCTION VALUES ===")
    sigmoid_test_values = [-5, -3, -1, 0, 1, 3, 5, -20, 20]
    for x in sigmoid_test_values:
        result = sigmoid(x)
        # Formatting to exactly match the precision in the notes
        print(f"x = {x:>3} | f(x) = {result:.5f}")
        
    print("\nObservation match: Very high negative values give near 0, very high positive values give near 1.")
    print("-" * 50)

    # --- Verifying Tanh Table ---
    print("\n=== TANH FUNCTION VALUES ===")
    tanh_test_values = [-5, -3, -1, 0, 1, 3, 5]
    for x in tanh_test_values:
        result = tanh_custom(x)
        print(f"x = {x:>3} | f(x) = {result:.4f}")
        
    print("\nObservation match: Outputs are perfectly zero-centered (-1 to 1).")
    print("-" * 50)

    # --- Verifying Softmax Table ---
    print("\n=== SOFTMAX FUNCTION VALUES (Multiclass Classification) ===")
    
    # The exact "Output before Activation function" (Logits) from the notes
    logits = [1.3, 5.1, 2.2, 0.7, 1.1]
    print(f"Input Logits: {logits}")
    
    probabilities = softmax(logits)
    
    print("\nCalculated Probabilities:")
    for i in range(len(logits)):
        print(f"Logit: {logits[i]} -> Probability: {probabilities[i]:.2f}")
        
    print(f"\nSum of all probabilities: {sum(probabilities):.1f}")

Mathematical Formulations from your Notes

Your code accurately applies the mathematical formulas written on the page. Here is how the written formulas map to their respective functions in the script:

1. Sigmoid Function

This is implemented in sigmoid(x). As noted on your page, this curves nicely between 0 and 1, translating raw numbers into strict probabilities.

2. Tangent Hyperbolic (Tanh)

This is directly implemented in tanh_custom(x). By crossing the origin, it pulls the mean of the hidden layer activations to near 0, which mathematically prevents bias shifting and speeds up the gradient descent process in subsequent layers.

3. Softmax Function

This is handled in the softmax(logits) array processing. It takes the individual element z_i, calculates its exponential, and divides it by the sum of all elements in the layer (1 to K). The output verification perfectly matches the hand-calculated array [0.02, 0.90, 0.05, 0.01, 0.02] shown in your bottom table.

Here is the plain Python implementation covering all the concepts from your notes on ReLU, the Dying ReLU problem, and Leaky ReLU.

To make these concepts concrete, I have not only implemented the functions but also built a simulation showing exactly how Sparsity works and how a Dying ReLU neuron completely stops learning during backpropagation compared to a Leaky ReLU neuron.

Plain Python Implementation

Python

# =====================================================================
# 1. ReLU (Rectified Linear Unit) ACTIVATION FUNCTION
# Notes Mapping: f(x) = max(0, x)
# "Introduces sparsity in the network as only fraction of neurons activate"
# "Helps backpropagation as its derivative is 0 or 1"
# =====================================================================

def relu(x):
    """
    Implements f(x) = max(0, x).
    Returns the input if it's positive, otherwise returns 0.
    """
    return max(0.0, float(x))

def relu_derivative(x):
    """
    The derivative of ReLU. 
    Crucial for backpropagation. It is 1 if x > 0, and 0 if x <= 0.
    """
    return 1.0 if x > 0.0 else 0.0


# =====================================================================
# 2. LEAKY ReLU ACTIVATION FUNCTION
# Notes Mapping: x if x > 0, 0.01 * x if x <= 0
# Solves the Dying ReLU problem by allowing a tiny, non-zero gradient.
# =====================================================================

def leaky_relu(x, alpha=0.01):
    """
    Implements Leaky ReLU. 
    Has a slight slope (default 0.01) for negative values.
    """
    return float(x) if x > 0.0 else alpha * float(x)

def leaky_relu_derivative(x, alpha=0.01):
    """
    The derivative of Leaky ReLU.
    Returns 1 if positive, and the small alpha (0.01) if negative.
    """
    return 1.0 if x > 0.0 else alpha


# =====================================================================
# 3. DEMONSTRATION & EXECUTION
# Demonstrating Sparsity and the "Dying ReLU" Problem
# =====================================================================

if __name__ == "__main__":
    
    print("=== 1. DEMONSTRATING SPARSITY ===")
    # Imagine a hidden layer receiving these 5 raw inputs (z values) from the previous layer
    raw_inputs = [2.5, -5.0, 1.2, -0.8, 5.0]
    
    relu_outputs = [relu(x) for x in raw_inputs]
    
    print(f"Raw Inputs:    {raw_inputs}")
    print(f"ReLU Outputs:  {relu_outputs}")
    
    # Calculate sparsity (percentage of neurons that output exactly 0)
    dead_count = sum(1 for out in relu_outputs if out == 0.0)
    sparsity_percentage = (dead_count / len(relu_outputs)) * 100
    print(f"Observation: {sparsity_percentage}% of neurons output 0. This is the 'Sparsity' mentioned in your notes.")
    print("-" * 60)

    
    print("\n=== 2. THE DYING ReLU PROBLEM VS LEAKY ReLU ===")
    print("Scenario: A neuron receives a large negative input during training.")
    print("Notes mapping: 'Neuron can become inactive... leading to 0 gradient... cannot learn further.'\n")
    
    # Let's simulate a single weight update step in backpropagation
    large_negative_input = -10.0
    error_signal_from_next_layer = 0.5 # A generic error passed backwards
    learning_rate = 0.1
    current_weight = 0.8
    
    # --- Case A: Using Standard ReLU ---
    print("--- CASE A: Standard ReLU ---")
    output_relu = relu(large_negative_input)
    deriv_relu = relu_derivative(large_negative_input)
    
    # Gradient = Error * Derivative of Activation Function
    gradient_relu = error_signal_from_next_layer * deriv_relu
    weight_update_relu = learning_rate * gradient_relu
    new_weight_relu = current_weight + weight_update_relu
    
    print(f"Input: {large_negative_input} -> Output: {output_relu}")
    print(f"Derivative (Gradient Multiplier): {deriv_relu}")
    print(f"Calculated Gradient: {gradient_relu}")
    print(f"New Weight: {new_weight_relu} (Changed by: {weight_update_relu})")
    print("Result: The gradient is strictly 0. The weight did not change. The neuron is DEAD.")
    
    print("\n--- CASE B: Leaky ReLU to the Rescue ---")
    # --- Case B: Using Leaky ReLU ---
    output_leaky = leaky_relu(large_negative_input)
    deriv_leaky = leaky_relu_derivative(large_negative_input)
    
    # Gradient = Error * Derivative of Activation Function
    gradient_leaky = error_signal_from_next_layer * deriv_leaky
    weight_update_leaky = learning_rate * gradient_leaky
    new_weight_leaky = current_weight + weight_update_leaky
    
    print(f"Input: {large_negative_input} -> Output: {output_leaky}")
    print(f"Derivative (Gradient Multiplier): {deriv_leaky}")
    print(f"Calculated Gradient: {gradient_leaky}")
    print(f"New Weight: {new_weight_leaky:.5f} (Changed by: {weight_update_leaky:.5f})")
    print("Result: The gradient is small but non-zero. The weight updated slightly. The neuron can still LEARN!")

Mathematical Mapping to Your Notes

Your code precisely models the mathematical formulations you wrote down:

ReLU Equation: The code max(0.0, float(x)) perfectly models f(x) = max(0, x).
Piecewise ReLU: The derivative 1.0 if x > 0.0 else 0.0 reflects the underlying slope of your piecewise function f(x) = x ; x > 0 and 0 ; x < 0.
Leaky ReLU Piecewise: The implementation float(x) if x > 0.0 else alpha * float(x) directly matches the bracketed piecewise function at the bottom of your page, applying the 0.01 multiplier to negative inputs.
The Dying ReLU Problem: As simulated in “Case A”, when x = -10.0, the derivative collapses to exactly 0. Because the chain rule in backpropagation requires multiplying by this derivative, the entire gradient becomes 0. If the gradient is 0, the neural network stops adjusting the weights, permanently freezing the neuron.

Category:

NLP From Scratch

Activation Functions from Scratch

Plain Python Implementation of Activation Functions

Mathematical Formulations from your Notes

Plain Python Implementation

Mathematical Mapping to Your Notes

Leave a Comment Cancel reply