I always find that building statistical functions from scratch is the best way to reinforce understanding of statistical concepts. Those of you who have been following my posts recently know that I have built the McCulloch-Pitts Neuron and Perceptron models in this manner. As I continue working through the history of neural networks, it is time to build a feedforward network with backpropagation. Here, we will train a neural network using randomly generated data. I will, in a later post, fit the neural network to empirical data and validate the weights for prediction of out of sample data (i.e., include training and testing).

As you may have noticed in the previous blogs, the purpose of writing this blog is largely for notekeeping purposes. Neural network libraries, like all other statistical libraries, are dangerous to use unless you understand their foundations. In my own experience, I recall a breakage in the rigor of presentation of machine learning methods when analysis moves from clustering, k-folds, and regression with moving windows to neural networks. It seemed like we should just accept that neural networks were magical. Thus, my providing a skeleton of the development of neural networks is intended both to help contextualize the development in light of earlier and later developments and also provide accessible explanation for those who would like to better understand the mechanisms employed by neural networks.Due to the complexity of backpropagation, I will not discuss history except to note that backpropagation is widely viewed as the next pathbreaking innovation after the Perceptron. I intend to follow up with discussion of the history of backpropagation in later posts.

A key aspect of a neural network with backpropagation is the use of the sigmoid activation function. We will start here, present the neural network at a high level, then dig into the feedforward and backpropagation processes.

Sigmoids, not Discrete Activation

The sigmoid function is a smooth, differentiable function that is used to determine whether a gate should be open or closed. The sigmoid function is defined as: $$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Here, the variable $z$ is the weighted sum of the inputs to the neuron. You might remember that, with the Perceptron, that if the sum of the value of the inputs was greater than the threshold value, the gate would open. Here, the sigmoid normalizes the values of $z$. Thus the values are not discretized in the canonical model. The sigmoid function is also differentiable, which is important for backpropagation as the derivative is used to calculate the change in weights. The derivative of the sigmoid function is: $$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$

In the script, we also include the RELU and hyperbolic tangent functions for reference, but we will not use them here.

The feedforward network with backpropagation weights inputs in the first layer across the nodes in that layer. The values are weighted and passed to the sigmoid function. Those values that are greater than 0.5 are passed to the next layer and the process is repeated.

Finally, after progressing through the hidden layer or layers, the weighted sum of the inputs to the last hidden layer are passed to the output layer. The output layer uses the sigmoid function to determine the output of the network. The output is compared to the actual value and the weights are adjusted using the derivative of the sigmoid function, starting from layer $N$ with: $$\delta^{(N)} = (y - a^{(N)})\sigma'(a^{(N)})$$

$\delta^{(N)}$ is employed to reweight the previous layers outputs. The process iterates back to the first layer. Equipped with this knowledge, we can begin to build the feedforward network with backpropagation.

Now that we understand how data is transformed between layers, we can conceptualize the neural network in terms of layers. The first layer consists of the raw data. This is our input layer. Values of each variable are weighted, summed, and then passed to the sigmoid function to be normalized. The values from the first hidden layer are then weighted with weights that are distinct from the weights for the input layer. If there are no more hidden layers, the values are aggregated in the output layer, which may consist of more than one node. We will focus on the situation where the output layer contains a single node.

In [1]:

import networkx as nx
import matplotlib.pyplot as plt

# Create a directed graph
G = nx.DiGraph()
num_nodes = 2

# Add nodes for the input layer and the hidden layer.
# Use LaTeX formatting with proper subscript notation.
for i in range(1, num_nodes + 1):
    G.add_node(f"$X_{i}$")
    G.add_node(f"$H_{i}$")
G.add_node("$Y$")

# Set positions (this helps visualize the layers)
pos = {}
for i in range(1, num_nodes + 1):
    pos[f"$X_{i}$"] = (0, 2 * num_nodes - i * 2.25)
    pos[f"$H_{i}$"] = (1, num_nodes - i)
pos["$Y$"] = (2, 0.5)

# Add edges from the input layer to the hidden layer.
# The weight from the i-th input neuron to the j-th hidden neuron is denoted by w_{ij}^{(1)}
for i in range(1, num_nodes + 1):
    for j in range(1, num_nodes + 1):
        G.add_edge(f"$X_{i}$", f"$H_{j}$", 
                   label = f"$w_{{{i}{j}}}^{{(1)}}$")

# Add edges from the hidden layer to the output.
# The weight from the j-th hidden neuron to the output is denoted by w_{j}^{(2)}
for j in range(1, num_nodes + 1):
    G.add_edge(f"$H_{j}$", "$Y$", label = f"$w_{{{j}}}^{{(2)}}$")

# Plot the graph
fig, ax = plt.subplots(figsize=(20, 12))
nx.draw(G, pos, with_labels=True, node_size=2000, node_color="skyblue", 
        font_size=24, font_weight="bold", edge_color="gray", arrowsize=20,
        ax=ax)
nx.draw_networkx_edge_labels(G, pos, edge_labels=nx.get_edge_attributes(G, 'label'),
                             font_color="red", font_size=24, ax=ax)

# Add layer labels at the bottom of the graph
plt.text(0, -1, "Input\nLayer", fontsize=24, fontweight="bold", ha="center")
plt.text(1, -1, "Hidden\nLayer", fontsize=24, fontweight="bold", ha="center")
plt.text(2, -1, "Output\nLayer", fontsize=24, fontweight="bold", ha="center")
plt.show()

No description has been provided for this image

Neural Nets with the Bird's Eye

Now that we understand the structure of the feedforward network, we can begin to construct the network using Python. We will create a function with arguments for the data of the input layer, the structure of the hidden layer(s), and the number of nodes in the output layer. We will, by default, use only a single node in the final layer.

We will format our functions to use numba. This will make the script a bit more complicated, but will pay dividends by allowing for parallel processing. You will notice njit decorators appearing above some of the functions. A decorator passes the adjacent function to another function. In this case, that function is used to execute multiple iterations of the simulation simultaneously. If you would like, you can also use numba to employ your GPU. This requires additional considerations in your script. Here, we will use parallel processing.

We will first construct the functions that generate the primary components of the neural network. This is the sigmoid function and the derivative of the function that we will use for reweighting in our gradient descent (hill descending instead of climbing...) algorithm.

We create a two dimensional array to create weights that transform values from the one layer to the next. The weights array will hold two-dimensional arrays where the sizes of dimensions correspond to the number of nodes (inputs) in a given layer and the number of nodes in the following layer. Random weights are initially selected. These will be updated in the backpropagation step using the derivative of the sigmoid function.

Often, when activation matrices are weighted, the weights of the activation matrix, $a^{(l)}$, $w^{(l)}$, will be presented separately from the bias term, which is a constant. We include a column of ones in our input matrix so that this constant is generated without requiring additional steps for bias weight creation and inclusion.

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import concurrent.futures
import multiprocessing
from numba import njit
from numba.typed import List


@njit(fastmath=True, nogil=True)
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

@njit(fastmath=True, nogil=True)
def relu(x):
    # Ensure we use float literals
    return np.maximum(0.0, x)

@njit(fastmath=True, nogil=True)
def hyperbolic_tangent(x):
    return (np.exp(2 * x) - 1.0) / (np.exp(2 * x) + 1.0)

@njit(fastmath=True, nogil=True)
def activation_function(x, activation_type="sigmoid"):
    if activation_type == "sigmoid":
        return 1.0 / (1.0 + np.exp(-x))
    elif activation_type == "relu":
        return np.maximum(0.0, x)
    elif activation_type == "tanh":
        return (np.exp(2 * x) - 1.0) / (np.exp(2 * x) + 1.0)
    else:
        # Default to sigmoid if unrecognized.
        return 1.0 / (1.0 + np.exp(-x))

@njit(fastmath=True, nogil=True)
def sigmoid_deriv(x):
    return x * (1.0 - x)

@njit(fastmath=True, nogil=True)
def relu_deriv(x):
    # Return float values to match the types
    return np.where(x <= 0, 0.0, 1.0)

@njit(fastmath=True, nogil=True)
def tanh_deriv(x):
    return 1.0 - x**2

@njit(fastmath=True, nogil=True)
# function uses if statements instead of eval or macros due restriction (eval())
# and efficeincy (macro) of numba
def activation_derivative(x, activation_type="sigmoid"):
    if activation_type == "sigmoid":
        return x * (1.0 - x)
    elif activation_type == "relu":
        # Use float literals here as well
        return np.where(x <= 0, 0.0, 1.0)
    elif activation_type == "tanh":
        return 1.0 - x**2
    else:
        return x * (1.0 - x)

def initialize_weights(input_size, hidden_sizes, output_size):
    weights_list = List()
    # 1. Input -> first hidden layer
    W = np.random.randn(input_size, hidden_sizes[0]).astype(np.float64)
    weights_list.append(W)
    
    # 2. Hidden layers: each subsequent layer gets an extra bias input.
    for i in range(1, len(hidden_sizes)):
        W = np.random.randn(hidden_sizes[i - 1] + 1, hidden_sizes[i]).astype(np.float64)
        weights_list.append(W)
        
    # 3. Last hidden layer -> output
    W = np.random.randn(hidden_sizes[-1] + 1, output_size).astype(np.float64)
    weights_list.append(W)
    
    return weights_list

The next bit of code is rather complicated. We can help ourselves by identifying the functions that will be called and the order in which they will be called.

$$tune\_nn\_parallel() \rightarrow evaluate\_model() \rightarrow train\_NN()$$

We break down the train_nn() function into feedfoward(), get_loss(), and back_propogate(). This makes the iterations of the training easier to understand.

The train_NN() function begins by ensuring that the array exists within a unified block of memory in order to minimizes processing costs. We initialize the weights for training. We then identify the number of layers by counting the number of weight matrices that are in the weights list. After initializing an array to hold record history across epochs, we being cycling through the epochs.

Feedforward

For each epoch, the first function called is the feed_forward() function. This function calculates the weighted values transmitted from one layer to the next. These weighted values are used to determine if gates are activated. We append $X$ to activations so that we can call the same process in the for loop that cycles through the hidden layers. Thus the $X$ input matrix is $a^{(1)}$. We denote the layer in parentheses as an indication that we are not raising the activation matrix to some power. We will refer to the layer as $l$. The $l^{th}$ element in activations holds the values of the input for that layer. This is the values of the actual input layer in the very first iteration of the loop. We take the product of the inputs and the weights at each layer. $$z^{(l)}=a^{(l)}W^{(l)}$$

The activation matrix is two dimension. Thus, $z^{(l)}$ reweights $a^{(l)}$ using the $w^{(l)}$. $z^{(l)}$ is passed to the sigmoid function to generate $a^{(l+1)}$: $$a^{(l+1)} = \frac{1}{1+e^{-z^{(l)}}}$$

The sigmoid function generates the next series of inputs normalized between 0 and 1. This array lacks a bias term (constant), so we add a column of ones to the bias array. This new two-dimensional array, $a^{(l+1)}$, is appended to activations, and the process repeats for the remaining hidden layers. While it was convient to label the hidden layers activation matrices as $H$ in the network graph, we will refer to layer using the notation $a^{(l)}$ since the X matrix and H matrices are all activation matrices. We call the appended matrix, $a^{(l)}$ and calculate $z^{(l)}=a^{(l)}W^{(l)}$. As before, pass this to the sigmoid function, which normalizes the outputs to create the activation matrix for the next layer.

After cycling through all of the input and hidden layers, we calculate the output and pass it to the sigmoid function. After completing this process, the feed_forward() function returns the list of activations and the sigmoid outputs from the process (a_no_bias_list). The last element in the a_no_bias_list is the predicted output for each observation. Values from this list are subtracted from the corresponding values from the list of observed outputs. This value is the error. Our loss function takes the mean of the sum of squared errors: $$\mathbf{L}=\frac{\Sigma_{i=0}^n{(y - \hat{y})^2}}{n}$$

Notice the output maintains continuous values, though if one is using categorical data, the output can be interpreted as prediction in the form of probability. The value of $L$ scores the overall accuracy of the neural network outputs.

In [3]:

@njit(fastmath=True, nogil=True)
def feed_forward(X, Y, weights, n_layers, activation_type="sigmoid"):
    # --- Forward Pass ---
    activations = List()        # Will hold activations (with bias added when needed)
    a_no_bias_list = List()      # Will hold activations before bias is added
    activations.append(X)
    
    for i in range(n_layers - 1):
        a_prev = activations[i]
        z = a_prev.dot(weights[i])
        a_no_bias = activation_function(z, activation_type)
        a_no_bias = np.ascontiguousarray(a_no_bias)
        a_no_bias_list.append(a_no_bias)
        n_samples = a_no_bias.shape[0]
        bias = np.ones((n_samples, 1), dtype=a_no_bias.dtype)
        bias = np.ascontiguousarray(bias)
        a_with_bias = np.concatenate((a_no_bias, bias), axis=1)
        a_with_bias = np.ascontiguousarray(a_with_bias)
        activations.append(a_with_bias)
    
    # Output layer (no bias appended)
    z = activations[-1].dot(weights[n_layers - 1])
    a_out = activation_function(z, activation_type)
    a_out = np.ascontiguousarray(a_out)
    activations.append(a_out)    
    
    return activations, a_no_bias_list

@njit(fastmath=True, nogil=True)
def get_loss(y, a_out):
    # Compute error and mean squared loss
    error = y - a_out
    loss = np.mean(error * error)
    return loss, error

Backpropagation

Next, we calculate the delta ($\delta$)for each layer. We work backward through the network, starting with the derivative $$ \frac{\partial \mathbf{L}}{\partial \mathbf{w}^{(N)}} $$ where $N$ is the index of the last layer. The partial derivative is used to update the weights of the network. Throughout this process, the weight for the bias term is removed.

In the script, backpropagation begins with: $$\delta^{(N)} =error * activation\_derivative(a\_out, activation\_type)$$

where a_out is the final activation output. This yields the delta for the last layer.

For each preceding layer, we compute the delta as follows. We take the dot product of the most recently computed delta, $\delta^{l+1}$ for layer $i$, and the transposed weight matrix of the layer ahead. After this, we remove the bias component and perform element-wise multiplication with the derivative of the activation function (evaluated on the pre-bias activations) to obtain: $$ \delta^{(l)} = \left( \delta^{(l+1)} \cdot \left(\mathbf{w}^{(l+1)}\right)^T \right)_{\text{without bias}} \odot \phi'\left(z^{(l)}\right) $$ where $\phi'$ is the derivative of the activation function, and $z^{(l)}$ represents the pre-activation values (before the bias is added).

For each layer $l$, the gradient (i.e., the partial derivative of the loss with respect to the weights $\mathbf{w}^{(l)}$ is computed by taking the dot product of the transposed activations from the forward pass and the corresponding delta: $$ \frac{\partial L}{\partial \mathbf{w}^{(l)}} = \left(\mathbf{a}^{(l-1)}\right)^T \delta^{(l)} $$ This gradient is then used to update the weights: $$ \mathbf{w}^{(l)} \leftarrow \mathbf{w}^{(l)} + \eta \, \frac{\partial L}{\partial \mathbf{w}^{(l)}} $$ Here, $\eta$ is the learning rate. (The update is additive because the error is defined as $y - a^N$ subtract the predictor from the observed.)

Digression on the Chain Rule in Backpropagation

It is useful here to clarify the use of the chain rule with regard to backpropagation. This can be found in most texts that provide a detailed explanation of neural networks with backpropagation. (A good start for those with programming background would be Training a Neural Network chpater of Practical Deep Learning by Ronald T. Kneusel.) I do not elborate here, though those who look around may find an Easter egg.

In [4]:

@njit(fastmath=True, nogil=True)
def back_propagate(weights, activations, a_no_bias_list, a_out, n_layers, learning_rate, error, activation_type="sigmoid"):
    # --- Backward Pass ---
    delta = error * activation_derivative(a_out, activation_type)
    delta = np.ascontiguousarray(delta)
    deltas_rev = List()
    deltas_rev.append(delta)
    for i in range(n_layers - 1, 0, -1):
        delta = deltas_rev[-1].dot(weights[i].T)
        delta = np.ascontiguousarray(delta)
        # Remove the bias gradient (last column)
        delta = delta[:, :-1]
        delta = np.ascontiguousarray(delta)
        a_no_bias = a_no_bias_list[i - 1]
        delta = delta * activation_derivative(a_no_bias, activation_type)
        delta = np.ascontiguousarray(delta)
        deltas_rev.append(delta)
    # Reverse the list so that deltas[0] corresponds to the first weight matrix.
    deltas = List()
    for i in range(len(deltas_rev) - 1, -1, -1):
        deltas.append(deltas_rev[i])
        
    # --- Weight Update ---
    for i in range(n_layers):
        grad = activations[i].T.dot(deltas[i])
        weights[i] = weights[i] + learning_rate * grad
    return weights

The feedforward and back_propagate processes are repeated until either the loss is sufficiently low (here this value is $0.0001$) or until the number of epochs executed reaches the maximum number set by the epochs parameter. Since we will be using train_NN() over many iterations, we must record the results. The function returns the final weights and the evolution of the loss function output.

In [5]:

def train_NN(X, y, epochs, learning_rate, hidden_sizes, output_size, activation_type="sigmoid"):

    # Ensure X and y are contiguous.
    input_size = X.shape[1]
    X = np.ascontiguousarray(X)
    y = np.ascontiguousarray(y)
    weights = initialize_weights(input_size, hidden_sizes, output_size)
    n_layers = len(weights)
    loss_history = np.empty(epochs, dtype=np.float64)
    # shuffle to avoid bias from order of data
    np.random.shuffle(X)
    for epoch in range(epochs):
        activations, a_no_bias_list = feed_forward(X, y, weights, n_layers, activation_type)
        a_out = activations[-1]
        loss, error = get_loss(y, a_out)
        weights = back_propagate(weights, activations, a_no_bias_list, a_out, 
                                 n_layers, learning_rate, error, activation_type)
        loss_history[epoch] = loss
        
        # --- Early Stopping ---
        if loss < 1e-4:
            for j in range(epoch + 1, epochs):
                loss_history[j] = loss
            break
    
    return weights, loss_history

The evaluate_model() function calls the train_NN() function, returning a dictionary of the loss history, the sizes of the hidden layers, the learning rate, the number of epochs, the simulation number, the X and y matrices, the number of output nodes (output_size). We run a parameter sweep with 10 iterations of each setting. One interesting feature you will notice in the results is that some iterations get stuck in local minima. These are indicated by the solid flat lines in the Loss by Epoch graph.

Also note that the ranking I have chosen here measures the speed to 100% convergence. We do not know what algorithm would perform the best because we have chosen an arbitrary, but reasonable, loss function value of $10^{-4}$ for exiting the training. Finally, since we are not using the weights, I do not save the weights for the winning netowork(s). The winners will need to be saved for testing the neural network.

From the main loop, we rerun the sweep 10 times with unique RNG seeds in order to compare results. In our case, we generated the best results most consistently using 1 hidden layer (The number 31 indicates the number of variables plus the constant.) Because this is executed using parallel processing with numba, execution is quite fast. If you want to see how slow it would go without parallel processing, limit max_workers to 1.

In [6]:

def evaluate_model(params):

    hidden_sizes, lr, epochs, sim, X, y, input_size, output_size, activation_type = params
    # np.random.seed(sim)  # Uncomment for reproducibility if desired
    weights, loss_history = train_NN(X, y, epochs, lr, hidden_sizes,  output_size, activation_type)

    # print(f"Simulations: {sim}; Final loss: {loss_history[epochs - 1]}")
    loss_dict = {}
    for i in range(epochs):
        loss_dict[i] = loss_history[i]
    return loss_dict, hidden_sizes, lr, epochs


def tune_nn_parallel(X, y, num_sims=10, output_size=1, epochs=[10000], learning_rates=[0.1], num_hidden_layers=[2], activation_type="sigmoid"):
    input_size = X.shape[1]
    # Convert DataFrame values to float64.
    X = X.values.astype(np.float64)
    y = y.values.astype(np.float64)
    
    hidden_sizes_options = [[input_size for _ in range(hidden_layer_length)] for hidden_layer_length in num_hidden_layers]
    epochs_options = epochs
    
    params_list = []
    for sim in range(num_sims):
        for h, hs in enumerate(hidden_sizes_options):
            for l, lr in enumerate(learning_rates):
                for e, epoch in enumerate(epochs_options):
                    sim_num = sim * len(hidden_sizes_options) * len(learning_rates) * len(epochs_options) \
                              + h * len(learning_rates) * len(epochs_options) \
                              + l * len(epochs_options) + e

                    params_list.append((hs, lr, epoch, sim_num, X, y, input_size, output_size, activation_type))
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
        results = list(executor.map(evaluate_model, params_list))
    
    loss_dict_list = [r[0] for r in results]
    loss_df = pd.DataFrame(loss_dict_list).T
    loss_df.rename(columns={i: f"{hs}{lr}" for i, (loss, hs, lr, epoch) in enumerate(results)}, inplace=True)
    return loss_df


def main():
    best_dict = {}
    activation_type = "sigmoid"
    for seed in range(10):
        np.random.seed(seed)
        # Create a small dataset.
        n = 100
        k = 30  # number of raw features
        x_names = [f"x{i}" for i in range(1, k + 1)]
        X_data = [np.random.randint(0, 2, k) for _ in range(n)]
        y_data = np.random.randint(0, 2, n).reshape(-1, 1)
        learning_rates = [0.1, 0.2, 0.3]
        epochs = 100000
        df = pd.DataFrame(X_data, columns=x_names)
        df['y'] = y_data
        X_df = df[x_names]
        X_df["bias"] = 1
        y_df = df[['y']]
        
        output_size = 1
        num_sims = 10
        num_hidden_layers = [i for i in range(1, 4)]  # Try 1 to 3 hidden layers
        # build kwargs
        kwargs = {"X":X_df, "y":y_df, 
                  "num_sims": num_sims, "output_size": output_size, "epochs": [epochs],
                  "num_hidden_layers": num_hidden_layers, "learning_rates": learning_rates,
                  "activation_type": activation_type}
        loss_df = tune_nn_parallel(**kwargs)  # Change to "relu" or "tanh" to test other activations.
        
        loss_keys = list(loss_df.keys())
        # Extract hidden layer and learning rate info from column names.
        hidden_layers = []
        learning_keys = []
        for key in loss_keys:
            split_index = key.find("]")
            hidden_layers.append(key[:split_index+1])
            learning_keys.append(key[split_index+1:])
        hidden_layers = set(hidden_layers)
        learning_keys = set(learning_keys)
        hidden_layer_colors = {hidden_layer: f"C{i}" for i, hidden_layer in enumerate(hidden_layers)}
        learning_key_alphas = {learning_key: 1 - 0.05 * i for i, learning_key in enumerate(learning_keys)}
        ID = f"Set: {seed}"
        # Fix the list comprehension syntax for the color argument.
        title = f"Loss by Epoch\n {ID}"
        fig, ax = plt.subplots(figsize=(20,10))
        loss_df.plot(legend=False, ls="", marker=".", markersize=0.5, alpha=0.02, 
                     color=[hidden_layer_colors[key[:key.find("]")+1]] for key in loss_keys],
                     fontsize = 28, ax = ax )
        ax.set_title(title, fontsize=28)
        

        fig, ax = plt.subplots(figsize=(20,10))
        converged_to_1 = {}

        for i, hidden_layer in enumerate(hidden_layers):
            # hidden layer is a list transformed to a string. I want to count the number of elements in the list
            num_hl = hidden_layer.count(",") + 1
            converged_to_1[num_hl] = {}
            hidden_layer_df = loss_df[[key for key in loss_keys if hidden_layer in key]]
            
            for j, learning_key in enumerate(learning_keys):
                learning_key_df = hidden_layer_df[[key for key in hidden_layer_df.keys() if learning_key in key]]
                pct_less_than_1e4 = (learning_key_df < 1e-4).mean(axis=1)
                pct_less_than_1e4.plot(ax=ax,
                                       color=f"C{i * len(learning_keys) + j}", 
                                       label=f"HL: {num_hl}, LR: {learning_key}",
                                       legend=False, fontsize=28)
                converged_to_1[num_hl][str(learning_key)] = (pct_less_than_1e4 == 1.0).mean()
        ID = f"Set: {seed}"
        title = f"Percentage of Simulations with Loss < 1e-4 by Epoch\n {ID}"
        ax.set_title(title, fontsize=28)
        ax.legend(ncol=1, fontsize=24, 
                  # set above plot on right side
                    bbox_to_anchor=(0.99, 1.0225), loc='upper left')
                  
        plt.show()

        # find earliest convergence to 1.0
        # find epoch where 100% of simulations have converged for each parameter set
        converged_to_1 = pd.DataFrame(converged_to_1)
        # sort keys
        converged_to_1 = converged_to_1[sorted(converged_to_1.keys())]
        # sort index
        converged_to_1 = converged_to_1.sort_index()
        # find column and index pair with highest value
        max_row_index, max_col_index  = np.unravel_index(converged_to_1.values.argmax(), converged_to_1.shape)
        max_col, max_row = converged_to_1.columns[max_col_index], converged_to_1.index[max_row_index]
        best_dict[seed] = f"{max_col}, {max_row}"
    pd.DataFrame(best_dict, index = ["Best Parameters"]).T.value_counts().plot(kind="bar", figsize=(20,12), fontsize=28)
    plt.title("Hyperparameters Counts: (Hidden Layers, Learning Rate)", fontsize=28)

if __name__ == '__main__':
    try:
        multiprocessing.set_start_method("fork")
    except RuntimeError:
        pass
    multiprocessing.freeze_support()
    main()

Note: This was a significantly more involved presentation than the previous posts on the evolution of neural networks. I anticipate their may be a handful of revisions to the blog in the week following the initial posting. I will not be indexing these revisions.

In [ ]:

InCognito

Building Neural Networks With Backpropagation from Scratch

Building Neural Networks With Backpropagation from Scratch

Related Posts

The Perceptron

The Context and Structure of Early Neural Networks: McCulloch and Pitts

Building Agent-based Models in JavaScript (Schelling Segregation)