In the early years of development of artificial intelligence and computing, neurons were fast considered for their role as computing devices. In an earlier post I discussed the mapping of neurons McCulloch and Pitts (1943). This was part of the enterprise of the same enterprise of envisioning various computing machines. This is not to make the claim that those engaged in this process viewed these mappings as perfectly realistic. Rather, they captured the idea that nervous systems could be described as rule-based, deterministic systems, a necessary feature of computing machines.

The Perceptron is essentially a device that generates a line for categorizing (separating) observations. If you are familiar with Logit and Probit models, you already have a sense of the goal of the Perceptron. The Perceptron solves for a generates a separating line of good fit by initially guessing what the defining parameters should be. The algorithm weights input values and sums them. Observations are categorized by the height of the value output. When the sum of the weighted values is greater than the threshold value, $\theta$, the observation is assigned the category associated with relatively high output values. Observations with values below the threshold are assigned the category associated with relatively low output values.

Predictors are marginally adjusted by shifting the weights. When predictions are below the threshold but should be above the threshold, the weights marginally increase. When predictions are incorrectly above the threshold, the weights decrease. This simultaneously shifts the line and changes its slope, with the outcome tending to improve across iterations of adjustment. Through this automatic feedback, the accuracy of the prediction will tend to improve as the algorithm repeatedly iterates through the observations.

Adaptive Learning

Before formally outlining and demonstrating the functioning of a Perceptron, it is appropriate to identify its significance. An artificially intelligent agent must be able to interact with an environment in an orderly manner, typically in pursuit of some goal (e.g., accurate classification which may be a means to some higher level goal like survival). Learning requires adaptation. There are multiple means by which an agent or system can adapt. In previous work, I have leveraged the power of collective learning to demonstrate that the coordinating powers of markets do not require agents to be particularly intelligent. Rather, some minimum level of experimentation and some mechanism(s) driving adoption of successful innovation are all that is required. Concerning real world corollaries, successful entrepreneurs might succeed in mentoring other entrepreneurs or other entrepreneurs may simply copy their most successful competitors. This is sufficient not only for survival. Learning in this manner supports innovation capable of increasing the carrying capacity of the economic system. That is, competitive markets naturally generate innovations that improve the productivity of the average entrepreneur.

The approach taken by Rosenblatt focuses on the ability of an individual agent to learn. The metaphor considered is one of vision. How can an observer learn to differentiate classes of observations based on some limited number of features that define those observations? There was a clear sense that the nervous system played an important role in such categorization. In this light, Pitts and McCulloch (1947) open their discussion for the relationship between perception and the nervous system, noting that:

Numerous nets, embodied in special nervous structures, serve to classify information according to useful common characters. In vision they detect the equivalence of apparitions related by similarity and congruence, like those of a single physical thing seen form various places. In audition, they recognize timbre and chord, regardless of pitch. The equivalent apparitions in all cases share a common figure and define a group of transformations that take the equivalents into one another but preserve the figure invariant. . . We seek general methods for designing nervous nets which recognize figures in such a way as to produce the same output for every input belonging to the figure. We endeavor particularly to find those which fit the histology and physiology of the actual structure. (127-128)

Rosenblatt draws from Pitts and McCulloch (1947) (and would have been familiar with the work cited above and does cite it elsewhere). He makes special effort to identify inspiration from the connectionist approaches of Donald Hebb (1949) and F. A. Hayek (1952). Three years later, Rosenblatt provides a bit more narrative with regard to these citations:

Hebb (Ref. 33) and Hayek (Ref. 32), following the tradition of James Stuart Mill and Helmholtz, have attempted to show how an organism can acquire perceptual capabilities through a maturational process. For Hayek, the recognition of the attributes of a stimulus is essentially a problem in classification [emphasis mine], and his point of view has inspired Uttley (Refs . 101, 10Z) to design a type of classifying automaton which attempts to translate the approach into more rigorous mathematical form. Hebb's model is more detailed in its biological description, and suggests a process by which neurons which are frequently activated together become linked into functional organizations called "cell assemblies" and "phase sequences" which, when stimulated, correspond to the evocation of an elementary idea or percept.

Hayek also follows a similar approach with regard to neurons being activated in groups. In the preface to The Sensory Order, Hayek recognized the similarity in approach and indicates that his work was generated independently, discovering Hebb after the fact:

It seems as if the problem discussed here were coming back into favour and some recent contributions have come to my knowledge too late to make full use of them. This applies particularly to Professor D. O. Hebb's Organization of Behaviour which appeared when the final version of the present book was practically finished. That work contains a theory of sensation which in many respects is similar to the one expounded here; and in view of the much greater technical competence of the author I doubted for a while whether publication of the present book was still justified. In the end I decided that the very fullness with which Professor Hebb has worked out the physiological detail has prevented him from bringing out as clearly as might be wished the general priniciples of the theory; and as I am concerned more with the general significance of a theory of that kind than with its detail, the two books, I hope, are complementary rather than covering the same ground.

Rosenblatt seems to have identified this distinction, as I have indicated with emphasis in his description of the significance of Hayek (1952). This begs the question, which I will not go into here: If Rosenblatt, one of the earliest developers of neural networks, credited Hayek for inspiring his contribution, why does he not receive credit comparable to Donald Hebb who show regularly in the list of early citations in neural network literature?

Building and Applying a Perceptron

While I have described the structure of a Perceptron, formal presentation will vanquish any opaqueness in description of the processes that help define the Perceptron.

The key idea is that the perceptron use weights to predict the identity of an observation. We presume that identity can be divided between two distinct classes, though we might imagine predicting that an observation is either $A$ or $\neg A$.

We can generalized for n inputs the following. $$\hat{y} = w_bb + \Sigma^n_{i=1}{w_ix_i} = w_0x_0 + w_1x_1 + ... + w_nx_n$$

If you would like, recognize that b is a column of $1$ values and call it $x_0$ so that $$\hat{y} = \Sigma^n_{i=0}{w_ix_i} = w_0x_0 + w_1x_1 + ... + w_nx_n$$

Compare to threshold value, $\theta$. $$\hat{y} = \Sigma^n_{i=0}{w_ix_i} \ge{\theta}$$

If we represent $x_0$ as $b$, though, it is easy to see that $-\theta$ stands in for $w_bb$. $$\Sigma^n_{i=1}{w_ix_i} - \theta = \Sigma^n_{i=1}{w_ix_i} + w_bb$$ $$w_bb = -\theta$$

The value $\hat{y}$ is transformed to $\hat{y}'$, a discrete value of either $1$ or $-1$. $$\hat{y}' = \left\{ \begin{array}{ll} 1 & \text{if } \hat{y} > 0 \\ -1 & \text{otherwise} \end{array} \right. $$

Updating Weights

After each iteration, weights are updated. If an observation is incorrectly classified, weights are adjusted by the product of the difference between the observation and prediction, $\eta$
$$w_{j+1} = w_j + \Delta w_j$$ $$\Delta w_j = \eta (y_j - \hat{y}_j')x_j$$

Notice that if $\hat{y}_j'=y_j$, then $\Delta = 0$ and $w_{j+1} = w_j$.

Below, I include the script for visualizing this process. In the final step, there is really only need for one edge since $y_j-\hat{y}'_j=0$, but I find that it is easier to read the outcome if the case where $w_{j+1}=w_j$ is separate from the case where $w_{j+1}\ne w_j$.

In [1]:

import networkx as nx
import matplotlib.pyplot as plt
from matplotlib.patches import ArrowStyle
import numpy as np
import pandas as pd
# set default fontsize
plt.rcParams.update({'font.size': 20})

def create_directed_network(n):
    G = nx.DiGraph()    
    # Define X_j nodes
    input_nodes = [f'$x_{{{i}_j}}$' for i in range(0, n+1)]
    # dot multiply w_j and X_j
    aggregator_node = r"$\hat{y}_j=w_j\cdot X_j$"
    # calculate new weights
    output_node = "$w_{j+1}$\n$=$\n$w_j + \eta(y_j-\hat{y}'_j)X_j$"
    # Define cases for w_(j+1)
    match_node = "$w_{j+1}=w_j$"
    y_hat_greater = "$\\forall i$\n$w_{i_{j+1}}<w_{i_j}$"
    y_hat_less = "$\\forall i$\n$w_{i_{j+1}}>w_{i_j}$"
    # attache each label to a graph node
    G.add_nodes_from(input_nodes)
    G.add_node(aggregator_node)
    G.add_node(output_node)
    G.add_node(match_node)
    G.add_node(y_hat_greater)
    G.add_node(y_hat_less)
    
    # Add edges from input nodes to aggregator with corresponding weights
    # Edge labels describe transformation or decision-rule
    for i in range(0, n+1):
        G.add_edge('$x_{' +str(i) + '_j}$', 
                   aggregator_node, 
                   weight='$w_{'+str(i)+'_j}$')
    
    # Add edge from aggregator to output node
    G.add_edge(aggregator_node, 
               output_node, 
               weight=r"$\hat{y}'_j=1\ if\  \hat{y}_j>\theta\ else\ -1$")
    # Add edges from output node to w_(j+1) cases
    G.add_edge(output_node, 
               match_node, 
               weight= r"$\hat{y}'_j = y_j$")
    G.add_edge(output_node, 
               y_hat_greater, 
               weight= r"$\hat{y}'_j > y_j$")
    G.add_edge(output_node, 
               y_hat_less, 
               weight= r"$\hat{y}'_j < y_j$")
    
    return G

# Example usage:
if __name__ == "__main__":
    n = 3  # Number of x nodes
    G = create_directed_network(n)
    
    # Define positions of each label
    pos = {f'$x_{{{i}_j}}$': (i,2) for i in range(4)}
    pos[r"$\hat{y}_j=w_j\cdot X_j$"]= (1.5, 0.75)
    pos["$w_{j+1}$\n$=$\n$w_j + \eta(y_j-\hat{y}'_j)X_j$"] = (1.5, -2.5)
    pos["$w_{j+1}=w_j$"] = (3.5, -2.5)
    pos["$\\forall i$\n$w_{i_{j+1}}<w_{i_j}$"] = (-0.5, -1)
    pos["$\\forall i$\n$w_{i_{j+1}}>w_{i_j}$"] = (-0.5, -3.5)
      
    
    
    # Draw the network
    plt.figure(figsize=(17, 8))
    nx.draw_networkx_nodes(G, pos, 
                           node_size=1500, 
                           node_color='lightblue')
    nx.draw_networkx_labels(G, pos, 
                            font_size=16, 
                            font_family='sans-serif')
    
    # Prepare edge labels
    edge_labels = {(u, v): d['weight'] for u, v, d in G.edges(data=True) if d['weight']}
    
    # Draw directed edges
    nx.draw_networkx_edges(G, pos,
                           arrows = True,
                           arrowsize=20, 
                           edge_color='k', 
                           alpha = 0.15, 
                           width=1.5,
                     arrowstyle=ArrowStyle("Fancy, head_length=1.2, head_width=0.5, tail_width=.1"))
    nx.draw_networkx_edge_labels(G, 
                                 pos, 
                                 edge_labels=edge_labels, 
                                 font_size=13,
                                 font_color='red')
                                 
    
    plt.axis('off')
    plt.title("$\\forall\ j \\in \{0, 1, . . . , m-1, m\}:m=3$")
    plt.show()

<>:16: SyntaxWarning: invalid escape sequence '\e'
<>:61: SyntaxWarning: invalid escape sequence '\e'
<>:96: SyntaxWarning: invalid escape sequence '\ '
<>:16: SyntaxWarning: invalid escape sequence '\e'
<>:61: SyntaxWarning: invalid escape sequence '\e'
<>:96: SyntaxWarning: invalid escape sequence '\ '
/var/folders/gt/3gm3rnvj7bdgc__9sdw81dc80000gn/T/ipykernel_2023/1777128163.py:16: SyntaxWarning: invalid escape sequence '\e'
  output_node = "$w_{j+1}$\n$=$\n$w_j + \eta(y_j-\hat{y}'_j)X_j$"
/var/folders/gt/3gm3rnvj7bdgc__9sdw81dc80000gn/T/ipykernel_2023/1777128163.py:61: SyntaxWarning: invalid escape sequence '\e'
  pos["$w_{j+1}$\n$=$\n$w_j + \eta(y_j-\hat{y}'_j)X_j$"] = (1.5, -2.5)
/var/folders/gt/3gm3rnvj7bdgc__9sdw81dc80000gn/T/ipykernel_2023/1777128163.py:96: SyntaxWarning: invalid escape sequence '\ '
  plt.title("$\\forall\ j \\in \{0, 1, . . . , m-1, m\}:m=3$")

No description has been provided for this image

Now that we have clarified the process, we need to apply it in python script so that we can observe the process in action. Rosenblatt goes through pains to show that eventually the Perceptron converges to 100% accuracy so long as the data is linearly separable. Computation for this convergence may take a long time. I outline the process, showing the result of a single simulation may take a long time to converge upon 100% accuracy. I then run several competing, shorter simulations to show that a line of good fit can be estimated quickly by allowing a variety of initial weights to be updated. This latter presentation is an extension of the logic of the Perceptron, but was not included as part of Rosenblatt's Perceptron.

First we will need to generate our data. As in the previous post, I draw from Introduction to Neural Network Models of Cognition. I make a number of modifications to the script for improvement in interpretation. For example, I use a kwargs dict for generating each species so that the parameters are easier to read and compare.

Each species is defined by the distribution of features associated with the species. The two species compared must display the same observable features so that they can be compared. Thus, we can use the same function to generate each species in light of the parameters $\mu_{k_i}$ and $\sigma_{k_i}$ that define each attribute $i$ for a species $k$.

In [2]:

def species_generator(mu1, sigma1, mu2, 
                      sigma2, n_samples, 
                      target, seed):
    rand = np.random.RandomState(seed)
    df = pd.DataFrame({
        "Weight-(gm)": rand.normal(
            mu1, sigma1, n_samples),
         "Wingspan-(cm)": rand.normal(
             mu2, sigma2, n_samples)})
    df["Bias"] = 1
    df["Species"] = target
    return df

seed = 100

# use kwargs to pass in parameters to the species generator
# mu1: mean of the normal distribution for the weight (grams) of the bird
# mu2: mean of the normal distribution for the wingspan (cm) of the bird
birds = {"Albatross":
              {"mu1": 9000, "sigma1": 800,
               "mu2": 300, "sigma2": 20,
               "target": 1, "n_samples": 100,
               "seed": seed},
            "Owl":
              {"mu1": 1000, "sigma1": 200,
               "mu2": 100, "sigma2": 200,
               "target": -1, "n_samples": 100,
               "seed": seed}}
data = {}
for species, attributes in birds.items():
    data[species]= species_generator(**attributes)

Once the data is generated, we want to use the Perceptron to predict the species of each observation. The outline of this process is provided in the fit() function. I have provided copious notes, so all that remains is for you to read through the script and, for the brave, to reconstruct the script yourself.

In [3]:

import copy
def random_weights(X, random_state: int):
    rand = np.random.RandomState(random_state)
    # paramater scale is the standard deviation of the normal distribution
    w = rand.normal(loc=0.0, scale=1, size=X.shape[1])

    return pd.DataFrame(w, index=X.columns).T

def predict(X, w):
    '''Return class label after unit step'''
    # take dot product of X and w
    predict = X.dot(w.T)

    return np.where(predict >= 0.0, 1, -1)
def process_row(xj, observed, ws, eta, errors, rand):
    # use weights list to get current weights
    # we will append to ws so that weights are not delted 
    ## after exiting process_row()
    w = ws[-1]
    # calculate the error to determine w_(j+1)
    delta = eta * (observed - predict(xj, w))
    delta = delta
    # Update weights in place (no copying)
    # append new weights to weight list, ws, appending
    # in function effects the list outside of the function
    # new_weights = w + xi.mul([rand.uniform(0,1) * delta for x in xi.keys()])
    new_weights = w + delta * xj
    ws.append(new_weights)
    # also append errors to error list (we sum errors after iterating through data)
    # each error is assigned a value of 1 if error or 0 if no error
    errors.append(int(delta != 0))

def fit(data, X_names, y_name, eta=0.001, n_iter=100, 
        random_state=1, sigma=1):
    
    # for each iteration, we will record the sum of all errors 
    errors = []

    # for the sake of efficiency, convert pandas dfs to numpy arrays
    X = data[X_names].values          # Convert to NumPy array
     # Since y values need to be transposed, we use ravel() 
    y = data[y_name].values.ravel()      

    # list to record weights at every iteration
    all_weights = []
    # list to record weights at the end of each epoch
    final_weights = []
    # get initial, random weights that will be marginally modified
    rand = np.random.RandomState(random_state)
    w = rand.normal(loc=0.0, scale=sigma, size=X.shape[1])  # shape (n_features,)
    # append initial weights to both lists
    all_weights.append(w)
    final_weights.append(w)
    for epoch in range(n_iter):
        # create list to record errors at each iteration of the epoch
        _errors = []
        # Shuffle indices so that order does not bias the line
        idx = rand.permutation(len(X))
        # proceed through shuffled indices
        for j in idx:
            xj = X[j]           
            observed = y[j]  
            # process described in above figure occurs in process_row()
            process_row(xj, observed, all_weights, eta, _errors, rand)
        # sum errors for each epoch to save in primary list, errors
        epoch_errors = sum(_errors)
        errors.append(epoch_errors)
        # save final weight for each epoch
        final_weights.append(all_weights[-1])
        
        if epoch_errors == 0:
            break
        # do not include initial weights.
        # The line is then viewed as the result of the last epoch
        # If the data is linearly separable, the line will be the same
        # as the line from the last epoch
    return final_weights[1:], errors
    # the last set of weights are never used

import pandas as pd

# stack dataframes from dictionary
data = pd.concat(data)
X_names = ["Bias", "Weight-(gm)", "Wingspan-(cm)"]
y_name = ["Species"]

weights, errors = fit(data, X_names, y_name, 
                      eta=0.01, n_iter=10**5,
                      random_state=1, sigma=10)

In [4]:

# plot total errors for each epoch
def plot_errors(data, errors):
    percent_errors = np.array(errors) / len(data)
    accuracy = 1 - (percent_errors)
    fig, ax = plt.subplots(figsize = (16,8))
    ax.plot(accuracy, marker = "o", linestyle = "", markersize = 4, alpha = 0.25)
    ax.scatter([len(accuracy)], [accuracy[-1]], marker = "x", s = 200, alpha = 0.75)
    plt.title(f"Final Accuracy Percent: {accuracy[-1]*100}%")
    ax.set_xlabel("Epoch")
    ax.set_ylabel("Accuracy")
plot_errors(data, errors)

In [5]:

# save dataframe of weights and errors for all epochs
def make_errors_df(weights, X_names, errors):
    edf = pd.DataFrame(weights, columns = X_names)
    edf.index = np.arange(len(weights))
    edf["Errors"] = errors
    return edf


edf = make_errors_df(weights, X_names, errors)
edf.tail()

Out[5]:

	Bias	Weight-(gm)	Wingspan-(cm)	Errors
12640	-16824.636547	144.785089	-82.050342	82
12641	-16825.616547	27.524232	-67.817653	59
12642	-16826.796547	117.131731	-39.870388	75
12643	-16827.216547	4.709586	-1.587849	25
12644	-16827.216547	4.709586	-1.587849	0

In [ ]:

In [6]:

import matplotlib.pyplot as plt 
# plot attribute weights and associated errors (no bias constant)
def plot_weights(edf):
  fig, ax = plt.subplots(figsize = (16,8))
  edf.plot.scatter(x = "Weight-(gm)", 
                    y = "Wingspan-(cm)", 
                    c = "Errors", 
                    cmap = "viridis_r",
                    alpha = 0.5, 
                    ax = ax)
   
  # cbar.set_label("Errors")
  ax.set_xlabel("$W_{Weight}$")
  ax.set_ylabel("$W_{Wingspan}$")
  final_weights = edf.iloc[-1]
  ax.scatter(final_weights["Weight-(gm)"], 
            final_weights["Wingspan-(cm)"], 
            c = final_weights["Errors"], 
          s = 300, marker="x",
          cmap = "viridis_r")
plot_weights(edf)

Prediction follows equation: $w_bb + w_xx+w_yy=0$

$ y = –(w_bb + w_xx) ⁄ w_y$

Since b is always 1, we can write:

$ y = –(w_b + w_xx) ⁄ w_y$

In [7]:

def plot_prediction_comparison(data, edf, species_map):
    w = edf[X_names].iloc[-1]
    plot_data = data.copy()
    # this will substitute species names for target values 
    plot_data["Species"] = plot_data["Species"].map(species_map)
    fig, ax = plt.subplots(figsize = (16,8))
    for i, species in enumerate(plot_data["Species"].unique()):
        species_data = plot_data[plot_data["Species"] == species]
        species_data.plot.scatter(x = "Weight-(gm)", 
                                  y = "Wingspan-(cm)",
                                  s = 75, 
                                  alpha = 0.45,
                                  label = species,
                                  color = f"C{i}", 
                                  ax = ax)
        ax.legend()

    separating_x = np.linspace(
        data["Weight-(gm)"].min(),
        data["Weight-(gm)"].max(), 
        1000)
    separating_y = -(w[X_names[0]] + w[X_names[1]] * separating_x) / w[X_names[2]]
    plt.plot(separating_x, separating_y, color="black")
# use species map for legend
species_map = {1: "Albatross", -1: "Owl"}
plot_prediction_comparison(data, edf, species_map)

If we carefully consider the script, we will notice a peculiar feature of the Perceptron. The set of weights that predict the observed outcomes change in the course of predicting in even within a single set. For economists trained to infer a True estimator (line), this may seem strange. But Rosenblatt's goal was not to estimate a perfect line for its own sake, but to describe a process that approximates how neural learning occurs.

As you can see, this learning process may take a long time, but eventually the accurate prediction is achieved so long as observations are linearly separable.

The slowness of this process is a bit wasteful and may frustrate the impatient. I have developed a more modern approach below that runs multiple simulations. (If you would like, you could parallelize these to further improve the speed.) Instead of running a single simulation with a maximum of 100,000 iterations, I run 100 simulations with a maximum of 100 iterations. I also increase the variation by increasing the sigma. While this increases the probability that the initial set of weights will be far from correct, it also increases the breadth of search so that at least one or a few weights will be close to generating a line of good fit (good predictor).

In [8]:

num_sims = 10**2
weights = {}
errors =  {}
for sim in range(num_sims):
    weights[sim], errors[sim] = fit(data, X_names, y_name, eta=0.01, n_iter=10**2, 
                        random_state=seed+sim, sigma=5000)
    if errors[sim][-1] == 0:
        break
# get key with smallest error
min_errors = {k: min(v) for k, v in errors.items()}
# get key with minimum error
min_error = min(min_errors, key=min_errors.get)
best_errors = errors[min_error]
best_weights = weights[min_error]
w = best_weights[-1]

In [9]:

plot_errors(data, best_errors)

In [10]:

edf = make_errors_df(best_weights, X_names, best_errors)

plot_weights(edf)

In [11]:

plot_prediction_comparison(data, edf, species_map)

In [ ]:

The Perceptron

Related Posts

Building Neural Networks With Backpropagation from Scratch

The Context and Structure of Early Neural Networks: McCulloch and Pitts

Deterministic Modeling Ain't What You Think