2.3 Multi-layer perceptrons

The feed-forward ANN described above is also known as a Multi-Layer Perceptron (MLP). (The name "perceptron" was coined by Rosenblatt in a 1943 paper.) The MLP is capable of approximating any mathematical function if there are a sufficient number of neurons. This refers to the strict definition of a function, which is a rule that specifies the outputs given any inputs within a specified domain. For example if x and y are inputs and z is an output, then the function f(x,y) = sin(x)/exp(z) over the domain where -2 < x < 2, -3 < z < 3 could be approximated arbitrarily closely by an MLP, given sufficient neurons.

2.3.1 The MLP formulae in matrix notation

We now present the formulae of an MLP using a matrix notation. This section may be skipped without loss of continuity.

Consider the mapping from the MLP inputs to the final inputs to each hidden layer sigmoid. Write the vector of inputs as x, and the vector of hidden layer sigmoid inputs as y1. Note that the mapping from x to y1 is a linear transformation:

y1 = W*x + C

where W is the matrix of weights and C is the vector of offsets (or biases). Because there is another linear transformation (between the hidden nodes and the output nodes), I will write this as:

y1 = W2*x + C2

The next step in the flow through the network is to apply the sigmoid non-linearities to each of the element in the vector y1. Because this function application is purely element-wise, it is convenient to write this simply as:

f(y1),     where f(x) = 1/(1+exp(-x))       --- the sigmoid function

To make sure I'm clear using an example, if y1 = [1, 2]' (a vector containing 2 numbers - the ' meaning transpose i.e. this is actually a column vector rather than a row vector), then

f([1 2]') means [f(1) f(2)]'

Clearly this type of element-wise notation preserves the dimensions of its input.

Returning to the MLP, the output of the hidden nodes is the vector

y2 = f(y1)

This vector then undergoes another linear transformation in the same way as for the input. Thus the input to the output layer sigmoids is

y3 = W1*y2 + C1

Lastly, applying the output sigmoids we reach the output of the MLP

z = f(y3)

In summary, the pipeline can be drawn (view with proportional font):

x -> linear map -> y1 -> sigmoids -> y2 -> linear map -> y3 -> sigmoids -> z

     W2, C2                                  W1,C1

where x, y1, y2, y3, and z are all vectors and the sigmoid functions are applied element-wise (element-wise functions are sometimes called universal functions).

Eliminating y1, y2, and y3 results in the reasonably compact notation:

z = f(W1 * f(W2*x + C2) + C1)

           `----y2----'

where to emphasize the structure I have tried to show y2, the output from the hidden layer nodes.

As a final "checksum", if there are n_in inputs, n_hidden hidden nodes, and n_out output nodes, then:

  W2 is a matrix with n_in columns and n_hidden rows
  C2 is a column vector with n_hidden rows
  W1 is a matrix with n_hidden columns and n_out rows
  C1 is a column vector with n_out rows.

2.4 Comparison with biological networks (and turing machines)

It is important to remember that an MLP is a very simple model of a biological neural network. In fact in many ways it is irrelevant that it has a biologically motivated background, and may simply be thought of as a general purpose mathematical function.

Two important ways in which an MLP is much simpler than a biological network:

  1. no temporal effects,
  2. no maintained state.

It is the lack of these two properties which means that an MLP cannot have feedback (unlike a biological network). Similarly, the lack of these properties means that the MLP is not a Turing machine, and cannot be a general purpose computer. It has no RAM! The values of the weights can be thought of as values in ROM.

Back to top. >>>

Continue with the tutorial. >>>