Neural networks
Using these equations, it is possible to define a neural network and generate outputs based on inputs. The way in which the output is generated is of course affected by the weights and biases of the network. Generally, their values will be randomized initially, often using a normal distribution to keep values within a certain range and avoid overflows. However, this will not produce accurate results and the output of the network is likely to be garbage.
We must, therefore, find a way to find an optimal, or at least satisfactory, set of weights and biases to produce the desired results. This is where the main topic of the essay, training, becomes relevant.
The first problem is to find a way to evaluate the performance of a neural network so that we can adjust its parameters based on this. The most common way of achieving this is by using what is known as a loss function to calculate the ‘ loss ’ between the value produced by the network and the actual value. (Abdi, 1994)
The function can be anything which compares the two, even something as simple as:
𝑓(𝑝, 𝑞) = |𝑝 − 𝑞|
where p is the expected result and q is the result produced by the network and |x| denotes the magnitude of the vector x. Generally speaking, a more complicated function selected for the particular problem is superior but a function such as the one above would be sufficient.
Option one: gradient descent
Using the loss function, we are able to evaluate the accuracy of the prediction, but now we must use this information to optimize the network by altering its parameters (weights and biases) (Abdi, 1994).
If we define θ as the set of parameters of some function, then the partial derivative of the function with respect to a parameter θ i ∈ θ shows us the impact of θ i on the value of the function. We can apply this to our situation by taking the partial derivative of the loss function with respect to every single parameter in the network and then adjusting the parameters based on this information. We can take a set of training data accompanied by correct labels and test the network on every item in the set before adjusting the parameters based on its performance across all the tests (Abdi, 1994).
This can be expressed mathematically as follows:
𝜕𝐽(𝑥 (𝑖) ) 𝜕𝜃 𝑗
𝜃 𝑗 → 𝜃 𝑗 −𝜂∑
∀ 𝜃 𝑗 ∈ 𝜃
𝑖
where x (i) denotes the i th sample and η is the learning rate. The choice of learning rate is incredibly important to the efficacy of the algorithm. Small learning rates allow for smaller adjustments which are more accurate but take longer to approach the optimal setup whereas larger learning rates cause the algorithm to approach more quickly, but is likely to be less accurate and stable. Generally, a small learning rate is the superior choice if there is a sufficiently large data set to train the network on. However, in practice obtaining large data sets is often a significant task and so a larger learning rate is better (Bottou, 2012).
With the current algorithm, every single item in the training set will be calculated for every training cycle, which can be quite wasteful in practice. Datasets will often number in the tens of thousands, and
61
Made with FlippingBook interactive PDF creator