This article will focus on exploring the relationship between probability (or, more specifically, probability distributions) and neural networks in an intuitive manner.
As previously stated, neural networks represent a parameterized family of functions. If the parameters were fixed, the neural network would essentially become a function (or, more generally, a map). However, academicians often state that neural networks represent probability distributions, which implies randomness, a concept that does not sit well when you think of neural networks in terms of deterministic functions.
Upon receiving the same input x
multiple times, a given neural network should produce the exact same output in almost all cases (I say almost all cases because sometimes random noise is added within a neural network to regularize it during training). So, where does this randomness come from? And if there is randomness in a neural network, why do we need it? Since I have built enough suspense now, I will answer these questions by delving into the basic core concepts and throwing a new light on them. I will begin by explaining the need for randomness in deep learning.
Before we begin, I want to redefine the concept of randomness by using two basic examples.
These two examples suggest that randomness is subjective rather than objective. For the observer with no knowledge or information about the coin, the coin flip can be seen as random, while for the observer with knowledge and information about the coin's dynamics, the coin flip is a function of some given variables. The same contrast applies to the second example between you and your friend.
If we have complete and sufficient information about a phenomenon (or process), we can theoretically predict its outcome. Randomness arises when we have incomplete information, in which case we use extrapolation techniques to try and determine the outcome (which is just a fancy way of saying 'educated guesses').
Our daily lives are full of random phenomena, and when faced with one, we subconsciously assign a probability to each outcome and choose the outcome with the highest probability. We assign these probabilities by referring to our past experience or by suitably restructuring our knowledge system. Regardless of the method, this mapping of outcomes to a corresponding probability is similar to what a probability distribution does.
The tasks that deep learning methods solve are similar in this sense; the features do not contain sufficient information, but deep learning methods learn to assign probabilities to different outcomes from the given information. Hence, randomness is a part of everyday life, which translates to deep learning applications as well.
My interpretation of randomness is very loose and non-technical. There are some edge cases that do not follow this interpretation, but it would be impractical to cover them here as they serve no purpose.
Loosely speaking, probability distributions are maps that assign a probability measure to an outcome. (The probability measure of an outcome refers to the odds of that outcome occurring.) Probability distribution functions (pdf's) provide an analytical method for assigning probability measures to outcomes. Some standard distribution functions include
The Gaussian distribution for continuous-valued variables,
The Bernoulli distribution for binary variables,
etc. One will notice that these distribution functions have an unknown value, which we refer to as the parameter of that distribution function. These parameters largely influence the probability distribution function, changing its shape, position, and scale significantly. In the case of the Gaussian distribution, the mean and variance controlled the location and shape of the bell curve. In the case of the Bernoulli distribution, the parameter would control the weight of the probability of one variable. Thus, probability distributions are characterized by a set of values referred to as their parameters.
Probability distribution functions with variable parameters represent a family of distributions.
Now that we have our basic concepts cleared, it will be easy to define a relationship between the two. However, our initial statement that neural networks represent probability distributions needs a slight modification. Technically, neural networks parameterize probability distributions. The output of a neural network is used as the parameters in a given probability distribution. The family of distributions that the neural network parameterizes mainly depends on the architecture of the neural network and the loss function used to train it.
For example, if we were to consider the task of classifying whether a car is present in a given image, let us assume that the neural network outputs a single scalar value whose range is from 0 to 1 (1-> Car is present in the image, 0-> Car is not present in the image). Since the two events are mutually exhaustive, we can assume the neural network is parameterizing a Bernoulli distribution, that is, it outputs the value p
. Given p
, we can plug it into the Bernoulli distribution function and determine the mode of the distribution, thus obtaining our answer.
In the event that the output is a real number, the neural network would most likely parameterize a Gaussian distribution. Given that its mean, median, and mode are the same, determining only the mean as a parameter is enough. Similarly, by imposing suitable constraints, we can make the neural network explicitly parameterize the distribution we want.
Thus, this article has explored the relation between neural networks and probability as promised. As a bonus, we also discovered a new perspective on randomness and why incorporating it in deep learning is so important.