Machine Learning and Neural Networks - page 57

 

10.1 — Why it helps to combine models



10.1 — Why it helps to combine models [Neural Networks for Machine Learning]

In this video, I will discuss the importance of combining multiple models for making predictions. When using a single model, we face the challenge of choosing the right capacity for it. If the capacity is too low, the model won't capture the regularities in the training data. On the other hand, if the capacity is too high, the model will overfit the sampling error in the specific training set. By combining multiple models, we can strike a better balance between fitting the true regularities and avoiding overfitting. Averaging the models together often leads to better results compared to using any single model. This effect is particularly significant when the models make diverse predictions. Encouraging the models to make different predictions can be achieved through various techniques.

When dealing with limited training data, overfitting is a common issue. However, by considering the predictions of multiple models, we can mitigate overfitting. This is especially true when the models make dissimilar predictions. In regression, we can decompose the squared error into a bias term and a variance term. The bias term indicates how well the model approximates the true function, while the variance term measures the model's ability to capture the sampling error in the training set. By averaging models, we can reduce variance while maintaining low bias, as high-capacity models often exhibit low bias. This allows us to leverage the benefits of averaging to reduce error.

When comparing an individual model to the average of models on a specific test case, it's possible for some individual predictors to outperform the combined predictor. However, different individual predictors excel on different cases. Additionally, when individual predictors significantly disagree with each other, the combined predictor generally outperforms all individual predictors on average. Thus, the goal is to have individual predictors that make distinct errors from one another while remaining accurate.

Mathematically, when combining networks, we compare two expected squared errors. The first error corresponds to randomly selecting one predictor and averaging the predictions over all predictors. The second error is obtained by averaging the models' predictions. The expected squared error from randomly selecting a model is greater than the squared error achieved through averaging, indicating the advantage of averaging in reducing error. The additional term in the equation represents the variance of the models' outputs, which is effectively reduced by averaging.

To achieve diverse predictions among models, various approaches can be employed. This includes using different types of models, altering model architectures, employing different learning algorithms, and training models on different subsets of the data. Techniques like bagging and boosting are also effective in creating diverse models. Bagging involves training different models on different subsets of the data, while boosting weights the training cases differently for each model. These methods contribute to improved performance when combining models.

Сombining multiple models is beneficial for prediction tasks. By averaging the models, we can strike a balance between capturing regularities and avoiding overfitting. Diverse predictions among the models enhance the performance of the combined predictor. Various techniques can be applied to encourage diverse predictions, leading to better overall results.

10.1 — Why it helps to combine models [Neural Networks for Machine Learning]
10.1 — Why it helps to combine models [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 10.2 — Mixtures of Experts



Lecture 10.2 — Mixtures of Experts [Neural Networks for Machine Learning]

The mixture of experts model, developed in the early 1990s, trains multiple neural networks, each specializing in different parts of the data. The idea is to have one neural net per data regime, with a manager neural net deciding which specialist to assign based on input data. This approach becomes more effective with larger data sets, as it can leverage extensive data to improve predictions. During training, the weights of the models are boosted to focus on cases where they perform better. This specialization leads to individual models excelling in certain areas while performing poorly in others. The key is to make each expert focus on predicting the right answer for cases where it outperforms other experts.

In the spectrum of models, there are local and global models. Local models, like nearest neighbors, focus on specific training cases and store their values for prediction. Global models, like fitting one polynomial to all data, are more complex and can be unstable. In between, there are intermediate complexity models that are useful for data sets with different regimes and varying input-output relationships.

To fit different models to different regimes, the training data needs to be partitioned into subsets representing each regime. Clustering based on input vectors alone is not ideal. Instead, similarity in input-output mappings should be considered. Partitioning based on input-output mapping allows models to better capture the relationships within each regime.

There are two error functions: one that encourages models to cooperate and another that encourages specialization. Encouraging cooperation involves comparing the average of all predictors with the target and training the predictors together to minimize the difference. However, this can lead to overfitting if the model is more powerful than training each predictor separately. In contrast, the error function that promotes specialization compares the output of each model with the target separately. A manager determines the weights assigned to each model, which represent the probability of selecting that model. Most experts will end up ignoring most targets, focusing only on a subset of training cases where they perform well.

The architecture of the mixture of experts model consists of multiple experts, a manager, and a softmax layer. The manager determines the probabilities of selecting each expert based on input data. The error function is computed using the outputs of the experts and the probabilities from the manager. By differentiating the error function, gradients for training the experts and the manager can be obtained. Experts with low probabilities for a particular case will have small gradients, preserving their parameters. Differentiating with respect to the outputs of the gating network provides the probability-based specialization.

There is a more complicated cost function based on mixture models, which involves Gaussian predictions and maximum likelihood estimation. This function maximizes the log probability of the target value under the mixture of experts' predictive distribution. The goal is to minimize the negative log probability as the cost function.

The mixture of experts model leverages specialized neural networks for different data regimes and effectively utilizes large data sets for improved predictions.

Lecture 10.2 — Mixtures of Experts [Neural Networks for Machine Learning]
Lecture 10.2 — Mixtures of Experts [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 10.3 — The idea of full Bayesian learning



Lecture 10.3 — The idea of full Bayesian learning [Neural Networks for Machine Learning]

In this video, I'll discuss full Bayesian learning and how it works. In full Bayesian learning, we aim to find the complete posterior distribution over all possible parameter settings, rather than searching for a single optimal setting. However, computing this distribution is computationally intensive for complex models like neural nets. Once we have the posterior distribution, we can make predictions by averaging the predictions from different parameter settings weighted by their posterior probabilities. While this approach is computationally demanding, it allows us to use complex models even with limited data.

Overfitting is a common problem when fitting complicated models to small datasets. However, by obtaining the full posterior distribution over parameters, we can avoid overfitting. A frequentist approach suggests using simpler models when there's limited data, assuming that fitting a model means finding the best parameter setting. But with the full posterior distribution, even with little data, predictions may be vague due to different parameter settings having significant posterior probabilities. As we gather more data, the posterior distribution becomes more focused on specific parameter settings, leading to sharper predictions.

The example of overfitting involves fitting a fifth-order polynomial to six data points, which seems to fit the data perfectly. In contrast, a straight line with only two degrees of freedom doesn't fit the data well. However, if we start with a reasonable prior on fifth-order polynomials and compute the full posterior distribution, we get vaguer but more sensible predictions. Different models within the posterior distribution make diverse predictions at a given input value, and on average, they align closely with the predictions made by the green line.

From a Bayesian perspective, the amount of data collected shouldn't influence prior beliefs about model complexity. By approximating full Bayesian learning in a neural net with a few parameters, we can use a grid-based approach. We place a grid over the parameter space, allowing each parameter a few alternative values. The cross-product of these values gives us grid points in the parameter space. Evaluating each grid point's performance in predicting the data and considering its prior probability, we assign posterior probabilities. Despite being computationally expensive, this method avoids gradient descent and local optima issues. It performs better than maximum likelihood or maximum a-posteriori when there's limited data.

To make predictions on test data, we compute the probability of a test output given a test input by summing the probabilities of all grid points. The probability of a grid point given the data and prior, multiplied by the probability of obtaining the test output given the input and grid point, determines the weight of each grid point's prediction. We also consider the possibility of modifying the net's output before producing the test answer.

In the provided image illustrating full Bayesian learning, a small net with four weights and two biases is shown. If we consider nine possible values for each weight and bias, the parameter space would have nine to the power of six grid points. For each grid point, we compute the probability of the observed outputs for all training cases, multiplied by the prior probability specific to that grid point. Normalizing these probabilities gives us the posterior probability over all grid points. Finally, we make predictions using these grid points, weighing each prediction by its posterior probability.

Lecture 10.3 — The idea of full Bayesian learning [Neural Networks for Machine Learning]
Lecture 10.3 — The idea of full Bayesian learning [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 10.4 — Making full Bayesian learning practical



Lecture 10.4 — Making full Bayesian learning practical [Neural Networks for Machine Learning]

In this video, I will explain how to make full Bayesian learning practical for large neural networks with thousands or even millions of weights. The technique used is a Monte Carlo method, which may seem peculiar at first. We employ a random number generator to explore the space of weight vectors in a random manner, but with a bias towards descending the cost function. When done correctly, this approach has a remarkable property: it samples weight vectors in proportion to their probabilities in the posterior distribution. By sampling a large number of weight vectors, we can obtain a good approximation of the full Bayesian method.

As the number of parameters increases, the number of grid points in the parameter space becomes exponentially large. Therefore, creating a grid for more than a few parameters is not feasible when there is enough data to render most parameter vectors highly improbable. Instead, we can focus on evaluating a small fraction of the grid points that make a significant contribution to the predictions. An idea that makes Bayesian learning feasible is to sample weight vectors according to their posterior probabilities. Instead of summing up all the terms in the equation, we can sample terms from the sum. We assign a weight of one or zero to each weight vector, depending on whether it is sampled or not. The probability of being sampled corresponds to the weight vector's posterior probability, resulting in the correct expected value.

Standard backpropagation, depicted on the right side, follows a path from an initial point to a final single point by moving along the gradient and descending the cost function. In contrast, a sampling method introduces Gaussian noise at each weight update, causing the weight vector to wander and explore the weight space continuously. This wandering behavior favors low-cost regions and tends to move downhill whenever possible. An essential question is how frequently the weights will visit each point in the space. The red dots represent samples taken while wandering, and they may not be in the lowest-cost regions due to the inherent noise. However, after sufficient exploration, a remarkable property of Markov chain Monte Carlo emerges: the weight vectors become unbiased samples from the true posterior distribution. Weight vectors that are highly probable under the posterior are more likely to be represented by a red dot than highly improbable ones. This technique, known as Markov chain Monte Carlo, enables the use of Bayesian learning with thousands of parameters.

The method mentioned earlier, which involves adding Gaussian noise, is called the Langevin method. While effective, it is not the most efficient approach. There are more sophisticated methods available that require less time for the weight vectors to explore the space before obtaining reliable samples. One such approach is using mini-batches in full Bayesian learning. When computing the gradient of the cost function on a random mini-batch, we obtain an unbiased estimate with sampling noise. This sampling noise can be used to provide the noise required by the Markov chain Monte Carlo method. A clever idea by Welling and collaborators allows for efficient sampling from the posterior distribution over weights using mini-batch methods. This advancement should make full Bayesian learning feasible for much larger networks that require training with mini-batches to complete the training process.

Using mini-batches in full Bayesian learning offers several advantages. When computing the gradient of the cost function on a random mini-batch, we not only obtain an unbiased estimate with sampling noise, but we also leverage the efficiency of mini-batch methods. This means that we can train much larger networks that would be otherwise infeasible to train with full Bayesian learning.

The breakthrough achieved by Welling and his collaborators allows for efficient sampling from the posterior distribution over weights using mini-batch methods. Their clever idea utilizes the sampling noise inherent in mini-batch gradient estimation to serve as the noise required by the Markov chain Monte Carlo method. By appropriately incorporating this noise, they have successfully obtained reliable samples from the posterior distribution, making full Bayesian learning practical for larger networks.

With this advancement, it becomes possible to train neural networks with thousands or even millions of weights using mini-batches and obtain samples from the posterior distribution over weights. This is particularly beneficial when dealing with large-scale problems that require extensive computational resources. The ability to incorporate uncertainty through full Bayesian learning provides a more comprehensive understanding of model predictions and can lead to improved decision-making.

Full Bayesian learning can be made practical for large neural networks by leveraging Monte Carlo methods such as Markov chain Monte Carlo. By sampling weight vectors according to their posterior probabilities, we can approximate the full Bayesian method and obtain valuable insights into the uncertainty of our models. With the introduction of mini-batch methods, efficient sampling from the posterior distribution over weights is now achievable, enabling the application of full Bayesian learning to much larger networks.

Lecture 10.4 — Making full Bayesian learning practical [Neural Networks for Machine Learning]
Lecture 10.4 — Making full Bayesian learning practical [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 10.5 — Dropout



Lecture 10.5 — Dropout [Neural Networks for Machine Learning]

Dropout is a successful method for combining a large number of neural network models without separately training each model. In this approach, random subsets of hidden units are dropped out for each training case, resulting in different architectures for each case. This creates a unique model for every training case, raising questions about how to train and efficiently average these models during testing.

Two ways to combine the outputs of multiple models are by averaging their output probabilities or using the geometric mean of their probabilities. Weight sharing plays a crucial role in this method. Dropout provides an efficient way to average neural networks, although it may not perform as well as the correct Bayesian approach. During training, hidden units are randomly dropped out with a probability of 0.5, resulting in a vast number of architectures sharing weights. Dropout can be seen as model averaging, where most models are not sampled, and each sampled model receives only one training example. Weight sharing among models effectively regularizes them. At test time, all hidden units are used, but their outgoing weights are halved to compute the geometric mean of predictions from all possible models. Dropout can be extended to multiple hidden layers by applying a dropout of 0.5 in each layer. This approximation is faster than averaging separate dropout models but provides a good approximation.

Furthermore, dropout can be applied to input layers with a higher probability of keeping inputs. This technique is already used in denoising autoencoders and has shown good results. Dropout has been shown to be effective in reducing errors and preventing overfitting in deep neural networks. It encourages specialization of hidden units and prevents complex co-adaptations that may lead to poor generalization on new test data. By forcing hidden units to work with different combinations of other hidden units, dropout promotes individually useful behavior and discourages reliance on specific collaborations. This approach improves the performance of dropout networks by allowing each unit to contribute in a unique and marginally useful way, leading to excellent results.

Dropout is a powerful technique for training and combining neural network models. It addresses the challenge of overfitting by regularizing the models through weight sharing and random dropout of hidden units. By creating diverse architectures for each training case, dropout encourages individual unit specialization and reduces complex co-adaptations. The process of averaging the models' output probabilities or using the geometric mean provides an ensemble-like effect, improving the overall performance of the network. Although dropout may not achieve the same level of performance as the correct Bayesian approach, it offers a practical and efficient alternative. When applied to multiple hidden layers, dropout can be used in each layer with a dropout probability of 0.5. This approximation, known as the "mean net," effectively combines the benefits of dropout with faster computation. It is particularly useful when computational resources are limited.

Furthermore, dropout can be extended to the input layer by applying dropout with a higher probability of retaining inputs. This technique helps prevent overfitting and has shown success in various applications. It is important to note that dropout does not only improve performance on training data but also enhances generalization to unseen test data. By encouraging individual unit behavior and reducing complex co-adaptations, dropout models tend to perform well on new and unseen examples.

Dropout is a practical and effective method for combining neural network models. By randomly dropping out hidden units and encouraging individual unit behavior, dropout mitigates overfitting and improves generalization. Its simplicity and efficiency make it a valuable tool for training deep neural networks.

Lecture 10.5 — Dropout [Neural Networks for Machine Learning]
Lecture 10.5 — Dropout [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 11.1 — Hopfield Nets



Lecture 11.1 — Hopfield Nets [Neural Networks for Machine Learning]

In this video, the presenter introduces Hopfield networks and their role in the resurgence of interest in neural networks in the 1980s. Hopfield networks are simple devices used for storing memories as distributed patterns of activity. They are energy-based models with binary threshold units and recurrent connections.

Analyzing networks with nonlinear units and recurrent connections can be challenging due to their various behaviors such as settling to stable states, oscillating, or even being chaotic. However, Hopfield and other researchers realized that if the connections are symmetric, a global energy function can be defined for each binary configuration of the network. The binary threshold decision rule, combined with the right energy function, causes the network to move downhill in energy, eventually reaching an energy minimum. The energy function consists of local contributions representing the product of connection weights and the binary states of connected neurons.

To find an energy minimum, units in a Hopfield net are updated sequentially, one at a time, in random order. Each unit computes its state based on the configuration that results in the lowest global energy. This sequential updating prevents units from making simultaneous decisions that could increase the energy and lead to oscillations. Hopfield networks are suitable for storing memories, as memories correspond to energy minima in the network. Memories can be partial or corrupted, and the binary threshold decision rule can clean them up and restore them to full memories. This content-addressable memory allows access to stored items based on partial content information.

Hopfield nets have properties that make them robust against hardware damage, as they can still function properly even with a few units removed. The weights in the network provide information about how states of neurons fit together, similar to reconstructing a dinosaur from a few bones. The storage rule for memories in a Hopfield net is simple. By incrementing the weights between units based on the product of their activities, a binary state vector can be stored. This rule only requires one pass through the data, making it an online rule. However, it is not an error correction rule, which has both advantages and disadvantages.

Hopfield networks offer a straightforward approach to storing memories and have interesting properties that make them valuable for various applications.

Lecture 11.1 — Hopfield Nets [Neural Networks for Machine Learning]
Lecture 11.1 — Hopfield Nets [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 11.2 — Dealing with spurious minima



Lecture 11.2 — Dealing with spurious minima [Neural Networks for Machine Learning]

In this video, I'll discuss the storage capacity of Hopfield nets and how it is limited by spurious memories. Spurious memories occur when nearby energy minima combine, creating a new minimum in the wrong place. Efforts to eliminate these spurious minima led to an interesting method of learning in more complex systems than basic Hopfield nets.

I will also touch upon a historical rediscovery related to increasing the capacity of Hopfield nets. Physicists attempting to enhance their storage capacity stumbled upon the perceptron convergence procedure, which was initially developed after Hopfield invented Hopfield nets as memory storage devices.

The capacity of a Hopfield net, using the Hopfield storage rule for a fully connected network, is approximately 0.15n memories, where n represents the number of binary threshold units. This capacity indicates the number of memories that can be stored without confusion. Each memory consists of a random configuration of the N units, contributing n bits of information. Thus, the total information stored in a Hopfield net is about 0.15N squared bits.

However, this storage method does not efficiently utilize the bits needed to store the weights. If we analyze the number of bits required to store the weights in the computer, it exceeds 0.15N squared bits, demonstrating that distributed memory in local energy minima is not efficient.

To improve the capacity of a Hopfield net, we need to address the merging of energy minima, which limits its capability. Each time a binary configuration is memorized, we hope to create a new energy minimum. However, nearby patterns can lead to the merging of minima, making it impossible to distinguish between separate memories. This merging phenomenon is what restricts the capacity of a Hopfield net.

An intriguing idea that emerged from improving the capacity of Hopfield nets is the concept of unlearning. Unlearning involves allowing the net to settle from a random initial state and then applying the opposite of the storage rule to eliminate spurious minima. Hopfield, Feinstein, and Palmer demonstrated that unlearning effectively increases memory capacity, and Crick and Mitchison proposed that unlearning might occur during REM sleep.

The challenge lies in determining how much unlearning should be done. Ideally, unlearning should be part of the process of fitting a model to data. Maximum likelihood fitting of the model can automatically incorporate unlearning, providing precise guidance on the amount of unlearning required.

Physicists made efforts to enhance the capacity of Hopfield nets, driven by the desire to find connections between familiar mathematical concepts and brain functionality. Elizabeth Gardner proposed a more efficient storage rule that utilized the full capacity of the weights. This rule involved cycling through the training set multiple times and employing the perceptron convergence procedure to train each unit's correct state.

This technique is similar to the pseudo-likelihood method used in statistics, where you aim to get one dimension right given the values on all other dimensions. The perceptron convergence procedure, with some adjustments for the symmetric weights in Hopfield nets, allows for more efficient memory storage.

By using the perceptron convergence procedure and iterating through the data multiple times, we can improve the memory storage efficiency of Hopfield nets. This technique is analogous to the pseudo-likelihood method employed in statistics, where the goal is to get one dimension right based on the values of all other dimensions. The perceptron convergence procedure, with appropriate modifications for the symmetric weights in Hopfield nets, enables more effective memory storage.

This enhanced storage rule presented by Gardner represents a significant advancement in maximizing the capacity of Hopfield nets. By cycling through the training set and iteratively adjusting the weights based on the perceptron convergence procedure, the network can store a greater number of memories.

It's worth noting that this approach sacrifices the online property of Hopfield nets, which allows processing of data in a single pass. However, the trade-off is justified by the improved storage efficiency achieved through the utilization of the full capacity of the weights.

The incorporation of unlearning, as proposed by Hopfield, Feinstein, and Palmer, provides a means to remove spurious minima and further increase memory capacity. Unlearning allows for the separation of merged minima, ensuring better recall of individual memories.

Interestingly, Crick and Mitchison suggested a functional explanation for unlearning during REM sleep. They proposed that the purpose of dreaming is to facilitate the removal of spurious minima, effectively resetting the network to a random state and unlearning previous patterns.

To address the mathematical challenge of determining the optimal amount of unlearning, a potential solution lies in treating unlearning as part of the model-fitting process. By employing maximum likelihood fitting, unlearning can be automatically incorporated, providing precise guidance on the extent of unlearning required to optimize the model's performance.

The quest to improve the capacity of Hopfield nets has yielded valuable insights into memory storage and learning processes. The development of the perceptron convergence procedure, along with the exploration of unlearning, has brought us closer to harnessing the full potential of Hopfield nets for effective memory storage and retrieval.

Lecture 11.2 — Dealing with spurious minima [Neural Networks for Machine Learning]
Lecture 11.2 — Dealing with spurious minima [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 11.3 — Hopfield nets with hidden units



Lecture 11.3 — Hopfield nets with hidden units [Neural Networks for Machine Learning]

In this video, I will introduce a novel approach to utilizing Hopfield nets and their energy function. By incorporating hidden units into the network, we aim to derive interpretations of perceptual input based on the states of these hidden units. The key concept is that the weights between units impose constraints on favorable interpretations, and by finding states with low energy, we can discover good interpretations of the input data.

Hopfield nets combine two fundamental ideas: the ability to find local energy minima using symmetrically connected binary threshold units and the notion that these local energy minima may correspond to memories. However, there is an alternative way to leverage the capability of finding local minima. Instead of using the network solely for memory storage, we can employ it to construct interpretations of sensory input. To illustrate this idea, let's delve into the details of inferring information from a 2D line in an image about the three-dimensional world. When we observe a 2D line, it can originate from various three-dimensional edges in the world. Due to the loss of depth information in the image, multiple 3D edges can lead to the same appearance on the retina. This ambiguity arises because we lack knowledge about the depth at each end of the line.

To address this, we assume that a straight 3D edge in the world causes a straight 2D line in the image. However, this assumption eliminates two degrees of freedom related to the depth at each end of the 3D edge. Consequently, a whole family of 3D edges corresponds to the same 2D line, but we can only perceive one of them at a time. Now, let's consider an example that demonstrates how we can utilize the ability to find low energy states in a network of binary units to aid in interpreting sensory input. Suppose we have a line drawing and want to interpret it as a three-dimensional object. For each potential 2D line, we allocate a corresponding neuron. Only a few neurons will activate in any given image, representing the present lines.

To construct interpretations, we introduce a set of 3D line units, one for each possible 3D edge. Since each 2D line unit can correspond to multiple 3D lines, we need to excite all the relevant 3D lines while ensuring competition among them, as only one 3D line should be active at a time. To achieve this, we establish excitatory connections from the 2D line unit to all the candidate 3D lines, along with inhibitory connections to enable competition.

However, the wiring of the neural network is not yet complete. We need to incorporate information about how 3D edges connect. For example, when two 2D lines converge in the image, it is highly likely that they correspond to edges with the same depth at the junction point. We can represent this expectation by introducing additional connections that support such coinciding 3D edges.

Furthermore, we can exploit the common occurrence of 3D edges joining at right angles. By establishing stronger connections between two 3D edges that agree in depth and form a right angle, we can indicate their cohesive relationship. These connections, represented by thicker green lines, provide information about how edges in the world connect and contribute to the formation of a coherent 3D object. Now, our network contains knowledge about the arrangement of edges in the world and how they project to create lines in the image. When we feed an image into this network, it should generate an interpretation. In the case of the image I am presenting, there are two distinct interpretations, known as the Necker cube. The network would exhibit two energy minima, each corresponding to one of the possible interpretations of the Necker cube.

Please note that this example serves as an analogy to grasp the concept of using low energy states as interpretations of perceptual data. Constructing a comprehensive model that accurately accounts for the flipping of the Necker cube would be considerably more complex than the simplified scenario described here. If we decide to use low energy states to represent sound perceptual interpretations, two key challenges emerge. First, we need to address the issue of search—how to prevent hidden units from becoming trapped in poor local energy minima. Poor minima reflect suboptimal interpretations based on our current model and weights. Is there a better approach than simply descending in energy from a random starting state?

The second challenge is even more daunting—how to learn the weights of the connections between hidden units and between visible and hidden units. Is there a straightforward learning algorithm to adjust these weights, considering that there is no external supervisor guiding the learning process? Our goal is for the network to receive input and construct meaningful patterns of activity in the hidden units that represent sensible interpretations. This poses a considerable challenge.

In summary, utilizing Hopfield nets and their energy function in a novel way involves incorporating hidden units to derive interpretations of perceptual input. The weights between units represent constraints on good interpretations, and finding low energy states allows us to discover favorable interpretations.

However, there are challenges to overcome. The first challenge is the search problem, which involves avoiding getting trapped in poor local energy minima. These minima represent suboptimal interpretations, and finding an efficient search method is crucial. The second challenge is learning the weights on the connections between hidden units and between visible and hidden units. This task is complicated by the absence of a supervisor or external guidance. A suitable learning algorithm is needed to adjust the weights, enabling the network to construct meaningful interpretations of sensory input. It's important to note that the example provided, involving the interpretation of a 2D line drawing as a 3D object, is an analogy to illustrate the concept of using low energy states for interpretations. Constructing a comprehensive model to handle more complex perceptual phenomena would require more intricate approaches.

In the next video, we will delve into the search problem and explore potential solutions to avoid getting trapped in poor local minima of the energy function.

Lecture 11.3 — Hopfield nets with hidden units [Neural Networks for Machine Learning]
Lecture 11.3 — Hopfield nets with hidden units [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 11.4 — Using stochastic units to improve search



Lecture 11.4 — Using stochastic units to improve search [Neural Networks for Machine Learning]

In this video, I will explain how adding noise to systems can help them escape local minima. Specifically, I will show you how to incorporate noise into the units of a Hopfield net in an appropriate manner.

A Hopfield net always makes decisions that reduce energy. This makes it difficult to climb out of a local minimum. If we are trapped in a local minimum, we cannot overcome the energy barrier to reach a better minimum.

However, by adding random noise, we can escape from poor minima, especially those that are shallow and lack significant energy barriers. The most effective strategy is to start with a high level of noise, allowing exploration of the space on a coarse scale and finding generally good regions. As the noise level decreases, the focus shifts to the best nearby minima.

Simulated annealing is a technique that gradually reduces the noise level to guide the system towards a deep minimum. It takes advantage of the temperature in a physical or simulated system, where higher temperature flattens the energy landscape, enabling easier crossing of barriers, while lower temperature improves the ratio of probabilities to favor transitions from one minimum to another.

To introduce noise into a Hopfield net, we replace the binary threshold units with binary stochastic units that make biased random decisions. The noise level is controlled by a parameter called temperature. Raising the noise level corresponds to decreasing the energy gaps between configurations.

The concept of thermal equilibrium is important to understand in the context of Boltzmann machines. At a fixed temperature, thermal equilibrium refers to the probability distribution settling into a stationary distribution, determined by the energy function. The probability of a configuration in thermal equilibrium is proportional to e to the power of minus its energy.

Reaching thermal equilibrium involves running multiple stochastic systems with the same weights and applying stochastic update rules. Although individual systems keep changing configurations, the fraction of systems in each configuration remains constant. This is analogous to shuffling card packs in a large casino until the initial order becomes irrelevant, and an equal number of packs are in each possible order.

Simulated annealing is a powerful method for overcoming local optima, but it will not be further discussed in this course, as it can be a distraction from understanding Boltzmann machines. Instead, binary stochastic units with a temperature of one (standard logistic function) will be used.

Adding noise to systems, such as Hopfield nets and Boltzmann machines, can help escape local minima and explore more favorable regions. The noise level is controlled by temperature, and reaching thermal equilibrium involves the fraction of systems in each configuration remaining constant while individual systems keep changing their states.

Lecture 11.4 — Using stochastic units to improve search [Neural Networks for Machine Learning]
Lecture 11.4 — Using stochastic units to improve search [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 11.5 — How a Boltzmann machine models data



Lecture 11.5 — How a Boltzmann machine models data [Neural Networks for Machine Learning]

In this video, I will explain how a Boltzmann machine models binary data vectors. Firstly, I will discuss the reasons for modeling binary data vectors and the potential applications of such a model. Then, I will delve into how the probabilities assigned to binary data vectors are determined by the weights in a Boltzmann machine.

Boltzmann machines, also known as stochastic Hopfield nets with hidden units, are effective at modeling binary data. By utilizing hidden units, these machines can fit a model to a set of binary training vectors and assign a probability to every possible binary vector.

There are several practical applications for modeling binary data. For instance, if you have different distributions of binary vectors, you may want to determine which distribution a new binary vector belongs to. By using hidden units to model the distribution for each document type, you can identify the most likely document class for a given binary vector.

Furthermore, Boltzmann machines can be useful for monitoring complex systems and detecting unusual behavior. Suppose you have a nuclear power station with binary readings from various dials indicating the state of the station. Instead of relying on supervised learning, which requires examples of dangerous states, you can build a model of normal states and detect deviations from the norm. This way, you can identify unusual states without prior knowledge of such states.

To compute the posterior probability that a particular distribution generated the observed data, you can employ Bayes' theorem. Given the observed data, the probability of it coming from a specific model is the probability that the model would generate that data divided by the sum of the same quantity for all models.

There are two main approaches to generating models of data, specifically binary vectors. The causal model approach involves generating the states of latent variables first, which are then used to generate the binary vector. This model relies on weighted connections and biases between the latent and visible units. Factor analysis is an example of a causal model that uses continuous variables.

On the other hand, a Boltzmann machine is an energy-based model that does not generate data causally. Instead, it defines everything in terms of the energies of joint configurations of visible and hidden units. The probability of a joint configuration is either defined directly in terms of the energy or computed procedurally after updating the stochastic binary units until thermal equilibrium is reached. The energy of a joint configuration consists of bias terms, visible-visible interactions, visible-hidden interactions, and hidden-hidden interactions.

To compute the probabilities of different visible vectors, we can work through an example. By writing down all possible states of the visible units, computing their negative energies, exponentiating those energies, and normalizing the probabilities, we can determine the probabilities of joint configurations and individual visible vectors.

For larger networks, computing the partition function becomes infeasible due to the exponential number of terms. In such cases, we can use Markov chain Monte Carlo methods to obtain samples from the model. By updating units stochastically based on their energy gaps, we can reach the stationary distribution and obtain a sample from the model. The probability of a sample is proportional to e to the power of the negative energy.

Additionally, we might be interested in obtaining samples from the posterior distribution over hidden configurations given a data vector, which is necessary for learning. By using Markov chain Monte Carlo with the visible units clamped to the data vector, we can update only the hidden units and obtain samples from the posterior distribution. This information is crucial for finding good explanations for observed data and guiding learning processes.

The process of learning in a Boltzmann machine involves adjusting the weights and biases to improve the model's fit to the training data. This learning process is typically achieved through an algorithm called contrastive divergence.

Contrastive divergence is an approximation method used to estimate the gradient of the log-likelihood function. It allows us to update the weights and biases in the Boltzmann machine efficiently. The basic idea behind contrastive divergence is to perform a few steps of Gibbs sampling, which involves alternating between updating the hidden units and updating the visible units, starting from a data vector.

To update the hidden units, we can sample their states based on the probabilities determined by the energy of the joint configuration of visible and hidden units. Then, we update the visible units based on the new states of the hidden units. This process is repeated for a few iterations until the Markov chain reaches approximate equilibrium.

Once we have obtained a sample from the posterior distribution over hidden configurations, we can use it to compute the positive and negative associations between the visible and hidden units. The positive associations are calculated by taking the outer product of the data vector and the sampled hidden configuration. The negative associations are computed in a similar manner, but using the visible units obtained from the Gibbs sampling process.

By taking the difference between the positive and negative associations, we can compute the gradient of the log-likelihood function. This gradient is then used to update the weights and biases of the Boltzmann machine through a learning rule, such as stochastic gradient descent.

The learning process continues by repeatedly applying contrastive divergence, sampling from the posterior distribution, and updating the weights and biases. Over time, the Boltzmann machine learns to capture the underlying patterns and distributions present in the training data.

It's important to note that training a Boltzmann machine can be a challenging task, especially for large networks with many hidden units. The computational complexity increases exponentially with the number of units, making it difficult to perform exact computations. However, approximate methods like contrastive divergence provide a practical solution for learning in Boltzmann machines.

Learning in a Boltzmann machine involves adjusting the weights and biases through contrastive divergence, which approximates the gradient of the log-likelihood function. By iteratively sampling from the posterior distribution and updating the model parameters, the Boltzmann machine can learn to model the underlying patterns in the binary data.

Lecture 11.5 — How a Boltzmann machine models data [Neural Networks for Machine Learning]
Lecture 11.5 — How a Boltzmann machine models data [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...