Machine Learning and Neural Networks - page 48

 

CS480/680 Lecture 15: Deep neural networks



CS480/680 Lecture 15: Deep neural networks

This video covers the basics of deep learning, including the concepts of deep neural networks, the vanishing gradient problem, and the evolution of deep neural networks in image recognition tasks. The lecturer explains how deep neural networks can be used to represent functions more succinctly and how they compute features that become increasingly higher-level as the network becomes deeper. Solutions to the vanishing gradient problem are addressed, including the use of rectified linear units (ReLU) and batch normalization. The lecture also covers max-out units and their advantages as a generalization of ReLUs that allows for multiple linear parts.

The lecture on deep neural networks discusses two problems that require resolution for effective deep learning: the issue of overfitting due to multiple layer network expressivity and the requirement for high computational power to train complex networks. The lecturer proposes solutions such as regularization and dropout during training, as well as parallel computing during computation. The lecture also details how dropout can be used during testing by scaling the input and hidden units' magnitudes. Lastly, the lecture concludes by introducing some breakthrough applications of deep neural networks in speech recognition, image recognition, and machine translation.

  • 00:00:00 In this section, we learn about the basics of deep learning, specifically what a deep neural network is and how it differs from a regular neural network. We find out that the term "deep learning" is mostly used for marketing purposes, as the concept of neural networks with many hidden layers was first proposed in the 1980s. However, the advantage of using deep neural networks is that they tend to be highly expressive, allowing them to fit the data well. The challenge lies in training them effectively, which is where the "great and vanishing problem" comes in.

  • 00:05:00 In this section, the lecturer discusses the issues of training large neural networks and the problem of overfitting due to the high number of weights and parameters. Researchers used to bias towards single hidden layer neural networks because they could approximate any function with enough hidden units. However, multiple hidden layer neural networks have the advantage of decreasing the overall size of the network, which can be exponentially reduced, as demonstrated by an example of the parity function. The lecturer shows a neural network architecture that encodes the parity function, where the hidden layer is a thresholding perceptron that encodes the logical "and" function, while the output unit is an "or" logical function.

  • 00:10:00 In this section, the lecturer explains how a neural network can be set up to detect whether the number of inputs turned on is odd or even. Each hidden unit in the fully connected network is responsible for checking one specific pattern where the inputs are odd and the output unit is just the OR of the hidden units. There are 8 patterns with 4 inputs that are odd and each hidden unit is responsible for one of those patterns. However, the lecturer notes that in general, having n inputs will result in exponentially many hidden units, making this approach not scalable and suggesting an alternative approach.

  • 00:15:00 In this section, the lecturer talks about the concept of deep neural networks, which involve multiple layers and can be used to represent functions more succinctly. The lecture provides an example of a function, the parity function, which can only be represented by an exponentially larger network with just one hidden layer or a linearly sized network with multiple hidden layers. The lecturer then discusses how deep neural networks can be used in practice for computer vision tasks, such as facial recognition, where inputs (such as pixel intensities) are fed into the network and intermediate values are computed to produce a classification at the output.

  • 00:20:00 In this section, the video discusses how deep neural networks compute features that are simple at the beginning of the network and become progressively higher-level as we go deeper. In computer vision, before deep learning, practitioners would manually design features for their tasks. However, deep learning allows for features to be learned as part of the network, making it possible to work with raw data. This breakthrough was pioneered by Geoff Hinton in 2006, who designed the first effective deep neural network.

  • 00:25:00 In this section, the history of deep neural networks and their breakthroughs in speech recognition and image classification are discussed. The first breakthrough came in 2009 when Geoff Hinton developed a way to train deep neural networks layer by layer using restricted Boltzmann machines, leading to a significant improvement in speech recognition benchmarks. Recurrent neural networks then replaced the restricted Boltzmann machines around 2013, leading to even better results. The second breakthrough came in image classification when the ImageNet Large Scale Visual Recognition Challenge was proposed in 2010. Despite years of research, computers could not accurately classify images among 1000 categories. However, by 2012 deep learning algorithms had reduced the error rate from 26% to 15% and by 2016 Microsoft had achieved an error rate of 3.1%, beating human performance.

  • 00:30:00 In this section, the speaker discusses the history and evolution of deep neural networks, particularly in image recognition tasks. The error rate for image classification tasks was significantly reduced in 2012 with the introduction of a convolutional neural network called AlexNet by Jeff Hinton's group. This led to the understanding that neural networks can achieve remarkable results, and more sophisticated architectures were designed to further improve the error rate. Over time, the depth of networks increased, and there was a clear trend towards deeper networks. The ability to apply and use deep neural networks for image recognition tasks was a result of various innovations, including better training techniques and preventing overfitting.

  • 00:35:00 In this section, the problem of vanishing gradients in deep neural networks is addressed, which occurs when partial derivatives of weights associated with edges in previous layers are smaller in magnitude, resulting in negligible values as the network deepens. This made it difficult for researchers to train neural networks with multiple layers because the bottom layers were not getting trained, therefore not providing meaningful output to improve the network's predictions. This was due in part to the activation functions used, such as a sigmoid function or hyperbolic tangent function whose gradient was always less than 1, making it challenging to optimize the weights and adjusting the initial computation.

  • 00:40:00 In this section, the lecturer explains the problem of the gradient vanishing in a deep neural network. He creates a toy neural network with a sigmoid activation unit and shows how the gradient consists of partial derivatives that are products of factors, each factor being either the partial derivative of the sigmoid or a weight. As the partial derivatives of the sigmoid are always less than one and the weights are initialized to magnitudes less than one, multiplying these factors tends to make the partial derivatives smaller and smaller. This results in weights having less and less impact as we go back into the layers, giving rise to the gradient vanishing problem. The lecturer then introduces some common solutions such as pre-training, different activation functions, skip connections, and batch normalization, and focuses on rectified linear units and max out units as possible solutions.

  • 00:45:00 In this section, the lecturer discusses solutions to the problem of vanishing gradients that arise due to problematic activation functions. One possible solution is to use activation functions that have derivatives greater than zero, such as the rectified linear unit (ReLU), which returns a linear combination of inputs or zero. Another solution is batch normalization, which ensures that data is effectively in a range where the gradient tends to be close to one. These solutions allow for some paths with vanishing gradients as long as enough paths have gradients of one, which propagates the gradient through the neural network.

  • 00:50:00 In this section, the lecturer discusses rectified linear units (ReLUs) and their advantages and disadvantages. ReLUs were initially criticized because they have a discontinuity at zero, which causes issues with computing gradients using gradient descent. However, this issue is not significant in practice since the numerical values are rarely exactly zero. In contrast, the soft loss function, which approximates ReLUs, is smooth and continuous, but its gradient is less than one everywhere. Hence, making ReLUs smooth does not help to eliminate the gradient vanishing problem. Despite ReLUs having a portion that could have been ignored, they are still useful because there are inputs for which each unit will produce something in the linear part.

  • 00:55:00 In this section, the speaker discusses the advantages of rectified linear units (ReLUs) and introduces the concept of max-out units. He explains that ReLUs have become popular because in cases where the gradient doesn't vanish, they can be trained faster, requiring less gradient descent. The speaker then introduces max-out units as a generalization of ReLUs that allows for multiple linear parts, rather than just a zero part and linear part, and demonstrates how they are constructed by taking the max of different linear combinations. The shape of a max-out unit is shown to have multiple linear combinations, each corresponding to a line, and becomes an aggregation of a hidden layer of identity units with a max unit.

  • 01:00:00 In this section of the lecture, the professor discusses two problems that need to be resolved for deep learning to be effective. The first problem is the issue of overfitting, which arises because of the high expressivity of multiple-layer networks. Regularization is one solution that involves minimizing the magnitude of weights to keep them small and constrained. Another solution is dropout, where some network units are randomly dropped during training to force the network to be robust and prevent overfitting. The second problem is the need for high computational power to train complex networks, which can be achieved through parallel computing using GPUs or distributed computing.

  • 01:05:00 In this section, the speaker discusses the use of dropout during testing time for deep neural networks. During training, dropout is a technique where some of the input or hidden units are randomly dropped from the network to prevent overfitting. However, during testing, the entire network is used, which can cause the magnitudes of the linear combinations to be higher. To solve this problem, the input units are rescaled by multiplying them by 1 minus the probability of dropping them, and the same is done for the hidden units. The speaker provides an example of a fully connected network with three inputs, four hidden units, and one output, and explains the use of a random number generator to drop some of the input and hidden units during training.

  • 01:10:00 In this section, the instructor discusses what happens if all the input or hidden units are removed in a neural network and how dropout regularization can address this issue. Although it is unlikely that all units are removed, it could impact accuracy if they are. Dropout regularization helps prevent overfitting and forces the network to become robust with respect to dropped features. The algorithm for dropout regularization involves sampling Bernoulli variables to create a mutilated network where some units are dropped and multiplying the remaining units' magnitudes by 1 minus the probability of being dropped. During training, the gradient is computed with respect to the mutilated network.

  • 01:15:00 In this section, the presenter discusses the dropout technique used in deep neural networks to make the network robust and to prevent overfitting. Dropout is a form of approximated and sample learning, where each iteration computes a mutilated network by dropping out specific nodes, resulting in one hypothesis or function that could encode what is being learned. The entire network can be thought of as the average of all the mutilated networks, with the adjustment of what is being computed. This method is similar to Bayesian learning and has been proved to approximate some computation with respect to a deep Gaussian process. This helps to justify why dropout can work well in practice. The presenter concludes by introducing some applications where deep neural networks have experienced breakthroughs, including speech recognition, image recognition, and machine translation.

  • 01:20:00 In this section, the speaker describes the historical state-of-the-art method for speech recognition, which was a hidden Markov model that used a mixture of Gaussians. However, in 2009, Geoff Hinton and his research group proposed replacing the Gaussian mixture with a deep neural network that used a stacked restricted Boltzmann machine. This hybrid model between a probabilistic model and a deep neural network led to a significant reduction in the error rate, which was observed across several benchmarks. Due to this breakthrough, several companies, including Google and Microsoft, started leveraging deep neural networks, ultimately leading to a renaissance in the field of deep learning.

  • 01:25:00 In this section, the lecturer discusses the breakthroughs in neural networks, starting with the image recognition breakthrough that occurred in 2012. The breakthrough was due to the development of convolutional neural networks which take 2D arrays of pixel intensities as input, have convolution layers that compute features at different granularities, and dense layers that are fully connected. Data augmentation was also used to improve recognition by making it invariant to rotation and other factors. The result was a significant reduction in error rate from 26.2% to 16.4% for the top entry in a competition. Though 16% is still relatively high, it is difficult to classify images accurately among thousands of classes, and the top five prediction accuracy was measured rather than the top one.

  • 01:30:00 In this section, the lecturer discusses the performance of a deep neural network algorithm using an image of a might as an example. The algorithm returns five potential classes and assigns a confidence score to each one to determine the probability of it being the correct class. The network generally performs well, correctly recognizing objects such as a container ship and motor scooter with high confidence, but there are instances where it misclassifies an object.
 

CS480/680 Lecture 16: Convolutional neural networks



CS480/680 Lecture 16: Convolutional neural networks

This video introduces convolutional neural networks (CNNs) and explains their importance in image processing as a specific type of neural network with key properties. The lecturer discusses how convolution can be used for image processing, such as in edge detection, and how CNNs can detect features in a similar way. The concept of convolutional layers and their parameters is explained, along with the process of training CNNs using backpropagation and gradient descent with shared weights. The lecturer also provides design principles for creating effective CNN architectures, such as using smaller filters and nonlinear activation after every convolution.

In this lecture on Convolutional Neural Networks (CNNs), the speaker discusses the concept of residual connections as a solution to the vanishing gradient problem faced by deep neural networks. These skip connections allow for shortening of network paths and ignoring of useless layers while still being able to use them if needed to avoid producing outputs close to zero. The use of batch normalization techniques is also introduced to mitigate the problem of vanishing gradients. Furthermore, the speaker notes that CNNs can be applied to sequential data and tensors with more than two dimensions, such as in video sequences, and that 3D CNNs are also a possibility for certain applications. The TensorFlow framework is highlighted as being designed for computation with multi-dimensional arrays.

  • 00:00:00 In this section, the presenter introduces convolutional neural networks (CNNs) and explains their importance in image processing as a specific type of neural network with key properties. The lecture goes on to discuss how CNNs can scale to handle large datasets and sequences. The presenter explains that CNNs are named after the mathematical operation of convolution, which modifies two functions to produce a third function, with an example of using convolution for smoothing. The lecture notes also make use of Gaussians as weighting functions for the convolution operation.

  • 00:05:00 In this section, the concept of convolution in both continuous and discrete cases is discussed, where the output, Y, is a weighted combination of X's in a neighborhood. When applied to images, this is a 2-dimensional function, where each pixel is a measurement of that function at a specific coordinate in the x and y directions. The weights applied to each pixel intensity can produce a new image, Y. As an example, a simple convolution can be used for edge detection in a grayscale image to detect vertical edges.

  • 00:10:00 In this section, the speaker discusses how convolutions can be used to detect features in neural networks. A convolution is essentially a linear combination of a subset of units based on a specific pattern of weights, which can help detect features like edges or other patterns that might be important for a given task. The speaker also explains that a pattern of weights determines the filter to detect a feature in a neighborhood, and a nonlinear activation function amplifies the output. The gab or filters are a popular class of filters that correspond to common feature maps inspired by how the human visual cortex works.

  • 00:15:00 In this section, the lecturer explains how convolutional neural networks work. The idea is to detect little edges in an image by applying patches of weights that correspond to a particular feature, and each patch's magnitude is determined by its color. These patches are applied to an image by alternating between convolution and pooling layers. The convolutional layer works by computing a convolution that corresponds to another vector using a filter of a specific size with the same weights. The key elements of a convolutional neural network are these convolution and pooling layers that alternate to detect different features in an image.

  • 00:20:00 In this section, the concept of convolutional layers in neural networks is explained. Convolutional layers use a fixed-size window, or patch, with a set of weights, or filter, applied to it. This filter is reused across each window in the layer, generating a much sparser representation of connections between inputs and outputs in comparison to a fully connected layer. In a 1D example, a patch of size 3 by 1 is taken and a filter is applied to each window of inputs. Similarly, in a 2D example, a patch of size 3 by 3 is taken, with the same set of weights applied across sliding windows to detect specific features such as edges. By reusing the same filter across instances of the window, convolutional layers allow for more compact and efficient network design.

  • 00:25:00 In this section, the lecturer explains convolutional neural networks and how they work with image and audio signals by using the same set of weights for every patch of the image or signal. The network detects features by applying a pooling filter, which computes a local equal variance, allowing the network to recognize features regardless of their location. This method can be used for digit recognition, with a bitmap image as input, and producing a label from 0 to 9 as output. The lecturer notes that backpropagation and automatic differentiation handle the shared weights, updating weights for edges that have the same weight.

  • 00:30:00 In this section of the video, the lecturer explains how convolutional neural networks (CNNs) work. The first step is to apply a 5x5 convolution to the input image using a filter, which allows detecting larger features than smaller filters. This produces a feature map of size 28x28, which can be used to check for the presence or absence of features in different locations. Next, a max pooling layer is applied to reduce the size of the feature map to 14x14 by taking the max of each 2x2 patch. Another convolution is then applied using a 5x5 filter to detect higher-level features, which produces 12 feature maps that undergo max pooling again. The intuition behind max pooling is that the exact location of some features, such as eyes or nose in face recognition, might vary slightly.

  • 00:35:00 this section, the lecturer discusses the second part of a neural network which is designed for classification. The common approach is to take a fully connected layer, flatten the features, and construct a vector of nodes to compute the classes with the weights adjusted through backpropagation. The beauty of convolutional neural networks is that the weights for the convolutional filters are not designed by humans but they get initialized randomly and updated as the network is trained, allowing the network to learn to extract relevant features. The network is able to optimize and come up with features that work better in practice through a data-driven solution.

  • 00:40:00 In this section, the lecturer discusses the concept of sparse connections in convolutional neural networks, which refers to the fact that nodes only have a few connections rather than being fully connected. This allows for a much smaller number of weights and a sparser computation. The lecturer also explains how parameters such as the number of filters, kernel size, stride, and padding are specified in the convolutional layer of a neural network. The examples provided help to further clarify how these parameters are used in defining convolutional layers.

  • 00:45:00 In this section, the lecturer explains how convolutional neural networks work. The lecturer demonstrates how a convolutional layer processes an input image by applying a kernel to it. The size of the kernel determines the size of the output and the stride determines how much the kernel moves across the input. Padding can also be used to maintain the input's original size. The lecturer provides examples of how different kernel sizes and strides affect the output size of the convolutional layer.

  • 00:50:00 In this section, the lecturer discusses the process of training convolutional neural networks (CNNs) using backpropagation and gradient descent, with the weights being shared among the variables. The process of computing the partial derivative is no different if a variable appears multiple times in the function, and algorithms such as Adam and RMSprop can be used for training. When it comes to designing a neural network architecture, it is problem-dependent and an art more than a science. However, some rules of thumb have shown good results, such as using a stack of small filters instead of a single large filter for fewer parameters and a deeper network.

  • 00:55:00 In this section of the video, the instructor explains a rule of thumb for designing convolutional neural network (CNN) architectures. He suggests that using smaller filters tend to work better and produces fewer parameters compared to larger filters. By using a stack of smaller filters in place of a larger filter, the receptive field remains the same while reducing the number of parameters needed. Additionally, adding nonlinear activation after every convolution can improve the performance of CNNs. These design principles can be useful for creating effective architectures for various applications.

  • 01:00:00 In this section, the use of residual layers in convolutional neural networks is discussed. Residual layers were proposed in 2015 as a way to avoid the degradation in the quality of networks caused by adding too many layers. The idea is to create skip connections to shorten the paths into the network, effectively reducing the depth and propagating the gradient more effectively. The residual connection skips some layers and adds the input X to the output of the skipped layers. This way, if the additional layers are not useful, they can be ignored without hurting the network's performance.

  • 01:05:00 In this section, the speaker introduces the concept of residual connections in convolutional neural networks (CNNs), and explains how they can solve the problem of vanishing gradients. By using skip connections, which essentially add the identity function to the output of a layer, the network is given the option to ignore certain layers that are not useful, while still being able to use them if it wishes to. This avoids the problem of layers producing outputs close to zero, which can result in the network ignoring those layers altogether. The speaker also mentions that the skip connections do not affect the gradient size, and suggests the use of batch normalization as another approach to mitigate the problem of vanishing gradients.

  • 01:10:00 In this section of the video, the speaker discusses techniques for dealing with issues like the vanishing gradient problem and normalization in convolutional neural networks. Batch normalization is a commonly used heuristic where values are normalized according to whatever batch of data that is being used with a variance of 1 and centered at 0, separately for each dimension. Additionally, skip connections can help to propagate gradients faster, as they provide shorter paths for backpropagation to take. Finally, the speaker notes that convolutional neural networks can be used for more than just computer vision, including sequential data and tensors with more than two dimensions, as seen in applications like video sequences. The framework TensorFlow is designed to perform computations with respect to multi-dimensional arrays, rather than being restricted to just vectors or matrices.

  • 01:15:00 In this section, it is mentioned that 3D convolutional neural networks exist and although they're not as common, there are some applications where they can be used.
 

CS480/680 Lecture 17: Hidden Markov Models


CS480/680 Lecture 17: Hidden Markov Models

The lecture introduces Hidden Markov Models (HMM), a type of probabilistic graphical model used to exploit correlations in sequence data that can improve accuracy. The model assumptions involve a stationary process and a Markovian process whereby a hidden state only depends on the previous state. The three distributions in HMM are the initial state distribution, the transition distribution, and the emission distribution, with the latter type used depending on the type of data. The algorithm can be used for monitoring, predicting, filtering, smoothing, and most likely explanation tasks. HMM has been used for speech recognition and machine learning, such as predicting the most likely sequence of outputs based on a sequence of inputs and hidden states for older people using walker devices for stability correlation. An experiment involving modified sensors and cameras on a walker was conducted to automatically recognize the activities performed by older adults based on collecting data on the activities of older adults in a retirement facility. The demonstration in supervised and unsupervised learning in the context of activity recognition was also discussed.

The lecture focuses on the use of Gaussian emission distributions in Hidden Markov Models (HMMs), which is commonly used in practical applications where the collected data is continuous. The lecturer explains that this method involves calculating mean and variance parameters that correspond to the empirical mean and variance of the data and using them to calculate the solution for the initial and transition distributions. The transition distribution corresponds to relative frequency counts, and maximum likelihood is used to obtain the solutions. This approach is similar to the solution for mixtures of Gaussians, where an initial and emission distribution are also used.

  • 00:00:00 In this section, the lecturer introduces the concept of Hidden Markov Models (HMM) which are different from the neural networks that have been discussed so far. The lecturer explains that HMM can be used when the data is coming from sequences as opposed to independent data points, and the predictions for one data point are correlated with the predictions for the next data point. The lecturer provides the example of speech recognition where the prediction of a phoneme or word is correlated with the next phoneme or word. Exploiting these correlations can improve the accuracy of predictions. The lecturer also explains that HMM can be generalized into a recurrent neural network (RNN) that can deal with sequence data and propagate information between different points in a sequence, which will be discussed later.

  • 00:05:00 In this section of the lecture, the speaker introduces hidden Markov models as a generalization of mixtures of Gaussians. He explains that hidden Markov models exploit correlations in sequential data to boost accuracy, and are used to express a distribution over y, which follows the conditional probability distribution x given y. This is different from a mixture of Gaussians, where a class conditional distribution for input x is expressed after y is sampled from a multinomial distribution. The speaker also draws a comparison between this model and the conditional random field and recurrent neural networks.

  • 00:10:00 In this section, the lecturer explains the assumptions made when designing a hidden Markov model. The first assumption is that the process is stationary, meaning that the transition and emission distributions are independent of time. The second assumption is that the process is Markovian, meaning that a given hidden state only depends on the previous hidden state. These assumptions create a probabilistic graphical model with an initial distribution, a transition distribution, and an emission distribution, which together form a joint distribution. The initial distribution describes the distribution for the first hidden state and is typically a multinomial.

  • 00:15:00 In this section, we learn about the three distributions in Hidden Markov Models: initial state distribution, transition distribution, and emission distribution. The Gaussian emission distribution is used for continuous data, while the multinomial emission distribution is useful for discrete data, such as sequences of words for natural language processing. By multiplying these distributions together, we can derive the joint distribution, which can be used for various applications like robot localization.

  • 00:20:00 In this section, we learn about the problem of a robot getting lost due to the drift and inaccuracies in the odometer readings. A solution to this problem is the use of a hidden Markov model, where the Y's, the hidden state, correspond to the location coordinates of the robot and the inputs correspond to some measurements by sensors. The transition distribution captures the probability that the robot might end up in different locations due to uncertainties in motion, while the emission distribution has a distribution over the measurements obtained by sensors to account for measurement inaccuracies. The hidden Markov model can be used for localization, which involves computing the probability of the robot's location at any given time step.

  • 00:25:00 In this section, the speaker explains the four broad categories in which tasks related to Hidden Markov Models (HMM) can be classified. These categories include monitoring, predicting, disambiguating, and most likely explanation. For the monitoring task, the algorithm used is known as the forward algorithm. It involves the recursive decomposition of the query in terms of the probability of the previous hidden state given all the previous measurements, which allows for the computation of the probability of Y for given X. The algorithm works by computing the first hidden state given the first measurement and then computing the next hidden state given measurements up to that time step and keeps on increasing the sequence by going forward in time.

  • 00:30:00 In this section, the lecturer discusses the prediction task using Hidden Markov Models (HMMs), which involves predicting the future state of a system given the current state. Examples of this task include weather and stock market prediction. The computation is done similarly to monitoring, using a forward algorithm with two phases: monitoring and prediction. In the example provided, the lecturer shows how to compute the probability of Y4 given X1 and X2 only. The lecturer also mentions that HMMs with prediction could be used for text generation, where the model predicts the next observable text given the current text.

  • 00:35:00 In this section, the lecturer discusses the tasks of Hidden Markov Models (HMMs) which include filtering, smoothing, and hindsight reasoning. Filtering refers to predicting the current state of a system based on past observations, while smoothing refers to predicting earlier states using observations both before and after that state. Hindsight reasoning involves computing the property of a state in the past given observations before and after that state. The lecturer highlights that HMMs are no longer the state of the art for these tasks, but they are a precursor to recurrent neural networks that tend to be more effective. The computation for these tasks is done in a recursive manner, leading to the creation of the forward-backward algorithm.

  • 00:40:00 In this section, the speaker discusses the use of Hidden Markov Models (HMMs) for speech recognition and machine translation. HMMs are used to compute the most likely sequence of outputs based on a sequence of inputs and hidden states. The Viterbi algorithm is applied to this dynamic programming procedure to carry out maximization. An application of activity recognition using sensor measurements and Walker devices that older people use to walk is also discussed. Inferring the activities of a person with a Walker helps to determine the most likely maneuvers that could lead to a fall or trigger a fall, which has been observed to occur in some situations despite Walkers being used for stability.

  • 00:45:00 In this section, the speaker discusses a study where a modified walker with sensors and cameras was used to collect data on the activities of older adults in a retirement facility. The walker had sensors such as a 3D accelerometer and load sensors that measured the weight on each leg of the walker, and a camera that looked backward at the legs. The experiment involved having the participants go through an obstacle course that simulated common daily activities. The data collected was used to develop a Hidden Markov model that automatically recognized the activities performed by the participants. The model had eight channels for the sensors and used machine learning to estimate the parameters of the initial transition and emission distributions.

  • 00:50:00 In this section, the speaker discusses a demonstration of an algorithm that predicts a person's activity based on sensor measurements. The algorithm uses a Hidden Markov Model or Conditional Random Field to track the person's activity and output predictions, which are then compared to manually labeled correct behaviors. The person's activity is visually represented as fluctuating curves, and the video's right panel displays 13 separate activities indicated by a red square for the correct behavior and a blue square for the algorithm's prediction. The speaker explains that, while theoretically possible, having the person wearing the sensors indicate their activity is not practical as the person may not always be a reliable judge of their own movements, and it can be awkward to have someone continuously announce their actions. Additionally, if unsupervised learning were used, the algorithm would infer an activity but would not be able to name it accurately.

  • 00:55:00 In this section, the speaker discusses the approach taken to both supervised and unsupervised learning in the context of activity recognition. For supervised learning, the Y's are known and the objective is to maximize the likelihood of the data. One approach discussed is to compute the derivative, set it to zero, isolate the parameters, and obtain values and estimates for pi theta and phi. In the case of two activities and binary measurements, it's possible to expand the Joint Distribution of the model and set the derivative to zero. The resulting answers are natural and involve the ratio of the number of classes in the data.

  • 01:00:00 In this section, the lecturer discusses the use of Gaussian emission distributions, which is a common practice in practical applications because the data collected is often continuous. This method involves using mean and variance parameters that correspond to the empirical mean and variance of the collected data. The solution for the initial and transition distributions is the same as before, while the transition distribution corresponds to relative frequency counts. Maximum likelihood is then used to obtain these solutions. This technique is similar to the solution for mixtures of Gaussians, where we also have an initial and emission distribution.
 

CS480/680 Lecture 18: Recurrent and recursive neural networks



CS480/680 Lecture 18: Recurrent and recursive neural networks

In this lecture, the speaker introduces recurrent and recursive neural networks as models suitable for sequential data without a fixed length. Recurrent neural networks can handle sequences of any length due to certain nodes with outputs fed back as inputs, and the way the H at every time step is computed is through the use of the same function f, which involves weight sharing. However, they can suffer from limitations such as not remembering information from early inputs and prediction drift. The lecturer also explains the bidirectional recurrent neural network (BRNN) architecture and the encoder-decoder model, which utilizes two RNNs - an encoder and a decoder, for applications where the input and output sequences do not match naturally. Additionally, the lecturer describes the benefits of Long Short-Term Memory (LSTM) units, which can mitigate the vanishing gradient problem, facilitate long-range dependencies, and selectively allow or block the flow of information.

This lecture on recurrent and recursive neural networks covers a range of topics, including the use of Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) units to prevent gradient problems, as well as the importance of attention mechanisms in machine translation for preserving sentence meaning and word alignment. The lecturer also discusses how recurrent neural networks can be generalized to recursive neural networks for sequences, graphs, and trees, and how to parse sentences and produce sentence embeddings using parse trees.

  • 00:00:00 In this section of the video, the speaker introduces recurrent and recursive neural networks as models suitable for sequential data without a fixed length. Feed-forward neural networks, previously discussed, assume a fixed-length input, which poses issues when dealing with variable-length data, such as time series data or machine translation. Recurrent neural networks, which have certain nodes with outputs fed back as inputs, can handle sequences of any length. The speaker explains this using a template and an unrolled version of the network. Recursive neural networks, which generalize to trees or graphs, are also discussed.

  • 00:05:00 In this section, the speaker discusses how recurrent neural networks connect across different time steps and how they are trained. To train RNNs, the speaker explains that gradient descent is used along with a technique known as backpropagation through time, which involves unrolling the network over time and creating a feed-forward neural network. The speaker also notes that the way the H at every time step is computed is through the use of the same function f, which involves weight sharing. The function f takes input from both the previous H and the current X and the weights used for it are the same at every time step.

  • 00:10:00 In this section, the lecturer explains recurrent neural networks (RNNs) and weight sharing. RNNs are networks that use the same function repeatedly at every time step, sharing the same weights. This means that there is some weight sharing taking place, which can make the derivation of the gradient during backpropagation different. The lecturer also mentions that H is generally a vector, with F being a function that outputs a vector. This effect creates challenges for training, including the gradient vanishing and explosion problem, where multiplying factors smaller than or greater than one can lead to either a vanishing or exploding gradient.

  • 00:15:00 In this section of the lecture, the speaker discusses the limitations of recurrent neural networks (RNNs) and how they may not remember information from early inputs. This can be problematic for applications such as machine translation, where the first word is just as important as the last word. However, for activities like activity recognition, it may be okay if the RNN forgets sensor measurements that happened a while ago because recent measurements are more important. Another problem with RNNs is prediction drift, where errors in predictions build up over time, causing predictions to drift. The speaker also compares RNNs to hidden Markov models (HMMs) and explains how RNNs can be used to generalize HMMs.

  • 00:20:00 In this section, the speaker explains the difference between a hidden Markov model and a recurrent neural network. In a hidden Markov model, the arrows indicate probabilistic dependencies, while in a recurrent neural network, the arrows indicate functional dependencies. The speaker introduces hidden states and outputs in a recurrent neural network and explains that the graph corresponds to the computation
    being done. The hidden state is computed using a function that takes the previous hidden state and the input, and the output is obtained using another function that takes the hidden state as input. Ultimately, the goal is to use this computation to compute probabilities or recognize activities.

  • 00:25:00 n this section, the concept of using Recurrent Neural Networks to emulate a Hidden Markov Model in the context of classification, specifically activity recognition, is discussed. The RNN is used to decouple the hidden state from the output, meaning the output just depends on the hidden state transformed through some function. An example of this is shown using a nonlinear activation function applied to HT and different sets of weights to transform the output. The forward algorithm of the RNN can compute y1 based on X1, y2 based on X1 and X2, and so on, similar to the Hidden Markov Model, however, the RNN has a problem when computing y2, which is addressed later in the lecture.

  • 00:30:00 In this section, the lecturer discusses the limitations of the unidirectional recurrent neural network architecture that only allows forward computation and introduces the bidirectional recurrent neural network (BRNN) architecture as a solution to this problem. The lecturer draws a diagram of the BRNN architecture, which includes forward and backward hidden states, inputs, and outputs. By aggregating information from before and after through the forward and backward hidden states, the BRNN architecture allows for bidirectional computation and can compute predictions based on inputs in both directions.

  • 00:35:00 In this section of the video, the lecturer discusses how recurrent neural networks can be used in applications where the input and output sequences do not match naturally, such as machine translation, question answering, and conversational agents. To tackle these problems, a different architecture known as the encoder decoder model, or sequence to sequence model, is often used. This architecture utilizes two RNNs - an encoder and a decoder. The encoder encodes the input sequence into a context vector, which is an embedding of the input, and the decoder uses the context vector to produce the corresponding output sequence. This approach allows for input and output sequences of different lengths and no synchronization between words in the input and output.

  • 00:40:00 In this section of the lecture, the instructor describes the architecture of a sequence-to-sequence model in machine translation, which uses a recurrent neural network to summarize input sentences into a context vector (C) that serves as the model's memory. The context vector is used to decode and produce a sequence of translated words, with each word corresponding to a different output. The model also uses hidden states to keep track of the progress of the translation and to ensure that information from the context vector is not forgotten over time. The instructor explains that it is useful to feed both the context vector and the previous hidden state into each step of the decoding process to ensure the coherence of the translated sentence.

  • 00:45:00 In this section of the video, the professor discusses the use of redundancy in information flow in neural networks. The vector used to encode information is typically high-dimensional and can have 500-1000 values, making it ideal for encoding entire sentences. The video also shows examples of translations achieved using a model that uses a recurrent neural network. The model was trained on a large corpus of data and was able to match the state-of-the-art in machine translation without needing a lot of knowledge about linguistics or the intricacies of machine translation, making it a significant advance. Additionally, the Long Short-Term Memory (LSTM) unit was proposed in the 1990s to improve long-range dependencies in neural networks.

  • 00:50:00 In this section, the lecturer discusses the benefits of Long Short-Term Memory (LSTM) units, which can mitigate the vanishing gradient problem and facilitate the learning of long-range dependencies due to their ability to remember information for extended periods of time. The key to the LSTM unit is the introduction of gates, including input, forget, and output gates. These gates regulate the flow of information by taking a value between 0 and 1 and multiplying it with the input, the hidden state, or the output. The lecturer also unrolls the LSTM cell architecture and introduces gates to each link to regulate the connections between them. These modifications allow the LSTM unit to selectively allow or block the flow of information and facilitate long-term memory in tasks such as machine translation.

  • 00:55:00 In this section, the lecturer explains the structure and variations of Long Short-Term Memory (LSTM) units, a type of recurrent neural network. LSTM units are built using a combination of several gates that regulate information flow, such as the input gate, output gate, forget gate, and memory gate. These gates take both the current X and the previous hidden state as input and output a value between 0 and 1 that decides whether to let new information in or forget old information. The lecturer also mentions that newer LSTM units use cell states instead of hidden states for memory storage and have H as the output instead of Y. The lecture concludes by describing specific equations that govern the LSTM unit's different gates.

  • 01:00:00 In this section, the instructor explains how the Long Short-Term Memory (LSTM) units work and how they are useful in preventing gradient problems such as vanishing and exploding gradients. It is explained that gates are used to determine what can influence the cell state, which carries the memory of the network. The instructor also notes that gated units known as the Gated Recurrent Unit (GRU) were proposed in 2014 as a simplified version of the LSTM units. The GRU removes one of the gates used in the LSTM units.

  • 01:05:00 In this section, the speaker introduces the gated recurrent unit (GRU), which simplifies the long short-term memory (LSTM) unit by having only two gates: the reset gate and the update gate. The update gate determines whether the new input goes into the hidden state or preserves what was already in it. This reduces the complexity of the unit and makes it more efficient, resulting in better performance. However, even with the use of GRU, there is still some memory that gets perturbed at every step, so attention mechanisms were developed, particularly useful in machine translation, to align each output word with some words in the input sequence, allowing the model to preserve the original sentence's meaning and check the word-to-word alignment.

  • 01:10:00 In this section, the idea of context vectors were introduced for decoding a sequence of words. The context vector is based on a weighted combination of all the hidden states associated with each time step in the encoding process. The weights are obtained through a softmax which produces higher probability when there is an alignment between the intended output and an input word. The alignment is computed using a dot product and turned into a probability through a softmax which allows for computing a weighted combination of possible inputs. By doing so, we create a context vector that summarizes the context that matters for the next few words we want to produce, rather than summarizing the entire sentence.

  • 01:15:00 In this section, the lecturer discusses the use of attention mechanisms in machine translation. The attention mechanism involves taking a convex combination of hidden states computed at each time step, instead of just using the last hidden state as the context vector. The weights used for the combination are probabilities obtained from a softmax, and they are used to compute alignments between the previous hidden state and all the previous inputs. This allows the machine translation model to align the concepts it is about to translate with the right part of the input. The use of attention has improved machine translation, and the lecturer presents some results obtained by authors who used it in 2015.

  • 01:20:00 In this section of the lecture, the speaker discusses the issue of long sentences in machine translation and the importance of having a mechanism that allows for looking back during the translation process. The researcher compares the accuracy of a recurrent neural network with and without attention, and measures the differences in accuracy using the bilingual evaluation under study (BLEU) score. The top curve, which uses attention, shows a consistent level of accuracy even as sentence length increases. This can be attributed to the attention mechanism allowing for all words in the input sequence to influence the context vector for the next step in decoding, regardless of their position.

  • 01:25:00 In this section, the lecturer discusses the limitations of recurrent neural networks when working with long sentences and the importance of attention mechanisms for solving this issue. Recurrent neural networks tend to overwrite early words with subsequent words, resulting in degraded translation quality when dealing with long sequences. Attention mechanisms solve this problem by focusing on specific words, allowing the neural network to deal with longer sequences of arbitrary length. Attention mechanisms also help with processing different languages, where word alignment is not necessarily one-to-one. The lecturer provides examples of how attention mechanisms work in producing translation maps that show the alignment of words in different languages.

  • 01:30:00 In this section, the speaker explains how recurrent neural networks can be generalized to recursive neural networks, which can be used for sequences, graphs, and trees. The key is to transform inputs and combine them recursively in a way that produces an output or embedding that captures the meaning of the input. To deal with varying lengths of inputs, the speaker emphasizes the importance of weight sharing between the different applications of rules for combining different nodes in the graph. The speaker also suggests using parse trees or dependency graphs to build a graph that reflects syntax and can be useful in computing and embedding.

  • 01:35:00 In this section, the lecturer discusses how to parse a sentence using constituency parse trees and how to produce embeddings for entire sentences. The idea is to come up with parts of speech tags and combine them into phrases and parse trees to understand the sentence structure. By associating rules with each transformation and sharing weights across all applications of the same rule, we can produce embeddings that are more promising and consistent with how humans understand sentences. Some researchers have shown that by building embeddings in this way, we can obtain very good results.

  • 01:40:00 In this section of the video, the speaker discusses the potential for obtaining a better sentence embedding through the use of a correct parse tree. They conclude the previous set of slides and move on to the next.
 

CS480/680 Lecture 19: Attention and Transformer Networks



CS480/680 Lecture 19: Attention and Transformer Networks

In this lecture, the concept of attention in neural networks is introduced, and its role in the development of transformer networks is discussed. Attention was initially studied in computer vision, allowing for the identification of crucial regions similar to how humans naturally focus on specific areas. Applying attention to machine translation led to the creation of transformer networks, which uses solely attention mechanisms and produce results as good as traditional neural networks. Transformer networks have advantages over recurrent neural networks, solving problems associated with long-range dependencies, vanishing and exploding gradients, and parallel computation. The lecture explores the multi-head attention in transformer networks, which ensures each output position attends to the input. The use of masks, normalization layers, and the Donora layer in transformer networks is discussed, and the concept of using attention as a building block is explored.

In this lecture on attention and transformer networks, the speaker explains the importance of normalization for decoupling gradients in different layers, as well as the significance of positional embedding to retain word order in sentences. The speaker compares the complexity estimates of transformer networks to recurrent and convolutional neural networks, highlighting the transformer network's ability to capture long-range dependencies and process words simultaneously. The advantages of transformer networks in improving scalability and reducing competition are also discussed, along with the introduction of transformer networks like GPT, BERT, and XLNet, which have shown impressive performance in accuracy and speed, raising questions about the future of recurrent neural networks.

  • 00:00:00 In this section, the lecturer introduces the concept of attention in neural networks and its role in the development of transformer networks. Attention was first studied in computer vision, with the idea that an attention mechanism could identify regions of interest in an image similar to how humans naturally focus on specific regions. This concept was then applied to machine translation and eventually led to the creation of transformer networks, which consist solely of attention mechanisms and have shown to produce results at least as good as those of traditional neural networks. Attention can also be used to highlight important features in an image that contribute to the desired output, such as the location of objects in object detection.

  • 00:05:00 In this section, the lecturer discusses how attention can be used as a building block in the recognition process, as seen in the breakthrough machine translation work of 2015, where the decoder was able to look back at the input sentence. In 2017, researchers demonstrated the use of attention to develop general language modeling techniques, allowing for the prediction and recovery of missing words in a sequence. The transformer network, which exclusively uses attention blocks, becomes the state of the art for natural language processing and surpasses recurrent neural networks due to its ability to deal with long-range dependencies and optimize parallel computation on GPUs. Transform networks are, therefore, an efficient choice for natural language processing tasks.

  • 00:10:00 In this section, the speaker explains the advantages of attention and transformer networks over the traditional recurrent neural networks. Attention blocks help in drawing connections between any part of the sequence, avoiding the problem of long-range dependencies. Additionally, transformer networks do computation simultaneously for the entire sequence, allowing for more parallelization and fewer steps to train, and solving the issue of vanishing and exploding gradients. The speaker also reviews attention as a form of approximation for database retrieval and introduces the equation used in attention mechanisms for neural networks.

  • 00:15:00 In this section, the speaker explains how the similarity function computes a distribution and how the attention mechanism can be generalized to a neural architecture. The speaker suggests various functions that could be used to measure similarity, including dot product and scaled dot product, and explains how they could be applied to compute the similarity between keys and the query. The speaker also introduces the idea of a weighted combination of values with high similarity in the retrieval process, which corresponds to the attention mechanism.

  • 00:20:00 In this section of the lecture, the professor explains the first layer of the attention mechanism in detail. The layer computes similarity between a query and each key in the memory. The most common way to compute the similarity is through a dot product or scaling the dot product by dividing by the square root of the dimensionality. Another way is to project the query into a new space using a weight matrix and then taking a dot product. This step will allow the neural network to learn a mapping W to compare the similarity between the query and the key more directly.

  • 00:25:00 In this section, we discuss how attention values are computed in a fully connected network that uses the softmax function. The weights are computed using an expression that compares a query with various keys to obtain a similarity measure, and this is used to assign a weight to every key. The attention value is then computed using a linear combination of the values associated with every key. The weights, represented by the matrix W, are learned by the neural network through backpropagation, optimizing the projection of Q into the space spanned by W. The resulting weights are used to produce an output, with one weight per output word and the hidden vectors associated with each input word used as the VI.

  • 00:30:00 In this section, the lecture discusses the attention mechanism and transformer networks. The attention mechanism is a way to combine hidden vectors for an output word with hidden vectors for input words, allowing for the production of a context vector. The transformer network, presented in 2017, eliminates recurrence in sequential data, which speeds up optimization and parallelizes operations. The transformer network in machine translation has two parts: an encoder and a decoder. The encoder processes the entire sequence of words in parallel via multi-head attention and a feedforward neural network, with the addition of positional encoding to account for word positioning.

  • 00:35:00 In this section, the lecture describes the multi-head attention mechanism, which computes attention between every position and every other position. The multi-head attention takes every word and combines it with some of the other words in the sentence through an attention mechanism, producing a better embedding that merges together information from pairs of words. The lecture also discusses a Donora layer that adds a residual connection, which takes the original input to what comes out of the multi-head attention and then normalizes this. The block is repeated several times so that the model can combine pairs of words, pairs of pairs, and so on. The output of this process is a sequence of embeddings, and there is one embedding per position in the sentence. The lecture then explores the decoder, which produces some output using a softmax that produces probabilities for outputting a label in each position. The decoder also includes two layers of attention, the first of which is self-attention between the output words, and the second of which combines output words with input words.

  • 00:40:00 In this section, the speaker discusses the multi-head attention mechanism in Transformer Networks, which is used to ensure that each position in the output is attending to positions in the input. The multi-head attention works by decomposing key-value pairs with queries, comparing them to the keys to find the highest weights, and taking a weighted combination of the corresponding values to produce the output. This process is repeated multiple times with different linear combinations to compute different projections and improve the embeddings until a distribution over the words in the dictionary is produced.

  • 00:45:00 In this section of the lecture, the professor discusses the concept of multi-head attention and how it can be compared to feature maps in convolutional neural networks. The different linear combinations in multi-head attention can be thought of as different filters, projecting or changing the space in which the values reside. This results in multiple scale dot product attentions, which correspond to multiple feature maps in CNNs. The contact layer concatenates these different attentions, and in the end, a linear combination of them results in a multi-head attention. Additionally, the professor explains the mask multi-head attention, which nullifies or removes links that would create dependencies on future words, making it suitable for machine translation tasks.

  • 00:50:00 This section of the video discusses the use of masks in the context of the Transformer network. The presenter explains how masks are used to nullify certain connections in the softmax function, and how using masks with values of minus infinity ensures that a proper distribution is maintained. The presenter also discusses how the use of masks allows for parallel computation during training, and how the technique of teacher forcing decouples input and output during training.

  • 00:55:00 In this section of the video, the importance of the normalization layer in Transformer Networks is discussed. The normalization layer helps to reduce the number of steps needed by gradient descent to optimize the network, as it ensures that the output of each layer regardless of how the weights are set, will have a mean of 0 and variance of 1. By doing this, the scale of the outputs is the same, which reduces the gradient competition between layers and makes convergence faster. It is noted that layer normalization is different from batch normalization as it normalizes at the level of a layer rather than a single hidden unit, making it suitable for smaller batches or even one data point at a time in an online or streaming setting.

  • 01:00:00 In this section of the video, the speaker discusses the importance of normalization for decoupling how gradients evolve in different layers. They also delve into the topic of positional embedding, which is added after the input embedding in the transformer network. The positional embedding ensures that the attention mechanism can capture positional information, which is important for retaining the ordering of words in a sentence. The speaker explains that the positional embedding is an engineering hack and discusses the formula used to compute it, although they note that there may be different ways to approach this aspect of the network.

  • 01:05:00 In this section of the lecture, the speaker compares the complexity estimates of a transformer network with those of a recurrent neural network or convolutional neural network. The transformer network, also known as a self-attention network, has a complexity of order n squared because the attention mechanism attends to every other position for each position in one layer, while also computing their embeddings. However, the transformer network does not lose information from the first word and allows information to flow between pairs of words immediately, making it effective in capturing long-range dependencies. Additionally, there are no sequential operations in a transformer network, meaning that all of the words can be processed simultaneously and in parallel. In contrast, a recurrent neural network has sequential operations and path length that can be up to n.

  • 01:10:00 In this section of the lecture, the speaker discusses the advantages of transformer networks, specifically their ability to reduce competition and improve scalability. The speaker then goes on to compare different models for machine translation, specifically English to German and English to French, and shows that while the transformer models did not necessarily produce outstanding results, they drastically reduced computation time, making them a more efficient option for training. The speaker also discusses other types of transformer networks, like GPT and GPT-2, which were proposed in 2018 for unsupervised language modeling.

  • 01:15:00 In this section, the video introduces two types of transformer networks called GPT and BERT. GPT is a language model that can be used for a variety of tasks including reading comprehension, translation, summarization, and question answering. The model attends to the previous outputs to generate a sequence of words without attending to the future output. The researchers applied this to different tasks without tailoring the network to the specific task and found that in a completely unsupervised fashion, they managed to come close to the state of the art. BERT stands for bi-directional encoded representations from transformers and its main advance is that it predicts a word based on both the previous word and the future words, making it better than GPT.

  • 01:20:00 In this section, the lecturer discusses the advancements made in transformer networks, specifically BERT and XLNet. BERT boasts the ability to fine-tune models with task-specific data, resulting in a major improvement in the state-of-the-art on eleven tasks. However, XLNet has put forth an even more impressive performance, beating BERT across most tasks due to its allowance of missing inputs and consequent better performance when generalizing. These transformer networks have proven to perform well in terms of accuracy and speed, causing questions about the future of recurrent neural networks.
 

CS480/680 Lecture 20: Autoencoders



CS480/680 Lecture 20: Autoencoders

Autoencoders refer to a family of networks closely related to encoder-decoders, with the difference being that autoencoders take an input and produce the same output. They are important for compression, denoising, obtaining a sparse representation, and data generation. Linear autoencoders achieve compression by mapping high dimensional vectors to smaller representations, while ensuring that no information is lost, and use weight matrices to compute a linear transformation from input to compressed representation and back. Additionally, deep autoencoders allow for sophisticated mappings, while probabilistic autoencoders produce conditional distributions over the intermediate representation and input, which can be used for data generation. The use of nonlinear functions by autoencoders takes advantage of nonlinear manifold, a projection onto a lower dimensional space that captures the intrinsic dimensionality of the data, leading to a lossless compression of the input.

  • 00:00:00 In this section of the lecture on Autoencoders, the presenter explains that they are a family of networks closely related to encoder-decoders, with the difference being that Autoencoders take an input and produce the same output. Autoencoders are important for tasks such as compression, denoising, obtaining a sparse representation, and data generation. Compression involves mapping high dimensional vectors to smaller representations, while ensuring that no information is lost. To achieve this, the input is fed to an encoder that produces a smaller representation, which is then decoded back to the input to ensure that the compressed representation has all the information of the input. Linear Autoencoders use weight matrices to compute a linear transformation from the input to the compressed representation and back to the input.

  • 00:05:00 In this section, the lecturer explains the connection between autoencoders and principal component analysis (PCA). He notes that the typical use of PCA is to project data into a lower dimensional hyperplane while preserving the variation in the data. However, he also explains that when an autoencoder (with linear mappings) is used to minimize the Euclidean distance, it yields the same solution as PCA, making it a useful tool for dimensionality reduction. The lecturer highlights that the matrices WF and WG in the autoencoder are essentially the inverses (or pseudo inverses) of each other since WG x WF yields X.

  • 00:10:00 In this section, the lecturer explains the beauty of autoencoders, which is that they do not restrict themselves to linear mappings unlike PCA. Instead, autoencoders can use nonlinear functions to find the hidden representation of data, which can be projected onto a lower dimensional space through a nonlinear manifold. This manifold can capture the intrinsic dimensionality of the data, which can lead to a lossless compression of the input. However, determining the optimal dimensionality of H would require specific techniques for structure learning.

  • 00:15:00 In this section, the video introduces deep autoencoders and sparse representations. Deep autoencoders have multiple layers before reaching the hidden layer, allowing for sophisticated mappings, while sparse representations impose structure onto intermediate representations by minimizing the number of non-zero entries in the vector produced by F. This can be done through non-convex optimization or by using l1 regularization to minimize the l1 norm of the output. Additionally, the video provides an example of using an autoencoder for denoising by feeding in a corrupted version of the input and attempting to recover the original X.

  • 00:20:00 In this section, the lecturer describes probabilistic or stochastic autoencoders, which are different from deterministic ones because they focus on conditional distributions. In a deterministic autoencoder, the encoder produces an intermediate representation that the decoder can directly use to reconstruct the input, whereas a probabilistic autoencoder produces conditional distributions over the intermediate representation and input. By designing a neural network with appropriate last activation functions, the last layer can be used to produce patterns that can be interpreted as distributions. Linear units in the output layer can be used to encode conditional distributions for real data, whereas sigmoid units can work with binary data. The lecturer emphasizes that these probabilistic autoencoders allow for the generation of data, which is a significant difference from deterministic ones.

  • 00:25:00 In this section of the lecture, the speaker explains the probabilistic graphical model of an autoencoder. The input X is considered a random variable and the output X tilde is an approximate version of the input. H is another random variable representing the hidden layer and the arrows indicate conditional dependencies. The weights are represented by conditional distributions and the decoder is a conditional distribution. Different activation functions are used to produce different types of output. The speaker also discusses how to compute a distribution over X based on a distribution over H for both binary and Gaussian vectors.

  • 00:30:00 In this section, the lecturer explains how an architecture like a probabilistic autoencoder can be used to generate data. With a deterministic autoencoder, the decoder takes some embedding and generates a data point. However, by having a distribution, we could sample from some distribution over the intermediate representation and use it to generate a data point. For instance, if we train the probabilistic autoencoder with faces, we could easily sample from the hidden representation and then produce a new face that is different but similar to the ones in the data set. By sampling from the distribution over images, we obtain an image.

  • 00:35:00 In this section, the speaker discusses the generation of new images using probabilistic autoencoders. The speaker explains how the autoencoder can generate new images by mapping input data points into embeddings in a space where nearby points can be decoded into new images. However, the speaker notes that in order to generate truly new images, there needs to be a distribution that allows for the sampling of proper embeddings. The distribution used in the autoencoder is conditioned on the input data point X, which can lead to the generation of similar images. To overcome this limitation, the next set of slides will discuss mechanisms for sampling directly with an H and generating new images.
 

CS480/680 Lecture 21: Generative networks (variational autoencoders and GANs)



CS480/680 Lecture 21: Generative networks (variational autoencoders and GANs)

This lecture focuses on generative networks, which allow for the production of data as output via networks like variational autoencoders (VAEs) and generative adversarial networks (GANs). VAEs use an encoder to map data from the original space to a new space and then a decoder to recover the original space. The lecturer explains the concept behind VAEs and challenges with computing the integral of the distributions needed in training. GANs consist of two networks - a generator and a discriminator - where the generator network creates new data points, and the discriminator network tries to distinguish between the generated and real ones. The challenges in GAN implementation are discussed, including ensuring a balance between strengths of the networks and achieving global convergence. The lecture ends with examples of generated images and a preview for the next lecture.

  • 00:00:00 In this section of the lecture, the focus is on generative networks and how they can be used for data generation. While classification and regression have been the main techniques covered in the course so far, generative networks allow for the production of data as output. This is particularly useful for natural language generation, speech synthesis, and image and video generation. Variational auto-encoders and generative adversarial networks are among the most popular networks currently being used for data generation. These networks are used to produce realistic data that is similar to that found in a dataset.

  • 00:05:00 In this section, the lecturer discusses the idea of probabilistic autoencoders, where instead of a deterministic encoder, we have a probabilistic encoder that encodes a conditional distribution. Similarly, the decoder is also a conditional distribution and can be thought of as a generator that creates a distribution over data, making it possible to generate new data points. A variational autoencoder is used to sample a hidden vector, H, from a fixed distribution, a Gaussian with mean 0 and variance 1, and then construct an objective that tries to make the encoder's distribution over H conditioned on X as close as possible to this fixed distribution, ensuring good sample results.

  • 00:10:00 In this section, the lecturer explains the concept behind variational autoencoders (VAEs). VAEs use an encoder to map data from the original space to a new space and then a decoder to recover the original space. The encoder produces a distribution that can be used to sample new points, which can be mapped back to the original space by the decoder. However, the distribution of the encoder needs to be as close as possible to a fixed distribution to ensure the generated data points are of the same type as the original data. The lecture covers the objective function for VAEs and how to optimize the network to achieve this goal.

  • 00:15:00 In this section, the lecturer discusses the challenges with computing the integral of the distribution of the encoder over H and the distribution over X for every H. This integral cannot be computed in closed form as the encoder and decoder are complex neural networks. To address this, the lecturer proposes the use of a single sample to approximate the integral and produce an H by sampling from the encoder and then approximate the resulting distribution by the distribution of the decoder. The approximation is made in training, and the lecturer highlights that this is different from regular autoencoders as there is a sampling step that requires careful consideration to still compute a gradient.

  • 00:20:00 In this section of the video, the speaker explains the reprioritization trick used in training generative networks like variational autoencoders. The encoder and decoder network architectures involve sampling steps, which makes computation of gradients difficult during optimization. To address this, a fixed Gaussian distribution is introduced to enable sampling of a new variable, H tilde, which is multiplied with the encoder's output, H, to get the distribution with the optimal mean and variance for the latent variable. The transformed H is then used in the decoder network to generate the reconstructed output X tilde.

  • 00:25:00 In this section, the speaker explains a trick called "reparameterization" that allows neural networks to generate samples from a data distribution without impeding the backpropagation of gradients. The trick involves sampling from a different but fixable distribution (such as a Gaussian) and then using some mathematical operations to transform the sample into a sample from the desired distribution. This way, the sample is an input to the network, which allows gradients to pass through it during backpropagation. The speaker then explains how this trick is used in training a generative network and generating new data points from the trained network.

  • 00:30:00 In this section, the speaker discusses the use of comeback library divergence, a distance measure used to minimize the difference between a fixed distribution and an encoder distribution in generative networks. The speaker uses Gaussian with mean zero unit variance as the fixed distribution and trained the encoder to produce a distribution close to it. By using the regularisation term, the decoder can generate a data point similar to what is in the training set, which in this case are images of faces. Examples of images generated by a variational autoencoder are shown, which are slightly blurry due to the probabilistic nature of the autoencoder. The speaker then introduces generative adversarial networks (GANs), which uses two networks - a generator and a discriminator - to produce sharper, more realistic images that are not probabilistically built.

  • 00:35:00 In this section, the lecturer explains how Generative Adversarial Networks (GANs) work. GANs consist of two networks: a generator network and a discriminator network. The generator network creates new data points, while the discriminator network tries to distinguish between the generated data points and the real ones. The discriminator acts as a tutor by providing feedback to the generator, helping it generate more realistic data points. The training is done by optimizing an objective function, where the discriminator network tries to maximize the probability of recognizing real data points and fake ones, while the generator network tries to minimize these probabilities and fool the discriminator. The objective function can be rewritten as the probability of a data point being fake.

  • 00:40:00 In this section, the instructor explains the architecture of Generative Adversarial Networks (GANs) which consist of a generator and a discriminator. The generator takes in a sample vector and produces simulated data while the discriminator is a classifier that takes in both real and generated data in order to classify them as real or fake. The GAN objective is to optimize these two networks using backpropagation with different sets of weights for the generator (WG) and the discriminator (WD). The instructor goes on to explain that the weights are updated by taking steps in the direction of the gradient in order to minimize the GAN objective.

  • 00:45:00 In this section, the speaker discusses an algorithm for training a generative adversarial network. The algorithm involves an outer loop where weights are optimized for the discriminator and then K steps are taken to optimize the objective. After that, a single step is taken to optimize the generator. The goal is for the generator to learn the distribution used to generate the training set so that it can produce real data indistinguishable from the real environment. If successful, the discriminator will have a 50% error rate and it will be impossible to tell whether a data point is real or fake.

  • 00:50:00 In this section of the video, the lecturer discusses the challenges that arise in the implementation of Generative Adversarial Networks (GANs), an approach to generative modeling that utilizes two networks called generator and discriminator that work in an adversarial setting to generate new data. One key issue is ensuring a balance between the strengths of both networks, as one could dominate the other. Another difficulty is achieving global convergence during optimization since non-convex optimization may lead to local optima that aren't optimal. Despite these challenges, some aspects of GANs work well in practice, as the generated images of digits and faces resemble real data points in their training set, although some fine-tuning may still be needed.

  • 00:55:00 In this section of the video, the speaker talks about generative adversarial networks (GANs) and how they can generate faces that are similar yet different. He provides examples of generated images, including a horse, a dog, and a blurry image. The speaker also mentions that the next class will cover a different topic in machine learning.
 

CS480/680 Lecture 22: Ensemble learning (bagging and boosting)



CS480/680 Lecture 22: Ensemble learning (bagging and boosting)

The lecture discusses ensemble learning, where multiple algorithms combine to improve learning results. The two main techniques reviewed are bagging and boosting, and the speaker emphasizes the importance of combining hypotheses to obtain a richer hypothesis. The lecture breaks down the process of weighted majority voting and its probability of error, as well as how boosting works to improve classification accuracy. The speaker also covers the advantages of boosting and ensemble learning, noting the applicability of ensemble learning to many types of problems. Finally, the video follows the example of the Netflix challenge to demonstrate the use of ensemble learning in data science competitions.

In this lecture on ensemble learning, the speaker emphasizes the value of combining hypotheses from different models to obtain a boost in accuracy, an approach that can be particularly useful when starting with already fairly good solutions. He discusses the importance of taking a weighted combination of predictions, noting that care must be taken as the average of two hypotheses could sometimes be worse than the individual hypotheses alone. The speaker also explains that normalization of weights may be necessary, depending on whether the task is classification or regression.

  • 00:00:00 The importance of ensemble learning is introduced, which is the process of combining multiple algorithms and hypotheses to improve learning results. The lecture discusses bagging and boosting techniques and highlights the difficulty of determining which individual algorithm is best suited for a specific problem. It is often a matter of trial and error, but combining imperfect hypotheses can lead to a better overall result, similar to how elections combine voters' choices or committees combine expert opinions. By combining multiple algorithms, the goal is to obtain a more robust and accurate prediction or classification.

  • 00:05:00 The lecturer discusses ensemble learning and how it can be used to improve the accuracy of machine learning models. Ensemble learning involves combining multiple imperfect hypotheses to obtain a richer hypothesis that is potentially better. The lecture mentions two methods of ensemble learning: bagging and boosting. The bagging technique involves taking a bag of hypotheses produced by different algorithms and combining them through voting, while boosting involves adjusting the weights of the hypotheses to give more weight to the ones that perform well. The lecturer explains how these techniques are used to generalize linear separators to obtain nonlinear boundaries and provides an example of a polytope.

  • 00:10:00 The concept of majority voting for classification is introduced, in which multiple hypotheses make predictions and the class that gets the most votes is chosen. The larger the number of hypotheses, the more unlikely it is for the majority to be incorrect. When hypotheses are independent, the majority voting becomes more robust. A mathematical equation is introduced to calculate the probability of the majority making an error based on the number of hypotheses and the probability of error. An example is provided where five hypotheses that make 10% errors provide a less than 1% probability of the majority vote being incorrect, demonstrating the robustness of the majority voting method.

  • 00:15:00 The video discusses the limitations of basic ensemble learning techniques, such as the assumption of independent hypotheses. To address these limitations, a weighted majority vote can be used to adjust for correlations and give higher weights to better hypotheses. This technique is known as boosting and is done by using a base learner that produces classifiers, which are then pooled to obtain a higher accuracy. The boosting framework has been able to overcome the belief that bad algorithms should be abandoned in favor of designing better ones by combining their hypotheses to improve the overall accuracy.

  • 00:20:00 The lecturer discusses the concept of boosting in ensemble learning, which involves using a base learner to produce hypotheses and then perturbing the training set weights to obtain a different hypothesis. By increasing the weights of misclassified instances, there is a better chance of obtaining a more accurate hypothesis. The lecturer explains that supervised learning techniques can be adjusted to work with a weighted training set, and this can be done simply by changing the objective and introducing a weight for every data point. This method allows for the creation of a weighted combination of the loss function of every data point.

  • 00:25:00 The lecturer explains the concept of boosting in ensemble learning. Boosting involves learning with a weighted training set where instances with high weights are biased towards correct classification. The boosting framework includes a loop where a hypothesis is repeatedly learned from the dataset with corresponding weights, instances are checked for misclassification and their weights are increased, and in the end, the in-sample hypothesis is a weighted majority of the generated hypotheses using weights that are proportional to their accuracy. There are two types of weights, those for the data points and those for the hypotheses. The lecturer emphasizes that the idea is to improve classification accuracy and that any algorithm that works with weighted datasets can be used as the base learner for boosting.

  • 00:30:00 The speaker discusses the concept of increasing the weights of misclassified data points in boosting algorithms. They explain that this has the effect of implicitly decreasing the weights of correctly classified data points, but it is the relative magnitude of the weights that matter. The algorithm then minimizes loss and tries to classify correctly to avoid paying a higher price for misclassification. The speaker also notes that if the training set does not follow the same distribution as the test set, weights can be used to perturb the distribution. However, boosting is typically not used for this purpose, as increasing the weights of imperfect hypotheses can prevent overfitting and improve generalization.

  • 00:35:00 The instructor explains the workings of the adaptive boosting algorithm with a visual example of generating multiple hypotheses using a simple data set. Using weighted majority votes, the algorithm assigns weights that are proportional to the accuracy of each hypothesis, and these are used to compute a weighted combination of the best-performing hypotheses. The ensemble formed from this combination is then used to make predictions.

  • 00:40:00 The lecturer explains the concept of combining multiple hypotheses to prevent overfitting. They argue that even if we have a perfect hypothesis, it is still better to combine multiple hypotheses to prevent overfitting. The lecturer notes that a deep neural network may lead to perfect accuracy on the training set but it is not simple and quick, which is what we want in a base learner used in conjunction with ensemble learning. The lecturer also describes the Adaboost algorithm and how it works to assign weights to hypotheses and data instances.

  • 00:45:00 The speaker explains the theory behind boosting and its advantages. Boosting works well with weak learners, which are algorithms that produce hypotheses that are at least as good as a random classifier. The goal is to improve accuracy and performance. The speaker explains how to calculate the weights for data instances and hypotheses, and how to normalize them. Boosting tends to be robust to overfitting and is simple to implement, making it applicable to many problems. Additionally, boosting generates multiple hypotheses, not just one, which leads to better accuracy.

  • 00:50:00 We learn about boosting and ensemble learning, which is a technique used to combine multiple models' predictions. Boosting is a method of generating multiple hypotheses with different weights, combining them all, and selecting the best one. As an approximation to Bayesian learning, it is a tractable way of generating one hypothesis at a time while being selective in combining multiple hypotheses for generalization. Boosting has several industrial applications, including the Kinect produced by Microsoft and the Netflix challenge, where it was used to improve their recommender system by 10%. Boosting is generally very good for combining expert predictions, unlike other heuristics, which might not always work and come without any theory.

  • 00:55:00 The speaker discusses the origins of Kaggle and how they started organizing data science competitions. He goes back to 2006 when Netflix launched a competition to improve accuracy by 10%. The first team, Bellcore, achieved an improvement of 8.43% but didn't meet the threshold. The speaker then describes how, over the years, teams started to collaborate, using ensemble learning, and how the grand prize team was formed. The teams joined forces to share one million dollars of the grand prize, proportional to the improvement in the team score that each algorithm contributes. The grand prize team managed to get to 9.46% by forming a large example of many researchers, and on the last day, Bellcore, pragmatic, and chaos submitted, winning the prize.

  • 01:00:00 The speaker discusses the importance and value of ensemble learning, particularly in the context of winning competitions. He uses the example of the BellKor's Pragmatic Chaos team winning the Netflix Prize by utilizing ensemble learning techniques to improve their accuracy by a few percentage points. He notes that ensemble learning is particularly useful when starting with already fairly good solutions rather than weak learners and that by combining hypotheses from different models, it is possible to obtain a boost in accuracy. Additionally, he mentions that ensemble learning lends itself well to distributed computing and can be achieved through multiple machines or cores.

  • 01:05:00 The instructor explains the concept of taking a weighted combination of predictions rather than hypotheses in order to avoid incurring a higher cost. The idea is that every hypothesis will make a prediction and those predictions will be combined according to weights. However, care must be taken when combining hypotheses as sometimes the average of two hypotheses could actually be worse than the individual hypotheses on their own. The instructor also mentions that the weights may need to be normalized depending on whether the task is classification or regression.
 

CS480/680 Lecture 23: Normalizing flows (Priyank Jaini)



CS480/680 Lecture 23: Normalizing flows (Priyank Jaini)

In this lecture, Priyank Jaini discusses normalizing flows as a method for density estimation and introduces how they differ from other generative models, such as GANs and VAEs. Jaini explains the concept of conservation of probability mass and how it is used to derive the change of variables formula in normalizing flows. He further explains the process of building the triangular structure in normalizing flows by using families of transformations and the concept of permutation matrices. Jaini also introduces the concept of sum of squares (SOS) flows, which use higher order polynomials and can capture any target density, making them universal. Lastly, Jaini discusses the latent space and its benefits in flow-based methods for image generation and asks the audience to reflect on the potential drawbacks of flow-based models.

In this lecture on normalizing flows by Priyank Jaini, he discusses the challenges of capturing high-dimensional transformations with a large number of parameters. Normalizing flows require both dimensions to be the same to achieve an exact representation, unlike GANs which use bottlenecks to overcome such issues. Jaini highlights that learning the associated parameters with high-dimensional datasets in normalizing flows experiments can be difficult. He also addresses questions about how normalizing flows can capture multimodal distributions and offers a code for implementing linear affine transformations.

  • 00:00:00 PhD student Priyank Jaini discusses normalizing flows as a family of deep generative models for solving the problem of density estimation, which forms a core problem in unsupervised learning. Jaini explains that density estimation has a wide range of applications in machine learning, such as important sampling, Bayesian inference, and image synthesis. Jaini also gives a brief introduction on how normalizing flows are different from variational autoencoders (VAEs) and generative adversarial networks (GANs), which were discussed in earlier lectures. He proposes that normalizing flows are useful for conditional generative models and can be used for density estimation.

  • 00:05:00 The speaker discusses the framework for generative models, including Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), and introduces normalizing flows as an alternative approach. Both GANs and VAEs use a source distribution and a transformation to generate synthetic examples or reconstruct data, but they represent the density functions implicitly rather than explicitly. In contrast, normalizing flows give an explicit representation of density functions and work on the principle of conservation of probability mass. The goal is to learn a transformation that transforms a simple source distribution (e.g., Gaussian) onto a more complicated target distribution to approximate the true data distribution.

  • 00:10:00 Priyank Jaini introduces the concept of conservation of probability mass and how it is used to derive the change of variables formula. He gives an example of a random variable on the interval 0-1 and applies the function T of Z, which results in a uniform random variable with probability density 1/3. He explains that the change of variables formula is used to find the density of a target random variable X in terms of the source random variable Z and the function T. He extends the formula to the multivariate case, where the function T is learned from Rd to Rd, and the formula becomes QX = PZ times the determinant of the gradient of T times the inverse.

  • 00:15:00 The speaker explains the concept of normalizing flows, which involves learning a function that maps a given input vector, X, to another vector, Z. The function, denoted as D, is composed of univariate functions, T1 to TD, that take in the components of X and output the components of Z. The goal is to approximate the density of the input data set, QX, using a simple source density PZ, and maximizing the likelihood of the data points using the change of variables formula. However, certain problems arise, including the function D needing to be invertible and bijective.

  • 00:20:00 The lecturer discusses how to calculate the latent space given only the observed data. To do this, the inverse function of the mapping function is needed. However, computing the determinant in practice is expensive, so the lecturer introduced the concept of triangular maps, where the computation of the determinant is easy. The lecture then explains that normalizing flow research is mainly focused on building these transformations that are triangular, so that density estimation can be done, and how these transformations can be used in different normalizing flows.

  • 00:25:00 The lecturer explains the process of building a triangular structure for normalizing flows. The structure involves choosing a simple density, P(Z), to approximate a given density, Q(X). The density P(Z) can be any probability distribution, such as a normal or uniform distribution. Initially, one transformation t1 is used to get X1 from Set 1. Then, as iterations continue, transformation t2 takes boards at 1 and Z2 as input, giving X2. The process continues until TD takes Z1, Z2, ..., until ZD as input and provides XT as output. The objective is to maximize the likelihood by optimizing a negative log likelihood, which involves finding the sum of the log of the matrix's diagonal elements. The lecturer provides examples of families of transformations that can be used to build the triangular structure and explains how the joint density can be written as a product of the marginals and conditional distributions.

  • 00:30:00 The lecturer discusses the concept of normalizing flows. Normal distributions are conditioned on data and are functions of the data. A transformation is learned from standard Gaussian to this normal distribution. The transformation is done iteratively, and the resulting function is triangular. By stacking these transformations, a mask auto-regressive flow is formed, allowing for a more complex transformation with multiple random variables. The determinant of each transformation and the final transformation can be easily computed by taking the Jacobian and the inverse. The parameters that define the transformation are trained by minimizing a log-likelihood.

  • 00:35:00 The presenter explains how to use a permutation matrix to switch up the ordering of random variables and break correlations to create a more complex transformation in density estimation. By stacking multiple transformations, the complexity of the transformation is increased, allowing for the ability to capture any density in real life, even if it does not follow a nice form. However, once the permutation is applied, the transformation is no longer triangular, making taking the Jacobian computationally expensive. The method of using a permutation matrix saves time and approximates the full transformation.

  • 00:40:00 The speaker discusses the various transformation methods used in normalizing flows. He explains that Real NVP is a linear transformation method that splits the input into two parts, applies a linear transformation to one part, and leaves the other part unchanged. They then stack multiple layers of this to build more complicated transformations. The speaker also mentions that neural autoregressive flows use deep neural networks instead of linear transformations and are universal. Further, he talks about his paper that proposes the use of sum of squares of polynomials instead of linear transformations or neural networks. This method uses high degree polynomials with coefficients that come from another neural network and is also universal.

  • 00:45:00 The lecturer discusses the properties of sum of squares (SOS) flows, which are a generalization of previously explored sum of squares of polynomials in computer science and optimization. Unlike other methods, SOS flows use higher order polynomials that can control higher order moments of the target distribution, such as kurtosis and skewness, without any constraints on the coefficient. SOS flows are easier to train and can capture any target density, making them universal, with applications in stochastic simulation. The lecturer also introduces an architecture called "Glow" that uses invertible one crossman convolutions and affine coupling layers to produce images that can interpolate faces to an older version.

  • 00:50:00 Priyank Jaini explains the architecture of normalizing flows and how they can be used for image generation. The algorithm works by using an affine coupling layer with multiple expressions and a random rotation matrix, W. They fix the determinant of the matrix by using an LU decomposition. Using this, they can interpolate between images of old and young people by transforming an input image into a latent representation, then moving in a specific direction within the latent space to achieve the desired outcome. Results show that the generated images are sharp, contradicting previous assumptions that images generated with log-likelihood would be blurry.

  • 00:55:00 The lecturer discusses the concept of the latent space, which captures certain properties of the input and is a hidden distribution used in flow-based methods for image generation. The lecturer provides an example of linear interpolation using the latent space to create an image of a person getting older. The lecturer also highlights the benefits of normalizing flow models, such as their explicit representation of densities and the use of efficient triangular transformations to capture the Jacobian determinant. However, the lecturer also poses a question to the audience regarding the potential drawbacks of flow-based methods, with one of them being the computation complexity.

  • 01:00:00 The lecturer discusses the challenges of capturing high-dimensional transformations with a large number of parameters in normalizing flows. While GANs use a bottleneck to overcome this problem, normalizing flows require both dimensions to be the same to achieve the exact representation. The lecturer highlights that the dimensions of datasets used in normalizing flows experiments are high, and this makes it difficult to learn the associated parameters. The lecturer also answers questions regarding how normalizing flows can capture multimodal distributions and how training on the weights of neural networks implicitly trains on the network parameters.

  • 01:05:00 Priyank Jaini explains that he has provided about a hundred lines of code for implementing linear affine transformations, which he learned from a tutorial by Eric Jack. He mentions that it is a simple process to train these networks, and offers the code for those interested.
 

CS480/680 Lecture 24: Gradient boosting, bagging, decision forests



CS480/680 Lecture 24: Gradient boosting, bagging, decision forests

This lecture covers gradient boosting, bagging, and decision forests in machine learning. Gradient boosting involves adding new predictors based on the negative gradient of the loss function to the previous predictor, leading to increased accuracy in regression tasks. The lecture also explores how to prevent overfitting and optimize performance using regularization and stopping training processes early. Additionally, the lecture covers bagging, which involves sub-sampling and combining different base learners to obtain a final prediction. The use of decision trees as base learners and the creation of random forests is also discussed, and a real-life example of the Microsoft Kinect using random forests for motion recognition is given. The benefits of ensemble methods for parallel computing are discussed, and the importance of understanding weight updates in machine learning systems is emphasized. This lecture covers the potential issues with averaging weights in combining predictors within neural networks or hidden Markov models, recommending instead the combining of predictions through a majority vote or averaging method. The professor also suggests various related courses available at the University of Waterloo, several graduate-level courses in optimization and linear algebra, and an undergraduate data science program focused on AI, machine learning, data systems, statistics, and optimization topics. The lecture emphasizes the importance of algorithmic approaches over overlap with statistics and the specialization in data science topics in comparison to general computer science degrees.

  • 00:00:00 The instructor discusses gradient boosting. He mentions that the adaboost algorithm is excellent for classification, but not for regression. He introduces gradient boosting, where the negative gradient of the loss function is computed, and the next predictor is fit to this gradient. This is a bit counterintuitive as it is not fitting the predictor to the desired output, but rather to the negative gradient. This will emulate a step of gradient descent, and by repeatedly applying it, the final predictor will be the sum of all the predictors. This method is particularly useful for regression. The instructor explains that this algorithm can be used with a wide range of loss functions, and it is a solution for boosting in regression.

  • 00:05:00 The concept of gradient boosting is explained, where at each step of the algorithm, a predictor with some loss function accompanies the difference between the target and the predicted value. The negative gradient is then taken to approximate the residuals, and the next predictor is trained for the residual dataset. The goal is to reduce the error by adding this new predictor to the previous one. The pseudocode of the algorithm is then given, where initially, the first predictor is set as a constant by minimizing the losses for each data point.

  • 00:10:00 The professor explains gradient boosting, a powerful concept in machine learning that combines several weak learners into a single strong learner. The idea is to start with a simple predictor that is just a constant and then compute a new predictor at each iteration by computing a pseudo residual for every data point, forming a new residual data set, training a new base learner with respect to that data set, and adding the new hypothesis multiplied by some step length to the predictor. The step length is selected by minimizing an optimization expression to take a step in the direction of the negative gradient to reduce error. The weight updating happens when the negative gradient is computed, but it is not a weight update per se.

  • 00:15:00 The speaker explains the weight update process during the training phase of a base learner, which could be a neural network, decision tree, or any other type of regressor. They clarify that when optimizing the predictor, there is no update of weights, as all of the functions, i.e., FK-1, HK, and Etha k, are already optimized and set to fixed weights. The combination of the predictions from these functions leads to a predictor that gradually improves at each step, leading to a lower loss function. However, the process may not lead to a loss of zero in the long run.

  • 00:20:00 The instructor discusses the potential of reducing error gradually with gradient boosting, but notes that this could lead to overfitting, depending on the space of predictors and the amount of noise present in the data. The algorithm involves adding more hypotheses together to create a larger sample without changing the weights. The instructor poses a question to the class about the risk of overfitting with gradient boosting and concludes that there is a risk of overfitting, but it is possible to prevent this occurrence by using techniques such as regularization or early stopping.

  • 00:25:00 The lecturer discusses ways to reduce overfitting, including introducing randomization and stopping the training process early by using a validation set. The lecture then introduces the technique of gradient boosting and mentions the popular package, XG boost, which has been optimized for performance and accuracy. The lecturer also outlines the main differences between bagging and boosting, including the use of independent hypotheses and a majority vote in bagging compared to the sequential creation of hypotheses and their combination in boosting.

  • 00:30:00 The speaker discusses boosting and bagging techniques in machine learning. Boosting involves weighted predictions, which allow for some correlated hypotheses and hypotheses with imbalanced accuracy. Boosting is flexible and can determine the weights of different hypotheses to counter the issue of correlation. In contrast, bagging involves bootstrap sampling, which involves training a base learner on a subset of data to reduce correlation between hypotheses. The speaker indicates that these techniques offer a practical way to engineer some setup where assumptions regarding hypothesis independence can hold or approximately hold, reducing arbitrary restrictions and making the model more reliable.

  • 00:35:00 The speaker discusses the idea of obtaining a simple predictor that is better than random in the paradigm of in-sample learning by sub-sampling the features to reduce correlation. By sub-sampling both data points and features, a smaller data set is obtained, which is fed to the base learner, and the process is repeated for each predictor. The resulting hypotheses are less correlated, which makes bagging a better option. The bagging algorithm consists of a loop where K predictors are created, and for each predictor, data is sub-sampled, and the base learner produces different hypotheses depending on the overlap.

  • 00:40:00 We learn about bagging, which is a technique that works by extracting multiple random samples from the training data to build multiple models. The idea is to generate a hypothesis from each of the base learners and then combine them to make a final prediction. If the objective is classification, the prediction is made by taking the majority vote, whereas for regression, the decision is made by taking the average of the prediction. The popular practice in the literature is to use a decision tree as a base learner, and once multiple decision trees are trained on various subsets of data, we call them a random forest. Random forests can also be used for distributed computing. The real-life example of Microsoft Kinect using random forest for posture and motion recognition is provided.

  • 00:45:00 The video discusses the Kinect and how it produces a depth map by projecting a cloud of points in the infrared spectrum and using an infrared camera to perceive the points. Microsoft built in some hardware to allow for real-time inference of the depth information based on the distribution of points. The Kinect has the ability to label pixels to identify body parts and motions with a random forest approach, where adjacent pixels are compared to the depth value of the current pixel. The subsampling technique is used to simplify the neighboring pixels, and comparing distances based on the body part's size gives clues to classify the current pixel, although this method is considered weak.

  • 00:50:00 The speaker discusses the benefits of bagging, boosting, and other ensemble methods, which allow for multiple lightweight classifiers to be distributed and utilized in parallel, thus scaling well for large data. GPUs have become key to parallelizing computation, and several frameworks exist to manipulate vectors, matrices, and tensors without worrying about parallelization. However, the speaker warns against the intuitive but unreliable method of taking the average of powers of classifiers or predictors, as hidden layers and variables can cause issues with this approach.

  • 00:55:00 The presenter explains how taking the average of individual systems in an architecture can be problematic. The presenter draws an example on the board where they use boolean variables that take on values of 0 and 1 to encode an exclusive-or. The presenter sets up weights for the boolean variables that are designed to compute the or of what comes in. The weights set up work to find each of two patterns, and as long as one of them is triggered, the presenter computes the and/or by combining them through another trash holding unit. The presenter goes on to explain how changing the weights can affect the output of the system.

  • 01:00:00 The speaker discusses the dangers of averaging weights when combining predictors in neural networks or hidden Markov models. The danger lies in the fact that there may be symmetric solutions that don't compute the same thing, and taking the average of the weights could result in a predictor that doesn't compute the correct thing. Instead, the safe thing to do is to combine the predictions, which can be done through a majority vote for classification or taking the average for regression. The speaker also recommends other courses related to machine learning offered at the University of Waterloo for those interested in learning more.

  • 01:05:00 The professor discusses other courses that would complement the current course on machine learning. Firstly, he suggests taking the Computational Linear Algebra course before taking the current course as linear algebra is a crucial foundation for machine learning. Additionally, he mentions the course called Theoretical Foundations of Machine Learning which focuses on an important factor in machine learning, namely, data complexity. He explains how determining the level of achievable accuracy with a certain amount of data is a complex matter hence, the course aims to derive principles that determine the amount of data one needs to achieve a desired level of accuracy. Lastly, the professor mentions other courses at the graduate level such as Optimization for Data Science and Fundamentals of Optimization, which are beneficial for understanding machine learning algorithms.

  • 01:10:00 The lecturer discusses the available courses and programs related to data science that students can take. These courses range from 800 level courses that are not regularly offered to data science programs at the undergraduate and graduate levels. The lecturer points out that while there may be some overlap between this course and courses in statistics, the approach here is more algorithmic. The data science programs cover topics at the intersection of AI, machine learning, data systems, statistics, and optimization. The courses students take in these programs emphasize specialization in data science topics, while a general computer science master's degree requires breadth across different topics.