Machine Learning and Neural Networks - page 47

 

CS480/680 Lecture 6: Sum-product networks (Pranav Subramani)



CS480/680 Lecture 6: Sum-product networks (Pranav Subramani)

The lecture discusses the concepts of sum-product networks (SPN) which are networks composed of sums and products, used for tractable probabilistic modelling that yields non-exponential runtimes and has many applications such as interpretability and easy marginal density computation. The video also mentions SPN's excellent performance with convolutional neural networks, its potential in building better generative models when combined with models like GANs and variation water encoders, and the untapped potential research areas for SPNs including adversarial robustness, reinforcement learning scenarios, and modelling expected utilities in games. The theoretical guarantee of interpreting the model and the opportunity for academics to make significant contributions in the field of machine learning were also highlighted.

  • 00:00:00 In this section, the speaker discusses tractable probabilistic modeling using some product networks, which are networks composed of sums and products - hence 'sum-product' - and are a tractable way of modeling probability function in a manner that yields non-exponential runtimes. Despite its size, sum-product network models are super useful in terms of expressivity, interpretability, easy marginal density computation, MAP query computation and likelihood computation, while also showing excellent performance in combination with Convolutional Neural Networks. These models have been shown to be able to outperform the state of the art by about 10% and can be combined with other models like Gans and variation water encoders to create better generative models.

  • 00:05:00 In this section, the speaker discusses the potential research areas for some product networks (SPNs). The speaker first introduces some ethical properties, which allow for the interpretation of models and datasets such as the "Amnesty dataset." Unlike neural networks, this model provides a theoretical guarantee that allows one to interpret what the model is doing to some extent. Some potential research areas for SPNs include building features on top of the primary library for SPNs, adversarial robustness, reinforcement learning scenarios with some product max networks, and modeling expected utilities in games. These research areas are mostly untapped, offering an opportunity for academics to make significant contributions in the field of machine learning.
 

CS480/680 Lecture 6: EM and mixture models (Guojun Zhang)



CS480/680 Lecture 6: EM and mixture models (Guojun Zhang)

In CS480/680 Lecture 6, Professor Guojun Zhang discusses the basics of unsupervised learning and clustering, focusing on mixture models and their use in clustering data. The lecture centers around the Expectation-Maximization algorithm and its Estep and Mstep processes, as well as gradient descent as an optimization method. The potential project proposed involves studying how EM and gradient descent behave in learning mixture models, with the ultimate goal being to propose a better algorithm to avoid bad local minimums. A mathematical background is noted as necessary for the project.

  • 00:00:00 In this section, Cody introduces the basics of unsupervised learning and clustering, and how it relates to mixture models. A mixture model is a way to describe a probability distribution as a convex combination of conditional distributions. For example, the mixture of Gaussians and the mixture of Bernoulli distributions can be used to cluster data. To find a solution to mixture models, we need to formulate an objective function to minimize. The classic algorithm for this is the Expectation-Maximization algorithm.

  • 00:05:00 In this section, the lecturer talks about the Estep and Mstep processes that are used in evaluating the posterior distribution and maximizing the q function in the optimization of mixture models. Gradient descent is another optimization algorithm that is discussed and it is noted that there are some clusters that may not be retrieved in the optimization process. The potential project proposed is to study how EM and gradient descent behave in learning mixture models and if there is a way to avoid bad local minimums, with the ultimate goal being to propose a better algorithm. The lecturer notes that a mathematical background is necessary for this project.
 

CS480/680 Lecture 6: Model compression for NLP (Ashutosh Adhikari)



CS480/680 Lecture 6: Model compression for NLP (Ashutosh Adhikari)

In this video, the presenter discusses the concept of model compression for NLP and the challenges of processing time and memory requirements as the number and depth of deep neural networks increase. Model compression techniques are categorized, and the oldest method, parameter pruning and sharing, is introduced. The speaker further elaborates on the concept of a student-teacher system for model compression in NLP and how the objective function is used to compress a larger model into a smaller student model while retaining accuracy. Finally, the potential importance of compressing models in the context of recent work on developing large-scale NLP models is highlighted.

  • 00:00:00 In this section, the video presenter discusses the issue of model compression as the number and depth of deep neural networks increase, along with their processing time and memory requirements. The aim is to reduce the number of parameters required in neural networks while retaining accuracy and knowledge to enable faster, more efficient deployment in online applications. Model compression techniques are categorized, and the presenter delves into the oldest method: parameter pruning and sharing, developed by Yann LeCun in 1990. The presentation also touches on channel pruning techniques, which have had success with convolutional neural networks in computer vision, but less explored with NLP models. Finally, the presenter highlights the potential importance of compressing models in the context of recent work on developing large-scale models for NLP tasks.

  • 00:05:00 In this section, the speaker introduces the concept of a student-teacher system for model compression in NLP. The teacher model is a larger model that is used to extract representations from and compress into a smaller student model. The objective function is used to help the student network capture all the representations learned by the teacher network along with the classification objective. While pruning and sharing methods have not been explored in detail, attention mechanisms and transformers will be covered in future lectures. The speaker notes that these huge models are basically transformers at the core, repeatedly applied.
 

CS480/680 Lecture 7: Mixture of Gaussians



CS480/680 Lecture 7: Mixture of Gaussians

In this lecture about mixture of Gaussians, the speaker explains how the model can be used for classification by constructing a prior distribution for each class, which enables the construction of a probabilistic model using Bayes' theorem to estimate the probability of a class for a given data point. The lecture also covers the process of calculating the likelihood of a data point belonging to a certain class and how this is used to determine the class prediction. The lecture notes explore the relationship between the softmax function and the arc max distribution and how the shape and boundaries of the Gaussian is determined by the covariance matrix. Finally, the lecture details the process of maximum likelihood learning and how it can be used to estimate the mean and covariance matrix for a mixture of Gaussians model.

  • 00:00:00 In this section, the lecturer discusses using mixtures of Gaussians for classification, which is a statistical model in the family of generative models. They explain how Gaussian distributions are used to model inaccuracy and noise in data, which can be used to simulate the creation of a dataset for applications such as text and image generation. The lecture provides an example of linear regression and how it can also be converted into a generative model through the use of Gaussian distributions.

  • 00:05:00 In this section, the lecturer discusses the possibility of generating similar images of people through a model that can generate data similar to the training set. The lecturer uses linear regression as an example and then moves on to classification, where a prior distribution for each class is constructed. Based on this idea, a probabilistic model can be constructed using Bayes' theorem to estimate the probability of the class for a given data point. The lecture emphasizes that this is not Bayesian learning, but rather Bayesian inference.

  • 00:10:00 In this section, the instructor discusses the assumptions made in the Mixture of Gaussians model and how to use Bayesian inference to compute the posterior probability of a class. The model assumes there are a finite number of categorical classes, which can be represented using a multinomial distribution. The class conditional distribution is assumed to be a Gaussian distribution with the same covariance matrix for each class. The likelihood is a product of the prior and class conditional distribution, which can be simplified by cancelling out the first term that does not depend on the class. This simplification is possible due to the assumption of the same covariance matrix for every class, though it may not always hold in general.

  • 00:15:00 In this section, the speaker explains how the posterior can be expressed as a logistic sigmoid function, which is particularly popular in neural networks as it takes any real number and produces an output between 0 and 1. They derive the expression for the posterior and show that it can be expressed as W(transpose)X + W_0 where W is the coefficient of x and W_0 is the constant part that does not depend on X. The logistic function has a particular definition and is used to produce an output that can be interpreted as a probability.

  • 00:20:00 In this section, the lecturer discusses the use of the logistic function in the context of mixtures of Gaussians. The logistic function is used to squash the output between 0 and 1, and its particular definition can be obtained by treating the coefficients of X as parameters and the constant part as W naught. The mean and covariance matrix, as well as the class probabilities, can be combined together to give the desired parameters. This helps in computing the posterior probability of a data point belonging to a particular class. The lecturer also explains the use of class conditionals, represented by Gaussian distributions, in finding the class probabilities of the data points. These class conditionals may have different bell shapes, and the posterior would have a higher probability for the class that matches the data point.

  • 00:25:00 In this section of the video, the lecturer explains how the likelihood of a data point belonging to a certain class is calculated for a mixture of Gaussians model. If the means of the two Gaussians are different and we assume they have the same covariance matrix, depending on the location of the point, it will naturally have a higher likelihood to belong to the class whose Gaussian has a mean closer to the point. A formula is given for the class conditional distributions, and once the posterior is computed, a class prediction can be made based on the probability of that class being greater than 0.5. The lecturer also shows the boundaries between the two classes, which is of interest when making predictions.

  • 00:30:00 In this section, the lecture explores the class boundary of using mixtures of Gaussians and how the boundary looks like, assuming that there are two Gaussians and they have the same covariance matrix. The boundary occurs at the probability where each class is the same (0.5). This simplifies to W transpose X bar = 0, which means the separator is linear. This is a simple model and a linear separator, and it's used when there are two classes. When there are more than two classes, the same computation is done, and the result is the softmax function, which is also commonly used in neural networks and has its roots in mixture of Gaussian computations.

  • 00:35:00 In this section, the lecturer explains the relationship between the softmax function and the arc max distribution and why it is called softmax. The arc max distribution assigns probability one for the classifier with the highest value and zero for all other classes, while the softmax function gives a softer version of this by assigning non-zero probabilities to all classes. The exponential function arises when considering mixtures of Gaussians and computing the posterior distribution for multiple classes. The boundaries of the different classes can also be shown in the posterior distribution. The lecture notes explain that the softmax function is widely used in neural networks to determine the output class.

  • 00:40:00 In this section, the lecturer explains how the shape and boundaries of the Gaussian is determined by the covariance matrix and how this affects the separation of classes. By using different covariance matrices, non-linear boundaries can be created, whereas using the same one will result in linear boundaries. The lecturer also discusses how to estimate the powers of the mixture of Gaussians model, namely pi, mu 1, mu 2, and Sigma, which represent the probability of each class, the mean of the Gaussians, and the noise covariance matrix, respectively. The maximum likelihood method is used for this purpose.

  • 00:45:00 In this section, the lecturer explains the process of maximum likelihood learning where the main problem is to find the powers of the model that maximize the likelihood of the data. To solve this optimization problem, the lecturer takes the log of the expression to simplify it. The resulting expression looks complicated, but it is actually nice and has a concave form that has a single global optimum. This method allows the combination of class conditionals for the two classes into one expression using convenient labels for the classes of 0 and 1.

  • 00:50:00 In this section of the lecture, the speaker discusses how the maximization of the log-likelihood function corresponds to a concave function, which can be optimized to obtain the sample mean and empirical mean of the data for each class in a mixture of Gaussians model. The probability of each class can be estimated by taking the fraction of data belonging to that class, which is an intuitive approach that is confirmed by the principle of maximum likelihood. Similarly, the mean of the inputs for each class can be estimated by taking the sum of all data points and dividing it by the number of points in that class. These estimations provide a formal justification for the intuitive approach to estimating these parameters.

  • 00:55:00 In this section, the speaker discusses the process of estimating the mean and covariance matrix for a mixture of Gaussians model using maximum likelihood learning. The target output data is given and the likelihood function is maximized to determine the correct values for the mean and covariance matrix. When estimating the covariance matrix, a linear combination of the empirical covariance matrices for each class is taken, which is weighted by the number of data points belonging to each class. The speaker clarifies that while Bayesian inference and maximum likelihood learning are both used in this process, the first part of the discussion was not learning but rather inference using Bayes theorem.

  • 01:00:00 In this section, the speaker explains that the machine learning part of mixture of Gaussians involves determining the powers needed for each feature in the Gaussian model. They use maximum likelihood learning to do this, but Bayesian learning is also possible. However, the lecture will only cover maximum likelihood learning. The speaker then concludes the section and states that the next class will cover an extension of this topic for classification.
 

CS480/680 Lecture 8: Logistic regression and generalized linear models



CS480/680 Lecture 8: Logistic regression and generalized linear models

This first part of the lecture on "CS480/680: Logistic Regression and Generalized Linear Models" introduces the idea of the exponential family of distributions and its relation to logistic regression, a powerful technique used for classification problems. The lecture explains that logistic regression aims to fit the best logistic function that models the posterior for a given dataset, and for problems with a few dimensions and weights, Newton's method can be used to find the minimum of the objective function, which is a convex function. The instructor also highlights the importance of logistic regression in recommender systems and ad placement, where the simplicity and efficiency of the technique make it ideal for making personalized recommendations based on user characteristics and behaviors.

The lecture also covers the topic of logistic regression and generalized linear models. The instructor discusses the limitations of Newton's method for logistic regression, such as the issue of overfitting caused by arbitrary large weights and singularity problems in the Hessian matrix. To prevent overfitting, regularization is suggested. The instructor introduces generalized linear models (GLMs) that can be used to work with non-linear separators efficiently. GLMs involve mapping the inputs to a new space where linear regression and classification can be done in a non-linear way as long as the mapping is non-linear. The lecture also covers basis functions and their types that can be used to perform nonlinear regression and classification.

  • 00:00:00 In this section of the video, the lecture discusses the limitations of the statistical model for classification based on mixtures of Gaussians, which assumes a Gaussian distribution. To address this limitation, they introduce a broad class of distributions known as the exponential family, leading to the development of a powerful and flexible technique called logistic regression. The lecture draws on the board to illustrate mixtures of Gaussians and explains that this method is suitable when the data forms clusters of a certain shape. However, if the data does not have this shape, the assumption of a Gaussian distribution needs to be relaxed. The lecture introduces the idea of the exponential family of distributions and explains its significance in the development of logistic regression.

  • 00:05:00 this section, the speaker discusses the Exponential Family, which includes many famous distributions such as Bernoulli, Poisson, and Gamma. The family is called exponential because the product density function has an exponential, and the exponent has a linear term in theta, some terms in X, and other terms in theta and X. The key to the different distributions in the family is the precise functions T of X, a of theta, and B of X. The beauty of this family is that any distribution in it can be rewritten in the form of a sigmoid logistic function. This characteristic enables the speaker to introduce probabilistic discriminative models, where the aim is estimating directly the parameters of the logistic function, instead of making assumptions on the data being perturbed with some noise and estimating the parameters of the associated distribution.

  • 00:10:00 In this section, we learn about logistic regression, which is a technique used to fit or find the best logistic function that models the posterior for a given data set. The posterior follows the softmax distribution whenever there are multiple classes. We would like to find the W that maximizes the posterior given some data. This optimization problem is converted into a minimization problem by introducing a negative sign. The goal is to find the best W that ensures the probability of the correct class Y is as high as possible for most data points.

  • 00:15:00 In this section, the instructor discusses logistic regression and how it can be used for classification problems. The goal is to find the W that minimizes the subjective, but it's important to note that even though this technique is called logistic regression, it's really a classification problem. However, the idea is that logistic regression is a form of regression because we're trying to estimate the posterior probability of the class given X, which is a numerical value. The instructor goes on to explain that an iterative method is necessary to solve this optimization problem because there isn't a way to isolate the variable in the expression in closed form.

  • 00:20:00 In this section of the lecture, the instructor discusses how to deal with the non-linear equation in logistic regression. The objective function of logistic regression is shown to be a convex function, making it easier to find the global optimum. The instructor explains that iterative methods, such as gradient descent or Newton's method, can be used to find the minimum of the objective function. While gradient descent can be used, it is not efficient, and determining the right step size is difficult. Newton's method is much faster and requires fewer steps, making it a popular choice for optimization.

  • 00:25:00 In this section of the lecture, the speaker discusses a method called Newton's method for logistic regression, which is an improvement over gradient descent. Newton's method involves starting with an initial guess for W and then subtracting from W the inverse of the Hessian times the gradient of the last function. This method essentially involves iterated three weighted least squares and approximates the objective with a quadratic function instead of a line, allowing for a better approximation of the curve and faster convergence. The beauty of this method is that each time a quadratic function is approximated, the minimum can be solved for optimally in closed form, eliminating the need to compute a step length.

  • 00:30:00 In this section, the speaker introduces Newton's method, which is a second-order optimization method that approximates a function with a quadratic at each step, finds the minimum of that quadratic curve, and refits the function. This is different from gradient descent, which involves minimizing a quadratic function. Newton's method can be much faster and more accurate since the quadratic function fits the curve better than a general linear regression, but it requires computing the Hessian, which includes all second-order derivatives and can be expensive in high dimensional problems. Therefore, it is better suited for problems with a few dimensions and weights.

  • 00:35:00 In this section, the instructor explains the rationale behind approximating the complicated non-linear function of logistic regression with a quadratic function using Newton's method. While gradient descent is cheaper, approximating with a quadratic function is a better fit, allowing for better steps. It is also computationally doable to find the minimum of a quadratic function. Newton's method is not guaranteed to find the global optimum for non-convex objectives, but since the function of logistic regression is convex, there is a single global optimum, and Newton's method can start anywhere. The main thing that needs to be computed to apply Newton's method is the Hessian, which can be obtained through a derivation resulting in an expression involving the data set with a row of ones and a diagonal matrix of Sigmas.

  • 00:40:00 In this section, the instructor discusses logistic regression and its importance as a machine learning technique, particularly for recommender systems and ad placement. Logistic regression is used to make recommendations to users, such as product recommendations or ad recommendations. The idea is to show ads that have a high probability of being clicked on by the user, and this can be modeled as a classification problem. The instructor also presents the structure of the Hessian and how to obtain it using the formula, which is important for programming logistic regression and applying Newton's method. While some students may find the mathematics overwhelming, it is essential to understand these methods to see how they arise and why they work.

  • 00:45:00 In this section, the concept of logistic regression is explained as a method for making recommendations to users, such as for products or apps, based on their characteristics and behaviors. Logistic regression is often used for these types of problems because it is simple, flexible, and efficient to implement, with predictions relying on computing a dot product. Examples of features that can be taken into account for making recommendations include whether certain apps have already been downloaded and installed, age, gender, location, and any other relevant information that the company or smartphone has about the user.

  • 00:50:00 In this section, the lecturer explains how logistic regression can be used for classification problems with two classes, where the probability of an event occurring is greater than or equal to 0.5. If there are multiple classes, a softmax distribution can be used, with a vector W for every class K. The lecturer highlights that logistic regression makes prediction simple, as it only involves computing a dot product, and this can be made efficient through exploiting sparsity and paralyzing some of the computation.

  • 00:55:00 In this section, the speaker discusses the efficiency of logistic regression and how it can run on low-resource devices by exploiting sparsity and paralleling the computation. The dot product computation can ignore zero entries, making the long vectors containing millions of entries faster to compute. The learning model can also be parallelized with a GPU, which is ideal for systems like recommender systems that require quick and scalable predictions. Furthermore, expanding the features is easy and only requires scaling instead of redesigning everything.

  • 01:00:00 In this section, the professor discusses the limitations of Newton's method for logistic regression and the issue of overfitting. While Newton's method is a fast optimization technique, it is not scalable for large datasets and millions of features. Logistic regression tends to overfit easily due to its convex optimization, which finds the global optimum that fits the data too well. Overfitting can cause singularities in the Hessian matrix, making it impossible to apply Newton's method. The sigmoid function of logistic regression goes from zero to one but never reaches one asymptotically, so to achieve a probability close to one, W transpose X bar has to be arbitrarily large, which results in overfitting.

  • 01:05:00 In this section, the lecturer discusses the issue of overfitting in logistic regression models. They explain that as W transpose X bar goes to infinity, the magnitude of W also goes to infinity, which can result in weights becoming arbitrarily large. Additionally, the Hessian will tend to zero due to the sigmoid function, making it difficult to apply Newton's method because computing the inverse of the Hessian will not be possible numerically. To prevent overfitting, the lecturer suggests using regularization, where a penalty term is added to minimize the magnitude of the weights. This also helps to prevent singularity issues.

  • 01:10:00 In this section of the video, the instructor discusses how to prevent overfitting in logistic regression and generalized linear models by adding a penalty term using Newton's method. However, a limitation of logistic regression is that the boundary between classes is always linear. To overcome this limitation and work with non-linear separators, the instructor introduces generalized linear models, which involve mapping the inputs to a new space where linear regression and classification can be done in a non-linear way as long as the mapping is non-linear. This simple approach allows for the generalization of linear models to work in non-linear settings and serves as the basis for kernel methods that are covered later in the course.

  • 01:15:00 In this section, the speaker discusses how to approximate a function using nonlinear regression and generalized linear models. The idea is to map the data from the original space to a new space, using a mapping function denoted by Phi, which maps each input X into a new feature. The mapping function denotes a basis function that can capture non-linearities by defining a mapping that will allow the user to move from whatever original space to a new space, making it non-linear. The goal is to find coefficients such as weights to obtain the best function and that hypothesis space. Ultimately, by using this technique, we can perform linear regression or classification while capturing the non-linearities in the original space implicitly.

  • 01:20:00 In this section, the instructor explains how to use logistic regression and generalized linear models to classify data points. The process involves mapping the input space into a higher-dimensional space using basis functions and then searching for the optimal hyperplane to separate the data in this higher dimension. The instructor highlights that prior knowledge about the possible space of functions is needed to choose appropriate basis functions but there are techniques available for learning basis functions. Additionally, the instructor explains how the weights in the model define the vector that is normal to the hyperplane.

  • 01:25:00 In this section, the lecturer discusses basis functions and their types, which can be used for logistic regression and generalized linear models. The lecturer first introduces polynomial basis functions since they can be used to span polynomial functions by taking all the powers of X up to a certain degree. The lecturer then presents two examples of nonlinear basis functions: Gaussian and sigmoid functions. Gaussian basis functions can be used by changing mu and s, where mu indicates the position of the bump on the x-axis, and s shows the width of the bump. Sigmoid functions are nonlinear functions but not probability distributions and can be used with Sigma hat applied to X minus mu J divided by s as the basis function. Other nonlinear functions that can be used as basis functions include wavelets, sines, and cosines.

  • 01:30:00 In this section of the lecture, the speaker discusses how to generalize linear models in order to implicitly perform nonlinear regression and classification. By replacing the input variable X with Phi of X, which is the input into a new space, various nonlinear functions can be utilized. The Phi function can be applied to different parts of the original input X and can be used to capture the underlying function using a set of basis functions, such as polynomials or Gaussians. This concludes the topic, which provides a fundamental understanding of nonlinear logistic regression and generalized linear models.
 

CS480/680 Lecture 9: Perceptrons and single layer neural nets



CS480/680 Lecture 9: Perceptrons and single layer neural nets

This lecture introduces neural networks with a focus on the elementary type, the perceptron, which produces a linear separator for classification. The lecture explores how weights are used to compute a linear combination of inputs that pass through an activation function to produce outputs, and how different weights can be used to approximate logic gates such as AND, OR, and NOT gates. The lecturer discusses the feedforward neural network and how the perceptron learning algorithm is used for binary classification and how gradient descent can optimize weights. Limitations of using a line to separate data are discussed, and the logistic sigmoid activation function is introduced as a possible solution, with a focus on how the weights can be trained using the logistic sigmoid activation function.

This lecture on Perceptrons and single-layer neural nets covers the use of logistic sigmoid activation functions to minimize squared error and the introduction of the learning rate as a crucial parameter in sequential gradient descent. The lecturer also demonstrates how neural networks with multiple layers can be composed to approximate any function arbitrarily closely using trash-holding functions, and how backpropagation can be used to train a network to learn arbitrary functions. The instructor emphasizes the versatility and efficiency of neural networks, citing their widespread use in solving various problems such as speech recognition, computer vision, machine translation, and word embeddings.

  • 00:00:00 In this section of the lecture, the focus is on a brief introduction to neural networks, with a particular emphasis on the perceptron, which is an elementary type of neural network that does not have hidden layers. It produces a linear separator for classification and plays a crucial role in the history of neural networks. Later on, more complex forms of neural networks become more generalized. The lecture also touches on how the brain works and how it could be emulated through computation. While the brain consists of neurons, the computer functions using logic gates that communicate through an electrical signal, which makes the computation sequential. However, the brain's signals propagate parallelly, making it more robust.

  • 00:05:00 In this section, the speaker discusses the fragility of computers as compared to the human brain, and how neural networks try to emulate the organization of the brain. Neural networks consist of nodes called units, which correspond to the neurons in a real neural network, and links that correspond to synapses. Computation is done through numerical signals, which are transmitted between the units. The goal is to activate neurons when the right pattern is inputted, allowing for a more robust computation that can handle the deletion of some neurons by using techniques for regularization. The inspiration for neural networks comes from the organization and propagation of chemical signals in actual biological neural networks.

  • 00:10:00 In this section, the lecturer explains how weights are used in neural networks to compute a linear combination of inputs and produce a new signal. This new signal is then passed through an activation function, which applies some nonlinearity to produce the output. Each node in the network receives inputs, rescales them with weights, and applies the activation function to produce an output that is then passed along to the next nodes in the network. The lecturer emphasizes that the weights are crucial to the behavior of the network and can be adjusted during the learning process to improve performance.

  • 00:15:00 In this section, the lecturer discusses how units in a neural network compute a nonlinear function of a linear combination of the inputs based on weights assigned to each input. The inputs themselves can be previous nodes that have gone through a nonlinear activation function. Rather than creating basis functions to map inputs to a new space, neural networks allow part of the network to learn how to remap inputs to a new space. Nonlinear activation functions are necessary to make the network more expressive, and the lecturer explains two popular examples of such functions: the threshold activation function and the sigmoid function.

  • 00:20:00 In this section of the lecture, the professor discusses the use of activation functions in neural networks, specifically the trash holding function and the sigmoid function. He explains that while the trash holding function is useful for outputting 0s and 1s, it is not smooth and continuous, which can make it difficult to use with gradient-based methods. The sigmoid function is a smooth version of the trash holding function and has the same shape but can be adjusted in slope. The professor then explores the designing of units in neural networks that can emulate basic gates like the AND, OR, and NOT gates. He demonstrates an example of a unit with a trash holding activation function that can emulate the NAND gate and questions if it's possible to come up with some weights to allow for an output that corresponds to the end of the inputs.

  • 00:25:00 In this section, the lecturer discusses how different weights can be used in a perceptron neural network to emulate logic gates, such as the end, or, and not gates. By tweaking the weights used in the perceptron, the neural network can be designed to produce the desired truth table output for each of these gates. The lecturer provides examples of different weights that can be used to emulate each of the logic gates, including weights for the end gate, or gate, and not gate.

  • 00:30:00 In this section, the lecturer discusses two broad classes of networks: feed-forward neural networks, which consist of a directed graph of nodes that flow in one direction; and recurrent neural networks, which are cyclic in nature and are useful for handling input of varying lengths, making them popular in natural language processing. The lecturer focuses on feed-forward neural networks and draws a simple example with two input units, one hidden layer with two units, and one output unit. By changing the weights of the connections between the layers, they explain that it is possible to emulate the end, or and the knot units, allowing the approximation of any Boolean function.

  • 00:35:00 In this section, the lecturer explains the concept of a perceptron, which is essentially a simple single-layer feedforward neural network used for binary classification. The algorithm for training a perceptron is discussed, where each output unit is trained separately by looping through the data set for each X Y pair and adjusting the weights based on whether the produced output is correct or not. The lecture also discusses the use of matrix representation for weights in neural networks.

  • 00:40:00 In this section, the teacher explains the perceptron learning algorithm which is used to deal with units that pass through a threshold function. The algorithm applies a very simple rule where if the computation of the network is correct, then the weights can be kept the same, but if the output is incorrect, adjustments have to be made by simply adding the input X to the weights or subtracting it, depending on the output. The goal is to increase the linear combination of inputs and weights if the output is supposed to be positive or decrease it if it is supposed to be negative so that the perceptron computes an output is that closer to the correct answer. The key is to take advantage of the fact that the trash holding function returns 1 when the linear combination is positive and 0 when it is negative.

  • 00:45:00 In this section, the speaker discusses the use of gradient descent to optimize the weights of a perceptron algorithm. A loss function is defined as the misclassification error, where for every data point X & Y, it is considered misclassified when the product of Y W transpose X is negative. A point is expected to be positive if it belongs to class 1 and negative if it belongs to class -1. The misclassified points are added up to obtain an objective that can be minimized. The gradient is then computed with respect to the objective to take a step in the opposite direction of the gradient for optimization.

  • 00:50:00 In this section of the lecture on perceptrons and single layer neural networks, the professor discusses the use of gradient descent with sequential processing to update weights in the perceptron algorithm. The algorithm relies on linearly separable data to eventually correctly classify all training instances. A theorem is presented stating that the threshold perceptron learning algorithm will converge if and only if the data is linearly separable. The section ends with an explanation and visualization of linearly separable data versus non-linearly separable data.

  • 00:55:00 In this section of the lecture, the professor discusses the limitations of attempting to separate a dataset with a line and introduces the possibility of using a logistic sigmoid activation function instead of the threshold activation function. The logistic sigmoid may be nonlinear, but it still produces a linear separator at the point where the probability is 0.5 for each class. Therefore, using the logistic sigmoid still gives us a linear separator and a hypothesis space that is the same as logistic regression. The professor then addresses the question of how to train the weights of the perceptron with the logistic sigmoid activation function.

  • 01:00:00 In this section, the speaker discusses the approach for defining an objective and minimizing the squared error in perceptrons with logistic sigmoid activation functions. They explain that the algorithm for maximum likelihood is essentially the same as logistic regression, while minimizing squared error requires finding the gradient and taking steps in the direction of it. They also introduce the idea of using a learning rate to define the step size in sequential gradient descent and mention that it is a critical parameter that often needs to be adjusted. The speaker suggests that taking steps with respect to mini-batches of data points or just one data point is common in practice.

  • 01:05:00 In this section, the lecturer explains how neural networks with multiple layers can approximate any function arbitrarily closely. By composing different neurons together, he demonstrates the creation of a 2D ridge by adding two parallel sigmoid units with opposite slopes, and then shows how two ridges can be composed to form a bump when intersected perpendicularly. This technique allows the creation of classifiers that can assign points to one class in one small region and the other class everywhere else. The lecturer illustrates the corresponding network, which includes four sigmoid units and a ridge with the identity activation function.

  • 01:10:00 In this section of the lecture on Perceptrons and single-layer neural nets, the professor discusses the construction of bumps using trash-holding functions or sigmoids, and how they can be tiled and added together to approximate any curve arbitrarily closely. He explains that this approach can be used for regression, and it is possible to train a neural net to learn an arbitrary function using algorithms such as backpropagation. Backpropagation is essentially a form of gradient descent that exploits the structure of the network to compute all partial derivatives simultaneously.

  • 01:15:00 In this section, the instructor explains how partial derivatives for all weights in a neural network can be obtained simultaneously in a constant number of passes through the network using the back propagation algorithm. The instructor emphasizes that neural networks have gained widespread popularity due to their versatility and power in solving various problems such as speech recognition and computer vision. The state-of-the-art in machine translation and word embeddings also employs neural networks, and their popularity is partly due to their efficiency.
 

CS480/680 Lecture 10: Multi-layer neural networks and backpropagation



CS480/680 Lecture 10: Multi-layer neural networks and backpropagation

This lecture on multi-layer neural networks and backpropagation explains the limitations of linear models and the need for non-linear models such as multi-layer neural networks. The lecturer discusses the different activation functions that can be used in neural networks and how they allow for non-linear basis functions. The lecture goes on to explain how the backpropagation algorithm is used to compute the gradient of the error with respect to every weight in a neural network. Automatic differentiation tools are also discussed as a way to efficiently compute the deltas and gradients in a neural network. Overall, the lecture emphasizes the flexibility and power of neural networks in approximating a wide range of functions.

The lecturer in this video discusses issues in optimizing neural networks, such as slow convergence, local optimization, non-convex optimization, and overfitting. To overcome slow convergence, techniques such as regularization and dropout can be used. Additionally, the speaker explains the behavior of gradient descent for optimization, highlighting the need for optimizing step size to improve its efficiency. The DES grant algorithm is proposed as a solution, which adjusts the learning rate of each dimension separately. The speaker also introduces RMSProp, a weighted moving average of previous gradients. Finally, the speaker discusses Adam, which involves taking a weighted moving average of the gradient itself, and shows that it outperforms other techniques such as SGD Nesterov.

  • 00:00:00 In this section, the lecturer provides a quick recap of linear regression and three models for linear classification. However, the problem with these models is that they still give us a linear separator. Thus, the lecture shifts the discussion to non-linear models and introduces the need for multi-layer neural networks.

  • 00:05:00 In this section, the instructor reviews linear models, including the perceptron and its threshold activation function, and the sigmoid activation function. The instructor explains that linear models can be extended to non-linear models to accommodate functions that are not straight lines but rather are curved. To achieve this, nonlinear regression is introduced, which uses a mapping function, Phi of X, to shift data into a new space. The instructor also introduces multi-layer neural networks, which provide adaptive basis functions for nonlinear regression, and then relates these back to the generalized linear regression model. Finally, the instructor discusses generalized nonlinear classification.

  • 00:10:00 In this section of the lecture, the speaker discusses how to work with unrestricted nonlinear models. The problem with the linear models with basis functions that we have seen so far is that we have to choose the basis functions a priori, and we might not have enough domain knowledge to do that. The solution is to choose basis functions that depend on the data and allow a very large number, or even an infinite number, of basis functions without paying a price. This idea was initially the approach in kernel methods and was the dominant set of techniques until around 2010.

  • 00:15:00 In this section, the video discusses the introduction of multi-layer neural networks in deep learning, which led to many of the successes we see in deep learning today. Specifically, the video focuses on a two-layer neural network with fully connected nodes, with each connection having a weight that can be represented in a matrix. The hidden units and output units are calculated using the activation function and linear combinations, with each layer having its own set of weights. By adjusting the powers inside the basis functions, it's possible to adapt them and vary them based on training sets, leading to a more successful deep learning model.

  • 00:20:00 In this section of the lecture, the professor explains how neural networks are essentially mathematical functions composed of multiple layers and weights. They use activation functions, such as the sigmoid or hyperbolic tangent, to add nonlinearity. These activation functions can act as basis functions for the next layer and can be used in nonlinear regression. By using nonlinear activation functions in the first layer and an identity function in the output layer, a neural network can be represented as a linear combination of nonlinear basis functions.

  • 00:25:00 In this section, the speaker discusses two-layer neural networks for non-linear regression and classification. The mathematical formula for the two-layer neural network involves hidden units with a sigmoid activation function and output units with the identity activation function. The sigma acts as a non-linear basis function that is parameterized by some weights, allowing the basis functions to adapt as the model is trained. This approach is the main difference between non-linear and linear regression. Similarly, for classification, the speaker shows how the same formula applies by computing basis functions that are non-linear via the first layer.

  • 00:30:00 In this section, the lecturer explains how multi-layer neural networks are different from logistic regression, despite having a similar interpretation. The neural network allows for more adaptive basis functions by using weights that update during training. The non-linearity comes from the use of a Sigma function, which can be replaced by other functions like the Gaussian or 10h function. The neural network can be used for both classification and regression by adjusting the activation function. The lecturer also mentions that multiple classes can be used in the network by replacing the Sigma function with another suitable function.

  • 00:35:00 In this section, the lecture discusses the optimization of weights for multi-layer neural networks, which includes both the weights of the linear combination and the weights that define the non-linear basis functions. The most popular algorithm for optimization is error minimization, which compares the output of the neural network to a target and computes the difference. Back propagation is a popular algorithm that allows the errors to be computed and back propagated through the network to compute a gradient with respect to every weight. The gradient is used to compute the update algorithm to optimize the weights. The back propagation algorithm is computed by hand, but packages like Tensor Flow and PyTorch offer tools for automatic differentiation.

  • 00:40:00 In this section, the speaker explains the backpropagation algorithm used to compute the gradient or partial derivative of the error with respect to every weight in a neural network. The algorithm is divided into two phases: a forward phase where the output of the network is computed based on the inputs, and a backward phase where Delta, a measure of error, is backpropagated to compute the partial derivative of the error with respect to every weight. The partial derivative is computed in two steps using the chain rule for partial derivative and Delta J and Zi. The speaker illustrates the algorithm with a fully connected network consisting of two inputs, two hidden units, and two output units, and shows how the algorithm computes the output of each unit and backpropagates errors.

  • 00:45:00 In this section of the video, the speaker discusses how to obtain partial derivatives in multi-layer neural networks using the backpropagation algorithm. The speaker explains that starting at the output layer, one can compute the partial derivative of the error with respect to every output unit J using a recursive formula that depends on the deltas of the output units. The speaker then demonstrates a simple example of using the forward and backward phase to compute the output of hidden and output units in a neural network.

  • 00:50:00 In this section, the speaker explains how to compute the deltas and gradients in a neural network and how automatic differentiation tools can help do this efficiently. They provide equations for computing the deltas for hidden layers and output layers and show how to use these to calculate gradients. The speaker emphasizes that automatic differentiation tools can save time and effort in manually computing gradients when working with different architectures and functions. The section concludes with examples of how with just three hidden units, a neural network can approximate arbitrary functions such as x-squared, absolute value of x, and sine of x.

  • 00:55:00 In this section, the lecturer discusses the ability of neural networks to approximate different functions. The network can converge to non-linear basis functions that can approximate smooth functions, such as quadratic and sine functions, quite well. However, for non-smooth functions, like the absolute function, the neural network struggles to approximate it without enough hidden units. Nevertheless, even for discontinuous functions like the step function, the network can still provide a reasonable approximation. The lecturer then moves on to the discussion of optimizing neural networks, which involves computing the gradient using automatic differentiation and performing stochastic gradient descent. While this is a general optimization technique, convergence can be slow without additional optimization methods.

  • 01:00:00 In this section, the lecturer discusses the issues that can arise in optimizing neural networks, including slow convergence, local optimization, non-convex optimization, and overfitting. Slow convergence can be overcome through techniques such as regularization and dropout. To illustrate the concept of slow convergence, the lecturer draws a picture of a ball-shaped surface representing the error function. Gradient descent may converge slowly when starting outside of the global minimum, and modern techniques such as momentum and adaptive learning rates can accelerate convergence.

  • 01:05:00 In this section, the lecturer discusses the behavior of gradient descent for optimization. The direction of the gradient is generally perpendicular to the contour line, and the problem with taking one step in its direction is that it may overshoot the minimum. On the other hand, if the gradient is small, taking many small steps may be necessary to get to the minimum. Therefore, there are regions where bigger steps should be taken and regions where smaller steps are more appropriate. This behavior highlights the need for optimizing the size of steps to improve the efficiency of gradient descent.

  • 01:10:00 In this section, the speaker discusses the potential issues with relying on the size of the gradient to determine the step size in a neural network. Since the size of the gradient may not be consistent across different dimensions, one solution proposed by the DES grant algorithm is to adjust the learning rate of each dimension separately by taking the sum of the square of the gradients seen so far and dividing the step size by the square root of that value. This allows for adjustments in the step size according to the magnitude of the gradient in each dimension. However, the learning rate may decay too quickly in some applications, hindering progress.

  • 01:15:00 In this section, the speaker discusses the problems with gradient descent in neural networks and how adjusting the learning rate can help in doing stochastic gradient descent. The speaker introduces the concept of the "dimension" in a neural net, where there is one dimension per weight. They explain the problem with the accumulation of large sums and the need to reduce the size of those steps. The speaker suggests a solution to this problem with the introduction of rmsprop, which is a weighted moving average of the previous gradients with an exponential decay that forgets older gradients. However, this method is not perfect, and the speaker acknowledges its limitations.

  • 01:20:00 In this section, the lecturer discusses the issue of a gradient lacking momentum in a region where it is stable, leading to the need for a way of increasing step size when the direction is the same. This leads to a version of the heuristic known as Adam, which involves taking a weighted moving average of the gradient itself and storing it in sT. When doing the update, instead of taking a step in the action and gradient, one takes a step in the action of that moving average. The technique is a heuristic, which was published in ICLR in 2015, and the main difference from its predecessors is that it comes with some theories and proofs of convergence about its properties. However, when it was published, there were issues with some of the proofs, which led to modifications being done with more proofs to come up with something more principled.

  • 01:25:00 In this section, the speaker explains the trade-off between making a few good steps and paying a high price for each step or making a lot of small steps quickly that are not very good steps, but overall still end up closer to the minimum. He also discusses optimization techniques that do not scale well, such as second-order optimization techniques like Newton's technique. In practice, heuristics tend to work well despite their lack of good theory. The speaker then provides empirical comparisons between Adam and other techniques such as SGD Nesterov and shows that Adam tends to perform quite well.
 

CS480/680 Lecture 11: Kernel Methods



CS480/680 Lecture 11: Kernel Methods

In this lecture, the concept of kernel methods is introduced as a way to scale up generalized linear models by mapping data from one space into a new space using a nonlinear function. The dual trick or kernel trick is explained as a technique that enables working in high-dimensional spaces without paying additional costs, leading to the use of a kernel function that computes the dot product of pairs of points in the new space. Various methods for constructing kernels are discussed, including the polynomial and Gaussian kernels, which can be used to measure similarity between data points and are useful in classification tasks. Rules for composing kernels are also introduced to construct new kernels that can control their complexity. The lecture emphasizes the importance of choosing functions that have a correspondence with Phi transpose Phi, as the gram matrix must be positive semi-definite and have eigenvalues greater than or equal to zero.

In this lecture on kernel methods, the speaker defines kernels as positive semi-definite functions that can be decomposed into a matrix times its transpose. Various types of kernels, such as polynomial and Gaussian, and their applications are discussed for comparing different types of data such as strings, sets, and graphs. The speaker also explains how substring kernels can quickly compute similarity between words by increasing the length of substrings and using dynamic programming. Additionally, support vector machines are shown to be effective in performing document classification using news articles from Reuters.

  • 00:00:00 In this section, the speaker introduces kernel methods, which are useful for scaling up generalized linear models. A quick recap of the similarities and differences between generalized linear models and neural networks is given, highlighting that fixed nonlinear basis functions are used in linear models and that the optimization tends to be easier and typically convex, while adaptive basis functions are used in neural networks, and the optimization tends to be harder. The introduction of kernel will lead to a trick that will avoid paying a price for the larger space when working with models that involve nonlinear mappings.

  • 00:05:00 In this section, the lecturer explains the evolution of machine learning paradigms, highlighting how the limited hypothesis space was not a significant concern when the amount of data was not ample. However, the neural networks era starting from 2009 brought forth a lot of data and computational power, making having a richer hypothesis space essential. The lecturer introduces the dual trick or kernel trick, a computational technique that enables working in high-dimensional spaces without paying additional costs, by mapping data into a new space using nonlinear functions. He explains how this trick, together with a kernel function, allows us to consider a large or infinite number of basis functions, without having to compute them explicitly.

  • 00:10:00 In this section, the lecturer focuses on kernel methods, which aim to compute the dot product between pairs of points in a new space and find ways to make the cost of computing those dot products a lot cheaper for better scaling of algorithms. Therefore, the dot products are renamed as kernel functions, and if we can determine the outputs of these kernels for every pair of points, we do not need to compute the underlying feature space defined by Phi of X, which is the key for defining kernels that are fast to evaluate and require no computation with respect to Phi of X. Linear regression is used as an example, and the lecturer shows that W is really a linear combination of the data points, which are coefficients times Phi of X n, and substitutes W with another expression, Phi times A, where Phi is the matrix of all the points into the new space.

  • 00:15:00 In this section, the speaker introduces the concept of kernel methods, which involves mapping data from one space into a new space using a mapping function. He shows how the optimization of a linear regression problem in the new space can be done using the coefficients (a) of a linear combination of the mapped points rather than the weight matrix (W). This leads to the use of a kernel function that computes the dot product of pairs of points in the new space, which is defined as the Gram matrix. The result is an alternative way of finding the solution to the regression problem by optimizing the coefficients using the kernel function.

  • 00:20:00 In this section, the lecturer discusses how to do predictions using the solution in the dual space, which results in a different complexity for computation than in the primal space. In the primal space, the complexity is dependent on the number of basis functions, but in the dual space, it depends on the amount of data, allowing for high-dimensional spaces without an increase in complexity. The key is to compute the kernel function without referring to the points in the new space, and there are various ways to define kernel functions that implicitly correspond to dot products. It is important to choose functions that have a correspondence with Phi transpose Phi, as the gram matrix must be positive semi-definite and have eigenvalues greater than or equal to zero. The lecturer provides an example of how to define a kernel directly and then figure out the corresponding mapping.

  • 00:25:00 In this section, the lecturer defines a kernel function as the dot product of two vectors in the original space squared. The question is raised whether this is a valid kernel function that can be computed without referring to Phi, the space transformation function. By expanding the function, the lecturer is able to define the mapping of Phi without explicitly computing it and arrives at a valid kernel function with basis functions. While typically kernel functions are computed by first defining Phi and then conducting a dot product, this method allows for direct computation of the kernel function in the original space.

  • 00:30:00 In this section, the lecturer discusses the method for constructing kernels. The idea is to construct new kernels that can control their complexity and make sure that it does not depend on the new space. The lecturer explains ten rules for composing kernels to make new kernels that are valid and, if a function is not a valid kernel, there are basic building blocks that could help compose them together to obtain more complex kernels. The lecture further introduces common kernels used in practice, such as the polynomial kernel, where the dot product in the original space is raised to some power, resulting in the feature space as all the degree M products of entries in X. The lecture will continue on the discussion of the Gaussian kernel in the next class.

  • 00:35:00 In this section, the lecturer explains that to achieve flexibility in regression or classification models without paying a computational price, high-dimensionality is required, which can be a problem. To avoid this issue, kernels are used, which specify a function that tells us the dot product between pairs of points in the new space. The polynomial kernel is then introduced as a common kernel, which takes the dot product in the original space raised to a power M. The lecturer provides a concrete example of the kernel in a 2D space and expands it to demonstrate the corresponding dot product in a 3D space.

  • 00:40:00 In this section, the lecturer explains kernel methods that are used to implicitly convert the input space into a higher-dimensional space where the classes can be linearly separable, even if they are not in the original space. The lecture explains how this method generalizes to an arbitrarily high power M, where it creates new features that are essentially all combinations of M possible features. However, this will lead to an exponentially large demand space, which would be computationally impossible for images. To work around this, a constant C can be added to the kernel to consider all features of degrees up to M.

  • 00:45:00 In this section, the concept of polynomial kernel and Gaussian kernel was explained. The polynomial kernel is used to compute the dot product of two vectors and it can measure the similarity between two data points up to degree two. On the other hand, the Gaussian kernel is denoted by a formula that computes similarity between two data points and is a popular kernel used in machine learning. Kernels are essentially a shortcut to computing the dot product in a new space and can be interpreted as a measure of similarity between data points which is useful in classification tasks.

  • 00:50:00 In this section, the lecturer explains how the Gaussian kernel can be seen as a measure of similarity between two points, with a high value if the points are identical and a low value if they are far apart. However, proving that the Gaussian kernel is a valid kernel is challenging as the feature space is infinite. Instead, the lecturer uses the rules from the previous lecture to justify the validity of the kernel, specifically rule number four, which states that taking the exponential of a kernel results in another valid kernel, and further examines the other rules to express the Gaussian kernel as a combination of valid kernels.

  • 00:55:00 In this section of the video, the lecturer demonstrates the use of various rules to show that K of X X prime, which is equal to e to the minus X minus X Prime divided by 2 Sigma square, is a valid kernel. The lecturer expands X minus X prime and separates the terms into different exponentials before using rules 1, 2, 4, and 8 to show that it is a valid kernel. The rules used include replacing a with the identity matrix and showing that X transpose X prime divided by Sigma square and e to the X transpose X prime divided by Sigma square are valid kernels.

  • 01:00:00 In this section, the speaker explains that kernels are positive semi-definite functions, which can be decomposed into a matrix times its transpose. He also explains that using a polynomial kernel, for example, would require constructing all the monomials up to a certain degree, resulting in exponential dimensionality. However, by working directly with the kernel, all that is needed is to compute similarity between each pair of data points, making it more computationally efficient. The Gaussian kernel is also discussed, which has an infinite feature space, making it powerful in representing arbitrary functions. Additionally, the speaker notes that while kernels are defined with respect to vectors, they can also be applied to other types of objects, such as sets, strings, or graphs.

  • 01:05:00 In this section, the lecturer discusses the idea of mapping strings and documents using kernel methods. The technique involves defining a kernel that measures the similarity between two documents or strings as the weighted sum of all the non-contiguous substrings that appear in both documents. However, enumerating all these features can be time-consuming and resource-intensive, which is where non-vectorial kernels come into play. These kernels are useful when comparing documents that may contain new or invented words and can map every string or document into a new feature space corresponding to whether the string contains a particular substring.

  • 01:10:00 In this section, the speaker explains the concept of substring kernel, which is used to determine the similarity between two words. The substring kernel takes a value lambda, raised to a power representing the length of the substring, which is lower for more important matches and higher for less important ones. The kernel can efficiently compute dot products in feature spaces, which consist of substrings of various lengths that are present in two words. To compute these kernels efficiently, the paper proposes gradually increasing the length of the substrings using dynamic programming. This allows for linear time computation of the kernels, which would otherwise be exponential.

  • 01:15:00 In this section, the speaker discusses how support vector machines can be used with kernels in order to work in a much richer space. The speaker cites a paper that performs document classification using news articles from Reuters and shows results using this technique. The approach can be quite powerful and will be further discussed in the next class.
 

CS480/680 Lecture 13: Support vector machines



CS480/680 Lecture 13: Support vector machines

This lecture is an introduction to support vector machines (SVMs) as a type of kernel method used for classification. SVMs are still popular for problems with low data and are considered sparse as they can work with a subset of the data and ignore the rest. The speaker explains the concept of support vectors, which are the closest data points to the decision boundary and the visual example of SVMs finding a linear separator to separate classes while maximizing the margin. The differences between SVMs and perceptrons are discussed, with SVMs employing a unique max margin linear separator and being less prone to overfitting. The optimization problem for SVMs can be rewritten using the Lagrangian, resulting in an equivalent problem without constraints. The solution obtained from the Lagrangian can be substituted back to obtain an expression that involves the kernel function, leading to a dual problem optimization. The benefits of working in the dual space with a kernel function that computes the similarity between pairs of data points are also explained. SVMs calculate the degree of similarity between a query point and all support vectors to determine the most similar ones, and the discussion also revolves around the number of support vectors and how it affects the classification of points.

This video discusses the concept of support vector machines (SVMs) in text categorization, where documents are represented as vectors of word counts. SVMs are effective in minimizing the worst-case loss, making the classifier suitable for any possible sample, even for different datasets. Researchers used SVMs with dual representation and kernel mapping to map data into an even higher dimensional space, without losing accuracy or sacrificing scalability. The lecture also covers the use of SVMs in retrieving relevant documents from a dataset and balancing precision and recall. The video concludes with a discussion on SVMs' ability to provide linear or nonlinear separators for data and the challenges associated with multi-class classification and non-linearly separable data.

  • 00:00:00 In this section, the speaker introduces support vector machines (SVMs), which are a type of kernel method used for classification. Historically, SVMs were the most important and popular classification technique in machine learning until neural networks took over after 2010. However, SVMs still perform well for problems with low data and are considered sparse as they can work with a subset of the data and ignore the rest. The speaker then provides a visual example of two classes of data and how SVMs find a linear separator to separate these classes while maximizing the margin, which is the smallest distance to the closest point in each class.\

  • 00:05:00 In this section, the concept of support vectors in support vector machines (SVMs) is explained. Support vectors are important data points that are the closest to the decision boundary, and they essentially determine where the linear separator will go. The final linear separator in SVMs, which maximizes the distance otherwise called the margin to the closest data points, is obtained by solving an optimization problem. The intuition behind maximizing the margin is to ensure that the data points, which could be noisy, won't get misclassified by the decision boundary.

  • 00:10:00 In this section, the concept of maximum margin in support vector machines (SVMs) is explained to achieve better classification. A maximum margin ensures that the classification is more robust to noise and can generalize better to future examples. The distance of a point to the separator is computed using the dot product between the weight vector and feature vector for that data point, which is then normalized to give the maximum margin. The formula to compute the distance of any point to the separator is also given and the objective that is being optimized in SVMs is discussed. It is highlighted that there is a unique line that has a maximum margin and, hence, any two lines that are equal in margin are not the maximum margin line.

  • 00:15:00 In this section, the differences between support vector machines (SVMs) and perceptrons are discussed. Perceptrons find a linear separator but this separator is dependent on the starting values used for the initialization of weights. Perceptrons also use a simple update rule for training and rely on label flipping to measure the distance between the linear separator and data points. In contrast, SVMs use a quadratic optimization problem to find the maximum margin linear separator, which is less dependent on initialization. SVMs also introduce the concept of slack to allow for soft margin classification and have a kernel trick for nonlinear classification. Overall, SVMs have higher classification accuracy compared to perceptrons.

  • 00:20:00 In this section, the lecturer contrasts standard perceptrons with support vector machines (SVMs). While the former lacks robustness and can quickly overfit, the latter uses a unique max margin linear separator and is less prone to overfitting. SVMs are optimized through convex quadratic optimization to minimize weights, under the constraint that all data points are at least one unit of distance away from the linear separator. While this optimization may seem complex, it is actually quite easy to perform computationally with many optimization packages available.

  • 00:25:00 In this section, the speaker introduces a more convenient optimization problem for support vector machines, where the distance between points is fixed to be at least one and the scale of W is minimized. The speaker demonstrates that this problem is equivalent to the previous optimization problem. This new formulation allows for a dual representation, where computations in the new feature space can be done in terms of dot products that can be replaced by a kernel function, similar to what was done with Gaussian processes.

  • 00:30:00 In this section, the speaker explains how the optimization problem for support vector machines can be rewritten using the Lagrangian, resulting in an equivalent problem without constraints. This new objective includes a penalty term for each violated constraint, which is dependent on a new variable a that is necessarily positive and greater than zero when a violation occurs. By setting this variable a to maximize the minimum of the Lagrangian, the new problem is mathematically equivalent to the original problem with constraints. This technique helps to simplify the optimization process and make it more efficient.

  • 00:35:00 In this section, the lecturer discusses the use of penalty terms and constraints in optimization problems for support vector machines. They explain that the constraint limiting the distance between points can be replaced with a penalty term, which is optimized by choosing a coefficient. This optimization problem, however, results in a maximum problem that is not easy to solve. To solve this, the lecturer describes how to compute the inner minimization problem in closed form, arriving at a solution where W is a linear combination of data points in the new feature space. The coefficients that are different from zero, which are the support vectors, determine the value of W.

  • 00:40:00 In this section, the lecturer explains how the solution obtained from the Lagrangian can be substituted back to obtain an expression that involves the kernel function. This kernel function allows us to work in a high dimensional space without worrying about dimensionality, as we can compute the kernel function directly between every pair of points. This leads to a dual problem optimization, where we optimize a different set of variables to obtain the coefficients. The majority of these coefficients will end up being zero, making the optimization problem sparse and reducing computational complexity. Once we have the coefficients, we can use them to classify data points by taking the dot product of the features and the coefficients, with a positive or negative result corresponding to different classes.

  • 00:45:00 In this section, the instructor explains the concept of support vector machines (SVMs) in a linearly separable case. They show that a linear separator in a two-dimensional space can be represented by a dot product of a normal vector and the input features. The points on the linear separator correspond to the dot product being equal to 0. They then explain the benefits of working in the dual space, which involves replacing the weights with a kernel function that computes the similarity between pairs of data points. The resulting sum only depends on the number of support vectors and allows for classification based on the sine of the linear combination of kernels between the query point and every point in the data set.

  • 00:50:00 In this section, the lecturer explains that the SVM algorithm calculates the degree of similarity between a query point and all support vectors in order to determine the most similar ones. The class of these most similar support vectors will then "vote" for the predicted class of the query point. This is similar to a weighted nearest-neighbor approach, with the weights dictated by the kernel function. However, the number of support vectors may not necessarily be equal for each class, and may vary based on the dimensionality of the space.

  • 00:55:00 In this section, the discussion revolves around the number of support vectors and how it affects the classification of points. Despite having more support vectors in one class, the number of support vectors does not influence the propensity to classify points in that class. The reason for this is that every support vector contributes to the sum whether positive or negative, indicating whether a point belongs to the same class as the support vector. Additionally, support vector machines are known to generalize well and are less prone to overfitting, as maximizing the margin is equivalent to minimizing an upper bound on the worst-case loss for any underlying input distribution.

  • 01:00:00 In this section of the lecture, the speaker explains the concept of support vector machines and how they work in text categorization. Support vector machines are effective at minimizing the worst-case loss and ensuring the classifier is good with respect to any possible sample that could correspond to different datasets. The lecture provides a case study of text categorization where classifiers are trained with an archive of news articles that have already been classified. A popular approach was to transform every document into a vector of word counts using the vector space model, where the ordering of words is ignored and a vector is created that is the length of the dictionary. This approach helped automate the categorization of articles and improve scalability.

  • 01:05:00 In this section of the lecture, the professor explains how documents can be represented as high-dimensional vectors, with each feature corresponding to a word in the document's dictionary. While it is natural to try to reduce the dimensionality of these vectors through feature extraction, this can lead to a loss of information since most words carry some level of relevance. To address this issue, researchers used support vector machines with a dual representation and kernel mapping to map the data to an even higher dimensional space. This approach scales well with the number of dimensions, making it a useful tool for analyzing high-dimensional data.

  • 01:10:00 In this section, the speaker discusses the use of support vector machines in retrieving documents from a dataset. Precision and recall are measures used to estimate the percentage of relevant documents retrieved and the percentage of relevant documents in the dataset, respectively. The goal is to balance precision and recall, and support vector machines were found to be the best algorithm for this purpose. They were able to keep all the features and map them into a higher dimensional space without losing accuracy or sacrificing scalability. The number of support vectors required in a higher dimensional space may increase, but there is no additional cost associated with working in that space.

  • 01:15:00 In this section, we learn about support vector machines (SVMs) and how they can give us a linear or nonlinear separator for our data. SVMs use a unique hyperplane to maximize a margin for good generalization, and we can use convex quadratic optimization to ensure global optimality. However, there are two important questions to address: can we do multi-class classification, and what do we do if our data is not linearly separable? The next set of slides will address these issues.
 

CS480/680 Lecture 14: Support vector machines (continued)



CS480/680 Lecture 14: Support vector machines (continued)

This section of the lecture is focused on handling non-linearly separable data and overlapping classes when using support vector machines (SVMs) by introducing slack variables and considering a soft margin. The speaker explains how slack variables allow points within the margin to be classified without introducing a classification error. A penalty term is added to the optimization problem to regulate the use of slack variables, controlled by the weight C, which adjusts the trade-off between error minimization and model complexity. The speaker also discusses different approaches to using SVMs for multi-class classification problems, including one-against-all, pairwise comparison, and continuous ranking, with the latter being the de-facto approach for SVMs with multiple classes. Additionally, the concept of multi-class margin is introduced, which involves a buffer around the linear separator, defined by the difference of weight vectors for each pair of classes.

  • 00:00:00 In this section, the lecturer discusses how to deal with non-linearly separable data and overlapping classes when using support vector machines (SVMs). The solution is to introduce slack variables and consider what is known as a soft margin, which relaxes the assumption that all points should be at least one unit away from the separator. Slack variables allow the margin to be less than one, so that even the points within the margin can be classified without introducing a classification error.

  • 00:05:00 In this section, the concept of soft margin is introduced as a way of allowing for misclassified points and points within the margin by introducing slack variables. A penalty term is also added to the optimization problem to regulate the use of slack variables and ensure that the slack variable penalty is minimized. This is controlled by the weight C, which also controls the trade-off between error minimization and model complexity. The sum of the slack variables is generally an upper bound on the number of misclassifications. The weight C can be thought of as a regularization coefficient that adjusts the trade-off between error minimization and model complexity, and when C goes to infinity, the original hard margin classifier is recovered.

  • 00:10:00 In this section, the speaker continues to discuss support vector machines and how to handle misclassifications and outliers. Soft margins can handle minor misclassifications but are still sensitive to outliers. Support vectors will correspond to active constraints that have equality, while those with inequality are not active if the distance is already greater than one, meaning all the slack variables would be zero. The speaker also touches on how to extend support vector machines to work with multiple classes, where historically three approaches have been considered, one of them being “one against all,” where each support vector machine would distinguish between a class and all other classes.

  • 00:15:00 In this section of the lecture, the speaker explains different approaches to using support vector machines to classify data with multiple classes. The first approach, one-against-all, involves training a support vector machine for each class versus the rest, but it can lead to conflicts in classification. The second approach, pairwise comparison, requires training support vector machines for every pair of classes, which can be computationally expensive. The third approach, continuous ranking, trains a single support vector machine to return a continuous value for ranking classes based on those values. The speaker illustrates these approaches using examples and concludes that pairwise comparison is not ideal due to its computational cost, leaving one-against-all as the least favorable and continuous ranking as the de-facto approach for using support vector machines with multiple classes.

  • 00:20:00 In this section, the lecturer discusses different approaches to using support vector machines for multi-class classification problems. They explain how using multiple linear separators to distinguish between different classes leads to ambiguous points, and describe an alternative approach, continuous ranking. The idea behind this approach is to use separate weight vectors for each class and compare the magnitude of the dot products of the input data with each class's weight vector, selecting the class with the largest dot product. This approach generalizes the margin concept to compare the dot products of different classes and ensures that the correct class has a dot product that is greater than all incorrect classes by at least one.

  • 00:25:00 In this section of the lecture, the presenter explains the concept of multi-class margin in Support Vector Machines (SVMs). The multi-class margin corresponds to having a buffer around the linear separator, which is defined by the difference of weight vectors for each pair of classes. The optimization problem remains the same as that of binary SVMs, with only the constraints being replaced. With overlapping classes and multiple classes, slack variables and a penalty term can be introduced to handle multi-class classification with a soft margin. The multi-class SVM is now a general formulation that works with multiple classes and overlapping classes.