Machine Learning and Neural Networks - page 28

 

Lecture 24. Linear Programming and Two-Person Games



24. Linear Programming and Two-Person Games

This YouTube video covers the topic of linear programming and two-person games. Linear programming is the process of optimizing a linear cost function subject to a set of linear constraints, and it is used in fields such as economics and engineering. The video explains the algorithms used in linear programming, including simplex method and interior point methods, and the concept of duality, where the primal problem and its dual problem are closely connected and can be solved using the simplex method. The video also covers how linear programming can be applied to two-person games, including the process of finding an upper bound on the maximum flow in a network and solving a game with a matrix. Finally, the video briefly discusses the limitations of applying these techniques to three or more person games and mentions that the next lecture will cover stochastic gradient descent.

  • 00:00:00 In this section, the lecturer introduces the topic of linear programming as a part of optimization and explains what it is and how it works. He defines linear programming as the process of optimizing a linear cost function subject to a set of linear constraints. He notes that the cost vector and constraint equations are both linear. However, when there are constraints involved, the problem is not actually linear, as the constraint equations can make it more complex. Despite this, linear programming is an important part of optimization and is often used in fields like economics and engineering.

  • 00:05:00 In this section, the speaker discusses linear programming and two-person games. They explain the concept of feasible set X, which is a constraint set in linear algebra language, and draw a visualization to show the concept. They use an example of minimizing a function with straightforward constraints and inequalities to explain how one of the three corners of a triangle is the winner, which is solved by finding the minimum value at the point where the plane intersects with the octant. The cost is linear, and the solution is either one of the three corners, or when equal values along those corners occur. In the example given, three zero zero is the winning corner.

  • 00:10:00 In this section, the video explains the two algorithms used in linear programming: simplex method and interior point methods. The simplex method travels along the edges of the feasible set to reach the optimal corner, while interior point methods go inside the feasible set to get the derivatives and minimize. The interior method is the newer idea, and although the exact algorithm proposed by Karmarkar has not survived, the idea is still being used and researched today. Both algorithms are still locked in competition with each other.

  • 00:15:00 In this section, the speaker discusses linear programming and its various types such as nonlinear programming, quadratic programming, semi-definite programming, and interior point methods. The speaker introduces the idea of duality, where a dual problem of the linear programming problem is created, and the primal problem is turned into a maximization problem with a linear cost and linear inequality constraints. The speaker then explains that the primal problem and its dual problem are closely connected and can be solved using the simplex method. Additionally, the speaker introduces the key idea of duality, which states that the maximum value is always less than or equal to any feasible allowed value. Finally, the speaker gives a one-line proof of the inequality B transpose Y is less than or equal to C transpose X.

  • 00:20:00 In this section, the speaker discusses the importance of X greater than or equal to zero in linear programming and its role in achieving weak duality. The fact that X is greater than or equal to zero ensures that the desired inequalities are satisfied and that the result obtained from the system is correct. The speaker mentions the concept of duality and how it relates to linear programming and two-person games, emphasizing the importance of paying attention to the algorithm in both cases. The speaker also provides an example of max flow equal min cut to demonstrate the concepts discussed.

  • 00:25:00 In this section, the speaker discusses the problem of linear programming and two-person games in the context of maximizing flow through a network with constraints on edge capacities. They explain that the goal is to maximize flow at the sink while ensuring that the flow variable does not exceed the amount of flow allowed by the capacities of the edges. The problem can be solved using integer programming but can be safely relaxed to allow for non-integer variables. The speaker provides examples of how to solve this problem and discusses the importance of choosing appropriate edge capacities.

  • 00:30:00 In this section, the lecturer discusses linear programming and two-person games. Specifically, he explores finding an upper bound on the maximum flow in a network, focusing on a cut in the network that separates edges that go with a source and those that go with a target. The maximum flow for this example is 14, which matches the minimum cut. The concept of duality is also introduced to find an upper bound when optimizing a problem.

  • 00:35:00 In this section, the speaker discusses linear programming and two-person games. The maximum cut problem in a big network can be solved quickly with linear programming, although the maximum cut may not be visible with thousands of edges. The simplex method, which almost always has an average case, is polynomial in time to solve. The speaker also talks about duality in linear programming where any flow can't exceed the capacity of the cut. Finally, the speaker talks about two-person games and a payoff matrix used to make decisions based on payoffs for minimizing and maximizing players. The game is a zero-sum game, meaning all pay X makes go to Y.

  • 00:40:00 In this section, the video discusses linear programming and two-person games using an example where X wants to make a small number and Y wants to make it big. This results in a simple game with a saddle point where Y chooses column two every time and X chooses row one every time. However, when the example changes and Y aims for column two, X has to choose a mixed strategy as a saddle point is not present. Y also adopts a mixed strategy, which X eventually figures out, leading to a competition to find the best choice between 0 and 1.

  • 00:45:00 In this section, the speaker discusses the process of solving a two-person game using linear programming and provides an example of solving a game with a matrix. The optimal strategy for Y comes out to be 2/3 on the first column and 1/3 on the second column. The best q4 for X is determined given this optimal Y strategy. The speaker explains that there could be other columns or rows for X that don't enter the mixed up for the optimal.

  • 00:50:00 In this section, the speaker discusses the connections between linear programming and two-person games. He notes the importance of the duality theorem and how it relates to solving optimization problems, as well as the limitations of applying these techniques to three or more person games. He also briefly recounts the story of John Nash and his contributions to the field, including his improvement and subsequent tragic death. Finally, the speaker mentions that the next lecture will cover stochastic gradient descent.
24. Linear Programming and Two-Person Games
24. Linear Programming and Two-Person Games
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 25. Stochastic Gradient Descent



25. Stochastic Gradient Descent

In this video, the concept of stochastic gradient descent (SGD) is introduced as an optimization method for solving large-scale machine learning problems often posed in the form of a finite sum problem. The speaker explains how SGD selects random data points to compute the gradient to speed up the computation and how it behaves differently from batch gradient descent as it approaches the optimum due to the fluctuating nature of the method. The key property of SGD is that the stochastic gradient estimate is an unbiased version of the true gradient in expectation and the variance of the stochastic gradient must be controlled to reduce the noise. The use of mini-batches is discussed as a means of cheap parallelism in deep learning GPU training, but selecting the right mini-batch size is still an open question that can impact the robustness of the solution in the presence of unseen data. Challenges in optimizing SGD include determining the mini-batch size and computing stochastic gradients, but researchers are trying to understand the efficacy of SGD in neural networks through developing a theory of generalization.

  • 00:00:00 In this section, the speaker introduces the concept of stochastic gradient descent as an ancient optimization method that is still used to train large-scale machine learning systems. They explain that solving optimization problems is crucial in data science, and these problems are often pretty large. The speaker provides an implementation of gradient descent in MATLAB and shows that only one line needs to be changed to drive all of the deep learning toolboxes and large-scale machine learning. The speaker then describes the optimization problems in machine learning, which involve finding an x over a cost function written as a sum. These are called finite sum problems, and they are typically solved using stochastic optimization methods.

  • 00:05:00 In this section, the speaker discusses large-scale machine learning, which means that both the number of training data points (n) and the dimensionality of the vectors (d) can be large. Large n may reach millions or billions, and large d could consist of up to a billion features. This drives a lot of research in optimization methods for large-scale machine learning, including the search for sub-linear time algorithms in data structures and hashing tricks to handle such big data. The speaker gives examples of the most classic question in linear algebra, the least squares regression problem, and another widely-used method called la sol, which are both written in terms of a loss over training data with a finite sum format. Finally, the speaker notes that deep neural networks are yet another example of this finite sum problem with n training data points.

  • 00:10:00 In this section, the speaker discusses how optimization procedures are necessary to solve the finite sum problems that arise in machine learning and statistics. This is because most problems in this field can be expressed as a finite sum problem, and specialized optimization procedures are needed to solve them. The speaker introduces the gradient descent method but notes that computing the gradient of a single point in a large data set can take hours or days, which is a major drawback. The speaker asks for suggestions from the audience to counter this drawback, and some ideas presented include using stochastic gradient descent and sampling a subset of the full data set.

  • 00:15:00 In this section, the speaker discusses the concept of stochastic gradient descent, which involves randomly selecting some data points at each iteration and computing the gradient of a single point, thus making the process much faster. However, the speaker notes that the key question is whether this idea makes mathematical sense. Stochastic gradient descent was first proposed by Robbins in Monroe in 1951, and it is compared to the gradient descent method. The speaker notes that stochastic gradient descent is more sensitive to step sizes and shows a simulation of a toy problem to illustrate how the line fluctuates. The method appears to still make progress towards the optimum, despite fluctuations.

  • 00:20:00 In this section, the speaker discusses the concept of Stochastic Gradient Descent (SGD), which computes the gradient of the randomly chosen data point multiplied by an alpha value (step size) to approach a solution. The process is very sensitive to the step size parameter and is more sensitive than gradient descent. As he varies the parameter, the speaker observes the progress towards the solution and explains the typical behaviour of SGD. He explains why people love SGD as it makes rapid progress in the beginning in large datasets and one can get quick and dirty progress while avoiding overfitting. However, when it comes close to the solution, it fluctuates more, and it can be difficult to find the best optimum due to chaotic behaviour.

  • 00:25:00 In this section, the speaker discusses how stochastic gradient methods work in a simple one-dimensional optimization problem where quadratic functions are used. The goal is to minimize these quadratic functions, and the speaker demonstrates how to use the gradients of individual components to do this. They explain that the method works well in the beginning because it uses the full gradient, but once it gets close to the optimum, anything can happen, and it becomes confusing. The speaker also shows how to find the closed-form solution, and where the true minimum can be found within a specific range of individual mins and maxes.

  • 00:30:00 In this section, the speaker explains the behavior of stochastic gradient descent (SGD) when the scalar X is outside the region of confusion, which means that the point is very far from where the solution is. In this far away regime, a stochastic gradient of some component has exactly the same sign as the full gradient, which is the direction to walk in to decrease the loss function. The speaker uses this to explain why SGD can make solid progress far away and how it can provide an awesome initial speed, allowing for millions of stochastic steps in the time it would take to do a single iteration of batch gradient descent. Once inside the region of confusion, stochastic gradient descent becomes less effective in optimization, but in machine learning, the fluctuations can make the method more robust and better for generalization. The speakers note that this is an idea that is prevalent in machine learning, theoretical computer science, and statistics, where randomization is used to speed up the computation of expensive quantities.

  • 00:35:00 In this section, the speaker discusses the key property of stochastic gradient descent (SGD). The main idea behind SGD is to use a randomly estimated gradient to save on computation. The key property of SGD is that in expectation, the stochastic gradient estimate is an unbiased version of the true gradient. Beyond this unbiasness, the amount of noise or the amount of stochasticity is controlled so that the variance of the stochastic gradient is reduced. The smaller the variance, the better your stochastic gradient is as a replacement for the true gradient, and the faster the convergence.

  • 00:40:00 In this section, the speaker discusses the stochastic gradient descent method and its behavior on both convex and non-convex problems. The speaker also mentions two variants of the method, one with no constraints where a random vector is chosen and the other with constraints where a training data point is randomly picked with or without replacement. The speaker explains that although the method has been around since 1951 and is widely used in deep learning toolkits, there are still gaps between theoretical and practical applications. The toolkits use the without replacement version even though the version we know how to analyze is the uniform random version, which is a major open problem in the field of stochastic gradient. The speaker also mentions the mini-batch idea, which uses a batch of points to reduce variance, resulting in less noise.

  • 00:45:00 In this section of the video, the speaker discusses the concept of mini-batches and how they are exploited by people to give a cheap version of parallelism in deep learning GPU style training. The larger the mini-batch, the more things can be done in parallel. However, there is also a conundrum in that using very large mini-batches implies that the stochastic gradient starts looking more like batch gradient descent which decreases noise to a point where the region of confusion shrinks too much. This is harmful to machine learning since it may cause overfitting of the neural network, making it difficult to predict unseen data. Therefore, selecting the right mini-batch size is still an open question in the optimisation process of the deep neural networks.

  • 00:50:00 In this section, the speaker discusses the challenges associated with optimizing stochastic gradient descent (SGD), including determining which mini-batch to use and how to compute stochastic gradients. The back propagation algorithm is introduced as a popular method of computing a single stochastic gradient, and machine learning toolkits may have different ways of automating the computation of a gradient. Theoretical challenges in proving SGD's efficacy are discussed, including the question of why SGD works so well for neural networks despite its supposed suboptimal qualities. Researchers are currently trying to understand this mystery through developing a theory of generalization.
25. Stochastic Gradient Descent
25. Stochastic Gradient Descent
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Suvrit SraView the complete course: https://ocw.m...
 

Lecture 26. Structure of Neural Nets for Deep Learning



26. Structure of Neural Nets for Deep Learning

This video discusses the structure of neural networks for deep learning. The goal is to classify data in a binary way by constructing a neural network with feature vectors that have m features, creating a learning function that can classify data as one of two categories. Non-linearity is essential in creating these functions, as linear classifiers are unable to separate non-linear data. The video also discusses the importance of the number of weights and layers in the neural net, and provides resources such as the TensorFlow playground for users to practice creating functions. Finally, the video discusses the recursion used to prove the formula for the number of flat pieces obtained by cutting a cake and how it relates to the optimization problem of minimizing total loss in deep learning.

  • 00:00:00 In this section, the professor introduces the central structure of deep neural nets, which is the construction of the learning function f that learns the training data and can be applied to the test data. The goal is to classify data in a binary way by constructing a neural network with feature vectors that have m features. The network will create a learning function that can classify data as one of two categories, such as boy or girl, cat or dog, or truck or car. The professor also mentions that this structure has been available on the mascot mit.edu/learning from data site for months and will be added to the Stellar platform.

  • 00:05:00 In this section, the lecturer explains how to create a function f of X that returns the correct answer for two-class classification. The lecturer states that the function should be negative for classification minus one and positive for classification plus one. However, the lecturer acknowledges that we don't need to get every sample right, as overfitting may occur, and the rule we discover should cover almost all cases but not every "weird" case. The lecturer then recommends visiting the site playground.tensorflow.org, where a simple model problem can help individuals learn about deep learning. The playground offers four examples, and one of them involves finding a function that is positive on some points and negative on other points.

  • 00:10:00 In this section, the speaker discusses the importance of non-linearity in neural networks, pointing out that if linear classifiers like support vector machines were used, it wouldn't be possible to create some non-linear function that could separate the data. He then shows an example of a 2D classification problem with a spiral where the system is trying to find a function that's positive on one spiral and negative on the other spiral, and this takes quite a bit of time, many epochs. The speaker also explains what an epoch is and mentions the difference between mini-batches with replacement and mini-batches without replacement in stochastic gradient descent.

  • 00:15:00 In this section, the speaker discusses a website called TensorFlow's playground, which allows users to create a function f of X using a nonlinear function. The website plots the zero set for the function, which separates the positive and negative sets, with zero being in between. The website allows users to decide the number of layers and neurons in each layer, as these are essential in finding a function f that learns the data. The speaker also notes the importance of linear functions in this process and requests recommendations for good convolutional neural net websites to practice on. The function f has the form of a vector of X with five components, layer one with six neurons, and an output layer of one number.

  • 00:20:00 In this section, the speaker discusses the structure of neural nets for deep learning. They begin by explaining the basic structure of a neural net, which involves a matrix of weights to compute the output Y. However, the process becomes more complex when adding in multiple layers for deep learning. Each layer is supposed to learn more about the data, with the first layer learning basic facts and each subsequent layer learning more detail. Finally, the speaker discusses how the neural net involves a fine map and applying a function on each component to get the final output.

  • 00:25:00 In this section, the speaker discusses the structure of neural nets in deep learning. They explain that neural nets consist of a learning function that depends on the weights and inputs, which is created through a chain of functions or composition of functions, each consisting of a linear or affine map followed by a non-linear function. This results in a complex function that is continuous and piecewise linear. The speaker notes that such a function relies on matrices and vectors to be created and is dependent on the number of weights in the model.

  • 00:30:00 In this section, the speaker talks about the structure of neural nets for deep learning, specifically the idea of a "chain" of linear functions followed by ReLu functions. They discuss the question of whether any function can be obtained in this way, and conclude that only continuous piecewise linear functions are possible. The speaker also uses the concept of origami to help visualize the graph of a piecewise linear function in two variables, which consists of flat pieces connected along straight edges. The question of counting the number of pieces is raised as an aid to visualization.

  • 00:35:00 In this section, the speaker discusses the problem of how many flat pieces can be obtained by folding a plane with n folds. This problem is essential in understanding the freedom of the function, f, and whether it can approximate any continuous function by taking enough folds. The speaker notes that the answer is yes, and this class of functions is universal. Additionally, the section touches on the importance of understanding this concept in the broader field of computer science, particularly in neural networks.

  • 00:40:00 In this section, the speaker discusses a mathematical problem involving the number of flat pieces in a folded piece of paper. They ask how many pieces would be formed if they folded the paper more times, and try to create a recursion formula to solve the problem. The speaker presents the numbers they have found so far, and explain that they need to come up with a formula for the number of flat pieces in an n-folded piece of paper with an m-dimensional surface. They then plan to add to the recursive formula once they have found it.

  • 00:45:00 In this section, the speaker uses a visual example to help explain the formula for the number of pieces created by making cuts in higher dimensional spaces. Using binomial numbers, the formula can be applied to any given M and N dimension. The speaker provides an example in which N equals 3 and M equals 2 to show how to use the formula. Finally, the formula is presented as R of with enfolds in M dimensions, equaling binomial numbers, and 0 through M.

  • 00:50:00 In this section, the speaker discusses the recursion used in proving the formula of flat pieces that arise from cutting a cake. They explain that the number they are looking for is the previous count of flat pieces plus the number of pieces that are cut. The rule for recursion is proven in the section 7.1 from the paper by Kleinberg and others. The next step after finding this family of functions is to choose the A's and the weights. This results in a problem in minimizing the total loss, which can be solved using gradient descent and backpropagation.
26. Structure of Neural Nets for Deep Learning
26. Structure of Neural Nets for Deep Learning
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 27. Backpropagation: Find Partial Derivatives



27. Backpropagation: Find Partial Derivatives

This video covers several topics related to backpropagation and finding partial derivatives. The speaker demonstrates the use of the chain rule for partial derivatives and emphasizes the importance of the order of calculations in matrix multiplication. Backpropagation is highlighted as an efficient algorithm for computing gradients, and various examples are given to demonstrate its effectiveness. The convergence of stochastic gradient descent is briefly discussed, along with a project idea related to the use of a random order of loss function samples in stochastic gradient descent. Overall, the video provides a comprehensive overview of backpropagation and its applications.

  • 00:00:00 In this section, the speaker discusses two topics of interest. Firstly, the convergence of stochastic gradient descent is discussed, with the focus being more on the logic and assumptions of the algorithm than on the proof itself. Secondly, the speaker suggests a project idea related to the use of a random order of loss function samples in stochastic gradient descent. Specifically, the project would involve computing the averages of a list of 100 random numbers using both replacement and without replacement methods to determine the difference in approach.

  • 00:05:00 In this section, the speaker discusses back propagation as a way to compute the gradient in steepest descent algorithms. Back propagation is the key calculation that made neural nets popular and involves computing gradients and derivatives quickly using automatic differentiation in reverse mode. The speaker also suggests exploring examples of the convergence of the average as replacements are done, as well as the good start and bad finish for stochastic gradient descent, with the magic words in computations being early stopping.

  • 00:10:00 In this section, the speaker discusses backpropagation and its use for finding partial derivatives. Backpropagation had been previously studied under the name automatic differentiation, and the speaker credits the leader in developing deep neural nets for realizing its effectiveness. The speaker provides a simple example of a function to illustrate the computation of f(x) and the derivatives, and emphasizes the use of the chain rule for finding partial derivatives. The section also mentions a blog by Christopher Olah, who provides clear explanations on the topic.

  • 00:15:00 In this section, the presenter discusses the computation of partial derivatives using the chain rule. They use a two-variable function example to demonstrate how to compute the function's partial derivatives, starting with finding F and creating a computational graph. They explain that to use the chain rule, one needs to differentiate each of the factors found in the computation of F and evaluate them appropriately. This computational graph is used to demonstrate the computation of partial derivatives for deep learning, for which many variables are evaluated.

  • 00:20:00 In this section, the speaker discusses the process of finding partial derivatives using forward mode automatic differentiation. They begin by calculating the partial derivative of F DX, breaking down the computation into simple pieces and replacing intermediate steps with derivatives. They use the fact that the derivative of X cubed with respect to X is 3X squared, which gives a value of 12 when X equals 2. They then recognize that the forward method is wasteful since they will have to do another graph for the Y derivative as well. The speaker also uses the product rule in finding the partial derivative of the product. The process requires a little organization, but the whole point is to break down the computation into simple pieces to simplify the derivatives.

  • 00:25:00 In this section, the speaker explains how to use the product rule to find partial derivatives using a computational graph. The speaker uses the example of finding the X derivative of a product and assigns names to the two terms in the product. He then computes the values needed for the product rule and uses them to compute the derivative. However, he struggles to find the final answer and admits that he would need to redo the computation if he wants to find the FD. The speaker suggests that using the reverse mode would be more efficient as it allows for computing both partial derivatives at once.

  • 00:30:00 In this section, the speaker talks about how the backpropagation technique allows computing gradients efficiently by following all the paths backward. This technique helps to find all derivatives through the chain rule applied to a few ones that have already been worked out in detail. The speaker notes that calculus tends to seem simple after looking back at what was actually done. The reverse mode ad approach is used to compute n first derivatives with just four or five times the cost, which is amazing according to the speaker. The speaker also gives an example of how the order in which computations are done can make a difference in terms of efficiency, using the multiplication of two matrices as an example.

  • 00:35:00 In this section of the video, the speaker discusses the importance of the order of computations in matrix multiplication as it can significantly affect the speed of calculations. He then moves on to example one of backpropagation and demonstrates how to use the chain rule and various other derivative rules to find partial derivatives while going backwards through a computational graph. He highlights the fact that by reusing the pieces in the chain, a wider chain can be created without significant costs, which results in faster computations even when the function depends on hundreds of variables.

  • 00:40:00 In this section of the video, the speaker explains how to use backpropagation to find partial derivatives. They demonstrate an example where they find partial derivatives with respect to X and Y using the chain rule, and emphasize that backpropagation allows all the derivatives to be found from one chain, rather than separate chains for each variable. The speaker notes that this process can be applied to a system of any size, and briefly mentions the convergence of stochastic gradient descent, which they will cover in future lectures.

  • 00:45:00 In this section, the speaker discusses two different ways to multiply three matrices - A, B, and C - and the number of operations required to do so. The first way involves multiplying A by BC, which costs M x N x PQ operations, where P and Q are the number of rows and columns of B and C, respectively. The second way involves multiplying AB by C, which costs M x P x Q operations. The speaker emphasizes that it is important to be mindful of the number of operations required when multiplying matrices, especially in the case when C is a column vector, as this can potentially lead to very large matrices that are difficult to handle.

  • 00:50:00 In this section, the speaker discusses partial derivatives and backpropagation. The speaker demonstrates how backpropagation is the right order for partial derivatives as it allows for the multiplication of two big matrices and getting a column vector, which is much faster than multiplying a column vector by a matrix to get a new column vector and then multiplying that by another matrix to get another column vector. Backpropagation simplifies the process and allows for order of magnitude faster calculations.
27. Backpropagation: Find Partial Derivatives
27. Backpropagation: Find Partial Derivatives
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 30: Completing a Rank-One Matrix, Circulants!



Lecture 30: Completing a Rank-One Matrix, Circulants!

In Lecture 30, the lecturer discusses completing a rank-one matrix and circulant matrices. They begin with a 2x2 determinant and use this to narrow down which values can be filled in a matrix to make it rank one. The lecturer then moves onto a combinatorial problem for a 4x4 matrix and introduces circulant matrices which feature cyclical patterns that can be created with only four given numbers. The lecture also covers cyclic convolution, eigenvalues, and eigenvectors of circulant matrices, which are important in signal processing.

  • 00:00:00 In this section, the lecturer gives an example question from a previous lab session about completing a matrix to a rank one matrix. The question is focused on which positions are okay to be filled in and which ones can't be filled in to achieve a rank one matrix. The lecturer explains how to choose non-zero numbers and poses a question about whether it's possible to complete a matrix with five non-zero numbers to a rank one matrix.

  • 00:05:00 In this section, the lecturer discusses completing a rank-one matrix and circulants. They begin by examining a 2x2 determinant, where any two by twos have to be rank 1 and therefore have a determinant of 0. They use this idea to narrow down what the missing number in a matrix would be, and how to fill in the rest of the values. The lecturer then moves onto a 4x4 example and introduces a combinatorial problem, determining which 5 positions will work and which won't. Finally, they talk about circulants, which feature cyclical patterns in the matrix where each row becomes the previous row shifted to the right by one element. They explain how to create circulant matrices and their properties, including diagonalization.

  • 00:10:00 In this section, the lecturer discusses completing a rank-one matrix and bipartite graphs. They start by prescribing some numbers in a 4x4 matrix and drawing a bipartite graph with rows and columns to represent the connections between the numbers. The lecturer explains that completing the matrix to rank one requires avoiding a 2x2 square where three entries have been specified. If all four entries are given, it will not be possible to create a zero determinant and the matrix will not have rank one. The lecturer converts the bipartite graph into a matrix representation to illustrate how to determine which entries can be filled in to create a rank-one matrix.

  • 00:15:00 In this section, the professor discusses completing a rank-one matrix, specifically addressing whether or not it's always possible to complete it if there are no 2x2s in the way. He demonstrates with examples that two by twos are not always the problem and that there could be longer cycles that would hinder completion. The key takeaway is that a matrix can only be completed to rank one if there are no cycles, which can be identified in the corresponding bipartite graph.

  • 00:20:00 In this section, the lecturer discusses completing a cycle with six edges and how it relates to the idea of cycles in matrices. He converts a drawn picture of a cycle into a matrix and explains how cycles in matrices show that certain requirements must be satisfied by non-zero values. He poses a question about completing a rank 2 matrix and discusses the importance of convolutions in machine learning.

  • 00:25:00 In this section, the lecturer introduces the concept of circulant matrices, which are a special type of convolution matrices that have constant diagonals that circle around to be completed. Circulant matrices are an essential part of signal processing, and their algebraic properties make them an efficient way to connect a set of weights. This is because the key matrix in this is the cyclic shift matrix, which helps to produce the circulant matrix from P and P². By specifying the first column of a circulant matrix, MATLAB, for example, can get all the other columns cyclically shifted, which means we only need four numbers to define a four-by-four circulant matrix.

  • 00:30:00 In this section of the lecture, the concept of circulant matrices is introduced. It is shown that every circulant matrix is a polynomial in P, where P represents a single shift. It is also proved that if two matrices are circulant, multiplying them together results in another circulant matrix. Additionally, the identity matrix is circulant, and if a circulant matrix is squared, the resulting matrix is also circulant. The goal when multiplying circulant matrices is to ensure that the polynomial degree does not exceed the desired number of terms.

  • 00:35:00 In this section, the lecturer discusses rank-one matrices and circulants. When multiplying a 4x4 circular shift matrix with degree three, there is a question of why the product is not degree six. The key is that the P to the fourth term is really a P to the 0 term, so the product is a cyclic convolution. The lecturer then explains the difference between convolution and cyclic convolution, giving an example of a convolution calculation between two vectors. He also reminds viewers that non-cyclic convolution does not use a circle symbol, whereas cyclic convolution does.

  • 00:40:00 In this section, the lecturer discusses cyclic convolution and how it can be used to multiply polynomials cyclically, which corresponds to multiplying circulant matrices. The sum of the digits of one factor multiplies the sum of the digits of the other factor to give the sum of the digits in the convolution. The lecturer also briefly touches on the eigenvalues and eigenvectors of these matrices. The vector of all ones is an eigenvector with an eigenvalue, and this has a polynomial sum of powers of P. The lecture concludes with a discussion of more advanced topics in the field.

  • 00:45:00 In this section of the lecture, the speaker explains that the eigenvectors of matrix C are the same as the eigenvectors of matrix P. The eigenvectors for matrix P are 1 and -1, and i and -i. The circulant world has multiple eigenvalues and eigen vectors for every circulant, and these are important rules in signal processing.
Lecture 30: Completing a Rank-One Matrix, Circulants!
Lecture 30: Completing a Rank-One Matrix, Circulants!
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 31. Eigenvectors of Circulant Matrices: Fourier Matrix



31. Eigenvectors of Circulant Matrices: Fourier Matrix

In this video on eigenvectors of circulant matrices, the speaker discusses how circulant matrices relate to image processing and machine learning, as well as its connection to the Fourier matrix. The speaker emphasizes the importance of understanding convolution and circulant matrices in relation to the discrete Fourier transform (DFT) and Fourier transforms. The speaker discusses the eigenvectors of circulant matrices, particularly the Fourier matrix, and how they are all constructed from the same set of eight numbers which are also the eigenvalues. The speaker also talks about the properties of the Fourier matrix, including how the columns are orthogonal but not orthonormal and how its eigenvectors add up to zero due to the symmetry of the circulant matrix, making them orthogonal to each other. Finally, the speaker demonstrates the concept of the Argan Vector as an eigenvector of the Fourier Matrix with examples.

  • 00:00:00 In this section, the professor introduces the topic of circulant matrices and provides updates on project deadlines and grading. He also mentions the connection of circulant matrices to the discrete Fourier transform, which is an important algorithm in engineering and mathematics. The special form of circulant matrices, with only n entries needed to define a matrix of size n by n, is useful in many applications, including machine learning for images.

  • 00:05:00 In this section, the speaker explains that images are typically described by their pixels and can have feature vectors with millions of components, which would make it impossible to compute the weights required for deep learning with gradient descent. However, matrices used in deep learning are special and do not depend on the number of features, similar to circulant matrices with cyclic features. These matrices are called linear shift invariant or linear time invariant, convolution matrices, triplets matrices, or constant diagonal matrices and are used in machine learning and image processing. Essentially, they help optimize deep learning by reducing the size of the weight computation required for each layer in the deep network.

  • 00:10:00 In this section, the speaker discusses the use of circulant matrices in image processing and machine learning. He explains that in order to operate on a large image with numerous pixels, we can use max pooling to reduce the size of the system. However, for convolutional operations, we need to choose weights to highlight important points. Therefore, we use filters to simplify the image, such as a low-pass filter. The speaker notes that wider neural nets in machine learning are used when dealing with image samples, as a constant diagonal matrix is more natural and efficient to use.

  • 00:15:00 In this section, the presenter talks about eigenvalues and eigenvectors of circulant matrices, specifically the permutation matrix that has a cyclic shift effect. The singular values of a permutation matrix are all one, and the eigenvalues can be found by taking P minus lambda I and setting the determinant to zero, resulting in lambda to the fourth power. The presenter also emphasizes the importance of understanding convolution and circulant matrices in relation to the DFT and Fourier transforms.

  • 00:20:00 In this section, the speaker discusses the eigenvectors of circulant matrices, specifically focusing on the Fourier matrix. The eigenvalues for the Fourier matrix are found by setting the determinant to zero, which results in the fourth roots of one. The eigenvalues for an 8x8 circulant matrix were also discussed, which are the eight solutions to the equation lambda to the 8th power equals one. These solutions come in the form of the numbers one, negative one, and the fourth and eighth roots of unity, and are important because they come into play as eigenvectors. The eigenvectors of orthogonal matrices were also introduced as a family of matrices with orthogonal eigenvectors.

  • 00:25:00 In this section, the speaker discusses different families of matrices that have orthogonal eigenvectors. Symmetric matrices have orthogonal eigenvectors and real eigenvalues, while diagonal matrices have eigenvectors that go into the identity matrix. Orthogonal matrices have eigenvalues with a magnitude of 1, and eigenvectors of permutation matrices are orthogonal. Anti-symmetric matrices have eigenvalues that can only be complex, making them unable to have real eigenvalues.

  • 00:30:00 In this section, the speaker talks about matrices with orthogonal eigenvectors and how they relate to normal matrices. Matrices with orthogonal eigenvectors have complex eigenvalues and the speaker writes out a diagonal matrix that contains any eigenvalue. He then sets up a matrix equation that shows how to recognize normal matrices, which are actually quite rare. In order to recognize them, one has to test whether a matrix equals its conjugate transpose.

  • 00:35:00 In this section, the speaker discusses eigenvectors of circulant matrices, particularly the Fourier matrix. The permutation P is orthogonal, so its eigenvectors are orthogonal, but these circulant matrices also commute, making them normal matrices. This means that once we find the eigenvectors of P, we've found the eigenvectors of any circulant matrix, and they are all special because they are connected to Fourier. The eigenvectors are found for various eigenvalues, including lambda being 1, -1, i, and -i.

  • 00:40:00 In this section, the speaker discusses the eigenvectors of circulant matrices and emphasizes that all the eigenvectors are constructed from the same set of eight numbers, which are also the eigenvalues. The eigenvector matrix for all circulant matrices of size n is the Fourier matrix, which is an important complex matrix allowing the fast Fourier transform. All entries in the matrix are powers of a complex number W on the unit circle at one of its eight points. The first eigenvector is all ones, while the rest are powers of W, such that the matrix is 8x8 in size. Overall, circulant matrices have similar properties thanks to their common eigenvector matrix.

  • 00:45:00 In this section of the video, the speaker explains the properties of the Fourier matrix, which is a circulant matrix made up of eigenvectors that are powers of the eighth root of one. The columns of the matrix are orthogonal, but not orthonormal, which means they need to be divided by the square root of eight to make them orthonormal. The matrix is a normal matrix, and its eigenvectors add up to zero due to the symmetry of the circulant matrix, making them orthogonal to each other. The speaker demonstrates this property using a three by three matrix where the sum of the eigenvectors adds up to zero, making them orthogonal.

  • 00:50:00 In this section, the speaker discusses how the Argan Vector is an eigenvector of the Fourier Matrix. He demonstrates how when the dot product of the Argan Vector's components is added, the result is 1. He then shows how when the Argan Vector is multiplied by e to the power of (2π/3), the components of the resultant vectors sum to 0. These demonstrations illustrate the concept that the eigenvectors of a circulant matrix are orthogonal. The speaker concludes by mentioning that he will continue discussing the topic of Fourier Matrix in the next lecture and that there is only a week and a half of class left in 1806.
31. Eigenvectors of Circulant Matrices: Fourier Matrix
31. Eigenvectors of Circulant Matrices: Fourier Matrix
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 32: ImageNet is a Convolutional Neural Network (CNN), The Convolution Rule



Lecture 32: ImageNet is a Convolutional Neural Network (CNN), The Convolution Rule

In Lecture 32 of a deep learning course, the power of convolutional neural networks (CNNs) in image classification is discussed, with the example of the ImageNet competition won by a large deep CNN featuring convolution layers, normal layers, and max pooling layers. The lecture also focuses on the convolution rule, which connects multiplication and convolution, with examples of two-dimensional convolutions, the use of the Kronecker product for a two-dimensional Fourier transform and in signal processing, and the difference between periodic and non-periodic cases with regards to convolution. The lecturer also discusses eigenvectors and eigenvalues of a circulant matrix and the Kronecker sum operation.

  • 00:00:00 In this section of the video, the importance of convolutional neural networks (CNN) is discussed in relation to deep learning and image classification. A paper by Hinton and Skipper is mentioned which trained a large deep CNN to classify 1.2 million high-res images in ImageNet. The competition was won with a top 5 test error rate of 15%, compared to 26% for the second-place team. The CNN had convolution layers, normal layers, and max pooling layers, with half the samples going on one GPU and half on another. Drop out was also used in fully connected layers to reduce overfitting. This demonstrates the power of CNNs in handling the enormous computational problem of classifying images.

  • 00:05:00 In this section of the video, the speaker discusses the convolution rule, which is an essential aspect of convolutional neural networks (CNNs). He explains that convolution arises from multiplying polynomials and how the formula for the coefficients in the content C*D in the convolution can be written in a different way to see how convolution operates. He then goes on to give an example of the convolution of two functions and explains that this concept relates to the convolution of two vectors in a CNN. Understanding convolution is crucial to comprehend the inner workings of a CNN, which is a type of neural network that has 60 million parameters and is used for image recognition tasks.

  • 00:10:00 In this section, the lecturer explains the convolution rule for functions and how it connects to the Fourier transform of the two functions. He mentions that if F is 2 pi periodic and G is 2 pi periodic, then one might want to do a periodic convolution and get an answer that also has a period of 2 pi. He talks about how making the convolution cyclic affects the multiplication and that W is used instead of X for cyclic X.

  • 00:15:00 In this section of the video, the lecturer discusses the difference between the periodic and non-periodic cases with regards to convolution. In the periodic case, the factor W is defined to have the property that W to the N is 1, and vectors greater than n can be folded back to a vector of length n. The cyclic case only considers K going from 0 to n-1, and the sums only go from 0 to n-1. In the non-periodic case, the convolution has P plus Q minus 1 components, and this number is calculated in the first lab.

  • 00:20:00 In this section, the lecturer discusses the eigenvectors and eigenvalues of a circulant matrix, specifically the permutation matrix. The eigenvectors are the columns of the eigenvector matrix, which is denoted as "F," and there are four eigenvalues derived from the multiplication of F and C. The lecturer demonstrates this formula, explaining that if C is a combination of P, then the same combination of eigenvectors will give the eigenvalues of matrix C.

  • 00:25:00 In this section, the lecturer discusses the convolution rule which is a connection between multiplication and convolution. The convolution rule connects the multiplication of matrices with convolving the matrices. Through the cyclic convolution, if the lecturer multiplies matrix C by matrix D, he will obtain another circulant matrix. The coefficients of the convoluted C and D represent the diagonal coefficients of the C times D matrix. The lecturer concludes that the eigenvalues of CD are equal to the eigenvalues of C times the eigenvalues of D because C and D commute and have the same eigenvectors. The eigenvalues multiply component by component, giving the relation for the convolution rule.

  • 00:30:00 In this section of the video, the lecturer discusses the convolution rule, which states that one can convolve an image and apply a Fourier Transform (FT) to it, or apply an FT to separate images and then multiply them point-wise. This rule is useful because it allows for the fast Fourier transform (FFT), which is highly efficient. The lecturer then considers the cost of each method - the convolution method requires N^2 steps, while the separate transformation method requires only 2NlogN steps.

  • 00:35:00 In this section, the speaker discusses two-dimensional convolutions and the operation that needs to be performed to convolve two functions in two dimensions. They discuss how in MATLAB, the command needed to perform this operation is "cron" and how it can be used to create a two-dimensional matrix with N squared pixels by multiplying two one-dimensional matrices A and B. The speaker also touches upon the idea that if one wants to multiply two long integers in cryptography, using the convolution rule can be a faster and more efficient method.

  • 00:40:00 In this section, the use of the Kronecker product to produce a big matrix for a two-dimensional Fourier transform is discussed. The Kronecker product is an operation that takes one-dimensional N by n matrices and produces an N squared by N squared matrix. By appropriately multiplying the two matrices using the Kronecker product, a big matrix can be created for a two-dimensional Fourier transform. The Laplacian, which is commonly used in differential equations, is also discussed, where the two-dimensional matrix that takes a five-point scheme with five weights for each point can be produced using the Kronecker product.

  • 00:45:00 In this section, the speaker discusses the Kronecker product and how it can be used in signal processing. He explains that he wants to use the Kronecker product to add a two-dimensional effect to the data, and then add on the vertical derivative. Together this is called the Kronecker sum, which is an important operation in signal processing. He also encourages students to email him if they would like to volunteer for a project in which they can discuss what they have learned and get feedback from the audience.
Lecture 32: ImageNet is a Convolutional Neural Network (CNN), The Convolution Rule
Lecture 32: ImageNet is a Convolutional Neural Network (CNN), The Convolution Rule
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 33. Neural Nets and the Learning Function



33. Neural Nets and the Learning Function

In this video, the speaker discusses the construction of the learning function f for neural nets, which is optimized by gradient descent or stochastic gradient descent and applied to training data to minimize the loss. He explains the use of a hand-drawn picture to illustrate the concept of neural nets and the learning function, as well as various loss functions used in machine learning, including cross-entropy loss. The speaker also talks about the problem of finding the positions of points given their distances, which is a classic problem with various applications, such as in determining the shapes of molecules using nuclear magnetic resonance. He concludes by discussing the construction of X, the final step in obtaining the structure of a neural network, and mentions a call for volunteers to discuss a project on Friday.

  • 00:00:00 In this section, the speaker discusses the construction of the learning function f for neural nets, which is optimized by gradient descent or stochastic gradient descent and applied to training data to minimize the loss. The learning function is a function of two sets of variables, X and V, where X are the weights and V are the feature vectors from the training data. The structure of the neural net involves taking f of a set of weights and sample vectors, producing nonlinear steps, and repeating the process until reaching the desired output. The linear step involves taking the input V0, multiplying it by matrices AK and adding bias vectors BK to shift the origin.

  • 00:05:00 In this section, the speaker discusses how neural networks work by taking a set of inputs, applying weights (which are chosen using gradient descent in chapter 6), and taking a non-linear step to produce a new output. This process is repeated through many layers until the final output, which is the neural network's prediction for the input. The number of weights can often greatly exceed the number of features in the input, creating an underdetermined situation.

  • 00:10:00 In this section, the speaker explains the use of a hand-drawn picture to illustrate the concept of neural nets and the learning function. He draws a picture where there is a training sample components multiplied by v1, which is the first in layer that can have a different number of neurons in the first layer, and each one comes from the eze by. The loss function is the function that we want to minimize by choosing x2, which is all the As and Bs. The loss function is often a finite sum over all F, which can be computed for all I, but stochastic gradient is used instead to choose just one or a small number of those. The loss function would be minus the true result from sample I, which can be squared to get the sum of squares of errors squared over all the samples.

  • 00:15:00 In this section, the speaker discusses various loss functions that are used in machine learning, specifically in neural networks. The loss function is a measure of how well the neural net's prediction matches the true value. The speaker provides four popular loss functions, including square loss, L1 loss, hinge loss, and cross-entropy loss. Cross-entropy loss is the most important for neural nets and is the most commonly used loss function. The speaker also briefly touches on distance matrices and the process of determining the positions of points in space using measured distances between them.

  • 00:20:00 In this section, the speaker introduces a math question that involves finding positions in space given distances between points. The question is straightforward and has applications in various fields. The section only takes up two pages in the book, but the solution is detailed and complete. The speaker also encourages students to ask questions about their projects and suggests emailing him directly. He also mentions a question about what courses to take after this one and asks the students if they plan on taking more courses in this area. The speaker acknowledges that there are courses in other departments, but he only found a list for course six.

  • 00:25:00 In this section, the speaker talks about the MIT Operations Research Center and its course offerings, including optimization, data analytics, statistics, and operations research. The speaker also refers to a lecture by Sir Tim Berners-Lee, the creator of the World Wide Web, and his responsibility for the excessive letters in URLs. Then the speaker discusses distance matrices and the problem of finding the position matrix from the given distances. The speaker mentions several applications, including wireless sensor networks, where distances can be measured between sensors, and GPS systems, which can calculate location using a similar principle.

  • 00:30:00 In this section, the speaker discusses the problem of finding the positions of points given their distances, which is a classical problem with a neat solution. The positions are not unique as they can undergo translations and rotations, but the speaker suggests removing translations by centering the centroid at the origin. The problem of finding the positions is applicable in various situations, such as in determining the shapes of molecules using nuclear magnetic resonance. Machine learning can also be described as finding a low-dimensional surface in high-dimensional space, which is mathematically equivalent to finding a curved manifold that best fits the given points. This process involves discovering the dimensionality of the problem and linearizing it, which reduces the dimensionality from the original high-dimensional space to the true dimensionality of the problem.

  • 00:35:00 In this section, the speaker explains how to find a matrix X given a dot product matrix G. They start by analyzing two rank-one matrices, one depending only on rows and the other depending only on columns, and explain that these matrices produce most of the significant part of the dot product matrix. They then introduce a diagonal matrix with the inner products of XI with itself on the diagonal and note that this matrix is related to the given D matrix. From there, they show how to derive the equation for the dot product matrix and explain that once they have G, they can find X. However, X is not unique because it can be rotated without changing the dot product, so their next step is to figure out how to factor out the rotation.

  • 00:40:00 In this section, the speaker discusses an equation related to the dot product matrix that can be used to find the cross-product of the identity matrix and the X transpose matrix in neural nets. The dot product matrix is a combination of the diagonal D matrix, a constant matrix with all rows the same, and a constant matrix with all columns the same. The speaker goes through the equation step-by-step and breaks down each component to reveal that the X transpose X matrix comes from these Rank 1 places and these cross-products. They then explore the significance of the half in the equation but ultimately conclude that it is necessary to obtain the correct result.

  • 00:45:00 In this section, the speaker discusses how to write a given equation in matrix language and how to ultimately find the matrix X given X transpose X. They use linear algebra to find the solution and note that X can be found up to an orthogonal transformation. The two leading methods discussed are using eigenvalues or using elimination on X transpose X. The speaker emphasizes the importance of these methods in the field of neural networks and machine learning.

  • 00:50:00 In this section, the speaker discusses the construction of X, which is symmetric and positive semi-definite, and two ways of finding it. The first approach is the eigenvalue construction, which involves computing the eigenvalues and eigenvectors of X transpose X, and then keeping the eigenvectors while taking the square roots of the eigenvalues. The second approach is the Cholesky factorization, which involves performing elimination on the symmetric positive definite matrix and then using the resulting lower triangular matrix L and diagonal matrix D to compute X as the product of L square root D L transpose. The Cholesky factorization is faster than the eigenvalue construction and easier to compute, making it a more practical option.

  • 00:55:00 In this section, the speaker concludes the discussion on distance matrices, which was the final step to obtain the structure of a neural network, separating the sample vectors from the weights. The speaker also mentions the two pieces of linear algebra: finding things to reduce them to triangular form or to connect them with symmetric matrices. Lastly, the speaker mentions a call for volunteers to discuss a project on Friday.
33. Neural Nets and the Learning Function
33. Neural Nets and the Learning Function
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 34. Distance Matrices, Procrustes Problem



34. Distance Matrices, Procrustes Problem

The speaker discusses the Procrustes problem, which involves finding the best orthogonal transformation that takes one set of vectors as close as possible to another set of vectors. They explain different expressions to calculate the Frobenius norm of a distance matrix and its connection to the Procrustes problem. The speaker also introduces the concept of the trace of matrices and finds the correct Q in the Procrustes problem. Additionally, they address the question of whether deep learning actually works and present the solution to a matrix problem involving finding the best orthogonal matrix, which involves computing the SVD of the dot product of two matrices and using the orthogonal matrices from the SVD.

  • 00:00:00 In this section, the speaker addresses a question raised in a previous discussion about finding points that satisfy a given distance matrix, and how to resolve the failure of the triangle inequality. The speaker explains that the matrix of dot products that comes directly from the distance matrix is positive semi-definite, but if the triangle inequality fails, the matrix of dot products will not come out positive definite. This issue cannot be resolved by changing dimensions, as the triangle inequality still holds regardless of dimensionality.

  • 00:05:00 In this section, the lecturer talks about the Procrustes Problem, which involves making something fit something else. The problem is named after a Greek myth about Procrustes who had a bed of a certain length and adjusted the length of the visitor to fit the bed. The problem involves finding a way to fit two sets of data together, and the lecturer explains that if the triangle inequality is satisfied by the numbers in the distance matrix, the matrix that comes out of the equation is positive semi-definite. However, if the triangle inequality is violated, the matrix is not positive semi-definite and has negative eigenvalues, and it is impossible to find the point. The lecturer also hints at a big question of whether deep learning actually works, which he will address later.

  • 00:10:00 In this section, the Procrustes problem is discussed, which involves finding the best orthogonal transformation that takes one set of vectors as close as possible to another set of vectors. If the two sets of vectors were both orthogonal bases, it would be easy to take one into the other with an orthogonal matrix Q, but that is not always the case. Therefore, the problem is to minimize over all orthogonal matrices Q in Frobenius norm squared, and treat the matrix like a long vector. One way to do this is to look at a transpose a, take the trace of it, and then find the sum of all the squares to get the Frobenius norm of a matrix.

  • 00:15:00 In this section, the lecturer discusses different expressions to calculate the Frobenius norm of a distance matrix. They show that the Frobenius norm squared can be expressed as the sum of squares of all the singular values, as the trace of the product of the matrix and its transpose, or the trace of the product of the transpose of the matrix and the matrix itself. They then explain how these expressions connect to each other and mention that solving this problem requires various important facts, such as the fact that multiplying every column of a matrix by Q doesn't change the Frobenius norm, and multiplying the matrix by Q doesn't affect the singular values.

  • 00:20:00 In this section, the speaker discusses properties of the Frobenius norm, including the fact that it remains unchanged when multiplied by an orthogonal factor or when multiplied on the other side by the same or a different factor. The speaker also introduces the concept of the trace of matrices, highlighting the fact that the trace does not change when the order of the matrices is reversed. The speaker then explains the steps to obtain the correct Q in the Procrustes problem.

  • 00:25:00 In this section, the speaker discusses the question of whether deep learning actually works and suggests that it is an important question to address. They mention that while there has been a lot of publicity and hype around deep learning and neural nets, it is not automatic that the structure of the network will be successful, even with multiple layers. The speaker then presents the solution to a matrix problem involving finding the best orthogonal matrix, which involves computing the SVD of the dot product of two matrices, and using the orthogonal matrices from the SVD.
34. Distance Matrices, Procrustes Problem
34. Distance Matrices, Procrustes Problem
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 35. Finding Clusters in Graphs



35. Finding Clusters in Graphs

This video discusses clustering in graphs and how to find clusters using different algorithms such as K-means and spectral clustering. The Laplacian matrix is used in spectral clustering and can provide information about clusters in the graph through its eigenvectors. The Fiedler eigenvector, which is the eigenvector for the smallest positive eigenvalue, is important for clustering. The speaker also emphasizes the importance of eigenvectors being orthogonal in identifying different clusters. Additionally, there is a brief preview of the next lecture, which will cover back propagation using Julia in linear algebra. Students are encouraged to submit their projects online or outside the instructor's office.

  • 00:00:00 In this section, the speaker discusses clustering in graphs, which is the process of subdividing a large graph into smaller, more manageable clusters. The problem is to find two clusters that are reasonably equal in size, and to do this, an algorithm must be utilized to determine the position of center points X and Y. The objective is to minimize the sum of distances between the center points and the nodes in the graph, while ensuring that the number of nodes in each cluster is reasonably close. There are several algorithms that can be used to accomplish this, including some that use matrices associated with the graph.

  • 00:05:00 n this section, the speaker discusses the K-means clustering algorithm for dividing a set of points, some labeled A and others labeled B, into clusters or groups. The algorithm begins by identifying the centroids, which are the middle points of the A and B groups, and then attempts to form the best clusters based on those centroids. This process is repeated until the algorithm converges on the best possible clusters for the data. The speaker also introduces the concept of the centroid, which is the point that minimizes the sum of distances between all the points in the group and the centroid.

  • 00:10:00 In this section, the instructor discusses two methods for solving the problem of finding clusters in graphs. The first method is called K-means, which involves finding the closest cluster centroid for each point, reassigning points to their respective clusters, and repeating the process until convergence. The second method is called spectral clustering and involves using the eigenvalues of a matrix to group similar points together. The term "spectral" refers to the eigenvalues of the matrix and the spectral theorem in linear algebra. The instructor emphasizes that the spectral theorem applies to symmetric matrices, and it states that the eigenvalues are real and the eigenvectors are orthogonal.

  • 00:15:00 In this section, the speaker discusses the graph Laplacian matrix, which is the key connection between linear algebra and graph theory. They describe this matrix as a symmetric positive semi-definite matrix, and there are four matrices associated with any graph: the incidence matrix, the degree matrix, the adjacency matrix, and the Laplacian matrix. The speaker performs an example using a simple graph to explain each of these matrices. The Laplacian matrix is used in spectral clustering and can have orthogonal eigenvectors that go with a multiplicity for eigenvalues, which is known as the spectral theorem.

  • 00:20:00 In this section, the speaker explains the concept of graph clustering by finding clusters in a given graph using the Laplacian matrix. The Laplacian matrix is obtained by subtracting the incidence matrix from the degree matrix. The resulting matrix is positive semi-definite, and its eigenvectors provide information about the clusters in the graph. The first eigenvalue is always zero, and the next eigenvalue is important for clustering. The speaker emphasizes the importance of the Fiedler vector, the eigenvector for the smallest positive eigenvalue, and explains its significance in graph clustering.

  • 00:25:00 In this section, the speaker explains why the Laplacian matrix is named as such when finding clusters in a graph. The Laplacian matrix has a diagonal of degree 4 and it allows for finding clusters through its eigenvectors. Specifically, the Fiedler eigenvector can determine the positive and negative components partitioning the graph into two clusters. This approach provides a method of deciding which nodes belong in which cluster using the graph Laplacian.

  • 00:30:00 In this section, the speaker discusses clustering in graphs and how to find clusters using different algorithms such as k-means and spectral clustering. He explains that eigenvectors of a symmetric matrix are orthogonal, meaning they add up to zero, which can be used to identify different clusters. He also mentions that there are other algorithms proposed for the same problem, and gives a brief preview of the next lecture which will cover back propagation using Julia in linear algebra. The speaker encourages students to submit their projects online or outside his office.
35. Finding Clusters in Graphs
35. Finding Clusters in Graphs
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...