Machine Learning and Neural Networks - page 27

 

Lecture 14. Low Rank Changes in A and Its Inverse



14. Low Rank Changes in A and Its Inverse

The video discusses the concept of low-rank matrices and their importance in function matrices, particularly the matrix inversion formula which finds the inverse of an N by n matrix in terms of a simpler 1 by 1 matrix. The formula is useful in finding the inverse of matrices that have low rank perturbations and can simplify the process of finding inverses. The speaker showcases how the formula works by presenting the formula for the second matrix and shows how the same logic was applied to arrive at the answer. The video also discusses practical applications of this formula, particularly in least squares problems and the Kalman filter.

  • 00:00:00 In this section, the professor discusses the concept of low rank matrices and their importance in function matrices. The focus topic is on a famous formula called the matrix inversion formula, also known as the low rank changes in A and its inverse. The formula finds the inverse of an N by n matrix in terms of a simpler 1 by 1 matrix using a UV transpose and dividing it by 1 minus the transpose of V times U. The formula is useful in finding the inverse of matrices that have low rank perturbations and can be used to simplify the process of finding inverses. The professor explains how this formula works and its practical applications.

  • 00:05:00 In this section, the speaker discusses how changing a matrix by rank 1 will result in a change in its inverse by rank one. The formula he presents computes an N by n inverse in terms of a 1 by 1 inverse, which is very useful. The speaker then demonstrates how to check the formula by multiplying the claimed inverse by the original matrix and hoping to get an identity matrix. The speaker showcases how the formula works by presenting the formula for the second matrix and shows how the same logic was applied to arrive at the answer.

  • 00:10:00 a formula for a low rank change in matrix A and its inverse. The formula involves taking the inverse of an N by n matrix but can be switched to a K by K matrix, which is a smaller perturbation of the identity matrix. The formula is shown to be true through a check and can be useful for perturbing a matrix A. The names of the individuals who discovered this formula are also listed.

  • 00:15:00 In this section, the speaker is discussing the changes that occur when taking the inverse of a low-rank matrix A. They use algebraic manipulations to show that when taking the inverse of A, there are certain terms that can be eliminated, leading to a simplified expression. The speaker notes that while they are able to prove the formula by checking that it produces the identity matrix, it is important to consider how the formula can be derived in the first place. They suggest using the formula to solve a linear system with a new measurement or observation in the least squares method.

  • 00:20:00 In this section, the speaker explains how to deal with new measurements when solving least squares problems. With a rectangular matrix A, adding one more measurement or data point to the solution results in a new matrix and right-hand side to solve. However, instead of recomputing the matrix multiplication A^T A, the speaker describes how to expand the matrix with the new measurement, transpose it, and use it to compute the updated solution. By using what's already computed, this allows for more computationally efficient solving of least squares problems.

  • 00:25:00 In this section, the speaker discusses perturbing A and its inverse with new data, which provides a rank 1 change in A transpose A. This concept is applicable to least squares problems, and the Kalman filter is an example of a recursive least squares method that uses this approach. The Kalman filter is utilized in guiding missiles and satellites by tracking new data and updating the solution, which is an important application of this concept in practice.

  • 00:30:00 In this section of the video, the speaker explains how to apply the Sherman-Morrison-Woodbury formula to compute low rank changes in A and its inverse. They mention that the Kalman filter, which is used for dynamic least squares, has two additional factors that are taken into account - the covariance matrix and the state equation. The covariance matrix deals with how errors are correlated, and the state equation tells how much the satellite (in the example) should be moving. The Kalman filter is an improved version of recursively squares that deals with changing measurements while leaving a big part unchanged.

  • 00:35:00 In this section, the speaker discusses the use of the low rank update formula in solving linear systems. The formula involves perturbing the matrix of a solved problem by rank one, and using the inverse of the original matrix to quickly solve the new problem. This approach can significantly reduce the time required to solve a new problem and is especially useful for large matrices where traditional elimination methods would be time-consuming.

  • 00:40:00 In this section, the speaker explains how to find the inverse of a matrix by combining solutions to different problems. By factorizing matrix A into Lu, all hard work is done on the left side, and finding the solutions to different right-hand sides only requires back substitution. By using the Sherman-Morrison-Woodbury formula, the answer X can be achieved by combining solutions W and Z. The formula changes the solution W by a term that comes from Sherman-Morrison Woodbury, and the term in the numerator is a multiple of Z times X.

  • 00:45:00 In this section, the speaker discusses how low rank changes in a matrix A can affect its inverse, and provides a formula for inverting an N by N matrix by switching and inverting a K by K matrix. The formula involves subtracting a copy of the inverse and adding a few other pieces, and ultimately results in a rank K change to the original inverse. The speaker notes that this formula has practical applications and encourages viewers to write it down for future reference.

  • 00:50:00 In this section, the speaker discusses the inverse of a K by K matrix and acknowledges the abundance of formulas covered in the previous hour and 50 minutes. The section concludes by stating that the notes cover some applications and will move on to address other aspects of low rank.
14. Low Rank Changes in A and Its Inverse
14. Low Rank Changes in A and Its Inverse
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 15. Matrices A(t) Depending on t, Derivative = dA/dt



15. Matrices A(t) Depending on t, Derivative = dA/dt

This video covers various topics related to matrices, including changes in matrices and their inverse, as well as changes in eigenvalues and singular values over time. The speaker explains key formulas for calculating these changes and emphasizes the importance of understanding calculus in linear algebra. Additionally, the lecture discusses the importance of normalization and explores interlacing theorems for eigenvalues in both symmetric and rank 1 matrices. Lastly, the video concludes with a review of the topics covered and a promise to expand on them in future lectures.

  • 00:00:00 In this section, the speaker discusses the changes in matrices, eigenvalues, and singular values when a matrix changes. The focus is on understanding the formulas for the change in the inverse matrix, the derivative of the inverse, and changes in eigenvalues and singular values when a matrix changes. The speaker explains that while an exact formula for the change in eigenvalues and singular values may not be
    possible, they can still derive inequalities to understand how big the change could be. The lecture also covers the setup of the matrix A, which depends on time (T) and the inverse A inverse.

  • 00:05:00 In this section, the speaker discusses an identity in calculus that complements the previous section's discussion on matrices' inverse. The formula states that the derivative of the inverse matrix is equal to negative one times the inverse of the matrix, multiplied by the derivative of the matrix and the inverse of the matrix. The speaker explains how to find the derivative of the inverse matrix by calling it "change in the inverse" and dividing both sides of the formula by delta T. Finally, the speaker applies calculus to let Delta T go to zero, leading to an intuitive understanding of the formula. The speaker also express their opinion on calculus' emphasis in college mathematics, stating that it overshadows linear algebra.

  • 00:10:00 In this section, the speaker explains the formula for the derivative of a matrix A as dA/dt with respect to time t, as delta T goes to zero. The Delta a divided by Delta T ratio has a meaning, and as Delta T approaches zero, the equation becomes a inverse. The derivative of one over X in the one by one case is just 1 over X squared, and this is parallel to formulas when Delta a is full-sized but low rank. The focus of the lecture then shifts to lambda's eigenvalues and how they change when a matrix changes, with two possibilities, one small change and one full-size order of one change. The lecture ends with facts surrounding eigenvalues and eigenvectors.

  • 00:15:00 In this section, the concept of eigenvectors and eigenvalues for matrices that depend on a parameter is explained. The matrix A is explored in detail, with eigenvector X on the left which has the same eigenvalue as AX. In contrast, the eigenvector Y, for a symmetric matrix A, is used in the same way with the transpose of A or AT. The importance of normalization, specifically Y transpose times X equals one, is emphasized. The author then proceeds to take the derivative of a formula and discusses how to contort the equation to fit this new context.

  • 00:20:00 In this section, the speaker explains how the derivative of a matrix can be used to find the derivative of its eigenvalues and eigenvectors as time changes. Using the product rule, they derive a formula for the derivative of the product of three terms that depend on time. By rearranging the terms and applying the diagonalization formula, they arrive at a simple formula for the derivative of the eigenvalue. The speaker notes that while this is a classical technique, it may not always be widely known or taught in courses.

  • 00:25:00 In this section, the speaker discusses a formula for finding the derivative of an eigenvalue using the rate at which the matrix is changing and the eigenvectors on the left and right. They simplify the formula to show that two terms cancel each other and the remaining term is the correct answer for the derivative. They use the fact that the derivative of one is zero to prove this cancellation. The speaker also mentions that this formula does not involve the derivative of the eigenvector and can be used for finding higher-level derivatives as well.

  • 00:30:00 In this section, the speaker discusses the change in eigenvalues after a rank one change to a symmetric matrix. He notes that the change is a true vector and not a differential, so there isn't an exact formula for the new eigenvalues. However, he shares some known facts, such as the eigenvalues are in descending order, and the rank one change is positive semi-definite. He also asks the audience to consider the eigenvector of the uu transpose matrix and explains that it is a full n by n matrix column times a row. He concludes by stating that the number resulting from this calculation is greater than zero.

  • 00:35:00 In this section, the speaker discusses a symmetric matrix and what happens when a rank one matrix is added to it. They conclude that this results in positive semi-definite matrices, and the new eigenvalues (lambdas) are larger than the original eigenvalues (gammas). However, the difference in size is not significant and there is a theorem called "interlacing" that ensures the eigenvalues do not pass each other up. Specifically, lambda 1 is bigger than gamma 1, but lambda 2 is smaller than gamma 1. This is a useful theorem that guarantees the order of the eigenvalues when a positive rank one matrix is added to a symmetric matrix.

  • 00:40:00 In this section, the professor discusses the eigenvalues of a rank 2 matrix resulting from a symmetric matrix and a rank 1 change. He explains that the rank of the change matrix is 2, indicating two non-zero eigenvalues, and its positive semi-definite nature means the eigenvalues would increase by adding it to the original matrix. However, he reveals a theorem that states that the eigenvalues cannot go higher than the original eigenvalues when adding a positive semi-definite matrix. He applies this to the alpha values and compares them to the lambdas, ultimately concluding that the alpha 2 value cannot pass lambda 1 and the alpha 3 value remains unknown.

  • 00:45:00 In this section, the lecturer explains interlacing of eigenvalues with an example of a symmetric matrix. The reduced version of this matrix also has eigenvalues, and they interlace with the original matrix's eigenvalues. However, the lecturer raises a concern about the interlacing of eigenvalues when the rank is changed. If the new eigenvector is multiplied by a large number, then it can potentially move the eigenvalue way up, which seems to contradict the interlacing theorem. The lecturer leaves this as a question to answer in the next lecture.

  • 00:50:00 In this section, the lecturer discusses eigenvalues and eigenvectors and why one particular eigenvector having an eigenvalue lambda 2 plus 20 does not invalidate the previous statements made. The lecture is concluded with a review of the topics covered and a note to continue the discussion in the next class.
15. Matrices A(t) Depending on t, Derivative = dA/dt
15. Matrices A(t) Depending on t, Derivative = dA/dt
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 16. Derivatives of Inverse and Singular Values


16. Derivatives of Inverse and Singular Values

This video covers a variety of topics including the derivative of the inverse and singular values of a matrix, interlacing, and the nuclear norm of a matrix. The speaker presents a formula for the derivative of singular values, using the SVD, to understand how a matrix changes over time, while establishing bounds for changes in eigenvalues in symmetric matrices. Vial's inequality is introduced as a way to estimate the lambda values of a matrix, and basis pursuit is used in matrix completion problems. The speaker also discusses the idea that the nuclear norm of a matrix comes from a norm that is not quite a norm and introduces the concept of Lasso and compressed sensing to be discussed in the next lecture.

  • 00:00:00 In this section, the instructor discusses various topics including finding the derivative of the inverse of a matrix, the derivative of an eigenvalue, and the derivative of the singular value. The instructor shares a formula for the derivative of the singular value, which he recently discovered, and mentions that the formula for the derivative of the inverse is not simply the derivative of the original matrix. He also talks about lab homework, asks for advice on a project, and mentions the upcoming lecture from Professor Townsend on Applied Linear Algebra. The instructor goes on to explain how to systematically find the derivative of a squared matrix, and why the commonly assumed formula is incorrect.

  • 00:05:00 In this section, the speaker discusses the derivative of singular values, which is similar to the derivative of eigenvalues. The formula for the derivative of singular values is given by the transpose of da/dt times the singular vector of a. This formula relies on the SVD, which says that a times V equals Sigma U. By using these facts and manipulating the equation, it is possible to derive the formula for the derivative of singular values. This formula is useful in understanding how a matrix changes over time and can be applied in various fields such as physics and engineering.

  • 00:10:00 In this section, the speaker discusses the derivatives of inverse and singular values. They start by explaining the formula for the singular values in terms of the SVD of a matrix, and then take the derivative of the equation. The speaker uses the product rule and simplifies the resulting equation to find the term that will give them the answer they are looking for. They then demonstrate that the other two terms will be zero, which proves that their chosen term is the correct one. Finally, they use dot products and a number to show that the derivative of U with U transpose is equal to zero.

  • 00:15:00 In this section, the speaker discusses derivatives of singular values and eigenvalues of a symmetric matrix. While an exact formula for the change in singular or eigenvalues cannot be computed, bounds can be established by recognizing that positive changes in eigenvalues will not cause them to decrease. The interlacing of the old and new values is illustrated by the fact that the second eigenvalue will not exceed the first old eigenvalue, and the first new eigenvalue will not be less than the first old eigenvalue, making these concepts useful in understanding the SVD.

  • 00:20:00 In this section of the video, the speaker poses a puzzle question regarding the effect of hyping up the second eigenvector on the eigenvalues of a matrix. He points out that if the second eigenvalue is increased by a certain amount, denoted as Theta, it can eventually surpass the first eigenvalue, which poses a potential problem. However, he then explains his thought process and shows that this isn't actually an issue because the first eigenvalue remains unchanged, while the second eigenvalue gets pushed up but eventually converges to the sum of lambda 1 and Theta.

  • 00:25:00 In this section, the speaker discusses interlacing and Vial's inequality. Vial's inequality is a way of estimating the lambda values of a matrix, which are the eigenvalues that are ordered from largest to smallest. The inequality is true for any symmetric matrix and states that the largest eigenvalue of the sum of two symmetric matrices is less than or equal to the sum of the largest eigenvalues of each matrix individually. This interlacing property holds not just for rank one perturbations, but for perturbations of other ranks as well. The speaker uses the example of adding a positive matrix, T, onto S and explains how this relates to Vial's inequality.

  • 00:30:00 In this section, the speaker discusses Vile's inequality and how it relates to interlacing. Vile's inequality gives a bound on how much an eigenvalue can increase, and this fact is crucial for understanding the interlacing phenomenon. The speaker mentions that there are two ways to prove interlacing, including Vile's inequality and another method involving a graph. The section also introduces compressed sensing, which will be discussed in the next part of the video.

  • 00:35:00 In this section, the concept of the nuclear norm of a matrix is introduced, which is the sum of the singular values of the matrix. This can be thought of as the L1 norm for a vector. It has a special property, similar to the L1 norm, where minimizing the nuclear norm with a constraint results in a sparse solution. This property is useful in matrix completion problems, where missing data in a matrix needs to be filled in. The numbers that minimize the nuclear norm are a good choice to fill in the missing data. The zero norm of a vector, which represents the number of non-zeros, is not a norm, but it can be moved to the nearest norm, which is the L1 norm. This norm is the sum of the absolute values of the components of the vector. Minimizing this norm subject to some conditions is called basis pursuit and is used in matrix completion problems.

  • 00:40:00 In this section, the speaker discusses the idea that the nuclear norm of a matrix comes from a norm that is not quite a norm. He explains that the rank of the matrix is equivalent to this norm but fails to be a norm because it is not scalable if the size of the matrix is doubled. The speaker goes on to describe the conjecture that the deep learning algorithm of gradient descent finds the solution to the minimum problem in the nuclear norm, and introduces the concept of Lasso and compressed sensing that will be further discussed in the next lecture.
16. Derivatives of Inverse and Singular Values
16. Derivatives of Inverse and Singular Values
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 17: Rapidly Decreasing Singular Values



Lecture 17: Rapidly Decreasing Singular Values

The lecture focuses on matrices and their ranks, and how rapidly decreasing singular values are prevalent in computational mathematics. The lecturer examines low rank matrices and demonstrates how they have a lot of zeros in their sequence of singular values, making it more efficient to send the matrix to a friend in low rank form than in full rank form. They also introduce the numerical rank of a matrix, which is defined by allowing some wiggle room to define the tolerance of singular values of a matrix. By sampling smooth functions, which can be well-approximated by polynomials, the numerical rank can be low, resulting in a low-rank approximation of the matrix X. The lecture also includes examples of Gaussian and Vandermonde matrices to explain how they can lead to matrices of low rank, and discusses the usefulness of Zolotarev numbers in bounding singular values.

  • 00:00:00 In this section a professor explains why low rank matrices are so prevalent in the world of computational mathematics. He discusses the importance of singular values, which tell us about the rank of a matrix and how well it can be approximated by a low rank matrix. He goes on to explain that a matrix X can be decomposed into a sum of K rank one matrices if it has K nonzero singular values. Additionally, X's column space and row space both have dimension K. The singular value sequence is unique to a matrix, and the focus is on identifying the properties of X that make low rank matrices appear in various mathematical problems.

  • 00:05:00 In this section, the lecturer discusses low rank matrices and how they have a lot of zeros in their sequence of singular values. A low rank matrix is one where it is more efficient to send the matrix to a friend in low rank form than in full rank form. The lecture uses different flags to demonstrate the concept of low rank matrices, with extremely low ranks being highly aligned with the coordinates of the rows and columns. As the rank increases, the alignment gets blurry, and it becomes harder to see if the matrix is low rank. High rank matrices are inefficient to send in low rank form.

  • 00:10:00 In this section, the lecturer examines the triangular flag matrix to understand why diagonal patterns are not good for low rank compression. The matrix of all ones has a property that is similar to Gil's favorite matrix when its inverse is taken. By examining the singular values of this matrix, the lecturer shows that triangular patterns are not amenable to low rank compression. However, the circle case and the Japanese flag pattern are convenient for low rank compression.

  • 00:15:00 In this section, the lecturer discusses the rank of a circle, particularly the Japanese flag. By decomposing the flag into a circle, a rank-one piece in the middle, and a square, the rank can be determined by adding the ranks of each piece. The lecturer shows that the rank-one piece is bounded by one, and then uses symmetry to determine the rank of the square piece, which depends on the radius of the circle. By doing some calculations with trigonometry, the lecturer concludes that the rank is approximately 1/2, making it efficient to represent the Japanese flag in low rank form. However, most matrices in computational math are not of finite rank but numerical rank, which is similar to rank but allows for some approximation.

  • 00:20:00 In this section, we learn about the numerical rank of a matrix, which is defined by allowing some wiggle room to define the tolerance of singular values of a matrix. The numerical rank is K if K is the first singular value above epsilon, which denotes the tolerance, and the rank is the same as the last singular value above epsilon, and is the first singular value below epsilon. Numerically low-rank matrices are not only low-rank matrices, but also full rank matrices with rapidly decreasing singular values. This allows us to compress matrices using low-rank approximation while allowing a reasonable tolerance level in practice. The Hilbert matrix is an example of a full rank matrix with low numerical rank.

  • 00:25:00 In this section, the lecturer discusses how matrices can be of low numerical rank but not necessarily low rank in general. The Vandermonde matrix is used as a classic example of this. This matrix comes up in polynomial interpolation at real points and is often numerically low rank, making it difficult to invert. However, numerical low rank is not always desirable, particularly when trying to find the inverse. The lecturer explains that the reason why there are so many low-rank matrices is that the world is smooth, which means matrices are numerically low rank. An example is given where a polynomial in two variables is sampled, and it is shown that the resulting matrix is of low rank mathematically with epsilon equals zero.

  • 00:30:00 In this section, the speaker discusses how to get a low-rank approximation for a matrix X by sampling a function and approximating that function by a polynomial. If a polynomial of two variables can be written down, with degree M in both x and y, and then sampled, the resulting x will have low rank with epsilon equals zero, having at most M squared rank. By sampling smooth functions, which can be well-approximated by polynomials, the numerical rank can be low, resulting in a low-rank approximation of the matrix X. However, the reasoning behind this method does not work well for the Hilbert matrix, which is full rank.

  • 00:35:00 In this section, the lecturer discusses how to find an appropriate reason for bounding the rank of a matrix. Many people have tried to come up with a polynomial that can accurately predict the rank of a matrix, but the methods have been unsatisfactory. The lecturer introduces the idea of Sylvester matrices, which are matrices that satisfy a certain equation called the Sylvester equation. By finding an A, B, and C that satisfy the equation, a matrix can be shown to be of numerical low rank. The lecturer provides an example using the Hilbert matrix and a specific way of multiplying by a half on the left and right to satisfy the Sylvester equation.

  • 00:40:00 In this section, the lecture provided examples of Gaussian and Vandermonde matrices to explain how permutations and multiplication can lead to matrices of low rank. The lecture explains that if X satisfies a semester equation, then a bound can be found on the singular values of any matrix that satisfies an expression similar to that of the Gaussian and Vandermonde matrices, called the Frobenius norm. The Fuller and bound is used to demonstrate this numerical low rank in matrices, with examples given to demonstrate a connection between satisfying certain equations and the appearance of these low-rank matrices in practice.

  • 00:45:00 In this section, the lecturer discusses how the abstract problem of singular values being bounded by Zolotarev numbers is useful because many people have previously studied these numbers. The key reason why this is useful is that the sets E and F are separated, and this is what makes the Zolotarev number get small extremely quickly with k. The lecturer uses the Hilbert matrix as an example to show how the Zolotarev number can give a bound on the numerical rank, indicating why there are so many low-rank matrices in computational math. The lecturer also mentions the unofficial curse surrounding the two key people who worked on the Zolotarev problem; both died at the age of 31, which is why there is a question mark next to Pencil's name.
Lecture 17: Rapidly Decreasing Singular Values
Lecture 17: Rapidly Decreasing Singular Values
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Alex TownsendView the complete course: https://oc...
 

Lecture 18: Counting Parameters in SVD, LU, QR, Saddle Points



Lecture 18: Counting Parameters in SVD, LU, QR, Saddle Points

In this lecture, the speaker reviews various matrix factorizations such as L&U, Q&R, and eigenvector matrices and counts the number of free parameters in each of these matrices. They also discuss the computation of Qs versus SVD and count the number of parameters in the SVD for a rank-R matrix. The lecturer also explains the concept of saddle points in matrices and how to find them using optimization techniques and Lagrange multipliers. Lastly, the lecturer discusses the sign of eigenvalues of a symmetric matrix and how the Rayleigh quotient can help determine the maximum value and corresponding eigenvector of the matrix.

  • 00:00:00 In this section, the speaker reviews the big factorizations of a matrix, such as L&U, Q&R, and eigenvector matrices, and counts the number of free parameters in each of these matrices. The speaker notes that the number of free parameters in L&U or Q&R should agree with the number of parameters in the original matrix, and that the eigenvalue and eigenvector matrices’ free parameters add up to N squared. The speaker notes that this exercise is not often seen in textbooks but is an important review for understanding linear algebra.

  • 00:05:00 In this section, the speaker discusses the number of free parameters in different matrix factorizations, including SVD, LU, QR, and polar decomposition. The speaker notes that the number of free parameters in an N by n orthogonal matrix Q is N-1 for the first column and N-2 for subsequent columns due to normalization and orthogonality conditions. They also discuss the number of free parameters in a symmetric matrix S, which is 1/2 N times N minus 1 plus the number of diagonal elements. They then go on to show how these counts add up for different factorizations, including L times U, Q times R, and Q times S. Finally, they mention the polar decomposition as another factorization that results in an orthogonal times a symmetric matrix.

  • 00:10:00 In this section, the lecturer discusses the computation of Qs versus the SVD and then counts the parameters in the SVD. The largest rank that the rectangular matrix can have is M, which will result in an M by N matrix for the SVD. The lecturer expects it to add up to the total of the original matrix, which has MN parameters. The count for S is equal to M and the count for V is equal to N. The count for U is equal to 1/2 (M^2 + M) if it is an M by M orthogonal matrix.

  • 00:15:00 In this section, the speaker explains how to count the important parameters in a matrix's singular value decomposition (SVD) for a rank-R matrix. The M columns of V that correspond to non-zero singular values are the only important parts of the matrix. To count the number of parameters, the speaker uses a formula accounting for the different number of necessary parameters in each orthogonal column of V, up to the Mth column. The formula involves adding up 1 to N-M for each column and subtracting that number from the half of M squared plus M plus 1. The result of the formula is the final count for the parameters in the SVD of a rank-R matrix.

  • 00:20:00 In this section, the speaker discusses matrices of rank R and the number of parameters they have. Matrices of rank R are not a subspace because different matrices can have the same rank, making it more like a surface, with different pieces. The speaker believes that a matrix of rank R has R parameters. They then go on to find the number of parameters in a rank R matrix. The number of parameters is R for Sigma, (R + 1) / 2 for V, and (M - 1) + (M - 2) + ... + (M - R) for U.

  • 00:25:00 In this section of the lecture, the instructor discusses the concept of saddle points in matrices, which are different from maxima and minima. Saddle points arise when optimizing a quadratic cost function subject to linear constraints using Lagrange multipliers. The instructor introduces lambda and shows how it is used in the Lagrangian to form a function that depends on both X and lambda. This function can then be optimized to find any saddle points that may arise. The instructor also mentions another source of saddle points, which arise in matrices that are not positive definite or negative definite.

  • 00:30:00 In this section, the speaker discusses how to find saddle points of a function and show how they arise in an important class of problems represented by a block matrix. The function has saddle points, not a maximum. Lagron's contribution to this problem is taking the derivatives with respect to X and lambda, producing n and m equations, respectively. Ultimately, the matrix represented by the block matrix indicates that it is not positive definite, and this information can be used to determine saddle points.

  • 00:35:00 In this section, the lecturer discusses how the determinant of a matrix can help determine the signs of its eigenvalues. Using a simple example, he shows that if the determinant is negative, there must be eigenvalues of both signs. He then relates this to the KKT matrices used in optimization and argues that they are generally indefinite, but they have a positive definite block associated with them. He demonstrates that, when using block elimination on this positive definite block, all the n pivots will be positive, which leads to the conclusion that the KKT matrices have both positive and negative eigenvalues.

  • 00:40:00 In this section, the lecturer discusses saddle points and how they relate to constraints. He explains how to determine the sign of the eigenvalues of a symmetric matrix, based on the signs of its pivots. The lecturer also defines the Rayleigh quotient and reviews how it can help us determine the maximum value and corresponding eigenvector of a symmetric matrix. The lecture ends with an explanation of how any value we plug into the Rayleigh quotient will be smaller than the maximum value.

  • 00:45:00 In this section, the speaker discusses the concept of saddle points in the Rayleigh quotient. There are difficult to handle intermediate lambdas between the minimum and maximum. However, at the maximum and minimum, the quotient values are easy to measure. If any vector is selected in any dimensions, we can compute R of X, which is between the maximum and minimum. The speaker says that talking about the details of saddle points will be saved for the next lecture, but before that, the third lab will be given, which teaches about overfitting, deep learning, and is due after the break.
Lecture 18: Counting Parameters in SVD, LU, QR, Saddle Points
Lecture 18: Counting Parameters in SVD, LU, QR, Saddle Points
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 19. Saddle Points Continued, Maxmin Principle



19. Saddle Points Continued, Maxmin Principle

In this video, the speaker continues discussing saddle points and how to find minimum and maximum values using the Rayleigh quotient in two-dimensional space. The interlacing theorem is explained, which involves writing saddle points as the maximum of a minimum to quickly find maxima and minima. The speaker also warns against overfitting when fitting data with a high-degree polynomial and discusses two open-ended labs for the class, involving saddle points and a simple neural network. The concepts of mean and variance in statistics and sample variance and covariance are explained, with the speaker noting that the covariance matrix for totally dependent outputs would be not invertible, and for polling scenarios with multiple people living in one house, some covariance is expected but not entirely independent.

  • 00:00:00 In this section, the speaker discusses the importance of understanding saddle points in relation to finding the minimum of the cost total function in deep learning. They provide an example of a Rayleigh quotient and a simple matrix S to illustrate the main facts of saddle points, the maximum and minimum values of the function, and the presence of a saddle point. The speaker also mentions their plans to discuss lab three, projects, and basic statistics, particularly the covariance matrix.

  • 00:05:00 In this section, the speaker discusses saddle points and how to find the minimum and maximum values by loading everything onto one variable and computing the derivatives to find where they are equal to zero. They demonstrate how to find the minimum value and show that the matrix's eigenvectors and eigenvalues help find the saddle point's location and value. The speaker also talks about how to compute the second derivatives and the symmetric matrix. They emphasize the importance of computing the saddle point values and suggest working with codes and being mindful of the process.

  • 00:10:00 In this section, the speaker discusses the idea of saddle points and how to write them as the maximum of a minimum in order to get back to maxima and minima quickly. He explains that this leads to the interlacing theorem and gives an example of taking the minimum over a 2-dimensional subspace to find the minimum of the Rayleigh quotient. By taking the maximum of that minimum over all subspaces, he is able to obtain lambda, the saddle point value.

  • 00:15:00 In this section, the speaker explains how to find the maximum and minimum values in a two-dimensional space using the Rayleigh quotient. He demonstrates that the maximum value is three by taking the maximum over all possible 2D spaces and showing that this particular choice of V gave the answer of three. The speaker then explains how the minimum value will be below three for any other subspace, meaning that the maximum value for the minimums is also three. The concept of saddle points is also discussed, with the speaker noting that these points often occur in the highest points of certain regions, and they can be Maxima of Minima or Minima of Maxima. The video concludes with a discussion of projects and an invitation for viewers to ask questions about them.

  • 00:20:00 In this section, the speaker explains a model of overfitting in which a polynomial of degree 5 is used to fit 6 points. The speaker points out that the 5th degree polynomial would be an exact fit to the data points, but it would also be a flawed model because it would not be smooth or nice. This example serves as a warning against overfitting, which occurs when a model is too complex and too closely fit to the training data.

  • 00:25:00 In this section, the speaker discusses the problem of fitting data with a polynomial of high degree. While fitting a straight line may result in underfitting, fitting a high-degree polynomial can lead to overfitting as it creates a perfect fit for all given data points, not considering the noise in the data. The idea of a perfect fit is related to the Vandermonde matrix, which has a large inverse due to the giant coefficient vector resulting from the perfect fit. The matrix has a wide range of singular values, with tiny values occurring alongside ordinary-sized values. As such, it can be challenging to find the correct degree of polynomial to fit to the data to strike a balance between underfitting and overfitting.

  • 00:30:00 In this section, the speaker describes two examples of open-ended labs for his class, one involving saddle points and the other involving a simple neural network. For the saddle point example, the speaker suggests submitting plots and tables of data to grade scope, and drawing conclusions about the safety and risk of increasing K. Regarding the neural network example, the speaker outlines a basic classification problem and encourages students to modify the model as they see fit, while still using linear algebra. The speaker also mentions a forthcoming faculty meeting about MIT's plans for courses in computational thinking, of which this course is one example. Finally, the speaker invites students to email him with rough project ideas and group preferences.

  • 00:35:00 In this section, the professor discusses the idea of a project for the class and clarifies its scope. He mentions that the project wouldn't be too large, maybe equivalent to three homework assignments, but not trivial either. He asks the students for their questions and input on the project, suggesting the possibility of including topics such as convolutional neural nets. The professor also mentions that some students had initiated a meeting at the Media Lab, and it had taken place successfully. He asks if people would be interested in such meetings again after the spring break.

  • 00:40:00 In this section, the speaker introduces the concepts of mean and variance in statistics, how they relate to actual output and expected output, and the difference between sample mean and expected mean. Sample mean is calculated from actual output of an experiment, while expected mean is calculated from the probabilities of those outcomes. The variance is also discussed, with sample variance and expected variance being distinguished. The speaker explains that the expected values of mean and variance will approach the actual values as the number of samples or possibilities increases.

  • 00:45:00 In this section, the concept of sample variance is discussed, which measures the average squared distance from the mean of a set of n samples. In statistics, the division of n minus one means that this distance is calculated from the sample mean, not zero, and when n is large, the difference between n and n minus one is not significant. Covariance, on the other hand, is a deeper idea that involves matrix manipulation when multiple experiments are performed, and the joint probability of two separate events is calculated.

  • 00:50:00 In this section, the speaker discusses the two extremes of covariance output: independent outputs and totally dependent outputs. While independent outputs have a covariance of 0, totally dependent outputs have a maximum covariance, where one output is entirely determined by the other. The speaker uses the example of flipping coins glued together to explain this concept. The covariance matrix for dependent outputs would be not invertible and symmetric positive definite, or semi definite for the glued-together case. The speaker mentions that in polling scenarios where multiple people are living in one house, there would be some covariance expected, but it would not be entirely independent.
19. Saddle Points Continued, Maxmin Principle
19. Saddle Points Continued, Maxmin Principle
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 20. Definitions and Inequalities



20. Definitions and Inequalities

In this section of the video, the speaker discusses various concepts in probability theory including expected value, variance, and covariance matrices. Markov's inequality and Chebyshev's inequality were also introduced as fundamental tools for estimating probabilities. The speaker then proceeds to explain the relationship between Markov's inequality and Chebychev's inequality, illustrating how they lead to the same result. The concept of covariance and covariance matrix, a fundamental tool in probability theory, was also introduced. The video also explores the idea of joint probabilities and tensors, explaining how gluing coins together adds dependence and alters the probabilities. Finally, the speaker discusses the properties of the covariance matrix, emphasizing that it's always positive semi-definite and is a combination of rank 1 positive semi-definite matrices.

  • 00:00:00 In this section, the lecturer discusses expected value, variance, and covariance matrix. Expected value, symbolized as 'e,' is defined as the weighted average of all possible outcomes based on their probabilities. Variance, on the other hand, is the expected value of the square of the distance between the mean and each data point. The covariance matrix can also be expressed in a similar manner. The lecturer then explores a second expression for variance by writing out the squares and combining them differently, resulting in a more efficient way to compute variance.

  • 00:05:00 In this section, the speaker discusses an algebraic process of simplifying an equation to find the expected value of x squared. He shows that the expected value of x squared minus the expected value of x minus M squared is equivalent to the sum of the probabilities of x squared. The speaker then goes on to introduce the Markov inequality, which is a statistic inequality that involves probabilities and expectations. He notes that Markov was a great Russian mathematician and that they will see Markov chains and processes later in the book.

  • 00:10:00 In this section, the speaker explains Markov's inequality, which can help estimate the probability that X is greater than or equal to a certain number. The inequality states that the probability of X being greater than or equal to a is less than or equal to the mean of X divided by a. The speaker gives an example using a mean of one and a value of a of three, showing that the probability of X being greater than or equal to three is less than or equal to 1/3. However, the speaker notes that this inequality only applies to non-negative events, and cannot be used with events that have outputs ranging from negative to positive infinity.

  • 00:15:00 In this section of the video, the speaker talks about using a special case to demonstrate the probability of being greater than or equal to 3. They use the definition of the mean to write out a specific equation and then make assumptions about the values of X1 through X5 to satisfy the Markov inequality. They state the fact that the probabilities add up to 1 and are all greater than or equal to 0. The speaker then proceeds to manipulate the equation to show that the probability of being greater than or equal to 3 is less than or equal to 1/3 by subtracting certain values from the equation. They conclude by showing that the equation satisfies the Markov inequality.

  • 00:20:00 In this section, the speaker discusses the Markov and Chebyshev inequalities in probability. Markov's inequality involves estimating the probability of a variable being greater than or equal to a certain value, and it only applies when the variables are all greater than or equal to zero. Chebyshev's inequality, on the other hand, deals with the probability of a variable being a certain distance away from the mean, and it does not make any assumptions about the inputs. These two inequalities are fundamental tools for estimating probabilities in probability theory.

  • 00:25:00 In this section, the speaker explains the relationship between Markov's inequality and Chebychev's inequality. He introduces a new variable Y, which is X minus M squared, and explains how to calculate its mean. The speaker then applies Markov's inequality to Y and Chebychev's inequality to X, demonstrating how they lead to the same result. Finally, he introduces the concept of covariance and covariance matrices.

  • 00:30:00 In this section, the speaker introduces the concept of covariance and covariance matrix, which is an M by M matrix where M is the number of experiments being done at once. To illustrate this concept, the speaker uses the example of flipping two coins with one output (X) per coin. If the two coins are flipped independently, then there is no correlation between the outputs, but if they are glued together, then the outputs are correlated and the joint probabilities are put into a 2x2 matrix.

  • 00:35:00 In this section, the speaker discusses the concept of joint probabilities and matrices for experimental setups involving independent coins. They explore the idea of a three-way structure, or tensor, in cases where there are three experiments with independent fair coins or when coins are glued together. The resulting entries in the tensor would be the joint probabilities, which can be used to calculate the probability of different outcomes. The speaker notes that while the entries in a simple case of an unglued experiment are one eighth, gluing the coins together adds dependence and alters the probabilities.

  • 00:40:00 In this section of the video, the speaker discusses the joint probability of flipping three coins and how it can be represented in a 3-way matrix. He mentions the concept of tensors and covariance matrices, defining the latter as the variance of the joint outcome of two experiments, X and Y, expressed as a summation of all possible outcomes. The speaker also explains the symbol P IJ and how it relates to gluing and ungluing coins together in different configurations.

  • 00:45:00 In this section of the video, the speaker discusses the joint probability of two events - X and Y - and how to calculate this probability for different pairs of values. The speaker provides examples of how to use joint probability, including calculating the probability of a certain age and height. The speaker also defines the marginal probabilities, which are the individual probabilities of each event, and explains how to add up the probabilities along rows or columns in a matrix. The speaker then goes on to define the covariance matrix and explains how to calculate its entries.

  • 00:50:00 In this section, the speaker talks about the covariance matrix and its properties. He explains that the variance of the X experiment is derived from adding up all the P IJ's, while the variance of the Y experiment is given by the Sigma Y squared value. The covariance between X and Y is the sum of the P IJ's times the distance of X from its mean, and the distance of Y from its mean. In case of independent coins, the covariance would be zero, whereas in the case of glued coins, it would be the same as Sigma X squared Sigma Y squared. The determinant of the matrix is zero in the glued coins case, which shows the squared covariance is the same as Sigma X squared Sigma Y squared. The covariance matrix is always positive semi-definite and is a combination of rank 1 positive semi-definite so it's positive semi-definite or positive definite.
20. Definitions and Inequalities
20. Definitions and Inequalities
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 21: Minimizing a Function Step by Step



Lecture 21: Minimizing a Function Step by Step

This video lecture discusses the basic algorithms used for minimizing a function and their convergence rates, particularly Newton's method and steepest descent. It also highlights the importance of convexity, which ensures that the function has one minimum, and introduces the concept of convex sets and convex functions. The lecturer explains how to test for convexity in a function, which determines whether it has saddle points or local minimums, as opposed to a global minimum. The video concludes with a discussion of Levenberg Marquardt, a cheaper version of Newton's method that is not fully second-order.

  • 00:00:00 In this section, the lecturer discusses the basics of optimization, which is the fundamental algorithm that goes into deep learning. The lecture starts by explaining the Taylor series and moves on to show how to extend the Taylor series when the function is of more than one variable. The lecturer then introduces the gradient of F, which is the partial derivatives of F with respect to each X variable. Finally, the quadratic term is explained, and the lecture ends by discussing the second derivatives and how they change with more variables.

  • 00:05:00 In this section of the lecture, the concept of the Hessian matrix is introduced, which is the matrix of second derivatives of a function. The Hessian matrix is symmetric and its computation is feasible for small to moderately large values of n. There is a parallel picture for the vector function, which is the Jacobian matrix, with entries being the derivatives of the function with respect to different variables. These are facts of multivariable calculus, which are used to solve equations in optimization problems.

  • 00:10:00 In this section, the lecturer discusses Newton's method for solving systems of equations in n unknowns, which involves minimizing a given function. Newton's method is the best way to solve n equations in n unknowns, which can be expressed as F equals 0, where F of one is equal to zero, and there are n equations in total. The lecturer shows how to use Newton's method to solve the equation x squared minus 9 equals 0, which can be written as a function, and demonstrates how to apply the method step by step.

  • 00:15:00 In this section, the lecturer discusses how Newton's method is used to minimize a function and how to determine how quickly it converges. They start by simplifying the formula that determines X sub K + 1 and show that if X sub K is exactly 3, then X sub K + 1 will also be 3. They then focus on how fast the error approaches zero and subtract 3 from both sides to factor out 1 over X sub K. Simplifying the equation shows that the error at step K + 1 is squared at every step, which proves why Newton's method is fantastic if executed near enough.

  • 00:20:00 In this section, the lecturer discusses using Newton's method for optimization and how it is applicable to very complicated loss functions with thousands or even hundreds of thousands of variables. The lecture covers two methods -- steepest descent and Newton's method -- wherein the steepest descent involves moving in the direction of the gradient of F, but with freedom to decide on the step size. On the other hand, Newton's method takes into account the second derivative of F and allows for faster convergence, but could also converge to undesirable solutions or blow up for certain starting points. This leads to the concept of regions of attraction, where certain starting points lead to the desired solution, while others lead to undesirable ones or infinity.

  • 00:25:00 In this section, the lecturer discusses two methods for minimizing a function step by step: steepest descent and Newton's method. Both involve iteratively choosing a direction in n-dimensional space and moving a certain distance along that direction, but steepest descent uses the gradient of the function to choose the direction, while Newton's method uses the Hessian, or second derivative. The lecture also explains the concept of an exact line search and the importance of choosing an appropriate learning rate in these methods.

  • 00:30:00 In this section, the lecturer discusses the basic algorithms used for minimizing a function and their convergence rates. The lecturer explains that Newton's method has a quadratic convergence rate, making it super fast if started close enough. In contrast, the steepest descent algorithm has a linear convergence rate, making it less efficient. The lecturer emphasizes that the starting point for solving these problems should be convexity, which ensures that the function has one minimum. The lecturer defines convex sets and functions and explains their significance in minimizing a function for points in a convex set. The lecture concludes with a discussion of Levenberg Marquardt, a cheaper version of Newton's method that is not fully second-order.

  • 00:35:00 In this section of the video, the speaker discusses how to minimize a function. The constraints for the function are defined by a convex set, which means that any line drawn between two points within the set must stay within the set. The speaker gives the example of two overlapping triangles, which do not form a convex set when combined.

  • 00:40:00 In this section, the concept of convex sets and convex functions is introduced. It is noted that the intersection of two convex sets is always convex, and the empty set is considered a convex set. The notes in the video highlight the importance of understanding these concepts when minimizing functions, as the prototype problem involves finding functions with a convex picture. The video also connects the definition of a convex function to the definition of a convex set, noting that the graph of a convex function resembles a bowl, while the points on that surface are not convex sets. However, the set of points on the graph is a convex set.

  • 00:45:00 In this section of the lecture, the speaker discusses a test for convex function. He explains that two convex functions can be used to create a minimum and maximum function, and one of them will be convex while the other won't. The minimum function will have a kink in it, and hence will not be convex, while the maximum one will be convex. The speaker also mentions that this test can be extended to a maximum of 1500 functions, and if all 1500 functions are convex, their maximum will also be convex.

  • 00:50:00 In this section, the speaker explains how to test for convexity in a function. For a function with only one variable in calculus, a convex function can be proven by checking if the second derivative is positive or zero. When dealing with a vector function with multiple variables, a symmetric matrix F would be added to the function. The test for convexity here would be positive semi-definite for the Hessian, as the second derivatives result in a matrix. Convex problems do not have saddle points or local minimums, only the global minimum, making them desirable.
Lecture 21: Minimizing a Function Step by Step
Lecture 21: Minimizing a Function Step by Step
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 22. Gradient Descent: Downhill to a Minimum



22. Gradient Descent: Downhill to a Minimum

In the video, "Gradient Descent: Downhill to a Minimum," the speaker discusses the importance of gradient descent in optimization and deep learning, where the goal is to minimize a function. The speaker introduces the gradient and the Hessian, and illustrates the steps of steepest descent using a quadratic function. The speaker also discusses how to interpret the gradient and the Hessian, as well as their role in measuring convexity. The speaker delves into choosing the appropriate learning rate, stressing the importance of the condition number in controlling the speed of convergence. The video also provides practical examples and formulas to help understand the concept of gradient descent, including the heavy ball method.

  • 00:00:00 In this section, the speaker discusses gradient descent as the central algorithm in neural nets, deep learning, machine learning, and optimization in general. The goal is to minimize a function, and if there are too many variables to take second derivatives, the focus is on first derivatives of the function. The speaker introduces the idea of gradient and Hessian, and the role of convexity before diving into a crucial example of a pure quadratic function with two unknowns. Through the example, the speaker demonstrates the steps of steepest descent, and how quickly they converge to the answer, which is the minimum point. The speaker also explains the importance of the condition number in the speed of convergence, and how to interpret and compute the gradient of a function.

  • 00:05:00 In this section, the speaker explains how to interpret the gradient and the Hessian of a surface. Using the example of a surface where the gradient is constant and the Hessian only contains second derivatives of zero, the speaker illustrates how to visualize the surface and interpret the gradient and Hessian in terms of steepest ascent or descent and level sets. The speaker emphasizes that the Hessian matrix of second derivatives tells us about the shape of a surface and how quickly it changes in different directions.

  • 00:10:00 In this section, the concept of the Hessian is introduced as a tool for measuring the convexity of a function. The Hessian of a function tells us whether a surface is convex or not, with positive semi-definite or positive definite Hessians indicating convexity. A linear function is convex but not strictly convex, while a strictly convex function would bend upwards. An example of a strictly convex function is given, namely 1/2 x transposed x, which has a minimum value when the gradient is half of sx squared.

  • 00:15:00 In this section, the speaker discusses the concept of finding the minimum value of a quadratic function using gradient descent. The minimum is reached at a point where the gradient is zero, and this point is denoted as argh men. The speaker emphasizes that this is different from the actual minimum value of the function and that the focus is often on finding the point where the minimum is reached rather than the minimum value itself. In this particular example, the minimum value is zero due to the lack of a linear term.

  • 00:20:00 In this section, the speaker discusses the fundamental minimization question of finding the minimum of a quadratic function. The function goes through zero and bottoms out at a certain point, and by plugging in that point, we can determine its lowest level. The speaker mentions a remarkable convex function and notes that the convexity is what really makes things work. This function is a function of a matrix and contains N squared variables.

  • 00:25:00 In this section, the speaker discusses a convex function obtained by taking the determinant of a matrix, followed by taking its logarithm with a negative sign. The resulting function is convex, and for a given matrix, the partial derivatives function as the entries of that matrix's inverse. The speaker then delves into the derivative of the determinant of a matrix with respect to its entries, stressing the importance of calculating these derivatives in gradient descent algorithms.

  • 00:30:00 In this section, the speaker explains the determinant and its fundamental property, which states that it's linear in Row 1. He also goes into the formula for cofactor expansion of a determinant and connects it to the fact that the gradient is the entries of X inverse. The speaker then introduces gradient descent and provides its formula, which involves the step size and the gradient of s at X. The only input left for decision-making is the step size.

  • 00:35:00 In this section, the instructor discusses the importance of choosing the appropriate learning rate in gradient descent. If the learning rate is too large, the function will oscillate and be difficult to optimize. On the other hand, if the learning rate is too small, it will take too much time for the algorithm to converge. One way to choose the optimal learning rate is through an exact line search, but this can be time-consuming for large problems. Instead, people typically estimate a suitable learning rate and adjust it as needed through backtracking line search. The instructor emphasizes the importance of the condition number in controlling the speed of convergence and asks the question of how much an exact line search would reduce the function.

  • 00:40:00 In this section, the speaker discusses an example to get a better understanding of gradient descent. A particular function is introduced where the exact answers are known, allowing for comparisons to be made. Starting at a point on the surface of this function, the speaker applies the gradient descent formula and calculates the iterations for this particular function. The speaker then presents a beautiful formula that will be taken as the best example possible to help understand gradient descent.

  • 00:45:00 In this section, the speaker discusses how the ratio (1-B)/(1+B) is crucial in determining the speed of convergence during gradient descent. If B is close to zero, the ratio is close to one, which leads to slow convergence, and if B is close to one, the ratio is close to zero, which leads to fast convergence. The speaker uses the example of level sets and ellipses to explain how the narrow valley can cause slow convergence when approaching the minimum. The speaker emphasizes the importance of a good condition number for optimization.

  • 00:50:00 In this section, the speaker discusses how gradient descent approaches a curve with a zigzag trajectory to eventually reach a specific point. He emphasizes that the multiplier 1 - B/ (1 + B) plays a critical role, and for a convex function, this quantity is crucial to determining the convergence of steepest descent. The next lecture will discuss momentum or heavy ball, which involves adding an extra term that allows the movement to accelerate instead of just directing it by steepest descent at every point. The idea is to let the momentum of a heavy ball take over and roll down, similar to how it would in real life.
22. Gradient Descent: Downhill to a Minimum
22. Gradient Descent: Downhill to a Minimum
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 23. Accelerating Gradient Descent (Use Momentum)



23. Accelerating Gradient Descent (Use Momentum)

This video discusses the concept of momentum in accelerating gradient descent. The presenter explains the basic gradient descent formula and shows how adding momentum can result in faster descent than the ordinary method, ultimately yielding significant improvements. They also discuss a continuous model of steepest descent and explain how it can be analyzed as a second-order differential equation with a momentum term. The presenter emphasizes the importance of minimizing both eigenvalues when using momentum to minimize the largest eigenvalue by choosing values for s and beta to make the eigenvalues of the matrix as small as possible. They also discuss Nesterov's method and suggest that it may be possible to get further improvements by going back two or three steps or more.

  • 00:00:00 In this section, the speaker discusses the basic gradient descent formula, where the new point is the old point minus the step size times the negative gradient at X K, which is the direction of the descent. Adding momentum to avoid zigzags in gradient descent results in faster descent than the ordinary method. There is also an alternative to momentum that accelerates descent, developed by a Russian mathematician named Nestoroff. For machine learning problems with hundreds of thousands of variables, stochastic gradient descent is used, where a mini-batch of training data is chosen randomly or systematically to do one batch of training samples each step.

  • 00:05:00 In this section, the speaker discusses the descent of steepest direction and the level sets for a model problem with a function of X and Y squared equaling a constant, forming ellipses. They explain that the optimal stopping point is where you are tangent to the farthest in level-set ellipse and start going up again. The speaker introduces the momentum term to improve the steepest descent formula and tracks its descent with a zig-zag pattern, showing improvement in the value of eigenvectors. The speaker concludes that the expression with momentum is a miracle and yields significant improvements.

  • 00:10:00 In this section of the video, the speaker discusses the use of momentum in accelerating gradient descent. The decay term in momentum tells you how fast the decay is smaller, and with momentum, this term of 1 minus B over 1 plus B changes to one minus square root of B over 1 plus square root of B. The speaker takes the example of B being 1 over 100, and the new X is the old X minus the gradient with an extra term that gives us a little memory. This term involves taking a new quantity Z with a step size, and instead of taking Z as just the gradient, which would be steepest descent, the speaker adds on a multiple beta of the previous Z, which is the search direction.

  • 00:15:00 In this section, the speaker discusses the concept of momentum in accelerating gradient descent. Rather than using a point to represent the function, the speaker suggests using a heavy ball that moves faster down the valley of the cost function. This is achieved by involving the previous step in the calculations, resulting in a three-level method instead of a two-level method. The speaker then relates this to a continuous model of steepest descent and explains how it can be analyzed as a second-order differential equation with a momentum term. They then show how to write this as a system of two first-order equations, which can be used to create a more efficient and faster gradient descent algorithm.

  • 00:20:00 In this section, the speaker discusses how to analyze what happens as k travels forward in the accelerated gradient descent algorithm. They explain that at every step, there is a constant coefficient problem since the XZ variable gets multiplied by a matrix. The speaker also mentions that in order to track each eigenvector of s, they follow each eigenvalue which allows them to rewrite the formula in terms of scalars rather than vectors.

  • 00:25:00 In this section, the speaker discusses how to track one eigenvector and use it to make the whole problem scalar. By choosing the step size and momentum coefficient, they can create a matrix that can multiply the coefficients of the eigenvector at each step to update it. By making s and beta as small as possible, they can ensure that the algorithm minimizes the loss function over the whole range of possible lambdas. The goal is to choose these values to make the process as efficient as possible.

  • 00:30:00 In this section, the speaker explains the concept of the condition number, which is the ratio of the largest eigenvalue to the smallest eigenvalue of a symmetric positive definite matrix. A higher condition number means a harder problem, and a lower one means an easier problem. The speaker explains how to use momentum to accelerate gradient descent and minimize the largest eigenvalue by choosing values for s and beta to make the eigenvalues of the matrix as small as possible. The speaker emphasizes that it is essential to minimize both eigenvalues, as having one small eigenvalue but a large one can prove deadly.

  • 00:35:00 In this section of the video, the speaker discusses a problem of finding the optimal parameters s and beta for a two-by-two matrix, based on the eigenvalues dependent on lambda, m, and capya. The goal is to choose parameters that result in the smallest possible larger eigenvalue, which will lead to faster convergence. The speaker presents the formula for the optimal s and beta, which depend on the ratio between little m and big M, and explains how to calculate the resulting minimum eigenvalue based on this formula. Ultimately, the speaker concludes that this optimal choice of s and beta results in eigenvalues that are smaller than a certain number, leading to faster convergence.

  • 00:40:00 In this section, the speaker talks about using momentum to improve the convergence rate in machine learning. They mention Nesterov's method for using a slightly different idea that involves the previous time value and evaluating the gradient at a different point. The speaker notes that there are very popular methods in use now for machine learning that involve a simple formula to add the previous values, such as ADA grad. They also suggest that it may be possible to get further improvements by going back two or three steps or more, as is done in backward difference formulas used in MATLAB software and in planetary calculations.

  • 00:45:00 In this section, the presenter talks about the momentum term and Nesterov, which involves evaluating the gradient at a point between XK and XK minus 1. The evaluation point for the gradient of F is at some non-integer point, which is unexpected and weird because it's not a mesh point. This involves XK plus 1, XK, and XK minus 1, so it's a second-order method. To analyze it, the process of writing it as two first-order steps is followed to optimize the coefficients in Nesterov. This process involves writing it as a coupled system of that's a one-step that has a matrix, finding the matrix, finding the eigenvalues of the matrix, and making those eigenvalues as small as possible.
23. Accelerating Gradient Descent (Use Momentum)
23. Accelerating Gradient Descent (Use Momentum)
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...