Machine Learning and Neural Networks - page 42

 

Lecture 9 -- Monte Carlo -- Philipp Hennig



Numerics of ML 9 -- Monte Carlo -- Philipp Hennig

In this video on the topic of Monte Carlo, Philipp Hennig explains how integration is a fundamental problem in machine learning when it comes to Bayesian inference using Bayes' Theorem. He introduces the Monte Carlo algorithm as a specific way of doing integration and provides a brief history of the method. He also discusses the properties of Monte Carlo algorithms, such as unbiased estimation and variance reduction with an increase in the number of samples. Additionally, Hennig delves into the Metropolis-Hastings algorithm, Markov Chain Monte Carlo, and Hamiltonian Monte Carlo, providing an overview of each algorithm's properties and how they work when sampling from a probability distribution. Ultimately, Hennig notes the importance of understanding why algorithms are used, rather than blindly applying them, to achieve optimal and efficient results.

In the second part of the video, Philipp Hennig discusses Monte Carlo methods for high-dimensional distributions, specifically the No U-turn Sampler (NUTS) algorithm which overcomes the problem with the U-turn idea of breaking the detailed balance. Hennig emphasizes that while these algorithms are complex and tricky to implement, understanding them is crucial for using them effectively. He also questions the knee-jerk approach to computing expected values using Monte Carlo methods and suggests there may be other ways to approximate without randomness. Hennig discusses the concept and limitations of randomness, the lack of convergence rates for Monte Carlo methods, and proposes the need to consider other methods for machine learning rather than relying on deterministic randomness.

  • 00:00:00 In this section, the instructor introduces the topic of integration, which is a fundamental problem in machine learning when doing Bayesian inference to compute conditional distributions posteriors using Bayes' Theorem. He explains that this process contains an integral, which represents the marginal that is computed as an expected value of some conditional distribution. The instructor emphasizes the importance of knowing how to perform integration correctly and introduces the Monte Carlo algorithm as one specific way of doing integration. He gives a brief history of Monte Carlo and reflects on why it's important to understand why algorithms are used, rather than just applying them blindly.

  • 00:05:00 In this section, Philipp Hennig discusses the story of how Monte Carlo simulations were developed to assist in designing a nuclear bomb back in the 1940s. The problem was in optimizing the geometry to achieve an explosion, and the solution was to use Monte Carlo simulations to approximate integrals with sums. The Fermi analog computer was invented for this purpose, which consists of two wheels and a pen to simulate the path of a neutron by using random numbers drawn from a die. Although this process seems simple, this method was the first step to developing Monte Carlo simulations for various fields.

  • 00:10:00 In this section, the concept of Monte Carlo simulations is explained as a way to estimate an expected value by replacing the integral with a sum over evaluations of a function at points drawn from a distribution. This is an unbiased estimator with a variance that decreases as the number of samples increases, resulting in an error that drops like one over the square root of the number of samples. While statisticians argue that this is the optimal rate for unbiased estimators, numerical mathematicians consider this rate to be quite slow, with polynomial rates being preferred. However, this method has its advantages, such as being free from dimensionality, as the variance does not depend on the dimensionality of the underlying distribution.

  • 00:15:00 In this section, Philipp Hennig addresses the debate surrounding the dimensionality of the Monte Carlo problem. Although there is a variance of f under p, which could be related to the dimensionality of the problem, the argument is that it does not depend on dimensionality. However, in certain structured problems, the variance can explode exponentially fast as a function of dimensionality. Nevertheless, most interesting applications of Monte Carlo sampling are insensitive to the dimensionality of the problem, allowing for the computation of high-dimensional problems. Hennig also discusses the classic example of computing Pi using Monte Carlo sampling, where it converges towards the truth with a rate given by the inverse square root of the number of samples.

  • 00:20:00 In this section, Philipp Hennig discusses Monte Carlo methods for approximating integrals. He explains how this method works by drawing a large number of samples from a distribution and computing the expected value under those simulations. This can be a good solution when a rough estimate is needed, but it is not practical for highly precise answers. Hennig also talks about ways to construct samples from distributions that are difficult to work with, such as rejection sampling and important sampling, but notes that these methods do not scale well in high dimensions.

  • 00:25:00 In this section, the idea of generating random variables based on high dimensional distribution is discussed. The standard method for this is called Markov chain Monte Carlo, which is based on a structure that moves iteratively forward with a finite memory. One method of this type is the Metropolis Hastings algorithm which involves constructing a Markov chain and going to a new location using a proposal distribution and a ratio between the distribution being drawn from and the proposed distribution. This algorithm was invented by a group of nuclear physicists in the 1950s, who were working on optimizing the geometries of nuclear weapons, and is still widely used today.

  • 00:30:00 In this section, Philipp Hennig discusses the Metropolis-Hastings algorithm, which is a type of Markov chain Monte Carlo algorithm used to sample from a probability distribution. He demonstrates how the algorithm generates points by drawing from a proposal distribution and accepting or rejecting them based on their probability density. Hennig also highlights the importance of using a properly adapted proposal distribution in order to effectively explore the distribution being sampled. The Metropolis-Hastings algorithm has two important properties, detailed balance and ergodicity, which ensure that the process of running the algorithm for a long time produces a stationary distribution given by the distribution being sampled.

  • 00:35:00 In this section, Philipp Hennig discusses the properties of algorithms that have at least one stationary distribution, which is a sequence that is aperiodic and has positive recurrence, meaning there's a non-zero probability to come back to that point at a future point. The algorithm must not have any structure that can cause it to get stuck in another stationary distribution. Metropolis Hastings, for example, is an algorithm that fulfills these two properties. However, it has a worse rate compared to simple Monte Carlo, and it can have local random work behaviors. The number of effective samples drawn by the algorithm has something to do with the freeway free step length or free time length between two samples at completely opposite ends of the distribution.

  • 00:40:00 In this section, the speaker discusses Monte Carlo methods and how to evaluate them. He explains that to travel from one end of the distribution to the other, one must use a large number of steps that are proportional to the square of the ratio between long and small length scales, resulting in convergence rates that are still o of square root of t but with a huge multiple in front. He states that a challenge with Monte Carlo is that if you are just looking at statistics of these blue dots, without knowing what the shape of the distribution is and without having the red dots as references, it's not entirely obvious how you would notice that this is the case. Finally, he talks about Hamiltonian Monte Carlo, which he claims is the "atom" of Markov Chain Monte Carlo, and is the common algorithm used to draw from probability distribution P of x.

  • 00:45:00 In this section, Philipp Hennig explains the concept of Hamiltonian Monte Carlo (HMC), a method used to draw samples from a probability distribution. In HMC, the amount of variables is doubled, with a new variable representing the momentum of the existing variable. The momentum variable is then evolved according to a function that defines an ordinary differential equation, with H representing the energy and K representing the kinetic energy. The time derivative of X is given by the partial derivative of H with respect to P, and the time derivative of P is given by minus the partial derivative of H with respect to X. If the algorithm manages to draw samples from the joint distribution over X and P, it marginally draws from the distribution over X.

  • 00:50:00 In this section, Philipp Hennig discusses implementing an ordinary differential equation (ODE) solver for the derivative of the probability of a given state using Hoyn's method, which has convergence rates of order two. He then compares this to using a software library and shows how the solver simulates the dynamics of a Hamiltonian system, which is a particle of mass 1 moving in a potential given by the logarithm of a shape, ultimately producing nice samples. Although it requires a somewhat constant number of steps to simulate, Hennig notes that the Metropolis-Hastings scheme always accepts and the algorithm makes steps that don't move at a distance given by long-length scales over short-length scales squared, but without a square root, ultimately making it a more efficient algorithm.

  • 00:55:00 In this section, Philipp Hennig explains how the Hamiltonian Monte Carlo algorithm works. This algorithm draws from a joint distribution over X and P at one constant potential line. The potential line is chosen by the initial momentum, and at each step, the momentum is changed to move to a different potential line. Hennig compares the algorithm to an optimization problem and notes that it has two parameters called LeapFrog steps and delta T which must be chosen properly for the algorithm to work effectively. If the parameters are set incorrectly, the simulation could waste computational resources by moving back and forth without actually traveling anywhere.

  • 01:00:00 In this section, Philipp Hennig discusses the idea of a U-turn and the No U-turn Sampler (NUTS) algorithm in Monte Carlo methods for high-dimensional distributions. The problem with the U-turn idea is that it breaks the detailed balance and makes the algorithm move away and not come back. The NUTS algorithm overcomes this by starting two Markov chains in opposite directions and waiting until one starts turning around, then randomly choosing one. This satisfies detailed balance and is a key component of many Markov chain Monte Carlo algorithms. Hennig emphasizes that while these algorithms are complex and tricky to implement, understanding them is crucial for using them effectively.

  • 01:05:00 In this section, the speaker discusses the knee-jerk approach to computing expected values in Bayesian inference using Monte Carlo methods, and highlights the low convergence rate and the need for unbiased estimators. However, the speaker questions the need for unbiased estimators and randomness in the first place, and suggests that there may be other ways to approximate the quantity of interest without randomness. The speaker also touches on the concept of randomness and its relationship to sequences and finite sequences computed on a Turing machine.

  • 01:10:00 In this section, Philipp Hennig discusses the concept of randomness through different sequences of numbers. He argues that some sequences, such as those produced by dice, have been culturally accepted as random even though they are not truly random. On the other hand, irrational numbers like pi are non-random, but also lack structure. Furthermore, Hennig explains how a seed can alter the randomness of a sequence produced by a random number generator. Finally, he discusses how physical machines that produced random numbers were tested for randomness, but ultimately failed the Die Hard tests of Randomness.

  • 01:15:00 In this section, Philipp Hennig discusses randomness and how it relates to machine learning, specifically Monte Carlo methods. He explains that randomness has to do with a lack of information, which is why it’s applicable in areas like cryptography where someone knowing something is important. For the kinds of random numbers used in contemporary machine learning, it’s misguided to talk about this lack of information. When using a Monte Carlo method, writers of scientific papers who rely on Monte Carlo methods often hide information from their viewers. They use it because it's easy to use and implement, not because it’s biased.

  • 01:20:00 In this section, Philipp Hennig explains how Markov chain Monte Carlo (MCMC) runs, and that it works relatively well for problems of high dimensionality, even though we do not know the convergence rates for it. MCMC is the only algorithm for which we have theoretical guarantees that rely on using random numbers, but it is accepted that samples produced by this approach are useful in the absence of other methods to compare to. Hennig also discusses that MCMC is fundamentally very slow and laborious and that there may be better ways of approximating integrals. He warns that the algorithms they'll look at next week will typically only work for low-dimensional problems and proposes the need to consider other methods for machine learning rather than relying on deterministic randomness.
Numerics of ML 9 -- Monte Carlo -- Philipp Hennig
Numerics of ML 9 -- Monte Carlo -- Philipp Hennig
  • 2023.02.02
  • www.youtube.com
The ninth lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses both ...
 

Lecture 10 -- Bayesian Quadrature -- Philipp Hennig



Numerics of ML 10 -- Bayesian Quadrature -- Philipp Hennig

In this video, Philipp Hennig discusses Bayesian Quadrature as an efficient method for the computational problem of integration in machine learning. He explains how a real-valued function can be uniquely identified but difficult to answer questions directly. Bayesian Quadrature is an inference method that treats the problem of finding an integral as an inference problem by putting a prior over the unknown object and the quantities that can be computed, then performs Bayesian inference. Hennig also compares this approach to Monte Carlo rejection and importance sampling, showing how Bayesian Quadrature can outperform classical quadrature rules. The lecture covers the Kalman filter algorithm for Bayesian Quadrature and its connection to classic integration algorithms, with a discussion on using uncertainty estimates in numerical methods. Finally, Hennig explores how the social structure of numerical computation affects algorithm design, discusses a method for designing computational methods for specific problems, and how probabilistic machine learning can estimate the error in real-time.

In the second part of the video, Philipp Hennig discusses Bayesian quadrature, which involves putting prior distributions over the quantities we care about, such as integrals and algorithm values, to compute something in a Bayesian fashion. The method assigns both a posterior estimate and an uncertainty estimate around the estimates, which can be identified with classic methods. Hennig explains how the algorithm adapts to the observed function and uses an active learning procedure to determine where to evaluate next. This algorithm can work in higher dimensions and has some non-trivially smart convergence rates. He also discusses limitations of classic algorithms and quadrature rules and proposes a workaround through adaptive reasoning.

  • 00:00:00 In this section, Philipp Hennig discusses the computational problem of integration in machine learning with a focus on Bayesian Quadrature as an efficient method. He describes a real-valued function, f of x, which is a product of two functions, X minus sine squared 3x, and X minus x squared, and can be uniquely identified by writing down a set of characters. Hennig explains that while we know everything about this function, it is difficult to answer every question about it directly, such as the value of the definite integral for minus three to plus 3 over this function, which cannot be found in books full of integrals or the new C library.

  • 00:05:00 In this section, Philipp Hennig discusses Bayesian Quadrature, an inference method that treats the problem of finding an integral as an inference problem by putting a prior over the unknown object and the quantities that can be computed, and then performs Bayesian inference. By putting a prior, we begin with a finite uncertainty, which leads to a narrow range of possible results of the computation, making it typical for computations. The approach is contrasted with Monte Carlo rejection and importance sampling, which are less efficient. The estimated function can be plotted as a function of the number, suggesting that Bayesian Quadrature is a viable option for solving integrals.

  • 00:10:00 In this section of Philipp Hennig's talk, he discusses Bayesian quadrature as a way to estimate the integral of a function using probabilistic machine learning. He compares this approach to the Monte Carlo method, and explains that a Gaussian process is used as a prior over the function. By evaluating the function at specific x-values, we can estimate the latent variable, which is the integral of the function. Hennig also shows how this approach can outperform classical quadrature rules.

  • 00:15:00 In this section, Philipp Hennig explains how to compute integrals over the kernel in order to approximate integrals over any function we're trying to learn. By choosing a prior mean function and a prior covariance function, we can embed the problem of computing an integral in the reproducing kernel Hilbert space. Through computations involving evaluations of the function at various points, we end up with the kernel mean embedding which involves computing integrals over the kernel. Therefore, we must choose kernels for which we can compute integrals in closed form, and Hennig chooses the Weiner process kernel as an example.

  • 00:20:00 In this section, Philipp Hennig discusses the process of Bayesian Quadrature. The process involves using a Vino process prior, a Gaussian process that is asymmetric and non-stationary, and conditioning on a set of function values to get a positive Gaussian process. By using this process, it is possible to achieve a much better result than Monte Carlo integration. For example, to achieve a 10^-7 relative error, Bayesian Quadrature would need less than 200 evaluations, while Monte Carlo integration would require more than 10^11 evaluations.

  • 00:25:00 In this section, the speaker discusses the speed of Bayesian Quadrature compared to Monte Carlo simulations. While Monte Carlo simulations are cheap and easy to implement, Bayesian Quadrature is also relatively fast and can be implemented as a Kalman filter, making it feasible for use in machine learning models. The speaker explains the linear map between the two states of the process and how it can encode integration, making it possible to discretize the stochastic differential equation and compute updates to the integral. The lecture then moves on to discussing the properties of Bayesian Quadrature in more detail.

  • 00:30:00 In this section, the speaker introduces a Kalman filter algorithm for Bayesian quadrature to evaluate integrals of a function. The algorithm involves defining matrices A and Q to represent the deterministic and stochastic parts of the linear time-invariant system, and H and R to represent the observation model. The posterior mean is a weighted sum of kernel functions, and the Kalman filter updates the estimate of the integral, with the uncertainty of the integral increasing with the cubed step length. The algorithm runs in linear time, and the posterior mean is a piecewise linear function that interpolates the function values. The estimate for the integral is the sum over the average values in each block.

  • 00:35:00 In this section, Hennig explains the concept of Bayesian quadrature and its connection to the trapezoid rule, which is a classic integration algorithm. He notes that the trapezoid rule can be seen as the posterior mean of a complex Gaussian process inference scheme and that this particular insight is an essential and common result. Hennig further discusses how various classic algorithms, whether for numerical computation, optimization, linear algebra or solving differential equations, all have connections to Bayesian posterior estimates. Additionally, he emphasizes that numerical computation should be considered as Gaussian inference since it involves least-squares estimates for numerical quantities with uncertainty, and suggests that using uncertainty estimates can be advantageous when dealing with numerical methods.

  • 00:40:00 In this section, Philipp Hennig discusses the decision-making aspect of numerical algorithms and how it is like an AI algorithm because it gets to decide which computations to perform. One question that arises is where to put evaluation points and the answer to that can be found in Bayesian inference problems. By defining a probability distribution to converge towards certainty, we can find a quantity that describes certainty or uncertainty and manipulate it. For the variance of the possible distribution over the integral, the objective is to minimize it, which can be done by setting all of the Delta J's equal to the Delta n minus one, indicating a regular grid of integration nodes. Additionally, the necessity of having integration nodes on both ends of the integration domain is discussed.

  • 00:45:00 In this section, the speaker explains how the Bayesian Quadrature algorithm can be used to obtain a design for where to put evaluation nodes based on a Gaussian process prior. The algorithm can provide different designs depending on the prior used, and the evaluation nodes can be chosen according to a simple policy of Maximum Information Gain. The trapezoid rule can be thought of as a Bayesian estimate, where the posterior mean is a patient estimate that arises from a specific Gaussian process prior over the integrand. The algorithm provides an error estimate, but the estimate is not accurate, and there is a significant gap between the actual and estimated error. However, the trapezoid rule has been around for hundreds of years, and the algorithm is not necessarily flawed. The trapezoid rule might have some properties that need to be questioned.

  • 00:50:00 In this section, Philipp Hennig discusses variance estimates and their relation to Bayesian quadrature. He explains that the error estimate is the standard deviation, which is the square root of the expected square error. Using a constant step size makes the sum easy to calculate, as there is no "i" within the sum. The theorem states that the convergence rate for this trapezoid rule is O of 1 over N squared. However, there are hidden assumptions in the math. Sample paths drawn from a Wiener process have extremely rough behaviors as they are non-differentiable almost everywhere, making the assumption of the prior invalid.

  • 00:55:00 In this section, Philipp Hennig discusses the problem of integrating rough, non-differentiable functions using numerical algorithms. He explains that algorithms designed to operate on super rough functions, such as the trapezoid rule, may not be as efficient as they could be if the function they are integrating is much smoother. Hennig suggests that the social structure of numerical computation, where algorithms are designed to work on a large class of problems, can lead to overly general methods that don't work particularly well on any individual one of them. However, he notes that it is possible to design a computational method for a particular problem if it is sufficiently important, once you understand how these algorithms work. He also discusses how the scale of the error in the algorithm can be estimated while it runs, using ideas from probabilistic machine learning.

  • 01:00:00 In this section, Philipp Hennig discusses how to estimate the scale of an unknown constant in the covariance matrix given some data, and introduces the concept of conjugate priors. He explains that for exponential family probability distributions, there is always a conjugate prior, such as the gamma prior, which can be used to estimate the variance of a Gaussian distribution. Hennig tells the story of William C Lee Gossett, who came up with this method while working as a brewer for Guinness, and had to estimate the distribution of samples from a beer barrel. This method involves multiplying the prior and the likelihood together and normalizing the results to get the same algebraic form as the gamma distribution, with new parameters based on the observations or function values.

  • 01:05:00 In this section, Philipp Hennig explains how to estimate the posterior concentration of a parameter and the student T distribution. The method is called Bayesian Quadrature, where the scale starts out wide and becomes more concentrated as more observations are collected. The results are shown in a plot, where initially the distribution contracts following an increase in observations. Hennig points out that the prior assumptions about this smooth function are way too conservative for this problem, and there are much smarter algorithms for integration, such as Gaussian quadrature with sets of features that expand with Legendre polynomials, that work very well.

  • 01:10:00 In this section, Hennig discusses Bayesian quadrature, which is a classic way of doing integrals on bounded domains, such as our domain from -1 to 1. He explains that there are corresponding quadrature rules which converge extremely fast, with a super polynomial weight of convergence, but this only works for functions that are actually smooth. The green line seen on the right graph can also correspond to some posterior mean estimate under certain kinds of Gaussian prior assumptions. While the result of this article is mostly for theoretical interest in clarifying the relationship between the two different approaches to numerical integration, there are classic algorithms which are very good for this kind of problem and come with lots of structure with different bases for different kinds of integration problems. These quadrature rules approximate the integral by assuming it can be written in a particular form using orthogonal polynomials and a weighting function, and there are specific choices for Phi depending on W and the integration domain.

  • 01:15:00 In this section, the speaker discusses the different types of Chebyshev polynomials and their use in computing numerical integrals for univariate functions. The speaker also explains why it is important to consider the integration domain, function shape, and prior when specifying a prior for a patient inference rule. The speaker notes that classic integration algorithms and quadrature rules can be thought of as some form of Gaussian posterior mean estimate, and choices made by these algorithms can be motivated by information theoretic arguments. The speaker concludes by stating that while classic quadrature rules work well for one-dimensional integrals, higher dimensional problems require more complicated approaches, such as Monte Carlo algorithms.

  • 01:20:00 In this section, the speaker discusses the limitations of the methods shown in the previous section when it comes to scaling in dimensionality. These methods tend to have a performance decay that is exponential in dimensionality because a mesh of evaluations must be produced, meaning that they have to cover the domain with points. This is problematic because Gaussian processes are being used as priors, and their posterior uncertainty does not depend on the numbers seen, only where evaluations have been made. As a result, these integration methods are non-adaptive, limiting their scalability in higher dimensions. To overcome this issue, new algorithms are needed that can reason about the fact that some points are more informative than others through adaptive reasoning.

  • 01:25:00 In this section, Philipp Hennig discusses the limitations of Gaussian processes for encoding non-negative values and proposes a workaround by defining a new function that squares the actual function. The resulting distribution is not Gaussian and is approximated by a stochastic process that can be approximated by a Gaussian process. The resulting algorithm is called Wasabi, which stands for warp sequential active Bayesian integration. It is a probabilistic formulation that adaptively adds uncertainty where large function values are expected, allowing building of approximate numerical algorithms. The utility function in blue represents the posterior uncertainty over function values.

  • 01:30:00 In this section, Philipp Hennig discusses the concept of Bayesian Quadrature, an algorithm for numerical integration. Hennig explains how the algorithm adapts to the observed function and uses an Active Learning procedure to determine where to evaluate next. This algorithm can work in higher dimensions and has some non-trivially smart convergence rates. Hennig also compares this algorithm to Monte Carlo algorithms and argues that prior knowledge can improve the algorithm's performance. Furthermore, he hints at the possibility of an even better algorithm beyond Monte Carlo, which will be discussed after Christmas.

  • 01:35:00 In this section, Philipp Hennig discusses Bayesian quadrature, which involves putting prior distribution over the quantities we care about, such as integrals and algorithm values, to compute something in a Bayesian fashion. The method assigns both a posterior estimate and an uncertainty estimate around the estimates, which can be identified with classic methods. If the error estimates are bad, it doesn't necessarily mean that the probabilistic view on computation is wrong, but rather that the set of prior assumptions is bad. By using more prior knowledge and treating numerical algorithms as autonomous agents, we can extract more information and make the algorithms faster and work better.
Numerics of ML 10 -- Bayesian Quadrature -- Philipp Hennig
Numerics of ML 10 -- Bayesian Quadrature -- Philipp Hennig
  • 2023.02.02
  • www.youtube.com
The tenth lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses both ...
 

Lecture 11 --Optimization for Deep Learning -- Frank Schneider



Numerics of ML 11 --Optimization for Deep Learning -- Frank Schneider

Frank Schneider discusses the challenges of optimization for deep learning, emphasizing the complexity of training neural networks and the importance of selecting the right optimization methods and algorithms. He notes the overwhelming number of available methods and the difficulty in comparing and benchmarking different algorithms. Schneider provides real-world examples of successful training of large language models and the need for non-default learning rate schedules and mid-flight changes to get the model to train successfully. Schneider highlights the importance of providing users with more insight into how to use these methods and how hyperparameters affect the training process, as well as the creation of benchmarking exercises to help practitioners select the best method for their specific use case. He also discusses newer methods like Alpha and how it can be leveraged to steer the training process for a neural network.

In the second part of the video on the numerics of optimization for deep learning, Frank Schneider introduces the "Deep Debugger" tool Cockpit, which provides additional instruments to detect and fix issues in the training process, such as data bugs and model blocks. He explains the importance of normalizing data for optimal hyperparameters, the relationship between learning rates and test accuracy, and the challenges of training neural networks with stochasticity. Schneider encourages students to work towards improving the training of neural networks by considering the gradient as a distribution and developing better autonomous methods in the long run.

  • 00:00:00 In this section, Frank Schneider introduces the topic of deep learning optimization and provides an overview of the challenges involved in training neural networks. He explains that while it may seem like a simple question of how to train neural networks, there are actually multiple ways to answer it, including considerations of hardware and software. The main focus of the lecture, however, is on the methods and algorithms used to train neural networks, and Schneider emphasizes that there's no one-size-fits-all solution. He provides a real-world example of a group at Midi training a large language model, showing that a non-default learning rate schedule and mid-flight changes to the learning rate were needed to get the model to train successfully. Overall, Schneider's lecture highlights the complexity of training neural networks and the importance of carefully selecting the right optimization methods and algorithms.

  • 00:05:00 In this section, the speaker discusses the challenges of training a neural network efficiently, citing the example of the logbook provided by OpenAI dedicated to the struggle of training a large language model. The speaker mentions that currently, there are no efficient methods to train neural networks, although there are some guidelines and intuitions available. The lecture will focus on understanding why training a neural network is so challenging and what can be done to improve the situation. The speaker notes that this will be different from their usual lecture structure, as there are numerous current state-of-the-art methods, and it's unclear which of these methods is the most efficient.

  • 00:10:00 In this section, the speaker discusses the misconceptions around machine learning being primarily optimization. While optimization involves searching for a minimum in a loss landscape, the goal of machine learning is to find a function that best fits the training data and generalizes well to new data. This is accomplished through the use of a loss function that quantifies the difference between the model's predictions and the true outputs. Since the true data distribution is often unknown, the model is trained on a finite sample of data, and the optimization process operates on the empirical loss. The speaker emphasizes that deep learning involves more complexity due to higher dimensional landscapes and expressive hypotheses.

  • 00:15:00 In this section, Frank Schneider explains that machine learning is not just optimization, as the quantity being optimized (empirical loss) is not the same as the quantity that the algorithm actually cares about (true loss). Overfitting and generalization are actually more complicated than just going from train to test, as in translation tasks, where models are trained on cross-entropy loss but evaluated on the quality of the translation. As a result, people have developed various methods, such as stochastic gradient descent, momentum variance, RMS prop, and atom, to take into account previous gradients and understand how they should behave in the future. In total, there are over 150 methods available for optimizing and training algorithms for deep learning.

  • 00:20:00 In this section, the speaker discusses the overwhelming number of optimization methods available for neural network training, with over 100 methods to choose from. The issue is not just picking a method, but also how to use it effectively. For example, even if we choose an optimization method like SGD or Adam, we still need to decide on hyperparameters like learning rate and epsilon, which can be difficult to tune. The speaker suggests that we need proper benchmarks to understand which methods are necessary and improved, and that the current challenge is defining what "better" means in the context of deep learning. Overall, the focus should be on providing users with more insight into how to use these methods and how hyperparameters affect the training process.

  • 00:25:00 In this section, Frank Schneider discusses the challenges that arise when comparing deep learning training algorithms, such as optimization for reinforcement problems, GANs, and large language models. It becomes difficult to determine whether the differences in performance are significant, as one may need to run these methods several times to account for stochasticity. Testing all cases can be expensive and time-consuming, as training must be repeated multiple times for all general-purpose methods. The method used to train must be analyzed when testing multiple problems, requiring changes to hyper-parameters, which make it even more expensive. Moreover, Schneider stresses that SGD and Adam are families of algorithms that cannot be compared directly without specifying the exact set of parameters.

  • 00:30:00 In this section, Frank Schneider discusses the process of identifying the state-of-the-art training methods for deep learning. Due to the large number of optimization methods available, they had to limit themselves to testing only 15 optimization methods on 8 different types of problems ranging from simple quadratic problems to larger-scale image classification and recurrent neural network models. To simulate various scenarios, they tested these optimization methods in four different settings with different budgets for hyperparameter tuning, from one shot tuning with the default high parameters to larger budgets for industry practitioners who have more resources available. The goal was to determine which optimization methods performed the best under different scenarios to help practitioners select the best method for their specific use case.

  • 00:35:00 In this section, Frank Schneider discusses the optimization process for deep learning models. He explains that to find the best optimization method, they had to conduct over 50,000 individual runs since there were 15 optimization methods and four learning rate schedules. Schneider notes that there was no clear state-of-the-art training method for deep learning since several methods performed well on different test problems. However, Adam showed consistently good results, and other methods that derived from Adam did not improve performance significantly. Overall, the benchmarking exercise showed that currently, there is no clear optimization method that works for all deep learning models.

  • 00:40:00 In this section, the speaker discusses the difficulties of determining the most effective method for training a neural network due to the various different methods available and the lack of a clear training protocol. The speaker discusses the creation of the ml Commons Benchmark from their algorithms working group, which is a competition to measure neural network training speed-ups solely due to algorithmic changes. The aim is to build more efficient algorithms to speed up neural network training. The speaker also discusses the lack of available information on how to use these methods and suggests that additional information could be used to create debugging tools to help users in the meantime, in hopes of eventually building a better method that can do everything automatically.

  • 00:45:00 In this section, the speaker discusses how most machine learning models approximate the empirical gradient by choosing an individual sample of the training dataset before taking a step. The mini-batch gradient or the empirical gradient is a sample from the true gradient, and averaging over the individual gradients gives an estimate of the true gradient, although the variance of the estimator is not available in PyTorch. However, by using packages like backpack, users can access the individual gradients and their variance. This additional information can be leveraged to steer the training process for a neural network, such as determining whether to increase or decrease the learning rate. The speaker provides an example where two loss curves can look the same, but the optimization in the loss landscape shows two completely different things happening.

  • 00:50:00 In this section, the speaker discusses how the loss curve can show whether a neural network is training or not but does not explain why or what to do to improve it. The loss landscape has tens of millions of dimensions, making it nearly impossible to look into. However, the speaker introduces a quantity that helps to characterize the neural network's optimization procedure, called alpha. The alpha value determines whether the network is understepping, minimizing, or overshooting by observing the slope in the direction that the network is stepping, which shows whether the loss landscape is going up or down.

  • 00:55:00 In this section, Frank Schneider explains how Alpha is calculated while optimizing the neural network. Alpha is a scalar value that was explained in the previous section as the direction the model moves to optimize the neural network. Schneider explains that the Alpha scalar quantity is based on the size of the step in comparison to the observed loss in that direction. Negative Alpha values imply under-stepping, whereas positive values imply overstepping, and one means switching directly to the other side of the valley. Schneider also explains how by condensing information into meaningful reports, developers can create debugging tools for deep learning similar to that of classical programming.

  • 01:00:00 In this section, Frank Schneider introduces the concept of the "Deep Debugger" with the tool "Cockpit," which augments a viewer's training process with additional instruments, like a pilot in an airplane. Schneider shows how Cockpit can provide new viewpoints in training a neural network, such as step size, distance, gradient norm, and gradient tests, that can help to detect and fix issues like data bugs in the training process. With the additional instruments, Cockpit can provide users with relevant information and complement the essential performance plot.

  • 01:05:00 In this section, the speaker discusses how using normalized versus raw data in deep learning affects the neural network's performance and optimal hyperparameters. Raw data, with pixel values ranging from 0 to 255, can lead to a less behaved gradient element histogram and therefore less optimal hyperparameters. However, normalizing the data can be easily missed because visually, the data will look the same. Another issue that can affect training is a model block in which one network trains well while another doesn't, even though they have similar gradient element histograms. By using Cockpit, one can look at the histogram for each layer of the network, revealing any degeneracies throughout the model. This helps identify model bugs that are hard to find through trial and error. Lastly, the use of Cockpit for hyperparameter tuning can lead to new research and better understanding of methods.

  • 01:10:00 In this section, Frank Schneider discusses optimization for deep learning and the relationship between learning rates, Alpha values, and test accuracy. He explains that while larger learning rates tend to result in larger Alpha values, which means overshooting and potentially taking too large of steps, the best-performing runs are typically in the positive Alpha region. This tells us that in neural network training, it may not always be best to minimize at each step and that overshooting is necessary to get the best performance. Schneider also shares examples from papers by the University of Toronto that illustrate the importance of finding a balance between taking local and global steps to achieve optimal results.

  • 01:15:00 In this section, Frank Schneider acknowledges that training neural networks is a challenging task that lacks a clear protocol to follow. Furthermore, he believes that the stochasticity in deep learning is a primary source of this challenge, which leads to training and optimizing being two different things. However, he suggests that thinking about the gradient as a distribution, accounting for standard deviation, variances, and confidences, can allow for better tools to be built and for better autonomous methods to develop in the long run. Schneider encourages interested students to help in improving the training of neural networks.
Numerics of ML 11 --Optimization for Deep Learning -- Frank Schneider
Numerics of ML 11 --Optimization for Deep Learning -- Frank Schneider
  • 2023.02.06
  • www.youtube.com
The eleventh lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses bo...
 

Lecture 12 -- Second-Order Optimization for Deep Learning -- Lukas Tatzel



Numerics of ML 12 -- Second-Order Optimization for Deep Learning -- Lukas Tatzel

In this video, Lukas Tatzel explains second-order optimization methods for deep learning and their potential benefits. He compares the trajectories and convergence rates of three optimization methods - SGD, Adam, and LBFGS - using the example of the Rosenberg function in 2D. Tatzel notes that the jumpy behavior of SGD makes slower convergence compared to the well-informed steps of LBFGS. He introduces the Newton step as a faster method for optimization and discusses its limitations, such as the dependence on the condition number. Tatzel also explains the concept of the Generalized Gauss-Newton matrix (GGN) as an approximation to the Hessian for dealing with ill-conditioned problems. Additionally, he discusses the trust region problem, how to deal with non-convex objective functions, and the Hessian-free approach that uses CG for minimizing quadratic functions.

This second part of the video explores second-order optimization techniques for deep learning, including BFGS and LBFGS, Hessian-free optimization, and KFC. The speaker explains that the Hessian-free approach linearizes the model using the Jacobian Vector product, while KFC is an approximate curvature based on official information metrics. However, stochasticity and biases can occur with these methods, and damping is recommended to address these issues. The speaker proposes the use of specialized algorithms that can use richer quantities like distributions to make updates and notes that the fundamental problem of stochasticity remains unsolved. Overall, second-order optimization methods offer a partial solution to the challenges of deep learning.

  • 00:00:00 In this section, Lukas Tatzel introduces second-order optimization methods as a potential solution for the expensive and tedious optimization process of deep learning. He uses the example of the Rosenberg function in 2D to compare the trajectories and convergence rates of three optimizers - SGD, Adam, and LBFGS. He notes that the jumpy behavior of SGD makes slower convergence compared to the well-informed steps of LBFGS, which requires less than 10 steps to reach the tolerance of 10^-8, making it not only faster in terms of steps but also in runtime compared to Adam and SGD. Tatzel raises the question of whether these methods can be applied to deep learning and explores how they work and their potential.

  • 00:05:00 In this section, Lukas Tatzel explains the basics of deep learning optimization, which involve predicting a vector of dimensional C and comparing it with the actual label to calculate the loss function. The goal in deep learning is to find a configuration of the network parameter vector Theta that minimizes the empirical risk. The numerical methods used for this include the stochastic gradient descent (SGD) which computes an estimate of the gradient on finite data using a Monte Carlo estimator. However, gradient-based methods are sensitive to the condition number, which is the ratio of maximum and minimum directional curvature.

  • 00:10:00 In this section, Lukas Tatzel discusses how gradient-based methods are sensitive to ill condition problems in deep learning. He explains that the condition number can be a problem for gradient-based methods if it is large, which can lead to slow conversions. To improve the updates in gradient-based methods, Tatzel suggests rescaling the gradient in both large and small curvature directions with their respective inverse curvatures. By doing this, second-order methods can be introduced to reduce or eliminate the dependency on the condition number.

  • 00:15:00 In this section, Lukas Tatzel discusses second-order optimization in deep learning and introduces the concept of the Newton step. This method involves approximating the loss function at the current iterate with a quadratic function, where the Hessian is assumed to be positive definite. By computing its gradients and setting it to zero, the Newton step can be derived and used for minimization purposes. This method can be much faster than gradient-based methods in certain situations, achieving local quadratic convergence if the target function is twice differentiable and the Hessian is Lipschitz continuous. Tatzel compares linear and quadratic convergence visually, showing that Newton methods can be really fast in certain situations as they are robust against ill-conditioned problems.

  • 00:20:00 In this section, Lukas Tatzel discusses second-order optimization methods for deep learning and the reasons why they are not commonly used. Second-order methods can be faster than gradient-based methods, but they require access to the Hessian matrix, which can be difficult to compute and store for large, non-convex problems. Additionally, handling stochasticity in the computation of the Hessian can affect the performance of these methods. Tatzel goes on to explain how these challenges can be addressed and gives an overview of the concepts behind the different methods.

  • 00:25:00 In this section, Lukas Tatzel explains second-order optimization for deep learning and the limitations of the Newton update method. He demonstrates the computation of the second-order derivative of the function with respect to Tau, which is a quadratic function with constant curvature Lambda. The curvature along an eigenvector is the eigenvalue, and if the curvature is negative, the quadratic is unbounded from below, rendering the Newton update method meaningless. To address this problem, Tatzel introduces the Generalized Gauss-Newton matrix (GGN), which is a positive semi-definite approximation to the Hessian and can serve as a replacement for it. He derives the GGN from the loss function by applying the change rule to the split between the parameter vector and the model results.

  • 00:30:00 In this section, Lukas Tatzel discusses the concept of second-order optimization for deep learning models. He explains the product rule and how it works, and how to compute the derivative of a matrix while applying the chain rule. Tatzel then talks about the GGN, a positive definite matrix that neglects curvature from the model, and the Hessian, which contains second derivatives of the model with respect to Theta. He compares the GGN and Hessian and shows that the GGN is positive definite and symmetric, making it a useful tool for optimization in deep learning models.

  • 00:35:00 In this section, Lukas Tatzel discusses the Hessian and how it determines whether or not the GGN (Generalized Gauss-Newton) algorithm is positive semi-definite. For all relevant loss functions, the Hessian is positive semi-definite. In cases where the loss function is such that the loss is computed as the squared norm between the outputs of the model and the true label, the Hessian is a scalar times the identity matrix, making it positive definite. Lukas also discusses the Fischer information matrix, which can be used to define a well-defined GGN step. In this case, the GGN algorithm is steepest descent in distribution space, where the parameter space is measured by the distance between two distributions.

  • 00:40:00 In this section, Lukas Tatzel explains the trust region problem in second-order optimization for deep learning. In the convex case, there is still a problem with quadratic models being arbitrarily bad, leading to a need for damping and restricting the iteration update to lie within some trust radius. By adding Delta times identity to the curvature matrix, a modified Newton step is created, and with damping, it is possible to control how conservative the updates are. When choosing the radius, it is easier to work with damping directly using the L-BFGS heuristic based on the reduction ratio between the expected and actual loss decrease.

  • 00:45:00 In this section of the video, Lukas Tatzel discusses how to deal with non-convex objective functions in deep learning by computing positive semi-definite curvature matrices such as the ggn and the fissure. It is possible to interpret these matrices and provide unbiased estimators of them on finite data. Damping heuristics, such as living back Mark, can be used to control how conservative updates should be. However, inverting these huge curvature matrices is a problem due to storage limitations. To solve this problem, ideas from numerical algebra, such as low-rank approximations, iterative methods, and structured approximations, can be borrowed. Tatzel then discusses the core idea of BFGS, which gradually learns an approximation to the inverse Hessian from gradient observations, with the goal of deducing from gradient observations how the inverse Hessian will look like.

  • 00:50:00 In this section, Lukas Tatzel explains the idea of using second-order optimization for deep learning. The second derivative is obtained by taking a difference approximation to the gradient, and this is then transferred to the multi-dimensional case using the secant equation. The goal is to approximate the inverse Hessian, so properties from the actual inverse Hessian are taken and required for the approximation to have the same properties. The update only involves the previous approximation and the vectors SK and yk. The approximation is stored by using a fixed window of some fixed size l, and with this, a good curvature estimate can still be obtained.

  • 00:55:00 In this section, Lukas Tatzel introduces second-order optimization methods for deep learning, specifically focusing on the Hessian-free approach. This approach uses CG for minimizing quadratic functions and only requires matrix-vector products, allowing for efficient computation without explicitly storing the curvature matrix. The GGn is used as the curvature metric, and by using Monte Carlo estimation, the matrices are able to be computed for a given input-output pair. To efficiently multiply the Jacobian with a vector, the core idea is to replace the Jacobian-vector product with a directional derivative. This allows for an efficient way to compute the product without explicitly constructing the matrices.

  • 01:00:00 In this section, the speaker discusses second-order optimization for deep learning, specifically the Hessian-Free optimization and KFC techniques. The Hessian-Free optimization involves linearizing the model by approximating F at theta plus Delta Theta by F of theta plus the Jacobian times Delta Theta and using the Jacobian Vector product. However, this approach is numerically unstable, so an approximation to the Jacobian Vector product is used instead. On the other hand, KFC is an approximate curvature based on official information metrics that involve two approximations: the block-diagonal approximation and exchanging the expectation and chronica products operations. The block-diagonal structure makes inverting the matrix trivial, and the approximation of the expectation is reasonable because it is difficult to compute chronical products over two vectors.

  • 01:05:00 In this section, Lukas Tatzel discusses three approaches to accessing and inverting the curvature matrix, which is used in second-order optimization for deep learning. The first method is BFGS and LBFGS, which use a dynamic lowering approximation of the Hessian and are the default choice for small deterministic problems. The second method is the Hessian-free optimizer, which is similar to Newton steps but requires little memory and more sequential work. However, it has trouble with larger mini-batch sizes that use batch Norm layers. The last method is KFC, which is a lightweight representation of the Hessian information metrics and widely used in uncertainty quantification. The K-Fik optimizer is recommended when dealing with limited memory, as storing and inverting the smaller components of the block is easier and faster than doing the same with the entire matrix.

  • 01:10:00 In this section, Lukas Tatzel discusses the issue of stochasticity when computing the Newton step, which involves inverting the Hessian and applying it to the gradient. Due to having only estimates of the Hessian and gradient, even if they are unbiased, the Newton step will still be biased. Tatzel provides an intuitive example in 1D where the expectation over 1/H hat is not the same as 1/H, showing that even with an estimate of the curvature, there is still some uncertainty when mapping it through the inverting function. This highlights the challenge of dealing with stochasticity in second-order optimization for deep learning.

  • 01:15:00 In this section, the speaker discusses the biases and instabilities that can occur in second-order optimization for deep learning. When estimating the inverse curvature, it is possible to generate heavy tails, which result in an expectation that is moved to above average. This leads to an overall Newton step that is too large in expectation. Additionally, biases and instabilities can be present due to stochastic estimates or by chance when a sample is close to zero. These issues can be solved by applying damping, which moves the distribution away from zero and mitigates potential biases and instabilities.

  • 01:20:00 In this section, Lukas Tatzel discusses the challenges of using damping as an outer loop optimization process, which treats all directions equally and may not be a suitable way to address the complexity of the training process. He proposes the use of specialized algorithms that can use richer quantities like distributions to make updates and notes that the fundamental problem of stochasticity remains unsolved. Overall, Tatzel suggests that second-order optimization methods like BFGS, LBFJS, Heston free optimizer and KFC offer a partial solution to the challenges of deep learning, including the issue of Hill conditioning.
Numerics of ML 12 -- Second-Order Optimization for Deep Learning -- Lukas Tatzel
Numerics of ML 12 -- Second-Order Optimization for Deep Learning -- Lukas Tatzel
  • 2023.02.06
  • www.youtube.com
The twelfth lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses bot...
 

Lecture 13 -- Uncertainty in Deep Learning -- Agustinus Kristiadi



Numerics of ML 13 -- Uncertainty in Deep Learning -- Agustinus Kristiadi

The video discusses uncertainty in deep learning, particularly in the weights of neural networks, and the importance of incorporating uncertainty due to the problem of asymptotic overconfidence, where neural networks give high confidence predictions for out-of-distribution examples that should not be classified with certainty. The video provides insights on how to use second-order quantities, specifically curvature estimates, to get uncertainty into deep neural networks, using a Gaussian distribution to approximate the last layer's weights and the Hessian matrix to estimate the curvature of the neural network. The video also discusses the Bayesian formalism and LaPlace approximations for selecting models and parameters of neural networks.

In the second part of the lecture Agustinus Kristiadi discusses various ways to introduce uncertainty in deep learning models in this video. One technique involves using linearized Laplace approximations to turn a neural network into a Gaussian model. Another approach is out-of-distribution training, where uncertainty is added in regions that are not covered by the original training set. Kristiadi emphasizes the importance of adding uncertainty to prevent overconfidence in the model and suggests using probabilistic measures to avoid the cost of finding the ideal posterior. These techniques will be explored further in an upcoming course on probabilistic machine learning.

  • 00:00:00 In this section, the speaker explains the topic of the lecture, which is about getting uncertainty into machine learning and how to do computations to achieve that. The lecture uses insights from previous lectures, particularly in solving integrals and using bayesian deep learning to get uncertainties. The speaker then discusses the importance of uncertainties in deep neural networks and the problem of asymptotic overconfidence, where the neural network gives high confidence predictions for out-of-distribution examples that should not be classified with such certainty. The lecture aims to provide insights on how to use second-order quantities, specifically curvature estimates, to get uncertainty into deep neural networks.

  • 00:05:00 In this section, Agustinus Kristiadi discusses uncertainty in deep learning, specifically in classification networks that use ReLU non-linearities. He presents a fundamental property of real classifiers: if the logit layer is a linear combination of previous layers with ReLU non-linearities, the output of the network is still a piecewise linear function, defined by the combination of previous layers. Moving away from the training data in this space leads to a region where the classifier has linear input to the softmax output, and with probability one, the gain for each linear output function differs. As a result, moving far enough in these regions will lead to arbitrarily high confidence for one class, which can be visually observed in the plot of three linear output features in red.

  • 00:10:00 In this section, Agustinus Kristiadi explains the fundamental property of real classifiers that creates high confidence in certain classes, and why it can't be fixed simply by retraining weights. The solution is to add uncertainty to the neural network weights and to do that, we need a Bayesian interpretation of the neural network, which can be achieved through maximizing the exponential of the function being minimized during training. This means that deep learning is already doing Bayesian inference, but only the mode of the posterior is being computed, which can be problematic. A common setting for supervised problems with continuous outputs is the quadratic loss and weight decay regularizer, which is equivalent to putting a Gaussian prior on the weights and a Gaussian likelihood on the data.

  • 00:15:00 In this section, the speaker discusses uncertainty in deep learning and Bayesian interpretation of deep neural networks. The speaker notes that the full posterior distribution needed for predictions is intractable. While Monte Carlo approaches are theoretically well-founded, they are time-consuming and can disadvantage those doing patient inference. Thus, the speaker argues for the cheapest possible way to do integrals: automatic differentiation coupled with linear algebra. The speaker shares the surprising result that any Gaussian approximate measure on even just the last layer's weights of the network already partially solves the problem of overconfidence, as demonstrated in a theorem. The speaker emphasizes that it doesn't matter if the probability distribution on the weights is correct, adding any probability measure on the weights can heal the confidence problem.

  • 00:20:00 In this section, the speaker explains how a gaussian distribution can be applied to the last layer's weights in the classification layer of a deep neural network to solve the problem of uncertainty in classification. The speaker assumes that any covariance of the gaussian distribution can be used because it does not matter, and the mean of the distribution is given by the trained weights of the deep neural network. The speaker then uses the gaussian distribution to solve the problem from the previous slide by approximating the integral of the soft Max over F of theta at X star. The David Makai approximation is used to compute the soft Max over the derived variable that has the mean prediction of the output that the network would otherwise have. The blue lines in the visualization depicting this approximation are bounded away from one, which provides a solution to uncertainty in classification.

  • 00:25:00 In this section, Agustinus Kristiadi discusses the importance of uncertainty in deep learning, specifically with regards to the weights of neural networks. He argues that it is crucial to take into account that we don't quite know the weights and to avoid assuming that we know something if we don't, as it can create issues. Mathematical approximations such as linearizing and using a Gaussian distribution on the weights can be made, and it has been proven that as long as we're ever so slightly uncertain, it will be fine. The choice of Sigma can be made with automatic differentiation with curvature estimates, which is the fastest and cheapest method.

  • 00:30:00 In this section, Agustinus Kristiadi explains how we can use the Hessian matrix to form a Gaussian approximation after finding the mode of the loss function through deep learning. The Hessian matrix, which contains the second-order derivative of the loss function, is used to construct approximations. Although the Gaussian approximation is local and not perfect, it is totally analytic, making it a favorable approximation. To utilize this approximation, we need a trained neural network, and once the network is trained, we can get the Hessian at that point using AutoDiff, which is a closed-form process that just works.

  • 00:35:00 In this section, the speaker discusses the concept of uncertainty in deep learning and how to evaluate it using the Hessian matrix. The Hessian matrix can be computed after training the deep neural network and provides a way to estimate uncertainty without adding cost to the trading of the network. The speaker also notes that this approach allows for keeping the point estimate, which can be useful for practical applications. However, there are downsides such as the Hessian being expensive to compute, and approximations are needed to make it tractable. The Generalized Gauss-Newton Matrix is one such approximation that can be used in practice.

  • 00:40:00 In this section, Agustinus Kristiadi discusses uncertainty in deep learning and how the Gauss-Newton Hessian (GNG) can be used to estimate the curvature of a neural network. He explains that the GNG is positive semi-definite and has a nice connection to linearization, which can result in a tractable model when combined with the Laplace approximation. This model can be used for regression and produces a Gaussian process with its mean function given by the output of the neural network.

  • 00:45:00 In this section, the speaker discusses uncertainty in deep learning, particularly in neural networks. They note that the core variance function is given by finding the mode of the loss function in the Jacobian of the network, taking the inner product with the inverse of the Hessian. The speaker mentions that this process can be used for classification in the form of a simple approximation developed by David Pinkai. The process involves defining the loss function, computing the Hessian of the loss function, and the Jacobian of the trained network with respect to the weights. Finally, combining the two in a product gives a predictive function for f of x star that is still non-linear in x but linear in weight space. The speaker highlights that this process can help avoid overconfidence, particularly in cases of classification.

  • 00:50:00 In this section, Agustinus Kristiadi discusses the Bayesian formalism and how it can be useful in deep learning. By linearizing the network in its weights and using Laplace approximation, we can reduce the intractable integral over the posterior to a simplified form of the posterior and the loss function. This process can provide us with a measure of how well our model fits the data, which is useful in adapting parameters or aspects of the model. By computing the evidence for the data, we can pick whichever model has the highest evidence and choose the one that is closer to the data.

  • 00:55:00 In this section, the speaker discusses how to use LaPlace approximations to select models and parameters of a neural network. The speaker explains that the Hessian depends on the shape of the loss function and that as you add more layers, the loss function might become narrower, leading to a better fit. The speaker shows a plot that demonstrates that around two to four layers is probably the best choice. The speaker also discusses how the Occam factor is not as straightforward as it is for Gaussian processes since the Hessian has a non-trivial effect on how well the model can explain the data. The speaker then shows a visualization of a deep neural network with linearization LaPlace approximation for a classification problem and explains how you can use a prior precision parameter to affect the model's confidence. Finally, the speaker discusses how LaPlace approximations can be used to select discrete choices like the number of layers or a parameter like the prior position using gradient descent.

  • 01:00:00 In this section, the speaker discusses uncertainty in deep learning and how it can be addressed using linearized Laplace approximations. This method involves using a probabilistic approach to determine the prior position of the layers when selecting the number of layers of a network. However, while this process works well for selecting a prior position, it may not work as well for other tasks, such as choosing the number of layers. The speaker then goes on to discuss linearized Laplace approximation and how it can be used as a black box tool to turn a deep neural network into a Gaussian model to deal with uncertainty. Finally, the speaker discusses a way to fix the problem with models not having uncertainty on their weights, which involves adding a simple fix to the network.

  • 01:05:00 In this section, Agustinus Kristiadi discusses the issue of adding an unbounded number of weights to account for the infinite complexity of data in deep neural networks. He explains that adding an infinite number of features would address the problem, and shows how keeping track of the infinite number of features does not have to be a costly task. Asymptotically, the uncertainty becomes the maximum entropy Thing 1 over C, without adding more complexity to the model.

  • 01:10:00 In this section, the speaker explains how uncertainty can be added to deep learning to improve predictions, particularly in areas where there is little training data or there are adversarial inputs. The approach involves training the mean of the network and then adding units that don't change the point prediction but add uncertainty, which can be moved and scaled. This technique is called out-of-distribution training and can be achieved using a length scale based on the breadth of the data to define an approximate Gaussian process. The cost of adding uncertainty is negligible, and it only adds a backstop that reduces confidence if the data is far from the training data.

  • 01:15:00 In this section, the speaker discusses how to introduce uncertainty to a deep learning model. One way to do this is through out of distribution training, where a new data set is created with images not containing the objects that were used in the original training set. The network is then trained to be uncertain in those regions. By defining a loss function that includes the out of distribution loss, the Hessian of the curvature estimate of the loss function where the mode of the loss is found can be adjusted to produce the desired amount of uncertainty. The speaker also notes that introducing uncertainty is important in deep learning as it can help prevent pathologies and overconfidence in the model.

  • 01:20:00 In this section, the speaker discusses the concept of adding uncertainty to a classifier without changing its fundamental structure. The linearization of the network in the weight space can allow this to happen, and by computing the Jacobian and Hessian of the loss function, we can turn a deep neural network into a Gaussian process. Adding functionality to the network such as asymptotic calibrated confidence can be done with this technique. The speaker emphasizes the importance of probabilistic training and the use of probability measures in machine learning without the need for full posterior tracking. This approach can solve problems such as overconfidence while avoiding the cost of finding the ideal posterior. Finally, the speaker suggests that the use of these techniques will be explored further in the upcoming course on probabilistic machine learning.
Numerics of ML 13 -- Uncertainty in Deep Learning -- Agustinus Kristiadi
Numerics of ML 13 -- Uncertainty in Deep Learning -- Agustinus Kristiadi
  • 2023.02.06
  • www.youtube.com
The thirteenth lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses ...
 

Lecture 14 -- Conclusion -- Philipp Hennig



Numerics of ML 14 -- Conclusion -- Philipp Hennig

Philipp Hennig gives a summary of the "Numerics of Machine Learning" course, emphasizing the importance of solving mathematical problems in machine learning related to numerical analysis, such as integration, optimization, differential equations, and linear algebra. He discusses the complexity of performing linear algebra on a data set and how it relates to the processing unit and disk. Hennig also covers topics such as handling data sets of non-trivial sizes, algorithms for solving linear systems, solving partial differential equations, and estimating integrals. He concludes by acknowledging the difficulty in training deep neural networks and the need for solutions to overcome the stochasticity problem.

In the conclusion of his lecture series, Philipp Hennig emphasizes the importance of going beyond just training machine learning models and knowing how much the model knows and what it doesn't know. He talks about estimating the curvature of the loss function to construct uncertainty estimates for deep neural networks and the importance of being probabilistic but not necessarily applying Bayes' theorem in every case due to computational complexity. Hennig also emphasizes the importance of numerical computation in machine learning and the need to develop new data-centric ways of computation. Finally, he invites feedback about the course and discusses the upcoming exam.

  • 00:00:00 In this section, Philipp Hennig provides a summary of the entire Numerics of Machine Learning course, which he believes is crucial due to the variation of content from various lecturers. He explains that machine learning essentially involves solving mathematical problems that don't have closed-form solutions as opposed to classic AI, which involves algorithms. The problems in machine learning are related to numerical analysis and include integration, optimization, differential equations, and linear algebra. Hennig emphasizes the importance of understanding the complexity of doing linear algebra on a data set and how it is relevant to the processing unit and disk.

  • 00:05:00 In this section, Philipp Hennig discusses linear algebra's role in machine learning and specifically in Gaussian process regression. He explains that to learn a predictive distribution, which has a mean and covariance, we need to solve a linear system of equations involving inverting a matrix times a vector. There are many algorithms for solving such linear systems, including the classic algorithm called the Cholesky decomposition, which can be viewed as an iterative procedure constructing the inverse of the matrix. Hennig notes that this approximation can be used as an estimate for the inverse of the matrix, but its quality may vary depending on the data order.

  • 00:10:00 In this section, Philipp Hennig explains how linearly expensive it is to go through a data set in some random order and load bits of it from disk while ignoring the rest. He compares this method to what students learn through a probabilistic machine learning class, which is to solve two different linear optimization problems to solve one equation. He also highlights that there are finite uncertainties that arise, which cause two sources of uncertainty, including the finite data set and limited computations, that don't give the full solution.

  • 00:15:00 In this section of the video, Philipp Hennig explains the complexity of solving linear problems in Bayesian influence Gaussian process regression. The level of expense in the base case is much more subtle than what most people might have learned. The four main takeaways from this are that you may opt not to look at the entire data set, you can use a Cholesky-like algorithm that gives an estimate of cost linear in the data set and quadratically in the number of iterations, you can use a more efficient algorithm that converges quickly but is quadratically expensive in each iteration, or you can opt for Cholesky, which yields cubic expense in the number of data points.

  • 00:20:00 In this section, Hennig discusses the importance of properly handling data sets of non-trivial sizes, and the decision of how to operate on them efficiently. He also goes on to explain how to handle infinite dimensional data sets, specifically in regards to systems that evolve through time, as well as the algorithm used for linear time dependent and time invariant problems, known as Kalman filtering and smoothing. Hennig highlights that this type of algorithm is both easily written down and linearly expensive in the number of time steps. He also emphasizes the importance of understanding the low levels of the computational hierarchy, as it can be used to speed up performance in higher level algorithms.

  • 00:25:00 In this section of the video, Philipp Hennig discusses the smoother algorithm, which serves as a bookkeeping algorithm that informs all earlier variables in the chain about the observations it has made in the future. He also talks about how fast algorithms can be applied to settings where observations are not a linear Gaussian transformation of the state space, and for the dynamics of the extended Kalman filter. Hennig also touches on the algorithmic landscapes and structure of this framework, which is very flexible and can be used to construct a powerful algorithm to solve differential equations.

  • 00:30:00 In this section, Philipp Hennig discusses how algebraic implicit equations, continuous group symmetries, and partial differential equations can all be included in the same algorithmic language as ordinary differential equations in machine learning. He also mentions the value of incorporating observations of a system, such as measuring the path it took or knowing where it started and ended, in determining unknown values in parts of the state space. Hennig notes that as simulation packages become more diverse, it becomes less necessary to have an extensive knowledge of simulation methods, as the simulation method can essentially be seen as a filter.

  • 00:35:00 In this section of the video, Philipp Hennig discusses how the methods in machine learning manage information and states that there isn't really a difference between information that comes from a disk or a sensor that is attached to the computer, and information that comes from the programmer who has written it down as an algebraic equation. He also mentions that the information operator acts as an interface between the user and the algorithm designer. He also explains how to solve partial differential equations, which is essentially the same thing as filtering simulation methods, using gaussian process regression. However, he notes that if the partial differential equation is not linear, then it can't be solved using a filter.

  • 00:40:00 In this section, Philipp Hennig summarizes the conclusion of the "Numerics of ML" series, which covers differential equations and integration in machine learning. He first talks about Gaussian process inference with functions, which can be complex due to the nature of function spaces. However, by observing nonlinear functions and applying various sources of information, such as partial differential equations and boundary values, they can be combined in a large Gaussian process inference scheme, resulting in a quantified representation of the dynamical system. Hennig then moves on to integration in probabilistic inference, where he introduces the Monte Carlo algorithm, which is an unbiased estimator that converges slowly, but works on any integrable function.

  • 00:45:00 In this section, Philipp Hennig discusses the best approaches to estimating integrals for machine learning. He suggests that the rate at which an estimate of the integral converges to the true value of the integral is 1 over the square root of the number of samples, which is dependent on the algorithm used. However, Bayesian quadrature, an algorithm that spends a lot of time modeling the integrand, can perform really well, especially in low-dimensional problems, and can converge much faster than Monte Carlo, even super polynomially fast. Hennig suggests that building algorithms that work well only for a small class of problems can work better for each instance of that problem, but may break badly outside of that class. Ultimately, the best algorithm will depend on the nature of the problem being solved.

  • 00:50:00 In this section, Philipp Hennig explores the challenges of contemporary machine learning numerical problems, specifically the issue of training deep neural networks. Although there are many optimizers available, they are fundamentally frustrating and inefficient, requiring constant babysitting and hyperparameter tuning, and not always working. While optimization used to be about hitting a button and watching the algorithm work perfectly, now machine learning requires a team of over 100 people to manage large language models, making it an inefficient use of resources. The main issue is stochasticity, and there is no known elegant solution for this problem yet, even though it is driving the entire machine learning community.

  • 00:55:00 In this section, Philipp Hennig concludes the lecture course on uncertainty in computation by emphasizing the difficulty in training deep neural networks. Although mini-batch gradients are evaluated due to finite data and compute, the significant noise introduced through this process actually reduces the performance of optimization algorithms. Hennig states that the solution to this problem would make training deep neural networks much faster and change the future of machine learning. In the meantime, we can still use available resources, like curvature estimates, to construct new algorithms and techniques.

  • 01:00:00 In this section, Philipp Hennig discusses the need to do more than just train dual networks in machine learning, and the importance of knowing how much the model knows and what it does not know. Hennig explains that estimating the curvature of the loss function can help to construct uncertainty estimates for deep neural networks in lightweight ways, using the Laplace approximation. This can be used for different use cases, and can combine with a linearization of the network weight space to turn any deep neural network approximately into a Gaussian process parametric Gaussian regression algorithm. Hennig emphasizes that while being probabilistic is important, it is not necessary to apply Bayes' theorem in every case, as it can be too computationally intensive. Instead, finding fast solutions that add value without being too computationally expensive is a better approach.

  • 01:05:00 In this section, Philipp Hennig emphasizes the importance of numerical computation in machine learning. He explains that numerical computations are active agents that interact with a data source and must actively decide how to use the data they receive. By taking this connection seriously, new data-centric ways of doing computation can be developed, which may be more flexible, easier to use, and easier to generalize to different settings. Hennig also highlights the importance of understanding how numerical algorithms work to become a better machine learning engineer. Finally, he invites feedback about the course and discusses the upcoming exam.
Numerics of ML 14 -- Conclusion -- Philipp Hennig
Numerics of ML 14 -- Conclusion -- Philipp Hennig
  • 2023.02.13
  • www.youtube.com
The fourteenth and final lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class ...
 

Support Vector Machine (SVM) in 7 minutes - Fun Machine Learning



Support Vector Machine (SVM) in 7 minutes - Fun Machine Learning

The video explains Support Vector Machines (SVM), a classification algorithm used for data sets with two classes that draws a decision boundary, or hyperplane, based on the extremes of the data set. It also discusses how SVM can be used for non-linearly separable data sets by transforming them into higher dimensional feature spaces using a kernel trick. The video identifies the advantages of SVM such as effectiveness in high-dimensional spaces, memory efficiency, and the ability to use different kernels for custom functions. However, the video also identifies the algorithm's disadvantages, such as poor performance when the number of features is greater than the number of samples and the lack of direct probability estimates, which require expensive cross-validation.

  • 00:00:00 In this section, we learn about support vector machines (SVM) and how they can be used to classify data sets with two classes. SVM algorithm looks at the extremes of the data sets and draws a decision boundary or hyperplane near the extreme points in the data set. Essentially, the support vector machine algorithm is a frontier that best separates the two classes. We then learn about non-linearly separable data sets and how SVMs can transform them into higher dimensional feature spaces with a kernel trick. Popular kernel types include polynomial kernel, radial basis function (RBF) kernel, and sigmoid kernel. However, choosing the correct kernel is a non-trivial task and may depend on the specific task at hand.

  • 00:05:00 In this section, the advantages and disadvantages of support vector machines (SVM) are discussed. SVM is effective in high-dimensional spaces and uses a subset of training points in the decision function, making it memory efficient. Different kernels can be specified for the decision function, including custom kernels, and SVM can be used in various applications, such as medical imaging, financial industry, and pattern recognition. However, the disadvantages of SVM include poor performance if the number of features is greater than the number of samples and the lack of direct probability estimates, which require expensive cross-validation.
Support Vector Machine (SVM) in 7 minutes - Fun Machine Learning
Support Vector Machine (SVM) in 7 minutes - Fun Machine Learning
  • 2017.08.15
  • www.youtube.com
Want to learn what make Support Vector Machine (SVM) so powerful. Click here to watch the full tutorial.⭐6-in-1 AI MEGA Course - https://augmentedstartups.in...
 

'The Deep Learning Revolution' - Geoffrey Hinton - RSE President's Lecture 2019



'The Deep Learning Revolution' - Geoffrey Hinton - RSE President's Lecture 2019

Geoffrey Hinton, known as the "Godfather of Deep Learning," discusses the history and evolution of deep learning and neural networks, the challenges and exciting possibilities of using deep learning to create machines that can learn in the same way as human brains, and the tricks and techniques that have made backpropagation more effective. He also describes the success of neural networks in speech recognition and computer vision, the evolution of neural networks for computer vision and unsupervised pre-training, and their effectiveness in language modeling and machine translation. He finishes by highlighting the value of reasoning by analogy and discusses his theory of "capsules" and wiring knowledge into a model that predicts parts from the whole.

Geoffrey Hinton, a pioneer in deep learning, delivers a lecture advocating for the integration of associative memories, fastweight memories and multiple timescales into neural networks to allow for long-term knowledge and temporary storage, which is necessary for real reasoning. Additionally, he discusses the balancing act between prior beliefs and data, the potential of unsupervised learning, the efficiency of convolutional nets in recognizing objects with the incorporation of viewpoint knowledge and translational equivariance, and the need to combine symbolic reasoning with connectionist networks, like transformer networks. He also addresses the issue of unconscious biases in machine learning and believes that they can be fixed more easily than human bias by identifying and correcting for biases. Lastly, he stresses the need for more funding and support for young researchers in the field of AI.

  • 00:00:00 If you are familiar with deep learning, you owe a lot to Professor Geoffrey Hinton, known as the "Godfather of Deep Learning", who got his PhD in artificial intelligence in Edinburgh in 1978 and has won numerous prizes for his contributions to machine learning. In the first part of his lecture, he discusses the history of deep learning and neural networks, and how they have evolved over the years. He also talks about the challenges and exciting possibilities of using deep learning to create machines that can learn in the same way as human brains.

  • 00:05:00 In this section, Geoffrey Hinton talks about the two paradigms of artificial intelligence that existed since the early 1950s. One was the logic-inspired approach which viewed intelligence as manipulating symbolic expressions using symbolic rules. The other approach, on the other hand, believed that the essence of intelligence was learning the strengths of connections in a neural network. This approach focused more on learning and perception, compared to the other approach's focus on reasoning. These different approaches led to different views of internal representations, and corresponding ways of making a computer do what you want. Hinton compares the intelligent design method with the training or learning strategy, which involves showing a computer lots of examples.

  • 00:10:00 In this section of the video, Geoffrey Hinton explains how the deep learning revolution came about, which started with training neuron networks to learn complex features via many layers. Idealized neurons are used, which model linear and nonlinear functions. Meanwhile, training networks have different methods, including supervised and unsupervised training, with backpropagation being the latter's most efficient algorithm. Lastly, he points out how deep learning involves perturbing the network to measure the effect, then changing the network when needed, which is far more efficient than the evolutionary approach of perturbing in the face of unknown variables.

  • 00:15:00 In this section of the lecture, Dr. Hinton discusses the optimization technique of backpropagation, which calculates the gradient of weights based on the discrepancy between the actual answer and the correct answer for a small batch of training examples. He explains the process of updating the weights based on the gradient and the use of stochastic gradient descent to optimize the process. Dr. Hinton then goes on to discuss the tricks and techniques that have made backpropagation more effective, including the use of momentum and smaller learning rates for the larger gradients, ultimately concluding that using these tricks is as good as anything despite hundreds of published journal papers on more sophisticated methods. Finally, he notes that in the 1990s, the lack of proper initialization techniques for neural networks and smaller-sized datasets led to the temporary abandonment of neural nets in the machine learning community.

  • 00:20:00 In this section, Geoffrey Hinton, a leading figure in deep learning, discusses the history of deep learning research and the challenges faced by researchers in the field. He describes how, in the early days of back propagation, many papers were rejected or criticized because they focused on unsupervised learning, which did not fit with the prevailing paradigm of computer vision. However, Hinton argues that unsupervised learning, in combination with techniques like dropout, was a key factor in making back propagation work for deep networks, and has since helped to revolutionize the field of deep learning.

  • 00:25:00 In this section, Hinton explains the success of neural networks in speech recognition and computer vision. The first big application of deep learning was in speech recognition, in which a front end does acoustic modeling by taking the middle frame of a spectrogram and identifying which phoneme a person is trying to express. The first commercially relevant application of deep learning on a big scale was in speech recognition, where a front end neural network outperformed highly tuned techniques from IBM and other places. Another significant event was the ImageNet competition in 2012, where a deep neural network achieved significantly lower error rates than traditional computer vision techniques.

  • 00:30:00 In this section, Professor Geoffrey Hinton discusses the evolution of neural networks for computer vision, machine translation and unsupervised pre-training, and how the computer vision community was sceptical at first about the success of these neural networks. He goes on to discuss soft attention and transformers, and how the latter is better suited for covariances, making it more sensitive to things like eyes being the same as each other, and how unsupervised pre-training can force the neural networks to capture information about what the words around a word can tell you about what that word must mean.

  • 00:35:00 In this section, Hinton explains the difference between using convolutional neural nets and transformers for natural language processing tasks such as disambiguating word meaning based on context. While convolutional neural nets use the words around the target word to change its representation, transformers train a network by back-ravine derivatives to learn to turn a word vector into a query, key, and value, which is used to attend to other words and activate the corresponding representation. Transformers have proven to be very effective in language modeling and machine translation and have been used to develop methods such as Burt, which uses unsupervised learning to learn word embeddings through the probability of the next word fragment.

  • 00:40:00 In this section of the lecture, Hinton discusses an experiment called "GPT-2" which can generate text that seems like it was written by a human. The GPT-2 model, which contains one and a half billion parameters, was trained on billions of words of text and can produce coherent and intelligible stories. Hinton speculates that this type of reasoning is not a proper logic-based reasoning but rather an intuitive reasoning. He also points out that it is difficult to know how much the model really understands, and he questions whether the model is just doing massive amounts of association or if it understands a bit more than that.

  • 00:45:00 In this section, Geoffrey Hinton highlights the value of reasoning by analogy and its role in improving reasoning capabilities. He compares sequential reasoning to reasoning by intuition in the context of the game AlphaGo, explaining that both intuition and logical reasoning are necessary to make well-informed decisions. Hinton also discusses how convolutional neural nets have improved efficiency but fail to recognize objects in the same way humans do, leading to the conclusion that humans use coordinate frames and understand relationships between parts and the whole of an object to recognize it. This highlights the need for insights into neural net architecture to improve how they recognize objects.

  • 00:50:00 In this section, Hinton uses a task to illustrate the dependence of spatial understanding on coordinate frames. He presents a wireframe cube and asks the viewer to point to where the corners are without using a coordinate frame, revealing that people tend to think of cubes relative to their coordinate system. Hinton then discusses his theory of "capsules," which groups neurons that learn to represent fragments of shapes, and imposes a coordinate frame on each fragment to capture intrinsic geometry. He plans to train these capsules unsupervised to capture shape knowledge.

  • 00:55:00 In this section, Hinton discusses wiring knowledge into a model that predicts parts from the whole. The model is trained by a transformer that looks at the parts already extracted, takes these parts, and tries to predict what wholes would explain those parts. The transformer is good at finding correlations between things and can predict what objects might be there and what their poses are. Hinton gives an example where the model is taught about squares and triangles and can later recognize them in new images. The model can also be trained to recognize house numbers without ever being shown labels.

  • 01:00:00 In this section, we learn about the potential of unsupervised learning and the various types of neurons that could work better than the scalar non-linearity that's currently in use. The speaker urges students not to believe everything they hear and encourages the redirection of 50 years of acquired knowledge towards figuring out how to get the right substrate to do specific processing. The Q&A portion discusses the possibility of relying solely on the fastest systems for intelligence and the coherence of a transformer's memory.

  • 01:05:00 In this section, Hinton responds to a question about unconscious biases in machine learning and compares it to biases in humans. He believes that while machine learning can be biased, it is much easier to fix than human bias because biases in machine learning can be identified and corrected for by freezing the weights and measuring who the biases are against. Furthermore, he talks about explainability in machine learning and argues against legislating that systems must be explainable before they can be used, as these big neural nets have learned billions of weights that cannot be succinctly explained. However, he admits that researchers do want to understand these systems better and encourages older researchers to provide funding for younger researchers.

  • 01:10:00 In this section, Geoffrey Hinton discusses the idea that if we wire in translational equivariance and more viewpoint knowledge into convolutional nets, they could be more efficient in object recognition and generalization. Additionally, he talks about the need to combine symbolic reasoning with connectionist networks, like transformer networks. Hinton believes that implementing associative memories, fastweight memories, and having each synapse have several timescales can allow for long-term knowledge and temporary storage, which is necessary for real reasoning.

  • 01:15:00 In this section, the speaker responds to a question about how neural networks update based on past or current experiences. He suggests using an associative memory that is activated by the current state, rather than engaging in back propagation through time. He clarifies that every synapse should have multiple time scales to store temporaries. The discussion then moves to the topic of hallucination in systems with prior beliefs. The speaker believes that getting the balance right between prior beliefs and data is key for such systems. Finally, he discusses his ambivalence towards backpropagation, stating that while it is the right thing to do, he is surprised that only a billion weights can do quite good translation, with the human brain containing much more.

  • 01:20:00 In this section of the video, the speaker discusses how our current AI technology may not be as intelligent as we think and that the focus should be on solving this issue. They also touch on the Human Brain Project, which was funded by European funding, and question whether it will help or hinder AI development. The speaker also compliments the lecturer for being able to explain complex concepts in a way that is easy for non-experts to understand and for promoting more funding and support for young researchers in the field of AI.
'The Deep Learning Revolution' - Geoffrey Hinton - RSE President's Lecture 2019
'The Deep Learning Revolution' - Geoffrey Hinton - RSE President's Lecture 2019
  • 2019.07.26
  • www.youtube.com
"There have been two very different paradigms for Artificial Intelligence: the logic-inspired paradigm focused on reasoning and language, and assumed that th...
 

How ChatGPT actually works



How ChatGPT actually works

ChatGPT is a machine learning model that is able to correctly identify harmful content in chat conversations. Its architecture is based on human input, and its shortcomings are outlined. Recommended readings are also provided.

  • 00:00:00 ChatGPT is a chatbot that is designed to mitigate the model's misalignment issues. It uses reinforcement learning from human feedback to fine-tune a pre-trained model.

  • 00:05:00 ChatGPT is a machine learning model that is able to correctly identify harmful content in chat conversations. Its architecture is based on human input, and its shortcomings are outlined. Recommended readings are also provided.
How ChatGPT actually works
How ChatGPT actually works
  • 2023.01.23
  • www.youtube.com
Since its release, the public has been playing with ChatGPT and seeing what it can do, but how does ChatGPT actually work? While the details of its inner wor...
 

Machine Learning From Scratch Full course



Machine Learning From Scratch Full course

Implementing machine learning models yourself is one of the best ways to master them. Despite seeming like a challenging task, it's often easier than you might imagine for most algorithms. Over the next 10 days, we'll be using Python and occasionally Numpy for specific calculations to implement one machine learning algorithm each day.

You can find the code in our GitHub repository: https://github.com/AssemblyAI-Examples/Machine-Learning-From-Scratch

Machine Learning From Scratch Full course
Machine Learning From Scratch Full course
  • 2022.09.12
  • www.youtube.com
To master machine learning models, one of the best things you can do is to implement them yourself. Although it might seem like a difficult task, for most al...