Machine Learning and Neural Networks - page 41

 

Reinforcement Learning in 3 Hours | Full Course using Python

Code: https://github.com/nicknochnack/ReinforcementLearningCourse



Reinforcement Learning in 3 Hours | Full Course using Python

00:00:00 - 01:00:00 The "Reinforcement Learning in 3 Hours" video course covers a range of topics in reinforcement learning, including practical implementation and bridging the theory-practice gap. The course covers everything from setting up the RL environment to building custom environments, with a focus on training reinforcement learning agents and evaluating them using different algorithms and architectures. Popular RL applications such as robotics and gaming are discussed, as well as the limitations of RL such as its assumption that environments are markovian and the potential for unstable training. The course uses Stable Baselines, an open-source RL library, and OpenAI Gym to build simulated environments. The instructor explains the different types of spaces used to represent actions and values that agents can take in an environment, as well as different RL algorithms such as A2C and PPO. The importance of understanding the environment before implementing algorithms is emphasized, and users are guided through setting up the compute platform for reinforcement learning, choosing appropriate RL algorithms, and training and testing the model.

01:00:00
- 02:00:00 This YouTube video provides a three-hour course on reinforcement learning using Python. The instructor explains the core components of reinforcement learning, including the agent, environment, action, and reward. The section discusses how to define an environment, train a model using reinforcement learning, and view training logs using TensorBoard to monitor the training process. The lecturer also covers other topics, such as saving and reloading a trained model, testing and improving model performance, defining a network architecture for a custom actor and value function in a neural network, and using reinforcement learning to play the Atari game Breakout. Additionally, the course includes three projects that learners will build using reinforcement learning techniques, including Breakout game in Atari, building a racing car for autonomous driving, and creating custom environments using the OpenAI Gym spaces.

 02:00:00 - 03:00:00 This YouTube video titled "Reinforcement Learning in 3 Hours | Full Course using Python" covers various topics related to reinforcement learning. The instructor demonstrates how to train a reinforcement learning agent for Atari games and autonomous driving using the racing car environment. They also introduce various OpenAI gym dependencies, helpers, and stable baselines, as well as different types of spaces for reinforcement learning. Additionally, the video covers how to create a custom environment for reinforcement learning, defining the state of the environment, its observation and action spaces, testing and training the model, and saving the trained model after learning. The instructor also discusses the importance of training models for longer periods for better performance and encourages viewers to reach out if they encounter any difficulties.

  • 00:00:00 In this section of the video, the presenter introduces the reinforcement learning course and outlines the different topics that will be covered throughout the course. He explains that the course is designed to bridge the gap between theory and practical implementation, and covers everything from setting up the RL environment to building custom environments. The presenter gives a high-level overview of reinforcement learning, its applications, and some of its limitations. The course will provide hands-on experience in training reinforcement learning agents and testing and evaluating them using different algorithms and architectures, as well as cover three different projects focused on the breakout environment, self-driving environment, and custom environments.

  • 00:05:00 In this section of the video on "Reinforcement Learning in 3 Hours", the instructor explains the fundamental concepts of reinforcement learning. A reinforcement learning agent learns based on the rewards it gets from the environment by taking different actions. The agent observes the environment to maximize its rewards over time by making certain decisions. The instructor also discusses some practical applications of reinforcement learning, such as autonomous driving, securities trading, and neural network architecture search.

  • 00:10:00 In this section, the video discusses some of the popular applications of reinforcement learning, including robotics, where simulated environments can be used to train robots to perform specific tasks. The video also mentions gaming as another popular application where the reward function can differ each time, making it a suitable environment for reinforcement learning. The limitations of reinforcement learning are also discussed, including the assumption that the environment is markovian and the fact that training can be time-consuming and unstable. The setup for the reinforcement learning model is discussed, which includes installing the stable baselines library from OpenAI and utilizing its helpful guides and documentation.

  • 00:15:00 In this section, the instructor introduces the course and outlines the 10 different steps that will be covered. The first step is to import and load the necessary dependencies, including stable baselines, an open-source library for reinforcement learning. The instructor explains the different algorithms available within the library, and the benefits of using PPO (Proximal Policy Optimization). The dependencies also include OS, for operating system functionality, and gym, for building and working with environments. Overall, the process is straightforward and requires only a few lines of code to get started with stable baselines.

  • 00:20:00 In this section, the instructor discusses the dependencies and environments required for reinforcement learning. They introduce Stable Baselines, which allows for faster machine learning through the vectorization of environments, and the Dummy Vec Env wrapper. They also explain how OpenAI Gym can be used to build simulated environments, which can reduce costs and allow for faster model production. They provide examples of real environments, such as a robot, as well as simulated environments, like OpenAI Gym, which has a lot of documentation and support.

  • 00:25:00 In this section of the video, the instructor discusses the standard for creating reinforcement learning environments, which is OpenAI Gym. He explains that OpenAI Gym provides pre-built environments, including those based on actual robots such as Fetch Robot and Shadow Hand Robot. He further explains the different types of spaces that OpenAI Gym supports, including box, discrete, tuple, dict, multi-binary, and multi-discrete. He notes that these spaces are used to represent the different types of values or actions that agents can take in the environment. The instructor then introduces the Classic Control environment, specifically the CartPole problem, as an example that he will use to train a reinforcement learning agent. The goal is to balance a beam by moving it to the left or right using two actions.

  • 00:30:00 In this section, the instructor explains how to load and test an environment using OpenAI Gym. They start by instantiating the CartPole-v0 environment using two lines of code. Then, they demonstrate how to test the environment by looping through multiple episodes and using env.reset() to get the initial set of observations. These observations will later be passed to a reinforcement learning agent to determine the best action to maximize the reward. The instructor notes the importance of understanding the environment before implementing any algorithms.

  • 00:35:00 In this section, the instructor explains the code used for sampling an environment in reinforcement learning. The code sets a maximum number of steps for the environment, sets up a score counter, and generates a random action based on the action space defined by the environment. The observations returned by the environment after each action are accompanied by a reward and a value indicating whether the episode is done. The results are printed out and the environment is closed after testing. The instructor also explains the concept of an observation space and demonstrates how it can be sampled.

  • 00:40:00 In this section, the instructor explains the two parts of the environment, the action space and observation space, and how they are represented in the OpenAI Gym documentation. The observation space consists of four values representing the cart position, velocity, pole angle, and pole angular velocities. On the other hand, the action space has two possible actions, zero or one, where zero pushes the cart to the left and one pushes the cart to the right. The section also highlights the different types of algorithms in reinforcement learning, model-based and model-free, and how they differ. The instructor focuses on model-free reinforcement learning and delves into the A2C and PPO algorithms, which will be used in the training stage.

  • 00:45:00 n this section of the video, the instructor explains how to choose the appropriate reinforcement learning algorithm based on the action space of the environment being used. He goes on to explain the different types of algorithms available in Stable Baselines, such as A2C, DDPG, DQN, HER, PPO, SAC, and TD3, and which action spaces they work best with. The instructor also discusses the training metrics that should be considered during training, such as evaluation metrics, time metrics, loss metrics, and other metrics. He reminds users that Stable Baselines can be installed with or without GPU acceleration and provides instructions for installing PyTorch if GPU acceleration is desired.

  • 00:50:00 In this section, the instructor discusses how to set up the compute platform for reinforcement learning, which is crucial for those who want to leverage GPU acceleration. CUDA and cuDNN are only supported on NVIDIA GPUs, so users have to ensure they have an NVIDIA GPU to use CUDA to take advantage of GPU acceleration. On the other hand, AMD GPUs are supported by RockM, a beta package only available on Linux. The instructor also emphasizes that traditional deep learning may see more performance improvement from using a GPU than reinforcement learning. Finally, the instructor defines the log path and instantiates the algorithm and agent.

  • 00:55:00 In this section, the instructor demonstrates how to wrap a non-vectorized environment inside of a dummy vectorized environment using a lambda function. Then, they define the model, which is the agent that will be trained, as PPO and pass through the policy, environment, verbose, and tensorboard log path as arguments. The instructor goes on to explain the various hyperparameters that can be passed through to the PPO algorithm. Finally, they demonstrate how to train the model using the model.learn function and passing through the number of time steps to train it for, which in this example is set at 20,000. After the model is trained, the instructor tests it out and checks the training metrics.


Part 2

  • 01:00:00 In this section of the video, the instructor shows how to save and reload a trained model. The model is saved using the `model.save()` function and a path is defined to locate the saved model. The instructor then demonstrates how to delete the saved model and reload it using the `ppo.load()` function. The next step is to test the trained model to see how it performs. The instructor explains that rollout metrics depend on the algorithm used for training and shows that the `A2C` algorithm provides these metrics during training, whereas the `PPO` algorithm requires an explicit command to generate these metrics.

  • 01:05:00 In this section, the video explains how to use the evaluate_policy method to test the performance of the model and determine whether the PPO model is considered 'solved' in this particular case. The evaluate_policy method is a way to test how well a model is performing, and the model is considered solved if it scores an average of 200 or more. The method is passed through the model, the environment, how many episodes to test and whether rendering is required or not. The average reward and the standard deviation in that reward are the values that you get out of evaluate_policy, and closing down the environment is done using emv.close. Finally, the video highlights how to deploy the model in an encapsulated function.

  • 01:10:00 In this section, the instructor demonstrates how to use the observations from the environment to predict the best action using the agent, in order to maximize rewards. The code block shows how to make key changes to use model.predict instead of env.actionspace.sample to take actions using the model. The instructor shows that the agent performs better than random steps and balances the pole. The code also shows the observations passed to model.predict function, with two values returned, the model action, and the next state. The first value is used here to determine the best action for the agent.

  • 01:15:00 In this section, the instructor explains the core components of reinforcement learning: the agent, environment, action, and reward. He demonstrates how to define an environment and train a model using reinforcement learning to accumulate a value of one every time by keeping the pole in an upright position and not falling down. The instructor also shows how to view the training logs using TensorBoard to monitor the training process.

  • 01:20:00 In this section of the video, the instructor explains how to use TensorBoard in a Jupiter notebook to view the training metrics of a reinforcement learning model. He demonstrates how to run the TensorBoard command using a magic command and shows how to specify the training log path. The instructor also shows how to view the training metrics, such as frames per second, entropy loss, learning rate, and policy gradient loss, in TensorBoard. He emphasizes that the average reward is the most important metric to monitor when tuning the model's performance. He concludes by inviting feedback and comments from viewers.

  • 01:25:00 In this section, the video discusses two key metrics to determine the performance of a reinforcement learning model - the reward metrics and the average episode length. The video also provides three strategies to improve model performance if it is not performing well, including training for a longer period, hyperparameter tuning, and exploring different algorithms. The section then delves into callbacks, alternate algorithms, and architectures, specifically discussing how to set up a callback to stop training once a reward threshold is met and explore different neural network architectures and algorithms. The video also highlights the importance of using callbacks for large models that require a longer training time.

  • 01:30:00 In this section, the instructor explains how to use callbacks in reinforcement learning for more flexible and efficient training. Two callbacks are used in the example: the stop callback and the eval callback. The stop callback specifies the average reward after which the training should stop, while the eval callback evaluates the best new model and checks whether it has passed the reward threshold. The instructor also demonstrates how to change the policy by specifying a new neural network architecture. Overall, callbacks provide greater control over reinforcement learning models, allowing for more customized and effective training.

  • 01:35:00 In this section, the speaker discusses the process of specifying a network architecture for a custom actor and value function in a neural network. This can be done simply by changing the number of units and layers and passing it onto the model. The speaker also emphasizes that the custom feature extractors can be defined and shows how to use an alternate algorithm such as DQN instead of PPO, and highlights other algorithms available in Stable Baselines. The speaker concludes by showcasing the trained DQN model.

  • 01:40:00 In this section, the instructor discusses the projects that learners will be building using reinforcement learning techniques. They will start with Project One, which is the Breakout game in Atari. Then, they will also tackle Project Two, where they will use reinforcement learning to build a racing car to simulate autonomous driving. Lastly, they will work on Project Three, which involves creating custom environments using the OpenAI Gym spaces. The instructor also explains how importing necessary libraries and dependencies for the projects are similar to those in the main course, and they will only need to use different algorithms depending on the project.

  • 01:45:00 In this section, the video instructor explains how to set up the Atari environment for reinforcement learning in Python. Due to recent changes, users must download raw files from atarimania.com and extract them into a folder in order to use the environment. After installing the necessary packages and dependencies, users can test the environment using the "emv.reset" and "emv.action_space" functions. The observation space is a box representing an image with the dimensions 210x160x3. The instructor also demonstrates how to test a model within the environment.

  • 01:50:00 In this section, the instructor shows the code for playing Breakout using random actions and points out that training the model can take a long time. To speed up the training, the instructor vectorizes the environment and trains four different environments at the same time. The environment used is the image-based environment, as opposed to the RAM version of Breakout, because the CNN policy will be used. The code for setting up the model is shown, including specifying the log path and the A2C algorithm with the CNN policy.

  • 01:55:00 In this section, the video instructor uses reinforcement learning to train a model to play the Atari game "Breakout". The model uses a convolutional neural network (CNN) policy, which is faster to train than a multi-layer perceptron policy. The environment is defined using OpenAI Gym's make_atari function and vectorization is used to speed up the training process. The model is trained for 100,000 steps, and after saving and reloading the model, it is evaluated using the evaluate policy method. The final model achieves an average episode reward of 6.1 with a standard deviation of 1.9, a significant improvement over a random agent. The instructor also provides information on a pre-trained model that has been trained for 300,000 steps and how to load and test it.


Part 3

  • 02:00:00 In this section, the instructor discusses how to handle freezing issues when working with the environment, specifically Atari. If the environment freezes, the notebook should be restarted, and the kernel should be restarted after saving the model. The instructor then demonstrates how to train a reinforcement learning agent for breakout, by walking through the process of importing dependencies, installing Atari ROMs, vectorizing the environment to train on four Atari environments simultaneously, training the agent, and finally evaluating and saving the model. The instructor also shows the impact of training the model for longer, and makes the trained models available in the Github repository for learners to try on their own.

  • 02:05:00 In this section of the video on reinforcement learning in three hours, the instructor begins by showing the results of Project 1, which involved training a model to play a game using reinforcement learning. The model performed significantly better than previous models, with a mean reward over 50 episodes of 22.22 and a standard deviation of 9.1. The instructor then introduces Project 2, which involves using reinforcement learning for autonomous driving using the racing car environment. To set up the environment, the instructor explains that swig must be installed and two new dependencies, box 2d and piglet, must be installed. The instructor then goes through the process of testing the environment and importing necessary dependencies.

  • 02:10:00 In this section, the video discusses the observation and action space of the car racing environment for reinforcement learning. The observation space is a 96 by 96 by 3 image with values between 0 and 255, while the action space is between minus one and one for three different values. The reward function is negative 0.1 for every frame and plus 1000 divided by n for every track tile visited. The game is considered solved when the agent can consistently get 900 or more points, which can take some time to achieve with training. The video then goes on to train a model using the PPO algorithm and shows how to test the racing environment using the trained model.

  • 02:15:00 In this section, the instructor sets up the environment for the self-driving car using OpenAI Gym and wraps it inside of a dummy Vectorize Environment Wrapper. Then, the agent and model are specified using the PPO algorithm, and the model is trained for 100,000 steps. The saved model is loaded and evaluated in the environment, and despite the high-powered car's lack of traction, it doesn't drive forward but spins out and does donuts. Finally, the environment is closed, and the instructor loads up a model trained for 438,000 steps to test.

  • 02:20:00 In this section, the instructor loads up a self-driving car model that was trained for 438,000 steps and tests it on the track. Though it is slower, it follows the track and gets a much higher score than the previous model trained for 100,000 steps. The instructor explains that training reinforcement learning agents for a longer period of time can produce much better models, and ideally, this model should have been trained for 1-2 million steps to perform optimally. He demonstrates how to test the model using a code snippet from the main tutorial, which shows that, even when trained only on images, the model can successfully navigate around the track. Ultimately, the instructor trained this model for two million additional steps, improving its performance and reaching a reward estimate of around 700.

  • 02:25:00 In this section, the instructor loads and runs a model that performs significantly better than the previous models he trained, despite occasionally spinning out on corners. He shows the model's evaluation score, which reached up to 800 points, a significant improvement from the previous models. He notes that this model was trained for a longer duration and had a high standard deviation. The instructor then introduces the last project, which involves using stable baselines for reinforcement learning in custom environments. He imports necessary dependencies and encourages viewers to reach out if they encounter any difficulties.

  • 02:30:00 In this section of the video, the instructor goes through the various gym dependencies or OpenAI gym dependencies, helpers, and stable baseline stuff that will be used in the reinforcement learning course. They import gym, which is the standard import, gym environment class from env, and the different types of spaces such as discrete, box, dict, tuple, multi-binary, multi-discrete. The instructor goes through how to use each of these spaces and how they can be used for different purposes. The instructor also goes through the different helpers that were imported, such as numpy, random, and os, and the stable baseline stuff, including ppo, common.vec_env, dummy_vec_nv, and evaluate_policy function.

  • 02:35:00 In this section of the video, the presenter discusses the different types of spaces available in OpenAI Gym for reinforcement learning. These spaces include discrete, box, tuple, dict, multi-binary, and multi-discrete. The presenter provides examples and explanations for each of these spaces. The video then goes on to discuss building a simulated environment for training an agent to regulate the temperature of a shower. The ultimate goal is to achieve a temperature between 37 and 39 degrees, but the agent does not know this a priori and must learn through trial and error.

  • 02:40:00 In this section, the instructor builds a shell for a shower environment by implementing the four key functions. These functions are init, step, render, and reset. The init function initializes the environment by defining the action space, observation space, and initial state. The step function takes an action and applies it to the environment. The render function displays the environment. The reset function resets the environment to its initial state. The instructor also sets an episode length of 60 seconds for the environment.

  • 02:45:00 In this section, the instructor defines the step function for the shower environment, which contains six code blocks. The first block applies the impact of the action on the state, with zero, one, and two as the three possible actions. Zero decreases the temperature by one degree, one leaves the temperature the same, and two increases the temperature by one degree. The second block decreases the shower time by one second. The third block defines the reward, with a reward of one if the temperature is between 37 and 39 degrees and -1 if it is outside that range. The fourth block checks if the shower is done and sets done to true if the shower time is less than or equal to zero. The fifth block creates a blank info dictionary, and the final block returns the state, reward, whether the shower is done, and the dictionary. The reset function resets the initial temperature to its default value and resets the shower time to 60 seconds.

  • 02:50:00 In this section, the instructor explains how to create a custom environment for reinforcement learning using Python. He demonstrates how to define the state of the environment and its observation and action spaces. The instructor also shows how to test and train the model using the defined environment, and how to save the trained model after learning. He mentions that gaming environments will take longer to train compared to simple environments, and encourages keeping this in mind for planning projects and committing to clients.

  • 02:55:00 In this section, the instructor demonstrates how to test and save the trained model. They use the 'evaluate policy' method to test the model's performance, and then save the model using the 'model.save' method. Additionally, they provide a brief summary of the course, which covers a range of topics from setting up the environment using stable baselines to training models with different algorithms, including PPO, A2C, and DQN. They also discuss creating custom environments and building projects, such as training a model to play Breakout or race a car around a track.

  • 03:00:00 In this section, the instructor recommends additional resources for further learning, including David Silva's Reinforcement Learning course, a book called Reinforcement Learning: An Introduction by Richard Sutton and Andrew Bartos, as well as exploring hyperparameter tuning, building custom environments, and implementing end-to-end solutions such as building a cart pole robot and training it in a simulated environment before implementing it in a real environment using a Raspberry Pi. The instructor encourages feedback and questions from viewers and thanks them for tuning in.
Reinforcement Learning in 3 Hours | Full Course using Python
Reinforcement Learning in 3 Hours | Full Course using Python
  • 2021.06.06
  • www.youtube.com
Want to get started with Reinforcement Learning?This is the course for you!This course will take you through all of the fundamentals required to get started ...
 

Sign Language Detection using ACTION RECOGNITION with Python | LSTM Deep Learning Model

Code: https://github.com/nicknochnack/ActionDetectionforSignLanguage



Sign Language Detection using ACTION RECOGNITION with Python | LSTM Deep Learning Model

In this YouTube video titled "Sign Language Detection using ACTION RECOGNITION with Python | LSTM Deep Learning Model", the presenter explains how to build a real-time sign language detection flow using action detection and key models. The presenter uses OpenCV and MediaPipe Holistic to extract key points from hands, face, and body, and then TensorFlow and Keras to build an LSTM model that predicts the action being demonstrated in a sequence of frames. The presenter goes through the process of accessing and extracting key points from the webcam, sets up a loop to access the webcam, and makes sign language detection easier by applying the landmarks to the last captured frame from the webcam. They also demonstrate how to modify the code to handle missing key points and add error handling to the pose model and face landmark detection. Finally, the presenter explains the keypoint extraction function for sign language detection using action recognition with Python.

The video provides a detailed explanation of how to create a sign language detection model using action recognition with Python. To collect the data, the presenter creates folders for each action and sequence and modifies the MediaPipe loop to collect 30 key point values per video for each action. The data is pre-processed by creating labels and features for the LSTM deep learning model, and the model is trained using TensorFlow and Keras. The trained model is evaluated using a multi-label confusion matrix and accuracy score function. Finally, real-time detection is established by creating new variables for detection, concatenating frames, and applying prediction logic, with a threshold variable implemented to render results above a certain confidence metric.

The video tutorial showcases how to use Python and an LSTM Deep Learning model for sign language detection using action recognition. The speaker walked through the prediction logic and explained the code, making it easy to understand. They also showed viewers how to adjust the code by using the append method, increasing the detection threshold, and adding probability visualization to make the detection visually compelling. The speaker also covered how to check if the result is above the threshold, how to manipulate probabilities, and how to extend and modify the project by adding additional actions or visualizations. Finally, the speaker presented the model's additional logic, which minimizes false detections and improves the model's accuracy, along with an invitation to support the video and the channel.

  • 00:00:00 In this section of the video, the creator explains their goal to produce a real-time sign language detection flow using action detection and key models. They will be using MediaPipe Holistic to extract key points from hands, face, and body, and then use TensorFlow and Keras to build a LSTM model that predicts the action being demonstrated in a sequence of frames. The process includes collecting data on key points, training a neural network, evaluating accuracy, and testing the model in real-time using OpenCV and a webcam. The creator outlines 11 steps to achieve this process and begins by installing and importing dependencies.

  • 00:05:00 In this section, the presenter discusses the different dependencies that will be used in the project, including OpenCV, MediaPipe, scikit-learn, NumPy, Matplotlib, and TensorFlow. After importing these dependencies, the presenter goes through the process of accessing and extracting key points from the webcam using OpenCV and MediaPipe Holistic. The presenter then sets up a loop to access the webcam and render multiple frames to the screen, allowing for real-time testing of the project. This loop will be used multiple times throughout the project, including when extracting frames and testing the project. All the code used in the project will be made available on Github, including the final trained weights.

  • 00:10:00 In this section of the video, the presenter explains how to access the webcam using OpenCV and start looping through all the frames. The presenter uses the "video capture" function to read the feed from the webcam and initiates a loop that will read, display, and wait for a keypress to break out of the loop. The presenter also explains how to break the loop gracefully, and how to troubleshoot device number issues if the webcam doesn't appear. Finally, the presenter introduces MediaPipe Holistic and MediaPipe Drawing Utilities, two Python modules used to download and leverage the Holistic model for pose detection and draw the pose landmarks on an image.

  • 00:15:00 In this section of the transcript, the speaker sets up a function to make sign language detection easier. The function takes in an image and a media pipe holistic model and goes through a series of steps, including color conversion from BGR to RGB, making the detection, and converting the image back to BGR, before returning the image and results to the loop for rendering. The steps are done symmetrically to ensure the image is set to non-writable before detection and back to writable afterward. The speaker also explains the cvtColor function used to convert image color and shows how to call the media pipe detection function in the loop.

  • 00:20:00 In this section, the presenter explains how to access the MediaPipe model using a "with statement" and sets the initial and tracking confidence. They also show how to access and visualize the different types of landmarks: face, left and right hand, and pose landmarks. The presenter then demonstrates how to use MediaPipe Holistic to detect landmarks and displays the results on the frame. Finally, they show how to render the landmarks to the screen by writing a function.

  • 00:25:00 In this section, the YouTuber sets up a new function called "draw_landmarks" that will render landmark data onto an image to allow for the visualization of the different landmarks detected by the media pipe models used in the project. The function uses the "mp.drawing" helper function provided by media pipe to draw the landmarks, and also requires the image and landmark data as inputs. The function also allows for the specification of connection maps and formatting options. The YouTuber then proceeds to demonstrate how to use the "plot.imshow" function from matplotlib to display the last captured frame from the webcam.

  • 00:30:00 In this section, the speaker corrects the color of the image and applies the landmarks to the image by passing them through the "draw landmarks" function. The results of the media pipe detection model are accessed as the last frame and results of running the loop. The "mp_drawing.draw_landmarks" methods apply to the current frame to render all hand, pose, and face connections. The speaker then applies the "draw landmarks" function to the real-time loop and before rendering, applies formatting using the "landmark drawing spec" and "connection drawing spec" to draw the dots and connections respectively. Finally, the speaker creates a new function called "draw style landmarks" to customize the "draw landmarks" function if desired.

  • 00:35:00 In this section, the speaker updates the formatting for the draw landmarks function, adding two additional parameters for the mp_drawing.drawing_spec function - color and circle radius. They demonstrate the changes for the face landmark and explain that the first parameter colors the landmark and the second parameter colors the connection. The speaker then copies the changes to the function for each of the pose and hand models, giving each model unique colors. The changes are purely cosmetic and won't affect performance, but they demonstrate the different models in action.

  • 00:40:00 In this section, the video tutorial explains how to extract key point values from the results variable of the MediaPipe Holistic model in a resilient way by concatenating them into a numpy array and handling errors. The tutorial walks through how to extract values for one landmark and update it for all landmarks using a loop and a list comprehension. The final array with all landmarks is then flattened to have all values in one array rather than multiple sets of landmarks.

  • 00:45:00 In this section, the presenter explains how to modify the code to handle when there are no key points due to the hand being out of frame. They start by showing that the left-hand landmarks have three values each and there are 21 landmarks for a total of 63 values needed. They then apply an if statement that replaces missing values with a blank numpy array. This same modification is then applied to the right-hand landmarks, which also have 63 values. The code extracts the different key points by concatenating the x, y, and z values together in one big array and then flattening it into the proper format for use in the LSTM model.

  • 00:50:00 In this section, the speaker discusses how to add error handling to the pose model and face landmark detection, and creates a function called "extract_key_points" to extract the key points needed for landmark detection and action detection. The function uses numpy arrays and loops through results to extract x, y, and z values for each landmark and then flattens them into an array. The speaker also mentions that the code will be available in the video description and invites viewers to ask questions in the comments.

  • 00:55:00 In this section of the video, the speaker explains the keypoint extraction function for sign language detection using action recognition with Python. The function extracts the key points for pose, face, left-hand, and right-hand and concatenates them into a flattened numpy array. These key points form the frame values used for the detection of sign language using human action detection. The speaker also sets up variables for the exported data path and the actions to be detected- hello, thanks, and I love you- using 30 frames of data for each action.
  • 01:00:00 In this section of the video, the presenter explains the data collection process for detecting sign language using action recognition with Python. They explain that for each of the three actions they want to detect, they will be collecting 30 videos worth of data, with each video being 30 frames in length. This amounts to a fair bit of data, but the presenter reassures viewers that they will take it step by step. They go on to create folders for each action and sequence of action, in which the 30 key points from each frame of each sequence will be stored as a numpy array. The presenter also mentions that in step 11, they will show viewers how to concatenate words together to form sentences.

  • 01:05:00 In this section, the instructor shows how to collect data for the sign language recognition model. He starts by creating folders for the three different classes of signs - hello, thanks, and I love you - and their corresponding individual sequence folders. Then, he modifies the media pipe loop to loop through each action, video, and frame for collecting the data. To ensure that the frames are not collected too quickly, he adds a break between each video using a logic statement. By doing this, the model will collect 30 key point values per video, effectively building a stacked set of three actions, each with 30 videos per action, and 30 frames per video.

  • 01:10:00 In this section, the video creator provides an explanation of the collection logic in a Python script that uses OpenCV and deep learning models to detect sign language gestures from video. The logic involves outputting text onto the screen and taking a break at every second frame. The video creator also demonstrates the use of np.save to save the frames as numpy arrays, which are stored in an mp_data folder. They then provide the final code block for the keypoint extraction and saving of the frames into the correct folders.

  • 01:15:00 In this section of the transcript, the speaker explains the logic applied to loop through their actions and sequences (videos), apply media pipe detection, draw styled landmarks, and extract different key points to be saved into specific folders. They will use three actions (hello, thank you, and I love you) and collect 30 frames per action for 30 sequences. Before running this code, they double-check for errors and adjust the line width to ensure the font isn't obscured. Once the code runs, the pop-up says "Starting Collection," they have two seconds to get into position to perform the action, and they have 30 frames to do so. The code should be able to loop and collect data indefinitely.

  • 01:20:00 In this section of the video, the presenter demonstrates how to collect sign language data using MediaPipe and OpenCV libraries in Python. The presenter suggests capturing various angles of hand signs for better model performance and also mentions that having 30 sequences of 30 frames each tends to work well. The key points collected by MediaPipe are used instead of the image to make the model more resilient in various scenarios. The presenter also explains that the collected data is stored as numpy arrays for future use.

  • 01:25:00 In this section, the video focuses on pre-processing the data and creating labels and features for the LSTM deep learning model. The video starts by importing dependencies and then creating a label map, which is basically a dictionary representing each one of the different actions. Next, the data is read in and brought together to structure it into one big array with 90 arrays, each containing 30 frames with 1662 values representing the key points. Two blank arrays are created, sequences, and labels, where sequences represent the feature data, and labels represent the label data. The code then loops through each of the actions and sequences, creating a blank array for windows to represent all the different frames for that particular sequence. Finally, numpy.load is used to load up each frame.

  • 01:30:00 In this section, the speaker goes through the process of pre-processing the data by storing it in a numpy array and converting the labels to a one hot encoded representation using the "to_categorical" function. They then use the "train_test_split" function to partition the data into training and testing sets. The training set consists of 85 sequences, and the test set has five sequences. The speaker also imports the necessary dependencies, including the sequential model, LSTM layer, and dense layer, as they prepare to train a LSTM neural network using TensorFlow and Keras.

  • 01:35:00 In this section, the speaker discusses the necessary dependencies needed for building a neural network, including sequential, lstm, dense, and tensorboard. They explain that sequential makes it easy to add layers to the model, and lstm is used for temporal analysis and action detection. The speaker also demonstrates how to set up tensorboard callbacks to monitor training progress. They then proceed to build the neural network architecture, which includes adding three sets of lstm layers and specifying the activation functions and input shape. Finally, they mention the importance of not returning sequences to the dense layer and recommend further learning resources.

  • 01:40:00 In this section of the video, the presenter goes through the different types of layers added to the LSTM model, specifying that the next three layers are all dense layers. Each dense layer uses 64 and 32 dense units or fully connected neurons with activation values of ReLU. The final layer is the actions layer, which extracts the values returned as three neural network units using the activation value of softmax. The model predicts an action such as "Hello," and this output is passed through 30 frames plus 1,662 key points, and preprocessed to extract the actions. The presenter explains how they arrived at using MediaPipe and the LSTM layer by discussing the research and development they conducted. Finally, the model is compiled using categorical cross-entropy as the loss function, and it is fit and trained with x train and y train for the specified number of epochs.

  • 01:45:00 In this section, the video demonstrates the process of training a LSTM deep learning model for sign language detection using action recognition with Python. The model uses MediaPipe Holistic's model, which doesn't require a data generator as the data can fit into memory. The video shows how to set up the training process with TensorBoard and how to monitor the model's accuracy and loss during training in TensorBoard. The video also includes how to stop the training process once a reasonable level of accuracy has been achieved and how to inspect the model's structure using the model.summary function.

  • 01:50:00 In this section, the speaker trains a model and makes predictions by using the saved model. The model uses LSTM deep learning technique to identify the actions from sign language. The model predicts the highest probability of detected action with a softmax function. The predicted value is compared with the actual value to evaluate the performance of the model. The speaker saves the trained model as action.h5 and reloads it with a compile and load function. Importing metrics from sklearn, the speaker evaluates the performance of the multi-label confusion matrix of the model.

  • 01:55:00 In this section of the video, the presenter explains how to evaluate the performance of the trained model using a multi-label confusion matrix and accuracy score function. The confusion matrix represents the true positives and true negatives, and the higher the number of these values in the top left and bottom right corner, the better the model performs. The accuracy score function gives the percentage of correct predictions of the model. The trained model is then evaluated using the testing data, and the loop is re-established for real-time detection, which involves creating new variables for detection, concatenating frames, and applying prediction logic. A threshold variable is also implemented to render results above a certain confidence metric.

  • 02:00:00 In this section, the speaker explains their prediction logic for sign language detection using action recognition with Python and LSTM deep learning model. They start by extracting key points and appending them to a sequence of 30 frames. If the length of the sequence equals 30, they run the prediction by calling model.predict with the expanded dimensions of the sequence. They then print the predicted class using np.argmax and the actions defined earlier. The speaker also inserts key points at the beginning of the sequence to sort out the logic and proceeds to add visualization logic to the code.

  • 02:05:00 In this section, the speaker breaks down the code explaining how the program checks whether the result is above the threshold by extracting the highest score result using "mp.argmax" and passing through "res". The logic of the program is to check the last word matches the current prediction. If it does, then the program will not append to the sentence length. If it does not match, the program will append the current detected action onto our sentence array. The sentence length should only include up to five words so that the program does not try to render a giant array. Finally, the program will put text and a rectangle on the image to display the sentence.

  • 02:10:00 In this section, the video demonstrates how to use Python and an LSTM Deep Learning model to detect sign language using action recognition. The tutorial walks through adjusting the code to use the append method and increasing the detection threshold for better accuracy. The video also shows how to render the probabilities with a quick function to create a probability visualization, making the detection output look visually compelling. Overall, the tutorial provides an excellent example of using deep learning models to detect sign language with Python, making it an excellent resource for anyone looking to build similar projects.

  • 02:15:00 In this section of the video, the presenter demonstrates a probability visualization function that allows us to view the different actions and how their probabilities are calculated in real-time. The function uses cv2.rectangle to place a dynamic rectangle on the output frame and then positions and fills it based on the action that we are currently working through, which then uses the cv2.put text method to output the text values. The presenter shows how this function can be brought into the loop to visualize it in real-time, and we can see that it is detecting and recognizing different actions such as hello, thanks, and I love you based on the probability values. The presenter highlights that there are many applications for this function and demonstrates how the code can be used to detect and recognize different actions in various scenarios.

  • 02:20:00 In this section of the video, the presenter recaps the process of their sign language detection project. They set up folders for collection, collected key points, pre-processed the data, built an LSTM neural network, made predictions, and tested it in real-time. They were able to effectively get detections running in real-time, including the probability visualization. They provide a demonstration of how they fixed the code to grab the right frames and append them to the sequence by moving a colon and adding a negative. They also added a line to append their predictions to a new prediction array to ensure more stability when predicting actions. Finally, they offer that users can extend and modify the project by adding additional actions, colors, and/or visualizations as they see fit.

  • 02:25:00 In this section, the speaker demonstrates how the action detection model has become more stable and resilient by implementing additional logic. The model is now able to minimize false detections and hold the detection for longer periods, improving its accuracy. The speaker also mentions that the trained weights for the model will be available in the GitHub repository for users to leverage. The video concludes with an invitation to give the video a thumbs up and subscribe to the channel.
Sign Language Detection using ACTION RECOGNITION with Python | LSTM Deep Learning Model
Sign Language Detection using ACTION RECOGNITION with Python | LSTM Deep Learning Model
  • 2021.06.18
  • www.youtube.com
Want to take your sign language model a little further?In this video, you'll learn how to leverage action detection to do so!You'll be able to leverage a key...
 

Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. Lecture 1 - Introduction -- Philipp Hennig



Numerics of ML 1 -- Introduction -- Philipp Hennig

In this video, Philipp Hennig discusses the importance of understanding numerical algorithms in machine learning and introduces the course content for the term. The first numerical algorithm covered is Linear Algebra, with an application in Gaussian Process Regression. Hennig also discusses the role of simulation, differential equations, integration, and optimization in machine learning. He introduces new developments in numerical algorithms, such as algorithmic spines, observables, and probabilistic numerical algorithms. Throughout the video, Hennig emphasizes the importance of updating classic algorithms used in machine learning to solve complex problems and highlights the role of writing code in this computer science class.

Philipp Hennig is introducing his course on Numerics of Machine Learning, which aims to explore how machine learning algorithms function inside the box and how they can be adapted or changed to improve learning machines. The highly technical knowledge in numerical algorithms and machine learning algorithms is highly sought after by researchers and industry professionals. The course will consist of theory and coding work, with assignments graded on a binary system. Hennig emphasizes the importance of numerical algorithms in machine learning and invites students to join this unique teaching experiment with nine different instructors.

  • 00:00:00 In this section, Philipp Hennig introduces the importance of understanding numerical algorithms in machine learning. While machine learning algorithms take data as input and produce models that predict or act in the world, the actual learning process involves numerical computation. Unlike classic AI algorithms, contemporary machine learning algorithms use numerical algorithms such as linear algebra, simulation, integration, and optimization methods as the primitives for these computations. Philipp defines numerical algorithms as methods that estimate a mathematical quantity that doesn't have a closed form solution and can go wrong unlike atomic operations that always work. Since numerical algorithms are central to machine learning, it is important to understand them to ensure they work correctly.

  • 00:05:00 In this section, the speaker discusses the difference between regular functions and numerical algorithms, noting that the latter tend to have their own libraries and several subroutines to choose from. He then provides an example of a prototypical numerical algorithm written in 1993 in the language Forth, implementing an algorithm invented by two mathematicians in 1975. This highlights the fact that numerical algorithms are old and have precise interfaces, making them difficult to modify. Machine learning engineers frequently encounter numerical tasks and have been able to utilize these old algorithms developed by other fields, but this can be problematic if the task at hand does not precisely match the capabilities of the method. The speaker suggests that this may become an issue in machine learning when trying to solve problems for which the existing numerical methods are not sufficient.

  • 00:10:00 In this section, Philipp Hennig introduces the topic of numerical algorithms and the course content for the term. Linear Algebra, the base layer of machine learning, is the first numerical algorithm they cover. An example of its application is in Gaussian Process Regression, where two functions are used for inference: Posterior mean and Posterior Covariance Function. These functions are defined using kernel methods, and their implementation involves the Cholesky decomposition method rather than computing the inverse of a matrix. Hennig also introduces a Python code snippet and explains why one should use Cholesky decomposition instead of computing the inverse of a matrix.

  • 00:15:00 In this section of the video, speaker Philipp Hennig discusses the problem with kernel machines, particularly with regards to their inability to scale well to large amounts of data. He explains that the expensive computations required for kernel machines make them difficult to use in contemporary machine learning. However, Hennig also suggests that there are other linear algebra algorithms that can be used to speed up computations by taking advantage of data set structure and approximations, ultimately leading to Solutions with gaussian process regression that scale to large data sets.

  • 00:20:00 In this section, Philipp Hennig introduces simulation algorithms and their role in machine learning. Simulation methods simulate the trajectory of a dynamical system through time, and they can estimate X. They show up in machine learning when building agents such as a self-driving car or when creating a machine learning algorithm that makes use of physical Insight such as scientific machine learning. The differential equations, such as Schrodinger's equation, are typically used to encode the knowledge of nature. Furthermore, Hennig provides an example of a simple prediction problem of the COVID-19 cases in Germany over one and a half years to explain why deep neural networks and Gaussian processes do not work in solving this problem.

  • 00:25:00 In this section, Philipp Hennig discusses the use of differential equations in modelling systems, specifically the SIR models which are commonly used in simulations, and the challenge of incorporating real-world dynamics, such as lockdowns, into these models. He suggests using a neural network to make the coefficient beta time-dependent, but notes the difficulty in doing so due to the lack of derivatives in the code. However, he highlights the recent development of an algorithm in Jax that solves this problem.

  • 00:30:00 In this section, Philipp Hennig discusses an algorithm called simulation-based inference, which is a current way of solving complex problems. This algorithm involves a nested for loop that evaluates the function f multiple times and returns the gradient and does a gradient descent step. Hennig explains that to create a more flexible and faster algorithm than this primitive code, we can build our own method that constructs a list of numbers inside the photon code in a procedural way and adapts them. This method involves a spine of a Markov chain that can hang operators onto it, such as probability distribution and information operators, to inform the algorithm about unknown factors. By doing this, we can solve these problems without calling a for loop over and over again in an outer loop, which would be time-consuming.

  • 00:35:00 In this section, Philipp Hennig discusses the importance of updating classic algorithms used in machine learning, which are over 100 years old. He introduces the idea of algorithmic spines that can operate on different information operators and can create new functionality. Hennig then goes on to discuss the role of integration in machine learning, which is an elementary operation of patient inference. The elementary operation for probabilistic machine learning is computing a posterior distribution by taking a joint distribution and dividing it by a marginal, which involves integration. Finally, Hennig discusses the importance of optimization, which is the foundational operation in machine learning, involving computing values that minimize loss functions. These algorithms form the basis for differentiable programs, for which the gradient of the function can be computed automatically.

  • 00:40:00 In this section, Philipp Hennig discusses optimization algorithms and their importance in machine learning. While classic methods like BFGS and minimize are stored in scipy.optimize, new methods like SGD and Adam are now the norm in machine learning. However, these methods often require a learning rate and lots of supervision, unlike the older methods, which can converge to a minimum and work on any differentiable problem. To deal with the limitations of these new methods on large datasets with millions of data points, a batch gradient descent is used to compute a much smaller sum, which is an unbiased estimator of the thing we're interested in. Although these new methods are more efficient and effective, they are still based on the same principles as the old algorithms, which may cause issues for certain applications.

  • 00:45:00 In this section of the video, the speaker discusses the possibility of computing variance in addition to gradient in deep learning algorithms. He argues that the omission of variance computation from the optimization process is because optimization is still viewed as a gradient computation problem rather than a problem of using random variables to find points that generalize well. However, he highlights the importance of including uncertainty arising from randomness in computations, noting that it is essential to building better training setups for deep neural networks. He concludes by mentioning upcoming lectures that will delve deeper into this topic.

  • 00:50:00 In this section, Philipp Hennig discusses the use of observables to add new functionality to deep neural networks, such as uncertainty or making them into a Bayesian deep neural network without using the expensive Markov chain Monte Carlo algorithms. He also explains how numerical algorithms used to train machine learning algorithms are actually machine learning algorithms themselves, as they estimate an unknown quantity or latent variable while observing tractable, observable data. This is similar to the process of inference, where a latent quantity is estimated based on observed results from a computation.

  • 00:55:00 In this section, Philipp Hennig introduces the concept of numerical algorithms as learning machines and discusses the idea behind building numerical algorithms from the ground up as probabilistic numerical algorithms. These are algorithms that take a probability distribution describing their task and use the CPU or the GPU as a data source to refine their estimate of what the solution to the numerical task is. Hennig emphasizes that the class is not a typical numerical analysis class, as the focus is on understanding machines inside as learning machines and building new algorithms in the language of machine learning. Students can expect to write a lot of code in this computer science class.

  • 01:00:00 In this section, Philipp Hennig introduces his course on Numerics of Machine Learning, which he claims is the first dedicated course of its kind in the world. The course aims to delve into the workings of machine learning algorithms, specifically how they function inside the box and how they can be changed or adapted to improve learning machines. The highly technical nature of numerical algorithms and machine learning algorithms means that knowledge in this area is highly sought after by both researchers and industry professionals. The lectures will be taught by his team of highly-experienced PhD students, who have spent years researching and thinking about the inner workings of these algorithms, and are thus more equipped to discuss the finer technical details than a professor.

  • 01:05:00 In this section, Philipp Hennig discusses the structure of the course and the course requirements. The course will include both theoretical and coding work, as students will be expected to solve numerical problems using either Python or Julia code. The exercises will be submitted as a PDF, with solutions graded on a binary basis – a tick mark will be given for a good solution, and a cross for an unsatisfactory one. The students will get a bonus point for each tick mark, which will count towards the final exam result. The exam will take place on the 13th of February or the 31st of March next year, and passing the first exam is encouraged as a reset may not be available. Finally, students interested in achieving a higher degree in numerical algorithms in machine learning or data-centric computation are encouraged to take this course as it offers ample opportunities for applied research in various fields.

  • 01:10:00 In this section, Philipp Hennig emphasizes the importance of numerical algorithms in machine learning, stating that they are the engines that drive the learning machine. He describes how understanding these algorithms and their Bayesian inference language can lead to faster, more reliable, and easier to use machine learning solutions. Hennig stresses that while classic numerical algorithms are important, they should be viewed through the lens of machine learning, adopting the perspective of learning machines as a means to integrate simulation and deep learning in a more holistic way. He invites students to join this exciting experiment in teaching machine learning with a unique setup of nine different instructors.
Numerics of ML 1 -- Introduction -- Philipp Hennig
Numerics of ML 1 -- Introduction -- Philipp Hennig
  • 2023.01.16
  • www.youtube.com
The first lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses both ...
 

Lecture 2 -- Numerical Linear Algebra -- Marvin Pförtner



Numerics of ML 2 -- Numerical Linear Algebra -- Marvin Pförtner

Numerical linear algebra is fundamental to machine learning, Gaussian processes and other non-parametric regression methods. The lecture covers various aspects of numerical linear algebra, including the importance of understanding the structure of a matrix for more efficient multiplication, the optimization of machine learning algorithms through solving hyperparameters selection problems and computing kernel matrices, and the solution of a linear system using the LU decomposition, among others. The lecture also emphasizes the importance of implementing algorithms properly, as the algorithm used for mathematical operations has a significant impact on performance, stability, and memory consumption.

In the second part of the video, Marvin Pförtner discusses the importance of numerical linear algebra in machine learning algorithms. He covers various topics including LU decomposition, Cholesky decomposition, matrix inversion lemma, and Gaussian process regression. Pförtner emphasizes the importance of utilizing structure to make algorithms more efficient and highlights the importance of numerical stability in solving large systems of equations in Gaussian process regression. He also discusses techniques such as active learning and low rank approximations to handle large datasets and the potential memory limitations of kernel matrices. Overall, the video showcases the crucial role that numerical linear algebra plays in many aspects of machine learning.

  • 00:00:00 In this section, a PhD student discusses the importance of numerical linear algebra in machine learning and Gaussian processes. Numerical linear algebra is fundamental to machine learning and it’s a set of tools required to implement algorithms. The lecture covers fundamental tasks in numerical linear algebra important for machine learning, exploring structure to make numerical linear algebra algorithms fast and reliable, and implementing Gaussian process regression properly. The lecture also cites examples of applications of numerical linear algebra like basic probability theory, general linear models, principal component analysis, and matrix-vector products which do dimensionality reduction.

  • 00:05:00 In this section, the speaker discusses numerical linear algebra in the context of machine learning. He explains how Gaussian processes, a non-parametric regression method in machine learning, rely on a prior probability measure, which is a Gaussian process that generates a symmetric and positive definite kernel Gram matrix. The generative information in this matrix allows for efficient and reliable algorithms. The speaker also mentions how similar equations apply to a larger class of models, including kernel methods and Ridge regression. He also briefly discusses how numerical linear algebra is used to solve linear partial differential equations and in optimization methods for local optimization of loss functions.

  • 00:10:00 In this section, the speaker discusses the importance of linear algebra in machine learning and provides examples to illustrate this importance. Linear algebra operations like Matrix Vector Multiplication, linear system solutions, and matrix decompositions are fundamental to many machine learning models. Furthermore, he notes that many machine learning models are actually noisy since they use a noisy estimate of the matrix with which they aim to solve linear systems. Finally, he emphasizes that logarithmic determinants are essential in the Gaussian density case and GP regression to obtain maximum posterior estimates.

  • 00:15:00 In this section, the speaker emphasizes the importance of efficient Matrix Vector multiplication in numerical linear algebra and machine learning. They give an example of how even simple tasks can become computationally infeasible if the mathematical expression is not transformed into an algorithm properly. The speaker also highlights the importance of identifying the structure in the Matrix for more efficient multiplication. They conclude by stating that the algorithm implementing a mathematical operation has a significant impact on performance, stability, and memory consumption.

  • 00:20:00 In this section, the speaker emphasizes the importance of understanding the structure of a matrix for optimizing machine learning algorithms. He explains that if you know there is lower rank structure within a matrix, then you should use methods specialized to lower matrices to factorize it, rather than multiplying out the complete matrix. He explains that lowering is just one type of structure, and there are various matrix structures such as sparse matrices and kernel matrices that also depend on non-zero entries and input dimensions of the regressor. The speaker also touches on how to store kernel matrices in order to get memory savings.

  • 00:25:00 In this section, the speaker discusses how to store and evaluate kernel matrices for Gaussian processes efficiently. If the data points exceed a certain limit, the naive approach of storing them is no longer feasible due to memory issues. There are libraries available that write very efficient CUDA kernels and use GPUs to compute Gaussian processes on a laptop using hundreds of thousands of data points. The speaker also talks about matrices with a general functional form, like auto-diff graphs, which require the same time and space requirements. Lastly, the speaker delves into a concrete algorithm of applying Bayesian regression to Gaussian processes, where the kernel of the Gaussian measure is the covariance of the unknown function. The speaker presents a plot of the posterior measure over the function in conjunction with the observed data and how the uncertainty quantification works well. However, the problem arises when computing the inverse, which scales quite prohibitively, making the naive approach of computing a kernel gram matrix from n data points infeasible for large n.

  • 00:30:00 In this section, the speaker discusses the numerical complexity of computing kernel matrices in Gaussian processes, which can become prohibitively expensive. Additionally, there are hyperparameters that need to be tuned for the kernel, such as output scale and length scale, in order to optimize the prior to explain the observed data set. The speaker describes a Bayesian approach to solving this model selection problem by computing the log marginal likelihood and minimizing a loss function consisting of a trade-off between model fit and complexity represented by the normalization factor of the Gaussian distribution. The speaker shows examples of severe underfitting and overfitting and explains how the trade-off between these two terms can be found to achieve the best model performance.

  • 00:35:00 In this section, Marvin Pförtner discusses the solution of a linear system. The solution requires M plus one solves where M is the number of data points at which we want to evaluate our regressor. The system is symmetric and positive definite in the most general case, but there might be additional structures to leverage as the system is typically huge, and we usually can't solve this for very large data sets. One very important matrix decomposition is the Lu decomposition. The algorithm used for solving a lower triangular system is forward substitution, which decomposes the matrix into four parts: scalar in the lower right corner, the column above that is zero, a row vector to the left, and another triangular part called L minus li minus one above it, which is also lower triangular.

  • 00:40:00 In this section, Marvin Pförtner discusses how to solve systems where the system matrix is lower triangular, with dimension n minus one. By splitting off the last row, the system can be solved using a simple algorithm. Recursive methods are then used to solve a system for any given dimension. Pförtner also explains how to split the matrix into lower and upper triangular parts using what he calls the Lu decomposition, which is a recursive definition using divide and conqueror techniques. This technique is useful for inverting matrices and making the solving of linear systems less expensive, with the process being O(N^2) instead of O(N^3).

  • 00:45:00 In this section, the Lu decomposition method for solving linear systems of equations is explained. This method decomposes a matrix into a lower triangular matrix and an upper triangular matrix, allowing for faster computation of solutions to linear systems. The process involves setting the diagonal entries of the left part of the lower triangular matrix to one and using partial pivoting to ensure stability and robustness. Despite the method's efficiency, the cost of computation, which is O(n^3), must be considered.

  • 00:50:00 In this section, Marvin Pförtner discusses the computational time of the UD decomposition and demonstrates how to implement it in place. He explains that the biggest part of each recursion step is the calculation of the outer product and the subtraction, which results in a summation over two times (n-1) squared. Using a strategy known as Gaussian elimination, the algorithm efficiently computes the upper triangular matrix. Pförtner shows how to perform an example computation with a small matrix, demonstrating that the non-trivial part of L is contained in the three entries below the diagonal, and the upper triangular part will contain the non-zero parts of U. By keeping everything in memory, Pförtner presents an implementation that cleverly stores L and U in the same matrix.

  • 00:55:00 In this section, the speaker explains the process of LU decomposition in numerical linear algebra. He shows how to compute the algorithm step by step and how to use it to solve linear systems. Once we have the LU decomposition of a matrix, we can apply it to efficiently solve multiple linear systems with multiple right-hand sides, costing only 2N squared for once forwards and backward substitution. The inverse of a permutation matrix is just its transpose, which is cheap to compute, making it possible to perform K solves with the same system matrix in Gaussian process regression.

  • 01:00:00 In this section, the speaker discusses how to efficiently solve multiple linear systems with the same matrix using an LU decomposition, which is computationally efficient. Additionally, a method for computing the log determinant with an LU decomposition is presented, which allows for efficient representation of a linear system and performing various linear algebra tasks with it. The speaker emphasizes the importance of utilizing structure to make algorithms more efficient and notes that the Cholesky decomposition is a specialized version of the LU decomposition that takes advantage of the symmetric and positive-definite nature of the kernel gram matrix.

  • 01:05:00 In this section, the speaker discusses the computation of the posterior mean and covariance in Gaussian processes. To obtain the posterior mean, one needs to solve one system by forward substitution and another by backward substitution. The speaker notes that with the structure of the cholesky factors of the covariance matrix, one can get a good lowering approximation to the matrix. Furthermore, he talks about the problem of potentially not being able to fit the large kernel matrix into memory and presents two approaches to solving this problem; using structure in the kernels employed or using sparse approximations.

  • 01:10:00 In this section, the speaker discusses how to efficiently invert matrices in machine learning algorithms. He uses a data set generated from a sinusoidal function as an example and shows that by knowing the generative structure of the data set, one can choose kernels that reflect that knowledge and are computationally efficient. The Matrix Inversion Lemma is a tool that can be used to invert matrices efficiently by perturbing them with a small number of subspaces. By using this lemma, one can compute expressions very efficiently and not even need to form the entire matrix in memory. The speaker emphasizes that there are many different approaches to using structure in machine learning algorithms.

  • 01:15:00 In this section, the lecturer discusses numerical linear algebra methods used in Gaussian inferences and hyperparameter optimization in machine learning. One method for scaling GP (Gaussian process) regression to large datasets is approximate inversion, which involves iterative construction of low rank approximations to the system matrix represented in the kernel matrix. The lecturer demonstrates this method using the Cholesky algorithm as an example and shows how the low rank approximator to the matrix can be obtained on the fly without computing the whole Cholesky factorization. The quality of the approximation depends on the kernel matrix and the order in which the data points are processed. Overall, this section highlights the importance of numerical linear algebra in various aspects of machine learning.

  • 01:20:00 In this section, Marvin Pförtner discusses how to choose the order of the data points in which Cholesky deals with them to approximate the kernel Matrix. He explains that pre-multiplying the gram Matrix with the permutation Matrix, also known as full pivotization or pivoted Cholesky decomposition, can lead to a lower approximation with fewer iterations. The idea is to observe the predictor for the data points after one iteration of Todeschini and then use the information gathered to select the data point to observe in the next iteration. This technique is considered an active learning problem and can yield a clever way to process rows and columns simultaneously and thus explore the generative structure of the Matrix in an online fashion.

  • 01:25:00 In this section, the speaker discusses the singular value decomposition (SVD) and how it solves an optimization problem to get the best factors for a matrix approximation. However, truncating an SVD could be arbitrarily bad, so a heuristic approach is used to approximate the SVD and compute an eigen decomposition. There is also a need for a matrix square root, which can be achieved through the Cholesky decomposition. It is important to take into account structure when implementing numerical linear algebra algorithms in practice, as this can significantly speed up the process.

  • 01:30:00 In this section, Marvin Pförtner discusses how the structure of numerical linear algebra affects Gaussian process regression. Gaussian process regression is computationally intensive and requires solving large systems of equations, which can be done using numerical linear algebra techniques. The speaker emphasizes the importance of numerical stability in solving these systems of equations to avoid losing accuracy in the final results.
Numerics of ML 2 -- Numerical Linear Algebra -- Marvin Pförtner
Numerics of ML 2 -- Numerical Linear Algebra -- Marvin Pförtner
  • 2023.01.16
  • www.youtube.com
The second lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses both...
 

Lecture 3 -- Scaling Gaussian Processes -- Jonathan Wenger



Numerics of ML 3 -- Scaling Gaussian Processes -- Jonathan Wenger

Jonathan Wenger discusses techniques for scaling Gaussian processes for large datasets in the "Numerics of ML 3" video. He explores iterative methods to solve linear systems and learn the matrix inverse, with the primary goal of achieving generalization, simplicity/interpretability, uncertainty estimates, and speed. Wenger introduces low-rank approximations to the kernel matrix such as the iterative Cholesky decomposition, partial Cholesky, and conjugate gradient methods. He also discusses preconditioning to accelerate convergence and improve stability when dealing with large datasets. Finally, he proposes using an orthogonal matrix Z to rewrite the trace of a matrix, which could potentially lead to quadratic time for scaling Gaussian processes.

In the second part of the lecture Jonathan Wenger discusses scaling Gaussian Processes (GP) for large datasets in this video. He presents various strategies for improving the convergence rate of Monte Carlo estimates for GP regression, including using existing preconditioners for the linear system solve to estimate the kernel matrix and its inverse. He also introduces the idea of linear time GP through variational approximation and addressing uncertainty quantification using the inducing point method. By using these strategies, scale-up to datasets with up to a million data points is possible with the GPU, making it easier to optimize hyperparameters quickly.

  • 00:00:00 In this section of the video, Jonathan Wenger discusses how to scale Gaussian processes for large datasets using iterative methods to solve linear systems. He explains that these methods can be viewed as learning algorithms for the matrix inverse, which is the primary object needed to compute the GP posterior. Wenger also outlines the main goals for regression, including generalization, simplicity/interpretability, uncertainty estimates, and speed. He notes that GPs are prime examples of models that can achieve all of these goals, but they are expensive to train and do inference. However, by developing modern methods to solve linear systems with kernel matrices, the quadratic time inference for GPS can be done faster than cubic time. Wenger also hints that there is a way to do this even faster in linear time but acknowledges that there may be some drawbacks that he will discuss further in the next lecture.

  • 00:05:00 In this section, the speaker discusses the limitations of the Scholesky decomposition for Gaussian Processes when dealing with large datasets, as it becomes prohibitive in terms of time and space complexity. He proposes iterative methods to reduce the complexity to being squared in the number of data points, showing how iterative Cholesky is used for low-rank approximation of the kernel Matrix. However, the problem is not approximating the kernel Matrix itself, since GP regression requires an approximation of the inverse of the kernel Matrix or the Precision Matrix, so the question is whether the iterative formulation of the Cholesky can be interpreted as an approximation to the Precision Matrix for linear solves.

  • 00:10:00 In this section, the speaker explores an iterative form of the Cholesky decomposition, which can be used for low-rank approximations to kernel matrices. By tracking additional quantities, it is possible to get an inverse approximation to the matrix, which is also low-rank, similar to the Cholesky. The speaker demonstrates how to compute this inverse approximation recursively, in terms of the Cholesky factors and the residual. This iterative method can be used as an approximate matrix inversion algorithm for positive-definite matrices, such as kernel matrices, and is a useful tool for scaling Gaussian processes.

  • 00:15:00 In this section, the speaker discusses the use of the partial Cholesky method for scaling Gaussian processes. The method involves modifying the Cholesky decomposition with a factor and multiplying it with a vector. This results in an iterative process that produces an inverse approximation by adding outer products of vectors. The complexity analysis shows that it is equally expensive as approximating the matrix itself. The speaker also compares the partial Cholesky method with GP regression and highlights the importance of selecting the right data points or unit vectors to improve the learning process.

  • 00:20:00 In this section, Jonathan Wenger discusses the importance of selecting the right data points when approximating the kernel matrix for Gaussian Processes (GP). He illustrates how a random selection of data points to condition on can result in a slower learning process. He introduces the "method of conjugate gradients," originally designed to solve linear systems in GP regression. This method rephrases the problem of ax=B, where a is a kernel matrix and B is a vector of size n, as a quadratic optimization problem, which is equivalent to solving the linear system ax=B. By taking the gradient of the quadratic function and setting it to zero, the column to ax equals B, and a residual can be defined as B minus ax, which can be used to find a better and more efficient way to select data points to speed up the learning process.

  • 00:25:00 In this section, Jonathan Wenger discusses the use of conjugate directions for optimization in Gaussian Processes. He explains that by modifying the direction we walk in, we can converge at most in n steps when using conjugate directions. To start, he uses the negative gradient as the first step in the direction of steepest descent and modifies the steps to satisfy the conjugacy condition. He presents the algorithm and explains its high-level parts, including the stopping criterion based on the gradient norm.

  • 00:30:00 In this section, Jonathan Wenger discusses the method of conjugate gradients, which is a method for approximating the inverse when solving multiple linear systems for posterior covariance. The conjugate gradients method constructs an approximation for the inverse, which is low-rank in the same way as the partial Swarovski. The update for the solution estimate involves a conjugate direction di, and the matrix CI approximates the inverse with the form of all previous search directions stacked into columns. This method allows for quickly solving the scenario system, and its low-rank structure makes it an efficient method for scaling gaussian processes.

  • 00:35:00 In this section, the speaker compares the partial Scholastic method to the conjugate gradient method for Gaussian process inference. The conjugate gradient method converges much faster, and the speaker explains that the "actions" used in the conjugate gradient method probe the matrix in a different way, which allows for better convergence. However, the speaker notes that it is important to analyze how fast the method converges, which requires an understanding of numerics, specifically machine precision and the condition number. The condition number is the maximum eigenvalue divided by the minimum eigenvalue in absolute terms and measures the unavoidable error amplification when implementing inversion algorithms.

  • 00:40:00 In this section, Jonathan Wenger discusses the stability and convergence behavior of methods for solving linear systems with kernel matrices, such as the conjugate gradient method or the Cholesky decomposition. The stability is determined by the condition number of the matrix, which depends on its eigenvalues, and the larger the condition number, the more unstable the method is. The convergence behavior is determined by the condition number of the matrix and the largest divided by the smallest eigenvalue. The closer the condition number is to one, the slower the convergence. Despite the moderately large condition number of the kernel matrix with a thousand data points, Wenger shows that the conjugate gradient method still converges quickly in a few hundred iterations relative to the problem size.

  • 00:45:00 In this section, Jonathan Wenger discusses scaling Gaussian processes and the impact of observation noise on convergence. As observation noise decreases, the convergence of CG slows down due to the blow-up of the condition number of the kernel matrix. The condition number is the largest eigenvalue divided by the smallest eigenvalue, and as data points get closer to each other, the condition number blows up. To solve this problem, preconditioning can be used to approximate the kernel matrix, assuming that the storing of the matrix is rather cheap relative to storing the actual matrix. By efficiently evaluating the inverse of the approximation, the preconditioner can replace the original problem with an easier-to-solve one, resulting in faster convergence of CG.

  • 00:50:00 In this section, Jonathan Wenger discusses the concept of preconditioning in scaling Gaussian processes for more efficient linear system solving. He uses the example of probabilistic learning methods to explain how prior knowledge of a problem can make it easier to solve, and similarly, preconditioning transforms a problem to be closer to the identity matrix and therefore easier to solve. By using a preconditioner, the condition number of the system gets lowered, which accelerates CG and makes it more stable. Wenger demonstrates the efficiency of preconditioning by using a low-rank plus diagonal preconditioner and partial SVD to solve a large-scale linear system with 100,000 data points in seven minutes.

  • 00:55:00 In this section, the speaker discusses the use of preconditioned conjugate gradient (CG) for solving linear systems during hyper-parameter optimization for Cholesky. In order to evaluate the loss and compute its gradient, we need to solve linear systems and compute traces. However, computing the trace involves n matrix-vector multiplies, which is too expensive for large datasets. To solve this, the speaker proposes using an orthogonal matrix Z such that c x Z(transpose) = identity matrix, allowing us to rewrite the trace of a as the trace of Z(transpose) x a x Z. This approximation method could potentially lead to quadratic time for scaling Gaussian processes.

  • 01:00:00 In this section, the presenter discusses the challenge of scaling up the calculation of the trace of the kernel matrix, which involves performing several matrix-vector multiplications. One potential solution is to randomize the calculation by drawing random vectors, scaled with the square root of the dimension, then computing the identity covariance. With the covariance of the random vector approximated, the trace can be calculated, which is the same as solving the original problem without random vectors. However, using Monte Carlo estimators in this method is insufficient for large datasets as it requires tens of thousands of random vectors, making the hyperparameter optimization slow.

  • 01:05:00 In this section, Jonathan Wenger discusses scaling Gaussian Processes (GP) for large datasets. He explains that existing preconditioners for the linear system solve can be used to estimate the kernel matrix, and its inverse to deal with the data scaling issue. The use of the preconditioner with partial Cholesky or the stochastic trace estimate helps in estimating the trace back. Using the same information, one can estimate the gradient of the log determinant too. By using these strategies, scale-up to datasets with up to a million data points is possible with the GPU. Wenger notes that pre-training involves using a small dataset as a springboard to optimize the hybrid parameters.

  • 01:10:00 In this section, the speaker discusses different strategies to improve the convergence rate of Monte Carlo estimates for Gaussian process regression. By inheriting the rate of preconditioning convergences, it's possible to converge faster to the true value exponentially or polynomially. The choice of actions to observe the kernel matrix through matrix vector multiply can also affect how fast convergence can be achieved. Therefore, in order to develop fast numerical algorithms for Gaussian process, domain expertise is needed, which can be provided through preconditions or the choice of actions to quickly converge. Additionally, the idea of linear time GP through variational approximation is introduced, which involves compressing high-dimensional data into a smaller training dataset to summarize it in a more effective way.

  • 01:15:00 In this section, Wenger discusses the use of Gaussian processes and how they can be scaled effectively. The idea is to summarize the training data to provide a direct approximation to the posterior, which only takes I squared n, where I is the number of inducing inputs and n is the size of the training data. However, iterative methods require hyper-parameter optimization, which also needs to be considered. Stochastic methods like batched optimization or sdd can be used in this case, which can be optimized quickly using a preferred optimizer. All the essential operations are I cubed or I squared times n, except for evaluating the kernel matrix, which is the most costly operation.

  • 01:20:00 In this section, the speaker discusses the issue of uncertainty quantification with scaling Gaussian processes using the inducing point method, which requires setting the number of inducing points a priori for the data set. As the optimizer searches for better summary data points, the resulting uncertainty quantification becomes significantly different from the true Gaussian process. While iterative methods can control the accuracy of the approximation until time runs out, the inducing point method requires controlling the fidelity of the approximation before optimization. The speaker poses the question of whether a method can be designed where the uncertainty quantification can be trusted at any point of the approximation, regardless of computation time.
 

Lecture 4 -- Computation-Aware Gaussian Processes -- Jonathan Wenger



Numerics of ML 4 -- Computation-Aware Gaussian Processes -- Jonathan Wenger

In this video on Numerics of ML, Jonathan Wenger discusses computation-aware Gaussian processes and their ability to quantify the approximation error and uncertainty in predictions. He explores the importance of choosing the right actions and how conjugate gradients can significantly reduce uncertainty and speed up learning. Wenger also talks about using linear time GP approximations based on inducing points but highlights the issues that arise from such approximations. Finally, he discusses updating beliefs about representative weights and using probabilistic learning algorithms to solve for the error in the representative weights. Overall, the video demonstrates the effectiveness of computation-aware Gaussian processes in improving the accuracy of predictions by accounting for computational uncertainties.

Jonathan Wenger also discusses the computation-aware Gaussian process and its complexity in this video. He explains that it is only necessary to compute and store the upper quadrant of the kernel matrix, and the computational cost of the algorithm is proportional to the size of this quadrant. The Gaussian process can be used on datasets of arbitrary size, as long as computations target only certain data points, blurring the line between data and computation. Wenger argues that the GP can be modeled to account for this situation by conditioning on projected data. He introduces a new theorem that allows for exact uncertainty quantification with an approximate model. Finally, he previews next week's lecture on extending the GP model to cases where a physical law partially governs the function being learned.

  • 00:00:00 In this section, Jonathan Wenger talks about the final culmination of their Gaussian processes lectures, where he demonstrates how to conduct exact uncertainty quantification in arbitrary time. He explains that this approach allows users to always quantify how far they are from the function they are trying to learn, no matter how much computation they put in, or whatever their budget is. By reinterpreting the algorithms from the previous lectures as learning agents, they are able to quantify the approximation error, which is introduced into the prediction posterior. Additionally, they discuss what it means to observe data through a computer and the philosophical debate surrounding it.

  • 00:05:00 In this section, Jonathan Wenger discusses the importance of choosing the right actions when dealing with Computation-Aware Gaussian Processes. He shows that the choice of actions can significantly reduce uncertainty and speed up the process of learning about the phenomena being predicted. Furthermore, he explores the method of conjugate gradients as a way of finding better actions when solving linear systems or minimizing quadratic functions. By taking into account the geometry of the problem, conjugate gradients can converge to a solution in a small number of steps.

  • 00:10:00 In this section of the video, Jonathan Wenger discusses the computation-aware Gaussian processes and how they differ from other approximation methods. He talks about the most expensive operation in both partially conjugate gradient and partialsky inverse approximation methods being the matrix-vector multiplication. He then teases the idea of linear time GP approximations that are based on inducing points as summary data points, and he discusses the issues that arise from a linear time approximation. Wenger then introduces the computation-aware GP inference, which addresses the issues of exact uncertainty quantification and says that it is cutting edge research that will be presented at NURBS later this year.

  • 00:15:00 In this section, Jonathan Wenger discusses the computation-aware Gaussian process and how to quantify the approximation error that arises from using iterative methods to solve a linear system of representative weights. He explains that the kernel functions in the GP model encode assumptions about what the true function looks like, and iterative solvers approximate these weights to construct a posterior mean prediction. By quantifying this approximation error probabilistically, it is possible to add the additional uncertainty to the prediction, which can improve the accuracy of the model. Wenger also gives a brief recap of the linear algebra of Gaussian distributions and how they make calculations in probability theory, particularly when it comes to conditioning and observations, easier.

  • 00:20:00 In this section, Jonathan Wenger discusses the properties of Gaussian distributions and how they can be used to determine the posterior distribution over a variable X given observations Y. By combining the properties of scaling and marginalization, Gaussian processes can be used to quantify the approximation error in estimates of representative weights. Wenger explains how a prior Gaussian distribution can be updated and used to learn the true representative weights, which cannot be directly observed. The spread and orientation of a Gaussian bell curve can be used to determine the direction in which to look for the true representative weights.

  • 00:25:00 In this section, Jonathan Wenger explains how to indirectly observe a black dot in a computation-aware Gaussian process by using a residual and a vector transformation. He shows how to apply the affine Gaussian inference theorem to calculate the distance between the representations and the estimated weights. The process involves collapsing the belief onto an orthogonal line and developing a one-dimensional probability belief, which is used to find the represented weights. Wenger also discusses how to select a more informative red line that aligns with the prior belief to reach a more accurate solution.

  • 00:30:00 In this section, Jonathan Wenger discusses an algorithm for updating a belief about representative weights in computation-aware Gaussian processes through an observation made by an action times a residual. He explains that the update involves an affine Gaussian inference, and points out the key elements in the update process. While the algorithm is similar to CG and partial Cholesky, he notes that the choice of prior is still an issue that needs to be addressed, as it has to be related to where the true representative weights lie to obtain a good error estimate. Wenger proposes that the GP prior and the assumptions made are related to the representative weights as they are involved in the inverse of the kernel matrix, making them significant in the GP prior.

  • 00:35:00 In this section, Jonathan Wenger discusses how to understand what distribution data was generated from before making any observations with a Gaussian Process (GP). Assuming a distribution over f, Wenger explains that the labels are distributed according to the zero-mean when using a zero-mean Gaussian prior and vary according to the kernel matrix plus independent noise, which is part of the observation model. Wenger then discusses finding the representatives using a probabilistic learning algorithm that updates the prior by projecting onto actions. Finally, Wenger explains how to solve the issue of needing calibrated prior K hat inverse by computing a distribution of mu star evaluated at a data point, which is a linear function of V star.

  • 00:40:00 In this section, Jonathan Wenger explains computation-aware Gaussian processes and how to account for computational uncertainties. He discusses the idea of marginalization, where multiple options for a random variable are considered and a posterior mean prediction that takes all possible representative weights estimates into account is computed. He explains how linear marginalization works and how it adds additional uncertainty to the covariance. Wenger then goes on to discuss the interpretation of the uncertainty of a GP as a mean error estimate and how the computational uncertainty can be considered an error estimate as well. Overall, the section explains the calculation of the combined uncertainty that includes the error to the true function and the error in the representative weights into one single estimate.

  • 00:45:00 In this section, the speaker discusses computation-aware Gaussian processes, which combine the error resulting from not having enough observed data with the error from not having performed enough computations to learn the prediction. The speaker demonstrates two examples of this process in action with the Ed Cholesky and CG actions. The proposed method called GP computes the posterior and combines a representative belief with initialization to obtain more accurate predictions through tracking uncertainty. The method is straightforward and effective, as seen in the reduced computational uncertainty and closer approximation to the true posterior mean in the plotted graphs.

  • 00:50:00 In this section, the speaker discusses the computation-aware Gaussian processes and the belief's use without needing to invert the kernel matrix. They choose an action in a specific direction and observe how close they are to the two represented weights in the chosen subspace, which affects how quickly they converge to the represented weights. To update the representative weights' estimate, they observe the projected residual and compute the direction to walk along. They also compute a low-rank approximation and update their estimate of the representatives and Precision Matrix. They apply the same quantities using partial Alaska and CG, choose unit vector actions to recover certain actions, and design a method like the linear-time method that weighs data points according to the kernel function centered at an inducing point.

  • 00:55:00 In this section, Jonathan Wenger discusses computation-aware Gaussian Processes (GP) and compares them with fully independent training conditional GP (FITC-GP). He introduces Kernel Vector Actions, which solve some of the problems with FITC-GP, but are dense, resulting in complexity of N squared, and hence they are not cost-effective. Wenger shows that by taking specific actions that target only part of the data points, they can reduce the complexity needed for computation of the kernel matrix. In the end, computational GP has better performance and such actions prove to be a useful approach for scalable computation with high accuracy.

  • 01:00:00 In this section, Jonathan Wenger discusses the computational-aware Gaussian process and its complexity. He shows that it is only necessary to compute and store the upper quadrant of the kernel matrix, and as a result, the computational cost of the algorithm is only proportional to the size of this quadrant. Additionally, he highlights that the algorithm can be used on datasets of arbitrary size, as long as actions that have zeros in the lower quadrant are chosen to target only certain data points with computation. Wenger argues that this blurs the distinction between data and computation because only observations targeted for computation are considered data. Finally, he notes that the Gaussian process can be modeled to account for this situation by conditioning on projected data.

  • 01:05:00 In this section, Jonathan Wenger explains that Gaussian Processes (GPs) can be thought of in two ways: as a more accurate model of what is happening or as a probabilistic numeric tool that quantifies the error introduced through approximation and takes it into account in predictions. He then goes on to discuss the interpretation of squared errors as probabilistic measures and how combined posterior can be used as a prediction tool. Wenger also introduces a new theorem that allows for exact uncertainty quantification with an approximate model, allowing users to trust their uncertainty quantification in the same way they trust Gaussian processes.

  • 01:10:00 In this section, Jonathan Wenger explains that Gaussian Processes (GPs) can be approximated by devising a learning algorithm, which can probabilistically quantify the error of the algorithm and push the error onto the GP posterior used to make predictions, allowing for exact uncertainty quantification regardless of the computational power used. Wenger also notes that while different variants of the method exist, they provide exact uncertainty quantification as long as the actions are linearly independent. Finally, Wenger previews next week's lecture, in which Jonathan will discuss extending the GP model to cases where a physical law partially governs the function being learned.
Numerics of ML 4 -- Computation-Aware Gaussian Processes -- Jonathan Wenger
Numerics of ML 4 -- Computation-Aware Gaussian Processes -- Jonathan Wenger
  • 2023.01.17
  • www.youtube.com
The fourth lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses both...
 

Lecture 5 -- State-Space Models -- Jonathan Schmidt



Numerics of ML 5 -- State-Space Models -- Jonathan Schmidt

In this section, Jonathan Schmidt introduces state-space models and their application to machine learning. He explains that state-space models are used to model complex dynamical systems, which are only partially observable and involve highly non-linear interactions. The lecture covers the graphical representation of state-space models and the important properties of Markov property and conditionally independent measurements. Schmidt presents different algorithms for computing various distributions such as prediction, filtering, and smoothing distributions, which are used to estimate the state of a system, using measurements obtained at different points in time. The lecture also covers the implementation of Kalman filter algorithms in Julia and the computation of smoothing estimates in linear Gaussian state-space models. Finally, Schmidt discusses the extended Kalman filter, which allows for the estimation of nonlinear dynamics and measurements in state-space models.

Jonathan Schmidt also discusses state-space models and their implementation using code, specifically focusing on non-linear dynamics and the extended Kalman filter. He also demonstrates smoothing algorithms and alternative Bayesian filtering methods, highlighting their pros and cons. The lecture concludes with a recommendation for further learning and anticipation for the next lecture, where Nathaniel will introduce probabilistic numerics for simulating dynamical systems.

  • 00:00:00 In this section, Jonathan Schmidt introduces state-space models and dynamical systems as a new focus for the numerics of machine learning lecture course. He explains that dynamical systems evolve over time and can only be partially observed, making them challenging to model. Schmidt provides examples such as COVID-19 case counts and smartphone orientation estimation to illustrate the temporal structure and hidden components of dynamical systems. The ultimate goal is to use probabilistic methods to simulate these systems, but first, a language and algorithmic framework must be established to discover latent components from observable data.

  • 00:05:00 In this section, the speaker discusses state-space models, which involve an online estimation task where the goal is to quickly update the estimate of a complex dynamical system as new data comes in. These models are often only partially observable and involve highly non-linear functions and interactions. To achieve this, an algorithmic framework is needed to update the belief accordingly. The speaker discusses the graphical representation of the modeling language used in state-space models, where the sequence of white nodes represents random variables modeling the system state, and the red box represents the observed data. The state of a dynamic system is a set of physical quantities that determine the evolution of the system, which are tracked and interact with each other. The observed data, y, depends on the present state and is often only available for some states in the trajectory, but not others.

  • 00:10:00 In this section, Jonathan Schmidt introduces state-space models as a probabilistic framework for modeling dynamical systems. He emphasizes two important properties of state-space models: the Markov property and conditionally independent measurements. Using these properties, he defines a state-space model as a Bayesian model that includes an initial distribution for the first state, a dynamics model for subsequent states, and a measurement model for observations. Schmidt notes that these distilled components will form the basis for the rest of the lecture series.

  • 00:15:00 In this section, the speaker explains how to analyze systems using state space models and computes four different conditional probability distributions. These include the prediction distribution, filtering distribution, data likelihood, and smoothing distribution, which are computed for every step in an ongoing sequence. The derivation involves introducing the quantity being calculated and building a joint distribution based on what is already known. The Chapman Kolmogorov equation is used to predict into the future given past measurements, and the correction step using Bayes' theorem is used to integrate new data into the estimate.

  • 00:20:00 In this section, the speaker explains the concept of a state space model and the prediction and updating scheme used in it. By computing the predicted distribution through Chapman-Homograph equation, the model updates prediction through Bayes theorem. The speaker then presents pseudocode for the algorithm, which operates in a linear time loop without going backward. The speaker emphasizes the importance of producing a sequence of distributions for current states given all previous measurements. Lastly, the speaker introduces a linear Gaussian state space model and how it produces distributions.

  • 00:25:00 In this section, the speaker introduces state-space models for a linear Gaussian system with a process noise covariance matrix Q and a measurement model with a measurement matrix H and a measurement covariance matrix R. The lecture explains how the prediction and filtering moments of the model can be computed using Gaussian inference, with the posterior distribution being a complicated collection of terms. The speaker then introduces the Kalman filter, named after the Hungarian scientist Rudolph Kalman, which allows for the computation of prediction and filtering moments in closed form. The prediction and correction equations of the Kalman filter are presented, with the Kalman gain being an important quantity that translates information gained in the measurement space to the state space for updating the filtering mean.

  • 00:30:00 In this section of the video, Jonathan Schmidt introduces state-space models and explains how to use them for filtering trajectories based on noisy measurements. He provides an example of tracking a car in a 2D plane using GPS measurements and writes the code in Julia. Schmidt explains that the dynamics model is a linear Gaussian model, and the process noise covariance involves polynomial terms of the time step. He also emphasizes that the filtering trajectory only uses previous and present data points and is not informed by the future.

  • 00:35:00 In this section, the speaker explains the implementation of the Kalman filter for state-space models using Julia code. He explains how to set up the transition and measurement models, predict the mean and covariance, and correct the estimate using the measurement model. The speaker then demonstrates how to run the Kalman filter and provides a visualization of the resulting estimate and the corresponding uncertainty.

  • 00:40:00 In this section, Jonathan Schmidt explains how state-space models are used to describe dynamical systems and how they can be built using linear Gaussian models that allow for the computation of interesting quantities using linear algebra. He also introduces the concept of smoothing posteriors, which provide the best estimate of a trajectory given all available data points, and relies on filtering distributions to compute them in a backward recursive algorithm. While the derivation of smoothing equations involves probability theory and the Markov property, the resulting collection of Gaussian random variables makes it easy to compute the smoothing distribution at each time step.

  • 00:45:00 In this section, the speaker explains the process of computing smoothing estimates in linear Gaussian State space models. This involves utilizing matrix vector product operations and marginalizing over the next time step while marginalizing to compute the posterior from the filtering posterior. The algorithm for smoothing estimates is computed through for loops since it only works if there is a data set or a fixed portion of time steps to consider. The process involves starting from the end of the time series and going backwards until the beginning by computing smoothing gain and using it to compute the smooth moments. The speaker also mentions that filtering estimate coincides with the smoothing estimate at the end of the time series. The smoothing algorithm ultimately provides a Gaussian process posterior as the smoothing posterior.

  • 00:50:00 In this section, the speaker explains how to compute Gaussian process posteriors in linear time by making assumptions that include linear transition, linear measurements, additive Gaussian noise for both dynamics and measurements, and the Markov property. However, not all Gaussian process posteriors can be computed using Gaussian filtering and smoothing. The speaker also discusses the possibility of dropping the Gaussian assumption, but this would require an entirely new class of algorithms. The next step involves looking at non-linear models using a Taylor approximation in first order to linearize the functions and then use common filtering.

  • 00:55:00 In this section, Jonathan Schmidt discusses state-space models and the extended Kalman filter, which is an extension of the Kalman filter for nonlinear dynamics and measurements. The linearization of nonlinear dynamics and measurement models is achieved through the use of Jacobian matrices, allowing for the use of the standard Kalman filter equations with some modifications. The predicted mean is evaluated at the previous filtering mean, allowing for easy computation of the predicted covariance matrix. The measurement model is similarly linearized, and the extended Kalman filter equations are derived. Schmidt notes that the extended Kalman filter is useful when it is not possible or desirable to differentiate nonlinear functions.

  • 01:00:00 In this section, Jonathan Schmidt discusses what happens if we cannot differentiate our function and how to work around it. One possible solution is to use a finite difference in scheme, where we build a difference like standard finite differences and then do the same thing. Schmidt also builds the extended-root smoother by looking at the smoothed equations and inserting, as the transposed transition matrix, the Jacobian matrix of the nonlinear function evaluated at the filtering mean. Schmidt provides a code example using a non-linear state space model of a pendulum, where the state dimension is 2 and the measurements are scalar. He sets up the dynamics model using a non-linear transformation and discusses the process noise covariance.

  • 01:05:00 In this section, Jonathan Schmidt discusses state-space models and how to implement them using code. He explains the non-linear dynamics of the system and the simple linear measurement model used for measurements. He also demonstrates how to implement an extended Kalman filter to estimate the trajectory of a pendulum. The filter uses automatic differentiation to compute the Jacobian matrix for the non-linear dynamics function and the gradient for the measurement function. The resulting animation shows the predicted trajectory and the noisy measurements.

  • 01:10:00 In this section, Jonathan Schmidt discusses the filtering estimate and extended smoothing in state-space models. The filtering estimate shows the uncertainty estimate in the shaded area, while the smoothing algorithm tidies up the filtering estimate using automatic differentiation, computing the smoothing gain, smooth mean, and smooth covariance. The smoother returns a Gaussian process posterior marginal, which covers the ground-truth trajectory well in its uncertainty. Schmidt also mentions alternative methods for Bayesian filtering, such as the unscented Kalman filter for approximating distributions, and the particle filter, which approximates the actual true posterior. While these methods have their pros and cons and may be harder to implement, they can be effective for non-linear or non-Gaussian models. Schmidt recommends the book "Bayesian Filtering and Smoothing" by Simo Särkkä for those interested in learning about these methods.

  • 01:15:00 In this section, the speaker summarizes what was learned about state-space models, their linear Gaussian model, and the Kalman and extended Kalman filters used to handle non-linear dynamics and measurements. The next lecture is recommended, where Nathaniel will introduce a powerful language for capturing laws of nature and combining it with the lecture in one week to learn how to simulate these dynamical systems using probabilistic numerics through Bayesian filtering and smoothing. The speaker concludes by asking for feedback and thanking listeners for their time.
Numerics of ML 5 -- State-Space Models -- Jonathan Schmidt
Numerics of ML 5 -- State-Space Models -- Jonathan Schmidt
  • 2023.01.24
  • www.youtube.com
The fifth lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses both ...
 

Lecture 6 -- Solving Ordinary Differential Equations -- Nathanael Bosch



Numerics of ML 6 -- Solving Ordinary Differential Equations -- Nathanael Bosch

Nathanael Bosch covers the concept of ODEs in machine learning, which describe the derivative of a function given its input and model systems that evolve over time. He discusses the challenges of solving ODEs and introduces numerical methods, such as forward Euler and backward Euler, and their stability properties. Bosch explores different numerical methods and their trade-offs in accuracy and complexity, such as explicit midpoint and classic fourth-order methods. He emphasizes the importance of local error, order, and understanding stability to avoid issues in using libraries to solve ODEs.

This second part of the video discusses the problem of estimating the vector field and initial value of an ordinary differential equation (ODE) using machine learning techniques. The speaker explains the importance of writing down the generative model and observation model for the states of the ODE to solve the inference problem. The likelihood function is maximized by minimizing the negative log likelihood, which yields a parameter estimate. The speaker demonstrates this approach using an SIR-D model and discusses using neural networks to improve the estimation of the contact rate. The importance of ODEs in machine learning research and their role in solving real-world problems is also highlighted.

  • 00:00:00 In this section of the lecture, Nathanael Bosch introduces the concept of ordinary differential equations (ODEs) and how they are used in machine learning. He defines an ODE as a way to describe the derivative of a function given its input, and explains that often in machine learning, ODEs are used to model systems that evolve over time. He provides examples of where ODEs appear in machine learning, including in diffusion models and in optimization problems. Bosch also discusses the challenges of solving ODEs, which require complex numerical solvers due to the impracticality of solving them perfectly.

  • 00:05:00 In this section, the speaker discusses how ODEs are used to transform noise into data for modeling complex distributions, which is done through normalizing flows. He also explains the concept of neural ODEs, which sparked a lot of research and reinterprets residual neural networks as discretizations of a more continuous thing. Additionally, the speaker relates ODEs to optimization, specifically, gradient flow, which is easier to write a theorem about than discrete gradient descent. Lastly, the speaker discusses how parameter inference is an example of using ODEs to learn something unknown, and in the next lecture, will interpret numerical ODE solutions as machine learning algorithms. The speaker concludes that while we can write down a solution for an ODE, it is not helpful due to the integration problem and unknown variables.

  • 00:10:00 In this section, the narrator introduces ordinary differential equations (ODEs) and initial value problems, which are crucial in understanding many algorithms in machine learning. ODEs represent the rate of change of a system over time, and the initial value is required to solve the problem. The solution to an ODE is given by a function that depends on the initial value, and numerical solutions to ODEs require extrapolating step by step. The narrator presents a logistic ODE problem for population growth, and the solution is given. The narrator emphasizes the goal of solving an initial value problem is to find the solution for a specific starting point given the vector field of the ODEs. The difficulty in solving ODEs is both solving the integral and handling the differential term. The narrator suggests small steps sizes for numerical solutions of ODEs to approximate the true solution accurately.

  • 00:15:00 In this section, Nathanael Bosch explains different numerical methods for solving ordinary differential equations. The first method he presents is the zeroth order Taylor series approximation, where only the function value at the current time step is considered for the approximation. This leads to the Forward Euler method, which is a simple, explicit formula for computing the next point in time. Bosch notes that while this method is a bad approximation, it is still widely used in software and dynamical simulations.

  • 00:20:00 In this section, the video discusses two methods for solving ordinary differential equations (ODEs): the forward Euler method and the backward Euler method. The forward Euler method uses the slope at the current point to approximate the value at the next point, while the backward Euler method uses a Taylor series approximation around Tau equals t plus h. The video provides code examples for both methods using the logistic ODE, which produce reasonable solutions. However, the video cautions that more complex differential equations may require additional consideration when choosing a numerical solver. Additionally, the video touches on the complexity of numerical methods and the importance of being aware of the underlying algorithms when using numerical packages.

  • 00:25:00 In this section, the speaker discusses the difference between explicit and implicit methods in solving ordinary differential equations (ODEs) and the importance of stability in choosing the appropriate algorithm. The speaker compares the forward Euler and backward Euler methods for a simple scalar ODE, x' = λx, where λ is less than zero. The forward Euler method is only stable for step sizes where 1 + hλ is less than one, while the backward Euler method is stable for all step sizes. The speaker demonstrates that choosing an inappropriate step size can lead to divergence behavior, emphasizing the importance of stability in selecting an appropriate method for solving ODEs.

  • 00:30:00 In this section, Nathanael Bosch discusses the differences between forward Euler and backward Euler methods for solving ordinary differential equations (ODEs). While both methods use similar math, backward Euler requires small requirements for convergence and can handle stiff areas in the ODEs that forward Euler cannot. Numerical quadrature is necessary, and there are many ways to do it. Additionally, constructing X hat, the approximation of the function at a given time, is another problem for which different methods yield different answers. Overall, the choice of method depends on factors such as computation time and the expected steepness of the ODE.

  • 00:35:00 In this section, Nathanael Bosch explains the general formulation of numerical methods for solving ordinary differential equations (ODEs), which involves three variables: bi, Qi, and X hats. He also introduces butcher tableaus as a way to make talking about the different methods more compact and readable, and points out that the different ways to compute the bi and Qi, as well as how to construct the X hats, are what makes each method unique. Bosch gives examples of different numerical methods, including the simplest one, forward Euler, which satisfies the general equation and has a butcher tableau that contains zeros but is still a sufficiently useful method. He also introduces backward Euler as an implicit method that lacks a zero and is computed slightly differently than forward Euler.

  • 00:40:00 In this section, the video explores the different strategies that can be used to solve Ordinary Differential Equations (ODEs). One suggestion from a listener was to split the integral into different terms and take steps in between each term, but the presenter explains that this would result in a different algorithm with different properties. The video goes on to demonstrate the explicit midpoint rule, which is close to doing two Euler steps, but not quite the same. The presenter explains that the midpoint rule extrapolates from the point and reduces the thing that forward Euler did to get a better extrapolation. Additionally, the video explores the classic fourth-order method, which is called this because it was the original method developed by Byron and Kota. Finally, the video notes that while there is some freedom in choosing the coefficients for solving ODEs, there are already hundreds of known methods on Wikipedia.

  • 00:45:00 leads to two solutions. In the Dobre-Fermi method, there are two lines at the end because it gives two solutions at each step. This method is complicated because it satisfies multiple properties and becomes more intricate when the Tableau gets bigger. The goal should not be to understand how the gradient works, but rather to focus on the properties the coefficients need to satisfy. The method was motivated by quadrature rules, and while there may not be a direct mapping to ODEs, they are still very motivated by quadrature rules.

  • 00:50:00 In this section, the video discusses how solving differential equations can be complicated due to the methods that aim for efficiency by providing two methods at once with different degrees of accuracy. One is more accurate than the other, and using the more accurate one can help estimate the error of the less accurate one, which can be helpful in adjusting the step size when solving the ODE while satisfying some local error. The video also mentions that there are different types of methods with different properties, and stability is also a factor to consider when choosing a method to solve a problem. Lastly, the video briefly touches on the importance of order in solving differential equations.

  • 00:55:00 In this section, Nathanael Bosch discusses the different methods for solving ordinary differential equations (ODEs) and the trade-off between accuracy and complexity. He highlights the importance of local error, which measures the error in a single step of the estimation, and how it can be reduced by making the step size smaller. Different methods such as the Hard Euler and Explicit Midpoint methods are then discussed, each with their own order and error convergence rate. Bosch also touches on the various bells and whistles that come with using libraries to solve ODEs, such as step size selection and automatic server selection, but cautions that it is still important to understand stability and order to avoid potential issues when things break.

  • 01:00:00 In this section of the video, the speaker discusses the problem of estimating the vector field and initial value of an ordinary differential equation (ODE) from data using machine learning techniques. He gives an example of an epidemiological model where the goal is to estimate the parameters beta, gamma, and lambda that fit the ODE to the observed data. The speaker explains that writing down the generative model and observation model for the states of the ODE is essential for solving the inference problem. He notes that estimating the parameters allows for a better understanding of the process that generated the data and cross-checking the inferred parameters against the literature can provide additional insight.

  • 01:05:00 In this section, the speaker discusses the problem of parameter inference and how to compute the maximum likelihood estimate for solving ordinary differential equations (ODEs). The likelihood function is a product of Gaussians that cannot be evaluated due to the assumption that the true X cannot be obtained, hence an approximation is required. By assuming the solver is good enough, the speaker demonstrates that plugging in an estimated solution for the true solution produces an evaluatable term. The likelihood function is then maximized by minimizing the negative log likelihood and the resulting loss function yields a parameter estimate. The speaker concludes with an example using an SIR-D model where the number of infected individuals in the beginning is unknown and needs to be estimated.

  • 01:10:00 In this section, the speaker discusses how to perform parameter inference on a model of ordinary differential equations (ODEs). The ODE model simulation is done by taking noisy samples from it, and two parameters are used to form a loss function which is computed by comparing the lines in the scatter plot to the actual data. The optimizer is used to iterate over the initial guess and the parameters, and the L-BFGS optimizer is used to generate output data. The resulting data can be used to interpret the model and its parameters, which can be compared with the literature. The model is then improved by making the contact rate time-varying, which makes it slightly more complex, and the entire process of parameter inference is done again.

  • 01:15:00 In this section, Nathanael Bosch discusses the challenges of estimating beta of t, which describes a time-varying estimate of a contact rate in ODEs, and emphasizes the need for better tools to solve the estimation problem. To address this, he proposes using a neural network to model beta of t and minimize an L2 loss function in parameter inference. While the neural network approach is less interpretable and does not provide good uncertainty estimates, it does provide a point estimate for the contact rate. Additionally, the results suggest that the neural network approach still needs significant improvement to match the fit of the GP model, and uncertainties in the results should be taken into account.

  • 01:20:00 In this section, the speaker discusses the approach of using neural networks to solve ODEs and mentions that although uncertainty quantification is not readily available using this method, it is still a valid conceptual approach. Maximum likelihood estimates are discussed and the potential to add priors and sampling in order to provide uncertainty quantification is mentioned. The speaker also discusses the upcoming topic of probabilistic numerical ODE solvers and highlights the importance of ODEs in machine learning research and its role in solving real-world problems. Neural ODEs are also briefly mentioned as a more general and structure-free approach, but with similarities in loss function and training procedures.
Numerics of ML 6 -- Solving Ordinary Differential Equations -- Nathanael Bosch
Numerics of ML 6 -- Solving Ordinary Differential Equations -- Nathanael Bosch
  • 2023.01.24
  • www.youtube.com
The sixth lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses both ...
 

Lecture 7 -- Probabilistic Numerical ODE Solvers -- Nathanael Bosch



Numerics of ML 7 -- Probabilistic Numerical ODE Solvers -- Nathanael Bosch

In this video, Nathanael Bosch presents the concept of probabilistic numerical ODE solvers, which combine state estimation and numerical ODE solvers to provide distributions over the states or ODE solutions. Bosch explains how a Q times integrated Wiener process can be used to model the true solution, and how this process allows for quantifying and propagating uncertainties in the system. He then demonstrates how to use extended Kalman filters to solve ODEs, and how step sizes affect the error estimates. The video ends with a discussion on uncertainty calibration and using the extended Kalman filter to estimate parameters in non-linear state space models.

In the second part of the lecture Nathanael Bosch talks about the benefits of using probabilistic methods to solve ODEs, including obtaining meaningful uncertainty estimates and the flexibility of including additional model features such as initial values. He demonstrates this approach with examples such as the harmonic oscillator and differential algebraic equations. Bosch also shows how including additional information and using probabilistic techniques can lead to more meaningful results, using an example of an epidemic model that failed to accurately represent the data using traditional scalar methods. He uses extended Kalman filters and smoothers to solve ODEs through state estimation, treating the estimation as a probabilistic problem, and highlights the importance of being Bayesian in decision-making.

  • 00:00:00 In this section, Nathanael Bosch introduces the concept of probabilistic numerical ODE solvers. He starts by summarizing the previous lectures, including state space models and common filters/smoothers for state estimation, and numerical ODE solvers. He explains that the challenge is to estimate the state of an ODE solution given a differential equation, and that numerical ODE solvers only provide an approximation. Bosch then proposes a way to combine the two concepts by interpreting ODEs as state estimation problems and solving them as data estimation problems. The resulting algorithms provide distributions over the states or ODE solutions, creating probabilistic numerical servers that offer richer output than classic servers.

  • 00:05:00 In this section, the concept of probabilistic numerical ODE solvers is discussed. These solvers estimate the true solution by providing a single estimate X hat through the evaluation of the vector field to update or extend the estimate to a future time point with an error that depends on the step size. The discussion then moves on to the use of special state estimation as a tool for solving numerical ODE estimation problems. The filtering distribution, the smoothing posterior, and the predict step that estimates future states given current information are then explained, with algorithms like the extended Kalman filter and extended Kalman smoother mentioned as simple methods for computing these quantities. The section concludes with the idea that numerical ODE solutions can be phrased as an inference problem rather than trying to compute the actual true solution, and that the goal is to find the posterior of x of t that satisfies the initial condition and ODE on a discrete set of points.

  • 00:10:00 In this section, we dive into the construction of a state space model for probabilistic numerical ODE solvers. The state we consider is the Q times integrated Wiener process. This state is a stochastic process that describes the dynamical system and tracks the derivatives up to Q. By tracking a limited number of derivatives, we can obtain a probabilistic state model that allows us to quantify and propagate the uncertainty in the system. The main goal is to define a prior, a likelihood, and a data model that, once solved, will give us an estimate of the output. This is necessary to do Gaussian filtering and smoothing, which is a fast algorithm for inference.

  • 00:15:00 In this section, Nathanael Bosch explains the stochastic process that models the true solution of a Q times integrated Winner process. The process has transitions in the form of a Gaussian model that uses a Matrix a of H and a covariance Matrix Q of H which have closed-form formulas. Accessing an entry in the process is a linear operation, making it convenient to access the first and the second derivatives. The process is markovian and satisfies the properties of a Gaussian process. Bosch also shows plots of different samples of the process, which illustrates why it is called a two times integrated linear process.

  • 00:20:00 In this section, the speaker discusses the Q times Integrated Ornstein-Uhlenbeck prior and how it is convenient because they can write down transition densities needed for Gaussian filtering and smoothing later. The likelihood and data combination part is also important because it informs the prior to do the desired thing at the top. The speaker shows how to use the language of the ODE and defines a measurement function or information operator that should be zero in a perfect world where there is infinite compute. They also introduce an observation model and explain why it helps satisfy the desired thing for the inference. Finally, the noiseless likelihood model is a direct likelihood, which is convenient because it has the Kalman filter updates in mind.

  • 00:25:00 In this section, Nathanael Bosch discusses the generative model for a Z, which is a concrete example of the logistic ODE, and how it relates to the inference process. The generative model allows for simulation of solutions, computation of derivatives, and generation of a posterior, which collapses around the Z. This generative model, in addition to the likelihood model that encodes the differential equation, enables the state space model to be solved and provides estimates for the X, which relate to the solution. Inference allows for the establishment of a relationship between the prior and the desired end result, and allows for the solving of the state space model.

  • 00:30:00 In this section, Nathanael Bosch discusses the importance of including the initial value when solving an ordinary differential equation through probabilistic numerical methods. He explains that adding another measurement that depends only on the initial value to the observation model is a more general way to include the initial value. He then provides pseudocode for the extended Kalman filter and ODE filter building blocks needed to implement the algorithm and describes the standard filtering loop involved in the prediction and update steps. The extended algorithm satisfies the initial value first and uses the transition model A and Q to compute the step size.

  • 00:35:00 In this section, Nathanael Bosch demonstrates the code necessary to solve an ordinary differential equation (ODE) using probabilistic numerical methods in Julia. He notes that while the formulas may seem complicated, the 10 lines of code needed to set up the model correctly are straightforward. Bosch shows how the extended Kalman filter is implemented with only two lines of code and the standard notation for multiplying with the inverse is replaced with a numerically stable solution that solves a linear system. He defines the vector field, initial time span, and true solution for the logistic ODE and demonstrates how to define the prior using the two times integrated Wiener process. Bosch's implementation of the extended Kalman filter algorithm closely matches the pseudocode from the slides, and the initial distribution he uses is arbitrarily set to zero mean and standard covariance.

  • 00:40:00 In this section, Nathanael Bosch demonstrates how to use extended Kalman filters to solve ODEs and plots the filter estimates. He then plays around with step sizes, showcasing how smaller step sizes decrease uncertainties and how larger ones increase them. He explains that the uncertainty doesn't just grow over time and the error estimates are a model of the error that is happening. Finally, he demonstrates that smoothing generally improves the results of the trajectories, which matches the motivation from two lectures ago. However, the error estimates could be made even better, but he asks the audience for input on how to do so.

  • 00:45:00 In this section, we learn that the error estimate for the probabilistic numerical ODE solver is too large and needs to be fixed through uncertainty calibration. The hyperparameter sigma squared directly influences the uncertainties and needs to be set properly in order to get actual uncertainty estimates that are meaningful. The motivation for setting the hyperparameters is similar to that in Gaussian processes, where the hyperparameters are estimated by maximizing the likelihood of the data given the parameter. The probability of the data can be decomposed, making it convenient to express and optimize.

  • 00:50:00 In this section, Nathanael Bosch discusses the use of the extended Kalman filter to estimate the parameters in a non-linear state space model. The P of z K given Z1 until K minus 1 is estimated using Gaussian estimates, and the Sigma hat is computed as the argmax of the quasi maximum likelihood estimate. In ODE filters, it is possible to compute the maximum likelihood estimate in closed form using a rescaled way of recalibrating parameter estimates. This method produces better estimates and corresponds to the maximum likelihood estimate Sigma. Bosch explains how this can be implemented using an update function with a calibration suffix.

  • 00:55:00 In this section, Nathanael Bosch discusses the Extended Kalman Filter (EKF) for probabilistic numerical Ordinary Differential Equation (ODE) solvers. He mentions that it has been modified to increase the sigma hatch, which results in the sum being computed in a running way, and divided by n, which is the quantity they want to compute. The EKF was previously trying to approximate something as Gaussian that might not be, and the aim is to get uncertainty estimates that are as informative as possible. By doing so, they have got an algorithm that provides useful error estimates that meaningfully describe the numerical error of the ODE solver. The obtained algorithm is fast and provides non-perfect but still useful uncertainty estimates..

  • 01:00:00 In this section, Nathanael Bosch explains the motivation for using probabilistic methods to solve ODEs. Beyond simply quantifying uncertainty and obtaining meaningful uncertainty estimates and plots, Bosch believes that formulating ODE solvers in a probabilistic way is flexible and convenient, enabling the inclusion of additional model features such as initial values. By defining a state space model and running an extended Kalman filter, it is possible to solve not only numerical problems with initial value but also higher-order ODEs with additional pieces of information.

  • 01:05:00 In this section, Nathanael Bosch explains a different approach to initial values for ODE solvers. He defines a new quantity to make sure X1 is equal to the initial derivative given, and this can be used to run an extended command filter with some predict and update steps. He shows the example of the harmonic oscillator and how only two lines needed to be changed from before to include an update on the first derivative. Calibration is applied again for meaningful results, and the error in this case doesn't tend towards zero as there is no attractor to tend towards, but instead adjusts depending on the problem setting. Bosch also discusses differential algebraic equations, which are differential equations that cannot be moved from the left to the right due to a singular matrix.

  • 01:10:00 In this section, the speaker discusses the concept of differential algebraic equations (DAE), which are equations that don't describe a derivative and have a constant value at some point. The speaker suggests a modification to the ODE likelihood algorithm to create a DAE likelihood algorithm that can solve DAE in a probabilistic way. The speaker then gives an example of a problem where an ODE has additional information and suggests a modification to the state-space model to introduce an additional observation model so that the algorithm can apply both observation models to satisfy g on the discrete grid. The speaker provides a video example that illustrates the importance of conservation quantities in solving problems with ODEs and additional information.

  • 01:15:00 In this section of the video, Nathanael Bosch discusses the use of probabilistic numerical ODE solvers and the benefits of including additional information to improve the results of ODE models. He presents an example of an epidemic model, where the traditional scalar model failed to accurately represent the data, and shows how a Gaussian process can be used to improve the model. Adding in more information and using probabilistic techniques can ultimately lead to a more meaningful result.

  • 01:20:00 In this section, Bosch discusses probabilistic numerical ODE solvers, which involve using a linear measurement operator to measure certain dimensions of a solution to an ODE, represented as a four-dimensional object (s-i-r-n-d). After creating a state space model, the ODE solution is solved, with the addition of a beta state, and the likelihood models of the ODE solution, initial value, and data are considered. The inference task involves using an extended Kalman filter to determine what the white dots are, given the black dots of the observed data. It is also suggested that X and beta be merged for a simpler reformulation.

  • 01:25:00 In this section, the speaker explains how Probabilistic Numerical ODE Solvers work, which is essentially a way of solving ODEs through state estimation, treating the estimation as a probabilistic problem. He defines a method for solving ODEs using extended Kalman filters and smoothers that lead to a range of solvers sometimes referred to as "ODE filters." The speaker highlights the importance of being Bayesian in decision-making, and the utility of uncertainty estimates, as well as the convenience of using patient algorithms that can be applied to a range of problems, including solving ODEs.

  • 01:30:00 In this section, the speaker talks about using external command filters in a non-standard way to solve numerical problems and perform inference from data in a way that combines physics and general external observations. According to the speaker, Bayesian filtering and smoothing are the best way to model or formulate dynamical systems, as it allows for flexible addition of information and factorization of the inference algorithm. The audience is encouraged to scan QR codes for feedback and questions for the speaker are welcome.
Numerics of ML 7 -- Probabilistic Numerical ODE Solvers -- Nathanael Bosch
Numerics of ML 7 -- Probabilistic Numerical ODE Solvers -- Nathanael Bosch
  • 2023.01.24
  • www.youtube.com
The seventh lecture of the Master class on Numerics of Machine Learning at the University of Tübingen in the Winter Term of 2022/23. This class discusses bot...
 

Lecture 8 -- Partial Differential Equations -- Marvin Pförtner



Numerics of ML 8 -- Partial Differential Equations -- Marvin Pförtner

Marvin Pförtner discusses partial differential equations (PDEs) and their significance in modeling various real-world systems. He explains how PDEs represent a system's mechanism with an unknown function and a linear differential operator, but require solving for parameters that are often unknown. Gaussian process inference can be used to analyze PDE models and inject mechanistic knowledge into statistical models. Pförtner examines the heat distribution in a central processing unit in a computer by restricting the model to a 2-dimensional heat distribution and presenting assumptions made for the model. The lecture also covers using Gaussian processes to solve PDEs and adding realistic boundary conditions for modeling uncertainty. Overall, the GP approach combined with the notion of an information operator allows us to incorporate prior knowledge about the system's behavior, inject mechanistic knowledge in the form of a linear PDE, and handle boundary conditions and right-hand sides.

In the second part of this video, Marvin Pförtner discusses using Gaussian processes to solve partial differential equations (PDEs) by estimating a probability measure over functions rather than a point estimate. He explains the benefits of uncertainty quantification and notes that this approach is more honest because it acknowledges the uncertainty in the estimation of the right-hand side function of the PDE. Pförtner also explains the Matern kernel, which is useful in practice and can control the differentiability of the GP, and provides a formula to compute the parameter P for the Matern kernel. He further explains how to construct a d-dimensional kernel for PDEs by taking products of one-dimensional Matern kernels over the dimensions, and the importance of being mathematically careful in the model construction.

  • 00:00:00 In this section of the lecture, Marvin Pförtner introduces partial differential equations (PDEs) and their importance in describing mechanistic models that generate data in the real world, including financial markets, fluids such as climate and weather, and wave mechanics. Despite being challenging to solve, linear PDEs continue to be a powerful modeling language, as they accurately describe many physical processes such as thermal conduction, electromagnetism, and particle velocities in Brownian motion. The lecture focuses specifically on integrating PDE-based models into probabilistic machine learning models through a practical modeling example.

  • 00:05:00 In this section, Marvin Pförtner discusses the use of partial differential equations (PDEs) to model various systems, including physical and financial models. He emphasizes the importance of understanding the behavior of a system's mechanism and inferring its behavior with the use of PDE models. However, PDEs often require system parameters that are unknown, and the goal is to use Bayesian statistical estimation to fuse the mechanistic knowledge of the system with measurement data to find these unknown parameters and gain confidence in predictions. Marvin also explains linear PDEs and how they relate to physical systems with spatial extent.

  • 00:10:00 In this section, Marvin Pförtner discusses partial differential equations (PDEs), which are commonly used to describe physical systems such as temperature distributions or the force generated by a set of electrical charges. The unknown function in a PDE represents the system being simulated, and the mechanistic knowledge is given by a linear differential operator. However, a challenge with PDEs is that they usually do not have an analytic solution and require numerical solvers that introduce discretization errors. Material parameters and the right-hand side function are two of the parameters that cannot be known exactly, causing difficulties in propagating uncertainties through classical solvers. Additionally, PDEs usually do not uniquely identify their solution, requiring additional conditions to be imposed.

  • 00:15:00 In this section, the speaker discusses partial differential equations (PDEs) and their relationship to functions, which are infinite-dimensional objects. The differential operator is linear, meaning that linear functions are in the kernel of the differential operator, allowing for the addition of a linear term to any solution of the Poisson equation and still obtain a solution. Boundary conditions are necessary to model interactions outside of the simulation domain, which are then summarized to how the outside interacts with the simulation at the boundary. PDEs are statements about functions that belong to function spaces, which are sets of functions that have a vector space structure similar to that of Rn, allowing for the representation of linear operators by matrices. Linear operators are maps between function spaces that have a linearity property because a differential operator maps a function to its derivative.

  • 00:20:00 In this section, Pförtner explains that linear PDEs are essentially linear systems in an infinite-dimensional vector space and relays the importance of defining norms on vector spaces and understanding convergence. He then introduces a mathematical model of the heat distribution in a central processing unit in a computer and restricts the model to a 2-dimensional heat distribution on a line slicing through the chip. The lecture discusses assumptions made for this model and how it is a good model for this particular case.

  • 00:25:00 In this section, the speaker discusses modeling the heat sources and heat sinks in a chip and how it can be represented using partial differential equations (PDEs). They explain the heat equation, which is a linear PDE of the second order and how it can be applied to model the temperature distribution in the chip. The speaker also explains how mechanistic knowledge from the differential equation can be injected into statistical models by interpreting the PDEs as an observation of the unknown function and the image under the differential operator. The PDEs are compared to fundamental laws in physics that describe the conservation of fundamental quantities like energy and mass.

  • 00:30:00 In this section, Marvin Pförtner discusses the relationship between temperature and heat energy and how they are proportional to one another through material parameters. He explains that every change in heat energy can be explained by either a known value of heat entering the system or by heat flowing into a certain point from surroundings via heat conduction. He then introduces the information operator as a mathematical concept that can be used to express any piece of information, including that of a differential equation. He further explains how a Gaussian process prior can be used to model an unknown function U, and how the posterior can be computed using closures of Gaussian processes under linear observations. However, because solving PDEs requires an infinite set of observations, it is computationally impossible for most cases, unless analytical information is known about the problem being solved.

  • 00:35:00 In this section, the speaker discusses using Gaussian processes (GPs) to solve partial differential equations (PDEs), which is similar to the approach used in ordinary differential equations (ODEs). The GP is seen as a probability measure on function spaces and a linear operator maps the sample paths of that GP onto RN. The prior predictive of this process is found to be a normal distribution, with the mean given by the image of the GP mean function through the linear operator, and the covariance matrix being very similar to the covariance matrix found in the finite-dimensional case. The posterior of this event turns out to actually have a similar structure to it. The speaker notes a lot of theoretical detail is involved and caution is necessary due to the infinities involved in solving PDEs using GPs.

  • 00:40:00 In this section, Marvin Pförtner explains how to compute a specific choice of a linear operator and the difficulties in expressing it in standard linear operator notation. He also discusses how to differentiate the one argument, differentiate the other argument, and build a matrix of all pairwise derivatives between two points. He then talks about how to use the same theorem to apply it to the problem and compute the posterior Gaussian process, and how to define the set of collocation points.

  • 00:45:00 In this section, the speaker explains how a generalized form of Gaussian Process inference can solve a boundary value problem. They outline how the observations can be represented using a black function that matches the right-hand side of the Partial Differential Equation (PDE), and how the information learned from this can be propagated back to the original Gaussian Process. The degree of freedom in the PDE that the boundary conditions do not fix can cause uncertainty, but by imposing Dirichlet boundary conditions, the posterior becomes a normal Gaussian Process regression problem, which works if the two boundary values are observed. The speaker emphasizes the importance of noting that boundary values in deployment are usually not known, and it would be helpful to add uncertainty to both the boundary values and the heat source distribution.

  • 00:50:00 In this section, the speaker discusses more realistic boundary conditions for partial differential equations. He states that heat is extracted uniformly over the entire surface of the CPU and this information can be modeled as Neumann boundary conditions where the first derivative of a boundary point is set instead of the value of the boundary point. By doing so, we can add uncertainty to the model and use a Gaussian distribution to model the derivative. An additional information operator is used to describe this boundary condition. The speaker further explains how the absolute scale of the system is determined by using thermometers within the CPU, and also how uncertain estimates of the function can be obtained by modeling a prior belief using another Gaussian process.

  • 00:55:00 In this section, Marvin Pförtner discusses how to integrate prior knowledge about a system's behavior into the model, with the help of Gaussian processes and information operators. He mentions that it's essential to choose the right-hand side function for the model integrable to zero to avoid the system from just continuously heating up. Pförtner then proceeds to discuss the challenges of ensuring that the GP has area one in all of its samples and how they can be solved by adding additional constraints, including the boundary effects, which takes into account the heat leaving via the boundary. Finally, Pförtner concludes that this GP approach combined with the notion of an information operator allows us to incorporate prior knowledge about the system's behavior, inject mechanistic knowledge in the form of a linear PDE, and handle boundary conditions and right-hand sides.

  • 01:00:00 In this section, Marvin Pförtner discusses using Gaussian processes to solve partial differential equations (PDEs) by estimating a probability measure over functions instead of a point estimate, which can give confidence intervals and samples that fulfill the conditions of the PDE. He explains that this approach is more honest because it acknowledges the uncertainty in the estimation of the right-hand side function of the PDE, and that it can be applied to 2D simulations, as well as simulations with time as another spatial dimension. Pförtner notes that the posterior mean of this method assuming no uncertainty is equivalent to a classical method called symmetric collocation. Finally, he explains that other methods for solving PDEs, such as weighted residual, finite volume, and spectral methods, can also be realized as posterior means of a Gaussian process, just without the uncertainty quantification.

  • 01:05:00 In this section, the speaker explains how Gaussian processes (GPs) can be used to solve linear partial differential equations (PDEs) and can also realize regression for function estimation. They emphasize the importance of choosing the right functions and prior to work with, as well as the benefits of uncertainty quantification. The speaker also notes the failure cases, such as when the sample paths of GPs are not differentiable, and the need to verify important conditions in order to make everything rigorous. The section concludes with a teaser of an upcoming publication from the speaker’s group that will delve into the formal details of these theorems.

  • 01:10:00 In this section, the speaker discusses how Gaussian processes (GPs) are defined and used to model unknown functions. GPs are collections of real-valued random variables, one for each point in their domain. They are used to represent functions, but we only know the finite combination of evaluations of the GP. To obtain a sample path of a GP, we need to continuously sample a function by fixing an Omega and transforming it through all the functions. We ensure that the sample paths are sufficiently differentiable to make sure they are defined. Additionally, to compute LF, the image of a GP under a linear operator L, we fix an Omega and apply L to the corresponding function.

  • 01:15:00 In this section, the speaker explains how a sample path can be mapped through a linear operator to create an infinite-dimensional object called a GP, which is later turned into a random variable that needs to be measurable. They note that the sample paths of GPS are made into a reproducing kernel Hilbert Space by choosing an appropriate kernel, however, the reproducing kernel Hibbert space of the actual kernel of the GP is not the space from which the samples come, and a larger space needs to be chosen in which these samples are contained. The speaker goes on to discuss the Matern kernel, which is useful in practice and can control the differentiability of the GP, and provides a formula to compute the parameter P for the Matern kernel, which can help generalize the process.

  • 01:20:00 In this section, the speaker explains how to construct a d-dimensional kernel for partial differential equations (PDEs) by taking products of one-dimensional Matern kernels over the dimensions, especially if there are mixed orders of the derivatives. This helps adapt to the concrete equation that users are trying to solve. Additionally, GPS provide a framework to combine various information sources into a single regression model using affine information operators. The speaker stresses the importance of being mathematically careful in the model construction, particularly when constructing the prior for a specific equation.