Machine Learning and Neural Networks - page 36

 

Gail Weiss: Thinking Like Transformers



Gail Weiss: Thinking Like Transformers

Gail Weiss discusses the concept of transformer encoders in this video, explaining their ability to process sequences and encode them into vectors. Weiss highlights several studies exploring the strengths and limitations of transformer encoders and introduces a programming language called the restricted access sequence processing language (RASP) to represent transformer encoders abilities. She also discusses multi-headed attention, selection patterns, and the challenges of softmax under certain conditions, before delving into the use of sequence operators and library functions to compute the inverse and the flip selector. Weiss provides insight into creating an optimal program for a transformer and the insights from the Universal and Sandwich Transformers, ultimately discussing the select predicate and binary vs. order three relations.

He also talks about the potential benefits and drawbacks of using higher order attention in transformer models, as well as the importance of residual connections in maintaining information throughout the layers. She also discusses potential issues with very deep transformers deviating from the RASP model and suggests the use of longer embeddings to overcome fuzziness in information.

  • 00:00:00 In this section of the video, Gail Weiss introduces the concept of transformer encoders, which are part of a neural network designed to process sequences. She explains that the transformer encoder takes a given sequence and encodes it into a set of vectors, with some popular examples including BERT. Weiss then highlights several studies that have explored the strengths and limitations of transformer encoders, including their ability to recognize formal languages and perform calculations. While these studies provide insights into the abilities of transformer encoders, Weiss notes that they don't offer a clear intuition about how the transformers actually process tasks.

  • 00:05:00 In this section, Gail Weiss explains that recurrent neural networks (RNNs) are similar to a finite state machine, which defines transitions between states based on input vectors, while outputting a final classification according to what state is reached. Through analyzing their expressive power using our understanding of deterministic finite automata, they gain insight on what to do with RNNs, enabling conversion of an RNN to a weighted finite automata, deterministic finite automata, or a weighted deterministic finite automata. However, this intuition does not exist for Transformers and there is a need to come up with a model that would serve a similar purpose for the Transformer encoder. Unlike RNNs, Transformer encoders take their entire input at once and process all input tokens in parallel through a deep network that applies processing a fixed number of times. Though there are no states in Transformers, there is a sense of processing and propagating information through a set of operations.

  • 00:10:00 In this section, Gail Weiss explains how a transformer encoder can be represented as a programming language called the restricted access sequence processing language (RASP), which describes the creation of Sequence Operators (S-ops). The S-ops represent the layers of the transformer encoder which apply operations to the sequence, creating more processed information. RASP provides two constants, tokens and indices, that create new sequences of characters and indices from any input sequence, and then applies an element-wise feed-forward network to each of the vectors that come into the input of the transformer encoder layer. The programmer is responsible for the logical or reasonable calculation of the operations applied, as they can be any element-wise operation.

  • 00:15:00 In this section of the video, Gail Weiss introduces the concept of multi-headed attention, which is a part of the transformer encoder layer that isn't element-wise. She starts by explaining single-head attention and how it works by applying two linear transformations to the sequence of vectors, creating a set of queries and a set of keys. The key describes the information that it has to offer, and the query describes the information that is wanted in each position. Through the scalar product of the query with each of the keys, the weights are normalized, and the output at each position is obtained. Weiss also clarifies a question regarding the intermediate transformation of the keys and values that allow the model to have more parameters.

  • 00:20:00 In this section, the speaker discusses multi-headed self-attention and how it can be used to compress multiple operations into a single layer. The process involves splitting the input vectors into equal length vectors and passing each block into different heads. Once the multiple heads have been applied, they are concatenated to form the output of the multi-headed self-attention. Although there is no dependence between the different heads, a single layer can do multiple operations because it has multiple different heads. This process allows the programmer to worry about attention heads instead of multi-headed self-attention, which is an issue for the compiler.

  • 00:25:00 In this section, Gail Weiss provides examples of how to encode certain selection patterns for a transformer. Using the example of selecting positions based on whether they describe themselves with a zero, one, or two, Weiss indicates that the first position chooses the last two positions, the second position chooses nothing, and the third position chooses only the first. Weiss also explains that various types of comparison can be selected such as greater or equal, less than or equal, or unequal, and that selectors can be composed together. However, Weiss notes that the abstraction of selection patterns to transformers is dependent on the reasonableness of the operation being performed, and this is up to the programmer to ensure.

  • 00:30:00 In this section, the speaker discusses the challenges of using softmax under certain conditions, namely that there is a slight focus on remaining positions that do not meet the selection criteria. To address this issue, the speaker suggests that if the size of these embeddings is increased, softmax will begin to approximate a hard selection of "less than or equal" and "over" positions. The presentation then moves on to the idea of aggregation and how tokens can be selected and weighted using an average. The speaker provides an example of token reversing, and how they created a reversing selector and aggregated the tokens to get the reversed input sequence using the example of the word "hello" to output "olleh".

  • 00:35:00 In this section, the speaker discusses how the length operator can be used in transformers even though it is not part of the input, and how the select decisions are pairwise to prevent arbitrarily powerful selection operations that could hide arbitrary power. The speaker also talks about other transformer components such as the skip or residual connection, which adds back the value of the embeddings, and layer norms, which are not present in the RASP model. Additionally, the speaker mentions the use of functions and composing selectors for convenience to avoid repeating code.

  • 00:40:00 In this section, the speaker discusses the use of sequence operators such as val, min, and max to check which indices are within a certain range. They also mention the utility of library functions such as selector width in implementing functions like in-place histograms and counting tokens in input sequences. Additionally, the speaker describes how they computed the length of a sequence using a selector and an indicator function, noting that layer normalization could complicate such computations. Finally, they suggest an alternative solution involving a feedback network to compute the inverse, but acknowledge that it may still be affected by layer normalization.

  • 00:45:00 In this section, Gail Weiss discusses how transformers can learn to compute inverse and the flip selector, even when dealing with relatively short sequences. She further explains that Ras program analysis can be used to determine when to apply computing reverse and when to apply the flip selector, while ensuring that the select and aggregate pairs are just attention heads. Finally, she explores how an actual transformer could perform the reverse computation in two layers, which was supported by the findings that the transformer managed to attain 99.6 percent accuracy when reversing input sequences up to 100 in length after 20 epochs.

  • 00:50:00 In this section, Gail Weiss discusses the creation of an optimal program for a transformer, which requires two layers to perform the task effectively. Even if the program is reduced to one layer and twice the parameters are added or two heads are used, it won't manage to get high accuracy because it doesn't know how to deal with the changing sequence length. Additionally, the attention patterns are the same in both layers, with the first layer using uniform attention to compute the sequence length, and the second layer utilizing the same reversing attention pattern. Further on, Weiss demonstrates an in-place histogram, which involves selecting tokens and outputting the number of times that token appears in the input sequence. By focusing each token on the same token as itself or on the first position on the beginning of the sequence token, they utilize only one attention head to train a transformer, making it much easier to compute the inverse and other elements.

  • 00:55:00 In this section, Gail Weiss talks about the insight that comes from the Universal Transformer and the Sandwich Transformer, both of which affect the way of thinking about and utilizing the transformer model. She explains that every sequence operator in RASP is the result of a finite number of operations, so the language is not infinitely powerful but a finite number of O(n^2) calculations. She also touches on the computational cost of an attention head in the transformer encoder and how it might not be worth utilizing the n^2 computations. The conversation then shifts to a discussion around the select predicate and binary vs. order three relations.

  • 01:00:00 n this section, Gail Weiss discusses the potential increase in power of a transformer model with higher order attention, such as a third order instead of a second order. She explains that such a model could potentially compute functions with greater than O(n^2) operations, increasing its power. However, this would also be more expensive computationally. Weiss also emphasizes the importance of residual connections in the transformer model as a way to retain and reuse information throughout the layers, and suggests that removing them can drastically decrease performance even though they may not appear as a distinct operation.

  • 01:05:00 In this section of the video, Gail Weiss discusses the potential issues that could arise with very deep transformers and how they may deviate from the RASP model. She mentions a paper she read in ICML about the significance of removing certain information from a transformer which could cause it to lose information quickly. However, she also notes that information may be kept if it's recognized as important. Weiss also discusses the idea of having a very long embedding to overcome issues of fuzziness in the information as the transformer goes deeper.
Gail Weiss: Thinking Like Transformers
Gail Weiss: Thinking Like Transformers
  • 2022.02.25
  • www.youtube.com
Paper presented by Gail Weiss to the Neural Sequence Model Theory discord on the 24th of February 2022.Gail's references:On Transformers and their components...
 

Visualizing and Understanding Deep Neural Networks by Matt Zeiler



Visualizing and Understanding Deep Neural Networks by Matt Zeiler

Matt Zeiler discusses visualizing and understanding convolutional neural networks (CNNs) for object recognition in images and videos. He describes how deep neural networks perform compared to humans and primates in recognizing objects and shows how CNNs learn to identify objects by going through the layers. Zeiler explains the process of improving CNN architecture and discusses the limitations when training with limited data. Lastly, he answers questions about using lower layers in higher layers and the application of convolutions in neural networks.

  • 00:00:00 In this section, Matt Zeiler describes a technique to visualize convolutional networks used to recognize objects in images and videos, which allows them to understand what each layer is learning and gain insights to improve performance. Convolutional neural networks have been around since the late 80s, with new approaches using very much the same architecture as before. The breakthrough in the field was by Geoff Hinton's team, as their neural network reduced error rates on the common benchmark, ImageNet, by ten percent instead of the usual twenty-six percent, leading to better performance on recognition tasks.

  • 00:05:00 In this section, Matt Zeiler discusses recent studies that compare the performance of deep neural networks to that of primates and humans in recognizing objects. One study involved recording the firing of the electrodes in a monkey's brain when it was presented with images and comparing its recognition accuracy to that of deep neural networks and humans. The results showed that humans, deep neural networks and the IT cortex of the monkey performed almost equally when images were presented for less than 100 milliseconds. Additionally, Zeiler discusses D convolutional networks, which are unsupervised learning models used to reconstruct images while making top-level features sparse, with the goal of learning what a convolutional network is actually learning.

  • 00:10:00 In this section, Matt Zeiler explains the importance of making the operations in convolutional networks reversible for achieving good reconstructions, especially when dealing with multiple layers of information. He demonstrates how the highest layer in the network is visualized (using a validation set of 50,000 images) by selecting the single feature map with the strongest activation at a time and using it as input to the top of the convolutional network to reconstruct the visualizations from the bottom up. The visualization of the first layer feature maps shows filters comprising of oriented edges and color edges at varying orientations and frequencies, which is what researchers previously expected. However, the visualization of higher layers provides new insights into how the network learns and classifies different objects by showing the strongest activations and invariants across multiple images.

  • 00:15:00 In this section, Zeiler explains the development of the second layer of neural networks, which exhibits a far more complex set of patterns than the first. Combinations of edges, parallel lines, curves, circles, and colorful blocks, among other structures, are present in the second layer. Through pooling, it has a broader scope of what it can process from the image. Looking at the third layer, Zeiler shows how it learns object parts that are crucial for building a representation of an object, such as a dog's face or a human face. Grouping features remain present in the third layer, but as a more semantically relevant grouping of grids or specific face structures.

  • 00:20:00 In this section, it is explained how the neural network learns to identify specific objects as it goes through the layers. In the fourth layer of the network, the connections get more object-specific and categories that aren’t explicit in the tasks, like grass, become a feature. The model also learns to recognize plural features such as different breeds of dogs or different types of keyboards at different orientations. The last convolution layer gets bigger because of boundary effects on the convolutions as it gets closer to the classification layer. The content of this last layer becomes highly object-specific as the model has to make a decision about which class the image belongs to, and only 256 features exist in this layer.

  • 00:25:00 In this section, Matt Zeiler discusses an experiment to confirm that the visualizations are triggering on relevant parts of the image. They slid a block of zeros with mean pixel 128 over the image and recorded the activations or probabilities of the model. By blocking out the face of a Pomeranian dog, they found that the probability of Pomeranian significantly drops, while the most probable class is a tennis ball when the face is blocked. Interestingly, they found that the fifth layer has learned a text detector as it drops the feature significantly when blocking any text in an image, providing a notion that the layer can associate text with other classes. Lastly, they used the Toronto group's model which won the ImageNet challenge in 2012 and found a huge disparity in the normalization of filters in the first layer.

  • 00:30:00 In this section of the video, Matt Zeiler discusses the process of improving the architecture of deep neural networks. He explains that after fixing the issue of renormalization, it became clear that the first layer filters were too large, resulting in dead filters. The second layer also had a lot of blocking artifacts, causing it to lose information, leading them to make the strides in the convolution smaller and remove blocking artifacts, and increase flexibility in the second layer. These modifications helped them win the 2013 ImageNet competition, and these approaches were used again in later competitions, leading to good results. Zeiler also discusses the generalization capabilities and uses for these neural networks in determining saliency.

  • 00:35:00 In this section, Zeiler discusses the limitations of deep models when only a small amount of training data is used, stating that the models struggle to learn the features properly. He explains that these models are adept at recognizing features that are important in general for object recognition, and this can be transferred to other tasks with only a few examples, as displayed through various tables and graphs. Additionally, Zeiler examines how important it is to train a deep model by looking at all intermediate layers and different types of classifiers. Finally, Zeiler suggests that utilizing a trained model to clean up gathered label data is possible and could potentially improve training models.

  • 00:40:00 In this section, Zeiler responds to a question about whether the lower layers of a neural network, which have shown decent performance in classification, can be used in higher layers or near the classification outputs. He explains that there may be more information in higher layers due to repeated extraction, but different types of information could also help. The conversation then shifts to the performance of different layers and hardware considerations for training large neural networks. Zeiler also discusses the ability of neural networks to recognize less concrete classes, such as subtle emotions or gestures, and the mapping of different layer sizes.

  • 00:45:00 In this section, the speaker explains how convolutions are applied over an image and other layers in neural networks. The application of convolutions depends on two parameters: the size of the filter and the stride between where the filter is applied. In lower layers, the speaker explains that strides of two are used because there is too much spatial content and the computation at every location is too expensive. However, doing so can cause a loss of information. The speaker also mentions that there is no unsupervised learning in the first few layers of the neural network, and that descriptive words like "abandoned" are already baked into the vocabulary.
Visualizing and Understanding Deep Neural Networks by Matt Zeiler
Visualizing and Understanding Deep Neural Networks by Matt Zeiler
  • 2015.02.02
  • www.youtube.com
Matthew Zeiler, PhD, Founder and CEO of Clarifai Inc, speaks about large convolutional neural networks. These networks have recently demonstrated impressive ...
 

How ChatGPT is Trained



How ChatGPT is Trained

ChatGPT is a machine learning system that is designed to mimic human conversation. It is first trained using a generative pre-training approach that relies on massive amounts of unstructured text data, and then fine-tuned using reinforcement learning to better adapt to the user's preferences.

  • 00:00:00 ChatGPT is a machine learning system that is designed to mimic human conversation. It is trained using a generative pre-training approach that relies on massive amounts of unstructured text data.

  • 00:05:00 ChatGPT is a chatbot that is trained to respond to user requests in a human-like way. It does this by first conditioning the model on a manually constructed example illustrating the desired Behavior, then using reinforcement learning to tune the model to the user's preferences.

  • 00:10:00 ChatGPT is a chatbot that is trained using a ranking over K outputs for a given input. The reward model will assign a scalar score to each member of a pair, representing logits or unnormalized log probabilities. The greater the score, the greater the probability of the model being placed on that response being preferred. Standard cross entropy is used for the loss treating the reward model as a binary classifier. Once trained, the scalar scores can be used as rewards. This will enable more interactive training than the purely supervised setting. During the reinforcement learning stage, our policy model that is the chat bot will be fine-tuned from the final supervised model. It emits actions its sequences of tokens when responding to a human in a conversational environment. Given a particular state that is a conversation history and a corresponding action, the reward model Returns the numerical reward. The developers elect to use proximal policy optimization or PPO as the reinforcement learning algorithm here. We won't go into the details of PPO in this video, but this has been a popular choice across different domains. Now the Learned reward model we're optimizing against here is a decent approximation to the true objective we care about. However, it's still just an approximation a proxy objective.
How ChatGPT is Trained
How ChatGPT is Trained
  • 2023.01.24
  • www.youtube.com
This short tutorial explains the training objectives used to develop ChatGPT, the new chatbot language model from OpenAI.Timestamps:0:00 - Non-intro0:24 - Tr...
 

The REAL potential of generative AI



The REAL potential of generative AI

Generative AI has the potential to revolutionize the way products are created, by helping developers with prototyping, evaluation, and customization. However, the technology is still in its early stages, and more research is needed to ensure that it is used ethically and safely.

  • 00:00:00 The video discusses the potential benefits and challenges of using large language models, and goes on to explain how Human Loop can help you build differentiated applications on top of these models.

  • 00:05:00 The video discusses how generative AI can help developers with prototyping, evaluation, and customization of their applications. It notes that the job of a developer is likely to change in the future, as AI technology helps augment their workflow.

  • 00:10:00 The video discusses the potential of generative AI, and discusses some of the obstacles to its widespread adoption. It notes that while the technology has great potential, it is still in its early stages, and more research is needed to ensure that it is used ethically and safely.

  • 00:15:00 The potential for generative AI is vast, with many potential uses in the near future. Startups should be prepared for a Cambrian explosion of new applications, some of which may be difficult to predict.

  • 00:20:00 This video discusses the potential of generative AI, and how it can be used to create new and innovative products.
The REAL potential of generative AI
The REAL potential of generative AI
  • 2023.02.28
  • www.youtube.com
What is a large language model? How can it be used to enhance your business? In this conversation, Ali Rowghani, Managing Director of YC Continuity, talks wi...
 

Vrije Universiteit Amsterdam Machine Learning 2019 - 1 Introduction to Machine Learning (MLVU2019)



Vrije Universiteit Amsterdam Machine Learning 2019 - 1 Introduction to Machine Learning (MLVU2019)

This video provides an introduction to machine learning and covers various topics related to it. The instructor explains how to prepare for the course and addresses common concerns about machine learning being intimidating. He introduces the different types of machine learning and distinguishes it from traditional rule-based programming. The video also covers the basics of supervised learning and provides examples of how machine learning can be used for classification and regression problems. The concepts of feature space, loss function, and residuals are explained as well.

The second part of the video provides an introduction to machine learning and explains its main goal of finding patterns and creating accurate models to predict outcomes from a dataset. The speaker discusses the importance of using specific algorithms and data splitting to avoid overfitting and achieve generalization. He also introduces the concept of density estimation and its difficulties with complex data. The speaker clarifies the difference between machine learning and other fields and alludes to a strategy for breaking down big data sets in order to make accurate predictions. The video also mentions the increase of people working in machine learning with the development of deep learning and provides tips for beginners to get started in the field.

  • 00:00:00 In this section, the speaker talks about how to prepare for the machine learning course. They suggest that students should read the main course materials carefully and focus on what is necessary. Additionally, there is a quiz available for students to test their understanding and memorize what the instructor tells them. Students will be given homework and allowed to use a printed sheet with formulas to write notes in the remaining area in pen.

  • 00:05:00 In this section, the speaker addresses concerns about machine learning being scary and intimidating, especially for those without a background in computer science. He explains that the purpose of the project is to help individuals become comfortable with machine learning by providing datasets and resources to explore and experiment with. The speaker emphasizes the importance of collaboration and encourages the use of the provided worksheets and computing tools to facilitate learning.

  • 00:10:00 In this section, the speaker discusses the importance of group dynamics and communication skills in the field of machine learning. He emphasizes that being able to effectively work and communicate in groups is just as important as technical writing skills. The speaker also encourages participants to register for group sessions and reach out to others in the program to form effective working relationships. He advises participants to use the available resources, such as online discussion forums, to connect with other members in the program and create productive, collaborative relationships.

  • 00:15:00 In this section, the speaker introduces the different types of machine learning, starting with supervised machine learning. They explain that they will go over two types of supervised machine learning - classification and regression - with regression being discussed after the break. The speaker also mentions that they will briefly discuss unsupervised machine learning and provide an explanation for why machine learning is different than regular machinery.

  • 00:20:00 In this section, the speaker distinguishes between traditional rule-based programming, which essentially follows a set of predetermined instructions, and machine learning, which is a process of using large sets of data to build predictive models that can be used to make decisions based on new data. Machine learning is useful in situations where decision-making needs to be fast, reliable, and incorruptible. However, it is important to remember that machine learning models are not perfect and can fail unexpectedly, so human input is still necessary to make final decisions. Clinical decision support is one example of how machine learning can be used to provide doctors with additional information to aid in their decision-making.

  • 00:25:00 In this section, the speaker explains the concept of online or incremental learning in machine learning. They state that online learning can be effective in situations where there is a constant stream of data, and the model needs to keep updating and predicting new information, which is a difficult task. Therefore, they recommend focusing on applying online learning by separating and reenacting the base data to enable the model to make predictions more easily. Additionally, the speaker discusses how scientists in the 1950s and 60s used simple artificial brains called perceptrons to explore how the brain learns, using examples like training a perceptron to recognize the difference between males and females.

  • 00:30:00 In this section of the video, the speaker discusses the basics of machine learning and introduces the concept of supervised learning, where a machine is trained to classify data into specific categories based on input features. An example is given of classifying emails as either spam or not spam by measuring features such as the frequency of certain words. The goal is to feed this data to a learning algorithm that creates a model, which can then accurately predict the class of new, unseen examples. There are many different classification algorithms that can be used for this type of problem.

  • 00:35:00 In this section, the speaker gives two examples of how machine learning can be used for classification problems. The first example involves recognizing multi-digit numbers in Arizona contracts using image classification. They use 28x28 pixel images of the digits as the features and the goal is to predict what digit is in the image. The second example involves using machine learning to teach a car how to drive, where they collect data through sensors in the steering wheel and breaking it down into frames and using 960 features to classify the direction of the car.

  • 00:40:00 In this section, the speaker discusses how to build an algorithm to solve a regression problem. The example given is predicting the duration of a bus ride based on the number of passengers. The speaker also mentions that there is a page with a full schedule for the course, which is important due to the time changes between groups and occasional visuals that may change. Lastly, the speaker talks about using two features to predict the height of a person, which is an example of a supervised learning problem.

  • 00:45:00 In this section, the speaker introduces the concept of representing data in a feature space using an axis, which allows for the visual representation of elements and their interfaces. By drawing a line in this space, a classifier can be created that divides the space into two areas where one area represents everything above the line and the other area represents everything below it. The logistic pacifier is the best choice when using lines, and each line can be described by three numbers that define a property on the plane of 3D space. A loss function, which is a commutable function, allows the calculation of the number of examples that a model gets wrong, and a lower value means a better model fit.

  • 00:50:00 In this section, the speaker provides examples of spaces and how they can be used to create models. He explains the concept of decision trees and how they can be complicated in a large space. He also demonstrates how the process of classification can be made simple and powerful, using a few variations on specification and diversification. Finally, the speaker touches on multi-class and multi-label classification and how they can be useful in instances where objects are not mutually exclusive.

  • 00:55:00 In this section, the speaker explains how to determine the proper class probability score and output space by creating features based on important data. To evaluate the line theta and muscle loss function, a residuals method is deployed that measures the distance between the model's predicted value and the actual output value. By using regression to plot the residual and calculating the sum of squared residuals, predictive accuracy can be improved because it pulls the line toward the data based on the proper squared distance.

  • 01:00:00 In this section, the speaker discusses the importance of using specific algorithms, such as the multiple linear regression, to analyze data and create models. He explains that these models are not always accurate due to overfitting, which is why the data should be split into different chunks and analyzed accordingly. The speaker also emphasizes that generalization is the most important aspect when creating machine learning algorithms to ensure that the model is able to accurately predict outcomes with new data.

  • 01:05:00 In this section, the video discusses machine learning and how it involves learning from a large amount of data. Machine learning models are built by putting data into a set of features and labels, with the goal of finding patterns and creating a model that can accurately predict a label based on the features. Techniques like k-means clustering can be used to group data points with similar features, which can help build more accurate models. Additionally, it's important to understand that finding an optimal model requires a lot of trial and error, and there is no straightforward way to know what will work best beforehand.

  • 01:10:00 In this section, the speaker introduces the concept of density estimation and how it helps in identifying the probability distribution of data. The density estimation is done by assuming a distribution of interest and capturing it based on the sample data. The model predicts a probability density for every point in features and assigns a number to represent the likelihood of different rates. However, for complex data such as pictures of human beings, density estimation becomes difficult due to high dimension features, and an alternative approach is needed to provide another similar sample.

  • 01:15:00 In this section, the speaker mentions that there are fields other than machine learning that may confuse people into thinking they involve machinery, such as city planning or bath planning. However, these fields do not necessarily require much spending or time. The speaker also alludes to a strategy that will be discussed more in-depth in the next week, which involves breaking down big data sets into smaller groups in order to make accurate predictions. This strategy is often used in fields such as voice recognition or character recognition.

  • 01:20:00 In this section, the speaker discusses the different ways of thinking about machine learning and the existing techniques and models that can be used for it. He also touches on how deep learning has contributed to the increase in the number of flavors of people who work on machine learning. Additionally, he provides tips for beginners who want to get started on machine learning and mentions the availability of resources to help with their learning journey.
1 Introduction to Machine Learning (MLVU2019)
1 Introduction to Machine Learning (MLVU2019)
  • 2019.02.06
  • www.youtube.com
slides: https://mlvu.github.io/lectures/11.Introduction.annotated.pdfcourse materials: https://mlvu.github.ioThe first lecture in the 2019 Machine learning c...
 

2 Linear Models 1: Hyperplanes, Random Search, Gradient Descent (MLVU2019)



2 Linear Models 1: Hyperplanes, Random Search, Gradient Descent (MLVU2019)

This video covers the basics of linear models, search methods, and optimization algorithms. Linear models are explained in both 2 dimensions and multiple dimensions, and the process of searching for a good model through methods such as random search and gradient descent is discussed. The importance of convexity in machine learning is explained, and the drawbacks of random search in non-convex landscapes are addressed. The video also introduces evolutionary methods and branching search as search methods. Finally, the use of calculus and gradient descent to optimize the loss function is explained, including the process of finding the direction of steepest descent for a hyperplane.

The second part discusses gradient descent and its application to linear models, where the algorithm updates the parameters by taking steps in the direction of the negative gradient of the loss function. The learning rate is crucial in determining how quickly the algorithm converges to the minimum, and linear functions allow one to work out the optimal model without having to search. However, more complex models require using gradient descent. The video also introduces classification and decision boundaries, where the goal is to separate blue points from red points by finding a line that does so optimally. Limitations of linear models include their inability to classify non-linearly separable datasets, but they are computationally cheap and work well in high dimensional feature spaces. The instructor also previews future topics that will be discussed, such as machine learning methodology.

  • 00:00:00 In this section, the speaker explains the basic recipe for machine learning, which involves abstracting a problem, choosing instances and features, selecting a model class, and searching for a good model. They then introduce linear models as the selected model class and discuss how to write them in mathematical language. They talk about search methods, including gradient descent, and emphasize that these methods are not specific to linear models and will come up in other contexts. The notation for describing datasets is also introduced, using superscripts to match instances and corresponding values. Finally, a simple regression dataset is used as a running example throughout the lecture.

  • 00:05:00 In this section, the speaker discusses linear models and how they can be used to map one space to another space. A linear model uses a function that describes a line to achieve this. The line function has two parameters, W and B, which represent the slope and bias respectively. The speaker explains that the number of features in a dataset can be arbitrary, and the model has to work with any number of features. For multiple features, each instance is represented as a vector using bold letter notation, and each of these vectors maps to a single value.

  • 00:10:00 In this section, the speaker explains how to extend the linear model from a plane to a hyper plane by assigning weights to every feature and keeping a single B value. This function can be expressed as the dot product of W and X plus B, which is a simple operation of two vectors of the same length. The dot product can also be expressed as the length of the two vectors in space times the cosine of the angle between them. The speaker also mentions an interesting principle, which is that by adding simple features to a model, it can become more powerful. Finally, to find a good model, a loss function is used, and a way to search the space of all models for a value that minimizes that loss function.

  • 00:15:00 In this section, the speaker discusses the mean squared error loss function used in linear regression. The function measures the distance between the model prediction and the actual value, squares the distance, and sums up all the residuals to determine the loss. The lower the value, the better the model. The speaker explains why the function squares the values instead of using absolute values to avoid positive and negative values from canceling out. The square also puts an extra penalty on the outliers, making them weigh more in the loss function. The section also briefly discusses model and feature spaces and how the search for low loss values in the loss landscape leads to fitting a model to the data.

  • 00:20:00 simple model, random search can be used to find the optimal parameter values by starting with a random point and using a loop to pick another point that is very close to it, computing the loss for both points, and if the loss for the new point is better, switching to the new point. The process continues until it reaches the optimal parameter values. This is similar to a hiker navigating in a snowstorm by taking small steps in every direction to determine where the mountain slope is going up the most and taking steps in that direction until reaching the valley. However, in machine learning settings, where the space is multi-dimensional, it is not possible to see the entire picture at once, so the process is analogous to a hiker in a snowstorm, where the small steps taken are in a fixed distance in a random direction until reaching the optimal values.

  • 00:25:00 In this section, the video discusses the concept of convexity in machine learning and its impact on using random search as a model search method. A convex loss surface, or one shaped like a bowl when graphed mathematically, has only one minimum, making it possible to find a global minimum. However, when a loss surface is not convex and has multiple local minima, random search can get stuck and converge on a local minimum. To address this, simulated annealing as a search method is introduced, which allows for a probability of moving uphill, allowing for the potential to escape local minima and find the global minimum.

  • 00:30:00 In this section, the video discusses the use of blackbox optimization methods, such as random search and simulated annealing, to optimize a continuous or discrete model space by considering the loss function as a black box, which does not require any knowledge of the internal workings of the model. It is noted that these methods can also be parallelized to run multiple searches simultaneously to increase the chances of finding the global optimum. Additionally, the video mentions that these optimization methods are often inspired by natural phenomena, such as evolutionary algorithms, particles, and colonies.

  • 00:35:00 In this section, the speaker introduces the basic algorithm for an evolutionary method of search that takes inspiration from evolution. This method starts with a population of models, computes their loss, ranks them, kills half the population, and breeds the other half to make a new population. The new models are selected based on the properties of the old ones and some variation is added to the population using mutation. The speaker also explains a branching search method, a variation of random search, where instead of picking one random direction, K random directions are chosen, and the direction with the lowest loss is selected. The speaker concludes by noting the flexibility and power of evolutionary methods, but cautions their expensive computational cost and parameter tuning requirements.

  • 00:40:00 In this section, the presenters discuss different search methods to find the optimum model for a given problem. As the number of models increases, they spend more time exploring the local curvature, which leads to a more direct line towards the optimum. Instead of taking a random step, they can spend more time understanding the local neighborhood and figuring out the optimal direction before moving. The authors then introduce gradient descent, which involves looking at the loss function and calculating the direction in which the function decreases the quickest through calculus. This method requires the function to be differentiable, smooth and continuous, and is no longer a black-box model.

  • 00:45:00 In this section, the speaker discusses slopes and tangent lines in relation to the loss function. The loss surface is not a linear function, but the slope of the tangent line, which represents the derivative of the loss function, can give an indication of the direction and speed at which the function is decreasing. In higher dimensions, the equivalent of the tangent line is the tangent hyperplane, which can also give us the direction in which the loss surface is decreasing the quickest. The lecture also touches on the interpretation of vectors as a point in space or a direction, which is useful when dealing with linear functions such as hyperplanes.

  • 00:50:00 In this section, the speaker discusses how to generalize taking the derivative to multiple dimensions and how to find the direction of steepest descent for a hyperplane. The equivalent of taking the derivative in multiple dimensions is computing the gradient, which is a vector consisting of the partial differential derivative with respect to X, Y, and Z. These three values together define three parameters for a plane and three values together define a hyperplane. The direction W of steepest descent can be found by maximizing the norm of W times the cosine of a, which is maximized when the distance between X and W is equal to the angle between X and W or when X and W are the same. Thus, the direction of steepest descent is W.

  • 00:55:00 In this section, the speaker explains a simple algorithm for finding the minimum of a loss function called gradient descent. The algorithm starts with a random point in model space, computes the gradient of the loss at that point, multiplies it with a small value called anta, and then subtracts that from the model. There's no randomness, only purely deterministic steps. The gradient gives both the direction and the step size. The speaker then goes on to calculate the gradient for a loss landscape using calculus, explaining the sum and chain rules, and ends up with the two-dimensional vector of the derivative of the loss function with respect to W and B.

  • 01:00:00 In this section, the speaker discusses the implementation of gradient descent in Python and how it allows for a step in the direction of the vector, following the curvature of the surface, to find the minimum and stay there. To demonstrate this, they introduce a website called playground.tensorflow.org, which allows users to experiment with a simple linear model using gradient descent. However, the speaker also points out that gradient descent has some limitations, such as the need to pick the learning rate and the potential for being stuck in a local minimum.

  • 01:05:00 In this section, the video discusses gradient descent in more detail and its application to linear models. With gradient descent, the algorithm updates the parameters by taking steps in the direction of the negative gradient of the loss function, and this process repeats until it reaches a minimum. The learning rate determines how big each step is, and it is crucial to find a learning rate that is not too big or too small, as it affects how quickly the algorithm converges to the minimum. Linear functions allow one to work out the optimal model without having to search. However, more complex models require using gradient descent. Gradient descent is fast, low memory, and accurate but does not escape local minima and only works on continuous model spaces with smooth loss functions. Finally, the video introduces classification and decision boundaries, where the goal is to separate blue points from red points by finding a line that does so optimally in feature space.

  • 01:10:00 In this section, the speaker discusses the process of finding a classifier for a simple classification dataset consisting of six instances. To do this, they search for a loss function that can be used to evaluate potential linear models or planes in the dataset, with the aim of minimizing the number of points misclassified to get a good evaluation. However, the loss function they initially use is not suitable for finding the optimal model as it has a flat structure, making random search and gradient ascent ineffective. The speaker then states that sometimes the loss function should be different from the evaluation function and presents a loss function that has a minimum around the desired point, but is smooth everywhere.

  • 01:15:00 In this section, the lecturer demonstrates how the least squares principle used in regression can be applied to classification by assigning point values and treating the problem as a regression problem. This approach works well in clustering linearly separable points, but there is no guarantee that it will separate clusters that are not linearly separable. They show how the gradient descent algorithm works by taking determined steps in feature space to minimize the loss function. The example used is a dataset with linearly separable points, and the lecturer also highlights how linear models are limited in what they can express, as shown by the example of the core dataset which has complex boundaries.

  • 01:20:00 In this section, the instructor discusses the limitations of linear models and how they may fail to classify non-linearly separable datasets, such as a dataset that has a spiral pattern. However, linear models can work well in high dimensional feature spaces and are also computationally cheap. The instructor explains that stochastic gradient descent is a powerful optimization tool but requires a smooth loss function to be used as a proxy for discrete loss functions. The instructor concludes by previewing future topics that will be discussed, such as machine learning methodology.
2 Linear Models 1: Hyperplanes, Random Search, Gradient Descent (MLVU2019)
2 Linear Models 1: Hyperplanes, Random Search, Gradient Descent (MLVU2019)
  • 2019.02.07
  • www.youtube.com
slides: https://mlvu.github.io/lectures/12.LinearModels1.annotated.pdfcourse materials: https://mlvu.github.ioIn this lecture, we discuss the linear models: ...
 

3 Methodology 1: Area-under-the-curve, bias and variance, no free lunch (MLVU2019)



3 Methodology 1: Area-under-the-curve, bias and variance, no free lunch (MLVU2019)

The video covers the use of the area-under-the-curve (AUC) metric in evaluating machine learning models, as well as introducing the concepts of bias and variance, and the "no free lunch" theorem. The AUC metric measures the classification model's performance by calculating the area under the ROC curve. Additionally, bias and variance are discussed as they play a crucial role in how well the model fits the training data and generalizes to new data. Also, the "no free lunch" theorem highlights the need to select the appropriate algorithm for each specific problem since there is no universally applicable algorithm for all machine learning problems.

This video covers three important machine learning concepts: AUC (area-under-the-curve), bias and variance, and the "no free lunch" theorem. AUC is a metric used to evaluate binary classification models, while bias and variance refer to differences between a model's predicted values and the true values in a dataset. The "no free lunch" theorem highlights the importance of selecting the appropriate algorithm for a given problem, as there is no single algorithm that can perform optimally on all possible problems and datasets.

  • 00:20:00 In this section, the speaker discusses the first methodology for evaluating machine learning models, the area-under-the-curve (AUC) metric. The AUC measures the performance of classification models by calculating the area under the receiver operating characteristic (ROC) curve. The speaker also introduces the concepts of bias and variance, which measure how well a model fits the training data and how well it generalizes to new data, respectively. Finally, the speaker explains the "no free lunch" theorem, which states that there is no one-size-fits-all algorithm for all machine learning problems and emphasizes the importance of selecting the appropriate algorithm for each specific problem.

  • 01:10:00 In this section, the speaker introduces three key concepts in machine learning methodology: area-under-the-curve (AUC), bias and variance, and the "no free lunch" theorem. AUC is a metric used to evaluate the performance of binary classification models and represents the probability that a model will rank a randomly chosen positive example higher than a randomly chosen negative example. Bias refers to the difference between the expected value of a model's predictions and the true values in the dataset, while variance refers to the variance in a model's predictions when trained on different datasets. The "no free lunch" theorem states that there is no one algorithm that can perform best on all possible problems and datasets, underscoring the importance of selecting the appropriate algorithm for a given problem.
3 Methodology 1: Area-under-the-curve, bias and variance, no free lunch (MLVU2019)
3 Methodology 1: Area-under-the-curve, bias and variance, no free lunch (MLVU2019)
  • 2019.02.12
  • www.youtube.com
slides: https://mlvu.github.io/lectures/21.Methodology1.annotated.pdfcourse materials: https://mlvu.github.ioIn this lecture, we discuss the practicalities t...
 

4 Methodology 2: Data cleaning, Principal Component Analysis, Eigenfaces (MLVU2019)



4 Methodology 2: Data cleaning, Principal Component Analysis, Eigenfaces (MLVU2019)

This first part of the video covers various important aspects of data pre-processing and cleaning before applying machine learning algorithms, starting with the crucial importance of understanding data biases and skew. The speaker then discusses methods for dealing with missing data, outliers, class imbalance, feature selection, and normalization. The video goes on to discuss the concept of basis and the MVN distribution, explaining how to use whitening to transform data into a normal distribution for normalization, and concludes with the use of principal component analysis (PCA) for dimensionality reduction. From manipulating the training set to using imputation methods, PCA projects data down to a lower dimensional space while retaining information from the original data.

This second part of the video discusses the use of Principal Component Analysis (PCA) in data cleaning and dimensionality reduction for machine learning. The method involves mean centering the data, computing the sample covariance, and decomposing it using eigen decomposition to obtain the eigenvectors aligned with the axis that capture the most variance. Using the first K principal components provides a good data reconstruction, allowing for better machine learning performance. The concept of Eigenfaces is also introduced, and PCA is shown to be effective in compressing the data to 30 dimensions while maintaining most of the required information for machine learning. Various applications of PCA are discussed, including its use in anthropology and in the study of complex datasets such as DNA and faces.

  • 00:00:00 In this section of the video, the presenter discusses the basics of data cleaning and pre-processing before applying machine learning algorithms. The importance of not taking data at face value is emphasized by discussing survivorship bias, where only focusing on the surviving population can lead to skewed results. The presenter then discusses techniques such as dealing with missing data, outliers, class imbalance, feature selection, and normalization. Finally, the second half of the video is focused on discussing dimensionality reduction using the principal component analysis algorithm.

  • 00:05:00 In this section, the video introduces practical tips for data cleaning and handling missing data in a data set, including removing missing features or instances that are not significant and making sure that the removal doesn't change the data distribution. Rather than removing missing values, it might be more useful to keep them for the training data and test the model's responses. To maximize the amount of training data, an imputation method that fills in guesses is available for the missing data, like using the mode or mean value. The guiding principle for dealing with missing data is considering the real-world use case, or the production environment, to prepare the model to deal with expected missing data in the most relevant and practical way.

  • 00:10:00 In this section, the speaker discusses two types of outliers in data: mechanical and natural outliers. Mechanical outliers occur due to errors such as missing data or mistakes in data entry, and should be treated as missing data to be cleaned up. On the other hand, natural outliers occur due to non-normal distribution of certain variables and should be kept in the dataset to ensure a better fit. The speaker provides examples of both types of outliers, including unusual face features in a dataset of faces and extremely high incomes in a dataset of income distribution.

  • 00:15:00 In this section, the importance of checking for assumptions of normality in data is discussed. Linear regression, for example, is based on these assumptions, so it is important to check for normality and be aware that assumptions can hide in models without being known. Outliers should also be considered when modeling and validating data, and it is important to test models with a training set that represents production situations to ensure that the models can handle outliers appropriately. Additionally, the importance of transforming data into categorical or numeric features for machine learning algorithms and the potential loss of information involved in such transformations is discussed.

  • 00:20:00 In this section, the speaker discusses the importance of selecting the right features for machine learning algorithms and how to extract meaningful information from data. They explain that simply interpreting numbers like phone numbers as numeric values is not useful, and instead suggest looking for categorical features such as area codes or mobile vs. landline status. In cases where a machine learning algorithm only accepts numeric features, the speaker recommends using one hot coding instead of integer coding to avoid imposing an arbitrary order on the data. The goal is to extract the necessary information without losing any essential details, and selecting features that accurately and effectively convey the information needed for the task at hand.

  • 00:25:00 In this section, the speaker discusses the value of expanding features to make a model more powerful. Using the example of a dataset for email spam classification, the speaker explains how two interrelated features can't be interpreted without knowing the value of the other, making it impossible for a linear classifier to draw boundaries between the classes. To address this limitation, the speaker discusses adding a cross-product feature, which multiplies the values of the existing features, allowing for a classification boundary to be drawn in a higher feature space, even though it's not linearly separable in the original space. The speaker then gives an example of a class of points with a circular decision boundary to further illustrate the importance of expanding features.

  • 00:30:00 In this section, the speaker explains how adding extra features can help a linear classifier solve classification problems. By adding the square of the x and y coordinates as features to a decision boundary problem, a linear classifier can be used to distinguish between two classes of points. The speaker shows how, using the TensorFlow Playground, training the classifier results in a decision boundary that to the human eye appears to be circular. The weights of the features are also shown and it is demonstrated that only one feature is necessary to solve this classification problem.

  • 00:35:00 In this section of the video, the speaker discusses how expanding the feature space can lead to a more powerful model, even for regression. They illustrate this point by showing how adding a squared variable to a linear regression model results in a parabola that fits the data better. The speaker also advises on dealing with class imbalance, suggesting manipulating the training set through techniques such as oversampling or data augmentation. Finally, they introduce the topic of normalization and provide a motivating example of how differences in units can affect the performance of a K nearest neighbor classification model.

  • 00:40:00 In this section of the video, the speaker discusses the importance of normalizing data for machine learning algorithms. They explain three ways to normalize data: normalization, standardization, and whitening. Normalization involves squeezing the data range into the range between zero and one, whereas standardization involves making sure the mean of the data is zero and the variance is one. The third method, whitening, is a slightly nicer normalization that takes into account all the correlations in the data and reduces it to a sphere in the feature space. The speaker explains that whitening is useful for dimensionality reduction.

  • 00:45:00 In this section, the speaker explains the concept of whitening data, which involves transforming the data into an uncorrelated feature set. The speaker uses linear algebra to demonstrate how to choose a different basis for the data by picking two other vectors for a new system of axes. The blue point, originally represented as (3,2) in the standard coordinate system, is recalculated with respect to the new basis system and has new coordinates of (2.5, 0.5). This leads to the generalized notation of sticking the basis vectors into a matrix as columns.

  • 00:50:00 In this section, the speaker discusses the concept of basis and how it can be used to transform between different bases with the help of the matrix transpose. The matrix inverse operation is expensive and numerically imprecise, so an orthonormal basis is preferred where the basis vectors have length one and are orthogonal to each other. The speaker then explains how the multivariate normal distribution is a generalization of the normal distribution to multiple dimensions and can help in interpreting data. The mean of the distribution is a vector and the variance becomes a covariance matrix in a multivariate normal distribution. The speaker also briefly explains the formula to compute the sample covariance for fitting a multivariate normal distribution to the data.

  • 00:55:00 In this section, the concept of multivariate normal (MVN) distribution is introduced, which has a mean of zero, variance of one in every direction, no correlations, and can be transformed into any other MVN distribution. The process of whitening data is further explained wherein the transformation of an MVN distribution is reversed to transform data to a normal distribution for normalization. The section also focuses on reducing the dimensionality of high-dimensional data through principal component analysis (PCA), a method that performs both whitening and dimensionality reduction. By finding new features derived from original features that retain as much relevant information as possible, PCA projects data down to a lower dimensional space while retaining essential information from the original data.

  • 01:00:00 In this section of the video, the presenter discusses Principal Component Analysis (PCA) and how it orders dimensions by variance captured, allowing for useful data reconstruction and dimensionality reduction. The presenter explains eigenvectors and how they are special vectors whose direction doesn't change under a transformation, and how they can be used to find the maximum variance in the original data. The presenter also explains how to find eigenvectors for a diagonal matrix, and how to rotate a matrix to align the eigenvectors along the axis.

  • 01:05:00 In this section, we learn about using principal component analysis (PCA) to preprocess data for machine learning algorithms. We first mean center the data to remove translation, then compute the sample covariance and decompose it using eigen decomposition. We then transform the data back to a standard multivariate normal (MVN) space and discard all but the first K features. The eigenvectors obtained from the decomposition become aligned with the axis, allowing us to keep the direction with the most variance. This results in a significant reduction in dimensionality, allowing for better machine learning performance.

  • 01:10:00 In this section, the presenter explains the concept of dimensionality reduction using principal component analysis (PCA). The goal of dimensionality reduction is to maintain invariance while retaining as much data as possible. Maximizing variance in projection is the same as minimizing the reconstruction error, which is a loss function used to measure the difference between the original and projected data. The first principal component is the line that captures the most variance, and the following components capture the remaining variance. Using the first K principal components provides a good data reconstruction.

  • 01:15:00 In this section, the speaker discusses using principal component analysis (PCA) in research applications. One such application is in the field of anthropology, where it can be used to quantify and demonstrate the characteristics of fossilized bones. By taking measurements of different aspects of the bone and creating a high-dimensional space of features for comparison, PCA can then be used to reduce the dimensions of the data down to two principal components, allowing for visual clustering and outlier identification. Additionally, PCA has been applied to the study of DNA in European populations, wherein the DNA is transformed into a high-dimensional feature vector, and PCA can be used to reveal patterns and clusters in the data.

  • 01:20:00 In this section, the speaker discusses how principal component analysis (PCA) can be applied to a dataset of DNA features and how it can be used to determine the rough shape of Europe. By looking at the two principal components of a DNA dataset colored by country of origin, one can determine how far north or west/east a person or their ancestors lived. PCA is often seen as a magical method because of its ability to provide insights into complex datasets, such as the eigenvectors of a dataset of faces applied in eigenfaces. By computing the mean of a dataset of faces and looking at the eigenvectors of the covariance of that dataset, PCA can provide directions in a high-dimensional space of faces' images.

  • 01:25:00 In this section, the speaker discusses the concept of Eigenfaces and how Principal Component Analysis (PCA) helps in data cleaning. By adding a tiny amount of the first eigenvector to the mean face, the speaker demonstrates how this corresponds to age in facial features. The second and fourth eigenvectors correspond to lighting and gender, respectively. The fifth eigenvector indicates how open or close the mouth is. The eigenvectors act as the bases for the new space and compressing the data to 30 dimensions provides a good representation of the original face. The inflection point occurs around 30 eigenvectors, where the rest of the details can be discarded, maintaining most of the information required for machine learning.
4 Methodology 2: Data cleaning, Principal Component Analysis, Eigenfaces (MLVU2019)
4 Methodology 2: Data cleaning, Principal Component Analysis, Eigenfaces (MLVU2019)
  • 2019.02.14
  • www.youtube.com
slides: https://mlvu.github.io/lectures/22.Methodology2.annotated.pdfcourse materials: https://mlvu.github.ioIn this lecture we discuss how to prepare your d...
 

Lecture 5 Probability 1: Entropy, (Naive) Bayes, Cross-entropy loss (MLVU2019)



5 Probability 1: Entropy, (Naive) Bayes, Cross-entropy loss (MLVU2019)

The video covers various aspects of probability theory and its application in machine learning. The speaker introduces entropy, which measures the amount of uncertainty in a system, and explains how it is related to naive Bayes and cross-entropy loss. The concepts of sample space, event space, random variables, and conditional probability are also discussed. Bayes' theorem is explained and considered a fundamental concept in machine learning. The video also covers maximum likelihood estimation principle and Bayesian probability, as well as the use of prefix-free code to simulate probability distributions. Lastly, the speaker discusses discriminative versus generative classifiers for binary classification, including the Naive Bayes classifier.

The second part explains the concept of computing probabilities for a new point belonging to a particular class using a multivariate normal distribution model. It discusses conditional independence of features to efficiently fit probability distributions for a classifier, and the need for smoothing or tuning pseudo-observations to handle zero instances. The speaker also introduces entropy loss as a more effective loss function for linear classifiers than accuracy, and discusses the cross-entropy loss function's ability to measure the difference between predicted and actual data, with the sigmoid function collapsing the function's symmetries to simplify it. Finally, the video hints that the next lecture will cover SVM loss as the final loss function.

  • 00:00:00 In this section of the video on probability, the speaker starts by advising students to join a group project if they haven't already, and to not worry too much about finding a perfect group but instead to make the best of what they get. The speaker then introduces probability theory and entropy, which is closely related and useful in machine learning. He explains that entropy, in this context, means measuring the amount of uncertainty or randomness in a system. The concept of entropy is important in machine learning and is used to explain naive Bayes and cross-entropy loss, which will be discussed later in the lecture. The lecture will also cover the basics of classification and linear classifiers.

  • 00:05:00 In this section, the speaker discusses loss functions and introduces the cross-entropy loss, which is considered to be a very good loss function. They present an example involving a teenager's online gambling and explain how probabilities work in this scenario. The speaker also touches on the concept of frequency and probability and how it applies in real-life situations.

  • 00:10:00 In this section, the speaker discusses the difference between subjective and objective probabilities. They explain that subjective probability is based on personal beliefs and experiences, while objective probability is based on frequentist probability, which is derived from experiments and observations. The speaker notes that in machine learning, the focus is on minimizing loss on the test set based on the training set, and that probability theory is used as a mathematical framework to describe probabilities. The speaker also introduces the concept of random variables and sample space.

  • 00:15:00 In this section, the video explains the concepts of sample space and event space in probability theory. Sample space encompasses all possible outcomes, where no two outcomes have another outcome in between them. Event space includes a set of subsets of the sample space, making it possible to identify probabilities of various events, like getting an odd or even number on a dice roll. Probabilities can be assigned to both discrete and continuous sample spaces. Additionally, the video mentions using random variables and features for modeling probabilistic data sets, which helps explain the likelihood of event outcomes.

  • 00:20:00 In this section, the speaker introduces the basic concepts of probability, including random variables and their representation as functions. The speaker explains that a random variable can be represented by a single number and instantiated as a variable. They also discuss the use of equals notation and how random variables can be referred to either by the function or by a specific value. The speaker then gives an example of an event space defined by two random variables, X and Y, and introduces the concept of conditional probability.

  • 00:25:00 In this section, the speaker discusses probabilities and how they can be rewritten and projected to determine the probability of different events. They explain that if two variables are independent, knowing the value of one will not change the probability of the other. The speaker then uses the example of two people living in different parts of a city to illustrate how the probability of one person being on time for work does not affect the probability of the other person being on time. However, they note that there is one rare possibility where the two people's probabilities could be connected.

  • 00:30:00 In this section, the speaker discusses probability and Bayes' theorem, which is a fundamental concept in machine learning. The speaker uses an example of a traffic jam to explain conditional independence and how knowing Alice is late for work slightly increases the belief that Bob is late too. Bayes' theorem is considered the most important formula in the field and explains how to turn the conditional probability around. Finally, the speaker explains how machine learning fits a probability distribution to data and how the frequentist approach determines the best parameters given the information available.

  • 00:35:00 In this section, the speaker discusses maximum likelihood estimation principle and Bayesian probability. Maximum likelihood estimation principle is based on the assumption that the observed data points are independent and the probability of these points maximizes the likelihood rate. Bayesian probability, on the other hand, involves updating one's beliefs based on prior knowledge and observed data. Bayesian probability uses a compromise between two parties, frequentists and Bayesians, to express belief distribution, which works well in machine learning.

  • 00:40:00 In this section, the speaker discusses the concept of probability distributions and how to simulate them without a tree that has a single outcome. The use of a prefix-free code or prefix tree is presented as a means to generate a wide range of probability distributions. The speaker explains that this approach can be used for communication and finding the probability of certain outcomes in various scenarios. The example of using a coin to simulate a 3-sided die and achieve a uniform distribution is also provided.

  • 00:45:00 In this section, the speaker discusses a family of probability distributions that can be described using a prefix-free code algorithm. This algorithm, known as Naive Bayes, is efficient for data and provides a good connection between description methods and probability distribution. The main use of this algorithm is to explain entropy, which is the measure of uncertainty in a random variable. The speaker explains how this algorithm can be used to encode data from a certain probability distribution and obtain a probability distribution that is a good fit for the given data.

  • 00:50:00 In this section, the speaker discusses entropy and cross-entropy loss as measures of the uniformity of data. Entropy can be used to represent the uniformity of data among different elements, with a smaller entropy indicating more uniform data. Cross-entropy is used to represent the expected code length when a different code is used and is always equal to or larger than the entropy, with a minimum value at zero. These measures help to understand the distance between two probability distributions and provide a theoretical basis for analyzing data sets as a sequence of random variables.

  • 00:55:00 In this section, the speaker explains the concepts of discriminative and generative classifiers for binary classification. Discriminative classification simply discriminates instances, while generative classification models the probability of the data given a class. Generative classifiers range from the Bayes optimal classifier to the Naive Bayes classifier, which makes a conditional independence assumption and is considered not correct but still works very well and is cheap.

  • 01:00:00 In this section, the speaker explains how to compute the probability of a new point belonging to a particular class using a multivariate normal distribution model. They explain that by estimating the probability distributions and filling them in, we can assign probabilities to each class based on the highest likelihood. However, when dealing with high dimensionality, there may not be enough data to accurately fit the model, in which case a categorical distribution can be used instead to model the features with the Bernoulli distribution.

  • 01:05:00 In this section, the concept of conditional independence of features is explained, which allows for the efficient fitting of probability distribution for a classifier. However, a single zero probability value can greatly impact the accuracy of the classifier, which can be resolved by smoothing or tuning pseudo-observations to ensure that there is at least one observation for each feature. This ensures that the probability never becomes zero, and the classifier accuracy is not negatively impacted.

  • 01:10:00 In this section, the speaker discusses ways to avoid skewed results in machine learning models by ensuring that there is at least one instance with a value for every possible class and feature. They summarize generative classifiers as having independence assumptions that work well with big and high-dimensional datasets, but require Laplace smoothing to handle zero instances. The speaker introduces the concept of entropy loss as a more effective loss function for linear classifiers compared to accuracy.

  • 01:15:00 In this section, the speaker explains how instead of assigning values to classifier models, probabilities can be assigned by using the logistic sigmoid function. The linear model is still used, but it is squeezed into the range between 0 and 1. This method allows for a more accurate interpretation of positive and negative instances.

  • 01:20:00 In this section, the presenter explains the cross-entropy loss function, which is used to measure the difference between what a machine learning model predicts and what the data says. The loss function is designed to maximize the size of the lines between the predictions and the data, with the goal of pushing up on the blue lines and minimizing the negative logarithm of all the lines to ultimately maximize the size of these lines.

  • 01:25:00 In this section, the speaker discusses how the cross-entropy loss function works by punishing larger residuals more than small residuals. The function of P versus M also shows that small bars contribute a lot to the loss, which is equivalent to squaring in previous models. The speaker then discusses the derivative of the logarithm and how the constant multiplier is included in the equation. To simplify the math, the constant multiplier can be disregarded, or the binary logarithm can be defined in terms of the natural logarithm.

  • 01:30:00 In this section, the speaker discusses cross-entropy loss and the role the sigmoid function plays in simplifying it. The symmetries of the sigmoid function allow for the collapse of the loss function, ultimately making it simpler. The logistic sigmoid, when applied to logistic regression, can handle points far away from the decision boundary without issue. The logistic regression can result in multiple good solutions in the region of uncertainty.

  • 01:35:00 In this section, the lecturer explains the concept of probability and classifies points as either blue or red based on their probability values. He further hints that the next lecture will cover SVM loss as the final loss function.
5 Probability 1: Entropy, (Naive) Bayes, Cross-entropy loss (MLVU2019)
5 Probability 1: Entropy, (Naive) Bayes, Cross-entropy loss (MLVU2019)
  • 2019.02.19
  • www.youtube.com
slides: https://mlvu.github.io/lectures/31.ProbabilisticModels1.annotated.pdfcourse materials: https://mlvu.github.ioApologies for the bad audio (and missing...
 

Lecture 6 Linear Models 2: Neural Networks, Backpropagation, SVMs and Kernel methods (MLVU2019)



6 Linear Models 2: Neural Networks, Backpropagation, SVMs and Kernel methods (MLVU2019)

This first part of the video on linear models focuses on introducing non-linearity to linear models and explores two models that rely on expanding the feature space: neural networks and support vector machines (SVMs). For neural networks, the speaker explains how to set up a network for regression and classification problems using activation functions such as sigmoid or softmax. The lecture then delves into backpropagation, a method used to compute gradients used in neural networks. For SVMs, the speaker introduces the concept of maximizing the margin to the nearest points of each class and demonstrates how it can be expressed as a constrained optimization problem. The video provides a clear introduction to the principles of neural networks and SVMs, recommending students focus on the first half of the lecture as a starting point for the rest of the course.

The second part of the video covers the topics of support vector machines (SVMs), soft margin SVMs, kernel tricks, and differences between SVMs and neural networks. The soft margin SVMs are introduced as a way to handle non-linearly separable data, allowing for a penalty value to be added to points that do not comply with classification constraints. The kernel trick allows for the computation of the dot product in a higher-dimensional space, expanding the feature space to significantly increase the model's power. The differences between SVMs and neural networks are explained, and the shift towards neural networks due to their ability to perform more advanced types of classification, even if not fully understood, is discussed.

  • 00:00:00 In this section, the speaker discusses how to learn nonlinear functions using linear models by adding extra features that are functions derived from the features being used, which was previously explained last week. The speaker then focuses on two models, namely neural networks and support vector machines, which rely on expanding the feature space. Neural networks require a learnable feature extractor while support vector machines use the kernel trick to blow up to a larger feature space. The lecture explains backpropagation, a specific method for computing gradients used in neural networks, as well as the hinge loss function used in support vector machines. The speaker recommends focusing on the first half of the lecture for a better understanding of linear models as it serves as a starting point for the rest of the course.

  • 00:05:00 In this section, the speaker discusses the history of neural networks, tracing back to the late 50s and early 60s when researchers started to take inspiration from the human brain to develop AI systems. They created a simplified version of a neuron called the perceptron which worked as a linear model and was used for classification. However, the interesting thing about the brain is the way a big bunch of neurons work together, so researchers started to chain these perceptrons together to build a network.

  • 00:10:00 In this section of the lecture on linear models, the speaker explains how to introduce non-linearity to a network of perceptrons in order to have the power to learn normally non-linear functions and more interesting models. One way to do this is by using a sigmoid function, which takes a range of numbers and squeezes them into the range of 0 to 1. By chaining together perceptrons with nonlinear activation functions into a feed-forward network or multi-layer perceptron, one can turn it into a regression or classification model, with each line representing a parameter of the network that needs tuning. The process of adapting these numbers to solve a learning problem is called backpropagation, which will be discussed later in the lecture.

  • 00:15:00 In this section of the video titled "6 Linear Models 2: Neural Networks, Backpropagation, SVMs and Kernel methods (MLVU2019)", the speaker explains how to set up a neural network for regression and classification problems. For regression, a network with one hidden layer and no activation on the output layer is set up, followed by the application of a regression loss function. For binary classification, a sigmoid activation is added to the output layer, and the probabilities obtained can be interpreted as the probability of the input being positive. For multi-class classification, a softmax activation is added, which creates one output node for each class and normalizes the probabilities so that they add up to one. The loss function is used to train the weights of the network until the cross-entropy loss is minimized.

  • 00:20:00 In this section, the speaker discusses the basic principle of neural networks, which is using gradient descent. However, since computing the loss over the entire dataset can be expensive, stochastic gradient descent is used, where only one example in the dataset is used to compute the loss, optimizing the model for that single example. Stochastic gradient descent adds randomness and creates a little randomness, helping to escape local minima. The speaker then adds a hidden layer in the attention flower playground for classification, where probabilistic classification is shown. However, the model doesn't seem to perform well on this particular problem.

  • 00:25:00 In this section of the video, the speaker discusses activation functions for linear models, comparing the sigmoid and ReLU activation functions. The ReLU function fits data faster, and its decision boundary is piecewise linear, while the sigmoid creates a curvy decision boundary. The speaker recommends experimenting with additional layers to make the model more powerful, although the added complexity makes it more difficult to train. The video then delves into backpropagation, which allows computers to efficiently compute gradients using symbolic differentiation without the exponential cost. The speaker explains that the basic idea is to describe the function as a composition of modules and to repeatedly apply the chain rule.

  • 00:30:00 In this section, the back propagation algorithm is explained as a method to take any given model and break it down into a chain of modules in order to compute the global gradient for a particular input by multiplying the gradients of each submodule together. This process begins by working out the derivative of each module with respect to its input symbolically using pen and paper, then moving onto numerical computation. A simple example is given to illustrate the idea of composing a function as a sequence of modules, using local derivatives and repeatedly applying the chain rule to derive the global derivative. The resulting factors are referred to as the global and local derivatives, respectively.

  • 00:35:00 In this section, the video discusses backpropagation by breaking down the system into modules and applying it to a two-layer neural network with sigmoid activation. The focus is on finding the derivative of the loss function with respect to the weights, not the input. The first module is the loss function, followed by Y, which is a linear activation function. Each hidden value gets a module with its own activation function, in this case, a sigmoid function, applied to it. H2 prime is the linear input to the activation function. Finally, the video states that it's important to recognize the difference between the derivative of the model with respect to its input and the derivative of the loss function with respect to the weights.

  • 00:40:00 In this section, the speaker discusses the local gradients of each module, specifically the derivative of the loss with respect to V2 and Y over V2. The derivative of L over Y is simplified using the chain rule and results in 2 times Y minus T, which is just the norm squared error. Y over V2 is a linear function, and the derivative is simply H2. When applying gradient descent to parameter z2, it updates by subtracting the error times the activation of H2. The speaker provides an analogy of a neural network as a government with the Prime Minister at the top, ministers in the second layer, and civil servants in the first layer. The ministers listen to civil servants and shout louder for certain decisions, interpreted as positive trust, while staying silent means negative trust. The Prime Minister adjusts their level of trust based on the error and back propagates it down the network for updates.

  • 00:45:00 In this section, the speaker explains how backpropagation works by assigning responsibility to all the weights for the error in the model's output. He uses a contrived analogy to demonstrate that the global error is computed and multiplied by the level of trust in the ministers who contributed to the problem. The speaker then shows how the activation function needs to be accounted for when updating the level of trust. Backpropagation essentially propagates the error back down the network to update the weights of the model. The speaker summarizes that neural networks are a combination of linear and non-linear functions and the simplest version is a feed-forward network.

  • 00:50:00 In this section, the video discusses the history and challenges with neural networks, and how interest in them decreased due to their difficulty in training and the uncertainty involved in tweaking their parameters. Support vector machines, which have a convex loss surface allowing for immediate feedback on whether the model works, became more popularized due to the lack of uncertainty involved in training them. The video then introduces support vector machines as a solution to the problem of multiple models that perform differently on similar data, using the concept of maximizing the margin to the nearest points and calling them support vectors.

  • 00:55:00 In this section, the concept of support vector machines (SVMs) is introduced as a method for finding a decision boundary for a binary classification problem. The SVM algorithm aims to find a line that maximizes the margin, or the distance between the decision boundary and the nearest points of each class. The objective of the SVM can be expressed as a constrained optimization problem, where the goal is to maximize the margin while satisfying constraints that ensure the output of the model is +1 for positive support vectors and -1 for negative support vectors. The SVM can be further simplified by introducing a label parameter that encodes whether a point is positive or negative, allowing the two objectives to be reduced to a single objective that can be written entirely in terms of the hyperplane parameters.

  • 01:00:00 In this section, the speaker discusses the concept of maximizing the margin between decision boundaries in support vector machines (SVMs). The size of the margin is dependent on the length of a vector, which can be determined by the parameters of the model. The objective is to maximize this margin while still satisfying certain constraints. However, if the data is not linearly separable, the model needs to be slackened by adding a slack parameter, which allows the model to violate certain constraints to find a better fit. Each data point has its own slack parameter, which can either be set to zero or a positive value.

  • 01:05:00 In this section, the lecturer discusses the concept of soft margin SVMs, which allows for data sets that are not linearly separable to be handled by adding a penalty value to points that do not comply with the classification constraints. This penalty is expressed through a loss function that can be minimized using the gradient descent method. The lecturer also presents the option of rewriting the loss function in terms of the support vectors as an alternative to the kernel trick, which enables solving the constrained optimization problem. The hinge loss function is presented as a way to implement this penalty system.

  • 01:10:00 In this section, the instructor discusses different loss functions in machine learning such as accuracy, least squares, the cross-entropy loss, and the soft margin SVM loss. The soft margin SVM works by maximizing the margin between a decision boundary and the nearest points with penalties. However, because this optimization function has constraints and a saddle point, it cannot be effectively solved by gradient descent. The instructor introduces the method of LaGrange multipliers, which helps rewrite the constrained optimization problem into a much simpler form without getting rid of constraints. By using this method, the instructor showcases how the soft margin SVM optimization function can be rewritten, which allows for the application of the kernel trick.

  • 01:15:00 In this section, the speaker discusses support vector machines (SVMs) and the kernel trick, which is a way of substituting the dot products of pairs of points in a dataset with other dot products. SVMs work by penalizing the size of alphas, indicating which points are support vectors, and summing over all pairs of points in the dataset. The kernel trick allows for the computation of the dot product in a higher-dimensional space, leading to a much more powerful model for a similar cost as computing a linear model. An example is given where the features are expanded by adding all cross products, which vastly increases the feature space and allows for much more powerful models.

  • 01:20:00 In this section, the concept of using kernel functions to achieve high-dimensional feature spaces for classification is discussed. By using the dot product and expanding it to higher powers, the feature space can be expanded to include cross-products and infinitely dimensional feature spaces, all while maintaining a low cost. This method, however, is prone to overfitting and can be complicated to implement. The use of kernel functions can also be extended to non-numerical data, such as text or protein sequences, where direct feature extraction is not straightforward. While kernel functions may not be trendy currently, they can still be useful in certain cases.

  • 01:25:00 In this section, the differences between support vector machines (SVMs) and neural networks are discussed. SVMs are limited in that their training time is quadratic, whereas neural networks only require a certain number of passes over the data. However, SVMs can still be trained with gradient descent, but this method loses sight of the kernel trick. Around 2005, training SVMs became increasingly difficult due to the amount of data involved, leading to the resurgence of neural networks. Furthermore, the culture within machine learning shifted to accepting that neural networks work, even if the reasoning behind their success is not yet entirely understood. Ultimately, this shift allowed for the use of neural network models to perform more advanced types of classification, which will be discussed in the following section.
6 Linear Models 2: Neural Networks, Backpropagation, SVMs and Kernel methods (MLVU2019)
6 Linear Models 2: Neural Networks, Backpropagation, SVMs and Kernel methods (MLVU2019)
  • 2019.02.27
  • www.youtube.com
NB: There is a mistake in slide 59. It should be max(0, 1 - y^i(w^T\x + b) ) (one minus the error instead of the other way around).slides: https://mlvu.githu...