Machine Learning and Neural Networks - page 33

 

CS 198-126: Modern Computer Vision Fall 2022 (University of California, Berkeley) Lecture 1 - Intro to Machine Learning



CS 198-126: Lecture 1 - Intro to Machine Learning

In this lecture on machine learning, the instructor covers a broad range of topics, including an introduction to the course, an overview of machine learning, different types of machine learning, machine learning pipeline, labeling data, and loss function. The concept of bias-variance trade-off, overfitting, and underfitting is also discussed. The instructor emphasizes the importance of choosing the right function during the process of machine learning and the role of hyperparameters in the process. The overall goal of machine learning is to accurately predict new data, not just fit the training data. The lecturer encourages students to attend the class and make an effort to learn about machine learning and deep learning.

  • 00:00:00 In this section, there is no proper content to summarize as the transcript excerpt provided seems to be a conversation between the speaker and the audience about the microphone situation in the room.

  • 00:05:00 In this section, we are introduced to the lecture series on deep learning for computer vision, presented by Jake and his colleagues. As the class gets started, Jake goes over the logistics of the course and outlines what they will be discussing in the first lecture, which is an overview of machine learning and how it is approached. Despite some technical difficulties with the recording equipment, Jake is excited to teach the class and begins with an introduction to himself and his colleagues.

  • 00:10:00 In this section, the instructor introduces himself and the course, which aims to provide an introductory boot camp into computer vision and deep learning for freshmen who haven't had much exposure to the material before. The course will cover topics such as computer vision tasks, learning from large datasets of images, 3D vision, and generative art. The instructor emphasizes that the course is supposed to be fun and interactive and provides logistics for the course, such as accessing recording slides and assignments on the website and using Ed stem to interact with students and the course staff. The syllabus is also available on the website, and the first quiz will be due at the end of next weekend.

  • 00:15:00 In this section, the instructor provides an introduction to machine learning (ML). He explains that ML is the process of using data to figure out what a function looks like rather than coding it yourself. With ML, the data guides the functions, and the instructor gives an example of how a function to identify the number 7 from an image is much easier to create with ML than trying to do it through coding. The instructor explains that ML involves template creation where one creates the structure of a function and leaves a few parameters that will determine how the function behaves, with the parameters learned through data. The importance of creating the right function templates is discussed, as it determines the success of the ML model.

  • 00:20:00 In this section of the lecture, the speaker explains that the key to machine learning is figuring out the format of the function that the model will follow. This function is sometimes referred to as a class of model and certain parts of it are blank and referred to as parameters, which are the values we are allowed to learn. The speaker emphasizes that the choice of function is crucially important for achieving accurate results. The speaker also provides a brief overview and categorization of different types of machine learning, including supervised learning, unsupervised learning, and reinforcement learning, and introduces the vocabulary associated with machine learning, including function, parameters, weights, biases, hyperparameters, and features.

  • 00:25:00 In this section, the lecturer explains the process of machine learning, focusing on the ML pipeline. First, it is essential to define the problem, prepare the data, and select the model and loss function. It is also essential to label the data with a one hot labeling system to convert it into numbers that the model can recognize. The lecturer emphasizes the importance of good data, noting that crummy data input will result in an equally poor output. Additionally, he discusses the importance of vectorization of data, ensuring that all features are on the same scale and represented in the correct order.

  • 00:30:00 In this section, the lecturer explains how to label data for machine learning applications and why it's important to represent data correctly. One hot labeling is used where each label is represented by a vector with a one in the corresponding position and zeros elsewhere. Defining a model and loss function is crucial when training a machine learning model. The model should be tailored to the specific problem and data, and different models can be tested to see which one works the best. The loss function measures the effectiveness of the model, and a low value is desirable.

  • 00:35:00 In this section, the lecturer discusses the importance of having a metric to optimize in machine learning and introduces the concept of a loss function to measure the difference between the model's output and the correct label. The mean squared error is given as an example of a loss function to compute the distance between the model's output and the correct label. The loss function is a hyper-parameter that needs to be selected beforehand. Additionally, the lecturer talks about the training phase, where the algorithm selects the optimal values for the model's parameters using the loss function. Furthermore, a testing set that the model has not seen before is used to evaluate the model's performance and determine how well the model generalizes to new data. Finally, the importance of ensuring that the model does not just memorize the training data but generalizes well to new data is emphasized.

  • 00:40:00 In this section, the concept of bias and variance trade-off is discussed as a crucial aspect in machine learning that appears in deep learning and all other ML classes. Bias refers to the model's tendency towards certain predictions, and variance is the ability of a model to capture the complexities in data without memorizing them. Examples of high bias and high variance are given, where high bias is a case where a model is not expressive enough to learn a data set accurately, while high variance refers to overfitting. The ideal model is one that captures the complexity of the data without memorizing it, which is known as the Goldilocks zone. The bias-variance trade-off is a critical concept in machine learning that users should know as they will encounter it in every machine learning class they take.

  • 00:45:00 In this section, the speaker discusses the concept of overfitting and underfitting in machine learning models. The goal of machine learning is to accurately model and predict new data, not just fit the training data. If a model is able to match the training data too closely but fails to predict new data accurately, then it is overfitting. On the other hand, if a model is too simple and cannot capture the patterns in the data, it is underfitting. The best approach is to find a balance between fitting enough to the training data while still being able to generalize to new data. This involves trial and error with hyperparameter tuning and the selection of appropriate loss functions.

  • 00:50:00 In this section, the instructor covers the concept of bias and variance in machine learning models. A model with high bias will have similar accuracy between training and testing data because it consistently spits out the same output regardless of the input. On the other hand, a high variance model fits too tightly to the data resulting in a large loss when tested on new data. The instructor emphasizes the trade-off between model complexity and generalization, which is important to understand when selecting a model for a specific task. Finally, the instructor encourages students to attend class and put in effort to learn about machine learning and deep learning, even though it may not be their top priority.
CS 198-126: Lecture 1 - Intro to Machine Learning
CS 198-126: Lecture 1 - Intro to Machine Learning
  • 2022.12.03
  • www.youtube.com
Lecture 1 - Intro to Machine LearningCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley.e...
 

CS 198-126: Lecture 2 - Intro to Deep Learning, Part 1



CS 198-126: Lecture 2 - Intro to Deep Learning, Part 1

In this YouTube lecture on Intro to Deep Learning, the instructor discusses the basics of deep learning models and how to train them using gradient descent, covering different building blocks for neural networks and why deep learning is such a prevalent technology. The lecture introduces the perceptron and stacking multiple perceptrons to create a more complex and sophisticated neural network, and explains how to compute the output by matrix multiplication and a final addition, with the middle layer using a ReLU activation function. The speaker addresses the use of the Softmax function and ReLU activation function, using loss functions as metrics for evaluating how well the model is performing, and the concept of gradient descent optimization. Finally, the instructor discusses the idea of deep learning and how a big neural network prompts low loss despite its ability to memorize the data. Also the lecturer introduces the concept of hyperparameter tuning in neural networks to improve their performance with specific datasets. He notes that there are no universal values for hyperparameters and suggests exploring different options such as layer numbers and activation functions. Due to time constraints, the lecture ends abruptly, but the lecturer assures students that the upcoming quiz will not be overly difficult and will be accessible on the GreatScope platform.

  • 00:00:00 In this section, the instructor asks the audience if they have any questions from the previous lecture, specifically regarding the bias-variance trade-off and one-hot encodings. They briefly touch on exploratory data analysis and k-fold cross-validation, stating that these topics will not be covered much in the course. The instructor elaborates on how k-fold cross-validation works before beginning the second lecture on deep learning.

  • 00:05:00 In this section of the lecture, the instructor provides a brief overview of what students can expect to learn, including the basics of deep learning models and how to train them using gradient descent. The lecture also covers different building blocks for neural networks and why deep learning is such a prevalent technology. Before diving into the specifics of deep learning, the instructor reviews some fundamental linear algebra concepts such as vector dot products and matrix vector multiplication. These concepts are important for understanding the inner workings of deep learning models. The instructor provides examples and notation to help students fully comprehend the material.

  • 00:10:00 In this section of the video, the speaker discusses the notation and indexing in deep learning, where vectors and matrices are used to represent the parameters of the black box function. The speaker explains that each element in the matrix is a parameter, and when indexing within a vector, it refers to a single scalar. They emphasize the ease of taking partial derivatives with respect to inputs and explain the motivation behind neural networks.

  • 00:15:00 In this section, the lecturer discusses the different types of problems that deep learning can solve, such as regression, classification, and reinforcement learning. The goal is to create a universal model that can be used for all kinds of tasks, similar to the human brain's capabilities. To create this model, the lecturer introduces the perceptron, a mathematical formulation that is similar to a neuron. The perceptron takes in inputs, weights them, adds them, and then activates them through a step function. The goal is to create a mini-brain that is capable of solving complex problems, such as non-linear regression on a polynomial that looks complicated.

  • 00:20:00 In this section, the video introduces the concept of stacking multiple perceptrons to create a more complex and sophisticated neural network. The video shows how the perceptron operation can be written more compactly as a dot product between inputs and weights, and a bias term. The video then takes this operation and repeats it for multiple perceptrons to create a single layer neural network. The video notes that the weights for each perceptron can be learned depending on the data set and problem at hand.

  • 00:25:00 In this section, the lecturer explains how the dot product works to calculate perceptron output, where each perceptron has its own unique weights and biases. Each element in the vector represents the perceptron output, which is the dot product with a scalar being added. The lecture then introduces a toy example, asking the audience to turn to their neighbor and discuss the math to predict the output.

  • 00:30:00 In this section, the speaker introduces a basic neural network consisting of an input layer with two inputs, a hidden layer with three nodes, and an output layer with one output. The speaker demonstrates how to compute the output by matrix multiplication and a final addition, with the middle layer using a ReLU activation function. The speaker then explains how to compact the forward pass notation and addresses a question about the ordering of the weights. Finally, the speaker briefly introduces classification using neural networks.

  • 00:35:00 In this section of the lecture, the speaker talks about interpreting the output of a deep learning model. The speaker uses the example of classifying a digit and representing the labels as a one-hot representation. To constrain the output of the model and ensure that the model's output can be interpreted as the probability that the model thinks it belongs to a certain class, the speaker introduces the softmax operation at the very end of the neural network. The softmax function takes the exponential of all the output values and then normalizes the results, ensuring that all the elements are between 0 and 1 and sum up to 1. This function guarantees that the model never has perfect confidence unless it outputs negative infinity for almost every single value, hence the meme that there will always be uncertainty in deep learning.

  • 00:40:00 In this section, the lecturer discusses the Softmax function, which takes in a vector of values and outputs a vector of values that sum to one and are all greater than zero. This function is used at the end of a neural network if one wants to interpret the network's outputs as probabilities for different classes. The lecture also explains that the reason for adding the ReLU activation function, which adds complexity to the network, is to enable modeling complex functions that a simple matrix multiplication, in the absence of ReLU, may not capture. The lecturer also touches on using loss functions as metrics for evaluating how well the model is performing, such as the mean squared error. Finally, the metaphor of being on a hill and wanting to go to the bottom while only seeing one foot around is introduced to explain the concept of gradient descent optimization.

  • 00:45:00 In this section, the lecturer introduces the concept of hill climbing in optimization and how it applies to deep learning. The neural network is a vector function, and the steepest direction is the gradient of the vector function. The lecturer explains that figuring out the direction of steepest ascent can be done using multi-variable calculus where taking the gradient of any function of multiple variables with respect to its inputs will give the steepest direction. In deep learning, the problem of finding the best weights that optimize the minimal loss is incredibly hard, so the lecturer proposes that we just initialize all the weights and biases to random values and take small steps in the direction of steepest descent. With continuous evaluation, this process should lead to the right selection of weights and biases.

  • 00:50:00 In this section, the speaker discusses how to step down the hill in a loss function by taking the partial derivative of the loss function with respect to all of the individual parameters. The speaker also mentions that the loss function depends on all parameters, such as weights and biases. To update these weights and biases, one takes the derivative of the loss function with respect to those parameters evaluated at their current value and takes a small step in the opposite direction of the greatest change, scaled by a learning rate. The hope is that after the update, the loss function's value will decrease significantly.

  • 00:55:00 In this section, the lecturer explains how to step the weights in order to decrease the loss of one example with the goal of decreasing loss for the entire data set. The gradient of every single training example is taken, and when averaged, it is the step that can be taken to result in the greatest loss decrease over the entire data set. For more computational efficiency, chunk it up and just look at little chunks called batch gradient descent. The lecturer details the process of network building blocks such as activation functions and loss functions, emphasizing that at the end of the day, all that matters is that the loss function quantifies how well we’re doing. The lecturer also notes that there's no concrete way to define the best values for all the weights and biases in the neural network, hence the use of these lazy methods. Finally, an explanation of the idea of deep learning and how a big neural network prompts low loss despite its ability to memorize the data is provided as empirical findings show.

  • 01:00:00 In this section, the lecturer discusses the process of tuning hyperparameters in a neural network to optimize its performance with a given dataset. He acknowledges that there is no fixed set of hyperparameters that will work for all datasets and instead recommends testing different values for parameters such as the number of layers, activation functions, etc. The lecturer had to rush through the end of the lecture and mentions that a quiz will be released soon, but it won't be too challenging and will be available on great scope.
CS 198-126: Lecture 2 - Intro to Deep Learning, Part 1
CS 198-126: Lecture 2 - Intro to Deep Learning, Part 1
  • 2022.12.03
  • www.youtube.com
Lecture 2 - Intro to Deep Learning, Part 1CS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berke...
 

CS 198-126: Lecture 3 - Intro to Deep Learning, Part 2



CS 198-126: Lecture 3 - Intro to Deep Learning, Part 2

In this section of the lecture, the concept of backpropagation is explained, which is a faster way to get all partial derivatives required for the gradient descent algorithm without performing redundant operations. The lecturer also discusses how to improve upon vanilla gradient descent for deep learning optimization and introduces momentum, RMSprop, and Adam as optimization methods. The importance of keeping track of a model's training history, the use of batch normalization, and ensembling as a technique to improve model performance are also discussed, as well as techniques commonly used in deep learning to help decrease overfitting such as dropout and skip connections. Finally, the lecturer briefly touches on the ease of use of PyTorch and opens the floor to questions.

  • 00:00:00 In this section of the lecture, the speaker makes some quick announcements about the upcoming deadline for the coding assignment and the first quiz. The first assignment is a chance for students to learn the tools needed for the rest of the course, and the quiz is meant to serve as a comprehension check. The speaker then outlines the topics that will be covered in the lecture, including backpropagation and modern deep learning tools, and assures students that if they don't understand the mathematical details of backpropagation, it's okay as long as they understand the high-level idea. The second half of the lecture is important, covering the tools that make modern deep learning work well.

  • 00:05:00 In this section of the lecture, the concept of creating a computational graph for functions and using the chain rule to calculate partial derivatives is discussed. The computational graph allows for efficient computation of derivatives with respect to individual nodes. This concept is then applied to backpropagation in a toy neural network example where the chain rule is used to calculate the partial derivatives of the loss with respect to each weight and bias parameter. By multiplying all the partial derivatives along the path from each parameter to the loss node, redundant computation can be avoided.

  • 00:10:00 In this section, the concept of backpropagation is explained, which is a faster way to get all partial derivatives required for the gradient descent algorithm without performing redundant operations. As the network's depth increases, many computations become repeated and redundant, making them unsuitable for training deep networks. Backpropagation works by caching values during the forward pass and reusing them during the backward pass when computing partial derivatives. However, because partial derivatives now involve matrices and matrices, caching becomes more critical because it saves multiply operations, which are typically more expensive. The video explains that we can use tools like pyTorch to automatically cache the required values for us.

  • 00:15:00 In this section, the lecturer discusses how to improve upon vanilla gradient descent for deep learning optimization. One issue with vanilla gradient descent is that it struggles at local minima or flat spots where the gradient is zero, preventing the algorithm from finding better solutions. To solve this, the lecturer introduces the concept of momentum, inspired by a ball rolling down a hill. By taking the weighted average of past gradients and adding it to the current gradient, momentum can help push past small local minima and flat spots. While technically not true gradient descent, momentum can enable the algorithm to power through these obstacles and hopefully find better solutions. The lecturer also discusses how to scale the weighted average of past gradients to not shrink the current gradient too much.

  • 00:20:00 In this section, the concept of momentum in gradient descent is discussed. The lecture explains that betas are used in order to control the step sizes such that the step sizes don't become too big and inconsistent. The lecture explains momentum as a way for the step sizes to stay the same when rolling downhill, but also to be used to continue to move in the direction that the gradient has historically pointed. The lecture then introduces the RMS prop optimization method, which stores a weighted average of the squared components of previous gradients.

  • 00:25:00 In this section, the instructor explains the concept of RMSprop, which is a form of gradient descent, and how it works compared to traditional methods. He explains that RMSprop divides the gradients by the square root of the squared moving averages of the gradients, which he demonstrates using examples of small and large gradients. By doing this, the algorithm is able to adaptively adjust the learning rate, which is known as an adaptive learning rate. He ultimately concludes that Adam is the best form of gradient descent as it has the benefits of both RMSprop and traditional methods.

  • 00:30:00 In this section, the lecturer introduces Adam, a combination of RMSProp and momentum, as the preferred optimization method for gradient descent in deep learning models. Adam allows for the benefits of being able to avoid local minima with momentum while accounting for issues with flat spots that require boosting through them. It does not change the direction of the gradient but only the scaling of it. The lecture suggests model checkpointing as a way to combat any erratic behavior that may arise with Adam or RMSProp after they have reached a local minimum. Second-order optimization methods can also be used, but they require more computing power and are less common.

  • 00:35:00 In this section, the instructor explains the importance of keeping track of a model's training history and how well it's performing on new data that it hasn't seen before to effectively determine which checkpoint is the best. A normalization technique called batch normalization is also discussed, which involves subtracting the mean and dividing by the standard deviation for every single activation on a neural network and then allowing the network to rescale those values as it sees fit by multiplying each weight by a value gamma and adding the bias. This technique helps to normalize data and creates regular-looking lost surfaces that are far easier to send down with gradient descent, making life a lot easier.

  • 00:40:00 In this section, we learn about batch normalization, which is a method used to normalize the activations of the neurons of a neural network by computing the mean and standard deviation of outputs from a certain layer. This normalization makes the default behavior of the neural network to have normalized activations, making them well-behaved. While this method does not add expressivity to the model, it allows for better gradients and a more normalized range of values as inputs across all layers of the network. Additionally, we learn about ensembling as a technique used to improve model performance by training multiple models and averaging their predictions.

  • 00:45:00 In this section, the lecture discusses two techniques commonly used in deep learning to help decrease overfitting: dropout and skip connections. Dropout involves randomly removing a certain number of neurons before training to force each neuron to learn how to use all the features that came before it, forcing it to learn the same output as the other neurons. In contrast, skip connections allow the learning of an identity function that helps propagate information without adding noise or confusion; it involves learning zeros for all the weights, which trivially allows good information to be passed to the last layer to classify correctly. Both techniques, along with others discussed in this lecture, help to increase performance by decreasing overfitting and allowing for arbitrarily deep networks.

  • 00:50:00 In this section, the lecturer explains how skip connections can be a useful tool when building neural networks. These connections can be added to boost performance, making your network better. The lecturer didn't have time to discuss PyTorch fully, but it is explained in the homework. They explain that PyTorch can be really easy to use if you already know how to use numpy. They can create functions that take in a value and returns it, making computing the value of the gradient on a certain input possible. The lecturer concludes by opening the floor to questions.
CS 198-126: Lecture 3 - Intro to Deep Learning, Part 2
CS 198-126: Lecture 3 - Intro to Deep Learning, Part 2
  • 2022.12.03
  • www.youtube.com
Lecture 3 - Intro to Deep Learning, Part 2CS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berke...
 

CS 198-126: Lecture 4 - Intro to Pretraining and Augmentations



CS 198-126: Lecture 4 - Intro to Pretraining and Augmentations

In this lecture, the speaker explains the evolution of feature extraction in machine learning, the advantages of deep learning, and how transfer learning can be used to improve the accuracy and speed of models. They also discuss the concept of freezing and fine-tuning layers in neural networks and the importance of embeddings in reducing the dimensionality of categorical variables. The lecture introduces self-supervised learning and its different tasks, including the jigsaw, rotation, and masked word prediction tasks, which can be used to pretrain models and transfer learned representations to downstream tasks. Finally, the renewed interest in self-supervised learning in computer vision is discussed, and the lecture encourages students to complete the homework on the high Crush notebook.

  • 00:05:00 In this section of the lecture, the author discusses representation learning and shallow learning. With shallow learning, the machine learning pipeline starts with an input X, features are extracted from it using a feature extractor, and then the extracted features are passed into a machine learning algorithm to get an output Y. The facilitator explains that feature extraction is dependent on the data and can be
    straightforward for tabular data, but complex for data like text, audio, or images. However, for images, there are specialized feature extractors available in classical computer vision.

  • 00:10:00 In this section, the lecturer explains the concept of feature extraction and how it has evolved in machine learning. In classical machine learning, a hand-programmed feature extractor such as Hog, which captures edge information in an image, is used to create models. However, this process is challenging as feature extractors vary for different tasks. Deep learning provides an end-to-end process by learning both feature extraction and output prediction. The lecturer explains that this process allows for the learning of abstract representations of the input data, which is passed through layers of learned feature extractors in a neural network, resulting in hierarchical representations. The lecture provides an example of how deep neural networks learn representations of images of cars.

  • 00:15:00 In this section, the speaker explains how depth in neural networks helps refine representations. Early layers of the network detect low-level details such as edges, while later layers focus on more concrete features like doors or windows in an image. The final layers are trying to determine if the input image is actually what the model has learned to recognize, creating an abstract mental model. The speakers then discuss transfer learning as a way to leverage pre-trained models and avoid having to train models from scratch, which can be costly in terms of time, compute, and data.

  • 00:20:00 In this section, the speaker discusses the concept of layering in neural networks and how pre-training and transfer learning can be used to improve the accuracy and speed of models. The speaker explains how earlier layers capture general features such as shapes and patterns, while later layers capture more abstract features such as objects and humans. The concept of freezing, in which certain layers are preserved and used in subsequent models, is also discussed as a way to customize models for specific tasks. The freezing technique can speed up model training and improve accuracy, but care should be taken to ensure that layers are frozen at the appropriate level.

  • 00:25:00 In this section, the instructor discusses transfer learning in neural networks, specifically the fine-tuning technique where the pre-trained model is trained further on output layers and non-frozen layers. They emphasize the importance of considering the size and similarity of the new dataset to the original dataset when deciding whether to freeze or fine-tune the pre-trained model. Additionally, they explain the importance of embeddings in neural networks and how they can reduce the dimensionality of categorical variables, making them easier to represent in a transform space. The use of embeddings is illustrated through an example involving the mapping of book genres to a lower-dimensional vector space.

  • 00:30:00 In this section of the lecture, the professor talks about high-dimensional data and the difficulties that arise when trying to represent it. The professor introduces the concept of lower-dimensional latent space, which involves encoding all the important information that represents the high-dimensional data into a lower-dimensional space. The goal is to capture this information through something called a latent space of features, and in many cases, it can be achieved through embeddings. The professor gives an example of how a one-dimensional structure can be represented using just one variable instead of three variables in a 3D space so that data is not sparsely spread out in a high-dimensional space. Finally, the professor explains how to learn embeddings by training a model to classify images in the MNIST dataset using the softmax loss function, and take the output of some layers in the model as a representation of the image.

  • 00:35:00 In this section, the speaker discusses the advantages of pre-trained networks and transfer learning, which can save time and computational power while achieving better results. Pre-trained networks can be trained on larger datasets, which can lead to better representations. Transfer learning allows the application of knowledge learned from one pre-trained network to another task, making it especially useful in natural language processing. Self-supervised pre-training is then introduced, which allows for learning without supervision from labels, by learning from raw data.

  • 00:40:00 In this section, the lecturer discusses unsupervised learning, which is a type of learning where no labels are provided, but the model still learns patterns and relationships within the dataset. Examples of unsupervised learning include principal component analysis (PCA) and clustering. The lecturer then talks about self-supervised learning, which involves providing supervision from the data itself, rather than from external labels. The technique involves predicting hidden parts or properties of the data from the observed parts. Self-supervised learning is beneficial in situations where labeled data is scarce or expensive to gather.

  • 00:45:00 In this section of the lecture, the speaker discusses self-supervised learning and the different tasks involved such as the pretext task and downstream task. These tasks can be used in various domains such as computer vision, NLP, and RL. The speaker then gives examples of self-supervised learning tasks such as the jigsaw task, where an image is divided into nine patches, shuffled, and the model is asked to predict the original order. Another task is the rotation task, where an image is rotated by some angle and the model is asked to predict the angle of rotation. These tasks can be used to pretrain models and transfer the learned representations to downstream tasks such as image classification and object detection.

  • 00:50:00 In this section of the lecture, the concept of pretraining models using self-supervised learning (SSL) is introduced. One example of SSL in computer vision is training a model to predict the rotation angle of an image and focus on object orientation, location, pose, and type instead of low-level details. This idea is not restricted to CV as SSL can also be applied to NLP and audio, such as predicting a single or multiple words from sentences. A famous model in NLP called BERT uses a Transformer model to predict masked words from two sentences simultaneously and learns a word-level and sentence-level embedding. BERT was a huge success in NLP.

  • 00:55:00 In this section of the lecture, the speaker discusses the renewed interest in self-supervised learning (SSL) in computer vision (CV) after the success of BERT in natural language processing (NLP). The current state-of-the-art in CV is said to be similar to BERT. The lecture provides an overview of representation learning, transfer learning, and SSL, and introduces different concepts and methodologies. While this lecture does not have homework, there is a homework for the entire cluster on the high Crush notebook that is due next Tuesday, and a future lecture on Advanced SSL for CV will have a homework. The slide deck can be accessed on the website for review.
CS 198-126: Lecture 4 - Intro to Pretraining and Augmentations
CS 198-126: Lecture 4 - Intro to Pretraining and Augmentations
  • 2022.12.03
  • www.youtube.com
Lecture 4 - Intro to Pretraining and AugmentationsCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://...
 

CS 198-126: Lecture 5 - Intro to Computer Vision



CS 198-126: Lecture 5 - Intro to Computer Vision

This lecture on computer vision covers various topics, including the history of computer vision and its development over the years. The instructor also explains deep learning and how it improves on classical computer vision methods. The lecture delves into the concept of convolutions and how they are used as feature extractors in computer vision, leading to the creation of convolutional neural networks (CNNs). Additionally, the lecture discusses the role of receptive fields and introduces pooling layers as a method to increase the receptive field of CNNs. Overall, the lecture provides an overview of computer vision as a field and the techniques used to extract information from images. In the second part of the lecture, various techniques for preserving the size of an image during convolutions are discussed, including padding and same padding. The concept of stride in convolutional layers is also covered, demonstrating how it can mimic the effect of a pooling layer. The anatomy of a CNN and its hyper-parameters, including kernel size, stride, padding, and pooling layers, are explained, with emphasis on how a convolutional layer acts as a feature extractor that passes low-dimensional blocks of features to a fully connected network for classification. The lectures also cover the LeNet network architecture for classifying handwritten digits and the importance of normalizing image data before passing it through a neural network. Finally, data augmentation is discussed as a technique for creating additional training data, and the importance of model checkpointing while training is emphasized.

  • 00:00:00 In this section, the instructor introduces computer vision as a field in AI that deals with extracting information from an image at a semantic level. They outline classification, detection, and segmentation as tasks that a machine learning model can perform. To enable a machine to understand an image and perform these tasks, it must have a higher level understanding of the image's contents, so the models are designed to extract features such as edges to classify images. The instructor explains that the field of computer vision has its roots in cognitive science and psychology, with developments in feature extractors such as hog coming from an experiment performed on cats in 1959.

  • 00:05:00 In this section of the lecture, the instructor discusses how deep learning has replaced classical computer vision methods, where feature extractors were hand-programmed. Deep learning allows models to learn not only the mappings from features to outputs but also the feature extractors themselves, and this breakthrough came in 2012 with Alex Krajevski's neural network. The instructor talks about the ImageNet visual recognition challenge and how AlexNet drastically reduced the error rate, becoming a turning point for deep learning. The lecture then goes on to discuss how images are represented digitally as matrices and the use of brightness values to represent grayscale images.

  • 00:10:00 In this section of the lecture, the instructor discusses the concept of color dimensions and channels in images. Each colored pixel can be broken down into three different values, which means that an RGB image can be represented by three matrices for each component. These matrices can then be stacked on top of each other to form a 3D matrix called a tensor. The instructor notes that this is an important concept for understanding convolutional neural networks (CNNs) because regular neural networks or multi-layer perceptrons are not helpful for processing large images due to the need to convert the 3D tensor into a vector, resulting in a massive number of elements.

  • 00:15:00 In this section, the speaker discusses the number of parameters for a fully connected layer in computer vision. The layer takes a 120,000 dimensional input and outputs a 10-dimensional vector, meaning the weight matrix needs to have dimensions of 10 by 120,000, resulting in 1.2 million weights and 10 parameters from the bias vector. This number of parameters would make the network too large and difficult to train, particularly if a higher-dimensional output is desired. Additionally, treating each pixel as a separate feature is unorthodox in image classification, as humans tend to break down images into different parts and use this information to construct a mental model. The speaker suggests looking at local regions of the image instead of individual pixels to make more sense of the data.

  • 00:20:00 In this section, the lecturer discusses the importance of looking at neighboring pixels in an image to gather information and how this poses a challenge for neural networks, which typically treat each pixel separately. He introduces the concept of local regions, which is relevant to computer vision and is concerned with the structure of an image. The lecture also talks about the need to extract hierarchical representations from an input and how these representations depend on each other, allowing the model to learn abstract concepts like what a face looks like. Finally, the lecture explains the concept of translational equivalence, where representations of an image should be translated along with its pixels to maintain consistency.

  • 00:25:00 In this section, the concept of translational invariance and local region processing is discussed in computer vision. Traditional network architecture cannot accommodate these requirements, leading researchers to develop convolutional neural networks (CNNs) instead. The convolution operation involved in CNNs is explained by using a weight filter that can slide over an image and compute dot products to create new outputs. The weight sharing technique is also introduced, where each patch is passed through a layer with the same weights and biases to yield the same representation, making CNNs capable of satisfying the criteria laid out for computer vision.

  • 00:30:00 In this section of the lecture, the speaker explains the process of convolutions, which involves taking an element-wise product of a filter on an input patch and summing the results. This process enables computer vision algorithms to focus on single patches of the input image rather than the entire image, and use the same filter for each patch, sharing weights. By strategically designing the filters, the algorithms can extract different kinds of information, such as edge detection. The speaker provides an example of a filter designed to detect vertical edges by highlighting the high activations along the middle of the convolved output, with low activations on the edges.

  • 00:35:00 In this section of the lecture, the instructor explains the concept of convolutions in computer vision and how they are used as feature extractors. Convolutional Neural Networks (CNNs) use filters that can extract different features from an image, and these filters can be learned through the process of deep learning. By using more filters, CNNs can extract different kinds of features from an input image and use information about all of them. The instructor also discusses how to generalize the convolution process to an input image with multiple channels and the output from this process is called an activation map, which represents the activation of different features.

  • 00:40:00 In this section, the speaker discusses the concept of representing images in an RGB format and how the activation map can have a 3D structure. The process involves extracting different features and convolving them to get a 3D output. This convolution operation is general and can be applied to any 3D input, which allows stacking convolutional layers on top of each other, leading to deep neural networks. Additionally, the speaker goes into implementation details concerning the receptive field concept, which is not limited to just convolutional neural networks.

  • 00:45:00 In this section, the concept of receptive fields in activation maps is discussed. Receptive field refers to the region of the input that influences each element of an activation map. This section explains how receptive fields work and how increasing the receptive field size can impact network performance. It is also noted that receptive fields can be influenced by different convolutional filters and that having an overly large or small receptive field can lead to missing important information in the input.

  • 00:50:00 receptive field, in this section of the lecture, the professor explains how the receptive field size affects the ability of a convolutional neural network (CNN) to classify an image. Having a small receptive field can lead to the network not being able to process enough information for the image classification task, while a large receptive field can lead to overfitting. The lecture also touches on the importance of introducing non-linearities through the use of activation functions in a CNN. The professor explains that while two different convolutions may have the same receptive field, their outputs will not be the same due to the introduction of non-linearities through activation functions.

  • 00:55:00 In this section, the lecturer introduces pooling layers as a method to increase the receptive field of convolutional neural networks without adding too many layers that might make the model too large. Pooling layers involve looking at square regions of inputs and applying either max or average operations. For instance, a two by two max pooling enables the model to pick only one value from each chunk of four pixels, thus reducing the input's dimensions by two. The lecturer also explains how max pooling preserves crucial information from a local area, making it common in reducing the spatial dimensions of activation.

  • 01:00:00 In this section, the speaker discusses different techniques for preserving the height and width of an image during convolutions, including the process of padding and same padding. Same padding is when you artificially increase the size of the input by surrounding it with zeros or other constants to maintain the spatial dimensions of the activation map. This is a preference in the deep learning community, but there is no empirical evidence that it leads to better performance over regular padding. Additionally, the speaker discusses the concept of stride in convolutions and how it can have the same effect as a pooling layer.

  • 01:05:00 In this section, the lecturer discusses the use of stride in convolutional layers that can act as an approximation to a pooling layer in computer vision. He explains that using a stride of bigger than one is akin to combining convolution and pooling layers together but does not provide any particular advantage to doing it. He also presents a formula to determine the dimensions of the output activation, which depend on factors such as the original input dimensions, filter size, padding size, and strides. The lecturer then explains how backpropagation of gradients can be done through convolutional layers, stressing that PyTorch makes it easy to define convolutional layers using multiple parameters.

  • 01:10:00 In this section, the lecturer discusses the hyper-parameters of convolutional layers and how they make up the anatomy of a CNN. These hyper-parameters include the kernel size, stride, padding, and pooling layers. The lecturer explains that a convolutional layer can be viewed as a feature extractor that converts high-dimensional input into low-dimensional blocks of features that can be passed into a fully connected network for a classification task. The output of the convolutional layer at the end is a low-dimensional block of features, which can be passed into the MLP inside the red box, which is the classifier. Finally, the lecturer explains that there are different types of pooling layers, but the norm in the Deep Learning community is to use max-pooling instead of average pooling.

  • 01:15:00 In this section, the video explains the structure of a Convolutional Neural Network (CNN) using a real-life example. The network, called LeNet, was developed by Jan Lacun for classifying handwritten digits. The video explains that the LeNet network takes an input and convolves it into feature maps, pulls those maps into smaller sizes, applies another convolution and pooling until it gets a smaller representation of the input, before passing it through fully connected layers to get an output representing one of the ten possible digits. The video goes on to explain the design choices for CNN architectures, such as stacking convolution, ReLU, and pooling layers, and the use of batch normalization layers to make the training more stable. Finally, the video discusses some of the commonly used datasets in computer vision, such as the MNIST handwritten digit classification dataset and the CIFAR-10 dataset.

  • 01:20:00 In this section of the lecture, the instructor discusses several popular computer vision datasets, including MNIST, CIFAR-10, and ImageNet. The ImageNet dataset, in particular, is a million-image dataset that has been widely used as a benchmark for evaluating computer vision algorithms. The instructor also emphasizes the importance of normalizing image data before passing them into a neural network and the challenge of collecting and labeling data, which requires careful consideration to ensure that the data comes from a similar distribution. Additionally, more data can help to prevent overfitting, but collecting large datasets can be costly and time-consuming.

  • 01:25:00 In this section, the lecture covers the topic of data augmentation where one can artificially create more data from a single image by making slight changes to its brightness, contrast, color, cropping, flipping or rotating the image, and assigning them the same label to create new training data sets. This method is a very cheap and easy way of creating new data from pre-existing datasets. Furthermore, the lecture also emphasizes the importance of model checkpointing while training convolution neural networks since the training time usually takes hours to days or even weeks, and losing progress due to a sudden interruption like machine crashing or accidental shutdown can be costly. It is essential to store model rate snapshots at different points in the training process to continue from the latest snapshot if the training gets interrupted.
CS 198-126: Lecture 5 - Intro to Computer Vision
CS 198-126: Lecture 5 - Intro to Computer Vision
  • 2022.12.03
  • www.youtube.com
Lecture 5 - Intro to Computer VisionCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley.ed...
 

CS 198-126: Lecture 6 - Advanced Computer Vision Architectures



CS 198-126: Lecture 6 - Advanced Computer Vision Architectures

This lecture on advanced computer vision architectures focuses on convolutional neural networks (CNNs) and their various techniques. The lecturer explains the architecture of AlexNet and VGG before delving into advanced techniques such as residuals to maintain backward residual values for higher accuracy and simpler architectures. The use of bottlenecks and one-by-one convolutions are discussed, as well as the importance of being able to learn the identity in computer vision architectures. The lecture also covers the issues of vanishing gradients in neural networks and how it can be alleviated with batch normalization and residual networks. Techniques such as global average pooling and depth-wise separable convolution are explained in-depth, followed by discussion of the mobile net architecture and its benefits.

Also the lecturer examines advanced computer vision architectures and focuses on optimizing convolutional neural network models by using step local convolutions and one by one convolutions. He emphasizes the importance of understanding these optimizations and the problems that may arise with certain optimizations in building future networks efficiently. The lecture concludes with a discussion on the tradeoff between accuracy, performance, and model size, highlighted by the comparison of the efficient net model to other networks. Students are informed of an upcoming quiz and a homework assignment due the following Friday.

  • 00:05:00 In this section, the speaker begins by recapping the previous lecture before delving into more advanced CNN architectures. They apologize for the rough introduction in the last lecture and make some last minute edits before beginning. There is a brief exchange about a microphone, but then the speaker jumps into the lecture.

  • 00:10:00 In this section, the speaker reviews the architecture of a convolutional neural network (CNN) and how it differs from a standard dense neural network. The speaker clarifies that the convolutional layer in a CNN is similar to a layer in a dense neural network, with learned parameters such as filters and bias terms. The speaker explains how a filter generates an output map for every location in the input, and multiple filters generate different output channels. The speaker also explains how a pooling layer can be used to decrease the size of the output volume. Overall, the speaker emphasizes that the mechanics of a CNN are similar to a dense neural network, with convolutions replacing matrix multiplication.

  • 00:15:00 In this section, the speaker explains the use of max pooling in convolutional neural networks to reduce the size of the feature volume and speed up convolutions. This technique involves taking the maximum value in each little square of the feature volume and using that as the output. The speaker also touches on the concept of segmentation, which involves labeling each pixel in an image with a specific classification, and notes that for this task, the output size will be the same as the input size. The section ends with a brief introduction to advanced CNN architectures, with a focus on ResNet as the most significant one to take away from the lecture.

  • 00:20:00 In this section, the lecturer discusses various computer vision architectures, with a focus on convolutional neural networks (CNNs). The lecture begins by discussing the motivation behind CNNs, which involves stacking convolutional layers and pooling layers to synthesize information from low-level features and work its way up to higher features. The lecture then goes on to discuss the architecture of AlexNet, which was a groundbreaking feat that achieved an error rate of around 17% on ImageNet in 2012. However, the lecturer notes that these architectures are no longer state-of-the-art, as there have been advancements in transformer architectures that will be discussed in later lectures.

  • 00:25:00 In this section, the speaker discusses the architecture of AlexNet and VGG, two widely used computer vision neural networks. AlexNet involves five convolutional layers, with the final output being a flattened one-dimensional vector passed through three dense layers and a softmax function to produce a predicted class. On the other hand, VGG has 23 convolutional layers and three dense layers. Additionally, the speaker highlights the use of one-by-one convolutions as a form of padding and dimensionality addition and reduction.

  • 00:30:00 In this section, the lecturer discusses advanced computer vision architectures, focusing on convolutional neural networks (CNNs). The lecture emphasizes the use of one-by-one convolutions to maintain input size as well as the combination of depthwise and pointwise convolutions to increase computational efficiency. The lecture also highlights the importance of learning low-level features in earlier stages of the classifier and describes the issues with blindly stacking layers. To address these issues, the lecture explains the use of residuals to maintain backward residual values, leading to higher accuracy and simpler architectures.

  • 00:35:00 In this section, the lecture discusses the concept of residuals in deep convolutional neural networks. While adding more layers should not decrease accuracy because the identity transform can be learned, in practice, adding more layers affects the previous identities from post-transformation, resulting in vanishing gradients, exploding gradients, and shattering gradients. Residuals address this problem by keeping information from previous stages into future computation, making it easier to learn the identity transform. The lecture also discusses bottlenecks in processing residuals, where adding residuals will increase the time to convergence, but not necessarily the results. A solution to this is to adjust the size and frequency of the bottleneck.

  • 00:40:00 In this section of the lecture, the speaker discusses the importance of being able to learn the identity in computer vision architectures. They explain that if the weights and biases of a network are all zero, the network will output the exact same thing that it takes in, making it easier for the network to recognize when it has enough information to make a good classification and to stop learning more complicated features. The speaker also touches on the issue of choosing the number of layers in a network, with two being a common choice in the ResNet architecture.

  • 00:45:00 In this section of the lecture, the presenter discusses the issue of vanishing gradients in neural networks and how it can affect weight updates. The vanishing gradient problem occurs when the partial derivatives of individual steps become too small in a multiplication chain or too large, which can cause issues with updating weights consistently. The presentation also talks about how batch normalization and residual networks help to alleviate the vanishing gradient issue. The lecture then moves on to discussing global average pooling, which is used to replace fully connected layers in Convolutional Neural Networks (CNNs) and generate feature maps for each category in classification tasks.

  • 00:50:00 In this section of the lecture, the speaker discusses how the use of dense layers in neural networks often results in overfitting and a reduction in performance. To prevent this, they suggest using global average pooling (GAP), which generates feature maps, averages them, and feeds them into a softmax function without tuning any parameters. The speaker also introduces the concept of depth-wise separable convolutions, which use lower dimension feature maps on each channel separately before combining them intelligently to reduce computation and retain data from each channel. This technique is particularly important for scaling computations over a wide range of filters in deep neural networks.

  • 00:55:00 In this section of the lecture, the speaker discusses the mobile net architecture, which uses depth and pointwise convolution to reduce the number of computations needed for an image. By applying a one-by-one-by-three layer to each channel of the image and then concatenating the results and applying a smaller pointwise convolution to it, the output is achieved, reducing the number of computations significantly. The mobile net architecture has fewer parameters and converges faster while matching the accuracy of the Inception of D3. The speaker also goes on to discuss squeezing and exciting networks, where you can compress and expand feature maps using dense layers and rescaling, which is less computationally intensive.

  • 01:00:00 In this section, the lecturer discusses how one can use step local convolutions and one by one convolutions to optimize a convolutional neural network (CNN) model. He also mentions how understanding the optimizations and the problems faced by certain optimizations can help build future networks more efficiently. The lecture concludes with a comparison of the efficient net model's accuracy, performance, and model size to other networks, highlighting that there is always a trade-off between these metrics.
CS 198-126: Lecture 6 - Advanced Computer Vision Architectures
CS 198-126: Lecture 6 - Advanced Computer Vision Architectures
  • 2022.12.03
  • www.youtube.com
Lecture 6 - Advanced Computer Vision ArchitecturesCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://...
 

CS 198-126: Lecture 7 - Object Detection



CS 198-126: Lecture 7 - Object Detection

The lecture discusses object detection, specifically adding localization to a simple classification CNN, the IOU method for object detection, the R-CNN system, and optimizing object detection algorithms to minimize processing time with YOLO. The video explains YOLO by chopping up an image, and discusses the challenges with YOLO object detection, including using anchor boxes to eliminate ambiguity. Finally, the YOLO architecture is explored, which is a fully convolutional neural network for object detection, and the storage of a large number of classes for classification is presented as an ongoing research question. The speaker recommends reading "The Yellow Paper" while advising against RCNN due to unreadability.

  • 00:00:00 In this section, the lecturer discusses the process of adding localization to a simple classification CNN for landmark detection. By adding an X and Y output to the network, the network outputs the exact location of a specific feature in an image, like the animal's nose. The lecturer then explains how to expand this network by adding more outputs to come up with a bounding box for the cat as a whole. The lecturer also explores different ideas for network expansion to come up with a bounding box, and the process for training the network on this expanded task.

  • 00:05:00 In this section, the lecturer discusses the IOU (intersection over Union) method for object detection. This approach aims to maximize the overlap between the predicted bounding box and the real bounding box by calculating the area of intersection and dividing it by the area of the Union. The lecturer explains that the closer this value is to 1, the better the detection is. Additionally, the lecturer touches on the challenges of detecting multiple objects in the same image, mentioning the basic solution of using an exhaustive search or sliding windows. However, this approach has significant problems, including inefficiency and requiring too much processing power.

  • 00:10:00 In this section, the speaker discusses the proposed solution to the issue of low aspect ratios in object detection, which is the system called R-CNN. Its basic idea is to guess likely bounding boxes and run classification on those by using classical non-machine learning algorithms to segment an image and propose a bunch of bounding boxes for objects. This approach works because edges in an image are likely to be the bounds of a bounding box. The algorithm also uses non-max suppression to remove redundancy caused by the potential classification of the same object multiple times. However, this system is still slow because most images have thousands of different segmentation regions, depending on how the classical algorithm is defined.

  • 00:15:00 In this section, the lecturer explains how to optimize object detection algorithms to minimize processing time. One way is by creating a feature map that extracts key information from the image and then performing classification only on the section of the feature map needed for each object detection, eliminating the need to rerun the full convolutional neural network each time. The lecturer then introduces YOLO, an object detection algorithm that utilizes a single convolutional neural network to output the location and bounding boxes of multiple objects in an image. The architecture of YOLO consists of convolutional layers and a classification layer, allowing for faster processing time and detection of multiple objects at once.

  • 00:20:00 In this section, the video explains how YOLO (You Only Look Once) works by chopping up an image into a grid, with each grid representing one classification vector or bounding box. This theoretically means that the number of objects that can be classified is equal to the number of grids in the image. YOLO also uses XY width and height, with the XY coordinate being the midpoint of the bounding box and the width and height being the size of the box. The video then goes on to explain non-max suppression, a process that eliminates overlaps and chooses the best bounding box with the highest confidence for each classification plot.

  • 00:25:00 In this section, we learn about the challenges with YOLO object detection, including the issue of multiple objects being centered in the same cell and how to output multiple classifications and bounding boxes in one cell. The solution to this is to use anchor boxes, where generic bounding boxes are defined before classification, and the dataset is classified based on similarity to these anchor boxes. This allows for a deterministic way to determine which object should be classified into which vector, and eliminates the ambiguity of duplicated bounding boxes.

  • 00:30:00 In this section, the YOLO architecture is discussed, which is a fully convolutional neural network for object detection. The YOLO network performs one pass over the image and is simple in design, eliminating classical components such as sliding windows. By utilizing anchor boxes and other techniques, YOLO is capable of matching the accuracy of RCNN while drastically improving its speed. Additionally, the concept of anchor boxes is explored, which are geometric shapes that correspond to objects in an image. It is challenging to detect objects with anchor boxes that are the same shape and size and are overlapping. However, there are algorithms that can find mathematically optimal anchor boxes to separate these objects. Finally, the discussion addresses the storage of a large number of classes for classification, which is still a classification question currently being explored by researchers.

  • 00:35:00 In this section, the speaker recommends a technical paper called "The Yellow Paper" for those interested in reading about object detection. On the other hand, the speaker advises against reading RCNN due to its unreadability. The speaker invites the audience to ask any questions before concluding the lecture.
CS 198-126: Lecture 7 - Object Detection
CS 198-126: Lecture 7 - Object Detection
  • 2022.12.03
  • www.youtube.com
Lecture 7 - Object DetectionCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley.edu/decal/...
 

CS 198-126: Lecture 8 - Semantic Segmentation



CS 198-126: Lecture 8 - Semantic Segmentation

The lecture discusses image segmentation, including semantic segmentation and instance segmentation. The main goal of segmentation is to detect all objects in an image and separate them out. The lecturer explains how a convolutional neural network (CNN) can be used for semantic segmentation and how downsampling can help with computationally expensive full resolution images. Different approaches to transform a small volume back into an image size are also discussed. The lecture introduces the U-Net, a model for semantic segmentation that combines previous improvements with skip connections, and explains how it can be expanded to instance segmentation using the Mask R-CNN approach. A pre-trained semantic segmentation model is demonstrated, and the speaker talks about pre-training and upcoming course assignments.

  • 00:00:00 In this section, the lecture covers image segmentation, specifically semantic segmentation and instance segmentation. Semantic segmentation involves finding the specific pixels where an object is present in the image, while instance segmentation identifies where each instance of each class is in the image. The eventual goal of segmentation is to detect all objects in an image and separate them out. This technique is useful because humans perceive objects through a combination of individual components, and being able to identify and classify these components more specifically is essential. Segmentation offers useful applications, such as object detection and identifying object relationships.

  • 00:05:00 In this section, the lecturer discusses the concept of segmentation and how to approach it. The idea is to create connected segments by grouping based on some sort of similarity criteria. The classical approach to this was to define a function to group pixels together based on their similarity in some metric, such as intensity. However, the lecturer mentions newer deep learning approaches, such as sliding windows, which can learn how to perform segmentation instead of using a fixed algorithm.

  • 00:10:00 In this section, the speaker explains how a convolutional neural network (CNN) can be used for semantic segmentation. Instead of running the CNN sliding window approach multiple times, a convolutional operation can be used instead. This allows for the desired effects without the inefficiency of recomputing shared features. The convolutional layer can be run as a filter over the image, and the output layer will map one-to-one with the original image, regardless of size. Padding can also be used to handle cases where the input size is smaller than the filter size.

  • 00:15:00 In this section, the lecturer discusses the issue of downsampling large images to make semantic segmentation more feasible, as processing full resolution images can be computationally expensive. The solution is to gradually downsample the image with each convolution layer and remove redundant information, creating a smaller volume to work with. This downsampled image is then upsampled at the end to create a segmentation map of the original image, with each pixel segmented into a particular class based on the maximum classification output. The lecture also briefly discusses different approaches to padding and treating the edges of images.

  • 00:20:00 In this section, the instructor discusses different approaches to transform a small volume, such as a segmentation map, back into an image the size of the original input image. The classical approach involves scaling up the image and using interpolation functions such as nearest neighbor or linear interpolation to fill in the extra space. However, this approach can lead to some loss of detail. The instructor suggests a learned approach that utilizes deconvolution, which involves passing the input layer over the output layer and removing pixels on the original input image one by one. The instructor provides a brief review of convolutions and explains how deconvolution works by flipping the input and output layers' positions.

  • 00:25:00 In this section, the lecturer draws an example of a simple input image and explains a decomposition method that involves looking at each pixel of the input image and using a filter to project it onto the output image. The lecturer notes a limitation of this method in that it can overwrite values that have already been written. To address this, the lecturer introduces an incremental approach involving convolutional convolutions to downsample the input image into a low-resolution representation, and then using D convolutions to upsample it back to its original size. The lecturer notes that this method is beneficial because the up sampling is learned rather than just a classical algorithm, which allows for more adaptability and refining of shapes.

  • 00:30:00 In this section, we learn about the U-Net, a model for semantic segmentation that combines previous improvements with skip connections. Skip connections allow for pulling in information from different levels of the down-sampling path while up-sampling. There are also different variations of the U-Net, such as deep flat family models and Transformer-based models like the Sendformer. The U-Net can also be expanded to solve the problem of instance segmentation by using the Mask R-CNN approach, which predicts the location of objects within the identified value.

  • 00:35:00 In this section, the lecturer discusses full instance segmentation using Mask R-CNN, which is useful for self-driving cars detecting the outlines of objects such as cars, pedestrians or backpacks. The lecturer explains how the potential bounding box is squished down into a fixed size image and classified. The labels for the objects are collected through a variety of methods, such as having humans do it or employing assistive methods. The matrix is learned in the same way as a filter, and when decommissioning an image, the lecturer explains it is a projection where the filter is multiplied by each corresponding pixel value, and then iterated over the image, which is not actually inverting a convolution.

  • 00:40:00 In this section, the lecturer and a student demonstrate a pre-trained semantic segmentation model that was trained on Cloud GPU. The model is run on an image from Japan, and the results show that the model can detect multiple objects with varying accuracies. The student also mentions that a score threshold can be applied to the model to filter out information with low accuracy scores. Overall, the demo serves as an example of how semantic segmentation can be applied to real-world images.

  • 00:45:00 In this section, the speaker talks about pre-training and how the model can be pre-trained on AWS and other offloaded services. They also mention a mandatory resident assignment and a recommended optional unit one that involves segmentation. The due dates and links are on the course website and they encourage students to come to office hours with any questions. Overall, this section provided some logistical information about the course and upcoming assignments.
CS 198-126: Lecture 8 - Semantic Segmentation
CS 198-126: Lecture 8 - Semantic Segmentation
  • 2022.12.03
  • www.youtube.com
Lecture 8 - Semantic SegmentationCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley.edu/d...
 

CS 198-126: Lecture 9 - Autoencoders, VAEs, Generative Modeling



CS 198-126: Lecture 9 - Autoencoders, VAEs, Generative Modeling

In this lecture, the concept of generative modeling is introduced, which involves using machine learning to create new images based on a dataset. Autoencoders, a type of neural network used for feature learning, are explained, focusing on their structure and how they can learn features of input data through compression and reconstruction. The lecture also covers variational autoencoders and their benefits, as well as the use of structured latent spaces in autoencoders to interpolate between images. The importance of vector quantization for working with discrete data is discussed, and the loss function for a variational autoencoder is explained, which includes a reconstruction loss and a commitment loss to prevent hardcoding of the input data. The lecture ends with a recap of the topics covered.

  • 00:00:00 In this section, the lecture introduces the topic of generative modeling, which involves using machine learning to generate new images based on a dataset. However, it can be a difficult problem for machines to understand what distinguishes different objects, such as cats, from each other. The lecture introduces the concept of autoencoders and how they can be used to compress and decompress images, as seen in the example of JPEG compression. The lecture also touches on the topic of variational autoencoders, which will be discussed in the next lecture.

  • 00:05:00 In this section, the lecturer discusses image compression and how it relies on understanding the data at some level. Compression can save space and reduce the number of bits needed to send an image over a network. The JPEG algorithm works by throwing out some of the higher frequency information and pixel-to-pixel relationships that are not crucial to human perception. The lecturer then suggests that for specific types of images, such as cat images, more advanced compression schemes could be developed with deeper knowledge of how the image is structured beyond just pixel correlations. Overall, compression algorithms highlight the importance of understanding the data in machine learning.

  • 00:10:00 In this section, the lecturer discusses the concept of autoencoders, a type of neural network used for feature learning that uses an encoder-decoder structure to compress the input data and later reconstruct it. He compares it to earlier techniques such as eigenfaces that used PCA for feature extraction, and shows the structure of an autoencoder network. The encoder part reduces the input data into a bottleneck layer, while the decoder part reconstructs it back to its original form. The goal is to minimize the difference between the original and reconstructed data. The lecturer raises the question of why a network that produces the same output as input would be useful, and explains that the key is in the learned features of the data that can be used for other tasks.

  • 00:15:00 In this section of the lecture, the instructor explains the concept of bottleneck layer in autoencoders and how it forces the network to compress the input data, therefore, learning some features of the data. He also discusses the limitations of this network structure and the desirable properties of the code, such as its small size and the similarity of codes for similar images. The instructor introduces the concept of variational autoencoders which builds on the autoencoders but provides magical properties that allow for sensible results when different operations are applied to the latent vectors. He then discusses the generative framework for image and text generation which involves sampling the latent vector containing the information of the image or text to be generated.

  • 00:20:00 In this section of the lecture, the speaker discusses the use of latent vectors as a way to represent traits or "genes" in a dataset of faces. The latent vector acts as a sort of probability distribution over possible sets of genes for an individual, and the facial structure is a function of the genes in that set. The encoder takes an input image and produces a latent vector, and the decoder takes that latent vector and produces an estimate of the original image. To impose structure on the latent space, probabilistic encodings are used.

  • 00:25:00 In this section of the lecture, the speaker explains the use of probability distributions in mapping inputs to a region of possible outputs in order to force nearby vectors in latent space to decode to similar images. This concept is important in variational autoencoders (VAEs), which use a gaussian distribution to output the parameters of a circle in latent space from which a code is sampled. The VAE loss includes a reconstruction term and a term that forces the encoder output to look like a normal distribution, preventing it from learning only one point to output and instead encouraging large sets that must be coded to the same point. The speaker notes the contrasting objectives in this loss, in which the autoencoder would prefer to map every input to a single point, but the additional term forces points to be close to the center of the plane and have some variance, resulting in disks instead of individual points.

  • 00:30:00 In this section of the lecture, the speaker discusses the benefits of using structured latent spaces in autoencoders. By enforcing structure in the latent space, autoencoders can effectively interpolate between images, which is shown in a popular online demo where users can slide a slider between two celebrity faces and see the slider interpolate between the two faces in a sensible way. The speaker explains that this is made possible by using a variational autoencoder, which forces the latent vectors to live together in the same space and decode nearby points from nearby vectors. The speaker notes that while the training details of variational autoencoders can be tricky due to the sampling involved, the approach is arbitrary and can be modified to fit various applications.

  • 00:35:00 In this section of the lecture, the speaker discusses how using discrete tokens is necessary for certain fields, such as natural language processing (NLP), since it is difficult to define what it means to alter a word by a certain percentage. As a result, he discusses the use of vector quantization as a hack to extend variational autoencoders (VAEs) to work with discrete tokens. In vector quantization, a code book of valid tokens is used to round any output vectors from the VAE to the nearest token, allowing for better representation of discrete data. However, the selection of the code book remains a challenge.

  • 00:40:00 In this section, the speaker discusses the loss function for a Variational Autoencoder (VAE), which is used to learn the locations of code words that correspond to different clusters within a data distribution. The loss function includes a reconstruction loss, which ensures that the output from the decoder is similar to the input, and a commitment loss, which ensures that the vectors outputted from the encoder are close to the code words representing the centers of these clusters. To prevent the network from hardcoding the input data, the encoder produces multiple code words for each input, which results in a larger set of code words and allows the network to generate a larger diversity of outputs.

  • 00:45:00 In this section of the video, the presenter discusses a method for generating new images with VQ-VAE by sampling some C from a normal distribution and passing it through the decoder, which results in a novel image that has not been seen before. Additionally, the presenter explains that uniform sampling of the codebook elements may not be effective because some code words are more common to use than others in the true data distribution. Therefore, learning the prior over the code words can be useful in generating new data. Lastly, the presenter provides a recap of the lecture, starting with autoencoders, moving to variational autoencoders, and ending with vector quantized variational autoencoders.
CS 198-126: Lecture 9 - Autoencoders, VAEs, Generative Modeling
CS 198-126: Lecture 9 - Autoencoders, VAEs, Generative Modeling
  • 2022.12.03
  • www.youtube.com
Lecture 9 - Autoencoders, VAEs, Generative ModelingCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https:/...
 

CS 198-126: Lecture 10 - GANs



CS 198-126: Lecture 10 - GANs

The lecture on GANs introduces the concept of two networks, the discriminator and the generator, competing against each other in a game theory-esque setup. The generator's input is random noise, which it assigns meaning to generate real-looking images, and the discriminator's job is to judge if the image is real or fake. GANs use a loss function that corresponds to negative cross-entropy loss, with the generator wanting to minimize and the discriminator wanting to maximize it. The value function represents how well the generator is doing and needs to be maximized by the discriminator by correctly classifying fake and real data. The lecture also covers issues with training GANs and the non-saturating loss that allows the generator to have more agency to change.

  • 00:00:00 In this section, the lecturer provides a review of latent variables and codes used to compress and map images into more compressed representations. The idea of using autoencoders to generate new images from a latent vector is introduced as well. The lecturer notes the challenge of judging what makes a good, realistic image, which is where GANs (Generative Adversarial Networks) come in to play. With two networks, one generating data and the other trying to determine if it's real or fake, the networks compete against each other in a game theory-esque setup. The discriminator wins when it correctly classifies images, and the generator wins when it fools the discriminator.

  • 00:05:00 In this section, the instructor explains the high level concept behind GANs, which involves two networks - the discriminator and the generator - competing against one another. Unlike autoencoders, where the bottleneck is in the middle, GANs have something in the middle between the generator and discriminator that's a lot more high dimensional. The input to the generator is some kind of random noise vector sampled from a multivariate Gaussian. The generator is then fed the latent noise variable and decides how to assign random meaning to it so that it can generate a whole host of inputs, or a whole host of real-looking images. The discriminator and generator networks are trained jointly via gradient descent, alternating between the two, with the goal of fooling the other network.

  • 00:10:00 In this section, the lecturer explains how GANs work by having a network that is given real and fake data to train a generator to figure out patterns that make images look real. The discriminator is the one that judges if the image is real or fake, and as it learns, it starts to notice patterns and updates its judgment. The hope is that the generator learns to improve itself by creating something with more shape or objects that make sense in the context of the scene. The loss function for GANs only consists of a classification loss from the discriminator, and the generator score is the opposite of it. To train the generator, the discriminator needs to be good in judging an image to provide feedback to the generator.

  • 00:15:00 In this section, the lecturer explains the importance of a discriminator being able to classify images accurately to improve the generator. The discriminator may need to be updated more than the generator, so it can discern a meaningful difference between the real and generated images. The lecturer then breaks down the loss function, which corresponds to negative cross-entropy loss, with the generator wanting to minimize it and the discriminator wanting to maximize it. The generator wins when its data looks real, and the discriminator wins when it correctly differentiates the real and fake images. The two networks are in a game-theory scenario where they are competing against each other to progress and get better.

  • 00:20:00 In this section of the video, the presenters explain the concept of a value function in GANs, which is the opposite of the loss function used in traditional machine learning models. The value function represents how well the generator is doing and needs to be maximized by the discriminator by correctly classifying fake and real data. The generator's weights are frozen during the first step so that the discriminator can be trained on batches of real and fake data. During the second step, the discriminator is frozen, and the generator's weights are updated to generate slightly better fake images. This process is repeated until the generator produces realistic images that even the discriminator cannot classify as fake.

  • 00:25:00 In this section, the speaker discusses conditional GANs, which provide a solution for generating images with more control over the classes being generated. The current GAN setup requires repeatedly feeding the generator randomly until the desired object or image is generated, but for data sets with more classes, this approach is not ideal. By appending a one-hot vector to the random noise vector, this allows the generator to have more control over the class being generated. The one-hot vector corresponds to the desired class, and the generator is trained to generate an image with that specific class.

  • 00:30:00 In this section of the lecture, the speaker discusses the idea of incentivizing the generator to use a specific feature in the conditional GAN model. The speaker explains that just telling the generator to generate a specific image isn't enough, since the generator has no incentive to use the given information. The solution is to also provide the discriminator with the same label, creating a strategy for it to identify whether or not a generated image corresponds to its label. This forces the generator to pay attention to the label since it wants to avoid detection by the discriminator, resulting in output that matches the given label. The architecture of both the generator and discriminator is also discussed.

  • 00:35:00 In this scenario, the generator's weights would eventually become zero. Additionally, the generator may slip up and get caught in mode collapse, in which it outputs only a small set of examples that fool the discriminator well. This issue arises due to the discriminator learning super sharp decision boundaries, and the generator being incentivized to output those fake examples repeatedly. Lastly, there are also training procedure issues with GANs, as their vanilla setup doesn't converge, and their loss function becomes flat.

  • 00:40:00 In this section, the lecturer discusses some common issues with GANs, which can make them difficult to train. One issue is that there will always be a trade-off between the generator and discriminator, with the discriminator trying to overfit to specific features in real images, and there is no clear way to know when the GAN is done training. The lecturer then goes over a non-saturating loss, which is a simple reformulation of the generator's objective, and addresses the problem that the generator only gets a small partial derivative when the discriminator recognizes generated images as fake. The non-saturating loss maximizes an alternative term and allows the generator to have more agency to change.

  • 00:45:00 In this section, the lecturer explains the mathematical trick behind the cross-entropy loss used in GANs. Instead of blindly trying to minimize the negative cross-entropy loss, the generator's objective is to maximize the probability of getting classified as a one, using a binary class cross-entropy sort of loss. This non-saturating loss gives larger generator gradients, allowing us to train quicker when the discriminator is just shutting down the generator. However, the lecturer notes that this is advanced material, without quiz or homework, but they are available to talk more about advanced GAN training techniques.
CS 198-126: Lecture 10 - GANs
CS 198-126: Lecture 10 - GANs
  • 2022.12.03
  • www.youtube.com
Lecture 10 - GANsCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley.edu/decal/modern-cv t...