Machine Learning and Neural Networks - page 34

 

CS 198-126: Lecture 11 - Advanced GANs



CS 198-126: Lecture 11 - Advanced GANs

This lecture on Advanced GANs covers various techniques to improve the stability and quality of GAN models, including bilinear upsampling, transposed convolution, conditional GANs, StyleGAN, and CycleGAN. The lecture also discusses the use of controlled random noise, adaptive instance normalization, and processing videos in GANs. To achieve better stability and results, the lecturer recommends using bigger batch sizes and truncating the range of random noise during testing, while cautioning against excessively nerfing the discriminator. Additionally, it is suggested to start with a broad distribution of different sizes of latent space to generate a variety of images. Finally, the lecture touches on the Big Gan, which helps generate GANs at very large scales.

  • 00:00:00 In this section, the speaker introduces the topic of GANs in the context of computer vision and discusses the construction of a GAN architecture for computer vision. The speaker focuses on the discriminator, which is a classification CNN, and the generator, which is more challenging due to the need to upsample the latent vector image. The speaker also discusses downsampling and up-sampling techniques, including nearest neighbor upsampling, which is a naive approach that duplicates every single cell on the existing feature map, resulting in a blurry image.

  • 00:05:00 In this section, the lecturer discusses ways to upsample feature maps for generators in GANs. He first describes bilinear upsampling, where new feature maps are created by taking a larger, empty feature map and filling it in with the average of all of its nearest neighbors. He then introduces transposed convolution, which pads the feature map with so much padding that by the time the convolution window is slid over it, the output feature map ends up bigger than the input. The lecturer notes that these are the most common ways to upsample feature maps and are usually sufficient to make generators larger.

  • 00:10:00 In this section of the lecture, the speaker discusses conditional GANs and how to handle them in the generator. The inputs to the generator now include a latent and a conditional vector that tells it what to generate. The speaker suggests concatenating the vectors together or processing them separately before concatenating them. They also briefly touch on passing multiple things to a discriminator. The lecture then transitions to StyleGAN, a new architecture for the generator that includes artistic flair and pre-processing of the latent vectors before convolution operations.

  • 00:15:00 In this section, the speaker discusses the need to feed the latent to give all the different convolutions access to the style encoded in it to produce better textures. Texture is random noise, and it would be much easier to provide the model with sources of randomness. This section introduces the architecture used, which dissects the generator into two different components. The first is the desired one, whose latent vector feeds into the generator. The introduction of the desired component is to pre-process the latent to solve the problem of unusable latent spaces, which makes it difficult to generate certain images. Pre-processes involve passing the latent through dense layers until we have a new modified latent called W.

  • 00:20:00 In this section, the video discusses the concept of adaptive instance normalization (AdaIN) and how it introduces style into the network. AdaIN replaces batch norm and uses a style vector to dictate how much to rescale and rebias, allowing for more meaningful results. The style vector is passed through one fully connected layer, which is used to rescale and rebias all activations. The final goal is to increase access to randomness by generating a large number of feature maps that are purely random noise.

  • 00:25:00 In this section, the lecturer discusses the addition of controlled random noise to each feature map, allowing the network to scale the amount of noise up or down based on the learned B values. This controlled random noise helps generate better textures and imperfections, allowing for the generation of individual hairs and wrinkles. The controlled noise is added after every convolution and allows the network to control the magnitude of the noise. The lecture also discusses the new innovations in style GAN, including the latent vector that is integrated into every layer, and the use of adaptive instance normalization to slowly introduce the style.

  • 00:30:00 In this section, the lecturer discusses the two advanced GAN techniques: StyleGAN and CycleGAN. StyleGAN generates random faces with vast improvements in texture through random noise, while CycleGAN transfers images from one dataset to another. One loss term is dedicated to the realism of converted images, and the other measures if the image can be restored back to its original state. CycleGAN can take realistic photos and transform them into Monet paintings, turn zebras into horses, and alter the seasons of a picture. While there is no consistency between frames in video, the technique can still produce decent results.

  • 00:35:00 In this section, the speaker explains that videos can be used to train a discriminator to identify real and fake videos, but it requires significant computation compared to the processing of images. The video should be processed frame by frame, but some frames can be skipped to make the process more efficient. The discriminator can be used to ensure consistency in the generated videos from one frame to another. Moreover, the speaker advises being careful when using GAN models in some demonstrations, such as converting a monkey into a horse as it may not always work efficiently, and the result may not be as expected. Finally, the speaker discusses how GAN scales up when bigger batch sizes and more data are thrown at bigger models, and explains some trade-offs between stability, reliability, variety, and quality.

  • 00:40:00 In this section, the lecturer discusses some tricks to get better stability and results with GANs. One key to better stability is using bigger batch sizes, which is especially helpful for complex tasks like GANs. Another tip is truncating the range of random noise during testing to avoid results that are outside of the model's experience. However, this comes with the trade-off of limiting the variety of generated images. The lecturer also emphasizes that accepting some instability during training is necessary to achieve good results and warns against excessively nerfing the discriminator.

  • 00:45:00 In this section, the speaker cautions against using a narrow distribution for latent space since the generator might generate similar images repeatedly, making it harder to generate a variety of images. The speaker suggests starting with a broad distribution of different sizes to give the model an excellent initial idea of how to generate images. Additionally, they share that the discriminator function can be beneficial in several different ways besides generating a single image randomly. Finally, they introduce the Big Gan, which helps generate GANs at very large scales.
CS 198-126: Lecture 11 - Advanced GANs
CS 198-126: Lecture 11 - Advanced GANs
  • 2022.12.03
  • www.youtube.com
Lecture 11 - Advanced GANsCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley.edu/decal/mo...
 

CS 198-126: Lecture 12 - Diffusion Models



CS 198-126: Lecture 12 - Diffusion Models

In this lecture on diffusion models, the speaker discusses the intuition behind diffusion models - predicting the noise added to an image and denoising it to obtain the original image. The lecture covers the training process, enhanced architecture, and examples of diffusion models in generating images and videos. Additionally, the lecture goes into depth regarding latent diffusion models, which compress the model into a latent space to run diffusion on the semantic part of the image. The speaker also provides an overview of related models such as Dolly Q, Google's Imagine model, and Facebook's Make a Video, and their ability to generate 3D models using text.

  • 00:00:00 In this section of the video, the speaker introduces diffusion models, a new class of generative models. They explain that the goal of generative models is to learn the underlying distribution of a given dataset so that new data can be generated from the same distribution. The speaker also mentions two main methods for learning distributions: maximizing likelihood or minimizing divergence metric. The lecture will dive into the math behind diffusion models, and the speaker notes that this lecture will be more mathematically involved than previous ones.

  • 00:05:00 In this section of the lecture on diffusion models, the speaker discusses the use of both Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) in modeling distributions that mimic the data distribution. The speaker explains that both models work by taking a sample from random noise and converting it into something that looks like it came from the data distribution. However, diffusion models take multiple tiny steps instead of one large step during this process, which creates a Markov chain that is easier to analyze. The diffusion model has a forward process where noise is added to an image, and then subsequently more noise is added to create a noisy version of the image. There is also a reverse process where the image is de-noised to return to the original image.

  • 00:10:00 In this section of the video, the lecturer explains the concept of reverse processing in diffusion models, where a new image can be generated by reversing the sequence of noising steps. The challenge lies in finding the reverse distribution, which is hard to compute using the exact distribution, and therefore, an approximation is made through the Q function and P function. The P function is represented by a neural network, which tries to learn the mean and variance of the reverse distribution, which is assumed to be Gaussian. The video also covers the training process for a diffusion model, which requires a loss function to be minimized or maximized.

  • 00:15:00 In this section of the lecture, the speaker discusses the application of the variation variational lower bound to diffusion models, which results in a loss function that resembles a sum of smaller loss functions. They explain that the terms L of 0 to L of T-1 contribute to the loss and that they will focus on analyzing L of T, which is defined from 1 to T-1. The speaker goes on to explain how the KL divergence between Q of X of T-1 and the distribution that the neural network tries to predict results in a term that measures the L2 loss between the learned mean and the mean from the conditional distribution. The authors of the diffusion papers suggest parameterizing mu of theta, the learned mu, in a similar form to Q of X of T-1 to simplify the expression and make it possible to predict a single term instead of predicting everything inside the red box.

  • 00:20:00 In this section, the lecturer explains the main intuition behind diffusion models, which is to predict the noise that was added to an image and then denoise it to get the original image back. The objective is to minimize the noise between the original noise and the predicted noise, and the training process involves adding noise to images in the dataset, passing them through the model, predicting the noise, and minimizing the distance between the predicted and actual noise. The model can then be used to synthesize new images by starting with random noise and denoising it using the predicted noise. The lecturer also notes that X of T, the variable being diffused, does not have to be an image.

  • 00:25:00 In this section, the speaker discusses diffusion models and their ability to predict noise of an image with the same dimensions as the input/output. One model that has the same dimensions is the same unit that was used in the segmentation lecture. However, the authors of the paper added many modern CV tricks, including resnet blocks, attention modules, grip norm, and swish activations to the model. They were able to show that it worked very well, and more time steps were used in a later paper to improve its quality further. The speaker also provides an image and a link to the slide containing the architecture of the model.

  • 00:30:00 In this section, it is explained that researchers have found a way to improve the results of using diffusion models for image generation by modifying the beta parameters that control the addition of noise in the forward process. Instead of using a linear schedule, they suggested using a slower cosine function, and ramping up later on to convert the images to noise slowly, helping the model to learn the reverse process better. Additionally, by learning the covariance matrix through a neural network, it is possible to improve the log-likelihood and get better likelihoods, which can be viewed as a measure of diversity.

  • 00:35:00 In this section of the lecture, the speaker discusses some architectural improvements that can be made to the unit model, which is commonly used in different papers. These improvements include increasing the model size, using attention modules, and adaptive normalization. The speaker also introduces the idea of classified guidance, which involves training a classifier to predict class labels from both original and noisy images, and using the resulting gradient to improve the diffusion model. Finally, the speaker mentions the use of metrics, such as FID and precision and recall, to measure the quality of generative models.

  • 00:40:00 In this section, the speaker discusses how the diffusion model has overtaken GAN models in image modeling due to its ability to capture better fidelity and diversity of the data distribution. They show images of the flamingos where the GAN images look very similar, whereas diffusion images show more diversity in their output, indicating better image modeling capabilities. The speaker also mentions that researchers have come up with better ways of guiding the diffusion model through a process called classifier-free guidance, where a conditional diffusion model is trained to avoid trading diversity for increased quality, which is inherent when conditioning the model on some class label.

  • 00:45:00 In this section, the lecturer discusses the concept of latent diffusion models, which are another class of diffusion models used for training on high-dimensional images, as it is not feasible to train a large diffusion model in such cases. The lecturer explains that researchers have discovered that more bits are used for capturing the pixel-level details and less for capturing some of the semantic details of an image which is not useful. In order to generate images accurately, a generative model must be run on the semantic part of the image instead. The lecturer gives an overview of how this can be achieved, which involves learning the latent space and compressing the model to a latent space to run diffusion on it. This allows an image to be converted to a latent and back to the image using an encoder and decoder model.

  • 00:50:00 In this section, the speaker discusses several models related to diffusion, including Dolly Q, image generation through Google's Imagine model, and video generation through Facebook's Make a Video. Additionally, Google has extended the Imagine model to generate videos as well. The speaker also mentions the ability to generate 3D models using text, as well as applying vision to RL, which achieves state-of-the-art results in offline RL, according to a paper released earlier this year. The speaker provides links to papers and resources for further learning.
CS 198-126: Lecture 12 - Diffusion Models
CS 198-126: Lecture 12 - Diffusion Models
  • 2022.12.03
  • www.youtube.com
Lecture 12 - Diffusion ModelsCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley.edu/decal...
 

CS 198-126: Lecture 13 - Intro to Sequence Modeling



CS 198-126: Lecture 13 - Intro to Sequence Modeling

In this lecture on sequence modeling, the speaker introduces the importance of representing sequence data and achieving a reasonable number of time steps without losing too much information. Recurrent neural networks (RNNs) are discussed as a first attempt at solving these challenges, which have the ability to handle varying lengths of inputs and outputs. However, issues with RNNs prevent them from performing optimally. Text embedding is introduced as a more efficient way to represent text data, rather than using a high dimensional one-hot vector. Additionally, the concept of positional encoding is discussed as a way to represent the order of elements in a sequence using continuous values, rather than binary ones.

  • 00:00:00 In this section, the speaker introduces sequence models and explains the motivation behind why they are important. In particular, they mention various types of sequence data, such as time series data, audio, and text, and how they are commonly used in computer vision and natural language processing models. The speaker also discusses the importance of representing sequence data and achieving a reasonable number of time steps without losing too much information. Ultimately, the goal is to create language models that can be trained on massive amounts of text data scraped from the internet, which is represented as a tokenized sequence of one-hot vectors.

  • 00:05:00 In this section, the instructor discusses the challenges of representing text data as one-hot vectors and the inefficiency of having one for every single word in a dictionary. The goal of sequence modeling is to handle arbitrarily long data and varying lengths of inputs and outputs. The instructor provides examples of different paradigms, including sentiment analysis and translation, which need to handle variable lengths of outputs. Additionally, long-distance relationships between words in a sentence must be considered when analyzing text data.

  • 00:10:00 In this section, the video discusses the challenges of sequence modeling, which require connecting ideas from various parts of a sentence and handling long-distance relationships across sequences. Recurrent neural networks (RNNs) are introduced as a first attempt at solving these challenges, and they do work but not particularly well due to issues that prevent them from performing optimally. The video explains that RNNs use a cell value shared across every sequence element, with each cell having the exact same weights that process the input sequence. Additionally, the output generated by the RNN can be interpreted as anything from a probability to a translation.

  • 00:15:00 In this section, we learn about the basic form of a Recurrent Neural Network (RNN) where we take in a sequence element of the same length, do a linear layer on it, take the output from the previous time step and the input at this time step to do a matrix multiplication. We then stack them on top of each other or add them together to spin out output. The tahn function is used to make sure the outputs are in range and to prevent values from blowing up or getting too small during forward or backward propagation. By stacking multiple layers, we can start learning more complex functions.

  • 00:20:00 In this section of the lecture, the instructor discusses the challenges and solutions of creating a sequence model. By using a tanh function on the output of each cell, the values are kept between -1 and 1, which avoids large values that can cause problems during repeated matrix multiplications. The model can handle arbitrary input size, variable output lengths, and long-distance relationships. The instructor then introduces embeddings as a more efficient way to represent text data, rather than using a 100,000 dimensional one-hot vector. Ideas such as binary and trinary encoding are explored as a possible solution.

  • 00:25:00 In this section, the speaker introduces the concept of text embedding and how it can be utilized in sequence modeling. Instead of using one-hot vectors for each word in the dictionary, a smaller vector representing the word is learned and fed into the model. This compression of the representation allows for a reduction in dimensionality and creates an embedded vector that resembles a code book. The hope is that these embeddings allow for an intelligent representation of the words, with similar words such as "cat" and "dog" being relatively close, while words with little correlation such as "cat" and "grass" are further apart. Although there is no guarantee that this proximity relationship exists, it can be utilized to make understanding how sentiment analysis and other models are affected by specific word choices easier.

  • 00:30:00 In this section, the lecturer discusses using gradient descent on a code book of embedded vectors to group semantically similar words together. He also mentions the concept of positional encoding, where time elapsed or position in a sequence can be important for certain domains, and discusses a few methods for representing one hot Vectors for position before moving on to what works well, known as positional encoding.

  • 00:35:00 In this section of the lecture, the instructor discusses the idea of using a time stamp in sequence modeling to indicate how far along in the sequence we are. However, using a binary encoding as a time stamp can become limited for larger sequence lengths since it can only represent a limited number of unique time steps. To address this issue, the instructor suggests using a continuous analog by replacing the binary encoding with sine and cosine waves of different frequencies. This way, we can still use a smaller vector to represent a larger number of unique time steps.

  • 00:40:00 In this section, the concept of positional encoding is discussed, which is a way to represent the order of elements in a sequence using continuous values rather than binary values. The process involves evaluating sine and cosine functions at different frequencies for each sequence element and then graphing them to create a continuous analog of binary positional encoding. The resulting graph alternates between high and low values, similar to the binary version, and can be appended to each element in the sequence. The positional encoding can be a bit confusing, but the lecture suggests reviewing the slide decks and experimenting with the concept for a better understanding.
CS 198-126: Lecture 13 - Intro to Sequence Modeling
CS 198-126: Lecture 13 - Intro to Sequence Modeling
  • 2022.12.03
  • www.youtube.com
Lecture 13 - Intro to Sequence ModelingCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley...
 

CS 198-126: Lecture 14 - Transformers and Attention



CS 198-126: Lecture 14 - Transformers and Attention

This video lecture on Transformers and Attention covers the concept and motivation behind attention, its relation to Transformers, and its application in NLP and vision. The lecturer discusses soft and hard attention, self-attention, local attention, and multi-head attention, and how they are used in the Transformer architecture. They also explain the key-value-query system, the importance of residual connections and layer normalization, and the process of applying a linear layer to get kqv from input embeddings. Lastly, the lecture covers the use of position embeddings and the CLS token in sequence-to-vector examples while highlighting the computational efficiency and scalability of the attention mechanism.

  • 00:00:00 In this section of the video lecture, the goal is to explain the motivation behind attention and how it is related to Transformer models. Attention is the cornerstone of modern Vision Transformers and is necessary to focus efforts and attention on a certain location. The lecturer explains that attention uses a query key value system to make more informed decisions about which things to pay attention to. The modern attention system is based on how humans read, where they focus on specific sequential words and blur out everything else.

  • 00:05:00 In this section, the lecturer discusses the concept of attention in machine learning models, specifically in the context of NLP and RNNs. Attention allows models to focus on the important parts of an input, making inferences using a specific subset of data instead of taking in everything as a whole. There are two types of attention: hard attention, which predicts which indices are relevant at a certain time step, and soft attention, which creates a set of soft weights with the softmax function to create a probability distribution based on the input tokens that indicate their importance. Soft attention is generally used and combines the representations of different features. The lecture also discusses the process of translating from French to English as an example of using attention.

  • 00:10:00 In this section, the speaker explains the process of encoding each word and creating a latent representation of the words using a traditional encoder-decoder network that involves sequential processing of the inputs and a context vector for decoding. They then introduce the concept of soft attention, which uses a context vector that takes information from each latent representation to decode based on the previously decoded information. The process involves creating a score function to determine similarities between the previous decoding and the encoding, and using different metrics to come up with a relative importance, providing a probabilistic representation of the relatedness of a query with a bunch of keys.

  • 00:15:00 In this section, the lecturer explains the concept of local attention, which allows for the attention model to query only a certain window of input tokens, rather than all of them, in order to save computational resources. The lecture also delves into using attention for vision, including the use of squeeze and excite networks for channel-wise attention and spatial attention for images. Additionally, the lecture briefly touches on using attention for generating sentences that describe images, such as using convolutions to extract key features and long short-term memory networks to maintain connections between words.

  • 00:20:00 In this section, the lecturer discusses the use of attention in various architectures, including spatial and self-attention. Self-attention involves looking up tokens from the same input while paying attention to the relationships between words in a sentence, allowing for a better prediction of the next word based on previous words. The lecturer also introduces the concept of Transformers, which use the key-value-query system of attention to wait for different amounts of similarity when selecting kernel features.

  • 00:25:00 In this section of the video, the lecturer introduces the concept of self-attention and soft attention, which are used in the Transformer model. The idea is to create a probability distribution that focuses on certain features while ignoring others, in order to predict certain relationships. The lecturer then explains how matrices are used instead of one-to-one comparison of queries and keys in Transformer models. The lecture also discusses the limitations of RNNs such as their inability to parallelize and capture long sequences, and how attention can help solve these problems.

  • 00:30:00 In this section of the lecture, the presenter discusses the Transformer architecture and how it uses self-attention to model sequences or groups of tokens. The inputs include a sequence of token embeddings and positional embeddings, and the goal is to come up with a representation that can be passed into the Transformer model. Multi-head attention is used to calculate the importance of each token based on the query and key, and the feed forward step is done in parallel to bring out the merits of the Transformer. The architecture combines residual connections and layer norms to alleviate vanishing gradients and provide an accurate representation. Finally, a linear layer is added at the end to calculate the output based on the cues, keys, and values of the different representations.

  • 00:35:00 In this section, the speaker explains the process of applying a linear layer to get kqv from the input embeddings for each word in the text. This involves using different weightings for keys, queries, and values joined together through matrix multiplication. After this, a dot product is found between the queries and values and each token directly attends to every other token, making the connections between inputs infinitely scalable. A SoftMax distribution is applied based on the dot product values, and then the values are re-weighted based on this distribution to come up with a final value on a token-by-token basis. Scaling the attention by dividing by one over the square root of D is used to standardize things and ensure that there are no small gradients, and multi-headed attention is employed to project each key, query, and value corresponding to a token H times. Lastly, dropout is used to prevent overfitting, and a transformation is applied to the resultant vectors before sending them to a feedforward neural network.

  • 00:40:00 In this section of the video, the lecturer explains the attention mechanism in transformers and the importance of adding residual connections to handle vanishing gradients in deep networks. They also discuss the differences between batch normalization and layer normalization, with layer normalization being used in the attention mechanism to normalize each feature dimension. The lecturer also explains how the weighted sum of the values produces multiple vectors which are then passed through a weighted matrix to get a singular value passed into the feed forward network. Overall, the lecture gives an in-depth explanation of the attention mechanism and its various components in transformers.

  • 00:45:00 In this section of the lecture on Transformers and Attention, the speaker explains the implementation of the Transformer architecture of the neural network, which consists of residual and layer norm operations, as well as a one by one convolution. Each multi-layer perceptron is parallelized, and the input position embeddings are used to focus on specific windows based on position information. A dummy token is also used in certain NLP tasks to transform a sequence to a vector measurement.

  • 00:50:00 In this section, the lecture discusses sequence-to-vector examples and the use of CLS tokens. The lecture explains the math behind the attention mechanism, which involves matrix multiplication between the query, key, and value inputs. The result is a weighted sum that represents the attention. This method is computationally efficient, making it suitable for parallelization on GPUs, and scalable even for large inputs. The lecture concludes by discussing transformer architecture, position embeddings, and introducing no inductive bias which is different from sequential models.
CS 198-126: Lecture 14 - Transformers and Attention
CS 198-126: Lecture 14 - Transformers and Attention
  • 2022.12.03
  • www.youtube.com
Lecture 14 - Transformers and AttentionCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley...
 

CS 198-126: Lecture 15 - Vision Transformers



CS 198-126: Lecture 15 - Vision Transformers

In this lecture, the speaker discusses the use of Vision Transformers (ViTs) for image processing tasks. The ViT architecture involves downsampling images into discrete patches, which are then projected into input embeddings using a linear layer output before being passed through a Transformer. The model is pre-trained on a large, labeled dataset before fine-tuning on the actual dataset, resulting in excellent performance with less compute than the previous state-of-the-art methods. The differences between ViTs and Convolutional Neural Networks (CNNs) are discussed, with ViTs having a global receptive field and more flexibility than CNNs. The use of self-supervised and unsupervised learning with Transformers for vision tasks is also highlighted.

  • 00:00:00 In this section, the speaker discusses the use of Vision Transformers and how they can be applied to images. They explain the concept of tokens, embeddings, and Transformers, providing a concrete example of how they can be used for natural language processing tasks. They then explain how the same architecture can be applied to computer vision tasks by preprocessing the image as a string of tokens and using the Transformer's scalability, computational efficiency, and global receptive fields to process it effectively. The speaker also touches upon the pre-processing of text through tokenization and mapping each word to a vocabulary.

  • 00:05:00 In this section of the lecture, the lecturer discusses how to convert tokenization and embedding methods used in natural language processing (NLP) to image processing. Tokenization involves converting words or phrases into a numerical format, which is used to generate embedding vectors. However, this process is not straightforward for images as color values are continuous, making it difficult to create a table to look them up. This challenge can be addressed by pretendering the values to being discrete, as this makes it possible to treat each pixel as a token. Additionally, the problem of time complexity is addressed by using smaller images and training them similarly to language models.

  • 00:10:00 In this section, the speaker discusses measuring the success of the Vision Transformer model through semi-supervised classification using a limited set of labeled samples. The model is pre-trained on unlabeled samples and then passed through a linear classifier with the output image representations as input. The output embeddings need to be good enough for the classifier to perform well. This technique resulted in competitive accuracy without using labels, and it was also used for image generation. While the model is successful, it requires a significant amount of compute and can only work on 64 by 64 resolution images. The appeal of the Transformer model is its scalability relative to compute, but more efficient means of implementation will be necessary for downstream applications.

  • 00:15:00 In this section, the speaker discusses the architecture of Vision Transformers, which is a more efficient and general approach to image classification. Instead of quantizing pixels, images are downsampled into patches and then project into input embeddings directly using a linear layer output. Position embeddings and the CLS token are added on top of the Transformer. The pre-training is done on a large, labeled data set before fine-tuning on the actual data set, resulting in excellent performance with much less compute than the previous state of the art. The approach is more general because it has fewer inductive biases.

  • 00:20:00 In this section, the differences between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are discussed. The two main differences between CNNs and ViTs are locality and two-dimensional neighborhood structure. CNNs tend to be biased towards features that are located close to each other due to limitations in the kernel size used for interactions between pixels. On the other hand, ViTs project every pixel to an embedding and allow every token to attend to every other token, regardless of its position in the image, making them less biased towards local features. ViTs also have unique representations for each token and positional embeddings, which affect the resulting representations, making them more flexible and capable of interpolating during fine-tuning.

  • 00:25:00 In this section, we learn about some of the advantages of Vision Transformers (ViTs) over traditional Convolutional Neural Networks (CNNs). ViTs are able to learn better image representations with larger datasets because they don't have biases towards processing images at the beginning, meaning they don't assume a mode of data, unlike engineered biases in CNNs. This is also the reason why ViTs have a trade-off with data, performing worse when there is less data and better with more data. Additionally, ViTs have a global receptive field, allowing for interactions across the entire image, which is not possible with CNNs. Some ViT features like position embeddings and attention representations make it more interpretable in some ways.

  • 00:30:00 In this section, the differences between convolutional neural networks (CNNs) and vision transformers are explained. CNNs use one or two convolutional layers that limit their ability to process information beyond a small area. Therefore, interactions between tokens in CNNs happen only at the end. In contrast, vision transformers use a global receptive field where each token interacts with every other token from the beginning, allowing them to attend to everything. However, vision transformers have cons, such as their output being less fine-grained due to using patches, leading to issues in fine-grain image classification and segmentation. The goal of having more general models is emphasized, where models learn from data instead of being hand-engineered for specific domains, allowing for easier domain combination.

  • 00:35:00 In this section, the speaker discusses the advantages of using self-supervised and unsupervised learning with Transformers, particularly in the context of vision tasks. With access to large amounts of unlabeled data from the internet, self-supervised and unsupervised objectives allow for efficient training without the need for annotation. The resulting model can produce representations that retain scene layout and object boundary information, and can be used for image classification and video segmentation tasks. The speaker also highlights the successful use of Vision Transformers in various image classification tasks, demonstrating their ability to scale well with large amounts of data.

  • 00:40:00 In this section, the lecturer discusses how to get from the initial architectures of Transformer models to the top ones on the leaderboard. They found that better representation scales with compute time, model size, and data set size, and large models are more sample-efficient, meaning they need fewer training samples to get the same performance. The lecturer also talks about Vision Transformers and CNN, which are a hybrid architecture between the two. They add inductive biases into Visual Transformers using weight values dependent on the relative position to address missing translational equivariance in Transformers when there isn't enough data.

  • 00:45:00 In this section, the lecturer discusses the use of a learned weight vector in Transformer models for images. This learned weight vector allows for an easier encoding of features that depend only on relative positioning rather than absolute positioning. Additionally, the lecturer presents solutions to the issue of quadratic time with respect to spatial size in Transformers, such as pooling and combining convolutional blocks with Transformer blocks. The Vision Transformer model with its self-supervised training schemes is seen as the next step in transitioning from hand-engineered features to more general models, and it requires a lot of data as Transformers tend to do. The BTS model is scalable and performs well on compute hardware. The lecturer confirms that it is a supervised learning algorithm.
CS 198-126: Lecture 15 - Vision Transformers
CS 198-126: Lecture 15 - Vision Transformers
  • 2022.12.03
  • www.youtube.com
Lecture 15 - Vision TransformersCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley.edu/de...
 

CS 198-126: Lecture 16 - Advanced Object Detection and Semantic Segmentation



CS 198-126: Lecture 16 - Advanced Object Detection and Semantic Segmentation

In this advanced object detection and semantic segmentation lecture, the lecturer discusses the advantages and disadvantages of convolutional neural networks (CNNs) and Transformers, particularly in natural language processing (NLP) and computer vision. While CNNs excel in textural bias, Transformers handle both NLP and computer vision tasks efficiently by using self-attention layers to tie important concepts together and focus on specific inputs. The lecture then delves into Vision Transformers, which prioritize shape over texture, making them resilient against distortion. He further explains the advantages and limitations of the Swin Transformer, an improved version of the Vision Transformer, which excels in image classification, semantic segmentation, and object detection. The lecture emphasizes the importance of generalizability in models that can handle any kind of data, and the potential applications in fields like self-driving cars.

  • 00:00:00 In this section, the lecturer outlines the plan for the day's lecture, which includes a review of CNNs and Transformers and their advantages and disadvantages. The lecture will also cover NLP contexts, such as BERT, and how embeddings are generated, then move on to Vision Transformers and compare them to CNNs. The Swing Transformer, an improvement upon Vision Transformers for computer vision applications, will be discussed, including window attention patch merging and shifted window attention with positional embeddings. The lecture may also cover advanced segmentation methods, time permitting.

  • 00:05:00 In this section of the lecture, the speaker discusses the concept of CNNs and their translational equivalence, meaning that they adhere to a two-dimensional neighborhood structure and capture information at different points depending on the stride distance. The speaker also points out that cnns have shown a propensity for textural bias over shape and that texture augmentation can affect their performance. The speaker then transitions into the context of Transformers for NLP tasks and how attention allows us to tie important things in a sentence together and to focus on certain parts of the input. Self-attention in Transformers allows us to do this within a sentence, emphasizing the importance of prior words encountered.

  • 00:10:00 In this section, the video discusses how self-attention layers utilize queries, keys, and values to calculate attention and weight information based on similarity or difference. The section also introduces Vision Transformers, which use the Transformer model to handle both NLP and computer vision tasks by flattening images into 16x16 patches and passing them through a linear layer to generate embeddings. The positional information is learned by the model, and they use a multi-layer perceptron to classify the output. The section compares Vision Transformers to CNNS and points out that the self-attention layers are global, while only the MLP compares neighboring pixels. The Transformer model in the Vision Transformer does not differentiate between image and word inputs and is generalizable for a range of tasks.

  • 00:15:00 In this section of the lecture, the concept of inductive bias in machine learning models is discussed. Inductive bias refers to the assumptions that a model makes about the data it has been trained on and reducing this bias allows for a model to be more generalizable. It is important to have models that can be applied to multiple tasks without assuming prior knowledge. While CNNs outperform Transformers on smaller data sets, the Vision Transformer model (ViT) performs better on larger and more complex data sets as it models human eyesight better by prioritizing shape over texture. Adversarial robustness is also introduced as a metric where images are distorted by introducing noise so that certain classifiers are no longer able to classify them.

  • 00:20:00 In this section, the limitations of Vision Transformers in image restoration and semantic segmentation are discussed. When patches are passed and processed one at a time, border information can be lost, and fine-grained pixel analysis within a patch is weak, as information that belongs to one patch is treated as the same. However, unlike CNNs that prioritize texture over shape, Vision Transformers prioritize shape over texture, making them naturally robust against visual distortions, even when targeted noise is added to an image. The extraction of patches is a problem unique to images, and for larger images, the number of image tokens generated will rapidly increase.

  • 00:25:00 In this section, the lecturer discusses the problems with using typical vision Transformers for object detection and segmentation, particularly when processing larger images as it requires a lot of processing power. However, a solution was introduced with the shifted window Transformer, which uses non-overlapping windows to perform self-attention within groups and then combines them together to perform cross attention. This allows for cross-window attention connections, resulting in a linear computational complexity instead of N-squared, as the size of the patches remains the same while they are combined. This method of image segmentation is commonly used in self-driving technologies.

  • 00:30:00 In this section, the concept of the Swin Transformer is introduced, a model that excels at image classification, object detection, and semantic segmentation. The Swin large patch model has a patch size of 4, a capacity of 192, a window size of 7, and is trained on ImageNet 22k and fine-tuned on ImageNet 1k. The model uses a window multi-attention layer and a shifted window attention layer, and an MLP with hidden layers that use a GELU activation function. The output of the window MSA is passed through a layer norm to normalize the intermediate layers' distributions before entering the MLP.

  • 00:35:00 In this section, the speaker discusses the benefits of using Layer Norm in training models for object detection and semantic segmentation. Layer Norm applies a smoothing operation to the gradient surface, resulting in faster training and better generalization accuracy. The speaker compares Layer Norm to other smoothing techniques like Batch Norm and explains how it focuses on the intermediary layers of the process. The discussion then shifts to Windowed Multi-Head Self-Attention (WMSA) blocks, which perform self-attention within each window of an image. The number of patch vectors in each window is guaranteed, resulting in linear complexity to the image size, unlike the quadratic complexity in Vit (a competing technique). Stage two of WMSA involves a patch merging process where neighboring pixel blocks are concatenated into a smaller window, creating new patch borders and remade windows.

  • 00:40:00 In this section of the lecture, the presenter explains the solution by Swin Transformer to handle the increase in the number of windows generated after advancing the patches. Swin Transformer cleverly combines these windows by rearranging blocks to only have four windows, reducing the number of total elements from 64 to 16 while keeping the total amount of information consistent. The optimization technique involves a cyclic shift, and a linear layer is used to increase the depth or the "C" dimension of the embedding size after reducing the breakdown of the patches. This technique provides savings in compute power and avoids the naive solution of zero-padding before performing attention.

  • 00:45:00 In this section, the speaker discusses two optimizations proposed by the authors to improve the efficiency of image processing. The first optimization involves shifting an image to a certain part before calculating the attention, then moving it back while marking that it has already been calculated. This optimizes compute power by avoiding the need to perform an entirely new operation to get the desired values. The second optimization is through positional embeddings that learn patch position information instead of being provided explicitly, limiting the scope of attention that needs to be calculated. These optimizations, along with the use of bias vectors and channel size manipulations, help in the performance of self-attention calculations in image processing.

  • 00:50:00 In this section, the lecture discusses the process of merging patches in stages two, three, and four of the Swin transformer model. By reducing the dimensionality of the patches, they are reduced by one-fourth to reach 3136 patches, and the encoding size is doubled to get 384 encodings. The process is repeated in stages three and four and the last component in the process is an average-pooling layer, followed by a classification head. The lecture raises concern over the reintroduction of inductive bias through the use of similar approaches to CNNs, but studies have shown that Swin models perform well in terms of corruption robustness and have a lower shape bias than Vision Transformers. The genericness of the Transformer architecture allows capturing patterns accurately regardless of data type or domain, and more data results in better performance.

  • 00:55:00 In this section, the lecturer explains the benefits and drawbacks of having a model that can take in any kind of data, process it, and pull out patterns, known as generalizability. The idea of a general artificial intelligence model that can handle any input/output is discussed, and the potential applications in fields such as self-driving cars are explored. The lecturer also notes that the field of adversarial robustness is still developing and that further testing is needed to determine the efficacy of models such as Swin against more advanced adversarial attacks.
CS 198-126: Lecture 16 - Advanced Object Detection and Semantic Segmentation
CS 198-126: Lecture 16 - Advanced Object Detection and Semantic Segmentation
  • 2022.12.03
  • www.youtube.com
Lecture 16 - Advanced Object Detection and Semantic SegmentationCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease ...
 

CS 198-126: Lecture 17 - 3-D Vision Survey, Part 1



CS 198-126: Lecture 17 - 3-D Vision Survey, Part 1

The video discusses different 3D visual representations and their pros and cons, including point clouds, meshes, voxels, and radiance fields. The lecture also covers raycasting, forward and backward, as well as colorizing and rendering images for objects that intersect with each other, with different approaches for solids and transparencies. The lecturer touches on differentiable rendering's limitations and how Radiance Fields can create a function for each XYZ point with a density and physical color, making it more learnable.

  • 00:00:00 In this section, the lecturer discusses the need to extend computer vision to 3D, as the real world is in three dimensions. There are limitless applications for 3D, such as self-driving, shape optimization, virtual environments, avatar generation, and more. Different methods for 3D representation are then presented, including 2.5D, point clouds, meshes, voxel grids, and regions fields. The lecture then delves into the pinhole camera model, which is important for understanding how imaging works, and subsequently how to render 3D objects in space for simulation.

  • 00:05:00 In this section of the lecture, the concept of forward tracing and backtracing is introduced as a means to determine the position of a camera in a scene. The lecturer also discusses RGB-D (2.5D) images and how they contain depth information that can be used to generate point clouds, which can then be used to create meshes of a surface. The benefits and limitations of using point clouds for mesh creation are also explored.

  • 00:10:00 In this section, the lecturer describes different representations for 3D objects. They start by discussing mesh structures and how they are difficult to work with in machine learning settings due to the lack of techniques for working with graphs. The lecture then introduces voxels as a discrete 3D space structure made up of small cubes or "Legos" that can represent objects in a binary or translucent way. However, using voxels at high resolutions can be prohibitive due to computational complexity. The lecture concludes by presenting radiance fields, a function that outputs RGB colors and density at specific XYZ coordinates, as a solution for representing high-frequency details in 3D objects.

  • 00:15:00 In this section, the lecturer discusses different 3D representations, including point clouds, meshes, voxels, and radiance fields. Each type has its pros and cons, and it's essential to choose the right representation for a particular task. After discussing 3D representations, the lecture moves on to raycasting and the two types of raycasting: forward and backward. Forward raycasting is useful for rendering point clouds since it allows us to see every point in the scene. Conversely, backward raycasting is more suited for rendering meshes or voxel grids since it allows us to see the surface that intersects the ray first.

  • 00:20:00 In this section of the video, the speaker discusses the process of colorizing and rendering images for different objects that intersect with each other. This is done by calculating three triangle intersections for every array, which can be efficient. If objects are translucent, the process involves considering not just the color of the first point intersected but also the density of the first and second point. For regions with no surfaces, such as smoke, ray sampling is used to sample different points on the straight and use the Radiance Field to create a function that outputs RGB and D, for each point. These sets of colors and densities are then aggregated using volumetric rendering to create one pixel volume.

  • 00:25:00 In this section, the lecturer discusses differentiable rendering and its limitations. While everything discussed in rendering is differentiable, it is only differentiable for the visible surfaces we see in the rendered image. Radiance fields solve a problem with this as every single point that is sampled will have an impact on the final color and thus have some output gradient. The lecturer also mentions that Radiance Fields have existed for a while and function as a way to create a function for every XYZ point with a density and physical color. Next, the lecturer will discuss modeling f as a neural network to make Radiance Fields learnable.

  • 00:30:00 In this section, the speaker briefly mentions a delay in the Transformers homework by one week, but does not provide any context or explanation.
CS 198-126: Lecture 17 - 3-D Vision Survey, Part 1
CS 198-126: Lecture 17 - 3-D Vision Survey, Part 1
  • 2022.12.03
  • www.youtube.com
Lecture 17 - 3-D Vision Survey, Part 1CS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley....
 

CS 198-126: Lecture 18 - 3-D Vision Survey, Part 2



CS 198-126: Lecture 18 - 3-D Vision Survey, Part 2

In this lecture on 3D vision, the instructor discusses radiance fields, specifically Neural Radiance Fields (NeRFs), which take in a position in space and output color and density. The speaker explains the process of rendering, which involves querying from the camera's perspective, and using the black box function to figure out what the image will look like. The lectures discuss the challenges in representing consistent perspectives of objects in 3D vision and the use of MLPs to take in the XYZ data of an object and view direction to output density and RGB information. The lecture also covers the challenges of volumetric rendering and using Nerf derivatives to improve computer vision. The instructor ends by demonstrating the use of space contraction to generate realistic 3D images using a neural network.

  • 00:00:00 In this section of the lecture, the instructors discuss radiance fields, specifically NeRFs (Neural Radiance Fields), which take in a position in space and output color and density. The process of rendering involves querying from the camera's perspective and using the black box function to figure out what the image will look like. The color is a weighted average of all the samples, and visibility is proportional to density and inversely proportional to the amount of objects in front of the camera. The instructors give examples to explain the intuition behind radiance fields, including how the closest object to the camera contributes the most to color and the effect of density on weight.

  • 00:05:00 In this section, the speaker explains how to create a neural Radiance field to generate new views of an object based on multiple images of that object. The goal is to come up with a neural Radiance field that can be queried at points in the scene to create new images. However, obtaining the ground troop positions and directions required for this can be a difficult and time-consuming task. There are programs available that can help with this process, but the speaker notes that it may be considered cheating to rely solely on these tools.

  • 00:10:00 In this section, the lecturer discusses the use of 3D vision for generating new views of a scene. They explain that learning a neural Radiance field allows for shape consistency across different views, which is important for rendering new views of an object with deep learning. Without this bottleneck, it is difficult to ensure consistency, as shown in an example with StyleGAN that produced inconsistent shapes across different views. The lecturer argues that learning a 3D representation of an object is necessary to generate new views of the object with consistent shape.

  • 00:15:00 In this section, the speaker discusses the challenges in representing consistent perspectives of objects in 3D vision. The use of Radiance Fields is explained as a way of representing fine details in the object's appearance, such as glare and reflections from different angles, which would be difficult to capture otherwise. The speaker goes into detail about how this process involves taking in position and viewing direction data to create a more accurate representation of the object being observed. The concept of using density and color MLPs to represent the varying aspects of the object is also explained.

  • 00:20:00 In this section, the speaker discusses the use of MLPs (dense neural networks) to take in the XYZ data of an object and its view direction to output density and RGB information. The network uses positional encoding to create sharp decision boundaries, which improves the crispness of the image being recreated. The use of binary representation and logic gates allows for sharp changes and high frequency details in the recreated image. The speaker notes that they can provide a more in-depth explanation of positional encoding if needed.

  • 00:25:00 In this section, the speaker goes into more detail about the different aspects of implementing a Nerf (neural radiance fields) model for 3D vision, including using positional encoding for sharp boundaries and view dependence for effects like glare and reflection. The speaker also discusses optimizing the sampling process in two rounds and using a separate MLP to learn the finer details of the edges. Additionally, the speaker explains the loss function used for training the network, which involves comparing the RGB values of ground truth images and rendering a limited number of rays due to GPU limitations. There is no direct loss on density, but the network still learns correct density through the indirect relationship between density and color correctness.

  • 00:30:00 In this section of the lecture, the speaker talks about the process of volumetric rendering and how it requires correct color and density in order to produce accurate predictions. The speaker explains that utilizing enough cameras enables triangulation of different points on the object and the easiest way for the network to produce low loss is by outputting the correct color and high density for the point of intersection. The speaker also showcases a project they are working on that uses pre-processing scripts and a library called nerfacto for real-time rendering training. The speaker notes that pre-processing is difficult and can sometimes result in incorrect directions.

  • 00:35:00 In this section, the speaker discusses 3D vision and the challenges associated with capturing images in all directions. The video focuses on using Nerf derivatives to improve computer vision and how this technique can be used to contract the space around a scene, making it easier for the network to learn good values. The speaker explains that the bounding box around the image helps to constrain the space, so the network only receives values between -1 and 1. The video illustrates how the contraction of space works with a formula that takes a point in space and maps it onto a unit ball, making the point and the scene's values easier for the network to learn.

  • 00:40:00 In this section of the video, the speaker demonstrates the use of space contraction to generate realistic 3D images using a neural network. He showcases an image of a Campanilla and explains that the network becomes progressively worse when it reaches the edge of the training data. The speaker also mentions some advancements in generating 3D images that take seconds rather than days. Although he did not have enough time to discuss why the density function is learnable, he offers to have discussions with the audience after the lecture.
CS 198-126: Lecture 18 - 3-D Vision Survey, Part 2
CS 198-126: Lecture 18 - 3-D Vision Survey, Part 2
  • 2022.12.03
  • www.youtube.com
Lecture 18 - 3-D Vision Survey, Part 2CS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley....
 

CS 198-126: Lecture 19 - Advanced Vision Pretraining



CS 198-126: Lecture 19 - Advanced Vision Pretraining

This video covers various techniques used for self-supervised pretraining in advanced vision, including contrastive learning, denoising autoencoders, context encoders, and the Mae network. The speaker provides an overview of each method, discussing its strengths and weaknesses, and highlights the benefits of combining contrastive and reconstruction losses in the BYOL method, which outperforms both individually. The video provides useful insights into the latest research trends in self-supervised learning and their potential to improve the performance of computer vision models.

  • 00:00:00 In this section, the instructor introduces the concept of self-supervised learning (SSL), which is a branch of unsupervised learning that creates labels from datasets without any labels associated with them. This approach is useful when working with small datasets or when pre-training models on large and diverse datasets to extract representations that can be transferred to downstream tasks. The instructor also provides an analogy by John McCune to explain how SSL provides more supervision than unsupervised learning and less than supervised learning, making it a valuable approach for various tasks in computer vision.

  • 00:05:00 In this section, the concept of unsupervised learning as the base for intelligence was introduced in the context of computer vision. Self-supervised learning was discussed as a way to create labels from scratch as the main form of learning, while supervised learning and reinforcement learning were only small parts of the process. The concept of contrastive learning was introduced as a popular unsupervised approach that focuses on similarity as an optimization goal, and the objective of the loss function was explained as pushing the embedding for the positive sample as close to the embedding for the input as possible, while simultaneously pushing the embedding for the negative sample farther away from the input embedding.

  • 00:10:00 In this section, the video explains the concept of triplet loss used to train face recognition networks and how it can be improved using a contrastive loss function. The contrastive loss function tackles the problem of pushing the input away from all possible negative samples, which is not feasible due to the large number of negative samples. The implementation of this loss function is similar to a classification problem, where the positive sample serves as the correct label, and all negative samples serve as incorrect labels. The video then introduces the MOCO algorithm, which defines contrastive learning as a differentiable dictionary income, allowing for the collection of all periods and queries in one place.

  • 00:15:00 In this section, the presenter explains the process of contrastive learning and how to define similarity through neural networks. The author defines what similar means and highlights that it is being passed through the same sample using the same network, known as instance discrimination. To create a good representation for downstream tasks, the key and query come from the same network, so using multiple networks is not very useful, and instead, a huge pool of negatives is needed to encourage better representations. However, it can be computationally challenging and impractical to pick out a single positive from a huge pool of negatives, which limits the batch size. The presenter then discusses an idea of pre-computing all the keys and queries from a single model.

  • 00:20:00 In this section of the lecture, the speaker discusses the idea of pre-computing embeddings and storing them in a queue while training a model on a single network that is updating over time. This approach helps to maintain consistency across time and prevent storage of embeddings from very far back in the training process. However, this method only solves the problem of computing embeddings in the forward pass, not the backward pass. The speaker suggests updating the key encoder with a moving average of the query and key encoders' rates to avoid changing the key encoder's weights too rapidly while maintaining consistency.

  • 00:25:00 In this section of the video, the presenter discusses the Moco and SimCLR models, which are both contrastive learning methods for producing good image representations without labels. The Moco model involves key encoders that are updated over time as training progresses to produce good representations, which can be used for downstream tasks. The SimCLR model simplifies this process by using a single encoder and passing the embeddings through a small MLP to yield even better results. This method eliminates the need to maintain moving averages or different networks, and has become a popular contrastive learning method in deep learning research.

  • 00:30:00 In this section, we learn about the SimCLR model, a self-supervised method for training image representations. The model uses contrastive loss and temperature scaling to compute embeddings and introduces the notion of similarity that the same image is similar and different ones are not. The data augmentation techniques used in the model are shown, and surprisingly, color-based augmentations produce the best results. Longer training sessions and larger batches also show better results. SimCLR was the first model method that beat a fully supervised baseline on image classification, and it achieves the best results when fine-tuned with just 1% and 10% of the ImageNet labels.

  • 00:35:00 In this section, the byol method for pre-training advanced vision is covered. The method involves applying different data augmentations to an input image, generating different views, passing them through encoder networks, and taking the representations from those, which are then projected onto a small network to get projection C and C prime. The method is not strictly a contrastive learning method like simclr, but rather a combination of elements from simclr and moco into a single objective function. The approach utilizes bootstrapping, maintaining two different networks, and fitting one model based on metrics estimated from another, instead of using true metrics from the data set.

  • 00:40:00 In this section, we learn about Deep Key Learning in heavy, which is the same as what happens in Deep Free Learning. This approach was the inspiration for BYOL, where the second network drives the supervision for the first network and vice versa. Using this bootstrapping process, the network learns more representations to build up representations, and since it is not contrastive learning, it is robust to changes in batch size and organization types. BYOL works well even with smaller batch sizes, and it beats MCLR for the same benchmarks. We then move on to the second class of methods, where input is destroyed, and we have to reconstruct the original image, and these methods work well with an autoencoder based structure. The presentation introduces Denoising Model Encoder, where noise is added to an image, and the goal is to predict the denoised image. The Stack Denoising Model Encoder was very popular because it works really well, and the network learns something meaningful even with destroyed images.

  • 00:45:00 In this section, the speaker discusses the difficulties of training neural networks in the past and how denoising autoencoders (DAE) were used as a workaround. The lecture then moves on to the concept of masking out parts of an image to predict the hidden region, which is called the context encoder. The method, introduced in 2016 at Berkeley's lab, was able to get good results in detection and segmentation, but not in classification. The speaker reviews the implementation of the context encoder and how adding a discriminator to the objective function led to better representations.

  • 00:50:00 In this section, the Mae network is discussed, which uses a Transformer backbone, in contrast to the CNN backbones utilized in other methods. The network replaces with a vit and uses the same objective as a context decoder by masking out patches from an image and passing the unmasked region to an encoder. The encoded embeddings are then passed to a decoder with the goal of reconstructing the original image. This process learns meaningful features in that format, and the network is illustrated with several examples from the Mae paper. The class token, which captures information about the entire sequence, can be used for classification.

  • 00:55:00 focuses on self-supervised pretraining using the mixture of contrastive learning and autoencoder-based reconstruction, and it outperforms both strategies individually. They combine the methods by using a new loss function that balances between the contrastive and reconstruction losses. It is a promising approach that demonstrates the potential for improving the performance of self-supervised methods, and it is a current area of research to understand the underlying reasons for these results.

  • 01:00:00 In this section, the speaker discusses the newly released MasS - a model that combines image reconstruction and contrastive learning at the same time through a single model. MasS generates two views of the same image, masks out the two different views, and adds noise to them, thereby combining the denoising objective. The loss function that MasS uses is a better combination of the endpoints, the reconstruction loss, and the denoising loss, resulting in better performance than previous models. The speaker notes that there are many other models in the area of representation learning that work well, and that the field is currently hot for research.
CS 198-126: Lecture 19 - Advanced Vision Pretraining
CS 198-126: Lecture 19 - Advanced Vision Pretraining
  • 2022.12.03
  • www.youtube.com
Lecture 19 - Advanced Vision PretrainingCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkele...
 

CS 198-126: Lecture 20 - Stylizing Images



CS 198-126: Lecture 20 - Stylizing Images

The video discusses various techniques for image stylization, including neural style transfer, GANs, and Pix2Pix, which require paired data, and CycleGAN, which uses unpaired data for image-to-image translation. The limitations of CycleGAN can be addressed by StarGAN, which can take information from multiple domains to train generators for multi-domain image transition tasks. The speaker also discusses multimodal unsupervised image-to-image translation using domain information and low-dimensional latent codes to produce diverse outputs, exemplified by the BicycleGAN model. Lastly, the potential benefits of using Vision Transformers with GANs for image translation tasks are mentioned, and the lecture concludes with fun image examples and an opportunity for questions and discussion.

  • 00:00:00 In this section, the speaker discusses image to image translation and specifically neural style transfer. The task involves transforming images from the source domain into the corresponding image in the target domain while preserving the content of the original image. Neural style transfer is a technique used to blend two images together by optimizing the output image to match the content of one image and the style reference of another. Convolutional Nets are used to extract relevant information from both images and create a new image with the desired style. The speaker goes into detail about the inputs required and the architecture used for this technique.

  • 00:05:00 In this section, the lecture discusses the concept of using deep CNNs to represent the content and style of images. Starting with low-level features such as edges and textures, the CNN abstracts higher-level features before producing object representations. The lecture then explores how to measure the similarity of style across different feature maps through the use of a gram matrix calculation. The lecture explains how to obtain content and style from CNNs, and the loss calculation method for each that adjusts the model to produce the desired output.

  • 00:10:00 In this section of the lecture, the speaker discusses a couple of different techniques for image processing. Firstly, they discuss the process of generating an output image by adding both content and style loss in an optimizer. They show an example of a content image and style image being combined to create the final image, with lower-level features from the content image and higher-level features from the style image. Next, they briefly review GANs, with a focus on the discriminator and generator portions. They also mention StyleGAN and its ability to separate higher and lower-level attributes in the image. Finally, they discuss a model called Pix2Pix, which uses a conditional GAN to generate output images based on additional information provided by the user.

  • 00:15:00 In this section, the video discusses various techniques for image stylization, including GANs and pix2pix, which require paired data, and CycleGAN, which uses unpaired data for image-to-image translation. However, CycleGAN has limitations, which can be addressed by StarGAN, a model that can take information from multiple domains to train generators, thus allowing for multi-domain image transition tasks. The key idea behind StarGAN is to learn a flexible translation method that uses both the image and domain information as input.

  • 00:20:00 In this section of the lecture, the speaker discusses the concept of multimodal unsupervised image to image translation and how it can be used to produce multiple realistic and diverse outputs from an input image. The paper being discussed incorporates domain information and low-dimensional latent codes to produce more accurate and faithful outputs. The BicycleGAN model was presented as an example of how this approach can work to minimize mode collapse and achieve diverse outputs. Additionally, the paper attempts to learn an encoder to map the output back into the latent space and minimize the probability of two different codes generating the same style or output.

  • 00:25:00 In this section of the lecture, the speaker discusses the challenges of using Vision Transformers for tasks like image to image translation and the potential benefits of using them in combination with GANs. They mention recent techniques that leverage the benefits of Vision Transformers with GANs to tackle image translation tasks, although it is not as straightforward as using GANs alone for these tasks. The speaker concludes by sharing some fun images showcasing the abilities of these techniques and opening up the floor for questions and discussion.
CS 198-126: Lecture 20 - Stylizing Images
CS 198-126: Lecture 20 - Stylizing Images
  • 2022.12.03
  • www.youtube.com
Lecture 20 - Stylizing ImagesCS 198-126: Modern Computer Vision and Deep LearningUniversity of California, BerkeleyPlease visit https://ml.berkeley.edu/decal...