Machine Learning and Neural Networks - page 51

 

Vanishing (or Exploding) Gradients (DL 11)



Vanishing (or Exploding) Gradients (DL 11)

As we delve into scaling up neural networks to solve larger problems, adding more layers becomes necessary. However, deeper networks can encounter issues during training caused by vanishing or exploding gradients. Let's consider a deep neural network with sigmoid activations for the hidden layers. Visualizing such a network with numerous nodes and layers becomes impractical. Instead, we can represent it with a block diagram, where each column represents a layer, and activation functions are indicated within each block.

Another way to visualize the network is through a computational graph, showing the sequence of operations applied to each batch of data. Starting with input matrices, we perform matrix multiplications, addition of biases, and apply activation functions at each layer. This process continues through the hidden layers until we reach the output layer, where the activation function changes to softmax. The loss is computed from the activations and targets.

Expressing the computations mathematically, we multiply weight matrices by input matrices, add biases, and apply activation functions. The expressions continue through the hidden layers, ultimately reaching the output layer where softmax activation is applied. The output activations and targets are used to calculate the loss.

When computing derivatives for gradient descent updates, the chain rule is applied repeatedly. Starting from the output layer, we calculate deltas by multiplying with the transpose of weight matrices and element-wise multiplying with the derivative of the activation function. This process propagates deltas backward through the hidden layers.

The vanishing gradient problem arises when using sigmoid activation functions for hidden layers. The derivative of a sigmoid tends to have small values, causing deltas to diminish with each backward propagation. As a result, the gradients become increasingly smaller, making it challenging to update weights effectively, especially in the early layers.

Deep learning faced difficulties in training deep neural networks due to the vanishing gradient problem. However, around a decade ago, approaches were devised to overcome this challenge. One method is to change the weight matrix initialization, generating larger initial random weights to counteract the decreasing deltas caused by sigmoid derivatives.

The most significant breakthrough came with the adoption of rectifier linear units (ReLU) as activation functions. Unlike sigmoid derivatives, ReLU derivatives tend not to reduce deltas significantly. This property made ReLU activations more popular since they facilitated training deep neural networks.

However, using ReLU activations introduces the risk of exploding gradients, where deltas can become larger as we propagate backward. To mitigate this, it is advisable to choose smaller initial weights compared to sigmoid activations.

ReLU neurons are preferred for hidden layers due to their training ease and computational efficiency. The initialization of weights depends on the activation function employed, and the deep learning community has made substantial progress in determining appropriate weight initialization methods for different activation types. Modern deep learning libraries often handle weight initialization automatically based on the specified activations.

Vanishing (or Exploding) Gradients (DL 11)
Vanishing (or Exploding) Gradients (DL 11)
  • 2022.09.30
  • www.youtube.com
Davidson CSC 381: Deep Learning, Fall 2022
 

Avoiding Neural Network Overfitting (DL 12)



Avoiding Neural Network Overfitting (DL 12)

As we work with larger neural networks for deep learning, the risk of overfitting increases significantly. It is crucial to understand the causes of overfitting and how to detect and prevent it. Overfitting occurs when a machine learning model becomes too specific to the training set and fails to generalize to new data. The primary cause is when a model has excessive parameter freedom compared to the amount of training data, making models with high degrees of freedom or small training sets more susceptible.

In polynomial regression, for instance, increasing the degree of the polynomial provides more parameters to fine-tune, allowing the model to fit the training data more precisely. However, this can hinder its ability to generalize to examples beyond the training set.

In the context of neural networks, the weights and biases serve as parameters. As neural networks become larger with more weights, they have greater freedom in choosing their parameters. Hence, when training a large neural network, it is important to be vigilant for potential overfitting, with the primary method of identification being monitoring the validation set.

Splitting the dataset into training, validation, and testing sets helps assess the network's generalization. When overfitting occurs, there is a noticeable discrepancy in loss or accuracy between the training set and the validation set. Ideally, the training set loss should decrease over epochs, but if it starts to increase, it indicates a problem. Similarly, the validation set loss should decrease in line with the training set loss, and if it starts to increase while the training set loss continues to decrease, it signifies strong overfitting. The accuracy of the model on both sets can also reveal overfitting in classification problems.

To address overfitting, one approach is to directly tackle its causes. Insufficient data can be mitigated by acquiring more data, as seen in large-scale deep learning successes using vast datasets. However, if obtaining more data is not feasible, scaling down the model can help combat overfitting and improve efficiency. The general guideline is to choose a neural network architecture that is adequately sized for the specific problem at hand.

If overfitting concerns persist, there are advanced techniques to consider. One such technique is early stopping, where training is halted when a separation between the training and validation sets is observed, even before reaching the maximum number of epochs. Additionally, methods like Dropout and weight regularization can be employed to prevent overfitting.

Dropout involves randomly zeroing out some activations in the network during training to prevent specific neurons from having an excessive impact. By dropping out neurons, subsequent layers of the network are compelled to learn functions that do not overly rely on those neurons, thereby reducing overfitting. Adjustments are made during testing to account for the absence of Dropout.

Weight regularization combats overfitting by discouraging weights from becoming too large. This is achieved by incorporating a penalty term into the loss function that discourages large weights. One common form of weight regularization is L2 regularization, where the sum of the squares of all weights is added as a quadratic penalty term. This regularization term, controlled by a hyperparameter, balances the emphasis on regularization versus the original loss function.

It is crucial to monitor for overfitting when training neural networks. Consider the model's size and the available data, and employ techniques like early stopping, Dropout, and regularization to address overfitting when necessary.

Avoiding Neural Network Overfitting (DL 12)
Avoiding Neural Network Overfitting (DL 12)
  • 2022.10.01
  • www.youtube.com
Davidson CSC 381: Deep Learning, Fall 2022
 

Convolutional Layers (DL 13)



Convolutional Layers (DL 13)

The neural networks we have considered so far have been densely connected, where each layer is connected to the next layer. Dense networks are a good starting point as they are general and versatile. However, for specific applications, we can choose alternative architectures that are more effective. In this video, we explore the first alternative architecture called convolutional layers.

Convolutional networks are well-suited for image processing tasks. Instead of treating the input image as a flat vector, convolutional layers preserve the spatial information of the image. Each neuron in a convolutional layer is connected only to a small region of the image, capturing the spatial proximity of pixels. By using this architecture, the network gains an advantage in learning image processing tasks.

Convolutional layers have two key ideas: local connectivity and weight tying. Local connectivity means that neurons are connected to a small sub-region of the image, enabling them to learn specific features. Weight tying ensures that the same function is applied to different regions of the image. By sharing weights, the network can learn to apply the same function across multiple regions.

Convolutional layers introduce new hyperparameters to consider. These include the kernel size (determining the sub-region size), stride (how much the window slides), number of output channels (number of functions applied to each window), padding (handling image edges), and pooling (aggregating neuron results to reduce dimensionality).

Pooling helps reduce the number of parameters in the network by aggregating the results of neurons in a region. This can be done by averaging or taking the maximum value. Pooling is useful when we don't need precise localization of features but rather the overall presence of features in a region.

Convolutional networks provide a more efficient way to process images compared to dense networks. They leverage the spatial information and reduce the number of parameters, making them easier to train.

Pooling helps to reduce the dimensionality of the feature maps and the number of parameters in the subsequent layers. By aggregating the results of neighboring neurons, pooling retains the most important information while discarding some spatial details.

There are different types of pooling operations, such as max pooling and average pooling. In max pooling, the maximum value within each pooling window is selected as the representative value for that region. This helps to capture the most prominent features present in the window. On the other hand, average pooling takes the average value of the window, providing a smoother representation of the features.

After pooling, we can further stack additional convolutional layers to learn more complex and abstract features from the previous layer's output. Each subsequent layer captures higher-level features by combining the information from multiple smaller receptive fields.

To summarize, convolutional neural networks (CNNs) with convolutional and pooling layers are well-suited for image processing tasks. The convolutional layers capture spatial proximity and exploit weight sharing, enabling the network to learn local features efficiently. Pooling reduces the dimensionality and extracts important information, allowing subsequent layers to learn more abstract representations. This hierarchical feature learning makes CNNs powerful for various computer vision applications, including image classification, object detection, and image segmentation.

Convolutional Layers (DL 13)
Convolutional Layers (DL 13)
  • 2022.10.15
  • www.youtube.com
Davidson CSC 381: Deep Learning, Fall 2022
 

Training large networks with little data: transfer learning and data augmentation (DL 14)



Training large networks with little data: transfer learning and data augmentation (DL 14)

In deep learning, it is common to encounter problems where we want to leverage the power of deep learning but lack sufficient data to train a deep model effectively. This issue arises across various domains and neural network architectures. Let's focus on the scenario of an image processing task using a convolutional network with a small image dataset. However, the concepts discussed here can be applied to other domains as well.

Deep convolutional networks are known for their effectiveness in image processing. However, training a deep convolutional network on a small image dataset would typically lead to extreme overfitting, where the network merely memorizes the input data. In such cases, we need to find ways to make better use of our data or explore alternative data sources.

One approach to overcome the data scarcity problem is through data augmentation and transfer learning. Transfer learning is a fundamental concept in modern deep learning, and it is surprisingly simple to explain. The idea is to train a deep neural network on a related but more general problem and then reuse that pre-trained network with additional training on our specific dataset to solve our problem.

For an image processing task, we can train a network on large image datasets collected from the web or machine learning competitions. The pre-trained network would have a last layer dedicated to classifying images from those datasets. When working on a different image processing task with a distinct output layer, we can discard the pre-trained network's output layer and add our own output layer that matches the requirements of our problem. This involves adding new weights that connect the new output layer to the last layer of the pre-trained network, which can be trained using our small dataset.

The expectation behind transfer learning's effectiveness lies in the assumption that if the pre-training problem is similar enough to our specific problem, the functionality learned by the pre-trained network will transfer over, benefiting our problem. We can think of the pre-trained network as having learned generic image processing functions, and we can utilize this learned transformation when training our network on the small dataset.

When applying transfer learning, we have several options for utilizing the pre-trained model. We need to discard the output layer to match our problem, but we can also remove other layers if we believe that the useful pre-processing has already been performed. Additionally, we can add multiple layers to perform more sophisticated processing for our specific problem. To preserve any useful processing done by the early layers, we can freeze their weights during retraining, especially if the pre-trained model was trained on a large dataset, and our problem has a small dataset.

Deep learning libraries often provide model zoos, which are collections of pre-trained models for different problem types. These models serve as starting points for transfer learning, making deep learning accessible for solving a wide range of problems.

However, even with the aid of transfer learning, our dataset might still be too small to train a network effectively, even on the last few layers. In such cases, we need to extract as much information as possible from our dataset, which brings us to the idea of data augmentation.

Data augmentation involves applying transformations to the dataset that appear different to the neural network but retain the same meaning to humans or other systems utilizing the learned model. In the case of image processing, various transformations can be applied without altering the human perception of the represented image. For example, rotating or zooming in on an image does not change its underlying content. These transformations introduce substantial differences in the input data seen by the neural network, making it challenging for the network to memorize specific examples or rely on fine details of the input.

However, we must ensure that the transformations do not change the meaning of the data and appear distinct from the network's perspective. For instance, translation of an image may have little impact on a convolutional network's ability to generalize, given its inherent translation invariance.

Data augmentation techniques include adding random noise, slight blurring, and other modifications that do not distort the human perception of the image. These transformations can be easily computed and randomly applied to each batch of data during training. By training the network for multiple epochs on the augmented dataset, we prevent it from simply memorizing the exact input examples and encourage it to generalize better.

It is important to note that data augmentation techniques are not as effective as having more data to train on, including related data for pre-training. However, when combined with transfer learning, data augmentation allows us to tackle a wider range of problems using deep neural networks.

In summary, when faced with a problem that requires deep learning but lacks sufficient data, transfer learning and data augmentation are valuable strategies. Transfer learning involves training a network on a related problem and reusing it, with additional training, for our specific problem. Data augmentation entails applying transformations to the dataset that retain their meaning while introducing variations for better generalization. Although these techniques are not substitutes for more data, they offer practical solutions for leveraging deep learning in scenarios with limited data availability.

Training large networks with little data: transfer learning and data augmentation (DL 14)
Training large networks with little data: transfer learning and data augmentation (DL 14)
  • 2022.10.21
  • www.youtube.com
Davidson CSC 381: Deep Learning, Fall 2022
 

Residual Networks and Skip Connections (DL 15)



Residual Networks and Skip Connections (DL 15)

Deep neural networks are powerful but challenging to train due to the need for more data as the number of parameters increases. Training deep networks often shows slow progress in decreasing loss compared to shallow networks. This is because the input data, passing through multiple layers with randomly initialized weights, gets scrambled into random noise, making it difficult for meaningful gradients to propagate.

To address this issue, skip connections are introduced. Skip connections involve grouping layers into blocks and providing two paths for data flow within and around each block. The output of a block is combined with its input using addition or concatenation, allowing the input to pass through and make the inputs and updates more meaningful.

Residual blocks, which incorporate skip connections, have several advantages. First, they simplify the learning task for each block by focusing on augmenting the existing data rather than figuring out everything about the input. Second, they facilitate the flow of gradients by providing shorter paths for updating each layer in the network. These advantages lead to faster training progress and better performance compared to shallow networks.

When using residual blocks, it is crucial to address shape compatibility between input and output tensors, especially when using convolutional layers. Special consideration should be given to matching the shapes and avoiding an explosion in the number of parameters, particularly when using concatenation. Typically, addition is preferred over concatenation for most skip connections in large residual networks.

One-by-one convolutions can be employed to preserve height, width, and channel dimensions in convolutional blocks. They allow us to adjust the depth of the output layer by specifying the number of filters in the convolutional layer.

While there are various variations and architectures of residual networks, the key idea remains consistent—improving the training of deep neural networks by leveraging skip connections and residual blocks. These techniques enable better information flow, faster training, and increased model performance. Exploring different residual architectures and their specific implementations is recommended for further understanding and application.

Additionally, it is important to consider some practical concerns when setting up a residual network. One such concern is managing the shape compatibility between input and output tensors when using skip connections. This becomes more complex when convolutional layers are involved, as the height, width, and channel dimensions need to align properly.

To simplify convolutional blocks, one-by-one strides and appropriate padding can be used to preserve the height and width of the input image. This ensures that at least the spatial dimensions match up when adding the input and output tensors of a block. To address the channel dimension, one-by-one convolutions can be employed. Although these convolutions might seem trivial since they receive input from a single pixel, they effectively allow us to adjust the depth of the output layer. By specifying the number of filters in the one-by-one convolution, we can increase or decrease the depth of the output tensor, making the shapes compatible.

When working with large residual networks, it is essential to strike a balance between the number of skip connections and the explosion of parameters. Excessive use of concatenation can lead to a substantial increase in the size of the activation tensor and the number of parameters. Therefore, it is advisable to limit the number of concatenation-based skip connections and prefer addition for most of them.

Modularity is another advantage offered by residual networks. The uniform structure of residual blocks and the ability to easily add more blocks facilitate the construction of deeper and more powerful networks. By incrementally increasing the number of blocks, one can create a network that suits the desired trade-off between computational resources and model capacity.

While residual networks have proven to be highly effective, it is worth noting that there are various other types of residual architectures with different design choices, such as incorporating normalization layers or multiple paths within a block. Exploring these variations can provide further insights and possibilities for improving the training of deep neural networks.

Overall, residual networks provide a valuable approach to training deep neural networks by leveraging skip connections and residual blocks. They simplify learning tasks, accelerate gradient propagation, and offer modularity for building powerful network architectures. Understanding the concepts and considerations behind residual networks contributes to advancements in deep learning research and practical applications.

Residual Networks and Skip Connections (DL 15)
Residual Networks and Skip Connections (DL 15)
  • 2022.10.23
  • www.youtube.com
Davidson CSC 381: Deep Learning, Fall 2022
 

Word Embeddings (DL 16)



Word Embeddings (DL 16)

The majority of the data we have worked with in neural networks has been image data. However, we can also use neural networks for other types of problems, such as text data. Representing text data as input to a neural network is not as straightforward as with images.

In image data, we can use standard digital storage formats, which represent images as arrays of red, green, and blue pixels. This representation is convenient because it captures spatial relationships between pixels and relevant color intensities.

For text data, the standard digital representation, where characters are converted to ASCII or other digital values, is not directly relevant to how neural networks learn. Various methods can be considered to convert ASCII values into valid input for a neural network, such as using the binary representation of ASCII values or normalizing the range of characters to lie between 0 and 1. However, these representations do not capture the semantics of the words in the same way an array represents an image.

One approach is to create giant vectors using one-hot encodings of the entire vocabulary. Each word has a unique vector representation, which solves the issue of similar words having different meanings. However, this results in a massive expansion of dimensions and loses the balance between word similarity and dimensionality.

To address this, we aim for a representation of text data that achieves several goals. Firstly, we want a per-word representation that is not excessively high-dimensional. Secondly, we want the representation to carry semantic information, where similar words have similar vector representations. This has been a challenging problem in natural language processing.

In recent years, neural networks have been successfully used to generate appropriate input representations for text data. One approach involves extracting n-grams, which are sequences of n words, from the text data. These n-grams provide contextual information for a specific point in a sentence or document.

The idea is to train the network using the one-hot dictionary representation as input and predict the one-hot encoding of the surrounding n-gram. For example, we can use a 5-gram input and predict the other four words in the n-gram. By training the network on nearby words using n-grams, we expect that semantically similar words will have similar representations and receive similar gradient feedback during training.

By discarding the output layer of the network, we can use the vector of activations in the last hidden layer as a numerical encoding of the input word. This representation is known as a word embedding, which captures the word's context in actual text. Various approaches exist for producing word embeddings, such as Word2Vec.

Rather than training our own word embeddings, we can utilize pre-trained embeddings generated by others with more data and computational resources. We can easily generate a lookup table to translate an arbitrary text document into the word embedding. This approach allows us to use word embeddings as the input to our neural network for machine learning on text data.

Using word embeddings as input to our neural network for machine learning on text data offers several advantages. These pre-trained embeddings have been generated by models with extensive data and computational resources, resulting in rich and meaningful representations of words.

By passing a document through an existing word embedding, we can obtain a vectorized representation of the text. This vector representation captures the contextual information of the words and can be used as input to our neural network.

The use of word embeddings enables transfer learning, where knowledge gained from one task (e.g., training the word embedding model) can be applied to another related task (e.g., our specific machine learning problem with text data). Instead of training our own embeddings from scratch, we can leverage the existing embeddings, benefiting from their generalization capabilities.

Once we have the word embedding representation of the text, we can proceed with training our neural network. The neural network can take the word embedding vectors as input and learn to make predictions based on the semantic information encoded in the embeddings.

The specific architecture of the neural network will depend on the task at hand. It could be a recurrent neural network (RNN) that considers the sequential nature of the text, a convolutional neural network (CNN) that captures local patterns, or a combination of both. The network can be designed to perform tasks such as sentiment analysis, text classification, language generation, or machine translation, among others.

During the training process, the neural network learns to recognize patterns and make predictions based on the input word embeddings. The gradients propagated through the network update the weights, optimizing the network's ability to make accurate predictions.

By utilizing word embeddings, we address the challenges of representing text data in a meaningful way for neural networks. These embeddings capture semantic relationships between words, allowing the network to learn from the context and make informed predictions. Additionally, leveraging pre-trained embeddings saves computational resources and improves the efficiency of our machine learning pipeline.

By using word embeddings as input to our neural network, we can harness the power of transfer learning and semantic representations. This approach significantly enhances the ability of neural networks to process and understand text data, opening the door to various natural language processing tasks and applications.

Word Embeddings (DL 16)
Word Embeddings (DL 16)
  • 2020.10.20
  • www.youtube.com
Davidson CSC 381: Deep Learning, F'20, F'22
 

Recurrent Neural Networks (DL 17)



Recurrent Neural Networks (DL 17)

In our previous lecture, we discussed the use of word embeddings, which are trainable representations of words as vectors with a moderate number of dimensions. These embeddings can serve as a foundation for building machine learning systems that operate on text data. For simple tasks like sentiment classification of product reviews, breaking down the document into words, embedding each word, and passing the sequence of embeddings as input to a neural network may suffice. However, for more complex tasks such as conversational replies or machine translation, a more sophisticated approach is required.

To illustrate this, we used the example of predicting the next word in a sentence. This task is more challenging than sentiment classification but easier than machine translation. When setting up neural networks to operate on text data, we face two broad approaches. One extreme is providing the entire document as input to the network, while the other extreme is providing a single word as input. However, both approaches have drawbacks: operating on the entire document limits training examples and deals with varying document sizes, while operating on one word at a time ignores the surrounding context necessary for understanding word meaning and representing concepts that don't directly map to words.

To find a compromise between these extremes, we introduced a method that operates on one word at a time but incorporates the network's memory of previous inputs to retain important context. The basic idea is to feed the network's output back to its input, allowing it to use its previous activations as a summary of the words seen so far. This approach gives rise to recurrent neural networks (RNNs), which can be visualized by unrolling them over time, representing the network at different points in time as words are inputted and the network's output is fed back in.

For the next word prediction task, the output of the RNN's hidden layer serves as a summary of the previous words in the sentence. The RNN learns to predict the next word based on this context. The inputs to the RNN are embedding vectors, while the outputs are in a one-hot dictionary encoding to allow for expressing uncertainty over different possible outputs.

Training RNNs involves computing gradients to update the network's weights. The challenge arises from the fact that the weights influence the loss not only through their application to the current input but also through their impact on inputs from previous time steps. To compute the effect of the weights on the loss at a particular time step, we need to consider both their impact on the current input and their influence on earlier time steps and their contribution to the error at the current time step.

Recurrent neural networks often employ sigmoid or tanh activation functions because they are prone to the vanishing gradient problem. This problem arises when gradients cannot propagate far backward in the network, limiting the ability to capture long-term dependencies. Consequently, plain RNNs are not effective at tasks requiring extensive context and long-term memory, which is why we focused on sentences rather than documents.

In the next lecture, we will explore a variant of recurrent neural networks specifically designed to address the long-term memory problem and achieve better performance in text and language processing tasks.

Recurrent Neural Networks (DL 17)
Recurrent Neural Networks (DL 17)
  • 2020.10.22
  • www.youtube.com
Davidson CSC 381: Deep Learning, F'20, F'22
 

LSTMs (DL 18)



LSTMs (DL 18)

The goal of this lecture is to demonstrate the practical use of recurrent neural networks (RNNs) for language modeling. Previously, we discussed using RNNs to predict the next word in a sentence, which serves as a common pre-training task for RNNs. For more complex tasks like question answering or machine translation, we can employ a transfer learning approach. First, we pre-train the RNN on the next word prediction task and then fine-tune it for the specific task we're interested in.

To obtain more meaningful outputs from an RNN, we focus on the hidden activations or states that are passed through the network in either the forward or backward direction. These hidden states represent the overall text input. For example, when translating a sentence, each word is sequentially fed into the RNN, and the hidden state produced at the last time step becomes a representation of the entire text. We can then pass this hidden state into additional neural network layers to solve the desired task, such as classification or text generation.

This process of feeding text into an RNN to encode it into a hidden layer state, and then using another RNN as a decoder, allows us to generate output text. By training this pair of RNNs on input-output pairs, we can translate sentences or generate responses.

However, regular RNNs with 10h activations face difficulties when dealing with longer sequences due to vanishing gradients. To address this issue, we can employ an architecture called Long Short-Term Memory (LSTM). LSTMs offer multiple paths for activations to flow, allowing gradients to propagate more efficiently through the network.

The LSTM layer consists of an input and an output. We can use these to train the network on tasks like predicting the next word. The input is concatenated with the previous hidden state, while an additional hidden state (c) is passed from the network to itself at each time step. This c state enables gradient propagation without the limitations imposed by 10h activations. Sigmoid activation functions are used to control what information is retained or forgotten from the previous states, and these gates are learned during training.

LSTMs incorporate both the h and c paths, allowing for more complex computations within each time step and facilitating rapid gradient propagation through multiple applications of the LSTM network. While we don't have complete knowledge of the specific functions learned by each component, the LSTM architecture has proven effective in practice compared to other types of RNNs.

The practical effectiveness of LSTM architectures lies in their ability to address the vanishing gradient problem and capture long-term dependencies in sequential data. By incorporating gating mechanisms and multiple paths for information flow, LSTMs have shown significant improvements over traditional RNNs in various natural language processing tasks.

The gated nature of LSTMs allows them to selectively remember and forget information from previous time steps, making them well-suited for modeling and generating sequences. The sigmoid activations in the LSTM gates control the flow of information, determining what to retain and what to discard. These gates learn from the training data and adaptively decide which parts of the previous hidden state and current input are relevant for the current time step.

The LSTM's ability to remember long-term dependencies is particularly crucial in language modeling. In language translation, for example, understanding the context of a sentence requires considering the entire input sequence. The hidden state at the last time step of the encoding LSTM captures the overall meaning of the sentence, enabling accurate translation or other downstream tasks.

Additionally, LSTMs facilitate efficient gradient propagation during both forward and backward passes. By preserving relevant information and mitigating the impact of vanishing gradients, LSTMs enable the effective training of deep recurrent networks on long sequences. This is accomplished through the use of parallel paths that allow gradients to flow uninterrupted, preventing them from vanishing or exploding as they traverse through the network.

The success of LSTMs in language modeling has made them a fundamental building block in many state-of-the-art models. Researchers and practitioners have extended LSTM architectures with additional features such as attention mechanisms, multi-head attention, and transformer-based models. These advancements further enhance the modeling capabilities of LSTMs, enabling them to handle even more complex tasks, including document summarization, sentiment analysis, and dialogue generation.

In summary, LSTMs have revolutionized language modeling by addressing the limitations of traditional RNNs. Their ability to capture long-term dependencies, handle vanishing gradients, and selectively retain relevant information has made them an indispensable tool in natural language processing. By leveraging LSTM architectures, researchers and developers have achieved significant advancements in various language-related tasks, leading to improved machine translation, question answering systems, and text generation models.

LSTMs (DL 18)
LSTMs (DL 18)
  • 2020.10.25
  • www.youtube.com
Davidson CSC 381: Deep Learning, F'20, F'22
 

Transformers and Self-Attention (DL 19)



Transformers and Self-Attention (DL 19)

The Transformer architecture, based on neural networks, has achieved state-of-the-art performance in language modeling and various other tasks. Let's explore the core ideas behind Transformers, including their construction from self-attention blocks and the integration of recurrent and residual network features.

Recurrent neural networks (RNNs) excel in text processing by gradually building a hidden state that represents the information content of a document. They receive word embeddings as input and can be trained on unsupervised tasks like predicting the next word in a sentence. However, RNNs, including LSTM variants, struggle with long inputs due to the need for repeated processing through layers.

On the other hand, residual networks are effective at handling deep models with many layers by utilizing residual connections. These connections simplify training by allowing each block to enhance the input and enable gradients to propagate more efficiently.

Residual networks have additional advantages in image processing, such as leveraging convolution within residual blocks, which aligns well with image-related functions. To combine the strengths of recurrent networks for text processing and residual networks for learning deep models, the Transformer was introduced.

Similar to RNNs, a Transformer operates on word embeddings. However, instead of receiving words one at a time, it processes all the embeddings for an entire document concatenated into a matrix. Transformers can be trained on unsupervised tasks that predict missing words, resulting in an encoding of the document usable for various natural language processing tasks.

From residual networks, Transformers inherit skip connections that allow each block to augment its predecessors, simplifying training even in large networks. To facilitate text processing, the architecture within the blocks incorporates a key idea called self-attention.

Self-attention addresses the need to pay attention to distant words in a sentence to understand the meaning of a specific word. Rather than explicitly engineering an attention function, the Transformer's architecture is designed to facilitate learning such a function.

Within a self-attention encoder block, each word's embedding undergoes three dense layers: query (q), key (k), and value (v). These layers share weights across all words but are applied to different elements of the input sentence. By calculating the dot product between query and key vectors, the model can assess similarity.

The dot product between the query and key vectors of the same word indicates self-similarity. Additionally, dot products are computed between the query vector of a specific word and the key vectors of all other words. Softmax is applied to convert the similarity scores into weights between 0 and 1, emphasizing the most similar vectors.

By multiplying the softmax weights with the value vectors of each word, attention is applied to different parts of the document. This weighted sum produces an output vector computed from the entire document. This process is executed in parallel for all words, resulting in a matrix that encodes the document based on attention.

The original word embeddings are augmented with the information derived from the entire document, weighted by attention. A regular dense layer of matching shape is then applied. Multiple attention heads can be utilized within an encoder block to learn different attention patterns. The output of all attention heads is summed and combined with the skip connection, resulting in the block's output.

The self-attention mechanism allows the network to learn what to pay attention to in each attention head. Multiple attention heads enable the model to focus on different aspects under various circumstances, enhancing the input representation into a useful encoding of the text document.

This encoding can be further processed for classification or used as input to another neural network for tasks like machine translation. Training Transformers initially focused on language encoding in one language and decoding in another language. Unsupervised training, similar to RNNs, can also be conducted by providing documents with randomly blanked-out words and training the model to predict the missing words.

Transformers have revolutionized various natural language processing tasks and have become the state-of-the-art architecture for language modeling and many other applications. Let's delve deeper into the core concepts of Transformers and explore how they combine the best aspects of recurrent and residual networks.

Recurrent neural networks (RNNs), such as LSTM, are effective for text processing because they process word embeddings sequentially and build up a hidden state that represents the information content of a document. RNNs can be trained on unsupervised tasks like predicting the next word in a sentence using readily available data. However, RNNs tend to struggle with long inputs due to the need for passing data through multiple layers repeatedly.

On the other hand, residual networks excel at handling deep models by utilizing residual connections, which simplify training and enable gradients to propagate efficiently. In image processing, residual networks leverage convolution within residual blocks, offering an advantage for functions relevant to image analysis. The goal is to combine the advantages of recurrent networks in processing text with the benefits of learning deep models from residual networks.

This brings us to the Transformer architecture. Like recurrent networks, Transformers operate on word embeddings. However, unlike recurrent networks that process words one at a time, Transformers receive the embeddings of an entire document concatenated into a matrix, with each row representing the embedding of a different word. Transformers can be trained on unsupervised tasks, such as predicting missing words, to generate document encodings for various natural language processing tasks.

From residual networks, Transformers inherit skip connections, ensuring that each block only needs to augment its predecessors and allowing gradients to propagate effectively even in large networks. To facilitate text processing, Transformers employ a distinct architecture within the blocks, known as self-attention.

Self-attention is the idea that to understand one word in a sentence, we need to pay attention to other words that may be distant in the sentence. The architecture is not explicitly engineered with a specific attention function; instead, it is designed to facilitate learning such functions.

In a self-attention encoder block, each word's embedding undergoes three dense layers called query, key, and value. These layers are shared across all words but applied to different elements of the input sentence. By taking the dot product between query and key vectors, we can assess the similarity. Larger dot products indicate vectors pointing in similar directions, while smaller dot products indicate vectors pointing in different directions.

For a given word, we compute the dot product between its query vector and the key vectors of all other words. This produces a vector of similarity scores, representing how similar the query vector is to each key vector. Applying softmax to these scores converts them into values between 0 and 1, emphasizing the most similar vectors. The resulting softmax weights serve as multipliers on the value vectors for all words in the document.

Each value vector is multiplied element-wise with its corresponding softmax weight, creating a weighted sum that represents the word's attention to other words. This process is applied in parallel for each word, generating an output vector calculated from the entire document, weighted according to the attention given to each word. This information is then added to the original word embedding.

To produce the output of one attention head, a regular dense layer of matching shape is applied. Multiple attention heads can be used within an encoder block, allowing the network to learn different attention patterns in different contexts. The output of all attention heads is combined and added to the skip connection, resulting in the output of the block.

Similar to convolutional layers using multiple channels, Transformers often employ multiple attention heads within an encoder block to capture different attention patterns. This enables the network to learn and combine various attention calculations, augmenting the input representation into a useful encoding of the text document.

Once the encoding is produced, it can be utilized for various tasks. For instance, additional layers can be applied for classification, or the encoding can serve as input to another neural network for tasks like machine translation. Initially, Transformer training focused on encoding in one language and decoding in another. Unsupervised training can also be conducted by randomly blanking out words in documents and training the model to predict the missing words.

To account for word ordering and proximity, Transformers incorporate positional encoding. This additional information is added to the word embeddings and enables the model to understand the relative positions of words in the document.

Transformers are a powerful architecture for natural language processing tasks. By combining the strengths of recurrent and residual networks, they have achieved state-of-the-art results in various applications. The self-attention mechanism allows the model to learn which words to pay attention to, and multiple attention heads capture different attention patterns. Transformers have significantly advanced the field of language modeling and continue to be an active area of research and development.

Transformers and Self-Attention (DL 19)
Transformers and Self-Attention (DL 19)
  • 2022.11.05
  • www.youtube.com
Davidson CSC 381: Deep Learning, Fall 2022
 

Other Metrics and the ROC Curve (DL 20)



Other Metrics and the ROC Curve (DL 20)

This is a short lecture on alternative metrics for measuring success in binary classification tasks when using neural networks.

In a binary classification task, we typically have two output nodes in our neural network, and our target vectors are either [1, 0] or [0, 1]. When decoding the network's output to a category label, there are four possible outcomes:

  1. True Positive: The target is [1, 0], and the decoded output agrees.
  2. False Negative: The target is [1, 0], but the decoded output incorrectly labels it as [0, 1].
  3. True Negative: The target is [0, 1], and the decoded output agrees.
  4. False Positive: The target is [0, 1], but the decoded output incorrectly labels it as [1, 0].

These outcomes can be used to calculate different metrics for evaluating the model's performance in binary classification. Here are some alternative metrics to consider:

  1. Precision: The fraction of data points that the model correctly labels as positive out of all data points labeled positive.
  2. Sensitivity or Recall: The fraction of data points that should have been labeled as the first category that the model correctly identifies as such.
  3. Specificity: The fraction of data points that should have been labeled as the second or negative category that the model correctly identifies as such.

Accuracy, which measures the overall fraction of correct labels, may not always be the most informative metric. Different situations, such as the importance of false positives or false negatives, may require a focus on specific metrics. Additionally, the distribution of positive and negative labels in the dataset can heavily influence accuracy.

To understand the trade-offs between metrics, it is common to visualize them using techniques like the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the false positive rate against the true positive rate for different classification thresholds.

By considering the trade-offs between false positives and false negatives and analyzing the ROC curve, we can choose the most suitable model and evaluation metric based on the specific requirements of our problem.

Instead of solely relying on accuracy, it is important to consider the relative importance of false positives and false negatives, and how different models perform in this regard. So when working on your project, it is advisable to evaluate the trade-offs between metrics and consider the implications for your specific problem rather than relying solely on overall accuracy.

Understanding the trade-offs between different metrics is crucial when evaluating machine learning models. In certain scenarios, accuracy may not provide a comprehensive picture of a model's performance, especially when false positives and false negatives carry different levels of importance. Let's explore some cases where alternative metrics are more appropriate:

  1. Importance of False Positives and False Negatives: In domains like medical diagnosis, the consequences of false positives and false negatives can vary significantly. For instance, in cancer detection, a false negative (missing a positive case) can have severe implications, while a false positive (incorrectly diagnosing a negative case) may lead to unnecessary treatments. In such cases, metrics like precision and recall/sensitivity can offer valuable insights into a model's performance.

  2. Imbalanced Data: When the positive and negative labels are unevenly distributed in the dataset, accuracy can be misleading. Suppose 95% of the data points belong to the positive class. In that case, a model that simply predicts everything as positive would achieve a high accuracy of 95% without actually learning the underlying patterns. Metrics like precision and recall can help address the bias and focus on the model's performance on each class.

  3. Precision-Recall Trade-off: Machine learning models often exhibit a trade-off between precision and recall. Precision measures the ability to correctly identify positive examples, while recall measures the ability to capture all positive examples. By adjusting the model's threshold or decision boundary, we can prioritize precision or recall. However, changing the threshold to improve one metric often comes at the expense of the other. Understanding this trade-off is important when selecting the appropriate metric for a given problem.

  4. Receiver Operating Characteristic (ROC) Curve: The ROC curve provides a graphical representation of the performance of a binary classification model by plotting the false positive rate against the true positive rate at various classification thresholds. A model that achieves high true positive rates with low false positive rates will have a curve closer to the top left corner, indicating better performance. The area under the ROC curve (AUC-ROC) is commonly used as a summary metric, with values closer to 1 indicating better performance.

Different machine learning models may have different trade-offs between sensitivity and specificity or precision and recall. It's important to consider the problem's specific requirements and the relative importance of different metrics. By evaluating these trade-offs and understanding how models perform across various metrics, we can make more informed decisions and choose the most suitable model for our application.

In your project, consider the trade-offs between false positives and false negatives and select metrics that align with the problem's objectives. Instead of solely relying on accuracy, take into account the specific needs and implications of your task to evaluate and compare the performance of different models accurately.

Other Metrics and the ROC Curve (DL 20)
Other Metrics and the ROC Curve (DL 20)
  • 2020.10.12
  • www.youtube.com
Davidson CSC 381: Deep Learning, F'20, F'22