You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
12.5 Extending Binary Metric to Multiclass Problems (L12 Model Eval 5: Performance Metrics)
12.5 Extending Binary Metric to Multiclass Problems (L12 Model Eval 5: Performance Metrics)
In this class, we discussed various classifiers that can be extended to work with multi-class settings. Decision trees, k-nearest neighbors, gradient boosting, random forests, and other classifiers naturally handle multi-class problems. However, you may come across classifiers like logistic regression or support vector machines that are more suited for binary classification. In such cases, we need to find ways to extend these binary classifiers to handle multiple classes.
One approach is the "one versus rest" or "one versus all" strategy, also known as OvR or OvA. This approach involves dividing the multi-class problem into separate binary classification problems. Each class is treated as the positive class in one binary classifier, while the remaining classes are treated as the negative class. For example, if we have three classes (yellow circles, red squares, and blue triangles), we create three binary classifiers: one for classifying yellow circles against the rest, one for red squares against the rest, and one for blue triangles against the rest. During training, we fit all three classifiers, and during prediction, we run all three classifiers and choose the one with the highest confidence score.
Another approach is the "one versus one" strategy, where we fit a binary classifier for each pair of classes. If we have three classes, we would have three binary classifiers: one for classifying yellow circles against red squares, one for yellow circles against blue triangles, and one for red squares against blue triangles. During prediction, we run all classifiers and use majority voting to determine the final class label.
Both OvR and OvO strategies allow us to extend binary classifiers to handle multi-class problems. However, OvO can be computationally expensive, especially when the number of classes is large, as it requires fitting multiple classifiers.
When evaluating the performance of multi-class classifiers, we need to extend binary classification metrics to handle multiple classes. Two common approaches for doing this are micro and macro averaging. Micro averaging involves computing the precision, recall, and F1 score by aggregating the true positives and false positives over all classes. Macro averaging involves computing the precision, recall, and F1 score for each class separately and then averaging them. Micro averaging treats each instance or prediction equally, while macro averaging weights all classes equally. Additionally, there is the weighted approach, which considers the class imbalance by accounting for the number of true instances for each label.
In scikit-learn, you can specify the averaging method (micro, macro, or weighted) when using classification metrics like precision, recall, and F1 score. For example, you can use average='micro' or average='macro' to compute the micro or macro averaged metric, respectively. There is also the receiver operating characteristic (ROC) and area under the curve (AUC) score, which can be computed using the roc_auc_score function. The default averaging method for ROC AUC is macro.
Dealing with class imbalance is another challenge in multi-class classification. Techniques like over-sampling and under-sampling can be used to address this issue. The imbalanced-learn library provides additional methods for handling class imbalance and is compatible with scikit-learn.
Overall, model evaluation in multi-class classification involves extending binary classifiers, choosing appropriate averaging methods for evaluation metrics, and considering class imbalance. While we couldn't cover all the details in this class, there are resources like the imbalanced-learn library documentation that provide more information on these topics.
13.0 Introduction to Feature Selection (L13: Feature Selection)
13.0 Introduction to Feature Selection (L13: Feature Selection)
Hello everyone! I hope you all had a productive semester and gained valuable knowledge from this class. I understand that this semester has been quite intense for most of us, so I didn't want to add more stress by overwhelming you with additional content. However, I apologize for not being able to cover certain topics as promised in the syllabus. To make up for it, I have prepared some bonus lectures during the winter break, starting today.
In this series of videos, I will focus on dimensionality reduction, specifically two methods: feature selection and feature extraction. These techniques are incredibly useful and important to understand. In today's lecture, we will dive into feature selection, exploring how it works, why it is essential, and its practical applications. In the next set of videos, we will cover feature extraction as an alternative approach.
Before we delve into the specifics of feature selection and feature extraction, let's briefly discuss the concept of dimensionality reduction and why it is significant. Dimensionality reduction aims to reduce the number of features in a dataset. For instance, consider the well-known Iris dataset, which consists of four features: sepal length, sepal width, petal length, and petal width. In feature selection, we choose a subset of these features, such as sepal width and petal length, to use in our machine learning algorithms. On the other hand, feature extraction involves creating new features through techniques like linear transformations. Principal Component Analysis (PCA) is one such method that combines multiple features into a smaller feature space.
In this lecture series, our primary focus is feature selection, where we select original features from the dataset. Dimensionality reduction, in general, involves creating smaller feature spaces. Now, why do we care about these smaller dimensional feature spaces? Let's explore some reasons:
Curse of dimensionality: Machine learning classifiers often struggle as the number of features increases. They may suffer from overfitting, particularly algorithms like K-nearest neighbors and decision trees. Reducing the feature set can improve the performance of such algorithms.
Computational efficiency: Large datasets with numerous features can be computationally expensive to process. By reducing the feature set, we can enhance computational performance without sacrificing predictive performance.
Easier data collection: Sometimes, a simpler feature subset can yield similar performance compared to a larger feature subset. This can make data collection easier and more cost-effective, as collecting certain features may be cheaper or more accessible.
Storage space: Storing vast amounts of data can be challenging and expensive. Feature extraction and selection techniques help reduce the data's dimensionality, making storage more feasible and efficient.
Interpretability: Understanding the features can aid in interpreting complex machine learning algorithms. It is especially crucial in domains where explanations are required, such as customer credit card applications, where customers have the right to know the basis of decisions made by automated systems.
To summarize, dimensionality reduction can be divided into two subproblems: feature selection and feature extraction. In today's lecture, we will focus on feature selection. In the next lecture, we will discuss feature extraction in more detail.
To illustrate the importance of feature selection, let me share an example from a collaborative project with biochemists studying sea lamprey fish. The goal was to find a federal mon receptor inhibitor to control the sea lamprey population in the Great Lakes. We used feature selection to understand which molecular features were crucial. By evaluating the performance of different feature subsets using a machine learning classifier, we discovered that certain simple features, such as the number of sulfur oxygens, were highly informative. Adding more features did not significantly improve the performance, indicating that these simple features were the most important for our classification task. This knowledge guided us in screening millions of molecules and eventually finding a compound that showed promising inhibition.
In the pheromone signal by a significant percentage using just the selected features. This example demonstrates the power of feature selection in identifying key components or characteristics that contribute to the desired outcome.
Now, let's delve deeper into feature selection and its relevance. Feature selection is a subcategory of dimensionality reduction, which aims to reduce the number of features or variables in a dataset while retaining relevant and meaningful information. The ultimate goal is to simplify the dataset and enhance computational efficiency, interpretability, and predictive performance.
There are several reasons why feature selection is important and beneficial. Firstly, the curse of dimensionality poses a challenge in machine learning. As the number of features increases, certain algorithms may experience diminishing performance or become more prone to overfitting. By reducing the feature set, we can mitigate these issues and improve the accuracy and reliability of the models, especially for algorithms like K-nearest neighbors and decision trees.
Secondly, computational efficiency is a significant consideration when dealing with large datasets. The computational cost of training and testing machine learning models increases with the number of features. By selecting a subset of relevant features, we can reduce the computational burden and expedite the process without sacrificing performance.
Moreover, feature selection allows for easier data collection. Sometimes, a simplified subset of features can provide similar predictive performance compared to a larger feature set. This is particularly valuable when collecting data becomes challenging or costly. For example, in the context of medical diagnosis, identifying easily obtainable features that still yield accurate results can save resources and make the process more accessible to patients.
Additionally, feature selection aids in storage space optimization. With the exponential growth of data generation, storing and managing large datasets becomes a significant concern. By selecting relevant features, we can reduce the storage requirements without compromising the overall performance or insights gained from the data.
Furthermore, interpretability plays a crucial role, especially when dealing with automated systems or regulated domains. Feature selection helps identify the most influential and interpretable features, enabling better understanding and explanation of the decision-making process. In contexts where legal or ethical requirements demand explanations for decisions, feature selection can facilitate compliance and accountability.
To summarize, dimensionality reduction, particularly feature selection, offers numerous advantages in various domains. By selecting the most informative features, we can mitigate the curse of dimensionality, improve computational efficiency, simplify data collection, optimize storage space, and enhance interpretability. These benefits contribute to more accurate and efficient machine learning models and enable a deeper understanding of complex problems.
In the upcoming lectures, we will explore feature extraction as another dimensionality reduction technique. Feature extraction involves transforming the original features into a new set of features through methods like principal component analysis (PCA). This process allows us to capture relevant information while reducing dimensionality. By understanding both feature selection and feature extraction, we can leverage the appropriate technique based on the specific characteristics and requirements of the dataset and problem at hand.
So, in the next lecture, we will delve into feature extraction and explore its techniques, advantages, and applications. Stay tuned as we continue our journey through the fascinating world of dimensionality reduction and its impact on machine learning.
13.1 The Different Categories of Feature Selection (L13: Feature Selection)
13.1 The Different Categories of Feature Selection (L13: Feature Selection)
In the previous video, we explored the concept of feature selection as a subcategory of dimensionality reduction. Feature selection involves selecting subsets of features from a dataset to improve the performance of machine learning models. We discussed various motivations for feature selection, such as enhancing predictive performance and computational efficiency, optimizing storage space, and gaining insights into the data.
Now, let's delve deeper into the different categories of feature selection algorithms: filter methods, embedded methods, and wrapper methods. Filter methods focus on the intrinsic properties of the features themselves and do not involve a model or classifier. They analyze features based on their individual characteristics, such as variance or pairwise correlations. For example, computing the variance of a feature helps determine its usefulness for distinguishing between different training examples. If the feature values are spread out across the axis, it indicates its importance. On the other hand, highly correlated features suggest redundancy, and one of them can be eliminated without losing much information. Filter methods are often referred to as univariate or bivariate statistics, as they analyze single or pairs of variables.
Embedded methods, as the name implies, incorporate feature selection within the learning algorithm. These methods are embedded in the model optimization process and aim to optimize the objective function. One example is decision trees, where features are selected internally while growing the tree. The decision tree chooses the feature that maximizes information gain at each split, resulting in the selection of important features. Unused features in the final decision tree can be considered less important.
Wrapper methods are closely aligned with the goal of optimizing predictive performance. These methods involve fitting a model to different feature subsets and selecting or eliminating features based on the model's performance. By comparing the performance of models trained on different feature subsets, we can determine the importance of each feature. For instance, if removing a feature leads to a significant drop in accuracy, it suggests that the feature is important for the model's performance. Wrapper methods provide valuable insights into feature importance by directly using the model's accuracy.
While wrapper methods offer accurate feature selection, they can be computationally expensive, especially when dealing with large feature sets. The process of fitting models to different subsets and evaluating their performance can be time-consuming. In contrast, filter methods are computationally more efficient, but they may not provide as accurate results as wrapper methods. The trade-off between accuracy and computational efficiency is a crucial consideration in feature selection.
In the upcoming videos, we will delve deeper into each category of feature selection algorithms. We will explore filter methods in more detail, followed by embedded methods and wrapper methods. By understanding these techniques, we can gain a comprehensive understanding of feature selection and how it can be applied to improve machine learning models.
Stay tuned for the next video, where we will discuss filter methods in depth.
13.2 Filter Methods for Feature Selection -- Variance Threshold (L13: Feature Selection)
13.2 Filter Methods for Feature Selection -- Variance Threshold (L13: Feature Selection)
Yeah, so in the previous video, we discussed the three different categories of feature selection: filter methods, embedded methods, and wrapper methods. Now, let's delve deeper into one of the categories, the filter methods. In the upcoming videos, we will also explore the embedded methods and the wrapper methods. However, for now, let's focus on the filter methods as the main topic.
Filter methods are feature selection techniques that primarily consider the intrinsic properties of the features themselves. They do not rely on a specific model for feature selection. One example of a filter method is the variance threshold. Let's take a closer look at how the variance threshold works.
When using a variance threshold for feature selection, we compute the variance of each feature. The assumption is that features with higher variances may contain more useful information for training a classifier or regression model. But why is this true? To understand this, let's consider a feature called X1. On the left-hand side, we have a feature with high variance, and the feature values are well spread out. On the right-hand side, we have a feature with low variance, and the feature values are less spread out. A higher variance allows us to reconstruct the decision boundaries based on that feature. This is crucial for making accurate predictions. Even in the worst-case scenario where the classes overlap, well-spread features can still help construct decision boundaries.
To illustrate this concept further, let's consider a binary classification case. Suppose we have two classes, class square and class star. In the best-case scenario, all the data points from one class are on one side, and all the data points from the other class are on the other side. This makes it easy to construct a decision boundary that separates the classes perfectly. However, in real-world scenarios, perfect separation is not always achievable. Even when the classes overlap, a feature with high variance can still aid in constructing decision boundaries. For example, a decision tree can classify data points accurately based on well-spread features, as demonstrated in the coding example.
Now that we understand the importance of variance, let's discuss how we can use it as a measure for feature selection. The variance of a discrete random variable can be calculated using a specific formula, but in practice, we often work with datasets where we don't know the probability distribution. Therefore, we assume uniform weights and calculate the variance based on the observed data points. For example, when dealing with categorical features, we perform one-hot encoding to create binary variables. In this case, the variance of a Bernoulli variable can be computed as p * (1 - p), where p is the probability of observing a value of 1. This variance calculation is particularly useful for feature selection in categorical feature scenarios.
To implement variance-based feature selection, Scikit-learn provides the VarianceThreshold class. This class allows us to remove features with low variances. By specifying a variance threshold, we can eliminate feature columns where a certain percentage of the labels are the same. For example, if we want to remove features where more than 80% of the labels are similar, we can set the variance threshold to 0.16 (calculated as 0.8 * (1 - 0.8)). This threshold ensures that features with little discriminatory power are discarded.
In summary, filter methods like the variance threshold are valuable for feature selection because they consider the intrinsic properties of features. By analyzing the variance of features, we can identify and remove those that provide limited information for classification or regression tasks.
13.3.1 L1-regularized Logistic Regression as Embedded Feature Selection (L13: Feature Selection)
13.3.1 L1-regularized Logistic Regression as Embedded Feature Selection (L13: Feature Selection)
In the previous video, we discussed different methods for feature selection, specifically focusing on filter methods that are based on the properties of the features. Now, let's delve into two different categories of feature selection: embedded methods and wrapper methods. Both of these categories involve using a model, such as a classifier, for feature selection. In this video, we will focus on embedded methods, where feature selection happens implicitly as part of the model training or optimization process.
Embedded methods integrate feature selection into the model training process. We will explore this concept in the context of L1-regularized logistic regression, also known as Lasso regression. Before we proceed, it's important to note that this video assumes basic familiarity with logistic regression. However, we will only cover the essential concepts to avoid getting too sidetracked.
Let's start by considering a binary logistic regression model with two classes, using the Iris dataset with two features: petal length and petal width. Logistic regression produces a linear decision boundary to separate the two classes. The decision boundary is determined by applying a threshold to the weighted sum of the inputs, which undergo a nonlinear transformation.
To better understand logistic regression, let's examine a graphical representation of the model. In this diagram, we have the weights (w) on the left-hand side, with w1 and w2 representing the weights for the two features. Additionally, we have the bias unit (B) acting as an intercept term. The weighted sum is computed as the sum of the product of each weight and its corresponding feature, plus the bias term. This weighted sum is then passed through a sigmoidal function, also known as the logistic sigmoid, which outputs a value between 0 and 1. This value represents the class membership probability, indicating the probability that a data point belongs to class 1 given the observed features. By applying a threshold (typically 0.5), we can make binary predictions, classifying the data point as either class 0 or class 1.
Now that we have a basic understanding of logistic regression, let's focus on L1-regularized logistic regression. The key aspect of L1-regularization is the inclusion of an L1 norm term, which measures the magnitude of the weights. This term is added to the loss function, effectively penalizing complex models with large weights. In logistic regression, we aim to minimize the loss function while also minimizing the weights.
To visualize this, imagine contour lines representing the loss function. The outer contours correspond to large loss values, while the contours closer to the center represent smaller loss values. The global minimum of the loss function without regularization occurs at the center, indicating the optimal weights for minimizing loss. However, the L1 penalty term prefers smaller weights and encourages simplicity. By introducing this penalty term, we seek a balance between minimizing the loss and minimizing the penalty. Remarkably, L1-regularized logistic regression tends to produce sparse weights, with some weights being exactly zero. This feature selection aspect is what makes L1-regularization attractive.
To demonstrate L1-regularized logistic regression in practice, we will use the wine dataset. This dataset contains 13 different features related to various wine characteristics, and the task is to classify the wines into different types. We start by splitting the data into training and test sets, a common practice in machine learning.
Please note that the detailed code examples and further explanations can be found in the notebook accompanying this video, which will be provided below.
Now let's move on to the feature selection part using the L1 regularization approach, also known as Lasso. We'll use the Logistic Regression model from scikit-learn, which allows us to apply the L1 regularization penalty.
By setting the penalty parameter to 'l1', we specify that we want to use L1 regularization. The solver parameter is set to 'liblinear', which is suitable for small datasets like the one we're working with.After fitting the model on the training data, we can access the learned coefficients, which represent the weights assigned to each feature. Let's print the coefficients:
The coef_ attribute of the model contains the coefficients. We iterate over the coefficients and print them out, associating each coefficient with its corresponding feature.Next, we can identify the features that have non-zero coefficients, as these are the selected features. Let's find the selected features and print them:
By applying L1 regularization, the Logistic Regression model implicitly performs feature selection by driving some coefficients to zero. The features with non-zero coefficients are considered important for the model's predictions.
It's important to note that the choice of the regularization parameter C influences the degree of regularization applied. A smaller C value results in stronger regularization, potentially leading to more features with zero coefficients.
Now you have an understanding of the embedded feature selection method using L1 regularization in Logistic Regression. In the next video, we will explore feature selection with decision trees and random forests.
Decision trees are simple yet powerful models that make predictions by partitioning the feature space into regions and assigning a label to each region. Random forests, on the other hand, are an ensemble of decision trees where each tree is trained on a random subset of the data and features.
Let's start by using the Random Forest Classifier from scikit-learn for feature selection:
Next, we can rank the features based on their importances and select the top k features. Let's find the top k features and print them:
We sort the indices of the features based on their importances, in descending order. Then, we select the top k features by slicing the feature_ranks list. Finally, we print out the top features.Random forests consider the average contribution of each feature across all the decision trees in the ensemble. The higher the importance, the more influential the feature is in making predictions.
In this way, random forests provide a straightforward way to perform feature selection based on the importance scores.
Now you have an understanding of feature selection using decision trees and random forests. In the next video, we will cover the recursive feature elimination method.
13.3.2 Decision Trees & Random Forest Feature Importance (L13: Feature Selection)
13.3.2 Decision Trees & Random Forest Feature Importance (L13: Feature Selection)
Greetings, viewers! In our previous video, we began our discussion on embedded methods for feature selection, focusing on an example of regularized logistic regression. Today, we will delve into another example of embedded methods, namely decision trees, and examine how they select features at each node. We will also explore their relationship with random forest feature importance, a concept that you might already be familiar with. So, let's jump right in!
But before we proceed, I have a small announcement to make. Unfortunately, my iPad pencil or pen is no longer functional, so I have switched to using a pen tablet. However, I must admit that it has been a bit trickier to get used to than I anticipated. It might take me a few more videos to become truly comfortable and proficient with it. Additionally, I am trying out new screen annotation software, so please bear with me if there are any hiccups. I hope the annotation process becomes smoother in upcoming videos. Now, let's refocus on our topic.
To recap, feature selection can be broadly categorized into three main methods: filter methods, wrapper methods, and embedded methods. In our previous video, we explored Lasso or L1-regularized logistic regression as an example of embedded methods. Today, we shift our attention to decision trees and random forests.
First, let's discuss decision trees and how they perform feature selection. To illustrate this, let's consider a dataset we've previously examined. It consists of two features, x1 and x2, and two classes: squares (class 0) and triangles (class 1). Our goal is to classify the data points, and we can visualize the decision boundary as a dividing line that separates the two classes. Multiple decision boundaries can achieve this, as I will demonstrate shortly.
Now, let's take a closer look at how a decision tree splits the dataset. I've trained a decision tree using scikit-learn and plotted it for you. This tree exhibits two splits. The first split occurs on feature x1 at a cutoff value of 5.5, dividing the data into two groups. The second split takes place on feature x2 at a cutoff value of 10.5, further partitioning the data. By making these splits, the decision tree successfully classifies the dataset. We can evaluate the effectiveness of these splits by examining the entropy, which indicates the level of mixing or disorder in the classes. Our goal is to reduce the entropy as much as possible, ideally reaching a value of zero, which signifies perfect classification. In our example, we observe that the entropy decreases at each split, eventually reaching zero.
What's intriguing to note is that decision trees inherently perform feature selection. At each node, the tree decides which feature to use for the split. This decision is based on the feature that maximizes the decrease in entropy or maximizes the information gain. Consequently, the decision tree automatically selects the most informative features to construct the classification model.
Now, let's shift our focus to random forests, which are ensembles of decision trees. Random forests provide a means to estimate feature importance. To demonstrate this, let's turn to the wine dataset, which comprises 13 different features related to various characteristics of wine, such as alcohol content, malic acid, ash, and more. On the right-hand side, you can see a feature importance plot generated by the random forest. The feature importances range from 0 to 1 and sum up to 1, representing the relative importance of each feature. The plot is sorted in descending order, with the most important feature on the left.
To generate this plot, I utilized scikit-learn's feature_importances_ attribute, which calculates feature importances based on the random forest model. As you can observe, the most important feature in this dataset is proline, followed by flavanoids and color intensity.
The feature importance values are determined by measuring the total reduction in impurity (often measured by Gini impurity or entropy) achieved by splits on each feature across all the decision trees in the random forest. Features that consistently lead to a greater reduction in impurity are considered more important.
It's important to note that feature importance is a relative measure within the context of the random forest model. The values are specific to the random forest you've trained and may not generalize to other models or datasets. However, it can still provide valuable insights into which features are most influential in making predictions.
Now that we've covered decision trees and random forests, let's summarize what we've learned so far. Decision trees perform feature selection implicitly by selecting the most informative feature at each split, aiming to decrease the entropy and improve classification. On the other hand, random forests, as an ensemble of decision trees, provide a measure of feature importance by assessing the total reduction in impurity achieved by each feature across all trees.
Understanding feature importance can be beneficial in various ways. It helps identify the most relevant features for predicting the target variable, allows for dimensionality reduction by focusing on the most informative features, and provides insights into the underlying relationships between features and the target variable.
Now, let's dive deeper into the process of assessing feature importance in random forests. We'll explore a method called permutation importance. But before we do that, let's briefly revisit bootstrap sampling and the concept of out-of-bag samples.
Bootstrap sampling involves randomly sampling the original dataset with replacement, resulting in duplicated data points. As a result, some examples are not included in the bootstrap sample, creating what we call out-of-bag samples. These samples serve as validation or evaluation sets since the trees do not see them during training.
Now, let's focus on method B, which is permutation importance. It utilizes the out-of-bag samples we discussed earlier. Firstly, we can assess the predictive performance of decision trees in the random forest during training. For each tree, predictions can be made for the out-of-bag samples, which act as validation or test data points exclusive to that tree.
To compute permutation importance, we start with a feature matrix containing the original feature values for the out-of-bag examples. For each decision tree in the random forest, we permute the values of feature J in the out-of-bag examples. This means we randomly shuffle the feature values while keeping the class labels unchanged.
Next, we use the permuted feature matrix to make predictions on the out-of-bag examples using the current decision tree. Remember, these predictions are based on the permuted feature values, so they represent the model's performance when the feature J is randomized.
We compare the permuted predictions with the original predictions for each out-of-bag example and count how many times the correct class prediction changes due to the permutation of feature J. This count reflects the impact of feature J on the model's accuracy. If feature J is important, permuting its values should lead to a significant decrease in prediction accuracy.
We repeat this process for each feature in the dataset, computing the impact of each feature on the model's accuracy. The more a feature's permutation affects the predictions, the more important it is considered.
To quantify the feature importance, we calculate the decrease in accuracy caused by permuting each feature. This is done by subtracting the permuted accuracy from the original accuracy and averaging this difference over all decision trees in the random forest.
Finally, we normalize the feature importance values so that they sum up to one, providing a relative measure of importance among the features. This normalization ensures that the importance values are comparable and interpretable.
However, it's essential to be aware that the permutation importance method has some limitations and considerations.
Firstly, permutation importance may underestimate the importance of correlated features. When permuting one feature, it can lead to changes in predictions for other correlated features. As a result, the importance of those correlated features may not be accurately reflected in the feature importance plot. It's important to consider the correlation between features when interpreting their importance.
Secondly, permutation importance assumes that the feature importance is solely based on the predictive accuracy of the model. While predictive accuracy is a crucial factor, it may not capture all aspects of feature importance. There could be other dimensions of importance, such as the interpretability or domain knowledge relevance of a feature.
Despite these limitations, permutation importance provides a valuable quantitative measure of feature importance. It allows researchers and practitioners to understand which features have the most influence on the model's predictions and can guide decisions related to feature selection, model interpretation, and dimensionality reduction.
In the next video, we will explore another category of feature selection methods called wrapper methods. Wrapper methods involve evaluating different subsets of features using a specific machine learning model. We will delve into recursive feature elimination and forward/backward feature selection. These methods can be particularly useful when the number of features is large and selecting the most relevant subset becomes crucial for model performance.
To recap, we have covered embedded methods, specifically decision trees and random forests, as techniques for feature selection. Decision trees perform feature selection implicitly by selecting the most informative feature at each split, aiming to decrease entropy and improve classification. Random forests, as an ensemble of decision trees, provide a measure of feature importance by assessing the total reduction in impurity achieved by each feature across all trees. We have also discussed the permutation importance method, which quantifies feature importance by permuting feature values and measuring their impact on the model's accuracy.
Understanding feature importance empowers data scientists and practitioners to make informed decisions about feature selection, interpret models, and gain insights into the underlying relationships between features and the target variable. It is a valuable tool in the machine learning toolkit that can contribute to better model performance and understanding.
In our previous videos, we have covered different methods for feature selection, including filter methods, wrapper methods, and embedded methods. In this video, we will focus on wrapper methods, specifically recursive feature elimination (RFE) and forward/backward feature selection.
Wrapper methods are feature selection techniques that involve evaluating different subsets of features using a specific machine learning model. Unlike filter methods, which rely on statistical measures, and embedded methods, which integrate feature selection into the model training process, wrapper methods use a model's performance as the criterion for selecting features.
Let's start by discussing recursive feature elimination (RFE). RFE is an iterative feature selection approach that works by recursively eliminating features and building models on the remaining features. It starts by training a model on the full feature set and ranks the features based on their importance. Then, it eliminates the least important feature(s) and repeats the process with the remaining features. This iterative process continues until a specified number of features is reached or a predefined performance threshold is achieved.
The idea behind RFE is that by recursively removing less important features, it focuses on the most informative features that contribute the most to the model's performance. RFE can be used with any machine learning model that provides a measure of feature importance or feature weights. Popular models used with RFE include logistic regression, support vector machines, and random forests.
Now, let's move on to forward/backward feature selection. These are two related wrapper methods that search for an optimal subset of features by iteratively adding or removing features based on their contribution to the model's performance.
Forward feature selection starts with an empty feature set and iteratively adds one feature at a time. At each iteration, it evaluates the model's performance using cross-validation or another evaluation metric and selects the feature that improves the performance the most. The process continues until a predefined stopping criterion is met, such as reaching a desired number of features or a plateau in performance improvement.
Backward feature selection, on the other hand, starts with the full feature set and iteratively removes one feature at a time. At each iteration, it evaluates the model's performance and removes the feature that has the least impact on the performance. The process continues until a stopping criterion is met.
Both forward and backward feature selection can be computationally expensive, especially when dealing with a large number of features. To mitigate this, various strategies can be employed, such as using heuristics or approximations to speed up the search process.
It's worth noting that wrapper methods, including RFE, forward selection, and backward selection, can be sensitive to the choice of the evaluation metric and the machine learning model used. Different evaluation metrics may lead to different subsets of selected features, and the performance of the selected features may vary across different models.
In practice, it is recommended to perform cross-validation or use an external validation set to obtain a robust estimate of the model's performance with different feature subsets. This helps to avoid overfitting and select the features that generalize well to unseen data.
To summarize, wrapper methods, such as recursive feature elimination (RFE), forward feature selection, and backward feature selection, are iterative techniques for feature selection that evaluate different subsets of features based on a model's performance. These methods can help identify the most relevant features for a specific machine learning task, improve model interpretability, and reduce the dimensionality of the feature space.
In the next video, we will explore other advanced techniques for feature selection, including genetic algorithms and principal component analysis (PCA). These methods offer additional options for selecting features based on different optimization principles and statistical techniques. Stay tuned for that!
Feature selection is a critical step in the machine learning pipeline, and the choice of the right feature selection method depends on the specific dataset, the machine learning task, and the desired trade-offs between model performance, interpretability, and computational efficiency.
13.4.1 Recursive Feature Elimination (L13: Feature Selection)
13.4.1 Recursive Feature Elimination (L13: Feature Selection)
In this section, we will explore the topic of Wrapper methods for feature selection, building upon our previous discussions on filter methods and embedded methods. Wrapper methods employ models explicitly for selecting features. One popular example of a wrapper method is recursive feature elimination (RFE), which we will focus on in this video. Additionally, we will also delve into other feature selection methods using wrapper techniques in upcoming videos.
To provide an overview, there are three main methods for feature selection: filter methods, embedded methods, and wrapper methods. Today, our focus is on wrapper methods. The core idea behind RFE can be summarized in three steps.
First, we fit a model to the dataset, typically using linear models such as linear regression or logistic regression. This step is nothing out of the ordinary.
Next, we examine the model and specifically look at the model coefficients, which we will discuss in more detail shortly. Based on the magnitudes of these coefficients, we eliminate the feature with the smallest coefficient. By considering the feature with the smallest coefficient as the least important, we can remove it from further consideration. It's worth noting that normalization or standardization of features is important for this process, ensuring that they are on a comparable scale. We will see concrete examples of this later.
The final step is to repeat steps one and two until we reach the desired number of features. In essence, we continuously fit the model and eliminate the least important feature until we have the desired set of features. This simple yet effective method provides a straightforward approach to feature selection.
One critical aspect of recursive feature elimination lies in the elimination of model coefficients or weights. To illustrate this, let's consider linear regression and logistic regression models. Linear regression is used for modeling continuous targets, while logistic regression is a classifier for discrete or categorical labels. We won't delve into the details of these models here, as they have been covered in previous lectures.
In both linear and logistic regression, the models have coefficients or weights. In linear regression, these weights represent the slopes, while in logistic regression, they are associated with the influence of each feature on the classification outcome. By examining the magnitudes of these weights, we can determine the importance of each feature. Eliminating the feature with the smallest weight or coefficient effectively removes it from consideration. Alternatively, setting the weight to zero achieves the same outcome, as the weighted sum computation excludes the feature's contribution.
To better understand how feature elimination works, let's walk through an example using logistic regression. We have a binary classification problem with two features, x1 and x2, and we want to determine the class membership probability. By computing a weighted sum using the feature values and model weights, we obtain the net input. Applying a logistic sigmoid function to the net input, we derive the class membership probability. Comparing this probability to a threshold, typically 0.5, allows us to assign class labels.
The key takeaway is that the weights in these models reflect the importance of each feature. Larger weights indicate greater importance, as they contribute more significantly to the net input and subsequently affect the classification outcome. Standardizing or normalizing the weights ensures they are on the same scale, facilitating a better interpretation of their importance.
Moving on, let's explore an example of using recursive feature elimination in scikit-learn with the wine dataset. The code presented here demonstrates the process. We first prepare the dataset by splitting it into training and test sets, followed by standardizing the features. Then, we instantiate an RFE object from the RFE class in scikit-learn. We pass a logistic regression estimator to the RFE object and specify the desired number of features to select (e.g., 5 in this case).
Once we have instantiated the RFE object, we can fit it to our training data using the fit method. This will start the recursive feature elimination process. The RFE object will train the logistic regression model on the training data and then eliminate the feature with the smallest coefficient. It will repeat this process iteratively until the desired number of features is reached.
After fitting the RFE object, we can access the selected features using the support_ attribute. This attribute returns a Boolean mask indicating which features were selected. We can also obtain the ranking of the features based on their importance using the ranking_ attribute. The lower the rank, the more important the feature.
In the next step, we can transform our original training data and test data to include only the selected features using the transform method of the RFE object. This will create new feature sets with the selected features only.
Finally, we can train a logistic regression model on the transformed training data and evaluate its performance on the transformed test data. This will allow us to assess the effectiveness of the feature selection process and determine if it improved the model's predictive accuracy.
It's worth noting that the number of features to select and the step size are hyperparameters that can be tuned to find the optimal configuration for a specific dataset and model. Grid search or other hyperparameter optimization techniques can be employed to find the best combination of these parameters.
Overall, recursive feature elimination is a wrapper method for feature selection that relies on training a model and iteratively eliminating the least important features. It can be applied to both regression and classification problems and can be used with different types of models. The selection of features is based on the coefficients or weights assigned to the features by the model. By iteratively removing the least important features, RFE aims to improve model performance by focusing on the most informative features.
13.4.2 Feature Permutation Importance (L13: Feature Selection)
13.4.2 Feature Permutation Importance (L13: Feature Selection)
Welcome to this video where we will delve into the topic of permutation importance. Permutation importance is a part of wrapper methods for feature selection, which we briefly discussed in the previous video. Wrapper methods involve using a model to perform feature selection or estimate feature importance. In a previous lecture, we explored recursive feature elimination as an example of a wrapper method. Now, we will shift our focus to permutation importance. In upcoming videos, we will also explore another method called sequential feature selection.
Before diving into the nitty-gritty details of how permutation importance works, let me provide you with a concise overview of the method. In essence, permutation importance involves shuffling each feature column in a dataset. Then, using an already trained model, we evaluate the model's performance on the shuffled dataset and compare it to the original performance. Typically, we observe a drop in performance when a feature column is shuffled. This drop in performance serves as an indicator of the importance of the feature. Of course, summarizing the method in just two steps may seem a bit complicated, so in the upcoming slides, I will walk you through the process in a more detailed and slower manner.
By applying permutation importance to each column in the dataset, we can generate a bar plot illustrating the importance of each feature. Additionally, we can optionally include the standard deviation of the importance values in the plot. In the upcoming video, I will provide a code example on how to create such a plot.
Now, before delving into the detailed explanation of permutation importance and the algorithm behind it, let's go over some noteworthy facts. Permutation importance often yields similar results to random forest feature importance based on impurity. However, the advantage of permutation importance is that it is model-agnostic, meaning it can be used with any type of machine learning algorithm or model. It is important to note that while permutation importance is not strictly a feature selection method, it does provide insights into the features that a model relies on the most. Consequently, we can use feature importance measures as a basis for selecting features.
If you recall our previous discussion on random forest feature importance, you can think of permutation importance as a generalization of one of the methods, specifically Method B, in that video. However, instead of using out-of-bag samples, permutation importance employs the holdout set. If you need a refresher on out-of-bag examples, feel free to revisit the previous video.
Now, let's dive into the step-by-step algorithm of permutation importance. First, we start with a model that has been fitted to the training set. This model can be any machine learning model or algorithm. As an example, let's consider a random forest classifier. We train the random forest on the training set, which is a standard step.
Next, we estimate the model's predictive performance on an independent dataset, such as the validation set or the test set. We record this performance as the baseline performance. For instance, let's say we achieve 99% accuracy on the validation set using our fitted random forest model. We consider this as the baseline performance.
For each feature column in the dataset, we randomly shuffle that specific column while keeping the other columns and class labels unchanged. This shuffling process is illustrated with an example dataset. Suppose we have a dataset with three feature columns and four training examples. We focus on shuffling column one, represented by a different color in the example. After shuffling, the order of values in that column changes. We randomly permute the values while maintaining the original values in columns two and three.
Another advantage of permutation importance is that it can handle correlated features well. Since it evaluates the importance of each feature individually by shuffling its values, it captures the unique contribution of each feature to the model's performance, regardless of correlations with other features. This is particularly useful in scenarios where there are high-dimensional datasets with interrelated features.
Permutation importance also provides a measure of feature importance that is more reliable than the inherent feature importance provided by some models. For example, in decision trees or random forests, the importance of a feature is based on the impurity reduction it achieves when splitting the data. However, this measure can be biased towards features with many possible splits or those that appear higher in the tree structure. Permutation importance provides a more direct and unbiased estimate of feature importance by directly evaluating the impact of shuffling each feature.
On the downside, permutation importance can be computationally expensive, especially if the model training process is time-consuming or if there are a large number of features. Since the permutation process requires re-evaluating the model's performance multiple times, it can add significant overhead. However, there are optimization techniques and parallelization strategies that can help mitigate this issue, such as using parallel computing or reducing the number of permutations.
It's worth noting that permutation importance is not a silver bullet for feature selection or model interpretation. While it provides valuable insights into the importance of individual features, it should be used in conjunction with other techniques and domain knowledge. Feature importance alone does not guarantee the predictive power or relevance of a feature. It's essential to consider the context, the specific problem, and the limitations of the model.
In summary, permutation importance is a powerful and model-agnostic method for assessing the importance of features in a machine learning model. By shuffling feature values and comparing the model's performance before and after the shuffling, it provides a reliable measure of feature importance. It is easy to understand, handles correlated features well, and is not susceptible to overfitting. However, it can be computationally expensive and should be used alongside other techniques for comprehensive feature selection and model interpretation.
13.4.3 Feature Permutation Importance Code Examples (L13: Feature Selection)
13.4.3 Feature Permutation Importance Code Examples (L13: Feature Selection)
Alright, so now that we have covered the basic introduction to permutation importance, let's take a look at some code examples to see how we can use permutation importance in practice. Yeah, and as always, I also have the code examples in Jupyter Notebooks linked below the video. And also, unlike always, we will be working with a wind dataset again, just to keep things simple.
So the wind dataset, again, is a dataset consisting of 13 columns. And here is an overview of how the first five rows look like. So there are three classes, Class One, two, and three. And there are 13 columns, but not all the columns are shown here due to space constraints. But yeah, we won't be discussing this wind dataset in too much detail because we have seen that so many times before.
Yeah, and then also, like always, we will be splitting the dataset into a training and a test set. So here, what we're doing is we're taking the dataset, except the first column, which is the label column. So we will be splitting the dataset into a training and a test set where 30% of the data is used for testing and 70% will be used for training correspondingly. Notice here that we are not creating any validation set. So it's just my personal opinion. But I don't think we need necessarily a validation set if we compute the permutation performance because yeah, usually we should keep our test set independent. But if you think back of how permutation performance works, based on the previous video, we are here only looking at the drop of performance when we permute a feature column. So we are not really recomputing the test accuracy, we are just using the test set for looking at how much the performance will drop if we shuffle a column.
Yeah, we are still in the setup stage here. So here in this slide, we are preparing our model. And in fact, here it's a random forest classifier. So in the previous video, we learned that permutation importance is a model-agnostic method. That means we can compute that for any type of model. However, we are using here a random forest so that we can then compare the permutation importance to the random forest impurity-based performance, which might be an interesting comparison. So here, we are setting up a random forest classifier with 100 trees. And we are fitting it to the training set. And here is just the accuracy computation. And we can see that the training accuracy is 100% and the test accuracy is also 100%, which indicates that this is actually a pretty good model, or it can also just indicate that the dataset is pretty easy to classify.
One thing I wanted to note here also is that when we compute the permutation importance, it's kind of important to have a very well-performing model if we want to interpret the feature importance as a general feature importance. Because if we don't have a model that performs well, we might find out what features the model relies on the most, but it doesn't really tell us, let's say, how important the feature is in the context of the target variable if the model is not very accurate. So, before we look at the permutation importance, just for reference, again here is the impurity-based importance. So, this is the random forest impurity-based importance that we have discussed already in a previous video. Again, this is just for reference, where we access this feature importance attribute after fitting the model.
Then, we are applying arg_sort, so that we are obtaining the sorting order of the importance values from the largest to smallest. So, from largest to smallest. And then what we are doing is creating a bar plot to visualize the impurity-based importance. The bar plot will show the feature names on the x-axis and the corresponding importance values on the y-axis. By sorting the importance values in descending order, the most important features will be plotted first.Next, the code moves on to computing the permutation importance. The permutation importance is calculated by randomly shuffling the values of each feature in the test set and measuring the drop in performance of the model. The higher the drop in performance, the more important the feature is considered to be. The code uses a for loop to iterate over each feature in the dataset.
Inside the loop, the feature values in the test set are shuffled using np.random.permutation(). Then, the shuffled test set is passed through the trained random forest classifier to obtain the predicted labels. The accuracy of the model on the shuffled test set is computed using the accuracy_score() function. The difference between the original test accuracy and the shuffled test accuracy represents the drop in performance caused by permuting the feature.
The drop in performance for each feature is stored in a list called importance_vals. After iterating over all the features, the importance_vals list contains the drop in performance values for each feature.
Finally, a bar plot is created to visualize the permutation importance. The feature names are plotted on the x-axis, and the corresponding drop in performance values are plotted on the y-axis. Again, the importance values are sorted in descending order to highlight the most important features.
This code provides a comparison between the impurity-based importance and the permutation importance. By comparing the two plots, you can observe if there are any differences in the ranking of feature importance between the two methods.
Make sure to have the necessary libraries imported, such as matplotlib, numpy, sklearn.ensemble.RandomForestClassifier, sklearn.datasets.load_iris, and sklearn.metrics.accuracy_score.
The resulting permutation importances are stored in the importances variable. We use np.argsort to obtain the indices that would sort the importances in ascending order. This helps in plotting the importances in the correct order.
Finally, we create a horizontal bar plot using plt.barh to display the permutation importances. The y-axis represents the features, while the x-axis represents the importance values. The plt.xlabel, plt.ylabel, and plt.title functions are used to add labels and a title to the plot.
Please make sure to have the necessary libraries imported, such as matplotlib, numpy, sklearn.ensemble.RandomForestClassifier, sklearn.datasets.load_iris, sklearn.inspection.permutation_importance, and sklearn.model_selection.train_test_split.
The resulting permutation importances are stored in the importances variable. We use np.argsort to obtain the indices that would sort the importances in ascending order. This helps in plotting the importances in the correct order.
Finally, we create a horizontal bar plot using plt.barh to display the permutation importances. The y-axis represents the features, while the x-axis represents the importance values. The plt.xlabel, plt.ylabel, and plt.title functions are used to add labels and a title to the plot.
Make sure to have the necessary libraries imported, such as matplotlib, numpy, sklearn.ensemble.RandomForestClassifier, sklearn.datasets.load_iris, sklearn.inspection.permutation_importance, and sklearn.model_selection.train_test_split.
13.4.4 Sequential Feature Selection (L13: Feature Selection)
13.4.4 Sequential Feature Selection (L13: Feature Selection)
In the previous videos, I introduced the concept of feature importance using permutation importance as my favorite technique. In this video, I want to discuss another important technique called sequential feature selection, which is also part of the wrapper methods we previously talked about.
Before diving into sequential feature selection, let's briefly recap the different types of feature selection methods we have discussed so far. We started with filter methods, then moved on to embedded methods like recursive feature elimination, and now we are focusing on wrapper methods.
Wrapper methods aim to find an optimal feature subset by trying out all possible feature combinations. This approach is known as exhaustive feature selection. To understand how it works, let's consider the example of the Iris dataset, which has four features: sepal length, sepal width, petal length, and petal width. To find the best combination of features for our model, we would need to try out all possible subsets, ranging from single features to the full feature set.
For the Iris dataset, this would result in 15 possible combinations, including subsets of one, two, three, and four features. However, exhaustive feature selection can be computationally expensive and prone to overfitting. To mitigate these issues, we can use a validation set or K-fold cross-validation to evaluate the performance of different feature subsets.
Despite its simplicity, exhaustive feature selection has a limitation when applied to datasets with a large number of features. The number of possible feature subsets grows exponentially with the number of features, making it impractical for large datasets. This limitation motivates the use of sequential feature selection, which is an approximation technique that explores a subset of feature combinations instead of evaluating all possible combinations.
Sequential feature selection is an iterative process that starts with the original feature set and gradually selects or removes features based on their performance. One popular approach is sequential backward selection, where we start with the full feature set and iteratively remove one feature at a time. In each iteration, we evaluate the performance of the remaining features and select the subset with the highest performance. This process continues until we have a subset with a single feature left.
The sequential backward selection algorithm can be summarized as follows:
By repeating steps 2-4, we gradually reduce the feature set until we reach the optimal subset. The final subset is selected based on the highest evaluation score, and in case of a tie, the smaller subset is preferred for computational efficiency. The number of iterations in sequential backward selection is equal to the number of features minus one.
Sequential forward selection is another variation of sequential feature selection. Instead of removing features, sequential forward selection starts with an empty feature set and gradually adds one feature at a time. The process involves training a classifier on each individual feature and evaluating its performance. The feature with the highest performance is selected and added to the subset. This process continues until the maximum number of features is reached.
In summary, sequential feature selection is a useful technique for finding an optimal subset of features. It offers a trade-off between computational efficiency and finding a good feature combination. Sequential backward selection and sequential forward selection are two common variations of sequential feature selection, each with its own advantages and use cases. In the next video, we will explore how to implement sequential feature selection programmatically and address the limitations of exhaustive feature selection.