Machine Learning and Neural Networks - page 72

 

9.7 The .632 and .632+ Bootstrap methods (L09 Model Eval 2: Confidence Intervals)



9.7 The .632 and .632+ Bootstrap methods (L09 Model Eval 2: Confidence Intervals)

n this video, we will delve deeper into the topics discussed in the previous video. In the previous video, we covered the bootstrap method, specifically the out-of-bag bootstrap, which is used to construct empirical confidence intervals. In this video, we will explore two advanced bootstrapping techniques: the 0.632 bootstrap and the 0.632+ bootstrap. These techniques are related and their origins will be explained further in this video.

To briefly recap the bootstrap procedure, we start with a dataset and create bootstrap samples by sampling with replacement. For each bootstrap sample, we fit a model and evaluate its performance on the out-of-bag samples. In the previous video, we also demonstrated how to implement this procedure in Python, using an object-oriented approach.

In the current video, the presenter introduces a code implementation that simplifies the process. They have created a class called "BootstrapOutOfBag" that takes the number of bootstrap rounds and a random seed as input. This class provides a method called "split" that divides the dataset into training and test subsets. The training subsets correspond to the bootstrap samples, while the test subsets represent the out-of-bag samples. By iterating over these splits, the presenter demonstrates how to perform the bootstrap procedure and evaluate the model's performance.

The presenter then introduces another implementation called "bootstrap_0.632_score." This implementation allows users to conveniently compute the out-of-bag or bootstrap scores. By providing the classifier, training set, number of splits, and random seed, users can calculate the mean accuracy and obtain confidence intervals using the percentile method.

Next, the video addresses a shortcoming of the out-of-bag bootstrap method, which is known as the pessimistic bias. Bradley Efron proposed the 0.632 estimate as a way to address this bias. The pessimistic bias arises because the bootstrap samples contain fewer unique data points compared to the original dataset. In fact, only 63.2% of the data points in the bootstrap samples are unique. The presenter explains the probability calculations behind this figure and provides a visualization to illustrate how it behaves for different sample sizes.

To overcome the pessimistic bias, the video introduces the 0.632 bootstrap method. This method combines the accuracy of the out-of-bag samples and the bootstrap samples in each round. The accuracy in each round is computed as the sum of two terms: the out-of-bag accuracy and the resubstitution accuracy. The out-of-bag accuracy represents the performance on the samples that were not included in the bootstrap sample, while the resubstitution accuracy measures the performance on the same data used to fit the model.

By combining these two terms, the 0.632 bootstrap method aims to provide a less biased estimate of the model's performance. This method addresses the overly optimistic nature of the resubstitution accuracy by incorporating the out-of-bag accuracy.

In conclusion, this video builds upon the concepts discussed in the previous video by introducing advanced bootstrapping techniques: the 0.632 bootstrap and the 0.632+ bootstrap. These methods aim to mitigate the pessimistic bias of the out-of-bag bootstrap by considering both out-of-bag and bootstrap sample accuracies. The video provides code implementations and explanations to facilitate understanding and application of these techniques.

9.7 The .632 and .632+ Bootstrap methods (L09 Model Eval 2: Confidence Intervals)
9.7 The .632 and .632+ Bootstrap methods (L09 Model Eval 2: Confidence Intervals)
  • 2020.11.13
  • www.youtube.com
In this video, we discuss the .632 bootstrap, which addresses the pessimistic bias of the OOB bootstrap covered in the previous video. Then, we discuss the ....
 

10.1 Cross-validation Lecture Overview (L10: Model Evaluation 3)


10.1 Cross-validation Lecture Overview (L10: Model Evaluation 3)

Hello everyone! Last week, we delved into the important topic of model evaluation, where we discussed various aspects such as evaluating model performance and constructing confidence intervals. However, our exploration of model evaluation is not yet complete, as there are other essential concepts that we need to cover. In practice, it's not only about evaluating a specific model; we also need to find a good model in the first place that we can evaluate.

In this lecture, we will focus on cross-validation techniques, which include methods for tuning hyperparameters and comparing models resulting from different hyperparameter settings. This process is known as model selection. Our main emphasis today will be on cross-validation.

We have a lot of topics to cover this week, but don't worry, each topic is relatively short. Let me provide an overview of what we will discuss in this lecture and the next:

  1. Cross-validation techniques for model evaluation: We will explore K-fold cross-validation and other related techniques for evaluating model performance. I will demonstrate code examples using Python and scikit-learn.

  2. Cross-validation for model selection: We will discuss how to use cross-validation for selecting the best model, including hyperparameter tuning. I will show you how to perform model selection using grid search and randomized search in scikit-learn.

  3. The law of parsimony: We will explore the concept of the one standard error method, which combines the idea of K-fold cross-validation with the principle of keeping models simple. I will also provide code examples for the one standard error method and repeated K-fold cross-validation, which is similar to the repeated holdout method discussed in the previous lectures.

Before we delve into cross-validation, let's have a quick reintroduction to hyperparameters and clarify their difference from model parameters. Then we will proceed to discuss K-fold cross-validation for model evaluation and other related techniques. We will examine the practical implementation of these techniques using Python and scikit-learn. Finally, we will extend our discussion to cross-validation for model selection, highlighting the distinction between model evaluation and model selection.

I have also prepared an overview based on extensive research and reading, categorizing different techniques based on specific tasks and problems. This categorization will help us navigate the different techniques and understand when to use each one. It's important to note that the recommendations provided in the overview are subject to further discussion, which we will engage in during the upcoming lectures.

That summarizes the lecture overview. Now, let's proceed with a reintroduction to hyperparameters, followed by a detailed exploration of cross-validation.

10.1 Cross-validation Lecture Overview (L10: Model Evaluation 3)
10.1 Cross-validation Lecture Overview (L10: Model Evaluation 3)
  • 2020.11.18
  • www.youtube.com
This video goes over the topics we are going to cover in this lecture: cross-validation and model selection. Also, it gives a big-picture overview discussing...
 

10.2 Hyperparameters (L10: Model Evaluation 3)



10.2 Hyperparameters (L10: Model Evaluation 3)

Before delving into cross-validation, let's take a moment to discuss hyperparameters. You may already be familiar with the concept, but if not, this will serve as a useful recap. Hyperparameters can be thought of as the tuning parameters or settings of a model or algorithm. They are the options that you manually adjust to optimize the performance of your model. To illustrate this, let's consider the K-nearest neighbor classifier, a nonparametric model.

Nonparametric models, unlike parametric models, do not have a predefined structure. Instead, they rely on the training set to define the model's structure. For instance, in K-nearest neighbors, the parameters of the model are essentially the training examples themselves. Thus, altering the training set, such as by adding or removing examples, can significantly impact the model's structure. Another example of a nonparametric model is the decision tree, where the number of splits in the tree depends on the training examples, rather than a predefined structure.

Now, let's focus specifically on the hyperparameters of the K-nearest neighbor algorithm. These hyperparameters include options like the number of neighbors (K) and the distance metric used (e.g., Manhattan or Euclidean distance). These options need to be set before running the model and are not learned from the data. In this course, we will explore techniques such as grid search or randomized search to assist with hyperparameter tuning. However, it's important to note that trying out different values for hyperparameters is not a process of fitting them to the data but rather an iterative experimentation to find the best settings.

To provide more examples, let's refer to the definitions of hyperparameters in scikit-learn. When initializing a decision tree classifier, hyperparameters can include the impurity measure (e.g., Gini or entropy), the depth of the tree for pre-pruning, and the minimum number of samples per leaf, among others. These are all considered hyperparameters.

Notably, not all options are hyperparameters, but all hyperparameters are options. For instance, the random state or random seed, which determines the randomness in the model, is not a hyperparameter. It is something that should not be manipulated to improve the model since changing the random seed for better performance would be considered unfair.

Now, let's contrast hyperparameters with model parameters. For example, let's take a brief look at logistic regression, which can be seen as a linear model and serves as an introduction to both classic machine learning and deep learning. In logistic regression, the inputs are features, including an intercept term to account for bias. The model weights, which are determined based on the number of features, form the structure of the model. Initially, these weights can be set to zero or small random values, and then they are updated iteratively to minimize the loss function (e.g., mean squared error in linear regression).

In logistic regression, a nonlinear function, typically the logistic function or sigmoid function, is applied to the net input (the weighted sum of inputs) to squash it into a range between zero and one. This output can be interpreted as the class membership probability in binary classification. The weights are adjusted to minimize the loss, which is computed by comparing the predicted class membership probability with the true class label (either 0 or 1). Logistic regression also employs regularization techniques, such as L1 or L2 regularization, which add a penalty term based on the size of the weights to prevent overfitting. The regularization strength (lambda) is a hyperparameter that needs to be set by the user.

To summarize, model parameters like the weights (W) in logistic regression are learned from the training data, whereas hyperparameters such as the regularization strength (lambda) are determined by the user and are not learned from the data. Model parameters are the internal variables of the model that are updated during the training process to optimize performance, while hyperparameters are external settings that control the behavior of the model and need to be set before training.

The process of finding the optimal values for hyperparameters is known as hyperparameter tuning. It is an important step in machine learning as it can greatly impact the performance of a model. However, finding the best hyperparameter values is not a straightforward task and often requires experimentation and evaluation of different combinations.

One common approach to hyperparameter tuning is grid search, where a predefined set of values is specified for each hyperparameter, and all possible combinations are evaluated using cross-validation. Cross-validation is a technique used to assess the performance of a model by splitting the data into multiple subsets (folds), training the model on some folds, and evaluating it on the remaining fold. This helps to estimate the model's performance on unseen data and reduces the risk of overfitting.

Another approach is randomized search, where random combinations of hyperparameter values are sampled from specified distributions. This can be useful when the search space for hyperparameters is large, as it allows exploring a broader range of values without exhaustively evaluating all possible combinations.

In addition to grid search and randomized search, there are more advanced techniques for hyperparameter tuning, such as Bayesian optimization, which uses probabilistic models to guide the search process, and genetic algorithms, which mimic the process of natural selection to evolve the best set of hyperparameters.

It's worth noting that hyperparameter tuning can be computationally expensive, especially for complex models or large datasets. Therefore, it is often done in conjunction with techniques like cross-validation to make the most efficient use of the available data.

Hyperparameters are the settings or options of a model that need to be set before training, while model parameters are the internal variables that are learned from the data during training. Hyperparameter tuning is the process of finding the best values for these settings, and it is crucial for optimizing model performance. Techniques such as grid search, randomized search, Bayesian optimization, and genetic algorithms are commonly used for hyperparameter tuning.

10.2 Hyperparameters (L10: Model Evaluation 3)
10.2 Hyperparameters (L10: Model Evaluation 3)
  • 2020.11.18
  • www.youtube.com
This video recaps the concept of hyperparameters using k-nearest neighbors and logistic regression as examples.-------This video is part of my Introduction o...
 

10.3 K-fold CV for Model Evaluation (L10: Model Evaluation 3)



10.3 K-fold CV for Model Evaluation (L10: Model Evaluation 3)

In this video, we will delve into the topic of cross-validation for model evaluation. Cross-validation is commonly used in conjunction with hyperparameter tuning and model selection. However, to facilitate better understanding, let's first explore how k-fold cross-validation works in the context of model evaluation alone, before discussing its application in model selection.

To begin, k-fold cross-validation for model evaluation involves splitting a dataset into a validation fold and the remaining data segments for training. In a typical example, let's consider five-fold cross-validation. The data set is divided into a validation fold (shown in blue) and four training folds (shown in different colors). The model is trained on the training folds and evaluated on the validation fold, resulting in a performance metric. Unlike the holdout method where only one validation set is used, in k-fold cross-validation, the validation fold is rotated through different segments of the data. This ensures that all data points are utilized for evaluation. In the case of five-fold cross-validation, there are five distinct validation folds, and five iterations are performed. Each iteration produces a performance measure. When reporting the overall performance, the typical approach is to average the performance values across all iterations.

It's important to note that in this discussion, we are focusing on k-fold cross-validation for model evaluation, without considering hyperparameter tuning. In this scenario, the performance estimate obtained through cross-validation can be considered an estimate of the model's generalization performance. By training a new model on the entire dataset using fixed hyperparameters, we can obtain a final model for practical use. While an independent test set can be used to further evaluate the model's performance, it is often unnecessary when no hyperparameter tuning is involved, as the cross-validation performance already provides a reliable estimate of generalization performance.

Now, let's explore some key properties of k-fold cross-validation. The validation folds are non-overlapping, meaning there is no overlap between the data points in the validation fold across different iterations. All the data points are utilized for testing, ensuring comprehensive evaluation. Some researchers may refer to the validation folds as test folds, as the terms can be used interchangeably.

On the other hand, the training folds are overlapping, which means they are not independent of each other. In a given iteration, the training data may have overlapping samples with the training data from other iterations. This characteristic makes it challenging to estimate the variance based on different training sets, which is important for understanding the model's performance variability.

Another noteworthy aspect is that reducing the value of k (the number of folds) makes the performance estimate more pessimistic. This is because with fewer data points available for training in each fold, the model's fitting capabilities are constrained. The performance estimate becomes more pessimistic due to the withheld data, as discussed in our previous explanation on performance pessimism.

Let's explore two special cases of k-fold cross-validation. When k equals 2, we have two-fold cross-validation, which is distinct from the holdout method. In two-fold cross-validation, the dataset is split exactly in half, and each half is used for training in different iterations. In contrast, the holdout method allows for arbitrary splitting proportions and does not involve rotation between iterations. However, each round of k-fold cross-validation can be considered a special case of the holdout method, where the dataset is split exactly into two halves.

Another special case is when k equals n, resulting in leave-one-out cross-validation (LOOCV). In LOOCV, each iteration involves leaving out one data point as the validation set, while the remaining n-1 data points are used for training. This approach is also known as LOOCV, where the validation set consists of only one data point.

A study conducted by Hawkins et al. (2003) examined the performance of different model evaluation methods, including leave-one-out cross-validation (LOOCV), and found that LOOCV tends to have high variance compared to other cross-validation methods. This high variance can be attributed to the fact that each validation fold in LOOCV consists of only one data point, resulting in a limited sample size for evaluation. Consequently, the performance estimates obtained from LOOCV can be highly sensitive to the specific data points chosen for validation in each iteration.

Despite its high variance, LOOCV has some advantages. Since each iteration involves training on n-1 data points, where n is the total number of data points, LOOCV tends to provide an unbiased estimate of the model's performance. Additionally, LOOCV utilizes all available data for training, which can be beneficial when the dataset is small or when a more precise performance estimate is desired.

However, due to its computational complexity, LOOCV may not be feasible for large datasets. The training process needs to be repeated n times, resulting in a significant computational burden. In such cases, k-fold cross-validation with a moderate value of k is often preferred.

Now that we have explored k-fold cross-validation for model evaluation, let's briefly discuss its application in model selection. In the context of model selection, the goal is to identify the best model from a set of candidate models, typically with different hyperparameter settings. Cross-validation can be used to estimate the performance of each model and facilitate the selection process.

The typical approach is to perform k-fold cross-validation for each model, calculate the average performance across all iterations, and compare the results. The model with the highest average performance is considered the best choice. This approach helps to mitigate the impact of data variability and provides a more robust evaluation of the models.

To summarize, cross-validation is a valuable technique for model evaluation and selection. By systematically rotating the validation fold through different segments of the data, it allows for comprehensive evaluation and provides estimates of the model's performance. Whether used solely for model evaluation or in combination with model selection, cross-validation helps researchers and practitioners make informed decisions about the generalization capabilities of their models.

10.3 K-fold CV for Model Evaluation (L10: Model Evaluation 3)
10.3 K-fold CV for Model Evaluation (L10: Model Evaluation 3)
  • 2020.11.19
  • www.youtube.com
This video introduces the concept of k-fold cross-validation and explains how it can be used for evaluating models. Also, it discusses why 10-fold cross-vali...
 

10.4 K-fold CV for Model Evaluation -- Code Examples (L10: Model Evaluation 3)



10.4 K-fold CV for Model Evaluation -- Code Examples (L10: Model Evaluation 3)

In the previous video, we discussed k-fold cross-validation as a method for evaluating machine learning models. In this video, we will explore how to implement k-fold cross-validation in Python using the scikit-learn library. I have uploaded the code notebook to GitHub, and you can find the link here.

Let's start by loading the necessary libraries and checking their versions. We will import NumPy and matplotlib, which are commonly used libraries. Next, we will demonstrate the use of k-fold cross-validation using the k-fold class from the model selection submodule in scikit-learn.

To ensure reproducibility, we set a random seed using a random number generator object. We then create a simple dataset with five labels from Class Zero and five labels from Class One. Additionally, we generate a random dataset with 10 inputs and four features. It's worth noting that this is just a random dataset for illustration purposes, and you can use any dataset you prefer, such as the iris dataset.

Next, we initialize a k-fold object, which we name cv (short for cross-validation). We set the number of splits, n_splits, to five, indicating that we will perform five-fold cross-validation. Let's examine the behavior of this k-fold object using the split method. When we execute this method, we obtain five results, each consisting of a tuple containing two arrays. The first array represents the training fold, and the second array represents the validation fold.

The numbers within these arrays correspond to the indices of the samples in the dataset. For example, if we want to obtain the actual labels corresponding to the first fold's training set, we can use these indices as an index array to select the labels. Similarly, we can select the corresponding features. It's important to note that the labels in the training and validation folds might be imbalanced, as we observed in this case.

To address this issue, it is recommended to shuffle the dataset before performing k-fold cross-validation. We can achieve this by shuffling the dataset directly within the k-fold object during initialization. By setting a random state and shuffling, we obtain a better mix of labels in the training and validation folds.

Furthermore, it is generally advised to stratify the splits, ensuring that the proportion of class labels remains consistent across each fold. We can achieve this by using the stratified k-fold class instead of the regular k-fold class. When we use stratified k-fold, the proportion of labels in each fold matches that of the original dataset.

Having discussed the general behavior of k-fold and stratified k-fold objects, let's see how to apply them in practice. We will use a decision tree classifier and the iris dataset as an example. First, we split the iris dataset into 85% training data and 15% test data using the train_test_split method, which ensures stratified splitting.

Next, we initialize a stratified k-fold object with k=10, as recommended by Ron Kohavi's paper on practical guide to cross-validation. We then employ a manual approach to perform k-fold cross-validation by iterating over the training and validation indices using the split method. Within each iteration, we fit a new decision tree classifier using the training fold and predict the labels of the validation fold. We compute the accuracy for each iteration and store the results in a placeholder variable.

After iterating through all the folds, we calculate the average k-fold cross-validation accuracy by dividing the sum of the accuracies by the number of iterations. Finally, to evaluate the model on unseen data, we fit a new decision tree classifier using all the training data and compute the accuracy on the test set.

In this case, we obtained a k-fold cross-validation accuracy of 95.3% and a test set accuracy of 95%. These results suggest that our model performs well on both the cross-validation folds and the unseen test data.

However, manually iterating over the folds and fitting models can be a bit cumbersome. Fortunately, scikit-learn provides a more convenient way to perform k-fold cross-validation using the cross_val_score function. This function takes the model, the dataset, and the number of folds as inputs, and automatically performs k-fold cross-validation, returning the scores for each fold.

Let's see how this is done in practice. We start by importing the necessary libraries and loading the iris dataset. Next, we create an instance of the decision tree classifier and initialize a stratified k-fold object with k=10.

We then use the cross_val_score function, passing in the classifier, the dataset, and the k-fold object. This function automatically performs the k-fold cross-validation, fits the model, and computes the scores for each fold. By default, the cross_val_score function uses the accuracy metric, but you can specify other metrics if desired.

Finally, we print the cross-validation scores for each fold and calculate the average score. In this case, we obtained a mean cross-validation accuracy of 95.3%, which matches the accuracy we obtained manually.

Using cross_val_score is a more concise and efficient way to perform k-fold cross-validation, as it handles the entire process automatically. It also allows us to easily change the number of folds or switch to a different model without modifying the code significantly.

10.4 K-fold CV for Model Evaluation -- Code Examples (L10: Model Evaluation 3)
10.4 K-fold CV for Model Evaluation -- Code Examples (L10: Model Evaluation 3)
  • 2020.11.19
  • www.youtube.com
This video explains how we can evaluate models via k-fold cross-validation in Python using scikit-learn. A later video will show how we can use k-fold cross-...
 

10.5 K-fold CV for Model Selection (L10: Model Evaluation 3)


10.5 K-fold CV for Model Selection (L10: Model Evaluation 3)

In the previous two videos, we discussed k-fold cross-validation for model evaluation and examined some code examples. Now, we will focus on k-fold cross-validation for model selection. Model selection is often the common use case for k-fold cross-validation as it allows us to tune hyperparameters and select the best performing hyperparameter settings.

The overall process can be summarized in five steps. However, due to the limited space on the slide, I will zoom in on each step in the following slides to provide more detail. The five steps are similar to the three-fold holdout method for model selection that we previously discussed.

Step 1: Split the data into training and test sets. This step is the same as before, where we divide the dataset into two parts, one for training and the other for testing. We will focus on the training set for now.

Step 2: Apply the learning algorithm with different hyperparameter settings using k-fold cross-validation. Each hyperparameter setting, such as the maximum depth of a decision tree algorithm, is evaluated using k-fold cross-validation. For example, we can use k-fold cross-validation with k=10, as recommended by Ron Kohavi. This step provides us with different performance estimates for each hyperparameter setting.

Step 3: Select the best performing model. Based on the performance estimates obtained from k-fold cross-validation, we can identify the hyperparameter setting that performs the best. For example, we may find that a max depth of five performs the best among the tested values. We select this hyperparameter setting as the best.

Step 4: Fit the model with the best hyperparameter values to the training data. After identifying the best hyperparameter setting, we retrain the model using the entire training dataset and the selected hyperparameters. This ensures that we have a single model with the best hyperparameter values.

Step 5: Evaluate the model on an independent test set. To estimate the generalization performance of the model, we evaluate it on a separate test set that was not used during the training or hyperparameter selection process. This provides an unbiased assessment of the model's performance.

Optionally, we can perform an additional step where we fit the model with the best hyperparameter values on the entire dataset. This step is based on the assumption that the model might perform even better when trained on more data.

Having an independent test set is important to avoid selection bias. Sometimes, a hyperparameter setting may perform well on k-fold cross-validation by chance, leading to an overly optimistic estimate. By using an independent test set, we can obtain a more reliable assessment of the model's performance.

This procedure summarizes k-fold cross-validation for model selection. Now, let's explore some techniques for selecting hyperparameters during the model selection or hyperparameter tuning step.

One common method is grid search, which remains widely used. Grid search is an exhaustive search method where you define a list of hyperparameter values to consider. For example, in the case of k-nearest neighbors, you can tune the value of k by specifying a list of values such as 3, 5, 6, 7, 8, and 9. Grid search evaluates the performance of the model for each hyperparameter combination using k-fold cross-validation.

Grid search can be performed in parallel, allowing multiple hyperparameter combinations to be evaluated simultaneously. However, it can suffer from poor coverage if not all relevant hyperparameter values are included in the predefined grid. This is especially problematic for continuous hyperparameters or when certain values are skipped.

To address the coverage issue, randomized search is an alternative approach that samples hyperparameter values from distributions. Instead of specifying a fixed grid, you can define distributions, such as uniform, normal, exponential, beta, or binomial, to sample hyperparameter values. Randomized search provides more flexibility in exploring the hyperparameter space and can potentially cover a wider range of values. By sampling from distributions, randomized search allows for a more efficient exploration of the hyperparameter space.

Compared to grid search, randomized search is often more computationally efficient because it doesn't evaluate all possible combinations. Instead, it randomly samples a subset of hyperparameter values and evaluates them using k-fold cross-validation. The number of iterations or samples can be specified in advance.

The advantage of randomized search is that it can efficiently search a large hyperparameter space, especially when some hyperparameters are less important than others. It can also handle continuous and discrete hyperparameters without the need to define a specific grid.

Both grid search and randomized search have their pros and cons. Grid search guarantees to cover all combinations within the defined grid, but it can be computationally expensive and may not be suitable for large hyperparameter spaces. Randomized search, on the other hand, is more efficient but does not guarantee exhaustive coverage.

In practice, the choice between grid search and randomized search depends on the size of the hyperparameter space, available computational resources, and the specific problem at hand.

Another technique for hyperparameter tuning is Bayesian optimization. Bayesian optimization uses a probabilistic model to model the relationship between hyperparameters and the objective function (e.g., model performance). It employs a surrogate model, such as Gaussian Processes, to approximate the objective function and uses an acquisition function to determine the next hyperparameter values to evaluate.

Bayesian optimization iteratively samples hyperparameter values based on the surrogate model and updates the model based on the evaluated performance. It focuses the search on promising regions of the hyperparameter space, leading to more efficient exploration.

The advantage of Bayesian optimization is its ability to handle both continuous and discrete hyperparameters, as well as non-convex and non-linear objective functions. It adapts to the observed performance and intelligently selects the next hyperparameter values to evaluate, potentially converging to the optimal solution with fewer evaluations compared to grid search or randomized search.

However, Bayesian optimization can be more computationally expensive, especially for large datasets or complex models. It requires evaluating the objective function multiple times to update the surrogate model and determine the next hyperparameter values to evaluate.

Overall, Bayesian optimization is a powerful technique for hyperparameter tuning, especially when the hyperparameter space is complex and the objective function is expensive to evaluate.

In summary, k-fold cross-validation is a valuable tool for both model evaluation and model selection. It allows us to estimate the performance of different models and select the best hyperparameter settings. Techniques like grid search, randomized search, and Bayesian optimization can be used to tune the hyperparameters and improve model performance. The choice of method depends on factors such as the size of the hyperparameter space, computational resources, and the specific problem at hand.

10.5 K-fold CV for Model Selection (L10: Model Evaluation 3)
10.5 K-fold CV for Model Selection (L10: Model Evaluation 3)
  • 2020.11.20
  • www.youtube.com
After talking about k-fold cross-validation for model *evaluation* in the last two videos, we are now going to talk about k-fold cross-validation for model *...
 

10.6 K-fold CV for Model Evaluation -- Code Examples (L10: Model Evaluation 3)



10.6 K-fold CV for Model Evaluation -- Code Examples (L10: Model Evaluation 3)

Yeah, like last time when we talked about k-fold cross-validation for model evaluation. Let's now take a closer look at some code examples for k-fold cross-validation, but this time for model selection. I'll provide you with some code examples that you can find on GitHub. I will also include the link on Piazza and Canvas so that you can download the code notebook and experiment with it later.

Alright, let's dive into the code notebook. As usual, we start with the watermark for checking the version numbers of the packages we are using. In this notebook, we will focus on grid search, which is really useful for hyperparameter tuning and model selection. For this demonstration, we will use the decision tree classifier on the iris dataset. Although the iris dataset may not be the most exciting, it allows us to keep things simple. Moreover, it will serve as a good practice before starting your class projects, where you will work with more complex datasets.

We begin by splitting our dataset into training and test sets. We use 85% of the data for training and 15% for testing, following the usual practice. Moving on to the grid search, we define two hyperparameter options: max depth and criterion. Max depth represents the maximum depth of the decision tree, and we set it to either 1, 2, 3, 4, 5, or None (no restriction on maximum depth). The criterion represents the function to measure the quality of a split, and we evaluate both "gini" and "entropy". In practice, the choice between gini and entropy makes little difference, but we include it for demonstration purposes.

Next, we create a parameter grid, specifying the hyperparameters and their respective values. Instead of using a list, we can also use dictionaries to specify different scenarios. For example, we can hardcode a specific value for one hyperparameter while exploring all values for another hyperparameter. This approach can be helpful when dealing with conflicting parameter choices. However, in this case, there is no conflict, so a list would suffice.

We set the number of cross-validation folds (CV) to 10, indicating that we want to perform 10-fold cross-validation. The stratified k-fold cross-validation is used for classifiers, ensuring that the proportions of labels are kept constant in each fold. The scoring metric used for selecting the best hyperparameter settings is accuracy for classifiers and the R-squared score for regressors. We also set the number of parallel jobs to run as -1, allowing multiple operations to be performed in parallel.

After specifying all the necessary details, we fit the grid search object to our data. It performs an exhaustive search over the parameter grid, evaluating the performance of each hyperparameter combination using cross-validation. Once the grid search is complete, we can access the best score and the corresponding parameters using the best_score_ and best_params_ attributes, respectively. In this case, the best model has a max depth of 3 and criterion "gini", achieving an accuracy of 96% on average across the validation folds.

If we are interested, we can manually inspect the results stored in a dictionary, which contains all the information. Here, we focus on the mean test score, which represents the average performance over the validation folds for each hyperparameter setting. We print the scores together with the parameter settings for better readability.

Optionally, we can summarize the results in a heatmap using a function from the ML extent library. The heatmap provides a visual representation of the performance for different hyperparameter settings. In this case, the choice between "gini" and "entropy" has almost no effect on the performance, as indicated by the similar scores. The best performance is achieved with a max depth of 3 and criterion "gini".

After obtaining the best hyperparameter settings, we can use them to train the final model on the entire training dataset. This ensures that we utilize all available data for model training. We create a new decision tree classifier object, set the hyperparameters to the best values found during grid search, and fit the model to the training data.

Once the model is trained, we can make predictions on the test dataset and evaluate its performance. In this example, we calculate the accuracy score, which measures the proportion of correctly classified instances. We print the accuracy score, and in this case, we achieve an accuracy of 93% on the test set.

Overall, grid search allows us to systematically explore different hyperparameter combinations and select the best configuration for our model. It automates the process of hyperparameter tuning and helps in finding optimal settings for improved performance.

That's the basic idea of using grid search for model selection and hyperparameter tuning. Of course, this is just one approach, and there are other techniques available, such as randomized search, Bayesian optimization, and more. The choice of method depends on the specific problem and the resources available.

10.6 K-fold CV for Model Evaluation -- Code Examples (L10: Model Evaluation 3)
10.6 K-fold CV for Model Evaluation -- Code Examples (L10: Model Evaluation 3)
  • 2020.11.20
  • www.youtube.com
In this video, we look at code examples for using k-fold cross-validation for model selection. In particular, we are looking at GridSearchCV and RandomizedSe...
 

10.7 K-fold CV 1-Standard Error Method (L10: Model Evaluation 3)


10.7 K-fold CV 1-Standard Error Method (L10: Model Evaluation 3)

In the previous discussion, we covered the concepts of k-fold cross-validation and model selection using grid search. However, there is another important topic to consider: the one standard error method. This method is relevant when we encounter situations where multiple hyperparameter settings perform equally well, and we need to choose the most appropriate one.

When we have multiple hyperparameter settings with similar or identical performance, it becomes crucial to decide which one to select. By default, scikit-learn chooses the first setting from the list if there is a tie. However, the one standard error method offers an alternative approach based on the principle of parsimony or Occam's razor. According to Occam's razor, when competing hypotheses perform equally well, the one with the fewest assumptions should be preferred.

To apply the one standard error method, we consider the numerically optimal estimate and its standard error. After performing model selection via k-fold cross-validation, we obtain performance estimates for different hyperparameter settings. Among these settings, we select the model whose performance is within one standard error of the best-performing model obtained in the previous step.

To illustrate this method, let's consider a binary classification dataset generated with scikit-learn, consisting of squares and triangles. We'll focus on an RBF kernel Support Vector Machine (SVM) for simplicity. The SVM has a hyperparameter called gamma, which controls the influence of each training example. We find that various gamma values result in accuracies ranging from 60% to 90%, with some settings showing similar performance.

In the case of SVM, the complexity of the decision boundary depends on the gamma value. A higher gamma leads to a more complex decision boundary, while a lower gamma results in a simpler decision boundary. We can observe this by plotting the decision boundaries for different gamma values. The simpler models have decision boundaries that are closer to linear, while the more complex ones exhibit more intricate shapes.

However, when multiple hyperparameter settings yield similar accuracies, we want to select the simplest model within one standard error of the best-performing model. For example, if the best-performing model has a gamma value of 0.1, we would consider models with gamma values within one standard error of 0.1 and choose the one with the lowest complexity.

It's worth noting that the one standard error method may not always have a corresponding paper or publication. It is a practical approach based on the principle of simplicity and has been widely adopted by practitioners. If there are any published studies or papers on this method, they would be valuable additions to further explore its effectiveness and implications.

In the next video, we will delve into a code example that demonstrates how to implement the one standard error method in practice.

10.7 K-fold CV 1-Standard Error Method (L10: Model Evaluation 3)
10.7 K-fold CV 1-Standard Error Method (L10: Model Evaluation 3)
  • 2020.11.20
  • www.youtube.com
This video suggests the 1-standard error method as a tie breaker for selecting one model from a set of similarly well performing models.-------This video is ...
 

10.8 K-fold CV 1-Standard Error Method -- Code Example (L10: Model Evaluation 3)


10.8 K-fold CV 1-Standard Error Method -- Code Example (L10: Model Evaluation 3)

In this video, I will provide a detailed explanation of how I implemented the one standard error method discussed in the previous video. To follow along with the code examples, you can find them under this link, which I will also post on Canvas for easy access.

Let's step through the notebook together. First, we have the conventional imports that are commonly used. Then, I generate my own toy dataset using the make_circles function from the scikit-learn library. This function allows you to specify the number of examples and the amount of noise in the dataset. The generated dataset is then split into training and test sets. This approach is excellent for conducting simulation studies on arbitrary large datasets to observe how different learning curves and model behaviors vary with varying parameters such as noise and number of training examples. It serves as a useful testbed for experimentation.

Next, I use the support vector machine (SVM) as an example. You don't need to fully understand how SVMs work for this demonstration; I simply chose it as a clear example. The following steps involve a manual approach, where I define a list of hyperparameter settings and iterate over these values. However, if you have more complicated settings, you can use the ParamSampler that was discussed in the previous video.

For this demonstration, I'm using a single hyperparameter, so a manual approach using a list and a for loop suffices. I initialize a list of parameters and then iterate over each value. In each iteration, I initialize the SVM model with the chosen hyperparameter setting. Then, I perform k-fold cross-validation to evaluate the model's accuracy. The accuracy values are collected, and I compute the average, standard deviation, and standard error. Please note that the naive approach I use to compute the standard error by dividing the standard deviation by the square root of the sample size might not be the best method because the rounds in k-fold cross-validation are not completely independent. However, for the purpose of obtaining some measure of similarity or error bars to compare different methods, this approach suffices.

After collecting the accuracy values, I plot them on a log scale because the sampling is done exponentially. The resulting plot displays the performance of the SVM model for different hyperparameter settings. This is consistent with what we have seen in the lecture slides.

To demonstrate the applicability of this method to other classifiers, I also provide code for decision tree classification on the iris dataset. In this case, I vary the maximum depth parameter of the decision tree from 1 to 10. Similar steps are followed: initializing the model with a hyperparameter setting, fitting the model, making predictions, collecting k-fold cross-validation scores, computing the standard error, and so on. By analyzing the decision boundaries for different max depths, we can observe the trade-off between model complexity and performance. In this specific example, a decision tree with a maximum depth of three is selected using the one standard error method.

Finally, I briefly mention the topics we will cover in the next lecture, which include cross-validation for algorithm selection, statistical tests, and evaluation metrics. These topics are closely related to the concepts discussed in the previous lectures.

I hope you find this explanation helpful. Have a great weekend!

10.8 K-fold CV 1-Standard Error Method -- Code Example (L10: Model Evaluation 3)
10.8 K-fold CV 1-Standard Error Method -- Code Example (L10: Model Evaluation 3)
  • 2020.11.20
  • www.youtube.com
This video goes over a code example for applying the 1-standard error method, which can be used as a tie breaker for selecting one model from a set of simila...
 

11.1 Lecture Overview (L11 Model Eval. Part 4)


11.1 Lecture Overview (L11 Model Eval. Part 4)

Hello everyone, and welcome! In our previous session, we delved into the topic of hyperparameter tuning and model selection. Our focus was on k-fold cross-validation, a technique used to rank different models with various hyperparameter settings in order to select the best one. We explored practical methods such as grid search and randomized search, which facilitate the process of model comparison.

Today, we will delve further into the aspect of model comparison. Suppose you come across a research paper that shares the predictions of one model on a test set. You may want to compare these predictions to those of your own model and determine if there is a statistically significant difference in their performance. Although this practice is not very common, it can be useful. One statistical test that can be employed in such cases is the McNemar test. Additionally, we will discuss algorithm comparisons, allowing us to compare different models and algorithms more equitably.

However, please note that today's lecture will be shorter than usual due to the Thanksgiving week. For those interested, the lecture notes provide more detailed explanations. They also cover additional statistical tests, such as the five-times-two F-test and various t-test procedures. While these topics are not examinable, they serve to satisfy your intellectual curiosity.

To optimize our time, we will not delve deeply into these methods since a lecture on performance metrics awaits us next week. If time permits, we may also touch upon feature selection and feature extraction. For your convenience, I have shared supplementary materials on these subjects via Canvas.

Now, let's get started with the main lecture on model evaluation, beginning with statistical tests for model comparisons. We will then address the challenges associated with multiple pairwise comparisons and explore methods to address them. Subsequently, we will delve into algorithm selection and examine a concrete code example related to the technique of nested cross-validation. This lecture overview sets the stage for our discussion today.

Before we proceed, let's recap the topics we covered in the previous lectures on model evaluation. We began with the fundamentals, including the bias-variance tradeoff, underfitting and overfitting, and the simple holdout method. We then delved into confidence intervals and introduced the bootstrap method for constructing empirical confidence intervals. We explored the repeated holdout method, which offers insights into model stability, although it is not commonly used in practice. It served as a good introduction to resampling methods.

Last week, we ventured into the realm of cross-validation, which added more depth to our exploration. We discussed hyperparameter tuning using grid search and randomized search and employed these techniques for model selection. Our primary focus was on the three-way holdout method, which involves splitting the dataset into training, validation, and test sets. We used the validation set to rank different models and the test set to estimate their final performance. For smaller datasets, we turned to k-fold cross-validation and leave-one-out cross-validation.

Today's lecture will introduce model and algorithm comparisons. While these concepts relate to model selection, our aim here is to compare different algorithms, seeking to determine which performs better across a range of related tasks. Ideally, we would have a collection of disjoint training and test sets for each algorithm. For example, when comparing image classification methods, we would employ various image datasets to train different models using different algorithms. We would then compare their performances on multiple datasets. However, practical constraints often limit our ability to follow this ideal approach. We encounter issues with violations of independence and other nuisances within datasets. This problem is reminiscent of the challenges discussed in the CIFAR-10 paper.

Moreover, how can we compare the performance of a model we have trained with that of a model published in a research paper or found on the internet? To address this, we can examine the actual difference in performance between the two models using statistical tests. One such test is the McNemar test, which is commonly used for comparing the predictive performance of two models on a binary outcome.

The McNemar test is suitable when we have paired data, meaning that each instance in the dataset is classified by both models, and the results are recorded as a contingency table. The contingency table has four cells representing the four possible outcomes:

              Model 2
           |  Positive | Negative |
---------------------------------
Model 1    |           |          |
---------------------------------
  Positive |           |          |
---------------------------------
  Negative |           |          |
---------------------------------
To apply the McNemar test, we count the number of instances in each cell of the contingency table. Let's denote these counts as follows:

  • a: The number of instances where both models 1 and 2 predict positive.
  • b: The number of instances where model 1 predicts positive and model 2 predicts negative.
  • c: The number of instances where model 1 predicts negative and model 2 predicts positive.
  • d: The number of instances where both models 1 and 2 predict negative.

With these counts, we can perform the McNemar test to determine if there is a significant difference in the models' performances. The null hypothesis (H0) is that the two models have the same performance, while the alternative hypothesis (H1) is that there is a difference.

The McNemar test statistic follows a chi-square distribution with 1 degree of freedom. We compute the test statistic using the formula:

chi2 = ((|b - c| - 1)^2) / (b + c)

If the test statistic chi2 exceeds a critical value from the chi-square distribution (with 1 degree of freedom) at a chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is a significant difference in performance between the two models.

It's important to note that the McNemar test assumes that the paired instances are independent and identically distributed. This assumption may not hold if the paired instances are not truly independent or if there is some form of dependency between them. Additionally, the McNemar test is primarily applicable to binary outcomes. If the outcome is multi-class, alternative tests such as the Cochran's Q test or the Stuart-Maxwell test may be more appropriate.

Now, let's move on to discussing the challenges of multiple pairwise comparisons. When comparing multiple models or algorithms, it becomes increasingly likely to find significant differences by chance alone. This phenomenon is known as the multiple comparison problem or the problem of multiple testing.

When conducting multiple pairwise comparisons, the probability of at least one significant result increases as the number of comparisons grows. This inflation of the Type I error rate can lead to false positive findings, where we reject the null hypothesis erroneously.

To address the multiple comparison problem, we need to adjust the significance level of our statistical tests. One common approach is the Bonferroni correction, which involves dividing the desired significance level (e.g., 0.05) by the number of comparisons being made. For example, if we are comparing three models, we would adjust the significance level to 0.05/3 = 0.0167 for each individual test.

The Bonferroni correction is a conservative method that controls the familywise error rate, ensuring that the overall Type I error rate across all comparisons remains below a specified threshold. However, it can be overly stringent, leading to a loss of power to detect true differences.

Other methods for adjusting the significance level include the Holm-Bonferroni method, the Benjamini-Hochberg procedure, and the false discovery rate (FDR) control. These methods provide less conservative alternatives to the Bonferroni correction and may be more appropriate in certain situations.

In summary, the McNemar test is a statistical test that can be used to compare the performance of two models on a binary outcome. However, when conducting multiple pairwise comparisons, it's important to account for the multiple comparison problem by adjusting the significance level.

11.1 Lecture Overview (L11 Model Eval. Part 4)
11.1 Lecture Overview (L11 Model Eval. Part 4)
  • 2020.11.24
  • www.youtube.com
This first video goes over the model and algorithm comparison-related topics that are coved in Lecture 11.More details in my article "Model Evaluation, Model...