Python in algorithmic trading - page 21

 

Backtest Your Dollar Cost Average Strategy easily in Python



Backtest Your Dollar Cost Average Strategy easily in Python

In the next 20 minutes or so, we will be implementing a dollar cost averaging strategy in Python. This strategy will allow you to assess the performance of dollar cost averaging for a specific asset or index over a certain period of time. We will be using a tool called backtesting.py to implement this strategy. Backtesting.py is a user-friendly framework in Python that is less intimidating than other libraries like Vectorbt or Backtrader. If you're new to Python, this will be a great option for you.

The dollar cost averaging strategy we will be implementing is relatively simple, but I will also show you how to extend it. Our strategy involves buying a fixed dollar amount of a particular asset every Monday or Tuesday and repeating this process until we run out of data. To get started, open up a terminal and set up a new virtual environment to ensure a clean environment for our implementation. Once you have set up the virtual environment, install the backtesting package using pip:

pip install backtesting

After installing the package, we can proceed with our Python file. We will need to import some necessary modules and data. From backtesting, import the backtest and strategy modules. Additionally, import some dummy data from backtesting.test, specifically the google data. We will also need the pandas module for data manipulation.

Now, let's define our strategy class. Create a class called DCA (Dollar Cost Average) that inherits from the strategy class. Inside this class, we will set a class variable called amount_to_invest, which represents the fixed dollar amount we want to invest. Initially, set it to 10.

Next, we need to define two functions within this class: __init__ and next. The __init__ function is called during initialization and is used to pre-compute any values we may need later. In our case, we will create an indicator that gives us the day of the week. To do this, we will use the self.indicator method provided by backtesting.py. We can define our indicator as self.day_of_week = self.indicator(lambda x: x, self.data.close.s.dt.dayofweek). This indicator will return an array of the day of the week values (0-6, where Monday is 0 and Sunday is 6) for our data.

Now, let's move on to the next function, which is where we implement our trading logic. This function is called for each bar of data and allows us to make decisions based on the current data. In our case, we will check if the day of the week is equal to 1 (Tuesday) using if self.day_of_week == 1:. If it is Tuesday, we will trigger a buy signal. To execute the buy order, we will use the self.buy function provided by backtesting.py. We can calculate the size of the buy order by dividing the amount_to_invest by the current closing price of the asset. To ensure we buy a whole number of shares, we can use math.floor to round down the result.

To handle fractional shares, we can split the shares by multiplying the asset by a small number, such as 10 ** -6. This will split the shares into microshares, which can later be converted back to the actual amount of shares bought by dividing by the same small number.

Finally, we need to run the backtest and extract the statistics. To do this, we can use bt.run() and assign the result to a variable called stats. We can also plot the results using bt.plot().

Since we haven't implemented the sell logic yet, the plot appears as a continuous line without any selling points. We'll fix that soon. But before we do, let's extract some statistics from the backtest results.

To do this, we'll use the stats variable we defined earlier. We can print out various statistics like the total return, annualized return, maximum drawdown, and more.

Feel free to add more statistics if you're interested in exploring additional performance metrics.

Now let's move on to implementing the sell logic. Since we're using a dollar-cost averaging strategy, we'll sell the same fixed dollar amount every week. In our case, we'll sell on Fridays.

Here, we check if the day of the week is 4 (Friday) using the day_of_week indicator we created earlier. If it is Friday, we sell the same dollar amount we bought earlier by dividing amount_to_invest by the current closing price. This ensures we sell the appropriate number of shares to match our investment amount.

Now, when we run the backtest, we should see selling points on the plot, indicating the Fridays when we sell our position.

Feel free to experiment with different variations of this strategy, such as adjusting the buy/sell days or implementing additional conditions based on price movements. This framework allows you to easily extend and customize your strategy according to your requirements.

Remember to adjust the amount_to_invest variable and explore different asset data to see how the strategy performs.

I hope this helps you in implementing and exploring the dollar-cost averaging strategy using the backtesting.py library in Python. Let me know if you have any further questions!

Backtest Your Dollar Cost Average Strategy easily in Python
Backtest Your Dollar Cost Average Strategy easily in Python
  • 2022.06.29
  • www.youtube.com
We backtest a simple dollar cost averaging strategy in backtesting.py. Backtesting.py is a super easy to use framework for beginners to python or to backtest...
 

Custom Indicators In Backtesting.py - Python Deep Dive



Custom Indicators In Backtesting.py - Python Deep Dive

In this video, we are going to explore the process of creating custom indicators in the backtesting.py library. This feature will enable us to easily backtest any trading strategy by creating indicators and translating Python functions into a format compatible with the backtesting.py ecosystem.

Before we delve into the details of indicator creation, it is recommended to check out a freely available course on YouTube that covers most aspects of backtesting.py. This course will provide a high-level understanding of the library, which will be beneficial when exploring indicator creation in this video.

In this video, we will focus on three different examples to cover various indicator ideas. The first example involves using signals generated in an external Python program and integrating them into backtesting.py. This approach is useful when you already have buy and sell signals from an external source and want to incorporate them into your backtesting process.

The second example will demonstrate the use of pandas-ta library to return multiple values for each indicator. Specifically, we will work with the Bollinger Bands indicator and showcase how to return a data frame containing both the lower and upper bands, instead of just a simple numpy array. This example will highlight the versatility of creating indicators with multiple values.

Finally, we will hand code a momentum strategy to demonstrate how custom indicators can be created using pure Python. This example will showcase the flexibility of creating indicators using Python programming, allowing for limitless possibilities in indicator design.

To follow along with the examples, ensure that you have the necessary libraries installed, including backtesting, pandas, and pandas-ta. Once you have installed these libraries, create a Python file for the code examples.

The initial part of the code sets up the necessary boilerplate when using backtesting.py. It imports the required classes, "backtest" and "strategy," and imports sample data for Google stock from backtesting.py. The imported data is a pandas data frame containing daily price data, including open, high, low, close, and volume, with a datetime index.

For the first example, we assume that you have already generated some signals in an external program and want to transfer them to backtesting.py. To demonstrate this, we create random signals using numpy and add them to the Google data frame. These signals could represent any indicator you have programmed in Python, where -1 denotes a sell signal, 0 indicates no action, and 1 represents a buy signal.

Next, we define a strategy class called "SignalStrategy" that inherits from the "Strategy" class imported earlier. This class will be responsible for implementing the buying and selling logic based on the signals. The class includes the initialization function "init" and the "next" function.

In the "init" function, we don't have much to do in this particular example, but it is good practice to include it. The "next" function is where the buying and selling logic will be implemented based on the signals.

To execute the backtest, we create an instance of the backtest class, passing the Google data frame and the "SignalStrategy" class. We also set the cache value to 10,000. Then, we run the backtest and store the results in the "stats" variable. Finally, we print out the statistics to see the performance of the strategy.

Running the code at this point won't yield any trades because we haven't implemented the buying and selling logic yet. However, we can access the signal values by using "self.data.signal" within the "next" function, which will give us the latest signal value.

To implement the buying and selling logic, we check the current signal value and the current position. If the signal is 1 (buy signal) and there is no existing position, we execute a buy order using "self.buy". If the signal is -1(sell signal) and there is an existing long position, we execute a sell order using "self.sell".

  1. External Signal Strategy:

    • Generate random signals or obtain signals from an external program.
    • Define a SignalStrategy class inheriting from Strategy.
    • Implement the next method to execute buy or sell orders based on the signals.
    • Use the self.buy() and self.sell() methods to execute the orders.
    • Instantiate a Backtest object with the data, strategy, initial capital, and commission.
    • Run the backtest using bt.run() and analyze the results.

  2. Using pandas-ta for Custom Indicators:

    • Import the pandas_ta library (install it with pip install pandas_ta).
    • Use the desired indicator function from pandas_ta to calculate the indicator.
    • Append the calculated indicator to the data frame.
    • Define a strategy class inheriting from Strategy.
    • Implement the next method to execute buy or sell orders based on the indicator values.
    • Use the desired conditions to determine when to buy or sell.
    • Instantiate a Backtest object with the data, strategy, initial capital, and commission.
    • Run the backtest using bt.run() and analyze the results.

Remember to replace placeholders like GOOG with your actual data and customize the strategies according to your specific requirements.

Custom Indicators In Backtesting.py - Python Deep Dive
Custom Indicators In Backtesting.py - Python Deep Dive
  • 2022.07.30
  • www.youtube.com
Learn how to make your own custom indicators in backtesting.py. We show how you can integrate libraries like pandas-ta, ta-lib, etc. as well as write your ow...
 

Stop Losses in Backtesting.py



Stop Losses in Backtesting.py

In this video, we are going to explore the concept of stop losses in the "backtesting.py" library. The video will cover three examples of increasing complexity and depth, providing a comprehensive understanding of stop losses in "backtesting.py". The presenter assumes some prior knowledge of "backtesting.py" and recommends watching a free course on YouTube for beginners before diving into this advanced topic.

To get started, open a terminal and ensure that "backtesting.py" is installed by running the command "pip install backtesting". This will install all the necessary packages. Then, create a new Python file, let's call it "example.py", and import the required modules: "backtest" and "strategy" from "backtesting", and "googledale" from "backtesting.test". "googledale" is a test dataset that comes with "backtesting.py".

Next, define the strategy class by creating a class called "Strats" that inherits from the "strategy" class. Implement the two required functions: "init" and "next". At this point, we are ready to run our backtest. Initialize a new backtest object, "bt", using the "backtest" function. Pass in the "googledale" data and the strategy class we just defined. Set the initial cash value to $10,000. Finally, run the backtest using the "bt.run" method and plot the results using "bt.plot".

Initially, the strategy class does not perform any trading actions. To demonstrate a simple stop-loss example, we will add some basic buying and selling logic. If we have an existing position, we won't take any action. However, if we don't have a position, we will place a buy order using the "self.to_buy" method, specifying the size of the position (e.g., 1 share). Additionally, we will add a stop loss and take profit. The stop loss will be set at 10 units below the current closing price, while the take profit will be set at 20 units above the current closing price.

Running the backtest will generate a large number of trades. As soon as a trade is closed, a new trade will be opened on the next bar unless the stop loss or take profit is triggered. It's important to understand how "backtesting.py" handles stop losses and take profits. In cases where both the stop loss and take profit are triggered in the same bar, the library assumes that the stop loss is triggered first. This behavior can lead to unexpected outcomes, especially when dealing with daily data that may have significant gaps.

To manage stop losses more effectively, we can extend the strategy class and use the "trailing strategy" provided by "backtesting.py". Import the necessary modules, including "crossover" and "trailing strategy" from "backtesting.lib". In the new strategy class, inherit from "trailing strategy" instead of the base "strategy" class. Override the "init" function to call the parent class's "init" function using "super". Then, use the "set_trailing_stop_loss" function from the parent class to set a trailing stop loss value.

In the next section of the video, the presenter explains in more detail how the "trailing strategy" works and how to customize it for specific requirements. However, in this section, the focus is on utilizing the "trailing strategy" in our code. By calling the parent class's "init" function and using the "set_trailing_stop_loss" function, we can leverage the trailing stop loss functionality in our backtest.

Overall, the video provides a step-by-step explanation of implementing stop losses in "backtesting.py". It covers simple examples as well as more advanced concepts like trailing it a value of 10, which means our stop loss will trail the price by 10 units.

Now that we have set up our initialization function, let's move on to the next function. This is where the bulk of our trading logic will be implemented. Inside the next function, we'll first call the parent class's next function using super().next(). This ensures that the trailing stop loss functionality is executed along with the other trading logic.

Next, we'll add some code to adjust our trailing stop loss. We'll use a conditional statement to check if we have an open position (self.position is not None). If we have a position, we'll update the trailing stop loss using the update_trailing_sl method provided by the trailing_strategy class. This method takes the current price as an argument and updates the stop loss accordingly.

Stop Losses in Backtesting.py
Stop Losses in Backtesting.py
  • 2022.08.19
  • www.youtube.com
In this video we go in-depth on how to use stop-losses in backtesting.py. We cover both static and trailing stop losses and how backtesting.py executes them ...
 

Backtest Validation in Python (Fooled By Randomness)



Backtest Validation in Python (Fooled By Randomness)

We've all been in that situation where we create a trading strategy, backtest it, and when we finally implement it, it fails to perform as expected. One of the main reasons for this disappointment is overfitting the strategy to a specific set of historical data used in the backtest. In this video, I will demonstrate a strategy to combat overfitting and ensure that you don't rely on strategies that lack a solid foundation or get fooled by randomness.

Let's dive into a specific example. I conducted a backtest on a simple RSI-based strategy using Bitcoin as the asset. The strategy involves selling when the RSI is high and buying when the RSI is low. The backtest results showed a modest return of about three percent, despite Bitcoin experiencing a 15 percent decline in the tested period. At first glance, it may seem like a promising strategy for bear markets.

However, it is crucial to examine the strategy's performance over various time frames to determine if it consistently identifies profitable opportunities or if it simply got lucky with the chosen parameter values during the backtest. To achieve this, I conducted multiple 30-day backtests, covering different periods throughout the year.

By plotting the distribution of returns from these backtests, we can gain insights into the strategy's effectiveness. The plot shows each 30-day window as a dot, representing the returns obtained during that period. The accompanying box plot displays the median return, quartiles, maximum, and minimum values. Analyzing the plot, it becomes evident that the median return over a 30-day period is -8.5 percent. Furthermore, the distribution of returns appears to be random, similar to the results one would expect from a random number generator set between -35 and 15. These findings strongly indicate that the strategy is not unique or effective beyond the specific historical data used in the backtest.

To validate the strategy and mitigate the influence of overfitting, we need to conduct backtests on a broader range of data. For this purpose, I downloaded multiple data files covering the entire year, from the beginning of 2022 to the end of 2022. I combined these files into a master CSV containing one-minute candle data for the entire period.

In the validation code, I made some minor adjustments to accommodate the extended dataset. The core strategy remains the same, focusing on RSI-based trading logic. However, I introduced a loop to conduct backtests on 30-day windows throughout the data. Each backtest calculates returns, which are then added to a list for further analysis.

By generating a box plot using the collected returns, we can visualize the distribution of strategy performance across various 30-day windows. This plot reveals the variability of returns and provides a clearer picture of how the strategy performs over different time intervals. In this specific example, the plot indicates predominantly negative returns for almost every month, suggesting that the strategy lacks consistent profitability.

These techniques for validating and verifying trading strategies can be applied to any backtesting framework of your choice. The provided code utilizes the backtesting.py library, but you can adapt it to other libraries like vectorbt or backtrader. The key idea is to ensure that your strategy demonstrates robustness across diverse time frames and is not simply a product of overfitting to a specific set of historical data.

By following these validation steps, you can reduce the risk of relying on strategies that are not grounded in reality or falling victim to random outcomes. It is essential to go beyond backtest performance and consider the strategy's effectiveness in different market conditions to make informed decisions when implementing trading strategies.

After analyzing the backtest results and the distribution of returns across different timeframes, we discovered that the strategy's performance was essentially random. It did not provide consistent profitability outside of the specific time period used for backtesting. This indicates that the strategy suffered from overfitting and lacked robustness.

To avoid falling into the overfitting trap and increase the chances of developing reliable trading strategies, here are a few recommendations:

  1. Use Sufficient and Diverse Data: Ensure that your backtest incorporates a significant amount of historical data to cover various market conditions. This helps to capture a broader range of scenarios and reduces the likelihood of overfitting to specific market conditions.

  2. Validate Across Multiple Timeframes: Instead of relying solely on a single time period for backtesting, test your strategy across different timeframes. This provides insights into its performance under various market conditions and helps identify if the strategy has consistent profitability or if the observed results were due to randomness.

  3. Implement Out-of-Sample Testing: Reserve a portion of your historical data for out-of-sample testing. After conducting your primary backtest on the initial dataset, validate the strategy on the reserved data that the model has not seen before. This helps assess the strategy's ability to adapt to unseen market conditions and provides a more realistic evaluation of its performance.

  4. Beware of Curve Fitting: Avoid excessive optimization or parameter tuning to fit the strategy too closely to historical data. Strategies that are too tailored to specific data patterns are more likely to fail in real-world trading. Aim for robustness rather than chasing exceptional performance on historical data alone.

  5. Consider Walk-Forward Analysis: Instead of relying solely on static backtests, consider using walk-forward analysis. This involves periodically re-optimizing and retesting your strategy as new data becomes available. It allows you to adapt and fine-tune your strategy continuously, improving its performance in changing market conditions.

  6. Use Statistical Significance Tests: Apply statistical tests to evaluate the significance of your strategy's performance. This helps determine if the observed results are statistically meaningful or merely due to chance. Common statistical tests used in backtesting include t-tests, bootstrap tests, and Monte Carlo simulations.

By following these guidelines, you can reduce the risk of developing strategies that are overly fitted to historical data and increase the likelihood of creating robust and reliable trading approaches.

Remember, the goal is to develop trading strategies that demonstrate consistent profitability across different market conditions, rather than strategies that merely perform well on historical data.

Backtest Validation in Python (Fooled By Randomness)
Backtest Validation in Python (Fooled By Randomness)
  • 2022.09.14
  • www.youtube.com
In this video we go through a method that I've found helpful for validating my backtests before I go live with a strategy. Looking at the distribution of ret...
 

A Fast Track Introduction to Python for Machine Learning Engineers



A Fast Track Introduction to Python for Machine Learning Engineers

The course instructor begins by introducing the concept of predictive modeling and its significance in the industry. Predictive modeling focuses on developing models that can make accurate predictions, even if they may not provide an explanation for why those predictions are made. The instructor emphasizes that the course will specifically focus on tabular data, such as spreadsheets or databases. The goal is to guide the students from being developers interested in machine learning in Python to becoming proficient in working with new datasets, developing end-to-end predictive models, and leveraging Python and the SCIPy library for machine learning tasks.

To start, the instructor provides a crash course in Python syntax. They cover fundamental concepts like variables and assignments, clarifying the distinction between the "equals" sign used for assignment and the "double equals" sign used for equality comparisons. The instructor demonstrates how to use Jupyter Notebook for Python coding and provides tips for navigation, such as creating a new notebook, using aliases for libraries, executing cells, and copying or moving cells. They also explain the auto-save feature and manual saving of notebooks. Finally, the video briefly touches on stopping the execution of the kernel.

Moving on, the instructor explains how to use the toolbar in Jupyter Notebook for Python engine navigation and how to annotate notebooks using Markdown. The video covers essential flow control statements, including if-then-else conditions, for loops, and while loops. These statements allow for decision-making and repetition within Python code. The instructor then introduces three crucial data structures for machine learning: tuples, lists, and dictionaries. These data structures provide efficient ways to store and manipulate data. Additionally, the video includes a crash course on NumPy, a library that enables numerical operations in Python. It covers creating arrays, accessing data, and performing arithmetic operations with arrays.

The video proceeds to discuss two essential libraries, Matplotlib and Pandas, which are commonly used in machine learning for data analysis and visualization. Matplotlib allows users to create various plots and charts, facilitating data visualization. Pandas, on the other hand, provides data structures and functions for data manipulation and analysis, particularly through series and data frame structures. The video highlights the significance of Pandas' read_csv function for loading CSV files, the most common format in machine learning applications. It also emphasizes the usefulness of Pandas functions for summarizing and plotting data to gain insights and prepare data for machine learning tasks. Descriptive statistics in Python are mentioned as a crucial tool for understanding data characteristics and nature.

The video dives into specific data visualization techniques that can aid data analysis before applying machine learning techniques. Histograms, density plots, and box plots are introduced as ways to observe the distribution of attributes and identify potential outliers. Correlation matrices and scatter plot matrices are presented as methods to identify relationships between pairs of attributes. The video emphasizes the importance of rescaling, standardizing, normalizing, and binarizing data as necessary preprocessing steps to prepare data for machine learning algorithms. The fit and transform method is explained as a common approach for data preprocessing.

The next topic discussed is data preprocessing techniques in machine learning. The video covers normalization and standardization as two important techniques. Normalization involves rescaling attributes to have the same scale, while standardization involves transforming attributes to have a mean of zero and a standard deviation of one. Binarization, which thresholds data to create binary attributes or crisp values, is also explained. The importance of feature selection is emphasized, as irrelevant or partially irrelevant features can negatively impact model performance. The video introduces univariate selection as one statistical approach to feature selection and highlights the use of recursive feature elimination and feature importance methods that utilize decision tree ensembles like random forest or extra trees. Principal component analysis (PCA) is also discussed as a data reduction technique that can compress the dataset into a smaller number of dimensions using linear algebra.

The video emphasizes the significance of resampling methods for evaluating machine learning algorithms' performance on unseen data. It warns against evaluating algorithms on the same dataset used for training, as it can lead to overfitting and poor generalization to new data. Techniques such as train-test split sets, k-fold cross-validation, leave one out cross-validation, and repeated random test splits are explained as ways to obtain reliable estimates of algorithm performance. The video concludes with a discussion of various performance metrics for machine learning algorithms, such as classification accuracy, logarithmic loss, area under the curve, confusion matrix, and classification report.

The video delves into performance metrics used to evaluate the predictions made by machine learning models. It covers classification accuracy, log loss (for evaluating probabilities), area under the receiver operating characteristic (ROC) curve (for binary classification problems), confusion matrix (for evaluating model accuracy with multiple classes), and the classification report (which provides precision, recall, F1 score, and support for each class). Additionally, the video explains three common regression metrics: mean absolute error, mean squared error, and R-squared. Practical examples are demonstrated to illustrate how to calculate these metrics using Python.

The speaker introduces the concept of spot checking to determine which machine learning algorithms perform well for a specific problem. Spot checking involves evaluating multiple algorithms and comparing their performances. The video demonstrates spot checking for six different machine learning models, including both linear and non-linear algorithms, using Python with the scikit-learn library. The speaker emphasizes that results may vary due to the stochastic nature of the models. The section concludes with an introduction to regression machine learning models, preparing viewers for the upcoming section on spot checking those models.

Next, the speaker introduces linear and nonlinear machine learning models using the Boston house price dataset as an example. A test harness with 10-fold cross-validation is employed to demonstrate how to spot check each model, and mean squared error is used as a performance indicator (inverted due to a quirk in the cross-file score function). The linear regression model, assuming a Gaussian distribution for input variables and their relevance to the output variable, is discussed. Ridge regression, a modification of linear regression that minimizes model complexity, is also explained. The speaker highlights the importance of understanding the pipeline or process rather than getting caught up in the specific code implementation at this stage.

The video explores the process of understanding and visualizing input variables for a machine learning problem. It suggests using univariate plots such as box and whisker plots and histograms to understand the distribution of input variables. For multivariate analysis, scatter plots can help identify structural relationships between input variables and reveal high correlations between specific attribute pairs. The video also discusses the evaluation process, using a test harness with 10-fold cross-validation to assess model performance. The importance of creating a validation dataset to independently evaluate the accuracy of the best model is emphasized. Six different machine learning models are evaluated, and the most accurate one is selected for making predictions. The classification report, confusion matrix, and accuracy estimation are used to evaluate the predictions. Finally, the video touches on regularization regression, highlighting the construction of Lasso and Elastic Net models to reduce the complexity of regression models.

The video introduces a binary classification problem in machine learning, aiming to predict metal from rock using the Sonar Mines versus Rocks dataset. The dataset contains 208 instances with 61 attributes, including the class attribute. Descriptive statistics are analyzed, indicating that although the data is in the same range, differing means suggest that standardizing the data might be beneficial. Unimodal and multimodal data visualizations, such as histograms, density plots, and correlation visualizations, are explored to gain insights into the data. A validation dataset is created, and a baseline for model performance is established by testing various models, including linear regression, logistic regression, linear discriminant analysis, classification regression trees, support vector machines (SVM), naive Bayes, and k-nearest neighbors (KNN). The accuracy of each algorithm is calculated using 10-fold cross-validation and compared.

In the following segment, the video discusses how to evaluate different machine learning algorithms using standardized data and tuning. Standardization involves transforming the data, so each attribute has a mean of 0 and a standard deviation of 1, which can improve the performance of certain models. To prevent data leakage during the transformation process, a pipeline that standardizes the data and builds the model for each fold in the cross-validation test harness is recommended. The video demonstrates tuning techniques for k-nearest neighbors (KNN) and support vector machines (SVM) using a grid search with 10-fold cross-validation on the standardized copy of the training dataset. The optimal configurations for KNN and SVM are identified, and the accuracy of the models is evaluated. Finally, the video briefly discusses KNN, decision tree regression, and SVM as nonlinear machine learning models.

  • 00:00:00 The course instructor introduces the concept of predictive modeling and its relevance to the industry. Unlike statistical modeling which attempts to understand data, predictive modeling focuses on developing models that make more accurate predictions at the expense of explaining why predictions are made. The course focuses on tabular data, such as spreadsheets or databases. The instructor leads the students from being developers interested in machine learning in python to having the resources and capabilities to work through new datasets end-to-end using python and develop accurate predictive models. Students learn how to complete all subtasks of a predictive modeling problem with python, integrate new and different techniques in python and SCIPy, learn python and get help with machine learning. The lecture also covers a crash course in python highlighting key details about the language syntax, including assignment, flow control, data structures, and functions.

  • 00:05:00 The video covers the basic syntax of Python, including variables and assignments, and explains the difference between "equals" for assignment and "double equals" for equality. The lecturer then demonstrates how to use Jupyter Notebook for Python coding and explains some basic navigation tips, such as how to create a new notebook, use alias for libraries, execute cells, and move or copy cells. The lecturer also explains the auto-save feature and how to save the notebook manually. The video concludes with a brief explanation of how to stop the execution of the kernel.

  • 00:10:00 The instructor explains how to use the toolbar to navigate through the Python engine and how to annotate notebooks with markdown. Then, the video covers flow control statements such as if-then-else conditions, for loops, and while loops. Afterward, the instructor explains the three main data structures necessary for machine learning: tuples, lists, and dictionaries. Lastly, the video provides a crash course on NumPy, which includes creating arrays, accessing data, and using arrays in arithmetic.

  • 00:15:00 The crash course on Matplotlib and Pandas for machine learning is discussed. Matplotlib is a library that can be utilized for creating plots and charts. Pandas provide data structures and functionality to manipulate and analyze data with the help of series and data frame data structures. These are important to load csv files which is the most common format used for machine learning applications. Furthermore, flexible functions like Pandas read_csv can aid in loading data and returning a Pandas data frame for summarizing and plotting data to draw insights and seed ideas for preprocessing and handling data in machine learning tasks. Finally, gaining insights from data through descriptive statistics in python cannot be substituted and can help in understanding the characteristics and nature of the data better.

  • 00:20:00 Pandas DataFrame makes it easy to create histograms with the hist() function. Another way to visualize data is using density plots, which can be created with the plot() function. Density plots show the probability density function of the data and can provide insight into the shape of the distribution. Box and whisker plots are also useful in visualizing the distribution of data and identifying outliers. Overall, data visualization is a key step in understanding the characteristics of a dataset and can aid in selecting appropriate machine learning algorithms.

  • 00:25:00 The video explains various plots and diagrams that can be useful for data analysis before applying machine learning techniques. Histograms, density plots, and box plots are used to observe the distribution of attributes while correlation matrices and scatter plot matrices are used to identify the relationship between pairs of attributes. Rescaling, standardizing, normalizing, and binarizing data are also discussed as necessary preprocessing steps to prepare data for machine learning algorithms, and the video explains the fit and transform method as a common approach for data preprocessing.

  • 00:30:00 The video discusses techniques for preprocessing data in machine learning. First, normalization and standardization techniques are explained. Normalization involves rescaling attributes to have the same scale, and standardization involves changing attributes to have a mean of zero and a standard deviation of one. Binarization, or thresholding data, is another technique discussed, which can be useful for adding new binary attributes or turning probabilities into crisp values. Then, the importance of feature selection is explained, as irrelevant or partially irrelevant features can negatively impact model performance. Univariate selection is one statistical approach to feature selection, using the scores of statistical tests. Recursive feature elimination and feature importance methods that use ensembles of decision trees like random forest or extra trees can also be useful for feature selection. Finally, principal component analysis (PCA) is a data reduction technique using linear algebra to transform the dataset into a compressed form with a smaller number of dimensions.

  • 00:35:00 The importance of resampling methods to evaluate the performance of machine learning algorithms on unseen data is explained. It is highlighted that evaluating the algorithm on the same data set used for training can lead to overfitting, resulting in perfect score on the training data set but poor predictions on new data. Techniques such as train-test split sets, k-fold cross-validation, leave one out cross-validation, and repeated random test splits are then presented as ways to create useful estimates of algorithm performance. The section ends with a discussion of various machine learning algorithm performance metrics, such as classification accuracy, logarithmic loss, area under the curve, confusion matrix, and classification report.

  • 00:40:00 The video covers several performance metrics for evaluating the predictions made by a machine learning model. These include classification accuracy, log loss for evaluating probabilities, area under the ROC curve for binary classification problems, the confusion matrix for evaluating model accuracy with two or more classes, and the classification report for evaluating the precision, recall, F1 score, and support for each class. Additionally, the video covers three common regression metrics: mean absolute error, mean squared error, and R-squared, and demonstrates examples of how to calculate these metrics using Python.

  • 00:45:00 The speaker explains the concept of using spot checking to discover which machine learning algorithms perform well for a given problem. He demonstrates spot checking for six different machine learning models, including linear and non-linear algorithms, using python with the scikit-learn library. He also emphasizes that the results may vary due to the stochastic nature of the model. Lastly, the speaker introduces regression machine learning models and prepares the viewers for the upcoming section on how to spot check those models.

  • 00:50:00 The speaker introduces linear and nonlinear machine learning models using the Boston house price dataset. A test harness with 10-fold cross-validation is used to demonstrate how to block check each model, and mean squared error numbers are used to indicate performance, which are inverted due to a quirk in the cross-file score function. The linear regression model assumes a Gaussian distribution for input variables and that they are relevant to the output variable and not highly correlated with each other. Ridge regression, a modification of linear regression, minimizes the complexity of the model measured by the sum square value of the coefficient values or L2 norm. The speaker emphasizes understanding the pipeline or process and not getting caught up in understanding the code at this point.

  • 00:55:00 The video discusses the process of understanding and visualizing the input variables for a machine learning problem. The video suggests using univariate plots like box and whisker plots and histograms to understand the distribution of the input variables. For multivariate plots, scatter plots can help spot structural relationships between the input variables and identify high correlation between certain pairs of attributes. The video then goes on to discuss the process of evaluating models through a test harness using 10-fold cross-validation, where the data set is split into 10 parts and trained and tested on different splits. The video highlights the importance of creating a validation data set to get a second and independent idea of how accurate the best model might be. The video evaluates six different machine learning models and selects the most accurate one for making predictions, evaluating the predictions through the classification report, confusion matrix, and accuracy estimation. The section ends with a discussion on regularization regression and constructing Lasso and Elastic Net models to minimize the complexity of the regression model.

  • 01:00:00 We are introduced to a binary classification problem in machine learning where the goal is to predict metal from rock using the sonar mines versus rocks dataset. The dataset contains 208 instances with 61 attributes including the class attribute. After analyzing the data and looking at descriptive statistics, we see that the data is in the same range but differing means indicate that standardizing the data may be beneficial. We also take a look at unimodal and multimodal data visualizations including histograms, density plots, and visualizations of correlation between attributes. We then prepare a validation data set and create a baseline for performance of different models, including linear regression, logistic regression, linear discriminant analysis, classification regression trees, SVMs, naive Bayes, and k-nearest neighbors. We compare the accuracy of each algorithm calculated through 10-fold cross-validation.

  • 01:05:00 The video discusses how to evaluate different machine learning algorithms using standardized data and tuning. Standardization transforms data so that each attribute has a mean of 0 and a standard deviation of 1, which can improve the skill of some models. To avoid data leakage during the transformation process, a pipeline that standardizes the data and builds the model for each fold in the cross-validation test harness is recommended. The video demonstrates how to tune k-nearest neighbors (KNN) and support vector machines (SVM) using a grid search with 10-fold cross-validation on the standardized copy of the training data set. The optimal configurations for KNN and SVM are identified, and the accuracy of the models is evaluated. Finally, the video briefly discusses KNN, decision tree regression, and SVM as nonlinear machine learning models.
A Fast Track Introduction to Python for Machine Learning Engineers
A Fast Track Introduction to Python for Machine Learning Engineers
  • 2022.03.23
  • www.youtube.com
Complete Course on Machine Learning with Python
 

Applied Statistics for Machine Learning Engineers


Applied Statistics for Machine Learning Engineers

The instructor in the video introduces the field of statistics and highlights its significance in working with predictive modeling problems in machine learning. They explain that statistics offers a range of techniques, starting from simple summary statistics to hypothesis tests and estimation statistics. The course is designed to provide a step-by-step foundation in statistical methods, with practical examples in Python. It covers six core aspects of statistics for machine learning and focuses on real-world applications, making it suitable for machine learning engineers.

The instructor emphasizes the close relationship between machine learning and statistics and suggests that programmers can benefit from improving their statistical skills through this course. They classify the field of statistics into two categories: descriptive statistics and inferential statistics. Descriptive statistics involve summarizing and describing data using measurements such as averages and graphical representations. Inferential statistics, on the other hand, are used to make inferences about a larger population based on sample data.

The importance of proper data treatment is also highlighted, including addressing data loss, corruption, and errors. The video then delves into the various steps involved in data preparation for machine learning models. This includes data cleansing, data selection, data sampling, and data transformation using statistical methods such as standardization and normalization. Data evaluation is also emphasized, and the video discusses experimental design, resampling data, and model selection to estimate the skill of a model. For predicting new data, the video recommends using estimation statistics.

The video explains the different measurement scales used in statistics, namely nominal, ordinal, interval, and ratio scales. It discusses the statistical techniques applicable to each scale and how they can be implemented in machine learning. The importance of understanding and reporting uncertainty in modeling is emphasized, especially when working with sample sets. The video then focuses on the normal distribution, which is commonly observed in various datasets. It demonstrates how to generate sample data and visually evaluate its fit to a Gaussian distribution using a histogram. While most datasets do not have a perfect Gaussian distribution, they often exhibit Gaussian-like properties.

The importance of selecting a granular way of splitting data to expose the underlying Gaussian distribution is highlighted. Measures of central tendency, such as the mean and median, are explored, along with the variance and standard deviation as measures of the distribution's spread. Randomness is discussed as an essential tool in machine learning, helping algorithms become more robust and accurate. Various sources of randomness, including data errors and noise, are explained.

The video explains that machine learning algorithms often leverage randomness to achieve better performance and generate more optimal models. Randomness enables algorithms to explore different possibilities and find better mappings of data. Controllable and uncontrollable sources of randomness are discussed, and the use of the seed function to make randomness consistent within a model is explained. The video provides an example using the Python random module for generating random numbers and highlights the difference between the numpy library's pseudorandom number generator and the standard library's pseudorandom number generator. Two cases for when to seed the random number generator are also discussed, namely during data preparation and data splits.

Consistently splitting the data and using pseudorandom number generators when evaluating an algorithm are emphasized. The video recommends evaluating the model in a way that incorporates measured uncertainty and the algorithm's performance. Evaluating an algorithm on multiple splits of the data provides insight into how its performance varies with different training and testing data. Evaluating an algorithm multiple times on the same data splits helps understand how its performance varies on its own. The video also introduces the law of large numbers and the central limit theorem, highlighting that having more data improves the model's performance and that as the sample size increases, the distribution of the mean approaches a Gaussian distribution.

The video demonstrates the central limit theorem using dice rolls and code, showing how sample means approximate a Gaussian distribution as the sample size increases.

The video emphasizes the importance of evaluating machine learning models and understanding the uncertainty involved in their predictions. It introduces evaluation metrics such as accuracy, precision, recall, and F1 score, which are commonly used to assess the performance of classification models. The video explains that accuracy measures the overall correctness of predictions, precision measures the proportion of true positive predictions out of all positive predictions, recall measures the proportion of true positive predictions out of all actual positive instances, and the F1 score combines precision and recall into a single metric. It also discusses the concept of a confusion matrix, which provides a more detailed view of the performance of a classification model by showing the number of true positive, true negative, false positive, and false negative predictions.

The speaker demonstrates how to calculate these evaluation metrics using Python's scikit-learn library. It shows how to import the necessary modules, split the data into training and testing sets, train a classification model, make predictions on the test set, and evaluate the model's performance using accuracy, precision, recall, and F1 score. The video highlights the importance of evaluating models on unseen data to ensure their generalization capabilities.

Furthermore, the video introduces the concept of receiver operating characteristic (ROC) curves and area under the curve (AUC) as evaluation metrics for binary classification models. ROC curves plot the true positive rate against the false positive rate at various classification thresholds, providing a visual representation of the model's performance across different threshold values. The AUC represents the area under the ROC curve and provides a single metric to compare the performance of different models. The video explains how to plot an ROC curve and calculate the AUC using Python's scikit-learn library.

The concept of overfitting is discussed as a common problem in machine learning, where a model performs well on the training data but fails to generalize to new, unseen data. The video explains that overfitting occurs when a model becomes too complex and learns patterns specific to the training data that do not hold in the general population. The video demonstrates how overfitting can be visualized by comparing the training and testing performance of a model. It explains that an overfit model will have low training error but high testing error, indicating poor generalization. The video suggests regularization techniques such as ridge regression and Lasso regression as ways to mitigate overfitting by adding a penalty term to the model's objective function.

The concept of cross-validation is introduced as a technique to assess the performance and generalization of machine learning models. The video explains that cross-validation involves splitting the data into multiple subsets, training the model on a portion of the data, and evaluating its performance on the remaining portion. This process is repeated multiple times, with different subsets used for training and testing, and the results are averaged to provide an estimate of the model's performance. The video demonstrates how to perform cross-validation using Python's scikit-learn library, specifically the K-fold cross-validation method.

Next, the video discusses the concept of feature selection and importance in machine learning. It explains that feature selection involves identifying the most relevant features or variables that contribute to the model's performance. The video highlights the importance of selecting informative features to improve the model's accuracy, reduce overfitting, and enhance interpretability. It introduces different feature selection techniques, such as univariate selection, recursive feature elimination, and feature importance scores. The video demonstrates how to implement feature selection using Python's scikit-learn library.

The concept of dimensionality reduction is also discussed as a technique to address the curse of dimensionality in machine learning. The video explains that dimensionality reduction involves reducing the number of features or variables in a dataset while preserving most of the relevant information. It introduces principal component analysis (PCA) as a commonly used dimensionality reduction technique. PCA aims to transform the data into a lower-dimensional space by identifying the directions of maximum variance in the data. The video explains that PCA creates new features, called principal components, which are linear combinations of the original features. These principal components capture the most important information in the data and can be used as input for machine learning models.

The video demonstrates how to perform PCA using Python's scikit-learn library. It shows how to import the necessary modules, standardize the data, initialize a PCA object, fit the PCA model to the data, and transform the data into the lower-dimensional space. The video also explains how to determine the optimal number of principal components to retain based on the explained variance ratio.

The concept of ensemble learning is introduced as a technique to improve the performance of machine learning models by combining multiple individual models. The video explains that ensemble learning leverages the wisdom of crowds, where each individual model contributes its own predictions, and the final prediction is determined based on a voting or averaging mechanism. The video discusses two popular ensemble learning methods: bagging and boosting. Bagging involves training multiple models on different subsets of the data and aggregating their predictions, while boosting focuses on training models sequentially, with each model giving more importance to instances that were misclassified by previous models.

The video demonstrates how to implement ensemble learning using Python's scikit-learn library. It shows how to import the necessary modules for bagging and boosting, initialize the ensemble models, fit them to the data, and make predictions using the ensemble models. The video emphasizes that ensemble learning can often improve the overall performance and robustness of machine learning models.

Finally, the video briefly touches on advanced topics in machine learning, such as deep learning and natural language processing (NLP). It mentions that deep learning involves training deep neural networks with multiple layers to learn complex patterns in data. NLP focuses on developing models and techniques to understand and process human language, enabling applications such as text classification, sentiment analysis, and machine translation. The video concludes by highlighting that machine learning is a vast and rapidly evolving field with numerous applications and opportunities for further exploration and learning.

The video provides a comprehensive overview of essential concepts and techniques in machine learning, including model evaluation, overfitting, regularization, cross-validation, feature selection, dimensionality reduction, ensemble learning, and an introduction to deep learning and NLP. It demonstrates practical implementations using Python and the scikit-learn library, making it a valuable resource for beginners and those looking to enhance their understanding of machine learning.

  • 00:00:00 The instructor introduces the field of statistics and its significance in working through predictive modeling problems with machine learning. He explains the range of statistical techniques available from simple summary statistics to hypothesis tests and estimation statistics. The course is designed to provide a step-by-step basis for statistical methods with executable examples in Python, covering six core aspects of statistics for machine learning. The instructor also emphasizes that the course is application-focused and provides real-world usage examples, making it suitable for machine learning engineers.

  • 00:05:00 It is emphasized that machine learning is closely related to statistics, and the course is presented as a good option for programmers who wish to improve their statistical skills. The field of statistics is classified into 2 categories, the first being descriptive statistics, and the second category being inferential statistics. Descriptive statistics is used to describe data using measurements such as averages and graphical representations, while inferential statistics is used to make inferences from data for a larger population. Lastly, the importance of the treatment of data, including data loss, corruption, and errors, is emphasized.

  • 00:10:00 The video discusses the various steps involved in data preparation for machine learning models. This includes data cleansing, data selection, data sampling, and data transformation using statistical methods such as standardization and normalization. Data evaluation is also important, and experimental design, including resampling data and model selection, must be performed to estimate a model's skill. For predicting new data, the video recommends the estimation statistics approach. Additionally, the video explains different measurement scales used in statistics and the characteristics of the normal distribution.

  • 00:15:00 The speaker explains the different scales of measurement: nominal, ordinal, interval, and ratio. They go on to discuss statistics that are applicable to each scale and how they can be implemented in machine learning. Given that we almost always work with sample sets, the author emphasizes that we need to understand and report uncertainty involved in modeling. The discussion then moves to a sample normal distribution which is very common in various data sets. Finally, they demonstrate how we can generate sample data and apply it to a histogram to see if it fits the Gaussian distribution. The author explains that while most data sets will not have a perfect Gaussian distribution, they will have Gaussian-like properties.

  • 00:20:00 The importance of selecting a more granular way of splitting data to better expose the underlying Gaussian distribution is highlighted, and measures of central tendency such as the mean and median are explored, with the variance and standard deviation also discussed as a measure of the spread of the distribution. Randomness is an essential tool in machine learning and is used to help algorithms be more robust and accurate. Various sources of randomness, such as errors in the data and noise that can obscure relationships, are explained.

  • 00:25:00 It is explained that machine learning algorithms often use randomness to achieve a better performing mapping of data. Randomness allows algorithms to generate a more optimal model. This section discusses sources of randomness, both controllable and uncontrollable and how the use of the seed function can make the randomness in a model consistent. An example is given using the Python random module for generating random numbers, and the numpy library for efficient working with vectors and matrices of numbers. The numpy pseudorandom number generator is different than the python standard library pseudorandom number generator and must be used separately. Finally, two cases for when to seed the random number generator are discussed, including data preparation and data splits.

  • 00:30:00 The importance of consistently splitting the data and the use of pseudorandom number generators when evaluating an algorithm is discussed. It is recommended to evaluate the model in such a way that the reported performance includes measured uncertainty and the performance of the algorithm. Evaluating an algorithm on multiple splits of the data will give insight into how the algorithm's performance varies with changes to the training and testing data, while evaluating an algorithm multiple times in the same splits of data will help provide insight into how the algorithm's performance varies alone. The law of large numbers and the central limit theorem are also discussed, highlighting that the more data we have, the better it is for our model's performance and that as the size of a sample increases, the distribution of the mean will approximate a Gaussian distribution.

  • 00:35:00 The central limit theorem is demonstrated using dice rolls and code. The demonstration shows that as sample size increases, the sample means will approximate a Gaussian distribution. Data interpretation is crucial in statistics to discover meaning. Statistical hypothesis testing or significance tests are used in machine learning to make claims about data distributions or to compare two samples. Hypothesis 0 or null hypothesis is the default assumption that nothing has changed and a statistical hypothesis test may return either a p-value or a critical value. The p-value is a quantity used to interpret the result of a hypothesis test and either reject or fail to reject the null hypothesis, while the critical value is used to compare the test statistic to its sampling distribution to determine if there is enough evidence to reject the null hypothesis.

  • 00:40:00 The concept of rejecting the null hypothesis is clarified, with the null hypothesis stating that there is no statistically significant difference. If the result of a statistical test rejects the null hypothesis, it means that something is statistically significant. The Gaussian distribution, which describes the grouping or density of observations and is often referred to as the normal distribution, is also discussed. The distribution is a mathematical function that describes the relationship of observations to a sample space. Density functions, including probability density functions and cumulative density functions, are used to describe the likelihood of observations in a distribution. Finally, the importance of checking whether a sample of data is random is emphasized and the characteristics of normally distributed samples are given.

  • 00:45:00 The speaker discussed the normal (Gaussian) distribution, its properties including the probability distribution function (pdf) and cumulative distribution function (cdf), and the 68, 95, and 99.7 rule associated with standard deviations. The speaker also introduced the t-distribution, which is similar to the normal distribution but is used for small samples. Subsequently, the article introduced the chi-squared distribution, its use for goodness-of-fit, and its relation to the t-distribution. Lastly, the speaker demonstrated the use of the stats chi 2 module in Scipy for calculating statistics for a chi-squared distribution.

  • 00:50:00 The concept of critical values in statistical hypothesis testing is explained. A critical value is a threshold used to determine whether a null hypothesis is accepted or rejected. It assumes a normal or Gaussian distribution and has an acceptance region and a rejection region. The line separating those regions is the critical value. One-tailed tests determine if the mean is greater or less than another mean but not both, while two-tailed tests determine if the two means are different from one another. The critical value allows for the quantification of the uncertainty of estimated statistics or intervals, such as confidence and tolerance intervals.

  • 00:55:00 The use of two-tailed tests is discussed, which take into account both the positive and negative effects of a product. The example given is that of a generic drug against a name brand product, where a two-tailed test can determine if the generic product is equivalent or worse than the name brand product. The use of percent point functions, or quantile functions, is also explained and demonstrated with examples using three commonly used distributions: the standard Gaussian distribution, the standard Student's t distribution, and the standard chi-squared distribution. Finally, the concept of correlation and its importance in determining the relationship between two variables is discussed, as well as the potential issue of multicollinearity and how it can affect algorithm performance.

  • 01:00:00 The video demonstrates a quick demo to show a strong positive correlation between two variables using a contrived dataset with each variable drawn from a Gaussian distribution and being linearly correlated. The demo calculates and prints the covariance matrix, showing a positive covariance between the two variables suggesting they change in the same direction. However, a problem with covariance as a statistical tool alone is that it's challenging to interpret, leading to Pearson's correlation coefficient. The video explains how Pearson's r correlation coefficient can summarize the strength of the linear relationship between two data samples by calculating the covariance of the two variables divided by the product of the standard deviation of each sample, and that the correlation coefficient can be used to evaluate the relationship between more than two variables.

  • 01:05:00 The video discusses the use of parametric statistical significance tests, which assume that the data was drawn from a Gaussian distribution with the same mean and standard deviation. A test dataset is defined and used to demonstrate the Student's t-test for independent and paired samples, as well as the analysis of variance test. The video shows how these tests can be implemented in Python using the appropriate Scipy functions. The examples illustrate how these tests can be used to determine if different data samples have the same distribution.

  • 01:10:00 The concept of effect size in statistics is discussed as a way of quantifying the magnitude of differences between groups or associations between variables, which can complement the results obtained from statistical hypothesis tests. The effect size methods are divided into association and difference, and can be standardized, original unit, or unit-free, depending on the purpose of interpretation and the statistical measure used. Pearson's correlation coefficient is a commonly used standardized measure for determining linear associations, which is unit-free and can be calculated using Python's pearsonr() function. Statistical power, which is influenced by effect size, sample size, significance, and power level, is also explained as a crucial factor in experimental design, and can be estimated through power analysis to determine the minimum sample size needed for an experiment.

  • 01:15:00 The video discusses the importance of data sampling and data resampling in predictive modeling, explaining that sampling involves selecting a subset of a population, whereas resampling involves estimating population parameters multiple times from a data sample to improve accuracy and quantify uncertainty. The video describes common methods of sampling, categorized as either probability sampling or non-probability sampling, and highlights three types of sampling that machine learning engineers are likely to encounter: simple random sampling, systemic sampling, and stratified sampling. Additionally, the video warns of the potential errors that can be introduced in the sampling process and emphasizes the need for statistical inference and care when drawing conclusions about a population. The video goes on to explain the commonly used sampling methods in machine learning, namely k-fold cross-validation and bootstrap, with the latter being computationally expensive, though providing robust estimates of the population.

  • 01:20:00 The bootstrap method is discussed as a tool for estimating quantities about a population by averaging estimates from multiple small data samples. The samples are constructed by drawing observations from large datasets one-at-a-time and returning them back to the original sample after being picked. This approach is called sampling with replacement. The resample function, provided in the SciPy library, can be used to create a single bootstrap sample, and though it does not include any mechanism to easily gather out-of-bag observations that could be used for evaluating fit models, out-of-bag observations can still be gathered using Python list comprehension. Additionally, the process of k-fold cross-validation is explained, as it is a resampling procedure used for evaluating machine learning models on limited data. The K-fold cycle learn class can be used for this procedure, and four commonly used variations of k-fold cross-validation are mentioned.

  • 01:25:00 The speaker discusses two approaches to resampling in machine learning: k-fold cross-validation and train-test split. While k-fold cross-validation is the gold standard, train-test split may be easier to understand and implement. The speaker demonstrates how to use the train-test split approach in Python and also mentions the use of estimation statistics, which aim to quantify the size and uncertainty of a finding and are becoming more popular in research literature. The three main classes of estimation statistics include effect size, interval estimation, and meta-analysis. The shift to estimation statistics is occurring because they are easier to analyze and interpret in the context of research questions.

  • 01:30:00 The different methods for measuring effect size and interval estimation are discussed. Effect size can be measured through association, which is the degree to which samples change together, or difference, which is the degree to which samples differ. Meanwhile, interval estimation allows quantification of uncertainty in observations, and can be done through tolerance intervals, confidence intervals, or prediction intervals. Tolerance intervals specify the upper and lower bounds within which a certain percentage of the process output falls, while confidence intervals provide bounds for the estimates of a population parameter. Finally, a demonstration is given on how to calculate tolerance intervals for a sample of observations drawn from a Gaussian distribution.

  • 01:35:00 The concept of confidence interval is discussed, which is the interval statistic used to quantify the uncertainty on an estimate. The confidence interval provides bounds on a population parameter, such as mean, standard deviation or similar. The value of a confidence interval is the ability to quantify the uncertainty of an estimate. A smaller confidence interval equals a more precise estimate, while a larger confidence interval is a less precise estimate. Further, the concept of classification accuracy or error is discussed, which is a proportion or ratio used to describe the proportion of correct or incorrect predictions made by the model. The classification error or accuracy can be used to easily calculate the confidence interval by assuming a Gaussian distribution of the proportion.

  • 01:40:00 The concept of confidence intervals is explained, from calculating a confidence interval for the classification error of a model to using bootstrap resampling as a non-parametric method for estimating confidence intervals. The bootstrap resampling method involves drawing samples with replacement from a fixed finite data set to estimate population parameters. Additionally, the concept of prediction intervals is introduced as an estimate of the interval in which future observations will fall with a certain confidence level, which is useful in making predictions or forecasts with regression models.

  • 01:45:00 The concept of prediction intervals is explained as an estimate of the range in which a future observation will fall with a certain confidence level, given previously observed data. It differentiates from a confidence interval, which quantifies uncertainty related to a population sample variable. Prediction intervals are typically used in prediction or forecasting models. The article presents a simple example of linear regression in a two-variable dataset, in which the relationship between variables is visible from a scatter plot. The linear regression model is then used to make a single prediction, with a predicted interval of 95% certainty and compared to the known expected value. The difference between prediction intervals and confidence intervals is emphasized, as well as the fact that prediction intervals are usually wider than confidence intervals due to accounting for uncertainty associated with error.
Applied Statistics for Machine Learning Engineers
Applied Statistics for Machine Learning Engineers
  • 2022.03.24
  • www.youtube.com
Complete Course on Applied Statistics for Machine Learning Engineers. It's all the statistics you'll need to know for a career in machine learning.
 

Applied Linear Algebra for Machine Learning Engineers



Applied Linear Algebra for Machine Learning Engineers

The video emphasizes the importance of learning linear algebra for machine learning engineers, as it serves as a fundamental building block for understanding calculus and statistics, which are essential in machine learning. Having a deeper understanding of linear algebra provides practitioners with a better intuition of how machine learning methods work, enabling them to customize algorithms and develop new ones.

The course takes a top-down approach to teach the basics of linear algebra, using concrete examples and data structures to demonstrate operations on matrices and vectors. Linear algebra is described as the mathematics of matrices and vectors, providing a language for data manipulation and allowing the creation of new columns or arrays of numbers through operations on these data structures. Initially developed in the late 1800s to solve systems of linear equations, linear algebra has become a key prerequisite for understanding machine learning.

The speaker introduces the concept of numerical linear algebra, which involves the application of linear algebra in computers. This includes implementing linear algebra operations and addressing the challenges that arise when working with limited floating-point precision in digital computers. Numerical linear algebra plays a crucial role in machine learning, particularly in deep learning algorithms that heavily rely on graphical processing units (GPUs) to perform linear algebra computations efficiently. Various open-source numerical linear algebra libraries, with Fortran-based libraries as their foundation, are commonly used to calculate linear algebra operations, often in conjunction with programming languages like Python.

Linear algebra's significance in statistics is highlighted, particularly in multivariate statistical analysis, principal component analysis, and solving linear regression problems. The video also mentions the broad range of applications for linear algebra in fields such as signal processing, computer graphics, and even physics, with examples like Albert Einstein's theory of relativity utilizing tensors and tensor calculus, a type of linear algebra.

The video further explores the practical application of linear algebra in machine learning tasks. It introduces the concept of using linear algebra operations, such as cropping, scaling, and shearing, to manipulate images, demonstrating how notation and operations of linear algebra can be employed in this context. Additionally, the video explains the popular encoding technique called one-hot encoding for categorical variables. The main data structure used in machine learning, N-dimensional arrays or N-D arrays, is introduced, with the NumPy library in Python discussed as a powerful tool for creating and manipulating these arrays. The video covers important functions, such as v-stack and horizontal stacking, which enable the creation of new arrays from existing arrays.

Manipulating and accessing data in NumPy arrays, commonly used to represent machine learning data, is explained. The video demonstrates how to convert one-dimensional lists to arrays using the array function and create two-dimensional data arrays using lists of lists. It also covers indexing and slicing operations in NumPy arrays, including the use of the colon operator for slicing and negative indexing. The importance of slicing in specifying input and output variables in machine learning is highlighted.

Techniques for working with multi-dimensional datasets in machine learning are discussed in the video. It begins with one-dimensional slicing and progresses to two-dimensional slicing, along with separating data into input and output values for training and testing. Array reshaping is covered, explaining how to reshape one-dimensional arrays into two-dimensional arrays with one column and transform two-dimensional data into three-dimensional arrays for algorithms that require multiple samples of one or more time steps and features. The concept of array broadcasting is introduced, which allows arrays with different sizes to be used in arithmetic operations, enabling data sets with varying sizes to be processed effectively.

The video also touches on the limitations of array arithmetic in NumPy, specifically that arithmetic operations can only be performed on arrays with the same dimensions and dimensions with the same size. However, this limitation is overcome by NumPy's built-in broadcasting feature, which replicates the smaller array along the last mismatched dimension, enabling arithmetic between arrays with different shapes and sizes. The video provides three examples of broadcasting, including scalar and one-dimensional arrays, scalar in a two-dimensional array, and one-dimensional array in a two-dimensional array. It is noted that broadcasting follows a strict rule, stating that arithmetic can only be performed when the shape of each dimension in the arrays is equal or one of them has a dimension size of one.

Moving on, the speaker introduces the concept of vectors, which are tuples of one or more values called scalars. Vectors are often represented using lowercase characters such as "v" and can be seen as points or coordinates in an n-dimensional space, where "n" represents the number of dimensions. The creation of vectors as NumPy arrays in Python is explained. The video also covers vector arithmetic operations, such as vector addition and subtraction, which are performed element-wise for vectors of equal length, resulting in a new vector of the same length. Furthermore, the speaker explains how vectors can be multiplied by scalars to scale their magnitude, and demonstrates how to perform these operations using NumPy arrays in Python. The dot product of two vectors is also discussed, which yields a scalar and can be used to calculate the weighted sum of a vector.

The focus then shifts to vector norms and their importance in machine learning. Vector norms refer to the size or length of a vector and are calculated using a measure that summarizes the distance of the vector from the origin of the vector space. It is emphasized that vector norms are always positive, except for a vector of all zero values. The video introduces four common vector norm calculations used in machine learning. It starts with the vector L1 norm, followed by the L2 norm (Euclidean norm), and the max norm. The section also defines matrices and explains how to manipulate them in Python. Matrix arithmetic, including matrix-matrix multiplication (dot product), matrix-vector multiplication, and scalar multiplication, is discussed. A matrix is described as a two-dimensional array of scalars with one or more columns and one or more rows, typically represented by uppercase letters such as "A".

Next, the concept of matrix operations for machine learning is introduced. This includes matrix multiplication, matrix division, and matrix scalar multiplication. Matrix multiplication, also known as the matrix dot product, requires the number of columns in the first matrix to be equal to the number of rows in the second matrix. The video mentions that the dot function in NumPy can be used to implement this operation. The concept of matrix transpose is also explained, where a new matrix is created by flipping the number of rows and columns of the original matrix. Finally, the process of matrix inversion is discussed, which involves finding another matrix that, when multiplied with the original matrix, results in an identity matrix.

Continuing from the discussion of matrix inversion, the video further explores this concept. Inverting a matrix is indicated by a negative 1 superscript next to the matrix. The video explains that matrix inversion involves finding efficient numerical methods. The trace operation of a square matrix is introduced, which calculates the sum of the diagonal elements and can be computed using the trace function in NumPy. The determinant of a square matrix is defined as a scalar representation of the volume of the matrix and can also be calculated using the det function in NumPy. The rank of a matrix is briefly mentioned, which estimates the number of linearly independent rows or columns in the matrix and is commonly computed using singular value decomposition. Lastly, the concept of sparse matrices is explained, highlighting that they predominantly contain zero values and can be computationally expensive to represent and work with.

The video then delves into sparse matrices, which are matrices primarily composed of zero values and differ from dense matrices that mostly have non-zero values. Sparsity is quantified by calculating the sparsity score, which is the number of zero values divided by the total number of elements in the matrix. The video emphasizes two main problems associated with sparsity: space complexity and time complexity. It is noted that representing and working with sparse matrices can be computationally expensive.

To address these challenges, the video mentions that Scipy provides tools for creating and manipulating sparse matrices. Additionally, it highlights that many linear algebra functions in NumPy and Scipy can operate on sparse matrices, enabling efficient computations and operations on sparse data.

Sparse matrices are commonly used in applied machine learning for data observations and data preparation. Their sparsity allows for more efficient storage and processing of large datasets with a significant number of zero values. By leveraging the sparsity structure, machine learning algorithms can benefit from reduced memory usage and faster computations.

Moving on, the video discusses different types of matrices commonly used in linear algebra, particularly those relevant to machine learning. Square matrices are introduced, where the number of rows equals the number of columns. Rectangular matrices, which have different numbers of rows and columns, are also mentioned. The video explains the main diagonal of a square matrix, which consists of elements with the same row and column indices. The order of a square matrix, defined as the number of rows or columns, is also covered.

Furthermore, the video introduces symmetric matrices, which are square matrices that are equal to their transpose. Triangular matrices, including upper and lower triangular matrices, are explained. Diagonal matrices, where all the non-diagonal elements are zero, are discussed as well. Identity matrices, which are square matrices with ones on the main diagonal and zeros elsewhere, are explained in the context of their role as multiplicative identities. Orthogonal matrices, formed when two vectors have a dot product equal to zero, are also introduced.

The video proceeds by discussing orthogonal matrices and tensors. An orthogonal matrix is a specific type of square matrix where the columns and rows are orthogonal unit vectors. These matrices are computationally efficient and stable for calculating their inverse, making them useful in various applications, including deep learning models. The video further mentions that in TensorFlow, tensors are a fundamental data structure and a generalization of vectors and matrices. Tensors are represented as multi-dimensional arrays and can be manipulated in Python using n-dimensional arrays, similar to matrices. The video highlights that element-wise tensor operations, such as addition and subtraction, can be performed on tensors, matrices, and vectors, providing an intuition for higher dimensions.

Next, the video introduces matrix decomposition, which is a method to break down a matrix into its constituent parts. Matrix decomposition simplifies complex matrix operations and enables efficient computations. Two widely used matrix decomposition techniques are covered: LU (Lower-Upper) decomposition for square matrices and QR (QR-factorization) decomposition for rectangular matrices.

The LU decomposition can simplify linear equations in the context of linear regression problems and facilitate calculations such as determinant and inverse of a matrix. The QR decomposition has applications in solving systems of linear equations. Both decomposition methods can be implemented using built-in functions in the NumPy package in Python, providing efficient and reliable solutions for various linear algebra problems.

Additionally, the video discusses the Cholesky decomposition, which is specifically used for symmetric and positive definite matrices. The Cholesky decomposition is represented by a lower triangular matrix, and it is considered nearly twice as efficient as the LU decomposition for decomposing symmetric matrices.

The video briefly mentions that matrix decomposition methods, including the Eigen decomposition, are employed to simplify complex operations. The Eigen decomposition decomposes a matrix into its eigenvectors and eigenvalues. Eigenvectors are coefficients that represent directions, while eigenvalues are scalars. Both eigenvectors and eigenvalues have practical applications, such as dimensionality reduction and performing complex matrix operations.

Lastly, the video touches upon the concept of singular value decomposition (SVD) and its applications in machine learning. SVD is used in various matrix operations and data reduction methods in machine learning. It plays a crucial role in calculations such as least squares linear regression, image compression, and denoising data.

The video explains that SVD allows a matrix to be decomposed into three separate matrices: U, Σ, and V. The U matrix contains the left singular vectors, Σ is a diagonal matrix containing the singular values, and V contains the right singular vectors. By reconstructing the original matrix from these components, one can obtain an approximation of the original data while reducing its dimensionality.

One of the main applications of SVD is dimensionality reduction. By selecting a subset of the most significant singular values and their corresponding singular vectors, it is possible to represent the data in a lower-dimensional space without losing crucial information. This technique is particularly useful in cases where the data has a high dimensionality, as it allows for more efficient storage and computation.

The video highlights that SVD has been successfully applied in natural language processing using a technique called latent semantic analysis (LSA) or latent semantic indexing (LSI). By representing text documents as matrices and performing SVD, LSA can capture the underlying semantic structure of the documents, enabling tasks such as document similarity and topic modeling.

Moreover, the video introduces the truncated SVD class, which directly implements the capability to reduce the dimensionality of a matrix. With the truncated SVD, it becomes possible to transform the original matrix into a lower-dimensional representation while preserving the most important information. This technique is particularly beneficial when dealing with large datasets, as it allows for more efficient processing and analysis.

In summary, the video has covered various topics related to linear algebra for machine learning. It has emphasized the importance of learning linear algebra as a fundamental building block for understanding calculus and statistics in the context of machine learning. The video has discussed the applications of linear algebra in machine learning, such as customization and development of algorithms, numerical linear algebra, statistical analysis, and various other fields like signal processing and computer graphics.

Furthermore, the video has explored key concepts in linear algebra, including vectors, matrices, matrix operations, vector norms, matrix decomposition techniques, and sparse matrices. It has explained how these concepts are used in machine learning and provided insights into their practical applications.

By understanding linear algebra, machine learning practitioners can gain a deeper intuition of the underlying mathematical foundations of machine learning algorithms and effectively apply them to real-world problems. Linear algebra serves as a powerful tool for data manipulation, dimensionality reduction, and optimization, enabling efficient and effective machine learning solutions.

  • 00:00:00 The importance of learning linear algebra for machine learning engineers is highlighted, as it is considered a building block for understanding calculus and statistics required in machine learning. A deeper understanding of linear algebra provides machine learning practitioners with a better intuition of how the methods work, allowing them to customize algorithms and devise new ones. The basics of linear algebra are taught in a top-down approach in this course, using concrete examples and data structures to demonstrate operations on matrices and vectors. Linear algebra is the mathematics of matrices and vectors, providing a language for data and allowing the creation of new columns or arrays of numbers using operations on these data structures. Linear algebra was developed in the late 1800s to solve unknown systems of linear equations and is now a key prerequisite for understanding machine learning.

  • 00:05:00 The speaker discusses numerical linear algebra, which is the application of linear algebra in computers. This includes implementing linear algebra operations as well as handling the potential issues that arise when working with limited floating point precision in digital computers. Numerical linear algebra is an essential tool in machine learning since many deep learning algorithms rely on graphical processing units' ability to compute linear algebra operations quickly. Several popular open source numerical linear algebra libraries are used to calculate linear algebra operations, with Fortran-based linear algebra libraries providing the basis for most modern implementations using programming languages such as Python. Linear algebra is also essential in statistics, especially in multivariate statistical analysis, principal component analysis, and solving linear regression problems. Additionally, the speaker discusses linear algebra's various applications in fields such as signal processing, computer graphics, and even physics, with Albert Einstein's theory of relativity using tensors and tensor calculus, a type of linear algebra.

  • 00:10:00 The concept of using linear algebra operations on images like cropping, scaling, and shearing using notation and operations of linear algebra is introduced. Additionally, the popular encoding technique for categorical variables called one hot encoding is explained. Moreover, the main data structure used in machine learning, N-dimensional arrays or N-D arrays, and how to create and manipulate them using the NumPy library in Python is discussed. Finally, two of the most popular functions for creating new arrays from existing arrays, v-stack and horizontal stacking, are explained.

  • 00:15:00 The speaker discusses how to manipulate and access data in numpy arrays, which are typically used to represent machine learning data. One-dimensional list can be converted to an array using the array function, and a two-dimensional data array can be created using a list of lists. Accessing data via indexing is similar to other programming languages, but numpy arrays can also be sliced using the colon operator. Negative indexing is also possible, and slicing can be used to specify input and output variables in machine learning.

  • 00:20:00 The video covers techniques for working with multi-dimensional data sets common in machine learning. It starts with one-dimensional slicing and moves on to two-dimensional slicing and separating data into input and output values for training and testing. The video then covers array reshaping, including reshaping one-dimensional arrays into two-dimensional arrays with one column and reshaping two-dimensional data into a three-dimensional array for algorithms that expect multiple samples of one or more time steps and one or more features. Finally, the video covers array broadcasting, which allows for arrays with different sizes to be added, subtracted, or used in arithmetic, which is useful for data sets with varying sizes.

  • 00:25:00 The limitations of array arithmetic in numpy are discussed, which include only performing arithmetic on arrays with the same dimensions and dimensions with the same size. However, this limitation is overcome by numpy's built-in broadcasting feature which replicates the smaller array along the last mismatched dimension. This method allows for arithmetic between arrays with different shapes and sizes. Three examples of broadcasting are given, including scalar and one-dimensional arrays, scalar in a two-dimensional array, and one-dimensional array in a two-dimensional array. The limitations of broadcasting are also noted, including the strict rule that must be satisfied for broadcasting to be performed, stating that arithmetic can only be performed from the shape of each dimension in the arrays are equal or one has the dimension size of one.

  • 00:30:00 The speaker introduces the concept of vectors, which are tuples of one or more values called scalars and are often represented using a lowercase character such as "v". Vectors can be thought of as points or coordinates in an n-dimensional space where n is the number of dimensions, and can be created as a numpy array in Python. The speaker also discusses vector arithmetic operations such as vector addition and subtraction, which are done element-wise for vectors of equal length resulting in a new vector of the same length. Furthermore, the speaker explains that vectors can be multiplied by a scalar to scale its magnitude and how to perform these operations using numpy arrays in Python. Lastly, the speaker talks about the dot product of two vectors, which gives a scalar, and how it can be used to calculate the weighted sum of a vector.

  • 00:35:00 The focus is on vector norms and their importance in machine learning. Vector norms refer to the size or length of a vector, calculated using some measure that summarizes the distance of the vector from the origin of the vector space. The norm is always a positive number, except for a vector of all zero values. Four common vector norm calculations used in machine learning are introduced, starting with the vector L1 norm, which is then followed by the L2 and max norms. The section also defines matrices and how to manipulate them in Python, discussing matrix arithmetic, matrix-matrix multiplication (dot product), matrix-vector multiplication, and scalar multiplication. A matrix is a two-dimensional array of scalars with one or more columns and one or more rows, and it is often represented by an uppercase letter, such as A.

  • 00:40:00 The concept of matrix operations for machine learning is introduced, including matrix multiplication, matrix division, and matrix scalar multiplication. Matrix multiplication, also known as matrix dot product, requires the number of columns in the first matrix to be equal to the number of rows in the second matrix. The dot function in numpy can be used to implement this operation. The concept of matrix transpose is also introduced, where a new matrix is created by flipping the number of rows and columns of the original matrix. Lastly, the process of matrix inversion is discussed, which finds another matrix that will multiply with the matrix resulting in an identity matrix.

  • 00:45:00 The concept of matrix inversion is covered, where inverting a matrix is indicated by a negative 1 superscript next to the matrix. The operation of matrix inversion involves finding a suite of efficient numerical methods. The trace operation of a square matrix is also discussed, which can be calculated using the trace function in numpy. The determinant of a square matrix is defined as a scalar representation of the volume of a matrix and can also be calculated using the det function in numpy. Additionally, the rank of a matrix is introduced, which estimates the number of linearly independent rows or columns in the matrix and is commonly calculated using singular value decomposition. Finally, the concept of sparse matrices is explained, which are matrices containing mostly zero values and are computationally expensive to represent and work with.

  • 00:50:00 We learn about sparse matrices, which are matrices that are mostly comprised of zero values and are different from dense matrices that have mostly non-zero values. Sparsity can be quantified by calculating the score, which is the number of zero values divided by the total number of elements in the matrix. We also learn about the two big problems with sparsity: space complexity and time complexity. Scipy provides tools for creating sparse matrices and many linear algebra numpy and scipy functions can operate on them. Sparse matrices are commonly used in applied machine learning for data observations and data preparation.

  • 00:55:00 The different types of matrices commonly used in linear algebra, particularly relevant to machine learning, are discussed. Square matrices are introduced, where the number of rows equals the number of columns, along with rectangular matrices. The main diagonal and order of a square matrix are also covered. Additionally, symmetric matrices, triangular matrices (including upper and lower), diagonal matrices, identity matrices, and orthogonal matrices are explained. It is noted that an orthogonal matrix is formed when two vectors have a dot product equal to zero.

  • 01:00:00 We learn about orthogonal matrices and tensors. An orthogonal matrix is a type of square matrix whose columns and rows are orthogonal unit vectors. These matrices are computationally cheap and stable to calculate their inverse and can be used in deep learning models. In Tensorflow, tensors are a cornerstone data structure and a generalization of vectors and matrices, represented as multi-dimensional arrays. Tensors can be manipulated in Python using n-dimensional arrays, similar to matrices, with element-wise tensor operations such as addition and subtraction. Additionally, tensor product operations can be performed on tensors, matrices, and vectors, allowing for an intuition of higher dimensions.

  • 01:05:00 The video introduces matrix decomposition, a method to reduce a matrix into its constituent parts and simplify more complex matrix operations that can be performed on the decomposition matrix. Two widely used matrix decomposition techniques that are covered in the upcoming lessons are the LU matrix decomposition for square matrices and the QR matrix decomposition for rectangular matrices. LU decomposition can be used to simplify linear equations in the linear regression problem and calculating the determinant and inverse of a matrix, while QR decomposition has applications in solving systems of linear equations. Both decompositions can be implemented using built-in functions in the NumPy package in Python.

  • 01:10:00 The video discusses the Cholesky decomposition, which is used for symmetric and positive definite matrices. This method is nearly twice as efficient as the LU decomposition and is preferred for decomposing symmetric matrices. The Cholesky decomposition is represented by a lower triangular matrix, which can be accessed easily through the Cholosky function in NumPy. The video also mentions that matrix decomposition methods, including Eigendecomposition, are used to simplify complex operations, and the Eigen decomposition decomposes a matrix into eigenvectors and eigenvalues. Finally, the video notes that eigenvectors are unit vectors, while eigenvalues are scalars, and both are useful for reducing dimensionality and performing complex matrix operations.

  • 01:15:00 The concept of eigen decomposition and its calculation using an efficient iterative algorithm is discussed. The eigen decomposition is a method of decomposing a square matrix into its eigenvalues and eigenvectors, which are coefficients and directions respectively. The eigendecomposition can be calculated in NumPy using the eig function, and tests can be carried out to confirm that a vector is indeed an eigenvector of a matrix. The original matrix can also be reconstructed from the eigenvalues and eigenvectors. The section also briefly introduces singular value decomposition (SVD) as a matrix decomposition method for reducing the matrix into its constituent parts to make certain subsequent matrix calculations simpler, and its applications in various fields such as compression, denoising, and data reduction.

  • 01:20:00 The concept of singular value decomposition (SVD) and its applications in machine learning are discussed. SVD is used in the calculation of other matrices operations and data reduction methods in machine learning, such as least squared linear regression, image compression, and denoising data. The original matrix can be reconstructed from the u, sigma, and v elements returned from the SVD. A popular application of SVD is for dimensionality reduction, where data can be reduced to a smaller subset of features that are the most relevant to the prediction problem. This has been successfully applied in natural language processing using a technique called latent semantic analysis or latent semantic indexing. The truncated SVD class that directly implements this capability is discussed, and its application is demonstrated using a defined matrix followed by a transform version.
Applied Linear Algebra for Machine Learning Engineers
Applied Linear Algebra for Machine Learning Engineers
  • 2022.03.26
  • www.youtube.com
This course will cover everything you need to know about linear algebra for your career as a machine learning engineer.
 

A Complete Introduction to XGBoost for Machine Learning Engineers


A Complete Introduction to XGBoost for Machine Learning Engineers

In the video, the instructor provides a comprehensive introduction to XGBoost for machine learning engineers. They explain that XGBoost is an open-source machine learning library known for its ability to quickly build highly accurate classification and regression models. It has gained popularity as a top choice for building real-world models, particularly when dealing with highly structured datasets. XGBoost was authored by Taiki Chen and is based on the gradient boost decision trees technique, which enables fast and efficient model building.

The instructor highlights that XGBoost supports multiple interfaces, including Python and scikit-learn implementations. They proceed to give a demonstration of XGBoost, showcasing various modules for loading data and building models.

The video then focuses on preparing the dataset for training an XGBoost model. The instructor emphasizes the importance of separating the data into training and testing sets. They identify the target variable as a binary classification problem and explain the process of setting the necessary hyperparameters for the XGBoost model. Once the model is trained on the training data, they evaluate its accuracy on the testing data using the accuracy score as a metric.

To provide a better understanding of XGBoost, the instructor delves into the concept of gradient boosting and its role in the broader category of traditional machine learning models. They explain that gradient boosting is a technique that combines a weak model with other models of the same type to create a more accurate model. In this process, each successive tree is built for the prediction residuals of the preceding tree. The instructor emphasizes that decision trees are used in gradient boosting, as they provide a graphical representation of possible decision solutions based on given conditions. They also mention that designing a decision tree requires a well-documented thought process to identify potential solutions effectively.

The video further explores the creation of binary decision trees using recursive binary splitting. This process involves evaluating all input variables and split points in a greedy manner to minimize a cost function that measures the proximity of predicted values to the actual values. The instructor explains that the split with the lowest cost is chosen, and the resulting groups can be further subdivided recursively. They emphasize that the algorithm used is greedy, as it focuses on making the best decision at each step. However, it is preferred to have decision trees with fewer splits to ensure better understandability and reduce the risk of overfitting the data. The instructor highlights that XGBoost provides mechanisms to prevent overfitting, such as limiting the maximum depth of each tree and pruning irrelevant branches. Additionally, they cover label encoding and demonstrate loading the iris dataset using scikit-learn.

Moving on, the video covers the process of encoding the target label as a numerical variable using the label encoder method. After splitting the data into training and testing datasets, the instructor defines and trains the XGBoost classifier on the training data. They then use the trained model to make predictions on the testing dataset, achieving an accuracy of 90%. The concept of ensemble learning is introduced as a method for combining multiple models to improve prediction accuracy, ultimately enhancing the learning algorithm's efficiency. The instructor emphasizes the importance of selecting the right model for classification or regression problems to achieve optimal results.

The video dives into the concept of bias and variance in machine learning models and emphasizes the need for a balance between the two. Ensemble learning is presented as a technique for addressing this balance by combining groups of weak learners to create more complex models. Two ensemble techniques, bagging and boosting, are introduced. Bagging aims to reduce variance by creating subsets of data to train decision trees and create an ensemble of models with high variance and low bias. Boosting, on the other hand, involves sequentially learning models with decision trees, allowing for the correction of errors made by previous models. The instructor highlights that gradient boosting is a specific type of boosting that optimizes a differentiable loss function using weak learners in the form of regression trees.

The video explains the concept of gradient boosting in detail, outlining its three-step process. The first step involves iteratively adding weak learners (e.g., decision trees) to minimize loss. The second step is the sequential addition of trees, and the final step focuses on reducing model error through further iterations. To demonstrate the process, the video showcases the use of k-fold cross-validation to segment the data. Through XGBoost, scores are obtained for each fold. The instructor chooses decision trees as the weak learners, ensuring a shallow depth to avoid overfitting. Finally, a loss function is defined as a measure of how well the machine learning model fits the data.

The core steps of gradient boosting are explained, which include optimizing the loss function, utilizing weak learners (often decision trees), and combining multiple weak learners in an additive manner through ensemble learning. The video also covers practical aspects of using XGBoost, such as handling missing values, saving models to disk, and employing early stopping. Demonstrations using Python code are provided to illustrate various use cases of XGBoost. Additionally, the video emphasizes the importance of data cleansing, including techniques for handling missing values, such as mean value imputation.

The speaker discusses the importance of cleaning data properly rather than relying solely on algorithms to do the work. They demonstrate how dropping empty values can improve model accuracy and caution against algorithms handling empty values. The concept of pickling, which involves saving trained models to disk for later use, is introduced using the pickle library in Python. The speaker demonstrates how to save and load models. They also show how to plot the importance of each attribute in a dataset using the plot importance function in XGBoost and the matplotlib library.

The speaker discusses the importance of analyzing and testing different scenarios when building machine learning models, emphasizing that feature importance scores from XGBoost may not always reflect the actual impact of a feature on the model's accuracy. They use the example of the Titanic dataset to demonstrate how adding the "sex" attribute improves model accuracy, despite being ranked low in feature importance scores. The speaker emphasizes the importance of testing various scenarios and not solely relying on feature importance scores. They also mention that XGBoost can evaluate and report the performance of a test set during training.

The video explains how to monitor the performance of an XGBoost model during training by specifying an evaluation metric and passing an array of x and y pairs. The model's performance on each evaluation set is stored and made available after training. The video covers learning curves, which provide insight into the model's behavior and help prevent overfitting by stopping learning early. Early stopping is introduced as a technique to halt training after a fixed number of epochs if no improvement is observed in the validation score.

The video covers the use of early stopping rounds in XGBoost and demonstrates building a regression model to evaluate home prices in Boston. The benefits of parallelism in gradient boosting are discussed, focusing on the construction of individual trees and the efficient preparation of input data. The video provides a demonstration of multithreading support, which utilizes all the cores of the system to execute computations simultaneously, resulting in faster program execution. Although XGBoost is primarily geared towards classification problems, the video highlights its capability to excel at building regression models as well.

The speaker creates a list to hold the number of iterations for an example and uses a for loop to test the execution speed of the model based on the number of threads. They print the speed of the build for each iteration and plot the results, showing how the speed of the model decreases as the number of threads increases. The speaker then discusses hyperparameter tuning, which involves adjusting parameters in a model to enhance its performance. They explore the default parameters for XGBoost and scikit-learn and mention that tuning hyperparameters is essential to optimize the performance of an XGBoost model. The video explains that hyperparameters are settings that are not learned from the data but are set manually by the user. Tuning hyperparameters involves systematically searching for the best combination of parameter values that result in the highest model performance.

To perform hyperparameter tuning, the video introduces two common approaches: grid search and random search. Grid search involves defining a grid of hyperparameter values and exhaustively evaluating each combination. Random search, on the other hand, randomly samples hyperparameter combinations from a predefined search space. The video recommends using random search when the search space is large or the number of hyperparameters is high.

The video demonstrates hyperparameter tuning using the RandomizedSearchCV class from scikit-learn. They define a parameter grid containing different values for hyperparameters such as learning rate, maximum depth, and subsample ratio. The RandomizedSearchCV class performs random search with cross-validation, evaluating the performance of each parameter combination. After tuning, the best hyperparameters are selected, and the model is trained with these optimal values.

The speaker explains that hyperparameter tuning helps to find the best trade-off between underfitting and overfitting. It is important to strike a balance and avoid overfitting by carefully selecting hyperparameters based on the specific dataset and problem at hand.

In addition to hyperparameter tuning, the video discusses feature importance in XGBoost models. Feature importance provides insights into which features have the most significant impact on the model's predictions. The speaker explains that feature importance is determined by the average gain, which measures the improvement in the loss function brought by a feature when it is used in a decision tree. Higher average gain indicates higher importance.

The video demonstrates how to extract and visualize feature importance using the XGBoost library. They plot a bar chart showing the top features and their corresponding importance scores. The speaker notes that feature importance can help in feature selection, dimensionality reduction, and gaining insights into the underlying problem.

Towards the end of the video, the speaker briefly mentions other advanced topics related to XGBoost. They touch upon handling imbalanced datasets by adjusting the scale_pos_weight hyperparameter, dealing with missing values using XGBoost's built-in capability, and handling categorical variables through one-hot encoding or using the built-in support for categorical features in XGBoost.

The video provides a comprehensive overview of XGBoost, covering its key concepts, implementation, hyperparameter tuning, and feature importance analysis. The demonstrations and code examples help illustrate the practical aspects of working with XGBoost in Python. It serves as a valuable resource for machine learning engineers looking to utilize XGBoost for their classification and regression tasks.

  • 00:00:00 The instructor provides an introduction to XGBoost for machine learning engineers. XGBoost is an open-source machine learning library used to build highly accurate classification and regression models quickly, making it a top choice for building real-world models against highly structured datasets. The author of XGBoost is Taiki Chen, and it is an implementation of gradient boost decision trees for speed and performance. The instructor also highlights that XGBoost supports several interfaces such as Python and scikit-learn implementations and provides a demo of XGBoost using several modules to load data and build models.

  • 00:05:00 The instructor explains how to prepare the dataset for training an XGBoost model, focusing on separating the data into training and testing sets. The target variable is identified as a binary classification problem, and the necessary hyperparameters are set for the XGBoost model. The model is trained on the training data and the accuracy of the model is evaluated on the testing data using the accuracy score as a metric. The instructor also gives an overview of gradient boosting, the concept behind XGBoost, and how it fits into the broader category of traditional machine learning models.

  • 00:10:00 We learn about recursive binary splitting and ensemble learning, which combines multiple weak models to improve the accuracy of predictions. Gradient boosting is a technique for building predictive models by combining a weak model with other models of the same type to produce a more accurate model. Each successive tree is built for the prediction residuals of the preceding tree. Decision trees are used in gradient boosting and entail a graphical representation of all the possible solutions to a decision based on certain conditions. The design of a decision tree requires a well-documented thought process that helps formalize the brainstorming process so that we can identify more potential solutions.

  • 00:15:00 The video explains how binary decision trees are created. The process is called recursive binary splitting and it involves evaluating all input variables and split points in a greedy manner to minimize a cost function that measures how close predicted values are to their corresponding real values. The split with the lowest cost is chosen, and the resulting groups can be subdivided recursively. The algorithm is a greedy one that focuses on making the best decision at each step. Decision trees with fewer splits are preferred, as they are easier to understand and less likely to overfit the data. To prevent overfitting, the XGBoost algorithm allows for a mechanism to stop the growth of trees, such as limiting the maximum depth of each tree, and pruning irrelevant branches. The video also covers label encoding and loading the iris dataset using scikit-learn.

  • 00:20:00 The video covers the process of encoding a target label as a numerical variable, using the label encoder method. Once the data has been split into training and testing datasets, the XGBoost classifier is defined and trained on the training data. The model is then used to make predictions on the testing dataset with 90% accuracy achieved. Ensemble learning is then introduced as a method for combining multiple models to improve the accuracy of predictions, allowing for a more efficient learning algorithm. The video emphasizes the importance of choosing the right model for classification or regression problems when trying to achieve the best results.

  • 00:25:00 The concept of bias and variance in machine learning models is discussed, and the need for a balance between the two is emphasized. Ensemble learning is introduced as a technique used for addressing this balance by combining groups of weak learners to create more complex models. Bagging and boosting are two ensemble techniques, with bagging used to reduce variance by creating several subsets of data to train decision trees and create an ensemble of models with high variance and low bias. Boosting involves learning models sequentially with decision trees, allowing for the correction of errors from previous models, and is achieved with teamwork among the weak learners to classify input correctly. Gradient boosting is a specific type of boosting that involves optimizing a differentiable loss function and using weak learners in the form of regression trees.

  • 00:30:00 The concept of gradient boosting was introduced and its three-step process was explained. The first step involves adding weak learners like decision trees in an iterative process to minimize loss. The second step is the sequential addition of trees, while the final step aims to reduce model error through more iterations. The demonstration involved the use of k-fold cross-validation to segment data, and through XGBoost, scores were obtained for each fold. The decision tree was used as the weak learner of choice, with a shallow depth to avoid overfitting. Lastly, a loss function was defined as a measure of how well the machine learning model fits the data of a particular phenomenon.

  • 00:35:00 The core steps of gradient boosting are explained, which include optimizing the loss function, using a weak learner (usually a decision tree), and combining many weak learners in an additive fashion through ensemble learning. The section also covers various practical aspects of using XGBoost, such as handling missing values, saving models to disk, and using early stopping. A code-based approach is taken in this section, with numerous demos given to show various uses of XGBoost. Additionally, the section explores the importance of data cleansing, including how to replace missing values with mean value imputation.

  • 00:40:00 The speaker discusses the importance of cleaning your own data and not relying on algorithms to do the work for you. They demonstrate how dropping empty values can improve model accuracy and caution against allowing algorithms to handle empty values. The speaker also introduces the concept of pickling, which is a way to save trained models to disk for later use, and demonstrates how to use the pickle library to save and load models in Python. Finally, they show how to plot the importance of each attribute in a dataset using the plot importance function in XGBoost and matplotlib.

  • 00:45:00 The speaker discusses the feature importance scores as determined by XGBoost and the importance of analyzing and testing different scenarios when building machine learning models. They use the example of the Titanic dataset and show how adding the "sex" attribute improves the accuracy of the model, despite it being ranked low in feature importance scores. The speaker emphasizes the importance of testing various scenarios and not relying solely on feature importance scores. They also mention the ability of XGBoost to evaluate and report the performance of a test set during training.

  • 00:50:00 The video discusses how to monitor the performance of the XGBoost model during training by specifying an evaluation metric and passing in an array of x and y pairs. The model's performance on each evaluation set is stored and made available after training. Using these performance measures, learning curves can be created to gain further insight into the model's behavior and potentially stop learning early to prevent overfitting. The video also covers early stopping, a technique where training is stopped after a fixed number of epochs if no improvement is observed in the validation score.

  • 00:55:00 The video covers the use of early stopping rounds in XGBoost and building a regression model to evaluate home prices in Boston. The benefits of parallelism in gradient boosting are also discussed, with a focus on the construction of individual trees and the efficient preparation of input data. A demonstration of multithreading support is provided, which allows for faster program execution by performing several computations at the same time, making use of all the cores of your system. The video also mentions that although XGBoost is geared towards classification problems, it can also excel at building regression models.

  • 01:00:00 The speaker creates a list to hold the number of iterations for the example and uses a for loop to test the execution speed of the model based on the number of threads. Two results are printed out: the speed of the build for each iteration and a plot to show how the speed of the model decreases as the number of threads increases. Then, the speaker discusses hyperparameter tuning, which means passing in parameters into a model to enhance its performance. They explore the default parameters for xgboost and scikit-learn, and note that tweaking the hyperparameters can take some work to squeeze out the model's performance. Finally, they delve into how many trees or weak learners or estimators are needed to configure a gradient boost model and how big each tree should be.

  • 01:05:00 The video teaches about tuning hyperparameters in order to optimize the XGBoost model. They showcase an example of grid search for the n-estimator's model parameter, which evaluates a series of values to test the estimators on a given model. They also cover different subsample techniques, and how row sampling can be specified in the second wrapper of the XGBoost class. Additionally, the video highlights the importance of configuring the learning rate, which is done through trial and error. The learning rate is shown to interact with many other aspects of the optimization process, and smaller learning rates will require more training epochs. Finally, diagnostic plots are useful in investigating how the learning rate impacts the rate of learning and the learning dynamics of the model.

  • 01:10:00 The presenter demonstrates how to create a high-scoring XGBoost model on the Titanic dataset. The presenter uses pandas and train test split libraries to preprocess the data and XGBoost to train the model. The accuracy rating of the model is above 80, which makes it resume worthy. The presenter also warns against people who upload fake scores on the Kaggle leaderboard by either overfitting the model or doctoring the results. Finally, the presenter walks through the code line by line, demonstrating data cleaning, label encoding, handling null values, defining the X and Y axes, and splitting the data for training and testing the model.

  • 01:15:00 The importance of handling missing data correctly was reiterated, as applied machine learning is primarily about data and not so much about modeling. Additionally, the results of monitoring a model's performance were explained, and early stopping was presented as an approach to training complex machine learning models to avoid overfitting. The section also included a discussion of configuring multi-threading support for XGBoost and the default hyperparameters for XGBoost and Scikit-learn.
A Complete Introduction to XGBoost for Machine Learning Engineers
A Complete Introduction to XGBoost for Machine Learning Engineers
  • 2022.03.28
  • www.youtube.com
This course will cover all the core aspects of the most well-known gradient booster used in the real-world.
 

Feature Engineering Case Study in Python for Machine Learning Engineers



Feature Engineering Case Study in Python for Machine Learning Engineers

The instructor begins the course by introducing the concept of feature engineering and its crucial role in extracting value from the vast amount of data generated every day. They emphasize the importance of feature engineering in maximizing the value extracted from messy data. Learners are assumed to have entry-level Python knowledge, along with experience using NumPy, Pandas, and Scikit-Learn.

The instructor highlights the significance of exploratory data analysis and data cleansing in the process of building a machine learning model. They explain that these phases will be the main focus of the course. While the learners will go through the entire pipeline in the final chapter, the primary emphasis will be on feature engineering.

The instructor emphasizes that feature engineering is essential for improving model performance. They explain that feature engineering involves converting raw data into features that better represent the underlying signal for machine learning models. The quality of the features directly impacts the model's performance, as good features can make even simple models powerful. The instructor advises using common sense when selecting features, removing irrelevant ones, and including factors relevant to the problem under analysis.

Various techniques for cleaning and engineering features are covered in the video. Outliers are removed, data is normalized and transformed to address skewness, features are combined to create more useful ones, and categorical variables are created from continuous ones. These techniques aim to obtain features that accurately capture important trends in the data while discarding irrelevant information. The Titanic dataset is introduced as an example, containing information about the passengers aboard the ship.

The instructor discusses the class imbalance problem in machine learning, where positive cases are significantly fewer than negative cases. They suggest adjusting the model to better detect the signal in both cases, such as through downsampling the negative class. However, since the dataset used in the example is not heavily imbalanced, the instructor proceeds with exploring the data features. Basic exploratory data analysis is conducted on continuous features, and non-numeric features like name, ticket, sex, cabin, and embarked are dropped. The cleaned dataset is displayed, and the distribution and correlation of features are examined. It is discovered that the "p-class" and "fare" features exhibit the strongest correlation with the survival column, indicating their potential usefulness in making predictions.

Further exploratory data analysis is conducted on the continuous features. Non-numeric features like name and ticket are dropped, and the first five rows of the dataset are printed. The data is described using pandas functions, revealing missing values and a binary target variable called "Survived." The correlation matrix is analyzed to determine the correlations between features and their relationship with "Survived." The importance of looking at the full distribution of data is emphasized, as relying solely on mean or median values may lead to inaccurate conclusions. Plots and visualizations are used to explore the relationship between categorical features and the survival rate, uncovering trends such as higher survival rates among first-class passengers and those with fewer family members.

The instructor highlights the importance of feature engineering and advises against condensing features excessively without proper testing. They discuss the process of exploring and engineering categorical features, including identifying missing values and the number of unique values in each feature. Grouping features and analyzing the average value for the target variable in each group is suggested as a helpful approach for better understanding the dataset. The relationship between the missing cabin feature and the survival rate is explored, leading to the discovery of a strong indicator of survival rate despite the seemingly low value of the feature.

Feature exploration reveals that titles, cabin indicators, and sex have a strong correlation with survival, while the embarked feature is redundant. The relationship between cabin and survival rate is explained through the observation that more people who boarded in Cherbourg had cabins, resulting in a higher survival rate. The number of immediate family members on board is combined into one feature, and either passenger class or fare is suggested due to their correlation.

The instructor explains that the next step is to engineer the features based on the insights gained from exploratory data analysis. They start by creating a new feature called "Title" from the "Name" feature. The "Title" feature extracts the title from each passenger's name (e.g., Mr., Mrs., Miss) as it may provide additional information related to social status and survival rate. The "Title" feature is then mapped to numerical values for simplicity.

Next, the instructor focuses on the "Cabin" feature, which initially had many missing values. However, by analyzing the survival rate of passengers with and without cabin information, it was discovered that having a recorded cabin number had a higher survival rate. Based on this insight, a new binary feature called "HasCabin" is created to indicate whether a passenger has a recorded cabin or not.

Moving on, the instructor tackles the "Sex" feature. Since machine learning models typically work better with numerical data, the "Sex" feature is mapped to binary values, with 0 representing male and 1 representing female.

After engineering the "Sex" feature, the instructor addresses the "Embarked" feature, which indicates the port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton). However, it was previously determined that the "Embarked" feature is redundant and does not contribute significantly to the prediction of survival. Therefore, it is dropped from the dataset.

The instructor then focuses on the "Pclass" and "Fare" features, which exhibited strong correlations with survival during exploratory data analysis. These features are left as they are since they are already in a suitable format for the machine learning model.

At this stage, the instructor emphasizes the importance of data preprocessing and preparing the features for the model. The dataset is split into training and testing sets to evaluate the model's performance accurately. Missing values in the "Age" feature are imputed using the median age of passengers, and all the features are standardized to have zero mean and unit variance using Scikit-Learn's preprocessing functions.

Finally, the instructor briefly discusses the concept of one-hot encoding for categorical features and mentions that it will be covered in more detail in the next video. One-hot encoding is a common technique used to represent categorical variables as binary vectors, enabling the model to interpret them correctly.

To summarize, in this part of the course, the instructor introduced the concept of feature engineering and explained its significance in machine learning. They conducted exploratory data analysis, cleaned the dataset, and engineered features based on the insights gained. The instructor demonstrated how to create new features, map categorical features to numerical values, and remove redundant features. The next steps involved data preprocessing and preparing the features for the machine learning model.

Please note that the above summary is a hypothetical continuation based on the general topics typically covered in a feature engineering course. The actual content and examples may vary depending on the specific course and instructor.

  • 00:00:00 The instructor introduces the course on feature engineering and its importance in extracting value from the vast amount of data generated every day, giving learners the toolkit they need to be able to extract maximum value from that messy data. Learners are assumed to have some entry-level Python knowledge, as well as experience using NumPy, Pandas, and Scikit-Learn. The instructor also goes through the process of building a machine learning model at a high level, highlighting the importance of the exploratory data analysis and data cleansing, which are the critical phases that will be focused on exclusively in this course. Learners will go through the entire pipeline in the final chapter, but the focus will primarily be on feature engineering.

  • 00:05:00 The importance of feature engineering and its impact on model performance are discussed. Feature engineering is the process of converting raw data into features that better represent the underlying signal for machine learning models to improve their accuracy on unseen data. It is the unsung hero in machine learning since without good quality data, machine learning models are essentially worthless. However, with great features, even simple models can be quite powerful. Additionally, it is important to use common sense when selecting features – irrelevant features must be removed, and factors relevant to the problem under analysis must be included. Ultimately, the quality of the features fed into the model is the primary limiting factor on the model's performance.

  • 00:10:00 The video covers various techniques to clean and engineer features to ensure that machine learning models only use useful signals. These include removing outliers, normalizing data, transforming skewed data, combining features into more useful ones, and creating categorical variables from continuous ones. The aim of these techniques is to obtain features that accurately capture important trends in the data and discard those that are not representative. The video also introduces the Titanic dataset, which contains information about the passengers aboard the ship, including their name, age, class, ticket number, cabin number, and embarkation port. The video goes on to explore the distribution of the target variable, which is whether an individual on the ship survives or not.

  • 00:15:00 The speaker discusses the class imbalance problem in machine learning, where the number of positive cases is significantly less than the negative majority cases, making it difficult for the model to detect the signal in the positive cases. The speaker suggests adjusting the model to better pick up the signal in both cases, such as by downsampling the negative class. However, since the dataset used in the example is not terribly imbalanced, the speaker proceeds with exploring the data features, starting with basic exploratory data analysis on continuous features only. The speaker drops non-numeric features like name, ticket, sex, cabin, and embarked, and prints the first five rows of the cleaned dataset. The speaker then examines the distribution and correlation of features, and finds that p-class and fair have the strongest correlation with the survival column and thus may be useful in making predictions.

  • 00:20:00 The video goes over some basic explanatory data analysis on the continuous features of the data. The video drops non-numeric features like name and ticket and prints the first 5 rows. The data is described using the built-in pandas function, and it is noted that there are missing values and a binary target variable called "Survived." The correlation matrix is analyzed, and it is important to note how correlated each feature is with "Survived" and any other features. It is stated that a strong negative correlation can be just as useful as a positive one. It is observed that the "p-class" and "Fare" features have the strongest correlation with "Survived," but "Fare" and "p-class" have a high correlation between features, which could confuse the model.

  • 00:25:00 The instructor discusses a method for identifying features that may be useful predictors for the target variable. The method involves grouping by the target variable and analyzing the distributions of each feature for people who survived versus people who did not, as well as running a t-test to determine statistical significance. The instructor highlights two features, fair and class, that stand out as potentially good indicators of survival, but cautions about the impact of correlation on interpretation. The instructor also discusses the missing values for the age feature and uses group by to determine whether it is missing at random or not. Additionally, the instructor explains how to plot continuous features to visualize their distributions and relationships with the target variable.

  • 00:30:00 The video discusses the importance of looking at the full distribution of data instead of relying on mean or median values for continuous features in determining survivorship rate. The video provides an example of plotting an overlaid histogram of age and fare for those that survived and those that didn't, highlighting the caution one needs to take when relying solely on averages. Additionally, the video uses seaborn's categorical plot to plot survival rate percentage for each level of different categorical features such as passenger class and family count, which shows a trend that first class passengers, as well as those with fewer family members, are more likely to survive. The video also explores combining sibling, spouses, parents, and children features into a single feature and discusses the importance of using sound logic when creating indicator variables for models to generalize effectively.

  • 00:35:00 The speaker emphasizes the importance of feature engineering in machine learning. The speaker advises against condensing features down too much without testing, as sometimes separate features may be more effective than a single feature. Moving on to categorical features, the speaker advises looking for missing values and the number of unique values in each feature. They discuss how grouping features and looking at the average value for the target variable in each group can be helpful in understanding the dataset better. In particular, the speaker spends time exploring the relationship between the missing cabin feature and survival rate. They explain how exploring data in this way led them to find a strong indicator of survival rate despite the seemingly low value of the feature.

  • 00:40:00 The speaker discusses the process of exploring and engineering features for machine learning. Features explored include cabin, ticket, and name. The cabin variable is used to create an indicator variable for the presence of a cabin, which is hypothesized to affect survival rates. The ticket variable is determined to be assigned randomly and will be dropped as a feature. The name variable is explored for titles, which may represent social status and correlate with survival rates. A pivot table is used to examine the survival rates of each title, with an outlier being the “master” title for young boys. Lastly, the speaker discusses plotting categorical features in order to explore the relationship between different levels of those features and the survival rate.

  • 00:45:00 Further feature exploration showed that the title, cabin indicator, and sex have a strong correlation with survival, while the embarked feature does not provide much information and is redundant. Using a pivot table, it was discovered that more people who boarded in Cherbourg had cabins relative to those who boarded in Queenstown or Southampton, explaining the higher survival rate in Cherbourg. Lastly, the number of immediate family members on board was combined into one feature, and the use of either passenger class or fare was suggested due to their correlation.

  • 00:50:00 The instructor discusses the process of feature engineering and how to handle missing values in machine learning. Three common approaches to replacing missing values are discussed: filling with the mean or median value of the feature, building a model to predict a reasonable value, or assigning a default value. The instructor decides to replace missing age values with the average value, which satisfies the model while avoiding bias. The embarked feature, a categorical variable, is also cleaned by adding another value to indicate missing values. Additionally, the process of capping is introduced as a way to remove outliers in the data, which is important for ensuring that the model fits the actual trend of the data instead of chasing down outliers.

  • 00:55:00 The presenter discusses identifying outliers using various thresholds. The function to detect outliers is defined, and thresholds for each feature are set and adjusted based on the distribution of values. The max values for siblings, spouses, parents, children, and age are reasonable, so there is no need to cap them, but the fair feature is capped at the 99th percentile. The age and fair features are transformed using the "clip" method to set upper boundaries on the features. The presenter then moves on to discuss skewed data and its potential problems, including the model chasing the long tail. The presenter visualizes the distribution of the continuous features age and fair and will transform them to create a more compact and easy to understand distribution.

  • 01:00:00 The video explores the process of transforming data to make it more well-behaved and compact in order to improve machine learning models. The specific transformation being used is the Box-Cox power transformation, where an exponent is applied to each data point in a certain feature. The video explains the process of testing out the transformation with different exponents and using criteria such as QQ plots and histograms to determine which transformation yielded the best-behaved data. The end result is a more compact distribution for the feature of interest that will not distract the machine learning model with long tails and outliers. The transformed data is then stored as a feature in the data frame for future use.

  • 01:05:00 The video explores how to create a new feature from the existing text data. After analyzing the name feature, the speaker shows how to parse out the person's title and create a new title feature. The title feature is found to be a strong indicator of whether someone survived and is added to the data in preparation for modeling. The video also covers creating a binary indicator for the cabin feature, indicating whether a passenger had a cabin or not. Lastly, the speaker demonstrates how to combine existing features, such as the number of siblings, spouses, parents, and children aboard into a new feature that indicates the number of immediate family members aboard, preparing the data for modeling.

  • 01:10:00 The presenter discusses the importance of converting categorical features into numeric features for machine learning models. They explain that this is necessary because models can only understand numerical values, not string values, and give an example of label encoding from the second learn package. They then walk through a loop to apply this conversion to the non-numeric features in the Titanic dataset. Finally, the presenter discusses the importance of splitting up data into training, validation, and test sets for evaluating the performance of the machine learning model on unseen data. They demonstrate how to use train test split from cycle learn to split the data into these sets.

  • 01:15:00 The video covers how to use the train test split method to split data into training, validation, and test sets. The method can only handle splitting one data set into two, so two passes through the method are necessary to get three separate data sets. The video also discusses the importance of standardizing data, or converting it to numbers that represent how many standard deviations above or below the mean the value is, in order to normalize features that are on different scales. A standard scalar is imported and used to scale the data, with examples provided to illustrate the process.

  • 01:20:00 The instructor discusses the importance of scaling data for some machine learning models and compares the performance of four different sets of features in predicting survival on the Titanic. While some algorithms like random forest don't necessarily require scaled data, scaling can help other algorithms train more quickly and even perform better. Additionally, the instructor defines four sets of features; original features, cleaned original features, new features plus cleaned original features, and reduced features, to build a model on each and compare performance to understand the value of cleansing, transforming, and creating features. Finally, the instructor writes out the data to CSV files to ensure the same examples are used in the training, validation, and testing sets.

  • 01:25:00 The process for fitting the model on the raw original features is discussed, using packages such as joblib, matplotlib, seaborn, numpy, pandas, random forest classifier, and grid search cv. The correlations between the features are visualized using a heatmap created from the correlation matrix, and it is found that passenger class and cabin have a high correlation of 0.7. Grid search cv is used to find the best hyperparameters for the model, such as the number of estimators and the max depth of the trees. The best model is found to have about 512 estimators with a max depth of 8, resulting in an average score of about 84.5 percent, allowing for the move to the next set of data.

  • 01:30:00 The video explores feature importance in a random forest model and the benefits of using grid search cv. The feature importances for the model show that sex is the most important feature, while age is more important than passenger class which was previously believed to be a strong indicator of survival. However, passenger class can be highly correlated with other features, such as whether someone had a cabin or the fare they paid, resulting in the model being confused about what truly drives the relationship with the target variable. Once the model is fit with the best hyperparameter settings on 100% of the training data, it is ready to be evaluated on a validation set. The model is then fit on clean features to determine if the missing values and outliers significantly affected its ability to pick up on underlying trends. The best hyperparameter settings for this model are simpler than the model on raw features, and feature importance is nearly identical to the previous model. Lastly, the model is fit on all the features, including transformed features, to see how much value they provide in addition to the simple features.

  • 01:35:00 The video explores the process of evaluating the best models generated by each feature set on a validation data set to select the best model based on performance. The video discusses the importance of considering model latency when deciding the best model and mentions the packages being used for accuracy, precision, and recall score calculations. The models previously saved are read in using a loop and stored as a dictionary with the model name as the key and the model object as the value. The best model is selected based on the performance of the validation set, and its performance on a holdout test set is evaluated for an unbiased view of its performance.

  • 01:40:00 The presenter discusses how to load models stored in a model dictionary and evaluate their performance using the "evaluate model" function. The presenter explains that in this case study, the best performing model on the validation set is the one built on all features, while the one built on reduced features is the simplest with the lowest latency. The presenter highlights the trade-offs between precision and recall depending on the problem they are solving. Lastly, the presenter states that since they do not have any prediction time requirements, they will deploy the model built on all features and evaluate it on the test set.

  • 01:45:00 The speaker explains how the test set was not used for model selection and is an unbiased way to evaluate the performance of the final selected model. The chosen model was built on four different features, with 64 estimators and a max depth of eight. The accuracy was robustly tested and evaluated on unseen data, generating an 83.7 percent accuracy on cross-validation, 83 on the validation set, and 81 on the test set. With this information, the speaker is confident in proposing this model as the best model for making predictions on whether the people aboard the titanic would survive or not. The speaker also notes that the skills learned in this course can be generalized to any new feature set to extract every last ounce of value to build the most powerful machine learning model.
Feature Engineering Case Study in Python for Machine Learning Engineers
Feature Engineering Case Study in Python for Machine Learning Engineers
  • 2022.04.06
  • www.youtube.com
Another free course to help you become a machine learning engineer in the real-world.LogikBot - Affordable, Real-World and Comprehensive - https://www.logikb...
 

Machine Learning with BigQuery on Google's Cloud Platform



Machine Learning with BigQuery on Google's Cloud Platform

The video discusses the content of a course that focuses on using BigQuery for machine learning. BigQuery is an enterprise data warehouse that was initially used internally at Google and later became a cloud service. It is highly scalable and serverless, capable of accommodating petabytes of data and providing fast query results. The course instruction is based on real-world case studies, guiding learners through the process of building machine learning models from data sourcing to model creation. Throughout the course, learners utilize BigQuery to construct their models, requiring them to set up a Google Cloud Platform (GCP) account specific to BigQuery.

The video explains Google's guiding principles for scaling hardware resources, emphasizing the decision to scale out rather than up. Google recognizes that hardware can fail at any time, so designs should account for potential failures. Additionally, Google utilizes commodity hardware, which is affordable and allows for vendor flexibility. Scaling out is preferred over scaling up due to the high cost of hardware. Google has developed technologies such as GFS, MapReduce, and Bigtable, which have led to a scaled-out hardware architecture. Colossus has replaced GFS and serves as the underlying distributed subsystem for Google's technologies, including BigQuery.

The lecturer provides an overview of Google's database solution, Spanner, which is distributed globally and relies on Colossus for managing distributed transactions. The video also demonstrates the process of signing up for and managing billing accounts within the Google Cloud Platform. Users can create a GCP account by visiting the platform's website, agreeing to the terms, and providing the necessary information. New users are granted a $300 credit to use on GCP, which can be monitored through the billing section. The lecturer advises setting up budget alerts to receive notifications when certain billing targets are reached.

The creation and purpose of BigQuery are discussed in detail. Google's exponential data growth necessitated the development of BigQuery, which allows for interactive queries over large data sets. BigQuery can handle queries regardless of whether they involve 50 rows or 50 billion rows. Its non-standard SQL dialect facilitates a short learning curve, and it can parallelize SQL execution across thousands of machines. While BigQuery stores structured data, it differs from relational databases by supporting nested record types within tables, enabling the storage of nested structures.

The architecture of BigQuery is explained, highlighting its approach to parallelization. Unlike most relational database systems that execute one query per core, BigQuery is designed to run a single query across thousands of cores, significantly improving performance compared to traditional approaches. The Dremel engine enables query pipelining, allowing other queries to utilize available cores while some are waiting on I/O. BigQuery employs a multi-tenancy approach, enabling multiple customers to run queries simultaneously on the same hardware without impacting other locations. The BigQuery interface comprises three core panes, including query history, saved queries, job history, and resource sections for organizing access to tables and views.

The video provides a detailed explanation of the screens and panels within the Google Cloud Console specific to BigQuery. The navigation menu displays BigQuery resources, such as data sets and tables, while the SQL workspace section allows users to create queries, work with tables, and view their job history. The Explorer panel lists current projects and their resources, while the Details panel provides information on selected resources and allows for modifications to table schemas, data exports, and other functions. It is clarified that BigQuery is not suitable for OLTP applications due to its lack of support for frequent small row-level updates. While not a NoSQL database, BigQuery uses a dialect of SQL and is closer to an OLAP database, providing similar benefits and suitability for many OLAP use cases.

The definition of Google's BigQuery is further discussed, emphasizing its fully managed, highly scalable, cost-effective, and fast cloud.

Here are additional points discussed in the video:

  1. BigQuery's storage format: BigQuery uses a columnar storage format, which is optimized for query performance. It stores data in a compressed and columnar manner, allowing for efficient processing of specific columns in a query without accessing unnecessary data. This format is especially beneficial for analytical workloads that involve aggregations and filtering.

  2. Data ingestion: BigQuery supports various methods of data ingestion. It can directly load data from sources like Google Cloud Storage, Google Sheets, and Google Cloud Bigtable. It also offers integrations with other data processing tools, such as Dataflow and Dataprep, for ETL (Extract, Transform, Load) operations.

  3. Data partitioning and clustering: To optimize query performance, BigQuery provides features like partitioning and clustering. Partitioning involves dividing large datasets into smaller, manageable parts based on a chosen column (e.g., date). Clustering further organizes the data within each partition, based on one or more columns, to improve query performance by reducing the amount of data scanned.

  4. Data access controls and security: BigQuery offers robust access controls to manage data security. It integrates with Google Cloud Identity and Access Management (IAM), allowing users to define fine-grained access permissions at the project, dataset, and table levels. BigQuery also supports encryption at rest and in transit, ensuring the protection of sensitive data.

  5. Data pricing and cost optimization: The video briefly touches on BigQuery's pricing model. It operates on a pay-as-you-go basis, charging users based on the amount of data processed by queries. BigQuery offers features like query caching, which can reduce costs by avoiding redundant data processing. It's important to optimize queries and avoid unnecessary data scanning to minimize costs.

  6. Machine learning with BigQuery: The course covers using BigQuery for machine learning tasks. BigQuery integrates with Google Cloud's machine learning services, such as AutoML and TensorFlow, allowing users to leverage the power of BigQuery for data preparation and feature engineering before training machine learning models.

  7. Use cases and examples: The lecturer mentions various real-world use cases where BigQuery excels, such as analyzing large volumes of log data, conducting market research, performing customer segmentation, and running complex analytical queries on massive datasets.

Overall, the video provides an overview of BigQuery's capabilities, architecture, and key features, highlighting its suitability for large-scale data analytics and machine learning tasks. It emphasizes the benefits of using a fully managed and highly scalable cloud-based solution like BigQuery for handling vast amounts of data efficiently.

  • 00:00:00 The video discusses the course content, which is focused on using BigQuery for machine learning. BigQuery is a highly scalable and serverless enterprise data warehouse that was originally used internally at Google before becoming a cloud service. It can accommodate petabytes of data and returns data in mere seconds, making it a valuable resource for supervised machine learning, particularly with large data sets. The instruction in this course is based on real-world case studies, and learners will be walked through the process of building their machine learning models from sourcing the data to modeling it to create a highly predictive model. Throughout the course, learners will leverage BigQuery to build their models, which means to set up GCP account, which will be specific to only BigQuery.

  • 00:05:00 The video explains the guiding principles behind scaling hardware resources at Google, specifically the decision to move towards scaling out, rather than up. Google's guiding principles when it comes to hardware is that anything can fail at any time and that designs should account for that. The second principle has to do with using commodity hardware, which is affordable and easy to obtain, thereby allowing Google to switch vendors without incurring any penalties. Lastly, hardware is expensive, hence the goal is to scale out rather than up. Google has designed key technologies such as GFS, MapReduce, and Bigtable to move them towards scaled-out hardware architecture. Furthermore, Colossus replaced GFS and is the underlying distributed subsystem on which much of Google's technology is built, including BigQuery, which relies on Colossus.

  • 00:10:00 The lecturer provides an overview of Google's database solution, Spanner, which is distributed globally and uses Colossus to manage distributed transactions, while also demonstrating how to sign up for and manage billing accounts within Google Cloud Platform. To begin using Google Cloud services, users need to create an account on GCP, which can be done by navigating to the browser and typing in "GCP" or "Google Cloud Platform." After agreeing to the terms and providing the appropriate information, new users are given a $300 credit to use on GCP, which can be monitored through the overview and budget features in the billing section. The lecturer encourages users to set up budget alerts to receive notifications when certain billing targets are hit, which can be accomplished by clicking on "create a budget" and specifying the total dollar amount to be spent, as well as selecting the project and budget alerts to be enabled.

  • 00:15:00 The creation and purpose of BigQuery is discussed. Google's exponential data growth caused problems, leading to the development of a tool that enabled interactive queries over large data sets—BigQuery. It offers the option to operate the same irrespective of whether one is querying 50 rows or 50 billion rows. Thanks to its non-standard dialect based on SQL, it has a short learning curve, embedded with the ability to parallelize SQL execution across thousands of machines. Structured data is what BigQuery can store, but unlike a relational database, these fields can hold record types, including nested records within tables. These nested structures are essentially pre-joined tables.

  • 00:20:00 The video explains the architecture of BigQuery and its approach to parallelization. While most relational database systems can only execute one query per core, BigQuery is architected to run a single query across thousands of cores, maximizing performance significantly compared to traditional query core approaches. This is possible due to the Dremel engine, which can pipeline queries, allowing other queries to use available cores while some are waiting on I/O. This multi-tenancy approach means that many customers can run queries at the same time on the same hardware, and BigQuery takes advantage of varied data usage patterns, so heavy usage in one geographic location doesn't impact other locations. The video also explains the three core panes of the BigQuery interface, with query history specific to each project, and saves queries, job history, and resource sections available to organize access to tables and views.

  • 00:25:00 The speaker explains the various screens and panels that make up the Google Cloud Console specific to BigQuery. The navigation menu displays BigQuery resources such as data sets and tables, while the SQL workspace section allows users to create queries, work with tables, and view their job history. The Explorer panel displays a list of current projects and their resources, and the Details panel provides information on the selected resource and allows users to modify table schemas, export data, and perform other functions. The speaker also discusses what BigQuery is not, explaining that it is not well-suited for OLTP applications due to its lack of support for frequent, small row-level updates, and that it is not a NoSQL database because it uses a dialect of SQL. Instead, BigQuery is closer to an OLAP database and provides many of the same benefits, making it appropriate for many OLAP use cases.

  • 00:30:00 The definition of Google's BigQuery was discussed. It is a fully managed, highly scalable, cost-effective, and fast cloud data warehouse for analytics with built-in machine learning. Additionally, BigQuery is made up of many other components such as Megastore and Colossus. BigQuery has its algorithm for storing data, Column IO, which stores data in columns, improving performance, and charging users based on the data returned. Google's network is fast due to their high attention to detail; therefore, much of their network architecture remains a mystery. Lastly, BigQuery released support for standard SQL with the launch of BigQuery 2.0, renaming BigQuery SQL to Legacy SQL, and preferred SQL dialect for queries and data stored in BigQuery.

  • 00:35:00 The video covers the process for saving and opening queries in BigQuery, as well as creating and querying views. The narrator explains that a view is a virtual table and demonstrates how to create and save a view in a new data set. The video also discusses the different options within the query editor, such as formatting the query and accessing query settings. Additionally, the video covers the explosion of machine learning and data science careers and discusses the differences between roles such as data analyst and data scientist. Finally, the narrator explains that the focus of the course will be on supervised machine learning using Python, which is considered the gold standard in the field.

  • 00:40:00 The different roles within the field of machine learning are discussed, including the data scientist, machine learning engineer, and data engineer. The focus is on applied machine learning, which is the real-world application of machine learning principles to solve problems, as opposed to purely academic or research applications. The importance of structured data sets, particularly those found in relational databases, is also emphasized, as traditional models such as gradient boosters have been shown to excel at modeling highly structured data sets and have won many competitions over artificial neural networks.

  • 00:45:00 The machine learning process is covered, which is highly process-oriented. The article explains how machine learning engineers must follow the same core steps when given a problem to solve. The first step is to look at the data, followed by sourcing data. Since most applied machine learning is supervised, the data must first be cleaned (or “wrangled”), which involves massaging the data into a numerically-supported format. This requires the machine learning engineer to spend most of their time doing data wrangling. Once the data has been cleaned, the modelling stage begins. At this stage, models or algorithms must be developed, which learn patterns from the cleaned data set. The goal of machine learning is to be able to make highly accurate predictions against fresh data. Once the models have been tuned and tested on new data, they are put into production for consumers to use.

  • 00:50:00 The video discusses the installation process of Python 3.7 version using the Anaconda distribution on a Mac. The Anaconda distribution is available for both Windows and Mac and has a graphical installer. After downloading the installer and entering the password, the default installation type is recommended, and the installation process may take a couple of minutes. Once the installation is completed, the Anaconda Navigator can be launched and a new Python 3 notebook can be opened to begin coding.

  • 00:55:00 The instructor explains how to navigate the Jupyter Notebook IDE, which is used for machine learning with BigQuery on Google's Cloud Platform. The first step is to locate the Notebook on the laptop by typing CMD and accessing the Anaconda command prompt. From there, typing "Jupyter Notebook" will load the Python engine on the local computer. Once loaded, navigating the Notebook is explained, including how to close out of page

  • 01:00:00 A step-by-step guide on using Jupyter Notebook is presented, starting with navigating to "New notebook" and selecting Python 3. The tutorial also shows how to import libraries, create, execute, and rename cells, change cell order, auto-save a notebook, insert, copy, cut, paste, execute all and restart the kernel, and use Markdown to annotate a notebook. Additionally, the notebook's simplicity is emphasized and seen as sufficient for working with the machine learning pipeline.

  • 01:05:00 The video covers the foundation of working with data in BigQuery, including data sets and tables. It explains how important it is for machine learning engineers to be able to create, upload and wrangle data in BigQuery, as scale can be a major problem when creating real-world models. With BigQuery ML, it only requires knowledge of SQL, making it simple and accessible for those well-versed in SQL, and it provides seasoned machine learning professionals the ability to build their models at any scale. Additionally, the video covers the core machine learning libraries used in applied machine learning in python such as Pandas, which is a library for data wrangling and manipulation, Numpy, which is a fundamental package for scientific computing with python, Matplotlib for creating 2D graphs, and Scikit-Learn, which is a library used to build traditional models.

  • 01:10:00 The video tutorial explores the basics of data wrangling and data manipulation for machine learning using two core libraries: pandas and numpy. The pandas library is used to load a famous toy dataset for machine learning called the Titanic dataset and create an alias. An array is created to enable the understanding of the model, and the necessary attributes for the model such as passenger class, sex, age, and survived are identified. The target variable, which is to be predicted, is the survived attribute, which is either 1 or 0; survived means 1 while did not survive is 0. The next step is converting values in the attributes to numbers using Python code, which can be understood by the machine. All the observations with null or nand values are removed, and the survived attribute is dropped from the x-axis to prevent the model from cheating. Finally, the dataset is divided into sections of testing and training using the general-purpose library for machine learning called scikit-learn.

  • 01:15:00 The video discusses the use of machine learning with the titanic data set and how in real-world scenarios, most models are sourced from relational databases. The SQL Server Management Studio interface is introduced as this is commonly used to manage SQL server databases. A hypothetical scenario is presented where the task is to create a dataset that can be used to predict future sales. The video walks through how to craft a query and join tables to create an order history for the celebrities, and how to save this information as a view so it can be easily queried and exported as a CSV file to be shared with the rest of the team.

  • 01:20:00 The video walks through the process of exporting data to a CSV file from a cloud database using SQL Server. They explain that exploratory data analysis, also known as data analysis, plays a crucial role in machine learning, and introduce the Matplotlib and Seaborn libraries for data visualization. The video goes on to show examples of how to use these libraries to explore the Titanic data set, calculate percentages of missing values, and create histograms and bar plots. They note that Seaborn is often preferred due to its simplicity.

  • 01:25:00 The speaker explores the different types of machine learning models and their applications. Deep learning models, while excelling in image and speech recognition, may not be the best fit for supervised machine learning in most cases, which is based on highly structured datasets. Traditional models, such as gradient boosters, are more accurate, less computationally intensive, easier to explain, and can accelerate classification and regression problems. The speaker then takes the audience through the process of building a traditional model using Python, Pandas for data wrangling, and XGBoost, a gradient booster library that has won many competitive modeling competitions. The model achieved an 83% score on the dataset, and the speaker explains how to save the model using the Pickle library.

  • 01:30:00 The video explains what classification is and how it separates observations into groups based on characteristics such as grades, test scores, and experience. It also covers binary classification and how it involves classifying data into two groups with an output of yes or no. The video then introduces artificial neural networks and deep learning models, defining linear regression as predicting the value based on a line and explaining how it is used in forecasting to predict random data points like cancer diagnoses or stock prices. The demonstration on linear regression in Python uses the pandas library to massage data, while the numpy library holds data in an optimized array container and the matplotlib library is used for data visualization. The video shows how to plot a graph to find the positive linear relationship between hours studied and scores achieved, and eventually imports the classifier model used for linear regression in the Python script.

  • 01:35:00 The speaker goes over the fundamentals of classification as a supervised machine learning technique and provides a simplified definition that separates observations into groups based on their characteristics. The example given is spam detection, where emails are divided into two categories: spam and not spam. A more complex example is the Titanic machine learning project, which is a binary classification problem where the output of the model is either a one for survived or a zero for not survived. The next part of the section goes over how to build a classification model with high accuracy, including importing libraries, using the iris dataset, converting textual values to numbers using label encoding, training a random forest classifier model, and testing the completed model against the training data to achieve 97% accuracy.

  • 01:40:00 The foundation of working with data using BigQuery is discussed, including data sets and tables. As a machine learning engineer, it's crucial to be able to create, upload, and wrangle data in BigQuery. The section goes over wrangling data in BigQuery, including how it can handle petabytes of data and the benefits of using Google's cloud Jupyter notebook called Cloud Datalab. BigQuery ML is also covered, which requires no programming knowledge other than SQL, making it easier for data professionals to create machine learning models. Finally, the section covers the nuances of data sets and tables, including how to create a data set and add tables to it in BigQuery.

  • 01:45:00 The speaker discusses the different source options when creating tables in BigQuery, including an empty table, external data sources, and uploading data from a readable source such as CSV, JSON, Arvo, Parquet, and ORC. While most machine learning engineers prefer to use CSV files, Arvo is faster to load and easier to parse with no encoding issues, while Parquet and ORC are widely used in the Apache Hadoop ecosystem. The speaker then introduces Google's Cloud Data Lab, which is a virtual machine (VM) hosted on GCP that contains a Jupyter Notebook-like interface called Datalab. Users can take code from a Jupyter Notebook locally and use it on GCP, and when creating a new Datalab instance, users choose a storage region and may be prompted to create an SSH key.

  • 01:50:00 The instructor demonstrates how to create a connection to BigQuery and import the wrangled titanic dataset to a cloud datalab instance. By importing BigQuery and creating a connection to it, users can write SQL code to query data. With the help of pre-packaged libraries such as pandas, decision tree classifier, and train test split, users can segment their data, fit it to their training data, and score their model. Additionally, users can make changes to their query directly inside the cell and execute it to create a new pandas dataframe that houses the data set from the query. Finally, the instructor shows how to upload and query another dataset, the iris dataset, in the cloud datalab instance using BigQuery.

  • 01:55:00 The presenter demonstrates how to import data from BigQuery library into a Jupyter notebook on Google's Cloud Platform. The iris dataset is imported and split into training and testing sets, and a random forest classifier is used for training. The predicted values are outputted for the model. The presenter also shows how to upgrade the resources on a cloud data lab instance by accessing it on the Google home page and clicking "edit".

  • 02:00:00 The speaker explains BigQuery ML, a tool that enables SQL practitioners to build large-scale machine learning models using existing SQL skills and tools, thus democratizing machine learning. BigQuery ML currently supports three types of models: linear regression, binary logistic regression, and multi-class logistic regression. The speaker also explains how to create a binary logistic regression model in BigQuery using SQL language. The creation of the model involves defining the model, specifying options and passing the target variable using SQL statements. The model can be evaluated and accuracy metrics presented through SQL as well. Finally, the speaker explains the prediction phase where the model is passed fresh data that it has never seen before.

  • 02:05:00 The speaker discusses how to use BigQuery ML to build a binary classification model and evaluate it. The data is uploaded from a CSV file to BigQuery, and the model is passed all columns except the target variable. Once the evaluation is completed, the model will make a prediction for each family member, with the first column in the output predicting survival (one for survived and zero for did not survive). The speaker then moves on to install the command-line tool called gsutil, which is a command-line tool used to work with Google storage on GCP. The tool offers three levels of storage with different accessibility and pricing.

  • 02:10:00 The speaker demonstrates how to upload and manage files in Google Cloud Storage using gsutil. First, the user must set the project to work within and create a bucket using gsutil mb, keeping in mind that every bucket name must be unique. Then, the speaker explains how to copy a file to a bucket and grant public access to it, using access control lists (ACLs) to control who can read and write the data. The speaker also demonstrates how to download and copy files to another bucket using gsutil, and using the -m switch to speed up the uploading process. The speaker concludes by showing how to export data from a relational database to two files and upload them to GCP using Cloud Storage.

  • 02:15:00 The speaker demonstrates how to upload two data sets to Google Cloud Platform's BigQuery, join them using SQL, and create a view to build machine learning models. After exporting data from SQL Server and saving them as CSV files, the speaker uploads them to GCP's cloud storage bucket, downloads them to BigQuery, and combines them using a simple join statement. Finally, the speaker shows how to create a view of this larger data set to use in machine learning models.

  • 02:20:00 The speaker walks through the process of creating a table in Google BigQuery on Cloud Platform for their Titanic project dataset. They upload the dataset from their local source, auto detect the schema from the CSV file, and skip the first row as it contains header information. After successfully creating the table, they query it and confirm that the data and headers appear correctly. The speaker notes that the dataset is now ready for the next steps of the project.
Machine Learning with BigQuery on Google's Cloud Platform
Machine Learning with BigQuery on Google's Cloud Platform
  • 2022.04.25
  • www.youtube.com
A complete look at BigQuery for machine learning.LogikBot - Affordable, Real-World and Comprehensive - https://www.logikbot.comThere are two core paths on Lo...