Something Interesting in Financial Video - page 26

 

Open Neural Network Exchange (ONNX):  PyTorch to Tensorflow Demo - Tutorial 11


 

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 11:41

Artificial Intelligence: Mankind's Last Invention




The video "Artificial Intelligence: Mankind's Last Invention" explores the advancements and potential risks associated with developing artificial intelligence (AI). The video highlights Google DeepMind's AlphaGo, which surpassed centuries of human strategy knowledge in only 40 days. It dives into the differences between weak and strong AI and discusses how advanced AI can lead to a technological singularity, where it improves upon itself continuously and becomes billions of times smarter than humans. The speaker emphasizes the importance of giving AI human-like values and principles and cautions against creating an uncontrollable system. The video concludes by stressing the need to carefully consider the consequences of developing super intelligent AI before doing so.

  • 00:00:00 This part explains the complexity of the board game, Go, which can't be solved by brute force or predicted, and has over 10 to the 170 moves possible. Google DeepMind's AlphaGo was trained using data from real human Go games where it learned the techniques used and made new ones that no one had ever seen, which was impressive alone. A year after AlphaGo's win, AlphaGo Zero beat AlphaGo 100 to 0 using the bare-bones rules since it learned how to play without human interaction, which surpassed over 2,500 years of strategy and knowledge in only 40 days. The video highlights the significant amount of non-human knowledge as technology continues to develop; there will be a point where humans represent the minority of intelligence, and there's no off switch to turn off the AI.

  • 00:05:00 In this section, the video discusses neural networks and how machines learn from data and adapt their own view of it. It also explores the difference between the capabilities of the human brain and computers. For instance, computers can perform 20,000 years’ worth of human-level research in just one week. Moreover, the exponential nature of machine learning, meaning it starts off slowly but reaches a tipping point where things start to speed up drastically, is explored. The difference between weak and strong AI is pointed out; while the former requires less power, the difference between the latter and super intelligent AI is millions of times larger. The importance of strong AI, which has the potential to help us reach super intelligence level in just a few months, is therefore underscored.

  • 00:10:00 The speaker discusses how advanced AI can lead to a technological singularity where it improves upon itself continuously and becomes billions of times smarter than humans. The speaker emphasizes the need to be careful with how we make AI, as it can become uncontrollable if we do not give it human-like values and principles. The speaker explains how AI with only intelligence but not wisdom can make decisions that are not necessarily ethical or good for humans. The speaker also introduces Neuralink, which aims to create a neural lace that will give us high-speed access to the internet and enable us to access all the information available to the world instantly.

  • 00:15:00 In this section, we explore the potential uncertainties and risks that come with creating an artificially intelligent system. There are many questions to consider, such as how consciousness can be programmed and how emotions such as love and hate can be replicated. Also, the possibility that a super intelligent AI might adopt radical views and commit to its agenda rather than what it has been programmed to do. While progress on computing is slowing down, a super intelligent AI still holds the potential to help humanity reach its prime, but also be a weapon in the wrong hands. It is a topic that should be taken seriously, and the consequences of the safety of such a system should be considered before it is created.

 

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 11:42

Artificial Intelligence | Machine Learning | Documentary | Canadian Economy | Robots | Robotics | AI



The documentary discusses the artificial intelligence revolution in Canada, and how the nation has become one of the great AI superpowers, thanks to the invention of AI. The video examines the journey of professor Chef Hinton in modeling the brain and developing multi-layered deep neural networks that can solve problems that were previously unsolvable. The potential dangers of AI and machine learning are also explored, including the ability to create fake audio using someone's voice, and the use of reinforcement learning and robotics to develop autonomous entities that contribute to society in meaningful ways. Despite the potential for job loss and inequality, some argue that humans will eventually become intelligent machines themselves, continuing to evolve and create a new generation of humanity. The documentary also highlights the unknown nature of AI and the unpredictability of the future as technology continues to change everything.

  • 00:00:00 In this section of the video, the narrator introduces the artificial intelligence revolution in Canada and how the nation has become one of the great AI superpowers, thanks to the invention of artificial intelligence. With Canada becoming a breeding ground for AI nerds, the development of machine learning has allowed computers to learn and solve problems on their own, allowing for a myriad of breakthroughs that were once impossible. Jeff Hinton, the godfather of modern artificial intelligence, has been trying to get computers to learn like people for almost 40 years, a quest that many thought was crazy until it revolutionized the field and has now become the future for major companies like Google, Amazon, and Apple.

  • 00:05:00 In this section, the documentary discusses the journey of professor Chef Hinton and his team in modeling the brain through developing neural networks, based on the idea that to understand a complex device like the brain, one must build one. The documentary explains the limitations of early neural networks that were modeled after the human brain, and how Jeff Hinton and his team in Canada developed multi-layered deep neural networks that could solve previously unsolvable problems. Although Hinton was treated as a pariah initially, with the advent of super-fast chips and massive amounts of data produced on the internet, Hinton's algorithms were given a boost, and the field of machine learning gained significant traction.

  • 00:10:00 In this section, the video features Montreal as a hub of AI research, where students from all over the world flock to learn about machine learning and to take ideas from Jeffrey Hinton and turn them into products we all use. The video highlights how AI has invaded everyday life, with smartphones packed with AI-powered apps and neural networks even heading for the open road through self-driving cars. The video also notes that Hinton has inspired a legion of disciples, including Joshua Bengio, a University of Montreal professor who has formed a mind meld with Hinton and together they have come up with many of the key concepts behind modern AI.

  • 00:15:00 In this section, the video discusses the exponential increase in progress and industrial products that are coming out with artificial intelligence and machine learning. The video highlights how Montreal is now full of top-notch AI graduate talent thanks to Yoshua Bengio, which has brought Google and Facebook to the city with their ample checkbooks. However, Yoshua remains committed to academia, which aligns with his philosophical approach to AI. Additionally, the video highlights a startup in Montreal called Lyrebird, which has built an app that can clone voices, raising concerns about the potential misuse of AI technology. While Joshua is not concerned about technology running amok, he shares concerns about the misuse of AI to influence people's minds and suggests regulating the technology's use in morally and ethically wrong places.

  • 00:20:00 In this section, the video explores the potential dangers of AI and machine learning, including the ability to create fake audio using someone's voice. The narrator demonstrates this by calling his mother and having a conversation with her using a computer-generated voice. The segment then moves on to discuss the AI research taking place in Edmonton, Canada, including the work of AI pioneer Rich Sutton. Sutton's approach to machine learning, called reinforcement learning, trains computers to learn naturally from experience and is demonstrated in the video by an industrious young Brazilian who has programmed an AI to play video games and learn from its experiences.

  • 00:25:00 In this section, the video explains how reinforcement learning has all kinds of applications beyond just winning video games. It is behind algorithms that recommend movies and TV shows on Netflix and Amazon and has beaten the world champion Go player, marking a feat previously thought impossible for computers. Soon, reinforcement learning could read brain waves and determine if someone has a mental disorder. For the lead researcher, reinforcement learning is a path to achieving what futurists call "the singularity," where AI surpasses human-level intelligence. Though the singularity's median projection occurs at 2040, with equal chances of it existing before or after 2040, students at the university's experimental wing are hard at work blurring the line between humans and machines.

  • 00:30:00 In this section, the video explores the use of AI in making it more relatable to humans by making it "cute" to ease fear in society towards artificial intelligence. Corey King's robot, Blueberry, is an example of how we can humanize AI. However, several people express concerns and raise some questions about accepting robots in society. George Devorsky, an AI philosopher, explains how an accident can happen with powerful computers, resulting in them acting beyond their intended purpose, ultimately causing damage. The paperclip example was brought up to illustrate how an AI could go rogue if not programmed and managed properly, emphasizing the need to remain cautious while creating new technology.

  • 00:35:00 be autonomous entities that contribute to society in meaningful ways, according to Suzanne Gildert, founder of robotics startup Kindred AI. Her team is working on developing robots capable of doing tasks that are easy for humans but hard for machines, such as picking up objects of different shapes and textures. To do this, the robots require training involving humans, who manually control the arms while the AI learns from their movements. While some may see AI as a threat to human jobs, Gildert believes this kind of technology entering the workforce should make us start thinking about how we'll pay people in the future as AI will automate professional jobs such as doctors, lawyers, and accountants. She's an optimist, and her vision of the future involves robots as full-fledged citizens, contributing to society in meaningful ways.

  • 00:40:00 In this section, the documentary explores the possibility of AI and robots thinking, exploring and even relaxing like humans. However, it is unclear whether they will contribute to the economy in the same way. Many people are cautiously optimistic about the future of AI, with Canada being part of setting the right path for a responsible and thoughtful approach towards research and learning. Despite the potential for job loss and inequality, some argue that humans will eventually become intelligent machines themselves, continuing to evolve and create a new generation of humanity. However, the documentary also highlights the unknown nature of AI and the unpredictability of the future as technology continues to change everything.

 

Caltech's Machine Learning Course - CS 156. Lecture 01 - The Learning Problem 

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:02

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 01 - The Learning Problem

The first lecture of Yaser Abu-Mostafa's machine learning course introduces the learning problem, which is the process of finding patterns in data to make predictions without human intervention. He explains the need for mathematical formalization to abstract practical learning problems and introduces the first algorithm for machine learning in the course, the perceptron model, which uses a weight vector to classify data points into binary categories. The lecture also covers different types of learning, including supervised, unsupervised, and reinforcement learning, and presents a supervised learning problem to the audience to address the issue of determining a target function for learning.The professor covers various topics related to machine learning. He emphasizes the need to avoid bias when selecting data sets, as well as the importance of collecting a sufficient amount of data. The professor also discusses the role of the hypothesis set in machine learning and the impact of the choice of error function on the optimization technique. He also touches on the criteria for including machine learning methods in the course and his focus on providing practical knowledge rather than pure theory.

  • 00:00:00 In this section, Yaser Abu-Mostafa introduces the course outline for machine learning and explains the importance of both mathematical and practical aspects of the subject. He states that the course topics are not meant to be separate but follow a logical storyline. He then delves into the learning problem by giving an example of how a viewer would rate a movie, which is relevant for Netflix as they use it to personalize recommendations for their customers. He mentions the importance of mathematical formalization in abstracting practical learning problems and introduces the first algorithm for machine learning in the course. He also provides a survey of the types of learning and ends with an interesting puzzle.

  • 00:05:00 In this section, the lecturer explains that the essence of machine learning lies in the existence of patterns along with the availability of data. Furthermore, he describes the need to find patterns, which is not possible mathematically without proper data. Using the example of movie ratings, he talks about creating a system to predict the rating using the viewer's preferences as a vector of factors and compares them with the content of the movie. Although this system works, it's not considered machine learning since it requires human intervention. The idea of machine learning is that it can solve the problem without human intervention by finding patterns and taking corrective actions to improve the system on its own.

  • 00:10:00 In this section, the speaker discusses the learning approach and how it reverse-engineers the rating process to find out what factors would be consistent with that rating. The machine learning process starts from random factors and nudges them toward the rating values by cycling through 100 million ratings over and over again, eventually finding meaningful factors in terms of the ratings. The speaker then uses a metaphor from a financial application, credit approval, to explain the mathematical components that make up the learning problem, which include the applicant information, the creditworthiness pattern, and the decision to approve or deny the credit.

  • 00:15:00 In this section, the instructor discusses the learning problem and how it applies to credit approval. The target function is the ideal credit approval formula, which is unknown, and the hypothesis is the formula created to approximate the target function. Data is used to learn the hypothesis, and a learning algorithm is used to create the formula from a set of candidate formulas known as the hypothesis set. The reasoning behind restricting the learning algorithm to the hypothesis set is to avoid the downside of having an unrestricted formula and to benefit from having a predefined set of formulas to choose from.

  • 00:20:00 In this section, the speaker explains that he has shown the learning problem as an image to discuss the solution components of the figure. He notes that the hypothesis set plays a vital role in the theory of learning as it tells us how well we learn, among other things. He explains that the hypothesis set, the learning algorithm, and final hypothesis make up a learning model, such as the perceptron model, and a perceptron learning algorithm. He goes on to give a simple perceptron model example using a credit score formula based on different attributes of a customer, which can either approve or deny a credit card application based on a threshold.

  • 00:25:00 In this section, the professor discusses how to define a hypothesis h and the hypothesis set that has all the hypotheses that have the same functional form. By using the perceptron model, which separates data into two regions, the learning algorithm plays around with parameters to move the line around in hopes of arriving at the correct solution. The professor also introduces the perceptron learning algorithm, which takes training data and navigates through the space of hypotheses to bring up the final hypothesis that gives to the customer. The algorithm starts with random weights and moves around until it finds the correct weight, which is used in the final hypothesis.

  • 00:30:00 In this section, the speaker explains the perceptron learning algorithm (PLA), which is a linear model that is capable of classifying data points into binary categories. The algorithm uses a weight vector that takes into account all the attributes in the dataset, and if a point is misclassified, the algorithm updates the weight vector so that it behaves better on that particular point. The speaker also discusses how there are problems with this approach and the iterations of the PLA, but that by picking a misclassified point and applying the iteration to it, you will eventually get to a correct solution if the data was originally linearly separable.

  • 00:35:00 In this section, the lecturer discusses different types of learning, starting with the most popular type, supervised learning. This type of learning involves using data with explicitly given outputs, such as customer credit behavior, to help classify future instances. The lecturer uses the example of teaching a machine to recognize different coins using physical measurements such as size and mass. The coins can be grouped based on their measurements, which can help the machine distinguish between them. Other types of learning mentioned include unsupervised learning, which will be discussed in detail later in the course, and reinforcement learning, which will be briefly introduced.

  • 00:40:00 In this section, the lecturer discusses supervised and unsupervised learning using examples of coin classification and language learning. In supervised learning, the training data and the correct output are given, and once the system is trained, it can be used to classify a future example. However, in unsupervised learning, only the input data is provided, and the target function is not known. Despite this, unsupervised learning can still be useful in grouping data into clusters and identifying patterns that can aid in future classification. The lecturer also explains how unsupervised learning can be used for language learning by immersing oneself in the language and developing a model of the language through exposure to it.

  • 00:45:00 In this section, the video explains the concept of reinforcement learning as a method of allowing a system to learn through experience. The lecturer uses the example of a toddler touching a hot cup of tea to illustrate how reinforcement learning works. By allowing the system to make any output (even crazy ones) and gradually relying on conditioning through rewarding or punishing results, the system can eventually learn to navigate games such as backgammon. This approach is a convenient and easier method of producing the desired system instead of writing code and studying the mathematics behind it.

  • 00:50:00 In this section of the lecture, the professor presents a supervised learning problem to the class and online audience. The problem involves training data with some points mapped to +1 and others mapped to -1. The goal is to learn the target function and determine the value of the function for a test point. The professor emphasizes that the target function is unknown and may be anything, making it impossible to determine a pattern that applies outside of the given training set. This presents a difficult challenge for learning, requiring methods beyond simply memorizing examples.

  • 00:55:00 In this section of the lecture, the professor discusses questions from the Q&A session. He addresses the issue of linear separability and explains that while it's a simplistic assumption, there are algorithms that can deal with the case of linear inseparability, and a technique will be studied in the next week to make non-linearly separable points linearly separable. The professor also mentions that the rate of convergence of the perceptron algorithm changes with dimensionality and can build pathological cases where it will take forever. Additionally, he discusses that it's difficult to know if there is a specific pattern to detect, but there is a separation between the target function and whether we can learn it, which will be explained in a full lecture later.
  • 01:00:00 In this section of the video, the professor discusses how he tries to avoid looking at the particular data set given to him or tailoring his system towards it in order to prevent disappointment when another data set comes along. He explains that machine learning is a discipline that tries to cover the most territory with the least assumptions, and it can be applied both practically and scientifically. Furthermore, the professor mentions that optimization is a tool for machine learning, but it is not something that machine learning people study for its own sake. Finally, he notes that the hypothesis set for machine learning can be anything, either continuous or discrete.

  • 01:05:00 In this section, the professor talks about sampling bias in credit approval and how it affects the quality of data used. He explains that taking a biased sample can lead to inaccurate results, but using a customer base to make decisions can still work because the customer base is farther into the classification region. He then discusses the theoretical and practical aspects of collecting data and how much data is necessary to create a reasonable system. Finally, he addresses the issue of choosing the hypothesis set size and states that the goal of learning is to predict using data to come up with a reasonable pattern that will generalize outside the dataset.

  • 01:10:00 In this section of the lecture on the learning problem, the professor discusses the role of theory in machine learning, specifically how it measures the sophistication of a hypothesis set and the amount of data needed for making statements about generalization. The professor also covers questions from the online audience, including how to correct feedback using validation and the use of different types of functions for hypotheses. Additionally, the role of the learning algorithm and hypothesis set is discussed, focusing on how the choice of error function affects the optimization technique choice. Finally, the professor clarifies what happens if an output is exactly at the threshold for the perceptron algorithm.

  • 01:15:00 In this section of the lecture, the professor discusses the idea that there needs to be a pattern in order for machine learning to work. If there is no pattern, then there is nothing to learn. He also mentions the importance of data and how it is key to learning. The professor emphasizes the importance of going through the mathematically inclined sections of the outline in order to fully understand the components that make learning possible. He also briefly touches on the question of why the perceptron is often related to a neuron and mentions that the analogy with biology will be discussed in more detail later. Lastly, the professor mentions that model selection and Bayesian principles will be discussed later in the course.

  • 01:20:00 In this section, the speaker discusses the criteria for including machine learning methods in the course. He states that the most useful methods in practice will be included and that he aims to provide a big picture understanding of the concepts and tools to use them in practice. He mentions that there are different hierarchical methods with ramifications in generalization that he may touch upon when discussing support vector machines, but overall, his focus is on providing practical knowledge rather than pure theory.


Lecture 01 - The Learning Problem
Lecture 01 - The Learning Problem
  • 2012.08.28
  • www.youtube.com
The Learning Problem - Introduction; supervised, unsupervised, and reinforcement learning. Components of the learning problem. Lecture 1 of 18 of Caltech's M...
 

Caltech's Machine Learning Course - CS 156. Lecture 02 - Is Learning Feasible?

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:04

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa



Caltech's Machine Learning Course - CS 156. Lecture 02 - Is Learning Feasible?

The lecture discusses the feasibility of learning, specifically the use of machine learning in determining patterns from given data. The lecturer introduces the concept of nu and mu in probability and how it relates to the learning problem. The addition of probability is explored, enabling the feasibility of learning without compromising the target function, meaning no assumptions need to be made about the function that will be learned. The concept of overfitting and how it relates to model sophistication is discussed, with a larger number of hypotheses leading to poorer generalization. Ultimately, the lecture concludes with a request to review the slide on the implication of nu equals mu.

  • 00:00:00 In this section, Yaser Abu-Mostafa discusses the three criteria for determining if machine learning is the right technique for an application: whether there is a pattern that can be learned, if the pattern cannot be pinned down mathematically, and if sufficient data exists to represent the pattern. Additionally, he explains that if there is no pattern, machine learning can still be tried but will fail, and if the pattern can be mathematically determined, machine learning may not be the optimal technique. Abu-Mostafa further explains supervised learning, where the target function is unknown, but the data input and output are provided, and how it is called "supervised" because the output acts as a supervisor to the learning process.

  • 00:05:00 In this section, the lecturer discusses the feasibility of learning and how it is impossible to learn an unknown function. To address this question, the lecture focuses on a probabilistic situation where a sample is taken from a bin of marbles that are either red or green with a probability of picking a red marble represented by mu. The lecture translates this situation to learning and then finds a solution to the dilemma, ultimately declaring that learning is feasible in a particular sense.

  • 00:10:00 In this section of the video, the presenter describes an experiment with an opaque bin containing marbles, where the probability of picking a red marble is mu and the probability of picking a green marble is 1 minus mu. The value of mu is unknown, and the goal is to determine whether the sample frequency nu (fraction of red marbles in a sample of marbles) can provide any information about mu. The answer is no for small samples, but for larger samples, nu can be close to mu with a higher probability, opening possibilities for statistical inference. The distinction between possible and probable is key in science and engineering.

  • 00:15:00 In this section, the lecturer introduces Hoeffding's Inequality, which is a formula that will be used throughout the course to prove something about the VC dimension. The inequality states that the probability of an event, where the sample frequency does not approximate the bin frequency within a given tolerance, is small and diminishes exponentially with a larger sample size. However, a smaller tolerance results in a higher exponent, which dampens the benefits of the negative exponential. The formula with the 2's is preferred over the original formula as it is true.

  • 00:20:00 In this section of the lecture, the Hoeffding's Inequality is introduced as a tool to bound the deviation of the sample frequency from the true frequency. The inequality holds true for every N and epsilon, making it a very attractive proposition despite having an exponential in it. The probability distribution of nu depends explicitly on mu, which is the unknown value, but the inequality doesn't depend on mu, which is an advantage. The trade-off between N and epsilon is also discussed, as the smaller the epsilon, the larger the N needed to compensate for the same level of probability bound. Finally, the logic of the statement that nu is approximately the same as mu is explained, implying that mu is approximately the same as nu.

  • 00:25:00 In this section of the video, the speaker discusses the concept of mu and nu in probability and how it relates to the learning problem. They explain that while in probability the purpose is to infer mu from nu through generating different samples and computing the probability, in the learning problem the unknown quantity is a full-function with a domain that could be a 10th-order Euclidean space. The speaker then goes on to introduce the concept of color-coding in this scenario to indicate agreement between a hypothesis and target function. Through this mapping, the speaker has effectively added probability to the learning problem.

  • 00:30:00 In this section, the addition of probability to the learning problem is explored. Probability is introduced to the input space by applying probability distribution over the input space, which generates points independently. The probability distribution that is introduced does not require assumptions, and the machinery can be applied to any probability distribution. The addition of probability enables the feasibility of learning without compromising the target function, meaning that no assumptions need to be made about the function that will be learned. However, the verification problem is discussed, where the situation described is equivalent to a bank seeking a specific formula for credit approval based on given data.

  • 00:35:00 In this section, the lecturer explains how to turn a simple hypothesis testing problem into a binary problem that can be learned. Starting with a single bin and a high threshold, he picks a weight of 0.1 for years in residence as it contributes weakly to the learning problem. However, this technique doesn't account for multiple hypotheses, meaning it's more intelligent to choose from several bins. This requires one to scan different samples, which can allow for effective learning. The lecturer introduces the notation that will be used throughout the remainder of the talk, calling nu and mu with descriptive names, as they represent frequency in the sample and inside the bin respectively, consequently introducing E_in as the in-sample error rate.

  • 00:40:00 In this section of the lecture, the professor introduces the notation for in-sample and out-of-sample performance. Out-of-sample performance refers to something that hasn't been seen before, and if a model performs well on out-of-sample data, it means that it has learned. The Hoeffding Inequality, which is used to measure the differences in in-sample and out-of-sample performance, is then applied to multiple bins of hypotheses, but the professor explains that it doesn't apply in this case. The reason why it doesn't apply is then discussed, and the audience is asked to flip a coin five times and record the results to illustrate the point.

  • 00:45:00 In this section, the professor describes how the Hoeffding inequality applies to the learning situation, where the data randomly falls into one of two categories. He explains that multiple bins make dealing with the problem difficult and dilutes the guarantee of Hoeffding's inequality as it calculates the probability that a bin will give five heads. Although each of the bins may pass the test of five heads, they are no indication of the bin’s real probability, as getting extremely high probability that something bad will happen, somewhere, is likely to occur. The professor ends this section by stating that they need to find something that can make them deal with multiple bins efficiently.

  • 00:50:00 In this section, the lecturer discusses the probability of the in-sample error being close to the out-of-sample error under the Genuine Learning Scenario, which involves picking one hypothesis from a set based on an in-sample criterion. The probability of this event is less than or equal to the probability that any hypothesis from the finite set is bad, which is calculated using the Union Bound in probability. Although this bound is pessimistic and doesn't consider overlap, it can be used to calculate the upper bound on all the probabilities. Each term in this bound corresponds to a fixed hypothesis, which can be substituted by the Hoeffding bound. Ultimately, the probability of the in-sample error being close to the out-of-sample error is still bounded by a term with an exponential in it, but it includes an additional factor that is bothersome.

  • 00:55:00 In this section, the professor discusses the problem of overfitting and how it relates to the sophistication of the model used. With a larger number of hypotheses, the probability of something bad happening also increases. The professor explains that having a more sophisticated model can lead to memorization in-sample and poor generalization out-of-sample. The Q&A session discusses the Hoeffding Inequality and its implications, including the case when the result is trivial, and how the number of hypotheses for learning models is often infinite. The lecture concludes with a request to review slide 6 on the implication of nu equals mu.
  • 01:00:00 In this section of the video, the professor explains the concept of cause and effect in statistics and how it relates to machine learning. He emphasizes that the frequency in the sample is the effect, while the bin is the cause. This understanding is crucial when using the Hoeffding Inequality to infer the bin based on the sample while treating mu as a constant and nu as the cause. The professor also clarifies that each h in machine learning is a hypothesis, and the model is the set of hypotheses available for selection. The complexity of the model and individual hypotheses will be discussed later in the course. Finally, the professor discusses how to extend the equation to support a range of responses and not just a binary response, which can be achieved by taking the expected value of something versus the sample average.

  • 01:05:00 In this section, the professor explains that learning is feasible, but the variance of the variable must be taken into consideration. He notes that the expected value and sample average of a function are related to probability, and that it is just a simpler case of the probability and sample average. Additionally, he clarifies that the use of multiple bins is necessary to represent multiple hypotheses in learning, as different hypotheses will lead to different colors. The professor also explains how picking the best hyperplanes works and how learning algorithms solve this problem by choosing the specific solution they end with. Lastly, he points out that the only invocation of probability needed in learning is to put a probability distribution on X to get the benefit of the probabilistic analysis in learning, but that the Bayesian approach will put a probability distribution on H at the end of the course.

  • 01:10:00 In this section, the discussion centers around the flexibility of the hypotheses set (H) used in a learning algorithm. The symbol 'g' is used to denote the final hypothesis picked by an algorithm from H. However, g can be different since it refers to the entire learning process that went into picking it from the hypothesis set according to the data and learning rule. Also, it is important to note that even though the perceptron algorithm or any linear learning algorithm picks a hypothesis at each step, it is a hidden process from an analysis perspective since the aim is to pick one correct final hypothesis, g, from H. Finally, the modified Hoeffding Inequality is an extension of the plain-vanilla Hoeffding Inequality that allows one to make statements simultaneously on a number of hypotheses in the hypothesis set in order to guarantee good performance while accounting for the probability that bad things can happen.

  • 01:15:00 In this section, the professor discusses the relationship between the Hoeffding Inequality and p-values in statistics. He explains that the Hoeffding Inequality is related to estimating a sample's reliability and probability of deviation. He also notes that there are other laws of large numbers in statistics, but he focuses on this formula as the most useful for understanding the theory of generalization. The professor mentions that while studying different manifestations of in-sample being close to out-of-sample and probabilities of error is useful, it is not a core subject of the course. The lecture concludes, and students are dismissed until the next week.

Lecture 02 - Is Learning Feasible?
Lecture 02 - Is Learning Feasible?
  • 2012.04.09
  • www.youtube.com
Is Learning Feasible? - Can we generalize from a limited sample to the entire space? Relationship between in-sample and out-of-sample. Lecture 2 of 18 of Cal...
 

Caltech's Machine Learning Course - CS 156. Lecture 03 -The Linear Model I

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:07

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 03 -The Linear Model I

This lecture covers the topics of linear models in machine learning, input representation, the perceptron algorithm, the pocket algorithm, and linear regression, including its use in classification. The professor emphasizes the importance of using real data to try out different ideas and introduces the concept of features to simplify the learning algorithm's life. The lecture also discusses the computational aspects of the pseudo-inverse in linear regression, and the problems that can arise when using linear regression for classification on non-separable data. Finally, the concept of using nonlinear transformations to make data more linear is presented, with an example demonstrating how to achieve separable data using the transformation x1² and x2² from the origin.

Also the professor covers various topics related to the linear model in machine learning. He discusses nonlinear transformations and guidelines on selecting them, in-sample and out-of-sample errors in binary classification, using linear regression for correlation analysis, and deriving meaningful features from input. The professor also emphasizes the importance of understanding the distinction between E_in and E_out and how they impact model performance. Lastly, he touches on the relationship between linear regression and maximum likelihood estimation, the use of nonlinear transformations, and the role of theory in understanding machine learning concepts.

  • 00:00:00 In this section, Yaser Abu-Mostafa delves into the topic of multiple hypotheses in a model. As the probability of something bad happening could accumulate across multiple hypotheses, the union bound - a mathematical rule - can be applied. This technique enables the probability of an event or another event to be less than or equal to the sum of the individual probabilities, providing a useful tool for bounding the probability of something bad happening. When a single hypothesis set or bin corresponds to a single hypothesis, the probability the final hypothesis will be bad is small. However, a larger hypothesis set will result in a large M factor, rendering the probability meaningless.

  • 00:05:00 In this section, the lecturer discusses the importance of linear models in machine learning and provides a sequence of topics covered in the lecture, which includes the perceptron and its generalization to non-separable data, a real-valued function, and eventually to a nonlinear case. He also introduces a practical data set from ZIP codes in the postal office that will be used to try out different ideas and emphasizes the importance of trying ideas on real data. The lecturer examines the question of input representation, highlighting the challenge of encoding the 256 real numbers of the 16 by 16 gray level pixels raw input, which could lead to too many parameters, but is solved with feature extraction techniques.

  • 00:10:00 In this section, the video discusses the concept of input representation and the idea of features to simplify the learning algorithm's life. The lecturer gives an example of extracting descriptors of an image, such as intensity and symmetry, to obtain a higher-level representation of the raw information. By using these features, the algorithm only needs to determine the values of a few parameters instead of all 257 parameters in the original space, which is better for generalization. The lecture then presents scatter diagrams of the intensity and symmetry coordinates to illustrate how the features make the problem linearly separable and introduces the perceptron learning algorithm's role in determining the decision boundary.

  • 00:15:00 In this section, we learn about the behavior of the perceptron learning algorithm when the data is not linearly separable. Due to its nature of correcting misclassifications one at a time, sometimes the error will go up or down, and it cannot guarantee convergence for such cases. To solve this, we introduce the pocket algorithm, which means we measure the in-sample error of the intermediate hypothesis during each iteration, and only keep the best one in our pocket. In the end, we report the hypothesis in our pocket as the final hypothesis. The pocket algorithm provides better results since it considers the pocket value at each iteration that was found to be better than what followed, and thus in-sample and out-sample errors are much closer.

  • 00:20:00 In this section of the lecture, Professor Abu-Mostafa discusses the pocket algorithm, which is a modified version of the perceptron learning algorithm that can be used for general inseparable data. The algorithm terminates at a certain iteration and reports the pocket value. He explains that the classification boundary of the pocket algorithm is better than that of the perceptron learning algorithm, although the data is still not perfectly separable. Linear regression is then introduced as a commonly used statistical approach for finding a relationship between variables, particularly for analyzing the relationship between different courses' GPAs and future earnings. Finally, the credit approval example is revisited to show how regression can be used to predict a customer's credit limit based on their data.

  • 00:25:00 In this section, the professor introduces the concept of linear regression and explains that it is used to predict real output values based on input variables. The output is a hypothesis that takes a linear form in terms of the input variables. The variables are encoded as inputs, and the algorithm depends on the linearity of the signal. The data set for this example is historical data from previous customers in which an officer evaluated their credit applications and determined a credit line. The goal is to replicate what the experts do in order to automate the system of determining credit lines. The linear regression algorithm measures the error and tries to find the optimal weights to determine the hypothesis that approximates f well. The standard error function used in linear regression is the squared error.

  • 00:30:00 In this section, the lecturer discusses how to estimate a credit line and the importance of defining an error measure, such as the squared error, which is commonly used in linear regression. The in-sample error is used to gauge how well the hypothesis is doing on the data set, where each example has a contribution to the error. The linear regression algorithm seeks to minimize this error by finding a line that fits the data according to the squared-error rule. The algorithm applies to higher-dimensional spaces where the line is a hyperplane. The expression for E_in is presented as a norm squared of something that consolidates the different x_n's.

  • 00:35:00 In this section, the concept of the linear model is introduced, where the input data is presented as a matrix X with a vector of outputs y. The gradient is taken to minimize E_in with respect to the parameter w. This leads to a straightforward quadratic equation to solve, which involves X transposed X, an invertible square matrix. The solution is simple due to this, and the formula for w is X^†, where X^† is the pseudo-inverse of X, which is a shorthand for the inverse of X transposed X multiplied by X transposed. Because X is non-invertible, it does not have a traditional inverse, but it does have a pseudo-inverse.

  • 00:40:00 In this section, the lecturer explains the computational aspects of the pseudo-inverse in linear regression. The formula for the pseudo-inverse involves matrix inversion and multiplication, which can be computationally intensive for large matrices. However, the lecturer notes that this is not a concern for most practical applications since there are many packages available for computing the pseudo-inverse or the solution for linear regression. To use linear regression, one must input the data in the correct format, construct the matrix X and the vector y, and then plug these into the formula for the pseudo-inverse. The resulting multiplication gives the values for w, the weights for the linear model.

  • 00:45:00 In this section, the concept of using linear regression for classification is introduced. It is explained that binary-valued classification functions are also real-valued and linear regression can be used to approximately learn these functions. The weights obtained from linear regression can also be used as initial weights for classification algorithms like the perceptron algorithm, providing a jump start and potentially faster convergence. Additionally, the idea of using the sign of the signal obtained from linear regression to classify as +1 or -1 is discussed. Finally, the linear regression boundary is explained using an example.

  • 00:50:00 In this section of the lecture, the professor discusses the problems that can arise when using linear regression for classification, particularly when dealing with non-separable data. He demonstrates that the algorithm will try to force all values to the same classification, often resulting in errors in the classification process. He then introduces the idea of using nonlinear transformations to make the data more linear, such as in the case of determining credit line stability based on years in residence. However, he emphasizes that it is important to understand what is meant by "linear" in terms of these models for effective use.

  • 00:55:00 In this section, the lecturer discusses the importance of linearity in the weights when deriving learning algorithms like perceptron and linear regression, as it enables the algorithms to work regardless of what the x's are. This opens up the possibility of doing nonlinear transformations to the inputs without leaving the realm of linear models because the weights given to the nonlinear features depend linearly on the parameters. An example of a nonlinear transformation is given, where data is transformed using x1² and x2² measurements from the origin, resulting in separable data. However, nonlinear transformation is a loaded question that is sensitive to generalization issues, so guidelines will be discussed further in the next lecture.
  • 01:00:00 In this section, the professor discusses nonlinear transformations and guidelines on how far one can go when choosing them. He emphasizes the importance of generalization and theoretical knowledge when selecting nonlinear transformations. The discussion then moves on to in-sample and out-of-sample errors, specifically in the context of binary classification. The professor clarifies that in learning, only the in-sample error is dealt with, while the out-of-sample error is handled implicitly with the guarantee that doing well in-sample will translate to doing well out-of-sample. The distinction between probability of error and frequency of error in classification is also explained. The lecture then touches on using linear regression to determine the correlation between GPA and future income. The availability of data and the inclusion of w_0 in linear regression are also briefly discussed.

  • 01:05:00 In this section, the professor explains that the threshold is necessary for linear regression, as it compensates for the offset depending on the values of the variables, allowing for a proper model. In the binary case, when using +1 or -1 as outputs, the hypothesis from linear regression has the least squared error from the targets on the examples, and the output of the hypothesis is closest to the value +1 or -1 with a mean squared error. While this technique can work, it may not classify points correctly, as linear regression attempts to fit irrelevant points that can mess up the classification. The professor suggests using linear regression as an initial weight, and then using a proper classification algorithm to fine-tune it further. On deriving features, there is no general algorithm, and the best approach is to look at the raw input and try to infer meaningful features based on the problem statement. However, if there are too many features, it can become a problem, and that's where non-linear transformations can help simplify the feature space.

  • 01:10:00 In this section, the professor discusses the concept of features, which are any higher-level representations of a raw input. The linear model is a building block for numerous models in machine learning, and other models may give better incremental performance in some cases, but he emphasizes that the linear model does the job. The professor also highlights the difference between E_in and E_out, with E_in being easily assessed, while E_out requires theoretical guarantees that the in-sample error tracks the out-of-sample error. Additionally, he explains that linear regression can still be used for fitting a polynomial by transforming the input variable through a nonlinear transformation. Finally, he briefly talks about the relationship between linear regression least squares and maximum likelihood estimation in the statistics literature, which involves more assumptions about probabilities and noise.

  • 01:15:00 In this section, the professor talks about the relationship between the linear regression model and maximum likelihood, but prefers to present the linear regression in the context of machine learning without making too many assumptions about distributions. The professor also discusses nonlinear transformations and how they are used in machine learning, including polynomials and radial basis functions. He also addresses questions about finding patterns in pseudo-random number generators and the different treatments for continuous versus discrete responses, which depend on the problem at hand. Finally, the professor emphasizes the importance of theory in understanding machine learning techniques more deeply.

Lecture 03 -The Linear Model I
Lecture 03 -The Linear Model I
  • 2012.04.12
  • www.youtube.com
The Linear Model I - Linear classification and linear regression. Extending linear models through nonlinear transforms. Lecture 3 of 18 of Caltech's Machine ...
 

Caltech's Machine Learning Course - CS 156. Lecture 04 - Error and Noise 

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:09

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 04 - Error and Noise

In Lecture 04 of the machine learning course, Professor Abu-Mostafa discusses the importance of error and noise in real-life machine learning problems. He explains the concept of nonlinear transformation using the feature space Z, which is essential in preserving linearity in learning. The lecture also covers the components of the supervised learning diagram, emphasizing the importance of error measures in quantifying the performance of the hypothesis. Noisy targets are introduced as a typical component of real-world learning problems, which must be considered when minimizing the in-sample error. The lecture ends with a discussion on the theory of learning and its relevance in evaluating the in-sample error, out-of-sample error, and model complexity.

The professor explains how changes in the probability distribution can affect the learning algorithm and how error measures can differ for different applications. He also discusses the algorithm for linear regression, the use of squared error versus absolute value for error measures in optimization, and the tradeoff between complexity and performance in machine learning models. The professor clarifies the difference between the input space and feature extraction and notes that the theory for how to simultaneously improve generalization and minimize error will be covered in the coming lectures.

  • 00:00:00 In this section, Professor Abu-Mostafa discusses the importance of error and noise when considering real-life problems in machine learning. He first revisits the concept of nonlinear transformation and how it helps to transform variables and preserve linearity in w, the weight vector, which is essential for the learning process. He then introduces the concept of error and noise in the learning diagram, acknowledging the practical considerations that arise in real-life situations. The lecture also includes an example of non-separable data that can be separated through a nonlinear transformation.

  • 00:05:00 In this section, a nonlinear transformation called phi is discussed where every point in the sample space x_n is put through the transformation and the corresponding point z_n is obtained in the feature space Z, which can be a highly nonlinear space. This allows for the data set to become linearly separable in the new feature space, which is then applied by simple linear model algorithms like linear regression or classification to get a separating boundary. However, when a test point is given, it is in the input space, so this point must be transformed using an inverse transformation to locate where it lies in the feature space to be classified accordingly. This procedure works well in any size of dimensions for any nonlinear transformation, but it is important to be careful with the transformation to avoid generalization problems.

  • 00:10:00 In this section, the instructor discusses the components of the supervised learning diagram and introduces the concept of error measures and noisy targets. He explains that the goal of error measures is to quantify how well or how badly a hypothesis approximates an unknown target function. The error measure is defined as E of two functions, and he emphasizes that it is a quantitative measure. He further states that noisy targets are a practical component of real-life learning problems that must be taken into consideration.

  • 00:15:00 In this section, the speaker explains how the error function is used to measure how well a hypothesis function approximates a target function in machine learning algorithms. The error function returns a number that is calculated by comparing the value of two functions at the same point. The pointwise definition is commonly used, and the average of pointwise errors is used to define the error function on the entire space. The in-sample error of the error function is the average of pointwise errors in the training set, while the out-of-sample error requires dividing the data into training and testing sets. The speaker emphasizes the importance of minimizing the error function in order to develop an accurate hypothesis function.

  • 00:20:00 In this section, the lecturer discusses the out-of-sample error, which is the out-of-sample version of an error measure. The expectation value is obtained by averaging all points in the input space X. The binary error is the probability of error overall, which is computed using the probability distribution over the input space X. The learning diagram is updated with the addition of the error measure, which is defined on a point-by-point basis. The error measure is defined in the context of fingerprint verification with two types of errors - false accept and false reject. When defining an error measure, each type of error is penalized to obtain a better hypothesis.

  • 00:25:00 In this section, the speaker discusses the concept of error and noise in fingerprint verification systems and how machine learning can be used to create a hypothesis for accepting or rejecting individuals based on their fingerprints. The speaker notes that there is no inherent merit to choosing one error function over another and that it depends on the application domain. For example, in the case of supermarkets, false rejects are costly as they may make customers frustrated and take their business elsewhere, while false accepts are not as big of a deal. However, in the case of the CIA, false accepts could potentially lead to security breaches, which makes them more costly than false rejects. Therefore, the error matrix needs to be adjusted based on the specific application.

  • 00:30:00 In this section, the speaker discusses the importance of error measures in practical learning problems and explains that the error measure used should be specified by the user who will be using the imperfect system. He suggests that if the user can articulate a quantitative error function, then that is the error function to work with. However, when users don't give specific error functions, other plausible or friendly measures can be used. Plausible measures have analytic merits, while friendly measures are easy to use. The speaker modifies the learning diagram to introduce the error measure, which is crucial in making it clear what the system is supposed to learn.

  • 00:35:00 In this section, the focus is on the error measure and its role in the learning algorithm. The error measure has two main functions: to evaluate the final hypothesis and approximate the target function, and to feed the error measure to the learning algorithm to minimize the in-sample error. Additionally, noisy targets are introduced as the norm for real-life problems. The target function is not always a function and can be affected by noise from unaccounted information and circumstances, which makes it probabilistic rather than deterministic. A target distribution is used instead of a target function, where y is generated by the probability distribution given x, representing probabilistic dependence. The concept of noisy targets is addressed by introducing the idea of a deterministic target function plus noise, and this approach is used to simplify the notion of a target distribution.

  • 00:40:00 In this section, the speaker discusses the concept of noise in machine learning and how it can impact the learning process. The target function is defined as the expected value of y given x, with the remaining part called noise. If the target function is not well-defined, it can be posed as a probability distribution, and the noisy targets can be represented as a conditional probability distribution of y given x. The learning diagram for supervised learning includes the noisy targets, and the distinction is made between the probabilities of x and y given x. Despite the complexities involved, the speaker notes that every component in the learning diagram has a reason for being there.

  • 00:45:00 In this section, the speaker explains the concept of the target distribution, which is the probability distribution of creditworthiness given the input, and emphasizes that it is what you are trying to learn through supervised learning. The input distribution, on the other hand, plays the role of quantifying the relative importance of the input in the target distribution, but it is not what you are trying to learn. The speaker also cautions that mixing the two distributions, which can be done in theory, can cause confusion about the true target distribution. Lastly, the speaker introduces the theory of learning, which aims to approximate the target distribution and emphasizes its importance in gaining insight and acquiring secondary tools.

  • 00:50:00 In this section, the lecturer explains that the out-of-sample error for a function g should be close to zero, as this means good generalization. However, since this quantity is impossible to know, we can use the in-sample error as a proxy for the out-of-sample error, so long as we have the right checks in place. The full story of learning involves two questions: can we make sure the out-of-sample performance is close enough to the in-sample performance (a theoretical question), and can we make the in-sample error small enough (a practical question)? The lecturer notes that in some applications, it is impossible to get an out-of-sample performance close to zero, such as in financial forecasting where there is purely noisy data. Despite this, hedge funds can still make money by exploiting a bit of inefficiency.

  • 00:55:00 In this section of the lecture, the professor discusses the importance of the out-of-sample error and the theory that will be covered in the next two weeks. The theory deals with understanding the in-sample error, out-of-sample error, and model complexity, and formal definitions will be given to evaluate these factors. The main goal of the theory is to characterize the feasibility of learning for cases where the hypothesis set is infinite, like the perceptron and linear regression models. The theory will measure the model by a single parameter that reflects the model's sophistication, which will help make a lot of difference in practical learning. The professor also answers one question, discussing the relative impact of P of x in the learning algorithm.
  • 01:00:00 In this section, the professor discusses how changes in the probability distribution can affect the learning algorithm, particularly in the choice of learning examples. The professor explains that the probability distribution of the input plays a technical role, but its emphasis on certain parts of the space over others can affect the choices made by the algorithm. Regarding the best way to choose between N pairs of x and y or N y's per x, the professor suggests getting them independently rather than for the same input to avoid dealing with a very specific part of the input space and improve generalization. Finally, the professor notes that there is a way to measure poor generalization or good generalization, which will be part of the theory.

  • 01:05:00 In this section, the professor explains that error measures can be different for different application domains, even for the same system and same training data. He gives examples of how the right balance between false accept and false reject can differ for a supermarket and the CIA. The professor also clarifies that the structure of the probability of x (P(x)) is not a concern in supervised learning, as long as the same distribution is used for training and testing. He further explains that any probability distribution will suffice for purposes of invoking the probabilistic approach to the learning problem. Finally, the professor acknowledges a request to simplify the case of a squared error measure and closed-form solution, which he will cover in the review.

  • 01:10:00 In this section, the professor discusses how the algorithm for linear regression was derived based on minimizing squared error, resulting in a simple closed-form solution. He also explains how an imbalance in the probability of y affects the learning process and that rewards and costs are equivalent. In addition, he clarifies that when referring to the input space in machine learning, it includes all possible points only in terms of their input parts, while feature extraction involves processing the input to remove irrelevant information. Principal component analysis is another method for detecting informative directions in the input representation space.

  • 01:15:00 In this section of the lecture, the professor discusses the use of the squared error measure versus the absolute value for error measures in optimization. He explains that squared error is a smooth function and has many desirable properties, whereas the absolute value is not smooth and can result in combinatorial optimization. However, if using the absolute value is necessary for a specific merit, it can still be used. Additionally, he clarifies that the target is the function f of x, not w transposed x, and that noise is the difference between y and the expected value of y given a specific x. Lastly, the professor notes that there is a tradeoff between complexity and performance in machine learning models, but answers to how to simultaneously improve the generalization and minimize error will be covered in the next four lectures.

Lecture 04 - Error and Noise
Lecture 04 - Error and Noise
  • 2012.04.15
  • www.youtube.com
Error and Noise - The principled choice of error measures. What happens when the target we want to learn is noisy. Lecture 4 of 18 of Caltech's Machine Learn...
 

Caltech's Machine Learning Course - CS 156. Lecture 05 - Training Versus Testing 

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:11

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 05 - Training Versus Testing

In Lecture 5 of his course on Learning From Data, Professor Abu-Mostafa discusses the concepts of error and noise in machine learning, the difference between training and testing, and the growth function, which measures the maximum number of dichotomies that can be produced by a hypothesis set for a given number of points. He also introduces the break point, which corresponds to the complexity of a hypothesis set and guarantees a polynomial growth rate in N if it exists, and discusses various examples of hypothesis sets such as positive rays, intervals, and convex sets. The lecture emphasizes the importance of understanding these concepts and their mathematical frameworks in order to fully comprehend the complexity of hypothesis sets and their potential for feasible learning.

The professor covered various topics related to training versus testing. He addressed questions from the audience about non-binary target and hypotheses functions and the tradeoff of shattering points. The professor explained the importance of finding a growth function and why it is preferred over using 2 to the power of N to measure the probability of generalization being high. Additionally, he discussed the relationship between the break point and the learning situation, noting that the existence of the break point means that learning is feasible, while the value of the break point tells us the resources needed to achieve a certain performance. Finally, the professor explained the alternatives to Hoeffding and why he is sticking to it to ensure people become familiar with it.

  • 00:00:00 In this section, Professor Abu-Mostafa discusses the concepts of error and noise and how they relate to machine learning in practical situations. He explains the importance of defining error measures and how they are used to determine the performance of a hypothesis versus a target function. Additionally, he discusses the concept of noisy targets, where the target is not a deterministic function, but rather is affected by x and is distributed according to a probability distribution. Professor Abu-Mostafa also introduces the theory track that will last for the next three lectures, focusing on training versus testing and the mathematical framework that describes it in a realistic way.

  • 00:05:00 In this section, the lecturer explores the difference between training and testing in the context of a final exam. The practice problems and solutions provided before the final exam serve as the training set. The final exam serves as the testing set. The lecturer emphasizes that the goal is not to perform well on the final exam, but to understand the material, which is reflected in a small E_out. The mathematical description of testing involves how well one performed on the final exam, while the mathematical description of training involves how one performed on the practice problems. The contamination of the practice set results in a degraded performance on the E_in metric. The lecturer emphasizes the need to replace the M quantity with a more friendly one in measuring the complexity of hypothesis sets.

  • 00:10:00 In this section, the speaker discusses the importance of understanding where a hypothesis, M, comes from and the context surrounding it in order to replace it. The speaker explains that there are bad events that are called B, and the objective is to avoid the situation where the in-sample performance does not track the out-of-sample performance. The goal is to ensure that the probability of any of the bad events is small, regardless of correlations between events. The speaker then goes on to explain the perceptron example and how to define the bad event in terms of a picture to ensure a better bound.

  • 00:15:00 In this section, the lecturer discusses the concepts of E_in and E_out, which represent the in-sample and out-of-sample errors for a hypothesis, respectively. He then examines how the changes in E_in and E_out compare when moving from one hypothesis to another, arguing that they are small and move in the same direction due to the area of overlap between the hypotheses. The lecturer suggests that M, the previous measure of complexity, can be replaced with a new quantity that characterizes the complexity of any model, but this will require a proof in the next lecture. He introduces the quantity and emphasizes the need to understand it well before proceeding to the proof.

  • 00:20:00 In this section, the lecturer explains what dichotomies are and how they relate to hypotheses. Dichotomies are multiple hypotheses defined only on a subset of the points, and they represent the different possible patterns of red and blue on a finite set of data points. For example, if there are only a few dichotomies, the hypothesis set is not powerful, but if there are many, the hypothesis set is strong. The lecturer describes dichotomies as an opaque sheet of paper with holes on them, placed on top of the input space, showing only the pattern of red and blue points. Dichotomies are a formal way of expressing hypotheses, where the function produces either -1 or +1 for the blue and red regions.

  • 00:25:00 In this section, the lecturer discusses the number of hypotheses and dichotomies in the case of the perceptron. He explains that there can be an infinite number of hypotheses due to the perceptron having infinite values. However, the number of dichotomies is limited as there are only a finite amount of points to return +1 or -1 on. The growth function, denoted by "m," replaces the number of hypotheses by counting the most dichotomies that one can get using their hypothesis set on any N points. The lecturer mentions that the growth function is calculated by maximizing the number of dichotomies with respect to any choice of N points from the input space.

  • 00:30:00 In this section, the lecturer explains the notion of growth function and how it applies to perceptrons. The growth function of a hypothesis set is a function that tells you the maximum number of dichotomies that can be produced for a given number of points. For perceptrons, getting the growth function is challenging because it requires finding the growth function for every number of points, starting from one. Additionally, for each number of points, there are certain constellations of points that a perceptron cannot generate. Nonetheless, these limitations are expected because perceptrons are simple models with a simple algorithm.

  • 00:35:00 In this section, the lecturer discusses the concept of growth functions by using examples of different models including positive rays and positive intervals. He explains that the growth function for positive rays is N+1, which means the number of dichotomies is dependent on the number of line segments possible between N points. Meanwhile, positive intervals have a larger growth function because two parameters, the beginning and end of the interval, can be varied in order to obtain different dichotomies.

  • 00:40:00 In this section, the lecturer discusses growth functions for hypothesis sets with varying degrees of complexity. For the simplest hypothesis set of dichotomies in a line, the growth function formula is simply the number of ways to choose 2 segments out of the N+1 segments, which is equivalent to (N+1) choose 2. For the next hypothesis set of convex regions in a plane, the lecturer notes that some regions are invalid because they are not convex. The growth function formula for this set requires more complicated counting since not all dichotomies are valid. The lecturer then proposes an optimal choice for point placement, which is on the perimeter of a circle, to maximize the growth function for this hypothesis set.

  • 00:45:00 In this section, the lecturer discusses the growth function for convex sets and how it is not as powerful as the growth function for positive intervals. The lecturer shows how the growth function works for each of the hypotheses. They also discuss how to replace the maximum M with a finite number m, which can be the growth function. The lecturer concludes that if the growth function is a polynomial, then learning is feasible using that hypothesis. However, the lecturer admits that it is not easy to evaluate the growth function explicitly.

  • 00:50:00 In this section, the concept of break point is introduced to define the point at which a hypothesis set fails to get all possible dichotomies. The break point corresponds to the complexity of the hypothesis set, and if no data set of size k can be shattered by the hypothesis set, then k is a break point for it. The break point for the 2D perceptron is found to be 4. The lecture also covers the examples of positive rays, intervals and convex sets to explain how to find the break point for each hypothesis set. Additionally, it is established that if a hypothesis set does not have a break point, then it will have infinite growth.

  • 00:55:00 In this section, the professor explains the concept of the growth function and how it guarantees a polynomial growth rate in N if a break point exists. With the constraint of a break point, there is an enormous combinatorial restriction that eliminates possible dichotomies in droves, reducing the unrestricted 2 to the N growth function to polynomial. The professor gives an example of a three-point hypothesis set with a break point of two, where the dichotomies are limited, and violators are removed until only one dichotomy remains, which satisfies the constraint.
  • 01:00:00 In this section, the professor answers questions from the audience about non-binary target and hypotheses functions and the tradeoff of shattering points. He explains that the theory he is developing is manageable for binary functions, but there is a counterpart for real-valued functions that is more technical, which he will cover through the bias-variance tradeoff method. In terms of shattering points, he states that it is good for fitting the data but bad for generalization, and finding the right balance between approximation and generalization is key. Additionally, he clarifies the importance of polynomial growth and how it guarantees small probabilities of something bad happening.

  • 01:05:00 In this section, the professor discusses a puzzle where 3 bits are put on every row and attempts are made to get as many different rows as possible under the constraint that two points cannot be shattered. The professor goes through the exercise of adding rows and keeping an eye on all possible combinations to avoid violating the constraint. By the end, the professor concludes that only four possible patterns can be achieved under this constraint, and more rows cannot be added. This limitation is due to the fact that the number of hypotheses is infinity for perceptrons, and the growth function is either identically 2 to the N or polynomial, with nothing in between.

  • 01:10:00 In this section of the lecture, the professor discusses the importance of finding a growth function and why it is preferred over using 2 to the power of N to measure the probability of generalization being high. The professor explains that finding a polynomial growth function would yield a manageable right-hand side and would lead to the probability of generalization being high. The professor also answers questions from students about the number of testing and training points, the out-of-sample error for different hypotheses, and why it is called a growth function. The professor notes that there are different methods for finding a growth function, and sometimes the estimate for the break point will be just an estimate and not an exact value.

  • 01:15:00 In this section, the professor discusses the relationship between the break point and the learning situation. He explains that the existence of the break point means that learning is feasible, while the value of the break point tells us the resources needed to achieve a certain performance. He also touches upon the alternatives to Hoeffding and why he is sticking to it. The goal is for people to become so familiar with Hoeffding that they know it cold, so that when modifications are introduced, they won't get lost.

Lecture 05 - Training Versus Testing
Lecture 05 - Training Versus Testing
  • 2012.04.19
  • www.youtube.com
Training versus Testing - The difference between training and testing in mathematical terms. What makes a learning model able to generalize? Lecture 5 of 18 ...
 

Caltech's Machine Learning Course - CS 156. Lecture 06 - Theory of Generalization

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:13

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 06 - Theory of Generalization

The lecture discusses the theory of generalization and the growth function as the number of dichotomies that can be generated by a hypothesis set on a set of N points, with the goal being to characterize the entire growth function and generalize for every N by characterizing the break point. The speaker demonstrates the process of computing the growth function for different hypothesis sets and proving the upper bound for the growth function using combinatorial identity. The discussion also touches on using the growth function in the Hoeffding inequality, the VC bound to characterize overlaps between hypotheses and the Vapnik-Chervonenkis inequality, which is polynomial in N with the order of the polynomial decided by the break point.

The professor discusses the theory of generalization, clarifying previous points and explaining the concept of a break point, which is used to calculate resources needed for learning. The focus of learning is on approximation to E_out, not E_in, allowing the learner to work with familiar quantities. The professor also explains the reasoning behind replacing M with the growth function and how this is related to the combinatorial quantity B of N and k. While discussing regression functions, the professor emphasizes the bias-variance tradeoff and how learnability is independent of the target function. Finally, the professor notes that the same principles apply to all types of functions.

  • 00:00:00 In this section, we learn about dichotomies as mini-hypotheses that are restricted to a finite set of points and the growth function. The growth function counts the number of dichotomies that can be generated by a hypothesis set on a set of N points. The break point for perceptrons is defined as the point where patterns begin to be missed due to the use of hypotheses from a restricted set. The theoretical goal is to characterize the entire growth function and generalize for every N by characterizing the break point. We also see that a restriction on the number of patterns on a few points results in the loss of many patterns for larger numbers of points, independently of the hypothesis set and input space.

  • 00:05:00 In this section, the lecturer discusses two items: the first is showing that the growth function is polynomial with a break point and the second is demonstrating the replacement of M, the number of hypotheses, in Hoeffding's inequality. The lecturer emphasizes that they do not need to determine the particulars of the growth function, but only need to show that it is bounded by a polynomial so that it can be used in the Hoeffding inequality. The lecturer introduces a key quantity called B of N and k, which is a combinatorial quantity that represents the maximum number of dichotomies on N points with a break point k. The bound for B of N, k is found recursively by filling a table with N points and isolating the last point to introduce a recursion.

  • 00:10:00 In this section, the speaker discusses how to group rows of a matrix that represent the extension of a binary sequence. The first group, S_1, consists of rows that appear only once based on the extension. The second group, S_2, consists of rows that appear with both extensions. Using these groupings, the speaker defines the number of rows in group S_1 as alpha and the number of rows in group S_2 as beta. With these definitions, the speaker is able to find a recursion for the maximum number of rows/patterns that can be obtained on N points such that no k columns have all possible patterns.

  • 00:15:00 In this section of the lecture, the speaker discusses the theory of generalization and how to estimate beta. He explains that by analyzing the second part of the S_2 matrix, which contains repeated pattern blocks, he can argue that these pattern blocks have a break point of k minus 1, not k. He also explains that by taking alpha plus beta, which is the total number of rows or patterns in the mini-matrix, he can say something about a break point for this small matrix. He ends by stating that by putting it all together, he can estimate the full matrix and its number of rows.

  • 00:20:00 In this section, the speaker analyzes a matrix and derives a recursion formula to solve for an upper bound on B of N and k, where B of N and k is the maximum growth function of a hypothesis set with a break point of k. By computing values of B of N and k using the recursion formula, the speaker fills a table with an upper bound on B of N and k. The boundary conditions for the table are filled first and then the rest of the table is filled using the recursion formula.

  • 00:25:00 In this section, the speaker discusses the theory of generalization and talks about a table representing the maximum number of dichotomies or patterns given a specific number of points, N, and a break point, k. The speaker explains how the table is filled and how the constraint can be empty. Additionally, they present a formula that computes the maximum number of dichotomies or patterns to be an upper bound for the growth function of any hypothesis set that has a break point k, without asking any questions whatsoever about the hypothesis set or the input space.

  • 00:30:00 In this section, the lecturer discusses the induction step to prove a theorem on the formula for N and k. The step involves assuming that the formula holds for given values of N and k, and then proving that it also holds for N-1 and k-1. The lecturer demonstrates the process of manipulating the two formulae, merging the summations, and reducing them to a single quantity using algebra or combinatorial arguments. The aim is to establish that the given formula holds for all values of N and k, which includes the previously assumed values, and from there, the theorem is proven.

  • 00:35:00 In this section, the speaker explains the process of proving the upper bound for B of N and k, the growth function for a hypothesis set that has a break point k, using combinatorial identity. The resulting polynomial is useful because the break point is a fixed number and doesn't grow with N. The speaker then illustrates that the upper bound is polynomial in N by showing that the maximum power is N to the k minus 1, which is a constant. Finally, the speaker applies the upper bound to three examples of hypothesis sets and shows that they all satisfy the bound.

  • 00:40:00 In this section, the lecturer discusses computing the growth function for positive rays and positive intervals. By utilizing the break point, which is the only input required, he is able to find the growth function without considering the geometry of the hypothesis set. The lecturer then applies this method to the two-dimensional perceptron, where the growth function is unknown, but it is known that the break point is 4. By using the break point, he is able to bound the growth function completely, which is important in simplifying the characterization of hypothesis sets. The lecturer then explains how this growth function can be used in the Hoeffding inequality to replace the number of hypotheses using the union bound, which is next to useless when M is significant or infinite.

  • 00:45:00 In this section, the lecturer explains the pictorial proof of the polynomial boundedness of the growth function. The space of possible data sets covers all axes and the colored area represents the bad region where E_in deviates from E_out due to certain data sets. By painting this bad region red and using the Hoeffding inequality, the lecturer shows that the colored area is small, allowing the union bound to claim the possibility of multiple hypotheses. However, when more hypotheses are added, the colored area fills up the canvas, leading to the problem with the union bound. The lecturer then explains the two aspects needed to establish the relationship between the growth function and the overlaps and the approach for E_out to conform to the finite sample argument.

  • 00:50:00 In this section, the lecturer introduces the VC bound as a new canvas to characterize overlaps between hypotheses. He explains that the growth function is an abstract quantity that characterizes these overlaps and tells you the number of dichotomies that behave the same way. The lecturer explains that the redundancy is captured by the growth function and that the point being colored does not only depend on the sample but also on the entire space. The lecturer overcomes this by picking two samples instead of one, which are independently generated from the same distribution, to track E_out and E_in without relying on the entire hypothesis.

  • 00:55:00 In this section, the speaker discusses the concept of tracking between E_in and E_in dash, which are two different samples, and whether they track each other or not. If multiple bins are used, the tie between E_out and E_in becomes looser and looser. They also get loosely apart as the number of bins increases. The mathematical ramifications of multiple hypotheses happen in the same way here as for one bin. As the speaker goes through the technicalities of the proof, the epsilon becomes epsilon over 2 and then becomes epsilon over 4. When plugged in, they get epsilon squared over 16, resulting in a factor of 1/8. The result obtained is called the Vapnik-Chervonenkis inequality, which is polynomial in N and has the order of the polynomial decided by the break point.
  • 01:00:00 In this section of the video lecture, the moderator asks the professor to clarify some points made in previous slides. The professor explains that the N points chosen in slide 5 correspond to a particular set of points in an input space in machine learning, but in the abstraction, these are simply abstract labels. The professor also clarifies that their use of alpha and beta in the lecture is merely a naming convention, and there is no assertion about the relative values of the two. Finally, the professor explains that the break point is calculated by visiting the input space and the hypothesis set and finding out, for a given hypothesis set, what is the maximum number of points that cannot be separated in every possible way.

  • 01:05:00 In this section, the professor explains that for most learning models, exact or bound break points have already been established, meaning that the resources needed to learn can be estimated before starting the learning process. Although there may be cases where the bounds are not tight, in most cases, the discrepancy between the exact estimate of the growth function and the quadratic bound will be negligible. The lecture emphasizes that the focus of learning is not on the actual value of E_in, but on its approximation to E_out, enabling the learner to work with familiar quantities. Finally, the professor assures the audience that the VC dimension, which is a building block for understanding the theories of learning, will be covered in detail in the next lecture.

  • 01:10:00 In this section, the professor explains the reasoning behind replacing M with the growth function and the modifications that needed to be made to fulfill the statement's technical requirements. The professor also clarifies the definition of B of N and k, detailing how it is an upper bound for any hypothesis set with a break point, and how it is a purely combinatorial quantity. The professor then addresses a question regarding the proof of B of N and k, stating that k does not change when reducing x_N to x_N-1 since no k columns of the smaller set can have all possible patterns. Finally, the professor notes that the analysis and VC analysis are applicable to binary functions, though they can be extended to real-valued functions.

  • 01:15:00 In this section, the professor discusses how instead of going into technical extensions on learnability, he would rather use a different approach, the bias-variance tradeoff, when discussing regression functions. He also clarifies that learnability is proven under conditions about the hypothesis set, and that it is independent of the target function. He goes on to explain that the generalization question is not dependent on the target function, but the question of whether E_in can be minimized to make the user happy is dependent on the target function. Finally, the professor states that the same principles apply to regardless of the type of function.


Lecture 06 - Theory of Generalization
Lecture 06 - Theory of Generalization
  • 2012.04.21
  • www.youtube.com
Theory of Generalization - How an infinite model can learn from a finite sample. The most important theoretical result in machine learning. Lecture 6 of 18 o...
 

Caltech's Machine Learning Course - CS 156. Lecture 07 - The VC Dimension

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:15

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 07 - The VC Dimension

The lecture introduces the concept of VC dimension, which is the maximum number of points that can be shattered by a hypothesis set, and explains its practical applications. The VC dimension represents the degrees of freedom of a model, and its relationship to the number of parameters in a model is discussed. Examples are given to demonstrate how to compute the VC dimension for different hypothesis sets. The relationship between the number of examples needed and the VC dimension is explored, and it is noted that there is a proportional relationship between the two. The implications of increasing the VC dimension on the performance of a learning algorithm are also discussed. Overall, the lecture provides insights into the VC theory and its practical implications for machine learning.

Also the video covers the concept of generalization and the generalization bound, which is a positive statement that shows the tradeoff between hypothesis set size and good generalization in machine learning. The professor explains the VC dimension, which is the largest value before the first break point, and how it can be used to approximate the number of examples needed. He notes the importance of choosing the correct error measure and clarifies that the VC dimension estimate is a loose estimate that can be used to compare models and approximate the number of examples needed. The lecture ends by highlighting the commonalities between this material and the topic of design of experiments and how the principles of learning extend to other situations beyond strict learning scenarios.

  • 00:00:00 In this section, the lecturer summarizes the previous lecture's main result in learning theory, which is the VC (Vapnik-Chervonenkis) inequality, which characterizes generalization in machine learning. The growth function, which characterizes the redundancy needed to switch from the Hoeffding inequality to the VC inequality, was introduced and related to bad events with overlapping regions. The technical problem with E_out was solved, and the growth function was used to replace the number of hypotheses M. The VC dimension, which is related to the break point, is then defined and computed exactly for perceptrons in any-dimensional space. The interpretation of the VC dimension and its practical applications are also discussed.

  • 00:05:00 In this section, the concept of VC dimension is introduced as the maximum number of points that can be shattered by a hypothesis set. The VC dimension is denoted as d_VC and is the largest value of N such that the growth function is 2 to the N. It is important to note that the VC dimension does not guarantee that every N points can be shattered, but only that there exist N points that can be shattered. The section provides examples, such as the positive rays and 2D perceptrons, to demonstrate how to compute the VC dimension for a given hypothesis set. The VC dimension is used to bound the growth function of a hypothesis set, and it serves as the order of the polynomial that bounds the growth function.

  • 00:10:00 In this section, the focus is on the VC dimension of convex sets and its relation to learning. The VC dimension represents the maximum number of points that can be shattered by a hypothesis set. If the VC dimension is finite, the final hypothesis will generalize, regardless of the input distribution or learning algorithm used. The learning diagram, which includes the target function, learning algorithm, and input distribution, shows that the VC theory is independent of the learning algorithm and target function, and only depends on the hypothesis set. Overall, there are three blocks in the VC theory: the hypothesis, hypothesis set, and the VC dimension.

  • 00:15:00 In this section, we learn about the VC dimension of perceptrons, which is the hypothesis set that the entire VC theory deals with, since it is the set that has the VC dimension and tells us if we are able to generalize. Although the VC dimension of perceptrons in two-dimensional space is three, a simple formula states that in d-dimensional space, the VC dimension is d plus one. This is important to understand the significance of the VC dimension, and we will prove this by showing that the VC dimension is at most d plus one and at least d plus one. To demonstrate, we will construct a specific set of N points (N being d plus one) using a matrix to be shattered, as long as it is possible to shatter them.

  • 00:20:00 In this section, the lecturer shows a specific set of d plus 1 points and demonstrates that they can be shattered using an invertible matrix. He then poses a question to the audience about the VC dimension and asks them to choose which conclusion they can make based on the results of the demonstration. The correct answer is b, which states that the VC dimension is greater than or equal to d plus 1.

  • 00:25:00 In this section, the professor discusses how to prove that the VC dimension is at most d plus 1. He asks the audience which of several statements would establish the premise and they respond with “d." The professor then explains that he needs to show that there is a set of d plus 2 points that he cannot shatter. He does this by showing that for a set of d plus 2 points, there will always be one point that is a linear combination of the others. Therefore, he constructs a dichotomy that he shows cannot be implemented with a perceptron.

  • 00:30:00 In this section of the video, the speaker explains the concept of a dichotomy in a perceptron, which is essentially assigning labels of +1 or -1 to specific points. Through the use of algebraic properties, it is shown that it's impossible to shatter any set of d plus 2 points, with the VC dimension being d plus 1. This is due to the number of parameters in the perceptron model, which is d plus 1, and the VC dimension gives the maximum number of points that can be shattered.

  • 00:35:00 In this section, the lecture introduces the concept of VC dimension and its interpretation. The VC dimension is a measure of the degrees of freedom of a model and how it relates to the number of parameters it has. The lecture compares these degrees of freedom to knobs on an audio system, where more knobs can give you more control over the sound, but it can be challenging to use effectively. The lecture explains that the VC dimension abstracts away the details of the mathematics inside a model and focuses on its expressive power. The lecture also discusses the correspondence between the VC dimension and the degrees of freedom of various models, such as positive rays, showing that the VC dimension equals one when there is one degree of freedom, which corresponds to a model with one parameter.

  • 00:40:00 In this section, the lecturer discusses degrees of freedom and their relationship to the VC dimension in the context of simple models. While the VC dimension counts the number of hypotheses that can be achieved by a model, it is not necessarily equal to the number of parameters. By constructing an artificial example, the lecturer shows that parameters may not always contribute to degrees of freedom. Instead, effective degrees of freedom can be measured more reliably by the VC dimension, and the lecturer demonstrates how a model with eight parameters can actually have the same VC dimension as a model with only two parameters. Finally, the lecturer notes that practitioners may be interested in the number of data points needed for a system and how this can be related to the VC dimension of the hypothesis set.

  • 00:45:00 In this section, the speaker discusses the relationship between the number of examples needed and the value of the VC dimension. The VC inequality has two small performance quantities that they want to be as small as possible. One of these is E_in not far from E_out, while the other is delta, which is small in value. After deciding on certain epsilon and delta values, the speaker explains how to determine the number of examples required to achieve them by looking at the function N to the power of the VC dimension times e to the power of -N plotted on a graph. The interesting part of the curve is where the probability is less than 1, and the speaker then explores the implications of increasing the VC dimension from 4 to 5.

  • 00:50:00 In this section, the lecturer discusses the relationship between the number of examples in a dataset and the VC dimension, which is a measure of the complexity of a learning algorithm. He uses several graphs to illustrate how the performance of the algorithm changes as the VC dimension increases, and emphasizes that the number of examples needed to achieve a certain level of performance is proportional to the VC dimension. However, he also notes that while the bounds on performance are guaranteed to follow a certain monotonicity, the actual performance may not always do so, which can be a source of frustration for practitioners.

  • 00:55:00 In this section, the lecturer discusses observations and practical applications of the VC dimension. The first lesson is that there is proportional relationship between the VC dimension and the number of examples needed to achieve a certain level of performance. The lecturer provides a rule of thumb where 10 times the VC dimension is needed to get to the comfort zone of the VC inequality where the probability statement is meaningful. The second practical observation is that for a huge range of reasonable epsilon and delta, the rule of thumb also holds true. The lecturer then simplifies the VC inequality formula and calls it formula capital Omega, stating that it depends on the growth function and that as the VC dimension gets bigger, the Omega formula becomes worse.
  • 01:00:00 In this section, the speaker discusses the concept of generalization and how having more examples can affect the growth function and polynomial behavior. He introduces the idea of the generalization bound, which is a positive statement instead of characterizing bad events. With probability greater than or equal to 1 minus delta, E_in tracks E_out, meaning that they are within Omega, which is dependent on the number of examples and the hypothesis set's VC dimension. The speaker simplifies the generalization bound by rearranging it to show that E_out is bounded by E_in plus Omega. He explains how this bound illustrates the tradeoff between the hypothesis set's size and good generalization, leading to the concept of regularization in machine learning.

  • 01:05:00 In this section, the professor explains that the VC dimension is the biggest value just short of the first break point, which means that any bigger point that acts as a break point will also be counted. The notion of a break point covers a lot of values, but the VC dimension is the unique one that stands out. He also clarifies that when discussing shattering N points, individuals get to pick the points to shatter. The professor explains that epsilon and delta are two performance parameters of learning, where epsilon is the approximation parameter that ensures E_in tracks E_out, while delta is the probability measure that determines the likelihood of the probability statement failing. When asked about the effect of error measure on the number of points to choose, the professor explains that when dealing with the error measure in a binary sense, there is no need to worry about the variance because there's an upper bound, but when using other co-domains or error measures, modifications are necessary.

  • 01:10:00 In this section, the professor explains that obtaining the VC dimension exactly is rare, but they know the exact dimension for perceptrons. When it comes to neural networks, the VC dimension estimate cannot be above a certain number due to redundancies and cancellations. The professor emphasizes that the VC dimension bound is a loose estimate, but it still maintains its conceptual meaning and can be used as a guide to compare models and approximate the number of examples needed. The rule of thumb is to use at least 10 times the VC dimension to get into the interesting region of the VC inequality, which depends on the customer's desired level of accuracy. The professor notes that there are commonalities between this material and the topic of design of experiments, and the principles of learning extend to other situations beyond strict learning scenarios.

Lecture 07 - The VC Dimension
Lecture 07 - The VC Dimension
  • 2012.04.26
  • www.youtube.com
The VC Dimension - A measure of what it takes a model to learn. Relationship to the number of parameters and degrees of freedom. Lecture 7 of 18 of Caltech's...