Programming tutorials - page 16

 

Hypothesis Testing: Example


Hypothesis Testing: Example

Today, we'll be going through an example of hypothesis testing for the mean. Before diving into the specific example, let's review the general procedure. It always starts with setting up hypotheses, including the null hypothesis, which represents the idea we want to gather evidence against, and the alternative hypothesis, which we seek to support. Assuming the null hypothesis is true, we examine where our sample mean (X bar) falls among all possible sample means under this assumption.

To do this, we calculate a z-score, which measures the deviation of our result within the context of the null hypothesis. For a one-sided alternative hypothesis testing whether the population mean (μ) is less than or greater than a specific value, we compute the probability of obtaining a z-score less than or equal to the one we obtained. For a two-sided alternative hypothesis, we calculate either probability and then double it appropriately.

In the most formal representation, we find the probability of getting a z-score less than or equal to the negative absolute value of our obtained z-score. By using the cumulative distribution function, we account for both the left and right tails. Once we have the p-value, we compare it to the chosen significance level (alpha). If the p-value is less than alpha, we reject the null hypothesis and conclude that the alternative hypothesis is supported.

Now let's apply this to an actual example. A consumer advocacy group tests the vitamin C content of an organic supplement, which claims to have an average of 1000 milligrams of vitamin C per tablet. With a sample size of 32, they find a sample mean of 1008.9 milligrams. The population standard deviation (σ) is given as 21 milligrams. Our task is to determine if there is enough evidence to reject the product's claim. The significance level (alpha) is set at 0.05.

Following the general procedure, we start by setting up the hypotheses. The null hypothesis is that the product's claim of an average vitamin C content of 1000 milligrams is true, while the alternative hypothesis is that the true mean differs from 1000 milligrams. Since there is no specific indication to consider only values less than or greater than 1000, we opt for a two-sided alternative hypothesis.

Next, we calculate the z-score using the formula (sample mean - expected value) / (standard deviation of the sample mean). Assuming the null hypothesis, we use a mean value of 1000 milligrams and compute the standard deviation of the sample mean as σ / √n, where n is the sample size. Consequently, the z-score is found to be 2.39, indicating that our sample mean of 1008.9 milligrams deviates 2.39 standard deviations away from the expected mean under the null hypothesis.

To determine the p-value, we need to find the probability of obtaining a z-score as extreme as the one we have (either positive or negative). In this case, we calculate P(Z ≤ -2.39), which yields 0.0084. Since this is a two-sided test, we double the probability to obtain 0.0168.

Comparing the p-value to the significance level, we find that 0.0168 is indeed less than 0.05. Therefore, we have sufficient evidence to reject the null hypothesis and conclude that the supplement does not contain an average of 1000 milligrams of vitamin C.

Hypothesis Testing: Example
Hypothesis Testing: Example
  • 2020.03.25
  • www.youtube.com
Another example of a two-sided hypothesis test for the mean when the population standard deviation is known. If this vid helps you, please help me a tiny bit...
 

Type I and Type II Errors in Significance Tests


Type I and Type II Errors in Significance Tests

Today, we'll discuss situations when significance testing doesn't go as planned. Let's cover it all in just three minutes. Let's begin.

In hypothesis testing, we encounter two possible states for H naught (the null hypothesis): it can be true or false. At the end of the test, we have two potential decisions: either rejecting H naught or not rejecting it. This gives us a total of four possible outcomes. We can examine the combinations of these two decisions. I have a table summarizing these outcomes, and two of them bring us satisfaction: rejecting H naught when it's false and not rejecting H naught when it's true. However, there are two situations that are not desirable.

As we delve into this topic, it's important to note that we usually don't have prior information about whether H naught is true or false at the beginning. If we obtain such information, it typically comes much later. Now, let's discuss the two unfavorable outcomes. The first one is called a Type 1 error or false positive. This occurs when we reject the null hypothesis despite it being true. It happens when a random event takes place, and we mistakenly interpret it as significant. The second situation is a Type 2 error or false negative. This occurs when we fail to reject the null hypothesis, even though it's actually false. In this case, there is something significant happening, but our test fails to detect it.

The terms "false positive" and "false negative" originate from medical testing, where the logical framework is similar to significance testing. In medical tests, you could be testing for a disease, and the test may indicate its presence or absence. The overall Type 1 and Type 2 errors are summarized in the table provided, highlighting the desired outcomes with check marks.

Let's quickly go through a couple of examples. Suppose a chocolate bar manufacturer claims that, on average, their bars weigh 350 grams. I suspect they might be overestimating, so I gather a sample and reject their claim with a p-value of 0.0089. However, if the manufacturer's claim was actually true, and their bars do have an average weight of 350 grams, I would have committed a Type 1 error or false positive.

Here's another example: A restaurant asserts that the mean sodium content of one of its sandwiches is 920 milligrams. I analyze a sample but find insufficient evidence to reject the claim with an alpha level of 0.01. If the restaurant's claim had been false, let's say the mean sodium content was actually 950 milligrams, I would have made a Type 2 error by not rejecting the claim.

Type I and Type II Errors in Significance Tests
Type I and Type II Errors in Significance Tests
  • 2020.03.28
  • www.youtube.com
When hypothesis testing goes wrong, explained in under three minutes.
 

Hypothesis testing using critical regions


Hypothesis testing using critical regions

Hello everyone, today we will discuss hypothesis testing using critical regions. While this approach may be considered old-school, it is still relevant in the theory we will cover. Therefore, it's beneficial to have a basic understanding of it.

In the past, computing p-values was more challenging than it is today. It involved relying on tables for calculations, such as those for the normal distribution, which had limited accuracy and finite entries. To minimize the need for these calculations, the concept of critical regions or rejection regions was commonly used.

The typical process for hypothesis testing today involves computing a p-value based on sample data and comparing it to the chosen significance level (alpha). However, with critical regions, we reverse this process. We start by selecting a significance level (alpha), which then defines a cutoff value for the test statistic, denoted as Z-star or T-star. If the sample data yields a sample statistic more extreme than this cutoff value, it leads us to reject the null hypothesis.

Let's consider an example to illustrate this. Suppose we have a two-sided alternative hypothesis and are conducting a test with a normal distribution and a significance level of alpha equals 0.05. In this case, alpha equals 0.05 corresponds to a shaded area of 0.05 in the distribution (0.025 on each side). By performing an inverse normal calculation (using the command Q norm in R), we find the critical value Z-star to be 1.96. Therefore, if the sample statistic (Z-star) is greater than 1.96 (absolute value), it indicates that we should reject the null hypothesis.

For another example, let's consider a t-distribution with 8 degrees of freedom and a one-sided alternative (right-sided alternative). Suppose we choose alpha equals 0.01 as the significance level. In this case, there is an area of 0.01 to the right of T-star, corresponding to an area of 0.99 to the left. By using an inverse t CDF (using the command QT) with the values 0.99 and 8 in R, we find T-star to be approximately 2.9. If the sample's t-statistic is greater than 2.9, it falls within the shaded region, leading us to reject the null hypothesis.

In the case of the normal distribution, we can translate the critical Z value into a statement about a critical sample mean. Consider the following example: The contents of cans of a certain brand of Cola are normally distributed with a standard deviation of 0.2 ounces. We wish to use a sample of size 15 to test the null hypothesis that the mean contents of the cans are 12 ounces against an alternative hypothesis that they are actually less than 12 ounces. With a one-sided alternative and alpha equals 0.05, the critical Z value is -1.645. Thus, if the sample mean (X-bar) is more than 1.645 standard deviations below the mean, we should reject the null hypothesis. Specifically, if the sample mean is less than 11.92 ounces, we would reject the null hypothesis.

Hypothesis testing using critical regions
Hypothesis testing using critical regions
  • 2020.03.29
  • www.youtube.com
A formerly very practical idea, now mostly of theoretical interest. If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more ...
 

Hypothesis Testing with the t-Distribution


Hypothesis Testing with the t-Distribution

Hello everyone, today we will discuss hypothesis testing using the t-distribution. In this scenario, we are dealing with situations where the standard deviation of the population is unknown. Previously, we performed hypothesis testing using Z statistics, assuming we knew the population standard deviation (Sigma). However, in statistical inference, the goal is to use sample information to gain insights about the population, so it's common not to know Sigma. In such cases, we estimate the population standard deviation using the sample standard deviation (s) and proceed with similar calculations.

The challenge arises because, when Sigma is replaced with s, the expression (X-bar - mu)/(s/sqrt(n)) no longer follows a normal distribution. Both X-bar and s vary with each new sample, making the distribution follow a t-distribution with (n-1) degrees of freedom. Fortunately, once we consider this adjustment, the calculations remain largely the same.

To perform a hypothesis test when Sigma is unknown, we start with the null and alternative hypotheses. Assuming the null hypothesis is true, we compute the t-statistic for the actual sample data: (X-bar - mu_naught)/(s/sqrt(n)). We then calculate p-values based on the alternative hypothesis.

For a left-sided alternative hypothesis, where we suspect mu is less than a given value, we find the probability of obtaining a t-value less than or equal to the one we obtained when the null hypothesis is true. This corresponds to the shaded area in the first picture.

Similarly, for a right-sided alternative hypothesis, where mu is greater than a given value, we determine the probability of obtaining a t-value greater than the one we obtained. This corresponds to the area to the right of the t-value.

In the case of a two-sided test, we consider both areas. We calculate the probability of obtaining a t-value larger (in absolute value) than the one we obtained and then double it.

Once we have the p-value, we compare it to the chosen significance level (alpha) to make a decision. If the p-value is less than alpha, we reject the null hypothesis. However, when performing calculations manually, obtaining the t-value from the sample data can be tricky. Utilizing technology, such as statistical software or calculators, is recommended. In R, for instance, the command PT(t, n-1) calculates the area to the left of a given t-value in a t-distribution with (n-1) degrees of freedom.

Let's consider an example to demonstrate this process. Suppose we have the weight losses of seven mice during an experiment. We want to determine if there is sufficient evidence to conclude that the mice lose weight during the experiment, with a significance level of alpha equals 0.05. Since we are not given the population standard deviation, we are dealing with a t-test situation.

To begin the test, we set the null hypothesis, assuming that the data is due to random chance, and the alternative hypothesis, which asserts that mice lose weight on average during the experiment. In this case, we choose a one-sided alternative hypothesis, focusing on weight loss rather than weight gain.

Next, we compute the t-statistic using the sample mean and sample standard deviation. With the obtained t-value, we calculate the p-value, which represents the probability of obtaining a t-value greater than or equal to the observed value by chance alone.

To evaluate this probability, we refer to a t-distribution with (n-1) degrees of freedom. We calculate the area to the right of the t-value by subtracting the area to the left from 1. In R, this can be done using the PT function. If the p-value is greater than the chosen significance level (alpha), we fail to reject the null hypothesis.

In our example, the calculated p-value is 0.059. Since 0.059 is greater than the significance level of 0.05, we do not have sufficient evidence to reject the null hypothesis. Therefore, we cannot conclude that the experiment causes mice to lose weight on average.

It's important to note that failing to reject the null hypothesis does not mean the null hypothesis is true. It simply means that the evidence is not strong enough to support the alternative hypothesis.

In summary, when dealing with hypothesis testing and the population standard deviation is unknown, we can use the t-distribution and estimate the standard deviation using the sample standard deviation. We then calculate the t-statistic, compute the p-value based on the alternative hypothesis, and compare it to the significance level to make a decision. Utilizing statistical software or tables can simplify the calculations and provide more accurate results.

Hypothesis Testing with the t-Distribution
Hypothesis Testing with the t-Distribution
  • 2020.04.04
  • www.youtube.com
How can we run a significance test when the population standard deviation is unknown? Simple: use the sample standard deviation as an estimate. If this vid h...
 

Significance Testing with the t-Distribution: Example


Significance Testing with the t-Distribution: Example

Hey everyone, today I'd like to walk you through another example of a hypothesis test using the t-distribution. This example focuses on carbon uptake rates in a specific grass species. The conventional wisdom suggests that the mean uptake rate is 34.0 micro moles per square meter per second. However, a group of researchers has their doubts. They conducted a study and obtained a sample mean of 30.6 with a sample standard deviation of 9.7. Now, at a significance level of 0.05, they want to determine if this data provides strong evidence against the conventional wisdom.

As with any significance test, let's start by stating our hypotheses explicitly. The null hypothesis, which we aim to challenge, assumes that our sample data is merely a result of random chance, and the conventional wisdom holds true. On the other hand, the alternative hypothesis seeks to establish the possibility that the true mean uptake rate is either greater or less than 34.0. In this case, we'll consider a two-sided alternative hypothesis to encompass both scenarios.

Next, we want to assess how extreme our sample mean (x-bar) is compared to what we would expect under the null hypothesis. We calculate the test statistic (T) by subtracting the expected mean under the null hypothesis (mu-naught) from the sample mean and dividing it by the sample standard deviation (s) divided by the square root of the sample size (n). This calculation yields T = -2.27.

To determine the probability of obtaining a test statistic as extreme as -2.27 due to random chance alone, we need to consider both sides of the distribution. We calculate the combined shaded area to the left and right of -2.27, which gives us the p-value of the test. In R, we can use the PT command to calculate the leftmost area, which represents the probability of T being less than -2.27. Then, we double this area to account for both sides of the distribution.

After applying the PT command in R with -2.27 and degrees of freedom (df) equal to sample size minus one (41), we find that the left shaded area is 0.029. Doubling this value gives us the total shaded area, which corresponds to the p-value of the test.

The computed p-value is 0.029, which is smaller than our significance level (alpha) of 0.05. Therefore, we reject the null hypothesis and conclude that the mean carbon dioxide uptake rate in this grass species is not actually 34.0 micro moles per square meter per second.

In conclusion, hypothesis testing using the t-distribution allows us to evaluate the strength of evidence against the null hypothesis when the population standard deviation is unknown. By calculating the test statistic, comparing it to the critical value (significance level), and computing the p-value, we can make informed decisions regarding the validity of the null hypothesis.

Significance Testing with the t-Distribution: Example
Significance Testing with the t-Distribution: Example
  • 2020.04.07
  • www.youtube.com
A two-sided test with unknown population standard deviation. If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more stats j...
 

Hypothesis testing in R


Hypothesis testing in R

Hello everyone! Today, we'll be conducting hypothesis testing in R using the t.test command. We'll work on a couple of problems related to the built-in air quality dataset, which we'll consider as a simple random sample of air quality measurements from New York City.

Let's switch over to R, where I've already loaded the tidyverse package, which I usually do at the beginning of my R sessions. I've also pulled up the help file for the air quality dataset. This dataset was collected in 1973, so it's not the most recent data. We can use the view command to take a look at the dataset. It consists of 153 observations on six variables, including wind and solar radiation, the two variables we're interested in.

Before conducting any statistical tests, it's good practice to visualize the data. So let's create a histogram using the qplot command. We'll focus on the wind variable and specify that we want a histogram.

Now let's move on to problem one. An official claims that the average wind speed in the city is nine miles per hour. We want to determine if this claim is plausible based on the data. We'll use a t-test with the null hypothesis that the mean wind speed is nine miles per hour. Looking at the histogram, it seems plausible, although slightly centered to the right of that value. We'll perform the t-test using the t.test command. We pass the wind variable to it and specify the null hypothesis as mu = 9. By default, R assumes a two-sided alternative hypothesis. The t.test command provides us with the sample mean, t-statistic, and the p-value. The sample mean is 9.96, and the computed t-statistic is 3.36, which corresponds to a p-value of under 0.1. With such a small p-value, it's not plausible that this data significantly deviates from the null hypothesis due to random chance alone. Therefore, we reject the null hypothesis and conclude that the mean wind speed in New York is not nine miles per hour.

Moving on to problem two, we want to assess whether a certain solar array would be cost-effective if the mean solar radiation is over 175 langley's. We'll use a one-sided alternative hypothesis, where the null hypothesis is that the mean solar radiation is 175 langley's, and the alternative hypothesis is that it's greater. We'll visualize the data by creating a histogram of the solar radiation variable. Again, the null hypothesis seems plausible based on the histogram. We'll perform the t-test using the t.test command, passing the solar radiation variable and specifying the null hypothesis as mu = 175. Additionally, we need to indicate the one-sided alternative hypothesis using the alternative = "greater" argument. The t.test command provides us with the sample mean, t-statistic, and p-value. The sample mean is 185.9, and the computed t-statistic is 1.47, resulting in a p-value of 0.07. With a p-value of 0.07, we do not have compelling evidence to support the claim that the mean solar radiation in New York is over 175 langley's, which is the threshold for justifying the purchase of the solar array. Therefore, we should refrain from drawing conclusions and further study is needed to assess the mean solar radiation accurately.

In summary, hypothesis testing using the t-test allows us to evaluate the plausibility of claims or hypotheses based on sample data. By specifying the null and alternative hypotheses, performing the test, and examining the resulting p-value, we can make informed decisions about accepting or rejecting hypotheses. Visualization of the data through histograms or other graphs can provide additional insights during the analysis.

Hypothesis testing in R
Hypothesis testing in R
  • 2022.03.30
  • www.youtube.com
Hypothesis testing in R is easy with the t.test command!If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more #rstats joy,...
 

Hypothesis Testing for Proportions


Hypothesis Testing for Proportions

Hello everyone! Today, we will continue our exploration of hypothesis testing, this time focusing on proportions. We'll approach this topic by examining an example to understand the key concepts involved.

Let's dive right in. A commentator claims that 30% of six-year-olds in the United States have a zinc deficiency. We want to evaluate this claim by collecting a sample and conducting a hypothesis test at a significance level of α = 0.05. To investigate further, we gather data by surveying 36 six-year-olds and find that 5 of them have zinc deficiencies, which is less than 30%. However, we need to determine if this difference could be attributed to random chance alone. Our main question is: How unlikely is it to obtain a sample like this?

To address this question, we compare the sample proportion (P-hat) we obtained (5 out of 36) with the proportion claimed under the null hypothesis. Let's denote the population proportion as P₀ or P-naught. Our null hypothesis assumes that the population proportion is 0.30 (30%). The alternative hypothesis, in this case, is simply that the population proportion is not equal to 0.30. We don't have a specific reason to assume it's greater or less than 30%, so we consider both possibilities. By default, we opt for a two-sided alternative unless there is a compelling reason for a one-sided alternative.

The sample proportion (P-hat) we calculated is 0.139, significantly lower than 30%. But is this difference statistically significant? To evaluate this, we analyze the sampling distribution of P-hat. We imagine obtaining samples of the same size repeatedly and calculating the proportion of zinc deficiencies each time. Assuming the sample size (n) is large (which is the case here with n = 36), the sampling distribution will have a bell-shaped curve. We can determine its center and spread. The mean of the sample proportion (P-hat) will be the same as the population proportion (P), while the standard deviation of P-hat will be the square root of P(1-P)/n. If you need a more detailed explanation, I recommend watching my video on confidence intervals for proportions.

Now that we know the sampling distribution follows a bell-shaped curve with known mean and standard deviation, we can compute a z-score. We calculate the difference between the observed value (P-hat) and the expected value (P-naught) and divide it by the standard deviation. Plugging in the values (P-hat = 0.139, P-naught = 0.30, n = 36) yields a z-score of -2.11.

To assess the probability of obtaining a P-hat as extreme as the one we observed (or even more extreme), we examine the corresponding z-scores. In this case, we are interested in the probability of getting a z-score less than -2.11 or greater than 2.11. We can calculate this by evaluating the cumulative distribution function (CDF) of the standard normal distribution. Using statistical software or web apps, we find that the probability of obtaining a z-score less than -2.11 is approximately 0.017. However, since we are considering both tails of the distribution, we need to double this value, resulting in a p-value of approximately 0.035.

Comparing the p-value to our chosen significance level (α = 0.05), we find that the p-value is less than α. Therefore, we reject the null hypothesis and conclude that the commentator's claim is likely false. The proportion of six-year-olds in the United States with zinc deficiencies is not 30%.

When it comes to sample size and the normal approximation, there are a couple of rules of thumb to keep in mind. The normal approximation tends to work well when the sample has at least five successes and five failures. Mathematically speaking, this means that the product of the sample size (n) and the sample proportion (P) should be greater than or equal to five, as well as the product of the sample size (n) and the complement of the sample proportion (1-P) should also be greater than or equal to five.

In our case, we had a sample size of 36 and a sample proportion (P-hat) of 0.139, which satisfies the conditions for the normal approximation. Therefore, we can confidently rely on the normal distribution for our statistical inference.

It's also worth noting that, in general, larger sample sizes tend to yield better results with the normal approximation. As the sample size increases, the normal distribution becomes a more accurate representation of the sampling distribution of P-hat.

So, in summary, we can conclude that the sample size of 36 in our example is sufficiently large for us to utilize the normal approximation in our hypothesis testing.

I hope this clarifies the role of sample size in the normal approximation and provides a comprehensive explanation of the hypothesis testing process for proportions.

Hypothesis Testing for Proportions
Hypothesis Testing for Proportions
  • 2020.05.09
  • www.youtube.com
How should we run a hypothesis test when we have data involving percentages, proportions, or fractions? Using a normal approximation. of course, at least whe...
 

Hypothesis Testing for Proportions: Example


Hypothesis Testing for Proportions: Example

Hello everyone! Today, we'll work on an example of a hypothesis test for proportions. Let's dive into the problem. A university claims that 65% of its students graduate in four years or less. However, there are doubts about the accuracy of this claim. To investigate further, a simple random sample of 120 students is taken, and it is found that only 68 out of the 120 students graduated within the specified time frame. As this proportion is less than the claimed 65%, it provides evidence against the university's assertion. Now, the question is whether this evidence is strong enough to suggest that the claim is unlikely or if it could be attributed to random chance. To determine this, we'll calculate a p-value and make a decision using a significance level (α) of 0.05.

Firstly, we need to formulate the null and alternative hypotheses. The null hypothesis states that the results are solely due to random chance and that the true proportion of students graduating in four years or less is indeed 0.65. On the other hand, the alternative hypothesis suggests that the university is overestimating its graduation rate, and the population proportion is less than 0.65. In this case, a one-sided alternative hypothesis is appropriate since we are solely interested in the possibility of the graduation rate being lower than 65%.

Assuming the null hypothesis is true, we can apply the central limit theorem, which states that when the sample size (n) is large enough, the sampling distribution of the proportion (P-hat) will be approximately normal. The mean of this distribution is equal to the population mean (P), and the standard deviation is given by the square root of P times 1 minus P divided by n. In our case, since we assumed the null hypothesis is true, the population proportion (P) is 0.65.

Now, let's calculate the z-score to determine the probability of obtaining a result as extreme as or more extreme than the observed proportion by random chance alone. By plugging in the values, we find a z-score of -1.91. To find the probability associated with this z-score, which represents the likelihood of obtaining a proportion less than or equal to the observed one, we use the normal cumulative distribution function (CDF). This can be done using various tools like tables, web apps, or statistical software. For instance, in R, the command "Pnorm(-1.91)" yields a value of 0.028.

Comparing this p-value with the significance level (α) of 0.05, we observe that the p-value is less than α. Therefore, we reject the null hypothesis, indicating that it is reasonable to conclude that the university has been overestimating its four-year graduation rate.

Hypothesis Testing for Proportions: Example
Hypothesis Testing for Proportions: Example
  • 2020.05.10
  • www.youtube.com
A complete example of a hypothesis test for a proportion using the normal approximation.
 

Introduction to Scatterplots


Introduction to Scatterplots

Hello everyone! Today, we'll delve into scatter plots, which are visual displays of data that involve multiple variables collected simultaneously. Scatter plots are crucial as they frequently arise in real-world data collection scenarios. Often, we gather more than one piece of information. For instance, we might have SAT math and verbal scores for a group of students, heights and weights of individuals in a medical study, or data on engine size and gas mileage for various cars. In each case, the data is paired, meaning each value of one variable corresponds to a specific value of the other variable, creating a one-to-one relationship. When such paired data exists, we can construct scatter plots.

Let's consider an example using a table. Each column in the table represents a scientific or engineering field, with the number on top indicating the number of PhDs awarded to women in that field in 2005, and the number at the bottom indicating the number of PhDs awarded to men in the same year. By plotting this data, where women's PhDs are represented by the x-values and men's PhDs by the y-values, we obtain a set of points. Some points are labeled, such as (2168, 2227), which corresponds to the second data column in the table. It represents a scientific field where 2168 PhDs were awarded to women and 2227 were awarded to men in 2005.

When examining scatter plots, it is valuable to describe them qualitatively. In this example, we observe a general downward trend in the data, although there are instances where values increase as we move from left to right. Overall, the shape of the data tends to slope downward, indicating a negative association between the two variables. However, it is important to note that we should refrain from using the term "negative correlation" unless the association is linear, meaning the graph follows a straight line. In this case, the data does not exhibit a linear relationship.

Another noteworthy aspect of this plot is the outlier in the upper right corner. Outliers can fall into various categories, such as data entry errors, unusual cases that impact analysis, or interesting phenomena that require further investigation. Lastly, it is crucial to consider which variable to place on the horizontal axis and which one on the vertical axis. If one variable naturally explains or influences the other in a study, it should be placed on the horizontal axis as the explanatory variable. Conversely, the variable being explained or influenced should be on the vertical axis as the response variable. For instance, in the example of gas mileage, it makes sense to view mileage as being explained by engine size (displacement), so we place mileage on the vertical axis. However, this choice may involve some subjectivity, and there may be scenarios where the roles are reversed, depending on the study's context.

Introduction to Scatterplots
Introduction to Scatterplots
  • 2020.04.11
  • www.youtube.com
What is a scatterplot? How do we construct them? How do we describe them? If this vid helps you, please help me a tiny bit by mashing that 'like' button. For...
 

Scatterplots and Correlation


Scatterplots and Correlation

Hello everyone! Today, we'll provide a brief introduction to correlation. We'll cover this topic in just three minutes. Let's get started!

When we examine a scatter plot, sometimes we observe a linear relationship where the data roughly follows a straight line. In such cases, we can discuss the correlation between the variables. However, it's important to resist the temptation of using the term "correlation" when variables have a relationship other than a linear one. Correlations can be weak or strong and can be positive or negative.

A positive correlation indicates that as we move from left to right on the graph, the general shape of the data points inclines upward. Conversely, a negative correlation implies that the general shape of the data points descends as we read from left to right. Stronger correlations are characterized by data points clustering more tightly around the imagined line, while weaker correlations display more scattered data points.

To quantify correlation, we use a statistic called the correlation coefficient (often denoted as "r"). It ranges between -1 and 1. Values closer to 0 indicate cloudier or more dispersed data. In the examples provided, a correlation of 0.4 or -0.4 represents a moderate correlation, while 0.9 or -0.9 signifies a stronger correlation. A correlation of 1 or -1 indicates a perfect linear relationship, where all the data points lie precisely on the line.

It's important to note that the correlation coefficient "r" should not be confused with the slope of the line. The sign of "r" indicates whether the slope is positive or negative, but "r" itself does not specifically represent the slope. Instead, the coefficient of correlation reflects how spread out the data is from the line that is imagined to pass through the center of the data.

When variables do not exhibit a linear relationship, we say they are uncorrelated. Take caution when interpreting the coefficient of correlation in such cases. Even if there is a clear association between the variables, as in a parabolic shape, computing the correlation would yield a value close to zero.

Now, let's discuss computing correlation. In short, it is not recommended to calculate it manually. Fortunately, we have tools like software packages to help us. In R, for example, the command is "cor". By providing the X and Y values (the two variables we want to correlate), we can immediately obtain the correlation coefficient. With the given table, if we assign the first row as X and the second row as Y, we can simply use the command "cor(X, Y)" to obtain the correlation value. In this example, we get a correlation of 0.787, indicating a moderate positive correlation.

Scatterplots and Correlation
Scatterplots and Correlation
  • 2020.04.14
  • www.youtube.com
Let's talk about relationships between quantitative variables!If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more #rstat...