Figure 1:Screenshot from Association Between Two Quantitative Variables web app using the Animals dataset
Figure 2:Screenshot from Association Between Two Quantitative Variables web app, showing the initial screen for building the bootstrap distribution
Figure 10: After more than 10,000 random permutations, the app displays a clear picture of the permutation distribution under independence
Figure 11: Descriptive statistics of the permutation distribution, including finding a specific quantile and how many (and what percent) of permutations resulted in a value for the correlation that is as large or larger (or as small or smaller) than the observed value.
THE ART & SCIENCE OF LEARNING FROM DATA
AGRESTI · FRANKLIN · KLINGENBERG
From this output, one could point out that sampling distributions are not necessarily normal but skewed. One could also compute a permutation P-value for a significance test of a null hypothesis of independence in favor of a positive linear relationship.
If, from the initial scatterplot, the relationship didn't seem to be linear but still monotone, you can repeat the entire exercise using Spearman's correlation coefficient.
Figure 8: Screenshot after generating a single permutation
The app shows the original dataset and corresponding scatterplot. With a permutation approach, we assume there is no relationship between the gestational period and longevity. Then, taking the longevity of one animal and pairing it with the gestational age of another animal is permissible because there is no tie between these two variables per our assumption. In effect, we could take all the longevity values, shuffle them and randomly reassign them to the animals. One such random shuffle of the longevity values is known as a permutation and the dataset resulting from it as the permutation dataset. The app illustrates this shuffle when you click on the "Generate Permutation(s)" button once, as the following screenshot shows:
Figure 6: Showing summary statistics and the 2.5th percentile of the bootstrap distribution. (Note: We zoomed into the distribution to more clearly show its range.
Data entry is straightforward: I entered variable names for the x and y variables and (optionally) an ID variable and then copied and pasted the data over from a spreadsheet I had opened in another window. You can also enter data manually. Once I clicked the "Submit Data" button, the app creates a scatterplot, computes the descriptive statistics for each variable and the Pearson's correlation coefficient between the two variables to measure the strength of the relationship. You can immediately check if this is appropriate by seeing whether the scatterplot reveals a roughly linear relationship. To help with this, you can optionally overlay a smooth fit, a linear regression fit or both using the checkboxes on the left. From there, you can also choose other measures for the association, such as the Spearman correlation coefficient (for instance, when the relationship is not linear but still monotonic) or Kendall's tau.The scatterplot is interactive. Placing the mouse over a point displays its x and y values (and the ID) and highlights the corresponding row in the dataset to the left. (You may have to scroll down to see the highlighted row.) In Figure 1 above, the cursor was moved over one point to identify it as the Hippopotamus with a gestational period of 240 days and a longevity of 30 years.
You can also click on a point in the scatterplot to delete it from the dataset and see how this affects the correlation coefficient. Conversely, you can click anywhere in the scatterplot to add a point at that location and observe how this changes the correlation coefficient.
The Pearson correlation coefficient of 0.86 indicates a rather strong relationship between the gestational period and the longevity of animals. We say rather strong as 0.86 seems sufficiently removed from 0, the value we would expect to see if there was no (linear) relationship. To show this more convincingly, we can use resampling ideas to approximate the sampling distribution of the Pearson correlation coefficient using the app. We will first demonstrate this using the bootstrap, where the app visualizes what a bootstrap sample is (a sample taken with replacement from the original sample) and shows how the bootstrap distribution builds up, one step at a time. Then, we do the same with the permutation distribution.
Click on the second tab "Build Bootstrap Distribution" in the app and you get the following picture:
Figure 7: Initial screen to generate permutation samples and construct the permutation distribution
This blog entry demonstrates how to use the new Association Between Two Quantitative Variables web app. Here is a screenshot of the app in action, after having entered data on the gestational period (measured in days) and the longevity (measured in years) of various animals. (The Animals dataset from Chapter 2.)
Figure 5: Bootstrap distribution after 5,020 randomly generated bootstrap samples.
We get a clearer picture of the sampling distribution of the correlation coefficient when computing it from several thousands of permutations of the original dataset. To get there, select to generate 1,000 permutations per one click of the "Generate Permutation(s)" button and then press it a couple of times.
The original table now highlights those observations that were selected in the bootstrap sample, and the scatterplot updates to highlight those same points. The empty circles are the ones not selected. The higher the intensity of the red filling of the points, the more frequent the corresponding observation was selected for inclusion in the bootstrap sample. A second data table to the right now shows the first few rows of the generated bootstrap sample. It is easy to see that e.g., the Dear was selected twice, the Ass and others once while the Bear was not part of this bootstrap sample. The scatterplot below the bootstrap sample shows just those animals included in the bootstrap sample and the correlation coefficient computed from them. Comparing the original and bootstrap sample and scatterplot side-by-side helps to visualize the idea behind the bootstrap.
The plot at the bottom keeps track of the correlation coefficient generated from each bootstrap sample. Right now, it only shows a marker (in yellow, labeled "obs") of the originally observed correlation and the one computed from the bootstrap sample (in blue, labeled "last"). As we generate more bootstrap samples, this plot will build up the histogram that approximates the sampling distribution of the correlation coefficient. Let's keep generating a single bootstrap sample so we can observe just that. Click the "Generate Bootstrap Sample(s)" button a couple more times (still with the default of just generating 1 sample each time the button is pressed), to see how the bootstrap sample, the corresponding scatterplot and the histogram for the sampling distribution update. After several clicks, you might get something like this:
Now that we get the idea what bootstrap resampling means, let's generate thousands of them to get a more accurate representation of the sampling distribution of the correlation coefficient. From the buttons to the left, select to generate 1,000 bootstrap samples in one run and press "Generate Bootstrap Sample(s)" several times to get 5,000 additional samples, in addition to the 20 we already generated. You will up with a screen that looks like this:
Figure 3:Screenshot after generating one bootstrap sample
We now have a good idea about the distribution of Pearson's correlation coefficient and its range. While the correlation may be as low as about 0.6, the entire distribution clearly sits well above 0. Consequently, there is evidence of an association between the gestational period and longevity of animals, and one that is actually fairly strong.
Through the checkboxes on the left, you can get more information on the sampling distribution of the correlation coefficient, such as its mean, standard deviation or quartiles. You can also compute a specific quantile. For instance, as the next screenshots shows (and after zooming into the distribution), the 2.5th percentile is equal to 0.63 and the 97.5th percentile (not shown) is equal to 0.97. This means that the interval from 0.63 to 0.97 describes a range of possible values for the true correlation coefficient
The sampling distribution of the correlation coefficient looks skewed to the right and extends to about 0.7. That means for a dataset with just 21 animals, we might actually see a correlation coefficient between gestational period and longevity as high as 0.7 even though there is no relationship. However, our observed value of 0.86 clearly stand out even further and seems at odds with the distribution under independence. Therefor, there is evidence for a relationship between gestational period and longevity because if there wasn't, observing what we have observed for the correlation coefficient becomes nearly impossible under independence.
You can get more information on the permutation distribution by clicking on the various options on the left. For instance, its 97.5th percentile equals 0.498, and anything above that value can be considered extreme. We also see that the observed value of 0.86 is almost 4 standard deviations above the mean of 0 of the permutation distribution, all pointing to a significant relationship between gestational period and longevity in animals.
The web app features three tabs. The first tab lets users enter data seamlessly (e.g., through copy and paste from a spreadsheet), displays an interactive scatterplot with mouse-over events and allows for adding or deleting points to check the influence of outliers. It also allows for overlaying a smooth trend line, a regression line, or both and gives the option to compute Pearson's, Spearman's or Kendall's correlation coefficient and displays a basic statistical summary of the two variables.
The second tab visualizes what is meant by a bootstrap sample and, step-by-step, builds the bootstrap distribution. Students can follow along to see how the correlation varies from one bootstrap sample from the next and understand why this is so by comparing the scatterplots of the original and bootstrapped sample. Finally, after thousands of bootstrap samples are generated with ease, the student can inspect and explore the sampling distribution of the correlation coefficient (as approximated through the bootstrap) and use it to understand the magnitude of the observed correlation and ultimately find a (confidence) interval of plausible values for the true correlation.
The third tab uses the permutation idea for constructing resamples. Starting from the original dataset, the app shows how a permutation of the original dataset is created. It displays scatterplots of the original and permuted dataset side-by-side, showing how the original (x,y) pairs change through permutation. (You can select to either permute the x- or the y-values.) Clicking through several permutations, students see how the permutation distribution assuming independence builds up in real time and they can compare the correlation values computed assuming independence to the actually observed one. After generating thousands of permutations, students can analyze the permutation distribution visually and through summary statistics and see where the observed value falls in comparison, ultimately leading to a permutation P-value.
Figure 4:Screenshot after generating 20 bootstrap samples
Figure 9: Clicking the "Generate Permutation(s)" button twenty times already gives us an idea about the range of the sampling distribution of the correlation coefficient under independence
The first few rows of the original dataset are displayed, with the scatterplot underneath it. The menu to the left now gives the option to generate one (or more) bootstrap samples. Click on the button "Generate Bootstrap Sample(s)" (not changing the default of 1 for the number of samples to generate) and you get the following:
One can also approach the issue of checking whether the observed correlation of 0.86 is sufficiently larger than 0 (i.e., is significant) using a permutation approach. Click on the third tab of the app, titled "Build Permutation Distribution" to get the start screen:
The original dataset and scatterplot stay the same, but now to their right the app displays the dataset resulting from one permutation (=random shuffle) of the longevity column and the resulting scatterplot. It is easy to point out how the longevity of certain animals got perturbed. The rightmost observation in the original scatterplot (the elephant) which had the longest gestational period and largest longevity at 30 years now doesn't have a longevity that is so extreme (in fact, only 12 years, which you can find by scrolling down in the table that shows the dataset after the permutation).Also, note the different shape of the scatterplot for the perturbed data and the corresponding correlation coefficient of -0.05.
The plot at the bottom keeps track of the correlation coefficient resulting from this permutation (the blue triangle labeled "last") and also shows the correlation from the original dataset (the yellow triangle labeled "obs"). When clicking a couple more times on the "Generate Permutation(s)" button, as shown below, the (sampling) distribution of the correlation coefficient under the assumption of independence starts to build up and we can judge whether the observed correlation falls right in line with that distribution or stands out.