The Art & Science of Learning from data
Agresti Franklin Klingenberg
Figure 6: Screenshot of the Analyzing Multivariate Relationships app for the preloaded SAT Score dataset
Figure 5: Screenshot of the app when clicking on the second tab “Scatterplot Matrix”
Figure 1: Screenshot of the Analyzing Multivariate Relationships app on start-up, displaying a scatterplot
SAT Score and Student Expenditure
A different dataset preloaded with the app explores average SAT scores and per student expenditures across the 50 US states (Click “More Info on Dataset” to learn more). Initially, it seems that the more a state spends, the lower the SAT scores, on average, which may lead some to believe that increased student spending has an effect opposite to the one hoped for.
One can clearly see a pattern in the colored points. Those with lighter colors (yellow and green) tend to be counties with both high education levels and high crime rates, those with darker colors tend to be those with lower education and lower crime rates. You can create a Scatterplot Matrix for all three variables (Education, Crime Rate, Urbanization) on the second tab of the app:
However, the participation rate in the SAT varies widely by state. In some states, between 60% and 80% of students take the SAT, while in others less than 20% participate. This is important, because in states with low SAT participation, those taking the SAT may be a selective group of just the top students who want to go on to College, and hence their average SAT score tends to be higher than the average SAT score for students of a state where 80% participate. Analyzing the relationship between SAT score and expenditure would be incomplete without accounting for the percentage of students taking the SAT. In the app, you can do this by selecting the variable SAT Participation for the grouping variable, where a state’s SAT participation is categorized as “low”, “medium” or “high”. This displays the relationship for each group separately, given a more nuanced picture of the relationship.
Figure 4: Screenshot of the app when selecting a quantitative grouping variable
The scatterplot shows a positive relationship: The higher the percentage of residents with a high school education in a county, the higher a county’s crime rate. You can use your mouse to hover over points in the scatterplot to see which county they represent. (This info is coming from the ID variable selected in the app.) Try to identify the county with the largest crime rate. (See the next screenshot.)
Click on “Linear Regression Fit” under “Show Trend” to display a linear trend line in the scatterplot.
This positive relationship between education and crime rate seems counterintuitive. Probably, there are other factors that contribute to the crime rate that we need to consider to get a more complete picture. Let’s consider the urbanization level of the counties, which is another variable recorded in the dataset.
Go to the drop-down menu for the Grouping Variable and select “Urbanization (Categorical)” as a grouping variable. This updates the scatterplot to look like this:
Figure 2: Screenshot of the app when adding a linear regression fit to indicate the trend
Figure 3: Screenshot of the app when selecting a grouping variable (here: Urbanization)