The Art & Science of Learning from data

Agresti    Franklin    Klingenberg

SAT Score and Student Expenditure

A different dataset preloaded with the app explores average SAT scores and per student expenditures across the 50 US states (Click “More Info on Dataset” to learn more). Initially, it seems that the more a state spends, the lower the SAT scores, on average, which may lead some to believe that increased student spending has an effect opposite to the one hoped for.

One can clearly see a pattern in the colored points. Those with lighter colors (yellow and green) tend to be counties with both high education levels and high crime rates, those with darker colors tend to be those with lower education and lower crime rates. You can create a Scatterplot Matrix for all three variables (Education, Crime Rate, Urbanization) on the second tab of the app:

Figure 2: Screenshot of the app when adding a linear regression fit to indicate the trend

Figure 6: Screenshot of the Analyzing Multivariate Relationships app for the preloaded SAT Score dataset

The points in the scatterplot are now color- and shape-coded to represent three different types of counties: Those who are classified as rural (from the info about the dataset we learn that this means counties with no urbanization), mixed (counties with 15% - 50% of residents living in metropolitan areas) and urban (with more than 50% living in metropolitan areas).  

With the “Linear Regression Fit” checkbox still selected, the plot now also shows trend lines for each of the three groups separately. Within a given group, the relationship between education and crime rate is now much weaker (or non-existent). For the group of urban counties, it might actually be slightly negative. This shows that analyzing the association between two variables when adjusting for a third may lead to different conclusions. The relationship of both education and crime rate with urbanization was the driving force for the relationship between the first two. The scatterplot matrix available from the second tab of the app illustrates this further, see below. Example 14 in Chapter 3 of the textbook “Statistics: The Art and Science of Learning from Data” includes a more detailed discussion of this topic in the context of this example. 

The pre-loaded dataset also includes the actual urbanization values (in percent) as a variable, named “Urbanization (Percent)”. We can also use this quantitative variable as a “grouping” variable. Selecting it from the drop-down menu for the grouping variable results in this plot:

The scatterplot shows a positive relationship: The higher the percentage of residents with a high school education in a county, the higher a county’s crime rate. You can use your mouse to hover over points in the scatterplot to see which county they represent. (This info is coming from the ID variable selected in the app.) Try to identify the county with the largest crime rate. (See the next screenshot.)

Click on “Linear Regression Fit” under “Show Trend” to display a linear trend line in the scatterplot.

Figure 3: Screenshot of the app when selecting a grouping variable (here: Urbanization)

  Figure 5: Screenshot of the app when clicking on the second tab “Scatterplot Matrix”

Figure 1: Screenshot of the Analyzing Multivariate Relationships app on start-up, displaying a scatterplot​

However, the participation rate in the SAT varies widely by state. In some states, between 60% and 80% of students take the SAT, while in others less than 20% participate. This is important, because in states with low SAT participation, those taking the SAT may be a selective group of just the top students who want to go on to College, and hence their average SAT score tends to be higher than the average SAT score for students of a state where 80% participate. Analyzing the relationship between SAT score and expenditure would be incomplete without accounting for the percentage of students taking the SAT. In the app, you can do this by selecting the variable SAT Participation for the grouping variable, where a state’s SAT participation is categorized as “low”, “medium” or “high”. This displays the relationship for each group separately, given a more nuanced picture of the relationship. 

   Figure 4: Screenshot of the app when selecting a quantitative grouping variable

This positive relationship between education and crime rate seems counterintuitive. Probably, there are other factors that contribute to the crime rate that we need to consider to get a more complete picture. Let’s consider the urbanization level of the counties, which is another variable recorded in the dataset.

Go to the drop-down menu for the Grouping Variable and select “Urbanization (Categorical)” as a grouping variable. This updates the scatterplot to look like this:

Exploring Multivariate Relationships

A central recommendation of the new Guidelines for Assessment and Instruction in Statistics Education (GAISE) report is to “give students experience with multivariable thinking”. The new Exploring Multivariate Relationships app was designed with this in mind. It allows visualizing relationships between two quantitative variables while including (or adjusting for) a third (categorical or quantitative) grouping variable. Several datasets are pre-loaded to illustrate the concepts using real examples and you can also upload your own data.

Relationship between Crime and Education
Consider the relationship between crime rates and educational levels using data from all Florida counties. The Crime Rate and Education dataset (click on “More info on dataset” to learn more) is preloaded when you start the app, showing the following interactive scatterplot: