Thursday, March 3, 2016

Regressing the Democratic Primary (Part 2)

Yesterday I posted a regression analysis that correctly predicts 13 of the 15 Democratic primaries/caucuses that have occurred so far this year. One of the most interesting aspects of the model is that it used no polling data or historical voting patterns--just 3 economic variables: median earnings from 2014, the cost of living for 2015, and unemployment for December 2015 (all of these are the latest available data for these measures). Based on a Facebook conversation that ensued, I added education and race through the model, specifically, the percent of residents in the state with a bachelor's degree or higher, and the percent Black population. Based on a specific recommendation, I also tried one interaction term, race with unemployment. The outcome variable is the difference between Sanders' votes and Clinton's votes. So, for example, in Vermont, Sanders beat Clinton by 72.5%, but in Alabama, Clinton beat Sanders by 58.6%.

I tried 36 different combinations of these 6 variables. While race was an important predictor, and in fact, when used by itself, correctly predicted 12/15 of the races. However, in several models it dropped out (it failed to reach statistical significance), and other models failed to see improved predictability. Education, however, proved to be a useful predictor. On its own, it was one of the worst predictors, missing almost half of the races. However, when combined with economic variables, specifically, the cost of living and unemployment, it produced the only model that missed just one state, Oklahoma. The rest of the models missed 2 or more states.

In addition to accuracy of prediction, I also produced results for AIC, BIC and the residuals. The 36 models, the p-value significance of each variable, and the AIC/BIC/Residuals data is in the image below. The column for B*U is the interaction term for %Black population * Unemployment. The last column is the number of states incorrectly predicted, and the table is sorted first by states predicted, and then by lowest AIC. In statistics, you can use AIC and BIC to compare different regression models--the lower the value, one has a better case to argue that it is a better model (lowest values highlighted in green). Similarly, lower residuals also tend to indicate a better model. As you can see from the chart, the top model does not have the best AIC/BIC/residuals, despite the fact that it has the best prediction history. In the models where I did not use a specific variable, that cell is highlighted in red and an "x." In models where the variable failed to reach statistical significance (p less than 0.05), I have crossed out the value and made the font red.

This image shows the predicted values of Sanders' wins in each state based on this model (cost of living, unemployment, and college education). The only state it missed is Oklahoma. A positive value is a win for Clinton (highlighted in red), and a negative value is a win for Sanders (highlighted in blue). In the "model prediction" column, the correct predictions are highlighted in green.

This image is from the actual R-output for this model, showing the p-values, model significance, adjusted R-square, the coefficients for each variable, etc.

(Addendum--Predictions)

This final image is a list of all 50 states + DC with the original data used to calculate the models, and predictions for the outcome of the rest of the primaries/caucuses. Model 1 is just the 2 economic variables + education. Model 2 is those same 3 variables, plus median earnings and % of the state population that self-identified as Black for the American Community Survey, 5 year estimate (2010-2014). It is, arguably, the 2nd best model--one of the problems with the model is that both race and education drop out of statistical significance. However, removing them from the analysis creates an inferior model, so for the purpose of comparison, I left this model intact alongside Model 1.

No comments:

Post a Comment