Multiple regression--predicting achievement index score for elementary schools

From WikiEducator
Jump to: navigation, search

This activity provides independent practice in use of multiple regression.

Research question

In a study to determine what factors are related to school performance, 400 California elementary schools were randomly sampled from California Department of Education's API dataset for the year 2000.[1] A number of measures related to school performance were collected, including a measure of academic performance: API 2000 (academic performance index, on a scale of 200 to 1000; a composite score indicating a school's overall academic performance, based on statewide testing) as well as other attributes of elementary schools thought to be related to school performance: class size, enrollment, percent of students receiving free lunch, etc.

Description of variables

Variable
Description
snum
School number
dnum
District number
api00
API score for the year 2000
api99
API score for the year 1999
growth
Change in API score from 1999 to 2000
meals
Percent of students receiving free meals
ell
Number of students who are English language learners
yr_rnd
Year round school (0=No, 1=Yes)
mobility
Percent first year in school
acs_k3
Average class size for grades K-3
acs_46
Average class size for grades 4-6
not_hsg
Percent of parents who did not complete high school
hsg
Percent of parents whose highest education level is high school graduate
some_coll
Percent of parents whose highest education level is some college
coll_grad
Percent of parents whose highest education level is college graduate
grad_sch
Percent of parents whose highest education level is graduate school study
avg_ed
Average parent education (on a 1-5 scale, corresponding to levels in hsg to grad_sch variables)
full
Percent of teachers with a full teaching credential
emer
Percent of teachers with an emergency teaching credential
enroll
Number of students enrolled in the school
mealcat
Percent of students receiving free meals, grouped in 3 categories (1=0-46% free meals, 2=47-80% free meals, 3=81-100% free meals)
collcat
unknown


Dataset

Obtain the dataset from one of the following:

Analyses

Response variable: api00

Explanatory variables: Choose at least 3 quantitative variables which you feel will most contribute to overall academic performance in elementary schools. Choose more if you wish, but dumping all of the possible variables into the prediction of api00 is inappropriate. Choose your variables BEFORE examination of descriptive statistics or correlations.

The following sections provide guiding questions to help step you through the process of multiple regression analysis. Copy and paste the following sections into a word processor. Create a summary or interpretation for each section as indicated.

Preliminary analyses

For all of the variables to be included in your regression analyses:

  1. Use SPSS to create descriptive statistics, frequency distributions (for variables with limited values), and histograms (for variables with many different values).
    • Evaluate the results for reasonableness. Consider the following questions:
      • Do any of the variables exhibit "suspicious" values?
      • Do any of the distributions seem unreasonable given what you know about the measurement scale or appear "extreme" such that the variable would be unreasonable to use as a predictor in the regression?
  2. Use SPSS to create pairwise correlations and scatterplots (for each explanatory variable with the response, as well as for each pair of explanatory variables).
    • Evaluate the results for reasonableness. Consider the following questions:
      • Do the correlations of explanatory variables with api00 support their use in a prediction equation?
      • How might the correlations among the explanatory variables impact the individual contribution of each?
      • Is there a linear relationship between each explanatory variable and the response?
  3. Summarize the results of your evaluation of the preliminary analyses.

Full regression analysis

  1. Use SPSS to create a regression analysis, including all of your chosen explanatory variables.
    • Evaluate the regression results, including the overall F test and contributions of each of the explanatory variables. Consider the following questions:
      • Do the explanatory variables (as a group) significantly predict the response variable?
      • Do each of the explanatory variables contribute to the prediction of the response beyond the contribution of the other variables?
      • Does the collection of explanatory variables provide a "useful/practical" explanation of the response?
      • What refinements, if any, will you make?
  2. Summarize the results of your evaluation of the regression analysis output.

Refine the model

  1. Decide on which variable to delete from the model.
  2. Use SPSS to create a second regression analysis with the remaining explanatory variables entered.
    • Evaluate the regression results. Consider the following questions:
      • Do the remaining explanatory variables (as a group) significantly predict the response variable?
      • Do each of the explanatory variables contribute to the prediction of the response beyond the contribution of the other variables?
      • Does the collection of explanatory variables provide a "useful/practical" explanation of the response?
      • Is the reduction in "explanation power" due to the removal of one explanatory variable acceptable?
      • What additional refinements, if any, will you make?
  3. Summarize the results of your evaluation of the regression analysis output.
  4. Repeat this step if there are additional variables to be removed from the model.

Residuals

  1. Use SPSS to specify a final run of your refined model. Choose the "Save" option and under "Predicted Values" select "Unstandardized" and under "Residuals" select "Unstandardized". Run the regression as before. Two new variables will be created in your dataset containing the unstandardized predicted value and the unstandardized residual for each observation. You will use these variables to create plots to study the residuals.
  2. Create a normal quartile plot (Q-Q plot) of the the unstandardized residuals.
    • Use the plot to evaluate the assumption that the errors (residuals) are normally distributed.
  3. Create plots of the unstandardized residuals versus the unstandardized predicted values and each of the explanatory variables entered in the model.
    • Evaluate each plot. Consider the following questions:
      • Are the residuals more or less randomly dispersed around zero?
      • Is there any evidence to suggest that the explanatory and response variables have a non-linear relationship?
      • Is there any evidence that the errors do not have a common standard deviation?
      • Are there any unusual patterns?
  4. Summarize the results of your evaluation of the residuals.

Conclusion

  1. Interpret the results of your regression analyses in the context of the research question. Be sure to include:
    • A description of the model, including regression equation and what the equation suggests about the relationship among the variables.
    • To what population the model applies
    • Results of significance testing
    • The model's usefulness (explanatory power)
    • Variables rejected from the model and why
  2. Describe any limitations to your study, e.g.,
    • Issues with generalization (both to a population as well as limitations of any proxy variables--indicators of other less measurable characteristics)
    • Model specification
    • Regression assumptions

Resources

Regression with SPSS, by Xiao Chen, Phil Ender, Michael Mitchell and Christine Wells (in alphabetical order), provides helpful guidance in analyzing the "elemapi" datasets.

References

  1. Chen, X., Ender, P., Mitchell, M. and Wells, C. (2003). Regression with SPSS, from http://www.ats.ucla.edu/stat/spss/webbooks/reg/default.htm .