GSE Stat Methods II - Review Notes

From WikiEducator

Jump to:navigation, search

The following review is based on the indicated chapter and section of

Moore, D. S., McCabe, G. P., & Craig, B. A. (2009). Introduction to the practice of statistics (6th ed). New York: W. H. Freeman.

The questions and items for display are organized as a slide show. The sub-bullets for each point support discussion and content to be written out on the board.


Two-Way Analysis of Variance (13.1/13/2)

Sign of Main Effects Neither Neither 1 factor Both factors Both factors
Sign of Interaction None Significant None Signficant None

Comparing means (12.2)

Planned comparisons (contrasts)

Post-hoc analyses & multiple comparisons

(c1) .95 (c1) .05
(c2) .95 .9025 .0475
(c2) .05 .0475 .025

Inference for one-way ANOVA (12.1)

Multiple regression (11.1/11.2)

Subtopic: Causation

Simple linear regression (10.2)

Before beginning, draw scatterplot on the board, for ongoing reference. Include a least squares line and a line for y-bar.

Simple linear regression (10.1)

Before beginning, draw scatterplot on the board, for ongoing reference

s = \sqrt{\frac{\sum residual^2}{n-2}} = \sqrt{\frac{\sum (y_i - \hat{y}_i)^2}{n-2}}

Analysis of two-way tables (9.1/9.2)

4th 8th
Pass 80 80
Fail 20 40
100 120

Inference for proportions (8.1/8.2)

when we have a categorical response variable, such that we are counting membership (successes) in each category. Example: what proportion of students bring their lunch to school?
the sample proportion of successes \hat{p} = \frac{X}{n}
divide by N....\mu_\hat{p} = p and \sigma_\hat{p} = \frac{\sqrt{np(1-p)}}{n} = \sqrt{\frac{p(1-p)}{n}}
\hat{p}, and change the name to standard error...SE_\hat{p} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

confidence interval for single proportion

estimate +/- margin of error
precision, narrowness of interval, so we increase our percent confidence
increase sample size
solve for n, such that n = {\left ( \frac{z*}{m} \right )}^2 \hat{p}(1-\hat{p})
formula uses p-hat, but that's what we want to estimate with the sample...

significance test for single proportion

z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}
Ha: p < > ≠ p0 (show image of P(Z>=z) for each case)

comparing two proportions

Fill out the table below
Population Pop prop Sample size Count of successes Sample prop
1 p1 n1 X1 \hat{p}_1=X_1/n_1
2 p2 n2 X2 \hat{p}_2=X_2/n_2
D = \hat{p}_1 - \hat{p}_2
when both samples are large, distribution of D is approximately Normal
\mu_D = \mu_\hat{p_1} - \mu_\hat{p_2} = p_1 - p_2
\sigma_D = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}
SE_D = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}
D \pm m, where m = z * SED
Ho: p1 = p2
Devise a pooled estimate of p, which we'll call \hat{p} = \frac{X_1 + X_2}{n_1 + n_2}
So SE_{Dp} = \sqrt{\hat{p}(1-\hat{p}) ((1/n_1) + (1/n_2))}
both are categorical -- explanatory defines the two populations, response is a yes/no on a particular question
create a two way table as a prelude to X2


a. When designing a study to determine this population proportion, what is the minimum number you would need to survey to be 95% confident that the population proportion is estimated to within 0.03?
n=\frac{1}{4} \left (\frac{z*}{m} \right)^2 = \frac{1}{4} \left (\frac{1.96}{.03} \right)^2 = 1067.11
b. If it was later determined that it was important to be more than 95% confident and a new survey was commissioned, how would that affect the minimum number you would need to survey? Why?
Need an even larger sample size. z* increases, but everything else stays the same.


  1. Dean, S., & Illowsky, B. (2009, February 18). Confidence Intervals: Homework and Comparing Two Independent Population Proportions. Retrieved from the Connexions web site on 5 Oct 2010.

Matched pairs (part of 7.1)

observations are paired by subject--two measurements per subject, test-retest
observations are natural pairs--twins, spouses, siblings, matching on ability
the between subjects variation is controlled by using the differences within subjects. Each subject serves as their own control. Eliminates other confounding factors (ability, age, knowledge...) which occur btwn subjects.
Draw the two populations for independent groups leading to sampling distribution of mean differences, \bar{x}_1 - \bar{x}_2, compared to one population of mean differences (matched pairs) leading to sampling distribution of differences, \bar{x}_d.
We can take the difference between the two measures for each individual; this difference is then compared with no difference.
We have one standard deviation, s, and one standard error, s / \sqrt{n}.
the mean of the differences between paired observations in sample 1 and sample 2...x(1) - y(1), x(2) - y(2). (display oli picture showing each pair of observations converted to differences)
check a histogram and/or Normal quantile plot (convert each difference to a percentile, determine z-score for that percentile, plot the difference score against the z-score, should result in a straight line, p. 68 in text)

Additional topics: type I and type II errors and power

7.2 Comparing two means

The two samples must be independent
  • Explanatory variable which is categorical (a grouping variable)
  • Response variable which is quantitative (provides scores/data which are summarized as a mean)
Ho: μ1 - μ2 = 0
Ho: μ1 = μ2
Ha: μ1 - μ2 ≠ 0 ...(Ha: μ1 ≠ μ2)
Ha: μ1 - μ2 < 0 ...(Ha: μ1 < μ2)
Ha: μ1 - μ2 > 0 ...(Ha: μ1 > μ2)
(discuss which mean is greater for the one-sided alternatives)
the difference between the means, μ1 - μ2
this means that we have a sampling distribution of differences...if both population distributions are normal, then sampling distribution of differences is also normal
draw sampling distribution of differences
(put up one-sample t-test formula, if needed: \frac{\bar{x} - \mu_0}{s/\sqrt{n}})
\frac{sample \ estimate - null \ value}{standard \ error}
t = \frac{\bar{y}_1 - \bar{y}_2}{\sqrt{{s_1^2 \over n_1} + {s_2^2  \over n_2}}}
review why this statistic makes sense....element by element,
  • y1 and y2 estimate μ1 and μ2, so \bar{y}_1 - \bar{y}_2 estimates μ1 - μ2
  • the null value is missing from the equation.
  • the denominator is the standard error of \bar{y}_1 - \bar{y}_2
measures (in standard errors) the difference between what the data tell me about the parameter of interest μ1 - μ2 (sample estimate) and what the null hypothesis claims that it is (null value).
The null distribution approximates the t distribution with the appropriate degrees of freedom. It's not exact, but good enough for our purposes. Let statistical software calculate the df.
the p-value indicates amount of evidence against Ho; p-values less than the alpha threshold provide strong evidence against Ho and in favor of the specified alternative.
easier to reject Ho with one-sided alternative, but it is be wrong to set it after seeing the data leans in that direction. Contributes to error. Which error--any thoughts?
This is the alpha level...Type I error.
What if obtained 20 different samples and did 20 t-tests using .05 alpha, when in fact Ho is true. For how many might we reject Ho, according to probability? (1)
we are 95% confident that the actual value of μ1 - μ2 occurs in this range.
when Ho is rejected, the confidence interval quantifies the supposed effect of the explanatory variable on the response variable.
\bar{y}_1 - \bar{y}_2 \pm t^* \sqrt{{s_1^2 \over n_1} + {s_2^2  \over n_2}}
  1. The method described here does NOT assume variances are equal. The pooled t test is used in this case; described later in chapter, we won't be using this method.
  2. The general method described here is robust to violations of Normality when
    • sample sizes are large (n1+n2 > 40)
    • sample size in each group is equal and shape of population distributions for each group are similar
  3. the routine will ask us to label one sample as "group 1" and the other as "group 2", how do we decide? Doesn't matter as long as Ha for one-sided test matches.
  4. small samples may be useful when effect size is large. If borderline, study can't say much, not enough power.

7.1 Inference for the population mean

  • Click on the link for Sampling Distribution applet. Create a crazy population distribution -- highly skewed with significant outliers. Set samples to N=5 and N=25, run simulation.
  • Discuss idea that when we look up the p-value for a z test we are assuming that the distribution of means is shaped like the z distribution.
  • Draw a normal distribution and shade a possible p-value. Compare this area to the area for the distribution created for N=5.
z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}
  • note there are two population values μ0 and σ.
  • the distribution of this statistic is normal and is derived from the sampling distribution of \bar{X}
  • standard error
  • draw a normal distribution of \bar{x}'s; the standard error is the standard deviation of the distribution of sample means
standard deviation of a statistic uses the population value...SD_\bar{x}= \frac{\sigma}{\sqrt{n}}
standard error of a statistic uses the value calculated from the sample...SE_\bar{x}= \frac{s}{\sqrt{n}}
NO!! When s replaces σ we now have a t statistic
t= \frac{\bar{x} - \mu_0}{s / \sqrt{n}}
  • the t statistic has a t distribution with n-1 degrees of freedom
  • degrees of freedom can be a difficult concept and difficult to determine; basically it's the number of independent pieces of information that go into the estimate of a parameter (in this case the t statistic)
t(k), where k = degrees of freedom
  • symmetric, centered at 0, covers -\infty to \infty
  • show figure comparing t(2), t(5), and z (note that t(30) ~ z)
  • show figure comparing a z score and t score -- review differences that result in larger spread
  • the t statistic is the standardized score for \bar{x} assuming Ho is true, μ = μ0.
  • the t statistic follows the t distribution, so we can calculate the t statistic and then use the distribution to determine the p-value (the likelihood of obtaining that value, or a larger one, of t)
  • the sample is random
  • population distribution is normal, well this is hard to know for sure.
show table of sample size vs. normality of population distribution
Look at the data for evidence.
  • explain that the population with the unknown mean is the one from which the sample is actually drawn, NOT the one that is the usual case, the one with μ0. We are testing to see if the population from which the sample is drawn is different from the usual (null) population.
  • review p-value probability formulas and pictures of t-distributions with p-values shaded for each version of Ha. example graphics
\bar{x} \pm t^* \frac{s}{\sqrt{n}}
t^* \frac{s}{\sqrt{n}}
  • Table D in textbook lists these values for a selection of t distributions
  • Review how to use Table D to obtain t*
  • Note that as df gets larger (n gets larger), the t values approach z.
  • The t test is fairly robust, mall deviations from normality – the results will not be affected too much. Factors that strongly matter:
    1. Random sampling: the data must be a random sample from the population
    2. outliers and skewness: strongly influence the mean and therefore the t procedures. However, their impact diminishes as the sample size gets larger because of the Central Limit Theorem.
Create a book