From WikiEducator
The following review is based on the indicated chapter and section of
 Moore, D. S., McCabe, G. P., & Craig, B. A. (2012). Introduction to the practice of statistics (7th ed). New York: W. H. Freeman.
The questions and items for display are organized as a slide show. The subbullets for each point support discussion and content to be written out on the board.
Tests of Significance (6.2)
 Consider a situation where a student is brought before an academic committee with a claim that she cheated. The committee assumes that the student is innocent until proven guilty. The instructor presents convincing evidence of cheating. What should the committee decide as to whether or not the student cheated?
 committee should find evidence convincing and decide that student cheated.
 Tests of significance work similarly. Identify the following in the cheating story.
 Identify two opposing claims
 student claims innocence (claim 1); instructor claims she cheated (claim 2).
 claim 1 is challenged by claim 2
 begin with assumption that claim 1 is true.
 Collect evidence
 instructor provides evidence against claim 1
 observations in sample will serve as evidence against claim 1
 Assess evidence
 committee evaluates evidence: how likely (probability based) to observe this evidence if student is innocent
 evaluate sample statistics in context of sampling distribution; determine how likely to observe this result if it were to have occurred by chance.
 Make a decision
 If very unlikely that student could be innocent (claim 1) given evidence (strong evidence against claim 1), then reject claim 1 and decide for claim 2
 If likely that student could be innocent (claim 1) given evidence (weak evidence against claim 1), then stay with claim 1 (cannot reject claim 1 in favor of claim 2). Note: we do not say we accept claim one, we just don't have anything better to conclude.
 What do we call the two claims in tests of significance?
 Null hypothesis (H_{o}), claim 1
 typically statement of no effect, no difference; the assumed usual state
 Alternative hypothesis (H_{a}), claim 2
 statement that disagrees with H_{o}, specifying what we think might be going on; written as an "opposite" of null hypothesis.
 also called hypothesis testing
 Example: Traditional practice suggests that college students should study 2 hours for every 1 hour of classroom time. Using this rule, a student with 15 hours of classroom time per week (i.e., 15 credit hours, denoted fulltime) should study on average 30 hours per week. A researcher is interested in whether this rule applies at Rutgers University. What are the null and alternative hypotheses for this study? (Step 1)
 H_{0}: The average time fulltime Rutgers students study outside of class is 30 hours per week.
 H_{a}: The average time fulltime Rutgers students study outside of class is not 30 hours per week.
 When wording the hypotheses (claims) who are they about?
 the population
 the population in our example is the university students
 If we suspected that fulltime Rutgers students study less than 30 hours per week, how would we have stated H_{a}?
 H_{a}: The average time fulltime Rutgers students study outside of class is less than 30 hours per week.
 Onesided alternative, could be greater than or less than
 Twosided alternative, population could differ in either direction.
 Must have a specific direction firmly in mind (without looking at the data) to choose onesided
 Example (continued): The researcher obtains a random sample of 50 college students currently taking 15 credits and collects the number of hours they study per week: xbar = 27 and sd = 5 hours per week. What evidence was collected? How might we summarize the evidence against H_{0}? (Step 2)
 evidence is the sample mean and sd.
 we can compare the sample results to the hypothesized value
 For this example, we will employ a z statistic to compare the sample mean to the hypothesized value. How does this work? What do we call this type of statistic, generally speaking?

 assume standard deviation of estimate is .7
 called a test statistic....this is a very typical form
 To assess the evidence we ask the question: how likely is it to get data like that observed when H_{0} is true? What do we need to answer this question? (Step 3)
 the probability of obtaining this value or one more extreme, if the population parameter in H_{0} is true.
 called the pvalue
 if very small, then unlikely to observe this value or one more extreme if H_{0} is true.
 if large, then not surprising to see a value like this, if H_{0} is true. Could have happened by chance
 What can we use to give us this probability....of observing a particular value or one more extreme given the population parameter provided in H_{0}.
 sampling distribution (ask if it is safe to use the sampling distribution of the mean for this example.....yes, n>40)
 (draw sampling distribution of xbar for μ=30, σ=.7)
 the zstatistic, 2.86, is from a Normal distribution representing the sampling distribution of the mean.
 (draw Normal distribution, shade areas outside +/2.86)
 for a two sided H_{a}, P(Z <= 2.86 or Z >= 2.86)...sum the lower and upper tails: .0021*2=.0042
 A pvalue of p=.0042 is pretty small, but is it small enough to decide against H_{0} (that the population mean is 30)? How can we decide? (Step 4)
 compare our result to a threshold value
 predetermined
 called significance level, α.
 What are some common significance levels? What do they mean?
 α=.05, α=.01, α=.1
 if we find a result that would occur less than 5% (1%, 10%) of the time, when H_{0} is true, then we decide to reject H_{0} and accept H_{a}.
 our result is statistically significant at level α.
 What do we conclude if the p value is not smaller than α?
 we decide that our data do not provide enough evidence to reject Ho
 we can also say that the data do not provide enough evidence to accept Ha.
 we cannot say that the data support H_{0}, or that we accept H_{0}.
 What do we conclude in our example for p=.0042?
 H_{0}: The average time fulltime Rutgers students study outside of class is 30 hours per week.
 H_{a}: The average time fulltime Rutgers students study outside of class is not 30 hours per week.
 assume we set α=.05 before the data were collected
 p=.0042
 our result is statistically significant
 we reject H_{0} and conclude that the average time fulltime Rutgers students study outside of class is not 30 hours per week.
 What if before we collected the data we suspected that Rutgers students, on average, study less than 30 hours per week. How does this change how we assess the evidence and what we conclude?
 (draw Normal distribution with z=2.86 and shade only area below)
 our pvalue is smaller....p=.0021....more powerful test because we have prior information about direction of difference
 our conclusion doesn't change.
 Example: In 2011, the SAT Critical Reading (SATCR) test had a mean of 496 and a standard deviation of 114. As XYZ College has a liberal arts focus, the academic dean suspects that XYZ students score higher than the national average. A random sample of 40 XYZ students had an average SATCR score of 522. Assume that the population standard deviation is the standard deviation of scores at XYZ College. Does the sample data support the dean's claim that XYZ College students have a higher average score?
 (draw the population, with sample n=40 removed, label)
 Step 1: What are the null and alternative hypotheses? What significance level should we use?
 H_{0}: μ=486; The average SATCR score for students at XYZ College is 496.
 H_{a}: μ>486; The average SATCR score for students at XYZ College is greater than 496.
 α=.05
 Step 2: What is the evidence against H_{0}?
 sampling distribution? xbar
 is it safe to use the distribution to determine area under the curve? yes n=40
 test statistic?
 xbar = 522
 the sd for the sampling distribution of xbar is σ/√n = 114/√40 = 18

 Step 3: What is the probability of obtaining a sample result this extreme or more extreme, if H_{0} is true (if mean of SATCR for XYZ College students is 496)?
 (draw Normal distribution, label z=1.44, shade area above)
 P(Z > 1.44) = .075
 p=.075
 Step 4: What do we conclude?
 p=.075 is not less than α=.05, we fail to reject H_{0}
 there is not enough evidence to conclude that the mean SATCR score at XYZ College is greater than 496.
 We have just performed a particular test of significance. What is this test called?
 z test for a population mean
 with known population standard deviation σ
Estimating with Confidence (6.1)
 What is the best estimate of the population mean, μ? Why?
 the sample mean, xbar, of a random sample is an unbiased estimate of μ.
 law of large numbers says that as the sample size increases, xbar will approach value of μ.
 What can we use to help us better understand the possible values of xbar?
 sampling distribution of xbar.
 shows the variability
 Example: Suppose you want to know the average height of undergraduate women at Rutgers. We know that the standard deviation of heights of young women in the US is 2.5. We obtain a random sample of 100 women currently enrolled as undergraduates at Rutgers. Their mean height is 64.8 inches with a standard deviation of 2.7 inches. What do we know about xbar, the mean height of Rutgers undergraduate women?
 CLT says the sampling distribution is N(μ, σ/√n) = N(μ, .25)...σ/√n = 2.5/10 = .25
 draw and label distribution
 note: not sd for Rutgers undergrad women; in most cases we will not know the population standard deviation
 How might we construct a 95% confidence interval for the mean of this distribution?
 use the standard deviation rule for 95%
 draw +/ 2 standard deviations (μ  .50, μ + .50) onto the distribution
 95% chance that xbar occurs between μ  .50, μ + .50
 as the distance between xbar and μ is the same from either direction, we can flip this statement to say
 we are 95% confident that the true population parameter, μ, falls within the interval and
 in context: we are 95% confident that the average height of Rutgers undergraduate women is between 64.3 inches and 65.3 inches
 of course our method could be wrong...which we estimate to be the case 5% of the time.
 What name do we use to refer to the value .50 in our example?
 margin of error
 quantifies the variability of the estimate in relation to our level of confidence
 for 95% confidence margin of error approx= 2*(σ/√n)
 What is the general form for a confidence interval?
 estimate +/ margin of error
 for a sample mean....
 example, we are 95% confident that the average height of Rutgers undergraduate women is 64.8 +/ .5 inches.
 How does the following image support "the population parameter, μ, must be within roughly 2 standard deviations from the sample average, xbar, in 95% of all samples."
 (display image of confidence intervals for a sample of xbars...from text)
 out of the 25 confidence intervals displayed, only one range does not include μ.
 Using the standard deviation rule, we said the general form of the confidence interval for the population mean is . How can we make this more precise?
 note that there is a 95% chance that a Normal random variable will take a value within 1.96 standard deviations of its mean.
 zscore = +/1.96 bounds the middle 95%.

 How do we adjust the method if we want to be 99% confident or 90% confident?
 C is used to indicate the confidence level: C=.90, C=.95, C=.99
 adjust the margin of error to be larger or smaller...to be more or less confident
 (display image of N dist, show C% under Normal curve, with +/ margin of error)
 What are the values which bound the middle 99% and 90% of the distribution of xbar?
 the corresponding zscore * σ/√n
 zscore for .90 = 1.645, zscore for .99 = 2.576
 The heights of Rutgers undergraduate women has an unknown mean (μ) and known standard deviation σ = 2.5. A simple random sample of 100 women is found to have a sample mean height xbar=64.8. Estimate μ with a 90%, 95%, and 99% confidence interval.
 What do we notice about the size of these intervals?
 (draw them on a normal distribution with mean 64.8)
 the more confident, the wider the interval for μ...the less precise the estimate
 There is a tradeoff between the level of confidence and the precision with which the parameter is estimated.
 What are the general formulas for the confidence interval and the margin of error?

 z* is the +/ zscore which bounds the middle C% of Normal dist
 (label z* as confidence multiplier and σ/√n as st. dev. of estimate)

 estimate +/ margin of error
 (draw line with m in each direction of estimate...confidence interval is length 2m
 m tells us how precise the confidence interval is....the estimate tells us the location
 How can we use m to make the confidence interval more precise? m = z*(σ/√n)
 a larger n will result in a smaller margin of error
 How is the margin of error impacted if we increase the sample size of Rutgers undergraduate women to 400. xbar = 64.8, population standard deviation σ = 2.5, 95% confidence for n=100 is (64.31, 65.29)
 95%:
 a sampling distribution of Xbar based on larger sample size has a smaller SD...less spread out.
 larger sample size means we are more confident in our estimate being close to μ.
 sometimes a larger sample size is too costly, or simply not available.
 An educational researcher is interested in estimating μ, the mean score on the math part of the SAT (SATM) of all community college students in his state with a margin of error of 5, at the 95% confidence level. What is the sample size needed to achieve this? (σ is assumed to be 100).
 If the answer were 1600.2, how would you decide what n to use?
 round up to next person to be more conservative....larger sample will get you slightly smaller m
 What is the most important requirement underpinning the accuracy of confidence intervals?
 the data is a random sample from the population....for this method the data must be a SRS
 the margin of "error" includes only random sampling error....the differences between one random sample and another.
Introduction to inference (intro)
 What is statistical inference?
 Display OLI big picture image
 inferring something about the population based on what is measured in the sample
 What role does probability play in statistical inference?
 tells us what might happen by chance alone
 What role does a sampling distribution play in statistical inference?
 gives us information about the variability of samples, if we were to use the inference method many times
 more theoretical than practical as we rarely know the truth about a population
 What requirement must underlie the sampling distribution for us to safely use it for making statistical inference?
 The data come from a random sample or a randomized experiment
 In a recent poll of a random sample of 1,200 undergraduates, the average amount of time spent on the internet was 19 hours per week. We are 95% confident that μ, the mean amount of time U.S. undergraduates spent on the internet per week, is between 18.36 hours and 19.64 hours. What do we call this type of interval?
 confidence interval....95% confident
 any thoughts on how this works?
 It was claimed that among all U.S. adults, about half are in favor of instituting national standards in K12 schools and about half are against it. In a recent poll of a random sample of 1,200 U.S. adults, 40% were in favor of instituting national standards. This data, therefore, provides some evidence against the claim. What statistical method is employed in this statement?
 test of significance
 hypothesis testing
 any thoughts on how this works?
Sampling distributions for counts and percents (5.2)
 A survey of 100 high school students randomly selected from the NYC school district asked the following. What is the difference between these two random variables:
 How many minutes of study hall did you have today? (X)....quantitative, continuous
 useful to calculate the sample mean
 did you buy food in you school's cafeteria today? (Y)....categorical, discrete
 useful to calculate the count of yes's (or no's)
 counts are a common variable in statistics
 The shape of the sampling distribution of sample means is Normal. What is the shape of the sampling distribution of counts?
 Binomial
 we will not be discussing the specifics of this distribution, but if you are going on in statistics, suggest studying this distribution.
 What else do we often report along with an overall count, which helps us interpret the meaning of the count?
 sample proportion
 Let's say that Y = 56....phat = 56/100 = .56
 if we had a different sample size, we would get a different phat.
 What does the sampling distribution of phat look like?
 eg, the distribution of all possible phats calculated from all possible 100 student samples from the NYC high schools.
 (display sampling distributions for n=100 and n=2500 from ips7e, chapt 3.3)
 center: mean of phats should equal population proportion, p
 spread: the larger the n, the less spread out the possible phats around p
 shape: appears Normal
 in fact the sample proportion is a special case of the mean.....the mean of the data if coded 1 and 0.
 What is the mean of the sampling distribution of the sample proportion?
 What is the standard deviation of the sampling distribution of the sample proportion?

 in fact this formula doesn't work quite right when sample is SRS
 good approximation when population is 20 times larger than sample.
 What can we say about phat as it relates to its estimation of p?
 phat is an ubiased estimate of p
 How does the formula for sigma phat confirm that the variability will decrease as the sample size increases?
 sample size is in the denominator....as it gets bigger, gets smaller
 If the sample proportion is in fact a mean, what can we use to support the idea that the sampling distribution of the sample proportion is Normal?
 the central limit theorem says that phat is approximately Normal when the sample size is large

 draw Normal distribution and label μ = p, and
 How large does the sample size need to be in order for phat to be Normally distributed?
 this is the case when the sample size is large.
 rule of thumb: np >= 10 and n(1p) >= 10
 The rule takes into account that Normal approximation is most accurate for any fixed n when p is close to 0.5, and least accurate when p is near 0 or near 1 (draw Normal distribution around .90, needs to be low variability.
 The frequency of color blindness (dyschromatopsia) in the Caucasian American male population is about 8%. We take a random sample of size 125 from this population. What is the probability that six individuals or fewer in the sample are color blind?
 Step 1: Is it OK to use the Normal distribution to calculate the probability?
 np = (125)(.08) = 10; n(1p) = (125)(.92) = 115
 125 is the smallest sample allowable given p = .08; if p was closer to .5 a smaller sample size would work
 Step 2: Calculate phat
 Step 3: Calculate zscore
 Step 4: Determine P(Z < 1.32)
 Use Normal calculator, P(Z < 1.32) = .0934
The sampling distribution of a sample mean (5.1)
 Let's again try the sampling distribution simulation, this time using Normal population distribution. What can we conclude about the distribution of sample means?
 demonstrate with sampling applet, Normal population distribution, mean, N=5 vs. N=20
 the spread (variability) of distributions of sample means depends on the sample size
 distributions of sample means tend to be Normal in shape.
 What determines the sampling distribution of xbar?
 design used to produce the data
 the sample size n
 the population distribution
 Assuming a sampling distribution of means from repeated random samples of size n, what is the mean of all of the sample means (xbars)?
 (draw sampling distribution of xbars)
 , the mean of the population
 text explains the theory for why this is true
 We know that a sampling distribution of sample means based on a larger sample size has a smaller spread. What is the standard deviation of the sampling distribution of means?

 note that with n in the denominator, as n increases, the σ_{xbar} decreases.
 the distribution of means is less variable than the original population (averages less variable than individual observations)
 xbar is an unbiased estimator of the population mean, μ; it will be correct on average. How can we improve the accuracy of xbar in estimating μ?
 increase the sample size
 reduces the spread of the sampling distribution
 A population is distributed N(μ, σ). How do we denote the sampling distribution of the sample means?
 (display image of sampling distribution overlaid on population distribution)
 N(μ, σ/√n)
 draw Normal distribution and label μ = xbar, and σ = σ/sqrt(n)
 Let's assume the population of SAT Math scores are Normally distributed with a mean of 514, and standard dev of 114. Based on random samples of size 30, what is the mean and standard deviation of the sampling distribution?
 What is the probability that a sample of 30 students has a mean of less than 555?
 (draw normal distribution....area under the curve less than 555)

 P(Z < 1.97) = .9756
 But many populations are not Normal...more uniform in density, or strongly skewed to the right or left. In these situations, is xbar (the mean of the sample) a good estimator for μ (the mean of the population distribution)? Why or why not?
 yes, the central limit theorem allows us to apply the ideas for sampling distributions from Normal populations to sampling distributions from nonNormal populations
 CLT: When randomly sampling from any population with mean, μ and standard deviation, σ, when n is large enough, the sampling distribution of xbar is approximately normal: ~ N(μ, σ/√n).
 (demonstrate with clt applet for skewed and custom distributions)
 What does the central limit theorem require?
 How large does the sample size need to be to result in a Normal enough sampling distribution?
 depends on how far from Normal the population distribution is.
 2530 is good enough for strongly skewed distributions or ones with mild outliers
 40 is usually large enough to overcome extreme skewness and outliers
 Household size in the United States has a mean of 2.6 people and standard deviation of 1.4 people.
 What is the probability that a randomly chosen household has more than 3 people?
 the first thing to consider is the shape of the population distribution.....probably skewed.
 not appropriate to use Normal distribution to obtain this probability because distribution is skewed right.
 What is the probability that the mean size of a random sample of 10 households is larger than 3?
 sample size is too small to assume sampling distribution would be Normal
 What is the probability that the mean size of a random sample of 100 households is larger than 3?
 we can now use clt to assume that sampling distributions of means is approximately N(2.6, 1.4/√(100)) = N(2.6,.14)
 we can use standardized scores to calculate probability
 (draw normal distribution....area under the curve greater than 3)

 P(Xbar > 3) = P(Z > 2.86) = P(Z < 2.86) = .0021
 How does the central limit theorem generalize to linear combinations of independent Normal random variables? to the sum or average of many small random quantities?
 if X and Y are independent random variables, aX + bY is also normally distributed. (a and b are fixed numbers)
 applies even if not independent (but has to have low correlation) and have different distributions (as long as none overwhelms others in size)
 example height...an average of many small events: genes, nutrition, illness, etc.
Sampling distributions  intro
 What do we call the probability distribution of a statistic calculated for a random sample?
 (display image from sampling applet)
 sampling distribution
 the random sample is a random variable
 the sampling distribution is the probability distribution of that random variable (the statistic being measure)
 How is the population distribution different from the sampling distribution
 it's the density curve for the population of individuals
 sampling dist for sampling 1 individual at a time
 What is the population from which data was sampled?
 data on crime rates in Detroit from 19611973
 commute time for 13 Stat Methods I students
 ASK test scores for 75 7th graders from XYZ NJ public school
 nutrition data for 63 US cereals
Means and variances of random variables (4.4)
 What descriptive statistics could we calculate to help us describe a probability distribution or a density curve?
 mean and standard deviation
 similar to idea that descriptive statistics help describe a histogram or other type of graph.
 How do we calculate the mean (average) of the data values given only a frequency table?
 (display simple frequency table: value=0, 1, 2, 3, 4; count=3, 3, 1, 2, 1)
 (add up all of the scores represented in the table, divide by number of scores: xbar = (0 + 0 + 0 + 1 + 1 + 1 + 2 + 3 + 3 + 4) / 10 = 15 / 10 = 1.5
 (rewrite the calculation to use frequencies: xbar = [0(3) + 1(3) + 2(1) + 3(2) + 4(1)] / 10 = 15/10 = 1.5)
 (distribute the denominator to each of the frequencies: xbar = [0(3/10) + 1(3/10) + 2(1/10) + 3(2/10) + 4(1/10)] = 0/10 + 3/10 + 2/10 + 6/10 + 4/10 = 15/10 = 1.5)
 this formula is a weighted average....take each value and and weight it by its relative frequency (or probability) of occurring
 (display new frequency table with probabilities for each value)
 A probability distribution for a discrete random variable describes the longrun outcomes of a random phenomenon. What symbol should we use to denote the mean?
 μ....because it's in the long run....represents the population distribution
 we will write μ_{X} to denote the mean of random variable X.
 What is the formula for the mean of a discrete random variable?
 (display probability distribution from OLI133)

 note how each value of X is weighted by it's probability....called a weighted average
 Does μ_{X} have to be a possible value of X?
 No, it can be any value between the min and max possible value of X
 What is the average family size in the US, given the following probability distribution?
 μ_{X} = 2(.44) + 3(.22) + 4(.20) + 5(.09) + 6(.03) + 7(.02)
 = .88 + .66 + .80 + .45 + .18 + .14 = 3.11
Number of persons
 2
 3
 4
 5
 6
 7

Probability
 0.44
 0.22
 0.20
 0.09
 0.03
 0.02

 Another term used to refer to the mean of a random variable is expected value. What is the expected value for family size in the US? Why do we "expect" this value"?
 3.11
 Actually, we don't expect it to occur ever. It's the average we would expect in the long run....after counting all or nearly all of the families in the US.
 What is the mean payout of a state lottery which pays $500 for one 3digit number chosen out of 1000 (i.e., 000 to 999)?
 probability distribution?
Payoff X
 $0
 $500

Probability
 0.999
 0.001

 mean? μ_{X} = 0(.999) + 500(.001) = 0 + .50 = $0.50, or 50 cents. Assume tickets cost $1...state makes half the money wagered....in the long run.
 The probability distribution of a continuous random variable is described by a density curve. Where is the mean for a symmetric distribution? ....for a skewed distribution?
 mean for a symmetric distribution is in the center
 mean for a skewed distribution is at the balance point (if we were to assume the density curve is made of a solid material)
 The law of large numbers says that as the number of randomly drawn observations (n) in a sample increases, the mean of the sample () gets closer and closer to the population mean μ. How do we interpret the following graph showing how the mean changes as we add more observations into our sample?
 (display graph showing comparison of means of larger and larger samples of young women, derived from N(64.5, 2.5))
 as the sample sizes get larger and larger the mean of the sample gets closer and closer to the mean of the population
 the sample mean is very close at a sample of 1000 and larger....estimated population mean can be quite wrong for smaller samples
 What statistic, in addition to μ, would be useful for describing a probability distribution?
 What do we call the squared standard deviation, σ_{X}^{2}?
 variance
 the rules we will study in this section use the variance, rather than the sd.
 What is the variance for family size in the US, given the following probability distribution?
 find the weighted average of squared deviations from the mean

 converting back to the stand dev:
Number of persons
 2
 3
 4
 5
 6
 7

Probability
 0.44
 0.22
 0.20
 0.09
 0.03
 0.02

 What is the formula for the variance of a discrete random variable?
 (display probability distribution from OLI136)


 the larger the variance, generally speaking the more scattered the values of X
 Let's say we have a random variable measured in inches and we want to convert to centimeters. Can we apply a linear transformation to the mean and variance?
 yes
 μ_{a + bX} = a + bμ_{X}

 Let's say we have two random variables which we want to add together: the number of girls in a class and the number of boys in a class to get the total number of children in a class. Can we just add the means and variances?
 yes (for means) and not necessarily (for variances)
 μ_{X + Y} = μ_{X} + μ_{Y}
 In order to add variances we need to know whether or not the two variances are independent of each other. In the example, do you think the number of girls is independent of the number of boys?
 no, knowing the size of one, tells a lot about the size of the other.
 If X and Y are two independent random variables, then
 If X and Y are two dependent (NOT independent) random variables, then
 rho is the correlation in the population
 the correlation for two independent random variables is 0.
 Consider the population of SAT math and reading scores. The means and sds are in the table below. The correlation is .68. What are the mean and standard deviation of the combined math and reading scores?
 Math
 Reading

Mean
 514
 496

Stand dev
 117
 114

 μ_{M + CR} = μ_{M} + μ_{CR} = 1010


Random variables (4.3)
 When we discussed the rules of probability, we mostly considered variables like m&m color, whether a coin toss resulted in heads or tails, the arrangement of boys and girls in a three child family. What kind of variables are these?
 categorical
 also result of a random phenomenon....not all variables are random
 What do we call a quantitative variable which results from a random phenomenon?
 a random variable
 a variable whose values are numerical outcomes of a random phenomenon
 we use capital letters at the end of the alphabet to denote random variables, e.g., X
 How do the following two examples differ?
 The number of people in a family (2 or more, live together, related by blood), chosen at random from all families living in the US. X can only take the values 2, 3, 4, 5, 6, 7,....max. What is the probability that the family has more than 6 members?
 The exact finish time of a randomly chosen 2011 Philadelphia marathon racer. Y is the race time and can take any value between 2:19:16 and 7:50:13 (first and last place). What is the probability that the racer finished in under 3 hours?
 The first is discrete  has a finite number of possible values (although sometimes we can't enumerate them.....what is the max value for family size?)
 The second is continuous  can take any value in an interval
 Often when we have a continuous variable, we will round it during the measurement process, e.g., daily high temperature, weight of an infant at birth, time spent commuting to class. The result is a finite number of possible values. Are these discrete or continuous variables?
 continuous variables in disguise
 treat as continuous
 What about variables with a lot of possible value, e.g., combined math, reading, and writing SAT scores, or the number of views for a youtube video? Are these discrete or continuous?
 discrete, but often we will treat them as continuous
 We can use the number of views for a youtube video as an indicator of popularity. What continuous variable could we use to measure popularity?
 Think back to the sampling distribution we created from the population. What were we measuring in each sample? Why is this measure a random variable? Is the mean discrete or continuous
 (display applet image)
 mean of randomly drawn samples of n=10 size
 the mean results from the random selection of the sample
 can be any value within the range of the distribution (032)
 continuous
 Let's consider the random phenomenon of tossing a coin twice.
 sample space? S={HH, HT, TH, TT}
 probabilities of each? equally likely, mult rule=1/2*1/2=1/4
 Now let's consider the random variable number of tails in two coin tosses.
 sample space? S={0, 1, 2}
 probabilities? create the following table
 note use of add rule for disjoint events
 HH
 HT, TH
 TT

Value of X
 0
 1
 2

Probability
 P(X=0) = 1/4
 P(X=1) = 1/4 + 1/4 = 1/2
 P(X=2) = 1/4

 What do we call this table of probabilities for a discrete random variable?
 What properties must all probability distributions satisfy?
 every probability is a number between 0 and 1, 0≤P(X=x)≤1
 the probabilities in the distribution add to 1, ∑xP(X=x)=1
 A young couple decides that they will continue to have children until they have a boy, or they have three children, whether they have a boy or not. (Let's assume that having a boy or a girl is equally likely, and that the child's gender in each birth is independent of the gender in the other births.) Let the random variable X be the number of children the couple has. What is the probability distribution of X?
 sample space? S={B, GB, GGB, GGG}
 probabilities for each outcome?
 P(B) = 1/2
 P(GB) = 1/4
 P(GGB) = 1/8
 P(GGG) = 1/8
 probability distribution?
 B
 GB
 GGB, GGG

Value of X
 1
 2
 3

Probability
 P(X=1) = 1/2
 P(X=2) = 1/4
 P(X=3) = 1/8 + 1/8 = 1/4

 What do we use to visually display the distribution of a quantitative variable?
 a histogram
 as a discrete random variable is quantitative, we can use a histogram to display a probability distribution
 (draw histogram for previous example)
 What is the total area of the histogram? Why?
 1
 width is 1 unit and height is proportion....the sum of the bar heights is 1.
 The table below shows the distribution of family size in the US. What is the probability that a family has 5 or more members?
 P(X>=5) = P(5) + P(6) + P(7) = .09 + .03 + .02
Number of persons
 2
 3
 4
 5
 6
 7

Probability
 0.44
 0.22
 0.20
 0.09
 0.03
 0.02

 Turning to a consideration of continuous variables, let's consider the spreadsheet random function...rand() which returns a random number between 0 and 1. What is the sample space?
 S={all numbers x such that 0 <= x <= 1}
 How can we graph this sample space?
 use a density curve
 as each possible number is equally likely, we use a uniform distribution
 (draw density curve)
 area under the curve is 1
 What is the probability distribution for a continuous variable? How do we use the probability distribution to find probabilities?
 the density curve
 the area under the curve for the values of X that make up the event.
 What is the probability that the spreadsheet random number will be between .3 and .7
 (display density curve with shaded area)
 the area which corresponds to the event is length x height.
 length=region in event, height = 1
 What is the probability of a single event, e.g.,
 it's meaningless, assigned to 0
 as a continuous random variable has infinitely many possible values, the probability of any single value occurring is zero
 only intervals of values have positive values
 How should we think about <= vs. < when working with discrete and continuous random variables?
 matters for discrete random variables....whether to include the probability of a particular outcome or not
 irrelevant for continuous variables
 In section 1.3 we looked at the density curve which would represent the heights of young women in the population, with mean of 64.5 in and stand dev of 2.5 in. What does this density curve look like? How do we denote this curve mathematically?
 (draw normal curve, indicate mean and sd)
 Normal distributions are probability distributions
 N(μ, σ) = N(64.5, 2.5)
 to fit with our new understanding, the area under the curve has to be 1
 now we can use it to find probabilities
 What is the probability that a randomly chosen 1824 year old woman has a height between 62 and 67 inches?
 display standard deviation rule using probability notation:
 P(μ−σ<X<μ+σ)= 0.68
 P(μ−2σ<X<μ+2σ) = 0.95
 P(μ−3σ<X<μ+3σ) = 0.997
 using the standard deviation rule, P(62 < X < 67) = .68
 What is the probability that a randomly chosen 1824 year old woman has a height greater than 69.5 in?
 (shade above 69.5 on normal dist)
 69.5 is μ+2σ, so probability is half of .05
 P(X > 69.5) = .025
 in fact, the area under curves is calculated using calculus...integration.
 a number of density curves are used in statistics routinely, in all cases we use tables or software to calculate the areas.
Last bit of section 1.3
 Issue: the SD rule provides probabilities for only a limited set of values. How can we generalize this idea of finding the percent of data in a section of the curve so we can find any percentage, no matter the particular mean and sd of the curve?
 We can standardize the observations
 Look at the position of the value relative to μ and σ.
 Calculate the distance of the value from the μ, in standard deviations.
 What is the formula for standardizing a score?

 draw N dist, with μ and σ, then show how z is the number of standard deviations a point is from the mean.
 z = (value  mean)/stand dev
 If the heights of young men (2029) are distributed N(69.3, 2.8), what is the zscore for a man who is: 64 inches tall? 79 (6' 7") in tall? What is the zscore for Yao Ming, who is 7' 6" tall?
 64 in:
 79 in:
 Yao Ming:
 values above the mean are positive, values below the mean are negative
 If we standardized all of the values in the distribution, we create the standard Normal distribution. What is its μ and σ?
 μ=0 and σ=1, written N(0,1)
 How does the standard Normal distribution relate to random variables?
 density curve for the continuous random variable Z, where
 When we convert the data points in a distribution from the actual scale to the standardized scale, what kind of transformation are we making?
 How does Yao Ming at 7' 6" compare to the current tallest WNBA player Liz Cambage at 6' 8"? What can we do to compare their relative heights?
 compare their standard scores calculated
 yao ming: z = 7.39
 liz cambage:
 How can we use standard scores (zscores) to help us find probabilities of events?
 find the area under the normal curve corresponding to the interval of interest.
 traditionally, we've used tables....turn to Table A
 Table A provides area below given z value
 z scores in first column, use columns to the right to refine
 What is the probability of a normal random variable taking a value less than 2.8 standard deviations above its mean?
 (display section of table from oli152)
 P(Z < 2.8) = .9974 or 99.74%.
 We said that a young man who is 64 in (5' 4") tall has a zscore of 1.89. What is the probability of being 64 in or taller?
 Find zscore 1.89 in table
 P(Z < 1.89) = .0294
 we want the area greater than 1.89, use the complement rule: P(Z > 1.89) = 1  .0294 = .9706....97%
 Let's consider the population of SAT verbal scores, which are approximately Normal, N(505, 110). What is the proportion of students who have scores less than 600? Greater than 600?
 (include link to Normal calculator)
 (Sketch the distribution)
 X < 600

 P(Z < .8636) = .8051 (table), = .8061 (Normal calculator)
 P(Z > .8636) = 1  .8061 = .1939 (complement rule)
 When we find the proportion of students who have scores above 600, how could we use the fact that the distribution is symmetric to find this value?
 same as P(Z < .8636)....check in table
 How can we use the the Normal distribution to find an x value if given a probability (inverse Normal calculations)?
 Use the table, or software to find the zscore, then unstandardize to find the value of x.
 How high must a student score on the SAT verbal to be in the top 10%?
 (include link to inverse Normal calculator)
 use inverse Normal calculator....x = 645.97
 using table, find zscore for .90 in the table...closest is z=1.28 (note z for .1003 is 1.28)
 unstandardize: , x = 505 + (1.28)(110) = 645.8
 Between what two zscores is the probability .95? What rule does this result support?
 (draw distribution with .95 shaded and .025 on either side)
 using inverse Normal calculator...# of sd is +/1.96
 using table....prob below .025 is z=1.96
 supports 2sd rule (bounds 95% of distribution).....z scores are sd units
 The zscore height for Yao Ming is 7.39. What is the probability that a young man is taller? Is shorter?
 P(Z > 7.39) is approximately 0...it is never exactly 0.
 P(Z < 7.39 is approximately 1...it is never exactly 1.
 demonstrate on calculator
Probability models (4.2)
 If we want to describe the probability of a random phenomenon mathematically, we need to know 1) a list of possible outcomes (sample space), and 2) the probability for each outcome. What is the list of all possible outcomes for flipping a twosided coin? What is the probability for each outcome.
 sample space S = {H, T}
 theoretically, P(H) = 1/2 and P(T) = 1/2
 We observe the number of baskets made for a basketball player who shoots three free throws. What is the sample space? What is the probability of each outcome?
 S = {0, 1, 2, 3}
 can't know....depends on how good the player is, and many other factors. Would need to observe the player throwing many freethrows to establish a likelihood
 would be developed empirically, via observation
 What is the sample space for how many hours a randomly selected student studies in a day (rounded to the hour)? What is the probability for each outcome? What is the probability that a student studies more than 22 hours?
 S = {0, 1, 2,...22, 23, 24}
 would need to develop the probabilities for each of these empirically
 may intuit some of the probabilities, e.g., P(hours studied is more than 22) = 0,
 We could rewrite P(hours studied is more than 22), as P(A). What is A?
 an event...an outcome or set of outcomes of a random phenomenon, a subset of the sample space.
 note that "hours studied is more than 22" combines more than one possible outcome
 For situations in which we need to establish a probability empirically, how can we estimate P(A)?
 P(A) = (the number of times A occurred) / (the total number of repetitions (trials))
 this is the relative frequency (for example in a frequency table)
 Probabilities are proportions; when rolling a die, P(1) = 1/6 (.166667). What is the possible range of values for a probability?
 Rule #1: probability is a number between 0 and 1, 0 <= P(A) <= 1
 as we saw P(A) = 0 never occurs, P(A) = 1 always occurs
 What does P(sample space) equal (that one of the outcomes in the sample space definitely occurs in a trial)?
 Rule #2: P(sample) = 1
 implies that the sum of the probabilities for all possible outcomes = 1.
 in a coin toss, P(heads) + P(tails) = .5 + .5 = 1
 If two events have no outcomes in common (disjoint), what is the probability that one or the other will occur?
 (display image of Venn diagram of A and B disjoin and not disjoint)
 the sum of the two individual probabilities
 Rule #3: If A and B are disjoint, P(A or B) = P(A) + P(B)...the addition rule for disjoint events
 What is the sample space of outcomes for flipping two fair coins? What is the probability of the event only heads or only tails?
 S = {HH, HT, TH, TT}. The probability of each of these events is 1/4, or 0.25.
 P(HH or TT) = P(HH) + P(TT) = 0.25 + 0.25 = 0.50
 What is the probability that an event does not occur? If P(A) = .6, what is P(not A)?
 (display venn diagram of complement)
 the probability that an event does not occur is 1  the probability that the event does occur.
 Rule #4: P(not A) = P(A^{c}) = 1 − P(A)
 note that A^{c} stands for the complement of A (everything that is not in A)
 If we roll a sixsided die, what is the probability of the face on top not having 1 dot?
 P(not 1) = 1  P(1) = 1  1/6 = 5/6
 When rolling a sixsided die, what allows us to use the addition rule for disjoint events for combining the probabilities of individual outcomes? Rolling a die: P(even) = P(2 dots) + P(4 dots) + P(6 dots) = ? vs. Tossing a coin repeatedly: P(last coin toss is a T) = P(T) + P(HT) + P(HHT) + P(HHHT)..... = ?
 the possible outcomes for rolling a die are finite (can be counted)
 also, it's a random phenomenon so each individual outcome is disjoint
 If you draw an M&M candy at random from a bag, the candy will have one of six colors. The probability of drawing each color depends on the proportions manufactured, (insert table of probabilities, with blue missing). What is the probability that an M&M chosen at random is blue?
 (display probabilities for all except blue)
 What do we know?
 S = {brown, red, yellow, green, orange, blue}
 P(S) = P(brown) + P(red) + P(yellow) + P(green) + P(orange) + P(blue) = 1 (rule #2)
 What process should we use.....complement of blue....solve for blue
 P(blue)= 1 – [P(brown) + P(red) + P(yellow) + P(green) + P(orange)] = 1 – [0.3 + 0.2 + 0.2 + 0.1 + 0.1] = 0.1
 What is the probability that a random M&M is either red, yellow, or orange?
 (display frequency table)
 P(red or yellow or orange) = P(red) + P(yellow) + P(orange) = 0.2 + 0.2 + 0.1 = 0.5
 Sometimes a random phenomenon produces outcomes which are all equally likely. An example that fits this model is rolling a sixsided fair die. What does it mean that each outcome is equally likely? What is the probability of rolling an even number (let's call this event E)?
 S = {1, 2, 3, 4, 5, 6}
 all 6 possible outcomes have the same probability of occurring (1/6)
 P(E) = P(2) + P(4) + P(8) = 1/6 + 1/6 + 1/6 = 3/6 = 1/2
 this is an instance in which we can use a theoretical understanding of the phenomenon
 What rule can we make for this situation? For a sample space of events which are equally likely how do we determine the value for P(A)?
 P(A) = (Number of possible outcomes in which EVENT A occurs) / (Number of possible outcomes in the sample space)
 for a sample space with k equally likely outcomes, each individual outcome has probability 1/k
 If we were to toss 2 sixsided dice, what is the probability that the two dice sum to 5?
 (display the sample space: grid of twodice outcomes)
 How many possible outcomes: 36
 Are they all equally likely: yes, assuming fair dice
 What is probability for each individual outcome: 1/36
 P(the roll of two dice sums to 5) = P(1,4) + P(2,3) + P(3,2) + P(4,1) = 4 / 36 = 0.111
 In the dice example on the previous screen, does order matter?
 yes
 the grid of options lists the pairs such that order matters...there is a first die and a second die
 sometimes order doesn't matter, for example if you have 5 equally qualified people for two job openings and you want to randomly choose two people for the job.
 If we toss a coin twice, how are the two individual outcomes for each coin related?
 they are unrelated...we say they are independent
 Two events A and B are independent if knowing that one occurs does not change the probability that the other occurs.
 What does a Venn diagram look like for two events which are not disjoint? Could these be independent events?
 overlapping...(draw an example for A and B)
 yes, they could be independent
 example: A = {first coin toss is a head}; B = {second coin toss is a head}
 Consider the activity of tossing a coin twice. What is the probability that both coin tosses are heads?
 we looked at this previously: S = {HH, HT, TH, TT}. The probability of each of these events is 1/4, or 0.25.
 we can also say that the first coin will turn up heads half the time, and with the first coin heads, the second will turn up half of those times.....1/2 x 1/2 = 1/4 (draw successive partitioning of sample space)
 What general rule applies to finding the probability that two independent events will occur?
 Rule #5: If A and B are independent, P(A and B) = P(A)P(B)
 multiplication rule for independent events
 A couple wants three children. Genetics tells us that the probability that a baby is a boy (B) or a girl (G) is the same, 0.5.
 Sample space? S = {BBB, BBG, BGB, GBB, GGB, GBG, BGG, GGG}
 Equally likely? yes
 Probability of each? 1/2
 Independent events? yes
 Does the multiplication rule for independent events support that P(BBB) = 1/8? yes P(BBB) = P(B)* P(B)* P(B) = (1/2)*(1/2)*(1/2) = 1/8
 Want 2 or more girls, what is the probability?
 use the addition rule for disjoint events to calculate the probabilities for X.
 P(2 or 3 girls) = P(2 girls) + P(3 girls) = P(GGB or GBG or BGG) + P(GGG) = P(GGB) + P(GBG) + P(BGG) + P(GGG) = 1/8 + 1/8 + 1/8 + 1/8 = 4/8
 A child in a classroom is chosen at random. Event A = child is male; Event B = child is female. Are these two events disjoint or not? Are these two events independent or dependent?
 Disjoint, it's either one or the other
 Dependent, because if the child is a male, then probability that it's a female is 0
 Can two events be disjoint and independent?
 No, impossible
 Disjoint means that if outcome includes A, then B is not possible....probability is 0.
 If two events A and B are independent, what can we say about their complements?
 A^{c} and B^{c} are independent
 all combinations of A, B, A^{c} and B^{c} are independent
Randomness (4.1)
 In a study, why do we collect data? What is the goal of the study?
 we want to learn something about the population
 we want to answer the research question for as it relates to the population (from which the sample was obtained)
 What does using a random sample help to control for?
 bias: eliminates bias in selecting a sample from the list of available individuals
 variability: we can use the sampling distribution and the laws of probability to control for variability
 What role does probability play in helping us make conclusions about populations from information about samples?
 (display oli big picture)
 Helps us quantify how random samples might differ.
 probability describes what will happen in the long run
 also called "chance"
 probability is a way to measure or quantify uncertainty; likelihood that something will happen
 What does it mean for something to be random?
 can't predict the outcome
 in a large number of repetitions (called trials), there is a pattern of results...a regular distribution
 (display results of random sampling from 3.3)
 Example: results of two series of 5000 tosses
 (display image from ips7e)....explain
 What is the probability that the result of a coin toss is heads?
 probability is .5
 each individual coin toss is random (uncertain).
 but, probability over many tosses is predictable.
 How does this graph help us understand this probability?
 The probability of heads is 0.5the proportion of times you get heads in many repeated trials.
 What is required in order for the outcome to be predictable?
 the trials are independent (i.e., the outcome of a new coin flip is not influenced by the result of any previous flip).
Ethics (3.4)
 In studies with human subjects and which receive federal funds, what standard procedures must be implemented?
 must be under the supervision of an institutional review board
 all subjects must give their informed consent
 all individual data must be kept confidential
 The institutional review board
 reviews the plan of study
 can require changes
 reviews the consent form
 monitors progress at least once a year
 There is a shorter review process for studies with minimal risk; risks which are no greater than "those ordinarily encountered in daily life or during the performance of routine physical or psychological examinations or tests."
 which procedures would be minimal risk: ips7e 3.96
 Prior to participating in a study, subjects must give informed consent in writing. What must they be informed about?
 about the nature of a study
 any risk of harm it might bring.
 must balance providing information with biasing results; telling prospective subjects that
 they will be involved in something emotionally or physically difficult could scare them off;
 this survey was paid for by a particular candidate could influence responses (response bias)
 Who cannot give informed consent?
 prison inmates
 very young children
 people with mental disorders
 All individual data must be kept confidential. Only statistical summaries may be made public. How is confidentiality different from anonymity?
 anonymity means the researchers do not know the identity of the subject
 anonymity prevents followups to improve nonresponse or to inform subjects of results
 To protect confidentiality most organizations strip off the personal identifying information from the data files used for statistical work or for research. Each individual has only an ID.
 Clinical trials are experiments which study the effectiveness of medical treatments on actual patients – these treatments can harm as well as heal. What is controversial about the following?
 Randomized comparative experiments are the only way to see the true effects of new treatments.
 without them how do we know what is a useful treatment vs. risk to subjects in clinical trial
 Most benefits of clinical trials go to future patients.
 need to be sure there is some sort of benefit for subjects in trial
 "...the interests of the subject must always prevail over the interests of science and society." (1964 Helsinki Declaration of the World Medical Association)
 the best situation is when it's not known which is better treatment vs. placebo
 balancing the risk between taking an unproven drug (treatment) and not taking a promising drug (control)
 How are behavioral and social science experiments different from clinical trials, generally speaking?
 Not as much risk (or benefit) to subject
 May rely on hiding the true purpose of the study.
 Subjects would change their behavior if told in advance what investigators were looking for.
 require consent unless a study merely observes behavior in a public space
Toward statistical inference (3.3)
 In April 2012 CASA Columbia, conducted a phone survey of 1,003 12 to 17year olds (493 males, 510 females) randomly selected from among all US households. 44% of respondents said they knew a classmate who sells drugs at school, and 60% said that drugs are available on school grounds. For whom are these statistics completely true? How would we like to apply the results?
 the percentages are true for the 1003 teens included in the survey.
 we would like to know the percentages for all US teens....this sample provides an estimate
 What do we call the use of a sample statistic to estimate that statistic in the population?
 statistical inference
 the estimate of the population is only as good as your sampling design
 the bigger the sample the better
 What terms do we use to distinguish between the numbers about the population which we would like to know, but which are too difficult to measure exactly, and the numbers we use to describe a sample?
 parameter is a number which describes a population; statistic is a number which describes a sample
 the value of the statistic is different for different samples
 we use the value of the statistic to estimate the unknown population parameter.
 If 862 teens out of 1003 said they know someone who is abusing substances during the school day, what is the estimated proportion for the sample. What is the corresponding population parameter?
 estimated proportion: phat = 862/1003 = .859
 population parameter is p.
 we use phat to estimate p.
 What might be the sample proportion if we asked a different 1003 teens from around the US if they know someone who is abusing substances during the school day?
 not likely to be the same, but unlikely to be much different than .859.
 random sampling helps to avoid bias in the sample, which could result in wildly different results (e.g., systematically favoring certain teens over other teens to be in the sample)
 to avoid bias, work hard to obtain a fully random sample from the population
 How can we use the idea that samples vary (sampling variability) to help us better understand the conclusions we can make about a populations?
 the sampling distribution of the statistic calculated for all of the possible samples will have a certain distribution
 we can model this distribution using simulation (the idea that we can use a computer to create 1,000's of pretend samples from a population with a given distribution)
 demonstrate with sampling applet, with uniform distribution, n=10
 What do we call the resulting histogram of calculated statistics from all possible samples of the same size from the same population?
 (display image from ips7e showing repeated sample from population with p=.6)
 sampling distribution
 What can we say about the histogram which results from graphing the mean of 10,000 samples from the uniform population in the applet?
 (use the applet to generate 10,000 samples)
 (display histogram of 10,000 samples of n=10)
 shape: seems normal (apply the normal fit
 center: the mean of the sampling distribution is very similar to the population mean of 16
 spread: the sd of the sample is smaller than the sd of the population...rerun with n=5....note that sd is larger...for n=20, sd is smaller still.
 What can we conclude from the idea that the mean of the sampling distribution is very similar to the population mean?
 the statistic (xbar) appears to be an unbiased estimate of the population paramenter (μ)
 bias is about the center of the sampling distribution
 the statistic is unbiased if the mean of the sampling distribution equals the value of the population parameter
 What can we conclude from the idea that the larger the sample size, the smaller the spread?
 (display histogram of 10,000 samples of n=20)
 that our sample statistic is likely to be a better estimate of the population parameter as the sample size gets larger.
 the variability of a statistic is described by the spread of the sampling distribution
 determined by the sampling design and the sample size (n)
 How can we use the idea of shooting arrows at a target to explain bias and variability?
 (display ips73 image of targets showing bias and variability)
 the arrows are the samples, the bull's eye is the population parameter
 bias means the aim is off and the arrows land consistently in one general area away from the bull's eye
 large variability means the arrows hit in a widely scattered pattern
 How do we reduce bias in a sample?
 use random sampling, eg, SRS from entire population
 How do we reduce variability in a sample?
 What term do we use to report how far off a sample statistic might be from the true population parameter?
 margin of error
 44% +/ 3% creates a band which quantifies how much error there is in our estimate of the population statistic
 Let's review, what is the goal of statistical inference?
 to estimate the population statistic
 Why is randomization an essential element in statistical inference?
 helps us create a sample from which we will get the best estimate of the population statistic
 How can we use random sampling to help us control for
 Bias?
 Randomization helps us eliminate bias in selecting a sample from the list of available individuals.
 All of the individuals have an equal chance of being in the sample.
 A random sample offers the best opportunity to obtain an unbiased estimate of the population parameter
 Variability?
 We can use the sampling distribution and the laws of probability to
 control variability....the larger the sample size the closer
 calculate a margin of error within quantifies how much error there is in our estimate of the population statistic
 What are some of the real world problems which can get in the way of using the sampling distribution and the laws of probablity?
 a sample which does not represent all parts of the population (undercoverage)
 lack of realism
 nonresponse in a sample
 not trivial problems
 (display table A.1 in casa columbia showing extensive attempts to create a random sample from the whole US)
Sampling design (3.2)
 Let's look at the big picture image again. What do we call the data that is collected for analysis in the study?
 (display big picture image from oli)
 the sample, the part of the population that we analyze in order to gather information.
 Why do research studies rely on a sample, rather than collecting data on the whole population?
 time, cost and inconvenience preclude including the whole population
 How should we think about the population in relation to a study?
 it's the group we are interested in, that we want to learn about, to which we would like to generalize our conclusions.
 the sample is the part from which we draw conclusions about the whole.
 we will want to consider the best method for choosing a sample, that is the sample design
 Obtaining a sample is often harder than it looks. What do we need to be careful about when choosing a sample?
 bias, that is choosing the sample in some systematic way which favors one part of the population over others.
 What should a researcher do if he/she suspects that there are issues with the sample?
 Do whatever is possible to remedy potential sources of bias
 Report what was done as part of the procedures followed.
 What is one concern when doing a sample survey?
 response rate: the percent of the original sample who provide usable data.
 Example: “Man on the street” survey, where the researcher asks whoever happens to come along. What do we call this type of sample?
 convenience sample
 cheap, convenient, often quite opinionated, or emotional
 How does using this method result in a biased sample?
 different locations or timing of the sampling could result in different conclusions.
 survey about gun control following a public shooting, or in a rural town vs. in a urban city struggling with gun violence
 Example: an online poll which invites anyone who comes along to participate. What do we call this type of sample?
 voluntary response sample, the people choose to be in the sample by responding to the invitation
 How does this method result in a biased sample?
 some people are more likely to respond than others, in particular people with negative opinions
 example: Ann Landers reports that 70% of (10,000) parents wrote in to say that having kids was not worth it—if they had to do it over again, they wouldn’t. But, in a random sample of parents 91% of reported that they WOULD have kids again.
 What simple method can we use to avoid the biases inherent in having the researcher choose, or the people volunteer?
 What do we call a sample when each individual in the population has an equal chance of being chosen for the sample.
 simple random sample (SRS)
 How is the method of selecting an SRS, similar to assigning subjects to treatment groups?
 Everyone in the population gets a label, then assign each subject to either be in the sample or not
 in fact, not only does each individual have an equal chance of being in the sample, each possible sample has an equal chance of being created.
 How can we use a spreadsheet to select a SRS?
 (open a spreadsheet with a population represented in the rows)
 assign a random number in a new column for each row of the population
 sort the data by the random number column
 select the first n rows of the sorted listing
 What is the name for when we use chance to create the sample?
 What if we'd like to be sure that important groups within the population will be correctly represented in the sample. What sampling method should we use?
 stratified sampling
 first divide the population into groups (or strata), then choose a separate SRS from each stratum. Combine to create the full sample.
 similar to the idea of a block design in an experiment
 Do the samples from each strata need to be the same size?
 No, e.g., you may want to respresent the different strata proportionately in your sample (a University has 60/40 women to men
 a sample of people from a town which represents proportionately the various ethnic groups residing in the town.
 What sampling design would we use if we wanted to do a statewide survey of hospital patients, but wanted to limit the survey to a handful of hospitals, to reduce the time and costs involved in the survey process? How does this sample process work?
 a multistage sample
 determine the primary sampling unit (in this case hospital), randomly select the number needed.
 identify any additional strata (gender, age, ethnicity), randomly select patients from within each strata
 Even with a strong sampling design, there are a number of ways in which bias may be introduced. Explain the following four issues:
 undercoverage: some groups in the population are left out of the process of choosing the sample
 eg., the population record for a city may not include homeless people
 nonresponse: an individual chosen for the sample can't be contacted or doesn't respond
 e.g., some people choose not to respond to a telephone survey, no matter what.
 response bias: when a respondent does not respond in a fully truthful way
 e.g., respondent may lie if asked about illegal or immoral behaviors
 e.g., characteristic of interviewer may influence a response
 e.g., if asked about past events, memory of respondent may not be accurate
 wording of question: questions must be written in a clear and nonleading manner
 e.g., do you oppose a ban on smoking?....double negative
 e.g., do you agree with most people that....leading
Design of experiments (3.1)
 Let's consider the example from ips7e "Are smaller classes better?" In the 1980's a study was conducted in Tennessee where 6385 children were assigned to 3 different classrooms for kindergarten3rd grade: a regular class with 2225 students and one teacher, a regular class with 2225 students, a teacher and a teacher's aide, and small class (1317 students). In later years, student outcomes were measured using standardized tests, whether or not a student failed a grade, high school gpa, etc. What aspects of this study correspond to the following terms?
 experimental units: individuals in experiment (students)
 subjects: individuals in experiment when human beings (students)
 treatments: specific experimental condition applied to the units (3 different types of classes)
 factor(s): explanatory variable(s) (class type)
 factor levels: values of explanatory variable (regular class, regular class + aide, small class)
 The small class size study found that in later years students from small classes did better on many of the measures. What can we conclude about these results?
 Because other variables were controlled by the experiments (e.g., differences in schools and families), we can be confident that the class size made the difference
 in an observational study, class size would be confounded with many other variables which could influence the results.
 How can we include more than one explanatory variable in an experiment?
 Include a condition for each combination of levels of each factor
 (draw a two way factor table...3 class size conditions and 3 levels of teacher experience)
 Sometimes just the act of being part of an experiment (with the hope of getting better, or in response to personal treatment) a subject will have an improved outcome. What do we call this?
 The placebo effect
 a placebo is a fake pill, used in medical studies, so that individuals don't know whether or not they are getting the drug being studied
 it is estimated that the placebo effect can improve the outcome for as much as 35% of individuals
 not understood; used with children...kiss it better?
 How can we control for the placebo effect?
 Include a control group, which does not receive the treatment, in such a way that the subject does not realize they are not receiving the treatment
 What is the control group in the small classes study?
 There isn't one, although some might say the regular class is the control
 experiments compare one treatment with another and include a group that receives no treatment when possible.
 What issue might arise with an experiment if there is no control group included in the design?
 bias: a result which systematically favors a certain outcome
 What method is the best choice for assigning individuals in the sample to the treatment groups? Why?
 randomization
 we want to make the groups as equal as possible, so as to control for other possible confounding variables.
 Instead of randomization, why don't researchers balance the groups for the possible confounding variables?
 some can't be measured, others won't be considered
 the remedy is to use chance to assign the the individuals
 What issue might arise if we only had a few individuals in our experiment, e.g., 4 individuals with 2 each assigned to 2 groups?
 outcome may reflect suitability of chosen individual to treatment; this individual is more or less likely to respond to treatment
 the chance variation in the individuals aligns with the treatments
 How do we control for this issue?
 Repeat each treatment on many individuals to reduce chance variation
 have enough experimental units in each group such that the chance variation of particular individuals averages out
 Let's consider an example: the average SAT math score over the last 10 years for students in a school who have taken an SAT prep class is 540. The school changes the format to an online course. The average SAT math score for the students who took the online prep course is 610. What is the issue with this study? How can we use a randomized comparative experiment to study the question?
 There is no way to directly compare the outcomes for the two classes as the students are from different years, there may be confounding variables which account for the difference; it doesn't make sense to compare a 10year average to a 1year average
 over a multiyear period or across multiple schools), randomly assign the students taking the SAT prep class to either the classroom or online version of the class, after taking the prep class, obtain their SAT math scores.
 What do we call an effect (outcome from an experiment) so large that it is unlikely to be due to chance?
 statistically significant
 it means there was good evidence for a result
 One of the requirements of a randomized comparative study is to randomize the assignment of each experimental unit (case, subject) to a treatment group. What is the intuitively obvious way to do this?
 Give all of the subjects an ID, write the IDs on a slip of paper and put them in a hat; choose the number of slips for the first group, the second group and so forth.
 another way to randomize is to use a table of random numbers: Table B in ips7e. The text includes instructions on how to use the table.
 How can we use software to randomly assign subjects to groups?
 use the random function in a spreadsheet.
 (demonstrate how to randomly assign the students in the class to 4 groups)
 in fact the table of random numbers and the spreadsheet function are only pseudorandom. Visit random.org for a true random number generator and to learn about why it is truly random.
 What do we call an experiment in which experimental units are randomly assigned to treatment groups?
 completely randomized design
 What property must be fulfilled for experimental units to be considered randomly assigned?
 Each experimental unit has the same chance of being in any of the treatments
 Use a two step process, randomly assign experimental units to groups, then randomly assign groups to treatments
 (display diagram of process)
 Example of 1969 draft lottery into the US Army: candidates birthdates were drawn from a jar into which the birthdates had been entered month by month. It was noticed that birthdates in Nov and Dec tended to have a lower draft number than birthdates in other months. Jar was not fully mixed.
 How can we control for any effects an experimenter might have on the experimental units which could bias the results, e.g., smiling differently to subjects dependent on which group they are in?
 doubleblind study: neither the subjects nor the experimenter knows which subject is in which treatment.
 but there are many things that are hard to control in a study, e.g., study of sugary soda....how to make the sugarfree soda taste like it is a typical sugarbased soda
 A welldesigned experiment concludes that changes in the explanatory variable cause a variation in the response, but in studying the design of the experiment we notice a few issues. What can we do to provide further support for the conclusion.
 replicate the study with a new sample and a new situation
 What is one problem with the following studies: 1) studying behavior of rats on diets with varying levels of sugar to help us understand how diets high in sugar effect the behavior of children 2) studying the effect of smiling on a person's judgments using college students as the experimental units?
 lack of realism
 the idea behind an experiment is to be able to generalize the conclusions to individuals outside of the direct experiement, even though the experiment the does not provide the ability to do so.
 Sometimes we can control for confounding variables by having each of the individuals in the study participate in both treatments. What do we call this study design?
 Matched pairs
 could also choose matched pairs of subjects (matched on gender, age,...) and randomly assign to treatments....more open to confounding.
 Might also see this design used in twin studies, which assumes that each twin is a replicate of the same individual
 if there are more than 2 treatments, would be called repeated measures
 With a matched pairs design with each individual participating in both treatments (e.g., evaluating two shampoo products), how would we want to organize the treatments to control for timing effects?
 randomize the order in which the two shampoos are evaluated.
 randomly assign individuals to groups and then assign one group to do evaluate shampoo A first and then shampoo B; the other group evaluating the shampoos in reverse order.
 What if we suspect that gender will influence the results of the experiment. What design could we use to specifically control for this variable?
 Block design
 Assign experimental units to treatments inside each blck.
 (display example design for men and women)
Producing data (Chapt 3 intro)
 What do we call data which represents only a select individual situation or story (i.e., case) which comes to our attention because it is interesting or compelling?
 anecdotal evidence or data
 What's an example of anecdotal evidence?
 My car has over $150,000 miles on it and has never had anything major go wrong. Should other owners of this same kind of car expect the same results?
 What is the problem with anecdotal evidence?
 may not be representative of a larger population, may be an anomaly/outlier
 What use can we make of "already available" data, produced for some other purpose (fedstats.gov, nces.ed.gov, census.gov)?
 may be used to answer some questions, although not all. It is important to use a trustworthy source, and to ask questions which may be answered using the data.
 and some questions require data designed specifically to address that issue
 What if we wanted to collect our own data. What are two ways we could survey individuals? How do these ways differ?
 sample and census
 a sample is uses a selection of individuals to represent the whole population, while a census attempts to contact every individual in the population.
 What is a sample survey?
 having a selected group from a population answer a set of questions; everyone answers the same questions; the researchers work hard to not influence the individuals or their responses.
 A sample survey is considered what type of data collection method?
 Observational
 observe individuals, measure variables, are careful not to influence the responses
 What type of data collection method deliberately influences the individuals, and then observes/measures the responses?
 Let's consider an example of an experimental study: We want to know if drinking sugary soda leads to being overweight in children. One group of children is assigned to drink a sugar based soda everyday, another group is given a sugarfree soda. The weight of each individual is measured every month. Why is this an experiment?
 The researcher is implementing a program (type of soda) in hopes of impacting the response (weight).
 What is one advantage of experimental over observational studies?
 provides evidence for a cause and effect relationship
 What terms do we use to describe the change that the researcher imposes on the individuals in an experiment?
 treatment, intervention, condition
 In what ways are observational studies at risk of false conclusions?
 lurking variable: a variable, other than the explanatory or response variables, which may influence the results
 confounding: when two or more variables are associated such that their influence on the response variable can not be untangled
 How might we study the relationship between sugarbased soda and weight using an observational study design?
 Researchers identified a sample of children and measured how much sugarbased soda they drank and their weight each week for 18 months.
 Display oli "big picture" image and explain where we are
 producing data in specific ways helps lay the foundation for what we can conclude about the data with a given degree of confidence (this is statistical inference)
The question of causation (2.6)
 After observing that children with more books in their home have higher achievement in school, why should we NOT conclude that giving children more books will result in greater school achievement.
 We are making a causal conclusion based on an observed association
 Mantra: Association does not imply causation....we must be very careful how we word our conclusions.
 For the "more books at home associated with greater school achievement" relationship, what might be the cause of the higher income?
 the books by themselves are not doing anything.
 likely a lurking variable which causes both results....parents' education, parents' motivation
 (write example on board...draw model of assoc (dashed arrows) and causation (solid arrows))
 What might be responsible for the observed positive relationship between years of education and amount of annual income?
 it could be causal...more education means you can get a better job
 it could also be something else about the person, upbringing, social class, work ethic, which causes a person to get more education and to work harder at work.
 (write example on board...draw model of assoc (dashed arrows) and causation (solid arrows))
 Mantra: Association does not imply causation
 What might be responsible for the observed association in the cocaine study...that use of desipramine is associated with a smaller rate of relapse?
 seems causal...that the drug was effective in helping people not use cocaine as compared to Lithium and placebo
 (write example on board...draw model of assoc (dashed arrows) and causation (solid arrows))
 What is it about the cocaine study that allows us to suggest that the drug caused the improved results?
 the researchers ran an experiment in which they had direct control of the variable of interest (drug)
 the participants were randomly assigned to the treatments which in effect equalizes the effects of other variables across groups.
 What terms do we use to describe each of these models?
 causal, common response, confounding
 display diagrams for ips7e
 What is the cause of the association in the common response model?
 a lurking variable.
 another example of this is the number of firefighters is positively associated with the size of the fire
 a lurking variable (seriousness of the fire) explains both measures
 (display oli image of model)
 What is the cause of the association in the confounding model?
 Causation is shared among one or more lurking and explanatory variables.
 Whenever there are uncontrolled variables which may be related to a response variable, consider whether confounding may be an issue.
 The results of the nightlight study suggest that leaving a light on when a young child is sleeping may result in nearsightedness. What is the evidence for a causal relationship? How might we model the relationship?
 The evidence is weak as the study was observational; it did not attempt to control other variables by assigning children to sleep with a type of light.
 (draw a causation model with a lurking variable....parents' nearsightedness)
 Mantra: Association does not imply causation
 (Display xkcd.com correlation cartoon)
 How can we design a study to establish direct causation?
 design an experiment in which possible lurking variables are controlled
 we will discuss how to do this in the next section (producing data)
 But there are many pressing problems for which we cannot carry out an experiment in which we randomly assign people to different groups in order to control for other variables. Then how is it that we have concluded that smoking causes cancer.
 It could be that some other genetic factor causes nicotine addiction and lung cancer (lurking variable) or that smokers live unhealthy lives which reacts with the smoking to heighten their risk for cancer (confounding).
 (display 5 criteria for establishing causation, when an experiment is not possible.)
 the evidence linking smoking with cancer is strong, but a welldesigned experiment, if it weren't unethical, would provide stronger evidence.
 Class assignment: In pairs, identify an example of a possible or tempting causation statement that does not rely on adequate evidence, and may well have lurking variables influencing the results.
Data analysis for twoway tables (2.5)
 What kind of graph and summary statistics do we use when we have one quantitative variable and one categorical variable?
 boxplots or back to back histogram or stemplot
 mean and sd, if symmetric and no outliers, median and quartiles otherwise
 What kind of graph and summary statistics do we use when we have two quantitative variables?
 Scatterplot
 correlation, linear regression equation
 What combination of two variables have we not discussed?
 two categorical variables
 How do we summarize the data for one categorical variable?
 counts and percents for each category
 How can we summarize the data for two categorical variables together?
 a twoway table that displays counts of observations for each combination of values for the two variables
 (display example)
 How should we position explanatory and response variables in the table?
 text: explanatory goes in the columns (horizontal axis) and response in rows (vertical axis)
 oli: display is opposite
 What do we call the variable whose values are in the rows? in the columns?
 The row variable
 The column variable
 What do we call the place where a row and a column category intersects?
 Example: binge drinking by college students (ips7e, p. 137) how many women are there who are nonfrequent binge drinkers?
 (display twoway table)
 8232
 What do we call this table given it has only two rows and two columns?
 What else would be useful to add to our table?
 total for each row and column...the margins
 (display expanded table)
 What would the dataset look like that would create this table?
 (draw data set rows and columns)
 how many variables: at least two, gender and frequent binge drinker
 what might the first row look like: pick two values from twoway table
 how many of the observations will have this pattern? (see cell in twoway table)
 what is a second possible observation: pick two different values from twoway table
 how many of the observations will have this pattern? (see cell in twoway table)
 # of total observations is: 17,096
 What else besides counts would be useful to add to our table?
 proportions or percents
 cell percents, row percents and column percents
 What do we call the collection of cell proportions?
 (display table with proportions)
 joint distribution...provides proportion of observations for each combination of values
 what proportion of the total are women who are not binge drinkers: .482
 How are these proportions calculated?
 The number in the cell divided by the total number of observations
 How do the proportion of women in each category compare to the men?
 proportion of womenyes slightly larger than menyes
 proportion of womenno noticeably larger than menno
 but there are more women in the sample, so we would expect this.
 Where in the twoway table is the distribution of each of the individual variables?
 in the margins....called the marginal distribution
 there are two...one for the row variable and one for the column variable
 this distribution can be counts, proportions or percents
 (display graphs with different options for both variables)
 How could we graphically display the marginal distributions?
 with a pie chart or a bar graph.
 (draw a bar graph for the distribution of gender)
 So far we have not addressed the relationship between these two variables. What percents should we calculate to address the relationship?
 percent of women in the total sample who are binge drinkers: 1684/9916 = .170 = 17.0%
 percent of men in total sample who are binge drinkers: 1630/5550 = .227 = 22.7%
 how does this compare with cell percents?....much different women and men looked similar and smaller percents
 What do we call percents when they are calculated within the category of a second variable?
 What do we mean by conditional?
 Given only the data in one category of one variable (the explanatory variable) what percent of observations are in each of the categories in the other variable (the response variable)
 What is a conditional distribution?
 When we condition on one value of one variable (the explanatory variable) and calculate the distribution of the other variable
 (display full table conditioning on column variable)
 (display full table conditioning on both row and column variables)
 How do we calculate a conditional percent?
 For one of the categories in the explanatory variable, take the cell percent corresponding to one of the values of the response variable and divide by the total for the category of the explanatory (*100).
 What is the calculation for the percent of men who are frequent binge drinkers: 1630/7180 (*100) = 22.7%
 What is the calculation for the percent of women who are frequent binge drinkers: 1684/9916 (*100) = 17.0%
 What graphs might be useful to help us understand the relationship between two categorical variables?
 bar graph conditioned on categories of explanatory variable
 What statistic could we calculate to help summarize the relationship?
 there are only advanced techniques (which you will learn in 532)
 we must use wellchosen percents to help us understand the relationship (23% of men vs. 17% of women are frequent binge drinkers)
 Sometimes results can be influenced by a lurking variable....let's take a look at an example: Table of patient outcomes following surgery for two hospitals. Which hospital appears to be better? What else could be influencing the outcomes at the two hospitals?
 (display table of patient outcome by hospital, with calculated death rate for each)
 Hospital B
 The seriousness of the patients illhealth
 Here is the same data split out according to patients condition: good or poor, with calculated death rates for each. What do we notice?
 Paradoxically, hospital A appears to do better in both categories
 What do we call this?
 Simpson's Paradox  the presence of a lurking variable which causes our understanding of the relationship to reverse direction when it is included in the analysis.
 What did we do to discover the effect of the lurking variable?
 created a threeway table which included the lurking variable
 If we have a threeway table and want to get back to a two way table, what do we need to do?
 aggregate the data over the levels of the third variable
 when we have aggregated data, we may not be seeing the whole picture.....
Cautions about correlation and regression (2.4)
 When do we use a best fit line to help describe the relationship between two variables?
 when both variables are quantitative
 when the x variable is explanatory and the y is response
 when their relationship appears to be linear
 How do we determine the bestfit line for a set of observations?
 the line which minimizes the sum of the squares of the vertical distances between the observed data points and the line.
 (draw a graph with 6 points and a best fit line....indicate the vertical distances which are minimized)
 If we consider the leastsquares regression line as the "fit" to the data, how should we refer to the vertical distances between the data points and the line?
 the part that didn't fit....the error
 the residual
 (label the residual on the graph)
 Each datapoint (observation) has a residual. How do we calculate them?
 for each y value, residual = observed y  predicted y
 What could we do with all of the residuals, one for each datapoint, to better understand them, to look for any interesting patterns which could tell us about the fit of the regression line?
 graph them.
 treat the residuals as a new variable
 How do we graph them?
 as these are y distances, we leave the explanatory variable on the x axis and put the new residuals variable on the y.
 (display an example image)
 notice how the residual plot magnifies the distances...easier to study the fit
 What does the mean of the leastsquares residuals equal?
 What might we find by studying the residuals?
 (display examples of randomly scattered and curvilinear residuals)
 if the regression line is working well (accurately portrays the pattern of the data), we will see no pattern in the plot of the residuals.
 a curvilinear plot suggests a nonlinear relationship btwn explan and resp variables
 a change in variability along the xaxis means that predictions made in areas of larger variability will not be as good as those made in areas of smaller variability.
 the residual plot is a very useful tool when exploring relationships. There is much more about residual plots in the second half of this course.
 Residual plots look at deviations as a group...at the pattern of deviation. What do we call individual points which deviate substantially from the overall pattern of the data?
 Why are outliers of concern in linear regression?
 The point(s) may effect the determination of the best fit line, such that the line is not effective in its representation of the data.
 What do we call points which unduly affect the regression line?
 An outlier may deviate substantially in the y or xdirections. What is the impact in each direction?
 In the ydirection, the point may pull the line toward it, but if there are many other points for similar xvalues, it may not pull it that much.
 In the xdirection, a lone point could be very influential, as it could set the direction of the line.
 Influential outliers are not always obvious on residual plots, because they may draw the line toward them. Always plot the data.
 (display example of always plot the data)
 How might a correlation between two variable be misleading?
 It could suggest a false conclusion which if more were known about the situation we would not consider.
 What might we look for if we are suspicious about a conclusion?
 lurking variable
 a variable not included in the study that does have an effect on the variable studied.
 examples:
 percent of students receiving free lunch is correlated to school achievement level
 number of fire fighters at a fire is correlated with the amount of fire damage
 number of books in a child's house is correlated to achievement level in school
 correlation (association), even if it's a very strong correlation, does not imply causation
 variables involved in these correlations are sometimes called "proxies" to convey the idea that there is another variable(s) which accounts for relationship.
 Why should we be concerned about a variable which is an average of many individuals?
 Because averages reduce the spread of the data, which increases the correlation
 (display example for boy's heights at different ages)
 (display growth chart, showing ranges of normal growth)
 always be sure to know the exact definition of a variable and how it was measured.
 What is the concern about rangerestriction as it relates to correlation?
 When you are only looking at a section of the range of the explanatory variable, the correlation may differ from what it is when looking at the full range.
 (display simulation range restriction
 (display overview of problems discussed)
Leastsquares regression (2.3)
 The correlation coefficient describes the direction and strength of a linear relationship between two variables. What can we use to further describe the form of the relationship?
 a bestfit line
 describes how a variable y changes given changes in variable x.
 How might a best fit line help us predict the value of the y variable?
 (display scatterplot with a line)
 show how to predict college GPA from a few HS GPA values
 In order to use a best fit line to predict values of the y variable, what must be true about the relationship between the two variables?
 x is explanatory and y is response
 their relationship appears to be linear
 How should the line be positioned among the points?
 drawn so that it comes as close as possible to all the points
 How do we describe a line mathematically?
 y = b_{0} + b_{1}x
 note that the line may be written y = b_{1}x + b_{0} also
 What is b_{1} in the equation?
 slope
 describes how much y will change given a change of one unit in x
 What is b_{0} in the equation?
 intercept
 the value of y when x=0
 where the line crosses the y axis.
 show how when x=0 the term b_{1}x drops out of the equation
 Example: For a sample of 105 graduates of a university who majored in computer science, researchers obtained university GPA (ugpa) and high school GPA (hsgpa). The equation for the bestfit line is^{[1]}:

 (display the scatterplot)
 notice that I've put a little hat (caret) on top of the y variable. The hat indicates that this variable is now the predicted value of y.
 calculate the predicted ugpa:
 for (rounds to 2.7)
 for (rounds to 3.7)
 for (rounds to 3.3)
 for (rounds to 1.8)
 What is the term used when we apply a best fit line to predict a response value far outside the range of the explanatory variable x which was used to obtain the line?
 Is extrapolation a useful practice?
 No, it is often not accurate; should be avoided
 What is one mathematical technique we can use to determine a specific linear relationship between the explanatory and response variables?
 leastsquares regression
 describes the dependence of the response variable on the explanatory variable.
 use linear regression when the dependence is linear
 How does the leastsquares technique work?
 minimizes squared vertical distance from points to the line
 (draw scatter plot with 6 points....draw best fit line....draw squares to show squared vertical distance to line
 Why do we choose to minimize the vertical distance?
 minimizes the error in predicting y.
 Where are the errors in predicting y?
 the distance from the actual point to the line
 some are positive and some negative
 when we use least squares method we are minimizing the errors
 In order to obtain the least squares line, , for a set of data what do we need to calculate?
 The formulas are slope: and intercept: . What is noteworthy about these formulas?
 use only basic descriptive statistics: mean of x and y, sd of x and y and r
 when calculating these values, use as many decimal places as your calculator (or spreadsheet) accommodate; we will generally use stat software to calculate the values
 Let's calculate the leastsquares regression line for the explanation of university GPA given high school GPA.
 (display output showing means, sd's, and r for sat data)

 b_{0} = (3.1729) − (.6752)(3.0767) = 1.0955
 How would you draw the regression line onto the graph by hand?
 use the equation to find two points, locate the points on the graph, and draw a straight line.
 What y value is predicated for the value corresponding to the mean of x?
 (display the scatterplot for ugpa and hsgpa, with regression line)
 find the mean of x (3.0767), the predicted value is (3.1729)
 the means for x and y are always on the regression line
 What would happen to the correlation between hsgpa and ugpa and the linear regression equation if we changed the units of measure (e.g., ugpa is measured on a 0 to 10 scale)?
 no change in correlation (change in units does not impact direction and strength of relationship)
 the slope and intercept are based on the particular scale chosen, so these would change (e.g.the slope for predicting ugpa from hsgpa would likely be much bigger)
 Find the values for b_{0} and b_{1} in the output below?
 (display regression output for sat data)
 you will need to ignore the many parts of the output which you do not yet understand
 How does r influence the slope in the formula ?
 It moderates the change in y, given a change in x.
 If sdy = 2 and sdx = 1 and r = .5, the slope is
 for every 1 unit change in x, we get 2 units change in y, except r=.5 moderates and we only get half of that....b_{1} = 1
 note that when r is 1 or 1, the result is the full relationship
 Would we get the same regression line if we reversed x and y?
 No...the line only works for the particular explanatory response relationship
 How can we use the correlation coefficient to help us understand how well the linear regression explains (predicts) the response variable?
 look at r^{2}, which tells us the proportion of the variation in the values of y which is explained by the leastsquares regression of y on x.
 (display regression output...find r^{2}
 61% of the variation in ugpa is explained by knowing (is dependent on) hsgpa....39% of the variation is due to other factors
 How would the points be organized on a scatterplot for a relationship where r = 1 (or r = 1)?
 exactly on the regression line, r^{2} = 1
 all of the variation in one variable is accounted for by the linear relationship with the other variable.
Correlation (2.2)
 What do we examine in a scatterplot to better understand the data?
 Pattern: form, direction, strength
 Deviations: outliers
 What types of variables are plotted with a scatterplot?
 two quantitative variables
 Given the two scatterplots below, which has the stronger relationship?
 (display two scatterplots of the same data but with different scales)
 Point out that they are both the same
 What form best fits the scatterplot?
 (display scatterplot)
 linear
 we will now focus on linear relationships to explore some of the numerical measures which can help us further understand the data.
 What numerical measure can we use to measure the strength and direction of a linear relationship between two quantitative variables?
 correlation coefficient (r)
 formula:
 discuss some of the ideas in the formula.
 we will not focus on the formula, but rather on understanding how to interpret r.
 What is the range of possible values for correlation?
 1 to 1
 (draw a horizontal line from 1 to 1.)
 How does the value of r tell us about the direction of the linear relationship?
 Negative values indicate a negative relationship
 Positive values indicate a positive relationship
 (indicate direction info on line)
 How does the value of r tell us about the strength of the linear relationship?
 Values at or very near zero suggest no relationship between the variables.
 Values nearish to 0, but not 0, either positive or negative, indicate a weak relationship.
 Values near 1 and 1 indicate a strong relationship.
 (indicate strength info on line)
 (display image of various scatterplots, with calculated r)
 A correlation requires two quantitative variables, does it matter which is explanatory and response? Why?
 NO...a correlation measures the relationship between two quantitative variables
 Can we measure a correlation between the chemistry test scores for a group of students and what class they were in? Why?
 NO...a correlation can only be calculated between two quantitative variables....requires arithmetic in the calculation
 If we change the units of measurement for one of the variables, will the value of r change?
 No...the pattern of the relationship and the correlation remain the same.
 In fact the correlation has no units.
 (display example)
 Does a large correlation indicate that the relationship is linear?
 NO
 the scatterplot must be assessed to determine if a linear relationship exists, before using a correlation to describe the data.
 (display example)
 Which measures of center and spread are used with the correlation? Why?
 mean and standard deviation
 correlation is not a resistant measure
 What is the effect of outliers on correlations?
 r is strongly affected by a few outlying observations
 use the applet to show how an outlying point affects correlation
 Why doesn’t a tight fit to a horizontal line imply a strong correlation?
 (draw example.)
 the value of x is irrelevant to the value of y, the variables are not related.
Scatterplots (2.1)
 What is the most common way to display the relationship between two quantitative variables?
 a scatterplot
 display dataset and have student explain how to make one for two quantitative variables...draw on the board
 On which axis should the explanatory variable, if there is one, be plotted? the response variable?
 expl goes on the xaxis
 resp goes on the yaxis
 What do we look at when we examine a scatterplot?
 overall pattern and striking deviations
 pattern: form, direction, strength
 What types of form might a scatterplot show?
 (display examples...without labels)
 linear
 curvilinear
 clustered
 no relationship
 What types of direction might a scatterplot show?
 (display examples...without labels)
 positive
 negative
 How do we determine the strength of a relationship displayed in a scatterplot?
 (display examples...without labels)
 by how closely the points follow the form of the relationship
 how well we can predict y given x
 What deviations might be evident in the data?
 (display a graph with outliers)
 outliers
 What does this plot show?
 (display an example scatterplot)
 discuss issue of scale (both x and y axis should provide similar variability to points.
 Why might it be useful to add a categorical variable to a scatterplot?
 (display examples)
 might show that points are clustered in an important way
 could even be that relationship is not what it appears
 What sort of graph would we create if we had a categorical explanatory variable and a quantitative response?
 side by side boxplots
 draw on board...Actress/Actor Oscar winners explain ages
Examining relationships (Chapt 2 intro)
 What more can we know about the data if we look at two variables together?
 whether or not the two variables are related, are associated, have a relationship.
 examples:
 Are years of education related to personal income?
 How do SAT scores relate to freshman year grades?
 Think back to the research questions we came up with on the first day, are any of them proposed two variable relationships?
 If we are measuring two variables, do we collect a different sample of individuals for each variable?
 NO, the variables must be measured on the same individuals
 (display dataset) one dataset of observations, with many measurements about each individual
 Two variables may be considered associated if knowing a value of one of the variables.....
 tells me something about the values of the other variable.
 Example: a student got an A in Math in one year....what grade would you predict for math in the following year?
 What do we need to know about the dataset and the variables before we can begin examining the relationship?
 What population and sample are the data obtained from?
 How are each of the variables measured?
 Which variables are categorical and which are quantitative.
 Sometimes categorical measures may be combined to create a quantitative index
 Other times a quantitative measure may be divided into categories. (e.g., age, proficiency level on a statewide test)
 How might we label the two variables to show the nature of the relationship?
 response variable  measures an outcome of the study
 explanatory variable  explains (or causes) changes in the response variable
 example years of education explains personal income (may or may not cause it)
 temperature explains hours of sleep (and may even cause it to vary)
 What other terms are used to describe explanatory and response variables?
 independent and dependent
 We will use the same approach for two variables that we used with one variable. What are the three steps?
 graphical display
 examine graph for overall pattern and deviations
 use numerical summaries to describe specifics of data
Density curves and normal distributions (1.3, through 689599.7 rule)
 What are the first steps of exploratory data analysis, when first presented with a collection of data?
 plot the data. especially if quantitative (histogram, boxplot)
 Examine the graph for pattern, and outliers
 Calculate appropriate summary statistics to describe center and spread
 When we graph a dataset as a histogram, we often see that the shape resembles a smooth mathematical function. What do we call that function.
 a density curve
 (display images of variety of density curves)
 What properties of a density curve are noteworthy?
 it is always on or above the horizontal axis
 its area under the curve (and above the xaxis) is exactly 1
 it can be used to approximate an observed histogram created from an actual dataset
 (display a histogram with curve; demonstrate how area = proportion)
 How can we think about the median of a density curve? the mean?
 the median is the point with equal area above and below.
 as the mean is an arithmetic average, we can think of the mean as the point at which the curve would balance
 easy to see the balance point in a normal curve, more difficult to find the balance point in a skewed curve. (draw one of each)
 As a density curve is idealized, what notation do we use to indicate mean and standard deviation, so we don't get confused with xbar and sd?
 Which density curve will we focus on going forward?
 the Normal curve
 the Normal curve describes Normal distributions
 as for all density curves there is a formula wich we can use to create the curve, but how this works is beyond this class.
 There are many variations of Normal curves. What do they have in common? How do they differ?
 incommon: symmetric, unimodal, bellshaped
 differ: μ and σ (draw a few Normal curves with different μ and σ)
 How do I know how far to draw the distance to represent σ?
 from the mean to the inflection point on either side of the mean.
 Why are Normal distributions important in statistics?
 Effective at modeling some distributions of real data (e.g., test scores, repeated measures of the same quantity, characteristics of biological populations), although there are many instances of nonNormal data
 Good approximations of chance events
 Statistical inference procedures based on Normal distributions often work well for other roughly symmetric distributions.
 How do we denote a Normal distribution with a particular μ and σ?
 What property of the Normal distributions is represented in this graph? How do we use the property?
 (display example Normal distribution, heights?, with +/ 3 standard deviations delineated)
 689599.7 rule
 1, 2, and 3 standard deviations encompasses these percentages of data; approximately true for actual data
 In the distribution of heights of young women, what percentage of young women have heights between 62 and 67 inches? What are the height of the middle 95%? What percent of young women have heights greater than 72 inches?
 btwn 62 and 67: 68%
 middle 95%: 59.569.5
 above 72: 2.5%
Displaying distributions with numbers (1.2)
 What aspects of a histogram do we examine to better understand our quantitative data?
 overall pattern: shape, center, spread
 deviations: outliers
 In this section, we will examine numerical descriptions which may help us better understand our data. What statistics do we use to measure the center?
 How do these two values differ?
 mean is average value
 median is middle value
 How do we calculate the mean?

 carefully review the notations (xbar, summation, subscripts...no order implied), emphasizing that this is an average
 (display best actress Oscar winner calculation)
 How do we calculate the median?
 Find the midpoint of the distribution, such that half of the observations are below and half are above
 If the number of observations is odd, the median is the observation at the center of the ordered list of observations
 If the number of observations is even, the median is the halfway point between the two center observations in the ordered list.
 (Display the best actress Oscar winner calculation)
 What is the median of 9, 4, 2, 3, 5, 8, 1?
 4
 order the numbers into an ordered list to show the concept
 What is the median of 9, 4, 2, 3, 5, 8, 1, 8?
 4.5
 add the additional 8 into the ordered list.
 How will an outlier in the data effect the mean and median?
 (Display two simple datasets, with the second having an obvious outlier/data entry error. Provide mean/median for each xbar_{A}=68.14, xbar_{B}=162)
 It will result in a mean which is closer to the outlier, than if the outlier is not there.
 It will not change the median, accept that it is a data point in the upper or lower half.
 How will the mean and median compare for a skewed distribution?
 (Display images of distributions with mean and median indicated.)
 The mean will be farther along the tail of the distribution, while the median is closer to the bulk of the data.
 What can we conclude about the mean, as a measurement tool?
 It is not a resistant measure, as it is strongly influenced by extreme values
 With what kinds of data should we use the mean as a measure of center?
 Symmetric distributions with no outliers
 In our examination of the overall pattern in our data, we have discussed shape and center. What's left?
 spread or variability
 (display image of two distributions with different variability, but the same center)
 What are some ways we might quantify spread
 range (maxmin), interquartile range (IQR, distance between 25th and 75th percentiles), standard deviation
 What is the range?
 Exact difference between largest and smallest observations
 Range = Max − Min
 (display best actress Oscar winner calculation)
 If the median breaks the data into halves, how do the quartiles divide the data
 into quarters or fourths.
 How do we find the quartiles for a dataset?
 After locating the median, find the center observation in the lower and upper halves.
 The lower half is Q1
 The upper half is Q2
 (draw a line representing range of data, from min to max; label M, Q1, Q2, 25% in each section, middle 50%, IQR)
 (display IQR calculations for best actress Oscar winners dataset)
 There are a number of other ways to calculate percentiles; different software apps do it differently.
 What is Q1 and Q3 in terms of percentiles?
 What do we mean by percentile?
 The percent of observations which occur at or below that value.
 How can we use the IQR to help us identify outliers?
 Calculate Q1  (1.5 * IQR), Q3 + (1.5 * IQR); any values which lie outside these upper and lower thresholds may be considered outliers.
 (display histogram for best actress Oscar winners, with potential outliers noted)
 calculate outlier thresholds
 Q1=32 and Q3=41.5 ⇒ IQR=9.5
 Q1 − 1.5(IQR) = 32 − (1.5)(9.5) = 17.75
 Q3 + 1.5(IQR) = 41.5 + (1.5)(9.5) = 55.75
 The three largest values in the dataset may be considered outliers.
 What are the three options for how to handle an outlier?
 Keep it....the observation belongs to the population we are studying...example: outliers in best actress Oscar winners
 Drop it....the observation is fundamentally different from the other observations; didn't realize when data was collected...example: studying typical third graders, but dataset includes observation for student who is two grades ahead. Student could well be an outlier on many physical characteritics (low), as well as cognitive measures (high), and likely does not represent typical third graders
 Fix it...check the original data collection process to see if the observation is a data error.
 What 5 numbers, of those which we have talked about so far, would fit nicely together to make a 5number summary?
 Min, Q1, M, Q3, Max
 the five number summary for best actress Oscar winners is: 21, 32, 35, 41.5, 80. (compare to histogram)
 What graph can we use to visually display the fivenumber summary?
 a boxplot
 construct a box plot using the best actress data, y axis Age 2080:
 5number summary 21, 32, 35, 41.5, 80
 outliers: 61, 74, 80
 largest observation that is not an outlier 49
 How might we organize our boxplots if we had both actress and actor data?
 sidebyside
 (display sidebyside boxplot for oscar winners; interpret what the graph says about the data)
 We've discussed range and IQR, which are both useful when our measure of center is a median. What measure of spread is useful when the measure of center is the mean?
 How does the standard deviation show the spread of the data?
 It quantifies how far the observations are from the mean.
 many notations: SD, s, Sd, StDev
 If we have a small set of data, no. of people who enter a pet store in 8 consecutive hours: 7, 9, 5, 13, 3, 11, 15, 9, and we've calculated the mean: 9, what's the first step in calculating an SD?
 write out the data on a line, show the mean
 subtract each observation from the mean
 What do we do next?
 square each deviation
 if we sum the deviations, we get 0; we could use the absolute deviation...
 Once we have the squared deviations, what could we do to summarize them...to find the average squared deviation?
 sum them up, and divide by the number of observations
 but in fact we divide by n1, because that's how many unique pieces of information we have. It's called degrees of freedom...
 sum of squared deviations = 112, divided by 7 is 16
 Now we have the average squared deviation. What could we do to help us better understand the number?
 Take the square root, as now the number is on the same scale as the original data....same units

 The formula is
 When should we use sd?
 it goes with the mean, best used with symmetric distributions with no outliers
 When is sd = 0?
 when all of the observations are the same value...there is no spread
 Is sd a resistant measure?
 no, large deviations from the mean will result in a large sd, larger than the other values would suggest
 What is the single best way to describe a dataset?
 a graph, numerical summaries don't provide as much depth of information (as they summarize the data)
 Is it OK to make a linear change to a measurement scale for a variable?
 yes, always.
 you can easily apply a linear trasnformation in SPSS and in spreadsheets, e.g., if you need to change the unit of measurement
 Will the shape, center and spread of the distribution remain the same for a transformed variable?
 shape: it will have the same basic shape: skewed, symmetric, unimodal, etc.
 center and spread: no, but they will change in a systematic way
Displaying distributions with graphs (1.1)
 Let's examine the data from a survey which asked 1200 US college students about how they perceive their body: overweight, underweight, or about right. Here is what the data would look like in a spreadsheet. (display printscreen of body image spreadsheet) What would be a good starting point for organizing our data?
 count how many individuals are in each group
 note that with 1200 individuals in the dataset, there's too much data to make any sense of it in the spreadsheet format.
 be sure to review the set up of the spreadsheet: how many variables, which is the variable we are focused on?
 What do we call a data display which shows the values and counts (or percents) for a variable?
 frequency distribution or frequency table
 display freq dist for body image data...note that percents don't add to 100%
 What graph(s) could we use to visualize the data?
 bar graph and pie chart
 display body image graphs
 What if we collected data as to what kind of pet a person has and the options are: dog, cat, fish, reptile. Does it make sense to graph this data on a bar graph? on a pie chart?
 Yes for a bar graph, because you don't need to account for all of the options, you could display the number of people who have each kind of pet
 No for a pie chart, because the categories don't cover everyone. Some people will have other kinds of pets (e.g., pigs) and some will have no pets. The categories in a pie chart must be all of the options which make up the whole amount.
 Could you transform a table of the percent of children living in poverty for 35 economically advanced countries into a bar graph? into a pie chart? why?
 yes for a bar chart, because it's reasonable to compare the percents across countries. (see example on next slide)
 note that countries are organized by increasing percent....would the graph be as useful if the countries were listed alphabetically?
 no for a pie chart, not because we don't have everything...we have all 35 countries, but because the percents are within each country...for children living in Finland, what percent are living in poverty? These are conditional percents, which we will discuss in chapter 2.
 When we have quantitative (numerical) data, what can we do to help us better understand the data?
 assuming we already understand the context of the data, graph it and look for general patterns and anomalies.
 What kinds of graphs are useful when we want to visualize quantitative data?
 histograms and stemplots
 use these for visualizing one variable at a time, to look at the pattern of spread in the data
 time plots (a type of line graph)
 use these when the data are sequenced, e.g., in time
 What kind of graph is the standard for use in visualizing the data for one quantitative variable?
 The histogram
 It breaks the range of data into classes (intervals) and displays the count or percent of observations in each class.
 (display example of histogram from ips6e)
 discuss axes, classes/bins/intervals, area of bars represents how much data is present in class, bars together, shape, anomalies
 How many classes (also called bins) should be included in a histogram? (include link to applet)
 No standard.
 show how changes in class width change our interpretation of the data
 Software will use defaults, which you can changeas you get more practiced at creating these, you will have a better sense of what will help you best visualize the data.
 What kind of graph is useful when you have a quantitative variable, which has positive values, and a small number of observations?
 stemplot (stem and leaf plot)
 Let's make a stemplot
 Collect the age of the youngest person living in the household of each student
 Create stems
 put on leaves
 Display back to back stemplot; useful, but we will study boxplots next week which are more useful still
 Which is more useful, generally speaking, histogram or stemplot?
 histogram
 stemplot is useful when your only tool is pencil and paper; not used in research journals
 What is the purpose of making a statistical graph?
 to better understand the data
 What do we look for when we examine a graph?
 overall pattern: shape, center, spread
 deviations from pattern: outliers
 Put up a few examples to discuss the following terms:
 modes, unimodal, bimodal
 symmetric....unimodal, bimodal, uniform
 tails
 skewness: skewed right (salary), skewed left (age at deathnatural causes)
 center: midpoint
 spread: variability (min, max)
 outliers
 Why should we pay particular attention to outliers?
 because that particular observation may be systematically different from the others: data entry error, equipment failure, different unit of measure, an individual which could be considered different than all of the others....does not belong in the same population.
 display graph showing outlier
 Why would we want to make a time plot for a quantitative variable?
 When the data collected is sequenced in some way, including the sequencing variable in the graph often changes our interpretation of the data.
 Show example timeplot, time is on the horizontal axis, lines connect the data points, may also include a trend line.
Chapt 1.Introduction
 What is/are statistics? (ask a number of students for their ideas)
 wp: the study of the collection, organization, analysis, interpretation, and presentation of data...often large amounts of data.
 numbers derived from data which help us better understand the data....which provide us useful information
 What is data?
 numerical facts about individuals, cases, or subjects
 but just looking at all the individual numbers won't help us understand
 we need to examine the data within the context of a research question
 Example research questions
 Have students pair up and devise a research question, on any topic
 Each group presents their question (write it on the board, using the term population where possible)
 Go back through the questions and identify what numerical facts could be obtained to create the data.
 The big picture of statistics.
 present the 4 stages described in OLI statistics
 We want to study a population, but too big
 Choose a sample and collect data (called producing data)
 Once we have the data, we need to begin to make sense of or summarize the data (exploratory data analysis)
 In order to make a conclusion about the population we have to explore how the sample compares to it (using probability)
 Finally we can use our sample to make a conclusion about the population (inference)
 We will explore steps 13 in Stat Methods I; step 4, inference, is tackled in Stat Methods II
 We will begin our study of statistics with exploratory data analysis (EDA), chapters 1 and 2 in the text.
 Looking back at our research questions, I see that we want to obtain data about a number characteristics for each individual in our population or sample. What math term do we use to name these characteristics?
 variables
 a variable is any characteristic of an individual
 What do we call a particular collection of numerical facts about individuals?
 a dataset, identified with particular circumstances
 A dataset may be displayed as a grid of rows and columns. In the example where are the individuals? Where are the variables?
 As we established, a data set is identified with particular circumstances. What are some questions we should know about this example dataset?
 What questions are we looking to answer with this data?
 What population are we interested in?
 Who are the individuals in the dataset?
 How many are there?
 How many variables do the data contain?
 What is the definition of each of the variables?
 How were the values in these variables obtained?
 How are the variables for gender (M, F) and test score (0100 pts) different?
 gender classifies each individual into a category>Categorical variable
 test score provides a numerical value or measurement for each individual>Quantitative variable
 Label the variables in the dataset below as either categorical or quantitative
 What problem do we have if we only know the the names and values of the variables in a dataset?
 Display new example and discuss what each of the variables might mean.
 Discuss issues with measurement and assigning categories.
 reliability/validity of measurement instrument
 assigning meaningful categories
 categories which are coded using numbers
 counts vs. rate of occurrence
 Let's get started with exploratory data analysis; two types
 distributions of individual variables
 relationships between two variables
 In each case we will look at
 visual displays of data (graphs/charts)
 numerical summaries/measures
Cite error: <ref>
tags exist, but no <references/>
tag was found