# Examine linear relationship with SPSS

As athletic performances in many areas continue to improve incrementally over time, we would expect that the winning times for the men's 1,500 meter race, run at the Olympics every year since 1896, would fit this pattern. Regression techniques offer an opportunity to study this relationship.[1]

The dataset contains 23 observations and two variables:

• Year, the year of the Olympic Games (from 1896 to 2000)
• Time, winning time (in seconds)

### Dataset

• olympics.xls
• an SPSS version of the dataset is available on your class website: olympics.sav

Open the dataset in the SPSS data editor.

The following instructions are based on the student version of PASW (SPSS) version 18.

## Create a scatterplot of year vs. time

Create a scatterplot (instructions) of year and winning time and determine if the relationship can be considered linear.

If so, then the least squares regression line is a useful tool to help us further describe the relationship between 1,500 meter winning time and year. Continue with the instructions which follow to calculate the least squares regression line and add it to the scatterplot

## Plot the least squares regression line

Double-click the graph displayed in the output window to open the Chart Editor.

To add the least squares regression line:

• Select Elements > Fit Line at Total.

The Properties dialog box opens, with the Fit Line tab highlighted.

• Confirm that Linear is chosen.

The line is automatically added to the graph. Close the Chart Editor window. The regression line displays among the data points along with the R2 value.

In Version 18, SPSS does not offer an option to add the equation for the line to the graph. Rather we must obtain the equation from regression analysis. Given this, the equation is provided below.

Interpret the equation for the least squares regression line
 The equation for the least squares line for the scatterplot of the winning time for the men's 1,500 meter race and year of the Olympic Games is: $Time = (-0.39 * Year) + 994$ What can we know about the relationship of Time and Year from the equation? Interpret the line within the context of the data.

The scatterplot shows one obvious point which sits well outside the other data points. Using the graph we can determine that this is the winning time for the 1896 race. Let's explore how the least squares regression would be effected if this point were removed.

## Remove the outlier (1896 winning time) from the plot and calculations

To remove a data point, we can simply delete it from the dataset.

• Observe that row 1 contains the 1896 winning time.
• Click on the row header 1 to select the entire row of data.
• Choose Edit > Cut.

The 1896 data is removed from the dataset.

• Create a new scatterplot, without the outlier.
• Add title and the least squares regression line.

Note that the R2 value has changed; it is larger because, without the 1896 value, the line is a better fit to the data.

Interpret the revised equation for the least squares regression line
 With the outlying 1896 datapoint removed, the equation for the least squares line for the scatterplot of the winning time for the men's 1,500 meter race and year of the Olympic Games is: $Time = (-0.33 * Year) + 872$ How does deleting the 1896 outlier affect the equation for the least squares regression line?

Predict the 2008 winning time
 A least squares regression line may be used to predict the value of the response variable given a value of the explanatory variable. We could use the revised equation to predict a winning time in the men's 1,500 meter race for a particular year. Although it is often unwise to use a regression equation to predict values outside the range of values in the explanatory variable, let's practice using the equation to predict winning times, by predicting a time for a more recent Olympic Games. The least squares regression line for predicting a new winning time is (notice the "hat" on the response variable): $\hat{Time} = (-0.33 * Year) + 872$ Use the regression line to predict the winning time for the Beijing Olympics held in 2008. How does the predicted 2008 winning time compare to the actual winning time of 212.94 seconds earned by Rashid Ramzi of Brunei?[2] What is it called when an established linear regression equation is used to predict a response for a value outside the range of explanatory variable values? Why are we concerned about using our revised equation to predict the winning time in 2008?

## Notes

1. Adapted from Open Learning Initiative. Probability and Statistics: Linear relationships to provide instructions for doing the analyses using SPSS.