User:ASnieckus/StatisticsContent/Learn by doing/Gender and monthly auto premium (Question 1)

Description of activity and Excel instructions: StatTutor Gender and monthly auto premium (Question 1)

Dataset: Direct link for auto_premium.xls not available.

Exploratory Analyses
This exercise begins by taking a look at the data.

Calculate descriptive statistics for each value of the categorical variable
Before beginning on the table of descriptive statistics, let's sort the data to make it easier to check the tabulation.
 * Type Ctrl+A to select all of the data.
 * Select Data > Sort....

The Sort dialog displays.


 * Click the Options tab.
 * Check Range contains column labels.
 * Cleck the Sort Criteria tab
 * Under Sort by, in the drop down menu, select the relevant categorical variable.
 * Click OK.

Now that the data are sorted, pick an empty area in your worksheet and create a table where the row headers are "Min", "Q1", "Median", "Q3", "Max", "n", "Mean", "St Dev" and the column headers are the different values of the categorical variable.

For each cell in the table, enter the formula to calculate the statistic listed in the row header. The data range to include in each formula is the range of data in the quantitative variable that corresponds to the categorical variable value listed in the column header.

The formulas you will need are:
 * Min: "=min([range])"
 * Q1: "=quartile([range],1)" [Note that in some operating systems the formula uses a semi-colon instead of a colon between the range and the numeral "1".]
 * Median: "=median([range])"
 * Q3: "=quartile([range],3)" [Note that in some operating systems the formula uses a semi-colon instead of a colon, between the range and the number "3".]
 * Max: "=max([range])"
 * n: "=count([range])"
 * Mean: "=average([range])"
 * St Dev: "=stdev([range])"

Create side-by-side boxplots
Unfortunately, Calc can't create boxplots directly, so we have to create a column graph and alter it to display as side-by-side boxplots. We start by using the calculated Descriptive Statistics to generate new values that will be the basis for our column graph.

Create the necessary statistics
The values in these new rows, for each value of the categorical variable, are calculated based on the 5-number summary values. Enter the necessary formulas, for each group, to obtain:
 * In each successive cell under St Dev, type "25th Percentile", "50th Percentile", "75th Percentile", "Lower limit", and finally "Upper limit".
 * 25th Percentile = Q1 (Remember to copy the numbers from Q1 without copying the formulas.)
 * 50th Percentile = Median - Q1
 * 75th Percentile = Q3 - Median
 * Lower limit = Q1 - min
 * Upper limit values = Max - Q3

Create the boxplots
Now you are ready to create the boxplots.
 * Highlight the values in the 25th Percentile, 50th Percentile, and 75th Percentile rows
 * Choose Insert > Chart...

The Chart Wizard displays.
 * In 1. Chart Type', choose the stacked column option.
 * In 2. Data Range, ensure that the data is correctly specified. (Select the 25th, 50th, and 75th values for the values of the categories. The data series is in rows, and has no headers.)
 * In 3. Data Series, use the Categories window to specify the category labels.
 * In 4. Chart Elements, label and select appropriate options for your chart.

A stacked column graph displays. To create the boxplots, add error bars (to the min and max values) and adjust the color and outlines as needed.
 * Use the Insert Y Error Bars... option to add the Upper Limit error bar.
 * Use the Insert Y Error Bars... option to add the Lower Limit error bar. (Remember that to add this error bar you have to select the "lowest portion of the bar.")
 * For each section of the bars, adjust the color and border as necessary.
 * If needed, adjust the y-axis scale.

Return to the StatTutor exercise for Question 1 to provide a comparison of the data in the two groups: describe the key features of the data display and support your description with numerical measures. (Be sure to include the numerical results in your description.)

Calculate p-value
Before calculating the p-value for the t-test, it is important to review the summary statistics for the two samples of data.
 * Review the mean and standard deviation for each group.

Calc provides a formula for calculating the p-value which can be used with a few types of t-tests. Type=3 is for use in this situation: unequal samples sizes and unequal variances. Note that Mode=2 indicates a two-sided test:
 * To calculate the p-value, enter the formula "=TTEST([range y1],[range y2],2,3)".

For a p-value of "0", convert the cell format to scientific notation to display the actual value.

Return to the StatTutor exercise for Question 1 to report results and draw conclusions.