Exploratory data analysis project

From WikiEducator
Jump to: navigation, search

This project offers students a chance to apply their knowledge of exploratory data analysis to create and interpret analyses for 3 variables from a chosen dataset. Students are encouraged to use a dataset related to their area of study (e.g., a dataset available at a place of work). Lists of potential datasets are also provided.

Schedule

Oct 17: Submit written proposal

Oct 31: Submit analyses and report

Nov 14: Present analysis results in class

Guidelines for proposal

In the proposal you will provide information about your chosen dataset and about the analyses to be run. You will submit the proposal for review and feedback, offering you the opportunity to improve your plans before implementing them.

Step 1: Find a suitable dataset

Criteria:

  • Dataset has at least 3 variables
  • Background information available about the research or data collection effort which generated the data and detailed definitions for each of the variables

Note: if you are using SPSS to do the analyses, the number of observations allowed for version 18 (which came with the book) is 1500. Feel free to use a subset of the observations in a dataset for your project.

Data sources

You are encouraged to use a dataset to which you have access through your work, research or studies. If you need to use a dataset available otherwise, here are a few options:

  1. Datasets (often related to a research study) made available for use in studying statistics:
  2. Large public datasets

Step 2: Provide background and details for dataset and variables

Your proposal will have two sections. The first provides information about the dataset and variables.

  • Provide the context for which the data in the dataset was originally collected. Use the following questions to guide your description of the dataset:
    • What questions is the data designed to address?
    • Who/What are the observations in the dataset?
    • To what population do the observations belong?
    • How many observations are there?
    • If the dataset was obtained on the internet, cite the specific url, the institution which now maintains it, and the original authors (if available).
    • If the dataset was obtained from another source, describe the source of the data.
  • Provide details about the 3 variables which you will analyze. For each variable provide:
    • name of variable
    • a detailed definition
    • description of how the variable was measured (i.e., specifics on how the data for a variable was obtained)
    • indication of whether the variable is categorical or quantitative (and explanation of any issues with this classification)
  • For each of the 3 pairs of variables, indicate which variable is the explanatory and which the response, or indicate that these designations don't apply (there is only a potential association).

Step 3: Provide details for graphs and analyses to be run

This is the second section of the proposal.

  • For each of the variables, specify the one-variable graphs and summary statistics to be created.
  • For each pair of variables (there are three pairs), specify the two-variable graphs and summary statistics to be created

Step 4: Submit your proposal for review; revise as needed

You will receive written feedback on your proposal, offering ideas for improvement as relevant. You are encouraged to improve your plans, as needed. Once you are satisfied with your plans for the analyses, you can go ahead with the analyses, as specified in your proposal. However, you don't need to be confined to the exact specifics of the proposal plans; feel free to make changes as needed to accommodate for unforseen outcomes.

Implementing your analyses

Once you have received feedback on your proposal and are satisfied with your plans for the data analysis, you can go ahead with running the analyses, as specified in your proposal. However, you don't need to be confined to the exact specifics of the proposal plans; feel free to make changes as needed to accommodate for unforseen outcomes.

Guidelines for final report

Note that the background section is from the proposal.Adapt as needed to reflect any improvements or changes made to your plan.

Background

Include the following from the proposal (updated as needed to reflect the data analyses you performed):

  • Context for which the data in the dataset was originally collected.
  • Details about the 3 variables which you will analyze.

Individual Variables

For each of the individual variables analysed:

  • Present the relevant descriptive statistics (may be organized in a table).
  • Display an appropriate graph, including descriptive titles and axis labels.
  • Briefly interpret the statistics and graphs for that variable.
  • Describe any issues encountered in analyzing this variable.

Suggest organizing this section one variable at a time. For each variable you might do the following:

  • create a table of statistics (just the ones which are relevant for the variable, not necessarily everything that comes out in the analysis output).
  • copy and paste in the relevant graph.
  • write about the statistics and graph in the context of the purpose underlying the data.
  • write about any issues you encountered doing the analyses.

Two-variable analyses

For each pair of variables analyzed

  • Specify which variable is the explanatory and which is the response.
  • Present any relevant descriptive statistics (may be organized in a table).
  • Display an appropriate graph, including descriptive titles and axis labels.
  • Briefly interpret the statistics and graphs to describe the relationship between the two variables and suggest conclusions.
  • Describe any issues encountered or extra measures taken in analyzing this pair of variables.

Again, the suggestion is to organize this information for each pair of variables separately, similar to how the one-variable analyses and interpretations were organized.

Summary of results

Write a short summary to synthesize the results and conclusions from the one and two-variable analyses.

  • Note how information obtained in the one-variable analyses impacted decisions in the two-variable analyses.
  • Explain any limitations related to your results (e.g., scatterplot may not be linear, but continued with calculating the linear regression line)
  • Include any possible implications which might be useful in future research/analysis.