User:ASnieckus/Statistics/VizMath notes
From WikiEducator
< User:ASnieckus | Statistics
Notes for VizMath presentation Visualizing data through graphs: A beginner's guide
- intro
- GSE
- PLC
- WikiEducator projects
- statistics....resources for stats I class are created in WE
- helping with creation of Art Appreciation course
- general gardening
- copyright
- special effort to make to include images that are openly licensed so content can be considered OER
- Building blocks
- data: numerical facts about individuals, cases, or subjects....organized into a dataset
- basis.....information
- variables: any measurable characteristic of an individual
- measured characteristics: temperature, age, test score
- attributes: M/F, years of education
- individuals: objects being measured
- may be people, other living things (trees, animal populations) or objects (furnaces, chairs, coffee cups)
- data: numerical facts about individuals, cases, or subjects....organized into a dataset
- Dataset
- dataset includes the values of the variables for each individual included in the data
- example....not terribly interesting, but provides the context for the data
- variables...columns
- individuals...rows
- review variables/values for individual 1
- dataset available in R....an open source statistical software....very powerful, used by working statisticians
- graphs used in exploratory data analysis, to help us understand the data...it's too hard to study it in the spreadsheet format
- Data always exists within a real-world context
- What questions are we looking to answer with this data?
- What population are we interested in?
- Individuals
- Who are the individuals in the dataset?
- How were they chosen?
- How many are there?
- Variables
- What variables are included in the data?
- What is the definition for each of the variables?
- How were the values in these variables obtained? That is, how were the variables measured?
- Categorical vs. quantitative variables
- categorical: classifies each individual into a category
- quantitative: numerical value or measurement for each individual
- review each variable
- sometimes categorical variables are coded with numbers in the dataset...1=male, 2=female....still cat
- Frequency distribution
- useful to summarize responses for categorical variables
- not a graph
- provides count and percent for each value in the variable
- organized in sequence, if one exists
- if no logical sequence in the values, organize by largest to smallest count/percent
- example poverty rates in countries...not useful to organize alphabetically
- clear display, communicates well...better than a graph?
- Bar graph
- used to display count or percent in each category
- Strengths
- this barchart is clear and concise....easy to read....pleasing color scheme
- includes a title to provide context for information
- count on bar...could also be percent of total
- Weaknesses
- title not fully descriptive
- x-axis and value labels not spelled out
- order of bars....alphabetical
- Pie chart
- used to display count or percent in each category
- only useful when all of the categories which make up the whole are included
- can't use it to show portions of students in 4 majors...have to include all of the students
- Strengths
- more descriptive title
- informative: includes counts and percents for each slice
- colors clearly distinguish different slices
- not too many slices
- 3D vs 2D
- issue with 3D is that it distorts the areas in the sectors, the red and blue sectors have more area portrayed, which may make them seem larger than they actually are....our brains have to compensate for this.
- 3D is widely criticized for unnecessarily complicating the data display...avoid
- not used in research to communicate results; widely used in business and media
- for 3 categories, may be that frequency is clearest, does visual display provide anything in addition?
- Pie charts: some issues and advice
- Issue: humans are less able to judge the size of angles and area in a pie chart than heights in a bar chart.
- Advice:
- If there are more than 6 categories, consider using a bar graph.
- Order the slices from largest to smallest, starting at the top.
- Histogram
- used to display quantitative variable
- provides visual impression of data....shape, center, spread
- breaks the range of data into classes (intervals) and displays the count or percent of observations in each class
- Pulse
- intervals are 10 beats, a bell-shaped, Normal distribution, centered around 70 beats per min.
- n is only 192....lots of missing data for this variable....would want to investigate as to why.
- Age
- intervals are 5 years, a right-skewed distribution (data piled up on the left, with a few individuals out to the right), two outliers above 70
- designed such that area across all of the bars adds to one; bars show proportion of data in that interval
- Stem-and-leaf plot
- shows the shape of the distribution with the actual values
- stem is 10's place and leaves are 1's place....review top/bottom of pulse
- pulse
- Normal distribution
- displays values...min/max
- age
- the big pile up btwn 17 and 24 is too large to show on the plot (it's cut off)...the graph doesn't work
- note that two outliers are 70 and 73, taking stats I with a bunch of undergrads...need to be investigated
- most useful for getting quick idea of shape using pencil/paper for small dataset
- Boxplot
- used to display quant variable
- provides visual display of median, quartiles, min/max, and outliers...review how each is displayed
- pulse
- mean and median are close and reasonably centered
- program classifies a few outer points as outliers...not visually apparent in histogram
- age
- the skewed distribution means that the bottom 50% are all jammed up together at the bottom
- displays lots of outliers
- mean is larger than median
- not very useful to help us visualize data
- can be useful, but for one variable histogram is preferred.
- Exploring relationships
- Graph two variables at once
- Must be measures of two variables for same set of individuals
- May suggest an association....knowing value of one variable tells something about the other
- Scatterplot
- used with two quantitative variables
- often consider whether one variable explains the other....explanatory/response relationship (also called independent/dependent)
- thought that height might explain pulse rate...put explanatory variable on x axis and response on y
- does not have much association....appears to be a ball of points..the best fit line is nearly horizontal (no association)
- Scatterplot - hand spans
- two quantitative variables....doesn't really have explanatory/response relationship
- shows strong positive linear association, as we'd expect; centered on y=x line
- not all that interesting, variation could well be measurement error
- Side-by-side boxplots
- one categorical and one quantitative variable
- categorical on x axis, quant on y axis
- compare heights for males and females
- differences are very clear
- males have much larger range of height than females
- forgot to label that height is in cm (important)
- Two-way table
- two categorical variables
- use conditional percents to describe association
- conditioned on explanatory variable....of the individuals in expl category, what percent are in resp category
- explain handedness vs. clapping
- of those who are right handed, 66% clap right on left
- of those who are left handed, 22% clap right on left
- the conditional percents suggest there is an association between handedness and how a person claps
- Time plot (or time series)
- quantitative variable(s) on y-axis; time on x-axis
- used for plots with time on x axis
- US TB cases per 100,000...shows two related variables (% change on right and rate on left)
- clearly shows the trends, nice having both variables together...can see the patterns
- titling is lacking