Title: | Datasets and Supplemental Functions from 'OpenIntro' Textbooks and Labs |
Version: | 2.5.0 |
Description: | Supplemental functions and data for 'OpenIntro' resources, which includes open-source textbooks and resources for introductory statistics (https://www.openintro.org/). The package contains datasets used in our open-source textbooks along with custom plotting functions for reproducing book figures. Note that many functions and examples include color transparency; some plotting elements may not show up properly (or at all) when run in some versions of Windows operating system. |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
LazyDataCompression: | xz |
RoxygenNote: | 7.3.1 |
Suggests: | broom, dplyr, forcats, knitr, lubridate, scales, testthat (≥ 3.0.0), tidyr, tidytext, stringr, maps |
Imports: | ggplot2 (≥ 2.2.1), graphics, readr, rmarkdown, tibble |
Depends: | R (≥ 2.10), airports, cherryblossom, usdata |
URL: | http://openintrostat.github.io/openintro/, https://github.com/OpenIntroStat/openintro/ |
BugReports: | https://github.com/OpenIntroStat/openintro/issues |
VignetteBuilder: | knitr |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2024-05-28 12:59:17 UTC; mine |
Author: | Mine Çetinkaya-Rundel
|
Maintainer: | Mine Çetinkaya-Rundel <cetinkaya.mine@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-05-31 13:40:02 UTC |
openintro: Datasets and Supplemental Functions from 'OpenIntro' Textbooks and Labs
Description
Supplemental functions and data for 'OpenIntro' resources, which includes open-source textbooks and resources for introductory statistics (https://www.openintro.org/). The package contains datasets used in our open-source textbooks along with custom plotting functions for reproducing book figures. Note that many functions and examples include color transparency; some plotting elements may not show up properly (or at all) when run in some versions of Windows operating system.
Author(s)
Maintainer: Mine Çetinkaya-Rundel cetinkaya.mine@gmail.com (ORCID)
Authors:
David Diez david.m.diez@gmail.com
Andrew Bray andrew.bray@gmail.com
Albert Y. Kim albert.ys.kim@gmail.com (ORCID)
Ben Baumer ben.baumer@gmail.com
Chester Ismay chester.ismay@gmail.com
Nick Paterno nicholas.paterno@gmail.com
Christopher Barr
See Also
Useful links:
Report bugs at https://github.com/OpenIntroStat/openintro/issues
Create a Line That may have Arrows on the Ends
Description
Similar to lines
, this function will include
endpoints that are solid points, open points, or arrows (mix-and-match
ready).
Usage
ArrowLines(
x,
y,
lty = 1,
lwd = 2.5,
col = 1,
length = 0.1,
af = 3,
cex.pch = 1.2,
ends = c("a", "a"),
...
)
Arguments
x |
A vector of the x-coordinates of the line to be drawn. |
y |
A vector of the y-coordinates of the line to be drawn. This vector
should have the same length as that of |
lty |
The line type. |
lwd |
The line width. |
col |
The line and endpoint color. |
length |
If an end point is an arrow, then this specifies the sizing of
the arrow. See the |
af |
A tuning parameter for creating the arrow. Usually the default
( |
cex.pch |
Plotting character size (if open or closed point at the end). |
ends |
A character vector of length 2, where the first value
corresponds to the start of the line and the second to the end of the line.
A value of |
... |
All additional arguments are passed to the
|
Author(s)
David Diez
See Also
Examples
CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2)
x <- c(-2, 0, 2, 4)
y <- c(0, 3, 0, 3)
ArrowLines(x, y, col = COL[1], ends = c("c", "c"))
points(x, y, col = COL[1], pch = 19, cex = 1.2)
CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2)
x <- c(-3, 0, 1, 3)
y <- c(2, 1, -2, 1)
ArrowLines(x, y, col = COL[1], ends = c("c", "c"))
points(x, y, col = COL[1], pch = 19, cex = 1.2)
CCP(xlim = c(-6, 6), ylim = c(-6, 6), ticklabs = 2)
x <- seq(-2, 2, 0.01)
y <- x^2 - 3
ArrowLines(x, y, col = COL[1], ends = c("c", "c"))
x <- seq(-2, 2, 1)
y <- x^2 - 3
points(x, y, col = COL[1], pch = 19, cex = 1.2)
Build Better Looking Axis Labels for US Dollars
Description
Convert and simplify axis labels that are in US Dollars.
Usage
AxisInDollars(side, at, include.symbol = TRUE, simplify = TRUE, ...)
Arguments
side |
An integer specifying which side of the plot the axis is to be drawn on. The axis is place as follows: 1 = below, 2 = left, 3 = above and 4 = right. |
at |
The points at which tick-marks are to be drawn. |
include.symbol |
Whether to include a dollar or percent symbol, where the symbol chosen depends on the function. |
simplify |
For dollars, simplify the amount to use abbreviations of
|
... |
Arguments passed to |
Value
The numeric locations on the axis scale at which tick marks were drawn when the plot was first drawn.
Author(s)
David Diez
See Also
buildAxis
AxisInDollars
AxisInPercent
Examples
x <- sample(50e6, 100)
hist(x, axes = FALSE)
AxisInDollars(1, pretty(x))
Build Better Looking Axis Labels for Percentages
Description
Convert and simplify axis labels that are in percentages.
Usage
AxisInPercent(side, at, include.symbol = TRUE, simplify = TRUE, ...)
Arguments
side |
An integer specifying which side of the plot the axis is to be drawn on. The axis is place as follows: 1 = below, 2 = left, 3 = above and 4 = right. |
at |
The points at which tick-marks are to be drawn. |
include.symbol |
Whether to include a dollar or percent symbol, where the symbol chosen depends on the function. |
simplify |
For dollars, simplify the amount to use abbreviations of
|
... |
Arguments passed to |
Value
The numeric locations on the axis scale at which tick marks were drawn when the plot was first drawn.
Author(s)
David Diez
See Also
buildAxis
AxisInDollars
AxisInDollars
Examples
x <- sample(50e6, 100)
hist(x, axes = FALSE)
AxisInDollars(1, pretty(x))
Add background color to a plot
Description
Overlays a colored rectangle over the entire plotting region.
Usage
BG(col = openintro::COL[5, 9])
Arguments
col |
Color to overlay. |
See Also
Examples
Test <- function(col) {
plot(1:7,
col = COL[1:7], pch = 19, cex = 5,
xlim = c(0, 8),
ylim = c(0, 9)
)
BG(col)
points(2:8, col = COL[1:7], pch = 19, cex = 5)
text(2, 6, "Correct Color")
text(6, 2, "Affected Color")
}
# Works well since black color almost fully transparent
Test(COL[5, 9])
# Works less well since transparency isn't as significant
Test(COL[5, 6])
# Pretty ugly due to overlay
Test(COL[5, 3])
# Basically useless due to heavy color gradient
Test(COL[4, 2])
Plot a Braces Symbol
Description
This function is not yet very flexible.
Usage
Braces(x, y, face.radians = 0, long = 1, short = 0.2, ...)
Arguments
x |
x-coordinate of the center of the braces. |
y |
y-coordinate of the center of the braces. |
face.radians |
Radians of where the braces should face. For example,
the default with |
long |
The units for the long dimension of the braces. |
short |
The units for the short dimension of the braces. This must be less than or equal to half of the long dimension. |
... |
Arguments passed to |
Author(s)
David Diez
See Also
Examples
plot(0:1, 0:1, type = "n")
Braces(0.5, 0.5, face.radians = 3 * pi / 2)
Plot a Cartesian Coordinate Plane
Description
Create a Cartesian Coordinate Plane.
Usage
CCP(
xlim = c(-4, 4),
ylim = c(-4, 4),
mar = rep(0, 4),
length = 0.1,
tcl = 0.007,
xylab = FALSE,
ticks = 1,
ticklabs = 1,
xpos = 1,
ypos = 2,
cex.coord = 1,
cex.xylab = 1.5,
add = FALSE
)
Arguments
xlim |
The x-limits for the plane (vector of length 2). |
ylim |
The y-limits for the plane (vector of length 2). |
mar |
Plotting margins. |
length |
The |
tcl |
Tick size. |
xylab |
Whether x and y should be shown next to the labels. |
ticks |
How frequently tick marks should be shown on the axes. If a vector of length 2, the first argument will correspond to the x-axis and the second to the y-axis. |
ticklabs |
How frequently tick labels should be shown on the axes. If a vector of length 2, the first argument will correspond to the x-axis and the second to the y-axis. |
xpos |
The position of the labels on the x-axis. See the |
ypos |
The position of the labels on the y-axis. See the |
cex.coord |
Inflation factor for font size of the coordinates, where
any value larger than zero is acceptable and |
cex.xylab |
Inflation factor for font size of the x and y labels, where
any value larger than zero is acceptable and |
add |
Indicate whether a new plot should be created ( |
Author(s)
David Diez
See Also
lsegments
, dlsegments
,
ArrowLines
Examples
CCP()
CCP(xylab = TRUE, ylim = c(-3.5, 2), xpos = 3, cex.coord = 1)
CCP(xlim = c(-8, 8), ylim = c(-10, 6), ticklabs = c(2, 2), cex.xylab = 0.8)
OpenIntro Statistics colors
Description
These are the core colors used for the OpenIntro Statistics textbook. The blue, green, yellow, and red colors are also gray-scaled, meaning no changes are required when printing black and white copies.
Usage
COL
Format
A 7-by-13 matrix of 7 colors with thirteen fading scales: blue, green, yellow, red, black, gray, and light gray.
Source
Colors selected by OpenIntro's in-house graphic designer, Meenal Patel.
Examples
plot(1:7, 7:1,
col = COL, pch = 19, cex = 6, xlab = "", ylab = "",
xlim = c(0.5, 7.5), ylim = c(-2.5, 8), axes = FALSE
)
text(1:7, 7:1 + 0.7, paste("COL[", 1:7, "]", sep = ""), cex = 0.9)
points(1:7, 7:1 - 0.7, col = COL[, 2], pch = 19, cex = 6)
points(1:7, 7:1 - 1.4, col = COL[, 3], pch = 19, cex = 6)
points(1:7, 7:1 - 2.1, col = COL[, 4], pch = 19, cex = 6)
Contingency Table to Data Frame
Description
Take a 2D contingency table and create a data frame representing the individual cases.
Usage
CT2DF(x, rn = row.names(x), cn = colnames(x), dfn = c("row.var", "col.var"))
Arguments
x |
Contingency table as a matrix. |
rn |
Character vector of the row names. |
cn |
Character vector of the column names. |
dfn |
Character vector with 2 values for the variable representing the rows and columns. |
Value
A data frame with two columns.
Author(s)
David Diez
See Also
Examples
a <- matrix(
c(459, 727, 854, 385, 99, 4198, 6245, 4821, 1634, 578),
2,
byrow = TRUE
)
b <-
CT2DF(
a,
c("No", "Yes"),
c("Excellent", "Very good", "Good", "Fair", "Poor"),
c("coverage", "health_status")
)
table(b)
Plot upper tail in chi-square distribution
Description
Plot a chi-square distribution and shade the upper tail.
Usage
ChiSquareTail(
U,
df,
xlim = c(0, 10),
col = fadeColor("black", "22"),
axes = TRUE,
...
)
Arguments
U |
Cut off for the upper tail. |
df |
Degrees of freedom. |
xlim |
Limits for the plot. |
col |
Color of the shading. |
axes |
Whether to plot an x-axis. |
... |
Currently ignored. |
Value
Nothing is returned from the function.
Author(s)
David Diez
See Also
Examples
data(COL)
ChiSquareTail(11.7,
7,
c(0, 25),
col = COL[1]
)
Introduction to Modern Statistics (IMS) Colors
Description
These are the core colors used for the Introduction to Modern Statistics textbook. The blue, green, pink, yellow, and red colors are also gray-scaled, meaning no changes are required when printing black and white copies.
Usage
IMSCOL
Format
A 8-by-13
matrix of 7 colors with four fading scales: blue,
green, pink, yellow, red, black, gray, and light gray.
Examples
plot(1:7, 7:1,
col = IMSCOL, pch = 19, cex = 6, xlab = "", ylab = "",
xlim = c(0.5, 7.5), ylim = c(-2.5, 8), axes = FALSE
)
text(1:7, 7:1 + 0.7, paste("IMSCOL[", 1:7, "]", sep = ""), cex = 0.9)
points(1:7, 7:1 - 0.7, col = IMSCOL[, 2], pch = 19, cex = 6)
points(1:7, 7:1 - 1.4, col = IMSCOL[, 3], pch = 19, cex = 6)
points(1:7, 7:1 - 2.1, col = IMSCOL[, 4], pch = 19, cex = 6)
LAhomes
Description
Data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010.
Usage
LAhomes
Format
A data frame with 1594 observations on the following 8 variables.
- city
City where the home is located.
- type
Type of home with levels
Condo/Twh
- condo or townhouse,SFR
- single family residence, andNA
- bed
Number of bedrooms in the home.
- bath
Number of bathrooms in the home.
- garage
Number of cars that can be parked in the garage. Note that a value of
4
refers to 4 or more garage spaces.- sqft
Squarefootage of the home.
- pool
Indicates if the home has a pool.
- price
Listing price of the home.
Examples
library(ggplot2)
ggplot(LAhomes, aes(sqft, price)) +
geom_point(alpha = 0.2) +
theme_minimal() +
labs(
title = "Can we predict list price from squarefootage?",
subtitle = "Homes in the Los Angeles area",
x = "Square feet",
y = "List price"
)
Custom Mosaic Plot
Description
Plot a mosaic plot custom built for a particular figure.
Usage
MosaicPlot(
formula,
data,
col = "#00000022",
border = 1,
dir = c("v", "h"),
off = 0.01,
cex.axis = 0.7,
col.dir = "v",
flip = c("v"),
...
)
Arguments
formula |
Formula describing the variable relationship. |
data |
Data frame for the variables, optional. |
col |
Colors for plotting. |
border |
Ignored. |
dir |
Ignored. |
off |
Fraction of white space between each box in the plot. |
cex.axis |
Axis label size. |
col.dir |
Direction to lay out colors. |
flip |
Whether to flip the ordering of the vertical ( |
... |
Ignored. |
Author(s)
David Diez
Examples
data(email)
data(COL)
email$spam <- ifelse(email$spam == 0, "not\nspam", "spam")
MosaicPlot(number ~ spam, email, col = COL[1:3], off = 0.02)
Plot data and add a regression line
Description
Plot data and add a regression line.
Usage
PlotWLine(
x,
y,
xlab = "",
ylab = "",
col = fadeColor(4, "88"),
cex = 1.2,
pch = 20,
n = 4,
nMax = 4,
yR = 0.1,
axes = TRUE,
...
)
Arguments
x |
Predictor variable. |
y |
Outcome variable. |
xlab |
x-axis label. |
ylab |
y-axis label. |
col |
Color of points. |
cex |
Size of points. |
pch |
Plotting character. |
n |
The preferred number of axis labels. |
nMax |
The maximum number of axis labels. |
yR |
y-limit buffer factor. |
axes |
Boolean to indicate whether or not to include axes. |
... |
Passed to |
See Also
Examples
PlotWLine(1:10, seq(-5, -2, length.out = 10) + rnorm(10))
Absenteeism from school in New South Wales
Description
Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year.
Usage
absenteeism
Format
A data frame with 146 observations on the following 5 variables.
- eth
Ethnicity, representing Aboriginal (
A
) or not (N
).- sex
Gender.
- age
Age bucket.
- lrn
Learner status, with average learner (
AL
) and slow learner (SL
).- days
Number of days absent.
Source
Venables WN, Ripley BD. 2002. Modern Applied Statistics with S. Fourth Edition. New York: Springer.
Data can also be found in the R MASS
package under the dataset name
quine
.
Examples
library(ggplot2)
ggplot(absenteeism, aes(x = eth, y = days)) +
geom_boxplot() +
coord_flip()
American Community Survey, 2012
Description
Results from the US Census American Community Survey, 2012.
Usage
acs12
Format
A data frame with 2000 observations on the following 13 variables.
- income
Annual income.
- employment
Employment status.
- hrs_work
Hours worked per week.
- race
Race.
- age
Age, in years.
- gender
Gender.
- citizen
Whether the person is a U.S. citizen.
- time_to_work
Travel time to work, in minutes.
- lang
Language spoken at home.
- married
Whether the person is married.
- edu
Education level.
- disability
Whether the person is disabled.
- birth_qrtr
The quarter of the year that the person was born, e.g.
Jan thru Mar
.
Source
https://www.census.gov/programs-surveys/acs
Examples
library(dplyr)
library(ggplot2)
library(broom)
# employed only
acs12_emp <- acs12 |>
filter(
age >= 30, age <= 60,
employment == "employed",
income > 0
)
# linear model
ggplot(acs12_emp, mapping = aes(x = age, y = income)) +
geom_point() +
geom_smooth(method = "lm")
lm(income ~ age, data = acs12_emp) |>
tidy()
# log-transormed model
ggplot(acs12_emp, mapping = aes(x = age, y = log(income))) +
geom_point() +
geom_smooth(method = "lm")
lm(log(income) ~ age, data = acs12_emp) |>
tidy()
Age at first marriage of 5,534 US women.
Description
Age at first marriage of 5,534 US women who responded to the National Survey of Family Growth (NSFG) conducted by the CDC in the 2006 and 2010 cycle.
Usage
age_at_mar
Format
A data frame with 5,534 observations and 1 variable.
- age
Age a first marriage.
Source
National Survey of Family Growth, 2006-2010 cycle, https://www.cdc.gov/nchs/nsfg/nsfg_2006_2010_puf.htm.
Examples
library(ggplot2)
ggplot(age_at_mar, mapping = aes(x = age)) +
geom_histogram(binwidth = 3) +
labs(
x = "Age", y = "Count", title = "Age at first marriage, US Women",
subtitle = "Source: National Survey of Family Growth Survey, 2006 - 2010"
)
Housing prices in Ames, Iowa
Description
Data set contains information from the Ames Assessor's Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010. See here for detailed variable descriptions.
Usage
ames
Format
A tbl_df with with 2930 rows and 82 variables:
- Order
Observation number.
- PID
Parcel identification number - can be used with city web site for parcel review.
- area
Above grade (ground) living area square feet.
- price
Sale price in USD.
- MS.SubClass
Identifies the type of dwelling involved in the sale.
- MS.Zoning
Identifies the general zoning classification of the sale.
- Lot.Frontage
Linear feet of street connected to property.
- Lot.Area
Lot size in square feet.
- Street
Type of road access to property.
- Alley
Type of alley access to property.
- Lot.Shape
General shape of property.
- Land.Contour
Flatness of the property.
- Utilities
Type of utilities available.
- Lot.Config
Lot configuration.
- Land.Slope
Slope of property.
- Neighborhood
Physical locations within Ames city limits (map available).
- Condition.1
Proximity to various conditions.
- Condition.2
Proximity to various conditions (if more than one is present).
- Bldg.Type
Type of dwelling.
- House.Style
Style of dwelling.
- Overall.Qual
Rates the overall material and finish of the house.
- Overall.Cond
Rates the overall condition of the house.
- Year.Built
Original construction date.
- Year.Remod.Add
Remodel date (same as construction date if no remodeling or additions).
- Roof.Style
Type of roof.
- Roof.Matl
Roof material.
- Exterior.1st
Exterior covering on house.
- Exterior.2nd
Exterior covering on house (if more than one material).
- Mas.Vnr.Type
Masonry veneer type.
- Mas.Vnr.Area
Masonry veneer area in square feet.
- Exter.Qual
Evaluates the quality of the material on the exterior.
- Exter.Cond
Evaluates the present condition of the material on the exterior.
- Foundation
Type of foundation.
- Bsmt.Qual
Evaluates the height of the basement.
- Bsmt.Cond
Evaluates the general condition of the basement.
- Bsmt.Exposure
Refers to walkout or garden level walls.
- BsmtFin.Type.1
Rating of basement finished area.
- BsmtFin.SF.1
Type 1 finished square feet.
- BsmtFin.Type.2
Rating of basement finished area (if multiple types).
- BsmtFin.SF.2
Type 2 finished square feet.
- Bsmt.Unf.SF
Unfinished square feet of basement area.
- Total.Bsmt.SF
Total square feet of basement area.
- Heating
Type of heating.
- Heating.QC
Heating quality and condition.
- Central.Air
Central air conditioning.
- Electrical
Electrical system.
- X1st.Flr.SF
First Floor square feet.
- X2nd.Flr.SF
Second floor square feet.
- Low.Qual.Fin.SF
Low quality finished square feet (all floors).
- Bsmt.Full.Bath
Basement full bathrooms.
- Bsmt.Half.Bath
Basement half bathrooms.
- Full.Bath
Full bathrooms above grade.
- Half.Bath
Half baths above grade.
- Bedroom.AbvGr
Bedrooms above grade (does NOT include basement bedrooms).
- Kitchen.AbvGr
Kitchens above grade.
- Kitchen.Qual
Kitchen quality.
- TotRms.AbvGrd
Total rooms above grade (does not include bathrooms).
- Functional
Home functionality (Assume typical unless deductions are warranted).
- Fireplaces
Number of fireplaces.
- Fireplace.Qu
Fireplace quality.
- Garage.Type
Garage location.
- Garage.Yr.Blt
Year garage was built.
- Garage.Finish
Interior finish of the garage.
- Garage.Cars
Size of garage in car capacity.
- Garage.Area
Size of garage in square feet.
- Garage.Qual
Garage quality.
- Garage.Cond
Garage condition.
- Paved.Drive
Paved driveway.
- Wood.Deck.SF
Wood deck area in square feet.
- Open.Porch.SF
Open porch area in square feet.
- Enclosed.Porch
Enclosed porch area in square feet.
- X3Ssn.Porch
Three season porch area in square feet.
- Screen.Porch
Screen porch area in square feet.
- Pool.Area
Pool area in square feet.
- Pool.QC
Pool quality.
- Fence
Fence quality.
- Misc.Feature
Miscellaneous feature not covered in other categories.
- Misc.Val
Dollar value of miscellaneous feature.
- Mo.Sold
Month Sold (MM).
- Yr.Sold
Year Sold (YYYY).
- Sale.Type
Type of sale.
- Sale.Condition
Condition of sale.
Source
De Cock, Dean. "Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project." Journal of Statistics Education 19.3 (2011).
Acute Myocardial Infarction (Heart Attack) Events
Description
This dataset is simulated but contains realistic occurrences of AMI in NY City.
Usage
ami_occurrences
Format
A data frame with 365 observations on the following variable.
- ami
Number of daily occurrences of heart attacks in NY City.
Examples
library(ggplot2)
ggplot(ami_occurrences, mapping = aes(x = ami)) +
geom_bar() +
labs(
x = "Acute Myocardial Infarction events",
y = "Count",
title = "Acute Myocardial Infarction events in NYC"
)
Pre-existing conditions in 92 children
Description
Pre-existing medical conditions of 92 children involved in a study on the optimal duration of antibiotic use in treatment of tracheitis, which is an upper respiratory infection.
Usage
antibiotics
Format
A data frame with 92 observations, each representing a child, on the following variable.
- condition
Pre-existing medical condition.
Examples
library(ggplot2)
ggplot(antibiotics, aes(x = condition)) +
geom_bar() +
labs(
x = "Conidition", y = "Count",
title = "Pre-existing coniditions of children",
subtitle = "in antibiotic use study"
) +
coord_flip()
Male and female births in London
Description
Arbuthnot's data describes male and female christenings (births) for London from 1629-1710.
Usage
arbuthnot
Format
A tbl_df with with 82 rows and 3 variables:
- year
year, ranging from 1629 to 1710
- boys
number of male christenings (births)
- girls
number of female christenings (births)
Details
John Arbuthnot (1710) used these time series data to carry out the first known significance test. During every one of the 82 years, there were more male christenings than female christenings. As Arbuthnot wondered, we might also wonder if this could be due to chance, or whether it meant the birth ratio was not actually 1:1.
Source
These data are excerpted from the Arbuthnot
dataset in the
HistData package.
Examples
library(ggplot2)
library(tidyr)
# All births
ggplot(arbuthnot, aes(x = year, y = boys + girls, group = 1)) +
geom_line()
# Boys and girls
arbuthnot |>
pivot_longer(cols = -year, names_to = "sex", values_to = "n") |>
ggplot(aes(x = year, y = n, color = sex, group = sex)) +
geom_line()
How important is it to ask pointed questions?
Description
In this experiment, each individual was asked to be a seller of an iPod (a product commonly used to store music on before smart phones...). They participant received $10 + 5% of the sale price for participating. The iPod they were selling had frozen twice in the past inexplicably but otherwise worked fine. The prospective buyer starts off and then asks one of three final questions, depending on the seller's treatment group.
Usage
ask
Format
A data frame with 219 observations on the following 3 variables.
- question_class
The type of question:
general
,pos_assumption
, andneg_assumption
.- question
The question corresponding to the
question.class
- response
The classified response from the seller, either
disclose
orhide
.
Details
The three possible questions:
General: What can you tell me about it?
Positive Assumption: It doesn't have any problems, does it?
Negative Assumption: What problems does it have?
The outcome variable is whether or not the participant discloses or hides the problem with the iPod.
Source
Minson JA, Ruedy NE, Schweitzer ME. There is such a thing as a stupid question: Question disclosure in strategic communication.
Examples
library(dplyr)
library(ggplot2)
# Distribution of responses based on question type
ask |>
count(question_class, response)
# Visualize relative frequencies of responses based on question type
ggplot(ask, aes(x = question_class, fill = response)) +
geom_bar(position = "fill")
# Perform chi-square test
(test <- chisq.test(table(ask$question_class, ask$response)))
# Check the test's assumption around sufficient expected observations
# per table cell.
test$expected
Simulated data for association plots
Description
Simulated dataset.
Usage
association
Format
A data frame with 121 observations on the following 4 variables.
- x1
a numeric vector
- x2
a numeric vector
- x3
a numeric vector
- y1
a numeric vector
- y2
a numeric vector
- y3
a numeric vector
- y4
a numeric vector
- y5
a numeric vector
- y6
a numeric vector
- y7
a numeric vector
- y8
a numeric vector
- y9
a numeric vector
- y10
a numeric vector
- y11
a numeric vector
- y12
a numeric vector
Examples
library(ggplot2)
ggplot(association, aes(x = x1, y = y1)) +
geom_point()
ggplot(association, aes(x = x2, y = y4)) +
geom_point()
ggplot(association, aes(x = x3, y = y7)) +
geom_point()
Eye color of couples
Description
Colors of the eye colors of male and female partners.
Usage
assortative_mating
Format
A data frame with 204 observations on the following 2 variables.
- self_male
a factor with levels
blue
,brown
, andgreen
- partner_female
a factor with
blue
,brown
, andgreen
Source
B. Laeng et al. Why do blue-eyed men prefer women with the same eye color? In: Behavioral Ecology and Sociobiology 61.3 (2007), pp. 371-384.
Examples
data(assortive_mating)
table(assortive_mating)
Cardiovascular problems for two types of Diabetes medicines
Description
A comparison of cardiovascular problems for Rosiglitazone and Pioglitazone.
Usage
avandia
Format
A data frame with 227571 observations on the following 2 variables.
- treatment
a factor with levels
Pioglitazone
andRosiglitazone
- cardiovascular_problems
a factor with levels
no
andyes
Source
D.J. Graham et al. Risk of acute myocardial infarction, stroke, heart failure, and death in elderly Medicare patients treated with rosiglitazone or pioglitazone. In: JAMA 304.4 (2010), p. 411. issn: 0098-7484.
Examples
table(avandia)
The Child Health and Development Studies
Description
The Child Health and Development Studies investigate a range of topics. One
study, in particular, considered all pregnancies between 1960 and 1967 among
women in the Kaiser Foundation Health Plan in the San Francisco East Bay
area. We do not have ideal provenance for these data. For a better documented
and more recent dataset on a similar topic with similar variables,
see births14. Additionally, Gestation
dataset in the
mosaicData
package also contains similar data.
Usage
babies
Format
A data frame with 1236 rows and 8 variables:
- case
id number
- bwt
birthweight, in ounces
- gestation
length of gestation, in days
- parity
binary indicator for a first pregnancy (0 = first pregnancy)
- age
mother's age in years
- height
mother's height in inches
- weight
mother's weight in pounds
- smoke
binary indicator for whether the mother smokes
Source
These data come from Child Health and Development Studies.
Crawling age
Description
Crawling age of babies along with the average outdoor temperature at 6 months of age.
Usage
babies_crawl
Format
A data frame with 12 observations on the following 5 variables.
- birth_month
A factor with levels corresponding to months
- avg_crawling_age
a numeric vector
- sd
a numeric vector
- n
a numeric vector
- temperature
a numeric vector
Source
J.B. Benson. Season of birth and onset of locomotion: Theoretical and methodological implications. In: Infant behavior and development 16.1 (1993), pp. 69-81. issn: 0163-6383.
Examples
library(ggplot2)
ggplot(babies_crawl, aes(x = temperature, y = avg_crawling_age)) +
geom_point() +
labs(x = "Temperature", y = "Average crawling age")
Beer and blood alcohol content
Description
Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer.
Usage
bac
Format
A data frame with 16 observations on the following 3 variables.
- student
a numeric vector
- beers
a numeric vector
- bac
a numeric vector
Source
J. Malkevitch and L.M. Lesser. For All Practical Purposes: Mathematical Literacy in Today's World. WH Freeman & Co, 2008. The data origin is given in the Electronic Encyclopedia of Statistical Examples and Exercises, 1992.
Examples
library(ggplot2)
ggplot(bac, aes(x = beers, y = bac)) +
geom_point() +
labs(x = "Number of beers", y = "Blood alcohol content")
Lifespan of ball bearings
Description
A simulated dataset on lifespan of ball bearings.
Usage
ball_bearing
Format
A data frame with 75 observations on the following variable.
- life_span
Lifespan of ball bearings (in hours).
Source
Simulated data.
Examples
library(ggplot2)
ggplot(ball_bearing, aes(x = life_span)) +
geom_histogram(binwidth = 1)
qqnorm(ball_bearing$life_span)
Body measurements of 507 physically active individuals.
Description
Body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender, are given for 507 physically active individuals - 247 men and 260 women. These data can be used to provide statistics students practice in the art of data analysis. Such analyses range from simple descriptive displays to more complicated multivariate analyses such as multiple regression and discriminant analysis.
Usage
bdims
Format
A data frame with 507 observations on the following 25 variables.
- bia_di
A numerical vector, respondent's biacromial diameter in centimeters.
- bii_di
A numerical vector, respondent's biiliac diameter (pelvic breadth) in centimeters.
- bit_di
A numerical vector, respondent's bitrochanteric diameter in centimeters.
- che_de
A numerical vector, respondent's chest depth in centimeters, measured between spine and sternum at nipple level, mid-expiration.
- che_di
A numerical vector, respondent's chest diameter in centimeters, measured at nipple level, mid-expiration.
- elb_di
A numerical vector, respondent's elbow diameter in centimeters, measured as sum of two elbows.
- wri_di
A numerical vector, respondent's wrist diameter in centimeters, measured as sum of two wrists.
- kne_di
A numerical vector, respondent's knee diameter in centimeters, measured as sum of two knees.
- ank_di
A numerical vector, respondent's ankle diameter in centimeters, measured as sum of two ankles.
- sho_gi
A numerical vector, respondent's shoulder girth in centimeters, measured over deltoid muscles.
- che_gi
A numerical vector, respondent's chest girth in centimeters, measured at nipple line in males and just above breast tissue in females, mid-expiration.
- wai_gi
A numerical vector, respondent's waist girth in centimeters, measured at the narrowest part of torso below the rib cage as average of contracted and relaxed position.
- nav_gi
A numerical vector, respondent's navel (abdominal) girth in centimeters, measured at umbilicus and iliac crest using iliac crest as a landmark.
- hip_gi
A numerical vector, respondent's hip girth in centimeters, measured at at level of bitrochanteric diameter.
- thi_gi
A numerical vector, respondent's thigh girth in centimeters, measured below gluteal fold as the average of right and left girths.
- bic_gi
A numerical vector, respondent's bicep girth in centimeters, measured when flexed as the average of right and left girths.
- for_gi
A numerical vector, respondent's forearm girth in centimeters, measured when extended, palm up as the average of right and left girths.
- kne_gi
A numerical vector, respondent's knee diameter in centimeters, measured as sum of two knees.
- cal_gi
A numerical vector, respondent's calf maximum girth in centimeters, measured as average of right and left girths.
- ank_gi
A numerical vector, respondent's ankle minimum girth in centimeters, measured as average of right and left girths.
- wri_gi
A numerical vector, respondent's wrist minimum girth in centimeters, measured as average of right and left girths.
- age
A numerical vector, respondent's age in years.
- wgt
A numerical vector, respondent's weight in kilograms.
- hgt
A numerical vector, respondent's height in centimeters.
- sex
A categorical vector, 1 if the respondent is male, 0 if female.
Source
Heinz G, Peterson LJ, Johnson RW, Kerk CJ. 2003. Exploring Relationships in Body Dimensions. Journal of Statistics Education 11(2).
Examples
library(ggplot2)
ggplot(bdims, aes(x = hgt)) +
geom_histogram(binwidth = 5)
ggplot(bdims, aes(x = hgt, y = wgt)) +
geom_point() +
labs(x = "Height", y = "Weight")
ggplot(bdims, aes(x = hgt, y = sho_gi)) +
geom_point() +
labs(x = "Height", y = "Shoulder girth")
ggplot(bdims, aes(x = hgt, y = hip_gi)) +
geom_point() +
labs(x = "Height", y = "Hip girth")
Efficacy of Pfizer-BioNTech COVID-19 vaccine on adolescents
Description
On March 31, 2021, Pfizer and BioNTech announced that "in a Phase 3 trial in adolescents 12 to 15 years of age with or without prior evidence of SARS-CoV-2 infection, the Pfizer-BioNTech COVID-19 vaccine BNT162b2 demonstrated 100% efficacy and robust antibody responses, exceeding those recorded earlier in vaccinated participants aged 16 to 25 years old, and was well tolerated." These results are from a Phase 3 trial in 2,260 adolescents 12 to 15 years of age in the United States. In the trial, 18 cases of COVID-19 were observed in the placebo group (n = 1,129) versus none in the vaccinated group (n = 1,131).
Usage
biontech_adolescents
Format
A data frame with 2260 observations on the following 2 variables.
- group
Study group:
vaccine
(Pfizer-BioNTech COVID-19 vaccine administered) orplacebo
.- outcome
Study outcome:
COVID-19
orno COVID-19
.
Source
"Pfizer-Biontech Announce Positive Topline Results Of Pivotal Covid-19 Vaccine Study In Adolescents". March 21, 2021. (Retrieved April 25, 2021.)
Examples
library(dplyr)
library(ggplot2)
biontech_adolescents |>
count(group, outcome)
ggplot(biontech_adolescents, aes(y = group, fill = outcome)) +
geom_bar()
Aircraft-Wildlife Collisions
Description
A collection of all collisions between aircraft in wildlife that were reported to the US Federal Aviation Administration between 1990 and 1997, with details on the circumstances of the collision.
Usage
birds
Format
A data frame with 19302 observations on the following 17 variables.
- opid
Three letter identification code for the operator (carrier) of the aircraft.
- operator
Name of the aircraft operator.
- atype
Make and model of aircraft.
- remarks
Verbal remarks regarding the collision.
- phase_of_flt
Phase of the flight during which the collision occurred:
Approach
,Climb
,Descent
,En Route
,Landing Roll
,Parked
,Take-off run
,Taxi
.- ac_mass
Mass of the aircraft classified as 2250 kg or less (1), 2251-5700 kg (2), 5701-27000 kg (3), 27001-272000 kg (4), above 272000 kg (5).
- num_engs
Number of engines on the aircraft.
- date
Date of the collision (MM/DD/YYYY).
- time_of_day
Light conditions:
Dawn
,Day
,Dusk
,Night
.- state
Two letter abbreviation of the US state in which the collision occurred.
- height
Feet above ground level.
- speed
Knots (indicated air speed).
- effect
Effect on flight:
Aborted Take-off
,Engine Shut Down
,None
,Other
,Precautionary Landing
.- sky
Type of cloud cover, if any:
No Cloud
,Overcast
,Some Cloud
.- species
Common name for bird or other wildlife.
- birds_seen
Number of birds/wildlife seen by pilot:
1
,2-10
,11-100
,Over 100
.- birds_struck
Number of birds/wildlife struck:
0
,1
,2-10
,11-100
,Over 100
.
Details
The FAA National Wildlife Strike Database contains strike reports that are voluntarily reported to the FAA by pilots, airlines, airports and others. Current research indicates that only about 20\ Wildlife strike reporting is not uniform as some organizations have more robust voluntary reporting procedures. Because of variations in reporting, users are cautioned that the comparisons between individual airports or airlines may be misleading.
Source
Aircraft Wildlife Strike Data: Search Tool - FAA Wildlife Strike Database. Available at https://datahub.transportation.gov/Aviation/Aircraft-Wildlife-Strike-Data-Search-Tool-FAA-Wild/jhay-dgxy. Retrieval date: Feb 4, 2012.
Examples
library(dplyr)
library(ggplot2)
library(forcats)
library(tidyr)
# Phase of the flight during which the collision occurred, tabular
birds |>
count(phase_of_flt, sort = TRUE)
# Phase of the flight during which the collision occurred, barplot
ggplot(birds, aes(y = fct_infreq(phase_of_flt))) +
geom_bar() +
labs(x = "Phase of flight")
# Height summary statistics
summary(birds$height)
# Phase of flight vs. effect of crash
birds |>
drop_na(phase_of_flt, effect) |>
ggplot(aes(y = phase_of_flt, fill = effect)) +
geom_bar(position = "fill") +
labs(x = "Proportion", y = "Phase of flight", fill = "Effect")
North Carolina births, 100 cases
Description
Data on a random sample of 100 births for babies in North Carolina where the mother was not a smoker and another 50 where the mother was a smoker.
Usage
births
Format
A data frame with 150 observations on the following 14 variables.
- f_age
Father's age.
- m_age
Mother's age.
- weeks
Weeks at which the mother gave birth.
- premature
Indicates whether the baby was premature or not.
- visits
Number of hospital visits.
- gained
Weight gained by mother.
- weight
Birth weight of the baby.
- sex_baby
Gender of the baby.
- smoke
Whether or not the mother was a smoker.
Source
Birth records released by North Carolina in 2004.
See Also
We do not have ideal provenance for these data. For a better documented and more recent dataset on a similar topic with similar variables, see births14. Additionally, ncbirths also contains similar data.
Examples
library(ggplot2)
ggplot(births, aes(x = smoke, y = weight)) +
geom_boxplot()
US births
Description
Every year, the US releases to the public a large dataset containing information on births recorded in the country. This dataset has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from the dataset released in 2014.
Usage
births14
Format
A data frame with 1,000 observations on the following 13 variables.
- fage
Father's age in years.
- mage
Mother's age in years.
- mature
Maturity status of mother.
- weeks
Length of pregnancy in weeks.
- premie
Whether the birth was classified as premature (premie) or full-term.
- visits
Number of hospital visits during pregnancy.
- gained
Weight gained by mother during pregnancy in pounds.
- weight
Weight of the baby at birth in pounds.
- lowbirthweight
Whether baby was classified as low birthweight (
low
) or not (not low
).- sex
Sex of the baby,
female
ormale
.- habit
Status of the mother as a
nonsmoker
or asmoker
.- marital
Whether mother is
married
ornot married
at birth.- whitemom
Whether mom is
white
ornot white
.
Source
United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics. Natality Detail File, 2014 United States. Inter-university Consortium for Political and Social Research, 2016-10-07. doi:10.3886/ICPSR36461.v1.
Examples
library(ggplot2)
ggplot(births14, aes(x = habit, y = weight)) +
geom_boxplot() +
labs(x = "Smoking status of mother", y = "Birth weight of baby (in lbs)")
ggplot(births14, aes(x = whitemom, y = visits)) +
geom_boxplot() +
labs(x = "Mother's race", y = "Number of doctor visits during pregnancy")
ggplot(births14, aes(x = mature, y = gained)) +
geom_boxplot() +
labs(x = "Mother's age category", y = "Weight gained during pregnancy")
Blizzard Employee Voluntary Salary Info.
Description
Employee generated anonymous survey of salary information.
Usage
blizzard_salary
Format
A data frame with 466 rows and 9 variables.
- timestamp
Time data was entered
- status
Specifies employment status.
- current_title
Current job title.
- current_salary
Current salary (in USD).
- salary_type
Frequency with levels year, hour, week.
- percent_incr
Raise given July 2020.
- other_info
Other information submitted by employee.
- location
Current office of employment.
- performance_rating
Most recent review performance rating.
Source
Bloomberg - Blizzard workers share salaries in revolt over wage disparities.
Examples
library(ggplot2)
library(dplyr)
plot_data <- blizzard_salary |>
mutate(annual_salary = case_when(
salary_type == "week" ~ current_salary * 52,
salary_type == "hour" ~ current_salary * 40 * 52,
TRUE ~ current_salary
))
ggplot(plot_data, aes(annual_salary)) +
geom_histogram(binwidth = 25000, color = "white") +
labs(
title = "Current Salary of Blizzard Employees",
x = "Salary",
y = "Number of Employees"
)
Sample of books on a shelf
Description
Simulated dataset.
Usage
books
Format
A data frame with 95 observations on the following 2 variables.
- type
a factor with levels
fiction
andnonfiction
- format
a factor with levels
hardcover
andpaperback
Examples
table(books)
Box plot
Description
An alternative to boxplot
. Equations are not accepted. Instead, the
second argument, fact
, is used to split the data.
Usage
boxPlot(
x,
fact = NULL,
horiz = FALSE,
width = 2/3,
lwd = 1,
lcol = "black",
medianLwd = 2,
pch = 20,
pchCex = 1.8,
col = grDevices::rgb(0, 0, 0, 0.25),
add = FALSE,
key = NULL,
axes = TRUE,
xlab = "",
ylab = "",
xlim = NULL,
ylim = NULL,
na.rm = TRUE,
...
)
Arguments
x |
A numerical vector. |
fact |
A character or factor vector defining the grouping for side-by-side box plots. |
horiz |
If |
width |
The width of the boxes in the plot. Value between |
lwd |
Width of lines used in box and whiskers. |
lcol |
Color of the box, median, and whiskers. |
medianLwd |
Width of the line marking the median. |
pch |
Plotting character of outliers. |
pchCex |
Size of outlier character. |
col |
Color of outliers. |
add |
If |
key |
The order in which to display the side-by-side boxplots. If
locations are specified in |
axes |
Whether to plot the axes. |
xlab |
Label for the x axis. |
ylab |
Label for the y axis. |
xlim |
Limits for the x axis. |
ylim |
Limits for the y axis. |
na.rm |
Indicate whether |
... |
Additional arguments to plot. |
Author(s)
David Diez
See Also
histPlot
, dotPlot
,
densityPlot
Examples
# univariarate
boxPlot(email$num_char, ylab = "Number of characters in emails")
# bivariate
boxPlot(email$num_char, email$spam,
xlab = "Spam",
ylab = "Number of characters in emails"
)
# faded outliers
boxPlot(email$num_char, email$spam,
xlab = "Spam",
ylab = "Number of characters in emails",
col = fadeColor("black", 18)
)
# horizontal plots
boxPlot(email$num_char, email$spam,
horiz = TRUE,
xlab = "Spam",
ylab = "Number of characters in emails",
col = fadeColor("black", 18)
)
# bivariate relationships where categorical data have more than 2 levels
boxPlot(email$num_char, email$image,
horiz = TRUE,
xlab = "Number of attached images",
ylab = "Number of characters in emails",
col = fadeColor("black", 18)
)
# key can be used to restrict to only the desired groups
boxPlot(email$num_char, email$image,
horiz = TRUE, key = c(0, 1, 2),
xlab = "Number of attached images (limited to 0, 1, 2)",
ylab = "Number of characters in emails",
col = fadeColor("black", 18)
)
# combine boxPlot and dotPlot
boxPlot(tips$tip, tips$day,
horiz = TRUE, key = c("Tuesday", "Friday")
)
dotPlot(tips$tip, tips$day,
add = TRUE, at = 1:2 + 0.05,
key = c("Tuesday", "Friday")
)
# adding a box
boxPlot(email$num_char[email$spam == 0], xlim = c(0, 3))
boxPlot(email$num_char[email$spam == 1], add = 2, axes = FALSE)
axis(1, at = 1:2, labels = c(0, 1))
boxPlot(email$num_char[email$spam == 0], ylim = c(0, 3), horiz = TRUE)
boxPlot(email$num_char[email$spam == 1], add = 2, horiz = TRUE, axes = FALSE)
axis(2, at = 1:2, labels = c(0, 1))
Axis function substitute
Description
The function buildAxis
is built to provide more control of the number
of labels on the axis. This function is still under development.
Usage
buildAxis(side, limits, n, nMin = 2, nMax = 10, extend = 2, eps = 10^-12, ...)
Arguments
side |
The side of the plot where to add the axis. |
limits |
Either lower and upper limits on the axis or a dataset. |
n |
The preferred number of axis labels. |
nMin |
The minimum number of axis labels. |
nMax |
The maximum number of axis labels. |
extend |
How far the axis may extend beyond |
eps |
The smallest increment allowed. |
... |
Arguments passed to |
Details
The primary reason behind building this function was to allow a plot to be
created with similar features but with different datasets. For instance, if
a set of code was written for one dataset and the function axis
had
been utilized with pre-specified values, the axis may not match the plot of
a new set of data. The function buildAxis
addresses this problem by
allowing the number of axis labels to be specified and controlled.
The axis is built by assigning penalties to a variety of potential axis setups, ranking them based on these penalties and then selecting the axis with the best score.
Value
A vector of the axis plotted.
Author(s)
David Diez
See Also
histPlot
, dotPlot
,
boxPlot
, densityPlot
Examples
# ===> 0 <===#
limits <- rnorm(100, 605490, 10)
hist(limits, axes = FALSE)
buildAxis(1, limits, 2, nMax = 4)
# ===> 1 <===#
x <- seq(0, 500, 10)
y <- 8 * x + rnorm(length(x), mean = 6000, sd = 200)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 5)
buildAxis(2, limits = y, n = 3)
# ===> 2 <===#
x <- 9528412 + seq(0, 200, 10)
y <- 8 * x + rnorm(length(x), mean = 6000, sd = 200)
plot(x, y, axes = FALSE)
temp <- buildAxis(1, limits = x, n = 4)
buildAxis(2, y, 3)
# ===> 3 <===#
x <- seq(367, 1251, 10)
y <- 7.5 * x + rnorm(length(x), mean = 6000, sd = 800)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 4, nMin = 3, nMax = 3)
buildAxis(2, limits = y, n = 4, nMin = 3, nMax = 5)
# ===> 4 <===#
x <- seq(367, 367.1, 0.001)
y <- 7.5 * x + rnorm(length(x), mean = 6000, sd = 0.01)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 4, nMin = 5, nMax = 6)
buildAxis(2, limits = y, n = 2, nMin = 3, nMax = 4)
# ===> 5 <===#
x <- seq(-0.05, -0.003, 0.0001)
y <- 50 + 20 * x + rnorm(length(x), sd = 0.1)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 4, nMin = 5, nMax = 6)
buildAxis(2, limits = y, n = 4, nMax = 5)
abline(lm(y ~ x))
# ===> 6 <===#
x <- seq(-0.0097, -0.008, 0.0001)
y <- 50 + 20 * x + rnorm(length(x), sd = 0.1)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 4, nMin = 2, nMax = 5)
buildAxis(2, limits = y, n = 4, nMax = 5)
abline(lm(y ~ x))
# ===> 7 <===#
x <- seq(0.03, -0.003099, -0.00001)
y <- 50 + 20 * x + rnorm(length(x), sd = 0.1)
plot(x, y, axes = FALSE)
buildAxis(1, limits = x, n = 4, nMin = 2, nMax = 5)
buildAxis(2, limits = y, n = 4, nMax = 6)
abline(lm(y ~ x))
# ===> 8 - repeat <===#
m <- runif(1) / runif(1) +
rgamma(1, runif(1) / runif(1), runif(1) / runif(1))
s <- rgamma(1, runif(1) / runif(1), runif(1) / runif(1))
x <- rnorm(50, m, s)
hist(x, axes = FALSE)
buildAxis(1, limits = x, n = 5, nMin = 4, nMax = 6, eps = 10^-12)
if (diff(range(x)) < 10^-12) {
cat("too small\n")
}
Burger preferences
Description
Sample burger place preferences versus gender.
Usage
burger
Format
A data frame with 500 observations on the following 2 variables.
- best_burger_place
Burger place.
- gender
a factor with levels
Female
andMale
Source
SurveyUSA, Results of SurveyUSA News Poll #17718, data collected on December 2, 2010.
Examples
table(burger)
Calculate hit streaks
Description
Calculate hit streaks
Usage
calc_streak(x)
Arguments
x |
A character vector of hits ( |
Value
A data frame with one column, length
, containing the length of
each hit streak.
Examples
data(kobe_basket)
calc_streak(kobe_basket$shot)
Cancer in dogs
Description
A study in 1994 examined 491 dogs that had developed cancer and 945 dogs as a control group to determine whether there is an increased risk of cancer in dogs that are exposed to the herbicide 2,4-Dichlorophenoxyacetic acid (2,4-D).
Usage
cancer_in_dogs
Format
A data frame with 1436 observations on the following 2 variables.
- order
a factor with levels
2,4-D
andno 2,4-D
- response
a factor with levels
cancer
andno cancer
Source
Hayes HM, Tarone RE, Cantor KP, Jessen CR, McCurnin DM, and Richardson RC. 1991. Case- Control Study of Canine Malignant Lymphoma: Positive Association With Dog Owner's Use of 2, 4- Dichlorophenoxyacetic Acid Herbicides. Journal of the National Cancer Institute 83(17):1226-1231.
Examples
table(cancer_in_dogs)
Deck of cards
Description
All the cards in a standard deck.
Usage
cards
Format
A data frame with 52 observations on the following 4 variables.
- value
a factor with levels
10
2
3
4
5
6
7
8
9
A
J
K
Q
- color
a factor with levels
black
red
- suit
a factor with levels
Club
Diamond
Heart
Spade
- face
a logical vector
Examples
table(cards$value)
table(cards$color)
table(cards$suit)
table(cards$face)
table(cards$suit, cards$face)
cars04
Description
A data frame with 428 rows and 19 columns. This is a record of characteristics on all of the new models of cars for sale in the US in the year 2004.
Usage
cars04
Format
A data frame with 428 observations on the following 19 variables.
- name
The name of the vehicle including manufacturer and model.
- sports_car
Logical variable indicating if the vehicle is a sports car.
- suv
Logical variable indicating if the vehicle is an suv.
- wagon
Logical variable indicating if the vehicle is a wagon.
- minivan
Logical variable indicating if the vehicle is a minivan.
- pickup
Logical variable indicating if the vehicle is a pickup.
- all_wheel
Logical variable indicating if the vehicle is all-wheel drive.
- rear_wheel
Logical variable indicating if the vehicle is rear-wheel drive.
- msrp
Manufacturer suggested retail price of the vehicle.
- dealer_cost
Amount of money the dealer paid for the vehicle.
- eng_size
Displacement of the engine - the total volume of all the cylinders, measured in liters.
- ncyl
Number of cylinders in the engine.
- horsepwr
Amount of horsepower produced by the engine.
- city_mpg
Gas mileage for city driving, measured in miles per gallon.
- hwy_mpg
Gas mileage for highway driving, measured in miles per gallon.
- weight
Total weight of the vehicle, measured in pounds.
- wheel_base
Distance between the center of the front wheels and the center of the rear wheels, measured in inches.
- length
Total length of the vehicle, measured in inches.
- width
Total width of the vehicle, measured in inches.
Examples
library(ggplot2)
# Highway gas mileage
ggplot(cars04, aes(x = hwy_mpg)) +
geom_histogram(
bins = 15, color = "white",
fill = openintro::IMSCOL["green", "full"]
) +
theme_minimal() +
labs(
title = "Highway gas milage for cars from 2004",
x = "Gas Mileage (miles per gallon)",
y = "Number of cars"
)
cars93
Description
A data frame with 54 rows and 6 columns. This data is a subset of the
Cars93
dataset from the MASS
package.
Usage
cars93
Format
A data frame with 54 observations on the following 6 variables.
- type
The vehicle type with levels
large
,midsize
, andsmall
.- price
Vehicle price (USD).
- mpg_city
Vehicle mileage in city (miles per gallon).
- drive_train
Vehicle drive train with levels
4WD
,front
, andrear
.- passengers
The vehicle passenger capacity.
- weight
Vehicle weight (lbs).
Details
These cars represent a random sample for 1993 models that were in both
Consumer Reports and PACE Buying Guide. Only vehicles of type
small
, midsize
, and large
were include.
Further description can be found in Lock (1993). Use the URL http://jse.amstat.org/v1n1/datasets.lock.html.
Source
Lock, R. H. (1993) 1993 New Car Data. Journal of Statistics Education 1(1).
Examples
library(ggplot2)
# Vehicle price by type
ggplot(cars93, aes(x = price)) +
geom_histogram(binwidth = 5) +
facet_wrap(~type)
# Vehicle price vs. weight
ggplot(cars93, aes(x = weight, y = price)) +
geom_point()
# Milleage vs. weight
ggplot(cars93, aes(x = weight, y = mpg_city)) +
geom_point() +
geom_smooth()
Community college housing (simulated data)
Description
These are simulated data and intended to represent housing prices of students at a community college.
Usage
cchousing
Format
A data frame with 75 observations on the following variable.
- price
Monthly housing price, simulated.
Examples
hist(cchousing$price)
Random sample of 2000 U.S. Census Data
Description
A random sample of 500 observations from the 2000 U.S. Census Data.
Usage
census
Format
A data frame with 500 observations on the following 8 variables.
- census_year
Census Year.
- state_fips_code
Name of state.
- total_family_income
Total family income (in U.S. dollars).
- age
Age.
- sex
Sex with levels
Female
andMale
.- race_general
Race with levels
American Indian or Alaska Native
,Black
,Chinese
,Japanese
,Other Asian or Pacific Islander
,Two major races
,White
andOther
.- marital_status
Marital status with levels
Divorced
,Married/spouse absent
,Married/spouse present
,Never married/single
,Separated
andWidowed
.- total_personal_income
Total personal income (in U.S. dollars).
Source
https://data.census.gov/cedsci
Examples
library(dplyr)
library(ggplot2)
census |>
filter(total_family_income > 0) |>
ggplot(aes(x = total_family_income)) +
geom_histogram(binwidth = 25000)
Summary information for 31 cherry trees
Description
Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 trees in the Allegheny National Forest, Pennsylvania.
Usage
cherry
Format
A data frame with 31 observations on the following 3 variables.
- diam
diameter in inches (at 54 inches above ground)
- height
height is measured in feet
- volume
volume in cubic feet
Source
D.J. Hand. A handbook of small data sets. Chapman & Hall/CRC, 1994.
Examples
library(ggplot2)
library(broom)
ggplot(cherry, aes(x = diam, y = volume)) +
geom_point() +
geom_smooth(method = "lm")
mod <- lm(volume ~ diam + height, cherry)
tidy(mod)
Gender Stereotypes in 5-7 year old Children
Description
Stereotypes are common, but at what age do they start? This study investigates stereotypes in young children aged 5-7 years old. There are four studies reported in the paper, and all four datasets are provided here.
Usage
children_gender_stereo
Format
This data object is more unusual than most. It is a list of 4 data frames. The four data frames correspond to the data used in Studies 1-4 of the referenced paper, and these data frames each have variables (columns) that are among the following:
- subject
Subject ID. Note that Subject 1 in the first data frame (dataset) does not correspond to Subject 1 in the second data frame.
- gender
Gender of the subject.
- age
Age of the subject, in years.
- trait
The trait that the children were making a judgement about, which was either
nice
orsmart
.- target
The age group of the people the children were making judgements about (as being either nice or smart):
children
oradults
.- stereotype
The proportion of trials where the child picked a gender target that matched the trait that was the same as the gender of the child. For example, suppose we had 18 pictures, where each picture showed 2 men and 2 women (and a different set of people in each photo). Then if we asked a boy to pick the person in each picture who they believed to be really smart, this
stereotype
variable would report the fraction of pictures where the boy picked a man. When a girl reviews the photos, then thisstereotype
variable reports the fraction of photos where she picked a woman. That is, this variable differs in meaning depending on the gender of the child. (This variable design is a little confusing, but it is useful when analyzing the data.)- high_achieve_caution
The proportion of trials where the child said that children of their own gender were high-achieving in school.
- interest
Average score that measured the interest of the child in the game.
- difference
A difference score between the interest of the child in the “smart” game and their interest in the “try-hard” game.
Details
The structure of the data object is a little unusual, so we recommend reviewing the Examples section before starting your analysis.
Thank you to Nicholas Horton for pointing us to this study and the data!
Most of the results in the paper can be reproduced using the data provided here.
% TODO(David) - Add short descriptions of each study.
Source
Bian L, Leslie SJ, Cimpian A. 2017. "Gender stereotypes about intellectual ability emerge early and influence children's interests". Science 355:6323 (389-391). https://www.science.org/doi/10.1126/science.aah6524.
The original data may be found here.
Examples
# This dataset is a little funny to work with.
# If wanting to review the data for a study, we
# recommend first assigning the corresponding
# data frame to a new variable. For instance,
# below we assign the second study's data to an
# object called `d` (d is for data!).
d <- children_gender_stereo[[2]]
Child care hours
Description
The China Health and Nutrition Survey aims to examine the effects of the health, nutrition, and family planning policies and programs implemented by national and local governments.
Usage
china
Format
A data frame with 9788 observations on the following 3 variables.
- gender
a numeric vector
- edu
a numeric vector
- child_care
a numeric vector
Source
UNC Carolina Population Center, China Health and Nutrition Survey, 2006.
Examples
summary(china)
CIA Factbook Details on Countries
Description
Country-level statistics from the US Central Intelligence Agency (CIA).
Usage
cia_factbook
Format
A data frame with 259 observations on the following 11 variables.
- country
Country name.
- area
Land area, in square kilometers. (1 square kilometer is 0.386 square miles
- birth_rate
Birth rate, in births per 1,000 people.
- death_rate
Death rate, in deaths per 1,000 people.
- infant_mortality_rate
Infant mortality, in deaths per 1,000 live births.
- internet_users
Total number of internet users.
- life_exp_at_birth
Live expectancy at birth, in years.
- maternal_mortality_rate
Number of female deaths per 100,000 live births where the death is related to pregnancy or birth.
- net_migration_rate
Net migration rate.
- population
Total population.
- population_growth_rate
Population growth rate.
Source
CIA Factbook, Country Comparisons, 2014. https://www.cia.gov/the-world-factbook/references/guide-to-country-comparisons/
Examples
library(dplyr)
library(ggplot2)
cia_factbook_iup <- cia_factbook |>
mutate(internet_users_percent = 100 * internet_users / population)
ggplot(cia_factbook_iup, aes(x = internet_users_percent, y = life_exp_at_birth)) +
geom_point() +
labs(x = "Percentage of internet users", y = "Life expectancy at birth")
Simulated class data
Description
This data is simulated and is meant to represent students scores from three different lectures who were all given the same exam.
Usage
classdata
Format
A data frame with 164 observations on the following 2 variables.
- m1
Represents a first midterm score.
- lecture
Three classes:
a
,b
, andc
.
References
OpenIntro Statistics, Chapter 8.
Examples
anova(lm(m1 ~ lecture, classdata))
Cleveland and Sacramento
Description
Data on a sample of 500 people from the Cleveland, OH and Sacramento, CA metro areas.
Usage
cle_sac
Format
A data frame with 500 observations representing people on the following 8 variables.
- year
Year the data was collected.
- state
State where person resides.
- city
City.
- age
Age.
- sex
Sex.
- race
Race.
- marital_status
Marital status.
- personal_income
Personal income.
Examples
library(ggplot2)
ggplot(cle_sac, aes(x = personal_income)) +
geom_histogram(binwidth = 20000) +
facet_wrap(~city)
Temperature Summary Data, Geography Limited
Description
A random set of monitoring locations were taken from NOAA data that had both years of interest (1948 and 2018) as well as data for both summary metrics of interest (dx70 and dx90, which are described below).
Usage
climate70
Format
A data frame with 197 observations on the following 7 variables.
- station
Station ID.
- latitude
Latitude of the station.
- longitude
Longitude of the station.
- dx70_1948
Number of days above 70 degrees in 1948.
- dx70_2018
Number of days above 70 degrees in 2018.
- dx90_1948
Number of days above 90 degrees in 1948.
- dx90_2018
Number of days above 90 degrees in 2018.
Details
Please keep in mind that these are two annual snapshots, and a complete analysis would consider much more than two years of data and much additional information for those years.
Source
https://www.ncdc.noaa.gov/cdo-web, retrieved 2019-04-24.
Examples
# Data sampled are from the US, Europe, and Australia.
# This geographic limitation may be due to the particular
# years considered, since locations without both 1948 and
# 2018 were discarded for this (simple) dataset.
plot(climate70$longitude, climate70$latitude)
plot(climate70$dx70_1948, climate70$dx70_2018)
abline(0, 1, lty = 2)
plot(climate70$dx90_1948, climate70$dx90_2018)
abline(0, 1, lty = 2)
hist(climate70$dx70_2018 - climate70$dx70_1948)
hist(climate70$dx90_2018 - climate70$dx90_1948)
t.test(climate70$dx70_2018 - climate70$dx70_1948)
t.test(climate70$dx90_2018 - climate70$dx90_1948)
Climber Drugs Data.
Description
Anonymous data was collected from urine samples at huts along the climb of Mont Blanc. Several types of drugs were tested, and proportions were reported.
Usage
climber_drugs
Format
A data frame with 211 rows and 6 variables.
- positive_sample
Idendification number of a specific urine sample.
- hut
Location where the sample was taken.
- substance
Substance detected to be present in the urine sample.
- concentration
Amount of substance found measured in ng/ml.
- screening_analysis
Indicates that the concentration was determined by screening analysis.
- concomitant
Indicates that this substance was always detected concomitantly with the previous one, within the same urine sample.
Source
PLOS One - Drug Use on Mont Blanc: A Study Using Automated Urine Collection
Examples
library(dplyr)
# Calculate the average concentration of each substance and number of occurrences.
climber_drugs |>
group_by(substance) |>
summarize(count = n(), mean_con = mean(concentration))
# Proportion samples in which each substance was detected.
climber_drugs |>
group_by(substance) |>
summarize(prop = n() / 154)
Coast Starlight Amtrak train
Description
Travel times and distances.
Usage
coast_starlight
Format
A data frame with 16 observations on the following 3 variables.
- station
Station.
- dist
Distance.
- travel_time
Travel time.
Examples
library(ggplot2)
ggplot(coast_starlight, aes(x = dist, y = travel_time)) +
geom_point()
comics
Description
A data frame containing information about comic book characters from Marvel Comics and DC Comics.
Usage
comics
Format
A data frame with 21821 observations on the following 11 variables.
- name
Name of the character. May include: Real name, hero or villain name, alias(es) and/or which universe they live in (i.e. Earth-616 in Marvel's multiverse).
- id
Status of the characters identity with levels
Secret
,Publie
,No Dual
andUnknown
.- align
Character's alignment with levels
Good
,Bad
,Neutral
andReformed Criminals
.- eye
Character's eye color.
- hair
Character's hair color.
- gender
Character's gender.
- gsm
Character's classification as a gender or sexual minority.
- alive
Is the character dead or alive?
- appearances
Number of comic boooks the character appears in.
- first_appear
Date of publication for the comic book the character first appeared in.
- publisher
Publisher of the comic with levels
Marvel
andDC
.
Examples
library(ggplot2)
library(dplyr)
# Good v Bad
plot_data <- comics |>
filter(align == "Good" | align == "Bad")
ggplot(plot_data, aes(x = align, fill = align)) +
geom_bar() +
facet_wrap(~publisher) +
scale_fill_manual(values = c(IMSCOL["red", "full"], IMSCOL["blue", "full"])) +
theme_minimal() +
labs(
title = "Is there a balance of power",
x = "",
y = "Number of characters",
fill = ""
)
Generate Contingency Tables for LaTeX
Description
Input a data frame or a table, and the LaTeX output will be returned. Options exist for row and column proportions as well as for showing work.
Usage
contTable(
x,
prop = c("none", "row", "col"),
show = FALSE,
digits = 3,
caption = NULL,
label = NULL
)
Arguments
x |
A data frame (with two columns) or a table. |
prop |
Indicate whether row ( |
show |
If row or column proportions are specified, indicate whether work should be shown. |
digits |
The number of digits after the decimal that should be shown for row or column proportions. |
caption |
A string that contains the table caption. The default value is
|
label |
The latex table label. The default value is |
Details
The contTable
function makes substantial use of the
cat
function.
Author(s)
David Diez
See Also
email
, cars93
, possum
,
mariokart
Examples
data(email)
table(email[, c("spam", "sent_email")])
contTable(email[, c("spam", "sent_email")])
Sample datasets for correlation problems
Description
Simulated data.
Usage
corr_match
Format
A data frame with 121 observations on the following 9 variables.
- x
a numeric vector
- y1
a numeric vector
- y2
a numeric vector
- y3
a numeric vector
- y4
a numeric vector
- y5
a numeric vector
- y6
a numeric vector
- y7
a numeric vector
- y8
a numeric vector
Source
Simulated dataset.
Examples
library(ggplot2)
ggplot(corr_match, aes(x = x, y = y1)) +
geom_point()
cor(corr_match$x, corr_match$y1)
Country ISO information
Description
Country International Organization for Standardization (ISO) information.
Usage
country_iso
Format
A data frame with 249 observations on the following 4 variables.
- country_code
Two-letter ISO country code.
- country_name
Country name.
- year
Year the two-letter ISO country code was assigned.
- top_level_domain
op-level domain name.
Source
Wikipedia, retrieved 2018-11-18. https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
Examples
country_iso
CPR dataset
Description
These patients were randomly divided into a treatment group where they received a blood thinner or the control group where they did not receive a blood thinner. The outcome variable of interest was whether the patients survived for at least 24 hours.
Usage
cpr
Format
A data frame with 90 observations on the following 2 variables.
- group
a factor with levels
control
andtreatment
- outcome
a factor with levels
died
andsurvived
Source
Efficacy and safety of thrombolytic therapy after initially unsuccessful cardiopulmonary resuscitation: a prospective clinical trial, by Bottiger et al., The Lancet, 2001.
Examples
table(cpr)
CPU's Released between 2010 and 2020.
Description
Data on computer processors released between 2010 and 2020.
Usage
cpu
Format
A data frame with 875 rows and 12 variables.
- company
Manufacturer of the CPU.
- name
Model name of the processor.
- codename
Name given by manufacturer to all chips with this architecture.
- cores
Number of compute cores per processor.
- threads
The number of threads represents the number of simultaneous calculations that can be ongoing in the processor.
- base_clock
Base speed for the CPU in GHz.
- boost_clock
Single-core max speed for the CPU in GHz.
- socket
Specifies the type of connection to the motherboard.
- process
Size of the process node used in production in nm.
- l3_cache
Size of the level 3 cache on the processor in MB.
- tdp
Total draw power of the processor.
- released
Date which the processor was released to the public.
Source
Examples
library(ggplot2)
# CPU base speed
ggplot(cpu, aes(x = company, y = base_clock)) +
geom_boxplot() +
labs(
x = "Company",
y = "Base Clock (GHz)",
title = "CPU base speed"
)
# Process node size vs. boost speed
ggplot(cpu, aes(x = process, y = boost_clock)) +
geom_point() +
labs(
x = "Process node size (nm)",
y = "Boost Clock (GHz)",
title = "Process node size vs. boost speed"
)
College credits.
Description
A simulated dataset of number of credits taken by college students each semester.
Usage
credits
Format
A data frame with 100 observations on the following variable.
- credits
Number of credits.
Source
Simulated data.
Examples
library(ggplot2)
ggplot(credits, aes(x = credits)) +
geom_histogram(binwidth = 1)
Daycare fines
Description
Researchers tested the deterrence hypothesis which predicts that the introduction of a penalty will reduce the occurrence of the behavior subject to the fine, with the condition that the fine leaves everything else unchanged by instituting a fine for late pickup at daycare centers. For this study, they worked with 10 volunteer daycare centers that did not originally impose a fine to parents for picking up their kids late. They randomly selected 6 of these daycare centers and instituted a monetary fine (of a considerable amount) for picking up children late and then removed it. In the remaining 4 daycare centers no fine was introduced. The study period was divided into four: before the fine (weeks 1–4), the first 4 weeks with the fine (weeks 5-8), the entire period with the fine (weeks 5–16), and the after fine period (weeks 17-20). Throughout the study, the number of kids who were picked up late was recorded each week for each daycare. The study found that the number of late-coming parents increased significantly when the fine was introduced, and no reduction occurred after the fine was removed.
Usage
daycare_fines
Format
A data frame with 200 observations on the following 7 variables.
- center
Daycare center id.
- group
Study group:
test
(fine instituted) orcontrol
(no fine).- children
Number of children at daycare center.
- week
Week of study.
- late_pickups
Number of late pickups for a given week and daycare center.
- study_period_4
Period of study, divided into 4 periods:
before fine
,first 4 weeks with fine
,last 8 weeks with fine
,after fine
- study_period_3
Period of study, divided into 4 periods:
before fine
,with fine
,after fine
Source
Gneezy, Uri, and Aldo Rustichini. "A fine is a price." The Journal of Legal Studies 29, no. 1 (2000): 1-17.
Examples
library(dplyr)
library(tidyr)
library(ggplot2)
# The following tables roughly match results presented in Table 2 of the source article
# The results are only off by rounding for some of the weeks
daycare_fines |>
group_by(center, study_period_4) |>
summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |>
pivot_wider(names_from = study_period_4, values_from = avg_late_pickups)
daycare_fines |>
group_by(center, study_period_3) |>
summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |>
pivot_wider(names_from = study_period_3, values_from = avg_late_pickups)
# The following plot matches Figure 1 of the source article
daycare_fines |>
group_by(week, group) |>
summarise(avg_late_pickups = mean(late_pickups), .groups = "drop") |>
ggplot(aes(x = week, y = avg_late_pickups, group = group, color = group)) +
geom_point() +
geom_line()
Density plot
Description
Compute kernel density plots, written in the same structure as
boxPlot
. Histograms can be automatically added for teaching
purposes.
Usage
densityPlot(
x,
fact = NULL,
bw = "nrd0",
histo = c("none", "faded", "hollow"),
breaks = "Sturges",
fading = "0E",
fadingBorder = "25",
lty = NULL,
lwd = 1,
col = c("black", "red", "blue"),
key = NULL,
add = FALSE,
adjust = 1,
kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
"cosine", "optcosine"),
weights = NULL,
n = 512,
from,
to,
na.rm = FALSE,
xlim = NULL,
ylim = NULL,
main = "",
...
)
Arguments
x |
A numerical vector. |
fact |
A character or factor vector defining the grouping for data in
|
bw |
Bandwidth. See |
histo |
Whether to plot a faded histogram ( |
breaks |
The |
fading |
Character value of hexadecimal, e.g. |
fadingBorder |
Character value of hexadecimal, e.g. |
lty |
Numerical vector describing the line type for the density
curve(s). Each element corresponds to a different level of the
argument |
lwd |
Numerical vector describing the line width for the density
curve(s). Each element corresponds to a different level of the
argument |
col |
Numerical vector describing the line color for the density
curve(s). Each element corresponds to a different level of the
argument |
key |
An argument to specify ordering of the factor levels. |
add |
If |
adjust |
Argument passed to |
kernel |
Argument passed to |
weights |
Argument passed to |
n |
Argument passed to |
from |
Argument passed to |
to |
Argument passed to |
na.rm |
Argument passed to |
xlim |
x-axis limits. |
ylim |
y-axis limits. |
main |
Title for the plot. |
... |
If |
Author(s)
David Diez
See Also
Examples
# hollow histograms
histPlot(tips$tip[tips$day == "Tuesday"],
hollow = TRUE, xlim = c(0, 30),
lty = 1, main = "Tips by day"
)
histPlot(tips$tip[tips$day == "Friday"],
hollow = TRUE, border = "red",
add = TRUE, main = "Tips by day"
)
legend("topright",
col = c("black", "red"),
lty = 1:2, legend = c("Tuesday", "Friday")
)
# density plots
densityPlot(tips$tip, tips$day,
col = c("black", "red"), main = "Tips by day"
)
legend("topright",
col = c("black", "red"),
lty = 1:2, legend = c("Tuesday", "Friday")
)
densityPlot(tips$tip,
histo = "faded",
breaks = 15, main = "Tips by day"
)
densityPlot(tips$tip,
histo = "hollow",
breaks = 30, fadingBorder = "66",
lty = 1, main = "Tips by day"
)
Type 2 Diabetes Clinical Trial for Patients 10-17 Years Old
Description
Three treatments were compared to test their relative efficacy (effectiveness) in treating Type 2 Diabetes in patients aged 10-17 who were being treated with metformin. The primary outcome was lack of glycemic control (or not); lacking glycemic control means the patient still needed insulin, which is not the preferred outcome for a patient.
Usage
diabetes2
Format
A data frame with 699 observations on the following 2 variables.
- treatment
The treatment the patient received.
- outcome
Whether there patient still needs insulin (
failure
) or met a basic positive outcome bar (success
).
Details
Each of the 699 patients in the experiment were randomized to one of the
following treatments: (1) continued treatment with metformin
(coded as met
), (2) formin combined with rosiglitazone (coded as
rosi
), or or (3) a lifestyle-intervention program (coded as
lifestyle
).
Source
Zeitler P, et al. 2012. A Clinical Trial to Maintain Glycemic Control in Youth with Type 2 Diabetes. N Engl J Med.
Examples
lapply(diabetes2, table)
(cont.table <- table(diabetes2))
(m <- chisq.test(cont.table))
m$expected
Create a Double Line Segment Plot
Description
Creae a plot showing two line segments. The union or intersection of those
line segments can also be generated by utilizing the type
argument.
Usage
dlsegments(
x1 = c(3, 7),
x2 = c(5, 9),
l = c("o", "o"),
r = c("c", "c"),
type = c("n", "u", "i"),
COL = 2,
lwd = 2.224,
ylim = c(-0.35, 2),
mar = rep(0, 4),
hideOrig = FALSE
)
Arguments
x1 |
The endpoints of the first interval. Values larger (smaller) than 999 (-999) will be interpreted as (negative) infinity. |
x2 |
The endpoints of the second interval. Values larger (smaller) than 999 (-999) will be interpreted as (negative) infinity. |
l |
A vector of length 2, where the values correspond to the left end
point of each interval. A value of |
r |
A vector of length 2, where the values correspond to the right end
point of each interval. A value of |
type |
By default, no intersection or union of the two lines will be
shown (value of |
COL |
If the union or intersection is to be shown (see the |
lwd |
If the union or intersection is to be shown (see the |
ylim |
A vector of length 2 specifying the vertical plotting limits,
which may be useful for fine-tuning plots. The default is |
mar |
A vector of length 4 that represent the plotting margins. |
hideOrig |
An optional argument that to specify that the two line
segments should be shown ( |
Author(s)
David Diez
See Also
Examples
dlsegments(c(-3, 3), c(1, 1000),
r = c("o", "o"), l = c("c", "o"), COL = COL[4]
)
dlsegments(c(-3, 3), c(1, 1000),
r = c("o", "o"), l = c("c", "o"), type = "un", COL = COL[4]
)
dlsegments(c(-3, 3), c(1, 1000),
r = c("o", "o"), l = c("c", "o"), type = "in", COL = COL[4]
)
Dot plot
Description
Plot observations as dots.
Usage
dotPlot(
x,
fact = NULL,
vertical = FALSE,
at = 1,
key = NULL,
pch = 20,
col = fadeColor("black", "66"),
cex = 1.5,
add = FALSE,
axes = TRUE,
xlim = NULL,
ylim = NULL,
...
)
Arguments
x |
A numerical vector. |
fact |
A character or factor vector defining the grouping for data in
|
vertical |
If |
at |
The vertical coordinate of the points, or the horizontal
coordinate if |
key |
The factor levels corresponding to |
pch |
Plotting character. If |
col |
Plotting character color. If |
cex |
Plotting character size. If |
add |
If |
axes |
If |
xlim |
Limits for the x axis. |
ylim |
Limits for the y axis. |
... |
Additional arguments to be passed to |
Author(s)
David Diez
See Also
histPlot
, densityPlot
,
boxPlot
Examples
library(dplyr)
# Price by type
dotPlot(cars93$price,
cars93$type,
key = c("large", "midsize", "small"),
cex = 1:3
)
# Hours worked by educational attainment or degree
gss2010_nona <- gss2010 |>
filter(!is.na(hrs1) & !is.na(degree))
dotPlot(gss2010_nona$hrs1,
gss2010_nona$degree,
col = fadeColor("black", "11")
)
# levels reordered
dotPlot(gss2010_nona$hrs1,
gss2010_nona$degree,
col = fadeColor("black", "11"),
key = c("LT HIGH SCHOOL", "HIGH SCHOOL", "BACHELOR", "JUNIOR COLLEGE", "GRADUATE")
)
# with boxPlot() overlaid
dotPlot(mariokart$total_pr,
mariokart$cond,
ylim = c(0.5, 2.5), xlim = c(25, 80), cex = 1
)
boxPlot(mariokart$total_pr,
mariokart$cond,
add = 1:2 + 0.1,
key = c("new", "used"), horiz = TRUE, axes = FALSE
)
Add a Stacked Dot Plot to an Existing Plot
Description
Add a stacked dot plot to an existing plot. The locations for the points in the dot plot are returned from the function in a list.
Usage
dotPlotStack(x, radius = 1, seed = 1, addDots = TRUE, ...)
Arguments
x |
A vector of numerical observations for the dot plot. |
radius |
The approximate distance that should separate each point. |
seed |
A random seed (integer). Different values will produce different variations. |
addDots |
Indicate whether the points should be added to the plot. |
... |
Additional arguments are passed to
|
Value
Returns a list with a height that can be used as the upper bound of ylim for a plot, then also the x- and y-coordinates of the points in the stacked dot plot.
Author(s)
David Diez
See Also
Examples
#
Survey on views of the DREAM Act
Description
A SurveyUSA poll.
Usage
dream
Format
A data frame with 910 observations on the following 2 variables.
- ideology
a factor with levels
Conservative
Liberal
Moderate
- stance
a factor with levels
No
Not sure
Yes
Source
SurveyUSA, News Poll #18927, data collected Jan 27-29, 2012.
Examples
table(dream)
Quadcopter Drone Blades
Description
Quality control dataset for quadcopter drone blades, where this data has been made up for an example.
Usage
drone_blades
Format
A data frame with 2000 observations on the following 2 variables.
- supplier
The supplier for the blade.
- inspection
The inspection conclusion.
References
OpenIntro Statistics, Third Edition and Fourth Edition.
Examples
library(dplyr)
drone_blades |>
count(supplier, inspection)
Drug use of students and parents
Description
Summary of 445 student-parent pairs.
Usage
drug_use
Format
A data frame with 445 observations on the following 2 variables.
- student
a factor with levels
not
uses
- parents
a factor with levels
not
used
Source
Ellis GJ and Stone LH. 1979. Marijuana Use in College: An Evaluation of a Modeling Explanation. Youth and Society 10:323-334.
Examples
table(drug_use)
Sale prices of houses in Duke Forest, Durham, NC
Description
Data on houses that were recently sold in the Duke Forest neighborhood of Durham, NC in November 2020.
Usage
duke_forest
Format
A data frame with 98 rows and 13 variables.
- address
Address of house.
- price
Sale price, in USD.
- bed
Number of bedrooms.
- bath
Number of bathrooms.
- area
Area of home, in square feet.
- type
Type of home (all are Single Family).
- year_built
Year the home was built.
- heating
Heating sytem.
- cooling
Cooling system (
other
orcentral
).- parking
Type of parking available and number of parking spaces.
- lot
Area of the entire property, in acres.
- hoa
If the home belongs to an Home Owners Association, the associted fee (
NA
otherwise).- url
URL of the listing.
Source
Data were collected from Zillow in November 2020.
Examples
library(ggplot2)
# Number of bedrooms and price
ggplot(duke_forest, aes(x = as.factor(bed), y = price)) +
geom_boxplot() +
labs(
x = "Number of bedrooms",
y = "Sale price (USD)",
title = "Homes for sale in Duke Forest, Durham, NC",
subtitle = "Data are from November 2020"
)
# Area and price
ggplot(duke_forest, aes(x = area, y = price)) +
geom_point() +
labs(
x = "Area (square feet)",
y = "Sale price (USD)",
title = "Homes for sale in Duke Forest, Durham, NC",
subtitle = "Data are from November 2020"
)
Earthquakes
Description
Select set of notable earthquakes from 1900 to 1999.
Usage
earthquakes
Format
A data frame with 123 rows and 7 variables.
- year
Year the earthquake took place.
- month
Month the earthquake took place.
- day
Day the earthquake took place
- richter
Magnitude of earthquake using the Richter Scale.
- area
City or geographic location of earthquakes.
- region
Country or countries if the earthquake occurred on a border.
- deaths
Approximate number of deaths caused by earthquake
Source
World Almanac and Book of Facts: 2011.
Examples
library(ggplot2)
ggplot(earthquakes, aes(x = richter, y = deaths)) +
geom_point()
ggplot(earthquakes, aes(x = log(deaths))) +
geom_histogram()
Survey on Ebola quarantine
Description
In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll asked New Yorkers whether they favored a "mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient". This poll included responses of 1,042 New York adults between October 26th and 28th, 2014.
Usage
ebola_survey
Format
A data frame with 1042 observations on the following variable.
- quarantine
Indicates whether the respondent is in
favor
oragainst
the mandatory quarantine.
Source
Poll ID NY141026 on maristpoll.marist.edu.
Examples
table(ebola_survey)
Exploratory data analysis plot
Description
Explore different plotting methods using a click interface.
Usage
edaPlot(
dataFrame,
Col = c("#888888", "#FF0000", "#222222", "#FFFFFF", "#CCCCCC", "#3377AA")
)
Arguments
dataFrame |
A data frame. |
Col |
A vector containing six colors. The colors may be given in any form. |
Author(s)
David Diez
See Also
histPlot
, densityPlot
,
boxPlot
, dotPlot
Examples
data(mlbbat10)
bat <- mlbbat10[mlbbat10$at_bat > 200, ]
# edaPlot(bat)
data(mariokart)
mk <- mariokart[mariokart$total_pr < 100, ]
# edaPlot(mk)
Elmhurst College gift aid
Description
A random sample of 50 students gift aid for students at Elmhurst College.
Usage
elmhurst
Format
A data frame with 50 observations on the following 3 variables.
- family_income
Family income of the student.
- gift_aid
Gift aid, in $1000s.
- price_paid
Price paid by the student (tuition - gift aid).
Source
These data were sampled from a table of data for all freshman from the 2011 class at Elmhurst College that accompanied an article titled What Students Really Pay to Go to College published online by The Chronicle of Higher Education: https://www.chronicle.com/article/what-students-really-pay-to-go-to-college/?sra=true.
Examples
library(ggplot2)
library(broom)
ggplot(elmhurst, aes(x = family_income, y = gift_aid)) +
geom_point() +
geom_smooth(method = "lm")
mod <- lm(gift_aid ~ family_income, data = elmhurst)
tidy(mod)
Data frame representing information about a collection of emails
Description
These data represent incoming emails for the first three months of 2012 for an email account (see Source).
Usage
email
Format
A email
(email_sent
) data frame has 3921 (1252)
observations on the following 21 variables.
- spam
Indicator for whether the email was spam.
- to_multiple
Indicator for whether the email was addressed to more than one recipient.
- from
Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
- cc
Number of people cc'ed.
- sent_email
Indicator for whether the sender had been sent an email in the last 30 days.
- time
Time at which email was sent.
- image
The number of images attached.
- attach
The number of attached files.
- dollar
The number of times a dollar sign or the word “dollar” appeared in the email.
- winner
Indicates whether “winner” appeared in the email.
- inherit
The number of times “inherit” (or an extension, such as “inheritance”) appeared in the email.
- viagra
The number of times “viagra” appeared in the email.
- password
The number of times “password” appeared in the email.
- num_char
The number of characters in the email, in thousands.
- line_breaks
The number of line breaks in the email (does not count text wrapping).
- format
Indicates whether the email was written using HTML (e.g. may have included bolding or active links).
- re_subj
Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”
- exclaim_subj
Whether there was an exclamation point in the subject.
- urgent_subj
Whether the word “urgent” was in the email subject.
- exclaim_mess
The number of exclamation points in the email message.
- number
Factor variable saying whether there was no number, a small number (under 1 million), or a big number.
Source
David Diez's Gmail Account, early months of 2012. All personally identifiable information has been removed.
See Also
Examples
e <- email
# ______ Variables For Logistic Regression ______#
# Variables are modified to match
# OpenIntro Statistics, Second Edition
# As Is (7): spam, to_multiple, winner, format,
# re_subj, exclaim_subj
# Omitted (6): from, sent_email, time, image,
# viagra, urgent_subj, number
# Become Indicators (5): cc, attach, dollar,
# inherit, password
e$cc <- ifelse(email$cc > 0, 1, 0)
e$attach <- ifelse(email$attach > 0, 1, 0)
e$dollar <- ifelse(email$dollar > 0, 1, 0)
e$inherit <- ifelse(email$inherit > 0, 1, 0)
e$password <- ifelse(email$password > 0, 1, 0)
# Transform (3): num_char, line_breaks, exclaim_mess
# e$num_char <- cut(email$num_char, c(0,1,5,10,20,1000))
# e$line_breaks <- cut(email$line_breaks, c(0,10,100,500,10000))
# e$exclaim_mess <- cut(email$exclaim_mess, c(-1,0,1,5,10000))
g <- glm(
spam ~ to_multiple + winner + format +
re_subj + exclaim_subj +
cc + attach + dollar +
inherit + password, # +
# num_char + line_breaks + exclaim_mess,
data = e, family = binomial
)
summary(g)
# ______ Variable Selection Via AIC ______#
g. <- step(g)
plot(predict(g., type = "response"), e$spam)
# ______ Splitting num_char by html ______#
x <- log(email$num_char)
bw <- 0.004
R <- range(x) + c(-1, 1)
wt <- sum(email$format == 1) / nrow(email)
htmlAll <- density(x, bw = 0.4, from = R[1], to = R[2])
htmlNo <- density(x[email$format != 1],
bw = 0.4,
from = R[1], to = R[2]
)
htmlYes <- density(x[email$format == 1],
bw = 0.4,
from = R[1], to = R[2]
)
htmlNo$y <- htmlNo$y #* (1-wt)
htmlYes$y <- htmlYes$y #* wt + htmlNo$y
plot(htmlAll, xlim = c(-4, 6), ylim = c(0, 0.4))
lines(htmlNo, col = 4)
lines(htmlYes, lwd = 2, col = 2)
Sample of 50 emails
Description
This is a subsample of the email
dataset.
Usage
email50
Format
A data frame with 50 observations on the following 21 variables.
- spam
Indicator for whether the email was spam.
- to_multiple
Indicator for whether the email was addressed to more than one recipient.
- from
Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
- cc
Number of people cc'ed.
- sent_email
Indicator for whether the sender had been sent an email in the last 30 days.
- time
Time at which email was sent.
- image
The number of images attached.
- attach
The number of attached files.
- dollar
The number of times a dollar sign or the word “dollar” appeared in the email.
- winner
Indicates whether “winner” appeared in the email.
- inherit
The number of times “inherit” (or an extension, such as “inheritance”) appeared in the email.
- viagra
The number of times “viagra” appeared in the email.
- password
The number of times “password” appeared in the email.
- num_char
The number of characters in the email, in thousands.
- line_breaks
The number of line breaks in the email (does not count text wrapping).
- format
Indicates whether the email was written using HTML (e.g. may have included bolding or active links).
- re_subj
Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”
- exclaim_subj
Whether there was an exclamation point in the subject.
- urgent_subj
Whether the word “urgent” was in the email subject.
- exclaim_mess
The number of exclamation points in the email message.
- number
Factor variable saying whether there was no number, a small number (under 1 million), or a big number.
Source
David Diez's Gmail Account, early months of 2012. All personally identifiable information has been removed.
See Also
Examples
index <- c(
101, 105, 116, 162, 194, 211, 263, 308, 361, 374,
375, 465, 509, 513, 571, 691, 785, 842, 966, 968,
1051, 1201, 1251, 1433, 1519, 1727, 1760, 1777, 1899, 1920,
1943, 2013, 2052, 2252, 2515, 2629, 2634, 2710, 2823, 2835,
2944, 3098, 3227, 3360, 3452, 3496, 3530, 3665, 3786, 3877
)
order <- c(
3, 33, 12, 1, 21, 15, 43, 49, 8, 6,
34, 25, 24, 35, 41, 9, 22, 50, 4, 48,
7, 14, 46, 10, 38, 32, 26, 18, 23, 45,
30, 16, 17, 20, 40, 47, 31, 37, 27, 11,
5, 44, 29, 19, 13, 36, 39, 42, 28, 2
)
d <- email[index, ][order, ]
identical(d, email50)
American Adults on Regulation and Renewable Energy
Description
Pew Research conducted a poll to find whether American adults support regulation or believe the private market will move the American economy towards renewable energy.
Usage
env_regulation
Format
A data frame with 705 observations on the following variable.
- statement
There were three possible outcomes for each person:
"Regulations necessary"
,"Private marketplace will ensure"
, and"Don't know"
.
Details
The exact statements being selected were: (1) Government regulations are necessary to encourage businesses and consumers to rely more on renewable energy sources. (2) The private marketplace will ensure that businesses and consumers rely more on renewable energy sources, even without government regulations.
The actual sample size was 1012. However, the original data were not from a simple random sample; after accounting for the design, the equivalent sample size was about 705, which was what was used for the dataset here to keep things simpler for intro stat analyses.
Source
Examples
table(env_regulation)
Vehicle info from the EPA for 2012
Description
Details from the EPA.
Usage
epa2012
Format
A data frame with 1129 observations on the following 28 variables.
- model_yr
a numeric vector
- mfr_name
Manufacturer name.
- division
Vehicle division.
- carline
Vehicle line.
- mfr_code
Manufacturer code.
- model_type_index
Model type index.
- engine_displacement
Engine displacement.
- no_cylinders
Number of cylinders.
- transmission_speed
Transmission speed.
- city_mpg
City mileage.
- hwy_mpg
Highway mileage.
- comb_mpg
Combined mileage.
- guzzler
Whether the car is considered a "guzzler" or not, a factor with levels
N
andY.
- air_aspir_method
Air aspiration method.
- air_aspir_method_desc
Air aspiration method description.
- transmission
Transmission type.
- transmission_desc
Transmission type description.
- no_gears
Number of gears.
- trans_lockup
Whether transmission locks up, a factor with levels
N
andY
.- trans_creeper_gear
A factor with level
N
only.- drive_sys
Drive system, a factor with levels.
- drive_desc
Drive system description.
- fuel_usage
Fuel usage, a factor with levels.
- fuel_usage_desc
Fuel usage description.
- class
Class of car.
- car_truck
Car or truck, a factor with levels
car
,1
,2
.- release_date
Date of vehicle release.
- fuel_cell
Whether the car has a fuel cell or not, a factor with levels
N
,Y
.
Source
Fueleconomy.gov, Shared MPG Estimates: Toyota Prius 2012.
See Also
epa2021
Examples
library(ggplot2)
library(dplyr)
# Variable descriptions
distinct(epa2012, air_aspir_method_desc, air_aspir_method)
distinct(epa2012, transmission_desc, transmission)
distinct(epa2012, drive_desc, drive_sys)
distinct(epa2012, fuel_usage_desc, fuel_usage)
# Guzzlers and their mileages
ggplot(epa2012, aes(x = city_mpg, y = hwy_mpg, color = guzzler)) +
geom_point() +
facet_wrap(~guzzler, ncol = 1)
Vehicle info from the EPA for 2021
Description
Details from the EPA.
Usage
epa2021
Format
A data frame with 1108 observations on the following 28 variables.
- model_yr
a numeric vector
- mfr_name
Manufacturer name.
- division
Vehicle division.
- carline
Vehicle line.
- mfr_code
Manufacturer code.
- model_type_index
Model type index.
- engine_displacement
Engine displacement.
- no_cylinders
Number of cylinders.
- transmission_speed
Transmission speed.
- city_mpg
City mileage.
- hwy_mpg
Highway mileage.
- comb_mpg
Combined mileage.
- guzzler
Whether the car is considered a "guzzler" or not, a factor with levels
N
andY.
- air_aspir_method
Air aspiration method.
- air_aspir_method_desc
Air aspiration method description.
- transmission
Transmission type.
- transmission_desc
Transmission type description.
- no_gears
Number of gears.
- trans_lockup
Whether transmission locks up, a factor with levels
N
andY
.- trans_creeper_gear
A factor with level
N
only.- drive_sys
Drive system, a factor with levels.
- drive_desc
Drive system description.
- fuel_usage
Fuel usage, a factor with levels.
- fuel_usage_desc
Fuel usage description.
- class
Class of car.
- car_truck
Car or truck, a factor with levels
car
,1
,??
,1
.- release_date
Date of vehicle release.
- fuel_cell
Whether the car has a fuel cell or not, a factor with levels
N
,NA
.
Source
Fuel Economy Data from fueleconomy.gov. Retrieved 6 May, 2021.
See Also
epa2012
Examples
library(ggplot2)
library(dplyr)
# Variable descriptions
distinct(epa2021, air_aspir_method_desc, air_aspir_method)
distinct(epa2021, transmission_desc, transmission)
distinct(epa2021, drive_desc, drive_sys)
distinct(epa2021, fuel_usage_desc, fuel_usage)
# Guzzlers and their mileages
ggplot(epa2021, aes(x = city_mpg, y = hwy_mpg, color = guzzler)) +
geom_point() +
facet_wrap(~guzzler, ncol = 1)
# Compare to 2012
epa2021 |>
bind_rows(epa2012) |>
group_by(model_yr) |>
summarise(
mean_city = mean(city_mpg),
mean_hwy = mean(hwy_mpg)
)
Environmental Sustainability Index 2005
Description
This dataset comes from the 2005 Environmental Sustainability Index: Benchmarking National Environmental Stewardship. Countries are given an overall sustainability score as well as scores in each of several different environmental areas.
Usage
esi
Format
A data frame with 146 observations on the following 29 variables.
- code
ISO3 country code.
- country
Country.
- esi
Environmental Sustainability Index.
- system
ESI core component: systems
- stress
ESI core component: stresses
- vulner
ESI core component: vulnerability
- cap
ESI core component: capacity
- global
ESI core component: global stewardship
- sys_air
Air quality.
- sys_bio
Biodiversity.
- sys_lan
Land.
- sys_wql
Water quality.
- sys_wqn
Water quantity.
- str_air
Reducing air pollution.
- str_eco
Reducing ecosystem stress.
- str_pop
Reducing population pressure.
- str_was
Reducing waste and consumption pressures.
- str_wat
Reducing water stress.
- str_nrm
Natural resource management.
- vul_hea
Environmental health.
- vul_sus
Basic human sustenance.
- vul_dis
Exposure to natural disasters.
- cap_gov
Environmental governance.
- cap_eff
Eco-efficiency.
- cap_pri
Private sector responsiveness.
- cap_st
Science and technology.
- glo_col
Participation in international collaboration efforts.
- glo_ghg
Greenhouse gas emissions.
- glo_tbp
Reducing transboundary environmental pressures.
Details
ESI and Component scores are presented as standard normal percentiles. Indicator scores are in the form of z-scores. See Appendix A of the report for information on the methodology and Appendix C for more detail on original data sources.
For more information on how each of the indices were calculated, see the documentation linked below.
Source
ESI Component Indicators. 2005 Environmental Sustainability Index: Benchmarking National Environmental Stewardship, Yale Center for Environmental Law and Policy, Yale University & Center for International Earth Science Information Network (CIESIN), Columbia University
In collaboration with: World Economic Forum, Geneva, Switzerland Joint Research Centre of the European Commission, Ispra, Italy.
Available at https://www.earth.columbia.edu/news/2005/images/ESI2005_policysummary.pdf.
References
Esty, Daniel C., Marc Levy, Tanja Srebotnjak, and Alexander de Sherbinin (2005). 2005 Environmental Sustainability Index: Benchmarking National Environmental Stewardship. New Haven: Yale Center for Environmental Law and Policy
Examples
library(ggplot2)
ggplot(esi, aes(x = cap_st, y = glo_col)) +
geom_point(color = ifelse(esi$code == "USA", "red", "black")) +
geom_text(
aes(label = ifelse(code == "USA", as.character(code), "")),
hjust = 1.2, color = "red"
) +
labs(x = "Science and technology", y = "Participation in international collaboration efforts")
ggplot(esi, aes(x = vulner, y = cap)) +
geom_point(color = ifelse(esi$code == "USA", "red", "black")) +
geom_text(
aes(label = ifelse(code == "USA", as.character(code), "")),
hjust = 1.2, color = "red"
) +
labs(x = "Vulnerability", y = "Capacity")
Ethanol Treatment for Tumors Experiment
Description
Experiment where 3 different treatments of ethanol were tested on the treatment of oral cancer tumors in hamsters.
Usage
ethanol
Format
A data frame with 24 observations, each representing one hamster, on the following 2 variables.
- treatment
Treatment the hamster received.
- regress
a factor with levels
no
yes
Details
The ethyl_cellulose
and pure_ethanol
treatments consisted of
about a quarter of the volume of the tumors, while the
pure_ethanol_16x
treatment was 16x that, so about 4 times the size of
the tumors.
Source
Morhard R, et al. 2017. Development of enhanced ethanol ablation as an alternative to surgery in treatment of superficial solid tumors. Scientific Reports 7:8750.
Examples
table(ethanol)
fisher.test(table(ethanol))
Professor evaluations and beauty
Description
The data are gathered from end of semester student evaluations for 463 courses taught by a sample of 94 professors from the University of Texas at Austin. In addition, six students rate the professors' physical appearance. The result is a data frame where each row contains a different course and each column has information on the course and the professor who taught that course.
Usage
evals
Format
A data frame with 463 observations on the following 23 variables.
- course_id
Variable identifying the course (out of 463 courses).
- prof_id
Variable identifying the professor who taught the course (out of 94 professors).
- score
Average professor evaluation score: (1) very unsatisfactory - (5) excellent.
- rank
Rank of professor: teaching, tenure track, tenured.
- ethnicity
Ethnicity of professor: not minority, minority.
- gender
Gender of professor: female, male.
- language
Language of school where professor received education: English or non-English.
- age
Age of professor.
- cls_perc_eval
Percent of students in class who completed evaluation.
- cls_did_eval
Number of students in class who completed evaluation.
- cls_students
Total number of students in class.
- cls_level
Class level: lower, upper.
- cls_profs
Number of professors teaching sections in course in sample: single, multiple.
- cls_credits
Number of credits of class: one credit (lab, PE, etc.), multi credit.
- bty_f1lower
Beauty rating of professor from lower level female: (1) lowest - (10) highest.
- bty_f1upper
Beauty rating of professor from upper level female: (1) lowest - (10) highest.
- bty_f2upper
Beauty rating of professor from second level female: (1) lowest - (10) highest.
- bty_m1lower
Beauty rating of professor from lower level male: (1) lowest - (10) highest.
- bty_m1upper
Beauty rating of professor from upper level male: (1) lowest - (10) highest.
- bty_m2upper
Beauty rating of professor from second upper level male: (1) lowest - (10) highest.
- bty_avg
Average beauty rating of professor.
- pic_outfit
Outfit of professor in picture: not formal, formal.
- pic_color
Color of professor's picture: color, black & white.
Source
Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, 2005. doi:10.1016/j.econedurev.2004.07.013.
Examples
evals
Exam and course grades for statistics students
Description
Grades on three exams and overall course grade for 233 students during several years for a statistics course at a university.
Usage
exam_grades
Format
A data frame with 233 observations, each representing a student.
- semester
Semester when grades were recorded.
- sex
Sex of the student as recorded on the university registration system: Man or Woman.
- exam1
Exam 1 grade.
- exam2
Exam 2 grade.
- exam3
Exam 3 grade.
- course_grade
Overall course grade.
Examples
library(ggplot2)
library(dplyr)
# Course grade vs. each exam
ggplot(exam_grades, aes(x = exam1, y = course_grade)) +
geom_point()
ggplot(exam_grades, aes(x = exam2, y = course_grade)) +
geom_point()
ggplot(exam_grades, aes(x = exam2, y = course_grade)) +
geom_point()
# Semester averages
exam_grades |>
group_by(semester) |>
summarise(across(exam1:course_grade, mean, na.rm = TRUE))
Exam scores
Description
Exam scores from a class of 19 students.
Usage
exams
Format
A data frame with 19 observations on the following variable.
- scores
a numeric vector
Examples
hist(exams$scores)
Number of Exclusive Relationships
Description
A survey conducted on a reasonably random sample of 203 undergraduates asked, among many other questions, about the number of exclusive relationships these students have been in.
Usage
exclusive_relationship
Format
A data frame with 218 observations on the following variable.
- num
Number of exclusive relationships.
Examples
summary(exclusive_relationship$num)
table(exclusive_relationship$num)
hist(exclusive_relationship$num)
Can Americans categorize facts and opinions?
Description
Pew Research Center conducted a survey in 2018, asking a sample of U.S. adults to categorize five factual and five opinion statements. This dataset provides data from this survey, with information on the age group of the participant as well as the number of factual and opinion statements they classified correctly (out of 5).
Usage
fact_opinion
Format
A data frame with 5,035 rows and 3 variables.
- age_group
Age group of survey participant.
- fact_correct
Number of factual statements classified correctly (out of 5).
- opinion_correct
Number of opinion statements classified correctly (out of 5).
Source
Younger Americans are better than older Americans at telling factual news statements from opinions, Pew Research Center, October 23, 2018.
Examples
library(ggplot2)
library(dplyr)
library(tidyr)
library(forcats)
# Distribution of fact_correct by age group
ggplot(fact_opinion, aes(x = age_group, y = fact_correct)) +
geom_boxplot() +
labs(
x = "Age group",
y = "Number correct (factual)",
title = "Number of factual statements classified correctly by age group"
)
# Distribution of opinion_correct by age group
ggplot(fact_opinion, aes(x = age_group, y = opinion_correct)) +
geom_boxplot() +
labs(
x = "Age group",
y = "Number correct (opinion)",
title = "Number of opinion statements classified correctly by age group"
)
# Replicating the figure from Pew report (see source for link)
fact_opinion |>
mutate(
facts = case_when(
fact_correct <= 2 ~ "Two or fewer",
fact_correct %in% c(3, 4) ~ "Three or four",
fact_correct == 5 ~ "All five"
),
facts = fct_relevel(facts, "Two or fewer", "Three or four", "All five"),
opinions = case_when(
opinion_correct <= 2 ~ "Two or fewer",
opinion_correct %in% c(3, 4) ~ "Three or four",
opinion_correct == 5 ~ "All five"
),
opinions = fct_relevel(opinions, "Two or fewer", "Three or four", "All five")
) |>
select(-fact_correct, -opinion_correct) |>
pivot_longer(cols = -age_group, names_to = "question_type", values_to = "n_correct") |>
ggplot(aes(y = fct_rev(age_group), fill = n_correct)) +
geom_bar(position = "fill") +
facet_wrap(~question_type, ncol = 1) +
scale_fill_viridis_d(guide = guide_legend(reverse = TRUE)) +
labs(
x = "Proportion",
y = "Age group",
fill = "Number of\ncorrect\nclassifications"
)
Fade colors
Description
Fade colors so they are transparent.
Usage
fadeColor(col, fade = "FF")
Arguments
col |
An integer, color name, or RGB hexadecimal. |
fade |
The amount to fade |
Author(s)
David Diez
Examples
data(mariokart)
new <- mariokart$cond == "new"
used <- mariokart$cond == "used"
# ===> color numbers <===#
dotPlot(mariokart$total_pr[new],
ylim = c(0, 3), xlim = c(25, 80), pch = 20,
col = 2, cex = 2, main = "using regular colors"
)
dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = 4, pch = 20, cex = 2)
dotPlot(mariokart$total_pr[new],
ylim = c(0, 3), xlim = c(25, 80),
col = fadeColor(2, "22"), pch = 20, cex = 2,
main = "fading the colors first"
)
dotPlot(mariokart$total_pr[used],
at = 2, add = TRUE,
col = fadeColor(4, "22"), pch = 20, cex = 2
)
# ===> color names <===#
dotPlot(mariokart$total_pr[new],
ylim = c(0, 3), xlim = c(25, 80), pch = 20,
col = "red", cex = 2, main = "using regular colors"
)
dotPlot(mariokart$total_pr[used], at = 2, add = TRUE, col = "blue", pch = 20, cex = 2)
dotPlot(mariokart$total_pr[new],
ylim = c(0, 3), xlim = c(25, 80),
col = fadeColor("red", "22"), pch = 20, cex = 2,
main = "fading the colors first"
)
dotPlot(mariokart$total_pr[used],
at = 2, add = TRUE,
col = fadeColor("blue", "22"), pch = 20, cex = 2
)
# ===> hexadecimal <===#
dotPlot(mariokart$total_pr[new],
ylim = c(0, 3), xlim = c(25, 80), pch = 20,
col = "#FF0000", cex = 2, main = "using regular colors"
)
dotPlot(mariokart$total_pr[used],
at = 2, add = TRUE, col = "#0000FF", pch = 20,
cex = 2
)
dotPlot(mariokart$total_pr[new],
ylim = c(0, 3), xlim = c(25, 80),
col = fadeColor("#FF0000", "22"), pch = 20, cex = 2,
main = "fading the colors first"
)
dotPlot(mariokart$total_pr[used],
at = 2, add = TRUE,
col = fadeColor("#0000FF", "22"), pch = 20, cex = 2
)
# ===> alternative: rgb function <===#
dotPlot(mariokart$total_pr[new],
ylim = c(0, 3), xlim = c(25, 80), pch = 20,
col = rgb(1, 0, 0), cex = 2, main = "using regular colors"
)
dotPlot(mariokart$total_pr[used],
at = 2, add = TRUE, col = rgb(0, 0, 1),
pch = 20, cex = 2
)
dotPlot(mariokart$total_pr[new],
ylim = c(0, 3), xlim = c(25, 80),
col = rgb(1, 0, 0, 1 / 8), pch = 20, cex = 2,
main = "fading the colors first"
)
dotPlot(mariokart$total_pr[used],
at = 2, add = TRUE,
col = rgb(0, 0, 1, 1 / 8), pch = 20, cex = 2
)
Simulated sample of parent / teen college attendance
Description
A simulated dataset based on real population summaries.
Usage
family_college
Format
A data frame with 792 observations on the following 2 variables.
- teen
Whether the teen goes to
college
ornot
.- parents
Whether the parent holds a college
degree
ornot
.
Source
Simulation based off of summary information provided at https://eric.ed.gov/?id=ED460660.
Examples
library(dplyr)
family_college |>
count(teen, parents)
Nutrition in fast food
Description
Nutrition amounts in 515 fast food items. The author of the data scraped only entrees (not sides, drinks, desserts, etc.).
Usage
fastfood
Format
A data frame with 515 observations on the following 17 variables.
- restaurant
Name of restaurant
- item
Name of item
- calories
Number of calories
- cal_fat
Calories from fat
- total_fat
Total fat
- sat_fat
Saturated fat
- trans_fat
Trans fat
- cholesterol
Cholesterol
- sodium
Sodium
- total_carb
Total carbs
- fiber
Fiber
- sugar
Suger
- protein
Protein
- vit_a
Vitamin A
- vit_c
Vitamin C
- calcium
Calcium
- salad
Salad or not
Source
Retrieved from Tidy Tuesday Fast food entree data.
Summary of male heights from USDA Food Commodity Intake Database
Description
Sample of heights based on the weighted sample in the survey.
Usage
fcid
Format
A data frame with 100 observations on the following 2 variables.
- height
a numeric vector
- num_of_adults
a numeric vector
Examples
fcid
Female college student heights, in inches
Description
24 sample observations.
Usage
fheights
Format
A data frame with 24 observations on the following variable.
- heights
height, in inches
Examples
hist(fheights$heights)
Young fish in the North Sea.
Description
Samples of 50 Tobis fish, or Sand Eels, were collected at three different locations in the North Sea and the number of one-year-old fish were counted.
Usage
fish_age
Format
A data frame with 300 rows and 3 variables:
- year
Year the fish was caught with levels 1997 and 1998.
- location
Site the fish was caught with levels A, B and C.
- one_year_old
Is the fish one-year-old, yes or no?
Source
Henrik Madsen, Paul Thyregod. 2011. Introduction to General and Generalized Linear Models CRC Press. Boca Raton, FL. ISBN: 978-1-4200-9155-7 Website
Examples
library(dplyr)
library(tidyr)
# Count the number of one-year-old fish at each location.
fish_age |>
filter(one_year_old == "yes") |>
count(year, location) |>
pivot_wider(names_from = location, values_from = n)
Findings on n-3 Fatty Acid Supplement Health Benefits
Description
The results summarize each of the health outcomes for an experiment where 12,933 subjects received a 1g fish oil supplement daily and 12,938 received a placebo daily. The experiment's duration was 5-years.
Usage
fish_oil_18
Format
The format is a list of 24 matrices. Each matrix is a 2x2 table, and below are the named items in the list, which also represent the outcomes.
- major_cardio_event
Major cardiovascular event. (Primary end point.)
- cardio_event_expanded
Cardiovascular event in expanded composite endpoint.
- myocardioal_infarction
Total myocardial infarction. (Heart attack.)
- stroke
Total stroke.
- cardio_death
Death from cardiovascular causes.
- PCI
Percutaneous coronary intervention.
- CABG
Coronary artery bypass graft.
- total_coronary_heart_disease
Total coronary heart disease.
- ischemic_stroke
Ischemic stroke.
- hemorrhagic_stroke
Hemorrhagic stroke.
- chd_death
Death from coronary heart disease.
- myocardial_infarction_death
Death from myocardial infraction.
- stroke_death
Death from stroke.
- invasive_cancer
Invasive cancer of any type. (Primary end point.)
- breast_cancer
Breast cancer.
- prostate_cancer
Prostate cancer.
- colorectal_cancer
Colorectal cancer.
- cancer_death
Death from cancer.
- death
Death from any cause.
- major_cardio_event_after_2y
Major cardiovascular event, excluding the first 2 years of follow-up.
- myocardial_infarction_after_2y
Total myocardial infarction, excluding the first 2 years of follow-up.
- invasive_cancer_after_2y
Invasive cancer of any type, excluding the first 2 years of follow-up.
- cancer_death_after_2y
Death from cancer, excluding the first 2 years of follow-up.
- death_after_2y
Death from any cause, excluding the first 2 years of follow-up.
Source
Manson JE, et al. 2018. Marine n-3 Fatty Acids and Prevention of Cardiovascular Disease and Cancer. NEJMoa1811403. doi:10.1056/NEJMoa1811403.
Examples
names(fish_oil_18)
(tab <- fish_oil_18[["major_cardio_event"]])
chisq.test(tab)
fisher.test(tab)
(tab <- fish_oil_18[["myocardioal_infarction"]])
chisq.test(tab)
fisher.test(tab)
River flow data
Description
Flow rates (mesured in cubic feet per second) of Clarks Creek, Leach Creek, Silver Creek, and Wildwood Creek Spring collected by volunteers of the Pierce Conservation District in the State of Washington in the US.
Usage
flow_rates
Format
A data frame with 31 rows and 3 variables.
- site
Location where measurements were taken.
- date
Date measurements were taken.
- flow
Flow rate of the river in cubic feet per second.
Source
Pierce County Water Data Viewer.
Examples
library(ggplot2)
# River flow rates by site
ggplot(flow_rates, aes(x = site, y = flow)) +
geom_boxplot() +
labs(
title = "River flow rates by site",
x = "Site",
y = expression(paste("Flow (ft"^3 * "/s)"))
)
# River flow rates over time
ggplot(flow_rates, aes(x = date, y = flow, color = site, shape = site)) +
geom_point(size = 2) +
labs(
title = "River flow rates over time",
x = "Date",
y = expression(paste("Flow (ft"^3 * "/s)")),
color = "Site", shape = "Site"
)
Friday the 13th
Description
This dataset addresses issues of how superstitions regarding Friday the 13th affect human behavior, and whether Friday the 13th is an unlucky day. Scanlon, et al. collected data on traffic and shopping patterns and accident frequency for Fridays the 6th and 13th between October of 1989 and November of 1992.
Usage
friday
Format
A data frame with 61 observations and 6 variables.
- type
Type of observation,
traffic
,shopping
, oraccident
.- date
Year and month of observation.
- sixth
Counts on the 6th of the month.
- thirteenth
Counts on the 13th of the month.
- diff
Difference between the sixth and the thirteenth.
- location
Location where data is collected.
Details
There are three types of observations: traffic, shopping, and accident. For traffic, the researchers obtained information from the British Department of Transport regarding the traffic flows between junctions 7 to 8 and junctions 9 to 10 of the M25 motorway. For shopping, they collected the numbers of shoppers in nine different supermarkets in southeast England. For accidents, they collected numbers of emergency admissions to hospitals due to transport accidents.
Source
Scanlon, T.J., Luben, R.N., Scanlon, F.L., Singleton, N. (1993), "Is Friday the 13th Bad For Your Health?," BMJ, 307, 1584-1586. https://dasl.datadescription.com/datafile/friday-the-13th-traffic and https://dasl.datadescription.com/datafile/friday-the-13th-accidents.
Examples
library(dplyr)
library(ggplot2)
friday |>
filter(type == "traffic") |>
ggplot(aes(x = sixth)) +
geom_histogram(binwidth = 2000) +
xlim(110000, 140000)
friday |>
filter(type == "traffic") |>
ggplot(aes(x = thirteenth)) +
geom_histogram(binwidth = 2000) +
xlim(110000, 140000)
Poll about use of full-body airport scanners
Description
Poll about use of full-body airport scanners, where about 4-in-5 people supported the use of the scanners.
Usage
full_body_scan
Format
A data frame with 1137 observations on the following 2 variables.
- answer
a factor with levels
do not know / no answer
should
should not
- party.affiliation
a factor with levels
Democrat
Independent
Republican
Source
S. Condon. Poll: 4 in 5 Support Full-Body Airport Scanners. In: CBS News (2010).
Examples
full_body_scan
GDP Countries Data.
Description
From World Bank, GDP in current U.S. dollars 1960-2020 by decade
Usage
gdp_countries
Format
A data frame with 659 rows and 9 variables.
- country
Name of country.
- description
description of data: GDP (in current US$), GDP growth (annual %), GDP per capita (in current US$)
- year_1960
value in 1960
- year_1970
value in 1970
- year_1980
value in 1980
- year_1990
value in 1990
- year_2000
value in 2000
- year_2010
value in 2010
- year_2020
value in 2020
Source
Examples
library(dplyr)
# don't use scientific notation
options(scipen = 999)
# List the top 10 countries by GDP (There is a row for World)
gdp_countries |>
filter(description == "GDP") |>
mutate(year2020 = format(year_2020, big.mark = ",")) |>
select(country, year2020) |>
arrange(desc(year2020)) |>
top_n(n = 11)
# List the 10 countries with the biggest GDP per capita change from 1960 to 2020
gdp_countries |>
filter(description == "GDP per capita") |>
mutate(change = format(round(year_2020 - year_1960, 0), big.mark = ",")) |>
select(country, change, year_1960, year_2020) |>
na.omit() |>
arrange(desc(change)) |>
top_n(n = 10)
Fake data for a gear company example
Description
Made-up data for whether a sample of two gear companies' parts pass inspection.
Usage
gear_company
Format
A data frame with 2000 observations on the following 2 variables.
- company
a factor with levels
current
prospective
- outcome
a factor with levels
not
pass
Examples
gear_company
Bank manager recommendations based on gender
Description
Study from the 1970s about whether gender influences hiring recommendations.
Usage
gender_discrimination
Format
A data frame with 48 observations on the following 2 variables.
- gender
a factor with levels
female
andmale
- decision
a factor with levels
not promoted
andpromoted
Source
Rosen B and Jerdee T. 1974. Influence of sex role stereotypes on personnel decisions. Journal of Applied Psychology 59(1):9-14.
Examples
library(ggplot2)
table(gender_discrimination)
ggplot(gender_discrimination, aes(y = gender, fill = decision)) +
geom_bar(position = "fill")
Get it Dunn Run, Race Times
Description
Get it Dunn is a small regional run that got extra attention when a runner, Nichole Porath, made the Guiness Book of World Records for the fastest time pushing a double stroller in a half marathon. This dataset contains results from the 2017 and 2018 races.
Usage
get_it_dunn_run
Format
A data frame with 978 observations on the following 10 variables.
- date
Date of the run.
- race
Run distance.
- bib_num
Bib number of the runner.
- first_name
First name of the runner.
- last_initial
Initial of the runner's last name.
- sex
Sex of the runner.
- age
Age of the runner.
- city
City of residence.
- state
State of residence.
- run_time_minutes
Run time, in minutes.
Source
Data were collected from GSE Timing: 2018 data, 2017 race data.
Examples
d <- subset(
get_it_dunn_run,
race == "5k" & date == "2018-05-12" &
!is.na(age) & state %in% c("MN", "WI")
)
head(d)
m <- lm(run_time_minutes ~ sex + age + state, d)
summary(m)
plot(m$fitted, m$residuals)
boxplot(m$residuals ~ d$sex)
plot(m$residuals ~ d$age)
hist(m$residuals)
Analytical skills of young gifted children
Description
An investigator is interested in understanding the relationship, if any, between the analytical skills of young gifted children and the following variables: father's IQ, mother's IQ, age in month when the child first said "mummy" or "daddy", age in month when the child first counted to 10 successfully, average number of hours per week the child's mother or father reads to the child, average number of hours per week the child watched an educational program on TV during the past three months, average number of hours per week the child watched cartoons on TV during the past three months. The analytical skills are evaluated using a standard testing procedure, and the score on this test is used as the response variable.
Usage
gifted
Format
A data frame with 36 observations and 8 variables.
- score
Score in test of analytical skills.
- fatheriq
Father's IQ.
- motheriq
Mother's IQ.
- speak
Age in months when the child first said "mummy" or "daddy".
- count
Age in months when the child first counted to 10 successfully.
- read
Average number of hours per week the child's mother or father reads to the child.
- edutv
Average number of hours per week the child watched an educational program on TV during the past three months.
- cartoons
Average number of hours per week the child watched cartoons on TV during the past three months.
Details
Data were collected from schools in a large city on a set of thirty-six children who were identified as gifted children soon after they reached the age of four.
Source
Graybill, F.A. & Iyer, H.K., (1994) Regression Analysis: Concepts and Applications, Duxbury, p. 511-6.
Examples
gifted
Pew survey on global warming
Description
A 2010 Pew Research poll asked 1,306 Americans, "From what you've read and heard, is there solid evidence that the average temperature on earth has been getting warmer over the past few decades, or not?"
Usage
global_warming_pew
Format
A data frame with 2253 observations on the following 2 variables.
- party_or_ideology
a factor with levels
Conservative Republican
Liberal Democrat
Mod/Cons Democrat
Mod/Lib Republican
- response
Response.
Source
Pew Research Center, Majority of Republicans No Longer See Evidence of Global Warming, data collected on October 27, 2010.
Examples
global_warming_pew
Google stock data
Description
Google stock data from 2006 to early 2014, where data from the first day each month was collected.
Usage
goog
Format
A data frame with 98 observations on the following 7 variables.
- date
a factor with levels
2006-01-03
,2006-02-01
, and so on- open
a numeric vector
- high
a numeric vector
- low
a numeric vector
- close
a numeric vector
- volume
a numeric vector
- adj_close
a numeric vector
Source
Yahoo! Finance.
Examples
goog
Pew Research poll on government approval ratings
Description
The poll's focus is on Obama and then Democrats and Republicans in Congress.
Usage
gov_poll
Format
A data frame with 4223 observations on the following 2 variables.
- poll
a factor with levels
approve
disapprove
- eval
a factor with levels
Democrats
Obama
Republicans
Source
See the Pew Research website: www.people-press.org/2012/03/14/romney-leads-gop-contest-trails-in- matchup-with-obama. The counts in Table 6.19 are approximate.
Examples
gov_poll
Survey of Duke students on GPA, studying, and more
Description
A survey of 55 Duke University students asked about their GPA, number of hours they study at night, number of nights they go out, and their gender.
Usage
gpa
Format
A data frame with 55 observations on the following 5 variables.
- gpa
a numeric vector
- studyweek
a numeric vector
- sleepnight
a numeric vector
- out
a numeric vector
- gender
a factor with levels
female
male
Examples
gpa
Sample of students and their GPA and IQ
Description
Data on 78 students including GPA, IQ, and gender.
Usage
gpa_iq
Format
A data frame with 78 observations representing students on the following 5 variables.
- obs
a numeric vector
- gpa
Grade point average (GPA).
- iq
IQ.
- gender
Gender.
- concept
a numeric vector
Examples
gpa_iq
gpa_study_hours
Description
A data frame with 193 rows and 2 columns. The columns represent the
variables gpa
and study_hours
for a sample of 193
undergraduate students who took an introductory statistics course in 2012 at
a private US university.
Usage
gpa_study_hours
Format
A data frame with 193 observations on the following 2 variables.
- gpa
Grade point average (GPA) of student.
- study_hours
Number of hours students study per week.
Details
GPA ranges from 0 to 4 points, however one student reported a GPA > 4. This is a data error but this observation has been left in the dataset as it is used to illustrate issues with real survey data. Both variables are self reported, hence may not be accurate.
Source
Collected at a private US university as part of an anonymous survey in an introductory statistics course.
Examples
library(ggplot2)
ggplot(gpa_study_hours, aes(x = study_hours, y = gpa)) +
geom_point(alpha = 0.5) +
labs(x = "Study hours/week", y = "GPA")
Simulated data for analyzing the relationship between watching TV and grades
Description
This is a simulated dataset to be used to estimate the relationship between number of hours per week students watch TV and the grade they got in a statistics class.
Usage
gradestv
Format
A data frame with 25 observations on the following 2 variables.
- tv
Number of hours per week students watch TV.
- grades
Grades students got in a statistics class (out of 100).
Details
There are a few potential outliers in this dataset. When analyzing the data one should consider how (if at all) these outliers may affect the estimates of correlation coefficient and regression parameters.
Source
Simulated data
Examples
library(ggplot2)
ggplot(gradestv, aes(x = tv, y = grades)) +
geom_point() +
geom_smooth(method = "lm")
Simulated Google search experiment
Description
The data were simulated to look like sample results from a Google search experiment.
Usage
gsearch
Format
A data frame with 10000 observations on the following 2 variables.
- type
a factor with levels
new search
no new search
- outcome
a factor with levels
current
test 1
test 2
Examples
library(ggplot2)
table(gsearch$type, gsearch$outcome)
ggplot(gsearch, aes(x = type, fill = outcome)) +
geom_bar(position = "fill") +
labs(y = "proportion")
2010 General Social Survey
Description
Data from the 2010 General Social Survey.
Usage
gss2010
Format
A data frame with 2044 observations on the following 5 variables.
- hrsrelax
After an average work day, about how many hours do you have to relax or pursue activities that you enjoy
- mntlhlth
For how many days during the past 30 days was your mental health, which includes stress, depression, and problems with emotions, not good?
- hrs1
Hours worked each week.
- degree
Educational attainment or degree.
- grass
Do you think the use of marijuana should be made legal, or not?
Source
US 2010 General Social Survey.
Examples
gss2010
gss_wordsum_class
Description
A data frame containing data from the General Social Survey.
Usage
gss_wordsum_class
Format
A data frame with 795 observations on the following 2 variables.
- wordsum
A vocabulary score calculated based on a ten question vocabulary test, where a higher score means better vocabulary. Scores range from 1 to 10.
- class
Self-identified social class has 4 levels: lower, working, middle, and upper class.
Examples
library(dplyr)
gss_wordsum_class |>
group_by(class) |>
summarize(mean_wordsum = mean(wordsum))
Health Coverage and Health Status
Description
Survey responses for 20,000 responses to the Behavioral Risk Factor Surveillance System.
Usage
health_coverage
Format
A data frame with 20000 observations on the following 2 variables.
- coverage
Whether the person had health coverage or not.
- health_status
The person's health status.
Source
Office of Surveillance, Epidemiology, and Laboratory Services Behavioral Risk Factor Surveillance System, BRFSS 2010 Survey Data.
Examples
table(health_coverage)
Pew Research Center poll on health care, including question variants
Description
For example, Pew Research Center conducted a survey with the following question: "As you may know, by 2014 nearly all Americans will be required to have health insurance. People who do not buy insurance will pay a penalty while people who cannot afford it will receive financial help from the government. Do you approve or disapprove of this policy?" For each randomly sampled respondent, the statements in brackets were randomized: either they were kept in the order given above, or the two statements were reversed.
Usage
healthcare_law_survey
Format
A data frame with 1503 observations on the following 2 variables.
- order
a factor with levels
cannot_afford_second
penalty_second
- response
a factor with levels
approve
disapprove
other
Source
www.people-press.org/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate/. Sample sizes for each polling group are approximate.
Examples
healthcare_law_survey
Heart Transplant Data
Description
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was designated officially a heart transplant candidate, meaning that he was gravely ill and would most likely benefit from a new heart. Then the actual heart transplant occurs between a few weeks to several months depending on the availability of a donor. Very few candidates during this waiting period show improvement and get deselected as a heart transplant candidate, but for the purposes of this experiment those patients were kept in the data as continuing candidates.
Usage
heart_transplant
Format
A data frame with 103 observations on the following 8 variables.
- id
ID number of the patient.
- acceptyear
Year of acceptance as a heart transplant candidate.
- age
Age of the patient at the beginning of the study.
- survived
Survival status with levels
alive
anddead
.- survtime
Number of days patients were alive after the date they were determined to be a candidate for a heart transplant until the termination date of the study
- prior
Whether or not the patient had prior surgery with levels
yes
andno
.- transplant
Transplant status with levels
control
(did not receive a transplant) andtreatment
(received a transplant).- wait
Waiting Time for Transplant
Source
http://www.stat.ucla.edu/~jsanchez/data/stanford.txt
References
Turnbull B, Brown B, and Hu M (1974). "Survivorship of heart transplant data." Journal of the American Statistical Association, vol. 69, pp. 74-80.
Examples
library(ggplot2)
ggplot(heart_transplant, aes(x = transplant, y = survtime)) +
geom_boxplot() +
labs(x = "Transplant", y = "Survival time (days)")
ggplot(heart_transplant, aes(x = transplant, fill = survived)) +
geom_bar(position = "fill") +
labs(x = "Transplant", y = "Proportion", fill = "Outcome")
Helium football
Description
At the 1976 Pro Bowl, Ray Guy, a punter for the Oakland Raiders, punted a ball that hung mid-air long enough for officials to question whether the pigskin was filled with helium. The ball was found to be filled with air, but since then many have tossed around the idea that a helium-filled football would outdistance an air-filled one. Students at Ohio State University conducted an experiment to test this myth. They used two identical footballs, one air filled with air and one filled with helium. Each football was kicked 39 times and the two footballs were alternated with each kick.
Usage
helium
Format
A data frame with 39 observations on the following 3 variables.
- trial
Trial number.
- air
Distance in years for air-filled football.
- helium
Distance in years for helium-filled football.
Details
Lafferty, M. B. (1993), "OSU scientists get a kick out of sports controversy, "The Columbus Dispatch (November, 21, 1993), B7.
Source
Previously part of the Data and Story Library, https://dasl.datadescription.com. Removed as of 2020.
Examples
boxPlot(helium$air, xlab = "air")
boxPlot(helium$helium, xlab = "helium")
Socioeconomic status and reduced-fee school lunches
Description
Examining the relationship between socioeconomic status measured as the percentage of children in a neighborhood receiving reduced-fee lunches at school (lunch) and the percentage of bike riders in the neighborhood wearing helmets (helmet).
Usage
helmet
Format
A data frame with 12 observations representing neighborhoods on the following 2 variables.
- lunch
Percent of students receiving reduced-fee school lunches.
- helmet
Percent of bike riders wearing helmets.
Examples
library(ggplot2)
ggplot(helmet, aes(x = lunch, y = helmet)) +
geom_point()
Human Freedom Index
Description
The Human Freedom Index is a report that attempts to summarize the idea of "freedom" through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it's political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.
Usage
hfi
Format
A data frame with 1458 observations on the following 123 variables.
- year
Year
- ISO_code
ISO code of country
- countries
Name of country
- region
Region where country is located
- pf_rol_procedural
Procedural justice
- pf_rol_civil
Civil justice
- pf_rol_criminal
Criminal justice
- pf_rol
Rule of law
- pf_ss_homicide
Homicide
- pf_ss_disappearances_disap
Disappearances
- pf_ss_disappearances_violent
Violent conflicts
- pf_ss_disappearances_organized
Violent conflicts
- pf_ss_disappearances_fatalities
Terrorism fatalities
- pf_ss_disappearances_injuries
Terrorism injuries
- pf_ss_disappearances
Disappearances, conflict, and terrorism
- pf_ss_women_fgm
Female genital mutilation
- pf_ss_women_missing
Missing women
- pf_ss_women_inheritance_widows
Inheritance rights for widows
- pf_ss_women_inheritance_daughters
Inheritance rights for daughters
- pf_ss_women_inheritance
Inheritance
- pf_ss_women
Women's security
- pf_ss
Security and safety
- pf_movement_domestic
Freedom of domestic movement
- pf_movement_foreign
Freedom of foreign movement
- pf_movement_women
Women's movement
- pf_movement
Freedom of movement
- pf_religion_estop_establish
Freedom to establish religious organizations
- pf_religion_estop_operate
Freedom to operate religious organizations
- pf_religion_estop
Freedom to establish and operate religious organizations
- pf_religion_harassment
Harassment and physical hostilities
- pf_religion_restrictions
Legal and regulatory restrictions
- pf_religion
Religious freedom
- pf_association_association
Freedom of association
- pf_association_assembly
Freedom of assembly
- pf_association_political_establish
Freedom to establish political parties
- pf_association_political_operate
Freedom to operate political parties
- pf_association_political
Freedom to establish and operate political parties
- pf_association_prof_establish
Freedom to establish professional organizations
- pf_association_prof_operate
Freedom to operate professional organizations
- pf_association_prof
Freedom to establish and operate professional organizations
- pf_association_sport_establish
Freedom to establish educational, sporting, and cultural organizations
- pf_association_sport_operate
Freedom to operate educational, sporting, and cultural organizations
- pf_association_sport
Freedom to establish and operate educational, sporting, and cultural organizations
- pf_association
Freedom to associate and assemble with peaceful individuals or organizations
- pf_expression_killed
Press killed
- pf_expression_jailed
Press jailed
- pf_expression_influence
Laws and regulations that influence media content
- pf_expression_control
Political pressures and controls on media content
- pf_expression_cable
Access to cable/satellite
- pf_expression_newspapers
Access to foreign newspapers
- pf_expression_internet
State control over internet access
- pf_expression
Freedom of expression
- pf_identity_legal
Legal gender
- pf_identity_parental_marriage
Parental rights in marriage
- pf_identity_parental_divorce
Parental rights after divorce
- pf_identity_parental
Parental rights
- pf_identity_sex_male
Male-to-male relationships
- pf_identity_sex_female
Female-to-female relationships
- pf_identity_sex
Same-sex relationships
- pf_identity_divorce
Divor
- pf_identity
Identity and relationships
- pf_score
Personal Freedom (score)
- pf_rank
Personal Freedom (rank)
- ef_government_consumption
Government consumption
- ef_government_transfers
Transfers and subsidies
- ef_government_enterprises
Government enterprises and investments
- ef_government_tax_income
Top marginal income tax rate - Top marginal income tax rates
- ef_government_tax_payroll
Top marginal income tax rate - Top marginal income and payroll tax rate
- ef_government_tax
Top marginal tax rate
- ef_government
Size of government
- ef_legal_judicial
Judicial independence
- ef_legal_courts
Impartial courts
- ef_legal_protection
Protection of property rights
- ef_legal_military
Military interference in rule of law and politics
- ef_legal_integrity
Integrity of the legal system
- ef_legal_enforcement
Legal enforcement of contracts
- ef_legal_restrictions
Regulatory restrictions on the sale of real property
- ef_legal_police
Reliability of police
- ef_legal_crime
Business costs of crime
- ef_legal_gender
Gender adjustment
- ef_legal
Legal system and property rights
- ef_money_growth
Money growth
- ef_money_sd
Standard deviation of inflation
- ef_money_inflation
Inflation - most recent year
- ef_money_currency
Freedom to own foreign currency bank account
- ef_money
Sound money
- ef_trade_tariffs_revenue
Tariffs - Revenue from trade taxes (percentage of trade sector)
- ef_trade_tariffs_mean
Tariffs - Mean tariff rate
- ef_trade_tariffs_sd
Tariffs - Standard deviation of tariffs rates
- ef_trade_tariffs
Tariffs
- ef_trade_regulatory_nontariff
Regulatory trade barriers - Nontariff trade barriers
- ef_trade_regulatory_compliance
Regulatory trade barriers - Compliance costs of importing and exporting
- ef_trade_regulatory
Regulatory trade barriers
- ef_trade_black
Black-market exchange rates
- ef_trade_movement_foreign
Controls of the movement of capital and people - Foreign ownership/investment restrictions
- ef_trade_movement_capital
Controls of the movement of capital and people - Capital controls
- ef_trade_movement_visit
Controls of the movement of capital and people - Freedom of foreigners to visit
- ef_trade_movement
Controls of the movement of capital and people
- ef_trade
Freedom to trade internationally
- ef_regulation_credit_ownership
Credit market regulations - Ownership of banks
- ef_regulation_credit_private
Credit market regulations - Private sector credit
- ef_regulation_credit_interest
Credit market regulations - Interest rate controls/negative real interest rates
- ef_regulation_credit
Credit market regulation
- ef_regulation_labor_minwage
Labor market regulations - Hiring regulations and minimum wage
- ef_regulation_labor_firing
Labor market regulations - Hiring and firing regulations
- ef_regulation_labor_bargain
Labor market regulations - Centralized collective bargaining
- ef_regulation_labor_hours
Labor market regulations - Hours regulations
- ef_regulation_labor_dismissal
Labor market regulations - Dismissal regulations
- ef_regulation_labor_conscription
Labor market regulations - Conscription
- ef_regulation_labor
Labor market regulation
- ef_regulation_business_adm
Business regulations - Administrative requirements
- ef_regulation_business_bureaucracy
Business regulations - Bureaucracy costs
- ef_regulation_business_start
Business regulations - Starting a business
- ef_regulation_business_bribes
Business regulations - Extra payments/bribes/favoritism
- ef_regulation_business_licensing
Business regulations - Licensing restrictions
- ef_regulation_business_compliance
Business regulations - Cost of tax compliance
- ef_regulation_business
Business regulation
- ef_regulation
Economic freedom regulation score
- ef_score
Economic freedom score
- ef_rank
Economic freedom rank
- hf_score
Human freedom score
- hf_rank
Human freedom rank
- hf_quartile
Human freedom quartile
Details
This dataset contains information from Human Freedom Index reports from 2008-2016.
Source
Ian Vasquez and Tanja Porcnik, The Human Freedom Index 2018: A Global Measurement of Personal, Civil, and Economic Freedom (Washington: Cato Institute, Fraser Institute, and the Friedrich Naumann Foundation for Freedom, 2018). https://www.cato.org/sites/cato.org/files/human-freedom-index-files/human-freedom-index-2016.pdf. https://www.kaggle.com/gsutters/the-human-freedom-index.
Histogram or hollow histogram
Description
Create histograms and hollow histograms. This function permits easy color and appearance customization.
Usage
histPlot(
x,
col = fadeColor("black", "22"),
border = "black",
breaks = "default",
probability = FALSE,
hollow = FALSE,
add = FALSE,
lty = 2,
lwd = 1,
freqTable = FALSE,
right = TRUE,
axes = TRUE,
xlab = NULL,
ylab = NULL,
xlim = NULL,
ylim = NULL,
...
)
Arguments
x |
Numerical vector or a frequency table (matrix) where the first
column represents the observed values and the second column the frequencies.
See also |
col |
Shading of the histogram bins. |
border |
Color of histogram bin borders. |
breaks |
A vector for the bin boundaries or an approximate number of bins. |
probability |
If |
hollow |
If |
add |
If |
lty |
Line type. Applies only if |
lwd |
Line width. Applies only if |
freqTable |
Set to |
right |
Set to |
axes |
If |
xlab |
Label for the x axis. |
ylab |
Label for the y axis. |
xlim |
Limits for the x axis. |
ylim |
Limits for the y axis. |
... |
Additional arguments to |
Author(s)
David Diez
See Also
Examples
histPlot(tips$tip, main = "Tips")
# overlaid hollow histograms
histPlot(tips$tip[tips$day == "Tuesday"],
probability = TRUE,
hollow = TRUE,
main = "Tips by day"
)
histPlot(tips$tip[tips$day == "Friday"],
probability = TRUE,
hollow = TRUE,
add = TRUE,
lty = 3,
border = "red"
)
legend("topright",
col = c("black", "red"),
lty = 1:2,
legend = c("Tuesday", "Friday")
)
# breaks and colors
histPlot(tips$tip,
col = fadeColor("yellow", "33"),
border = "darkblue",
probability = TRUE,
breaks = 30,
lwd = 3
)
# custom breaks
brks <- c(-1, 0, 1, 2, 3, 4, seq(5, 20, 5), 22, 24, 26)
histPlot(tips$tip,
probability = TRUE,
breaks = brks,
col = fadeColor("darkgoldenrod4", "33"),
xlim = c(0, 26)
)
United States House of Representatives historical make-up
Description
The make-up of the United States House of Representatives every two years since 1789. The last Congress included is the 112th Congress, which completed its term in 2013.
Usage
house
Format
A data frame with 112 observations on the following 12 variables.
- congress
The number of that year's Congress
- year_start
Starting year
- year_end
Ending year
- seats
Total number of seats
- p1
Name of the first political party
- np1
Number of seats held by the first political party
- p2
Name of the second political party
- np2
Number of seats held by the second political party
- other
Other
- vac
Vacancy
- del
Delegate
- res
Resident commissioner
Source
Party Divisions of the House of Representatives, 1789 to Present. https://history.house.gov/Institution/Party-Divisions/Party-Divisions.
Examples
library(dplyr)
library(ggplot2)
library(forcats)
# Examine two-party relationship since 1855
house_since_1855 <- house |>
filter(year_start >= 1855) |>
mutate(
p1_perc = 100 * np1 / seats,
p2_perc = 100 * np2 / seats,
era = case_when(
between(year_start, 1861, 1865) ~ "Civil War",
between(year_start, 1914, 1918) ~ "World War I",
between(year_start, 1929, 1939) ~ "Great Depression",
between(year_start, 1940, 1945) ~ "World War II",
between(year_start, 1960, 1965) ~ "Vietnam War Start",
between(year_start, 1965, 1975) ~ "Vietnam War Escalated",
TRUE ~ NA_character_
),
era = fct_relevel(
era, "Civil War", "World War I",
"Great Depression", "World War II",
"Vietnam War Start", "Vietnam War Escalated"
)
)
ggplot(house_since_1855, aes(x = year_start)) +
geom_rect(aes(
xmin = year_start, xmax = lead(year_start),
ymin = -Inf, ymax = Inf, fill = era
)) +
geom_line(aes(y = p1_perc, color = "Democrats")) + # Democrats
geom_line(aes(y = p2_perc, color = "Republicans")) + # Republicans
scale_fill_brewer(palette = "Pastel1", na.translate = FALSE) +
scale_color_manual(
name = "Party",
values = c("Democrats" = "blue", "Republicans" = "red"),
labels = c("Democrats", "Republicans")
) +
theme_minimal() +
ylim(0, 100) +
labs(x = "Year", y = "Percentage of seats", fill = "Era")
Simulated dataset on student housing
Description
Each observation represents a simulated rent price for a student.
Usage
housing
Format
A data frame with 75 observations on the following variable.
- cost
a numeric vector
Examples
housing
High School and Beyond survey
Description
Two hundred observations were randomly sampled from the High School and Beyond survey, a survey conducted on high school seniors by the National Center of Education Statistics.
Usage
hsb2
Format
A data frame with 200 observations and 11 variables.
- id
Student ID.
- gender
Student's gender, with levels
female
andmale
.- race
Student's race, with levels
african american
,asian
,hispanic
, andwhite
.- ses
Socio economic status of student's family, with levels
low
,middle
, andhigh
.- schtyp
Type of school, with levels
public
andprivate
.- prog
Type of program, with levels
general
,academic
, andvocational
.- read
Standardized reading score.
- write
Standardized writing score.
- math
Standardized math score.
- science
Standardized science score.
- socst
Standardized social studies score.
Source
UCLA Institute for Digital Research & Education - Statistical Consulting.
Examples
library(ggplot2)
ggplot(hsb2, aes(x = read - write, y = ses)) +
geom_boxplot() +
labs(
x = "Difference between reading and writing scores",
y = "Socio-economic status"
)
Great Britain: husband and wife pairs
Description
The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights of the husbands and wives.
Usage
husbands_wives
Format
A data frame with 199 observations on the following 8 variables.
- age_husband
Age of husband.
- age_wife
Age of wife.
- ht_husband
Height of husband (mm).
- ht_wife
Height of wife (mm).
- age_husb_at_marriage
Age of husband at the time they married.
- age_wife_at_marriage
Age of wife at the time they married.
- years_married
Number of years married.
Source
Hand DJ. 1994. A handbook of small data sets. Chapman & Hall/CRC.
Examples
library(ggplot2)
ggplot(husbands_wives, aes(x = ht_husband, y = ht_wife)) +
geom_point()
Poll on illegal workers in the US
Description
910 randomly sampled registered voters in Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country as well as their political ideology.
Usage
immigration
Format
A data frame with 910 observations on the following 2 variables.
- response
a factor with levels
Apply for citizenship
Guest worker
Leave the country
Not sure
- political
a factor with levels
conservative
liberal
moderate
Source
SurveyUSA, News Poll #18927, data collected Jan 27-29, 2012.
Examples
immigration
Infant Mortality Rates, 2012
Description
This entry gives the number of deaths of infants under one year old in 2012 per 1,000 live births in the same year. This rate is often used as an indicator of the level of health in a country.
Usage
infmortrate
Format
A data frame with 222 observations on the following 2 variables.
- country
Name of country.
- inf_mort_rate
Infant mortality rate per 1,000 live births.
Details
The data is given in decreasing order of infant mortality rates. There are a few potential outliers.
Source
CIA World Factbook, https://www.cia.gov/the-world-factbook/field/infant-mortality-rate/country-comparison.
Examples
library(ggplot2)
ggplot(infmortrate, aes(x = inf_mort_rate)) +
geom_histogram(binwidth = 10)
ggplot(infmortrate, aes(x = inf_mort_rate)) +
geom_density()
iowa
Description
A data frame containing information about the 2016 US Presidential Election for the state of Iowa.
Usage
iowa
Format
A data frame with 1386 observations on the following 5 variables.
- office
The office that the candidates were running for.
- candidate
President/Vice President pairs who were running for office.
- party
Political part of the candidate.
- county
County in Iowa where the votes were cast.
- votes
Number of votes received by the candidate.
Examples
library(ggplot2)
library(dplyr)
plot_data <- iowa |>
filter(candidate != "Total") |>
group_by(candidate) |>
summarize(total_votes = sum(votes) / 1000)
ggplot(plot_data, aes(total_votes, candidate)) +
geom_col() +
theme_minimal() +
labs(
title = "2016 Presidential Election in Iowa",
subtitle = "Popular vote",
y = "",
x = "Number of Votes (in thousands)
"
)
Facebook, Google, and LinkedIn IPO filings
Description
On Feb 1st, 2011, Facebook Inc. filed an S-1 form with the Securities and Exchange Commission as part of their initial public offering (IPO). This dataset includes the text of that document as well as text from the IPOs of two competing companies: Google and LinkedIn.
Usage
ipo
Format
The format is a list of three character vectors. Each vector contains the line-by-line text of the IPO Prospectus of Facebook, Google, and LinkedIn, respectively.
Details
Each of the three prospectuses is encoded in UTF-8 format and contains some non-word characters related to the layout of the original documents. For analysis on the words, it is recommended that the data be processed with packages such as tidytext. See examples below.
Source
All IPO prospectuses are available from the U.S. Securities and Exchange Commission: Facebook, Google, LinkedIn.
References
Zweig, J., 2020. Mark Zuckerberg: CEO For Life?. WSJ.
Examples
library(tidytext)
library(tibble)
library(dplyr)
library(ggplot2)
library(forcats)
# Analyzing Facebook IPO text
facebook <- tibble(text = ipo$facebook, company = "Facebook")
facebook |>
unnest_tokens(word, text) |>
anti_join(stop_words) |>
count(word, sort = TRUE) |>
slice_head(n = 20) |>
ggplot(aes(y = fct_reorder(word, n), x = n, fill = n)) +
geom_col() +
labs(
title = "Top 20 most common words in Facebook IPO",
x = "Frequency",
y = "Word"
)
# Comparisons to Google and LinkedIn IPO texts
google <- tibble(text = ipo$google, company = "Google")
linkedin <- tibble(text = ipo$linkedin, company = "LinkedIn")
ipo_texts <- bind_rows(facebook, google, linkedin)
ipo_texts |>
unnest_tokens(word, text) |>
count(company, word, sort = TRUE) |>
bind_tf_idf(word, company, n) |>
arrange(desc(tf_idf)) |>
group_by(company) |>
slice_max(tf_idf, n = 15) |>
ungroup() |>
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = company)) +
geom_col(show.legend = FALSE) +
facet_wrap(~company, ncol = 3, scales = "free") +
labs(x = "tf-idf", y = NULL)
Length of songs on an iPod
Description
A simulated dataset on lengths of songs on an iPod.
Usage
ipod
Format
A data frame with 3000 observations on the following variable.
- song_length
Length of song (in minutes).
Source
Simulated data.
Examples
library(ggplot2)
ggplot(ipod, aes(x = song_length)) +
geom_histogram(binwidth = 0.5)
iran
Description
A data frame containing information about the 2009 Presidential Election in Iran. There were widespread claims of election fraud in this election both internationally and within Iran.
Usage
iran
Format
A data frame with 366 observations on the following 9 variables.
- province
Iranian province where votes were cast.
- city
City within province where votes were cast.
- ahmadinejad
Number of votes received by Ahmadinejad.
- rezai
Number of votes received by Rezai.
- karrubi
Number of votes received by Karrubi.
- mousavi
Number of votes received by Mousavi.
- total_votes_cast
Total number of votes cast.
- voided_votes
Number of votes that were not counted.
- legitimate_votes
Number of votes that were counted.
Examples
library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)
plot_data <- iran |>
summarize(
ahmadinejad = sum(ahmadinejad) / 1000,
rezai = sum(rezai) / 1000,
karrubi = sum(karrubi) / 1000,
mousavi = sum(mousavi) / 1000
) |>
pivot_longer(
cols = c(ahmadinejad, rezai, karrubi, mousavi),
names_to = "candidate",
values_to = "votes"
) |>
mutate(candidate = str_to_title(candidate))
ggplot(plot_data, aes(votes, candidate)) +
geom_col() +
theme_minimal() +
labs(
title = "2009 Iranian Presidential Election",
x = "Number of votes (in thousands)",
y = ""
)
Simulated juror dataset
Description
Simulated dataset of registered voters proportions and representation on juries.
Usage
jury
Format
A data frame with 275 observations on the following variable.
- race
a factor with levels
black
hispanic
other
white
Examples
jury
Kobe Bryant basketball performance
Description
Data from the five games the Los Angeles Lakers played against the Orlando Magic in the 2009 NBA finals.
Usage
kobe_basket
Format
A data frame with 133 rows and 6 variables:
- vs
A categorical vector, ORL if the Los Angeles Lakers played against Orlando
- game
A numerical vector, game in the 2009 NBA finals
- quarter
A categorical vector, quarter in the game, OT stands for overtime
- time
A character vector, time at which Kobe took a shot
- description
A character vector, description of the shot
- shot
A categorical vector, H if the shot was a hit, M if the shot was a miss
Details
Each row represents a shot Kobe Bryant took during the five games of the 2009 NBA finals. Kobe Bryant's performance earned him the title of Most Valuable Player and many spectators commented on how he appeared to show a hot hand.
lab_report
Description
Acts as a simplified template to common parameters passed to rmarkdown::html_document().
Usage
lab_report(
highlight = "pygments",
theme = "spacelab",
toc = TRUE,
toc_float = TRUE,
code_download = TRUE,
code_folding = "show"
)
Arguments
highlight |
Syntax highlighting style. Supported styles include
"default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn",
"haddock", and "textmate". Pass |
theme |
Visual theme ("default", "cerulean", "journal", "flatly",
"readable", "spacelab", "united", "cosmo", "lumen", "paper", "sandstone",
"simplex", or "yeti"). Pass |
toc |
|
toc_float |
|
code_download |
Embed the Rmd source code within the document and provide a link that can be used by readers to download the code. |
code_folding |
Enable document readers to toggle the display of R code
chunks. Specify |
Are Emily and Greg More Employable Than Lakisha and Jamal?
Description
Original data from the experiment run by Bertrand and Mullainathan (2004).
Usage
labor_market_discrimination
Format
A tibble with 4870 observations of 63 variables.
- education
Highest education, with levels of 0 = not reported; 1 = high school diploma; 2 = high school graduate; 3 = some college; 4 = college or more.
- n_jobs
Number of jobs listed on resume.
- years_exp
Number of years of work experience on the resume.
- honors
Indicator variable for which 1 = resume mentions some honors.
- volunteer
Indicator variable for which 1 = resume mentions some volunteering experience.
- military
Indicator variable for which 1 = resume mentions some military experience.
- emp_holes
Indicator variable for which 1 = resume mentions some employment holes.
- occup_specific
1990 Census Occupation Code. See sources for a key.
- occup_broad
Occupation broad with levels 1 = executives and managerial occupations, 2 = administrative supervisors, 3 = sales representatives, 4 = sales workers, 5 = secretaries and legal assistants, 6 = clerical occupations
- work_in_school
Indicator variable for which 1 = resume mentions some work experience while at school
Indicator variable for which 1 = email address on applicant's resume.
- computer_skills
Indicator variable for which 1 = resume mentions some computer skills.
- special_skills
Indicator variable for which 1 = resume mentions some special skills.
- first_name
Applicant's first name.
- sex
Sex, with levels of 'f' = female; 'm' = male.
- race
Race, with levels of 'b' = black; 'w' = white.
- h
Indicator variable for which 1 = high quality resume.
- l
Indicator variable for which 1 = low quality resume.
- call
Indicator variable for which 1 = applicant was called back.
- city
City, with levels of 'c' = chicago; 'b' = boston.
- kind
Kind, with levels of 'a' = administrative; 's' = sales.
- ad_id
Employment ad identifier.
- frac_black
Fraction of blacks in applicant's zip.
- frac_white
Fraction of whites in applicant's zip.
- l_med_hh_inc
Log median household income in applicant's zip.
- frac_dropout
Fraction of high-school dropouts in applicant's zip.
- frac_colp
Fraction of college degree or more in applicant's zip
- l_inc
Log per capita income in applicant's zip.
- col
Indicator variable for which 1 = applicant has college degree or more.
- expminreq
Minimum experience required, if any (in years when numeric).
- school_req
Specific education requirement, if any. 'hsg' = high school graduate, 'somcol' = some college, 'colp' = four year degree or higher
- eoe
Indicator variable for which 1 = ad mentions employer is 'Equal Opportunity Employer'.
- parent_sales
Sales of parent company (in millions of US $).
- parent_emp
Number of parent company employees.
- branch_sales
Sales of branch (in millions of US $).
- branch_emp
Number of branch employees.
- fed
Indicator variable for which 1 = employer is a federal contractor.
- frac_black_emp_zip
Fraction of blacks in employers's zipcode.
- frac_white_emp_zip
Fraction of whites in employer's zipcode.
- l_med_hh_inc_emp_zip
Log median household income in employer's zipcode.
- frac_dropout_emp_zip
Fraction of high-school dropouts in employer's zipcode.
- frac_colp_emp_zip
Fraction of college degree or more in employer's zipcode.
- l_inc_emp_zip
Log per capita income in employer's zipcode.
- manager
Indicator variable for which 1 = executives or managers wanted.
- supervisor
Indicator variable for which 1 = administrative supervisors wanted.
- secretary
Indicator variable for which 1 = secretaries or legal assistants wanted.
- off_support
Indicator variable for which 1 = clerical workers wanted.
- sales_rep
Indicator variable for which 1 = sales representative wanted.
- retail_sales
Indicator variable for which 1 = retail sales worker wanted.
- req
Indicator variable for which 1 = ad mentions any requirement for job.
- exp_req
Indicator variable for which 1 = ad mentions some experience requirement.
- com_req
Indicator variable for which 1 = ad mentions some communication skills requirement.
- educ_req
Indicator variable for which 1 = ad mentions some educational requirement.
- comp_req
Indicator variable for which 1 = ad mentions some computer skill requirement.
- org_req
Indicator variable for which 1 = ad mentions some organizational skills requirement.
- manuf
Indicator variable for which 1 = employer industry is manufacturing.
- trans_com
Indicator variable for which 1 = employer industry is transport or communication.
- bank_real
Indicator variable for which 1 = employer industry is finance, insurance or real estate.
- trade
Indicator variable for which 1 = employer industry is wholesale or retail trade.
- bus_service
Indicator variable for which 1 = employer industry is business or personal services.
- oth_service
Indicator variable for which 1 = employer industry is health, education or social services.
- miss_ind
Indicator variable for which 1 = employer industry is other or unknown.
- ownership
Ownership status of employer, with levels of 'non-profit'; 'private'; 'public'
Details
From the summary: "We study race in the labor market by sending fictitious resumes to help-wanted ads in Boston and Chicago newspapers. To manipulate perceived race, resumes are randomly assigned African-American- or White-sounding names. White names receive 50 percent more callbacks for interviews. Callbacks are also more responsive to resume quality for White names than for African-American ones. The racial gap is uniform across occupation, industry, and employer size. We also find little evidence that employers are inferring social class from the names. Differential treatment by race still appears to be prominent in the U. S. labor market."
Source
Bertrand, Marianne, and Mullainathan, Sendhil. Replication data for: Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. Nashville, TN: American Economic Association [publisher], 2004. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2019-12-06. doi:10.3886/E116023V1.
Note: The description of the variables follows closely the labels provided in the original dataset, with small edits for clarity.
Examples
library(dplyr)
# Percent callback for typical White names and typical African-American names (table 1, p. 997)
labor_market_discrimination |>
group_by(race) |>
summarise(call_back = mean(call))
Gender, Socioeconomic Class, and Interview Invites
Description
Resumes were sent out to 316 top law firms in the United States, and there were two randomized characteristics of each resume. First, the gender associated with the resume was randomized by assigning a first name of either James or Julia. Second, the socioeconomic class of the candidate was randomly assigned and represented through five minor changes associated with personal interests and other other minor details (e.g. an extracurricular activity of sailing team vs track and field). The outcome variable was whether the candidate was received an interview.
Usage
law_resume
Format
A data frame with 316 observations on the following 3 variables. Each row represents a resume sent a top law firm for this experiment.
- class
The resume represented irrelevant details suggesting either
"low"
or"high"
socioeconomic class.- gender
The resume implied the candidate was either
"male"
or"female"
.- outcome
If the candidate received an invitation for an
"interview"
or"not"
.
Source
For a casual overview, see https://hbr.org/2016/12/research-how-subtle-class-cues-can-backfire-on-your-resume.
For the academic paper, see Tilcsik A, Rivera LA. 2016. Class Advantage, Commitment Penalty. The Gendered Effect of Social Class Signals in an Elite Labor Market. American Sociological Review 81:6 p1097-1131. doi:10.1177/0003122416668154.
Examples
tapply(law_resume$outcome == "interview", law_resume[, c("class", "gender")], mean)
m <- glm(I(outcome == "interview") ~ gender * class, data = law_resume, family = binomial)
summary(m)
predict(m, type = "response")
Lecture Delivery Method and Learning Outcomes
Description
Data was collected from 276 students in a university psychology course to determine the effect of lecture delivery method on learning. Students were presented a live lecture by the professor on one day and a pre-recorded lecture on a different topic by the same professor on a different day. Survey data was collected during the lectures to determine mind wandering, interest, and motivation. Students were also ultimately asked about the preferred lecture delivery method. Finally, students completed an assessment at the end of the lecture to determine memory recall.
Usage
lecture_learning
Format
A data frame with 552 rows and 8 variables.
- student
Identification number of a specific student. Each identification appears twice because same student heard both lecture delivery methods.
- gender
Gender of student. Recored a binary variable with levels Male and Female in the study.
- method
Delivery method of lecture was either in-person(Live) or pre-recorded(Video).
- mindwander
An indicator of distraction during the lecture. It is a proportion of six mind wandering probes during the lecture when a student answered yes that mind wandering had just occurred.
- memory
An indicator of recall of information provided during the lecture. It is the proportion of correct answers in a six question assessment given at the end of the lecture presentation.
- interest
A Likert scale that gauged student interest level concerning the lecture.
- motivation_both
After experiencing both lecture delivery methods, students were asked about which method they were most motivated to remain attentive.
- motivation_single
After a single lecture delivery experience, this Likert scale was used to gauge motivation to remain attentive during the lecture.
Source
Examples
library(dplyr)
library(ggplot2)
# Calculate the average memory test proportion by lecture delivery method
# and gender.
lecture_learning |>
group_by(method, gender) |>
summarize(average_memory = mean(memory), count = n(), .groups = "drop")
# Compare visually the differences in memory test proportions by delivery
# method and gender.
ggplot(lecture_learning, aes(x = method, y = memory, fill = gender)) +
geom_boxplot() +
theme_minimal() +
labs(
title = "Difference in memory test proportions",
x = "Method",
y = "Memory",
fill = "Gender"
)
# Use a paired t-test to determine whether memory test proportion score
# differed by delivery method. Note that paired t-tests are identical
# to one sample t-test on the difference between the Live and Video methods.
learning_diff <- lecture_learning |>
tidyr::pivot_wider(id_cols = student, names_from = method, values_from = memory) |>
mutate(time_diff = Live - Video)
t.test(time_diff ~ 1, data = learning_diff)
# Calculating the proportion of students who were most motivated to remain
# attentive in each delivery method.
lecture_learning |>
count(motivation_both) |>
mutate(proportion = n / sum(n))
Legalization of Marijuana Support in 2010 California Survey
Description
In a 2010 Survey USA poll, 70% of the 119 respondents between the ages of 18 and 34 said they would vote in the 2010 general election for Prop 19, which would change California law to legalize marijuana and allow it to be regulated and taxed.
Usage
leg_mari
Format
A data frame with 119 observations on the following variable.
- response
One of two values:
oppose
orsupport
.
Source
Survey USA, Election Poll #16804, data collected July 8-11, 2010.
Examples
table(leg_mari)
Population of Lego Sets for Sale between Jan. 1, 2018 and Sept. 11, 2020.
Description
Data about Lego Sets for sale. Based on JSDSE article by Anna Peterson and Laura Ziegler Data from their article was scrapped from multiple sources including brickset.com
Usage
lego_population
Format
A data frame with 1304 rows and 14 variables.
- item_number
Set Item number
- set_name
Name of the set.
- theme
Set theme: Duplo, City or Friends.
- pieces
Number of pieces in the set.
- price
Recommended retail price from LEGO.
- amazon_price
Price of the set at Amazon.
- year
Year that it was produced.
- ages
LEGO's recommended ages of children for the set
- pages
Pages in the instruction booklet.
- minifigures
Number of LEGO people in the data, if unknown "NA" was recorded.
- packaging
Type of packaging: bag, box, etc.
- weight
Weight of the set of LEGOS in pounds and kilograms.
- unique_pieces
Number of pieces classified as unique in the instruction manual.
- size
Size of the lego pieces: Large if safe for small children and Small for older children.
Source
Peterson, A. D., & Ziegler, L. (2021). Building a multiple linear regression model with LEGO brick data. Journal of Statistics and Data Science Education, 29(3),1-7. doi:10.1080/26939169.2021.1946450
BrickInstructions.com. (n.d.). Retrieved February 2, 2021 from
Brickset. (n.d.). BRICKSET: Your LEGO® set guide. Retrieved February 2, 2021 from
Examples
library(ggplot2)
library(dplyr)
lego_population |>
filter(theme == "Friends" | theme == "City") |>
ggplot(aes(x = pieces, y = amazon_price)) +
geom_point(alpha = 0.3) +
labs(
x = "Pieces in the Set",
y = "Amazon Price",
title = "Amazon Price vs Number of Pieces in Lego Sets",
subtitle = "Friends and City Themes"
)
Sample of Lego Sets
Description
Data about Lego Sets for sale. Based on JSDSE article by Anna Peterson and Laura Ziegler Data from their article was scrapped from multiple sources including brickset.com
Usage
lego_sample
Format
A data frame with 75 rows and 15 variables.
- item_number
Set Item number
- set_name
Name of the set.
- theme
Set theme: Duplo, City or Friends.
- pieces
Number of pieces in the set.
- price
Recommended retail price from LEGO.
- amazon_price
Price of the set at Amazon.
- year
Year that it was produced.
- ages
LEGO's recommended ages of children for the set
- pages
Pages in the instruction booklet.
- minifigures
Number of LEGO people in the data, if unknown "NA" was recorded.
- packaging
Type of packaging: bag, box, etc.
- weight
Weight of the set of LEGOS in pounds and kilograms.
- unique_pieces
Number of pieces classified as unique in the instruction manual.
- size
Size of the lego pieces: Large if safe for small children and Small for older children.
Source
Peterson, A. D., & Ziegler, L. (2021). Building a multiple linear regression model with LEGO brick data. Journal of Statistics and Data Science Education, 29(3),1-7. doi:10.1080/26939169.2021.1946450
BrickInstructions.com. (n.d.). Retrieved February 2, 2021 from
Brickset. (n.d.). BRICKSET: Your LEGO® set guide. Retrieved February 2, 2021 from
Examples
library(ggplot2)
library(dplyr)
lego_sample |>
filter(theme == "Friends" | theme == "City") |>
ggplot(aes(x = pieces, y = amazon_price)) +
geom_point(alpha = 0.3) +
labs(
x = "Pieces in the Set",
y = "Amazon Price",
title = "Amazon Price vs Number of Pieces in Lego Sets",
subtitle = "Friends and City Themes"
)
life_exp
Description
A data frame with 3142 rows and 4 columns. County level data for life expectancy and median income in the United States.
Usage
life_exp
Format
A data frame with 3142 observations on the following 4 variables.
- state
Name of the state.
- county
Name of the county.
- expectancy
Life expectancy in the county.
- income
Median income in the county, measured in US $.
Examples
library(ggplot2)
# Income V Expectancy
ggplot(life_exp, aes(x = income, y = expectancy)) +
geom_point(color = openintro::IMSCOL["green", "full"], alpha = 0.2) +
theme_minimal() +
labs(
title = "Is there a relationship between median income and life expectancy?",
x = "Median income (US $)",
y = "Life Expectancy (year)"
)
Create simple regression plot with residual plot
Description
Create a simple regression plot with residual plot.
Usage
linResPlot(
x,
y,
axes = FALSE,
wBox = TRUE,
wLine = TRUE,
lCol = "#00000088",
lty = 1,
lwd = 1,
main = "",
xlab = "",
ylab = "",
marRes = NULL,
col = fadeColor(4, "88"),
pch = 20,
cex = 1.5,
yR = 0.1,
ylim = NULL,
subset = NULL,
...
)
Arguments
x |
Predictor variable. |
y |
Outcome variable. |
axes |
Whether to plot axis labels. |
wBox |
Whether to plot boxes around each plot. |
wLine |
Add a regression line. |
lCol |
Line color. |
lty |
Line type. |
lwd |
Line width. |
main |
Title for the top plot. |
xlab |
x-label. |
ylab |
y-label. |
marRes |
Margin for the residuals plot. |
col |
Color of the points. |
pch |
Plotting character of points. |
cex |
Size of points. |
yR |
An additional vertical stretch factor on the plot. |
ylim |
y-limits. |
subset |
Boolean vector, if wanting a subset of the data. |
... |
Additional arguments passed to both plots. |
See Also
Examples
# Currently seems broken for this example.
n <- 25
x <- runif(n)
y <- 5 * x + rnorm(n)
myMat <- rbind(matrix(1:2, 2))
myW <- 1
myH <- c(1, 0.45)
par(mar = c(0.35, 0.654, 0.35, 0.654))
layout(myMat, myW, myH)
linResPlot(x, y, col = COL[1, 2])
Field data on lizards observed in their natural habitat
Description
Data on here lizard was observed and the level of sunlight. The data are collected on Sceloporus occidentalis (western fence lizards) by Stephen C. Adolph in 1983 (in desert and mountain sites) and by Dee Asbury in 2002-3 (in valley site).
Usage
lizard_habitat
Format
A data frame with 332 observations on the following 2 variables.
- site
Site of lizard observation:
desert
,mountain
, orvalley
.- sunlight
Sunlight level at time of observation:
sun
(lizard was observed perching in full sunlight),partial
(lizard was observed perching with part of its body in the sun, part in the shade),shade
(lizard was observed perching in the shade).
Source
Adolph, S. C. 1990. Influence of behavioral thermoregulation on microhabitat use by two Sceloporus lizards. Ecology 71: 315-327. Asbury, D.A., and S. C. Adolph. 2007. Behavioral plasticity in an ecological generalist: microhabitat use by western fence lizards. Evolutionary Ecology Research 9:801-815.
Examples
library(ggplot2)
# Frequencies
table(lizard_habitat)
# Stacked bar plots
ggplot(lizard_habitat, aes(y = site, fill = sunlight)) +
geom_bar(position = "fill") +
labs(x = "Proportion")
Lizard speeds
Description
Data on top speeds measured on a laboratory race track for two species of lizards: Western fence lizard (Sceloporus occidentalis) and Sagebrush lizard (Sceloporus graciosus).
Usage
lizard_run
Format
A data frame with 48 observations on the following 3 variables.
- top_speed
Top speed of lizard, meters per second.
- common_name
Common name:
Western fence lizard
andSagebrush lizard
.- scientific_name
Scientific name (Genus and species):
Sceloporus occidentalis
andSceloporus graciosus
.
Source
Adolph, S. C. 1987. Physiological and behavioral ecology of the lizards Sceloporus occidentalis and Sceloporus graciosus. Dissertation. University of Washington, Seattle, Washington, USA.
Examples
library(ggplot2)
library(dplyr)
# Top speed by species
ggplot(lizard_run, aes(x = top_speed, color = common_name, fill = common_name)) +
geom_density(alpha = 0.5)
# Top speed summary statistics by species
lizard_run |>
group_by(common_name) |>
summarise(
n = n(),
mean = mean(top_speed),
sd = sd(top_speed)
)
Linear regression plot with residual plot
Description
Plot data, the linear model, and a residual plot simultaneously.
Usage
lmPlot(
x,
y,
xAxis = 0,
yAxis = 4,
resAxis = 3,
resSymm = TRUE,
wBox = TRUE,
wLine = TRUE,
lCol = "#00000088",
lty = 1,
lwd = 1,
xlab = "",
ylab = "",
marRes = NULL,
col = "#22558888",
pch = 20,
cex = 1.5,
xR = 0.02,
yR = 0.1,
xlim = NULL,
ylim = NULL,
subset = NULL,
parCustom = FALSE,
myHeight = c(1, 0.45),
plots = c("both", "mainOnly", "resOnly"),
highlight = NULL,
hlCol = NULL,
hlCex = 1.5,
hlPch = 20,
na.rm = TRUE,
...
)
Arguments
x |
The x coordinates of points in the plot. |
y |
The y coordinates of points in the plot. |
xAxis |
The maximum number of x axis labels. |
yAxis |
The maximum number of y axis labels. |
resAxis |
The maximum number of y axis labels in the residual plot. |
resSymm |
Boolean determining whether the range of the residual plot should be symmetric about zero. |
wBox |
Boolean determining whether a box should be added around each plot. |
wLine |
Boolean determining whether to add a regression line to the plot. |
lCol |
The color of the regression line to be added. |
lty |
The line type of the regression line to be added. |
lwd |
The line width of the regression line to be added. |
xlab |
A label for the x axis. |
ylab |
A label for the y axis |
marRes |
Margin specified for the residuals. |
col |
Color of points. |
pch |
Plotting character. |
cex |
Plotting character size. |
xR |
Scaling the limits of the x axis. Ignored if |
yR |
Scaling the limits of the y axis. Ignored if |
xlim |
Limits for the x axis. |
ylim |
Limits for the y axis. |
subset |
A subset of the data to be used for the linear model. |
parCustom |
If |
myHeight |
A numerical vector of length 2 representing the ratio of the primary plot to the residual plot, in height. |
plots |
Not currently utilized. |
highlight |
Numerical vector specifying particular points to highlight. |
hlCol |
Color of highlighted points. |
hlCex |
Size of highlighted points. |
hlPch |
Plotting characters of highlighted points. |
na.rm |
Remove cases with |
... |
Additional arguments to |
Author(s)
David Diez
See Also
Examples
lmPlot(satgpa$sat_sum, satgpa$fy_gpa)
lmPlot(gradestv$tv, gradestv$grades,
xAxis = 4,
xlab = "time watching TV", yR = 0.2, highlight = c(1, 15, 20)
)
Loan data from Lending Club
Description
This dataset represents thousands of loans made through the Lending Club platform, which is a platform that allows individuals to lend to other individuals. Of course, not all loans are created equal. Someone who is a essentially a sure bet to pay back a loan will have an easier time getting a loan with a low interest rate than someone who appears to be riskier. And for people who are very risky? They may not even get a loan offer, or they may not have accepted the loan offer due to a high interest rate. It is important to keep that last part in mind, since this dataset only represents loans actually made, i.e. do not mistake this data for loan applications!
Usage
loans_full_schema
Format
A data frame with 10,000 observations on the following 55 variables.
- emp_title
Job title.
- emp_length
Number of years in the job, rounded down. If longer than 10 years, then this is represented by the value
10
.- state
Two-letter state code.
- homeownership
The ownership status of the applicant's residence.
- annual_income
Annual income.
- verified_income
Type of verification of the applicant's income.
- debt_to_income
Debt-to-income ratio.
- annual_income_joint
If this is a joint application, then the annual income of the two parties applying.
- verification_income_joint
Type of verification of the joint income.
- debt_to_income_joint
Debt-to-income ratio for the two parties.
- delinq_2y
Delinquencies on lines of credit in the last 2 years.
- months_since_last_delinq
Months since the last delinquency.
- earliest_credit_line
Year of the applicant's earliest line of credit
- inquiries_last_12m
Inquiries into the applicant's credit during the last 12 months.
- total_credit_lines
Total number of credit lines in this applicant's credit history.
- open_credit_lines
Number of currently open lines of credit.
- total_credit_limit
Total available credit, e.g. if only credit cards, then the total of all the credit limits. This excludes a mortgage.
- total_credit_utilized
Total credit balance, excluding a mortgage.
- num_collections_last_12m
Number of collections in the last 12 months. This excludes medical collections.
- num_historical_failed_to_pay
The number of derogatory public records, which roughly means the number of times the applicant failed to pay.
- months_since_90d_late
Months since the last time the applicant was 90 days late on a payment.
- current_accounts_delinq
Number of accounts where the applicant is currently delinquent.
- total_collection_amount_ever
The total amount that the applicant has had against them in collections.
- current_installment_accounts
Number of installment accounts, which are (roughly) accounts with a fixed payment amount and period. A typical example might be a 36-month car loan.
- accounts_opened_24m
Number of new lines of credit opened in the last 24 months.
- months_since_last_credit_inquiry
Number of months since the last credit inquiry on this applicant.
- num_satisfactory_accounts
Number of satisfactory accounts.
- num_accounts_120d_past_due
Number of current accounts that are 120 days past due.
- num_accounts_30d_past_due
Number of current accounts that are 30 days past due.
- num_active_debit_accounts
Number of currently active bank cards.
- total_debit_limit
Total of all bank card limits.
- num_total_cc_accounts
Total number of credit card accounts in the applicant's history.
- num_open_cc_accounts
Total number of currently open credit card accounts.
- num_cc_carrying_balance
Number of credit cards that are carrying a balance.
- num_mort_accounts
Number of mortgage accounts.
- account_never_delinq_percent
Percent of all lines of credit where the applicant was never delinquent.
- tax_liens
a numeric vector
- public_record_bankrupt
Number of bankruptcies listed in the public record for this applicant.
- loan_purpose
The category for the purpose of the loan.
- application_type
The type of application: either
individual
orjoint
.- loan_amount
The amount of the loan the applicant received.
- term
The number of months of the loan the applicant received.
- interest_rate
Interest rate of the loan the applicant received.
- installment
Monthly payment for the loan the applicant received.
- grade
Grade associated with the loan.
- sub_grade
Detailed grade associated with the loan.
- issue_month
Month the loan was issued.
- loan_status
Status of the loan.
- initial_listing_status
Initial listing status of the loan. (I think this has to do with whether the lender provided the entire loan or if the loan is across multiple lenders.)
- disbursement_method
Dispersement method of the loan.
- balance
Current balance on the loan.
- paid_total
Total that has been paid on the loan by the applicant.
- paid_principal
The difference between the original loan amount and the current balance on the loan.
- paid_interest
The amount of interest paid so far by the applicant.
- paid_late_fees
Late fees paid by the applicant.
Source
This data comes from Lending Club (https://www.lendingclub.com/info/statistics.action), which provides a very large, open set of data on the people who received loans through their platform.
Examples
loans_full_schema
London Borough Boundaries
Description
This dataset contains the coordinates of the boundaries of all 32 boroughs of the Greater London area.
Usage
london_boroughs
Format
A data frame with 45341 observations on the following 3 variables.
- borough
Name of the borough.
- x
The "easting" component of the coordinate, see details.
- y
The "northing" component of the coordinate, see details.
Details
Map data was made available through the Ordnance Survey Open Data
initiative. The data use the
National Grid coordinate system,
based upon eastings (x
) and northings (y
) instead of longitude and latitude.
The name
variable covers all 32 boroughs in Greater London:
Barking & Dagenham
, Barnet
, Bexley
, Brent
,
Bromley
, Camden
, Croydon
, Ealing
,
Enfield
, Greenwich
, Hackney
, Hammersmith &
Fulham
, Haringey
, Harrow
, Havering
, Hillingdon
,
Hounslow
, Islington
, Kensington & Chelsea
,
Kingston
, Lambeth
, Lewisham
, Merton
,
Newham
, Redbridge
, Richmond
, Southwark
,
Sutton
, Tower Hamlets
, Waltham Forest
,
Wandsworth
, Westminster
Source
https://data.london.gov.uk/dataset/ordnance-survey-code-point
Contains Ordinance Survey data released under the Open Government License, OGL v2.
See Also
london_murders
Examples
library(dplyr)
library(ggplot2)
# Calculate number of murders by borough
london_murders_counts <- london_murders |>
group_by(borough) |>
add_tally()
london_murders_counts
## Not run:
# Add number of murders to geographic boundary data
london_boroughs_murders <- inner_join(london_boroughs, london_murders_counts, by = "borough")
# Map murders
ggplot(london_boroughs_murders) +
geom_polygon(aes(x = x, y = y, group = borough, fill = n), colour = "white") +
scale_fill_distiller(direction = 1) +
labs(x = "Easting", y = "Northing", fill = "Number of murders")
## End(Not run)
London Murders, 2006-2011
Description
This dataset contains the victim name, age, and location of every murder recorded in the Greater London area by the Metropolitan Police from January 1, 2006 to September 7, 2011.
Usage
london_murders
Format
A data frame with 838 observations on the following 5 variables.
- forename
First name(s) of the victim.
- age
Age of the victim.
- date
Date of the murder (YYYY-MM-DD).
- year
Year of the murder.
- borough
The London borough in which the murder took place. See the Details section for a list of all the boroughs.
Details
To visualize this dataset using a map, see the
london_boroughs
dataset, which contains the latitude and
longitude of polygons that define the boundaries of the 32 boroughs of
Greater London.
The borough
variable covers all 32 boroughs in Greater London:
Barking & Dagenham
, Barnet
, Bexley
, Brent
,
Bromley
, Camden
, Croydon
, Ealing
,
Enfield
, Greenwich
, Hackney
, Hammersmith &
Fulham
, Haringey
, Harrow
, Havering
, Hillingdon
,
Hounslow
, Islington
, Kensington & Chelsea
,
Kingston
, Lambeth
, Lewisham
, Merton
,
Newham
, Redbridge
, Richmond
, Southwark
,
Sutton
, Tower Hamlets
, Waltham Forest
,
Wandsworth
, Westminster
Source
https://www.theguardian.com/news/datablog/2011/oct/05/murder-london-list#data
References
Inspired by The Guardian Datablog.
Examples
library(dplyr)
library(ggplot2)
library(lubridate)
london_murders |>
mutate(
day_count = as.numeric(date - ymd("2006-01-01")),
date_cut = cut(day_count, seq(0, 2160, 90))
) |>
group_by(date_cut) |>
add_tally() |>
ggplot(aes(x = date_cut, y = n)) +
geom_col() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
labs(x = "Date from 01/2006 - 09/2011", y = "Number of deaths per 90 days")
Output a message while inside a loop
Description
NOTE: utils::txtProgressBar()
and utils::setTxtProgressBar()
are better.
Output a message while inside a for loop to update the user on progress. This
function is useful in tracking progress when the number of iterations is large
or the procedures in each iteration take a long time.
Usage
loop(i, n = NULL, every = 1, extra = NULL)
Arguments
i |
The index value used in the loop. |
n |
The last entry in the loop. |
every |
The number of loops between messages. |
extra |
Additional information to print. |
Author(s)
David Diez
See Also
Examples
for (i in 1:160) {
loop(i, 160, 20, paste("iter", i))
}
Create a Line Segment Plot
Description
Creae a simple plot showing a line segment.
Usage
lsegments(
x = c(3, 7),
l = "o",
r = "c",
ticks = TRUE,
labs = 1,
add = 0,
ylim = c(-0.75, 0.25)
)
Arguments
x |
The endpoints of the interval. Values larger (smaller) than 999 (-999) will be interpreted as (negative) infinity. |
l |
Indicate whether the left end point should be open ( |
r |
Indicate whether the right end point should be open ( |
ticks |
Indicate whether to show tick marks ( |
labs |
The position for the point labels. Set to |
add |
Indicate whether the line segment should be added to an existing
plot ( |
ylim |
A vector of length 2 specifying the vertical plotting limits,
which may be useful for fine-tuning plots. The default is
|
Author(s)
David Diez
See Also
Examples
lsegments(c(2, 7), "o", "c", ylim = c(-0.3, 0.2))
lsegments(c(5, 7), "c", "c", ylim = c(-0.3, 0.2))
lsegments(c(4, 1000), "o", "o", ylim = c(-0.3, 0.2))
Influence of a Good Mood on Helpfulness
Description
This study investigated whether finding a coin influenced a person's likelihood of mailing a sealed but addressed letter that appeared to have been accidentally left in a conspicuous place. Several variables were collected during the experiment, including two randomized variables of whether there was a coin to be found and whether the letter already had a stamp on it.
Usage
mail_me
Format
A data frame with 42 observations on the following 4 variables.
- stamped
a factor with levels
no
yes
- found_coin
a factor with levels
coin
no_coin
- gender
a factor with levels
female
male
- mailed_letter
a factor with levels
no
yes
Details
The precise context was in a phone booth (this study is from the 1970s!), where a person who entered a phone booth would find a dime in the phone tray, which would be sufficient to pay for their phone call. There was also a letter next to the phone, which sometimes had a stamp on it.
Source
Levin PF, Isen AM. 1975. Studies on the Effect of Feeling Good on Helping. Sociometry 31(1), p141-147.
Examples
table(mail_me)
(x <- table(mail_me[, c("mailed_letter", "found_coin")]))
chisq.test(x)
(x <- table(mail_me[, c("mailed_letter", "stamped")]))
chisq.test(x)
m <- glm(mailed_letter ~ stamped + found_coin + gender,
data = mail_me,
family = binomial
)
summary(m)
Survey of Duke students and the area of their major
Description
Survey of 218 students, collecting information on their GPAs and their academic major.
Usage
major_survey
Format
A data frame with 218 observations on the following 2 variables.
- gpa
Grade point average (GPA).
- major
Area of academic major.
Examples
library(ggplot2)
ggplot(major_survey, aes(x = major, y = gpa)) +
geom_boxplot()
Regression tube
Description
Produce a linear, quadratic, or nonparametric tube for regression data.
Usage
makeTube(
x,
y,
Z = 2,
R = 1,
col = "#00000022",
border = "#00000000",
type = c("lin", "quad", "robust"),
stDev = c("constant", "linear", "other"),
length.out = 99,
bw = "default",
plotTube = TRUE,
addLine = TRUE,
...
)
Arguments
x |
|
y |
|
Z |
Number of standard deviations out from the regression line to extend the tube. |
R |
Control of how far the tube extends to the left and right. |
col |
Fill color of the tube. |
border |
Border color of the tube. |
type |
The type of model fit to the data. Here |
stDev |
Choices are constant variance ( |
length.out |
The number of observations used to build the regression model. This argument may be increased to increase the smoothing of a quadratic or nonparametric curve. |
bw |
Bandwidth used if |
plotTube |
Whether the tube should be plotted. |
addLine |
Whether the linear model should be plotted. |
... |
Additional arguments passed to the |
Value
X |
|
Y |
|
tubeX |
|
tubeY |
|
Author(s)
David Diez
See Also
Examples
# possum example
plot(possum$total_l, possum$head_l)
makeTube(possum$total_l, possum$head_l, 1)
makeTube(possum$total_l, possum$head_l, 2)
makeTube(possum$total_l, possum$head_l, 3)
# grades and TV example
plot(gradestv)
makeTube(gradestv$tv, gradestv$grades, 1.5)
plot(gradestv)
makeTube(gradestv$tv, gradestv$grades, 1.5, stDev = "o")
plot(gradestv)
makeTube(gradestv$tv, gradestv$grades, 1.5, type = "robust")
plot(gradestv)
makeTube(gradestv$tv, gradestv$grades, 1.5, type = "robust", stDev = "o")
# what can go wrong with a basic least squares model
# 1
x <- runif(100)
y <- 25 * x - 20 * x^2 + rnorm(length(x), sd = 1.5)
plot(x, y)
makeTube(x, y, type = "q")
# 2
x <- c(-0.6, -0.46, -0.091, runif(97))
y <- 25 * x + rnorm(length(x))
y[2] <- y[2] + 8
y[1] <- y[1] + 1
plot(x, y, ylim = range(y) + c(-10, 5))
makeTube(x, y)
# 3
x <- runif(100)
y <- 5 * x + rnorm(length(x), sd = x)
plot(x, y)
makeTube(x, y, stDev = "l", bw = 0.03)
Malaria Vaccine Trial
Description
Volunteer patients were randomized into one of two experiment groups where they would receive an experimental vaccine or a placebo. They were subsequently exposed to a drug-sensitive strain of malaria and observed to see whether they came down with an infection.
Usage
malaria
Format
A data frame with 20 observations on the following 2 variables.
- treatment
Whether a person was given the experimental
vaccine
or aplacebo
.- outcome
Whether the person got an
infection
orno infection
.
Details
In this study, volunteer patients were randomized into one of two experiment groups: 14 patients received an experimental vaccine or 6 patients received a placebo vaccine. Nineteen weeks later, all 20 patients were exposed to a drug-sensitive malaria virus strain; the motivation of using a drug-sensitive strain of virus here is for ethical considerations, allowing any infections to be treated effectively.
Source
Lyke et al. 2017. PfSPZ vaccine induces strain-transcending T cells and durable protection against heterologous controlled human malaria infection. PNAS 114(10):2711-2716. doi:10.1073/pnas.1615324114.
Examples
library(dplyr)
# Calculate conditional probabilities of infection after vaccine/placebo
malaria |>
count(treatment, outcome) |>
group_by(treatment) |>
mutate(prop = n / sum(n))
# Fisher's exact text
fisher.test(table(malaria))
Sample of 100 male heights
Description
Random sample based on Food Commodity Intake Database distribution
Usage
male_heights
Format
A data frame with 100 observations on the following variable.
- heights
a numeric vector
References
What We Eat In America - Food Commodity Intake Database. Available at https://fcid.foodrisk.org/.
Examples
male_heights
Random sample of adult male heights
Description
This sample is based on data from the USDA Food Commodity Intake Database.
Usage
male_heights_fcid
Format
A data frame with 100 observations on the following variable.
- height_inch
Height, in inches.
Source
Simulated based on data from USDA.
Examples
data(male_heights_fcid)
histPlot(male_heights_fcid$height_inch)
Sleep in Mammals
Description
This dataset includes data for 39 species of mammals distributed over 13 orders. The data were used for analyzing the relationship between constitutional and ecological factors and sleeping in mammals. Two qualitatively different sleep variables (dreaming and non dreaming) were recorded. Constitutional variables such as life span, body weight, brain weight and gestation time were evaluated. Ecological variables such as severity of predation, safety of sleeping place and overall danger were inferred from field observations in the literature.
Usage
mammals
Format
A data frame with 62 observations on the following 11 variables.
- species
Species of mammals
- body_wt
Total body weight of the mammal (in kg)
- brain_wt
Brain weight of the mammal (in kg)
- non_dreaming
Number of hours of non dreaming sleep
- dreaming
Number of hours of dreaming sleep
- total_sleep
Total number of hours of sleep
- life_span
Life span (in years)
- gestation
Gestation time (in days)
- predation
An index of how likely the mammal is to be preyed upon. 1 = least likely to be preyed upon. 5 = most likely to be preyed upon.
- exposure
An index of the how exposed the mammal is during sleep. 1 = least exposed (e.g., sleeps in a well-protected den). 5 = most exposed.
- danger
An index of how much danger the mammal faces from other animals. This index is based upon Predation and Exposure. 1 = least danger from other animals. 5 = most danger from other animals.
Source
http://www.statsci.org/data/general/sleep.txt
References
T. Allison and D. Cicchetti, "Sleep in mammals: ecological and constitutional correlates," Arch. Hydrobiol, vol. 75, p. 442, 1975.
Examples
library(ggplot2)
ggplot(mammals, aes(x = log(body_wt), y = log(brain_wt))) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Log of body weight", x = "Log of brain weight")
Experiment with Mammogram Randomized
Description
An experiment where 89,835 women were randomized to either get a mammogram or a non-mammogram breast screening. The response measured was whether they had died from breast cancer within 25 years.
Usage
mammogram
Format
A data frame with 89835 observations on the following 2 variables.
- treatment
a factor with levels
control
mammogram
- breast_cancer_death
a factor with levels
no
yes
Source
Miller AB. 2014. Twenty five year follow-up for breast cancer incidence and mortality of the Canadian National Breast Screening Study: randomised screening trial. BMJ 2014;348:g366.
Examples
table(mammogram)
chisq.test(table(mammogram))
manhattan
Description
A data frame containing data on apartment rentals in Manhattan.
Usage
manhattan
Format
A data frame with 20 observations on the following 1 variable.
- rent
Monthly rent for a 1 bedroom apartment listed as "For rent by owner".
Examples
library(ggplot2)
ggplot(manhattan, aes(rent)) +
geom_histogram(color = "white", binwidth = 300) +
theme_minimal() +
labs(
title = "Rent in Manhattan",
subtitle = "1 Bedroom Apartments",
x = "Rent (in US$)",
caption = "Source: Craigslist"
)
New York City Marathon Times (outdated)
Description
Marathon times of male and female winners of the New York City Marathon 1970-1999. See nyc_marathon for a more updated dataset. We recommend not using this dataset since the data source has been taken off the web.
Usage
marathon
Format
A data frame with 60 observations on the following 3 variables.
- year
Year
- gender
Gender
- time
Running time (in hours)
Source
Data source has been removed.
Examples
library(ggplot2)
ggplot(marathon, aes(x = time)) +
geom_histogram(binwidth = 0.15)
ggplot(marathon, aes(y = time, x = gender)) +
geom_boxplot()
Wii Mario Kart auctions from Ebay
Description
Auction data from Ebay for the game Mario Kart for the Nintendo Wii. This data was collected in early October 2009.
Usage
mariokart
Format
A data frame with 143 observations on the following 12 variables. All prices are in US dollars.
- id
Auction ID assigned by Ebay.
- duration
Auction length, in days.
- n_bids
Number of bids.
- cond
Game condition, either
new
orused
.- start_pr
Start price of the auction.
- ship_pr
Shipping price.
- total_pr
Total price, which equals the auction price plus the shipping price.
- ship_sp
Shipping speed or method.
- seller_rate
The seller's rating on Ebay. This is the number of positive ratings minus the number of negative ratings for the seller.
- stock_photo
Whether the auction feature photo was a stock photo or not. If the picture was used in many auctions, then it was called a stock photo.
- wheels
Number of Wii wheels included in the auction. These are steering wheel attachments to make it seem as though you are actually driving in the game. When used with the controller, turning the wheel actually causes the character on screen to turn.
- title
The title of the auctions.
Details
There are several interesting features in the data. First off, note that there are two outliers in the data. These serve as a nice example of what one should do when encountering an outlier: examine the data point and remove it only if there is a good reason. In these two cases, we can see from the auction titles that they included other items in their auctions besides the game, which justifies removing them from the dataset.
This dataset includes all auctions for a full week in October 2009. Auctions were included in the dataset if they satisfied a number of conditions. (1) They were included in a search for "wii mario kart" on ebay.com, (2) items were in the Video Games > Games > Nintendo Wii section of Ebay, (3) the listing was an auction and not exclusively a "Buy it Now" listing (sellers sometimes offer an optional higher price for a buyer to end bidding and win the auction immediately, which is an optional Buy it Now auction), (4) the item listed was the actual game, (5) the item was being sold from the US, (6) the item had at least one bidder, (7) there were no other items included in the auction with the exception of racing wheels, either generic or brand-name being acceptable, and (8) the auction did not end with a Buy It Now option.
Source
Ebay.
Examples
library(ggplot2)
library(broom)
library(dplyr)
# Identify outliers
ggplot(mariokart, aes(x = total_pr, y = cond)) +
geom_boxplot()
# Replot without the outliers
mariokart |>
filter(total_pr < 80) |>
ggplot(aes(x = total_pr, y = cond)) +
geom_boxplot()
# Fit a multiple regression models
mariokart_no <- mariokart |> filter(total_pr < 80)
m1 <- lm(total_pr ~ cond + stock_photo + duration + wheels, data = mariokart_no)
tidy(m1)
m2 <- lm(total_pr ~ cond + stock_photo + wheels, data = mariokart_no)
tidy(m2)
m3 <- lm(total_pr ~ cond + wheels, data = mariokart_no)
tidy(m3)
# Fit diagnostics
aug_m3 <- augment(m3)
ggplot(aug_m3, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted values", y = "Residuals")
ggplot(aug_m3, aes(x = .fitted, y = abs(.resid))) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted values", y = "Absolute value of residuals")
ggplot(aug_m3, aes(x = 1:nrow(aug_m3), y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Order of data collection", y = "Residuals")
ggplot(aug_m3, aes(x = cond, y = .resid)) +
geom_boxplot() +
labs(x = "Condition", y = "Residuals")
ggplot(aug_m3, aes(x = wheels, y = .resid)) +
geom_point() +
labs(
x = "Number of wheels", y = "Residuals",
title = "Notice curvature"
)
Marvel Cinematic Universe films
Description
A list of Marvel Cinematic Universe films through the Infinity saga. The Infinity saga is a 23 movie storyline spanning from Ironman in 2008 to Endgame in 2019.
Usage
mcu_films
Format
A data frame with 23 rows and 7 variables.
- movie
Title of the movie.
- length_hrs
Length of the movie: hours portion.
- length_min
Length of the movie: minutes portion.
- release_date
Date the movie was released in the US.
- opening_weekend_us
Box office totals for opening weekend in the US.
- gross_us
All box office totals in US.
- gross_world
All box office totals world wide.
Details
Box office figures are not adjusted to a specific year. They are from the year the film was released.
Source
Examples
library(ggplot2)
library(scales)
ggplot(mcu_films, aes(x = opening_weekend_us, y = gross_us)) +
geom_point() +
labs(
title = "MCU Box Office Totals: Opening weekend vs. all-time",
x = "Opening weekend totals (USD in millions)",
y = "All-time totals (USD)"
) +
scale_x_continuous(labels = label_dollar(scale = 1 / 1000000)) +
scale_y_continuous(labels = label_dollar(scale = 1 / 1000000))
President's party performance and unemployment rate
Description
Covers midterm elections.
Usage
midterms_house
Format
A data frame with 29 observations on the following 5 variables.
- year
Year.
- potus
The president in office.
- party
President's party: Democrat or Republican.
- unemp
Unemployment rate.
- house_change
Change in House seats for the President's party.
Details
An older version of this data is at unemploy_pres
.
Source
Wikipedia.
Examples
library(ggplot2)
ggplot(midterms_house, aes(x = unemp, y = house_change)) +
geom_point()
Migraines and acupuncture
Description
Experiment involving acupuncture and sham acupuncture (as placebo) in the treatment of migraines.
Usage
migraine
Format
A data frame with 89 observations on the following 2 variables.
- group
a factor with levels
control
treatment
- pain_free
a factor with levels
no
yes
Source
G. Allais et al. Ear acupuncture in the treatment of migraine attacks: a randomized trial on the efficacy of appropriate versus inappropriate acupoints. In: Neurological Sci. 32.1 (2011), pp. 173-175.
Examples
migraine
US Military Demographics
Description
This dataset contains demographic information on every member of the US armed forces including gender, race, and rank.
Usage
military
Format
A data frame with 1,414,593 observations on the following 6 variables.
- grade
The status of the service member as
enlisted
officer
orwarrant officer
.- branch
The branch of the armed forces:
air force
,army
,marine corps
,navy
.- gender
Whether the service member is
female
ormale
.- race
The race identified by the service member:
ami/aln
(american indian/alaskan native),asian
,black
,multi
(multi-ethnic),p/i
(pacific islander),unk
(unknown), orwhite
.- hisp
Whether a service member identifies with being hispanic (
TRUE
) or not (FALSE
).- rank
The numeric rank of the service member (higher number indicates higher rank).
Details
The branches covered by this dataset include the Army, Navy, Air Force, and Marine Corps. Demographic information on the Coast Guard is contained in the original dataset but has not been included here.
Source
Data provided by the Department of Defense and made available at https://catalog.data.gov/dataset/personnel-trends-by-gender-race, retrieved 2012-02-20.
Examples
## Not run:
library(dplyr)
library(ggplot2)
library(forcats)
# Proportion of females in military branches
military |>
ggplot(aes(x = branch, fill = gender)) +
geom_bar(position = "fill") +
labs(
x = "Branch", y = "Proportion", fill = "Gender",
title = "Proportion of females in military branches"
)
# Proportion of army officer females across ranks
military |>
filter(
grade == "officer",
branch == "army"
) |>
ggplot(aes(x = factor(rank), fill = fct_rev(gender))) +
geom_bar(position = "fill") +
labs(
x = "Rank", y = "Proportion", fill = "Gender",
title = "Proportion of army officer females across ranks"
)
## End(Not run)
Salary data for Major League Baseball (2010)
Description
Salary data for Major League Baseball players in the year 2010.
Usage
mlb
Format
A data frame with 828 observations on the following 4 variables.
- player
Player name
- team
Team
- position
Field position
- salary
Salary (in $1000s)
Source
https://databases.usatoday.com/mlb-salaries/, retrieved 2011-02-23.
Examples
# _____ Basic Histogram _____ #
hist(mlb$salary / 1000,
breaks = 15,
main = "", xlab = "Salary (millions of dollars)", ylab = "",
axes = FALSE,
col = "#22558844"
)
axis(1, seq(0, 40, 10))
axis(2, c(0, 500))
axis(2, seq(100, 400, 100), rep("", 4), tcl = -0.2)
# _____ Histogram on Log Scale _____ #
hist(log(mlb$salary / 1000),
breaks = 15,
main = "", xlab = "log(Salary)", ylab = "",
axes = FALSE, col = "#22558844"
)
axis(1) # , seq(0, 40, 10))
axis(2, seq(0, 300, 100))
# _____ Box plot of log(salary) against position _____ #
boxPlot(log(mlb$salary / 1000), mlb$position, horiz = TRUE, ylab = "")
Batter Statistics for 2018 Major League Baseball (MLB) Season
Description
Batter statistics for 2018 Major League Baseball season.
Usage
mlb_players_18
Format
A data frame with 1270 observations on the following 19 variables.
- name
Player name
- team
Team abbreviation
- position
Position abbreviation:
1B
= first base,2B
= second base,3B
= third base,C
= catcher,CF
= center field (outfield),DH
= designated hitter,LF
= left field (outfield),P
= pitcher,RF
= right field (outfield),SS
= shortstop.- games
Number of games played.
- AB
At bats.
- R
Runs.
- H
Hits.
- doubles
Doubles.
- triples
Triples.
- HR
Home runs.
- RBI
Runs batted in.
- walks
Walks.
- strike_outs
Strike outs.
- stolen_bases
Stolen bases.
- caught_stealing_base
Number of times caught stealing a base.
- AVG
Batting average.
- OBP
On-base percentage.
- SLG
Slugging percentage.
- OPS
On-base percentage plus slugging percentage.
Source
See Also
Examples
d <- subset(mlb_players_18, !position %in% c("P", "DH") & AB >= 100)
dim(d)
# _____ Per Position, No Further Grouping _____ #
plot(d$OBP ~ as.factor(d$position))
model <- lm(OBP ~ as.factor(position), d)
summary(model)
anova(model)
# _____ Simplified Analysis, Fewer Positions _____ #
pos <- list(
c("LF", "CF", "RF"),
c("1B", "2B", "3B", "SS"),
"C"
)
POS <- c("OF", "IF", "C")
table(d$position)
# _____ On-Base Percentage Across Positions _____ #
out <- c()
gp <- c()
for (i in 1:length(pos)) {
these <- which(d$position %in% pos[[i]])
out <- c(out, d$OBP[these])
gp <- c(gp, rep(POS[i], length(these)))
}
plot(out ~ as.factor(gp))
summary(lm(out ~ as.factor(gp)))
anova(lm(out ~ as.factor(gp)))
Major League Baseball Teams Data.
Description
A subset of data on Major League Baseball teams from Lahman's Baseball Database. The full dataset is available in the Lahman R package.
Usage
mlb_teams
Format
A data frame with 2784 rows and 41 variables.
- year
Year of play.
- league_id
League the team plays in with levels AL (American League) and NL (National League).
- division_id
Division the team plays in with levels W (west), E (east) and C (central).
- rank
Team's rank in their division at the end of the regular season.
- games_played
Games played.
- home_games
Games played at home.
- wins
Number of games won.
- losses
Number of games lost.
- division_winner
Did the team win their division? Levels of Y (yes) and N (no).
- wild_card_winner
Was the team a wild card winner. Levels of Y (yes) and N (no).
- league_winner
Did the team win their league? Levels of Y (yes) and N (no).
- world_series_winner
Did the team win the World Series? Levels of Y (yes) and N (no).
- runs_scored
Number of runs scored during the season.
- at_bats
Number of at bats during the season.
- hits
Number of hits during the season. Includes singles, doubles, triples and homeruns.
- doubles
Number of doubles hit.
- triples
Number of triples hit.
- homeruns
Homeruns by batters.
- walks
Number of walks.
- strikeouts_by_batters
Number of batters struckout.
- stolen_bases
Number of stolen bases.
- caught_stealing
Number of base runners caught stealing.
- batters_hit_by_pitch
Number of batters hit by a pitch.
- sacrifice_flies
Number of sacrifice flies.
- opponents_runs_scored
Number of runs scored by opponents.
- earned_runs_allowed
Number of earned runs allowed.
- earned_run_average
Earned run average.
- complete_games
Number of games where a single pitcher played the entire game.
- shutouts
Number of shutouts.
- saves
Number of saves.
- outs_pitches
Number of outs pitched for the season (number of innings pitched times 3).
- hits_allowed
Number of hits made by opponents.
- homeruns_allowed
Number of homeruns hit by opponents.
- walks_allowed
Number of opponents who were walked.
- strikeouts_by_pitchers
Number of opponents who were struckout.
- errors
Number of errors.
- double_plays
Number of double plays.
- fielding_percentage
Teams fielding percentage.
- team_name
Full name of team.
- ball_park
Home ballpark name.
- home_attendance
Home attendance total.
Source
Lahmans Baseball Database
Examples
library(dplyr)
# List the World Series winning teams for each year
mlb_teams |>
filter(world_series_winner == "Y") |>
select(year, team_name, ball_park)
# List the teams with their average number of wins and losses
mlb_teams |>
group_by(team_name) |>
summarize(mean_wins = mean(wins), mean_losses = mean(losses)) |>
arrange((team_name))
Major League Baseball Player Hitting Statistics for 2010
Description
Major League Baseball Player Hitting Statistics for 2010.
Usage
mlbbat10
Format
A data frame with 1199 observations on the following 19 variables.
- name
Player name
- team
Team abbreviation
- position
Player position
- game
Number of games
- at_bat
Number of at bats
- run
Number of runs
- hit
Number of hits
- double
Number of doubles
- triple
Number of triples
- home_run
Number of home runs
- rbi
Number of runs batted in
- total_base
Total bases, computed as 3HR + 23B + 1*2B + H
- walk
Number of walks
- strike_out
Number of strikeouts
- stolen_base
Number of stolen bases
- caught_stealing
Number of times caught stealing
- obp
On base percentage
- slg
Slugging percentage (total_base / at_bat)
- bat_avg
Batting average
Source
https://www.mlb.com, retrieved 2011-04-22.
Examples
library(ggplot2)
library(dplyr)
library(scales)
mlbbat10_200 <- mlbbat10 |>
filter(mlbbat10$at_bat > 200)
# On-base percentage across positions
ggplot(mlbbat10_200, aes(x = position, y = obp, fill = position)) +
geom_boxplot(show.legend = FALSE) +
scale_y_continuous(labels = label_number(suffix = "%", accuracy = 0.01)) +
labs(
title = "On-base percentage across positions",
y = "On-base percentage across positions",
x = "Position"
)
# Batting average across positions
ggplot(mlbbat10_200, aes(x = bat_avg, fill = position)) +
geom_density(alpha = 0.5) +
labs(
title = "Batting average across positions",
fill = NULL,
y = "Batting average",
x = "Position"
)
# Mean number of home runs across positions
mlbbat10_200 |>
group_by(position) |>
summarise(mean_home_run = mean(home_run)) |>
ggplot(aes(x = position, y = mean_home_run, fill = position)) +
geom_col(show.legend = FALSE) +
labs(
title = "Mean number of home runs across positions",
y = "Home runs",
x = "Position"
)
# Runs batted in across positions
ggplot(mlbbat10_200, aes(x = run, y = obp, fill = position)) +
geom_boxplot(show.legend = FALSE) +
labs(
title = "Runs batted in across positions",
y = "Runs",
x = "Position"
)
Minneapolis police use of force data.
Description
From Minneapolis, data from 2016 through August 2021
Usage
mn_police_use_of_force
Format
A data frame with 12925 rows and 13 variables.
- response_datetime
DateTime of police response.
- problem
Problem that required police response.
- is_911_call
Whether response was iniated by call to 911.
- primary_offense
Offense of subject.
- subject_injury
Whether subject was injured Yes/No/null.
- force_type
Type of police force used.
- force_type_action
Detail of police force used.
- race
Race of subject.
- sex
Gender of subject.
- age
Age of subject.
- type_resistance
Resistance to police by subject.
- precinct
Precinct where response occurred.
- neighborhood
Neighborhood where response occurred.
Source
Examples
library(dplyr)
library(ggplot2)
# List percent of total for each race
mn_police_use_of_force |>
count(race) |>
mutate(percent = round(n / sum(n) * 100, 2)) |>
arrange(desc(percent))
# Display use of force count by three races
race_sub <- c("Asian", "White", "Black")
ggplot(
mn_police_use_of_force |> filter(race %in% race_sub),
aes(force_type, ..count..)
) +
geom_point(stat = "count", size = 4) +
coord_flip() +
facet_grid(race ~ .) +
labs(
x = "Force Type",
y = "Number of Incidents"
)
movies
Description
A dataset with information about movies released in 2003.
Usage
movies
Format
A data frame with 140 observations on the following 5 variables.
- movie
Title of the movie.
- genre
Genre of the movie.
- score
Critics score of the movie on a 0 to 100 scale.
- rating
MPAA rating of the film.
- box_office
Millions of dollars earned at the box office in the US and Canada.
Source
Investigating Statistical Concepts, Applications and Methods
Examples
library(ggplot2)
ggplot(movies, aes(score, box_office, color = genre)) +
geom_point() +
theme_minimal() +
labs(
title = "Does a critic score predict box office earnings?",
x = "Critic rating",
y = "Box office earnings (millions US$",
color = "Genre"
)
Medial temporal lobe (MTL) and other data for 26 participants
Description
The data are from a convenience sample of 25 women and 10 men who were middle-aged or older. The purpose of the study was to understand the relationship between sedentary behavior and thickness of the medial temporal lobe (MTL) in the brain.
Usage
mtl
Format
A data frame with 35 observations on the following 23 variables.
- subject
ID for the individual.
- sex
Gender, which takes values
F
(female) orM
(male).- ethnic
Ethnicity, simplified to
Caucasian
andOther
.- educ
Years of educational.
- e4grp
APOE-4 status, taking a value of
E4
orNon-E4
.- age
Age, in years.
- mmse
Score from the Mini-Mental State Examination, which is a global cognition evaluation.
- ham_a
Score on the Hamilton Rating Scale for anxiety.
- ham_d
Score on the Hamilton Rating Scale for depression.
- dig_sym
We (the authors of this R package) are unsure as to the meaning of this variable.
- delay_vp
We (the authors of this R package) are unsure as to the meaning of this variable.
- bfr_selective_reminding_delayed
We (the authors of this R package) are unsure as to the meaning of this variable.
- sitting
Self-reported time sitting per day, averaged to the nearest hour.
- met_minwk
Metabolic equivalent units score (activity level). A score of
0
means "no activity" while3000
is considered "high activity".- ipa_qgrp
Classification of
METminwk
intoLow
orHigh
.- aca1
Thickness of the CA1 subregion of the MTL.
- aca23dg
Thickness of the CA23DG subregion of the MTL.
- ae_cort
Thickness of a subregion of the MTL.
- a_fusi_cort
Thickness of the fusiform gyrus subregion of the MTL.
- a_ph_cort
Thickness of the perirhinal cortex subregion of the MTL.
- a_pe_cort
Thickness of the entorhinal cortex subregion of the MTL.
- asubic
Thickness of the subiculum subregion of the MTL.
- total
Total MTL thickness.
Source
Siddarth P, Burggren AC, Eyre HA, Small GW, Merrill DA. 2018. Sedentary behavior associated with reduced medial temporal lobe thickness in middle-aged and older adults. PLoS ONE 13(4): e0195549. doi:10.1371/journal.pone.0195549.
Thank you to Professor Silas Bergen of Winona State University for pointing us to this dataset!
References
A New York Times article references this study. https://www.nytimes.com/2018/04/19/opinion/standing-up-at-your-desk-could-make-you-smarter.html
Examples
library(ggplot2)
ggplot(mtl, aes(x = ipa_qgrp, y = met_minwk)) +
geom_boxplot()
Data for 20 metropolitan areas
Description
Population, percent in poverty, percent unemployment, and murder rate.
Usage
murders
Format
A data frame with 20 metropolitan areas on the following 4 variables.
- population
Population.
- perc_pov
Percent in poverty.
- perc_unemp
Percent unemployed.
- annual_murders_per_mil
Number of murders per year per million people.
Source
We do not have provenance for these data hence recommend not using them for analysis.
Examples
library(ggplot2)
ggplot(murders, aes(x = perc_pov, y = annual_murders_per_mil)) +
geom_point() +
labs(
x = "Percent in poverty",
y = "Number of murders per year per million people"
)
Custom PDF function
Description
A similar function to pdf
and png
, except that different
defaults are provided, including for the plotting parameters.
Usage
myPDF(
fileName,
width = 5,
height = 3,
mar = c(3.9, 3.9, 1, 1),
mgp = c(2.8, 0.55, 0),
las = 1,
tcl = -0.3,
...
)
Arguments
fileName |
File name for the image to be output. The name should end in
|
width |
The width of the image file (inches). Default: |
height |
The height of the image file (inches). Default: |
mar |
Plotting margins. To change, input a numerical vector of length 4. |
mgp |
Margin graphing parameters. To change, input a numerical vector of length 3. The first argument specifies where x and y labels are placed; the second specifies the axis labels are placed; and the third specifies how far to pull the entire axis from the plot. |
las |
Orientation of axis labels. Input |
tcl |
The tick mark length as a proportion of text height. The default
is |
... |
Additional arguments to |
Author(s)
David Diez
See Also
Examples
# save a plot to a PDF
# myPDF("myPlot.pdf")
histPlot(mariokart$total_pr)
# dev.off()
# save a plot to a PNG
# myPNG("myPlot.png")
histPlot(mariokart$total_pr)
# dev.off()
NBA Finals History
Description
This dataset contains information about the teams who played in the NBA Finals from 1950 - 2022.
Usage
nba_finals
Format
A data frame with 73 rows and 9 variables:
- year
The year in which the Finals took place.
- winner
The team who won the series.
- western_wins
Number of series wins by the Western Conference Champions.
- eastern_wins
Number of series wins by the Eastern Conference Champions.
- western_champions
Team that won the Western Conference title and played in the Finals.
- eastern_champions
Team that won the Eastern Conference title and played in the Finals.
- western_coach
Coach of the Western Conference champions.
- eastern_coach
Coach of the Eastern Conference champions.
- home_court
Which conference held home court advantage for the series.
Source
Wikipedia: List of NBA Champions
Examples
library(dplyr)
library(ggplot2)
library(tidyr)
# Top 5 Appearing Coaches
nba_finals |>
pivot_longer(
cols = c("western_coach", "eastern_coach"),
names_to = "conference", values_to = "coach"
) |>
count(coach, sort = TRUE) |>
slice_head(n = 5)
# Top 5 Winning Coaches
nba_finals |>
mutate(
winning_coach = case_when(
western_wins == 4 ~ western_coach,
eastern_wins == 4 ~ eastern_coach
)
) |>
count(winning_coach, sort = TRUE) |>
slice_head(n = 5)
NBA Finals Team Summary
Description
A dataset with individual team summaries for the NBA Finals series from 1950 to 2022. To win the Finals, a team must win 4 games. The maximum number of games in a series is 7.
Usage
nba_finals_teams
Format
A data frame with 33 rows and 7 variables:
- team
Team name.
- win
Number of NBA Championships won.
- loss
Number of NBA Championships lost.
- apps
Number of NBA Finals appearances.
- pct
Win percentage.
- years_won
Years in which the team won a Championship.
- years_lost
Years in which the team lost a Championship.
Details
Notes:
The Chicago Stags folded in 1950, the Washington Capitols in 1951 and the Baltimore Bullets in 1954.
This list uses current team names. For example, the Seattle SuperSonics are not on the list as that team moved and became the Oklahoma City Thunder.
Source
Examples
library(ggplot2)
library(dplyr)
library(openintro)
teams_with_apps <- nba_finals_teams |>
filter(apps != 0)
ggplot(teams_with_apps, aes(x = win)) +
geom_histogram(binwidth = 2) +
labs(
title = "Number of NBA Finals series wins",
x = "Number of wins",
y = "Number of teams"
)
ggplot(teams_with_apps, aes(x = apps, y = win)) +
geom_point(alpha = 0.3) +
labs(
title = "Can we predict how many NBA Championships a
team has based on the number of appearances?",
x = "Number of NBA Finals appearances",
y = "Number of NBA Finals series wins"
)
NBA Player heights from 2008-9
Description
Heights of all NBA players from the 2008-9 season.
Usage
nba_heights
Format
A data frame with 435 observations (players) on the following 4 variables.
- last_name
Last name.
- first_name
First name.
- h_meters
Height, in meters.
- h_in
Height, in inches.
Source
Collected from NBA.
Examples
qqnorm(nba_heights$h_meters)
NBA Players for the 2018-2019 season
Description
Summary information from the NBA players for the 2018-2019 season.
Usage
nba_players_19
Format
A data frame with 494 observations on the following 7 variables.
- first_name
First name.
- last_name
Last name.
- team
Team name
- team_abbr
3-letter team abbreviation.
- position
Player position.
- number
Jersey number.
- height
Height, in inches.
Source
Examples
hist(nba_players_19$height, 20)
table(nba_players_19$team)
North Carolina births, 1000 cases
Description
In 2004, the state of North Carolina released to the public a large dataset containing information on births recorded in this state. This dataset has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from this dataset.
Usage
ncbirths
Format
A data frame with 1000 observations on the following 13 variables.
- fage
Father's age in years.
- mage
Mother's age in years.
- mature
Maturity status of mother.
- weeks
Length of pregnancy in weeks.
- premie
Whether the birth was classified as premature (premie) or full-term.
- visits
Number of hospital visits during pregnancy.
- gained
Weight gained by mother during pregnancy in pounds.
- weight
Weight of the baby at birth in pounds.
- lowbirthweight
Whether baby was classified as low birthweight (
low
) or not (not low
).- gender
Gender of the baby,
female
ormale
.- habit
Status of the mother as a
nonsmoker
or asmoker
.- marital
Whether mother is
married
ornot married
at birth.- whitemom
Whether mom is
white
ornot white
.
See Also
We do not have ideal provenance for these data. For a better documented and more recent dataset on a similar topic with similar variables, see births14.
Examples
library(ggplot2)
ggplot(ncbirths, aes(x = habit, y = weight)) +
geom_boxplot() +
labs(x = "Smoking status of mother", y = "Birth weight of baby (in lbs)")
ggplot(ncbirths, aes(x = whitemom, y = visits)) +
geom_boxplot() +
labs(x = "Mother's race", y = "Number of doctor visits during pregnancy")
ggplot(ncbirths, aes(x = mature, y = gained)) +
geom_boxplot() +
labs(x = "Mother's age category", y = "Weight gained during pregnancy")
Normal distribution tails
Description
Produce a normal (or t) distribution and shaded tail.
Usage
normTail(
m = 0,
s = 1,
L = NULL,
U = NULL,
M = NULL,
df = 1000,
curveColor = 1,
border = 1,
col = "#CCCCCC",
xlim = NULL,
ylim = NULL,
xlab = "",
ylab = "",
digits = 2,
axes = 1,
detail = 999,
xLab = c("number", "symbol"),
cex.axis = 1,
xAxisIncr = 1,
add = FALSE,
...
)
Arguments
m |
Numerical value for the distribution mean. |
s |
Numerical value for the distribution standard deviation. |
L |
Numerical value representing the cutoff for a shaded lower tail. |
U |
Numerical value representing the cutoff for a shaded upper tail. |
M |
Numerical value representing the cutoff for a shaded central region. |
df |
Numerical value describing the degrees of freedom. Default is
|
curveColor |
The color for the distribution curve. |
border |
The color for the border of the shaded area. |
col |
The color for filling the shaded area. |
xlim |
Limits for the x axis. |
ylim |
Limits for the y axis. |
xlab |
A title for the x axis. |
ylab |
A title for the y axis. |
digits |
The maximum number of digits past the decimal to use in axes values. |
axes |
A numeric value denoting whether to draw both axes ( |
detail |
A number describing the number of points to use in drawing the normal curve. Smaller values correspond to a less smooth curve but reduced memory usage in the final file. |
xLab |
If |
cex.axis |
Numerical value controlling the size of the axis labels. |
xAxisIncr |
A number describing how often axis labels are placed,
scaled by standard deviations. This argument is ignored if |
add |
Boolean indicating whether to add this normal curve to the existing plot. |
... |
Additional arguments to |
Author(s)
David Diez
See Also
Examples
normTail(3, 2, 5)
normTail(3, 2, 1, xLab = "symbol")
normTail(3, 2, M = 1:2, xLab = "symbol", cex.axis = 0.8)
normTail(3, 2, U = 5, axes = FALSE)
normTail(L = -1, U = 2, M = c(0, 1), axes = 3, xAxisIncr = 2)
normTail(
L = -1, U = 2, M = c(0, 1),
xLab = "symbol", cex.axis = 0.8, xAxisIncr = 2
)
Nuclear Arms Reduction Survey
Description
A simple random sample of 1,028 US adults in March 2013 found that 56\ support nuclear arms reduction.
Usage
nuclear_survey
Format
A data frame with 1028 observations on the following variable.
- arms_reduction
Responses of
favor
oragainst
.
Source
Gallup report: In U.S., 56 percent Favor U.S.-Russian Nuclear Arms Reductions. Available at https://news.gallup.com/poll/161198/favor-russian-nuclear-arms-reductions.aspx.
Examples
table(nuclear_survey)
nyc
Description
Zagat is a public survey where anyone can provide scores to a restaurant. The scores from the general public are then gathered to produce ratings. This dataset contains a list of 168 NYC restaurants and their Zagat Ratings.
Usage
nyc
Format
A data frame with 168 observations on the following 6 variables.
- restaurant
Name of the restaurant.
- price
Price of a mean for two, with drinks, in US $.
- food
Zagat rating for food.
- decor
Zagat rating for decor.
- service
Zagat rating for service.
- east
Indicator variable for location of the restaurant.
0
= west of 5th Avenue,1
= east of 5th Avenue
Details
For each category the scales are as follows:
0 - 9: poor to fair 10 - 15: fair to good 16 - 19: good to very good 20 - 25: very good to excellent 25 - 30: extraordinary to perfection
Examples
library(dplyr)
library(ggplot2)
location_labs <- c("West", "East")
names(location_labs) <- c(0, 1)
ggplot(nyc, mapping = aes(x = price, group = east, fill = east)) +
geom_boxplot(alpha = 0.5) +
facet_grid(east ~ ., labeller = labeller(east = location_labs)) +
labs(
title = "Is food more expensive east of 5th Avenue?",
x = "Price (US$)"
) +
guides(fill = "none") +
theme_minimal() +
theme(axis.text.y = element_blank())
New York City Marathon Times
Description
Marathon times of runners in the Men and Women divisions of the New York City Marathon, 1970 - 2023.
Usage
nyc_marathon
Format
A data frame with 108 observations on the following 7 variables.
- year
Year of marathom.
- name
Name of winner.
- country
Country of winner.
- time
Running time (HH:MM:SS).
- time_hrs
Running time (in hours).
- division
Division:
Men
orWomen
.- note
Note about the race or the winning time.
Source
Wikipedia, List of winners of the New York City Marathon. Retrieved 6 November, 2023.
Examples
library(ggplot2)
ggplot(nyc_marathon, aes(x = year, y = time_hrs, color = division, shape = division)) +
geom_point()
Flights data
Description
On-time data for a random sample of flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
Usage
nycflights
Format
A tbl_df with 32,735 rows and 16 variables:
- year,month,day
Date of departure.
- dep_time,arr_time
Departure and arrival times, local tz.
- dep_delay,arr_delay
Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
- hour,minute
Time of departure broken in to hour and minutes.
- carrier
Two letter carrier abbreviation. See
airlines
in thenycflights13
package for more information or google the airline code.- tailnum
Plane tail number.
- flight
Flight number.
- origin,dest
Origin and destination. See
airports
in thenycflights13
package for more information or google airport the code.- air_time
Amount of time spent in the air.
- distance
Distance flown.
Source
Hadley Wickham (2014). nycflights13: Data about flights departing NYC in 2013. R package version 0.1.
Examples
library(dplyr)
# Longest departure delays
nycflights |>
select(flight, origin, dest, dep_delay, arr_delay) |>
arrange(desc(dep_delay))
# Longest arrival delays
nycflights |>
select(flight, origin, dest, dep_delay, arr_delay) |>
arrange(desc(arr_delay))
California poll on drilling off the California coast
Description
A 2010 survey asking a randomly sample of registered voters in California for their position on drilling for oil and natural gas off the Coast of California.
Usage
offshore_drilling
Format
A data frame with 827 observations on the following 2 variables.
- position
a factor with levels
do not know
oppose
support
- college_grad
a factor with levels
no
yes
Source
Survey USA, Election Poll #16804, data collected July 8-11, 2010.
Examples
offshore_drilling
OpenIntro colors
Description
A character string of full colors from IMSCOL[,1]
Usage
openintro_colors
Format
A named character string with 9 elements: "blue", "green", "pink", "yellow", "red", "black", "gray", "lgray
Examples
openintro_colors
openintro_colors["blue"]
Function to extract OpenIntro IMS colors as hex codes
Description
Uses full colors from IMSCOL
Usage
openintro_cols(...)
Arguments
... |
Character names of openintro_colors |
Examples
openintro_cols("blue")
openintro_cols("red")
Return function to interpolate an OpenIntro IMS color palette
Description
Not exported
Usage
openintro_pal(palette = "main", reverse = FALSE, ...)
Arguments
palette |
Character name of palette in openintro_palettes |
reverse |
Boolean indicating whether the palette should be reversed |
... |
Additional arguments to pass to |
OpenIntro palettes
Description
A list with OpenIntro color palettes
Usage
openintro_palettes
Format
A list with 8 color palettes: main, two, three, four, five, cool, hot, gray
Examples
openintro_palettes
openintro_palettes$main
openintro_palettes$three
openintro_palettes$cool
openintro_palettes$hot
Opportunity cost of purchases
Description
In a study on opportunity cost, 150 students were given the following statement: "Imagine that you have been saving some extra money on the side to make some purchases, and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as a comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of $14.99. What would you do in this situation? Please circle one of the options below." Half of the students were given the following two options: (A) Buy this entertaining video. (B) Not buy this entertaining video. The other half were given the following two options (note the modified option B): (A) Buy this entertaining video. (B) Not buy this entertaining video. Keep the $14.99 for other purchases. The results of this study are in this dataset.
Usage
opportunity_cost
Format
A data frame with 150 observations on the following 2 variables.
- group
a factor with levels
control
andtreatment
- decision
a factor with levels
buy video
andnot buy video
Source
Frederick S, Novemsky N, Wang J, Dhar R, Nowlis S. 2009. Opportunity Cost Neglect. Journal of Consumer Research 36: 553-561.
Examples
library(ggplot2)
table(opportunity_cost)
ggplot(opportunity_cost, aes(y = group, fill = decision)) +
geom_bar(position = "fill")
1986 Challenger disaster and O-rings
Description
On January 28, 1986, a routine launch was anticipated for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened: the shuttle broke apart, killing all seven crew members on board. An investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch. The table below summarizes observational data on O-rings for 23 shuttle missions, where the mission order is based on the temperature at the time of the launch.
Usage
orings
Format
A data frame with 23 observations on the following 4 variables.
- mission
Shuttle mission number.
- temperature
Temperature, in Fahrenheit.
- damaged
Number of damaged O-rings (out of 6).
- undamaged
Number of undamaged O-rings (out of 6).
Source
https://archive.ics.uci.edu/dataset/92/challenger+usa+space+shuttle+o+ring
Examples
library(dplyr)
library(forcats)
library(tidyr)
library(broom)
# This is a wide data frame. You can convert it to a long
# data frame to predict probability of O-ring damage based
# on temperature using logistic regression.
orings_long <- orings |>
pivot_longer(cols = c(damaged, undamaged), names_to = "outcome", values_to = "n") |>
uncount(n) |>
mutate(outcome = fct_relevel(outcome, "undamaged", "damaged"))
orings_mod <- glm(outcome ~ temperature, data = orings_long, family = "binomial")
tidy(orings_mod)
Oscar winners, 1929 to 2018
Description
Best actor and actress Oscar winners from 1929 to 2018
Usage
oscars
Format
A data frame with 182 observations on the following 10 variables.
- oscar_no
Oscar ceremony number.
- oscar_yr
Year the Oscar ceremony was held.
- award
Best actress
orBest actor
.- name
Name of winning actor or actress.
- movie
Name of movie actor or actress got the Oscar for.
- age
Age at which the actor or actress won the Oscar.
- birth_pl
US State where the actor or actress was born, country if foreign.
- birth_date
Birth date of actor or actress.
- birth_mo
Birth month of actor or actress.
- birth_d
Birth day of actor or actress.
- birth_y
Birth year of actor or actress.
Details
Although there have been only 84 Oscar ceremonies until 2012, there are 85 male winners and 85 female winners because ties happened on two occasions (1933 for the best actor and 1969 for the best actress).
Source
Journal of Statistical Education, http://jse.amstat.org/datasets/oscars.dat.txt, updated through 2019 using information from Oscars.org and Wikipedia.org.
Examples
library(ggplot2)
library(dplyr)
ggplot(oscars, aes(x = award, y = age)) +
geom_boxplot()
ggplot(oscars, aes(x = factor(birth_mo))) +
geom_bar()
oscars |>
count(birth_pl, sort = TRUE)
Simulated datasets for different types of outliers
Description
Data sets for showing different types of outliers
Usage
outliers
Format
A data frame with 50 observations on the following 5 variables.
- x
a numeric vector
- y
a numeric vector
- x_inf
a numeric vector
- y_lev
a numeric vector
- y_out
a numeric vector
Examples
outliers
Race time for Olympic and Paralympic 1500m.
Description
Compiled gold medal times for the 1500m race in the Olympic Games and the Paralympic Games. The times given for contestants competing in the Paralympic Games are for athletes with different visual impairments; T11 indicates fully blind (with an option to race with a guide-runner) with T12 and T13 as lower levels of visual impairment.
Usage
paralympic_1500
Format
A data frame with 83 rows and 10 variables.
- year
Year the games took place.
- city
City of the games.
- country_of_games
Country of the games.
- division
Division:
Men
orWomen
.- type
Type.
- name
Name of the athlete.
- country_of_athlete
Country of athlete.
- time
Time of gold medal race, in m:s.
- time_min
Time of gold medal race, in decimal minutes (min + sec/60).
Source
https://www.paralympic.org/ and https://en.wikipedia.org/wiki/1500_metres_at_the_Olympics.
Examples
library(ggplot2)
library(dplyr)
paralympic_1500 |>
mutate(
sight_level = case_when(
type == "T11" ~ "total impairment",
type == "T12" ~ "some impairment",
type == "T13" ~ "some impairment",
type == "Olympic" ~ "no impairment"
)
) |>
filter(division == "Men", year > 1920) |>
filter(type == "Olympic" | type == "T11") |>
ggplot(aes(x = year, y = time_min, color = sight_level, shape = sight_level)) +
geom_point() +
scale_x_continuous(breaks = seq(1924, 2020, by = 8)) +
labs(
title = "Men's Olympic and Paralympic 1500m race times",
x = "Year",
y = "Time of Race (minutes)",
color = "Sight level",
shape = "Sight level"
)
Guesses at the weight of Penelope (a cow)
Description
The data was collected by the Planet Money podcast to test a theory about crowd-sourcing. Penelope's actual weight was 1,355 pounds.
Usage
penelope
Format
A data frame with 17,184 observations on the following variable.
- weight
Guesses of Penelope's weight, in pounds.
Source
Examples
library(ggplot2)
ggplot(penelope, aes(x = weight)) +
geom_histogram(binwidth = 250)
summary(penelope$weight)
What's the best way to loosen a rusty bolt?
Description
The channel Project Farm on YouTube investigated penetrating oils and other options for loosening rusty bolts. Eight options were evaluated, including a control group, to determine which was most effective.
Usage
penetrating_oil
Format
A data frame with 30 observations on the following 2 variables.
- treatment
The different treatments tried:
none
(control),Heat
(via blow torch),Acetone/ATF
,AeroKroil
,Liquid Wrench
,PB Blaster
,Royal Purple
, andWD-40
.- torque
Torque required to loosen the rusty bolt, which was measured in foot-pounds.
Source
https://www.youtube.com/watch?v=xUEob2oAKVs
Examples
m <- lm(torque ~ treatment, data = penetrating_oil)
anova(m)
# There are 28 pairwise comparisons to be made.
xbar <- tapply(penetrating_oil$torque, penetrating_oil$treatment, mean)
n <- tapply(penetrating_oil$torque, penetrating_oil$treatment, length)
s <- summary(m)$sigma
df <- summary(m)$df[1]
diff <- c()
se <- c()
k <- 0
N <- length(n)
K <- N * (N - 1) / 2
for (i in 1:(N - 1)) {
for (j in (i + 1):N) {
k <- k + 1
diff[k] <- xbar[i] - xbar[j]
se[k] <- s * sqrt(1 / n[i] + 1 / n[j])
if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.05) {
cat("0.05 - ", names(n)[c(i, j)], "\n")
} else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.1) {
cat("0.1 - ", names(n)[c(i, j)], "\n")
} else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.2) {
cat("0.2 - ", names(n)[c(i, j)], "\n")
} else if (2 * K * pt(-abs(diff[k] / se[k]), df) < 0.3) {
cat("0.3 - ", names(n)[c(i, j)], "\n")
}
}
}
# Smallest p-value using Bonferroni
min(2 * K * pt(-abs(diff / se), df))
# Better pairwise comparison method.
anova(m1 <- aov(torque ~ treatment, data = penetrating_oil))
TukeyHSD(m1)
Penny Ages
Description
Sample of pennies and their ages. Taken in 2004.
Usage
penny_ages
Format
A data frame with 648 observations on the following 2 variables.
- year
Penny's year.
- age
Age as of 2004.
Examples
hist(penny_ages$year)
Pew Survey on Energy Sources in 2018
Description
US-based survey on support for expanding six different sources of energy, including solar, wind, offshore drilling, hydrolic fracturing ("fracking"), coal, and nuclear.
Usage
pew_energy_2018
Format
The format is: List of 6 $ solar_panel_farms : List of responses on solar farms. $ wind_turbine_farms : List of responses on wind turbine farms. $ offshore_drilling : List of responses on offshore drilling. $ hydrolic_fracturing : List of responses on hydrolic fracturing. $ coal_mining : List of responses on coal mining. $ nuclear_power_plants: List of responses on nuclear.
Details
We did not have access to individual responses in original dataset, so we took the published percentages and backed out the breakdown
Source
Examples
data(pew_energy_2018)
lapply(pew_energy_2018, head)
lapply(pew_energy_2018, length)
lapply(pew_energy_2018, table)
Prop <- function(x) {
table(x) / length(x)
}
lapply(pew_energy_2018, Prop)
Photo classifications: fashion or not
Description
This is a simulated dataset for photo classifications based on a machine learning algorithm versus what the true classification is for those photos. While the data are not real, they resemble performance that would be reasonable to expect in a well-built classifier.
Usage
photo_classify
Format
A data frame with 1822 observations on the following 2 variables.
- mach_learn
The prediction by the machine learning system as to whether the photo is about fashion or not.
- truth
The actual classification of the photo by a team of humans.
Details
The hypothetical ML algorithm has a precision of 90\ photos it claims are fashion, about 90\ The recall of the ML algorithm is about 64\ about fashion, it correctly predicts that they are about fashion about 64\ of the time.
Source
The data are simulated / hypothetical.
Examples
data(photo_classify)
table(photo_classify)
Piracy and PIPA/SOPA
Description
This dataset contains observations on all 100 US Senators and 434 of the 325 US Congressional Representatives related to their support of anti-piracy legislation that was introduced at the end of 2011.
Usage
piracy
Format
A data frame with 534 observations on the following 8 variables.
- name
Name of legislator.
- party
Party affiliation as democrat (
D
), Republican (R
), or Independent (I
).- state
Two letter state abbreviation.
- money_pro
Amount of money in dollars contributed to the legislator's campaign in 2010 by groups generally thought to be supportive of PIPA/SOPA: movie and TV studios, record labels.
- money_con
Amount of money in dollars contributed to the legislator's campaign in 2010 by groups generally thought to be opposed to PIPA/SOPA: computer and internet companies.
- years
Number of years of service in Congress.
- stance
Degree of support for PIPA/SOPA with levels
Leaning No
,No
,Undecided
,Unknown
,Yes
- chamber
Whether the legislator is a member of either the
house
orsenate
.
Details
The Stop Online Piracy Act (SOPA) and the Protect Intellectual Property Act (PIPA) were two bills introduced in the US House of Representatives and the US Senate, respectively, to curtail copyright infringement. The bill was controversial because there were concerns the bill limited free speech rights. ProPublica, the independent and non-profit news organization, compiled this dataset to compare the stance of legislators towards the bills with the amount of campaign funds that they received from groups considered to be supportive of or in opposition to the legislation.
For more background on the legislation and the formulation of
money_pro
and money_con
, read the documentation on ProPublica,
linked below.
Source
https://projects.propublica.org/sopa The list may be slightly out of date since many politician's perspectives on the legislation were in flux at the time of data collection.
Examples
library(dplyr)
library(ggplot2)
pipa <- filter(piracy, chamber == "senate")
pipa |>
group_by(stance) |>
summarise(money_pro_mean = mean(money_pro, na.rm = TRUE)) |>
ggplot(aes(x = stance, y = money_pro_mean)) +
geom_col() +
labs(
x = "Stance", y = "Average contribution, in $",
title = "Average contribution to the legislator's campaign in 2010",
subtitle = "by groups supportive of PIPA/SOPA (movie and TV studios, record labels)"
)
ggplot(pipa, aes(x = stance, y = money_pro)) +
geom_boxplot() +
labs(
x = "Stance", y = "Contribution, in $",
title = "Contribution by groups supportive of PIPA/SOPA",
subtitle = "Movie and TV studios, record labels"
)
ggplot(pipa, aes(x = stance, y = money_con)) +
geom_boxplot() +
labs(
x = "Stance", y = "Contribution, in $",
title = "Contribution by groups opposed to PIPA/SOPA",
subtitle = "Computer and internet companies"
)
pipa |>
filter(
money_pro > 0,
money_con > 0
) |>
mutate(for_pipa = ifelse(stance == "yes", "yes", "no")) |>
ggplot(aes(x = money_pro, y = money_con, color = for_pipa)) +
geom_point() +
scale_color_manual(values = c("gray", "red")) +
scale_y_log10() +
scale_x_log10() +
labs(
x = "Contribution by pro-PIPA groups",
y = "Contribution by anti-PIPA groups",
color = "For PIPA"
)
Table of Playing Cards in 52-Card Deck
Description
A table describing each of the 52 cards in a deck.
Usage
playing_cards
Format
A data frame with 52 observations on the following 2 variables.
- number
The number or card type.
- suit
Card suit, which takes one of four values:
Club
,Diamond
,Heart
, orSpade
.- face_card
Whether the card counts as a face card.
Source
This extremely complex dataset was generated from scratch.
Examples
playing_cards <- data.frame(
number = rep(c(2:10, "J", "Q", "K", "A"), 4),
suit = rep(c("Spade", "Diamond", "Club", "Heart"), rep(13, 4))
)
playing_cards$face_card <-
ifelse(playing_cards$number %in% c(2:10, "A"), "no", "yes")
Air quality for Durham, NC
Description
Daily air quality is measured by the air quality index (AQI) reported by the Environmental Protection Agency in 2011.
Usage
pm25_2011_durham
Format
A data frame with 449 observations on the following 20 variables.
- date
Date
- aqs_site_id
The numeric site ID.
- poc
A numeric vector, the Parameter Occurance Code.
- daily_mean_pm2_5_concentration
A numeric vector with the average daily concentration of fine particulates, or particulate matter 2.5.
- units
A character vector with value
ug/m3 LC
.- daily_aqi_value
A numeric vector with the daily air quality index.
- daily_obs_count
A numeric vector.
- percent_complete
A numeric vector.
- aqs_parameter_code
A numeric vector.
- aqs_parameter_desc
A factor with levels
PM2.5 - Local Conditions
andAcceptable PM2.5 AQI & Speciation Mass
.- cbsa_code
A numeric vector.
- cbsa_name
A character vector with value
Durham, NC
.- state_code
A numeric vector.
- state
A character vector with value
North Carolina
.- county_code
A numeric vector.
- county
A character vector with value
Durham
.- site_latitude
A numeric vector of the latitude.
- site_longitude
A numeric vector of the longitude.
- csa_code
a numeric vector
- csa_name
a factor with levels
Raleigh-Durham-Cary, NC
Source
US Environmental Protection Agency, AirData, 2011. http://www3.epa.gov/airdata/ad_data_daily.html
Examples
library(ggplot2)
ggplot(pm25_2011_durham, aes(x = date, y = daily_mean_pm2_5_concentration, group = 1)) +
geom_line()
Air quality for Durham, NC
Description
Daily air quality is measured by the air quality index (AQI) reported by the Environmental Protection Agency in 2022.
Usage
pm25_2022_durham
Format
A data frame with 356 observations on the following 20 variables.
- date
Date.
- aqs_site_id
The numeric site ID.
- poc
A numeric vector, the Parameter Occurance Code.
- daily_mean_pm2_5_concentration
A numeric vector with the average daily concentration of fine particulates, or particulate matter 2.5.
- units
A character vector with value
ug/m3 LC
.- daily_aqi_value
A numeric vector with the daily air quality index.
- daily_obs_count
A numeric vector.
- percent_complete
A numeric vector.
- aqs_parameter_code
A numeric vector.
- aqs_parameter_desc
A factor vector with level
PM2.5 - Local Conditions
.- cbsa_code
A numeric vector.
- cbsa_name
A character vector with value
Durham-Chapel Hill, NC
.- state_code
A numeric vector.
- state
A character vector with value
North Carolina
.- county_code
A numeric vector.
- county
A character vector with value
Durham
.- site_latitude
A numeric vector of the latitude.
- site_longitude
A numeric vector of the longitude.
- site_name
A character vector with value
Durham Armory
.
Source
US Environmental Protection Agency, AirData, 2022. http://www3.epa.gov/airdata/ad_data_daily.html
Examples
library(ggplot2)
ggplot(pm25_2022_durham, aes(x = date, y = daily_mean_pm2_5_concentration, group = 1)) +
geom_line()
Poker winnings during 50 sessions
Description
Poker winnings (and losses) for 50 days by a professional poker player.
Usage
poker
Format
A data frame with 49 observations on the following variable.
- winnings
Poker winnings and losses, in US dollars.
Source
Anonymity has been requested by the player.
Examples
library(ggplot2)
ggplot(poker, aes(x = winnings)) +
geom_histogram(binwidth = 250)
Possums in Australia and New Guinea
Description
Data representing possums in Australia and New Guinea. This is a copy of the
dataset by the same name in the DAAG
package, however, the dataset
included here includes fewer variables.
Usage
possum
Format
A data frame with 104 observations on the following 8 variables.
- site
The site number where the possum was trapped.
- pop
Population, either
Vic
(Victoria) orother
(New South Wales or Queensland).- sex
Gender, either
m
(male) orf
(female).- age
Age.
- head_l
Head length, in mm.
- skull_w
Skull width, in mm.
- total_l
Total length, in cm.
- tail_l
Tail length, in cm.
Source
Lindenmayer, D. B., Viggers, K. L., Cunningham, R. B., and Donnelly, C. F. 1995. Morphological variation among columns of the mountain brushtail possum, Trichosurus caninus Ogilby (Phalangeridae: Marsupiala). Australian Journal of Zoology 43: 449-458.
Examples
library(ggplot2)
# Skull width vs. head length
ggplot(possum, aes(x = head_l, y = skull_w)) +
geom_point()
# Total length vs. sex
ggplot(possum, aes(x = total_l, fill = sex)) +
geom_density(alpha = 0.5)
US Poll on who it is better to raise taxes on
Description
A poll of 691 people, with party affiliation collected, asked whether they think it's better to raise taxes on the rich or raise taxes on the poor.
Usage
ppp_201503
Format
A data frame with 691 observations on the following 2 variables.
- party
Political party affiliation.
- taxes
Support for who to raise taxes on.
Source
Public Policy Polling, Americans on College Degrees, Classic Literature, the Seasons, and More, data collected Feb 20-22, 2015.
Examples
library(ggplot2)
ggplot(ppp_201503, aes(x = party, fill = taxes)) +
geom_bar(position = "fill") +
labs(x = "Party", x = "Proportion", fill = "Taxes")
Birth counts
Description
An updated version of the historical Arbuthnot dataset. Numbers of boys and girls born in the United States between 1940 and 2002.
Usage
present
Format
A data frame with 63 observations on the following 3 variables.
- year
Year.
- boys
Number of boys born.
- girls
Number of girls born.
Source
Mathews, T. J., and Brady E. Hamilton. "Trend analysis of the sex ratio at birth in the United States." National vital statistics reports 53.20 (2005): 1-17.
Examples
library(ggplot2)
ggplot(present, mapping = aes(x = year, y = boys / girls)) +
geom_line()
United States Presidental History
Description
Summary of the changes in the president and vice president for the United States of America.
Usage
president
Format
A data frame with 67 observations on the following 5 variables.
- potus
President of the United States
- party
Political party of the president
- start
Start year
- end
End year
- vpotus
Vice President of the United States
Source
Presidents of the United States (table) – infoplease.com (visited: Nov 2nd, 2010)
https://www.infoplease.com/us/government/executive-branch/presidents and https://www.infoplease.com/us/government/executive-branch/vice-presidents
Examples
president
Prison isolation experiment
Description
Subjects from Central Prison in Raleigh, NC, volunteered for an experiment involving an "isolation" experience. The goal of the experiment was to find a treatment that reduces subjects' psychopathic deviant T scores. This score measures a person's need for control or their rebellion against control, and it is part of a commonly used mental health test called the Minnesota Multiphasic Personality Inventory (MMPI) test.
Usage
prison
Format
A data frame with 14 observations on the following 6 variables.
- pre_trt1
Pre-treatment 1.
- post_trt1
Post-treatment 1.
- pre_trt2
Pre-treatment 2.
- post_trt2
Post-treatment 2.
- pre_trt3
Pre-treatment 3.
- post_trt3
Post-treatment 3.
Source
https://stat.duke.edu/datasets/prison-isolation
Examples
prison
User reported fuel efficiency for 2017 Toyota Prius Prime
Description
Fueleconomy.gov, the official US government source for fuel economy information, allows users to share gas mileage information on their vehicles. These data come from 19 users sharing gas mileage on their 2017 Toyota Prius Prime. Note that these data are user estimates and since the sources data cannot be verified, the accuracy of these estimates are not guaranteed.
Usage
prius_mpg
Format
A data frame with 19 observations on the following 10 variables.
- average_mpg
Average mileage as estimated by the user.
- state
US State the user lives in.
- stop_and_go
Proportion of stop and go driving.
- highway
Proportion of highway driving.
- last_updated
Date estimate was last updated.
Source
Fueleconomy.gov, https://www.fueleconomy.gov/mpg/MPG.do?action=mpgData&vehicleID=38531&browser=true&details=on, retrieved 2019-04-14.
Examples
library(ggplot2)
library(dplyr)
ggplot(prius_mpg, aes(x = average_mpg)) +
geom_histogram(binwidth = 25)
Generate simulated QQ plots
Description
Create a 3 x 3 grid of quantile-quantile plots, the first of which corresponds to the input data. The other eight plots arise from simulating random normal data with the same mean, standard deviation, and length as the data. For use in comparing known-normal qqplots to an observed qqplot to assess normality.
Usage
qqnormsim(sample, data)
Arguments
sample |
the variable to be plotted. |
data |
data frame to use. |
Value
A 3 x 3 grid of qqplots.
Yahoo! News Race and Justice poll results
Description
Results from a Yahoo! News poll conducted by YouGov on May 29-31, 2020. In total 1060 U.S. adults were asked a series of questions regarding race and justice in the wake of the killing of George Floyd by a police officer. Results in this dataset are percentages for the question, "Do you think Blacks and Whites receive equal treatment from the police?" For this particular question there were 1059 respondents.
Usage
race_justice
Format
A data frame with 1,059 rows and 2 variables.
- race_eth
Race/ethnicity of respondent, with levels
White
,Black
,Hispanic
, andOther
.- response
Response to the question "Do you think Black and White people receive equal treatment from the police?", with levels
Yes
,No
, andNot sure
.
Source
Yahoo! News Race and Justice - May 31, 2020.
Examples
library(ggplot2)
library(dplyr)
# Conditional probabilities of response for each race/ethnicity
race_justice |>
count(race_eth, response) |>
group_by(race_eth) |>
mutate(prop = n / sum(n))
# Stacked bar plot of counts
ggplot(race_justice, aes(x = race_eth, fill = response)) +
geom_bar() +
labs(
x = "Race / ethnicity",
y = "Count",
title = "Do you think Black and White people receive
equal treatment from the police?",
fill = "Response"
)
# Stacked bar plot of proportions
ggplot(race_justice, aes(x = race_eth, fill = response)) +
geom_bar(position = "fill") +
labs(
x = "Race / ethnicity",
y = "Proportion",
title = "Do you think Black and White people receive
equal treatment from the police?",
fill = "Response"
)
Reddit Survey on Financial Independence.
Description
A reduced set of the official results of the 2020 FI Survey from Reddit (r/financialindependence). Only responses that represent the respondent (not other contributors in the household) are listed. Does not include retired individuals. As per instructed, respondents give dollar values in their native currency.
Usage
reddit_finance
Format
A data frame with 1998 rows and 65 variables.
- num_incomes
How many individuals contribute to your household income?
- pan_inc_chg
As a result of the pandemic, did your earned income increase, decrease, or remain the same?
- pan_inc_chg_pct
By how much did your earned income change?
- pan_exp_chg
As a result of the pandemic, did your expenses increase, decrease, or remain the same?
- pan_exp_chg_pct
By how much did your expenses change?
- pan_fi_chg
As a result of the pandemic, did your FI (financially independent) number...
- pan_ret_date_chg
As a result of the pandemic, did your planned RE (retirement) date...
- pan_financial_impact
Overall, how would you characterize the pandemic's impact on your finances?
- political
With which political party do you most closely identify? You do not need to be registered with a party to select it, answer based on your personal views.
- race_eth
What is your race/ethnicity? Select all that apply.
- gender
What is your gender?
- age
What is your age?
- edu
What is the highest level of education you have completed?
- rel_status
What is your relationship status?
- children
Do you have children?
- country
What country are you in?
- fin_indy
Are you financially independent? Meaning you do not need to work for money, regardless of whether you work for money.
- fin_indy_num
At what amount invested will you consider yourself Financially Independent? (What is your FI number?)
- fin_indy_pct
What percent FI are you? (What percent of your FI number do you currently have?)
- retire_invst_num
At what amount invested do you intend to retire? (What is your RE number)
- tgt_sf_wthdrw_rt
What is your target safe withdrawal rate? (If your answer is 3.5%, enter it as 3.5)
- max_retire_sup
How much annual income do you expect to have from the sources you selected in question T5 at the point where you are utilizing all of them (or a majority if you do not intend to use all at the same time)? Enter your answer as a dollar amount.
- retire_exp
How much money (from your savings and other sources) do you intend to spend each year once you are retired? Enter your answer as a dollar amount.
- whn_fin_indy_num
At what amount invested did you consider yourself Financially Independent? (AKA what was your "FI number")
- fin_indy_lvl
Which of the following would you have considered yourself at the time you reached Financial Independence:
- retire_age
At what age do you intend to retire?
- stp_whn_fin_indy
Do you intend to stop working for money when you reach financial independence?
- industry
Which of the following best describes the industry in which you currently or most recently work(ed)?
- employer
Which of the following best describes your current or most recent employer?
- role
Which of the following best describes your current or most recent job role?
- ft_status
What is your current employment status? - Full Time
- pt_status
What is your current employment status? - Part Time, Regular
- gig_status
What is your current employment status? -Side Gig, Intermittent
- ne_status
What is your current employment status? -Not Employed
- edu_status
What is your current educational status?
- housing
What is your current housing situation?
- home_value
Primary residence value.
- brokerage_accts_tax
Brokerage accounts (Taxable).
- retirement_accts_tax
Retirement accounts (Tax Advantaged).
- cash
Cash / cash equivalents (Savings, Checking, C.D.s, Money Market).
- invst_accts
Dedicated Savings/Investment Accounts (Healthcare, Education).
- spec_crypto
Speculation (Crypto, P2P Lending, Gold, etc.).
- invst_prop_bus_own
investment properties / owned business(es).
- other_val
Other assets.
- student_loans
Outstanding student loans.
- mortgage
Outstanding mortgage / HELOC.
- auto_loan
Outstanding auto loans.
- credit_personal_loan
Outstanding credit cards / personal loans.
- medical_debt
Outstanding medical debt.
- invst_prop_bus_own_debt
Debt from investment properties / owned business.
- other_debt
Debt from other sources.
- 2020_gross_inc
What was your 2020 gross (pre-tax, pre-deductions) annual household income?
- 2020_housing_exp
Housing expenses(rent, mortgage, insurance, taxes, upkeep).
- 2020_utilities_exp
Utilities expenses(phone, internet, gas, electric, water, sewer).
- 2020_transp_exp
Transportation expenses(car payment, bus / subway tickets, gas, insurance, maintenance).
- 2020_necessities_exp
Necessities expenses(Groceries, Clothing, Personal Care, Household Supplies).
- 2020_lux_exp
Luxury expenses (Restaurants/Dining, Entertainment, Hobbies, Travel, Pets, Gifts).
- 2020_child_exp
Children expenses(child care, soccer team, etc.).
- 2020_debt_repay
Debt repayment (excluding mortgage/auto).
- 2020_invst_save
Investments / savings.
- 2020_charity
Charity / Tithing.
- 2020_healthcare_exp
Healthcare expenses(direct costs, co-pays, insurance you pay).
- 2020_taxes
Taxes (the sum of all taxes paid, including amounts deducted from paychecks).
- 2020_edu_exp
Education expenses.
- 2020_other_exp
Other expenses.
Source
Reddit Official 2020 FI Survey Results, https://www.reddit.com/r/financialindependence/comments/m1q8ia/official_2020_fi_survey_results.
Examples
library(ggplot2)
# Histogram of Expected Retirement Age.
ggplot(reddit_finance, aes(retire_age)) +
geom_bar(na.rm = TRUE) +
labs(
title = "At what age do you expect to retire?",
x = "Age Bracket",
y = "Number of Respondents"
)
# Histogram of Dollar Amount at Which FI was reached.
ggplot(reddit_finance, aes(whn_fin_indy_num)) +
geom_histogram(na.rm = TRUE, bins = 20) +
labs(
title = "At what amount invested did you consider\nyourself Financially Independent?",
x = "Dollar Amount (in local currency)",
y = "Number of Respondents"
)
Simulated data for regression
Description
Simulated data for regression
Usage
res_demo_1
Format
A data frame with 100 observations on the following 3 variables.
- x
a numeric vector
- y_lin
a numeric vector
- y_fan_back
a numeric vector
Examples
res_demo_1
Simulated data for regression
Description
Simulated data for regression
Usage
res_demo_2
Format
A data frame with 300 observations on the following 3 variables.
- x
a numeric vector
- y_fan
a numeric vector
- y_log
a numeric vector
Examples
res_demo_2
Which resume attributes drive job callbacks?
Description
This experiment data comes from a study that sought to understand the influence of race and gender on job application callback rates. The study monitored job postings in Boston and Chicago for several months during 2001 and 2002 and used this to build up a set of test cases. Over this time period, the researchers randomly generating resumes to go out to a job posting, such as years of experience and education details, to create a realistic-looking resume. They then randomly assigned a name to the resume that would communicate the applicant's gender and race. The first names chosen for the study were selected so that the names would predominantly be recognized as belonging to black or white individuals. For example, Lakisha was a name that their survey indicated would be interpreted as a black woman, while Greg was a name that would generally be interpreted to be associated with a white male.
Usage
resume
Format
A data frame with 4870 observations, representing 4870 resumes, over
30 different variables that describe the job details, the outcome
(received_callback
), and attributes of the resume.
- job_ad_id
Unique ID associated with the advertisement.
- job_city
City where the job was located.
- job_industry
Industry of the job.
- job_type
Type of role.
- job_fed_contractor
-
Indicator for if the employer is a federal contractor.
- job_equal_opp_employer
Indicator for if the employer is an Equal Opportunity Employer.
- job_ownership
The type of company, e.g. a nonprofit or a private company.
- job_req_any
Indicator for if any job requirements are listed. If so, the other
job_req_*
fields give more detail.- job_req_communication
Indicator for if communication skills are required.
- job_req_education
Indicator for if some level of education is required.
- job_req_min_experience
-
Amount of experience required.
- job_req_computer
Indicator for if computer skills are required.
- job_req_organization
-
Indicator for if organization skills are required.
- job_req_school
Level of education required.
- received_callback
Indicator for if there was a callback from the job posting for the person listed on this resume.
- firstname
The first name used on the resume.
- race
Inferred race associated with the first name on the resume.
- gender
Inferred gender associated with the first name on the resume.
- years_college
Years of college education listed on the resume.
- college_degree
Indicator for if the resume listed a college degree.
- honors
-
Indicator for if the resume listed that the candidate has been awarded some honors.
- worked_during_school
Indicator for if the resume listed working while in school.
- years_experience
Years of experience listed on the resume.
- computer_skills
-
Indicator for if computer skills were listed on the resume. These skills were adapted for listings, though the skills were assigned independently of other details on the resume.
- special_skills
Indicator for if any special skills were listed on the resume.
- volunteer
Indicator for if volunteering was listed on the resume.
- military
Indicator for if military experience was listed on the resume.
- employment_holes
Indicator for if there were holes in the person's employment history.
- has_email_address
Indicator for if the resume lists an email address.
- resume_quality
Each resume was generally classified as either lower or higher quality.
Details
Because this is an experiment, where the race and gender attributes are being randomly assigned to the resumes, we can conclude that any statistically significant difference in callback rates is causally linked to these attributes.
Do you think it's reasonable to make a causal conclusion? You may have some health skepticism. However, do take care to appreciate that this was an experiment: the first name (and so the inferred race and gender) were randomly assigned to the resumes, and the quality and attributes of a resume were assigned independent of the race and gender. This means that any effects we observe are in fact causal, and the effects related to race are both statistically significant and very large: white applicants had about a 50\
Do you still have doubts lingering in the back of your mind about the validity of this study? Maybe a counterargument about why the standard conclusions from this study may not apply? The article summarizing the results was exceptionally well-written, and it addresses many potential concerns about the study's approach. So if you're feeling skeptical about the conclusions, please find the link below and explore!
Source
Bertrand M, Mullainathan S. 2004. "Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination". The American Economic Review 94:4 (991-1013). doi:10.3386/w9873.
See Also
Examples
head(resume, 5)
# Some checks to confirm balance between race and
# other attributes of a resume. There should be
# some minor differences due to randomness, but
# each variable should be (and is) generally
# well-balanced.
table(resume$race, resume$years_college)
table(resume$race, resume$college_degree)
table(resume$race, resume$honors)
table(resume$race, resume$worked_during_school)
table(resume$race, resume$years_experience)
table(resume$race, resume$computer_skills)
table(resume$race, resume$special_skills)
table(resume$race, resume$volunteer)
table(resume$race, resume$military)
table(resume$race, resume$employment_holes)
table(resume$race, resume$has_email_address)
table(resume$race, resume$resume_quality)
# Regarding the callback outcome for race,
# we observe a very large difference.
tapply(
resume$received_callback,
resume[c("race", "gender")],
mean
)
# Natural question: is this statisticaly significant?
# A proper analysis would take into account the
# paired nature of the data. For each ad, let's
# compute the following statistic:
# <callback rate for white candidates>
# - <callback rate for black candidates>
# First contruct the callbacks for white and
# black candidates by ad ID:
table(resume$race)
cb_white <- with(
subset(resume, race == "white"),
tapply(received_callback, job_ad_id, mean)
)
cb_black <- with(
subset(resume, race == "black"),
tapply(received_callback, job_ad_id, mean)
)
# Next, compute the differences, where the
# names(cb_white) part ensures we matched up the
# job ad IDs.
diff <- cb_white - cb_black[names(cb_white)]
# Finally, we can apply a t-test on the differences:
t.test(diff)
# There is very strong evidence of an effect.
# Here's a similar check with gender. There are
# more female-inferred candidates used on the resumes.
table(resume$gender)
cb_male <- with(
subset(resume, gender == "m"),
tapply(received_callback, job_ad_id, mean)
)
cb_female <- with(
subset(resume, gender == "f"),
tapply(received_callback, job_ad_id, mean)
)
diff <- cb_female - cb_male[names(cb_female)]
# The `na.rm = TRUE` part ensures we limit to jobs
# where both a male and female resume were sent.
t.test(diff, na.rm = TRUE)
# There is no statistically significant difference.
# Was that the best analysis? Absolutely not!
# However, the analysis was unbiased. To get more
# precision on the estimates, we could build a
# multivariate model that includes many characteristics
# of the resumes sent, e.g. years of experience.
# Since those other characteristics were assigned
# independently of the race characteristics, this
# means the race finding will almost certainy will
# hold. However, it is possible that we'll find
# more interesting results with the gender investigation.
Sample Responses to Two Public Health Questions
Description
Public health has improved and evolved, but has the public's knowledge changed with it? This dataset explores sample responses for two survey questions posed by Hans Rosling during lectures to a wide array of well-educated audiences.
Usage
rosling_responses
Format
A data frame with 278 rows and 3 variables:
- question
ID for the question being posed.
- response
Noting whether the response was
correct
orincorrect
.- prob_random_correct
The probability the person would have guessed the answer correctly if they were guessing completely randomly.
Source
The samples we describe are plausible based on the exact rates observed in larger samples. For more info on the actual rates observed, visit https://www.gapminder.org.
Another relevant reference is a book by Hans Rosling, Anna Rosling Ronnlund, and Ola Rosling called Factfulness.
Examples
frac_correct <- tapply(
rosling_responses$response == "correct",
rosling_responses$question,
mean
)
frac_correct
n <- table(rosling_responses$question)
n
expected <- tapply(
rosling_responses$prob_random_correct,
rosling_responses$question,
mean
)
# Construct confidence intervals.
se <- sqrt(frac_correct * (1 - frac_correct) / n)
# Lower bounds.
frac_correct - 1.96 * se
# Upper bounds.
frac_correct + 1.96 * se
# Construct Z-scores and p-values.
z <- (frac_correct - expected) / se
pt(z, df = n - 1)
Russians' Opinions on US Election Influence in 2016
Description
Survey of Russian citizens on whether they believed their government tried to influence the 2016 US election. The survey was taken in Spring 2018 by Pew Research.
Usage
russian_influence_on_us_election_2016
Format
A data frame with 506 observations on the following variable.
- influence_2016
Response of the Russian survey participant to the question of whether their government tried to influence the 2016 election in the United States.
Details
The actual sample size was 1000. However, the original data were not from a simple random sample; after accounting for the design, the equivalent sample size was 506, which was what was used for the dataset here to keep things simpler for intro stat analyses.
Source
Examples
table(russian_influence_on_us_election_2016)
Sustainability and Economic Indicators for South Africa.
Description
Includes yearly data on gdp, gni, co2 emissions, start up costs.
Usage
sa_gdp_elec
Format
A data frame with 16 rows and 7 variables.
- year
Year data collected.
- access_elec
Access to electricity as a percentage of the population.
- startup
cost of business startup procedures as a percent of GNI.
- co2
CO2 emission in kt (kiloton).
- gdp
GDP per capita, PPP in constant 2017 international dollars.
- gni
GNI per capita, PPP in constant 2017 international dollars.
- co2_kg_ppp
kg per 2017 PPP dollars of GDP.
Source
Examples
library(ggplot2)
ggplot(sa_gdp_elec, aes(year, access_elec)) +
geom_point(alpha = 0.3) +
labs(
x = "Year",
y = "Percent of Population",
title = "Access to Electricity in South Africa 2003 - 2018"
)
Salinity in Bimini Lagoon, Bahamas
Description
Data collected at three different water masses in the Bimini Lagoon, Bahamas.
Usage
salinity
Format
A data frame with 30 rows and 2 variables.
- site_number
Location where measurements were taken.
- salinity_ppt
Salinity value in parts per thousand.
Source
Till, R. (1974) Statistical Methods for the Earth Scientist: An Introduction. London: Macmillon, 104.
Examples
library(ggplot2)
library(broom)
ggplot(salinity, aes(x = salinity_ppt)) +
geom_dotplot() +
facet_wrap(~site_number, ncol = 1)
tidy(aov(salinity_ppt ~ site_number, data = salinity))
Simulated data for SAT score improvement
Description
Fake data for score improvements from students who took a course from an SAT score improvement company.
Usage
sat_improve
Format
A data frame with 30 observations on the following variable.
- sat_improve
a numeric vector
Examples
sat_improve
SAT and GPA data
Description
SAT and GPA data for 1000 students at an unnamed college.
Usage
satgpa
Format
A data frame with 1000 observations on the following 6 variables.
- sex
Gender of the student.
- sat_v
Verbal SAT percentile.
- sat_m
Math SAT percentile.
- sat_sum
Total of verbal and math SAT percentiles.
- hs_gpa
High school grade point average.
- fy_gpa
First year (college) grade point average.
Source
Educational Testing Service originally collected the data.
References
https://chance.dartmouth.edu/course/Syllabi/Princeton96/ETSValidation.html
Examples
library(ggplot2)
library(broom)
# Verbal scores
ggplot(satgpa, aes(x = sat_v, fy_gpa)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
x = "Verbal SAT percentile",
y = "First year (college) grade point average"
)
mod <- lm(fy_gpa ~ sat_v, data = satgpa)
tidy(mod)
# Math scores
ggplot(satgpa, aes(x = sat_m, fy_gpa)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
x = "Math SAT percentile",
y = "First year (college) grade point average"
)
mod <- lm(fy_gpa ~ sat_m, data = satgpa)
tidy(mod)
Color scale constructor for OpenIntro IMS colors
Description
Color scale constructor for OpenIntro IMS colors
Usage
scale_color_openintro(palette = "main", discrete = TRUE, reverse = FALSE, ...)
Arguments
palette |
Character name of palette in openintro_palettes |
discrete |
Boolean indicating whether color aesthetic is discrete or not |
reverse |
Boolean indicating whether the palette should be reversed |
... |
Additional arguments passed to |
Examples
library(ggplot2)
# Categorical variable with three levels
ggplot(evals, aes(
x = bty_avg, y = score,
color = rank, shape = rank
)) +
geom_jitter(size = 2, alpha = 0.6) +
scale_color_openintro("three")
# Categorical variable with two levels
ggplot(evals, aes(
x = bty_avg, y = score,
color = language, shape = language
)) +
geom_jitter(size = 2, alpha = 0.6) +
scale_color_openintro("two")
# Continuous variable
# Generates a palette, but not recommended
ggplot(evals, aes(
x = bty_avg, y = score,
color = score
)) +
geom_jitter(size = 2, alpha = 0.8) +
scale_color_openintro(discrete = FALSE)
# For continous palettes
# use scale_color_gradient instead
ggplot(evals, aes(
x = bty_avg, y = score,
color = score
)) +
geom_jitter(size = 2) +
scale_color_gradient(low = IMSCOL["blue", "full"], high = IMSCOL["blue", "f6"])
ggplot(evals, aes(
x = bty_avg, y = score,
color = cls_perc_eval
)) +
geom_jitter(size = 2) +
scale_color_gradient(low = COL["red", "full"], high = COL["red", "f8"])
Fill scale constructor for OpenIntro IMS colors
Description
Fill scale constructor for OpenIntro IMS colors
Usage
scale_fill_openintro(palette = "main", discrete = TRUE, reverse = FALSE, ...)
Arguments
palette |
Character name of palette in openintro_palettes |
discrete |
Boolean indicating whether color aesthetic is discrete or not |
reverse |
Boolean indicating whether the palette should be reversed |
... |
Additional arguments passed to |
Examples
library(ggplot2)
library(dplyr)
# Categorical variable with two levels
ggplot(evals, aes(x = ethnicity, fill = ethnicity)) +
geom_bar() +
scale_fill_openintro("two")
# Categorical variable with three levels
ggplot(evals, aes(x = rank, fill = rank)) +
geom_bar() +
scale_fill_openintro("three")
# Continuous variable with levels
# Generates a palette, but may not be the best palette
# in terms of color-blind and grayscale friendliness
ggplot(diamonds, aes(x = clarity, fill = clarity)) +
geom_bar() +
scale_fill_openintro()
# For continuous palettes
# use scale_color_gradient instead
ggplot(evals, aes(
x = bty_avg, y = score,
color = score
)) +
geom_jitter(size = 2) +
scale_color_gradient(low = IMSCOL["blue", "full"], high = IMSCOL["blue", "f6"])
ggplot(evals, aes(
x = bty_avg, y = score,
color = cls_perc_eval
)) +
geom_jitter(size = 2) +
scale_color_gradient(low = IMSCOL["green", "full"], high = IMSCOL["green", "f6"])
Public Opinion with SCOTUS ruling on American Healthcare Act
Description
On June 28, 2012 the U.S. Supreme Court upheld the much debated 2010 healthcare law, declaring it constitutional. A Gallup poll released the day after this decision indicates that 46% of 1,012 Americans agree with this decision.
Usage
scotus_healthcare
Format
A data frame with 1012 observations on the following variable.
- response
Response values reported are
agree
andother
.
Source
Gallup, Americans Issue Split Decision on Healthcare Ruling, retrieved 2012-06-28.
Examples
table(scotus_healthcare)
Names of pets in Seattle
Description
Names of registered pets in Seattle, WA, between 2003 and 2018, provided by the city's Open Data Portal.
Usage
seattlepets
Format
A data frame with 52,519 rows and 7 variables:
- license_issue_date
Date the animal was registered with Seattle
- license_number
Unique license number
- animal_name
Animal's name
- species
Animal's species (dog, cat, goat, etc.)
- primary_breed
Primary breed of the animal
- secondary_breed
Secondary breed if mixed
- zip_code
Zip code animal is registered in
Source
These data come from Seattle's Open Data Portal, https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb
Bank manager recommendations based on sex
Description
Study from the 1970s about whether sex influences hiring recommendations.
Usage
sex_discrimination
Format
A data frame with 48 observations on the following 2 variables.
- sex
a factor with levels
female
andmale
- decision
a factor with levels
not promoted
andpromoted
Source
Rosen B and Jerdee T. 1974. Influence of sex role stereotypes on personnel decisions. Journal of Applied Psychology 59(1):9-14.
Examples
library(ggplot2)
table(sex_discrimination)
ggplot(sex_discrimination, aes(y = sex, fill = decision)) +
geom_bar(position = "fill")
Simpson's Paradox: Covid
Description
A dataset on Delta Variant Covid-19 cases in the UK. This dataset gives a great example of Simpson's Paradox. When aggregating results without regard to age group, the death rate for vaccinated individuals is higher – but they have a much higher risk population. Once we look at populations with more comparable risks (breakout age groups), we see that the vaccinated group tends to be lower risk in each risk-bucketed group and that many of the higher risk patients had gotten vaccinated. The dataset was brought to OpenIntro's attention by Matthew T. Brenneman of Embry-Riddle Aeronautical University. Note: some totals in the original source differ as there were some cases that did not have ages associated with them.
Usage
simpsons_paradox_covid
Format
A data frame with 286,166 rows and 3 variables:
- age_group
Age of the person. Levels:
under 50
,50 +
.- vaccine_status
Vaccination status of the person. Note: the vaccinated group includes those who were only partially vaccinated. Levels:
vaccinated
,unvaccinated
- outcome
Did the person die from the Delta variant? Levels:
death
andsurvived
.
Source
Public Health England: Technical briefing 20
Examples
library(dplyr)
library(scales)
# Calculate the mortality rate for all cases by vaccination status
simpsons_paradox_covid |>
group_by(vaccine_status, outcome) |>
summarize(count = n()) |>
ungroup() |>
group_by(vaccine_status) |>
mutate(total = sum(count)) |>
filter(outcome == "death") |>
select(c(vaccine_status, count, total)) |>
mutate(mortality_rate = label_percent(accuracy = 0.01)(round(count / total, 4))) |>
select(-c(count, total))
# Calculate mortality rate by age group and vaccination status
simpsons_paradox_covid |>
group_by(age_group, vaccine_status, outcome) |>
summarize(count = n()) |>
ungroup() |>
group_by(age_group, vaccine_status) |>
mutate(total = sum(count)) |>
filter(outcome == "death") |>
select(c(age_group, vaccine_status, count, total)) |>
mutate(mortality_rate = label_percent(accuracy = 0.01)(round(count / total, 4))) |>
select(-c(count, total))
Simulated datasets, not necessarily drawn from a normal distribution.
Description
Data were simulated in R, and some of the simulations do not represent data from actual normal distributions.
Usage
simulated_dist
Format
The format is: List of 4 $ d1: dataset of 100 observations. $ d2: dataset of 50 observations. $ d3: num dataset of 500 observations. $ d4: dataset of 15 observations. $ d5: num dataset of 25 observations. $ d6: dataset of 50 observations.
Examples
data(simulated_dist)
lapply(simulated_dist, qqnorm)
Simulated datasets, drawn from a normal distribution.
Description
Data were simulated using rnorm
.
Usage
simulated_normal
Format
The format is: List of 3 $ n40 : 40 observations from a standard normal distribution. $ n100: 100 observations from a standard normal distribution. $ n400: 400 observations from a standard normal distribution.
Examples
data(simulated_normal)
lapply(simulated_normal, qqnorm)
Simulated data for sample scatterplots
Description
Fake data.
Usage
simulated_scatter
Format
A data frame with 500 observations on the following 3 variables.
- group
Group, representing data for a specific plot.
- x
x-value.
- y
y-value.
Examples
library(ggplot2)
ggplot(simulated_scatter, aes(x = x, y = y)) +
geom_point() +
facet_wrap(~group)
Sinusitis and antibiotic experiment
Description
Researchers studying the effect of antibiotic treatment for acute sinusitis to one of two groups: treatment or control.
Usage
sinusitis
Format
A data frame with 166 observations on the following 2 variables.
- group
a factor with levels
control
andtreatment
- self_reported_improvement
a factor with levels
no
andyes
Source
J.M. Garbutt et al. Amoxicillin for Acute Rhinosinusitis: A Randomized Controlled Trial. In: JAMA: The Journal of the American Medical Association 307.7 (2012), pp. 685-692.
Examples
sinusitis
Survey on sleep deprivation and transportation workers
Description
The National Sleep Foundation conducted a survey on the sleep habits of randomly sampled transportation workers and a control sample of non-transportation workers.
Usage
sleep_deprivation
Format
A data frame with 1087 observations on the following 2 variables.
- sleep
a factor with levels
<6
,6-8
, and>8
- profession
a factor with levels
bus / taxi / limo drivers
,control
,pilots
,train operators
,truck drivers
Source
National Sleep Foundation, 2012 Sleep in America Poll: Transportation Workers' Sleep, 2012. https://www.sleepfoundation.org/professionals/sleep-americar-polls/2012-sleep-america-poll-transportation-workers-sleep
Examples
sleep_deprivation
Smallpox vaccine results
Description
A sample of 6,224 individuals from the year 1721 who were exposed to smallpox in Boston. Some of them had received a vaccine (inoculated) while others had not. Doctors at the time believed that inoculation, which involves exposing a person to the disease in a controlled form, could reduce the likelihood of death.
Usage
smallpox
Format
A data frame with 6224 observations on the following 2 variables.
- result
Whether the person
died
orlived
.- inoculated
Whether the person received inoculated.
Source
Fenner F. 1988. Smallpox and Its Eradication (History of International Public Health, No. 6). Geneva: World Health Organization. ISBN 92-4-156110-6.
Examples
data(smallpox)
table(smallpox)
UK Smoking Data
Description
Survey data on smoking habits from the UK. The dataset can be used for analyzing the demographic characteristics of smokers and types of tobacco consumed.
Usage
smoking
Format
A data frame with 1691 observations on the following 12 variables.
- gender
Gender with levels
Female
andMale
.- age
Age.
- marital_status
Marital status with levels
Divorced
,Married
,Separated
,Single
andWidowed
.- highest_qualification
Highest education level with levels
A Levels
,Degree
,GCSE/CSE
,GCSE/O Level
,Higher/Sub Degree
,No Qualification
,ONC/BTEC
andOther/Sub Degree
- nationality
Nationality with levels
British
,English
,Irish
,Scottish
,Welsh
,Other
,Refused
andUnknown
.- ethnicity
Ethnicity with levels
Asian
,Black
,Chinese
,Mixed
,White
andRefused
Unknown
.- gross_income
Gross income with levels
Under 2,600
,2,600 to 5,200
,5,200 to 10,400
,10,400 to 15,600
,15,600 to 20,800
,20,800 to 28,600
,28,600 to 36,400
,Above 36,400
,Refused
andUnknown
.- region
Region with levels
London
,Midlands & East Anglia
,Scotland
,South East
,South West
,The North
andWales
- smoke
Smoking status with levels
No
andYes
- amt_weekends
Number of cigarettes smoked per day on weekends.
- amt_weekdays
Number of cigarettes smoked per day on weekdays.
- type
Type of cigarettes smoked with levels
Packets
,Hand-Rolled
,Both/Mainly Packets
andBoth/Mainly Hand-Rolled
Source
National STEM Centre, Large Datasets from stats4schools, https://www.stem.org.uk/resources/elibrary/resource/28452/large-datasets-stats4schools.
Examples
library(ggplot2)
ggplot(smoking, aes(x = amt_weekends)) +
geom_histogram(binwidth = 5)
ggplot(smoking, aes(x = amt_weekdays)) +
geom_histogram(binwidth = 5)
ggplot(smoking, aes(x = gender, fill = smoke)) +
geom_bar(position = "fill")
ggplot(smoking, aes(x = marital_status, fill = smoke)) +
geom_bar(position = "fill")
Snowfall at Paradise, Mt. Rainier National Park
Description
Annual snowfall data for Paradise, Mt. Rainier National Park. To include a full winter season, snowfall is recorded from July 1 to June 30. Data from 1943-1946 not available due to road closure during World War II. Records also unavailable from 1948-1954.
Usage
snowfall
Format
A data frame with 100 rows and 3 variables.
- year_start
The year snowfall measurement began on July 1.
- year_end
The year snowfall measurement ended on June 30.
- total_snow
Snowfall measured in inches.
Source
Examples
library(ggplot2)
ggplot(snowfall, aes(x = total_snow)) +
geom_histogram(binwidth = 50) +
labs(
title = "Annual Snowfall",
subtitle = "Paradise, Mt. Rainier National Park",
x = "Snowfall (in.)",
y = "Number of Years",
caption = "Source: National Parks Services"
)
ggplot(snowfall, aes(x = year_start, y = total_snow, group = 1)) +
geom_line() +
labs(
title = "Annual Snowfall",
subtitle = "Paradise, Mt. Rainier National Park",
y = "Snowfall (in.)",
x = "Year",
caption = "Source: National Parks Services"
)
Social experiment
Description
A "social experiment" conducted by a TV program questioned what people do when they see a very obviously bruised woman getting picked on by her boyfriend. On two different occasions at the same restaurant, the same couple was depicted. In one scenario the woman was dressed "provocatively" and in the other scenario the woman was dressed "conservatively". The table below shows how many restaurant diners were present under each scenario, and whether or not they intervened.
Usage
socialexp
Format
A data frame with 45 observations on the following 2 variables.
- intervene
Whether other diners intervened or not.
- scenario
How the woman was dressed.
Examples
table(socialexp)
soda
Description
A randomly generated dataset of soda preference (cola or orange) based on location.
Usage
soda
Format
A data frame with 60 observations on the following 2 variables.
- drink
Soda preference, cola or orange.
- location
Is the person from the West coast or East coast?
Examples
library(dplyr)
soda |>
count(location, drink)
Energy Output From Two Solar Arrays in San Francisco
Description
The data provide the energy output for several months from two roof-top solar arrays in San Francisco. This city is known for having highly variable weather, so while these two arrays are only about 1 mile apart from each other, the Inner Sunset location tends to have more fog.
Usage
solar
Format
A data frame with 284 observations on the following 3 variables. Each row represents a single day for one of the arrays.
- location
Location for the array.
- date
Date.
- kwh
Number of kWh
Details
The Haight-Ashbury array is a 10.4 kWh array, while the Inner Sunset array is a 2.8 kWh array. The kWh units represents kilowatt-hours, which is the unit of energy that typically is used for electricity bills. The cost per kWh in San Francisco was about $0.25 in 2016.
Source
These data were provided by Larry Rosenfeld, a resident in San Francisco.
Examples
solar.is <- subset(solar, location == "Inner_Sunset")
solar.ha <- subset(solar, location == "Haight_Ashbury")
plot(solar.is$date, solar.is$kwh, type = "l", ylim = c(0, max(solar$kwh)))
lines(solar.ha$date, solar.ha$kwh, col = 4)
d <- merge(solar.ha, solar.is, by = "date")
plot(d$date, d$kwh.x / d$kwh.y, type = "l")
SOWC Child Mortality Data.
Description
Child mortality data from UNICEF's State of the World's Children 2019 Statistical Tables.
Usage
sowc_child_mortality
Format
A data frame with 195 rows and 19 variables.
- countries_and_areas
Country or area name.
- under5_mortality_1990
Under-5 mortality rate (deaths per 1,000 live births) in 1990.
- under5_mortality_2000
Under-5 mortality rate (deaths per 1,000 live births) in 2000.
- under5_mortality_2018
Under-5 mortality rate (deaths per 1,000 live births) in 2018.
- under5_reduction
Annual rate of reduction in under-5 mortality rate (%)2000–2018.
- under5_mortality_2018_male
Under-5 mortality rate male (deaths per 1,000 live births) 2018.
- under5_mortality_2018_female
Under-5 mortality rate female (deaths per 1,000 live births) 2018.
- infant_mortality_1990
Infant mortality rate (deaths per 1,000 live births) 1990
- infant_mortality_2018
Infant mortality rate (deaths per 1,000 live births) 2018
- neonatal_mortality_1990
Neonatal mortality rate (deaths per 1,000 live births) 1990.
- neonatal_mortality_2000
Neonatal mortality rate (deaths per 1,000 live births) 2000.
- neonatal_mortality_2018
Neonatal mortality rate (deaths per 1,000 live births) 2018.
- prob_dying_age5to14_1990
Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 1990.
- prob_dying_age5to14_2018
Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 2018.
- under5_deaths_2018
Annual number of under-5 deaths (thousands) 2018.
- neonatal_deaths_2018
Annual number of neonatal deaths (thousands) 2018.
- neonatal_deaths_percent_under5
Neonatal deaths as proportion of all under-5 deaths (%) 2018.
- age5to14_deaths_2018
Number of deaths among children aged 5–14 (thousands) 2018.
Source
United Nations Children's Emergency Fund (UNICEF)
Examples
library(dplyr)
library(ggplot2)
# List countries and areas whose children aged 5 and under have a higher probability of dying in
# 2018 than they did in 1990
sowc_child_mortality |>
mutate(decrease_prob_dying = prob_dying_age5to14_1990 - prob_dying_age5to14_2018) |>
select(countries_and_areas, decrease_prob_dying) |>
filter(decrease_prob_dying < 0) |>
arrange(decrease_prob_dying)
# List countries and areas and their relative rank for neonatal mortality in 2018
sowc_child_mortality |>
mutate(rank = round(rank(-neonatal_mortality_2018))) |>
select(countries_and_areas, rank, neonatal_mortality_2018) |>
arrange(rank)
SOWC Demographics Data.
Description
Demographic data from UNICEF's State of the World's Children 2019 Statistical Tables.
Usage
sowc_demographics
Format
A data frame with 202 rows and 18 variables.
- countries_and_areas
Country or area name.
- total_pop_2018
Population in 2018 in thousands.
- under18_pop_2018
Population under age 18 in 2018 in thousands.
- under5_pop_2018
Population under age 5 in 2018 in thousands.
- pop_growth_rate_2018
Rate at which population is growing in 2018.
- pop_growth_rate_2030
Rate at which population is estimated to grow in 2030.
- births_2018
Number of births in 2018 in thousands.
- fertility_2018
Number of live births per woman in 2018.A total fertility level of 2.1 is called replacement level and represents a level at which the population would remain the same size.
- life_expectancy_1970
Life expectancy at birth in 1970.
- life_expectancy_2000
Life expectancy at birth in 2000.
- life_expectancy_2018
Life expectancy at birth in 2018.
- dependency_ratio_total
The ratio of the not-working-age population to the working-age population of 15 - 64 years.
- dependency_ratio_child
The ratio of the under 15 population to the working-age population of 15 - 64 years.
- dependency_ratio_oldage
The ratio of the over 64 population to the working-age population of 15 - 64 years.
- percent_urban_2018
Percent of population living in urban areas.
- pop_urban_growth_rate_2018
Annual urban population growth rate from 2000 to 2018.
- pop_urban_growth_rate_2030
Estimated annual urban population growth rate from 2018 to 2030.
- migration_rate
Net migration rate per 1000 population from 2015 to 2020.
Source
United Nations Children's Emergency Fund (UNICEF)
Examples
library(dplyr)
library(ggplot2)
# List countries and areas' life expectancy, ordered by rank of life expectancy in 2018
sowc_demographics |>
mutate(life_expectancy_change = life_expectancy_2018 - life_expectancy_1970) |>
mutate(rank_life_expectancy = round(rank(-life_expectancy_2018), 0)) |>
select(
countries_and_areas, rank_life_expectancy, life_expectancy_2018,
life_expectancy_change
) |>
arrange(rank_life_expectancy)
# List countries and areas' migration rate and population, ordered by rank of migration rate
sowc_demographics |>
mutate(rank = round(rank(migration_rate))) |>
mutate(population_millions = total_pop_2018 / 1000) |>
select(countries_and_areas, rank, migration_rate, population_millions) |>
arrange(rank)
# Scatterplot of life expectancy v population in 2018
ggplot(sowc_demographics, aes(life_expectancy_1970, life_expectancy_2018, size = total_pop_2018)) +
geom_point(alpha = 0.5) +
labs(
title = "Life Expectancy",
subtitle = "1970 v. 2018",
x = "Life Expectancy in 1970",
y = "Life Expectancy in 2018",
size = "2018 Total Population"
)
SOWC Maternal and Newborn Health Data.
Description
Data from UNICEF's State of the World's Children 2019 Statistical Tables.
Usage
sowc_maternal_newborn
Format
A data frame with 202 rows and 18 variables.
- countries_and_areas
Country or area name.
- life_expectancy_female
Life expectancy: female in 2018.
- family_planning_1549
Demand for family planning satisfied with modern methods (%) 2013–2018 Women aged 15 to 49.
- family_planning_1519
Demand for family planning satisfied with modern methods (%) 2013–2018 Women aged 15 to 19.
- adolescent_birth_rate
Adolescent birth rate 2013 to 2018.
- births_age_18
Births by age 18 (%) 2013 to 2018.
- antenatal_care_1
Antenatal care (%) 2013 to 2018 At least one visit.
- antenatal_care_4_1549
Antenatal care (%) 2013 to 2018 At least four visits Women aged 15 to 49.
- antenatal_care_4_1519
Antenatal care (%) 2013 to 2018 At least four visits Women aged 15 to 19.
- delivery_care_attendant_1549
Delivery care (%) 2013 to 2018 Skilled birth attendant Women aged 15 to 49.
- delivery_care_attendant_1519
Delivery care (%) 2013 to 2018 Skilled birth attendant Women aged 15 to 19.
- delivery_care_institutional
Delivery care (%) 2013 to 2018 Institutional delivery.
- c_section
Delivery care (%) 2013–2018 C-section.
- postnatal_health_newborns
Postnatal health check(%) 2013 to 2018 For newborns.
- postnatal_health_mothers
Postnatal health check(%) 2013 to 2018 For mothers.
- maternal_deaths_2017
Maternal mortality 2017 Number of maternal deaths.
- maternal_mortality_ratio_2017
Maternal mortality 2017 Maternal Mortality Ratio.
- risk_maternal_death_2017
Maternal mortality 2017 Lifetime risk of maternal death (1 in X).
Source
United Nations Children's Emergency Fund (UNICEF)
Examples
library(dplyr)
library(ggplot2)
# List countries and lifetime risk of maternal death (1 in X), ranked
sowc_maternal_newborn |>
mutate(rank = round(rank(risk_maternal_death_2017), 0)) |>
select(countries_and_areas, rank, risk_maternal_death_2017) |>
arrange(rank)
# Graph scatterplot of Maternal Mortality Ratio 2017 and Antenatal Care 4+ Visits %
sowc_maternal_newborn |>
select(antenatal_care_4_1549, maternal_mortality_ratio_2017) |>
remove_missing(na.rm = TRUE) |>
ggplot(aes(antenatal_care_4_1549, maternal_mortality_ratio_2017)) +
geom_point(alpha = 0.5) +
labs(
title = "Antenatal Care and Mortality",
x = "Antenatal Care 4+ visits %",
y = "Maternal Mortality Ratio"
)
Financial information for 50 S&P 500 companies
Description
Fifty companies were randomly sampled from the 500 companies in the S&P 500, and their financial information was collected on March 8, 2012.
Usage
sp500
Format
A data frame with 50 observations on the following 12 variables.
- market_cap
Total value of all company shares, in millions of dollars.
- stock
The name of the stock (e.g.
AAPL
for Apple).- ent_value
Enterprise value, which is an alternative to market cap that also accounts for things like cash and debt, in millions of dollars.
- trail_pe
The market cap divided by the earnings (profits) over the last year.
- forward_pe
The market cap divided by the forecasted earnings (profits) over the next year.
- ev_over_rev
Enterprise value divided by the company's revenue.
- profit_margin
Percent of earnings that are profits.
- revenue
Revenue, in millions of dollars.
- growth
Quartly revenue growth (year over year), in millions of dollars.
- earn_before
Earnings before interest, taxes, depreciation, and amortization, in millions of dollars.
- cash
Total cash, in millions of dollars.
- debt
Total debt, in millions of dollars.
Source
Yahoo! Finance, retrieved 2012-03-08.
Examples
library(ggplot2)
ggplot(sp500, aes(x = ent_value, y = earn_before)) +
geom_point() +
labs(x = "Enterprise value", y = "Earnings")
ggplot(sp500, aes(x = ev_over_rev, y = forward_pe)) +
geom_point() +
labs(
x = "Enterprise value / revenue, logged",
y = "Market cap / forecasted earnings, logged"
)
ggplot(sp500, aes(x = ent_value, y = earn_before)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
labs(x = "Enterprise value", y = "Earnings")
ggplot(sp500, aes(x = ev_over_rev, y = forward_pe)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
labs(
x = "Enterprise value / revenue, logged",
y = "Market cap / forecasted earnings, logged"
)
Daily observations for the S&P 500
Description
Data runs from 1950 to near the end of 2018.
Usage
sp500_1950_2018
Format
A data frame with 17346 observations on the following 7 variables.
- Date
Date of the form
"YYYY-MM-DD"
.- Open
Opening price.
- High
Highest price of the day.
- Low
Lowest price of the day.
- Close
Closing price of the day.
- Adj.Close
Adjusted price at close after accounting for dividends paid out.
- Volume
Trading volume.
Source
Yahoo! Finance
Examples
data(sp500_1950_2018)
sp500.ten.years <- subset(
sp500_1950_2018,
"2009-01-01" <= as.Date(Date) & as.Date(Date) <= "2018-12-31"
)
d <- diff(sp500.ten.years$Adj.Close)
mean(d > 0)
S&P 500 stock data
Description
Daily stock returns from the S&P500 for 1990-2011 can be used to assess whether stock activity each day is independent of the stock's behavior on previous days. We label each day as Up or Down (D) depending on whether the market was up or down that day. For example, consider the following changes in price, their new labels of up and down, and then the number of days that must be observed before each Up day.
Usage
sp500_seq
Format
A data frame with 2948 observations on the following variable.
- race
a factor with levels
1
,2
,3
,4
,5
,6
, and7+
Source
Examples
sp500_seq
Speed, gender, and height of 1325 students
Description
1,325 UCLA students were asked to fill out a survey where they were asked about their height, fastest speed they have ever driven, and gender.
Usage
speed_gender_height
Format
A data frame with 1325 observations on the following 3 variables.
- speed
a numeric vector
- gender
a factor with levels
female
andmale
- height
a numeric vector
Examples
speed_gender_height
SSD read and write speeds
Description
User submitted data on 1TB solid state drives (SSD).
Usage
ssd_speed
Format
A data frame with 54 rows and 7 variables.
- brand
Brand name of the drive.
- model
Model name of the drive.
- samples
Number of user submitted benchmarks.
- form_factor
Physical form of the drive with levels
2.5
,m.2
, andmSATA
.- nvme
If a drive uses the nvme protocol this value is 1, 0 if it does not.
- read
Average read speed from user benchmarks in MB/s.
- write
Average write speed from user benchmarks in MB/s.
Source
UserBenchmark, retrieved September 1, 2020.
Examples
library(ggplot2)
library(dplyr)
ssd_speed |>
count(form_factor)
ssd_speed |>
filter(form_factor != "mSATA") |>
ggplot(aes(x = read, y = write, color = form_factor)) +
geom_point() +
labs(
title = "Average read vs. write speed of SSDs",
x = "Read speed (MB/s)",
y = "Write speed (MB/s)"
) +
facet_wrap(~form_factor, ncol = 1, scales = "free") +
guides(color = FALSE)
Starbucks nutrition
Description
Nutrition facts for several Starbucks food items
Usage
starbucks
Format
A data frame with 77 observations on the following 7 variables.
- item
Food item.
- calories
Calories.
- fat
a numeric vector
- carb
a numeric vector
- fiber
a numeric vector
- protein
a numeric vector
- type
a factor with levels
bakery
,bistro box
,hot breakfast
,parfait
,petite
,salad
, andsandwich
Source
https://www.starbucks.com/menu, retrieved 2011-03-10.
Examples
starbucks
Final exam scores for twenty students
Description
Scores range from 57 to 94.
Usage
stats_scores
Format
A data frame with 20 observations on the following variable.
- scores
a numeric vector
Examples
stats_scores
Embryonic stem cells to treat heart attack (in sheep)
Description
Does treatment using embryonic stem cells (ESCs) help improve heart function following a heart attack? Each sheep in the study was randomly assigned to the ESC or control group, and the change in their hearts' pumping capacity was measured in the study. A positive value corresponds to increased pumping capacity, which generally suggests a stronger recovery.
Usage
stem_cell
Format
A data frame with 18 observations on the following 3 variables.
- trmt
a factor with levels
ctrl
esc
- before
a numeric vector
- after
a numeric vector
Source
doi:10.1016/S0140-6736(05)67380-1
Examples
stem_cell
Stents for the treatment of stroke
Description
An experiment that studies effectiveness of stents in treating patients at
risk of stroke with some unexpected results. stent30
represents
the results 30 days after stroke and stent365
represents the results
365 days after stroke.
Usage
stent30
Format
A data frame with 451 observations on the following 2 variables.
- group
a factor with levels
control
andtreatment
- outcome
a factor with levels
no event
andstroke
Source
Chimowitz MI, Lynn MJ, Derdeyn CP, et al. 2011. Stenting versus Aggressive Med- ical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993- 1003. doi:10.1056/NEJMoa1105335. NY Times article reporting on the study: https://www.nytimes.com/2011/09/08/health/research/08stent.html.
Examples
# 30-day results
table(stent30)
# 365-day results
table(stent365)
Monthly Returns for a few stocks
Description
Monthly return data for a few stocks, which covers stock prices from November 2015 through October 2018.
Usage
stocks_18
Format
A data frame with 36 observations on the following 3 variables.
- date
First day of the month corresponding to the returns.
- goog
Google stock price change.
- cat
Caterpillar stock price change.
- xom
Exxon Mobil stock price change.
Source
Yahoo! Finance, direct download.
Examples
d <- stocks_18
dim(d)
apply(d[, 2:3], 2, mean)
apply(d[, 2:3], 2, sd)
Community college housing (simulated data, 2015)
Description
These are simulated data and intended to represent housing prices of students at a college.
Usage
student_housing
Format
A data frame with 175 observations on the following variable.
- price
Monthly housing price, simulated.
Examples
set.seed(5)
generate_student_housing <- data.frame(
price = round(rnorm(175, 515, 65) + exp(rnorm(175, 4.2, 1)))
)
hist(student_housing$price, 20)
t.test(student_housing$price)
mean(student_housing$price)
sd(student_housing$price)
identical(student_housing, generate_student_housing)
Sleep for 110 students (simulated)
Description
A simulated dataset for how much 110 college students each slept in a single night.
Usage
student_sleep
Format
A data frame with 110 observations on the following variable.
- hours
Number of hours slept by this student (simulated).
Source
Simulated data.
Examples
set.seed(2)
x <- exp(c(
rnorm(100, log(7.5), 0.15),
rnorm(10, log(10), 0.196)
))
x <- round(x - mean(x) + 7.42, 2)
identical(x, student_sleep$hours)
Treating heart attacks
Description
Experiment data for studying the efficacy of treating patients who have had a heart attack with Sulphinpyrazone.
Usage
sulphinpyrazone
Format
A data frame with 1475 observations on the following 2 variables.
- group
a factor with levels
control
treatment
- outcome
a factor with levels
died
lived
Source
Anturane Reinfarction Trial Research Group. 1980. Sulfinpyrazone in the prevention of sudden death after myocardial infarction. New England Journal of Medicine 302(5):250-256.
Examples
sulphinpyrazone
Supreme Court approval rating
Description
Summary of a random survey of 976 people.
Usage
supreme_court
Format
A data frame with 976 observations on the following variable.
- answer
a factor with levels
approve
andnot
Source
Examples
supreme_court
Teacher Salaries in St. Louis, Michigan
Description
This dataset contains teacher salaries from 2009-2010 for 71 teachers employed by the St. Louis Public School in Michigan, as well as several covariates.
Usage
teacher
Format
A data frame with 71 observations on the following 8 variables.
- id
Identification code for each teacher, assigned randomly.
- degree
Highest educational degree attained:
BA
(bachelor's degree) orMA
(master's degree).- fte
Full-time enrollment status: full-time
1
or part-time0.5
.- years
Number of years employed by the school district.
- base
Base annual salary, in dollars.
- fica
Amount paid into Social Security and Medicare per year through the Federal Insurance Contribution Act (FICA), in dollars.
- retirement
Amount paid into the retirement fund of the teacher per year, in dollars.
- total
Total annual salary of the teacher, resulting from the sum of base salary + fica + retirement, in dollars.
Source
Originally posted on SODA Developers (dev.socrata.com/data), removed in 2020.
Examples
library(ggplot2)
# Salary and education level
ggplot(teacher, aes(x = degree, y = base)) +
geom_boxplot() +
labs(
x = "Highest educational degree attained",
y = "Base annual salary, in $",
color = "Degree",
title = "Salary and education level"
)
# Salary and years of employment
ggplot(teacher, aes(x = years, y = base, color = degree)) +
geom_point() +
labs(
x = "Number of years employed by the school district",
y = "Base annual salary, in $",
color = "Degree",
title = "Salary and years of employment"
)
Textbook data for UCLA Bookstore and Amazon
Description
A random sample was taken of nearly 10\ textbook for each course was identified, and its new price at the UCLA Bookstore and on Amazon.com were recorded.
Usage
textbooks
Format
A data frame with 73 observations on the following 7 variables.
- dept_abbr
Course department (abbreviated).
- course
Course number.
- isbn
Book ISBN.
- ucla_new
New price at the UCLA Bookstore.
- amaz_new
New price on Amazon.com.
- more
Whether additional books were required for the course (
Y
means "yes, additional books were required").- diff
The UCLA Bookstore price minus the Amazon.com price for each book.
Details
The sample represents only courses where textbooks were listed online
through UCLA Bookstore's website. The most expensive textbook was selected
based on the UCLA Bookstore price, which may insert bias into the data; for
this reason, it may be beneficial to analyze only the data where more
is "N"
.
Source
Collected by David Diez.
Examples
library(ggplot2)
ggplot(textbooks, aes(x = diff)) +
geom_histogram(binwidth = 5)
t.test(textbooks$diff)
Thanksgiving spending, simulated based on Gallup poll.
Description
This entry gives simulated spending data for Americans during Thanksgiving in 2009 based on findings of a Gallup poll.
Usage
thanksgiving_spend
Format
A data frame with 436 observations on the following 1 variable.
- spending
Amount of spending, in US dollars.
Examples
library(ggplot2)
ggplot(thanksgiving_spend, aes(x = spending)) +
geom_histogram(binwidth = 20)
Tip data
Description
A simulated dataset of tips over a few weeks on a couple days per week. Each tip is associated with a single group, which may include several bills and tables (i.e. groups paid in one lump sum in simulations).
Usage
tips
Format
A data frame with 95 observations on the following 5 variables.
- week
Week number.
- day
Day, either
Friday
orTuesday
.- n_peop
Number of people associated with the group.
- bill
Total bill for the group.
- tip
Total tip from the group.
Details
This dataset was built using simulations of tables, then bills, then tips based on the bills. Large groups were assumed to only pay the gratuity, which is evident in the data. Tips were set to be plausible round values; they were often (but not always) rounded to dollars, quarters, etc.
Source
Simulated dataset.
Examples
library(ggplot2)
ggplot(tips, aes(x = day, y = tip)) +
geom_boxplot()
ggplot(tips, aes(x = tip, fill = factor(week))) +
geom_density(alpha = 0.5) +
labs(x = "Tip", y = "Density", fill = "Week")
ggplot(tips, aes(x = tip)) +
geom_dotplot()
ggplot(tips, aes(x = tip, fill = factor(day))) +
geom_density(alpha = 0.5) +
labs(x = "Tip", y = "Density", fill = "Day")
Simulated polling dataset
Description
Simulated data for a fake political candidate.
Usage
toohey
Format
A data frame with 500 observations on the following variable.
- vote_for
a factor with levels
no
yes
Examples
toohey
Turkey tourism
Description
Summary of tourism in Turkey.
Usage
tourism
Format
A data frame with 47 observations on the following 3 variables.
- year
a numeric vector
- visitor_count_tho
a numeric vector
- tourist_spending
a numeric vector
Source
Association of Turkish Travel Agencies, Foreign Visitors Figure & Tourist Spendings By Years. http://www.tursab.org.tr/en/statistics/foreign-visitors-figure-tourist-spendings-by-years_1083.html
Examples
tourism
Simulated dataset for ANOVA
Description
Simulated dataset for getting a better understanding of intuition that ANOVA is based off of.
Usage
toy_anova
Format
A data frame with 70 observations on the following 3 variables.
- group
a factor with levels
I
II
III
- outcome
a numeric vector
Examples
toy_anova
Transplant consultant success rate (fake data)
Description
Summarizing whether there was or was not a complication for 62 patients who used a particular medical consultant.
Usage
transplant
Format
A data frame with 62 observations on the following variable.
- outcome
a factor with levels
complications
okay
Examples
transplant
Construct tree diagrams
Description
Construct beautiful tree diagrams
Usage
treeDiag(
main,
p1,
p2,
out1 = c("Yes", "No"),
out2 = c("Yes", "No"),
textwd = 0.15,
solwd = 0.2,
SBS = c(TRUE, TRUE),
showSol = TRUE,
solSub = NULL,
digits = 4,
textadj = 0.015,
cex.main = 1.3,
col.main = "#999999",
showWork = FALSE
)
Arguments
main |
Character vector with two variable names, descriptions, or questions |
p1 |
Vector of probabilities for the primary branches |
p2 |
List for the secondary branches, where each list item should be a
numerical vector of probabilities corresponding to the primary branches of
|
out1 |
Character vector of the outcomes corresponding to the primary branches |
out2 |
Character vector of the outcomes corresponding to the secondary branches |
textwd |
The width provided for text with a default of |
solwd |
The with provided for the solution with a default of |
SBS |
A boolean vector indicating whether to place text and probability side-by-side for the primary and secondary branches |
showSol |
Boolean indicating whether to show the solution in the tree diagram |
solSub |
An optional list of vectors corresponding to |
digits |
The number of digits to show in the solution |
textadj |
Vertical adjustment of text |
cex.main |
Size of |
col.main |
Color of |
showWork |
Whether work should be shown for the solutions |
Author(s)
David Diez, Christopher Barr
Examples
treeDiag(
c("Flight on time?", "Luggage on time?"),
c(0.8, 0.2), list(c(0.97, 0.03), c(0.15, 0.85))
)
treeDiag(c("Breakfast?", "Go to class"), c(.4, .6),
list(c(0.4, 0.36, 0.34), c(0.6, 0.3, 0.1)), c("Yes", "No"),
c("Statistics", "English", "Sociology"),
showWork = TRUE
)
treeDiag(
c("Breakfast?", "Go to class"), c(0.4, 0.11, 0.49),
list(c(0.4, 0.36, 0.24), c(0.6, 0.3, 0.1), c(0.1, 0.4, 0.5)),
c("one", "two", "three"), c("Statistics", "English", "Sociology")
)
treeDiag(c("Dow Jones rise?", "NASDAQ rise?"),
c(0.53, 0.47), list(c(0.75, 0.25), c(0.72, 0.28)),
solSub = list(c("(a)", "(b)"), c("(c)", "(d)")), solwd = 0.08
)
twins
Description
A data frame containing data collected in the mid 20th century by Cyril Burt from a study tracked down identical twins who were separated at birth: one child was raised in the home of their biological parents and the other in a foster home. In an attempt to answer the question of whether intelligence is the result of nature or nurture, both children were given IQ tests.
Usage
twins
Format
A data frame with 27 observations on the following 2 variables.
- foster
IQ score of the twin raised by Foster parents.
- biological
IQ score of the twin raised by Biological parents.
Examples
library(ggplot2)
library(dplyr)
library(tidyr)
plot_data <- twins |>
pivot_longer(cols = c(foster, biological), names_to = "twin", values_to = "iq")
ggplot(plot_data, aes(iq, fill = twin)) +
geom_histogram(color = "white", binwidth = 5) +
facet_wrap(~twin) +
theme_minimal() +
labs(
title = "IQ of identical twins",
subtitle = "Separated at birth",
x = "IQ",
y = "Count",
fill = ""
)
UCLA courses in Fall 2018
Description
List of all courses at UCLA during Fall 2018.
Usage
ucla_f18
Format
A data frame with 3950 observations on the following 14 variables.
- year
Year the course was offered
- term
Term the course was offered
- subject
Subject
- subject_abbr
Subject abbreviation, if any
- course
Course name
- course_num
Course number, complete
- course_numeric
Course number, numeric only
- seminar
Boolean for if this is a seminar course
- ind_study
Boolean for if this is some form of independent study
- apprenticeship
Boolean for if this is an apprenticeship
- internship
Boolean for if this is an internship
- honors_contracts
Boolean for if this is an honors contracts course
- laboratory
Boolean for if this is a lab
- special_topic
Boolean for if this is any of the special types of courses listed
Source
https://sa.ucla.edu/ro/public/soc, retrieved 2018-11-22.
Examples
nrow(ucla_f18)
table(ucla_f18$special_topic)
subset(ucla_f18, is.na(course_numeric))
table(subset(ucla_f18, !special_topic)$course_numeric < 100)
elig_courses <-
subset(ucla_f18, !special_topic & course_numeric < 100)
set.seed(1)
ucla_textbooks_f18 <-
elig_courses[sample(nrow(elig_courses), 100), ]
tmp <- order(
ucla_textbooks_f18$subject,
ucla_textbooks_f18$course_numeric
)
ucla_textbooks_f18 <- ucla_textbooks_f18[tmp, ]
rownames(ucla_textbooks_f18) <- NULL
head(ucla_textbooks_f18)
Sample of UCLA course textbooks for Fall 2018
Description
A sample of courses were collected from UCLA from Fall 2018, and the corresponding textbook prices were collected from the UCLA bookstore and also from Amazon.
Usage
ucla_textbooks_f18
Format
A data frame with 201 observations on the following 20 variables.
- year
Year the course was offered
- term
Term the course was offered
- subject
Subject
- subject_abbr
Subject abbreviation, if any
- course
Course name
- course_num
Course number, complete
- course_numeric
Course number, numeric only
- seminar
Boolean for if this is a seminar course.
- ind_study
Boolean for if this is some form of independent study
- apprenticeship
Boolean for if this is an apprenticeship
- internship
Boolean for if this is an internship
- honors_contracts
Boolean for if this is an honors contracts course
- laboratory
Boolean for if this is a lab
- special_topic
Boolean for if this is any of the special types of courses listed
- textbook_isbn
Textbook ISBN
- bookstore_new
New price at the UCLA bookstore
- bookstore_used
Used price at the UCLA bookstore
- amazon_new
New price sold by Amazon
- amazon_used
Used price sold by Amazon
- notes
Any relevant notes
Details
A past dataset was collected from UCLA courses in Spring 2010, and Amazon at that time was found to be almost uniformly lower than those of the UCLA bookstore's. Now in 2018, the UCLA bookstore is about even with Amazon on the vast majority of titles, and there is no statistical difference in the sample data.
The most expensive book required for the course was generally used.
The reason why we advocate for using raw amount differences instead of percent differences is that a 20\ to a 20\ price difference on low-priced books would balance numerically (but not in a practical sense) a moderate but important price difference on more expensive books. So while this tends to result in a bit less sensitivity in detecting some effect, we believe the absolute difference compares prices in a more meaningful way.
Used prices contain the shipping cost but do not contain tax. The used prices are a more nuanced comparison, since these are all 3rd party sellers. Amazon is often more a marketplace than a retail site at this point, and many people buy from 3rd party sellers on Amazon now without realizing it. The relationship Amazon has with 3rd party sellers is also challenging. Given the frequently changing dynamics in this space, we don't think any analysis here will be very reliable for long term insights since products from these sellers changes frequently in quantity and price. For this reason, we focus only on new books sold directly by Amazon in our comparison. In a future round of data collection, it may be interesting to explore whether the dynamics have changed in the used market.
Source
https://sa.ucla.edu/ro/public/soc
See Also
Examples
library(ggplot2)
library(dplyr)
ggplot(ucla_textbooks_f18, aes(x = bookstore_new, y = amazon_new)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "orange") +
labs(
x = "UCLA Bookstore price", y = "Amazon price",
title = "Amazon vs. UCLA Bookstore prices of new textbooks",
subtitle = "Orange line represents y = x"
)
# The following outliers were double checked for accuracy
ucla_textbooks_f18_with_diff <- ucla_textbooks_f18 |>
mutate(diff = bookstore_new - amazon_new)
ucla_textbooks_f18_with_diff |>
filter(diff > 20 | diff < -20)
# Distribution of price differences
ggplot(ucla_textbooks_f18_with_diff, aes(x = diff)) +
geom_histogram(binwidth = 5)
# t-test of price differences
t.test(ucla_textbooks_f18_with_diff$diff)
United Kingdom Demographic Data
Description
This dataset comes from the Guardian's Data Blog and includes five financial demographic variables.
Usage
ukdemo
Format
A data frame with 12 observations on the following 6 variables.
- region
Region in the United Kingdom
- debt
Average regional debt, not including mortgages, in pounds
- unemployment
Percent unemployment
- house
Average house price, in pounds
- pay
Average hourly pay, in pounds
- rpi
Retail price index, which is standardized to 100 for the entire UK, and lower index scores correspond to lower prices
Source
The data was described in the Guardian Data Blog: https://www.theguardian.com/news/datablog/interactive/2011/oct/27/debt-money-expert-facts, retrieved 2011-11-01.
References
Guardian Data Blog
Examples
library(ggplot2)
ggplot(ukdemo, aes(x = pay, y = rpi)) +
geom_point() +
labs(x = "Average hourly pay", y = "Retail price index")
Annual unemployment since 1890
Description
A compilation of two datasets that provides an estimate of unemployment from 1890 to 2010.
Usage
unempl
Format
A data frame with 121 observations on the following 3 variables.
- year
Year
- unemp
Unemployment rate, in percent
- us_data
1
if from the Bureau of Labor Statistics,0
otherwise
Source
The data are from Wikipedia at the following URL accessed on November 1st, 2010:
https://en.wikipedia.org/wiki/File:US_Unemployment_1890-2009.gif
Below is a direct quote from Wikipedia describing the sources of the data:
Own work by Peace01234 Complete raw data are on Peace01234. 1930-2009 data are from Bureau of Labor Statistics (BLS), Employment status of the civilian noninstitutional population, 1940 to date retrieved on March 6, 2009 and February 12, 2010 from the BLS' FTP server. Data prior to 1948 are for persons age 14 and over. Data beginning in 1948 are for persons age 16 and over. See also "Historical Comparability" under the Household Data section of the Explanatory Notes at https://www.bls.gov/cps/eetech_methods.pdf. 1890-1930 data are from Christina Romer (1986). "Spurious Volatility in Historical Unemployment Data", The Journal of Political Economy, 94(1): 1-37. 1930-1940 data are from Robert M. Coen (1973). "Labor Force and Unemployment in the 1920's and 1930's: A Re-Examination Based on Postwar Experience", The Review of Economics and Statistics, 55(1): 46-55. Unemployment data was only surveyed once each decade until 1940 when yearly surveys were begun. The yearly data estimates before 1940 are based on the decade surveys combined with other relevant surveys that were collected during those years. The methods are described in detail by Coen and Romer.
Examples
# =====> Time Series Plot of Data <=====#
COL <- c("#DDEEBB", "#EEDDBB", "#BBDDEE", "#FFD5DD", "#FFC5CC")
plot(unempl$year, unempl$unemp, type = "n")
rect(0, -50, 3000, 100, col = "#E2E2E2")
rect(1914.5, -1000, 1918.9, 1000, col = COL[1], border = "#E2E2E2")
rect(1929, -1000, 1939, 1000, col = COL[2], border = "#E2E2E2")
rect(1939.7, -1000, 1945.6, 1000, col = COL[3], border = "#E2E2E2")
rect(1955.8, -1000, 1965.3, 1000, col = COL[4], border = "#E2E2E2")
rect(1965.3, -1000, 1975.4, 1000, col = COL[5], border = "#E2E2E2")
abline(h = seq(0, 50, 5), col = "#F8F8F8", lwd = 2)
abline(v = seq(1900, 2000, 20), col = "#FFFFFF", lwd = 1.3)
lines(unempl$year, unempl$unemp)
points(unempl$year, unempl$unemp, pch = 20)
legend("topright",
fill = COL,
c(
"World War I", "Great Depression", "World War II",
"Vietnam War Start", "Vietnam War Escalated"
),
bg = "#FFFFFF", border = "#FFFFFF"
)
President's party performance and unemployment rate
Description
Covers midterm elections.
Usage
unemploy_pres
Format
A data frame with 29 observations on the following 5 variables.
- year
Year.
- potus
The president in office.
- party
President's party.
- unemp
Unemployment rate.
- change
Change in House seats for the president's party.
Source
Wikipedia.
Examples
unemploy_pres
US temperatures in 1950 and 2022
Description
A representative set of monitoring locations were taken from NOAA data in 1950 and 2022 such that the locations are sampled roughly geographically across the continental US (the observations do not represent a random sample of geographical locations).
Usage
us_temperature
Format
A data frame with 18759 observations on the following 9 variables.
- location
Location of the NOAA weather station.
- station
Formal ID of the NOAA weather station.
- latitude
Latitude of the NOAA weather station.
- longitude
Longitude of the NOAA weather station.
- elevation
Elevation of the NOAA weather station.
- date
Date the measurement was taken (Y-m-d).
- tmax
Maximum daily temperature (Farenheit).
- tmin
Minimum daily temperature (Farenheit).
- year
Year of the measurement.
Details
Please keep in mind that the data represent two annual snapshots, and a complete analysis would consider more than two years of data and a random or more complete sampling of weather stations across the US.
Source
NOAA Climate Data Online. Retrieved 23 September, 2023.
Examples
library(dplyr)
library(ggplot2)
library(maps)
summarized_temp <- us_temperature |>
group_by(station, year, latitude, longitude) |>
summarize(tmax_med = median(tmax, na.rm = TRUE)) |>
mutate(plot_shift = ifelse(year == "1950", 0, 1)) |>
mutate(year = as.factor(year))
usa <- map_data("state")
ggplot(data = usa, aes(x = long, y = lat)) +
geom_polygon(aes(group = group), color = "black", fill = "white") +
geom_point(
data = summarized_temp,
aes(
x = longitude + plot_shift, y = latitude,
color = tmax_med, shape = year
)
) +
scale_color_gradient(high = IMSCOL["red", 1], low = IMSCOL["yellow", 1]) +
ggtitle("Median of the daily high temp, 1950 & 2022") +
labs(
x = "longitude",
color = "median high temp"
) +
guides(shape = guide_legend(override.aes = list(color = "black")))
ucb_admit
Description
Data from a study carried out by the graduate Division of the University of California, Berkeley in the early 1970's to evaluate whether there was a sex bias in graduate admissions.
Usage
ucb_admit
Format
A data frame with 4526 observations on the following 3 variables.
- admit
Was the applicant admitted to the university?
- gender
Whether the applicant identified as male or female.
- department
What department did the applicant apply to, noted as A through F for confidentiality.
Examples
library(ggplot2)
library(dplyr)
plot_data <- ucb_admit |>
count(dept, gender, admit)
ggplot(plot_data, aes(dept, n, fill = gender)) +
geom_col(position = "dodge") +
facet_wrap(~admit) +
theme_minimal() +
labs(
title = "Does gender discrimination play a role in college admittance?",
x = "Department",
y = "Number of Students",
fill = "Gender",
caption = "Source: UC Berkeley, 1970's"
)
Time Between Gondola Cars at Sterling Winery
Description
These times represent times between gondolas at Sterling Winery. The main take-away: there are 7 cars, as evidenced by the somewhat regular increases in splits between every 7 cars. The reason the times are slightly non-constant is that the gondolas come off the tracks, so times will change a little between each period.
Usage
winery_cars
Format
A data frame with 52 observations on the following 2 variables.
- obs_number
The observation number, e.g. observation 3 was immediately preceded by observation 2.
- time_until_next
Time until this gondola car arrived since the last car had left.
Details
Important context: there was a sufficient line that people were leaving the winery.
So why is this data valuable? It indicates that the winery should add one more car since it has a lot of time wasted every 7th car. By adding another car, fewer visitors are likely to be turned away, resulting in increased revenue.
Source
In-person data collection by David Diez (OpenIntro) on 2013-07-04.
Examples
winery_cars$car_number <- rep(1:7, 10)[1:nrow(winery_cars)]
col <- COL[ifelse(winery_cars$car_number == 3, 4, 1)]
plot(winery_cars[, c("obs_number", "time_until_next")],
col = col, pch = 19
)
plot(winery_cars$car_number, winery_cars$time_until_next,
col = fadeColor(col, "88"), pch = 19
)
World Population Data.
Description
From World Bank, population 1960-2020
Usage
world_pop
Format
A data frame with 216 rows and 62 variables.
- country
Name of country.
- year_1960
population in 1960.
- year_1961
population in 1961.
- year_1962
population in 1962.
- year_1963
population in 1963.
- year_1964
population in 1964.
- year_1965
population in 1965.
- year_1966
population in 1966.
- year_1967
population in 1967.
- year_1968
population in 1968.
- year_1969
population in 1969.
- year_1970
population in 1970.
- year_1971
population in 1971.
- year_1972
population in 1972.
- year_1973
population in 1973.
- year_1974
population in 1974.
- year_1975
population in 1975.
- year_1976
population in 1976.
- year_1977
population in 1977.
- year_1978
population in 1978.
- year_1979
population in 1979.
- year_1980
population in 1980.
- year_1981
population in 1981.
- year_1982
population in 1982.
- year_1983
population in 1983.
- year_1984
population in 1984.
- year_1985
population in 1985.
- year_1986
population in 1986.
- year_1987
population in 1987.
- year_1988
population in 1988.
- year_1989
population in 1989.
- year_1990
population in 1990.
- year_1991
population in 1991.
- year_1992
population in 1992.
- year_1993
population in 1993.
- year_1994
population in 1994.
- year_1995
population in 1995.
- year_1996
population in 1996.
- year_1997
population in 1997.
- year_1998
population in 1998.
- year_1999
population in 1999.
- year_2000
population in 2000.
- year_2001
population in 2001.
- year_2002
population in 2002.
- year_2003
population in 2003.
- year_2004
population in 2004.
- year_2005
population in 2005.
- year_2006
population in 2006.
- year_2007
population in 2007.
- year_2008
population in 2008.
- year_2009
population in 2009.
- year_2010
population in 2010.
- year_2011
population in 2011.
- year_2012
population in 2012.
- year_2013
population in 2013.
- year_2014
population in 2014.
- year_2015
population in 2015.
- year_2016
population in 2016.
- year_2017
population in 2017.
- year_2018
population in 2018.
- year_2019
population in 2019.
- year_2020
population in 2020.
Source
Examples
library(dplyr)
library(ggplot2)
library(tidyr)
# List percentage of population change from 1960 to 2020
world_pop |>
mutate(percent_change = round((year_2020 - year_1960) / year_2020 * 100, 2)) |>
mutate(rank_pop_change = round(rank(-percent_change)), 0) |>
select(rank_pop_change, country, percent_change) |>
arrange(rank_pop_change)
# Graph population in millions by decade for specified countries
world_pop |>
select(
country, year_1960, year_1970, year_1980, year_1990,
year_2000, year_2010, year_2020
) |>
filter(country %in% c("China", "India", "United States")) |>
pivot_longer(
cols = c(year_1960, year_1970, year_1980, year_1990, year_2000, year_2010, year_2020),
names_to = "year",
values_to = "population"
) |>
mutate(year = as.numeric(gsub("year_", "", year))) |>
ggplot(aes(year, population, color = country)) +
geom_point() +
geom_smooth(method = "loess", formula = "y ~ x") +
labs(
title = "Population",
subtitle = "by Decade",
x = "Year",
y = "Population (in millions)",
color = "Country"
)
Create a CSV variant of .rda files
Description
The function should be run with a path to a package directory.
It will then look through the data
directory of the package,
and for all datasets that are data frames, create CSV variants
in a data-csv
directory.
Usage
write_pkg_data(
pkg,
dir = paste0("data-", out_type),
overwrite = FALSE,
out_type = c("csv", "tab", "R")
)
Arguments
pkg |
The R package where we'd like to generate CSVs of any data frames. |
dir |
A character string representing the path to the folder. where the CSV files should be written. If no such directory exists, one will be created (recursively). |
overwrite |
Boolean to indicate if to overwrite any existing files that have conflicting names in the directory specified. |
out_type |
Format for the type of output as a CSV ( |
Examples
## Not run:
write_pkg_data("openintro")
list.files("data-csv")
## End(Not run)
Exxon Mobile stock data
Description
Monthly data covering 2006 through early 2014.
Usage
xom
Format
A data frame with 98 observations on the following 7 variables.
- date
Date.
- open
a numeric vector
- high
a numeric vector
- low
a numeric vector
- close
a numeric vector
- volume
a numeric vector
- adj_close
a numeric vector
Source
Yahoo! Finance.
Examples
xom
Contagiousness of yawning
Description
An experiment conducted by the MythBusters, a science entertainment TV program on the Discovery Channel, tested if a person can be subconsciously influenced into yawning if another person near them yawns. 50 people were randomly assigned to two groups: 34 to a group where a person near them yawned (treatment) and 16 to a group where there wasn't a person yawning near them (control).
Usage
yawn
Format
A data frame with 50 observations on the following 2 variables.
- result
a factor with levels
not yawn
yawn
- group
a factor with levels
ctrl
trmt
Source
MythBusters, Season 3, Episode 28.
Examples
yawn
Youth Risk Behavior Surveillance System (YRBSS)
Description
Select variables from YRBSS.
Usage
yrbss
Format
A data frame with 13583 observations on the following 13 variables.
- age
Age, in years.
- gender
Gender.
- grade
School grade.
- hispanic
Hispanic or not.
- race
Race / ethnicity.
- height
Height, in meters (3.28 feet per meter).
- weight
Weight, in kilograms (2.2 pounds per kilogram).
- helmet_12m
How often did you wear a helmet when biking in the last 12 months?
- text_while_driving_30d
How many days did you text while driving in the last 30 days?
- physically_active_7d
How many days were you physically active for 60+ minutes in the last 7 days?
- hours_tv_per_school_day
How many hours of TV do you typically watch on a school night?
- strength_training_7d
How many days did you do strength training (e.g. lift weights) in the last 7 days?
- school_night_hours_sleep
How many hours of sleep do you typically get on a school night?
Source
CDC's Youth Risk Behavior Surveillance System (YRBSS)
Examples
table(yrbss$physically_active_7d)
Sample of Youth Risk Behavior Surveillance System (YRBSS)
Description
A sample of the yrbss
dataset.
Usage
yrbss_samp
Format
A data frame with 100 observations on the following 13 variables.
- age
Age, in years.
- gender
Gender.
- grade
School grade.
- hispanic
Hispanic or not.
- race
Race / ethnicity.
- height
Height, in meters (3.28 feet per meter).
- weight
Weight, in kilograms (2.2 pounds per kilogram).
- helmet_12m
How often did you wear a helmet when biking in the last 12 months?
- text_while_driving_30d
How many days did you text while driving in the last 30 days?
- physically_active_7d
How many days were you physically active for 60+ minutes in the last 7 days?
- hours_tv_per_school_day
How many hours of TV do you typically watch on a school night?
- strength_training_7d
How many days did you do strength training (e.g. lift weights) in the last 7 days?
- school_night_hours_sleep
How many hours of sleep do you typically get on a school night?
Source
CDC's Youth Risk Behavior Surveillance System (YRBSS)
Examples
table(yrbss_samp$physically_active_7d)