- An introduction to the Grammar of Graphics
- An introduction to ggplot
A ggplot graphic has at least three key components:
displ
displacement by hwy
highway miles per gallonlibrary(tidyverse) ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()
library(tidyverse) ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
displ
displacement and hwy
highway miles per gallondrv
f = front-wheel drive, r = rear wheel drive, 4 = 4wdcyl
number of cylindersggplot(mpg, aes(x = displ, y = hwy, color = drv, size = cyl)) + geom_point()
ggplot(mpg, aes(x = displ, y = hwy, color = drv, size = cyl)) + geom_point()
ggplot()
functionThe main function is ggplot()
. It takes two arguments:
data
: A data framemapping
: Aesthetic mappings provided with the aes()
function.Additional layers are added with a +
sign.
ggplot(mpg, aes(x = displ, y = hwy, color = drv, size = cyl)) + geom_point()
mpg
data frame.cty
and hwy
displayed on the axis andclass
andshape = drv
) is mapped on the variable drv
.geom_point()
layer.ggplot(mpg, aes(cty, hwy, colour = class, shape = drv)) + geom_point()
geom
function.ggplot(diamonds, aes(carat, price)) + geom_point(colour ="red", shape = "+", size = 1)
geom_point()
: Dots for each data point.geom_line()
: Lines connecting each x-axis data pointgeom_bar()
: Barsgeom_text()
: Text at x and y positionsgeom_smooth()
: Smoothed conditional meanseconomics
data framedate
and unemployment
. (geom_line()
)geom_point()
)ggplot(economics, aes(date, unemploy)) + geom_line() + geom_point(color = "red", size = 1)
geom_bar()
geom_bar()
draws barsx
variable# Number of cars in each class: ggplot(mpg, aes(class)) + geom_bar()
mpg
data frame.drv
variable.red
with the fill
argument.width = 0.8
to resize the bar width.ggplot(mpg, aes(drv)) + geom_bar(fill = "red", width = 0.8)
geom_bar()
stat = "identity"
, bar heights and bar categories are taken from the x and y variables:df <- data.frame( type = c("A", "B", "C"), mean = c(2.5, 4.4, 6.3) ) ggplot(df, aes(x = type, y = mean)) + geom_bar(stat = "identity")
starwars
database.mutate(bmi = mass / (height / 100)^2)
)summarise(mean_bmi = median(bmi, na.rm = TRUE)
)starwars %>% mutate(bmi = mass / (height / 100)^2) %>% group_by(species) %>% summarise( mean_bmi = median(bmi, na.rm = TRUE) ) %>% ggplot(aes(species, mean_bmi)) + geom_bar(stat = "identity")
geom_smooth()
geom_smooth() is used to add smoothed conditional means in scatterplots.
ggplot(mpg,aes(displ, hwy)) + geom_point() + geom_smooth()
economics
data frame.unemploy
by population pop
.geom_smooth
layer.economics %>% ggplot(aes(pop, unemploy)) + geom_point() + geom_smooth()
dslabs
.gapminder
.year
and continent
. (group_by(year, continent)
)summarize()
function to calculate the mean of infant_mortality
.year
on x-axis, mean of infant_mortality
on y-axis, and continent
as line/dot colours.smooth
layer.library(dslabs) gapminder %>% group_by(year, continent) %>% summarize(m_infant_mortality = mean(infant_mortality, na.rm = TRUE)) %>% ggplot(aes(x = year, y = m_infant_mortality, color = continent)) + geom_line() + geom_point() + geom_smooth()
When you have multiple values ordered in a categorical variable. Simple plots become messy:
ggplot(mpg,aes(drv, hwy)) + geom_point()
Solutions
geom_jitter()
: Adds a litle random jitter to each datapointgeom_boxplot()
: Draws a boxplotgeom_violin()
: Draws a violine plotmgp
datasetggplot(mpg,aes(drv, hwy)) +geom_jitter(width = 0.2) ggplot(mpg,aes(drv, hwy)) +geom_boxplot() ggplot(mpg,aes(drv, hwy)) +geom_violin()
ggplot(mpg, aes(cty, hwy)) + geom_point() + facet_wrap(~class)
class <- mpg %>% group_by(class) %>% summarise(n=n(), hwy = mean(hwy)) ggplot(mpg, aes(class, hwy)) + geom_jitter(width = 0.2) + geom_point(data = class, mapping = aes(class, hwy), colour = "red", size = 6) + geom_text(data = class, aes(class, 10, label = paste0("n=", n))) + ylim(10, 45)