- An introduction to the Grammar of Graphics
- An introduction to ggplot
A ggplot graphic has at least three key components:
displ displacement by hwy highway miles per gallonlibrary(tidyverse) ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()
library(tidyverse) ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
displ displacement and hwy highway miles per gallondrv f = front-wheel drive, r = rear wheel drive, 4 = 4wdcyl number of cylindersggplot(mpg, aes(x = displ, y = hwy, color = drv, size = cyl)) + geom_point()
ggplot(mpg, aes(x = displ, y = hwy, color = drv, size = cyl)) + geom_point()
ggplot() functionThe main function is ggplot(). It takes two arguments:
data : A data framemapping : Aesthetic mappings provided with the aes() function.Additional layers are added with a + sign.
ggplot(mpg, aes(x = displ, y = hwy, color = drv, size = cyl)) + geom_point()
mpg data frame.cty and hwy displayed on the axis andclass andshape = drv) is mapped on the variable drv.geom_point() layer.ggplot(mpg, aes(cty, hwy, colour = class, shape = drv)) + geom_point()
geom function.ggplot(diamonds, aes(carat, price)) + geom_point(colour ="red", shape = "+", size = 1)
geom_point() : Dots for each data point.geom_line() : Lines connecting each x-axis data pointgeom_bar() : Barsgeom_text() : Text at x and y positionsgeom_smooth() : Smoothed conditional meanseconomics data framedate and unemployment. (geom_line())geom_point())ggplot(economics, aes(date, unemploy)) + geom_line() + geom_point(color = "red", size = 1)
geom_bar()geom_bar() draws barsx variable# Number of cars in each class: ggplot(mpg, aes(class)) + geom_bar()
mpg data frame.drv variable.red with the fill argument.width = 0.8 to resize the bar width.ggplot(mpg, aes(drv)) + geom_bar(fill = "red", width = 0.8)
geom_bar()stat = "identity", bar heights and bar categories are taken from the x and y variables:df <- data.frame(
type = c("A", "B", "C"),
mean = c(2.5, 4.4, 6.3)
)
ggplot(df, aes(x = type, y = mean)) +
geom_bar(stat = "identity")
starwars database.mutate(bmi = mass / (height / 100)^2))summarise(mean_bmi = median(bmi, na.rm = TRUE))starwars %>%
mutate(bmi = mass / (height / 100)^2) %>%
group_by(species) %>%
summarise(
mean_bmi = median(bmi, na.rm = TRUE)
) %>%
ggplot(aes(species, mean_bmi)) +
geom_bar(stat = "identity")
geom_smooth()geom_smooth() is used to add smoothed conditional means in scatterplots.
ggplot(mpg,aes(displ, hwy)) + geom_point() + geom_smooth()
economics data frame.unemploy by population pop.geom_smooth layer.economics %>%
ggplot(aes(pop, unemploy)) +
geom_point() +
geom_smooth()
dslabs.gapminder.year and continent. (group_by(year, continent))summarize() function to calculate the mean of infant_mortality.year on x-axis, mean of infant_mortality on y-axis, and continent as line/dot colours.smooth layer.library(dslabs)
gapminder %>% group_by(year, continent) %>%
summarize(m_infant_mortality = mean(infant_mortality, na.rm = TRUE)) %>%
ggplot(aes(x = year, y = m_infant_mortality, color = continent)) +
geom_line() +
geom_point() +
geom_smooth()
When you have multiple values ordered in a categorical variable. Simple plots become messy:
ggplot(mpg,aes(drv, hwy)) + geom_point()
Solutions
geom_jitter() : Adds a litle random jitter to each datapointgeom_boxplot() : Draws a boxplotgeom_violin() : Draws a violine plotmgp datasetggplot(mpg,aes(drv, hwy)) +geom_jitter(width = 0.2) ggplot(mpg,aes(drv, hwy)) +geom_boxplot() ggplot(mpg,aes(drv, hwy)) +geom_violin()
ggplot(mpg, aes(cty, hwy)) + geom_point() + facet_wrap(~class)
class <- mpg %>%
group_by(class) %>%
summarise(n=n(), hwy = mean(hwy))
ggplot(mpg, aes(class, hwy)) +
geom_jitter(width = 0.2) +
geom_point(data = class, mapping = aes(class, hwy), colour = "red", size = 6) +
geom_text(data = class, aes(class, 10, label = paste0("n=", n))) +
ylim(10, 45)