This article is intended to briefly summarize what I’ve learn about the *apply functions.
split() and lapply()
We’ll work on the Air Quality *.csv file from the Data Science Specialization course:
> airquality <- read.csv("rcourse/hw1_data.csv") > head(airquality)
We want to create a matrix that contain the monthy means of each column :
> s <- split(airquality, airquality$Month) # This create a list of 5 data.frame, separating the datas of the 5 collected months (may to september).
Since each elements of the list have the same dimensions, we can use sapply() to make a matrix of these datas. sapply() summarize (as long as it can) the result of lapply(). We use the colMeans() function which is a shortcut for apply(s, 2, mean). The dataset containing some NA values, we want to remove them turning na.rm to TRUE.
> sapply(s, colMeans, na.rm=1)
If we only need some of the columns, we can use an anonymous function :
> sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm=1))
—-
Some exercices
> library(datasets) > data(iris)
“This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.”
In this dataset, what is the mean of ‘Sepal.Length’ for the species virginica?
> virginica<-subset(iris, iris[,"Species"]=="virginica") > mean(virginica[,1]) # Since Sepal.Length is [,1]
> split(mtcars$mpg, mtcars$cyl) $`4` [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4 $`6` [1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7 $`8` [1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8 15.0 > sapply(split(mtcars$mpg, mtcars$cyl), mean) 4 6 8 26.66364 19.74286 15.10000