I am taking the R programming course from the Data Science Specialization offered by the John Hopkins University on Coursera. This blog post is a personal notes taking where we can follow the reasoning during the exercices.
Today I try to complete the Assignement 1 “Air Pollution” Part 1. We are given a .zip file that contains 332 *.csv files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. Here is my walkthrough.
Part 1 : pollutantmean()
The Part 1 is about writing the pollutantmean(directory, pollutant, id=1:332) function which returns the mean of a specified pollutant out of one or many CSV (requested by id) in the specified directory.
The results should be:
> pollutantmean("specdata", "sulfate", 1:10)
[1] 4.064
> pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706
> pollutantmean("specdata", "nitrate", 23)
[1] 1.281
My try :
There are 2 cases: when ID is given for one single monitor, when ID is given for many monitors in a row.
pollutantmean <- function(directory, pollutant, id = 1:332) { files <- list.files(directory, full.names = TRUE) # Case where id indicates 1 file if (length(files[id])==1){ mean(read.csv(files[id])[,pollutant], na.rm=1) } # Case where id indicates many files in a row else { datas <- data.frame() for (i in 1:length(files[id])){ datas <- rbind(datas, read.csv(files[i])) } mean(datas[,pollutant], na.rm=1) } }
Results are:
> pollutantmean("specdata", "sulfate", 1:10) [1] 4.064128 > pollutantmean("specdata", "nitrate", 70:72) [1] 0.8599547 > pollutantmean("specdata", "nitrate", 23) [1] 1.280833
The first and the third requests works but not the second one… The mistake is that the loop is always starting at i=1 instead of the given set (that is why 1:10 returns the right answer, but 70:72 actually returns the result for 1:72). By simply fixing the loop, the results are all right:
## Fixed loop for (i in id){ datas <- rbind(datas, read.csv(files[i])) }
> pollutantmean("specdata", "sulfate", 1:10) [1] 4.064128 > pollutantmean("specdata", "nitrate", 70:72) [1] 1.706047 > pollutantmean("specdata", "nitrate", 23) [1] 1.280833
What I try do next is to fix the function to makes it works with disparate ID given. I do :
– Read the monitor files list into the files vector, then binding into the bind23_26 vector files 23 and 26 (it actually adds the 26’s datas just after the 23’s datas into one single data.frame).
– Create a vector containing id=23 and id=26 and requesting them into the pollutantmean() function.
> files <- list.files("specdata", full.names=1) > bind23_26 <- read.csv(files[23]) > bind23_26 <- rbind(bind23_26, read.csv(files[26])) > mean(bind23_26[,"nitrate"], na.rm=1) [1] 4.169054 > v <- c(23,26) > pollutantmean("specdata", "nitrate", v) [1] 4.169054
Surprisingly it works without fixing the loop. I learned that loops can works with (i in c(1, 4, 5, …) ).
Next, I guess I have to fix the results to be shown at 10-3 just like the example, but the assignment asks not to round the values…
Finally, I can erase the case where ID is a single element since for loop can obviously browse a set of 1 number.
## pollutantmean.R pollutantmean <- function(directory, pollutant, id = 1:332) { files <- list.files(directory, full.names = TRUE) datas <- data.frame() for (i in id){ datas <- rbind(datas, read.csv(files[i])) } mean(datas[,pollutant], na.rm=1) }
This was a great help for a full time working dad that is strapped for time, but still finds time to learn new languages during his downtime. Thanks.
Thank you for sharing this, so helpful for a non-talent in coding like me! Really appreciate it!