R Programming Course – Assignment 1 : Air Pollution Part 2

Part 1 : pollutantmean()

Part 2 : complete()

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases. 
Datas returned are shown here.

complete <- function(directory, id = 1:332) {
 files <- list.files(directory, full.names = 1)
 
 complete_files <- data.frame(id=NA, nobs=NA)
 
 for (i in id) {
  complete_files[i, 1] <- i
  complete_files[i, 2] <- sum(complete.cases(read.csv(files[i])))
 }

 complete_files
}

Relatively simple, in this function we start by reading the files from the given directory and creating an empty data.frame headed with column names (id and nobs).

Console output:

> complete("specdata", 1)
 id nobs
1 1 117

> complete("specdata", c(2, 4, 8, 10, 12))
 id nobs
1 NA NA
2 2 1041
3 NA NA
4 4 474
5 NA NA
6 NA NA
7 NA NA
8 8 192
9 NA NA
10 10 148
11 NA NA
12 12 96

> complete("specdata", 30:25)
 id nobs
(...)
22 NA NA
23 NA NA
24 NA NA
25 25 463
26 26 586
27 27 338
28 28 475
29 29 711
30 30 932

> complete("specdata", 3)
 id nobs
1 NA NA
2 NA NA
3 3 243

The code works well, but there is an issue: the loop takes every id numbers from 1 to ID.
After debuging the function with debug(complete) to see the trace, I guess since it is a data.frame, rows are completed from 1 and writing on the spaces #01, #02… #id.

A kind of solution :

complete <- function(directory, id = 1:332) {
 files <- list.files(directory, full.names = 1)
 
 complete_files <- data.frame(id=integer(), nobs=integer())
 
 for (i in id) {
 complete_files[i, 1] <- i
 complete_files[i, 2] <- sum(complete.cases(read.csv(files[i])))
 }
 complete_files[complete.cases(complete_files),]
}

> complete("specdata", 25:30)
 id nobs
25 25 463
26 26 586
27 27 338
28 28 475
29 29 711
30 30 932

Better, but not correct: we have just removed the NA rows from the data.frame, but we need to rewrite the data.frame to make the results starting at the data.frame id=1.

What we need to do is to get the length of the id vector, not the id itself.

My solution :

complete <- function(directory, id = 1:332) {
 files <- list.files(directory, full.names = 1)
 
 complete_files <- data.frame(id=integer(), nobs=integer())
 
 for (i in 1:length(id)) {
 complete_files[i,1] <- id[i]
 complete_files[i, 2] <- sum(complete.cases(read.csv(files[id[i]])))
 }
complete_files
}

Although this works and the correction is simple, there was a slight difficulty : my first try was nesting loops (for i in id > for j in length(id)) to keep either the current ID from the vector and the current ID from the data.frame. Obviously, at every vector ID it rewrote length(id) times into the data.frame, erasing the datas.

Part 3 : corr()

R Programming Course – Assignment 1 : Air Pollution Part 1

I am taking the R programming course from the Data Science Specialization offered by the John Hopkins University on Coursera. This blog post is a personal notes taking where we can follow the reasoning during the exercices.

Today I try to complete the Assignement 1 “Air Pollution” Part 1. We are given a .zip file that contains 332 *.csv files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. Here is my walkthrough.

Part 1 : pollutantmean()

The Part 1 is about writing the pollutantmean(directory, pollutant, id=1:332) function which returns the mean of a specified pollutant out of one or many CSV (requested by id) in the specified directory.

The results should be:

> pollutantmean("specdata", "sulfate", 1:10)
[1] 4.064
> pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706
> pollutantmean("specdata", "nitrate", 23)
[1] 1.281

My try :

There are 2 cases: when ID is given for one single monitor, when ID is given for many monitors in a row.

pollutantmean <- function(directory, pollutant, id = 1:332) {
 files <- list.files(directory, full.names = TRUE)
 
 # Case where id indicates 1 file
 if (length(files[id])==1){
 mean(read.csv(files[id])[,pollutant], na.rm=1)
 }
 
 # Case where id indicates many files in a row
 else {
 datas <- data.frame()
 for (i in 1:length(files[id])){
 datas <- rbind(datas, read.csv(files[i]))
 }
 mean(datas[,pollutant], na.rm=1)
 }
}

Results are:

> pollutantmean("specdata", "sulfate", 1:10)
[1] 4.064128
> pollutantmean("specdata", "nitrate", 70:72)
[1] 0.8599547
> pollutantmean("specdata", "nitrate", 23)
[1] 1.280833

The first and the third requests works but not the second one… The mistake is that the loop is always starting at i=1 instead of the given set (that is why 1:10 returns the right answer, but 70:72 actually returns the result for 1:72). By simply fixing the loop, the results are all right:

## Fixed loop
for (i in id){
 datas <- rbind(datas, read.csv(files[i]))
}
> pollutantmean("specdata", "sulfate", 1:10)
[1] 4.064128
> pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706047
> pollutantmean("specdata", "nitrate", 23)
[1] 1.280833

What I try do next is to fix the function to makes it works with disparate ID given. I do :
– Read the monitor files list into the files vector, then binding into the bind23_26 vector files 23 and 26 (it actually adds the 26’s datas just after the 23’s datas into one single data.frame).
– Create a vector containing id=23 and id=26 and requesting them into the pollutantmean() function.

> files <- list.files("specdata", full.names=1)
> bind23_26 <- read.csv(files[23])
> bind23_26 <- rbind(bind23_26, read.csv(files[26]))
> mean(bind23_26[,"nitrate"], na.rm=1)
[1] 4.169054
> v <- c(23,26)
> pollutantmean("specdata", "nitrate", v)
[1] 4.169054

Surprisingly it works without fixing the loop. I learned that loops can works with (i in c(1, 4, 5, …) ).

Next, I guess I have to fix the results to be shown at 10-3 just like the example, but the assignment asks not to round the values…

Finally, I can erase the case where ID is a single element since for loop can obviously browse a set of 1 number.

## pollutantmean.R
pollutantmean <- function(directory, pollutant, id = 1:332) {
 files <- list.files(directory, full.names = TRUE)
 datas <- data.frame()
 for (i in id){
 datas <- rbind(datas, read.csv(files[i]))
 }
 mean(datas[,pollutant], na.rm=1)
}

Part 2 : complete()
Part 3 : corr()