Part 2 : complete()
Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.
Datas returned are shown here.
complete <- function(directory, id = 1:332) { files <- list.files(directory, full.names = 1) complete_files <- data.frame(id=NA, nobs=NA) for (i in id) { complete_files[i, 1] <- i complete_files[i, 2] <- sum(complete.cases(read.csv(files[i]))) } complete_files }
Relatively simple, in this function we start by reading the files from the given directory and creating an empty data.frame headed with column names (id and nobs).
Console output:
> complete("specdata", 1) id nobs 1 1 117 > complete("specdata", c(2, 4, 8, 10, 12)) id nobs 1 NA NA 2 2 1041 3 NA NA 4 4 474 5 NA NA 6 NA NA 7 NA NA 8 8 192 9 NA NA 10 10 148 11 NA NA 12 12 96 > complete("specdata", 30:25) id nobs (...) 22 NA NA 23 NA NA 24 NA NA 25 25 463 26 26 586 27 27 338 28 28 475 29 29 711 30 30 932 > complete("specdata", 3) id nobs 1 NA NA 2 NA NA 3 3 243
The code works well, but there is an issue: the loop takes every id numbers from 1 to ID.
After debuging the function with debug(complete) to see the trace, I guess since it is a data.frame, rows are completed from 1 and writing on the spaces #01, #02… #id.
A kind of solution :
complete <- function(directory, id = 1:332) { files <- list.files(directory, full.names = 1) complete_files <- data.frame(id=integer(), nobs=integer()) for (i in id) { complete_files[i, 1] <- i complete_files[i, 2] <- sum(complete.cases(read.csv(files[i]))) } complete_files[complete.cases(complete_files),] } > complete("specdata", 25:30) id nobs 25 25 463 26 26 586 27 27 338 28 28 475 29 29 711 30 30 932
Better, but not correct: we have just removed the NA rows from the data.frame, but we need to rewrite the data.frame to make the results starting at the data.frame id=1.
What we need to do is to get the length of the id vector, not the id itself.
My solution :
complete <- function(directory, id = 1:332) { files <- list.files(directory, full.names = 1) complete_files <- data.frame(id=integer(), nobs=integer()) for (i in 1:length(id)) { complete_files[i,1] <- id[i] complete_files[i, 2] <- sum(complete.cases(read.csv(files[id[i]]))) } complete_files }
Although this works and the correction is simple, there was a slight difficulty : my first try was nesting loops (for i in id > for j in length(id)) to keep either the current ID from the vector and the current ID from the data.frame. Obviously, at every vector ID it rewrote length(id) times into the data.frame, erasing the datas.