Here’s a script that I use to query the CRAN package download logs and figure out what packages are the “top” packages being used/downloaded. It borrows heavily from this post, which is badass… User be wary, as it download all of the logs for the specified date and creates a data.table to house this information (if you specify a large timeframe, this thing will be HUGE). To get around that, I randomly r sample the data.table for half of the entries and at the end, I’ve got a section to install these packages. However, it’s currently commented out.
Geek+1
## Inspired and heavily dependent on the code Felix Schönbrodt ## http://www.nicebread.de/finally-tracking-cran-packages-downloads/ ## ====================================================================== ## Step 1: Parameterize the script with dates and the number of packages ## that we're interested in... ## ====================================================================== # My advice would be to set this to a day or a week. If you do 6-months like I've # done here you better have the memory to support it! start <- as.Date('2014-09-01') end <- as.Date('2015-02-28') # How many "top" packages are we interested in? top.x <- 20 ## ====================================================================== ## Step 2: Download all log files for each week ## ====================================================================== # Here's an easy way to get all the URLs in R all_days <- seq(start, end, by = 'day') # If we were to look we'd see a strong weekly pattern in the the downloads, # with Saturday and Sunday having much fewer downloads than other days. This is # not surprising since we know that the countries which use R don't work these # days. Let's just look at MWF to be safe... weekdays(all_days) days.To.Keep<- c("Monday", "Wednesday", "Friday") all_days <- subset(all_days, weekdays(all_days) %in% days.To.Keep) weekdays(all_days) year <- as.POSIXlt(all_days)$year + 1900 urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz') # only download the files you don't have: missing_files <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE)) dir.create("CRANlogs") for (i in 1:length(missing_files)) { print(paste0(i, "/", length(missing_files))) download.file(urls[i], paste0('CRANlogs/', missing_files[i], '.csv.gz')) } ## ====================================================================== ## Step 3: Load single data files into one big data.table and then clean ## up the files (delete them) once we're done ## ====================================================================== file_list <- list.files("CRANlogs", full.names=TRUE) logs <- list() for (file in file_list) { print(paste("Reading", file, "...")) logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", as.is=TRUE) } # rbind all of the files together library(data.table) dat <- rbindlist(logs) # logs will likely be huge, so unless you have memory for days, we best delete it # and free up that memmory #rm(logs); gc(verbose=T); #Let's make this data.table smaller, to save memory, and randomly sample half of it dat<-dat[sample(nrow(dat), ceiling(0.5*nrow(dat))), ] # define the remaining variable types dat[, date:=as.Date(date)] dat[, package:=factor(package)] dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")] # set the key setkey(dat, package, date, week) # Delete the files and thier directory (gots to keep our shit clean!!!!) # Just comment this out if you don't want to delete the files (i.e., you # might want them for later use) #unlink("CRANlogs", recursive = TRUE) ## ====================================================================== ## Step 4: Analyze it! ## ====================================================================== library(ggplot2) library(plyr) # Overall downloads of packages d1 <- dat[, length(week), by=package] d1 <- d1[order(-V1), ] # Build a vector of package names, to be used later for install.packages package.names<-as.character(d1$package[1:top.x]) # plot 1: Compare downloads of "top" packages on a weekly basis agg1 <- dat[J(package.names), length(unique(ip_id)), by=c("week", "package")] ggplot(agg1, aes(x=week, y=V1*2, color=package, group=package)) + geom_line(size=1) + ylab("Downloads") + theme_bw() + theme(axis.text.x = element_text(angle=90, vjust=0.5)) ## ====================================================================== ## Step 5: Install them all (plus their dependencies)! ## ====================================================================== # Uncomment this line if you want to install all of the "top" packages # install.packages(package.names,dep=TRUE)












Leave a Reply