My script to install the “top” R packages

Here’s a script that I use to query the CRAN package download logs and figure out what packages are the “top” packages being used/downloaded. It borrows heavily from this post, which is badass… User be wary, as it download all of the logs for the specified date and creates a data.table to house this information (if you specify a large timeframe, this thing will be HUGE). To get around that, I randomly r sample the data.table for half of the entries and at the end, I’ve got a section to install these packages. However, it’s currently commented out.

Geek+1

## Inspired and heavily dependent on the code Felix Schönbrodt
## http://www.nicebread.de/finally-tracking-cran-packages-downloads/

## ======================================================================
## Step 1: Parameterize the script with dates and the number of packages
## that we're interested in...
## ======================================================================

# My advice would be to set this to a day or a week. If you do 6-months like I've
# done here you better have the memory to support it!
start <- as.Date('2014-09-01')
end <- as.Date('2015-02-28')

# How many "top" packages are we interested in?
top.x <- 20

## ======================================================================
## Step 2: Download all log files for each week
## ======================================================================

# Here's an easy way to get all the URLs in R
all_days <- seq(start, end, by = 'day')

# If we were to look we'd  see a strong weekly pattern in the the downloads,
# with Saturday and Sunday having much fewer downloads than other days. This is
# not surprising since we know that the countries which use R don't work these
# days. Let's just look at MWF to be safe...
weekdays(all_days)
days.To.Keep<- c("Monday", "Wednesday", "Friday")
all_days <- subset(all_days, weekdays(all_days) %in% days.To.Keep)
weekdays(all_days)

year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')

# only download the files you don't have:
missing_files <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))

dir.create("CRANlogs")
for (i in 1:length(missing_files)) {
  print(paste0(i, "/", length(missing_files)))
  download.file(urls[i], paste0('CRANlogs/', missing_files[i], '.csv.gz'))
}

## ======================================================================
## Step 3: Load single data files into one big data.table and then clean
## up the files (delete them) once we're done
## ======================================================================

file_list <- list.files("CRANlogs", full.names=TRUE)

logs <- list()
for (file in file_list) {
  print(paste("Reading", file, "..."))
  logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"",
                             dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)
}

# rbind all of the files together
library(data.table)
dat <- rbindlist(logs)
# logs will likely be huge, so unless you have memory for days, we best delete it
# and free up that memmory
#rm(logs); gc(verbose=T);

#Let's make this data.table smaller, to save memory, and randomly sample half of it
dat<-dat[sample(nrow(dat), ceiling(0.5*nrow(dat))), ]

# define the remaining variable types
dat[, date:=as.Date(date)]
dat[, package:=factor(package)]
dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]

# set the key
setkey(dat, package, date, week)

# Delete the files and thier directory (gots to keep our shit clean!!!!)
# Just comment this out if you don't want to delete the files (i.e., you
# might want them for later use)
#unlink("CRANlogs", recursive = TRUE) 

## ======================================================================
## Step 4: Analyze it!
## ======================================================================

library(ggplot2)
library(plyr)

# Overall downloads of packages
d1 <- dat[, length(week), by=package]
d1 <- d1[order(-V1), ]

# Build a vector of package names, to be  used later for install.packages
package.names<-as.character(d1$package[1:top.x])

# plot 1: Compare downloads of "top" packages on a weekly basis
agg1 <- dat[J(package.names), length(unique(ip_id)), by=c("week", "package")]

ggplot(agg1, aes(x=week, y=V1*2, color=package, group=package)) + geom_line(size=1) +
  ylab("Downloads") + theme_bw() +
  theme(axis.text.x  = element_text(angle=90, vjust=0.5))

## ======================================================================
## Step 5: Install them all (plus their dependencies)!
## ======================================================================

# Uncomment this line if you want to install all of the "top" packages
# install.packages(package.names,dep=TRUE)
Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>