Alt + E S V

Top Package Downloads in the R Ecosystem

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

In September, pandas creator Wes McKinney shared this chart about the growth in Stack Overflow traffic for various Python packages.

The original plot appeared in Stack Overflow’s analysis of the reasons driving Python’s growth as implied by traffic directed to tags for various Python packages. Their assertion is that more visits to questions about a particular library implies more interest in that library, and thus this traffic can provide directional evidence about how Python as a whole is being used in practice.

It’s notable that of these selected packages, the libraries that extend Python’s data structures/analysis, plotting, and mathematic functionality have experienced more growth in the Stack Overflow community than the web framework packages. Stack Overflow’s conclusion was, “this suggests that much of Python’s growth may be due to data science, rather than to web development.”

Stack Overflow is far from the only outlet singing the praises of pandas and Python. In the delightfully named article “Python explosion blamed on pandas” , The Register states,

“Love of pandas reflects a generalized ardor for data science, as measured by visits to specific topics or tags on Stack Overflow. ‘Python’s popularity in data science and machine learning is probably the main driver of its fast growth,’ said [David] Robinson.”

(Before we move on, I would like the record to show I sincerely could not love that article’s title any more.)

Given that both R and Python are acknowledged as the languages of data science, we were similarly curious about which packages are gaining traction in R.

Methodology

RedMonk doesn’t have access to Stack Overflow’s site traffic data, and thus I cannot replicate their exact analysis. I was, however, able to look at data about package downloads from CRAN as well as the number of Stack Overflow questions asked.

I downloaded all daily package downloads available from CRAN, dating from October 2012 through September 2017. The top five downloaded packages were calculated as the sum of all daily package downloads. These results were aggregated to monthly downloads by package and plotted.

From this subset of popular packages I then pulled Stack Overflow’s tags to determine how many questions have be asked about each topic.

Results

Top 5 CRAN downloads 2012 – 2017
Rcpp – “seamless R and C++ integration”
ggplot2 – “elegant data visualizations”
stringr – “simple, consistent wrappers from common string operations”
digest – “create compact hash digests of R objects”
plyr – “tools for splitting, applying and combining data”

From these packages, I then pulled total Stack Overflow tags for each package.

(Note that the ‘digest’ tag is not unique to R and includes hashing questions across multiple programming languages.)

Observations:

  • Performance vs. packages: R is a comparatively slow language, and developers looking for performance gains will often drop into C++ to speed up their code. The Rcpp package integrates R with C++, and the fact that this is this is the top downloaded package indicates that the extent of R developers and data scientists that are looking for solutions to the language’s speed constraints.
    Per a post on R-bloggers, “its [Rs] strong suit really isn’t speed but rather the comparative advantage is the 4,284 packages on CRAN. We accept the slower speed for the time saved from not having to re-invent the wheel every time we want to do something new.” That was written in 2013; if the language’s extensibility was an acceptable tradeoff for its lack of speed with 4,000 packages, then it’s easy to imagine that is even more true as of this writing when the number of packages in CRAN now tops 11,500.
    The sustained heavy use of the Rcpp package indicate that the language’s performance does matter to users, though it also shows how the package ecosystem is crucial and can in fact help compensate for the language’s shortcomings.
  • The impressive influence of Hadley Wickham: 3 of the top 5 downloaded packages and 5 of packages in the top 10 are authored by Hadley Wickham, chief scientist at RStudio. Many of Wickham’s most popular packages are aggregated into the tidyverse, “an opinionated collection of R packages designed for data science.” The tidyverse approach shapes how many people perform data manipulation in R, and users of ggplot2, stringr, plyr, scales, and reshape2 (among many other popular packages not in the top 10) owe Wickham a high five for his contributions to the R community. His vision for data manipulation and plotting in R are key drivers for how the language as a whole is used for data science.
  • Downloads vs. activity: Though Rcpp is the most downloaded package in CRAN’s logs, of this subset of packages ggplot2 is by far the most active tag on Stack Overflow. As with our programming languages, a common objection to extrapolating developer interest from Stack Overflow questions is that it is perhaps indicative of difficulty of use rather than traction. We continue to believe that Stack Overflow activity is a reasonable gauge of developer interest, and it is particularly notable in this instance to see the magnitude of the disparity between ggplot2 and the other R packages. While total downloads would indicate that packages that help accommodate R’s speed is a developer priority, Stack Overflow questions show high use around data visualization and plotting.

The latest iteration of the RedMonk programming language rankings ranks Python at #3 and R at #14 (though we worry less about individual rankings and more about tiers of rankings.) Through this lens, Python is more firmly established as leading programming language choice for developers. Python has wider use cases than R, though Stack Overflow’s findings indicate that it is the language’s relevance in data science that is propelling its use. The pandas library and its “easy-to-use data structures and data analysis tools” have helped establish this position.

R is a popular though more niche language. Many of R’s top downloaded packages, particularly those in the tidyverse, are similarly designed to optimize R for data science by helping developers with data modeling and analysis. However, it is notable that the top downloaded package is designed to help overcome performance shortcomings in the language.

6 comments

  1. […] of the R package ecosystem and the most-mentioned packages in Stack Overflow Q&A's. A recent RedMonk post also analyzes the top packages in the R ecosystem, with similar […]

  2. […] of the R package ecosystem and the most-mentioned packages in Stack Overflow Q&A’s. A recent RedMonk post also analyzes the top packages in the R ecosystem, with similar […]

  3. […] of the R package ecosystem and the most-mentioned packages in Stack Overflow Q&A's. A recent RedMonk post also analyzes the top packages in the R ecosystem, with similar […]

  4. I would suspect that the leading position of Rcpp is also due to the fact that this package is a dependency of many other packages and therefore gets downloaded automatically, if I’m selecting another package. Did you analyse package dependencies in order to get a ranking of “toplevel” CRAN packages (by toplevel I mean packages like dplyr, ggplot2, etc. in contrast to “support” packages like Rcpp, magittr, etc.)?

    1. I think you’re right; Hadley also pointed out on Twitter that the dependencies of ggplot are a factor in the results. We’d love extend the analysis in the next iteration to take dependencies into account.

  5. […] of the R package ecosystem and the most-mentioned packages in Stack Overflow Q&A’s. A recent RedMonk post also analyzes the top packages in the R ecosystem, with similar […]

Leave a Reply

Your email address will not be published. Required fields are marked *