In an article entitled “Python Displacing R As The Programming Language For Data Science,” MongoDB’s Matt Asay made an argument that has been circulating for some time now. As Python has steadily improved its data science credentials, from Numpy to Pandas, with even R’s dominant ggplot2 charting library having been ported, its viability as a real data science platform improves daily. More than any other language in fact, save perhaps Java, Python is rapidly becoming a lingua franca, with footholds in every technology arena from the desktop to the server.
The question, per yesterday’s piece, is what this means for R specifically. Not surprisingly, as a debate between programming languages, the question is not without controversy. Advocates of one or the other platforms have taken to Twitter to argue for or against the hypothesis, sometimes heatedly.
Python advocates point to the flaws in R’s runtime, primarily performance, and its idosyncratic syntax. Which are valid complaints, speaking as a regular R user. They are less than persuasive, given that clear, clean syntax and a fast runtime correlate only weakly with actual language usage, but they certainly represent legitimate arguments. More broadly, and more convincingly, others assert that over a long enough horizon, general purpose tools typically see wider adoption than specialized alternatives. Which is again, a substantive point.
R advocates, meanwhile, point to R’s anecdotal but widely accepted traction within academic communities. As an open source, data-science focused runtime with a huge number of libraries behind it, R has been replacing tools like MATLAB, SAS, and SPSS within academic settings, both in statistics departments and outside of them. R’s packaging system (CRAN), in fact, is so extensive that it contains not only libraries for operating on data, but datasets themselves. Not only does it contain datasets for individual textbooks taught by academia, it will store different datasets by the edition of those textbooks. An entire generations of researchers is being trained to use R for their analysis.
Typically this is the type of subjective debate which can be examined via objective data sources, but comparing the trajectories is problematic and potentially not possible without further comparative research. RStudio’s Hadley Wickham, creator of many of the most important R libraries, examined GitHub and StackOverflow data in an attempt to apply metrics to the debate, but all the data really tells us is that a) both languages are growing and that b) Python is more popular – which we knew already. Searches of package popularity likewise are unrevealing; besides the difficulty of comparing runtimes due to the package-per-version protocol, there is the contextual difficulty of comparing Python to R. Python represents a superset of R use cases. We know Python is more versatile and applicable in a much wider range of applications. We also know that in spite of Python’s recent gains, R has a wider library of data science libraries available to it.
My colleague Donnie Berkholz points to this survey, which at least is context-specific in its focus on languages employed for analytics, data mining, data science. It indicates that R remains the most popular language for data science, at 60.9% to Python’s 38.8%. And for those who would argue that current status is less important than trajectory, it further suggests that R actually grew at a higher rate this year than Python – 15.1% to 14.2%. But without knowing more about the composition and sampling of the survey audience, it’s difficult to attribute too much importance to this survey. Granted, it’s context specific, but we have no way of knowing whether the audience surveyed is representative or skewed in one direction or another.
Ultimately, it’s not clear that the question is answerable with data at the present time. Still, a few things seem clear. Both languages are growing, and both can be used for data science. Python is more versatile and widely used, R more specialized and capable. And while the gap has been narrowing as Python has become more data science capable, there’s a long way to go before it matches the library strength of R – which continues to progress in the meantime.
How you assess the future path depends on how you answer a few questions. At RedMonk, we typically bet on the bigger community, but that’s not as easy here. Python’s total community is obviously much larger, but it seems probable that R’s community, which is more or less strictly focused on data science, is substantially larger than the subset of the Python community specifically focused on data. Which community do you bet on then? The easy answer is general purpose, but that undervalues the specialization of the R community on a discipline that is difficult to master.
While the original argument is certainly defensible, then, I find it ultimately unpersuasive. The evidence isn’t there, yet at least, to convince me that R is being replaced by Python on a volume basis. With key packages like ggplot2 being ported, however, it will be interesting to watch for any future shift.
In the meantime, the good news is that users do not need to concern themselves with this question. Both runtimes are viable as data science platforms for the foreseeable future, both are under active development and both bring unique strengths to the table. More to the point, language usage here does not need to be a zero sum game. Users that wish to leverage both, in fact, may do so via the numerous R<==>Python bridges available. Wherever you come down on this issue, then, rest assured that you’re not going to make a bad choice.
Disclosure: I use R daily, I use Python approximately monthly.