Blogs

RedMonk

Skip to content

What does “expressiveness” via LOC per commit measure in practice?

Yesterday’s post ranking the “expressiveness” of programming languages was quite popular. It got more than 30,000 readers in the first 24 hours; it’s at 31,302 as I write this. For this blog, that qualifies as a great audience. After a day’s worth of feedback, thought, and discussion on Twitter, Hacker News, and the post’s comments, I wanted to sum up some of my thoughts, others’ contributions, and things I left out of the initial post.

 What are we really measuring here?

As I mentioned as a major caveat in the initial post, lines of code (LOC) per commit is an imperfect metric as a window into expressiveness. It’s measuring something, but what does it mean? My take on these results is that it’s a useful metric when painting with broad strokes, and the results seem to generally bear that out. It’s more helpful in comparing large-scale trends than arguing over whether Ruby should be #27 or #22, which is likely below the noise level. I think the reason some placements seem so weird is that it’s measuring expressiveness in practice rather than in theory. That brings in factors like:

  • The standard library and library ecosystem. Is there a weak standard library? Is there a small or nonexistent community of add-on library developers? In both cases, constructing a commit-worthy chunk of code could require additional lines.
  • The development culture and its norms. Is copy-and-pasting common for this language? Are imported libraries often committed to the project repository (JavaScript is a prime candidate here)? Are autogenerated files committed (e.g., minified JavaScript, autotools configure scripts)?
  • The developer population using it. Especially for third-tier languages, the number of developers is small enough that these results could reflect those developers more than the properties of the language itself. Some of the least-popular third-tier languages have fewer than 10 developers committing during a given month. I would generally disregard anything but the largest differences between third-tier languages, and treat even those with skepticism. Some languages are also more popular for beginning programmers, which could influence the results if the beginners make up a significant chunk of the language’s total userbase.
  • The time frame of its initial popularity.  This can result in time-based influences upon tools and methodologies in use. For example, newer languages popularized in the agile and GitHub eras may tend to bias toward smaller, more frequent commits. Languages that grew up alongside waterfall development and slower, centralized version control may be biased more toward larger, monolithic commits. It even carries as far as things like line length — today, wide-screen monitors are common, and many developers no longer restrict their column width to 80 or less. This could have a language-specific impact, where older languages with a great deal of inertia change more slowly to a new “standard” of development. For example, perhaps fixed-format Fortran wasn’t typically maintained in version control at all, and full files were just committed wholesale? That could explain its similarity to JavaScript.
  • Differences in project types by language. If a language is more likely to be used in larger, enterprise projects, this could influence the types of commits it receives. For example, it could get more small bugfixes than new features because it’s a long-lived codebase and requires additional stability. It could also see a different level of refactoring.

So … what should you get out of the results, then?

Frankly, given all the possible variables involved, the biggest surprise here is that the results look as reasonable as they do, at the level of broad, multi-language or cross-tier trends. Here’s what I would tend to believe, and what I would be skeptical about.

  • Believe: multi-language trends
  • Believe: cross-tier trends
  • Believe: large differences between individual languages, but investigate why
  • Believe: highly-ranked languages
  • Be skeptical: anything involving third-tier languages
  • Be skeptical: small differences between individual languages
  • Be skeptical: individual languages that don’t fit into a group of similar ones
  • Be skeptical: low-ranked languages, until investigated

Why do I suggest believing high ranks but not low ones? It’s the Anna Karenina principle, as Tolstoy wrote:

Happy families are all alike; every unhappy family is unhappy in its own way.

While there are a large number of ways to have a high median or high IQR, it seems to me that low values of both would indicate a number of good development practices in addition to a good language.

To wrap things up, I think this is measuring, with a fair amount of noise, a form of expressiveness in practice rather than in theory — a form that includes all the ways code is incorporated into a repository. That makes it an interesting window into a number of potential problems with how specific languages as well as language classes are typically used.

by-sa

Categories: adoption, data-science, employment.

  • Pingback: Programming languages ranked by expressiveness – Donnie Berkholz's Story of Data

  • http://twitter.com/akuhn Adrian Kuhn

    Actually, research has shown that using Git leads to larger commits. Why? A common practice is to commit many small changes to a local feature branch and then merging the stable feature as one big-sized commit to the public branch. So while git encourages small commit it also encourages branching and merging and thus the commit size on observable public branches increases.

    More thoughts after lunch break…

    • http://redmonk.com/dberkholz Donnie Berkholz

      It’s really a shame that using Git leads to larger commits because those people are totally screwing up git-bisect’s ability to be completely awesome for development.

      • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

        Most of us probably will not live long enough to learn everything that’s available in git, let alone *use* it. ;-)

        Somebody remarked a couple years ago that git was actually a NoSQL database, and I’m really agreeing with that the more I learn.

  • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

    I’d add “Be skeptical of languages less than ten years old.”

    • http://redmonk.com/dberkholz Donnie Berkholz

      Why, what does age tell you? I would see total volume of use (commit velocity * age) as more relevant.

      • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

        It allows for evolution of the language. I’m struggling with the “expressiveness” of CoffeeScript. Syntactially and semantically I don’t find it much different from Python or Ruby.

        I’m sure there are metrics one can use to measure maturity of a language, though in the case of COBOL, FORTRAN and Lisp I’d say there should be a “senility” metric as well. ;-)

    • http://www.facebook.com/i3enhamin Ben Racine

      It’s a self-selecting set of programmers, those willing to try new things have normally improved their practices using old things.

      • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

        I am always interested in new programming *paradigms* but I am willing to learn a new programming *language* only if it either embodies a new paradigm or I am being *paid* to use it.

        • http://www.facebook.com/i3enhamin Ben Racine

          ^^ Kind of depressing if taken out of context. I’m sure you’re still willing to learn in other arenas.

          • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

            It’s diminishing returns in spades where programming languages are concerned. Time spent designing new languages is time that *can’t* be spent building what paying customers want done.

            Most problems in business and scientific computing fall into classes that lend themselves to well-established programming paradigms, embodied long ago by FORTRAN, COBOL, Algol, Lisp, APL, SQL, FORTH and regular expressions.

            As programmers, we’re constrained by unsolvable and NP-complete problems in automata theory, the CAP theorem and the software engineering realities as noted by Brooks and quantified by Putnam (http://en.wikipedia.org/wiki/Putnam_model). Having to learn a new language when an old one will do the job is a waste if it doesn’t lead to an increase in revenue or a reduction in costs.

          • http://redmonk.com/dberkholz Donnie Berkholz

            The biggest problem with this approach is developers who don’t take a sufficiently long-term view. If you don’t bother learning the current standards in terms of languages, frameworks, methodologies, etc, you’ll have an increasingly hard time finding work — or at least interesting work. You may also find yourself inadvertently excluded from some tech communities because you don’t have enough in common to have a good discussion. You’ll be losing productivity in many cases because you’re writing in a language where it takes you 10x as long to get the same amount done.

  • Aaron Bohannon

    I’m not inclined to believe that any of the factors you listed would have a significant impact on the data…except for the last one, project type, which I imagine could have a huge impact, unfortunately.

    • http://redmonk.com/dberkholz Donnie Berkholz

      The development culture and tools will be pretty major, I think. Importing of e.g. jQuery and other dependencies into a repo will have a significant impact on these metrics. The difference between JavaScript and other high-level languages, even its cousin ActionScript, reflect this. Also IDEs that generate boilerplate code, that do major refactoring automatically, and so on will likely show a serious effect.

  • Pingback: Are we getting better at designing programming languages? – Donnie Berkholz's Story of Data