AI: The Difference Between Open and Open Source

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

There is a pattern in AI that has become clear. Every few weeks – or more accurately at this point, days – there is a release of a new AI model. The majority of these models are not released under an actual open source license, but instead under licensing that imposes varying restrictions based on intended use cases, revenue limits, user counts or other limiting factors. In most cases the project authors explicitly and counterfactually describe the released artifacts as open source and a credulous press accepts these statements at face value with no pushback. End users, for their part, having been assured by the authors and press covering the releases that the project is open source, assume that it is. Up to this point, after all, projects that were described as open source generally were, and therefore allowed no arbitrary restrictions, and thus users did not have to pay much attention.

Each of these incidents acts to dilute the term open source, and thus weaken it.

Some would excuse if not actively condone this behavior because when it comes to the question of what is open source AI, the answer is we don’t know yet. It is not clear, at present, precisely what the term open source means in the context of AI. There is no industry consensus, and the primary, underfunded defender of the term is still working on a definition.

The implicit assertion of those that would defend the description of assets as open source that are objectively not is that the blame should not go to bad actor authors, but rather the OSI. If only their definition had been available, the reasoning goes, these parties deliberately and willfully misusing the term open source would be more respectful.

This position ignores some obvious challenges. Most obviously, defining open source with respect to AI is an enormous industry challenge. It is not clear, for example, that copyright – the fundamental intellectual property mechanism open source licensing is based on – can be applied to embeddings and other abstruse, numerical portions of released projects. And while the open source definition was designed in an era where the source code was all that mattered, it is but one small piece of an AI model. What, then, should a definition in an AI era require of project authors to ensure the same rights to an end user? How encompassing should it be? And what are the downstream implications of that? A project trained on massive datasets stretching across the internet, as but one example, is clearly not going to be able to convey that as part of its release.

But it’s not just that defining open source is difficult. Those who would blame the OSI for the repeated misuse of the term open source with respect to AI models are ignoring a simple truth: that while we can’t yet say what open source is, precisely, with respect to AI, it’s easy to tell what it is not.

It is true that we do not yet understand what the scope of an open source AI license might be, and whether it touches on training data or whether weights, parameters and embeddings are sufficient. We can say with confidence, however, that licenses that pose artificial use restrictions based on the user counts and revenue mentioned above will not qualify for this definition.

It is possible, therefore, to be respectful of the term open source and its specific meaning even in the absence of a definition that applies to models. And it’s possible to do so in a manner in which full credit is still received for making assets open rather than keeping them private and proprietary. We know this is possible because this is precisely what Google has done with Gemma.

Released last week, Gemma are two small but high performing models from Google intended to compete with the likes of Meta’s LLaMa. Like LLaMa, Gemma is an open model. Unlike Meta, however, which falsely claimed that LLaMa was open source, Google was careful to state that while Gemma is open, it is not open source.

Their reasoning is as follows:

We’re precise about the language we’re using to describe Gemma models because we’re proud to enable responsible AI access and innovation, and we’re equally proud supporters of open source. The definition of “Open Source” has been invaluable to computing and innovation because of requirements for redistribution and derived works, and against discrimination. These requirements enable cross-industry collaboration, individual innovation and entrepreneurship, and shared research to happen with exponential effects.

However, existing open-source concepts can’t always be directly applied to AI systems, which raises questions on how to use open-source licenses with AI. It’s important that we carry forward open principles that have made the sea-change we’re experiencing with AI possible while clarifying the concept of open-source AI and addressing concepts like derived work and author attribution.

The gist, in other words, is that while we don’t yet know what open source AI is, we do know what it isn’t.

This articulation and branding is important – vitally so – for the long term health of the term open source, and thereby, the industry. But note that it comes at no cost to Google. There is no ambiguity or uncertainty about whether the model is open and available: it has been described and received as such. “Open model” conveys precisely what it needs to, and makes no promises it cannot fulfill. Unfortunately, the press has not yet internalized the difference between open and open source that Google so clearly articulated, and took it upon themselves to apply to Gemma the term open source that Google so assiduously declined to itself.

Unfortunate as that may be, Google should be commended for its behavior here, for doing the right thing by open source and for providing a clear path that with luck, others may follow.

Open is good. The industry succeeds and is driven forward when groundbreaking new models are released and made available. But for the health of open source and the industry as a whole, it’s important to choose our words carefully and to understand that while open is good, open is not open source.

Disclosure: Google is a RedMonk customer. Meta and the OSI are not currently RedMonk customers.

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *