tecosystems

The Narrows Bridge: From Open Source to AI

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

In November of 1938, construction began on the Tacoma Narrows bridge in Washington state. Twenty months later in July of 1940, it opened to traffic. Connecting the Tacoma and the Kitsap Peninsula west of Seattle, it was at the time the third longest suspension bridge in the world. From the first day of construction it was buffeted by high winds, winds that introduced substantial vertical movement into an engineering structure that generally tries to avoid such. Multiple efforts were made to mitigate these forces and keep them in check. It was an inauspicious beginning for an expensive and complex endeavor.


Some sixty years after construction on that bridge began, a fraught and contentious debate in a then obscure corner of the technology industry resulted in both the term open source and the ten point definition that encapsulates it. While it was accorded little importance at the time, with the benefit of hindsight, this discussion amongst passionate but largely unrecognized technology advocates was of monumental historical importance. Over the nearly three decades since its inception, open source has grown from a seemingly utopian academic curiosity to industry default for large capital markets.

For all of its success, however, open source has been besieged in recent years by attackers from multiple fronts. Most notably, a group of vendors and investors seeking to commercialize open source have attempted to blur that definition to the point of meaninglessness and irrelevance – a conflict that continues today. More recently, however, open source has come under intense pressure as both reasonable and not so reasonable actors alike try to understand how the term applies to Artificial Intelligence (AI) systems and projects.

The term open source, of course, was originally coined to describe – and its corresponding definition was designed to apply to – source code. AI projects, however, are vastly broader in their scope. Source code is a component of AI projects, to be sure, but one among many, and in most cases not the most important. The varied other components, from data to parameters and weights, are functionally and legally quite distinct from source code. It is not clear, as but one example, whether copyright – the standard legal mechanism underpinning open source licenses – can be applied to embeddings, which are very long strings of numbers representing multidimensional transformations of various types of data.

Source code, in other words, is a precisely and narrowly bounded subject area. AI projects are not. Their scope blends software, data, techniques, biases and more. AI is inarguably a fundamentally different asset than software alone.

And yet, largely because vendors have been cavalierly throwing around the term open source to describe obviously non-open source projects that contain use restrictions that open source explicitly forbids, the Open Source Initiative (OSI) – stewards and defenders of the Open Source Definition (OSD) since 1998 – has been compelled to respond. The most egregious of the offenders in terms of misuse has been Meta, who has repeatedly described its Llama model as open source while omitting the fact that it imposes restrictions on usage, usage by competitors and so on. Restrictions that open source does not permit. Their behavior stands in stark contrast with their counterparts at Google, who have to their credit attempted to hold the line by being explicit that their Gemma model is open but that it does not meet the OSI’s definition of open source and therefore should not be considered such. Generally speaking, however, Meta’s careless approach is far more common than Google’s, which has resulted in pressure building on the OSI to reconsider its definition of what is and what is not open source within the new, arcane and rapidly evolving world of AI.

The entire debate about an open source AI definition (OSAID), then, has been driven by misuse and misrepresentation. It was at the same time implicitly predicated on a single core assumption: that open source and AI are, or can be made to be, compatible. Simply stated, the current process assumes that it is possible to achieve a definition of open source that is both consistent with long held open source ideals and community norms while being equally applicable and relevant to fast emerging AI projects and their various interested parties.

After months of observation and consideration of nascent AI projects, vendor efforts in the space and conversations with experts in the field as well as interested third party organizations, I no longer share that core assumption.

I do not believe the term open source can or should be extended into the AI world.

There are several problems with the application of open source to AI. These are arguably the most pressing.

One of These Things is Not Like The Other

As discussed above, software and AI are the proverbial apples and oranges – or perhaps more accurately, apples versus an apple pie. This by itself is, or should be, cause for concern. At its heart, the current deliberation around an open source definition for AI is an attempt to drag a term defined over two decades ago to describe a narrowly defined asset into the present to instead cover a brand new, far more complicated future set of artifacts. The risks of which are substantial. First, trying to bend the original open source definition and its principles to apply to AI has the potential to fall well short of fully circumscribing the new project assets in all of their complexity, which is bad. Worse, however, is the prospect of perceived shortcomings in the OSAID bleeding into the trust of and faith in the tried and true original OSD. The implications of that are far reaching and highly concerning.

To properly address the greater complexities of AI projects, the new OSAID would need to grapple with far more complicated and nuanced issues than are involved with mere source code, and in so doing it almost certainly would have to resort to compromise. Which the release candidates to date, in fact, have.

AI and source code are simply too different to be neatly managed side by side. The complexity of AI demands complexity of licensing, which brings us to the problem of nuance.

If You’re Explaining, You’re Losing

An open source AI definition will inevitably have key areas of contention, principally around data sharing and availability. Idealists seeking to preserve and protect the bedrock principles of open source, for example, argue that any model that doesn’t require training data is compromising the four key freedoms that the original open source definition satisfies. The OSI, for its part, contends that in discussions with various AI researchers, their consensus opinion is that the weights are more important than the original training data. That position may or may not be correct. What is definitely true is that even if that assertion is correct, it is a nuanced position that is counterintuitive and requires lengthy explanation.

Similarly, data brings with it a host of legal complications – complications that are without precedent in the world of pure source code. Source code, for example, is inarguably less conflicted from a legal standpoint than, as but one example, medical data used to train AI models intended to assist in the early detection of cancer. The OSI’s approach to this – similar to what the Linux Foundation is trying to do with the Open Data Product Specification – is categorizing the various types of data. In the OSI’s case, there are four buckets of data: open, public, obtainable and unshareable. These are superficially self-descriptive, but also subtle, slippery and legalistic distinctions. Which means, ultimately, that they require nuance to be understood.

In more simple terms, both with respect to the availability of data and the types of data being made available, or not, the OSI is trying to thread a needle of balancing open source’s legacy of making everything available with the messy reality that is data availability. The idealists want training data required for obvious reasons. The pragmatic path, unfortunately, involves substantial compromise and, more problematically, requires explanation to be understood. And as the old political adage advises: “If you’re explaining, you’re losing.”

This is particularly true in this case, because in contrast to the OSI’s complicated position, critics have a simple and easy case to make – one made for easy black and white headlines and stories on Hacker News: if you’re not fully satisfying the four freedoms, you’re not open source. The reality might be that, at least in the case of large foundational models, even if all of the training data was made available, the number of entities that could leverage it and build their own replacements could be counted on one hand – two or three at the most. But reality does not determine perception.

To illustrate this, consider these two potential headlines:

  • “The OSI’s Open Source Definition (OSD) mandates the release of all source code. Their Open Source AI Definition (OSAID), on the other hand, does not require the release of all the training data.”

Versus

  • “The OSI says that they want all training data released, but that requiring it would be problematic because of legal complexity and difficult to actually leverage given the dataset size. To clarify, they’ve created four different categories of data which they go into in detail here…”

Optically, then, the pragmatic path is a minefield. One of the things that most RedMonk clients have heard at some point is: “the market generally has no ability to appreciate nuance.” In a world in which professional technology industry reporters are unable to distinguish between genuine open source – is it on this list of approved licenses or not? – and objectively non-open source, as in the cases of licenses which are open except when they are not, there is essentially no chance licenses which depend on levels and shifting definitions will be correctly interpreted.

The need to rely on nuance, then, to explain the OSAID seems inherently problematic.

An Open Source AI Definition is…Mandatory?

As described above, implicit in this entire multi-year process is an assumption that an open source definition that will satisfy a required consensus of parties is possible. There is, however, a second assumption underlying that, which is that AI requires a revised and updated open source definition. The most straightforward articulation of this idea arrives courtesy of Mark Collier, who said:

This brings me to the Open Source AI Definition (OSAID), an effort organized by OSI over the past two years, which I have participated in alongside others from both traditional open source backgrounds and AI experts. It is often said that naming things is the hardest problem in software engineering, but in this case we started with a name, “Open Source AI,” and set out to define it.

This position is certainly understandable. Rogue actors such as Meta have been abusing the term open source, when they are well aware that the arbitrary use restrictions (competitors can’t use it, you can’t use it to do certain things, etc) attached to the license make it clearly and unambiguously not open source. Meta’s actions are bad enough, but arguably worse are the columnists that have enabled Meta’s behavior by victim blaming. In the absence of an OSI AI definition – which as argued above may not be actually achievable – these writers bafflingly hold the OSI responsible for Meta’s behavior.

In the face of this continuous assault on the OSD, then, the obvious response is for the OSI to respond to this willful misuse by way of an updated definition with more clarity and industry buy in.

Or is it?

Given the fact that Meta paid no attention whatsoever to the original definition which had been industry consensus for years, it’s not clear that an updated definition – again assuming, potentially counterfactually, that one is achievable – would change their behavior. That seems like a questionable assumption, particularly given the downsides discussed above if the effort is unsuccessful.

The default path as described above was an updated open source definition, because it was implicitly assumed that that was the only option on the table. But what if it wasn’t?

The Road Not Traveled

What if, instead of trying to bend and reshape a decades old definition intended to describe one asset to encompass another, wildly different asset, we instead abandoned that approach entirely?

On the one hand, for parties that have been fighting for the better part of two years to thread the needle between idealism and capitalism to arrive at an ideologically sound and yet commercially acceptable definition of open source AI, the idea of abandoning the effort will presumably seem horrifying and a non-starter.

This is not the first time that outside parties have sought to reshape or redefine open source, however.

For several years, there has been a desire on the part of some to bring open source into greater alignment with commercial interests – even at the expense of core open source principles. The response from the wider open source community, however, was rejection. The belief was then and is now that trying to shoehorn specific commercial protections into the term open source would fatally compromise it. Instead, just as when prior efforts to bend open source to other new and ultimately incompatible goals around ethical source, those who sought to change open source were told instead to find a new home, a new term and a new definition for what they were building. The end result of which was “Fair Source,” a new, from scratch term that borrowed some ideals from open source but is entirely its own new and unique brand.

What if AI followed that path? The industry’s massive cumulative efforts to date to define what and how open source principles might apply to AI need not be wasted. They could instead be repurposed behind a new, clean slate term of choice – one that accurately conveys the portions of a model that are open, while not falsely advertising its features by applying the open source brand to non-open source assets.

Naming is, as Collier mentions above, the hardest problem in software engineering, and so coming up with a new, alternative term for open source AI would not be a simple exercise. It seems likely, however, that it would be both simpler and more achievable than coming up with a definition of open source AI that might minimally satisfy both the idealists and the pragmatists.

The only way that this would work, of course, would be if sufficient momentum could be assembled behind it. This would require the support of multiple, conflicting parties. Here are a few arguments in favor of a new term for each:

Idealists:

  • Assuming some industry consensus could be achieved around a new brand – greater consensus, at least, than is behind the current OSAID release candidate – pressure on the term open source would immediately begin to decline. One “defense” of Meta’s behavior at present is that there is no accepted definition for open projects. A new definition, even without the open source branding, eliminates that argument. Second, and perhaps more importantly for idealists, if open source and the four freedoms are no longer explicitly invoked by the license, there’s a lessened need to be so strict about what’s in and what’s out. Idealists instead could center around protecting the original, source code derived definition of open source while attempting to clearly differentiate it from the new, AI-centric term of choice.

Pragmatists:

  • Those who have been most willing to compromise in this process to accommodate the vagaries of, as but one example, data licensing would no longer be fighting against the reputation and legacy of the OSD and the four freedoms. It would instead be an opportunity to start fresh, informed by open source but not beholden to it. Pragmatists would have more room to maneuver in their efforts to find a license that balances openness with a desire to maximize uptake of the license, avoiding the worst case scenario of a strict definition that few or no models satisfy.

Vendors:

  • Assuming again that some level of industry consensus could be achieved, they could receive a similar if not exactly equivalent level of marketing benefit from the new definition, without the corresponding costs of constant criticism from open source communities for their willful misuse of that term. In a world in which the various large, AI players are able to agree on a) both a clear and understandable model for achieving a certain level of openness as determined by the OSI and b) a willingness to put their marketing weight behind the new brand, marketing becomes at once a simpler and less fraught exercise.

The OSI:

  • The OSI, its members and various participants have been tearing each other apart for well over a year trying to achieve an outcome that is, in all likelihood, not achievable. While redirecting current efforts into the creation of a new AI-centric brand might seem like surrender, it would more accurately be described as being flexible and adaptive rather than rigid and hidebound. Given a landscape in which the forces arrayed against the new license seem to be stronger than those supporting it, preemptively eliminating an entire line of attack while creating the space to introduce a new brand for a new technology area that would protect your existing brand from fall out is nothing more or less than the most logical course of action moving forward.

As mentioned above, the Tacoma Narrows bridge opened to the public in July of 1940. A mere four months later, systemic forces – 40+ MPH winds in this case – stressed the structure to the point that its concrete and metal surface began to ripple and twist like a ribbon. A little over an hour later, the bridge collapsed entirely. Its spectacular destruction and the engineering failure to account for the forces that destroyed it has led to it being an object lesson for engineers to this day.

If the OSI chooses to stay the course with the OSAID, I will personally do everything in my power to help it succeed. But I fear that, as with the Narrows bridge, the fault lines are already on display, and it will not survive the high winds that are sure to be in its future.

Better to choose a new bespoke name and brand, one specifically tailored to suit the unique and dynamic challenges of the new technology. As for what that name might be, I’m relatively indifferent. Public AI was one option floated on a recent call and that has promise, or some flavor of open model / weights might work – though as one participant in the current process pointed out, the potential for overlap and confusion might make those less than ideal. This industry is generally bad at naming, so this exercise might be no exception. The name, however, is less important than consensus. As Abraham Lincoln said, “Public sentiment is everything. With public sentiment, nothing can fail.”

However they proceed, I wish the OSI luck in their thankless task and hope they find a solid bridge with which to bring the spirit of open source forward into the world of AI.