InstructLab: What if Contributing to Models Was Easy?

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

One of the unfortunate but unavoidable truths about tech conferences in 2024 is that they are, of necessity, AI conferences. While we can and should debate the opportunity costs of redirecting so much time and attention from events to AI-related subjects, the reality is that the sheer relevance, interest in and capabilities of these technologies – not to mention their still accelerating rates of innovation – makes them inevitable headline acts. The question facing events, then, is not whether to feature AI prominently, but whether they’re talking about genuinely novel and interesting problems within AI or whether they’re talking about AI for the sake of talking about AI.

The good news for Red Hat following its annual Summit last week, is that it was the former. The company and its parent IBM announced two open source projects; the Granite family of models and InstructLab, a project intended to lower the barriers to model contributions.

The Granite models are easy to understand and contextualize. We’re in the middle of a model boom – of questionable sustainability, but that’s a subject for another day. Suffice it to say that there is no shortage of models and no real impediments to understanding them.

InstructLab, on the other hand, requires some introduction and explanation. To understand InstructLab, it’s helpful to remember what open source was like in the early days. Source code was exploding in availability, and developers inhaled it, tinkered with it and compiled it on whatever hardware they had to hand. A subset of those developers had the ability, incentives and willingness to share their improvements back to the upstream project, and the software moved forward on the backs of these collective inputs.

Today, open source moves forward in much this same fashion – though not without its existential challenges. AI, for its part, has embraced this same ethos of community driven development within platforms like Hugging Face. But the barriers to actually contributing back to those models that are open enough to contribute to, by Red Hat’s estimation at least, are too high – in terms of the required hardware, technical ability and available training data. InstructLab, therefore, is explicitly intended to make it possible for experts in non-technical fields to contribute back to existing models.

The simplest way to contribute to a model – the model equivalent of a pull request – is a “skill” which requires two things: a YAML file and a second text file detailing attribution of the content – who created it, where it came from and so on. YAML doesn’t have many fans, but from the perspective of contributors it’s really just a text file with structured formatting. InstructLab then uses a limited number of these skills to generate a larger corpus of related synthetic data which is used to update the model.

The obvious question for those following this space is whether it’s intended as a replacement for traditional Retrieval-Augmented Generation (RAG) model updates, and the short answer is no. It’s best characterized as a complement, one that opens a pipeline for other forms of model input unsuitable for RAG processes.

While there are many more questions to be asked about InstructLab, however, from how it works to how it compares to RAG to its overall efficiency and efficacy, perhaps the most interesting question is: what if it does work?

If we assume that InstructLab accomplishes exactly what it set out to, which is to say dramatically lowering the barrier to model contributions and updates, what would that mean? On the one hand, decreased friction to contributions resulting in an explosion in same would suggest models that dramatically accelerate their abilities, coverage and breadth.

On the other hand, however, the question is what governance and provenance would and will be required to manage new influxes of content contributions. In the source code world, the industry has decades of experience in understanding the implications of intellectual property and the scaffolding needed to manage it properly; see, for example, the OpenTofu project’s response to allegations of copyright infringement. Code is also a relatively narrowly circumscribed domain with clear boundaries.

Relevant model content, however, may take a myriad of forms. It’s also not clear that the industry currently has the intellectual property governance mechanisms – both in terms of the licenses and the processes to manage them – in place to handle a world in which anyone, not just the data scientists, is a potential input for models. Developers and other individual practitioners who are typically less than obsessed with licensing and other compliance concerns are likely to appreciate the lowered barriers to entry. It remains to be seen, however, whether their employers will feel the same way.

It is clear that Red Hat, the long term standard bearer for open source, is trying to bring more of the community contribution and enthusiasm it knows well to the world of AI. What is less clear is whether enterprise AI customers are ready for it.

But the same was true of open source software once upon a time. It took years, but enterprises evolved the ability to consume open source at scale. It may take a similar period of learning and acclimation before enterprises are ready to embrace democratized model updates, but the potential benefits are obvious if they do.

Disclosure: IBM and Red Hat are RedMonk customers.