The Cloud Native Compute Foundation (CNCF) has just announced the general availability of their community compute cluster. This will allow many opensource projects to access a cluster that will consist of 1,000 compute and storage nodes donated by Intel and hosted by Supernap.
Close observers of the CNCF will have noticed the ongoing discussion around the cluster over the past few months. Resources such as this compute cluster are immensely valuable, and we often see those organisations bequeathed with such resources jealously guarding them with a view to cementing what they see as a specific strategic advantage.
We had the opportunity to speak with Dan Kohn, Executive Director of the CNCF, at DockerCon in Seattle earlier this year. During our conversation Dan made it very clear that he does not want to see the jealous guardian approach to resources, both in the physical sense and in terms of mindshare, occurring within the CNCF.
Rather, he wants the CNCF to be a relatively broad church, ultimately helping drive the adoption and trajectory of cloud native approaches in enterprise software. In an era when enterprises are desperately trying to understand what digital transformation means beyond four quadrant presentations from consultancy firms this kind of leadership is very, very welcome.
There are two significant points arising from the cluster announcement itself. Firstly, the fact that the CNCF will open the cluster up too projects beyond just those that fall directly under the CNCF umbrella is extremely welcome. For example, within the current queue there are requests for access for Apache ZooKeeper and another where OpenStack will be used as part of the underlying infrastructure for testing Mesos, Docker and Kubernetes functionality.
The second significant point is around the cross pollination of ideas that will come out of having such a useful central resource. Users of the CNCF cluster have to commit to publicly publishing and sharing their findings and results. The various communities using the CNCF cluster will invariably have learnings to share as they start to use an infrastructure that can be succinctly and consistently described. The discussions will gradually become about the software, not the underlying hardware infrastructure.
It is this second aspect that we find truly exciting. Encouraging collaboration between communities and creating a common basis for discourse can only be a good thing.
The Difficulties of Testing at Scale
On the github page for the CNCF cluster there is an explicit note about the rare value the cluster represents – size and bare metal. This is key to understanding the value that opening the cluster up to the wider cloud native community represents.
Testing and benchmarking software at scale is both tricky and rewarding. It is resource intensive, time consuming and, particularly when a realistic workload is put under sustained stress, leads to the discovery of some of the most complex bugs and subtle interplays you can imagine . Seemingly minor code changes can have ripple effects through a distributed system that may not show up until you are running at almost full utilization on three or four hundred nodes. Reducing the number of possible variables is always helpful, and the ability to test on bare metal removes a significant number of factors that are normally outside of a projects control.
Currently, we have far too many benchmarks in the cloud native space that consist of use cases such as spinning up thousands of containers with essentially no applications and referring to that as scale. There is no enterprise on the planet that has gone through the pain of scaling a real world application, with real a world workload, that takes metrics such as these seriously.
As we start to see results emerging from the ongoing usage of the cluster we hope to see some realistic workloads and metrics being shared.
It is very early days for the CNCF cluster. We are positive with the direction that has been set, and the communications around the cluster.
The two most obvious concerns we see are around the criteria for access to the cluster, including the associated approval process for granting access, and poor utilization of resources (we already note requests for blocking small amounts of machines for periods of several months, which would be a very poor usage of resources and run contrary to one of the core benefits of the cluster – that of scale).
In the first case, the criteria are pretty clear, and we expect to see some minor refinements. The key to success will be in the consistent and transparent application of the criteria. Measuring utilization of resources will be a somewhat trickier proposition, but the community will, in our opinion, decide upon some reasonable metrics.
Disclaimers: Docker, IBM, RedHat, Apprenda, Cloudsoft, Treasure Data, CoreOS, Huawei, Exoscale, Univa, Cisco, Mirantis and IBM are all members of the CNCF and current RedMonk clients. We have had briefings, updates and discussions with the vast majority of CNCF members in the last six months.
- Earlier in my career I worked on performance benchmarking for several years. Big numbers are easy (many benchmarks are gamed or just plain irrelevant), big numbers with real workloads, ran over prolonged periods of time, that are relevant to customers and enterprises are useful, hard and uncover significant issues