DeepSquare builds a sustainable supercomputing ecosystem using k0s
High-Performance Computing (HPC) service provider DeepSquare needed a scalable infrastructure platform to enable a decentralized, sustainable HPC ecosystem. Using the lightweight Kubernetes distribution k0s, DeepSquare built a fully declarative, modern HPC cluster manager on hybrid clouds that delivers HPC-as-a-Service for AI training, 3D game rendering, and other GPU-intensive end user workloads.
Decentralized, sustainable HPC
Based in Switzerland, DeepSquare aims to support innovation in Europe and beyond and offer a true alternative to the hyperscalers by providing world-class, decentralized, sustainable, and managed High-Performance Computing as an Ecosystem. The platform distributes idle bare metal resources from a virtuous community of supercomputing facilities, enabling users to spin up production-grade HPC clusters on demand, while leveraging blockchain technology to ensure fair access to all users and to accurately measure allocated compute power to reward infrastructure providers.
DeepSquare wanted to streamline operations by automating HPC cluster deployment and operations. Initially, they tried a hyper-converged infrastructure platform running on bare metal. VMs hosted the HPC tooling, including cluster and workload managers. However, the platform frequently overcommitted resources, causing clusters to go down.
The infrastructure team wanted a lighter and more scalable platform that could extend to public clouds during high demand. They decided to switch to Kubernetes and began testing lightweight deployment solutions, including the minimal distributions k0s and k3s, and the kubeadm bootstrapping tool.
Simple, flexible, and easy to deploy
“k0s was the most appealing Kubernetes distribution for us, because it didn't include all the things we didn't want to add. It had only the real basic components, and we could shape it to our needs,” said Christophe Lillo, Head of Infrastructure at DeepSquare. k0s is a certified Kubernetes distribution that is 100% open source and integrates easily with tooling from the cloud native ecosystem.
DeepSquare also chose k0s because it provides the easiest way to deploy clusters, thanks to k0sctl, an open source command-line bootstrapping and management tool, which can provision a cluster with just one command and one configuration file.
The HPC tooling DeepSquare originally used was difficult to containerize, so they created a new HPC stack with more modern, cloud native tooling. DeepSquare integrated k0s into ClusterFactory, an open source infrastructure orchestrator that combines best-in-class HPC, cloud, and DevOps tooling to manage Kubernetes clusters in a declarative, GitOps way.
“The idea behind ClusterFactory is to automate the process of deploying the full HPC stack, using Kubernetes. Our goal was to modernize how we could orchestrate all those services,” Lillo explained. In the HPC industry, there is a great opportunity for modern GitOps automation, as many systems engineers still use Perl scripts to deploy infrastructure.
The ClusterFactory control plane runs on k0s clusters and includes a provisioning stack (Grendel, Slurm, CernVM-FS), network stack (Traefik, Multus, MetalLB), and GitOps stack (ArgoCD, cert-manager, sealed-secrets). After deploying k0s to run ClusterFactory at two sites in Switzerland, DeepSquare extended one of its clusters onto Exoscale and OVH public clouds, providing fallbacks to ensure production uptime.
Making HPC more accessible and sustainable
With the AI sector experiencing rapid growth, fueled by advancements in machine learning, language models, and AI training and inference, there's a pressing need to make High-Performance Computing more accessible and sustainable. DeepSquare addresses this challenge by simplifying access to HPC resources. They developed a streamlined workflow system that allows users to easily submit their AI and computational tasks to the DeepSquare grid.
A DeepSquare workflow file is essentially a blueprint of a project, outlining the necessary resources and the step-by-step process an application needs to follow. This file is crucial for executing tasks on the platform, ensuring that operations are carried out efficiently and effectively.
A DeepSquare workflow
To facilitate this, users can manage and monitor their tasks using the DeepSquare Command Line Interface (CLI) and Text-based User Interface (TUI), as shown below. These tools offer users the flexibility to control their spending through token credits and manage their jobs, including options to cancel or add additional resources as needed.
Deepsquare TUI used to deploy and manage jobs on a decentralized HPC
Below is a high-level architecture diagram of the DeepSquare Platform.
A user requests new compute resources (1) by submitting a job to the Blockchain through the SDK(3) together with its job definition (workflow file) (2).
The method RequestJob is called on the DeepSquare Meta-scheduler smart contracts (4).
The Meta-scheduler receives a RequestNewJob event and uses its meta-scheduling algorithm to select the best Grid Partners for the job (6).
The Supervisor running on the selected Grid Partner's clusters calls the ClaimJob method to inform the network that the job has been added to the local queue (7), and gets the job definition (workflow file) (8).
The job is finally submitted to the local Slurm queue (9).
DeepSquare’s Job Submission Architecture: Clusters are deployed using ClusterFactory
Now that ClusterFactory control planes run on k0s clusters, DeepSquare has the scalability and resiliency it needs to empower more developers to access HPC capabilities on demand and push the boundaries of digital innovation.
For DeepSquare, the benefits of using k0s include:
Greater scalability and reliability, with no failures during the past 2 years
Faster and easier cluster provisioning using k0sctl, reduced from 1 day to now only 5 minutes
Faster deployment and configuration of the full stack (ClusterFactory), reduced from 2-4 weeks to now only 3 days
To learn more or download k0s, visit: https://k0sproject.io/
For more information about DeepSquare, visit: https://deepsquare.io/