How to Leverage Managed Services to Get out of the Data Center Business and Focus on Core Innovation
Need a fully-managed private cloud? Here’s how Mirantis OpsCare Plus experts architect, build, and operate modern, multi-modal private clouds at scale
Modern private clouds take many forms to serve many use-cases and requirements. If you need to build and operate a completely future-proof enterprise private cloud, then the potential technology stack is often fiercely complicated, encompassing:
Data center and physical server management
Centralized deployment, lifecycle management, and observability for the cloud layers
(Optimally) A fully-built-out enterprise IaaS framework for running virtual machines: essential for hosting legacy conventional workloads, container engines, and container orchestration (i.e., Kubernetes)
(Optimally) A primary Kubernetes underlay, which makes everything on top agile, resilient, and updateable
Fully-built-out production Kubernetes cluster model(s) that can be deployed and lifecycle managed by (or on behalf of) developers and operators
Once the cloud is built, tested, and handed over, it needs to be continually updated to maintain security, compliance, availability, and high performance. Things like software development workflows, CI/CD, etc., need to be architected, built on top, and then maintained. Only then can the real work begin: starting with efforts to deliver new and valuable innovation, while also often simultaneously migrating and modernizing workloads to exploit the new cloud platform.
It’s a big, hairy deal. It requires a ton of labor by people with special skills (like: how to build and tune Ceph clusters to provide block and object storage under Kubernetes). So it’s no wonder that cloud projects go slow, stall out, or produce lackluster ROI after many months or years of effort.
OpsCare Plus and ZeroOps
OpsCare Plus – a professional services offering within Mirantis’ ZeroOps for Cloud On Prem portfolio – evolved out of more than a decade’s experience helping technically-demanding organizations design, source, build, and operate some of the world’s biggest private clouds. This article explores exactly how that process works: the steps, how long each takes, and how they can move your organization – efficiently and predictably – into a ZeroOps state,where you’re consuming your new, on-premises private cloud pretty much “as a service.” We spoke at length with Mirantis’ Christian Huebner, Principal Architect and Director of Architecture Services, to hear the scoop.
Architecture Design Assessment
“The first step,” says Huebner, “is what we call an Architecture Design Assessment, or ADA. This is a discrete, technical- and human-centered process that’s about gathering information: what are the use-cases in detail? Current technical situation? Requirements and constraints? Utilization and scaling projections, etc. Generally, we start by identifying all the stakeholders and subject-matter experts on the customer side, and go onsite with them to begin the discovery and planning process.”
“If the customer is coming from another platform like VMware or AWS, we need to expose all the details, logic, dependencies, and pain-points around their current solution,” Huebner says. “Partly to help design a better, new solution, and partly because the customer will, in most cases, need to migrate workloads and processes before a new solution can deliver ROI.”
“A full ADA – say for a big, multi-mode IaaS and Kubernetes cloud, or several clouds in different locations – is very far-ranging, taking two to four weeks. For very simple engagements, typically with existing customers we’re already familiar with, we have a shorter – several-day-long – process called an ADC that mostly looks at workloads, for example. So the ADA scales with complexity, and we respect the customer’s time, resources, and budget. The goal is to give a lot of value with the ADA: a comprehensive report that captures both the big picture and the scope and engineering details of what this customer will need to consider in the cloud project they’re contemplating.”
“Normally,” says Huebner, “an ADA takes somewhere between three and four weeks. In simple cases, three weeks on the dot. Obviously, more complex projects take a little more time. If we’re talking about building multiple clouds in different locations, we need to explore each. But this rarely involves actually going onsite, and is quite time-efficient because we use automation.”
“In fact, logical discovery usually doesn’t require any access to the customer’s facility by our team. We have a sophisticated script we can give the customer – they can audit it and know exactly what it’s doing. They run this on their site, and the software picks up and packages every environmental detail, which they can then send to us. We have an analyzer that unpacks the compressed data and helps us see what’s going on and how to proceed.”
“This tends to turn up a lot of detail that lets us directly address challenges the customer is experiencing with their current cloud or infrastructure,” says Huebner. “For example, we can see that a CPU is undersized for what they need to do with it, or they have too few placement groups. And of course, we report on these findings and then account for them in our recommendations. This would be impossible without software analytics – no human can review and understand tens of megabytes of text data. You’d go crazy, and it would take weeks. In our case, it’s more like days – our automation saves maybe 50 or 60 hours of engineering time, each time it’s used.”
“We discover a lot,” Huebner says. “In fact, in the course of ADA discovery, we always, always discover constraints and dependencies the customer forgot existed and requirements they didn’t quite know they had. Required integrations left out of the initial scope. Cloud builds with no backup capability. Or someone stands up in a meeting and says they absolutely need metal for a critical database. Sometimes, they really want a particular user experience – they want a certain kind of workload to start very fast – so we need to analyze, understand, and recommend an architecture that will deliver this specific aspect of performance. I can’t remember an ADA – and we’ve done many hundreds – that ended up looking exactly like a customer’s initial description of what they thought they needed.”
“In all this, our goal is to hopefully prevent the customer from making any decisions – in fact, including the decision to proceed with the project with us as their partner – until all information is collected and can be considered,” Huebner says.
“Reality often works against us,” he continues. “In fact, here’s a good example of something customers do, pretty frequently, that we wish they wouldn’t do until the ADA is complete – which is to commit to buying data center hardware.”
“We are really good at specifying hardware under clouds,” says Huebner. “And when customers let us fully work the technology and math, we can often save them a lot of money. In one current example – we’re talking about building a 120-node OpenStack-on-Kubernetes cloud. And I’m right now doing calculations, and it’s looking like we’ll be able to save them about half a million USD by specifying the hardware correctly.”
“To do this responsibly, of course,” Huebner says, “You have to factor in every detail about the data center physical plant, environment, HA requirements, storage infrastructure, the software stack, workloads, performance, and so on. I have engineering applications that I’ve built and tested over years and continually refined, that help me determine variations that work and satisfy requirements, enabling me to fully explain my rationale for every component recommendation.”
“Sometimes, the customer will have a bill of materials for us to review, based on earlier work,” he continues. “Sometimes, we start from scratch, which is often better because supply chains for infrastructure are very dynamic and prices shift and change quite quickly. We analyze every way of fulfilling the technical requirements of the software stack and the customer’s needs for capacity, performance, resilience, and other variables. And it always turns out there are thousands of ways to solve each problem – and we tend to look at all of them. What would this cloud look like on 18 physical nodes? On 120? Can we host control plane elements one way and compute nodes another?”
“We look very carefully at cost – sometimes, even despite a customer’s stated preferences – just so we know we’re not letting the customer miss out on savings. For example, in one recent case, the customer preferred Intel CPUs. But AMD announced a discount on comparable CPUs and we worked out all the pricing impacts so we could share this information and let the customer make the call.”
“When we submit the final ADA report, including a bill of materials and a full specification for networking, storage, and other aspects of the solution, the customer is in a significantly improved situation to understand and take action on their anticipated cloud journey. They can, of course, do whatever they like with this information. Though we hope the process continues – as it usually does – with pre-deployment.”
Pre-Deployment
“In our ideal universe,” says Huebner, “this is where you buy and take delivery of the hardware. This is highly variable in terms of time-cost, both in terms of dealing with the supply chains and sources and often in internal procurement processes within organizations. That process is Byzantine and outmoded in some companies. You fill out multiple forms for the different hardware types. This gets steamed and steeped in the inbox of someone in New Zealand. Someone will get the required number of quotes from different vendors. In the process, someone (not us!) will fill out a form wrong and order the wrong CPU and then this will be discovered and that part of the RFQ will need to be reissued.”
“In other cases, thankfully, organizations have mostly fixed this. Internal processes are efficient. So you only need to plan around possible supply chain delays,” Huebner says.
“We try to provide a full hardware specification as early as we can,” he continues. “But we also try to plan things to maximize efficient use of delays during the pre-deployment period. One way is to encourage the customer to begin the process of application migration and modernization – moving workloads and databases, converting legacy monoliths to efficient containerized forms. Our Application Migration and Modernization Platform helps automate application discovery, dependency analysis, and helps us automate many steps in a modernization process, like generating Docker manifests. All this work can begin before the new cloud is available, and starting right after the ADA can really speed up the customer’s ROI for the cloud transformation project. A migration and modernization plan can usually be executed entirely remote, with our teams delivering modernized container workloads at intervals for the customer to test, validate, and put in service. Ideally, delays can be made useful, and when the new cloud is in place, it can be running important workloads very quickly, since the migration and modernization process is well established and underway.”
Integrate/Build/Deploy
“So the hardware arrives,” continues Huebner, “and we’ve given the customer a specification that lets them rack, stack, and network everything together. We provide the customer with tools to validate and test the hardware configuration. We can get involved in details of the physical plant, of course – how facilities systems integrate, power and cooling, network, how to manage ongoing procurement, inventory spares, establish MTBF baselines, and so on – but often, our customers have experienced IT people who can manage this efficiently. Either way, however, we generally do all the bare metal management – a Mirantis deployment engineer installs an operating system on one node and we create what we call a ‘seed node’ there. And from that point, all the customer needs to provide is credentials for IPMI, so we can manage controllers and power things on and off, and so forth.”
“Thereafter, our process is fast and flexible,” says Huebner. “Say, two working weeks for a simple cloud deployment. In more complex cases, though – for example, where the customer requires a particular storage solution, there’s preliminary custom integration work to do and test. We will develop required integrations, deal with vendors, and stand behind everything.”
“Non-custom integration work is more straightforward,” Huebner says. “We’ll connect our monitoring system with their identity provider (IDP) to simplify local and remote observability and alerting, for example. And we’ll integrate data center access more generally so that their developers and others can access the cloud via their corporate SSO.”
“Scale is also a factor,” Huebner says. “Because of networking, it becomes inefficient to provision more than six or eight nodes in parallel. So very large cloud builds, and multi-location builds take more time. Up to a maximum of perhaps six weeks.”
Testing and Handoff
“Once the deployment is complete, of course, we test exhaustively,” says Huebner. “Some of this is implementation-dependent – if we’ve integrated a storage solution, we test every piece of block storage, for example. We have automated testing that checks out every aspect of cloud functionality, networking, and other details. We have acceptance testing. We do performance testing. By the time we say ‘done,’ we have very high assurance that all the customer’s requirements have been met, everything is working as advertised, and there are no surprises in store.”
“Acceptance testing is typically a week to ten days,” says Huebner, “depending on customer requirements. Performance testing can take longer. Then the handoff happens. This is a very distinct and important step: the team that has built the cloud assemblies has all the information that there is to know about the cloud; and we have a large meeting with the customer team that includes the deployment engineers, and also our support, customer success, and other functions. We introduce them to the cloud. Everything is described in detail. We hand over the documentation – and it should be noted that the docs are mostly specific to the customer and the implementation: all crafted as part of the project.”
Operate
“And then … more and more often, we operate the cloud for the customer. The service we call OpsCare Plus provides 24/7/365 remote observation and rapid response. Typically, just outbound permissions are granted, so that our StackLight observability platform – part of the cloud build – can share data with our NOC. We and the customer are notified simultaneously of any critical issues, and a procedure is in place and rehearsed for communicating with customer engineers onsite, quickly determining what’s wrong, and sometimes granting temporary remote access to us, to fix what’s broken. Other customers may grant us perpetual access – via a bastion server or similar highly-secure mechanism – so that we can fix issues entirely unattended.”
“More normal maintenance: adding additional nodes, for example, or doing minor updates, are carefully scheduled to happen with minimum disruption. This stuff is mostly automated: the customer generally experiences no downtime. Our IaaS cloud architecture, Mirantis OpenStack for Kubernetes (MOSK), is built on a Kubernetes substrate, so updating the OpenStack control plane, for example, happens by revising config files to point to new control plane component containers. This is nothing like legacy OpenStack upgrades, which could be forbiddingly complex, all-hands-on-deck affairs.”
“The result is that more frequent updates become practical,” Huebner says. “Some customers are leery of this – they’re accustomed to viewing cloud-framework updates as risky, time-consuming endeavors. But the technology has caught up. And with a little effort – and support from Mirantis – private clouds can actually consume innovation faster, stay closer to state-of-the-art, and of course, be as secure as possible because updates aren’t being delayed unconscionably long.”
“We test updates, of course, before recommending that customers move to them.” Huebner says. “We have multiple internal test clouds that we use to exercise updates in different scenarios. So by the time our protocols start nudging customers about an update, we’re very sure that it’s a good idea. Major version upgrades, of course, happen less frequently, and require more testing and preparation. They may also require minor downtime: typically, every node in a cloud will need to be rebooted. But we consistently improve our best practices and at this point, have largely eliminated the risks.”
“That’s the ultimate vision,” Huebner concludes. “Where the customer might find out about a critical issue when they read a morning email, explaining how it was mitigated. Our goal with Opscare Plus is to get the customer as close to ZeroOps as possible: they can consume their on-premises cloud as a service. An increasing range of customers are discovering that this is optimal – most customers don’t know how to run an OpenStack cloud or a large-scale production Kubernetes implementation on bare metal, consistently and reliably. And Opscare Plus lets them rely on Mirantis expertise to do this – eliminating the need to hire and retain many highly-skilled and expensive engineers.”
For more information on Mirantis cloud architectural design assessment, cloud build services, OpsCare support, and/or Application Migration and Modernization, please contact us.