What is the difference between hybrid and multi-cloud? It’s really a matter of orchestrating multiple environments. The term hybrid and multi-cloud are merging to mean the integration of compute, storage, and networking provider boundaries. Organizations are even beginning to look at their own data centers as cloud-like environments that need to be integrated with native cloud platforms. However, for solutions and applications to run in this world, they need to be able to do two things: run where they are required and take advantage of the environment that they are in in an ideal fashion. This holds true even if components or services that comprise the solution may run or span clouds.
Unfortunately, we see too many software packages that are ‘hostable’, which means that they are written to a common standard execution environment like Kubernetes, cloud native compute environments, etc. It’s often touted that these solutions can ‘run anywhere’. While that’s true, it doesn’t mean that the solution will be able take advantage of the environment in which they’re running. Therefore, you need to balance portability with constructing the solution to take advantage of cloud capabilities. It’s not always best to port an app into the cloud via containers. Oftentimes, it’s just a milestone that might be necessary in order to consider expanding the solution to take full advantage of the compute services in its environment.
Before we all knew what hybrid meant, you’d badge into the on-premises data center where there might have been some code running that has a cloud connection. This code might have some useful methods that were the best of what was available at the time. In fact, in retrospect, most of that code might have been done more efficiently done by leveraging more of cloud. Nowadays, componentizing the solution to determine which workloads should be run where is the value-added activity that architects must consider. In some cases, when solutions span environments (cloud, on-premises or other clouds) the app (or the application domain) is in two or more places. Some components are best run in the on-prem environment while the rest might run best in cloud environments that are more optimized for those components. The concepts of application or solution domains and component hosting are blurring the definition and distinction of hybrid cloud and multi-cloud descriptors.
Interestingly, the scenarios for optimization and their respective parameters are becoming more sophisticated. For instance, you may want to run components in a telco-cloud during the day for reduced latency and better performance, but at night you may want to run these components in a less expensive way on a different cloud for cost optimization. Or you may want to follow the sun. We call this dynamic optimization “vertically scaling” and is a new pattern that we’re seeing to emerge.
The multi-cloud concept can often be rooted as part of an organizational mandate. In many cases, this mandate can simply be due to the fact that organizations may not want to be “beholden” to a single cloud provider. That’s still a real concern for many customers, but it may not be the best or only reason drive strategic partnerships and implement solutions across multiple clouds. The technical debt incurred by spanning providers without having a tangible reason can prove to be unjustifiable over the long term. For some, understanding and justifying the technical debt to extend an application across multiple clouds is the right choice.
A better way to think about using multiple clouds is to take advantage of a particular cloud’s advantages for a certain part of your application. That would require patterns that allow for deploying solution components that to a specific cloud to utilize cloud-specific functionality. The architect will just need to realize that these components may need to be refactored if it is to be ported to a different cloud in the future.
There are three operational aspects that determine tools and techniques that should be resident as part of a solution, especially when it spans multiple cloud environments. Many of these capabilities are simpler in a single cloud deployment, but quickly become complicated when the app must account for functional and nonfunctional behavior across cloud environments. These include considerations like logging and monitoring, security authentication and authorization, data ingestion and schematization as well as other common foundational capabilities for solutions such as support, troubleshooting and resolution. Cloud environments may have separate control towers, and normally require native or proprietary tools to manage and configure deployed solution components. Architects working to deploy solutions to multiple cloud environments should strive to establish a “single pane of glass” where telemetry, configuration and statistics from all environments are aggregated and contextualized. Even if this accomplished, areas like troubleshooting and issue resolution will most likely require cloud-native tooling across multiple clouds.
Building for multi-cloud environments
While understanding and organizing around these operational models is an important first step. It’s certainly not easy to do. Many organizations underestimate the complexity and organizational dynamics of managing multiple clouds, especially when one solution depends on all of them at once.
Building solutions to take advantage of multiple environments requires a standard set of protocols that all cloud and infrastructure providers should adhere to. Microsoft and other technology leaders have organized and follow the Cloud Native Computing Foundation (CNCF) model. Most people use containers, known as Open Container Initiative (OCI) containers, that both CNCF and Linux Foundation drive. CNCF has also standardized Kubernetes as an orchestration environment that brings together multiple servers. Service mesh is a network architecture run within your cluster, such as isolating network cables, etc. Microsoft calls CNCF support “Cloud Native”.
CNCF is primarily infrastructure, but what about building the software? Microsoft has an abstraction layer called DAPR, or Distributed Application Programming Runtime. You write your code to the DAPR API that has services like state and sequence management, and we the community have built connectors to the implementation to sequence management in Cloud A, Cloud B, etc. DAPR is modular and not a set of monolithic capabilities, where you can choose to work with one of DAPR’s capabilities. It’s also adaptable to multiple environments, such as Kubernetes and non-Kubernetes environments (think IoT).
You can also think of DAPR as a productivity enhancement, as it abstracts away a lot of the underlying complexity by using services supported by a variety of environments. It allows you to quickly build, deploy, and spend time writing code that adds value rather than writing code for plumbing.
We’ve seen a lot of success with DAPR and plan to bring it into CNCF as well. If you’re interested in learning more, go to dapr.io.
Another important evolution is that capabilities that had been cloud-specific are now becoming portable. Microsoft is taking Azure Functions, Azure Event Grid and others and making them available on top of Kubernetes. Now you can write your app on Azure and, conceptually speaking, package it up run in on GKE on Google Compute Platform.
There are three things that make it feasible to run apps in a multi-environment mode: 1) CNFC based services, 2) facilities such as DAPR, 3) and portable, cloud consistent capabilities that run on multiple clouds.
Operating solutions in multi-cloud environments
Observability is critical to operating solutions in a multi-cloud environment. In the quest to build resilient cloud apps, we need to create telemetry that indicates whether an app is operating correctly. The biggest challenge is orchestrating and exposing telemetry from solution components. For example, is it an anti-pattern to monitor my app from within the solution itself alone? (Note: It is) The “watch-dog” pattern is a good idea, when it’s not always possible for the app to communicate how it’s doing. In these circumstances, you should adopt an “outside-in” approach where external systems are monitoring the solution across clouds.
Many build in application monitoring, health and metric components as an afterthought and often seem to believe that applications will likely adopt to or conform to the service level agreements of the cloud environments where they are deployed. However, building to these assumptions is often harmful as any application constructed with the assumption that it will most likely execute without some degree of constant failure often results in undesirable outcomes.
We need an observability pattern that explains how we implement a function that can profile, alert and provide visibility to an application and its components across clouds from outside those components. You need to start with generating and collecting logs, and then you can determine where to transport them and how to use them from a control and monitoring plane outside of the component implementation of your solution.
So, which is better, a push or pull mechanism? Should the app ‘tell’ you everything is okay, or should you ask every so often? Both Uli and Eric agree that it’s better for the app to ‘tell’ you its status, but there are times that you’ll want to ping the service, such as in IoT scenarios. IoT devices use a lifecycle scenario that give a ‘heartbeat’ in the form of a log or event, which is more of a push mechanism.
Observability fabrics are orthogonal app processes that allow you to monitor multiple apps externally. The logs and metrics are great ways for algorithmic analysis of application performance. If I’m seeing logs at a certain volume, I can ask this system of intelligence, “is this weird”? If it is, then let me know about it. We can train these systems. The more data that we have the more accurate it becomes. It could become intelligent enough to make suggestions.
How do you monitor a multi-layered, multi-service infrastructure like Kubernetes? You could implement a concept of a correlation ID across service, then collect and store logs in a centralized place. You can then track the transaction through the call stack to figure out with which components the application is interacting. You also need to determine what your application is–especially after getting into microservices and other architectural patterns. It can get very confusing otherwise. Once you have that definition you can track the transaction through the stack.
Conclusion
We just scratched the service in this article. To learn more, check out the three videos below for a lot more insights about building and operating apps in multi-cloud environments. You can always ping us at @echarran or @benbrauer. And don’t forget to check out all the great supporting documentation about Well-Architected workloads in Docs!
Posted at https://sl.advdat.com/3FDtiwK