During the creation of this post, I had a feeling that the topic I wanted to discuss was too broad just for one post. Therefore I have decided to split this post into two parts.
The first part is presented here. The second part is here: Disaster Recovery Strategies – Part 2
Introduction
Disaster Recovery and Business Continuity topics are widely discussed and documented. Details can be found easily on the Internet. The goal of this post is rather to present you with the added value of Aviatrix when it comes to DR/BC. I will focus on Region failures/outages because it is the area where Aviatrix shines the most.
But first, let’s see how Aviatrix Control-Plane and Date-Plane can survive a regional outage.
Aviatrix Control-Plane and Data-Plane Separation
The Aviatrix Controller is the brain of the cloud network platform. It is centralized and it controls the whole multi-cloud / multi-region environment.
The Controller is an instance that is deployed within a VPC/VNET. It provides a central configuration point (either via GUI or Terraform) of the environment. Please keep in mind that the monitoring is delivered by Aviatrix CoPilot.
Aviatrix CoPilot provides a global operational view of your multi-cloud network not available from AWS, Azure, or any other cloud provider. Though Aviatrix CoPilot is an important part of the environment, the post will focus only on Aviatrix Controller from a control-plane perspective.
Additionally, there is a distributed data-plane built by Aviatrix Gateways where all the packet forwarding happens and encryption is done.
Since the Aviatrix Controller is not in the data-plane, temporary loss of the Aviatrix Controller does not affect the existing tunnels or packet forwarding. The data-plane operates normally and there is no packet loss or packet interruption in case of the Aviatrix Controller failure. The only impact would be on control-plane-related tasks, e.g making routing changes or updates.
This loosely coupled relationship between the Aviatrix Controller and Aviatrix Gateways reduces the impact and simplifies the infrastructure.
Where to deploy the Aviatrix Controller?
As you know already the data-plane can remain functional without a Controller. However, the desirable and best practice solution would be to take one step further and deploy a control-plane in a different Region than the data-plane uses. Why would it be so important?
Let’s say that the customer deploys the Aviatrix Controller in the same Region as a data-plane Aviatrix Gateways (or at least some part of the data-plane) as shown in the diagram below.
What happens if there is a Region outage? Needless to say, the outage blast radius would be huge. Both the control-plane and data-plane stop functioning. Besides bringing up the workloads/applications/databases the customer must now take care of the following as well:
- Aviatrix control-plane (Controller)
- Aviatrix data-plane (Gateways)
Furthermore, the customer must restore the Controller first before being able to activate the data-plane in a different Region. The Controller is required to deploy the Gateways. Though setting up the Controller and restoring the configuration from backup is an easy task, it does take some time.
So what is the best practice? The recommended approach is to deploy the Controller in a completely separate Region as shown in the diagram above.
What does this approach bring to the table? Thanks to this approach the outage in one Region does not affect the whole environment (data-plane and control-plane) at the same time:
- If a Region (where Aviatrix Controller is deployed) fails, the customer must only bring up the Aviatrix Controller in a different Region and restore the configuration from backup. The data-plane will not be affected here.
- If a Region (where the Spoke/Transit Gateways are deployed) fails, the data-plane is affected of course but the control-plane is still working. The Controller is placed in a different Region so it is not affected. The customer must deploy or activate the Gateways in a secondary Region, but since the Controller is up, it is an easy task. The Gateways could have been already deployed but shut down (cost reasons) or the customer could want to deploy them during an outage, e.g. using already prepared Terraform configuration files.
To be continued in Post #2
The other part of this topic is presented here: Disaster Recovery Strategies – Part 2