Disaster Recovery Strategies

Disaster Recovery and Business Continuity topic has been presented by me in two parts.

The Part #1 focused on Aviatrix Control-Plane and Data-Plane separation and Aviatrix Controller deployment recommendations. -> Disaster Recovery Strategies – Part 1

Part #2 (this post) focuses on Disaster Recovery / Business Continuity strategies.

Aviatrix and Disaster Recovery

How does Aviatrix help the customers with their Disaster Recovery / Business Continuity strategies?

As you might already know, Aviatrix lets the customers connect environments (e.g. their Primary and DR locations) in different Clouds and Regions very easily.

The detailed strategies are presented in 4 scenarios that could be leveraged by the customers. The specific strategy used by the customer is a trade-off between a cost and the restoration time, and it depends of course on multiple factors, incl. How is a DR important to the company? What is the cost in case of an outage? Could the company deal with an outage? What are the applications/databases DR requirements?

Four Disaster Recovery strategies are discussed by me in this post. Those strategies are presented from a cloud network perspective, especially how Aviatrix gateways could be deployed to achieve the desired DR/BC scenario.

At the same time, it is important to mention that application/database requirements and capabilities must be taken into account whenever the DR strategy is discussed and implemented.

Please keep in mind that all the drawings have been done using Azure as an example but the strategies apply to AWS as well.

— Click on the links below to expand the content —

JIT Data-Plane deployment in DR Region

The JIT (Just In Time) data-plane deployment approach assumes that there will be no Aviatrix Gateways pre-built at the secondary Region before the primary Region outage. Only the VPCs/VNETs will be created beforehand.

The Aviatrix Gateways at the secondary Region will be spun up at the time of downtime. To lower the time required by the deployment at the secondary location, the use of Terraform is recommended.

This strategy does not introduce any AWS/Azure additional costs when it comes to compute resources or licenses. The VPCsVNETs are free of charge. There is also no Aviatrix-related cost.

The customer must be aware of the following trade-offs:

Long Recovery Time, as the whole environment (Aviatrix Gateways and the connections between them) must be built at the secondary Region at the time of the outage
The risk of potential deployment issues. Each issue could extend the Recovery Time even more (which could be mitigated by using Terraform)
The risk that the primary Region outage would affect thousands of other customers/organizations. The AWS/Azure compute resources within a particular Region are finite resources. Those other customers might want to restore their workloads in the same Region that is our customer’s secondary Region. In theory, there is a possibility of resource contention.

The expected Recovery Time depends on the number and types of the components to be deployed, e.g. number of Gateways, number of peerings, and number of SNAT/DNAT rules. However, it is safe to say that the time will be more than an hour.

JTI Data-Plane deployment in DR Region — JIT Data-Plane deployment in DR Region

Advantages

No additional Azure/AWS costs (compute resources)
No Aviatrix cost until the deployment
No need to synchronize the configuration between the primary and the secondary environments

Disadvantages

Long Recovery Time measured (1hr+), e.g. the Aviatrix Gateways, peerings, SNAT/DNAT rules must be built during the outage
The risk of potential deployment issues. It can be mitigated by using a Terraform
A finite number of Azure/AWS compute resources in a Region. The risk of resource contention between many customers/organizations

Cold Data-Plane in DR Region

The Cold data-plane approach assumes to have the data-plane (Aviatrix Spoke and Transit Gateways) already deployed in both environments/Regions (the primary “active” one and the secondary “standby” one) but the Gateway instances in the secondary environment are shut down to save both the Azure/AWS compute cost (no compute costs are charged until used) and Aviatrix costs (tunnels are down). It means the customer deploys the same set of Aviatrix Spoke and Transit Gateways in both environments but only the Gateways and Tunnels in the primary environment are UP and functioning. The customer workloads and all the traffic flows are active only in the primary Region.

Some considerations must be taken into account when it comes to the Cold data-plane scenario:

Pre-requisite: Do not deploy the Aviatrix Controller in the primary environment Region. The failure of this Region will impact the control-plane. Spinning up the Gateways will be dependent on the Controller being reachable. The Controller must be restored first before activating the data plane (Gateways) secondary environment. The shared failure domain between the Controller and primary Region could be catastrophic (as discussed in Part #1 post).
The Risk: Potential primary Region outage would affect thousands of other customers/organizations. The Azure/AWS compute resources within a particular Region are finite resources. Imagine that those other customers would like to restore their workloads in the same Region that you have your secondary Aviatrix Gateways deployed but shut down and you want to bring them up.

The interconnection between the primary and secondary environments is not required because inter-region peering between Transit Gateways will be down anyway in this case.

The expected recovery time from a cloud network perspective will be less than an hour because the Aviatrix Gateways have been already pre-built in the secondary environment. The only thing to be done here is to Enable them so the recovery time is dependent on the Gateways changing their state from down to up, and the Tunnels being established between them.

Advantages

Quickly available (<1hr), a fair degree of certainty as the secondary environment is already pre-built but shut down
No time is required for building the data plane because the data plane is already pre-built. Though the time is required to bring up the Gateways and all the Tunnels between them
No Azure/AWS cost because the Gateways are deployed but shut down. No compute costs are charged until Gateways are activated
No Aviatrix cost (tunnels down)

Disadvantages

The risk of activating the Gateways in the secondary location at the same time as other organizations might want to spin up their compute resources. Azure compute resources in a particular Region are finite
Any configuration changes to the primary location must be executed on the secondary location as well (Terraform could be used to make it automated and consistent)

Hybrid Hot/Cold Data-Plane

The Hybrid hot/cold data plane approach, as the name implies, is a mix of Hot and Cold strategies (discussed below). The idea is to deploy the whole environment in the secondary Region but keep half of the Aviatrix Gateways in a down state (to reduce both Azure/AWS and Aviatrix cost).

Please notice that most of the Tunnels between Spoke and Transit Gateways will be down, and only one Tunnel will be UP. The reason is that a Tunnel can be active only if Gateways on both ends are active, which is shown in the diagram below.

In this scenario, the primary Region Transit Gateways and the secondary Region Transit Gateways can be interconnected to make the data plane available for replication during normal operations. However, it depends on the applications/databases deployed by the customer and the IP address scheme used in the VPCs/VNETs (whether wthe same IP prefixes must be used in both environments or not).

As always, it is recommended to deploy the Aviatrix Controller in a different Region than primary and secondary environments. More on that in a post called “Part #1”.

When the primary Region fails, the secondary Region is ready to take the traffic because half of the Gateways (and part of the Tunnels) are already functioning. The Recovery Time is the same as in the “Hot data plane” approach. It means it is highly dependent on the time required by the applications/databases/workloads to be migrated from the primary environment to the secondary environment. From a cloud network perspective, the secondary environment is ready to take the traffic. The only required thing is to bring up the remaining half of the Gateways that are shut down to introduce resiliency and improve the performance.

This approach combines the advantages of Hot and Cold strategies. Though their disadvantages apply as well.

Advantages

Combines the advantages of Hot and Cold strategies
No time is required for building the data plane because the data plane is already pre-built and functioning at the secondary location. The half of the Gateways must be brought up
Seamless disaster recovery from a cloud network perspective = The secondary environment is immediately available with one tunnel between active Spoke/Transit Gateways
Data plane at the secondary location might be used for replication (depending on the applications/databases used) already during normal operations
Aviatrix license cost because only some tunnels are up

Disadvantages

The Azure/AWS cost because half of the Gateways are already deployed and functional at the pre-built secondary location and are consuming the compute resources
Any configuration changes to the primary location must be executed on the secondary location as well (Terraform could be used to make it automated and consistent)

Hot Data-Plane in DR Region

The Hot data-plane approach assumes to have the data-plane (Aviatrix Spoke and Transit Gateways) already deployed in both environments/Regions.

As already mentioned a few times, this post focuses on the network perspective to achieve the desired DR strategy. The customer deploys the same set of Aviatrix Spoke and Transit Gateways in Primary and DR environments. All the Gateways and Tunnels in both parts are UP and functioning.

Please keep in mind though that there might be two sub-scenarios from the Application/DataBase perspective. There are a lot of factors (application and database related) that must be taken into account before choosing one or the other sub-scenario, incl. Are the applications ready to function in both environments? What about DNS setup? What about DataBase setup and replication? What about IP addressing? To name just a few. The sub-scenarios are:

the primary Region could be “active” and the secondary could be “standby” from the Application/DataBase perspective. Though the network part is active, from the application perspective only the primary Region is UP. In this scenario, there is also a possibility to interconnect the primary and secondary environments to make the data-plane (network) available for DB replication during normal operations.
both Regions are Active from Applications/DataBase perspective, meaning a true Active/Active or Multi-Site solution is used. Applications are running in both Regions/environments.

It is recommended to deploy the Aviatrix Controller in a different Region than primary and secondary environments. More on that in section: “Separation of failure domains”.

The expected recovery time is the fastest one among all 4 presented scenarios. Of course, it also depends on the sub-scenario chosen from the Application/DB perspective (Active/Standby or Active/Active). For Active/Standby: the recovery time is highly dependent on the time required by the applications/databases/workloads to be migrated from the primary environment to the secondary environment.

Nevertheless, from a cloud network perspective, the secondary environment is up and running all the time and it does not require any additional configuration.

Advantages

The most desired solution when it comes to the service uptime and the best recovery time
No time is required for building the Aviatrix data plane because the data plane is already pre-built and functioning at the secondary location
Seamless disaster recovery from the Aviatrix cloud network perspective = The secondary environment is immediately available
Data plane at the secondary location might be used for replication (depending on the applications/databases used) already during normal operations OR the secondary location could be fully running (from an application/database perspective)

Disadvantages

The highest AWS/Azure cost because Gateways are already deployed at the pre-built secondary location and are consuming the compute resources
The highest Aviatrix cost because all the tunnels are UP
Any configuration changes to the primary location must be executed on the secondary location as well (Terraform could be used to make it automated and consistent)

Summary

The first step of DR/BC planning is to gather application/database requirements. As soon as it is known what is feasible from the app/db perspective, the next step is to think about how the network could be used to achieve the goal. With Aviatrix, the network part of DR is straightforward.

Please find below the comparison of all solutions presented in this post:

JTI Data-Plane deployment:
- The cheapest solution – DR network upfront costs can be even nailed down to “zero”
- The longest network recovery time (1hr+)
- The whole environment must be rebuilt in DR location
Cold-Data-Plane
- No upfront network Azure/AWS and Aviatrix cost (the same advantage as with the JIT approach).
- Network Recovery Time of an hour is achievable
- Network configuration in DR location must be synchronized with configuration in Primary location
Hybrid Hot/Cold Data-Plane
- Network in DR location is Active (means seamless DR)
- It is possible to have traffic flowing to the DR location (e.g. db replication)
- There are some Azure/AWS compute costs because half of Aviatrix Gateways is up
- There is some Aviatrix cost because 1 tunnel is up between each Gateway pair
- Network configuration in DR location must be synchronized with configuration in Primary location
Hot Data-Plane
- Network in DR location is fully UP at full capacity (seamless DR)
- Compute cost (AWS/Azure) might be high (compute resources are allocated for the Gateways)
- Aviatrix cost (all tunnels are up)
- Network configuration in DR location must be synchronized with configuration in Primary location

I hope this post was informative.

Aviatrix and Disaster Recovery