
A guide to EC2 disaster recovery using Terraform.
In the fast-paced world of cloud computing, everyone's looking for streamlined ways to manage their tech. Enter Terraform, a tool that's changing the game by simplifying how we handle infrastructure with its cool Infrastructure as Code (IaC) features.
In this article, we're diving into a practical Terraform success story that breaks the mould - using Terraform for ec2 disaster recovery.
The task is as follows: the infrastructure (dozens of services hosted on ec2), which cannot use multi-zone deployment for some constraints, must be restored from snapshots in the secondary availability zone if the primary one fails. This infrastructure is (you guessed it) managed by Terraform.
So, we need to develop such logic for terraform so that it can determine whether restoration from a snapshot is required and whether restoration is possible.
This is a basic TF configuration (EC2 instance with additional EBS volume) that we will enhance step by step to get to the desired functionality.
We assume that snapshots are already configured for root and additional volumes, so this part of the configuration is not included in this article.
resource "aws_instance" "ec2_instance" {
ami = local.ami
instance_type = var.instance_type
subnet_id = var.subnet
key_name = var.ssh_key
root_block_device {
delete_on_termination = true
volume_size = var.root_volume_size
volume_type = "gp3"
tags = {
Name = "root_disk"
}
}
}
resource "aws_ebs_volume" "ebs_disk" {
availability_zone = var.availability_zone
size = var.data_volume_size
type = "gp3"
tags = {
Name = "data_disk"
}
}
resource "aws_volume_attachment" "ebs_disk_att" {
device_name = var.ebs_mount_point
volume_id = aws_ebs_volume.ebs_disk.id
instance_id = aws_instance.ec2_instance.id
}
First, we want to recover the additional EBS from the snapshot when:
- EBS volume does not exist
- Snapshots for this EBS volume exist
We can do this by adding the following to our initial TF configuration.
#check whether ebs volumes exist
data "aws_ebs_volumes" "volumes" {
filter {
name = "tag:Name"
values = [“data_disk“]
}
filter {
name = "availability-zone"
values = [var.availability_zone]
}
}
data "aws_ebs_volume" "volume" {
count = length(data.aws_ebs_volumes.volumes.ids) != 0 ? 1 : 0
most_recent = true
filter {
name = "tag:Name"
values = [“data_disk“] }
filter {
name = "availability-zone"
values = [var.availability_zone] }
}
#check whether snapshots exist
data "aws_ebs_snapshot_ids" "snapshots" {
filter {
name = "tag:Name"
values = [“data_disk“] }
}
data "aws_ebs_snapshot" "ebs_snapshot" {
count = length(data.aws_ebs_snapshot_ids.snapshots.ids) != 0 ? 1 : 0
most_recent = true
filter {
name = "tag:Name"
values = [“data_disk“] }
}
#Define a local variable for the snapshot id
locals {
snapshot_id = length(data.aws_ebs_volumes.volumes.ids) == 0 && length(data.aws_ebs_snapshot_ids.snapshots.ids) != 0 ? "${data.aws_ebs_snapshot.ebs_snapshot[0].id}" : ""
}
resource "aws_ebs_volume" "ebs_disk" {
...
snapshot_id = local.snapshot_id
...
}
This configuration leverages data resources for EBS and Snapshots to determine whether the resources exist. As you can see first we have to search for “aws_ebs_volumes
” and only then for “aws_ebs_volume
”, this is because the “aws_ebs_volumes
” returns an array of EBS ids, and it is fine when it is empty and terraform will not fail, but the “aws_ebs_volume
” returns only one id and terraform will raise an error if it is null (not found).
The same story is with "aws_ebs_snapshot_ids
" and "aws_ebs_snapshot
".
The snapshot_id attribute specifies the ID of an EBS snapshot from which the volume should be created. The value is assigned from the local variable local.snapshot_id.
This attribute will be ignored if its value is set to “”.
Now we can do something similar for root volume.
We want to recover EC2 from the snapshot when:
- Snapshot for the root volume exist
- AMI for EC2 is registered
- EC2 (target VM) does not exist
#check whether snapshot exist ami is registered
data "aws_ebs_snapshot_ids" "root_snapshots" {
filter {
name = "tag:Name"
values = [“root_disk“] }
}
data "aws_ebs_snapshot" "ebs_root_volume" {
count = length(data.aws_ebs_snapshot_ids.root_snapshots.ids) != 0 ? 1 : 0
most_recent = true
filter {
name = "tag:Name"
values = [“root_disk“]
}
}
resource "aws_ami" "ami" {
count = length(data.aws_ebs_snapshot.ebs_root_volume) != 0 ? 1 : 0
name = “custom-ami"
root_device_name = "/dev/xvda"
ebs_block_device {
device_name = "/dev/xvda"
snapshot_id = data.aws_ebs_snapshot.ebs_root_volume[0].id
volume_size = var.root_volume_size
}
}
#Define a local variable for ami and prevent the ec2 re-creating on ami change
locals {
ami = length(aws_ami.ami) == 1 ? "${aws_ami.ami[0].id}" : "${var.ami}"
}
resource "aws_instance" "ec2_instance" {
...
ami = local.ami
lifecycle {
ignore_changes = [
ami
]
...
}
ami = local.ami: The ami attribute specifies the Amazon Machine Image (AMI) to use for the EC2 instance. In this case, it's assigned the value of the local variable local.ami.
lifecycle { ignore_changes = [ ami ] }: This block configures the lifecycle behavior of the resource. Specifically, it instructs Terraform to ignore ami changes to prevent Terraform from attempting to recreate the EC2 instance.
That’s it. As you can see the solution is simple and pretty straightforward.
The provided Terraform code example has been successfully tested. Terraform, recognized as a key player in Infrastructure as Code (IaC), demonstrated its versatility by addressing the non-standard use case of EC2 disaster recovery in a real-world project.
Get in Touch.
Let’s discuss how we can help with your cloud journey. Our experts are standing by to talk about your migration, modernisation, development and skills challenges.
