Migrating on-prem to AWS with ease, using CloudEndure

The Why

We at Nordcloud have a multitude of customers that turn to us for migrating their on-premise workloads to a public cloud provider. Every migration case is different, each with their own unique starting points, technical challenges and requirements. Starting with an in-depth assessment of the current environment, we agreed with the customer that most of the workloads will be rehosted to AWS. During this process, the team has assessed the available tooling and services that could best facilitate a smooth and efficient migration. We found CloudEndure to be the best tool for lift-and-shifting more than a hundred servers.

The How

Lift-and-shift, aka. rehost means taking servers as they are in the current environment, and spinning up exact copies in the new environment. In our case the migration was from an on-premise datacenter to AWS. The on-premise datacenter architecture relied on VMware to virtualize the workloads but the decision was made not to migrate to VMWare on AWS, but directly to EC2. CloudEndure's strength lies exactly in this. It makes the process of creating a snapshot of the server, transferring it to AWS, performing OS level changes - that are required because of the underlying infrastructure change - and ultimately starting the new instance up.

It executes the whole process without affecting anything on the source instance, enabling for minimal cutover downtime. The source doesn't need to be powered off, restarted or modified in any way shape or form for a fresh copy of the instance to be started on AWS. Our cutover processes fell into a couple categories. For example, if the server in-scope was a stateful one and also required data consistency (eg. self-hosted databases), we first shut down the processes of the application, waited a very short time for CloudEndure to sync the changes to the target environment, then started up the instance in the new environment.

The way CloudEndure works is really simple. You execute the provided script on the source machine which installs and configures the CloudEndure agent. This agent in-turn will start communicating with CloudEndure's public endpoints, aka. "Control Tower". You set up a project within the console and the instances will automatically get added in the console. After you see your instance, the replication will start. CloudEndure takes a 1:1 snapshot of the instance and continuously replicates new changes, as long as it's not stopped manually or the instance is powered off. It offers you an always up to date copy - with approx. 5 min latency - that you can spin up any time with a single button. During the cutover process, the OS level changes are performed. It will also ensure that Windows Servers will be on a licensed AMI. This is especially useful because it eases the burden of licensing, which many customers face. This functionality however is only supported for Windows and no other operating systems. (more on this in the Challenges section)

The Tech

From a technical perspective, CloudEndure's console [Control Tower] makes API calls on your behalf to AWS services. It doesn't deploy a CloudFormation stack or need anything special. It uses an IAM access and secret key pair, for which there is a pre-made IAM policy template following IAM best practices. After you set up your project, a Replicator EC2 instance will be spun up in your target region. This instance will be handling the temporary storage of the to-be-rehosted instances' snapshots. Each disk on each source server [you deploy the agent to and start replication] will be created as an EBS volume and attached to the Replicator instance. You can specify what instance type you want your Replicator instance(s) to be, and I would highly recommend going with AMD based instances. In our experience, they have been noticeably faster in the cutover, plus they're simply cheaper. Since your Replicator instance(s) will be running constantly throughout the whole period of the migration - which can take a long time - make sure to adjust the sizing to your needs. The second part of the game is the Machine Converter instance, which does the heavy lifting for the changes required on the OS and disk level. It is amazing how well it works, especially for Windows instances. It performs all the necessary modifications that are required by the complete change of the underlying hypervisor. Windows does not lose its domain join in the process either, the new server stands up as if it was a reboot.

From the moment you initiate a cutover from the console [or through the API], it generally takes ~20-30 minutes for Windows instances, and 5-10 minutes for Linux instances to be up and running in AWS. The latest snapshot at the start of the cutover is taken, the machine converter does its black magic, and voila! The new server is up and running.

CloudEndure Demo

In this short demo I'll demonstrate the process of an instance's migration. I've set up a simple web server on a RHEL7 instance in GCP, which we'll be migrating to AWS. Here's the instance in the GCP console, and the contents of its web server.

A very simple web server running on the source

First you create a new project in the CloudEndure console after you sign up. With the plus sign you can create multiple projects. Each project is tied to a single target AWS account. If you want to spread your workloads through multiple accounts, this is a must. Then you paste the pair of access and secret keys for your IAM user, which has the following IAM policy attached.

Paste the credentials of the IAM user here

Then you're presented with the configuration. This is where you set the target region in AWS and your...well...settings. As I mentioned earlier, I recommend AMD instances and a beefy Machine Converter, as the latter will only run for short periods of time but makes a big difference in speed. Sometimes, you can throw money at the problem. You can also set it to use the dedicated Direct Connect connection so the migration doesn't impact your regular internet. If you do need to use the internet, you can also restrict the max bandwidth CE will use.

Under "Machines" you'll get the instructions on how to deploy the agent on the source server(s). It's a simple two liner and it can be non-interactive. It'll automatically assign the server with the console and even start replication to the target region. Or not, as you can turn off pieces of this automation by using flags during the execution of the install script.

Instructions on how to install the agent on the source machines

In this example I'm running it manually via a terminal on the source machine. You can however put this into your existing Ansible workflows for example.

After the agent is installed, CE can configure everything automatically and also begin the replication by deploying the Replication Server in your region - if there isn't one already. Each disk on the source servers will be mounted on the Replication Server as a separate EBS volume. Since there is a maximum amount of volumes you can mount to an instance, multiple Replication Servers will be deployed if needed. CE will replicate the contents of each disk to each EBS volume, and continuously update if there are changes on the source disk.

You can monitor the replication in the console

By clicking on the instance in the console, you get to configure the Blueprint for the cutover. This is where you define your target instance's details. The options are essentially 1:1 to what you'd be able to customize when launching an EC2 instance directly. It can also add an IAM role to the instance, but be aware that the provided IAM policy does not allow that (gotcha!).

Use your preconfigured resources or create new, all can do

After the console reports "Continuous Data Replication" for the instance, you can initiate a cutover any time with the "Launch target machine" button. You can choose between "Test Mode" and "Cutover", however the only difference between the two is how the history is displayed in the CE console. There is no technical difference between the two cutover types. You can monitor the progress in the "Job Progress" section on the left.

Fire away and lean back to witness the magic - for 8 minutes

This is what the process looks like on the AWS side. Firstly, there was a snapshot taken of the respective [source] EBS volume from the Replication Server. Then the Machine Converter comes in and terminates as it finishes. Finally the cloudendure-source instance is started (even the hostname persists from source).

Navigating to the DNS name or IP of the new server, we can see that the same page is served by the new instance. Of course this was a very limited demo, but the instance in AWS is indeed an exact replica of the source.

Learnings about CloudEndure

We only learned about CloudEndure's post-launch-script functionality quite late in the project. We utilized it once to run a complex application's migration, including self-hosted databases and multiple app servers. It allowed us to complete the cutover and all the post-migration tasks in under two hours. We have set up and tested the cutover process in-depth. At the time of the cutover this complex environment started in AWS with all the required configuration changes, without any need for manual input. With necessary preparation, this can allow for minimal service disruptions when migrating traditional workloads to the cloud.
CloudEndure has more potential. While our team has not explored other usage patterns, it could also be implemented as a disaster recovery tool. Eg. your EC2 instances in Frankfurt could be continuously replicating to Ireland. In case there's a region-wide, business impacting outage in Frankfurt, you could spin up your servers in Ireland reliably and [most importantly] fast.
Windows is better supported than Linux based instances. The migration of Windows Server (from 2008 R2) will also make sure that the AMI of the AWS instance matches the OS. This is important for licensing, but it's unfortunately not supported on Linux based instances. There were quite a lot of Red Hat servers in the scope of this migration, and we realized that AWS was missing the knowledge that they were running RHEL (the "AMI" in the instance details reported "unknown" essentially). Therefore licensing was not automatically channeled through AWS, and the team had to "fix them" in a timely manner. When we became aware of the issue, we learned that CloudEndure is capable of using pre-made instances as targets instead of creating brand new ones. This way, we could specify the AMI by creating a new target instance with the required details and CE would replace the drives of the instance only, thereby keeping the AMI and licensing. We have tried to use this "target-instance" functionality before but we received errors every time. This isn't mentioned in the docs, but we found that the IAM policy the CE user has assigned in the AWS account, has limited access to EC2. It uses conditionals that restrict EC2 actions to resources tagged with CloudEndure Creation Time (however, any tag value works). Therefore both the instance and its current disk has to have that tag, otherwise CE will not be able to detach the existing, and attach the new disks to the instance.

CloudEndure; the good, the bad and the small amount of ugly

CloudEndure makes lift-and-shift migrations [to AWS] quite effortless. As we discovered however, it is not as mature as other AWS services, given that CE has been acquired by AWS in 2019. We could rely on it for all our tasks but were presented with multiple shortcomings.

One of these is regarding documentation. It is extensive and covers most scenarios, but not-so edge cases like the "target instance cutover" lacked a crucial part of information (tagging). The other major pain point was the inconsistency of the web interface. There were multiple times where instances reported an up-to-date "continuous replication" state, but suddenly jumped to "replicating...10 minutes left". This would have been understandable in cases where there was a lot of data change ongoing on the source servers, but it occurred many times where that wasn't the case. The cutover still proceeds successfully, but seeing the instance suddenly jump to "wait, let me catch up, it'll take 10 minutes" and shortly after going back to "nope, all good, just kidding" was quite frustrating at times. This was especially nerve wrecking just before a sensitive cutover. The UI can also have inconsistencies, eg. when you save the blueprint, make sure to come back to it and double check that your configuration has been saved. Getting the AMI correct is also limited to Windows Server. Support for other major operating systems would be great, such as industry standard RHEL. The documentation should describe how to fix the AMIs after. Or better yet, how to use the Target Instance ID functionality to avoid this issue.