30-Day Cloud Migration Blueprint: Zero Downtime Playbook

Most cloud migrations fail not because the destination platform is wrong, but because the team never built a map before they started driving. They skip the infrastructure audit, underestimate data transfer windows, and discover critical dependencies mid-cutover at 2 AM on a Saturday. The result: emergency rollbacks, unplanned downtime, and a war story the engineering team tells for years.

This blueprint is designed to eliminate that outcome. It is built around a real-world migration, an anonymized $2M ARR SaaS company that moved its entire production workload from AWS to InMotion Cloud's Managed VPC in 28 days without a single minute of customer-facing downtime. Every checklist, bandwidth calculation, and rollback trigger in this guide comes from that engagement.

The Case Study: $2M ARR SaaS on AWS, Moving to Managed VPC

The company runs a B2B workflow automation platform serving mid-market customers. Their AWS footprint was straightforward but expensive:

3 EC2 instances (2x m5.xlarge app servers, 1x m5.2xlarge for background workers)
RDS PostgreSQL 14 (db.r5.large, Multi-AZ, 500 GB storage)
S3 for object storage (approximately 2.1 TB of customer attachments and exports)
CloudFront CDN in front of their application tier

Why they moved: their AWS bill averaged $11,400/month with significant month-to-month variance driven by data transfer and CloudFront costs. Support responses averaged 18 hours on non-critical tickets. And the infrastructure had accumulated years of over-engineering, Lambda functions, SQS queues, and IAM policies that nobody on the current team fully understood.

Their target on InMotion Cloud: a clean Managed VPC architecture with right-sized instances (virtual machines), a managed database service, and object storage simpler, more predictable, and with support they could actually reach.

The final migration completed on day 28. Month-one cloud costs: $4,850 — a 57% reduction.

Week 1: Discovery and Architecture (Days 1–7)

The first week is entirely about understanding what you have before you touch anything. Migrations that skip this phase discover surprise dependencies during cutover, which is the worst possible time.

Infrastructure Audit

Document every running resource in the source environment. Do not rely on what the documentation says — query the actual state. For AWS, aws ec2 describe-instances, aws rds describe-db-instances, and a CloudTrail review of the past 90 days will surface things no architecture diagram shows.

For this case study, the audit uncovered two undocumented Lambda functions that processed inbound webhook payloads and wrote directly to the PostgreSQL database. Nobody on the current team had built them. They would have broken silently during migration if not caught in Week 1.

Dependency Mapping

Map every external connection: outbound API calls, inbound webhooks, third-party integrations, email delivery services, DNS-dependent services, and hardcoded IP addresses in application configuration. Build a dependency graph, even a simple one. This graph becomes your cutover checklist.

Target Architecture Design

Design the destination environment before provisioning anything. For this migration, the target architecture on InMotion Cloud's Managed VPC included:

2x instances using a compute-optimized flavor (resource template), replacing 3 AWS instances by right-sizing based on actual CPU and RAM utilization data (averaging 22% CPU and 38% RAM on the old instances)
1x managed PostgreSQL instance with high-availability replication
Object storage bucket replacing S3, with S3-compatible API for minimal application code changes
Load balancer with SSL termination
Private network for database and inter-instance traffic

Risk Assessment

Score every component by two dimensions: migration complexity and business impact if it breaks. High-complexity, high-impact components (the database, in almost every case) get the most conservative migration path. Low-complexity, low-impact components (static asset storage) can be migrated aggressively.

Week 1 Checklist

[ ] Complete inventory of all running instances, databases, storage buckets, and networking resources
[ ] Document all inbound and outbound network dependencies
[ ] Pull 90-day CPU, RAM, disk I/O, and network utilization data for all instances
[ ] Identify all DNS records associated with the application
[ ] Document all SSL/TLS certificates and their expiration dates
[ ] Map all third-party integrations and their authentication methods
[ ] Review IAM roles, service accounts, and credentials in use
[ ] Design target architecture with named resources, IP ranges, and network topology
[ ] Estimate total data volume for migration (storage + database)
[ ] Identify maintenance windows and customer usage low points
[ ] Get sign-off on target architecture from engineering leads and CTO

Rollback at Week 1: There is nothing to roll back. The source environment is untouched. The only risk is time spent on architecture that needs revision.

Week 2: Parallel Environment and Data Sync (Days 8–14)

Week 2 is when you build the destination environment and begin moving data. The principle is parallel operation, nothing in the source environment changes. Both environments run simultaneously, and the source remains authoritative until the moment of cutover.

Standing Up the Managed VPC Environment

Provision every resource defined in your Week 1 architecture design. Use infrastructure-as-code from day one, even if the source environment was built by hand. This gives you a repeatable build, a documented configuration, and a clean rollback path if the target environment needs to be rebuilt.

Validate networking before deploying applications: confirm private network connectivity between instances, confirm the load balancer health checks pass, and confirm external connectivity from the correct IP ranges.

Database Replication Setup

PostgreSQL migration has three viable strategies depending on database size and acceptable replication lag:

1. pg_dump / pg_restore - Appropriate for databases under ~50 GB with a maintenance window available. Take a consistent dump, transfer, restore. Simple, well-understood, no ongoing synchronization.

2. Logical replication (pg_logical or built-in logical replication slots) -Appropriate for large databases where you need near-zero downtime. Configure the source as a logical replication publisher, the target as a subscriber. Initial data sync runs in the background; ongoing changes stream continuously. Final cutover lag can be reduced to seconds.

3. pg_basebackup + WAL streaming - Appropriate for very large databases (500 GB+) where logical replication overhead is unacceptable. Take a base backup, then stream WAL (write-ahead log) records continuously. The target stays within seconds of the source.

For this case study (500 GB PostgreSQL), we used pg_basebackup for the initial transfer followed by continuous WAL streaming. Bandwidth math: 500 GB at a sustained 400 Mbps transfer rate = approximately 2.8 hours for the initial base backup. With WAL streaming running continuously afterward, replication lag stabilized at under 5 seconds within 6 hours of starting.

Object Storage Migration

For the 2.1 TB S3 bucket, we used rclone with the --transfers 32 flag to parallelize object transfers and --checksum to verify integrity. The S3-compatible API on InMotion Cloud's object storage meant the rclone configuration required only endpoint, access key, and secret key changes.

Bandwidth math: 2.1 TB at a sustained 600 Mbps = approximately 7.9 hours for initial sync. Run rclone sync again immediately before cutover to catch any objects written after the initial transfer. The delta sync for 2.1 TB of mostly static data completed in under 20 minutes.

Application Deployment to New Environment

Deploy the application to the new instances and validate it can connect to the replicated database (in read-only mode during this phase), the object storage bucket, and all third-party services. Smoke test every critical path manually. Do not rely on automated tests alone — click through the product as a real user would.

Week 2 Checklist

[ ] All instances provisioned and accessible via SSH
[ ] Private networking confirmed between all instances
[ ] Load balancer provisioned with health checks passing
[ ] SSL certificates installed and validated
[ ] PostgreSQL replication running with lag under 10 seconds
[ ] Object storage initial sync complete with checksum validation
[ ] Application deployed to new instances and connecting to replicated database
[ ] All environment variables and secrets configured in new environment
[ ] Monitoring and alerting configured for new environment
[ ] Confirmed new environment has no access to production traffic yet

Rollback at Week 2: Stop replication, terminate the target environment. Source environment is completely unaffected. Estimated rollback time: 15 minutes.

Week 3: Testing and Validation (Days 15–21)

By day 15, you have a fully operational parallel environment. Week 3 is about proving it is production-ready before a single customer request touches it.

Load Testing

Replay production traffic patterns against the new environment using tools like k6, Locust, or Apache JMeter. Do not use synthetic load profiles that don't match real usage — pull actual traffic patterns from your access logs. The goal is to confirm the new environment handles peak load (not average load) without degradation.

For this case study, peak load was 340 concurrent users during a Tuesday afternoon window. The new environment handled 450 concurrent users at lower p99 latency than the old environment, due to the less-contended hardware on InMotion Cloud's infrastructure.

Performance Benchmarking

Compare key metrics between old and new environments under equivalent load:

Document these results. If anything regresses post-cutover, you have a baseline for diagnosis.

Security Audit

Review firewall rules on the new environment. Confirm that the database port is not publicly accessible, that instance-to-instance traffic uses the private network, and that all public-facing endpoints require TLS. Scan for open ports using nmap from an external vantage point.

Review all credentials: confirm no AWS access keys or environment-specific secrets have been hardcoded into application deployments on the new environment.

Rollback Procedure Documentation

Before Week 4, document the rollback procedure in writing and review it with the engineering team. Everyone involved in the cutover should know the rollback steps without having to look them up. Rollback procedures that require someone to think under pressure tend to be executed incorrectly.

Week 3 Checklist

[ ] Load test completed at 150% of peak production traffic
[ ] Performance benchmark documented for all key metrics
[ ] Security port scan completed — no unexpected open ports
[ ] Database firewall rules reviewed and confirmed
[ ] SSL certificate expiration dates confirmed on new environment
[ ] Backup and restore procedure tested (restore to point-in-time from new database)
[ ] Monitoring dashboards confirmed functional on new environment
[ ] Rollback procedure documented and reviewed with team
[ ] Team trained on new infrastructure management tools
[ ] Cutover runbook written with step-by-step instructions and owner assignments

Rollback at Week 3: Same as Week 2. Estimated rollback time: 15 minutes.

Week 4: Cutover and Stabilization (Days 22–28)

The cutover is the most operationally intense part of the migration, but if Weeks 1–3 were thorough, it should also be the most boring. Boring is the goal.

DNS Cutover Strategy

DNS is the lever you pull to redirect traffic from old to new. The strategy has three phases:

Phase 1 (Days 22–25): Reduce TTL. Reduce the TTL on all DNS records for the application domain to 60 seconds. This ensures that when you make the final change, it propagates globally within 1–2 minutes rather than waiting hours for cached records to expire. Do this several days before cutover, not hours before.

Phase 2 (Day 26–27, optional): Weighted routing. If your DNS provider supports weighted routing (Route 53 does; many others do as well), shift 5% of traffic to the new environment and monitor for 24 hours. This surfaces any issues that synthetic load testing didn't catch. If errors appear, roll back the weight to 0%, source environment is still handling 95% of traffic.

Phase 3 (Day 28): Final switch. Update DNS to point 100% of traffic to the new environment. With a 60-second TTL, propagation completes in under 2 minutes. Keep the old environment running for 48 hours post-cutover.

Monitoring During Transition

Define your monitoring criteria before cutover day. You need real-time visibility into:

HTTP error rates (alert threshold: >1% 5xx errors over a 2-minute window)
Application response time (alert threshold: p99 > 2x pre-cutover baseline)
Database replication lag (alert threshold: >30 seconds, should be near zero at cutover)
Background job queue depth (alert threshold: queue growing rather than draining)

Assign one team member solely to watching dashboards during the cutover window. That person's only job is to call a rollback if any threshold is breached.

Rollback Triggers and Procedure

Rollback triggers (any one of these = immediate rollback):

HTTP error rate exceeds 1% for more than 3 consecutive minutes
Any data corruption detected in the new database
Payment processing or authentication failures
On-call engineer judgment- if something feels wrong, roll back first, investigate second

Rollback procedure (estimated time: 4 minutes):

Update DNS A record to point back to old environment load balancer IP - 1 minute
Confirm traffic is flowing to old environment via monitoring -2 minutes
Verify error rates return to baseline - 1 minute
Page engineering lead and begin post-mortem

The old environment remained in service, with replication continuing to run, for the full 48-hour post-cutover window. Rollback at any point in that window would have meant at most a few seconds of lag in the database, an entirely recoverable situation.

Week 4 Checklist

[ ] TTL reduced to 60 seconds 72+ hours before cutover
[ ] Final rclone sync of object storage completed (delta sync)
[ ] Database replication lag confirmed under 5 seconds
[ ] Monitoring dashboards open and confirmed functional
[ ] Rollback triggers and procedure reviewed by all participants
[ ] Customer support team notified of maintenance window (even if zero downtime is planned)
[ ] DNS update executed
[ ] Traffic confirmed flowing to new environment via access logs
[ ] Error rates confirmed at baseline for 30 minutes post-cutover
[ ] Old environment kept running for 48-hour post-cutover monitoring window
[ ] DNS TTL restored to normal value (300–3600 seconds) after stabilization

Data Migration Strategies: The Technical Detail

Block Storage Migration

For application data stored on instance volumes, the simplest approach is rsync with the --checksum and --delete flags. Run an initial sync while the source is live, then run a final incremental sync during the cutover window before switching traffic. For large volumes (> 1 TB), use parallel rsync streams with xargs or rsync's built-in --no-whole-file flag to avoid re-transmitting unchanged data.

Bandwidth math for block storage:

100 GB at 400 Mbps = ~33 minutes
500 GB at 400 Mbps = ~2.8 hours
1 TB at 400 Mbps = ~5.6 hours
2 TB at 400 Mbps = ~11.1 hours

Plan your transfer windows around these estimates, with a 30% buffer for network variability.

Database Migration Options Compared

For production SaaS migrations, logical replication or pg_basebackup + WAL streaming is almost always the right choice. The additional complexity is worth the reduction in cutover risk.

Object Storage Sync

rclone is the standard tool for S3-compatible object storage migration. Key flags:

1rclone sync s3:source-bucket inmotioncloud:destination-bucket \
2  --transfers 32 \
3  --checkers 16 \
4  --checksum \
5  --progress \
6  --log-file=migration-$(date +%Y%m%d).log

Run this once for the bulk transfer, then run it again immediately before cutover for the delta. Verify object count and total size match between source and destination before proceeding.

Post-Migration Validation: 30-Point Checklist

Complete this checklist within 48 hours of cutover:

Performance

[ ] p50 and p99 API response times at or below pre-migration baseline
[ ] Database query performance at or below pre-migration baseline
[ ] Background job throughput at or above pre-migration baseline
[ ] No memory or CPU pressure on any instance under normal load
[ ] CDN cache hit rate at expected levels

Security

[ ] All public-facing endpoints serving HTTPS only
[ ] Database port not accessible from public internet
[ ] SSH access restricted to known IP ranges or VPN
[ ] All secrets rotated post-migration (old environment credentials retired)
[ ] SSL certificate expiration dates confirmed (minimum 60 days remaining)
[ ] Firewall rules documented and reviewed

Monitoring and Alerting

[ ] All uptime monitors confirmed active and alerting to correct recipients
[ ] Error rate alerts confirmed functional (test with a deliberate 404)
[ ] Database disk space alerts configured
[ ] CPU and RAM threshold alerts configured
[ ] Log aggregation confirmed capturing application and system logs
[ ] On-call rotation updated to reflect new infrastructure

Billing and Cost

[ ] First month cost estimate confirmed against actuals
[ ] Budget alerts configured on new account
[ ] Old environment resources fully terminated (no zombie instances generating cost)
[ ] Object storage lifecycle policies configured if applicable

Backup and Recovery

[ ] Automated database backups confirmed running on schedule
[ ] Point-in-time recovery tested confirm restore completes successfully
[ ] Object storage versioning or backup policy confirmed
[ ] Disaster recovery runbook updated for new environment
[ ] RTO and RPO targets confirmed achievable with new backup configuration

Application

[ ] All critical user flows tested manually (sign up, login, core feature, payment)
[ ] All third-party integrations confirmed functional
[ ] Email delivery confirmed (send a test transactional email)
[ ] All cron jobs and scheduled tasks confirmed running
[ ] Application error tracking (Sentry, Rollbar, etc.) confirmed receiving events from new environment

Before and After: Cost Comparison

For the case study company, the monthly infrastructure cost changed as follows:

Annual savings: $79,200. The 28-day migration engagement cost less than one month of the savings it generated.

The data transfer cost reduction is the most significant line item. AWS charges for outbound data transfer; the InMotion Cloud pricing model is more predictable and the included bandwidth allocation covered the majority of their traffic volume.

Lessons Learned from the Case Study

The undocumented Lambda functions nearly caused a failed migration. The Week 1 infrastructure audit caught them. Every hour spent in discovery directly prevents hours of emergency debugging during cutover. Do not skip or compress Week 1.

WAL streaming replication gave the team confidence they wouldn't have had with a dump/restore approach. Knowing the database on the new environment was continuously synchronized and that rollback meant flipping DNS back with at most 5 seconds of lag made the cutover window a calm, monitored event rather than a nerve-wracking one.

The 57% cost reduction was a direct result of right-sizing. AWS's instance selection had never been revisited since the company's early growth phase. Pulling 90-day utilization data and sizing appropriately for actual workload is an underrated component of any migration.

Keeping the old environment live for 48 hours post-cutover is non-negotiable. It costs almost nothing, and it means rollback is a DNS change, not a re-deployment. The team slept better knowing the backstop existed.

Conclusion and Next Steps

A 28-day migration with zero downtime is repeatable when the phases are followed in order and the work isn't compressed. The temptation to skip Week 1 discovery or reduce the Week 3 testing window is real, resist it. Every shortcut taken before cutover becomes a problem discovered during cutover.

The key elements that made this migration successful were: a complete infrastructure audit before any provisioning, continuous database replication rather than a one-time dump, a DNS TTL reduction strategy that made cutover a 2-minute operation, and pre-defined rollback triggers that removed the ambiguity of "should we roll back?"

If you are planning a migration to InMotion Cloud's Managed VPC and want to work through an architecture design before you begin, InMotion Cloud's solutions team can review your current environment and help map the target architecture. The discovery phase is the highest-value conversation you can have before day one.

The blueprint is here. The first step is the audit.

The 30-Day Cloud Migration Blueprint: Zero Downtime, Enterprise-Grade Results