Rescuing Failed Instances in OpenStack

When an OpenStack instance fails to boot or becomes inaccessible, rescue mode provides a way to recover your data and fix configuration problems without losing everything. This guide walks through the complete process of rescuing failed instances, from diagnosis to recovery.

What is Rescue Mode?

Rescue mode boots your instance from a temporary image while attaching your original root disk as a secondary device. This gives you access to your data and file system without relying on the broken operating system.

Think of it like booting from a live USB drive on a physical server. You can mount the original disk, recover files, fix configuration errors, or repair a broken bootloader.

When to Use Rescue Mode

Rescue mode is appropriate when:

Instance won't boot due to kernel panic, filesystem corruption, or bootloader issues
SSH access fails after a bad configuration change
Password reset needed and cloud-init isn't working
Critical files need recovery before rebuilding
Filesystem repair required using fsck or similar tools
Configuration files need fixing that prevent normal boot

Rescue mode is not appropriate for:

Hardware-level failures (these require migration or rebuild)
Severe disk corruption (backup recovery is safer)
Performance troubleshooting (use normal boot with monitoring)

Prerequisites

Before entering rescue mode, ensure you have:

Access to Horizon dashboard or OpenStack CLI tools installed
Valid credentials for your OpenStack environment
Instance ID or name of the failed instance
SSH key configured for the rescue image
Knowledge of the instance's original OS (helps with file paths and commands)

You can use either the Horizon dashboard for a visual interface or the OpenStack CLI for automation and scripting. Both methods provide the same rescue functionality.

Step 1: Identify the Failed Instance

First, locate the instance and verify its current state. You can use either Horizon or the CLI.

Using Horizon Dashboard

Navigate to Project > Compute > Instances
Review the instance list and locate your failed instance
Check the Status column for the instance state
Click the instance name to view detailed information
Note the instance ID and current status

Common failure states include:

ERROR indicates a serious problem
SHUTOFF means the instance stopped
ACTIVE but unreachable suggests OS-level issues

Using OpenStack CLI

List all instances:

1openstack server list

Find your failed instance and note its ID. Check its detailed status:

1openstack server show <instance-id>

Look at the status field for the same failure states listed above.

Step 2: Create a Snapshot (Optional but Recommended)

Before making changes, create a snapshot for rollback capability. You can create snapshots using either Horizon or the CLI.

Using Horizon Dashboard

Navigate to Project > Compute > Instances
Locate the failed instance in the list
Click the dropdown arrow on the right side of the instance row
Select Create Snapshot from the actions menu
Enter a snapshot name like backup-before-rescue
Click Create Snapshot
Navigate to Project > Compute > Images to monitor progress
Wait for the status to change to Active

When status shows Active, the snapshot is ready.

Using OpenStack CLI

Create a snapshot:

1openstack server image create --name backup-before-rescue <instance-id>

This snapshot captures the instance's current state. If rescue operations go wrong, you can rebuild from this point.

Wait for the snapshot to complete:

1openstack image list | grep backup-before-rescue

When status shows active, the snapshot is ready.

Step 3: Enter Rescue Mode

You can enter rescue mode using either the Horizon dashboard or the OpenStack CLI. Both methods achieve the same result.

How to Rescue an Instance in Horizon Dashboard

The Horizon dashboard provides a visual interface for entering rescue mode.

Navigate to Project > Compute > Instances
Locate the instance you want to rescue in the instance list
Click the dropdown arrow on the right side of the instance row
Select Rescue Instance from the actions menu
In the dialog that appears, select a rescue image from the dropdown (or leave default to use a copy of the original image)
Optionally set a password for the rescue environment
Click Confirm to start the rescue process

The instance shuts down and boots from the rescue image. This takes 30 to 60 seconds.

Watch the Status column in the instance list. When it changes to RESCUE, the instance is ready.

To access the rescued instance, click the instance name to open its details page, then click the Console tab to access the rescue environment.

How to Rescue an Instance Using OpenStack CLI

For automation and scripting, use the OpenStack CLI to enter rescue mode.

Put the instance into rescue mode:

1openstack server rescue <instance-id>

You can specify a custom rescue image if needed:

1openstack server rescue --image <rescue-image-id> <instance-id>

If no image is specified, OpenStack uses a default rescue image based on your instance's original image type.

The rescue process takes 30 to 60 seconds. Monitor progress:

1openstack server show <instance-id> | grep status

When status changes to RESCUE, the instance is ready.

Step 4: Connect to the Rescue Environment

Once in rescue mode, connect via SSH using your configured key:

1ssh -i /path/to/key.pem rescue@<instance-floating-ip>

Some rescue images use different default users like ubuntu or centos. Check your cloud provider's documentation if rescue doesn't work.

After connecting, you're running in the temporary rescue environment. Your original root disk is attached but not mounted.

Step 5: Locate and Mount the Original Disk

List attached block devices to find your original root disk:

1lsblk

You'll see output like:

1NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
2vda      8:0    0  10G  0 disk
3└─vda1   8:1    0  10G  0 part /
4vdb      8:16   0  80G  0 disk
5└─vdb1   8:17   0  80G  0 part

In this example:

vda is the rescue image (currently booted)
vdb is your original root disk (unmounted)

Create a mount point and mount the original disk:

1sudo mkdir /mnt/original
2sudo mount /dev/vdb1 /mnt/original

If you have LVM volumes or multiple partitions, you may need to mount them differently:

1# For LVM volumes
2sudo vgscan
3sudo vgchange -ay
4sudo mount /dev/mapper/vg-root /mnt/original
5
6# For systems with separate /boot
7sudo mount /dev/vdb1 /mnt/original
8sudo mount /dev/vdb2 /mnt/original/boot

Verify the mount succeeded:

1ls /mnt/original

You should see the familiar directory structure from your instance (bin, etc, home, var, etc.).

Step 6: Recover Data or Fix Configuration

Now that your original filesystem is mounted, you can perform recovery operations.

Recover Important Files

Copy critical data to a safe location:

1# Create a tarball of important directories
2sudo tar -czf /tmp/backup-data.tar.gz -C /mnt/original/home .
3
4# Copy to object storage or another instance
5# Example with Swift:
6swift upload backups /tmp/backup-data.tar.gz

Fix Configuration Files

Edit configuration files that prevented boot:

1# Fix SSH config
2sudo nano /mnt/original/etc/ssh/sshd_config
3
4# Reset network configuration
5sudo nano /mnt/original/etc/network/interfaces
6
7# Fix fstab if mount issues exist
8sudo nano /mnt/original/etc/fstab

Reset Root Password

If you need to reset the root password:

1# Chroot into the original system
2sudo chroot /mnt/original
3
4# Change the password
5passwd root
6
7# Exit chroot
8exit

Check and Repair Filesystem

Run filesystem checks if corruption is suspected:

1# Unmount first
2sudo umount /mnt/original
3
4# Run fsck
5sudo fsck -y /dev/vdb1
6
7# Remount
8sudo mount /dev/vdb1 /mnt/original

The -y flag automatically answers yes to repair prompts. Remove it if you want manual control.

Fix Bootloader Issues

If GRUB is broken:

1# Chroot into the system
2sudo mount /dev/vdb1 /mnt/original
3sudo mount --bind /dev /mnt/original/dev
4sudo mount --bind /proc /mnt/original/proc
5sudo mount --bind /sys /mnt/original/sys
6sudo chroot /mnt/original
7
8# Reinstall GRUB
9grub-install /dev/vdb
10update-grub
11
12# Exit and clean up
13exit
14sudo umount /mnt/original/sys
15sudo umount /mnt/original/proc
16sudo umount /mnt/original/dev
17sudo umount /mnt/original

Step 7: Exit Rescue Mode

Once you've completed recovery operations, exit rescue mode. You can unrescue using either Horizon or the CLI.

How to Unrescue an Instance in Horizon Dashboard

First, unmount the original disk if you're still connected to the rescue environment:

1   sudo umount /mnt/original

Exit the SSH or console session
Navigate to Project > Compute > Instances
Locate the rescued instance (status shows RESCUE)
Click the dropdown arrow on the right side of the instance row
Select Unrescue Instance from the actions menu
Click Confirm in the dialog that appears

The instance reboots using its original root disk. This takes 1 to 2 minutes. The status returns to ACTIVE when complete.

How to Unrescue an Instance Using OpenStack CLI

From your local machine, exit the rescue environment:

1# First, unmount the original disk if still connected
2sudo umount /mnt/original
3
4# Exit SSH session
5exit

Unrescue the instance:

1openstack server unrescue <instance-id>

The instance reboots using its original root disk. This takes 1 to 2 minutes.

Step 8: Verify Recovery

After unrescuing, verify the instance boots normally:

1openstack server show <instance-id> | grep status

Status should return to ACTIVE.

Test connectivity:

1ssh -i /path/to/key.pem user@<instance-floating-ip>

Check that services are running:

1systemctl status

Verify your configuration changes took effect.

Troubleshooting Common Issues

Cannot Mount Original Disk

If mounting fails with errors like "wrong fs type" or "bad superblock":

1# Check filesystem type
2sudo file -s /dev/vdb1
3
4# Try forcing the filesystem type
5sudo mount -t ext4 /dev/vdb1 /mnt/original

For severe corruption, try mounting read-only:

1sudo mount -o ro /dev/vdb1 /mnt/original

Rescue Image Doesn't Match Original OS

If you need a specific rescue image:

1# List available images
2openstack image list
3
4# Enter rescue with specific image
5openstack server rescue --image <image-id> <instance-id>

Cannot Exit Rescue Mode

If unrescue fails, try:

1# Check current state
2openstack server show <instance-id>
3
4# Force stop and restart
5openstack server stop <instance-id>
6openstack server start <instance-id>

If the instance remains stuck, contact support.

Lost Changes After Unrescue

If your changes didn't persist:

Verify you edited files in /mnt/original, not the rescue environment
Check that you unmounted properly before unrescuing
Ensure filesystem writes completed (run sync before unmounting)

Best Practices

Document what broke. Before fixing anything, note what failed and why. This helps prevent repeat issues.

Work on a copy. When possible, snapshot before rescue and work on a test instance first.

Use rescue mode sparingly. It's for emergency recovery, not routine maintenance. Fix the root cause to avoid needing rescue again.

Keep rescue images updated. Use recent rescue images that match your instance's OS version.

Automate backups. Regular snapshots and backups reduce the need for rescue operations.

Test your recovery process. Practice rescuing a test instance before you need it in production.

Alternative: Rebuild from Backup

If rescue mode doesn't resolve the issue, rebuilding from backup is often faster and safer:

1# Create new instance from snapshot
2openstack server create --image <snapshot-id> --flavor <flavor> recovered-instance
3
4# Or rebuild existing instance
5openstack server rebuild --image <snapshot-id> <instance-id>

Rebuilding replaces the root disk entirely, which solves corruption problems that rescue mode cannot fix.

When to Contact Support

Contact your cloud provider if:

Disk is completely inaccessible even in rescue mode
Hardware issues suspected (repeated failures, IO errors)
Rescue mode fails to boot
Data recovery requires specialized tools

Provide your instance ID, error messages, and steps you've already tried.

Summary

Rescue mode is a powerful tool for recovering failed OpenStack instances. By booting from a clean image and mounting the original disk, you can fix configuration errors, recover data, and repair broken systems without losing everything.

The key steps are:

Snapshot the failed instance for safety
Enter rescue mode using OpenStack CLI
Mount the original root disk
Fix configuration or recover data
Unmount and exit rescue mode
Verify the instance boots normally

With practice, rescue operations become routine. The more familiar you are with the process, the faster you can recover from failures and minimize downtime.