OpenStack Instance Troubleshooting Guide

Introduction

OpenStack instances (virtual machines) can encounter various issues during their lifecycle. When an instance fails to boot, loses network connectivity, or experiences performance degradation, rapid troubleshooting is essential to minimize downtime and maintain service reliability. This guide walks through the most common OpenStack instance problems and provides systematic solutions to resolve them quickly.

Understanding OpenStack Instance States

Before troubleshooting, it's important to understand instance states. An OpenStack instance can exist in several states:

BUILD: Instance is being created
ACTIVE: Instance is running normally
ERROR: Instance failed during creation or operation
SHUTOFF: Instance is powered off
SUSPENDED: Instance has been suspended (RAM saved to disk)
PAUSED: Instance is paused (RAM kept in memory)
REBOOT: Instance is rebooting

Check your instance state using the OpenStack CLI:

1openstack server show <instance-id>

Or via the Horizon dashboard under Compute → Instances.

Common Issue 1: Instance Won't Boot

Symptoms

Instance stuck in BUILD state
Instance enters ERROR state immediately after creation
Instance shows ACTIVE but is unreachable

Diagnostic Steps

Check instance status:

1openstack server show <instance-id> -f json

Look for the fault field, which contains error details when present.

Review compute logs:

1# On the compute node
2sudo tail -f /var/log/nova/nova-compute.log

Common causes:

Insufficient resources: The compute node lacks CPU, RAM, or disk space
Image issues: The image is corrupted or incompatible
Flavor mismatch: The flavor specifies more resources than available
Volume attachment failure: Boot volume cannot be attached

Solutions

Insufficient resources:

Check available resources on compute nodes
Migrate existing instances to free resources
Add more compute capacity
Choose a smaller flavor

Image problems:

Verify image integrity: openstack image show <image-id>
Re-upload the image if corrupted
Use a known-working image for testing

Flavor issues:

List available flavors: openstack flavor list
Select a flavor that matches available resources
Create a custom flavor if needed

Volume attachment failures:

Check Cinder volume status: openstack volume list
Verify storage backend connectivity
Review Cinder logs: /var/log/cinder/cinder-volume.log

Common Issue 2: Network Connectivity Problems

Symptoms

Cannot SSH into instance
Instance cannot reach external networks
Instance cannot communicate with other instances

Diagnostic Steps

Check security group rules:

1openstack security group list
2openstack security group rule list <security-group-name>

Verify network configuration:

1openstack server show <instance-id> | grep addresses
2openstack port list --server <instance-id>

Test connectivity from the instance console:

Access instance via VNC console in Horizon
Run ping 8.8.8.8 to test external connectivity
Run ip addr to verify IP assignment
Run ip route to check routing table

Solutions

Security group blocking traffic:

Add rules to allow SSH and other required services:

1openstack security group rule create --proto tcp --dst-port 22 <security-group-name>
2openstack security group rule create --proto icmp <security-group-name>

No floating IP assigned:

Allocate and associate a floating IP:

1openstack floating ip create <external-network>
2openstack server add floating ip <instance-id> <floating-ip-address>

Network configuration issues:

Verify router is attached to subnet: openstack router show <router-id>
Check router gateway: openstack router show <router-id> | grep external_gateway
Restart network agent if needed (requires admin access)

DHCP not working:

Access instance via console and configure static IP:

1# Inside instance
2sudo ip addr add <ip-address>/24 dev eth0
3sudo ip route add default via <gateway-ip>

Then troubleshoot DHCP agent: openstack network agent list --agent-type dhcp

Common Issue 3: Performance Degradation

Symptoms

Instance responding slowly
High CPU wait time
Network throughput lower than expected
Disk I/O bottlenecks

Diagnostic Steps

Check instance metrics:

Access the instance and run:

1# CPU usage
2top
3htop
4
5# Disk I/O
6iostat -x 1
7iotop
8
9# Network usage
10iftop
11nethogs
12
13# Memory usage
14free -h
15vmstat 1

Check compute node load:

1# On compute node (requires admin access)
2uptime
3top
4virsh list --all

Review instance resource allocation:

1openstack server show <instance-id>
2openstack flavor show <flavor-id>

Solutions

CPU bottleneck:

Resize instance to larger flavor: openstack server resize <instance-id> <new-flavor>
Identify CPU-intensive processes and optimize
Spread load across multiple instances

Disk I/O bottleneck:

Move to SSD-backed storage if using HDD volumes
Increase volume IOPS allocation (if supported)
Optimize application disk usage
Consider using Cinder volume instead of ephemeral disk

Network bottleneck:

Check for network congestion on compute node
Verify QoS policies are not limiting bandwidth
Use SR-IOV or DPDK for high-performance networking (if available)

Memory pressure:

Resize to flavor with more RAM
Identify memory leaks in applications
Enable swap (not recommended for production)

Common Issue 4: Instance in ERROR State

Symptoms

Instance shows ERROR state
Cannot perform operations on instance
Previous operations failed

Diagnostic Steps

Check fault message:

1openstack server show <instance-id> -f json | grep fault

Review Nova logs:

1sudo grep <instance-id> /var/log/nova/nova-compute.log
2sudo grep <instance-id> /var/log/nova/nova-api.log

Solutions

Reset instance state (admin only):

1openstack server set --state active <instance-id>

Delete and recreate:

If state reset doesn't work, rebuild from scratch:

1openstack server delete <instance-id>
2openstack server create --image <image> --flavor <flavor> --network <network> <new-name>

Fix underlying issue:

The ERROR state usually indicates a deeper problem. Address the root cause before resetting state:

Storage backend failure
Network configuration error
Hypervisor issue
Image corruption

Common Issue 5: Console Access Not Working

Symptoms

Cannot access VNC console
Console shows blank screen
Console connection times out

Diagnostic Steps

Verify console URL:

1openstack console url show <instance-id>

Check Nova console service:

1# On controller node
2sudo systemctl status nova-novncproxy

Review console logs:

1sudo tail -f /var/log/nova/nova-novncproxy.log

Solutions

Restart console proxy:

1sudo systemctl restart nova-novncproxy

Verify console port forwarding:

Ensure firewall allows console proxy port (typically 6080):

1sudo firewall-cmd --list-ports
2sudo firewall-cmd --add-port=6080/tcp --permanent
3sudo firewall-cmd --reload

Use serial console instead:

1openstack console url show --serial <instance-id>

Advanced Troubleshooting Techniques

Using Instance Console Logs

Retrieve system console output without VNC access:

1openstack console log show <instance-id>

This shows boot messages and can reveal:

Kernel panics
Filesystem errors
Network configuration issues
Cloud-init failures

Checking Cloud-Init Status

If an instance boots but doesn't configure properly:

1# Inside instance
2sudo cloud-init status
3sudo cloud-init analyze show
4sudo cat /var/log/cloud-init.log

Verifying Volume Attachments

For instances with persistent volumes:

1openstack volume list
2openstack volume show <volume-id>
3openstack server volume list <instance-id>

Detach and reattach if needed:

1openstack server remove volume <instance-id> <volume-id>
2openstack server add volume <instance-id> <volume-id>

Network Namespace Debugging

For deep network troubleshooting (requires admin access on network node):

1# List network namespaces
2sudo ip netns list
3
4# Execute command in namespace
5sudo ip netns exec qrouter-<router-id> ip addr
6sudo ip netns exec qdhcp-<network-id> tcpdump -i any

Error Message Reference

"No valid host was found"

Meaning: Scheduler could not find a compute node with sufficient resources
Solution: Check compute node resources, verify placement service, check scheduler logs

"Build of instance failed: Block Device Mapping is Invalid"

Meaning: Volume or image specification is incorrect
Solution: Verify volume exists and is available, check image format compatibility

"Failed to allocate the network(s), not rescheduling"

Meaning: Network creation failed, often due to IP exhaustion
Solution: Check available IPs in subnet, verify network quotas

"Instance failed to spawn"

Meaning: Nova compute failed to create the VM
Solution: Check compute node logs, verify libvirt/KVM status, check disk space

"Exceeded maximum number of retries"

Meaning: Operation timed out after multiple attempts
Solution: Check service health, network connectivity, increase timeout values

Prevention and Best Practices

Before creating instances:

Verify sufficient quota: openstack quota show
Check compute node resources
Validate image compatibility
Test network configuration

During instance lifecycle:

Monitor resource usage regularly
Keep security groups updated
Back up critical instances
Document custom configurations

Logging and monitoring:

Enable instance monitoring via Horizon or CLI
Set up alerts for instance state changes
Regularly review OpenStack service logs
Use external monitoring tools (Prometheus, Nagios, etc.)

When to Contact Support

Contact InMotion Cloud support if you encounter:

Persistent ERROR states that cannot be resolved
Suspected hardware failures on compute nodes
Network-wide connectivity issues
Storage backend problems affecting multiple instances
OpenStack service failures (Nova, Neutron, Cinder)
Performance issues across multiple instances

Provide the following information when contacting support:

Instance ID
Error messages and fault details
Output of openstack server show <instance-id>
Recent actions performed on the instance
Network topology (if network-related)

Summary

OpenStack instance troubleshooting requires a systematic approach. Start by identifying the instance state and reviewing error messages. Check resource availability, network configuration, and security groups. Use OpenStack CLI commands and log files to diagnose issues. Most problems fall into categories of insufficient resources, network misconfiguration, or storage issues. When in doubt, recreate the instance with known-working configurations and escalate persistent issues to support.

Regular monitoring, proper documentation, and preventative measures significantly reduce troubleshooting time and improve instance reliability.