OpenStack Instance Troubleshooting Guide
Introduction
OpenStack instances (virtual machines) can encounter various issues during their lifecycle. When an instance fails to boot, loses network connectivity, or experiences performance degradation, rapid troubleshooting is essential to minimize downtime and maintain service reliability. This guide walks through the most common OpenStack instance problems and provides systematic solutions to resolve them quickly.
Understanding OpenStack Instance States
Before troubleshooting, it's important to understand instance states. An OpenStack instance can exist in several states:
- BUILD: Instance is being created
- ACTIVE: Instance is running normally
- ERROR: Instance failed during creation or operation
- SHUTOFF: Instance is powered off
- SUSPENDED: Instance has been suspended (RAM saved to disk)
- PAUSED: Instance is paused (RAM kept in memory)
- REBOOT: Instance is rebooting
Check your instance state using the OpenStack CLI:
1openstack server show <instance-id>
Or via the Horizon dashboard under Compute → Instances.
Common Issue 1: Instance Won't Boot
Symptoms
- Instance stuck in BUILD state
- Instance enters ERROR state immediately after creation
- Instance shows ACTIVE but is unreachable
Diagnostic Steps
Check instance status:
1openstack server show <instance-id> -f json
Look for the fault field, which contains error details when present.
Review compute logs:
1# On the compute node2sudo tail -f /var/log/nova/nova-compute.log
Common causes:
- Insufficient resources: The compute node lacks CPU, RAM, or disk space
- Image issues: The image is corrupted or incompatible
- Flavor mismatch: The flavor specifies more resources than available
- Volume attachment failure: Boot volume cannot be attached
Solutions
Insufficient resources:
- Check available resources on compute nodes
- Migrate existing instances to free resources
- Add more compute capacity
- Choose a smaller flavor
Image problems:
- Verify image integrity:
openstack image show <image-id> - Re-upload the image if corrupted
- Use a known-working image for testing
Flavor issues:
- List available flavors:
openstack flavor list - Select a flavor that matches available resources
- Create a custom flavor if needed
Volume attachment failures:
- Check Cinder volume status:
openstack volume list - Verify storage backend connectivity
- Review Cinder logs:
/var/log/cinder/cinder-volume.log
Common Issue 2: Network Connectivity Problems
Symptoms
- Cannot SSH into instance
- Instance cannot reach external networks
- Instance cannot communicate with other instances
Diagnostic Steps
Check security group rules:
1openstack security group list2openstack security group rule list <security-group-name>
Verify network configuration:
1openstack server show <instance-id> | grep addresses2openstack port list --server <instance-id>
Test connectivity from the instance console:
- Access instance via VNC console in Horizon
- Run
ping 8.8.8.8to test external connectivity - Run
ip addrto verify IP assignment - Run
ip routeto check routing table
Solutions
Security group blocking traffic:
Add rules to allow SSH and other required services:
1openstack security group rule create --proto tcp --dst-port 22 <security-group-name>2openstack security group rule create --proto icmp <security-group-name>
No floating IP assigned:
Allocate and associate a floating IP:
1openstack floating ip create <external-network>2openstack server add floating ip <instance-id> <floating-ip-address>
Network configuration issues:
- Verify router is attached to subnet:
openstack router show <router-id> - Check router gateway:
openstack router show <router-id> | grep external_gateway - Restart network agent if needed (requires admin access)
DHCP not working:
Access instance via console and configure static IP:
1# Inside instance2sudo ip addr add <ip-address>/24 dev eth03sudo ip route add default via <gateway-ip>
Then troubleshoot DHCP agent: openstack network agent list --agent-type dhcp
Common Issue 3: Performance Degradation
Symptoms
- Instance responding slowly
- High CPU wait time
- Network throughput lower than expected
- Disk I/O bottlenecks
Diagnostic Steps
Check instance metrics:
Access the instance and run:
1# CPU usage2top3htop45# Disk I/O6iostat -x 17iotop89# Network usage10iftop11nethogs1213# Memory usage14free -h15vmstat 1
Check compute node load:
1# On compute node (requires admin access)2uptime3top4virsh list --all
Review instance resource allocation:
1openstack server show <instance-id>2openstack flavor show <flavor-id>
Solutions
CPU bottleneck:
- Resize instance to larger flavor:
openstack server resize <instance-id> <new-flavor> - Identify CPU-intensive processes and optimize
- Spread load across multiple instances
Disk I/O bottleneck:
- Move to SSD-backed storage if using HDD volumes
- Increase volume IOPS allocation (if supported)
- Optimize application disk usage
- Consider using Cinder volume instead of ephemeral disk
Network bottleneck:
- Check for network congestion on compute node
- Verify QoS policies are not limiting bandwidth
- Use SR-IOV or DPDK for high-performance networking (if available)
Memory pressure:
- Resize to flavor with more RAM
- Identify memory leaks in applications
- Enable swap (not recommended for production)
Common Issue 4: Instance in ERROR State
Symptoms
- Instance shows ERROR state
- Cannot perform operations on instance
- Previous operations failed
Diagnostic Steps
Check fault message:
1openstack server show <instance-id> -f json | grep fault
Review Nova logs:
1sudo grep <instance-id> /var/log/nova/nova-compute.log2sudo grep <instance-id> /var/log/nova/nova-api.log
Solutions
Reset instance state (admin only):
1openstack server set --state active <instance-id>
Delete and recreate:
If state reset doesn't work, rebuild from scratch:
1openstack server delete <instance-id>2openstack server create --image <image> --flavor <flavor> --network <network> <new-name>
Fix underlying issue:
The ERROR state usually indicates a deeper problem. Address the root cause before resetting state:
- Storage backend failure
- Network configuration error
- Hypervisor issue
- Image corruption
Common Issue 5: Console Access Not Working
Symptoms
- Cannot access VNC console
- Console shows blank screen
- Console connection times out
Diagnostic Steps
Verify console URL:
1openstack console url show <instance-id>
Check Nova console service:
1# On controller node2sudo systemctl status nova-novncproxy
Review console logs:
1sudo tail -f /var/log/nova/nova-novncproxy.log
Solutions
Restart console proxy:
1sudo systemctl restart nova-novncproxy
Verify console port forwarding:
Ensure firewall allows console proxy port (typically 6080):
1sudo firewall-cmd --list-ports2sudo firewall-cmd --add-port=6080/tcp --permanent3sudo firewall-cmd --reload
Use serial console instead:
1openstack console url show --serial <instance-id>
Advanced Troubleshooting Techniques
Using Instance Console Logs
Retrieve system console output without VNC access:
1openstack console log show <instance-id>
This shows boot messages and can reveal:
- Kernel panics
- Filesystem errors
- Network configuration issues
- Cloud-init failures
Checking Cloud-Init Status
If an instance boots but doesn't configure properly:
1# Inside instance2sudo cloud-init status3sudo cloud-init analyze show4sudo cat /var/log/cloud-init.log
Verifying Volume Attachments
For instances with persistent volumes:
1openstack volume list2openstack volume show <volume-id>3openstack server volume list <instance-id>
Detach and reattach if needed:
1openstack server remove volume <instance-id> <volume-id>2openstack server add volume <instance-id> <volume-id>
Network Namespace Debugging
For deep network troubleshooting (requires admin access on network node):
1# List network namespaces2sudo ip netns list34# Execute command in namespace5sudo ip netns exec qrouter-<router-id> ip addr6sudo ip netns exec qdhcp-<network-id> tcpdump -i any
Error Message Reference
"No valid host was found"
- Meaning: Scheduler could not find a compute node with sufficient resources
- Solution: Check compute node resources, verify placement service, check scheduler logs
"Build of instance failed: Block Device Mapping is Invalid"
- Meaning: Volume or image specification is incorrect
- Solution: Verify volume exists and is available, check image format compatibility
"Failed to allocate the network(s), not rescheduling"
- Meaning: Network creation failed, often due to IP exhaustion
- Solution: Check available IPs in subnet, verify network quotas
"Instance failed to spawn"
- Meaning: Nova compute failed to create the VM
- Solution: Check compute node logs, verify libvirt/KVM status, check disk space
"Exceeded maximum number of retries"
- Meaning: Operation timed out after multiple attempts
- Solution: Check service health, network connectivity, increase timeout values
Prevention and Best Practices
Before creating instances:
- Verify sufficient quota:
openstack quota show - Check compute node resources
- Validate image compatibility
- Test network configuration
During instance lifecycle:
- Monitor resource usage regularly
- Keep security groups updated
- Back up critical instances
- Document custom configurations
Logging and monitoring:
- Enable instance monitoring via Horizon or CLI
- Set up alerts for instance state changes
- Regularly review OpenStack service logs
- Use external monitoring tools (Prometheus, Nagios, etc.)
When to Contact Support
Contact InMotion Cloud support if you encounter:
- Persistent ERROR states that cannot be resolved
- Suspected hardware failures on compute nodes
- Network-wide connectivity issues
- Storage backend problems affecting multiple instances
- OpenStack service failures (Nova, Neutron, Cinder)
- Performance issues across multiple instances
Provide the following information when contacting support:
- Instance ID
- Error messages and fault details
- Output of
openstack server show <instance-id> - Recent actions performed on the instance
- Network topology (if network-related)
Summary
OpenStack instance troubleshooting requires a systematic approach. Start by identifying the instance state and reviewing error messages. Check resource availability, network configuration, and security groups. Use OpenStack CLI commands and log files to diagnose issues. Most problems fall into categories of insufficient resources, network misconfiguration, or storage issues. When in doubt, recreate the instance with known-working configurations and escalate persistent issues to support.
Regular monitoring, proper documentation, and preventative measures significantly reduce troubleshooting time and improve instance reliability.
