- Introduction
- VirtIO Framework
- CPU Tuning
- Memory Management
- NUMA Tuning
- Disk and Block I/O Tuning
- Network Tuning
- Time-keeping Best Practices
- Summary and Best Practices
KVM (Kernel-based Virtual Machine) is a full virtualization solution for Linux on x86 hardware containing virtualization extensions (Intel VT or AMD-V). Performance tuning in KVM environments is crucial for achieving optimal virtual machine performance and efficient resource utilization.
This guide covers comprehensive performance tuning techniques and best practices for KVM virtualization, including CPU optimization, memory management, NUMA awareness, I/O tuning, and network configuration.
VirtIO is a virtualization standard for network and disk device drivers where the guest's device driver "knows" it is running in a virtual environment and cooperates with the hypervisor. This cooperation enables better performance compared to traditional device emulation.
- Reduced Overhead: Minimizes the overhead of device emulation
- Better Performance: Provides near-native performance for I/O operations
- Standardization: Offers a common interface for various hypervisors
- virtio-net: Network device
- virtio-blk: Block device
- virtio-scsi: SCSI device
- virtio-balloon: Memory ballooning device
- virtio-rng: Random number generator
<devices>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io='native'/>
<source file='/var/lib/libvirt/images/vm.qcow2'/>
<target dev='vda' bus='virtio'/>
</disk>
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
</interface>
</devices>Proper vCPU allocation is critical for VM performance. Over-allocation can lead to CPU contention and degraded performance.
- Match Physical Cores: Allocate vCPUs based on physical CPU cores available
- Avoid Over-subscription: Keep vCPU to physical CPU ratio reasonable (typically 1:1 or 2:1)
- Consider Workload: Match vCPU count to application requirements
<vcpu placement='static'>4</vcpu># Set vCPU count for a domain
virsh setvcpus <domain> <count> --config --maximum
# Set current vCPU count
virsh setvcpus <domain> <count> --configCPU pinning (also called CPU affinity) binds vCPUs to specific physical CPUs, reducing context switching and improving cache locality.
- Reduced CPU migration overhead
- Better cache utilization
- Improved performance consistency
- Reduced latency for real-time workloads
<cputune>
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
<emulatorpin cpuset='0-1'/>
</cputune># Pin vCPU to physical CPU
virsh vcpupin <domain> <vcpu> <cpuset> --config
# Example: Pin vCPU 0 to physical CPU 2
virsh vcpupin myvm 0 2 --config
# View current pinning
virsh vcpupin <domain>For NUMA systems, pin vCPUs to CPUs within the same NUMA node:
# Check NUMA topology
numactl --hardware
# Pin vCPUs to NUMA node 0
virsh vcpupin myvm 0 0-7 --config
virsh vcpupin myvm 1 0-7 --configConfiguring CPU topology helps the guest OS make better scheduling decisions and improves performance for NUMA-aware applications.
<cpu mode='host-passthrough'>
<topology sockets='1' cores='4' threads='2'/>
<numa>
<cell id='0' cpus='0-3' memory='4194304' unit='KiB'/>
<cell id='1' cpus='4-7' memory='4194304' unit='KiB'/>
</numa>
</cpu>| Mode | Description | Use Case |
|---|---|---|
host-passthrough |
Exposes host CPU features directly | Best performance, limited migration |
host-model |
Uses closest CPU model | Good performance, better migration |
custom |
Manually specify CPU model | Maximum compatibility |
<cpu mode='custom' match='exact'>
<model fallback='allow'>Skylake-Server</model>
<topology sockets='2' cores='4' threads='2'/>
<feature policy='require' name='vmx'/>
</cpu>Proper memory allocation ensures VMs have sufficient memory while avoiding over-commitment that can lead to swapping.
<memory unit='GiB'>8</memory>
<currentMemory unit='GiB'>8</currentMemory><memory unit='GiB'>16</memory>
<currentMemory unit='GiB'>8</currentMemory>
<memballoon model='virtio'>
<stats period='10'/>
</memballoon>- Allocate memory based on workload requirements
- Leave sufficient memory for the host OS (typically 2-4 GB minimum)
- Monitor memory usage and adjust as needed
- Use memory ballooning for dynamic adjustment
Fine-tune memory parameters for optimal performance.
<memtune>
<hard_limit unit='GiB'>10</hard_limit>
<soft_limit unit='GiB'>8</soft_limit>
<swap_hard_limit unit='GiB'>12</swap_hard_limit>
<min_guarantee unit='GiB'>6</min_guarantee>
</memtune>- hard_limit: Maximum memory the VM can use
- soft_limit: Memory limit enforced during memory contention
- swap_hard_limit: Maximum memory + swap
- min_guarantee: Minimum memory guaranteed to VM
Configure memory backing for performance optimization.
<memoryBacking>
<hugepages>
<page size='1' unit='GiB'/>
</hugepages>
<locked/>
<nosharepages/>
</memoryBacking>- hugepages: Use hugepages for VM memory
- locked: Lock VM memory in RAM (prevent swapping)
- nosharepages: Disable KSM for this VM
- source type='file': Use file-backed memory
Hugepages reduce TLB (Translation Lookaside Buffer) misses and improve memory performance.
Hugepages were introduced in the Linux kernel to improve the performance of memory management. Memory is managed in blocks known as pages.
Page Size Basics:
- Standard Pages: x86 CPUs usually address memory in 4 KB pages
- Hugepages: CPUs are capable of using larger pages:
- 2 MB pages (common)
- 1 GB pages (for very large memory systems)
Why Hugepages?
When a system needs to handle huge amounts of memory, there are two options:
-
Increase page table entries in hardware MMU (expensive and limited)
- Modern processors support only hundreds or thousands of page table entries
- Struggles with high number of entries or manipulations
-
Increase page size (hugepages - more efficient)
- 2 MB pages suitable for managing multiple gigabytes of memory
- 1 GB pages best for scaling to terabytes of memory
Translation Lookaside Buffer (TLB):
A TLB is a cache used for virtual-to-physical address translations. It's a very scarce resource on processors. Operating systems try to make the best use of limited TLB resources. This optimization is more critical now as bigger physical memories (several GB) are more readily available.
Benefits:
- Reduced TLB misses
- Lower memory management overhead
- Better performance for memory-intensive workloads
- Improved memory access latency
- More efficient use of CPU cache
1. Check Current Configuration:
# Check current hugepage configuration
cat /proc/meminfo | grep Huge
# Example output:
# AnonHugePages: 0 kB
# HugePages_Total: 0
# HugePages_Free: 0
# HugePages_Rsvd: 0
# HugePages_Surp: 0
# Hugepagesize: 2048 kB
# Check hugepage size
cat /proc/meminfo | grep Hugepagesize2. Configure Hugepages:
# View current hugepages value
cat /proc/sys/vm/nr_hugepages
# Or using sysctl
sysctl -a | grep huge
# Configure hugepages (2MB pages)
# Example: Set 1024 hugepages (1024 * 2MB = 2GB)
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# Configure 1GB hugepages
# Example: Set 8 hugepages (8 * 1GB = 8GB)
echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# Make persistent (add to /etc/sysctl.conf)
vm.nr_hugepages = 1024
# Apply sysctl changes
sysctl -pImportant Note: Total memory assigned for hugepages cannot be used by applications that are not hugepage-aware. If you over-allocate hugepages, normal functioning of the host system can be affected.
3. Mount Hugepages:
# Create mount point
mkdir -p /dev/hugepages
# Mount hugepages
mount -t hugetlbfs hugetlbfs /dev/hugepages
# Verify mount
mount | grep huge
# Make persistent (add to /etc/fstab)
hugetlbfs /dev/hugepages hugetlbfs defaults 0 04. Restart libvirtd:
# Restart libvirtd service
systemctl restart libvirtd5. Configure VM to Use Hugepages:
<memoryBacking>
<hugepages>
<page size='1' unit='GiB' nodeset='0'/>
</hugepages>
</memoryBacking>6. Verify VM is Using Hugepages:
# Restart the hugepage-configured guest
virsh start <vm-name>
# Verify on host
cat /proc/meminfo | grep -i huge
# Check which process is using hugepages
grep -i hugepages /proc/*/smaps | grep -v "0 kB"THP is an abstraction layer that automates hugepage size allocation based on application requests.
What is THP?
Transparent Hugepage support can be:
- Entirely disabled
- Enabled only inside MADV_HUGEPAGE regions (to avoid consuming more memory)
- Enabled system-wide
THP Modes:
| Mode | Description | Use Case |
|---|---|---|
always |
Always use THP | Maximum performance, may use more memory |
madvise |
Use hugepages only in VMAs marked with MADV_HUGEPAGE | Balanced approach, application control |
never |
Disable THP | When THP causes issues or not needed |
Check Current THP Setting:
# Check current setting
cat /sys/kernel/mm/transparent_hugepage/enabled
# Example output:
# always [madvise] never
# The value in brackets [] is currently activeConfigure THP:
# Enable always
echo always > /sys/kernel/mm/transparent_hugepage/enabled
# Enable madvise (recommended)
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
# Disable THP
echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Make persistent (add to /etc/rc.local or systemd service)THP Benefits:
- System settings for performance automatically optimized
- Allows all free memory to be used as cache
- Performance increased by better memory utilization
- Can coexist with static hugepages
THP and KVM:
- When static hugepages are not used, KVM will use transparent hugepages instead of regular 4 KB page size
- Advantages: Less memory used for page tables, reduced TLB misses, increased performance
- Important: When using hugepages for guest memory, you can no longer swap or balloon guest memory
THP vs Static Hugepages:
| Feature | Static Hugepages | Transparent Hugepages |
|---|---|---|
| Configuration | Manual | Automatic |
| Memory Reservation | Pre-allocated | Dynamic |
| Swapping | Not supported | Supported (when not using hugepages) |
| Ballooning | Not supported | Supported (when not using hugepages) |
| Best For | Predictable workloads | Dynamic workloads |
| Performance | Slightly better | Very good |
- Calculate Requirements: Determine total memory needed for all VMs
- Reserve Appropriately: Don't over-allocate; leave memory for host OS
- Use 1GB Pages: For very large memory VMs (>64GB)
- Use 2MB Pages: For most standard VMs
- Monitor Usage: Regularly check hugepage utilization
- NUMA Awareness: Allocate hugepages on correct NUMA nodes
- THP for Dynamic: Use THP when workload is unpredictable
- Static for Production: Use static hugepages for production VMs with known memory requirements
KSM allows the kernel to merge identical memory pages across VMs, reducing memory usage.
# Enable KSM
echo 1 > /sys/kernel/mm/ksm/run
# Configure scan parameters
echo 100 > /sys/kernel/mm/ksm/sleep_millisecs
echo 1000 > /sys/kernel/mm/ksm/pages_to_scan# Check KSM statistics
cat /sys/kernel/mm/ksm/pages_shared
cat /sys/kernel/mm/ksm/pages_sharing
cat /sys/kernel/mm/ksm/pages_unshared- Pros: Reduces memory usage, allows higher VM density
- Cons: CPU overhead for scanning, potential security concerns
- Best for: Environments with many similar VMs (e.g., VDI)
Non-Uniform Memory Access (NUMA) tuning is critical for performance on multi-socket systems.
Understanding and configuring NUMA topology ensures optimal memory access patterns.
# View NUMA topology
numactl --hardware
# View NUMA statistics
numastat
# Check CPU to NUMA node mapping
lscpu | grep NUMA<cpu mode='host-passthrough'>
<numa>
<cell id='0' cpus='0-3' memory='4' unit='GiB'/>
<cell id='1' cpus='4-7' memory='4' unit='GiB'/>
</numa>
</cpu>Configure NUMA memory allocation policies for optimal performance.
<numatune>
<memory mode='strict' nodeset='0'/>
<memnode cellid='0' mode='strict' nodeset='0'/>
<memnode cellid='1' mode='strict' nodeset='1'/>
</numatune>| Mode | Description | Use Case |
|---|---|---|
strict |
Allocate only from specified nodes | Best performance, may fail if memory unavailable |
preferred |
Prefer specified nodes, fallback allowed | Good performance with flexibility |
interleave |
Distribute across nodes | Balanced memory bandwidth |
restrictive |
Restrict to specified nodes | Resource isolation |
<numatune>
<memory mode='strict' nodeset='0-1'/>
</numatune>
<cputune>
<vcpupin vcpu='0' cpuset='0-7'/>
<vcpupin vcpu='1' cpuset='8-15'/>
</cputune>Configure automatic NUMA balancing for dynamic workloads.
The main aim of automatic NUMA balancing is to improve the performance of different applications running in a NUMA-aware system. The strategy behind its design is simple: an application will generally perform best when the threads of its processes are accessing memory on the same NUMA node where the threads are scheduled by the kernel.
Automatic NUMA balancing moves tasks (threads or processes) closer to the memory they are accessing. It also moves application data to memory closer to the tasks that reference it. This is all done automatically by the kernel when automatic NUMA balancing is active.
Automatic NUMA balancing will be enabled when booted on hardware with NUMA properties. The main conditions or criteria are:
numactl --hardware: Shows multiple nodescat /sys/kernel/debug/sched_features: Shows NUMA in the flags
Example output:
[ humble-numaserver ]$ cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY
WAKEUP_PREEMPTION ARCH_POWER NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_
POWER TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN NUMA
NUMA_FAVOUR_HIGHER NO_NUMA_RESIST_LOWER# Check if enabled
cat /proc/sys/kernel/numa_balancing
# Enable automatic NUMA balancing
echo 1 > /proc/sys/kernel/numa_balancing
# Disable automatic NUMA balancing
echo 0 > /proc/sys/kernel/numa_balancing
# Make persistent (add to /etc/sysctl.conf)
kernel.numa_balancing = 1# Scan delay (milliseconds)
echo 1000 > /proc/sys/kernel/numa_balancing_scan_delay_ms
# Scan period (milliseconds)
echo 1000 > /proc/sys/kernel/numa_balancing_scan_period_min_ms
echo 60000 > /proc/sys/kernel/numa_balancing_scan_period_max_ms
# Scan size (MB)
echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mbThe Automatic NUMA balancing mechanism works based on several algorithms and data structures:
- NUMA hinting page faults: Triggers for memory migration decisions
- NUMA page migration: Moves pages closer to accessing CPUs
- Task grouping: Groups related tasks together
- Fault statistics: Tracks memory access patterns
- Task placement: Optimizes task scheduling
- Pseudo-interleaving: Balances memory across nodes
- Manual NUMA tuning of applications will override automatic NUMA balancing
- Disables the periodic unmapping of memory, NUMA faults, migration, and automatic NUMA placement for manually tuned applications
- In some cases, system-wide manual NUMA tuning is preferred
- Best Practice: Limit guest resources to the amount available on a single NUMA node to avoid unnecessarily splitting resources across NUMA nodes
- Pin VMs to NUMA nodes: Avoid cross-NUMA memory access
- Match vCPU and memory: Allocate both from same NUMA node
- Monitor NUMA statistics: Check for remote memory access
- Size VMs appropriately: Don't exceed single NUMA node capacity when possible
- Consider workload: Automatic balancing works best for dynamic workloads
numad is a user-level daemon that provides placement advice and process management for efficient use of CPUs and memory on systems with NUMA topology. It monitors NUMA topology and resource usage within a system to dynamically improve NUMA resource allocation and management.
Key Features:
- Monitors NUMA topology and resource usage
- Attempts to locate processes for efficient NUMA locality and affinity
- Dynamically adjusts to changing system conditions
- Provides guidance for initial manual binding of CPU and memory resources
When to Use numad:
- Best for: Server consolidation environments with multiple applications or virtual guests
- Most effective: When processes can be localized in a subset of the system's NUMA nodes
- Not recommended: For systems dedicated to large in-memory databases with unpredictable memory access patterns
Starting numad:
# Start numad as executable
numad
# Check if running
ps aux | grep numad
# Stop numad
numad -i 0Important Notes:
- When numad is enabled, its behavior overrides the default behavior of automatic NUMA balancing
- Stopping numad does not remove the changes it has made to improve NUMA affinity
- If system use changes significantly, running numad again will adjust the affinity
numad Log File:
Monitor numad activity in /var/log/numad:
# Example log entries
Tue Nov 17 06:49:43 2015: Changing THP scan time in /sys/kernel/mm/
transparent_hugepage/khugepaged/scan_sleep_millisecs from 10000 to 1000 ms.
Tue Nov 17 06:49:43 2015: Registering numad version 20140225 PID 9170
Tue Nov 17 06:49:45 2015: Advising pid 1479 (qemu-kvm) move from nodes
(0-1) to nodes (1)
Tue Nov 17 06:49:47 2015: Including PID: 1479 in CPUset: /sys/fs/cgroup/
CPUset/machine.slice/machine-qemu\x2dfedora21.scope/emulator
Tue Nov 17 06:49:48 2015: PID 1479 moved to node(s) 1 in 3.33 secondsnumastat shows per-NUMA-node memory statistics for processes and the operating system.
Usage:
# Show NUMA statistics for all processes
numastat
# Show statistics for specific process (e.g., qemu-kvm)
numastat -p $(pgrep qemu-kvm)
# Show statistics with process name
numastat qemu-kvmPackage Information:
- The
numactlpackage provides thenumactlbinary/command - The
numadpackage provides thenumadbinary/command
Monitoring NUMA Performance:
Use numastat to monitor the difference before and after running the numad service to verify performance improvements.
KSM is capable of detecting that a system is using NUMA memory and controlling merging pages across different NUMA nodes.
# Check merge_across_nodes setting
cat /sys/kernel/mm/ksm/merge_across_nodes
# By default, pages from all nodes can be merged (value = 1)
# Set to 0 to merge only pages from the same node
echo 0 > /sys/kernel/mm/ksm/merge_across_nodesImportant Considerations:
- When KSM merges across nodes on a NUMA host with multiple guest virtual machines, guests and CPUs from more distant nodes can suffer a significant increase in access latency to the merged KSM page
- Unless you are oversubscribing or overcommitting system memory, you will get better runtime performance by disabling KSM sharing
- Use the
nosharepagesoption in guest XML to disable KSM for specific guests
Important Warning:
It is harder to live-migrate a pinned guest across hosts because a similar set of backing resources/configurations may not be available on the destination or target host where the VM is getting migrated. For example, the target host may have a different NUMA topology.
You should consider this fact when you tune a KVM environment. Automatic NUMA balancing may help, to a certain extent, reduce the need for manually pinning guest resources.
Optimizing disk I/O is crucial for overall VM performance.
The virtual disk of a VM can be either a block device or an image file.
Key Observations:
- Block Device Backend: Preferred for better VM performance over image files on remote filesystems (NFS, GlusterFS, etc.)
- File Backend: Helps virt admins better manage guest disks and is useful in many scenarios
- Mixed Usage: No restriction on mixing block devices and files as storage disks for the same guest
- Disk Limit: Total number of virtual disks that can be attached to a VM has a limit
When an application inside a guest OS writes data to local storage, the I/O request traverses through multiple layers:
- Guest Filesystem: Application writes to guest filesystem
- Guest I/O Subsystem: Request passes through guest OS I/O subsystem
- qemu-kvm Process: Hypervisor receives request from guest OS
- Host Processing: Hypervisor processes I/O like any other host application
This multi-layer traversal explains why block device backends perform better than image file backends.
- Additional Resource Demand: File image is part of host filesystem, creating additional I/O overhead compared to block devices
- Sparse Images: Using sparse image files helps over-allocate host storage but reduces virtual disk performance
- Partition Alignment: Improper partitioning of guest storage with disk image files can cause unnecessary I/O operations due to misalignment of standard partition units
Choose the appropriate storage backend for your workload.
| Backend | Description | Use Case | Performance |
|---|---|---|---|
qcow2 |
QEMU Copy-On-Write format | Snapshots, thin provisioning | Good (with overhead) |
raw |
Raw disk image | Best performance | Excellent |
LVM |
Logical Volume Manager | Good performance, flexibility | Very Good |
Ceph/RBD |
Distributed storage | Scalability, high availability | Good (network dependent) |
Block Device |
Direct block device | Production workloads | Best |
Recommendation: Use raw format for best performance. The qcow format has performance overhead due to the format layer operations (e.g., allocating new clusters when growing images). However, qcow is useful when features like snapshots are required.
Always use the virtio disk bus when configuring disks rather than the IDE bus. The virtio_blk driver uses the VirtIO API to provide high performance for storage I/O devices, significantly increasing storage performance, especially in large enterprise storage systems.
Both are paravirtualized storage controllers, but they have different characteristics:
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io='native'/>
<source file='/var/lib/libvirt/images/vm.qcow2'/>
<target dev='vda' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk><controller type='scsi' index='0' model='virtio-scsi'>
<driver queues='4'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</controller>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io='native'/>
<source file='/var/lib/libvirt/images/vm.qcow2'/>
<target dev='sda' bus='scsi'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>Configure appropriate cache mode for your workload.
┌─────────────────────────────────────────────────────────────────────┐
│ Disk Cache Modes │
├─────────────┬─────────────┬─────────────┬─────────────────────────┤
│ writeback │writethrough │ none │ directsync │
├─────────────┼─────────────┼─────────────┼─────────────────────────┤
│ │ │ │ │
│ Guest │ Guest │ Guest │ Guest │
│ │ │ │ │ │ │ │ │
│ ▼ │ ▼ │ ▼ │ ▼ │
│ R W │ R W │ R W │ R W │
│ │ │ │ │ │ │ │ │ │ │ │ │
│ ▼ ▼ │ ▼ │ │ │ │ │ │ │ │
│ Host Page │ Host Page │ │ │ │ │ │ │
│ Cache │ Cache │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │
│ ▼ │ │ ▼ ▼ │ ▼ ▼ │ ▼ ▼ │
│ Disk Cache │ Disk Cache │ Disk Cache │ Physical Disk │
│ │ │ │ │ │ │ │ │ │ │ │ │
│ ▼ ▼ │ ▼ ▼ │ ▼ ▼ │ ▼ ▼ │
│ Physical │ Physical │ Physical │ │
│ Disk │ Disk │ Disk │ │
└─────────────┴─────────────┴─────────────┴─────────────────────────┘
Legend:
R = Read operations
W = Write operations
│ = Data flow path
▼ = Direction of data flow
| Mode | Host Page Cache | Disk Write Cache | O_DIRECT | O_SYNC | Use Case |
|---|---|---|---|---|---|
none |
Bypassed | Used | Yes | No | Best for production, data integrity, supports migration |
writethrough |
Used for reads | Bypassed | No | No | Balanced performance and safety |
writeback |
Used | Used | No | No | Best performance, risk of data loss |
directsync |
Bypassed | Bypassed | Yes | Yes | Maximum data integrity |
unsafe |
Used | Ignored | No | No | Testing only, high data loss risk |
default |
System default | System default | - | - | Uses system's default settings |
cache=none (Recommended for Production)
- Uses O_DIRECT flag to bypass host page cache
- I/O happens directly between qemu-kvm userspace buffers and storage device
- Guest I/O not cached on host, but may be kept in writeback disk cache
- Best choice for guests with large I/O requirements
- Only option that supports migration
- Semantics: Host page cache bypassed, disk write cache used
cache=writethrough
- Matches O_DSYNC semantics
- Guest I/O cached on host but written through to physical medium
- Writes reported as completed only when data committed to storage device
- Slower and prone to scaling problems
- Suitable for small number of guests with lower I/O requirements
- Recommended for guests that don't support writeback cache (when migration not needed)
cache=writeback
- Guest I/O cached on host
- Writes reported as completed when placed in host page cache
- Normal page cache management handles commitment to storage
- Host page cache used, writes reported to guest as completed when in cache
- Best performance but risk of data loss on host failure
cache=directsync
- Similar to writethrough but bypasses host page cache
- I/O from guest bypasses host page cache
- Writes reported as completed only when committed to storage device
- Use when writethrough behavior desired but also want to bypass host page cache
cache=unsafe
- Host may cache all disk I/O
- Sync requests from guests are ignored
- Huge risk of data loss in event of host failure
- May be useful for guest installation or similar non-critical tasks
- Never use in production
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
<source file='/var/lib/libvirt/images/vm.img'/>
<target dev='vda' bus='virtio'/>
</disk>- Production Systems: Use
cache='none'for best balance of performance and data integrity - Development/Testing:
cache='writeback'for maximum performance (accept the risk) - Small Deployments:
cache='writethrough'for safety with moderate performance - Maximum Safety:
cache='directsync'when data integrity is paramount - Never in Production:
cache='unsafe'- only for temporary, non-critical operations
Fine-tune I/O performance with iotune parameters.
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/var/lib/libvirt/images/vm.qcow2'/>
<target dev='vda' bus='virtio'/>
<iotune>
<total_bytes_sec>104857600</total_bytes_sec>
<read_bytes_sec>52428800</read_bytes_sec>
<write_bytes_sec>52428800</write_bytes_sec>
<total_iops_sec>1000</total_iops_sec>
<read_iops_sec>500</read_iops_sec>
<write_iops_sec>500</write_iops_sec>
</iotune>
</disk>Enable multi-queue for better I/O performance on multi-core systems.
<controller type='scsi' index='0' model='virtio-scsi'>
<driver queues='4' iothread='1'/>
</controller>Use I/O threads to offload I/O processing from vCPU threads.
<iothreads>4</iothreads>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' iothread='1'/>
<source file='/var/lib/libvirt/images/vm.qcow2'/>
<target dev='vda' bus='virtio'/>
</disk>- Use VirtIO drivers: Always use VirtIO for best performance
- Choose appropriate cache mode: Use
cache='none'for production - Enable native AIO: Use
io='native'for better async I/O - Use raw format when possible: Better performance than qcow2
- Enable discard/TRIM: Use
discard='unmap'for SSD optimization - Configure I/O threads: Separate I/O processing from vCPU threads
- Use multi-queue: Enable for multi-core VMs
Network performance is critical for many virtualized workloads.
Best Practice: Segregate network traffic to avoid congestion in KVM setups.
- Use dedicated networks for different traffic types:
- Management traffic
- Backup traffic
- Live migration traffic
- Production/application traffic
Important: Avoid multiple network interfaces for the same network segment. If you must use multiple interfaces on the same segment, apply network tuning such as arp_filter to prevent ARP Flux.
ARP Flux is an undesirable condition that can occur in both hosts and guests, caused by the machine responding to ARP requests from more than one network interface.
# Enable arp_filter
echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
# Make persistent (add to /etc/sysctl.conf)
net.ipv4.conf.all.arp_filter = 1Always use VirtIO network driver for best performance.
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost' queues='4'/>
</interface>vhost-net is a kernel-level backend for virtio networking that significantly improves performance by reducing the number of context switches and system calls.
Without vhost-net (Traditional virtio):
┌─────────────────────────────────────────────────────────────┐
│ Virtual Machine │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Guest OS │ │
│ │ ┌────────────────────────────────────────────────┐ │ │
│ │ │ virtio-net (frontend driver) │ │ │
│ │ │ TX RX │ │ │
│ │ └────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ ▲
▼ │
┌─────────────────────────────────────────────────────────────┐
│ Host Kernel │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ QEMU │ │
│ │ ┌────────────────────────────────────────────────┐ │ │
│ │ │ virtio backend │ │ │
│ │ └────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ ▲ │
│ ▼ │ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ TAP Device │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ ▲ │
│ ▼ │ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Bridge │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ ▲ │
│ ▼ │ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Physical NIC │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
With vhost-net (Optimized):
┌─────────────────────────────────────────────────────────────┐
│ Virtual Machine │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Guest OS │ │
│ │ ┌────────────────────────────────────────────────┐ │ │
│ │ │ virtio-net (frontend driver) │ │ │
│ │ │ TX RX │ │ │
│ │ └────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ ▲
▼ │
┌─────────────────────────────────────────────────────────────┐
│ Host Kernel │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ vhost-net (kernel module) │ │
│ │ (bypasses QEMU for data path) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ ▲ │
│ ▼ │ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ TAP Device │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ ▲ │
│ ▼ │ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Bridge │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ ▲ │
│ ▼ │ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Physical NIC │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key Benefits of vhost-net:
- Reduces context switches between kernel and userspace
- Lowers latency
- Reduces CPU usage
- Improves overall network throughput
- Data path bypasses QEMU userspace
Load the kernel module:
# Load vhost-net module
modprobe vhost-net
# Verify module is loaded
lsmod | grep vhost
# Check device file created
ls -l /dev/vhost-net
# Make persistent (add to /etc/modules-load.d/vhost.conf)
echo "vhost-net" > /etc/modules-load.d/vhost.confConfigure in VM XML:
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost'/>
</interface>QEMU Command Line:
When QEMU is launched with -netdev tap,vhost=on, it opens /dev/vhost-net and initializes the vhost-net instance with several ioctl() calls. This initialization process binds QEMU with a vhost-net instance.
Example from qemu-kvm process:
-netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,
netdev=hostnet0,id=net0,mac=52:54:00:49:3b:95,bus=pci.0,addr=0x3The experimental_zcopytx parameter controls Bridge Zero Copy Transmit, which can improve performance for large packet workloads.
What is Zero Copy Transmit?
A system for providing zero copy transmission in virtualization environment where the hypervisor receives a guest OS request pertaining to a data packet. The data packet resides in a buffer of the guest OS or guest application and has at least a partial header created during networking stack processing. The hypervisor sends a request to the network device driver to transfer the data packet over the network, identifying the packet in the guest buffer, and refrains from copying the data to a hypervisor buffer.
When to Use:
- Environment uses large packet sizes
- Reduces host CPU overhead when guest communicates to external network
- Does NOT affect: Guest-to-guest, guest-to-host, or small packet workloads
Enable Zero Copy:
# Check current setting
cat /sys/module/vhost_net/parameters/experimental_zcopytx
# Enable zero copy transmit
echo 1 > /sys/module/vhost_net/parameters/experimental_zcopytxMulti-queue virtio-net provides significant performance improvements by allowing parallel packet processing across multiple vCPUs.
Traditional virtio-net had a single RX (receive) and TX (transmit) queue, which created a bottleneck:
- Even with multiple vCPUs, networking throughput was limited
- Guests couldn't transmit or retrieve packets in parallel
- Virtual NICs couldn't utilize multi-queue support available in Linux kernel
- tap/virtio-net backend had to serialize concurrent transmission/receiving requests from different CPUs
- This serialization caused significant performance overhead
Multi-queue support was introduced in both frontend (guest) and backend (host) drivers to lift this bottleneck:
- Allows guests to scale network performance with more vCPUs
- Each queue can be processed by a different vCPU
- Parallel packet processing significantly improves throughput
- Better CPU cache utilization
Host Configuration (VM XML):
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost' queues='4'/>
</interface>Where queues value can be 1 to 8 (kernel supports up to 8 queues for multi-queue tap device).
QEMU Command Line:
# Start guest with 2 queues
qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,...Guest Configuration:
Inside the guest, enable multi-queue support using ethtool:
# Check current queue configuration
ethtool -l eth0
# Example output:
# Channel parameters for eth0:
# Pre-set maximums:
# RX: 0
# TX: 0
# Other: 0
# Combined: 4 <-- Maximum queues available
# Current hardware settings:
# RX: 0
# TX: 0
# Other: 0
# Combined: 1 <-- Currently active queues
# Enable multi-queue (set combined queues, where K is from 1 to M)
ethtool -L eth0 combined 4
# Verify the change
ethtool -l eth0Enable RPS (Receive Packet Steering):
# Enable RPS for better distribution
for i in /sys/class/net/eth0/queues/rx-*/rps_cpus; do
echo f > $i
done
# Or set specific CPU mask (example: CPUs 0-3)
for i in /sys/class/net/eth0/queues/rx-*/rps_cpus; do
echo 0f > $i
doneMulti-queue virtio-net provides the greatest performance benefit when:
- Large Packets: Traffic packets are relatively large
- High Concurrency: Guest is active on many connections simultaneously
- Multiple Traffic Patterns: Traffic running between:
- Guest to guest
- Guest to host
- Guest to external system
- Optimal Queue Count: Number of queues equals number of vCPUs
- Multi-queue support optimizes RX interrupt affinity
- TX queue selection makes specific queue private to specific vCPU
- Match Queue Count to vCPUs: Set
queuesparameter equal to number of vCPUs for optimal performance - Monitor CPU Usage: Multi-queue increases CPU consumption even with better throughput
- Test Your Workload: Performance improvement varies by workload type
- Consider Trade-offs: Higher throughput comes at cost of increased CPU usage
- Use with vhost-net: Combine multi-queue with vhost-net for best results
<!-- VM with 4 vCPUs and 4 network queues -->
<domain type='kvm'>
<name>multiqueue-vm</name>
<vcpu placement='static'>4</vcpu>
<devices>
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost' queues='4'/>
</interface>
</devices>
</domain>Inside Guest:
# Enable all 4 queues
ethtool -L eth0 combined 4
# Enable RPS
for i in /sys/class/net/eth0/queues/rx-*/rps_cpus; do
echo f > $i
done
# Verify configuration
ethtool -l eth0
cat /sys/class/net/eth0/queues/rx-*/rps_cpus# Check interrupt distribution across CPUs
cat /proc/interrupts | grep virtio
# Monitor network statistics
ethtool -S eth0
# Check queue statistics
tc -s qdisc show dev eth0vhost-net offloads network processing to the kernel, improving performance.
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost'/>
</interface># Load vhost-net module
modprobe vhost-net
# Make persistent (add to /etc/modules-load.d/vhost.conf)
vhost-netSR-IOV provides near-native network performance by allowing a single physical PCIe device to present itself as multiple virtual devices.
┌─────────────────────────────────────────────────────────────────────┐
│ Hypervisor │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ I/O MMU (Intel VT-d or AMD IOMMU) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Guest OS 1 │ │ Guest OS 2 │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ Virtual NIC │ │ │ │ Virtual NIC │ │ │
│ │ │ Driver │ │ │ │ Driver │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ SR-IOV PCI Device (NIC) │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Physical Function (PF) │ │ │
│ │ │ (Full PCIe function with SR-IOV capability) │ │ │
│ │ │ - Physical NIC Driver │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Virtual │ │ Virtual │ │ Virtual │ │ │
│ │ │ Function 1 │ │ Function 2 │ │ Function N │ │ │
│ │ │ (VF) │ │ (VF) │ │ (VF) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────┐
│ Physical Network │
└───────────────────────┘
Key Components:
- Physical Function (PF): Full PCIe function with SR-IOV capability, manages the SR-IOV functionality
- Virtual Function (VF): Lightweight PCIe function that can be assigned to VMs
- I/O MMU: Provides memory address translation and isolation (Intel VT-d or AMD IOMMU)
- Near-Native Performance: Direct hardware access without hypervisor intervention
- Low Latency: Bypasses virtual switch and hypervisor network stack
- High Throughput: Full bandwidth of physical NIC divided among VFs
- CPU Offload: Network processing offloaded to hardware
- Scalability: Single physical NIC supports multiple VMs
- Hardware support for SR-IOV (check NIC specifications)
- IOMMU support enabled in BIOS (Intel VT-d or AMD-Vi)
- Kernel support for SR-IOV and IOMMU
1. Check SR-IOV Support:
# Check if SR-IOV is supported
lspci -vvv | grep -i sriov
# Example output showing SR-IOV capability:
# Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
# IOVCap: Migration-, Interrupt Message Number: 000
# IOVCtl: Enable+ Migration- Interrupt- MSE+ ARIHierarchy+
# Initial VFs: 64, Total VFs: 64, Number of VFs: 42. Enable IOMMU in Kernel:
# For Intel processors (add to kernel boot parameters)
intel_iommu=on iommu=pt
# For AMD processors
amd_iommu=on iommu=pt
# Edit /etc/default/grub and add to GRUB_CMDLINE_LINUX
GRUB_CMDLINE_LINUX="... intel_iommu=on iommu=pt"
# Update grub and reboot
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot3. Enable Virtual Functions:
# Check current VF count
cat /sys/class/net/eth0/device/sriov_numvfs
# Enable SR-IOV (example: create 4 VFs for Intel NIC)
echo 4 > /sys/class/net/eth0/device/sriov_numvfs
# Verify VFs created
lspci | grep -i virtual
# Check VF details
ip link show eth04. Make Persistent:
Create a systemd service or add to startup scripts:
# Create /etc/systemd/system/sriov-enable.service
[Unit]
Description=Enable SR-IOV
After=network.target
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'echo 4 > /sys/class/net/eth0/device/sriov_numvfs'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
# Enable the service
systemctl enable sriov-enable.serviceMethod 1: Direct VF Assignment (hostdev)
<interface type='hostdev' managed='yes'>
<source>
<address type='pci' domain='0x0000' bus='0x05' slot='0x10' function='0x0'/>
</source>
<mac address='52:54:00:6d:90:02'/>
<vlan>
<tag id='42'/>
</vlan>
</interface>Method 2: Using Network Pool
<!-- Define SR-IOV network pool -->
<network>
<name>sriov-network</name>
<forward mode='hostdev' managed='yes'>
<pf dev='eth0'/>
</forward>
</network>
<!-- Use in VM configuration -->
<interface type='network'>
<source network='sriov-network'/>
<mac address='52:54:00:6d:90:02'/>
</interface>Find VF PCI Address:
# List all VFs
virsh nodedev-list --cap pci | grep -i virtual
# Get VF details
virsh nodedev-dumpxml pci_0000_05_10_0Advantages:
- Best network performance (near-native)
- Low CPU overhead
- Hardware-level isolation
- Support for advanced features (VLAN, QoS)
Limitations:
- No Live Migration: VMs with SR-IOV cannot be live migrated
- Limited VFs: Number of VFs limited by hardware (typically 32-64 per PF)
- Hardware Dependency: Tied to specific physical hardware
- Driver Requirements: Guest needs appropriate VF drivers
- Management Complexity: More complex than software-based networking
Best Use Cases:
- High-performance computing (HPC)
- Network Function Virtualization (NFV)
- Database servers requiring low latency
- Applications with high network throughput requirements
- When live migration is not required
# Increase network buffer sizes
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"
# Enable TCP window scaling
sysctl -w net.ipv4.tcp_window_scaling=1
# Increase connection backlog
sysctl -w net.core.netdev_max_backlog=5000# Enable offload features
ethtool -K eth0 tso on
ethtool -K eth0 gso on
ethtool -K eth0 gro on
# Increase ring buffer size
ethtool -G eth0 rx 4096 tx 4096Optimize bridge configuration for better performance.
# Disable netfilter on bridges
sysctl -w net.bridge.bridge-nf-call-iptables=0
sysctl -w net.bridge.bridge-nf-call-ip6tables=0
sysctl -w net.bridge.bridge-nf-call-arptables=0
# Make persistent (add to /etc/sysctl.conf)
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-arptables = 0- Use VirtIO with vhost-net: Best performance for most workloads
- Enable multi-queue: Match queue count to vCPU count
- Use SR-IOV for high-performance: When maximum throughput is needed
- Tune buffer sizes: Increase for high-bandwidth workloads
- Enable offload features: TSO, GSO, GRO in guest
- Optimize bridge settings: Disable unnecessary netfilter
- Monitor network statistics: Use tools like iperf, netperf
Accurate time-keeping is crucial for many applications and system operations. In virtualization environments, maintaining time synchronization between guest and host is critical as it affects many guest operations and can cause unpredictable results if not properly configured.
Different mechanisms exist for time-keeping, with Network Time Protocol (NTP) being one of the best-known techniques for clock synchronization between computer systems over networks.
Key Considerations:
- Guest time should be in sync with hypervisor/host
- Time drift can cause authentication failures, log inconsistencies, and application errors
- Multiple methods available for achieving time sync (NTP, hwclock, kvm-clock)
- Best approach depends on your specific setup
Strategy:
- First: Make KVM host time stable and in sync (use NTP or similar)
- Then: Keep guest time in sync with host
- Best Option: Use kvm-clock for optimal results
Configure appropriate clock source for the VM.
<clock offset='utc'>
<timer name='rtc' tickpolicy='catchup'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='no'/>
<timer name='hypervclock' present='yes'/>
</clock>kvm-clock is a paravirtualized (virtualization-aware) clock device that provides the most accurate and stable time-keeping for KVM guests.
When kvm-clock is in use:
- Guest Requests Time: Guest asks hypervisor for current time
- Shared Page: Guest registers a page and shares address with hypervisor
- Continuous Updates: Hypervisor keeps updating this shared page
- Guest Reads: Guest simply reads this page whenever it needs time information
- Guaranteed Accuracy: Provides both stable and accurate timekeeping
Prerequisites:
- Hypervisor must support kvm-clock
- Guest kernel must have kvm-clock support (built-in for modern Linux kernels)
Check if kvm-clock is loaded:
# Check kernel messages for kvm-clock
dmesg | grep kvm-clock
# Example output:
[root@kvmguest ]$ dmesg | grep kvm-clock
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000] kvm-clock: CPU 0, msr 4:27fcf001, primary CPU clock
[ 0.027170] kvm-clock: CPU 1, msr 4:27fcf041, secondary CPU clock
[ 0.376023] kvm-clock: CPU 30, msr 4:27fcf781, secondary CPU clock
[ 0.388027] kvm-clock: CPU 31, msr 4:27fcf7c1, secondary CPU clock
[ 0.597084] Switched to clocksource kvm-clockVerify kvm-clock is active clocksource:
# Check current clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
# Expected output:
kvm-clock
# List available clocksources
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
# Example output:
# kvm-clock tsc acpi_pmManually set kvm-clock (if needed):
# Set kvm-clock as clocksource
echo kvm-clock > /sys/devices/system/clocksource/clocksource0/current_clocksource
# Verify
cat /sys/devices/system/clocksource/clocksource0/current_clocksource<clock offset='utc'>
<timer name='rtc' tickpolicy='catchup'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='no'/>
<timer name='kvmclock' present='yes'/>
</clock><clock offset='localtime'>
<timer name='rtc' tickpolicy='catchup'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='yes'/>
<timer name='hypervclock' present='yes'/>
</clock>Even with kvm-clock, it's recommended to configure NTP/Chrony in guests for additional time synchronization.
# Install chrony
yum install chrony # RHEL/CentOS
apt install chrony # Debian/Ubuntu
# Edit /etc/chrony.conf
# Add NTP servers
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst
server 2.pool.ntp.org iburst
# Use makestep for large time corrections
# makestep <threshold> <limit>
# threshold: step if offset is larger than this (seconds)
# limit: number of clock updates (-1 = unlimited)
makestep 1.0 -1
# Enable and start chronyd
systemctl enable chronyd
systemctl start chronyd
# Check synchronization status
chronyc tracking
chronyc sources# Install ntp
yum install ntp # RHEL/CentOS
apt install ntp # Debian/Ubuntu
# Edit /etc/ntp.conf
# Add NTP servers
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst
server 2.pool.ntp.org iburst
# Disable NTP slewing (use stepping instead)
# This is important for VMs
tinker panic 0
# Enable and start ntpd
systemctl enable ntpd
systemctl start ntpd
# Check synchronization status
ntpq -p
ntpstat# 1. Verify kvm-clock is active
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
# 2. Set kvm-clock as clocksource (if not already set)
echo kvm-clock > /sys/devices/system/clocksource/clocksource0/current_clocksource
# 3. Make persistent (add to /etc/rc.local or systemd service)
echo 'echo kvm-clock > /sys/devices/system/clocksource/clocksource0/current_clocksource' >> /etc/rc.local
# 4. Configure NTP/Chrony (see above sections)
# 5. Verify time synchronization
timedatectl status- Install VirtIO drivers including viostor and vioscsi
- Configure Windows Time service:
# Set time source to hypervisor
w32tm /config /syncfromflags:VM /update
# Restart Windows Time service
net stop w32time && net start w32time
# Check time service status
w32tm /query /status
# Force synchronization
w32tm /resync
# Display time source
w32tm /query /source- Additional Windows Configuration:
# Set time service to automatic startup
sc config w32time start= auto
# Configure time correction settings
w32tm /config /update /manualpeerlist:"time.windows.com" /syncfromflags:manual
# Check configuration
w32tm /query /configurationConfigure TSC for better time-keeping performance.
<cpu mode='host-passthrough'>
<feature policy='require' name='invtsc'/>
</cpu>
<clock offset='utc'>
<timer name='tsc' frequency='3000000000' mode='native'/>
</clock>- Use kvmclock for Linux: Best time source for KVM guests
- Use Hyper-V enlightenments for Windows: Better time-keeping
- Disable HPET when not needed: Reduces overhead
- Use catchup tickpolicy: Handles time drift better
- Configure NTP/Chrony: Keep guest time synchronized
- Use invtsc feature: For better TSC stability
- Avoid CPU overcommitment: Reduces time-keeping issues
- Allocate appropriate number of vCPUs (avoid over-subscription)
- Pin vCPUs to physical CPUs for consistent performance
- Configure CPU topology matching physical layout
- Use host-passthrough mode for best performance
- Align vCPUs with NUMA nodes
- Allocate sufficient memory for workload
- Enable hugepages for memory-intensive workloads
- Configure NUMA memory policies
- Lock memory for real-time workloads
- Consider KSM for high-density environments
- Use VirtIO-SCSI or VirtIO-BLK drivers
- Set cache='none' for production workloads
- Enable native AIO (io='native')
- Use raw format when snapshots not needed
- Configure I/O threads for better performance
- Enable multi-queue for multi-core VMs
- Enable discard/TRIM for SSDs
- Use VirtIO network driver with vhost-net
- Enable multi-queue networking
- Match queue count to vCPU count
- Consider SR-IOV for high-performance needs
- Tune network buffers and offload features
- Optimize bridge configuration
- Use kvmclock for Linux guests
- Use Hyper-V enlightenments for Windows
- Configure appropriate timer settings
- Disable HPET when not needed
- Set up NTP/Chrony in guests
# CPU usage
top -b -n 1 | head -20
mpstat -P ALL 1
# Memory usage
free -h
vmstat 1
# Disk I/O
iostat -x 1
iotop
# Network
iftop
nethogs# List all VMs
virsh list --all
# VM CPU stats
virsh cpu-stats <domain>
# VM memory stats
virsh dommemstat <domain>
# VM block stats
virsh domblkstat <domain> <device>
# VM network stats
virsh domifstat <domain> <interface>| Issue | Symptoms | Solution |
|---|---|---|
| High CPU steal time | Poor performance, high %st in top | Reduce vCPU count or pin vCPUs |
| Memory swapping | Slow performance, high swap usage | Increase VM memory or enable hugepages |
| Disk I/O bottleneck | High iowait, slow disk operations | Use VirtIO, enable native AIO, use faster storage |
| Network latency | High ping times, packet loss | Enable multi-queue, use vhost-net or SR-IOV |
| Time drift | Clock skew, authentication issues | Configure kvmclock, use NTP |
| NUMA imbalance | Uneven memory access times | Pin vCPUs and memory to same NUMA node |
<domain type='kvm'>
<name>optimized-vm</name>
<memory unit='GiB'>8</memory>
<currentMemory unit='GiB'>8</currentMemory>
<memoryBacking>
<hugepages>
<page size='1' unit='GiB'/>
</hugepages>
<locked/>
</memoryBacking>
<vcpu placement='static'>4</vcpu>
<iothreads>2</iothreads>
<cputune>
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
<emulatorpin cpuset='0-1'/>
<iothreadpin iothread='1' cpuset='4'/>
<iothreadpin iothread='2' cpuset='5'/>
</cputune>
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
<cpu mode='host-passthrough'>
<topology sockets='1' cores='4' threads='1'/>
<numa>
<cell id='0' cpus='0-3' memory='8' unit='GiB'/>
</numa>
</cpu>
<clock offset='utc'>
<timer name='rtc' tickpolicy='catchup'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='no'/>
<timer name='kvmclock' present='yes'/>
</clock>
<devices>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native' iothread='1'/>
<source file='/var/lib/libvirt/images/vm.img'/>
<target dev='vda' bus='virtio'/>
</disk>
<controller type='scsi' index='0' model='virtio-scsi'>
<driver queues='4' iothread='2'/>
</controller>
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost' queues='4'/>
</interface>
<memballoon model='virtio'>
<stats period='10'/>
</memballoon>
</devices>
</domain>- KVM Documentation: https://www.linux-kvm.org/
- libvirt Documentation: https://libvirt.org/
- Red Hat Virtualization Tuning Guide: https://access.redhat.com/documentation/
- QEMU Documentation: https://www.qemu.org/documentation/
Performance tuning in KVM requires a holistic approach considering CPU, memory, storage, and network optimization. By following the best practices outlined in this guide and continuously monitoring performance metrics, you can achieve optimal VM performance for your workloads.
Remember that performance tuning is an iterative process. Start with baseline measurements, apply optimizations incrementally, and measure the impact of each change. Different workloads may require different optimization strategies, so always test configurations in a non-production environment first.
Document Version: 1.0
Last Updated: 2026-05-12
Author: perfAge Team :)