1Nested VMX
   7On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
   8to easily and efficiently run guest operating systems. Normally, these guests
   9*cannot* themselves be hypervisors running their own guests, because in VMX,
  10guests cannot use VMX instructions.
  12The "Nested VMX" feature adds this missing capability - of running guest
  13hypervisors (which use VMX) with their own nested guests. It does so by
  14allowing a guest to use VMX instructions, and correctly and efficiently
  15emulating them using the single level of VMX available in the hardware.
  17We describe in much greater detail the theory behind the nested VMX feature,
  18its implementation and its performance characteristics, in the OSDI 2010 paper
  19"The Turtles Project: Design and Implementation of Nested Virtualization",
  20available at:
  28Single-level virtualization has two levels - the host (KVM) and the guests.
  29In nested virtualization, we have three levels: The host (KVM), which we call
  30L0, the guest hypervisor, which we call L1, and its nested guest, which we
  31call L2.
  34Known limitations
  37The current code supports running Linux guests under KVM guests.
  38Only 64-bit guest hypervisors are supported.
  40Additional patches for running Windows under guest KVM, and Linux under
  41guest VMware server, and support for nested EPT, are currently running in
  42the lab, and will be sent as follow-on patchsets.
  45Running nested VMX
  48The nested VMX feature is disabled by default. It can be enabled by giving
  49the "nested=1" option to the kvm-intel module.
  51No modifications are required to user space (qemu). However, qemu's default
  52emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
  53explicitly enabled, by giving qemu one of the following options:
  55     -cpu host              (emulated CPU has all features of the real CPU)
  57     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
  63Nested VMX aims to present a standard and (eventually) fully-functional VMX
  64implementation for the a guest hypervisor to use. As such, the official
  65specification of the ABI that it provides is Intel's VMX specification,
  66namely volume 3B of their "Intel 64 and IA-32 Architectures Software
  67Developer's Manual". Not all of VMX's features are currently fully supported,
  68but the goal is to eventually support them all, starting with the VMX features
  69which are used in practice by popular hypervisors (KVM and others).
  71As a VMX implementation, nested VMX presents a VMCS structure to L1.
  72As mandated by the spec, other than the two fields revision_id and abort,
  73this structure is *opaque* to its user, who is not supposed to know or care
  74about its internal structure. Rather, the structure is accessed through the
  75VMREAD and VMWRITE instructions.
  76Still, for debugging purposes, KVM developers might be interested to know the
  77internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
  79The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
  80also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
  81which L0 builds to actually run L2 - how this is done is explained in the
  82aforementioned paper.
  84For convenience, we repeat the content of struct vmcs12 here. If the internals
  85of this structure changes, this can break live migration across KVM versions.
  86VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
  87struct shadow_vmcs is ever changed.
  89        typedef u64 natural_width;
  90        struct __packed vmcs12 {
  91                /* According to the Intel spec, a VMCS region must start with
  92                 * these two user-visible fields */
  93                u32 revision_id;
  94                u32 abort;
  96                u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
  97                u32 padding[7]; /* room for future expansion */
  99                u64 io_bitmap_a;
 100                u64 io_bitmap_b;
 101                u64 msr_bitmap;
 102                u64 vm_exit_msr_store_addr;
 103                u64 vm_exit_msr_load_addr;
 104                u64 vm_entry_msr_load_addr;
 105                u64 tsc_offset;
 106                u64 virtual_apic_page_addr;
 107                u64 apic_access_addr;
 108                u64 ept_pointer;
 109                u64 guest_physical_address;
 110                u64 vmcs_link_pointer;
 111                u64 guest_ia32_debugctl;
 112                u64 guest_ia32_pat;
 113                u64 guest_ia32_efer;
 114                u64 guest_pdptr0;
 115                u64 guest_pdptr1;
 116                u64 guest_pdptr2;
 117                u64 guest_pdptr3;
 118                u64 host_ia32_pat;
 119                u64 host_ia32_efer;
 120                u64 padding64[8]; /* room for future expansion */
 121                natural_width cr0_guest_host_mask;
 122                natural_width cr4_guest_host_mask;
 123                natural_width cr0_read_shadow;
 124                natural_width cr4_read_shadow;
 125                natural_width cr3_target_value0;
 126                natural_width cr3_target_value1;
 127                natural_width cr3_target_value2;
 128                natural_width cr3_target_value3;
 129                natural_width exit_qualification;
 130                natural_width guest_linear_address;
 131                natural_width guest_cr0;
 132                natural_width guest_cr3;
 133                natural_width guest_cr4;
 134                natural_width guest_es_base;
 135                natural_width guest_cs_base;
 136                natural_width guest_ss_base;
 137                natural_width guest_ds_base;
 138                natural_width guest_fs_base;
 139                natural_width guest_gs_base;
 140                natural_width guest_ldtr_base;
 141                natural_width guest_tr_base;
 142                natural_width guest_gdtr_base;
 143                natural_width guest_idtr_base;
 144                natural_width guest_dr7;
 145                natural_width guest_rsp;
 146                natural_width guest_rip;
 147                natural_width guest_rflags;
 148                natural_width guest_pending_dbg_exceptions;
 149                natural_width guest_sysenter_esp;
 150                natural_width guest_sysenter_eip;
 151                natural_width host_cr0;
 152                natural_width host_cr3;
 153                natural_width host_cr4;
 154                natural_width host_fs_base;
 155                natural_width host_gs_base;
 156                natural_width host_tr_base;
 157                natural_width host_gdtr_base;
 158                natural_width host_idtr_base;
 159                natural_width host_ia32_sysenter_esp;
 160                natural_width host_ia32_sysenter_eip;
 161                natural_width host_rsp;
 162                natural_width host_rip;
 163                natural_width paddingl[8]; /* room for future expansion */
 164                u32 pin_based_vm_exec_control;
 165                u32 cpu_based_vm_exec_control;
 166                u32 exception_bitmap;
 167                u32 page_fault_error_code_mask;
 168                u32 page_fault_error_code_match;
 169                u32 cr3_target_count;
 170                u32 vm_exit_controls;
 171                u32 vm_exit_msr_store_count;
 172                u32 vm_exit_msr_load_count;
 173                u32 vm_entry_controls;
 174                u32 vm_entry_msr_load_count;
 175                u32 vm_entry_intr_info_field;
 176                u32 vm_entry_exception_error_code;
 177                u32 vm_entry_instruction_len;
 178                u32 tpr_threshold;
 179                u32 secondary_vm_exec_control;
 180                u32 vm_instruction_error;
 181                u32 vm_exit_reason;
 182                u32 vm_exit_intr_info;
 183                u32 vm_exit_intr_error_code;
 184                u32 idt_vectoring_info_field;
 185                u32 idt_vectoring_error_code;
 186                u32 vm_exit_instruction_len;
 187                u32 vmx_instruction_info;
 188                u32 guest_es_limit;
 189                u32 guest_cs_limit;
 190                u32 guest_ss_limit;
 191                u32 guest_ds_limit;
 192                u32 guest_fs_limit;
 193                u32 guest_gs_limit;
 194                u32 guest_ldtr_limit;
 195                u32 guest_tr_limit;
 196                u32 guest_gdtr_limit;
 197                u32 guest_idtr_limit;
 198                u32 guest_es_ar_bytes;
 199                u32 guest_cs_ar_bytes;
 200                u32 guest_ss_ar_bytes;
 201                u32 guest_ds_ar_bytes;
 202                u32 guest_fs_ar_bytes;
 203                u32 guest_gs_ar_bytes;
 204                u32 guest_ldtr_ar_bytes;
 205                u32 guest_tr_ar_bytes;
 206                u32 guest_interruptibility_info;
 207                u32 guest_activity_state;
 208                u32 guest_sysenter_cs;
 209                u32 host_ia32_sysenter_cs;
 210                u32 padding32[8]; /* room for future expansion */
 211                u16 virtual_processor_id;
 212                u16 guest_es_selector;
 213                u16 guest_cs_selector;
 214                u16 guest_ss_selector;
 215                u16 guest_ds_selector;
 216                u16 guest_fs_selector;
 217                u16 guest_gs_selector;
 218                u16 guest_ldtr_selector;
 219                u16 guest_tr_selector;
 220                u16 host_es_selector;
 221                u16 host_cs_selector;
 222                u16 host_ss_selector;
 223                u16 host_ds_selector;
 224                u16 host_fs_selector;
 225                u16 host_gs_selector;
 226                u16 host_tr_selector;
 227        };
 233These patches were written by:
 234     Abel Gordon, abelg <at>
 235     Nadav Har'El, nyh <at>
 236     Orit Wasserman, oritw <at>
 237     Ben-Ami Yassor, benami <at>
 238     Muli Ben-Yehuda, muli <at>
 240With contributions by:
 241     Anthony Liguori, aliguori <at>
 242     Mike Day, mdday <at>
 243     Michael Factor, factor <at>
 244     Zvi Dubitzky, dubi <at>
 246And valuable reviews by:
 247     Avi Kivity, avi <at>
 248     Gleb Natapov, gleb <at>
 249     Marcelo Tosatti, mtosatti <at>
 250     Kevin Tian, kevin.tian <at>
 251     and others.
 252 kindly hosted by Redpill Linpro AS, provider of Linux consulting and operations services since 1995.