Hi all,

Following Greg's suggestion to turn the proposed fix into a real patch,
here is a minimal fix for the vmcb12->save.rip TOCTOU race in KVM's
nested SVM implementation.

Background
----------

The CVE-2021-29657 fix introduced nested_copy_vmcb_save_to_cache() to
snapshot vmcb12 fields before validation and use, preventing a racing L1
vCPU from modifying vmcb12 between check and use. However, the save area
cache deliberately excluded rip, rsp, and rax -- only efer, cr0, cr3,
cr4, dr6, and dr7 are snapshotted.

As a result, vmcb12->save.rip is still read three separate times from
the live guest-mapped HVA pointer during a single nested VMRUN:

  1) enter_svm_guest_mode() passes vmcb12->save.rip to
     nested_vmcb02_prepare_control(), where it is stored in
     svm->soft_int_old_rip, svm->soft_int_next_rip, and
     vmcb02->control.next_rip

  2) nested_vmcb02_prepare_save() calls
     kvm_rip_write(vcpu, vmcb12->save.rip), setting the KVM-internal
     vCPU register state

  3) nested_vmcb02_prepare_save() then does
     vmcb02->save.rip = vmcb12->save.rip, setting the hardware VMCB02
     save area

Since vmcb12 is mapped via kvm_vcpu_map() as a direct HVA into guest
physical memory with no write protection, a concurrent L1 vCPU can
modify vmcb12->save.rip between these reads, producing a three-way RIP
inconsistency. This is the save-area analog of CVE-2021-29657.

The inconsistency is particularly dangerous when combined with soft
interrupt injection (event_inj with TYPE_SOFT): KVM records
soft_int_old_rip from read #1 but the vCPU state and hardware VMCB
reflect reads #2 and #3 respectively. If interrupt delivery faults,
svm_complete_interrupts() uses the stale soft_int_old_rip to
reconstruct pre-injection state, which no longer matches reality.

I am aware of Yosry Ahmed's larger patch series (v3-v6) that
reworks the entire vmcb12 caching architecture and would subsume
this fix. However, that series is still under review and has not
yet been merged. This patch is a minimal, self-contained fix that
can be applied immediately to close the TOCTOU window on rip, rsp,
and rax.

Fix
---

Add rip, rsp, and rax to struct vmcb_save_area_cached, snapshot them
in __nested_copy_vmcb_save_to_cache(), and replace all direct reads
of vmcb12->save.{rip,rsp,rax} with reads from the cached copy. This
ensures all consumers within a single nested VMRUN see consistent
register values.

Testing
-------

Tested on AMD Ryzen 7 7800X3D with nested virtualization enabled
(kvm_amd nested=1). A userspace race harness demonstrated a 25.6%
hit rate for concurrent modification of the rip field between reads
across 1M iterations, with 3-way splits (all three reads returning
different values) confirmed. With the patch applied, all three
consumption points see the same snapshotted value regardless of
concurrent modification.

The original discussion that led to this patch inadvertently went to
a public list. KVM maintainers were not CC'd on the follow-up; this
submission corrects that.

Seungil Jeon (1):
  KVM: nSVM: Snapshot vmcb12 save.rip to prevent TOCTOU race

 arch/x86/kvm/svm/nested.c | 22 +++++++++++-----------
 arch/x86/kvm/svm/svm.h    |  3 +++
 2 files changed, 14 insertions(+), 11 deletions(-)

--
2.43.0