BUG: Occasional unexpected DR6 value seen with nested virtualization on x86

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* BUG: Occasional unexpected DR6 value seen with nested virtualization on x86
@ 2025-01-22  6:02 John Stultz
  2025-01-22  9:00 ` Paolo Bonzini
  2025-01-22 20:55 ` Sean Christopherson
  0 siblings, 2 replies; 7+ messages in thread
From: John Stultz @ 2025-01-22  6:02 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, kvm, Peter Zijlstra,
	Frederic Weisbecker
  Cc: Andy Lutomirski, Borislav Petkov, Jim Mattson, Alex Bennée,
	Will Deacon, Thomas Gleixner, Dave Hansen, LKML, kernel-team

For awhile now, when testing Android in our virtualized x86
environment (cuttlefish), we've seen flakey failures ~1% or less in
testing with the bionic sys_ptrace.watchpoint_stress test:
https://android.googlesource.com/platform/bionic/+/refs/heads/main/tests/sys_ptrace_test.cpp#221

The failure looks like:
bionic/tests/sys_ptrace_test.cpp:(194) Failure in test
sys_ptrace.watchpoint_stress
Expected equality of these values:
  4
  siginfo.si_code
    Which is: 1
sys_ptrace.watchpoint_stress exited with exitcode 1.

Basically we expect to get a SIGTRAP with si_code: TRAP_HWBKPT, but
occasionally we get an si_code of TRAP_BRKPT.

I managed to reproduce the problem, and isolated it down to the call path:
[  173.185462] __send_signal_locked+0x3af/0x4b0
[  173.185563] send_signal_locked+0x16e/0x1b0
[  173.185649] force_sig_info_to_task+0x118/0x150
[  173.185759] force_sig_fault+0x60/0x80
[  173.185847] send_sigtrap+0x48/0x50
[  173.185933] noist_exc_debug+0xbe/0x100

Where we seem to be in exc_debug_user():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/traps.c#n1067

Specifically here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/traps.c#n1130
        icebp = !dr6;
        ...
        /* Add the virtual_dr6 bits for signals. */
        dr6 |= current->thread.virtual_dr6;
        if (dr6 & (DR_STEP | DR_TRAP_BITS) || icebp)
        send_sigtrap(regs, 0, get_si_code(dr6));

Where get_si_code() is here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/traps.h#n28
static inline int get_si_code(unsigned long condition)
{
        if (condition & DR_STEP)
                return TRAP_TRACE;
        else if (condition & (DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3))
                return TRAP_HWBKPT;
        else
                return TRAP_BRKPT;
}

We seem to be hitting the case where dr6 is null, and then as icebp
gets set in that case, we will call get_si_code() with a zero value
code, that gives us TRAP_BRKPT instead of TRAP_HWBKPT.

The dr6 value passed to exc_debug_user() comes from
debug_read_clear_dr6() in the definition for
DEFINE_IDTENTRY_DEBUG_USER(exc_debug):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/traps.c#n1147
Where debug_read_clear_dr6() is implemented here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/traps.c#n926

I then cut down and ported the bionic test out so it could build under
a standard debian environment:
https://github.com/johnstultz-work/bionic-ptrace-reproducer

Where I was able to reproduce the same problem in a debian VM (after
running the test in a loop for a short while).

Now, here's where it is odd. I could *not* reproduce the problem on
bare metal hardware, *nor* could I reproduce the problem in a virtual
environment.  I can *only* reproduce the problem with nested
virtualization (running the VM inside a VM).

I have reproduced this on my intel i12 NUC using the same v6.12 kernel
on metal + virt + nested environments.  It also reproduced on the NUC
with v5.15 (metal) + v6.1 (virt) + v6.1(nested).

I've also reproduced it with both the vms using only 1 cpu, and
tasksetting qemu on the bare metal to a single cpu to rule out any
sort issue with virtcpus migrating around.

Also setting enable_shadow_vmcs=0 on the metal host didn't seem to
affect the behavior.

I've tried to do some tracing in the arch/x86/kvm/x86.c logic, but
I've not yet directly correlated anything on the hosts to the point
where we read the zero DR6 value in the nested guest.

But I'm not very savvy around virtualization or ptrace watchpoints or
low level details around intel DB6 register, so I wanted to bring this
up on the list to see if folks had suggestions or ideas to further
narrow this down?  Happy to test things as it's pretty simple to
reproduce here.

Many thanks to Alex Bennee and Jim Mattson for their testing
suggestions to help narrow this down so far.

thanks
-john

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: Occasional unexpected DR6 value seen with nested virtualization on x86
  2025-01-22  6:02 BUG: Occasional unexpected DR6 value seen with nested virtualization on x86 John Stultz
@ 2025-01-22  9:00 ` Paolo Bonzini
  2025-01-22 16:55   ` Jim Mattson
  2025-01-22 20:55 ` Sean Christopherson
  1 sibling, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2025-01-22  9:00 UTC (permalink / raw)
  To: John Stultz
  Cc: Sean Christopherson, kvm, Peter Zijlstra, Frederic Weisbecker,
	Andy Lutomirski, Borislav Petkov, Jim Mattson, Alex Bennée,
	Will Deacon, Thomas Gleixner, Dave Hansen, LKML, Team, Android

Il mer 22 gen 2025, 07:07 John Stultz <jstultz@google.com> ha scritto:
>
> I then cut down and ported the bionic test out so it could build under
> a standard debian environment:
> https://github.com/johnstultz-work/bionic-ptrace-reproducer
>
> Where I was able to reproduce the same problem in a debian VM (after
> running the test in a loop for a short while).


Thanks, that's nice to have.

> Now, here's where it is odd. I could *not* reproduce the problem on
> bare metal hardware, *nor* could I reproduce the problem in a virtual
> environment.  I can *only* reproduce the problem with nested
> virtualization (running the VM inside a VM).

Typically in that case the best thing to do is turn it into a
kvm-unit-test or selftest (though that's often an endeavor of its own,
as it requires distilling the Linux kernel and userspace code into a
guest that runs without an OS). But what you've done is already a good
step.

> I have reproduced this on my intel i12 NUC using the same v6.12 kernel
> on metal + virt + nested environments.  It also reproduced on the NUC
> with v5.15 (metal) + v6.1 (virt) + v6.1(nested).

Good that you can use a new kernel. Older kernels are less reliable
with nested virt (especially since the one that matters the most is
the metal one).

Paolo

> I've tried to do some tracing in the arch/x86/kvm/x86.c logic, but
> I've not yet directly correlated anything on the hosts to the point
> where we read the zero DR6 value in the nested guest.
>
> But I'm not very savvy around virtualization or ptrace watchpoints or
> low level details around intel DB6 register, so I wanted to bring this
> up on the list to see if folks had suggestions or ideas to further
> narrow this down?  Happy to test things as it's pretty simple to
> reproduce here.
>
> Many thanks to Alex Bennee and Jim Mattson for their testing
> suggestions to help narrow this down so far.
>
> thanks
> -john
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: Occasional unexpected DR6 value seen with nested virtualization on x86
  2025-01-22  9:00 ` Paolo Bonzini
@ 2025-01-22 16:55   ` Jim Mattson
  2025-01-22 18:24     ` John Stultz
  0 siblings, 1 reply; 7+ messages in thread
From: Jim Mattson @ 2025-01-22 16:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: John Stultz, Sean Christopherson, kvm, Peter Zijlstra,
	Frederic Weisbecker, Andy Lutomirski, Borislav Petkov,
	Alex Bennée, Will Deacon, Thomas Gleixner, Dave Hansen, LKML,
	Team, Android

On Wed, Jan 22, 2025 at 1:01 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Il mer 22 gen 2025, 07:07 John Stultz <jstultz@google.com> ha scritto:
> >
> > I then cut down and ported the bionic test out so it could build under
> > a standard debian environment:
> > https://github.com/johnstultz-work/bionic-ptrace-reproducer
> >
> > Where I was able to reproduce the same problem in a debian VM (after
> > running the test in a loop for a short while).
>
>
> Thanks, that's nice to have.
>
> > Now, here's where it is odd. I could *not* reproduce the problem on
> > bare metal hardware, *nor* could I reproduce the problem in a virtual
> > environment.  I can *only* reproduce the problem with nested
> > virtualization (running the VM inside a VM).
>
> Typically in that case the best thing to do is turn it into a
> kvm-unit-test or selftest (though that's often an endeavor of its own,
> as it requires distilling the Linux kernel and userspace code into a
> guest that runs without an OS). But what you've done is already a good
> step.

Just run the kvm-unit-tests 'x86/debug' test in a loop inside an L1
VM. It will eventually fail. Maybe not the same bug, but we can hope.
:)

> > I have reproduced this on my intel i12 NUC using the same v6.12 kernel
> > on metal + virt + nested environments.  It also reproduced on the NUC
> > with v5.15 (metal) + v6.1 (virt) + v6.1(nested).
>
> Good that you can use a new kernel. Older kernels are less reliable
> with nested virt (especially since the one that matters the most is
> the metal one).
>
> Paolo
>
> > I've tried to do some tracing in the arch/x86/kvm/x86.c logic, but
> > I've not yet directly correlated anything on the hosts to the point
> > where we read the zero DR6 value in the nested guest.
> >
> > But I'm not very savvy around virtualization or ptrace watchpoints or
> > low level details around intel DB6 register, so I wanted to bring this
> > up on the list to see if folks had suggestions or ideas to further
> > narrow this down?  Happy to test things as it's pretty simple to
> > reproduce here.
> >
> > Many thanks to Alex Bennee and Jim Mattson for their testing
> > suggestions to help narrow this down so far.
> >
> > thanks
> > -john
> >
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: Occasional unexpected DR6 value seen with nested virtualization on x86
  2025-01-22 16:55   ` Jim Mattson
@ 2025-01-22 18:24     ` John Stultz
  0 siblings, 0 replies; 7+ messages in thread
From: John Stultz @ 2025-01-22 18:24 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Sean Christopherson, kvm, Peter Zijlstra,
	Frederic Weisbecker, Andy Lutomirski, Alex Bennée,
	Will Deacon, Thomas Gleixner, Dave Hansen, LKML, Team, Android

On Wed, Jan 22, 2025 at 8:56 AM Jim Mattson <jmattson@google.com> wrote:
>
> On Wed, Jan 22, 2025 at 1:01 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > Il mer 22 gen 2025, 07:07 John Stultz <jstultz@google.com> ha scritto:
> > >
> > > I then cut down and ported the bionic test out so it could build under
> > > a standard debian environment:
> > > https://github.com/johnstultz-work/bionic-ptrace-reproducer
> > >
> > > Where I was able to reproduce the same problem in a debian VM (after
> > > running the test in a loop for a short while).
> >
> >
> > Thanks, that's nice to have.
> >
> > > Now, here's where it is odd. I could *not* reproduce the problem on
> > > bare metal hardware, *nor* could I reproduce the problem in a virtual
> > > environment.  I can *only* reproduce the problem with nested
> > > virtualization (running the VM inside a VM).
> >
> > Typically in that case the best thing to do is turn it into a
> > kvm-unit-test or selftest (though that's often an endeavor of its own,
> > as it requires distilling the Linux kernel and userspace code into a
> > guest that runs without an OS). But what you've done is already a good
> > step.
>
> Just run the kvm-unit-tests 'x86/debug' test in a loop inside an L1
> VM. It will eventually fail. Maybe not the same bug, but we can hope.
> :)

Thanks Jim,

I've just reproduced this as well, after running the debug test in a
loop.  The one odd bit is that it's not always the same subtest that
fails.

I've seen:
FAIL: Single-step #DB w/ STI blocking
FAIL: Usermode Single-step #DB w/ STI blocking
FAIL: Single-step #DB on emulated instructions
FAIL: Single-step #DB w/ MOVSS blocking
FAIL: Usermode Single-step #DB basic test
FAIL: Single-step #DB w/ MOVSS blocking and DR7.GD=1
FAIL: hw breakpoint (test that dr6.BS is not cleared)

So yeah, hopefully all the same bug? :)

thanks
-john

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: Occasional unexpected DR6 value seen with nested virtualization on x86
  2025-01-22  6:02 BUG: Occasional unexpected DR6 value seen with nested virtualization on x86 John Stultz
  2025-01-22  9:00 ` Paolo Bonzini
@ 2025-01-22 20:55 ` Sean Christopherson
  2025-01-22 23:00   ` John Stultz
  1 sibling, 1 reply; 7+ messages in thread
From: Sean Christopherson @ 2025-01-22 20:55 UTC (permalink / raw)
  To: John Stultz
  Cc: Paolo Bonzini, kvm, Peter Zijlstra, Frederic Weisbecker,
	Andy Lutomirski, Borislav Petkov, Jim Mattson, Alex Bennée,
	Will Deacon, Thomas Gleixner, Dave Hansen, LKML, kernel-team

On Tue, Jan 21, 2025, John Stultz wrote:
> For awhile now, when testing Android in our virtualized x86
> environment (cuttlefish), we've seen flakey failures ~1% or less in
> testing with the bionic sys_ptrace.watchpoint_stress test:
> https://android.googlesource.com/platform/bionic/+/refs/heads/main/tests/sys_ptrace_test.cpp#221
> 
> The failure looks like:
> bionic/tests/sys_ptrace_test.cpp:(194) Failure in test
> sys_ptrace.watchpoint_stress
> Expected equality of these values:
>   4
>   siginfo.si_code
>     Which is: 1
> sys_ptrace.watchpoint_stress exited with exitcode 1.
> 
> Basically we expect to get a SIGTRAP with si_code: TRAP_HWBKPT, but
> occasionally we get an si_code of TRAP_BRKPT.

...

> Now, here's where it is odd. I could *not* reproduce the problem on
> bare metal hardware, *nor* could I reproduce the problem in a virtual
> environment.  I can *only* reproduce the problem with nested
> virtualization (running the VM inside a VM).
> 
> I have reproduced this on my intel i12 NUC using the same v6.12 kernel
> on metal + virt + nested environments.  It also reproduced on the NUC
> with v5.15 (metal) + v6.1 (virt) + v6.1(nested).

Huh.  This isn't actually a nested virtualization bug.  It's a flaw in KVM's
fastpath handling.  But hitting it in a non-nested setup is practically impossible
because it requires the "kernel" running the test to have interrupts enabled
(rules out the ptrace test), a source of interrupts (rules out KVM-Unit-Test),
window of a handful of instructions (or a weird guest).

If KVM has disabled DR interception, which occurs when the guest accesses a DR
and host userspace isn't debuggin the guest, KVM loads the guest's DR's before
VM-Enter, and then saves the guest's DRs on VM-Exit (and restores the intercept).

For DR0-DR3, the behavior is identical between VMX and SVM, and also identical
between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest) and KVM_DEBUGREG_WONT_EXIT
(guest using DRs), and so KVM handles loading DR0-DR3 in common code, _outside_
of the core kvm_x86_ops.vcpu_run() loop.

But for DR6, the guest's value doesn't need to be loaded into hardware for
KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field, and so on VMX
loading the guest's value into DR6 is handled by vmx_vcpu_run(), i.e. is done
_inside_ the core run loop.

Unfortunately, saving the guest values on VM-Exit is initiated by common x86,
again outside of the core run loop.  If the guest modifies DR6 (in hardware),
and then the next VM-Exit is a fastpath VM-Exit, KVM will reload hardware DR6
with vcpu->arch.dr6 and clobber the guest's actual value.

The bug shows up with nested because KVM handles the VMX preemption timer in the
fastpath.  The sequence (omitting a pile of entry/exits that aren't directly
relevant to the bug) being hit is effectively:

                                       Hardware DR6 Value

KVM:  arch.dr6 = 0xffff0ff0
KVM:  mov arch.dr6, dr6                0xffff0ff0
vCPU: mov 0xffff0ff1, dr6  # VM-Exit
KVM:  disable DR interception
vCPU: mov 0xffff0ff1, dr6              0xffff0ff1
      VMX Preemption Timer # VM-Exit
KVM:  mov arch.dr6, dr6                0xffff0ff0
vCPU: mov dr6, <reg>
  

Amusingly, SVM has the bug even though there's a dedicated VMCB field.  But
because SVM doesn't handle any asynchronous exits that are handled in the fastpath,
hitting the issue would require the vCPU to send an IPI between hardware DR6
being modified (either by software or a #DB) and software reading DR6.

A hack-a-fix hammer would be to orce KVM down the slow path if DR interception
is disabled:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2d9a16fd4d3..354b50a5838c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10995,7 +10995,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
                if (kvm_lapic_enabled(vcpu))
                        kvm_x86_call(sync_pir_to_irr)(vcpu);
 
-               if (unlikely(kvm_vcpu_exit_request(vcpu))) {
+               if (unlikely(kvm_vcpu_exit_request(vcpu)) ||
+                   (vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
                        exit_fastpath = EXIT_FASTPATH_EXIT_HANDLED;
                        break;
                }


But I'm leaning toward going straight for a more complete fix.  My only hesitation
is adding a dedicated .set_dr6() hook, as there's probably more code in VMX and
SVM that can (should?) be moved out of .vcpu_run(), i.e. we could probably add a
.pre_vcpu_run() hook to handle everything.   However, even if we added a pre-run
hook, I think I'd still prefer to keep the KVM_DEBUGREG_WONT_EXIT logic in one
place (modulo the SVM behavior :-/).

---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/kvm/svm/svm.c             |  5 ++---
 arch/x86/kvm/vmx/main.c            |  1 +
 arch/x86/kvm/vmx/vmx.c             | 10 ++++++----
 arch/x86/kvm/vmx/x86_ops.h         |  1 +
 arch/x86/kvm/x86.c                 |  3 +++
 7 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 7b4536ff3834..5459bc48cfd1 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -48,6 +48,7 @@ KVM_X86_OP(set_idt)
 KVM_X86_OP(get_gdt)
 KVM_X86_OP(set_gdt)
 KVM_X86_OP(sync_dirty_debug_regs)
+KVM_X86_OP(set_dr6)
 KVM_X86_OP(set_dr7)
 KVM_X86_OP(cache_reg)
 KVM_X86_OP(get_rflags)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5193c3dfbce1..21d247176858 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1685,6 +1685,7 @@ struct kvm_x86_ops {
 	void (*get_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*set_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*sync_dirty_debug_regs)(struct kvm_vcpu *vcpu);
+	void (*set_dr6)(struct kvm_vcpu *vcpu, unsigned long value);
 	void (*set_dr7)(struct kvm_vcpu *vcpu, unsigned long value);
 	void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
 	unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7640a84e554a..9d2033d64cfb 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4247,9 +4247,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
 	 * Run with all-zero DR6 unless needed, so that we can get the exact cause
 	 * of a #DB.
 	 */
-	if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT))
-		svm_set_dr6(svm, vcpu->arch.dr6);
-	else
+	if (likely(!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)))
 		svm_set_dr6(svm, DR6_ACTIVE_LOW);
 
 	clgi();
@@ -5043,6 +5041,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.set_idt = svm_set_idt,
 	.get_gdt = svm_get_gdt,
 	.set_gdt = svm_set_gdt,
+	.set_dr6 = svm_set_dr6,
 	.set_dr7 = svm_set_dr7,
 	.sync_dirty_debug_regs = svm_sync_dirty_debug_regs,
 	.cache_reg = svm_cache_reg,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2427f918e763..43ee9ed11291 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -61,6 +61,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_idt = vmx_set_idt,
 	.get_gdt = vmx_get_gdt,
 	.set_gdt = vmx_set_gdt,
+	.set_dr6 = vmx_set_dr6,
 	.set_dr7 = vmx_set_dr7,
 	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
 	.cache_reg = vmx_cache_reg,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f72835e85b6d..6c56d5235f0f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5648,6 +5648,12 @@ void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 	set_debugreg(DR6_RESERVED, 6);
 }
 
+void vmx_set_dr6(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	lockdep_assert_irqs_disabled();
+	set_debugreg(vcpu->arch.dr6, 6);
+}
+
 void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
 {
 	vmcs_writel(GUEST_DR7, val);
@@ -7417,10 +7423,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 		vmx->loaded_vmcs->host_state.cr4 = cr4;
 	}
 
-	/* When KVM_DEBUGREG_WONT_EXIT, dr6 is accessible in guest. */
-	if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT))
-		set_debugreg(vcpu->arch.dr6, 6);
-
 	/* When single-stepping over STI and MOV SS, we must clear the
 	 * corresponding interruptibility bits in the guest state. Otherwise
 	 * vmentry fails as it then expects bit 14 (BS) in pending debug
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ce3295a67c04..430773a5ef8e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -73,6 +73,7 @@ void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_dr6(struct kvm_vcpu *vcpu, unsigned long val);
 void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
 void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
 void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2d9a16fd4d3..9d5d267b4e14 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10971,6 +10971,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		set_debugreg(vcpu->arch.eff_db[1], 1);
 		set_debugreg(vcpu->arch.eff_db[2], 2);
 		set_debugreg(vcpu->arch.eff_db[3], 3);
+		/* When KVM_DEBUGREG_WONT_EXIT, dr6 is accessible in guest. */
+		if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT))
+			kvm_x86_call(set_dr6)(vcpu, vcpu->arch.dr6);
 	} else if (unlikely(hw_breakpoint_active())) {
 		set_debugreg(0, 7);
 	}

base-commit: eb723766b1030a23c38adf2348b7c3d1409d11f0
-- 


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: BUG: Occasional unexpected DR6 value seen with nested virtualization on x86
  2025-01-22 20:55 ` Sean Christopherson
@ 2025-01-22 23:00   ` John Stultz
  2025-01-23  0:32     ` Sean Christopherson
  0 siblings, 1 reply; 7+ messages in thread
From: John Stultz @ 2025-01-22 23:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, Peter Zijlstra, Frederic Weisbecker,
	Andy Lutomirski, Borislav Petkov, Jim Mattson, Alex Bennée,
	Will Deacon, Thomas Gleixner, Dave Hansen, LKML, kernel-team

On Wed, Jan 22, 2025 at 12:55 PM Sean Christopherson <seanjc@google.com> wrote:
> On Tue, Jan 21, 2025, John Stultz wrote:
> > Now, here's where it is odd. I could *not* reproduce the problem on
> > bare metal hardware, *nor* could I reproduce the problem in a virtual
> > environment.  I can *only* reproduce the problem with nested
> > virtualization (running the VM inside a VM).
> >
> > I have reproduced this on my intel i12 NUC using the same v6.12 kernel
> > on metal + virt + nested environments.  It also reproduced on the NUC
> > with v5.15 (metal) + v6.1 (virt) + v6.1(nested).
>
> Huh.  This isn't actually a nested virtualization bug.  It's a flaw in KVM's
> fastpath handling.  But hitting it in a non-nested setup is practically impossible
> because it requires the "kernel" running the test to have interrupts enabled
> (rules out the ptrace test), a source of interrupts (rules out KVM-Unit-Test),
> window of a handful of instructions (or a weird guest).
>

...[trimmed the very impressive details I'll pretend I understand :) ]...

> But I'm leaning toward going straight for a more complete fix.  My only hesitation
> is adding a dedicated .set_dr6() hook, as there's probably more code in VMX and
> SVM that can (should?) be moved out of .vcpu_run(), i.e. we could probably add a
> .pre_vcpu_run() hook to handle everything.   However, even if we added a pre-run
> hook, I think I'd still prefer to keep the KVM_DEBUGREG_WONT_EXIT logic in one
> place (modulo the SVM behavior :-/).
>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  1 +
>  arch/x86/kvm/svm/svm.c             |  5 ++---
>  arch/x86/kvm/vmx/main.c            |  1 +
>  arch/x86/kvm/vmx/vmx.c             | 10 ++++++----
>  arch/x86/kvm/vmx/x86_ops.h         |  1 +
>  arch/x86/kvm/x86.c                 |  3 +++
>  7 files changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 7b4536ff3834..5459bc48cfd1 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -48,6 +48,7 @@ KVM_X86_OP(set_idt)
>  KVM_X86_OP(get_gdt)
>  KVM_X86_OP(set_gdt)
>  KVM_X86_OP(sync_dirty_debug_regs)
> +KVM_X86_OP(set_dr6)
>  KVM_X86_OP(set_dr7)
>  KVM_X86_OP(cache_reg)
>  KVM_X86_OP(get_rflags)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5193c3dfbce1..21d247176858 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1685,6 +1685,7 @@ struct kvm_x86_ops {
>         void (*get_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
>         void (*set_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
>         void (*sync_dirty_debug_regs)(struct kvm_vcpu *vcpu);
> +       void (*set_dr6)(struct kvm_vcpu *vcpu, unsigned long value);
>         void (*set_dr7)(struct kvm_vcpu *vcpu, unsigned long value);
>         void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
>         unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7640a84e554a..9d2033d64cfb 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4247,9 +4247,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
>          * Run with all-zero DR6 unless needed, so that we can get the exact cause
>          * of a #DB.
>          */
> -       if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT))
> -               svm_set_dr6(svm, vcpu->arch.dr6);
> -       else
> +       if (likely(!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)))
>                 svm_set_dr6(svm, DR6_ACTIVE_LOW);
>
>         clgi();
> @@ -5043,6 +5041,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>         .set_idt = svm_set_idt,
>         .get_gdt = svm_get_gdt,
>         .set_gdt = svm_set_gdt,
> +       .set_dr6 = svm_set_dr6,


Just fyi, to get this to build (svm_set_dr6 takes a *svm not a *vcpu)
I needed to create a little wrapper to get the types right:

static void svm_set_dr6_vcpu(struct kvm_vcpu *vcpu, unsigned long value)
{
       struct vcpu_svm *svm = to_svm(vcpu);
       svm_set_dr6(svm, value);
}

But otherwise, this looks like it has fixed the issue! I've not been
able to trip a failure with the bionic ptrace test, nor with the debug
test in kvm-unit-tests, both running in loops for several minutes.

Tested-by: John Stultz <jstultz@google.com>

Thank you so much for the fast fix here!
-john

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: Occasional unexpected DR6 value seen with nested virtualization on x86
  2025-01-22 23:00   ` John Stultz
@ 2025-01-23  0:32     ` Sean Christopherson
  0 siblings, 0 replies; 7+ messages in thread
From: Sean Christopherson @ 2025-01-23  0:32 UTC (permalink / raw)
  To: John Stultz
  Cc: Paolo Bonzini, kvm, Peter Zijlstra, Frederic Weisbecker,
	Andy Lutomirski, Borislav Petkov, Jim Mattson, Alex Bennée,
	Will Deacon, Thomas Gleixner, Dave Hansen, LKML, kernel-team

On Wed, Jan 22, 2025, John Stultz wrote:
> On Wed, Jan 22, 2025 at 12:55 PM Sean Christopherson <seanjc@google.com> wrote:
> > On Tue, Jan 21, 2025, John Stultz wrote:
> > @@ -5043,6 +5041,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> >         .set_idt = svm_set_idt,
> >         .get_gdt = svm_get_gdt,
> >         .set_gdt = svm_set_gdt,
> > +       .set_dr6 = svm_set_dr6,
> 
> 
> Just fyi, to get this to build (svm_set_dr6 takes a *svm not a *vcpu)
> I needed to create a little wrapper to get the types right:
> 
> static void svm_set_dr6_vcpu(struct kvm_vcpu *vcpu, unsigned long value)
> {
>        struct vcpu_svm *svm = to_svm(vcpu);
>        svm_set_dr6(svm, value);
> }

Heh, yeah, I discovered as much when I tried to build wht my more generic kconfig.

> But otherwise, this looks like it has fixed the issue! I've not been
> able to trip a failure with the bionic ptrace test, nor with the debug
> test in kvm-unit-tests, both running in loops for several minutes.

FWIW, I ran the testcase in L2 for ~45 minutes and saw one failure ~3 minutes in,
but unfortunately I didn't have any tracing running so I have zero insight into
what went wrong.  I'm fairly certain the failure was due to running an unpatched
kernel in L1, i.e. that I hit the ultra-rare scenario where an L2=>L1 fastpath
exit between the #DB and read from DR6 clobbered hardware DR6.

For giggle and extra confidence, I hacked KVM to emulate HLT as a nop in the
fastpath, and verified failure (and the fix) in a non-nested setup with the below
selftest, on both AMD and Intel.

Sadly, KVM doesn't handle many exits in the fastpath on AMD, so having a regression
test that isn't Intel-specific isn't really possible at the momemnt.  I'm mildly
tempted to use testing as an excuse to handle some CPUID emulation in the fastpath,
as Linux userspace does a _lot_ of CPUID, e.g. a kernel build generates tens of
thousands of CPUID exits.

Anyways, this all makes me confident in the fix.  I'll post it properly tomorrow.

diff --git a/tools/testing/selftests/kvm/x86/debug_regs.c b/tools/testing/selftests/kvm/x86/debug_regs.c
index 2d814c1d1dc4..a34b65052f4e 100644
--- a/tools/testing/selftests/kvm/x86/debug_regs.c
+++ b/tools/testing/selftests/kvm/x86/debug_regs.c
@@ -22,11 +22,25 @@ extern unsigned char sw_bp, hw_bp, write_data, ss_start, bd_start;
 
 static void guest_code(void)
 {
+       unsigned long val = 0xffff0ffful;
+
        /* Create a pending interrupt on current vCPU */
        x2apic_enable();
        x2apic_write_reg(APIC_ICR, APIC_DEST_SELF | APIC_INT_ASSERT |
                         APIC_DM_FIXED | IRQ_VECTOR);
 
+       /*
+        * Debug Register Interception tests.
+        */
+       asm volatile("mov %%rax, %%dr6\n\t"
+                    "hlt\n\t"
+                    "mov %%dr6, %%rax\n\t"
+                    : "+r" (val));
+
+       __GUEST_ASSERT(val == 0xffff0ffful,
+                      "Wanted DR6 = 0xffff0ffful, got %lx\n", val);
+       GUEST_SYNC(0);
+
        /*
         * Software BP tests.
         *
@@ -103,6 +117,9 @@ int main(void)
        vm = vm_create_with_one_vcpu(&vcpu, guest_code);
        run = vcpu->run;
 
+       vcpu_run(vcpu);
+       TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_SYNC);
+
        /* Test software BPs - int3 */
        memset(&debug, 0, sizeof(debug));
        debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-01-23  0:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-22  6:02 BUG: Occasional unexpected DR6 value seen with nested virtualization on x86 John Stultz
2025-01-22  9:00 ` Paolo Bonzini
2025-01-22 16:55   ` Jim Mattson
2025-01-22 18:24     ` John Stultz
2025-01-22 20:55 ` Sean Christopherson
2025-01-22 23:00   ` John Stultz
2025-01-23  0:32     ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox