[BUG] Guest OSes die simultaneously (bisected)

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] Guest OSes die simultaneously (bisected)
@ 2024-01-03 22:22 Paul E. McKenney
  2024-01-03 23:27 ` Paul E. McKenney
  2024-01-04 10:01 ` Breno Leitao
  0 siblings, 2 replies; 12+ messages in thread
From: Paul E. McKenney @ 2024-01-03 22:22 UTC (permalink / raw)
  To: Like Xu, Andi Kleen, Kan Liang, Luwei Kang, Peter Zijlstra,
	Paolo Bonzini
  Cc: linux-perf-users, linux-kernel, kvm, Breno Leitao,
	Arnaldo Carvalho de Melo, Ingo Molnar

Hello!

Since some time between v5.19 and v6.4, long-running rcutorture tests
would (rarely but intolerably often) have all guests on a given host die
simultaneously with something like an instruction fault or a segmentation
violation.

Each bisection step required 20 hosts running 10 hours each, and
this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
IA32_PEBS_ENABLE MSR emulation for extended PEBS").  Although this commit
is certainly messing with things that could possibly cause all manner
of mischief, I don't immediately see a smoking gun.  Except that the
commit prior to this one is rock solid.

Just to make things a bit more exciting, bisection in mainline proved
to be problematic due to bugs of various kinds that hid this one.  I was
therefore forced to bisect among the commits backported to the internal
v5.19-based kernel, which fingered the backported version of the patch
called out above.

Please note that this is not (yet) an emergency.  I will just continue
to run rcutorture on v5.19-based hypervisors in the meantime.

Any suggestions for debugging or fixing?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-03 22:22 [BUG] Guest OSes die simultaneously (bisected) Paul E. McKenney
@ 2024-01-03 23:27 ` Paul E. McKenney
  2024-01-04  0:24   ` Sean Christopherson
  2024-01-04 10:01 ` Breno Leitao
  1 sibling, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2024-01-03 23:27 UTC (permalink / raw)
  To: Like Xu, Andi Kleen, Kan Liang, Luwei Kang, Peter Zijlstra,
	Paolo Bonzini
  Cc: linux-perf-users, linux-kernel, kvm, Breno Leitao,
	Arnaldo Carvalho de Melo, Ingo Molnar

On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> Hello!
> 
> Since some time between v5.19 and v6.4, long-running rcutorture tests
> would (rarely but intolerably often) have all guests on a given host die
> simultaneously with something like an instruction fault or a segmentation
> violation.
> 
> Each bisection step required 20 hosts running 10 hours each, and
> this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> IA32_PEBS_ENABLE MSR emulation for extended PEBS").  Although this commit
> is certainly messing with things that could possibly cause all manner
> of mischief, I don't immediately see a smoking gun.  Except that the
> commit prior to this one is rock solid.
> 
> Just to make things a bit more exciting, bisection in mainline proved
> to be problematic due to bugs of various kinds that hid this one.  I was
> therefore forced to bisect among the commits backported to the internal
> v5.19-based kernel, which fingered the backported version of the patch
> called out above.

Ah, and so why do I believe that this is a problem in mainline rather
than just (say) a backporting mistake?

Because this issue was first located in v6.4, which already has this
commit included.

							Thanx, Paul

> Please note that this is not (yet) an emergency.  I will just continue
> to run rcutorture on v5.19-based hypervisors in the meantime.
> 
> Any suggestions for debugging or fixing?
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-03 23:27 ` Paul E. McKenney
@ 2024-01-04  0:24   ` Sean Christopherson
  2024-01-04  1:00     ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2024-01-04  0:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Like Xu, Andi Kleen, Kan Liang, Luwei Kang, Peter Zijlstra,
	Paolo Bonzini, linux-perf-users, linux-kernel, kvm, Breno Leitao,
	Arnaldo Carvalho de Melo, Ingo Molnar

On Wed, Jan 03, 2024, Paul E. McKenney wrote:
> On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> > Hello!
> > 
> > Since some time between v5.19 and v6.4, long-running rcutorture tests
> > would (rarely but intolerably often) have all guests on a given host die
> > simultaneously with something like an instruction fault or a segmentation
> > violation.
> > 
> > Each bisection step required 20 hosts running 10 hours each, and
> > this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> > IA32_PEBS_ENABLE MSR emulation for extended PEBS").  Although this commit
> > is certainly messing with things that could possibly cause all manner
> > of mischief, I don't immediately see a smoking gun.  Except that the
> > commit prior to this one is rock solid.
> > Just to make things a bit more exciting, bisection in mainline proved
> > to be problematic due to bugs of various kinds that hid this one.  I was
> > therefore forced to bisect among the commits backported to the internal
> > v5.19-based kernel, which fingered the backported version of the patch
> > called out above.
> 
> Ah, and so why do I believe that this is a problem in mainline rather
> than just (say) a backporting mistake?
> 
> Because this issue was first located in v6.4, which already has this
> commit included.
> 
> 							Thanx, Paul
> 
> > Please note that this is not (yet) an emergency.  I will just continue
> > to run rcutorture on v5.19-based hypervisors in the meantime.
> > 
> > Any suggestions for debugging or fixing?

This looks suspect:

+       u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
+       int global_ctrl, pebs_enable;
 
-       arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
-       arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
-       arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
-       arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
-       *nr = 1;
+       *nr = 0;
+       global_ctrl = (*nr)++;
+       arr[global_ctrl] = (struct perf_guest_switch_msr){
+               .msr = MSR_CORE_PERF_GLOBAL_CTRL,
+               .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
+               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
+       };


IIUC (always a big if with this code), the intent is that the guest's version of
PERF_GLOBAL_CTRL gets bits that are (a) not exclusive to the host and (b) not
being used for PEBS.  (b) is necessary because PEBS generates records in memory
using virtual addresses, i.e. the CPU will write to memory using a virtual address
that is valid for the host but not the guest.  And so PMU counters that are
configured to generate PEBS records need to be disabled while running the guest.

Before that commit, the logic was:

  guest[PERF_GLOBAL_CTRL] = ctrl & ~host;
  guest[PERF_GLOBAL_CTRL] &= ~pebs;

But after, it's now:

  guest[PERF_GLOBAL_CTRL] = ctrl & (~host | ~pebs);

I.e. the kernel is enabled counters in the guest that are not host-only OR not
PEBS.  E.g. if only counter 0 is in use, it's using PEBS, but it's not exclusive
to the host, then the new code will yield (truncated to a single byte for sanity)

  1 = 1 & (0xf | 0xe)

and thus keep counter 0 enabled, whereas the old code would yield

  1 = 1 & 0xf
  0 = 1 & 0xe

A bit of a shot in the dark and completed untested, but I think this is the correct
fix?

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a08f794a0e79..92d5a3464cb2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4056,7 +4056,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
        arr[global_ctrl] = (struct perf_guest_switch_msr){
                .msr = MSR_CORE_PERF_GLOBAL_CTRL,
                .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
-               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
+               .guest = intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask),
        };
 
        if (!x86_pmu.pebs)


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-04  0:24   ` Sean Christopherson
@ 2024-01-04  1:00     ` Paul E. McKenney
  2024-01-04 14:50       ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2024-01-04  1:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Like Xu, Andi Kleen, Kan Liang, Luwei Kang, Peter Zijlstra,
	Paolo Bonzini, linux-perf-users, linux-kernel, kvm, Breno Leitao,
	Arnaldo Carvalho de Melo, Ingo Molnar

On Wed, Jan 03, 2024 at 04:24:06PM -0800, Sean Christopherson wrote:
> On Wed, Jan 03, 2024, Paul E. McKenney wrote:
> > On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> > > Hello!
> > > 
> > > Since some time between v5.19 and v6.4, long-running rcutorture tests
> > > would (rarely but intolerably often) have all guests on a given host die
> > > simultaneously with something like an instruction fault or a segmentation
> > > violation.
> > > 
> > > Each bisection step required 20 hosts running 10 hours each, and
> > > this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> > > IA32_PEBS_ENABLE MSR emulation for extended PEBS").  Although this commit
> > > is certainly messing with things that could possibly cause all manner
> > > of mischief, I don't immediately see a smoking gun.  Except that the
> > > commit prior to this one is rock solid.
> > > Just to make things a bit more exciting, bisection in mainline proved
> > > to be problematic due to bugs of various kinds that hid this one.  I was
> > > therefore forced to bisect among the commits backported to the internal
> > > v5.19-based kernel, which fingered the backported version of the patch
> > > called out above.
> > 
> > Ah, and so why do I believe that this is a problem in mainline rather
> > than just (say) a backporting mistake?
> > 
> > Because this issue was first located in v6.4, which already has this
> > commit included.
> > 
> > 							Thanx, Paul
> > 
> > > Please note that this is not (yet) an emergency.  I will just continue
> > > to run rcutorture on v5.19-based hypervisors in the meantime.
> > > 
> > > Any suggestions for debugging or fixing?
> 
> This looks suspect:
> 
> +       u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
> +       int global_ctrl, pebs_enable;
>  
> -       arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> -       arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> -       arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> -       arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
> -       *nr = 1;
> +       *nr = 0;
> +       global_ctrl = (*nr)++;
> +       arr[global_ctrl] = (struct perf_guest_switch_msr){
> +               .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> +               .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> +               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> +       };
> 
> 
> IIUC (always a big if with this code), the intent is that the guest's version of
> PERF_GLOBAL_CTRL gets bits that are (a) not exclusive to the host and (b) not
> being used for PEBS.  (b) is necessary because PEBS generates records in memory
> using virtual addresses, i.e. the CPU will write to memory using a virtual address
> that is valid for the host but not the guest.  And so PMU counters that are
> configured to generate PEBS records need to be disabled while running the guest.
> 
> Before that commit, the logic was:
> 
>   guest[PERF_GLOBAL_CTRL] = ctrl & ~host;
>   guest[PERF_GLOBAL_CTRL] &= ~pebs;
> 
> But after, it's now:
> 
>   guest[PERF_GLOBAL_CTRL] = ctrl & (~host | ~pebs);
> 
> I.e. the kernel is enabled counters in the guest that are not host-only OR not
> PEBS.  E.g. if only counter 0 is in use, it's using PEBS, but it's not exclusive
> to the host, then the new code will yield (truncated to a single byte for sanity)
> 
>   1 = 1 & (0xf | 0xe)
> 
> and thus keep counter 0 enabled, whereas the old code would yield
> 
>   1 = 1 & 0xf
>   0 = 1 & 0xe
> 
> A bit of a shot in the dark and completed untested, but I think this is the correct
> fix?

I am firing off some tests, and either way, thank you very much!!!

							Thanx, Paul

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a08f794a0e79..92d5a3464cb2 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4056,7 +4056,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
>         arr[global_ctrl] = (struct perf_guest_switch_msr){
>                 .msr = MSR_CORE_PERF_GLOBAL_CTRL,
>                 .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> -               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> +               .guest = intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask),
>         };
>  
>         if (!x86_pmu.pebs)
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-03 22:22 [BUG] Guest OSes die simultaneously (bisected) Paul E. McKenney
  2024-01-03 23:27 ` Paul E. McKenney
@ 2024-01-04 10:01 ` Breno Leitao
  1 sibling, 0 replies; 12+ messages in thread
From: Breno Leitao @ 2024-01-04 10:01 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Like Xu, Andi Kleen, Kan Liang, Luwei Kang, Peter Zijlstra,
	Paolo Bonzini, linux-perf-users, linux-kernel, kvm,
	Arnaldo Carvalho de Melo, Ingo Molnar, rbc

On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> I was
> therefore forced to bisect among the commits backported to the internal
> v5.19-based kernel, which fingered the backported version of the patch
> called out above.

Just to add some context to these backport, this commit (c59a1f106f5c)
was backported to the internal v5.19-based kernel in order to easily
backport these two fixes.

	a16eb25b09c02a54c ("KVM: x86: Mask LVTPC when handling a PMI")
	73554b29bd70546c1 ("KVM: x86/pmu: Synthesize at most one PMI per VM-exit")

They are required to solve the softlockup problem reported here:

	https://lore.kernel.org/all/ZRwcpki67uhpAUKi@gmail.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-04  1:00     ` Paul E. McKenney
@ 2024-01-04 14:50       ` Paul E. McKenney
  2024-01-04 14:59         ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2024-01-04 14:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Like Xu, Andi Kleen, Kan Liang, Luwei Kang, Peter Zijlstra,
	Paolo Bonzini, linux-perf-users, linux-kernel, kvm, Breno Leitao,
	Arnaldo Carvalho de Melo, Ingo Molnar

On Wed, Jan 03, 2024 at 05:00:35PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 03, 2024 at 04:24:06PM -0800, Sean Christopherson wrote:
> > On Wed, Jan 03, 2024, Paul E. McKenney wrote:
> > > On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> > > > Hello!
> > > > 
> > > > Since some time between v5.19 and v6.4, long-running rcutorture tests
> > > > would (rarely but intolerably often) have all guests on a given host die
> > > > simultaneously with something like an instruction fault or a segmentation
> > > > violation.
> > > > 
> > > > Each bisection step required 20 hosts running 10 hours each, and
> > > > this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> > > > IA32_PEBS_ENABLE MSR emulation for extended PEBS").  Although this commit
> > > > is certainly messing with things that could possibly cause all manner
> > > > of mischief, I don't immediately see a smoking gun.  Except that the
> > > > commit prior to this one is rock solid.
> > > > Just to make things a bit more exciting, bisection in mainline proved
> > > > to be problematic due to bugs of various kinds that hid this one.  I was
> > > > therefore forced to bisect among the commits backported to the internal
> > > > v5.19-based kernel, which fingered the backported version of the patch
> > > > called out above.
> > > 
> > > Ah, and so why do I believe that this is a problem in mainline rather
> > > than just (say) a backporting mistake?
> > > 
> > > Because this issue was first located in v6.4, which already has this
> > > commit included.
> > > 
> > > 							Thanx, Paul
> > > 
> > > > Please note that this is not (yet) an emergency.  I will just continue
> > > > to run rcutorture on v5.19-based hypervisors in the meantime.
> > > > 
> > > > Any suggestions for debugging or fixing?
> > 
> > This looks suspect:
> > 
> > +       u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
> > +       int global_ctrl, pebs_enable;
> >  
> > -       arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> > -       arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> > -       arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> > -       arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
> > -       *nr = 1;
> > +       *nr = 0;
> > +       global_ctrl = (*nr)++;
> > +       arr[global_ctrl] = (struct perf_guest_switch_msr){
> > +               .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > +               .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > +               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > +       };
> > 
> > 
> > IIUC (always a big if with this code), the intent is that the guest's version of
> > PERF_GLOBAL_CTRL gets bits that are (a) not exclusive to the host and (b) not
> > being used for PEBS.  (b) is necessary because PEBS generates records in memory
> > using virtual addresses, i.e. the CPU will write to memory using a virtual address
> > that is valid for the host but not the guest.  And so PMU counters that are
> > configured to generate PEBS records need to be disabled while running the guest.
> > 
> > Before that commit, the logic was:
> > 
> >   guest[PERF_GLOBAL_CTRL] = ctrl & ~host;
> >   guest[PERF_GLOBAL_CTRL] &= ~pebs;
> > 
> > But after, it's now:
> > 
> >   guest[PERF_GLOBAL_CTRL] = ctrl & (~host | ~pebs);
> > 
> > I.e. the kernel is enabled counters in the guest that are not host-only OR not
> > PEBS.  E.g. if only counter 0 is in use, it's using PEBS, but it's not exclusive
> > to the host, then the new code will yield (truncated to a single byte for sanity)
> > 
> >   1 = 1 & (0xf | 0xe)
> > 
> > and thus keep counter 0 enabled, whereas the old code would yield
> > 
> >   1 = 1 & 0xf
> >   0 = 1 & 0xe
> > 
> > A bit of a shot in the dark and completed untested, but I think this is the correct
> > fix?
> 
> I am firing off some tests, and either way, thank you very much!!!

Woo-hoo!!!  ;-)

Tested-by: Paul E. McKenney <paulmck@kernel.org>

Will you be sending a proper patch, or would you prefer that I do so?
In the latter case, I would need your Signed-off-by.

And again, thank you very much!!!

							Thanx, Paul

> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index a08f794a0e79..92d5a3464cb2 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -4056,7 +4056,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> >         arr[global_ctrl] = (struct perf_guest_switch_msr){
> >                 .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> >                 .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > -               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > +               .guest = intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask),
> >         };
> >  
> >         if (!x86_pmu.pebs)
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-04 14:50       ` Paul E. McKenney
@ 2024-01-04 14:59         ` Paolo Bonzini
  2024-01-04 16:06           ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2024-01-04 14:59 UTC (permalink / raw)
  To: paulmck
  Cc: Sean Christopherson, Like Xu, Andi Kleen, Kan Liang, Luwei Kang,
	Peter Zijlstra, linux-perf-users, linux-kernel, kvm, Breno Leitao,
	Arnaldo Carvalho de Melo, Ingo Molnar

On Thu, Jan 4, 2024 at 3:58 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Jan 03, 2024 at 05:00:35PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 03, 2024 at 04:24:06PM -0800, Sean Christopherson wrote:
> > > On Wed, Jan 03, 2024, Paul E. McKenney wrote:
> > > > On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> > > > > Hello!
> > > > >
> > > > > Since some time between v5.19 and v6.4, long-running rcutorture tests
> > > > > would (rarely but intolerably often) have all guests on a given host die
> > > > > simultaneously with something like an instruction fault or a segmentation
> > > > > violation.
> > > > >
> > > > > Each bisection step required 20 hosts running 10 hours each, and
> > > > > this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> > > > > IA32_PEBS_ENABLE MSR emulation for extended PEBS").  Although this commit
> > > > > is certainly messing with things that could possibly cause all manner
> > > > > of mischief, I don't immediately see a smoking gun.  Except that the
> > > > > commit prior to this one is rock solid.
> > > > > Just to make things a bit more exciting, bisection in mainline proved
> > > > > to be problematic due to bugs of various kinds that hid this one.  I was
> > > > > therefore forced to bisect among the commits backported to the internal
> > > > > v5.19-based kernel, which fingered the backported version of the patch
> > > > > called out above.
> > > >
> > > > Ah, and so why do I believe that this is a problem in mainline rather
> > > > than just (say) a backporting mistake?
> > > >
> > > > Because this issue was first located in v6.4, which already has this
> > > > commit included.
> > > >
> > > >                                                   Thanx, Paul
> > > >
> > > > > Please note that this is not (yet) an emergency.  I will just continue
> > > > > to run rcutorture on v5.19-based hypervisors in the meantime.
> > > > >
> > > > > Any suggestions for debugging or fixing?
> > >
> > > This looks suspect:
> > >
> > > +       u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
> > > +       int global_ctrl, pebs_enable;
> > >
> > > -       arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> > > -       arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> > > -       arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> > > -       arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
> > > -       *nr = 1;
> > > +       *nr = 0;
> > > +       global_ctrl = (*nr)++;
> > > +       arr[global_ctrl] = (struct perf_guest_switch_msr){
> > > +               .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > > +               .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > > +               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > > +       };
> > >
> > >
> > > IIUC (always a big if with this code), the intent is that the guest's version of
> > > PERF_GLOBAL_CTRL gets bits that are (a) not exclusive to the host and (b) not
> > > being used for PEBS.  (b) is necessary because PEBS generates records in memory
> > > using virtual addresses, i.e. the CPU will write to memory using a virtual address
> > > that is valid for the host but not the guest.  And so PMU counters that are
> > > configured to generate PEBS records need to be disabled while running the guest.
> > >
> > > Before that commit, the logic was:
> > >
> > >   guest[PERF_GLOBAL_CTRL] = ctrl & ~host;
> > >   guest[PERF_GLOBAL_CTRL] &= ~pebs;
> > >
> > > But after, it's now:
> > >
> > >   guest[PERF_GLOBAL_CTRL] = ctrl & (~host | ~pebs);
> > >
> > > I.e. the kernel is enabled counters in the guest that are not host-only OR not
> > > PEBS.  E.g. if only counter 0 is in use, it's using PEBS, but it's not exclusive
> > > to the host, then the new code will yield (truncated to a single byte for sanity)
> > >
> > >   1 = 1 & (0xf | 0xe)
> > >
> > > and thus keep counter 0 enabled, whereas the old code would yield
> > >
> > >   1 = 1 & 0xf
> > >   0 = 1 & 0xe
> > >
> > > A bit of a shot in the dark and completed untested, but I think this is the correct
> > > fix?
> >
> > I am firing off some tests, and either way, thank you very much!!!
>
> Woo-hoo!!!  ;-)
>
> Tested-by: Paul E. McKenney <paulmck@kernel.org>
>
> Will you be sending a proper patch, or would you prefer that I do so?
> In the latter case, I would need your Signed-off-by.

I will fast track this one to Linus.

Paolo

> And again, thank you very much!!!
>
>                                                         Thanx, Paul
>
> > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > index a08f794a0e79..92d5a3464cb2 100644
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -4056,7 +4056,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> > >         arr[global_ctrl] = (struct perf_guest_switch_msr){
> > >                 .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > >                 .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > > -               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > > +               .guest = intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask),
> > >         };
> > >
> > >         if (!x86_pmu.pebs)
> > >
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-04 14:59         ` Paolo Bonzini
@ 2024-01-04 16:06           ` Paul E. McKenney
  2024-01-04 16:32             ` Paolo Bonzini
  2024-01-04 17:07             ` Andi Kleen
  0 siblings, 2 replies; 12+ messages in thread
From: Paul E. McKenney @ 2024-01-04 16:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Like Xu, Andi Kleen, Kan Liang, Luwei Kang,
	Peter Zijlstra, linux-perf-users, linux-kernel, kvm, Breno Leitao,
	Arnaldo Carvalho de Melo, Ingo Molnar

On Thu, Jan 04, 2024 at 03:59:52PM +0100, Paolo Bonzini wrote:
> On Thu, Jan 4, 2024 at 3:58 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Jan 03, 2024 at 05:00:35PM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 03, 2024 at 04:24:06PM -0800, Sean Christopherson wrote:
> > > > On Wed, Jan 03, 2024, Paul E. McKenney wrote:
> > > > > On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> > > > > > Hello!
> > > > > >
> > > > > > Since some time between v5.19 and v6.4, long-running rcutorture tests
> > > > > > would (rarely but intolerably often) have all guests on a given host die
> > > > > > simultaneously with something like an instruction fault or a segmentation
> > > > > > violation.
> > > > > >
> > > > > > Each bisection step required 20 hosts running 10 hours each, and
> > > > > > this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> > > > > > IA32_PEBS_ENABLE MSR emulation for extended PEBS").  Although this commit
> > > > > > is certainly messing with things that could possibly cause all manner
> > > > > > of mischief, I don't immediately see a smoking gun.  Except that the
> > > > > > commit prior to this one is rock solid.
> > > > > > Just to make things a bit more exciting, bisection in mainline proved
> > > > > > to be problematic due to bugs of various kinds that hid this one.  I was
> > > > > > therefore forced to bisect among the commits backported to the internal
> > > > > > v5.19-based kernel, which fingered the backported version of the patch
> > > > > > called out above.
> > > > >
> > > > > Ah, and so why do I believe that this is a problem in mainline rather
> > > > > than just (say) a backporting mistake?
> > > > >
> > > > > Because this issue was first located in v6.4, which already has this
> > > > > commit included.
> > > > >
> > > > >                                                   Thanx, Paul
> > > > >
> > > > > > Please note that this is not (yet) an emergency.  I will just continue
> > > > > > to run rcutorture on v5.19-based hypervisors in the meantime.
> > > > > >
> > > > > > Any suggestions for debugging or fixing?
> > > >
> > > > This looks suspect:
> > > >
> > > > +       u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
> > > > +       int global_ctrl, pebs_enable;
> > > >
> > > > -       arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> > > > -       arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> > > > -       arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> > > > -       arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
> > > > -       *nr = 1;
> > > > +       *nr = 0;
> > > > +       global_ctrl = (*nr)++;
> > > > +       arr[global_ctrl] = (struct perf_guest_switch_msr){
> > > > +               .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > > > +               .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > > > +               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > > > +       };
> > > >
> > > >
> > > > IIUC (always a big if with this code), the intent is that the guest's version of
> > > > PERF_GLOBAL_CTRL gets bits that are (a) not exclusive to the host and (b) not
> > > > being used for PEBS.  (b) is necessary because PEBS generates records in memory
> > > > using virtual addresses, i.e. the CPU will write to memory using a virtual address
> > > > that is valid for the host but not the guest.  And so PMU counters that are
> > > > configured to generate PEBS records need to be disabled while running the guest.
> > > >
> > > > Before that commit, the logic was:
> > > >
> > > >   guest[PERF_GLOBAL_CTRL] = ctrl & ~host;
> > > >   guest[PERF_GLOBAL_CTRL] &= ~pebs;
> > > >
> > > > But after, it's now:
> > > >
> > > >   guest[PERF_GLOBAL_CTRL] = ctrl & (~host | ~pebs);
> > > >
> > > > I.e. the kernel is enabled counters in the guest that are not host-only OR not
> > > > PEBS.  E.g. if only counter 0 is in use, it's using PEBS, but it's not exclusive
> > > > to the host, then the new code will yield (truncated to a single byte for sanity)
> > > >
> > > >   1 = 1 & (0xf | 0xe)
> > > >
> > > > and thus keep counter 0 enabled, whereas the old code would yield
> > > >
> > > >   1 = 1 & 0xf
> > > >   0 = 1 & 0xe
> > > >
> > > > A bit of a shot in the dark and completed untested, but I think this is the correct
> > > > fix?
> > >
> > > I am firing off some tests, and either way, thank you very much!!!
> >
> > Woo-hoo!!!  ;-)
> >
> > Tested-by: Paul E. McKenney <paulmck@kernel.org>
> >
> > Will you be sending a proper patch, or would you prefer that I do so?
> > In the latter case, I would need your Signed-off-by.
> 
> I will fast track this one to Linus.

Thank you, Paolo!

One additional request, if I may be so bold...

Although I am happy to have been able to locate the commit (and even
happier that Sean spotted the problem and that you quickly pushed the
fix to mainline!), chasing this consumed a lot of time and systems over
an embarrassingly large number of months.  As in I first spotted this
bug in late July.  Despite a number of increasingly complex attempts,
bisection became feasible only after the buggy commit was backported to
our internal v5.19 code base.  :-(

My (completely random) guess is that there is some rare combination
of events that causes this code to fail.  If so, is it feasible to
construct a test that makes this rare combination of events less rare,
so that similar future bugs are caught more quickly?

And please understand that I am not casting shade on those who wrote,
reviewed, and committed that buggy commit.  As in I freely confess that
I had to stare at Sean's fix for a few minutes before I figured out what
was going on.  Instead, the point I am trying to make is that carefully
constructed tests can serve as tireless and accurate code reviewers.
This won't ever replace actual code review, but my experience indicates
that it will help find more bugs more quickly and more easily.

							Thanx, Paul

> Paolo
> 
> > And again, thank you very much!!!
> >
> >                                                         Thanx, Paul
> >
> > > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > > index a08f794a0e79..92d5a3464cb2 100644
> > > > --- a/arch/x86/events/intel/core.c
> > > > +++ b/arch/x86/events/intel/core.c
> > > > @@ -4056,7 +4056,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> > > >         arr[global_ctrl] = (struct perf_guest_switch_msr){
> > > >                 .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > > >                 .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > > > -               .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > > > +               .guest = intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask),
> > > >         };
> > > >
> > > >         if (!x86_pmu.pebs)
> > > >
> >
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-04 16:06           ` Paul E. McKenney
@ 2024-01-04 16:32             ` Paolo Bonzini
  2024-01-04 17:25               ` Sean Christopherson
  2024-01-04 19:23               ` Paul E. McKenney
  2024-01-04 17:07             ` Andi Kleen
  1 sibling, 2 replies; 12+ messages in thread
From: Paolo Bonzini @ 2024-01-04 16:32 UTC (permalink / raw)
  To: paulmck
  Cc: Sean Christopherson, Like Xu, Andi Kleen, Kan Liang, Luwei Kang,
	Peter Zijlstra, linux-perf-users, linux-kernel, kvm, Breno Leitao,
	Arnaldo Carvalho de Melo, Ingo Molnar

On 1/4/24 17:06, Paul E. McKenney wrote:
> Although I am happy to have been able to locate the commit (and even
> happier that Sean spotted the problem and that you quickly pushed the
> fix to mainline!), chasing this consumed a lot of time and systems over
> an embarrassingly large number of months.  As in I first spotted this
> bug in late July.  Despite a number of increasingly complex attempts,
> bisection became feasible only after the buggy commit was backported to
> our internal v5.19 code base.  🙁

Yes, this strikes two sore points.

One is that I have also experienced being able to bisect only with a 
somewhat more linear history (namely the CentOS Stream 9 aka c9s 
frankenkernel [1]) and not with upstream.  Even if the c9s kernel is not 
a fully linear set of commits, there's some benefit from merge commits 
that consist of slightly more curated set of patches, where each merge 
commit includes both new features and bugfixes.  Unfortunately, whether 
you'll be able to do this with the c9s kernel depends a lot on the 
subsystems involved and on the bug.  Both are factors that may or may 
not be known in advance.

The other, of course, is testing.  The KVM selftests infrastructure is 
meant for this kind of white box problem, but the space of tests that 
can be written is so large, that there's always too few tests.  It 
shines when you have a clear bisection but an unclear fix (in the past I 
have had cases where spending two days to write a test led me to writing 
a fix in thirty minutes), but boosting the reproducibility is always a 
good thing.

> And please understand that I am not casting shade on those who wrote,
> reviewed, and committed that buggy commit.  As in I freely confess that
> I had to stare at Sean's fix for a few minutes before I figured out what
> was going on.

Oh don't worry about that---rather, I am going to cast a shade on those 
that did not review the commit, namely me.  I am somewhat obsessed with 
Boolean logic and *probably* I would have caught it, or would have asked 
to split the use of designated initializers to a separate patch.  Any of 
the two could, at least potentially, have saved you quite some time.

> Instead, the point I am trying to make is that carefully
> constructed tests can serve as tireless and accurate code reviewers.
> This won't ever replace actual code review, but my experience indicates
> that it will help find more bugs more quickly and more easily.

TBH this (conflict between virtual addresses on the host and the guest 
leading to corruption of the guest) is probably not the kind of 
adversarial test that one would have written or suggested right off the 
bat.  But it should be written now indeed.

Paolo

[1] 
https://www.theregister.com/2023/06/30/enterprise_distro_feature_devconf/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-04 16:06           ` Paul E. McKenney
  2024-01-04 16:32             ` Paolo Bonzini
@ 2024-01-04 17:07             ` Andi Kleen
  1 sibling, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2024-01-04 17:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Paolo Bonzini, Sean Christopherson, Like Xu, Kan Liang,
	Luwei Kang, Peter Zijlstra, linux-perf-users, linux-kernel, kvm,
	Breno Leitao, Arnaldo Carvalho de Melo, Ingo Molnar

> My (completely random) guess is that there is some rare combination
> of events that causes this code to fail.  If so, is it feasible to
> construct a test that makes this rare combination of events less rare,
> so that similar future bugs are caught more quickly?

Yes, I tested something similar before. What you need is create lots of 
PMIs with perf (running perf top should be enough) and a workload that creates
lots of exits in a guest (e.g. running fio on a virtio device). This 
will stress test this particular path.

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-04 16:32             ` Paolo Bonzini
@ 2024-01-04 17:25               ` Sean Christopherson
  2024-01-04 19:23               ` Paul E. McKenney
  1 sibling, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2024-01-04 17:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: paulmck, Like Xu, Andi Kleen, Kan Liang, Luwei Kang,
	Peter Zijlstra, linux-perf-users, linux-kernel, kvm, Breno Leitao,
	Arnaldo Carvalho de Melo, Ingo Molnar

On Thu, Jan 04, 2024, Paolo Bonzini wrote:
> On 1/4/24 17:06, Paul E. McKenney wrote:
> > Instead, the point I am trying to make is that carefully
> > constructed tests can serve as tireless and accurate code reviewers.
> > This won't ever replace actual code review, but my experience indicates
> > that it will help find more bugs more quickly and more easily.
> 
> TBH this (conflict between virtual addresses on the host and the guest
> leading to corruption of the guest) is probably not the kind of adversarial
> test that one would have written or suggested right off the bat.

I disagree.  The flaws with PEBS using a virtual address is blatantly obvious to
anyone that has spent any time dealing with the cross-section of PMU and VMX.
Intel even explicitly added "isolation" functionality to ensure PEBS can't overrun
VM-Enter and generate host records in the guest.  Not to mention that Intel
specifically addressed the virtual addressing issue in the design of Processor
Trace (PT, a.k.a. RTIT).

In other words, we *knew* exactly what would break *and* there had been breakage
in the past.  Chalk it up to messed up priorities, poor test infrastructure, or
anything along those lines.  But we shouldn't pretend that this was some obscure
edge case that didn't warrant a dedicated test from the get-go.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Guest OSes die simultaneously (bisected)
  2024-01-04 16:32             ` Paolo Bonzini
  2024-01-04 17:25               ` Sean Christopherson
@ 2024-01-04 19:23               ` Paul E. McKenney
  1 sibling, 0 replies; 12+ messages in thread
From: Paul E. McKenney @ 2024-01-04 19:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Like Xu, Andi Kleen, Kan Liang, Luwei Kang,
	Peter Zijlstra, linux-perf-users, linux-kernel, kvm, Breno Leitao,
	Arnaldo Carvalho de Melo, Ingo Molnar

On Thu, Jan 04, 2024 at 05:32:34PM +0100, Paolo Bonzini wrote:
> On 1/4/24 17:06, Paul E. McKenney wrote:
> > Although I am happy to have been able to locate the commit (and even
> > happier that Sean spotted the problem and that you quickly pushed the
> > fix to mainline!), chasing this consumed a lot of time and systems over
> > an embarrassingly large number of months.  As in I first spotted this
> > bug in late July.  Despite a number of increasingly complex attempts,
> > bisection became feasible only after the buggy commit was backported to
> > our internal v5.19 code base.  🙁
> 
> Yes, this strikes two sore points.
> 
> One is that I have also experienced being able to bisect only with a
> somewhat more linear history (namely the CentOS Stream 9 aka c9s
> frankenkernel [1]) and not with upstream.  Even if the c9s kernel is not a
> fully linear set of commits, there's some benefit from merge commits that
> consist of slightly more curated set of patches, where each merge commit
> includes both new features and bugfixes.  Unfortunately, whether you'll be
> able to do this with the c9s kernel depends a lot on the subsystems involved
> and on the bug.  Both are factors that may or may not be known in advance.

I guess I am glad that it is not just me.  ;-)

> The other, of course, is testing.  The KVM selftests infrastructure is meant
> for this kind of white box problem, but the space of tests that can be
> written is so large, that there's always too few tests.  It shines when you
> have a clear bisection but an unclear fix (in the past I have had cases
> where spending two days to write a test led me to writing a fix in thirty
> minutes), but boosting the reproducibility is always a good thing.

Agreed, validation never will be perfect, and so improving the test
suite based on production experience is a good thing, as is creating
test cases based on the behavior of important production workloads for
those who run them.

> > And please understand that I am not casting shade on those who wrote,
> > reviewed, and committed that buggy commit.  As in I freely confess that
> > I had to stare at Sean's fix for a few minutes before I figured out what
> > was going on.
> 
> Oh don't worry about that---rather, I am going to cast a shade on those that
> did not review the commit, namely me.  I am somewhat obsessed with Boolean
> logic and *probably* I would have caught it, or would have asked to split
> the use of designated initializers to a separate patch.  Any of the two
> could, at least potentially, have saved you quite some time.

We have all done similar things.  I certainly have!

> > Instead, the point I am trying to make is that carefully
> > constructed tests can serve as tireless and accurate code reviewers.
> > This won't ever replace actual code review, but my experience indicates
> > that it will help find more bugs more quickly and more easily.
> 
> TBH this (conflict between virtual addresses on the host and the guest
> leading to corruption of the guest) is probably not the kind of adversarial
> test that one would have written or suggested right off the bat.  But it
> should be written now indeed.

Very good, looking forward to seeing it!

							Thanx, Paul

> Paolo
> 
> [1]
> https://www.theregister.com/2023/06/30/enterprise_distro_feature_devconf/
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-01-04 19:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-03 22:22 [BUG] Guest OSes die simultaneously (bisected) Paul E. McKenney
2024-01-03 23:27 ` Paul E. McKenney
2024-01-04  0:24   ` Sean Christopherson
2024-01-04  1:00     ` Paul E. McKenney
2024-01-04 14:50       ` Paul E. McKenney
2024-01-04 14:59         ` Paolo Bonzini
2024-01-04 16:06           ` Paul E. McKenney
2024-01-04 16:32             ` Paolo Bonzini
2024-01-04 17:25               ` Sean Christopherson
2024-01-04 19:23               ` Paul E. McKenney
2024-01-04 17:07             ` Andi Kleen
2024-01-04 10:01 ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).