perf hw in kexeced kernel broken in tip

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* perf hw  in kexeced kernel broken in tip
@ 2010-12-01  8:00 Yinghai Lu
  2010-12-01 11:27 ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Yinghai Lu @ 2010-12-01  8:00 UTC (permalink / raw)
  To: Ingo Molnar, Jason Wessel, Peter Zijlstra, Peter Zijlstra,
	Don Zickus
  Cc: linux-kernel@vger.kernel.org

First kernel:
[    1.139418] calling  init_hw_perf_events+0x0/0xb77 @ 1
[    1.159111] Performance Events: PEBS fmt1+, Nehalem events, Intel PMU
driver.
[    1.159567] ... version:                3
[    1.179121] ... bit width:              48
[    1.179353] ... generic registers:      4
[    1.179593] ... value mask:             0000ffffffffffff
[    1.199211] ... max period:             000000007fffffff
[    1.199554] ... fixed-purpose events:   3
[    1.219108] ... event mask:             000000070000000f
[    1.219454] initcall init_hw_perf_events+0x0/0xb77 returned 0 after
11719 usecs

.....
[   20.220997] checking TSC synchronization [CPU#0 -> CPU#11]: passed.
[   20.260818] NMI watchdog enabled, takes one hw-pmu counter.

kexeced kernel.


[    1.169470] calling  init_hw_perf_events+0x0/0xb77 @ 1
[    1.189265] Performance Events: PEBS fmt1+, Nehalem events, Broken
PMU hardware detected, software events only.
...
[   21.010407] NMI watchdog failed to create perf event on cpu14:
fffffffffffffffe

caused by:

commit 33c6d6a7ad0ffab9b1b15f8e4107a2af072a05a0
Author: Don Zickus <dzickus@redhat.com>
Date:   Mon Nov 22 16:55:23 2010 -0500

    x86, perf, nmi: Disable perf if counters are not accessible
   
    In a kvm virt guests, the perf counters are not emulated.  Instead they
    return zero on a rdmsrl. The perf nmi handler uses the fact that
crossing
    a zero means the counter overflowed (for those counters that do not have
    specific interrupt bits). Therefore on kvm guests, perf will swallow all
    NMIs thinking the counters overflowed.
   
    This causes problems for subsystems like kgdb which needs NMIs to do its
    magic. This problem was discovered by running kgdb tests.
   
    The solution is to write garbage into a perf counter during the
    initialization and hopefully reading back the same number.  On kvm
    guests, the value will be read back as zero and we disable perf as
    a result.
   
    Reported-by: Jason Wessel <jason.wessel@windriver.com>
    Patch-inspired-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Don Zickus <dzickus@redhat.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Stephane Eranian <eranian@google.com>
    LKML-Reference: <1290462923-30734-1-git-send-email-dzickus@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/arch/x86/kernel/cpu/perf_event.c
b/arch/x86/kernel/cpu/perf_event.c
index ed63101..6d75b91 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -381,6 +381,20 @@ static void release_pmc_hardware(void) {}
 
 #endif
 
+static bool check_hw_exists(void)
+{
+       u64 val, val_new = 0;
+       int ret = 0;
+
+       val = 0xabcdUL;
+       ret |= checking_wrmsrl(x86_pmu.perfctr, val);
+       ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
+       if (ret || val != val_new)
+               return false;
+
+       return true;
+}
+
 static void reserve_ds_buffers(void);
 static void release_ds_buffers(void);
 
@@ -1372,6 +1386,12 @@ void __init init_hw_perf_events(void)
 
        pmu_check_apic();
 
+       /* sanity check that the hardware exists or is emulated */
+       if (!check_hw_exists()) {
+               pr_cont("Broken PMU hardware detected, software events
only.\n");
+               return;
+       }
+
        pr_cont("%s PMU driver.\n", x86_pmu.name);
 
        if (x86_pmu.quirks)



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01  8:00 perf hw in kexeced kernel broken in tip Yinghai Lu
@ 2010-12-01 11:27 ` Peter Zijlstra
  2010-12-01 16:06   ` Vivek Goyal
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-01 11:27 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Vivek Goyal, Haren Myneni

On Wed, 2010-12-01 at 00:00 -0800, Yinghai Lu wrote:
> First kernel:
> [    1.139418] calling  init_hw_perf_events+0x0/0xb77 @ 1
> [    1.159111] Performance Events: PEBS fmt1+, Nehalem events, Intel PMU
> driver.
> [    1.159567] ... version:                3
> [    1.179121] ... bit width:              48
> [    1.179353] ... generic registers:      4
> [    1.179593] ... value mask:             0000ffffffffffff
> [    1.199211] ... max period:             000000007fffffff
> [    1.199554] ... fixed-purpose events:   3
> [    1.219108] ... event mask:             000000070000000f
> [    1.219454] initcall init_hw_perf_events+0x0/0xb77 returned 0 after
> 11719 usecs
> 
> .....
> [   20.220997] checking TSC synchronization [CPU#0 -> CPU#11]: passed.
> [   20.260818] NMI watchdog enabled, takes one hw-pmu counter.
> 
> kexeced kernel.
> 
> 
> [    1.169470] calling  init_hw_perf_events+0x0/0xb77 @ 1
> [    1.189265] Performance Events: PEBS fmt1+, Nehalem events, Broken
> PMU hardware detected, software events only.
> ...
> [   21.010407] NMI watchdog failed to create perf event on cpu14:
> fffffffffffffffe
> 
> caused by:

*sigh*, and people ask me why kexec/kdump are such bad ideas..

apparently kexec doesn't properly shut down the first kernel and leaves
a counter running, then when we write and read the counter value they
don't match because its still running and voila, crap happens.

I've CC'ed the kexec people, maybe they got clue as to how to sort this.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 11:27 ` Peter Zijlstra
@ 2010-12-01 16:06   ` Vivek Goyal
  2010-12-01 16:11     ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Vivek Goyal @ 2010-12-01 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yinghai Lu, Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 01, 2010 at 12:27:47PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-01 at 00:00 -0800, Yinghai Lu wrote:
> > First kernel:
> > [    1.139418] calling  init_hw_perf_events+0x0/0xb77 @ 1
> > [    1.159111] Performance Events: PEBS fmt1+, Nehalem events, Intel PMU
> > driver.
> > [    1.159567] ... version:                3
> > [    1.179121] ... bit width:              48
> > [    1.179353] ... generic registers:      4
> > [    1.179593] ... value mask:             0000ffffffffffff
> > [    1.199211] ... max period:             000000007fffffff
> > [    1.199554] ... fixed-purpose events:   3
> > [    1.219108] ... event mask:             000000070000000f
> > [    1.219454] initcall init_hw_perf_events+0x0/0xb77 returned 0 after
> > 11719 usecs
> > 
> > .....
> > [   20.220997] checking TSC synchronization [CPU#0 -> CPU#11]: passed.
> > [   20.260818] NMI watchdog enabled, takes one hw-pmu counter.
> > 
> > kexeced kernel.
> > 
> > 
> > [    1.169470] calling  init_hw_perf_events+0x0/0xb77 @ 1
> > [    1.189265] Performance Events: PEBS fmt1+, Nehalem events, Broken
> > PMU hardware detected, software events only.
> > ...
> > [   21.010407] NMI watchdog failed to create perf event on cpu14:
> > fffffffffffffffe
> > 
> > caused by:
> 
> *sigh*, and people ask me why kexec/kdump are such bad ideas..
> 
> apparently kexec doesn't properly shut down the first kernel and leaves
> a counter running, then when we write and read the counter value they
> don't match because its still running and voila, crap happens.
> 
> I've CC'ed the kexec people, maybe they got clue as to how to sort this.

So we can shutdown counters while first kernel is going down. Is there a
simple function already which I can call?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 16:06   ` Vivek Goyal
@ 2010-12-01 16:11     ` Peter Zijlstra
  2010-12-01 16:23       ` Vivek Goyal
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-01 16:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Yinghai Lu, Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Haren Myneni

On Wed, 2010-12-01 at 11:06 -0500, Vivek Goyal wrote:
> On Wed, Dec 01, 2010 at 12:27:47PM +0100, Peter Zijlstra wrote:
> > On Wed, 2010-12-01 at 00:00 -0800, Yinghai Lu wrote:
> > > First kernel:
> > > [    1.139418] calling  init_hw_perf_events+0x0/0xb77 @ 1
> > > [    1.159111] Performance Events: PEBS fmt1+, Nehalem events, Intel PMU
> > > driver.
> > > [    1.159567] ... version:                3
> > > [    1.179121] ... bit width:              48
> > > [    1.179353] ... generic registers:      4
> > > [    1.179593] ... value mask:             0000ffffffffffff
> > > [    1.199211] ... max period:             000000007fffffff
> > > [    1.199554] ... fixed-purpose events:   3
> > > [    1.219108] ... event mask:             000000070000000f
> > > [    1.219454] initcall init_hw_perf_events+0x0/0xb77 returned 0 after
> > > 11719 usecs
> > > 
> > > .....
> > > [   20.220997] checking TSC synchronization [CPU#0 -> CPU#11]: passed.
> > > [   20.260818] NMI watchdog enabled, takes one hw-pmu counter.
> > > 
> > > kexeced kernel.
> > > 
> > > 
> > > [    1.169470] calling  init_hw_perf_events+0x0/0xb77 @ 1
> > > [    1.189265] Performance Events: PEBS fmt1+, Nehalem events, Broken
> > > PMU hardware detected, software events only.
> > > ...
> > > [   21.010407] NMI watchdog failed to create perf event on cpu14:
> > > fffffffffffffffe
> > > 
> > > caused by:
> > 
> > *sigh*, and people ask me why kexec/kdump are such bad ideas..
> > 
> > apparently kexec doesn't properly shut down the first kernel and leaves
> > a counter running, then when we write and read the counter value they
> > don't match because its still running and voila, crap happens.
> > 
> > I've CC'ed the kexec people, maybe they got clue as to how to sort this.
> 
> So we can shutdown counters while first kernel is going down. Is there a
> simple function already which I can call?

Dunno, the cpu hotplug stuff should suffice I think, but then I don't
think you actually unplug the boot cpu.

What does kexec normally do to ensure hardware is left in a sane state?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 16:11     ` Peter Zijlstra
@ 2010-12-01 16:23       ` Vivek Goyal
  2010-12-01 19:38         ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Vivek Goyal @ 2010-12-01 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yinghai Lu, Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 01, 2010 at 05:11:46PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-01 at 11:06 -0500, Vivek Goyal wrote:
> > On Wed, Dec 01, 2010 at 12:27:47PM +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-12-01 at 00:00 -0800, Yinghai Lu wrote:
> > > > First kernel:
> > > > [    1.139418] calling  init_hw_perf_events+0x0/0xb77 @ 1
> > > > [    1.159111] Performance Events: PEBS fmt1+, Nehalem events, Intel PMU
> > > > driver.
> > > > [    1.159567] ... version:                3
> > > > [    1.179121] ... bit width:              48
> > > > [    1.179353] ... generic registers:      4
> > > > [    1.179593] ... value mask:             0000ffffffffffff
> > > > [    1.199211] ... max period:             000000007fffffff
> > > > [    1.199554] ... fixed-purpose events:   3
> > > > [    1.219108] ... event mask:             000000070000000f
> > > > [    1.219454] initcall init_hw_perf_events+0x0/0xb77 returned 0 after
> > > > 11719 usecs
> > > > 
> > > > .....
> > > > [   20.220997] checking TSC synchronization [CPU#0 -> CPU#11]: passed.
> > > > [   20.260818] NMI watchdog enabled, takes one hw-pmu counter.
> > > > 
> > > > kexeced kernel.
> > > > 
> > > > 
> > > > [    1.169470] calling  init_hw_perf_events+0x0/0xb77 @ 1
> > > > [    1.189265] Performance Events: PEBS fmt1+, Nehalem events, Broken
> > > > PMU hardware detected, software events only.
> > > > ...
> > > > [   21.010407] NMI watchdog failed to create perf event on cpu14:
> > > > fffffffffffffffe
> > > > 
> > > > caused by:
> > > 
> > > *sigh*, and people ask me why kexec/kdump are such bad ideas..
> > > 
> > > apparently kexec doesn't properly shut down the first kernel and leaves
> > > a counter running, then when we write and read the counter value they
> > > don't match because its still running and voila, crap happens.
> > > 
> > > I've CC'ed the kexec people, maybe they got clue as to how to sort this.
> > 
> > So we can shutdown counters while first kernel is going down. Is there a
> > simple function already which I can call?
> 
> Dunno, the cpu hotplug stuff should suffice I think, but then I don't
> think you actually unplug the boot cpu.
> 
> What does kexec normally do to ensure hardware is left in a sane state?

Typically calls device_shutdown() and sysdev_shutdown() from
kernel_restart_prepare() to shutdown the devices.

Also calls machine_shutdown() which depending on architecture can take
care of various things like stopping other cpus, shutting down LAPIC, 
disabling IOAPIC, disabling hpet, shutting down IOMMU etc
(native_machine_shutdown()).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 16:23       ` Vivek Goyal
@ 2010-12-01 19:38         ` Peter Zijlstra
  2010-12-01 19:46           ` Vivek Goyal
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-01 19:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Yinghai Lu, Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Haren Myneni

On Wed, 2010-12-01 at 11:23 -0500, Vivek Goyal wrote:
> > What does kexec normally do to ensure hardware is left in a sane state?
> 
> Typically calls device_shutdown() and sysdev_shutdown() from
> kernel_restart_prepare() to shutdown the devices.
> 
> Also calls machine_shutdown() which depending on architecture can take
> care of various things like stopping other cpus, shutting down LAPIC, 
> disabling IOAPIC, disabling hpet, shutting down IOMMU etc
> (native_machine_shutdown()). 

So basically there's no sane generic reset callout?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 19:38         ` Peter Zijlstra
@ 2010-12-01 19:46           ` Vivek Goyal
  2010-12-01 19:49             ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Vivek Goyal @ 2010-12-01 19:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yinghai Lu, Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Haren Myneni, Eric W. Biederman

On Wed, Dec 01, 2010 at 08:38:12PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-01 at 11:23 -0500, Vivek Goyal wrote:
> > > What does kexec normally do to ensure hardware is left in a sane state?
> > 
> > Typically calls device_shutdown() and sysdev_shutdown() from
> > kernel_restart_prepare() to shutdown the devices.
> > 
> > Also calls machine_shutdown() which depending on architecture can take
> > care of various things like stopping other cpus, shutting down LAPIC, 
> > disabling IOAPIC, disabling hpet, shutting down IOMMU etc
> > (native_machine_shutdown()). 
> 
> So basically there's no sane generic reset callout?

I think ->shutdown() calls are sane generic callouts. Isn't it?

There seem to be few exceptions for LAPIC, IOMMU and HPET and I am not
sure why they are not covered by shutdown calls. CCing Eric, he might
have more insight into it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 19:46           ` Vivek Goyal
@ 2010-12-01 19:49             ` Peter Zijlstra
  2010-12-01 19:58               ` Vivek Goyal
  2010-12-01 20:41               ` Eric W. Biederman
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-01 19:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Yinghai Lu, Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Haren Myneni, Eric W. Biederman

On Wed, 2010-12-01 at 14:46 -0500, Vivek Goyal wrote:
> On Wed, Dec 01, 2010 at 08:38:12PM +0100, Peter Zijlstra wrote:
> > On Wed, 2010-12-01 at 11:23 -0500, Vivek Goyal wrote:
> > > > What does kexec normally do to ensure hardware is left in a sane state?
> > > 
> > > Typically calls device_shutdown() and sysdev_shutdown() from
> > > kernel_restart_prepare() to shutdown the devices.
> > > 
> > > Also calls machine_shutdown() which depending on architecture can take
> > > care of various things like stopping other cpus, shutting down LAPIC, 
> > > disabling IOAPIC, disabling hpet, shutting down IOMMU etc
> > > (native_machine_shutdown()). 
> > 
> > So basically there's no sane generic reset callout?
> 
> I think ->shutdown() calls are sane generic callouts. Isn't it?

->shutdown looks like it's about to reset/halt the hardware, no point in
slowing down the regular shutdown/reboot path for something like this,
we know the hardware will get reset to a sane state.

> There seem to be few exceptions for LAPIC, IOMMU and HPET and I am not
> sure why they are not covered by shutdown calls. CCing Eric, he might
> have more insight into it.

That's all arch specific, but even there I don't think the reset code
should live outside of kexec.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 19:49             ` Peter Zijlstra
@ 2010-12-01 19:58               ` Vivek Goyal
  2010-12-01 20:07                 ` Peter Zijlstra
  2010-12-01 20:41               ` Eric W. Biederman
  1 sibling, 1 reply; 42+ messages in thread
From: Vivek Goyal @ 2010-12-01 19:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yinghai Lu, Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Haren Myneni, Eric W. Biederman

On Wed, Dec 01, 2010 at 08:49:49PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-01 at 14:46 -0500, Vivek Goyal wrote:
> > On Wed, Dec 01, 2010 at 08:38:12PM +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-12-01 at 11:23 -0500, Vivek Goyal wrote:
> > > > > What does kexec normally do to ensure hardware is left in a sane state?
> > > > 
> > > > Typically calls device_shutdown() and sysdev_shutdown() from
> > > > kernel_restart_prepare() to shutdown the devices.
> > > > 
> > > > Also calls machine_shutdown() which depending on architecture can take
> > > > care of various things like stopping other cpus, shutting down LAPIC, 
> > > > disabling IOAPIC, disabling hpet, shutting down IOMMU etc
> > > > (native_machine_shutdown()). 
> > > 
> > > So basically there's no sane generic reset callout?
> > 
> > I think ->shutdown() calls are sane generic callouts. Isn't it?
> 
> ->shutdown looks like it's about to reset/halt the hardware, no point in
> slowing down the regular shutdown/reboot path for something like this,
> we know the hardware will get reset to a sane state.

I think we already call ->shutdown() in regular reboot path.

kernel_restart()
  kernel_restart_prepare()
    device_shutdown();
    sysdev_shutdown();

So it should not make lot of difference if perf subsystem/counters are
also shutdown using ->shutdown().

> 
> > There seem to be few exceptions for LAPIC, IOMMU and HPET and I am not
> > sure why they are not covered by shutdown calls. CCing Eric, he might
> > have more insight into it.
> 
> That's all arch specific, but even there I don't think the reset code
> should live outside of kexec.

I would not know the history but I have heard stories that if you don't
shutdown the hardware over restart, BIOS might not be expecting it and
might get trumped.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 19:58               ` Vivek Goyal
@ 2010-12-01 20:07                 ` Peter Zijlstra
  2010-12-01 21:48                   ` Eric W. Biederman
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-01 20:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Yinghai Lu, Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Haren Myneni, Eric W. Biederman

On Wed, 2010-12-01 at 14:58 -0500, Vivek Goyal wrote:
> On Wed, Dec 01, 2010 at 08:49:49PM +0100, Peter Zijlstra wrote:
> > On Wed, 2010-12-01 at 14:46 -0500, Vivek Goyal wrote:
> > > On Wed, Dec 01, 2010 at 08:38:12PM +0100, Peter Zijlstra wrote:
> > > > On Wed, 2010-12-01 at 11:23 -0500, Vivek Goyal wrote:
> > > > > > What does kexec normally do to ensure hardware is left in a sane state?
> > > > > 
> > > > > Typically calls device_shutdown() and sysdev_shutdown() from
> > > > > kernel_restart_prepare() to shutdown the devices.
> > > > > 
> > > > > Also calls machine_shutdown() which depending on architecture can take
> > > > > care of various things like stopping other cpus, shutting down LAPIC, 
> > > > > disabling IOAPIC, disabling hpet, shutting down IOMMU etc
> > > > > (native_machine_shutdown()). 
> > > > 
> > > > So basically there's no sane generic reset callout?
> > > 
> > > I think ->shutdown() calls are sane generic callouts. Isn't it?
> > 
> > ->shutdown looks like it's about to reset/halt the hardware, no point in
> > slowing down the regular shutdown/reboot path for something like this,
> > we know the hardware will get reset to a sane state.
> 
> I think we already call ->shutdown() in regular reboot path.
> 
> kernel_restart()
>   kernel_restart_prepare()
>     device_shutdown();
>     sysdev_shutdown();
> 
> So it should not make lot of difference if perf subsystem/counters are
> also shutdown using ->shutdown().

Oh, but I'm not a device or sysdev thing, I'll never get something like
that.

> > 
> > > There seem to be few exceptions for LAPIC, IOMMU and HPET and I am not
> > > sure why they are not covered by shutdown calls. CCing Eric, he might
> > > have more insight into it.
> > 
> > That's all arch specific, but even there I don't think the reset code
> > should live outside of kexec.
> 
> I would not know the history but I have heard stories that if you don't
> shutdown the hardware over restart, BIOS might not be expecting it and
> might get trumped.

Never yet had a problem with that.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 20:07                 ` Peter Zijlstra
@ 2010-12-01 21:48                   ` Eric W. Biederman
  2010-12-02  5:23                     ` Don Zickus
  0 siblings, 1 reply; 42+ messages in thread
From: Eric W. Biederman @ 2010-12-01 21:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, Yinghai Lu, Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Haren Myneni

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, 2010-12-01 at 14:58 -0500, Vivek Goyal wrote:
>> On Wed, Dec 01, 2010 at 08:49:49PM +0100, Peter Zijlstra wrote:
>> > On Wed, 2010-12-01 at 14:46 -0500, Vivek Goyal wrote:
>> > > On Wed, Dec 01, 2010 at 08:38:12PM +0100, Peter Zijlstra wrote:
>> > > > On Wed, 2010-12-01 at 11:23 -0500, Vivek Goyal wrote:
>> > > > > > What does kexec normally do to ensure hardware is left in a sane state?
>> > > > > 
>> > > > > Typically calls device_shutdown() and sysdev_shutdown() from
>> > > > > kernel_restart_prepare() to shutdown the devices.
>> > > > > 
>> > > > > Also calls machine_shutdown() which depending on architecture can take
>> > > > > care of various things like stopping other cpus, shutting down LAPIC, 
>> > > > > disabling IOAPIC, disabling hpet, shutting down IOMMU etc
>> > > > > (native_machine_shutdown()). 
>> > > > 
>> > > > So basically there's no sane generic reset callout?
>> > > 
>> > > I think ->shutdown() calls are sane generic callouts. Isn't it?
>> > 
>> > ->shutdown looks like it's about to reset/halt the hardware, no point in
>> > slowing down the regular shutdown/reboot path for something like this,
>> > we know the hardware will get reset to a sane state.
>> 
>> I think we already call ->shutdown() in regular reboot path.
>> 
>> kernel_restart()
>>   kernel_restart_prepare()
>>     device_shutdown();
>>     sysdev_shutdown();
>> 
>> So it should not make lot of difference if perf subsystem/counters are
>> also shutdown using ->shutdown().
>
> Oh, but I'm not a device or sysdev thing, I'll never get something like
> that.

There is also the reboot notifier, if the NMI needs to be controlled
outside of device model.  Sigh. The NMI handling is such a special case.

>> > > There seem to be few exceptions for LAPIC, IOMMU and HPET and I am not
>> > > sure why they are not covered by shutdown calls. CCing Eric, he might
>> > > have more insight into it.
>> > 
>> > That's all arch specific, but even there I don't think the reset code
>> > should live outside of kexec.
>> 
>> I would not know the history but I have heard stories that if you don't
>> shutdown the hardware over restart, BIOS might not be expecting it and
>> might get trumped.
>
> Never yet had a problem with that.

I haven't personally but I have certainly heard stories and seen
debugging sessions where some devices work or don't depending on the
order of running linux and windows on a machine, with soft reboots in
between.

Eric


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 21:48                   ` Eric W. Biederman
@ 2010-12-02  5:23                     ` Don Zickus
  2010-12-02  7:34                       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Don Zickus @ 2010-12-02  5:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 01, 2010 at 01:48:07PM -0800, Eric W. Biederman wrote:
> >
> > Oh, but I'm not a device or sysdev thing, I'll never get something like
> > that.
> 
> There is also the reboot notifier, if the NMI needs to be controlled
> outside of device model.  Sigh. The NMI handling is such a special case.

I tried reboot notifiers with the nmi_watchdog and acheived some success
(on a Westmere box, a P4 still failed).  Kdump is still screwed, but maybe
we don't care for now.

Here is the quick and dirty patch I used.

Cheers,
Don


diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 792a4ed..3455cf9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -23,6 +23,7 @@
 #include <linux/notifier.h>
 #include <linux/module.h>
 #include <linux/sysctl.h>
+#include <linux/reboot.h>
 
 #include <asm/irq_regs.h>
 #include <linux/perf_event.h>
@@ -550,6 +551,18 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
 	.notifier_call = cpu_callback
 };
 
+static int __cpuinit
+reboot_callback(struct notifier_block *nfb, unsigned long action, void *unused)
+{
+	watchdog_disable_all_cpus();
+
+	return notifier_from_errno(0);
+}
+
+static struct notifier_block __cpuinitdata reboot_nfb = {
+	.notifier_call = reboot_callback
+};
+
 void __init lockup_detector_init(void)
 {
 	void *cpu = (void *)(long)smp_processor_id();
@@ -563,6 +576,7 @@ void __init lockup_detector_init(void)
 
 	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
 	register_cpu_notifier(&cpu_nfb);
+	register_reboot_notifier(&reboot_nfb);
 
 	return;
 }

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-02  5:23                     ` Don Zickus
@ 2010-12-02  7:34                       ` Peter Zijlstra
  2010-12-02 16:15                         ` Don Zickus
  2010-12-07 21:16                         ` Don Zickus
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-02  7:34 UTC (permalink / raw)
  To: Don Zickus
  Cc: Eric W. Biederman, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Thu, 2010-12-02 at 00:23 -0500, Don Zickus wrote:
> On Wed, Dec 01, 2010 at 01:48:07PM -0800, Eric W. Biederman wrote:
> > >
> > > Oh, but I'm not a device or sysdev thing, I'll never get something like
> > > that.
> > 
> > There is also the reboot notifier, if the NMI needs to be controlled
> > outside of device model.  Sigh. The NMI handling is such a special case.
> 
> I tried reboot notifiers with the nmi_watchdog and acheived some success
> (on a Westmere box, a P4 still failed).  Kdump is still screwed, but maybe
> we don't care for now.
> 
> Here is the quick and dirty patch I used.

> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 792a4ed..3455cf9 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -23,6 +23,7 @@
>  #include <linux/notifier.h>
>  #include <linux/module.h>
>  #include <linux/sysctl.h>
> +#include <linux/reboot.h>
>  
>  #include <asm/irq_regs.h>
>  #include <linux/perf_event.h>
> @@ -550,6 +551,18 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
>  	.notifier_call = cpu_callback
>  };
>  
> +static int __cpuinit
> +reboot_callback(struct notifier_block *nfb, unsigned long action, void *unused)
> +{
> +	watchdog_disable_all_cpus();
> +
> +	return notifier_from_errno(0);
> +}
> +
> +static struct notifier_block __cpuinitdata reboot_nfb = {
> +	.notifier_call = reboot_callback
> +};
> +
>  void __init lockup_detector_init(void)
>  {
>  	void *cpu = (void *)(long)smp_processor_id();
> @@ -563,6 +576,7 @@ void __init lockup_detector_init(void)
>  
>  	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>  	register_cpu_notifier(&cpu_nfb);
> +	register_reboot_notifier(&reboot_nfb);
>  
>  	return;
>  }

We'd really want a perf_event.c callback there to do as the hot-unplug
code does and detach all running counters from the cpu.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-02  7:34                       ` Peter Zijlstra
@ 2010-12-02 16:15                         ` Don Zickus
  2010-12-07 23:30                           ` Peter Zijlstra
  2010-12-07 21:16                         ` Don Zickus
  1 sibling, 1 reply; 42+ messages in thread
From: Don Zickus @ 2010-12-02 16:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric W. Biederman, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Thu, Dec 02, 2010 at 08:34:30AM +0100, Peter Zijlstra wrote:
> On Thu, 2010-12-02 at 00:23 -0500, Don Zickus wrote:
> > On Wed, Dec 01, 2010 at 01:48:07PM -0800, Eric W. Biederman wrote:
> > > >
> > > > Oh, but I'm not a device or sysdev thing, I'll never get something like
> > > > that.
> > > 
> > > There is also the reboot notifier, if the NMI needs to be controlled
> > > outside of device model.  Sigh. The NMI handling is such a special case.
> > 
> > I tried reboot notifiers with the nmi_watchdog and acheived some success
> > (on a Westmere box, a P4 still failed).  Kdump is still screwed, but maybe
> > we don't care for now.
> > 
> > Here is the quick and dirty patch I used.
> 
> 
> We'd really want a perf_event.c callback there to do as the hot-unplug
> code does and detach all running counters from the cpu.

Ok, I moved the reboot notifier stuff from kernel/watchdog.c to
kernel/perf_event.c.  Things still worked fine from a kexec perspective.

Vivek suggested to me this morning that I should just blantantly disable the
perf counter during init when running my test.  Looking through the code I
don't think I can do this using disable_all because some routines look for
the active bit to be set and some arches have different disable registers
than others.  Thoughts?

Cheers,
Don

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-02 16:15                         ` Don Zickus
@ 2010-12-07 23:30                           ` Peter Zijlstra
  2010-12-08 14:01                             ` Don Zickus
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-07 23:30 UTC (permalink / raw)
  To: Don Zickus
  Cc: Eric W. Biederman, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Thu, 2010-12-02 at 11:15 -0500, Don Zickus wrote:

> Vivek suggested to me this morning that I should just blantantly disable the
> perf counter during init when running my test. 

Nah, we should actively scan for that during the bring-up and kill
hw-perf when we find an enable bit set, some BIOSes actively use the
PMU, this is something that should be discouraged.

---
 arch/x86/kernel/cpu/perf_event.c |   30 +++++++++++++++++++++++++++---
 1 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 817d2b1..7f92833 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -375,15 +375,40 @@ static void release_pmc_hardware(void) {}
 static bool check_hw_exists(void)
 {
 	u64 val, val_new = 0;
-	int ret = 0;
+	int i, reg, ret = 0;
 
 	val = 0xabcdUL;
 	ret |= checking_wrmsrl(x86_pmu.perfctr, val);
 	ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
-	if (ret || val != val_new)
+	if (ret || val != val_new) {
+		printk(KERN_CONT "Broken PMU hardware detected, software events only.\n");
 		return false;
+	}
+
+	/*
+	 * Check to see if the BIOS enabled any of the counters, if so
+	 * complain and bail.
+	 */
+	for (i = 0; i < x86_pmu.num_counters; i++) {
+		reg = x86_pmu.eventsel + i;
+		rdmsrl(reg, val);
+		if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
+			goto bios_fail;
+	}
+
+	for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
+		reg = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
+		rdmsrl(reg, val);
+		if (val & (0x03 << i*4))
+			goto bios_fail;
+	}
 
 	return true;
+
+bios_fail:
+	printk(KERN_CONT "Broken BIOS detected, software events only.\n");
+	printk(KERN_ERR FW_BUG "invalid MSR %x=%Lx\n", reg, val);
+	return false;
 }
 
 static void reserve_ds_buffers(void);
@@ -1379,7 +1404,6 @@ int __init init_hw_perf_events(void)
 
 	/* sanity check that the hardware exists or is emulated */
 	if (!check_hw_exists()) {
-		pr_cont("Broken PMU hardware detected, software events only.\n");
 		return 0;
 	}
 



>  Looking through the code I
> don't think I can do this using disable_all because some routines look for
> the active bit to be set and some arches have different disable registers
> than others.  Thoughts?

Something like the below, preferably I'd key that off of SYS_KEXEC, but
looking through the existing notifiers adding a state requires touching
all of them :/

---
 kernel/perf_event.c |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 195393c..7abcd8d 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -21,6 +21,7 @@
 #include <linux/dcache.h>
 #include <linux/percpu.h>
 #include <linux/ptrace.h>
+#include <linux/reboot.h>
 #include <linux/vmstat.h>
 #include <linux/vmalloc.h>
 #include <linux/hardirq.h>
@@ -6383,6 +6384,25 @@ static void perf_event_exit_cpu(int cpu)
 static inline void perf_event_exit_cpu(int cpu) { }
 #endif
 
+static int 
+perf_reboot(struct notifier_block *notifier, unsigned long val, void *v)
+{
+	/*
+	 * XXX this relies on hotplug, does kexec do too?
+	 */
+	perf_event_exit_cpu(0);
+	return NOTIFY_OK;
+}
+
+/*
+ * Run the perf reboot notifier at the very last possible moment so that
+ * the generic watchdog code runs as long as possible.
+ */
+static struct notifier_block perf_reboot_notifier = {
+	.notifier_call = perf_reboot,
+	.priority = INT_MIN,
+};
+
 static int __cpuinit
 perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu)
 {
@@ -6418,6 +6438,7 @@ void __init perf_event_init(void)
 	perf_pmu_register(&perf_task_clock);
 	perf_tp_register();
 	perf_cpu_notifier(perf_cpu_notify);
+	register_reboot_notifier(&perf_reboot_notifier);
 
 	ret = init_hw_breakpoint();
 	WARN(ret, "hw_breakpoint initialization failed with: %d", ret);


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-07 23:30                           ` Peter Zijlstra
@ 2010-12-08 14:01                             ` Don Zickus
  2010-12-08 14:20                               ` Peter Zijlstra
                                                 ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Don Zickus @ 2010-12-08 14:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric W. Biederman, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 08, 2010 at 12:30:20AM +0100, Peter Zijlstra wrote:
> On Thu, 2010-12-02 at 11:15 -0500, Don Zickus wrote:
> 
> > Vivek suggested to me this morning that I should just blantantly disable the
> > perf counter during init when running my test. 
> 
> Nah, we should actively scan for that during the bring-up and kill
> hw-perf when we find an enable bit set, some BIOSes actively use the
> PMU, this is something that should be discouraged.

Ok, the reboot notifier addresses the kexec problem but doesn't fix it
though (I have to test to confirm that, comments below).  The bios check
should catch those situations (ironically I stumbled upon a machine with
this problem, so I will test your patch with it, though it only uses perf
counter 0).  The kdump problem will still exist, not sure if we care and
perhaps we should document in the changelog that we know kdump is still
broken (unless we do care).

> 
> ---
>  arch/x86/kernel/cpu/perf_event.c |   30 +++++++++++++++++++++++++++---
>  1 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 817d2b1..7f92833 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -375,15 +375,40 @@ static void release_pmc_hardware(void) {}
>  static bool check_hw_exists(void)
>  {
>  	u64 val, val_new = 0;
> -	int ret = 0;
> +	int i, reg, ret = 0;
>  
>  	val = 0xabcdUL;
>  	ret |= checking_wrmsrl(x86_pmu.perfctr, val);
>  	ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
> -	if (ret || val != val_new)
> +	if (ret || val != val_new) {
> +		printk(KERN_CONT "Broken PMU hardware detected, software events only.\n");
>  		return false;
> +	}
> +
> +	/*
> +	 * Check to see if the BIOS enabled any of the counters, if so
> +	 * complain and bail.
> +	 */
> +	for (i = 0; i < x86_pmu.num_counters; i++) {
> +		reg = x86_pmu.eventsel + i;
> +		rdmsrl(reg, val);
> +		if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
> +			goto bios_fail;
> +	}
> +
> +	for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
> +		reg = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
> +		rdmsrl(reg, val);
> +		if (val & (0x03 << i*4))
> +			goto bios_fail;
> +	}

I wonder if you should reverse these checks.  If the bios has the perf
counter enabled, there might be a high chance that it fails the first
check and never gets to the actually bios checks.

>  
>  	return true;
> +
> +bios_fail:
> +	printk(KERN_CONT "Broken BIOS detected, software events only.\n");
> +	printk(KERN_ERR FW_BUG "invalid MSR %x=%Lx\n", reg, val);
> +	return false;
>  }
>  
>  static void reserve_ds_buffers(void);
> @@ -1379,7 +1404,6 @@ int __init init_hw_perf_events(void)
>  
>  	/* sanity check that the hardware exists or is emulated */
>  	if (!check_hw_exists()) {
> -		pr_cont("Broken PMU hardware detected, software events only.\n");
>  		return 0;
>  	}

nitpick - you can probably remove the curly braces, no?

>  
> 
> 
> 
> >  Looking through the code I
> > don't think I can do this using disable_all because some routines look for
> > the active bit to be set and some arches have different disable registers
> > than others.  Thoughts?
> 
> Something like the below, preferably I'd key that off of SYS_KEXEC, but
> looking through the existing notifiers adding a state requires touching
> all of them :/
> 
> ---
>  kernel/perf_event.c |   21 +++++++++++++++++++++
>  1 files changed, 21 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 195393c..7abcd8d 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -21,6 +21,7 @@
>  #include <linux/dcache.h>
>  #include <linux/percpu.h>
>  #include <linux/ptrace.h>
> +#include <linux/reboot.h>
>  #include <linux/vmstat.h>
>  #include <linux/vmalloc.h>
>  #include <linux/hardirq.h>
> @@ -6383,6 +6384,25 @@ static void perf_event_exit_cpu(int cpu)
>  static inline void perf_event_exit_cpu(int cpu) { }
>  #endif
>  
> +static int 
> +perf_reboot(struct notifier_block *notifier, unsigned long val, void *v)
> +{
> +	/*
> +	 * XXX this relies on hotplug, does kexec do too?
> +	 */
> +	perf_event_exit_cpu(0);
> +	return NOTIFY_OK;

Ok, so this shuts down the perf counters on cpu0, but the other cpus are
still running and will fail your new bios check, no?

Privately, I used the above wrapped with for_each_online_cpu(cpu) and it
worked fine for me.

Cheers,
Don


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 14:01                             ` Don Zickus
@ 2010-12-08 14:20                               ` Peter Zijlstra
  2010-12-08 14:42                                 ` Vivek Goyal
  2010-12-08 14:59                                 ` Peter Zijlstra
  2010-12-08 14:33                               ` Peter Zijlstra
  2010-12-08 14:39                               ` Vivek Goyal
  2 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-08 14:20 UTC (permalink / raw)
  To: Don Zickus
  Cc: Eric W. Biederman, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, 2010-12-08 at 09:01 -0500, Don Zickus wrote:
> On Wed, Dec 08, 2010 at 12:30:20AM +0100, Peter Zijlstra wrote:
> > On Thu, 2010-12-02 at 11:15 -0500, Don Zickus wrote:
> > 
> > > Vivek suggested to me this morning that I should just blantantly disable the
> > > perf counter during init when running my test. 
> > 
> > Nah, we should actively scan for that during the bring-up and kill
> > hw-perf when we find an enable bit set, some BIOSes actively use the
> > PMU, this is something that should be discouraged.
> 
> Ok, the reboot notifier addresses the kexec problem but doesn't fix it
> though (I have to test to confirm that, comments below).  


> The bios check
> should catch those situations (ironically I stumbled upon a machine with
> this problem, so I will test your patch with it, though it only uses perf
> counter 0). 

Right, they usually only steal one or two counters, but the fact that
they're using them at all is insane and should be punished.

>  The kdump problem will still exist, not sure if we care and
> perhaps we should document in the changelog that we know kdump is still
> broken (unless we do care).

You mean even if we cure the kexec reboot notifier patch thing kdump is
still borken?


> > ---
> >  arch/x86/kernel/cpu/perf_event.c |   30 +++++++++++++++++++++++++++---
> >  1 files changed, 27 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> > index 817d2b1..7f92833 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -375,15 +375,40 @@ static void release_pmc_hardware(void) {}
> >  static bool check_hw_exists(void)
> >  {
> >  	u64 val, val_new = 0;
> > -	int ret = 0;
> > +	int i, reg, ret = 0;
> >  
> >  	val = 0xabcdUL;
> >  	ret |= checking_wrmsrl(x86_pmu.perfctr, val);
> >  	ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
> > -	if (ret || val != val_new)
> > +	if (ret || val != val_new) {
> > +		printk(KERN_CONT "Broken PMU hardware detected, software events only.\n");
> >  		return false;
> > +	}
> > +
> > +	/*
> > +	 * Check to see if the BIOS enabled any of the counters, if so
> > +	 * complain and bail.
> > +	 */
> > +	for (i = 0; i < x86_pmu.num_counters; i++) {
> > +		reg = x86_pmu.eventsel + i;
> > +		rdmsrl(reg, val);
> > +		if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
> > +			goto bios_fail;
> > +	}
> > +
> > +	for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
> > +		reg = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
> > +		rdmsrl(reg, val);
> > +		if (val & (0x03 << i*4))
> > +			goto bios_fail;
> > +	}
> 
> I wonder if you should reverse these checks.  If the bios has the perf
> counter enabled, there might be a high chance that it fails the first
> check and never gets to the actually bios checks.

Ah, good point.

> >  
> >  	return true;
> > +
> > +bios_fail:
> > +	printk(KERN_CONT "Broken BIOS detected, software events only.\n");
> > +	printk(KERN_ERR FW_BUG "invalid MSR %x=%Lx\n", reg, val);
> > +	return false;
> >  }
> >  
> >  static void reserve_ds_buffers(void);
> > @@ -1379,7 +1404,6 @@ int __init init_hw_perf_events(void)
> >  
> >  	/* sanity check that the hardware exists or is emulated */
> >  	if (!check_hw_exists()) {
> > -		pr_cont("Broken PMU hardware detected, software events only.\n");
> >  		return 0;
> >  	}
> 
> nitpick - you can probably remove the curly braces, no?

Quite so.


> > @@ -6383,6 +6384,25 @@ static void perf_event_exit_cpu(int cpu)
> >  static inline void perf_event_exit_cpu(int cpu) { }
> >  #endif
> >  
> > +static int 
> > +perf_reboot(struct notifier_block *notifier, unsigned long val, void *v)
> > +{
> > +	/*
> > +	 * XXX this relies on hotplug, does kexec do too?
> > +	 */
> > +	perf_event_exit_cpu(0);
> > +	return NOTIFY_OK;
> 
> Ok, so this shuts down the perf counters on cpu0, but the other cpus are
> still running and will fail your new bios check, no?
> 
> Privately, I used the above wrapped with for_each_online_cpu(cpu) and it
> worked fine for me.

Oh, so reboot doesn't actually stop the non-boot cpus? I was unsure of
that (see my XXX there), so yeah, if it doesn't then I guess the
for_each_possible_cpu() thing is the way out.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 14:20                               ` Peter Zijlstra
@ 2010-12-08 14:42                                 ` Vivek Goyal
  2010-12-08 14:48                                   ` Peter Zijlstra
  2010-12-08 14:59                                 ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Vivek Goyal @ 2010-12-08 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Eric W. Biederman, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 08, 2010 at 03:20:05PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-08 at 09:01 -0500, Don Zickus wrote:
> > On Wed, Dec 08, 2010 at 12:30:20AM +0100, Peter Zijlstra wrote:
> > > On Thu, 2010-12-02 at 11:15 -0500, Don Zickus wrote:
> > > 
> > > > Vivek suggested to me this morning that I should just blantantly disable the
> > > > perf counter during init when running my test. 
> > > 
> > > Nah, we should actively scan for that during the bring-up and kill
> > > hw-perf when we find an enable bit set, some BIOSes actively use the
> > > PMU, this is something that should be discouraged.
> > 
> > Ok, the reboot notifier addresses the kexec problem but doesn't fix it
> > though (I have to test to confirm that, comments below).  
> 
> 
> > The bios check
> > should catch those situations (ironically I stumbled upon a machine with
> > this problem, so I will test your patch with it, though it only uses perf
> > counter 0). 
> 
> Right, they usually only steal one or two counters, but the fact that
> they're using them at all is insane and should be punished.
> 
> >  The kdump problem will still exist, not sure if we care and
> > perhaps we should document in the changelog that we know kdump is still
> > broken (unless we do care).
> 
> You mean even if we cure the kexec reboot notifier patch thing kdump is
> still borken?
> 

Yes. reboot notifier notifications are not sent in kdump path. In this
path we know kernel has crashed and we just try to do bare minimal things
to boot into second kernel. If some hardware is left in inconsistent
state we try to recover from that situation by resetting the device
when second kernel is booting.

Either driver itself can detect that device is in inconsistent state and
reset it otherwise we also pass a command line parameter "reset_devices" to
second kernel to explicitly tell kernel that devices might be in bad state,
reset these during initialization. If we want to use these perf counters in
kdump kernel, we shall have to do something similar.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 14:42                                 ` Vivek Goyal
@ 2010-12-08 14:48                                   ` Peter Zijlstra
  2010-12-08 15:02                                     ` Vivek Goyal
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-08 14:48 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Don Zickus, Eric W. Biederman, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, 2010-12-08 at 09:42 -0500, Vivek Goyal wrote:
> 
> Either driver itself can detect that device is in inconsistent state and
> reset it otherwise we also pass a command line parameter "reset_devices" to
> second kernel to explicitly tell kernel that devices might be in bad state,
> reset these during initialization. If we want to use these perf counters in
> kdump kernel, we shall have to do something similar.

Right, so I'm perfectly fine with leaving the kdump kernel broken for
now and if people really do need hardware events we can try and reset
the hardware when we find that reset_devices command line parameter.

Not sure how that interacts with these broken BIOSes, but its kdump so
its mostly broken by design anyway ;-)



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 14:48                                   ` Peter Zijlstra
@ 2010-12-08 15:02                                     ` Vivek Goyal
  2010-12-08 15:15                                       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Vivek Goyal @ 2010-12-08 15:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Eric W. Biederman, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 08, 2010 at 03:48:04PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-08 at 09:42 -0500, Vivek Goyal wrote:
> > 
> > Either driver itself can detect that device is in inconsistent state and
> > reset it otherwise we also pass a command line parameter "reset_devices" to
> > second kernel to explicitly tell kernel that devices might be in bad state,
> > reset these during initialization. If we want to use these perf counters in
> > kdump kernel, we shall have to do something similar.
> 
> Right, so I'm perfectly fine with leaving the kdump kernel broken for
> now and if people really do need hardware events we can try and reset
> the hardware when we find that reset_devices command line parameter.
> 
> Not sure how that interacts with these broken BIOSes,

reset_devices was meant to be dual purpose so that it can handle broken
BIOSes also. So if BIOS is broken then one can pass "reset_devices" to 
kernel and driver can attempt to reset the device and boot the kernel.

>but its kdump so
> its mostly broken by design anyway ;-)

Kdump has its share of problems especially with the fact that
kernel/drivers find devices in bad state and are not hardened enough
to deal with that. But on bare metal what's the better way of capturing
kernel crash dump? Trying to do anything post crash in the kernel is
also not very reliable either.

I think the way we fix kernel for boot problems on newer hardware, for broken
BIOses, we need to keep on fixing it in kdump path also to make sure new
devices/drivers can cope up with this scenario. 

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 15:02                                     ` Vivek Goyal
@ 2010-12-08 15:15                                       ` Peter Zijlstra
  2010-12-08 15:22                                         ` Vivek Goyal
  2010-12-08 21:16                                         ` Eric W. Biederman
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-08 15:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Don Zickus, Eric W. Biederman, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, 2010-12-08 at 10:02 -0500, Vivek Goyal wrote:

> >but its kdump so its mostly broken by design anyway ;-)
> 
> Kdump has its share of problems especially with the fact that
> kernel/drivers find devices in bad state and are not hardened enough
> to deal with that. But on bare metal what's the better way of capturing
> kernel crash dump? Trying to do anything post crash in the kernel is
> also not very reliable either.

/me <3 RS-232

I haven't found anything better than that... 

And poking at the RS-232 requires less of the kernel to be functional
than booting into a new kernel (whose image might have been corrupted by
the dying kernel, etc..)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 15:15                                       ` Peter Zijlstra
@ 2010-12-08 15:22                                         ` Vivek Goyal
  2010-12-08 21:16                                         ` Eric W. Biederman
  1 sibling, 0 replies; 42+ messages in thread
From: Vivek Goyal @ 2010-12-08 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Eric W. Biederman, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 08, 2010 at 04:15:07PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-08 at 10:02 -0500, Vivek Goyal wrote:
> 
> > >but its kdump so its mostly broken by design anyway ;-)
> > 
> > Kdump has its share of problems especially with the fact that
> > kernel/drivers find devices in bad state and are not hardened enough
> > to deal with that. But on bare metal what's the better way of capturing
> > kernel crash dump? Trying to do anything post crash in the kernel is
> > also not very reliable either.
> 
> /me <3 RS-232
> 
> I haven't found anything better than that... 

Serial is good for getting the oops out. But for the big vmcore? Secondly,
people want the flexibility of sending the vmcore over various targets
like over network to some remote server. Booting into second kernel opens
up all those options and now one can do intelligent filtering and send
the vmcore to any kind of destination.

> 
> And poking at the RS-232 requires less of the kernel to be functional
> than booting into a new kernel (whose image might have been corrupted by
> the dying kernel, etc..)

New kernel image being corrupted problem can be solved up to great extent
by write protecting that memory location.

So those who are happy with RS-232, they don't have to configure kdump.
Just connect serial console and get the oops message out.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 15:15                                       ` Peter Zijlstra
  2010-12-08 15:22                                         ` Vivek Goyal
@ 2010-12-08 21:16                                         ` Eric W. Biederman
  1 sibling, 0 replies; 42+ messages in thread
From: Eric W. Biederman @ 2010-12-08 21:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, Don Zickus, Yinghai Lu, Ingo Molnar, Jason Wessel,
	linux-kernel@vger.kernel.org, Haren Myneni

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, 2010-12-08 at 10:02 -0500, Vivek Goyal wrote:
>
>> >but its kdump so its mostly broken by design anyway ;-)
>> 
>> Kdump has its share of problems especially with the fact that
>> kernel/drivers find devices in bad state and are not hardened enough
>> to deal with that. But on bare metal what's the better way of capturing
>> kernel crash dump? Trying to do anything post crash in the kernel is
>> also not very reliable either.
>
> /me <3 RS-232
>
> I haven't found anything better than that... 

True.  But it can be a pain to operate RS-232 at production scale, or to
convince customers to hook up RS-232 just in case your released software
happens to crash.

> And poking at the RS-232 requires less of the kernel to be functional
> than booting into a new kernel (whose image might have been corrupted by
> the dying kernel, etc..)

For debugging a reproducible failure RS-232 wins.  For everything else
there is kdump.  It sucks but it is at least fixable.

And really the kdump kernel should be running a minimalistic hardware
config so you only have to get the chunks of hardware you really care
about working.

As for corruption the kdump kernel lives in an area of memory that we
never DMA to in the primary kernel, and we check a sha256 hash before we
start booting the kdump kernel.  In general kdump fails safe. That is if
it can't makes things work it fails to boot and does nothing to your
system.  Definitely not perfect but if you don't have RS-232 it is the
best I have seen.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 14:20                               ` Peter Zijlstra
  2010-12-08 14:42                                 ` Vivek Goyal
@ 2010-12-08 14:59                                 ` Peter Zijlstra
  2010-12-08 18:43                                   ` Yinghai Lu
                                                     ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-08 14:59 UTC (permalink / raw)
  To: Don Zickus
  Cc: Eric W. Biederman, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, 2010-12-08 at 15:20 +0100, Peter Zijlstra wrote:

> > I wonder if you should reverse these checks.  If the bios has the perf
> > counter enabled, there might be a high chance that it fails the first
> > check and never gets to the actually bios checks.
> 
> Ah, good point. 

Something like so..

---
Subject: perf, x86: Detect broken BIOSes
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Dec 08 15:56:23 CET 2010

Some BIOSes use PMU resources, this is a bug.

Try to detect this, warn about it, and further refuse to touch the
PMU ourselves.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -375,15 +375,51 @@ static void release_pmc_hardware(void) {
 static bool check_hw_exists(void)
 {
 	u64 val, val_new = 0;
-	int ret = 0;
+	int i, reg, ret = 0;
 
+	/*
+	 * Check to see if the BIOS enabled any of the counters, if so
+	 * complain and bail.
+	 */
+	for (i = 0; i < x86_pmu.num_counters; i++) {
+		reg = x86_pmu.eventsel + i;
+		ret = rdmsrl_safe(reg, &val);
+		if (ret)
+			goto msr_fail;
+		if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
+			goto bios_fail;
+	}
+
+	for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
+		reg = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
+		ret = rdmsrl_safe(reg, &val);
+		if (ret)
+			goto msr_fail;
+		if (val & (0x03 << i*4))
+			goto bios_fail;
+	}
+
+	/*
+	 * Now write a value and read it back to see if it matches,
+	 * this is needed to detect certain hardware emulators (qemu/kvm)
+	 * that don't trap on the MSR access and always return 0s.
+	 */
 	val = 0xabcdUL;
-	ret |= checking_wrmsrl(x86_pmu.perfctr, val);
+	ret = checking_wrmsrl(x86_pmu.perfctr, val);
 	ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
 	if (ret || val != val_new)
-		return false;
+		goto msr_fail;
 
 	return true;
+
+bios_fail:
+	printk(KERN_CONT "Broken BIOS detected, software events only.\n");
+	printk(KERN_ERR FW_BUG "invalid MSR: %x=%Lx\n", reg, val);
+	return false;
+
+msr_fail:
+	printk(KERN_CONT "Broken PMU hardware detected, software events only.\n");
+	return false;
 }
 
 static void reserve_ds_buffers(void);
@@ -1378,10 +1414,8 @@ int __init init_hw_perf_events(void)
 	pmu_check_apic();
 
 	/* sanity check that the hardware exists or is emulated */
-	if (!check_hw_exists()) {
-		pr_cont("Broken PMU hardware detected, software events only.\n");
+	if (!check_hw_exists())
 		return 0;
-	}
 
 	pr_cont("%s PMU driver.\n", x86_pmu.name);
 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 14:59                                 ` Peter Zijlstra
@ 2010-12-08 18:43                                   ` Yinghai Lu
  2010-12-08 19:01                                     ` Don Zickus
  2010-12-08 19:06                                     ` Peter Zijlstra
  2010-12-08 22:37                                   ` Don Zickus
  2010-12-09 20:20                                   ` Don Zickus
  2 siblings, 2 replies; 42+ messages in thread
From: Yinghai Lu @ 2010-12-08 18:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Eric W. Biederman, Vivek Goyal, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On 12/08/2010 06:59 AM, Peter Zijlstra wrote:
> On Wed, 2010-12-08 at 15:20 +0100, Peter Zijlstra wrote:
> 
>>> I wonder if you should reverse these checks.  If the bios has the perf
>>> counter enabled, there might be a high chance that it fails the first
>>> check and never gets to the actually bios checks.
>>
>> Ah, good point. 
> 
> Something like so..
> 
> ---
> Subject: perf, x86: Detect broken BIOSes
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Wed Dec 08 15:56:23 CET 2010
> 
> Some BIOSes use PMU resources, this is a bug.
> 
> Try to detect this, warn about it, and further refuse to touch the
> PMU ourselves.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
> +++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
> @@ -375,15 +375,51 @@ static void release_pmc_hardware(void) {
>  static bool check_hw_exists(void)
>  {
>  	u64 val, val_new = 0;
> -	int ret = 0;
> +	int i, reg, ret = 0;
>  
> +	/*
> +	 * Check to see if the BIOS enabled any of the counters, if so
> +	 * complain and bail.
> +	 */
> +	for (i = 0; i < x86_pmu.num_counters; i++) {
> +		reg = x86_pmu.eventsel + i;
> +		ret = rdmsrl_safe(reg, &val);
> +		if (ret)
> +			goto msr_fail;
> +		if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
> +			goto bios_fail;
> +	}
> +
> +	for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
> +		reg = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
> +		ret = rdmsrl_safe(reg, &val);
> +		if (ret)
> +			goto msr_fail;
> +		if (val & (0x03 << i*4))
> +			goto bios_fail;
> +	}
> +
> +	/*
> +	 * Now write a value and read it back to see if it matches,
> +	 * this is needed to detect certain hardware emulators (qemu/kvm)
> +	 * that don't trap on the MSR access and always return 0s.
> +	 */
>  	val = 0xabcdUL;
> -	ret |= checking_wrmsrl(x86_pmu.perfctr, val);
> +	ret = checking_wrmsrl(x86_pmu.perfctr, val);
>  	ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
>  	if (ret || val != val_new)
> -		return false;
> +		goto msr_fail;
>  
>  	return true;
> +
> +bios_fail:
> +	printk(KERN_CONT "Broken BIOS detected, software events only.\n");
> +	printk(KERN_ERR FW_BUG "invalid MSR: %x=%Lx\n", reg, val);
> +	return false;
> +
> +msr_fail:
> +	printk(KERN_CONT "Broken PMU hardware detected, software events only.\n");
> +	return false;
>  }

can you add sth force_... in command line to take over ownership of perf from BIOS or previous kernel ?

then still can use perf etc after we kexec from RHEL or SLES kernel to later kernel ( from 2.6.37)

Thanks

Yinghai

>  
>  static void reserve_ds_buffers(void);
> @@ -1378,10 +1414,8 @@ int __init init_hw_perf_events(void)
>  	pmu_check_apic();
>  
>  	/* sanity check that the hardware exists or is emulated */
> -	if (!check_hw_exists()) {
> -		pr_cont("Broken PMU hardware detected, software events only.\n");
> +	if (!check_hw_exists())
>  		return 0;
> -	}
>  
>  	pr_cont("%s PMU driver.\n", x86_pmu.name);
>  
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 18:43                                   ` Yinghai Lu
@ 2010-12-08 19:01                                     ` Don Zickus
  2010-12-08 19:05                                       ` Yinghai Lu
  2010-12-08 19:06                                     ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Don Zickus @ 2010-12-08 19:01 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Peter Zijlstra, Eric W. Biederman, Vivek Goyal, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 08, 2010 at 10:43:14AM -0800, Yinghai Lu wrote:
> 
> can you add sth force_... in command line to take over ownership of perf from BIOS or previous kernel ?
> 
> then still can use perf etc after we kexec from RHEL or SLES kernel to later kernel ( from 2.6.37)

My understand is that you can't because the BIOS is actively using it
behind the scenes of the kernel (well during an SMI).  I have a machine
where I tried to force take it but it still stopped triggering interrupts.

Cheers,
Don

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 19:01                                     ` Don Zickus
@ 2010-12-08 19:05                                       ` Yinghai Lu
  2010-12-08 19:17                                         ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Yinghai Lu @ 2010-12-08 19:05 UTC (permalink / raw)
  To: Don Zickus
  Cc: Peter Zijlstra, Eric W. Biederman, Vivek Goyal, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On 12/08/2010 11:01 AM, Don Zickus wrote:
> On Wed, Dec 08, 2010 at 10:43:14AM -0800, Yinghai Lu wrote:
>>
>> can you add sth force_... in command line to take over ownership of perf from BIOS or previous kernel ?
>>
>> then still can use perf etc after we kexec from RHEL or SLES kernel to later kernel ( from 2.6.37)
> 
> My understand is that you can't because the BIOS is actively using it
> behind the scenes of the kernel (well during an SMI).  I have a machine
> where I tried to force take it but it still stopped triggering interrupts.

how about second case: kexec from RHEL 6 stock kernel to upstream kernel ?

Thanks

	Yinghai

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 19:05                                       ` Yinghai Lu
@ 2010-12-08 19:17                                         ` Peter Zijlstra
  2010-12-08 19:20                                           ` Yinghai Lu
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-08 19:17 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Don Zickus, Eric W. Biederman, Vivek Goyal, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, 2010-12-08 at 11:05 -0800, Yinghai Lu wrote:
> On 12/08/2010 11:01 AM, Don Zickus wrote:
> > On Wed, Dec 08, 2010 at 10:43:14AM -0800, Yinghai Lu wrote:
> >>
> >> can you add sth force_... in command line to take over ownership of perf from BIOS or previous kernel ?
> >>
> >> then still can use perf etc after we kexec from RHEL or SLES kernel to later kernel ( from 2.6.37)
> > 
> > My understand is that you can't because the BIOS is actively using it
> > behind the scenes of the kernel (well during an SMI).  I have a machine
> > where I tried to force take it but it still stopped triggering interrupts.
> 
> how about second case: kexec from RHEL 6 stock kernel to upstream kernel ?

Its impossible to distinguish between a BIOS having claimed a counter
and a previous kernel not having shut things down properly.

The best we can do is allow a force parameter and let the user keep all
pieces when he uses it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 19:17                                         ` Peter Zijlstra
@ 2010-12-08 19:20                                           ` Yinghai Lu
  0 siblings, 0 replies; 42+ messages in thread
From: Yinghai Lu @ 2010-12-08 19:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Eric W. Biederman, Vivek Goyal, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On 12/08/2010 11:17 AM, Peter Zijlstra wrote:
> On Wed, 2010-12-08 at 11:05 -0800, Yinghai Lu wrote:
>> On 12/08/2010 11:01 AM, Don Zickus wrote:
>>> On Wed, Dec 08, 2010 at 10:43:14AM -0800, Yinghai Lu wrote:
>>>>
>>>> can you add sth force_... in command line to take over ownership of perf from BIOS or previous kernel ?
>>>>
>>>> then still can use perf etc after we kexec from RHEL or SLES kernel to later kernel ( from 2.6.37)
>>>
>>> My understand is that you can't because the BIOS is actively using it
>>> behind the scenes of the kernel (well during an SMI).  I have a machine
>>> where I tried to force take it but it still stopped triggering interrupts.
>>
>> how about second case: kexec from RHEL 6 stock kernel to upstream kernel ?
> 
> Its impossible to distinguish between a BIOS having claimed a counter
> and a previous kernel not having shut things down properly.
> 
> The best we can do is allow a force parameter and let the user keep all
> pieces when he uses it.

ok, thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 18:43                                   ` Yinghai Lu
  2010-12-08 19:01                                     ` Don Zickus
@ 2010-12-08 19:06                                     ` Peter Zijlstra
  2010-12-08 19:20                                       ` Yinghai Lu
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-08 19:06 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Don Zickus, Eric W. Biederman, Vivek Goyal, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, 2010-12-08 at 10:43 -0800, Yinghai Lu wrote:
> can you add sth force_... in command line to take over ownership of
> perf from BIOS or previous kernel ?

The problem is, you cannot steal the thing from the BIOS, you'll trample
on its settings and the next time it runs it will simply re-instate it.

And aside from probing the EN bit on boot there is no way of determining
this.

So forcing the state might get you an ill-functioning system.

> then still can use perf etc after we kexec from RHEL or SLES kernel to
> later kernel ( from 2.6.37)

I'm not sure why people would do that, but yeah I guess we can do
something like that.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 19:06                                     ` Peter Zijlstra
@ 2010-12-08 19:20                                       ` Yinghai Lu
  0 siblings, 0 replies; 42+ messages in thread
From: Yinghai Lu @ 2010-12-08 19:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Eric W. Biederman, Vivek Goyal, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On 12/08/2010 11:06 AM, Peter Zijlstra wrote:
>> then still can use perf etc after we kexec from RHEL or SLES kernel to
>> later kernel ( from 2.6.37)
> 
> I'm not sure why people would do that, but yeah I guess we can do
> something like that.

more problem:
system with linuxbios and have kernel in flash as bootloader. they may kexec to final production kernel. 

and they may need to update that embedded kernel to shutdown perf....

Yinghai

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 14:59                                 ` Peter Zijlstra
  2010-12-08 18:43                                   ` Yinghai Lu
@ 2010-12-08 22:37                                   ` Don Zickus
  2010-12-08 23:20                                     ` Eric W. Biederman
  2010-12-09 20:20                                   ` Don Zickus
  2 siblings, 1 reply; 42+ messages in thread
From: Don Zickus @ 2010-12-08 22:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric W. Biederman, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 08, 2010 at 03:59:16PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-08 at 15:20 +0100, Peter Zijlstra wrote:
> 
> > > I wonder if you should reverse these checks.  If the bios has the perf
> > > counter enabled, there might be a high chance that it fails the first
> > > check and never gets to the actually bios checks.
> > 
> > Ah, good point. 
> 
> Something like so..

This seems to work correctly on my Nehalem and broken bios machines during
boot and kexec.  As expected it fails during kdump.  My p4 box failed
during kexec for some reason.  But p4 has other issues.

Cheers,
Don

> 
> ---
> Subject: perf, x86: Detect broken BIOSes
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Wed Dec 08 15:56:23 CET 2010
> 
> Some BIOSes use PMU resources, this is a bug.
> 
> Try to detect this, warn about it, and further refuse to touch the
> PMU ourselves.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
> +++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
> @@ -375,15 +375,51 @@ static void release_pmc_hardware(void) {
>  static bool check_hw_exists(void)
>  {
>  	u64 val, val_new = 0;
> -	int ret = 0;
> +	int i, reg, ret = 0;
>  
> +	/*
> +	 * Check to see if the BIOS enabled any of the counters, if so
> +	 * complain and bail.
> +	 */
> +	for (i = 0; i < x86_pmu.num_counters; i++) {
> +		reg = x86_pmu.eventsel + i;
> +		ret = rdmsrl_safe(reg, &val);
> +		if (ret)
> +			goto msr_fail;
> +		if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
> +			goto bios_fail;
> +	}
> +
> +	for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
> +		reg = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
> +		ret = rdmsrl_safe(reg, &val);
> +		if (ret)
> +			goto msr_fail;
> +		if (val & (0x03 << i*4))
> +			goto bios_fail;
> +	}
> +
> +	/*
> +	 * Now write a value and read it back to see if it matches,
> +	 * this is needed to detect certain hardware emulators (qemu/kvm)
> +	 * that don't trap on the MSR access and always return 0s.
> +	 */
>  	val = 0xabcdUL;
> -	ret |= checking_wrmsrl(x86_pmu.perfctr, val);
> +	ret = checking_wrmsrl(x86_pmu.perfctr, val);
>  	ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
>  	if (ret || val != val_new)
> -		return false;
> +		goto msr_fail;
>  
>  	return true;
> +
> +bios_fail:
> +	printk(KERN_CONT "Broken BIOS detected, software events only.\n");
> +	printk(KERN_ERR FW_BUG "invalid MSR: %x=%Lx\n", reg, val);
> +	return false;
> +
> +msr_fail:
> +	printk(KERN_CONT "Broken PMU hardware detected, software events only.\n");
> +	return false;
>  }
>  
>  static void reserve_ds_buffers(void);
> @@ -1378,10 +1414,8 @@ int __init init_hw_perf_events(void)
>  	pmu_check_apic();
>  
>  	/* sanity check that the hardware exists or is emulated */
> -	if (!check_hw_exists()) {
> -		pr_cont("Broken PMU hardware detected, software events only.\n");
> +	if (!check_hw_exists())
>  		return 0;
> -	}
>  
>  	pr_cont("%s PMU driver.\n", x86_pmu.name);
>  
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 22:37                                   ` Don Zickus
@ 2010-12-08 23:20                                     ` Eric W. Biederman
  2010-12-09  4:34                                       ` Don Zickus
  0 siblings, 1 reply; 42+ messages in thread
From: Eric W. Biederman @ 2010-12-08 23:20 UTC (permalink / raw)
  To: Don Zickus
  Cc: Peter Zijlstra, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

Don Zickus <dzickus@redhat.com> writes:

> On Wed, Dec 08, 2010 at 03:59:16PM +0100, Peter Zijlstra wrote:
>> On Wed, 2010-12-08 at 15:20 +0100, Peter Zijlstra wrote:
>> 
>> > > I wonder if you should reverse these checks.  If the bios has the perf
>> > > counter enabled, there might be a high chance that it fails the first
>> > > check and never gets to the actually bios checks.
>> > 
>> > Ah, good point. 
>> 
>> Something like so..
>
> This seems to work correctly on my Nehalem and broken bios machines during
> boot and kexec.  As expected it fails during kdump.  My p4 box failed
> during kexec for some reason.  But p4 has other issues.

Does the kdump kernel still boot?

It looks like it should I just want to double check.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 23:20                                     ` Eric W. Biederman
@ 2010-12-09  4:34                                       ` Don Zickus
  0 siblings, 0 replies; 42+ messages in thread
From: Don Zickus @ 2010-12-09  4:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 08, 2010 at 03:20:37PM -0800, Eric W. Biederman wrote:
> Don Zickus <dzickus@redhat.com> writes:
> 
> > On Wed, Dec 08, 2010 at 03:59:16PM +0100, Peter Zijlstra wrote:
> >> On Wed, 2010-12-08 at 15:20 +0100, Peter Zijlstra wrote:
> >> 
> >> > > I wonder if you should reverse these checks.  If the bios has the perf
> >> > > counter enabled, there might be a high chance that it fails the first
> >> > > check and never gets to the actually bios checks.
> >> > 
> >> > Ah, good point. 
> >> 
> >> Something like so..
> >
> > This seems to work correctly on my Nehalem and broken bios machines during
> > boot and kexec.  As expected it fails during kdump.  My p4 box failed
> > during kexec for some reason.  But p4 has other issues.
> 
> Does the kdump kernel still boot?
> 
> It looks like it should I just want to double check.

Yeah, sorry for not being clear.  It definitely boots and does it thing.
perf init (and thus nmi watchdog) fail with 'BIOS broken' because the perf
counters were not shutdown prior to executing kdump.

Cheers,
Don

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 14:59                                 ` Peter Zijlstra
  2010-12-08 18:43                                   ` Yinghai Lu
  2010-12-08 22:37                                   ` Don Zickus
@ 2010-12-09 20:20                                   ` Don Zickus
  2010-12-09 20:44                                     ` Cyrill Gorcunov
  2 siblings, 1 reply; 42+ messages in thread
From: Don Zickus @ 2010-12-09 20:20 UTC (permalink / raw)
  To: Peter Zijlstra, gorcunov
  Cc: Eric W. Biederman, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 08, 2010 at 03:59:16PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-08 at 15:20 +0100, Peter Zijlstra wrote:
> 
> > > I wonder if you should reverse these checks.  If the bios has the perf
> > > counter enabled, there might be a high chance that it fails the first
> > > check and never gets to the actually bios checks.
> > 
> > Ah, good point. 
> 
> Something like so..

Getting closer...

Pentium4s are special they need the double write, so...

> +	/*
> +	 * Now write a value and read it back to see if it matches,
> +	 * this is needed to detect certain hardware emulators (qemu/kvm)
> +	 * that don't trap on the MSR access and always return 0s.
> +	 */
>  	val = 0xabcdUL;
> -	ret |= checking_wrmsrl(x86_pmu.perfctr, val);
> +	ret = checking_wrmsrl(x86_pmu.perfctr, val);

	if (x86_pmu.perfctr_second_write)
		ret |= checking_wrmsrl(x86_pmu.perfctr, val);


solved my p4 problems for kexec.

>  	ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
>  	if (ret || val != val_new)
> -		return false;
> +		goto msr_fail;
>  
>  	return true;

Cheers,
Don

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-09 20:20                                   ` Don Zickus
@ 2010-12-09 20:44                                     ` Cyrill Gorcunov
  0 siblings, 0 replies; 42+ messages in thread
From: Cyrill Gorcunov @ 2010-12-09 20:44 UTC (permalink / raw)
  To: Don Zickus
  Cc: Peter Zijlstra, Eric W. Biederman, Vivek Goyal, Yinghai Lu,
	Ingo Molnar, Jason Wessel, linux-kernel@vger.kernel.org,
	Haren Myneni

On Thu, Dec 09, 2010 at 03:20:08PM -0500, Don Zickus wrote:
...
> 
> Getting closer...
> 
> Pentium4s are special they need the double write, so...
> 
> > +	/*
> > +	 * Now write a value and read it back to see if it matches,
> > +	 * this is needed to detect certain hardware emulators (qemu/kvm)
> > +	 * that don't trap on the MSR access and always return 0s.
> > +	 */
> >  	val = 0xabcdUL;
> > -	ret |= checking_wrmsrl(x86_pmu.perfctr, val);
> > +	ret = checking_wrmsrl(x86_pmu.perfctr, val);
> 
> 	if (x86_pmu.perfctr_second_write)
> 		ret |= checking_wrmsrl(x86_pmu.perfctr, val);
> 
> solved my p4 problems for kexec.
> 
...

yeah, thanks! would you push a patch upstream?

  Cyrill

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 14:01                             ` Don Zickus
  2010-12-08 14:20                               ` Peter Zijlstra
@ 2010-12-08 14:33                               ` Peter Zijlstra
  2010-12-08 14:39                               ` Vivek Goyal
  2 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-08 14:33 UTC (permalink / raw)
  To: Don Zickus
  Cc: Eric W. Biederman, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, 2010-12-08 at 09:01 -0500, Don Zickus wrote:
> Privately, I used the above wrapped with for_each_online_cpu(cpu) and it
> worked fine for me. 


Something like so then?

---
Subject: perf: Stop all counters on reboot
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Dec 08 15:29:02 CET 2010

Use the reboot notifier to detach all running counters on reboot, this
solves a problem with kexec where the new kernel doesn't expect
running counters (rightly so).

It will however decrease the coverage of the NMI watchdog. Making a
kexec specific reboot notifier callback would be best, however that
would require touching all notifier callback handlers as they are not
properly structured to deal with new state.

As a compromise, place the perf reboot notifier at the very last
position in the list.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -21,6 +21,7 @@
 #include <linux/dcache.h>
 #include <linux/percpu.h>
 #include <linux/ptrace.h>
+#include <linux/reboot.h>
 #include <linux/vmstat.h>
 #include <linux/vmalloc.h>
 #include <linux/hardirq.h>
@@ -6329,7 +6330,7 @@ static void __cpuinit perf_event_init_cp
 	mutex_unlock(&swhash->hlist_mutex);
 }
 
-#ifdef CONFIG_HOTPLUG_CPU
+#if defined CONFIG_HOTPLUG_CPU || defined CONFIG_KEXEC
 static void perf_pmu_rotate_stop(struct pmu *pmu)
 {
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
@@ -6383,6 +6384,26 @@ static void perf_event_exit_cpu(int cpu)
 static inline void perf_event_exit_cpu(int cpu) { }
 #endif
 
+static int
+perf_reboot(struct notifier_block *notifier, unsigned long val, void *v)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		perf_event_exit_cpu(cpu);
+
+	return NOTIFY_OK;
+}
+
+/*
+ * Run the perf reboot notifier at the very last possible moment so that
+ * the generic watchdog code runs as long as possible.
+ */
+static struct notifier_block perf_reboot_notifier = {
+	.notifier_call = perf_reboot,
+	.priority = INT_MIN,
+};
+
 static int __cpuinit
 perf_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu)
 {
@@ -6418,6 +6439,7 @@ void __init perf_event_init(void)
 	perf_pmu_register(&perf_task_clock);
 	perf_tp_register();
 	perf_cpu_notifier(perf_cpu_notify);
+	register_reboot_notifier(&perf_reboot_notifier);
 
 	ret = init_hw_breakpoint();
 	WARN(ret, "hw_breakpoint initialization failed with: %d", ret);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08 14:01                             ` Don Zickus
  2010-12-08 14:20                               ` Peter Zijlstra
  2010-12-08 14:33                               ` Peter Zijlstra
@ 2010-12-08 14:39                               ` Vivek Goyal
  2 siblings, 0 replies; 42+ messages in thread
From: Vivek Goyal @ 2010-12-08 14:39 UTC (permalink / raw)
  To: Don Zickus
  Cc: Peter Zijlstra, Eric W. Biederman, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Wed, Dec 08, 2010 at 09:01:03AM -0500, Don Zickus wrote:
> On Wed, Dec 08, 2010 at 12:30:20AM +0100, Peter Zijlstra wrote:
> > On Thu, 2010-12-02 at 11:15 -0500, Don Zickus wrote:
> > 
> > > Vivek suggested to me this morning that I should just blantantly disable the
> > > perf counter during init when running my test. 
> > 
> > Nah, we should actively scan for that during the bring-up and kill
> > hw-perf when we find an enable bit set, some BIOSes actively use the
> > PMU, this is something that should be discouraged.
> 
> Ok, the reboot notifier addresses the kexec problem but doesn't fix it
> though (I have to test to confirm that, comments below).  The bios check
> should catch those situations (ironically I stumbled upon a machine with
> this problem, so I will test your patch with it, though it only uses perf
> counter 0).  The kdump problem will still exist, not sure if we care and
> perhaps we should document in the changelog that we know kdump is still
> broken (unless we do care).

Can't think why would somebody like to use performance counters in kdump
kernel. So that probably should not be a concern.

Vivek

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-02  7:34                       ` Peter Zijlstra
  2010-12-02 16:15                         ` Don Zickus
@ 2010-12-07 21:16                         ` Don Zickus
  2010-12-08  0:26                           ` Yinghai Lu
  1 sibling, 1 reply; 42+ messages in thread
From: Don Zickus @ 2010-12-07 21:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric W. Biederman, Vivek Goyal, Yinghai Lu, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Thu, Dec 02, 2010 at 08:34:30AM +0100, Peter Zijlstra wrote:
> >  void __init lockup_detector_init(void)
> >  {
> >  	void *cpu = (void *)(long)smp_processor_id();
> > @@ -563,6 +576,7 @@ void __init lockup_detector_init(void)
> >  
> >  	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> >  	register_cpu_notifier(&cpu_nfb);
> > +	register_reboot_notifier(&reboot_nfb);
> >  
> >  	return;
> >  }
> 
> We'd really want a perf_event.c callback there to do as the hot-unplug
> code does and detach all running counters from the cpu.

Ok, here is a simpler patch for now.

--------------------------------8<--------
From: Don Zickus <dzickus@redhat.com>
Date: Tue, 7 Dec 2010 16:06:59 -0500
Subject: [PATCH] perf:  Use event select bits for hardware check

The counter registers can continue to increment if left enabled
across a kexec or a kdump.  The makes the perf hardware check
accidentally return false when the hardware really does exist.

Change the check to use the first bits of event selection.  Those
bits should be safe as they are used to program the type of events
to use.  And more importantly, they won't increment across kexec/kdump.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/cpu/perf_event.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 7b91396..7d869c0 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -377,10 +377,10 @@ static bool check_hw_exists(void)
 	u64 val, val_new = 0;
 	int ret = 0;
 
-	val = 0xabcdUL;
-	ret |= checking_wrmsrl(x86_pmu.perfctr, val);
-	ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
-	if (ret || val != val_new)
+	val = 0xabUL;
+	ret |= checking_wrmsrl(x86_pmu.eventsel, val);
+	ret |= rdmsrl_safe(x86_pmu.eventsel, &val_new);
+	if (ret || val != (val_new & 0xFF))
 		return false;
 
 	return true;
-- 
1.7.3.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-07 21:16                         ` Don Zickus
@ 2010-12-08  0:26                           ` Yinghai Lu
  2010-12-08 10:39                             ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Yinghai Lu @ 2010-12-08  0:26 UTC (permalink / raw)
  To: Don Zickus
  Cc: Peter Zijlstra, Eric W. Biederman, Vivek Goyal, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On 12/07/2010 01:16 PM, Don Zickus wrote:
> On Thu, Dec 02, 2010 at 08:34:30AM +0100, Peter Zijlstra wrote:
>>>  void __init lockup_detector_init(void)
>>>  {
>>>  	void *cpu = (void *)(long)smp_processor_id();
>>> @@ -563,6 +576,7 @@ void __init lockup_detector_init(void)
>>>  
>>>  	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>>>  	register_cpu_notifier(&cpu_nfb);
>>> +	register_reboot_notifier(&reboot_nfb);
>>>  
>>>  	return;
>>>  }
>>
>> We'd really want a perf_event.c callback there to do as the hot-unplug
>> code does and detach all running counters from the cpu.
> 
> Ok, here is a simpler patch for now.
> 
> --------------------------------8<--------
> From: Don Zickus <dzickus@redhat.com>
> Date: Tue, 7 Dec 2010 16:06:59 -0500
> Subject: [PATCH] perf:  Use event select bits for hardware check
> 
> The counter registers can continue to increment if left enabled
> across a kexec or a kdump.  The makes the perf hardware check
> accidentally return false when the hardware really does exist.
> 
> Change the check to use the first bits of event selection.  Those
> bits should be safe as they are used to program the type of events
> to use.  And more importantly, they won't increment across kexec/kdump.
> 
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---
>  arch/x86/kernel/cpu/perf_event.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 7b91396..7d869c0 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -377,10 +377,10 @@ static bool check_hw_exists(void)
>  	u64 val, val_new = 0;
>  	int ret = 0;
>  
> -	val = 0xabcdUL;
> -	ret |= checking_wrmsrl(x86_pmu.perfctr, val);
> -	ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
> -	if (ret || val != val_new)
> +	val = 0xabUL;
> +	ret |= checking_wrmsrl(x86_pmu.eventsel, val);
> +	ret |= rdmsrl_safe(x86_pmu.eventsel, &val_new);
> +	if (ret || val != (val_new & 0xFF))
>  		return false;
>  
>  	return true;

Thanks. it fixes the problem.

Yinghai

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-08  0:26                           ` Yinghai Lu
@ 2010-12-08 10:39                             ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2010-12-08 10:39 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Don Zickus, Eric W. Biederman, Vivek Goyal, Ingo Molnar,
	Jason Wessel, linux-kernel@vger.kernel.org, Haren Myneni

On Tue, 2010-12-07 at 16:26 -0800, Yinghai Lu wrote:
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -377,10 +377,10 @@ static bool check_hw_exists(void)
> >       u64 val, val_new = 0;
> >       int ret = 0;
> >  
> > -     val = 0xabcdUL;
> > -     ret |= checking_wrmsrl(x86_pmu.perfctr, val);
> > -     ret |= rdmsrl_safe(x86_pmu.perfctr, &val_new);
> > -     if (ret || val != val_new)
> > +     val = 0xabUL;
> > +     ret |= checking_wrmsrl(x86_pmu.eventsel, val);
> > +     ret |= rdmsrl_safe(x86_pmu.eventsel, &val_new);
> > +     if (ret || val != (val_new & 0xFF))
> >               return false;
> >  
> >       return true;
> 
> Thanks. it fixes the problem.
> 
Won't merge it though, I think it stinks..

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: perf hw  in kexeced kernel broken in tip
  2010-12-01 19:49             ` Peter Zijlstra
  2010-12-01 19:58               ` Vivek Goyal
@ 2010-12-01 20:41               ` Eric W. Biederman
  1 sibling, 0 replies; 42+ messages in thread
From: Eric W. Biederman @ 2010-12-01 20:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, Yinghai Lu, Ingo Molnar, Jason Wessel, Don Zickus,
	linux-kernel@vger.kernel.org, Haren Myneni

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, 2010-12-01 at 14:46 -0500, Vivek Goyal wrote:
>> On Wed, Dec 01, 2010 at 08:38:12PM +0100, Peter Zijlstra wrote:
>> > On Wed, 2010-12-01 at 11:23 -0500, Vivek Goyal wrote:
>> > > > What does kexec normally do to ensure hardware is left in a sane state?
>> > > 
>> > > Typically calls device_shutdown() and sysdev_shutdown() from
>> > > kernel_restart_prepare() to shutdown the devices.
>> > > 
>> > > Also calls machine_shutdown() which depending on architecture can take
>> > > care of various things like stopping other cpus, shutting down LAPIC, 
>> > > disabling IOAPIC, disabling hpet, shutting down IOMMU etc
>> > > (native_machine_shutdown()). 
>> > 
>> > So basically there's no sane generic reset callout?
>> 
>> I think ->shutdown() calls are sane generic callouts. Isn't it?
>
> ->shutdown looks like it's about to reset/halt the hardware, no point in
> slowing down the regular shutdown/reboot path for something like this,
> we know the hardware will get reset to a sane state.

No you don't!

Most BIOSen implement a board level reset there, but it isn't required.
Just doing a software only reinitialization is allowed, and on some
arches is the only thing you can do.

Speed during reboot is not a reason to avoid anything.  reboot
is not a fast path, and we are talking about things in human tersm.

The only argument I have heard that holds the least amount of
sense is to keep what we do to a minimum, to increase the chances
that we can do a reboot even after a kernel oops.

All of that said.  What insane start are we leaving the hardware
in that we think it is going to be slow in human terms to remove?

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2010-12-09 20:44 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-01  8:00 perf hw in kexeced kernel broken in tip Yinghai Lu
2010-12-01 11:27 ` Peter Zijlstra
2010-12-01 16:06   ` Vivek Goyal
2010-12-01 16:11     ` Peter Zijlstra
2010-12-01 16:23       ` Vivek Goyal
2010-12-01 19:38         ` Peter Zijlstra
2010-12-01 19:46           ` Vivek Goyal
2010-12-01 19:49             ` Peter Zijlstra
2010-12-01 19:58               ` Vivek Goyal
2010-12-01 20:07                 ` Peter Zijlstra
2010-12-01 21:48                   ` Eric W. Biederman
2010-12-02  5:23                     ` Don Zickus
2010-12-02  7:34                       ` Peter Zijlstra
2010-12-02 16:15                         ` Don Zickus
2010-12-07 23:30                           ` Peter Zijlstra
2010-12-08 14:01                             ` Don Zickus
2010-12-08 14:20                               ` Peter Zijlstra
2010-12-08 14:42                                 ` Vivek Goyal
2010-12-08 14:48                                   ` Peter Zijlstra
2010-12-08 15:02                                     ` Vivek Goyal
2010-12-08 15:15                                       ` Peter Zijlstra
2010-12-08 15:22                                         ` Vivek Goyal
2010-12-08 21:16                                         ` Eric W. Biederman
2010-12-08 14:59                                 ` Peter Zijlstra
2010-12-08 18:43                                   ` Yinghai Lu
2010-12-08 19:01                                     ` Don Zickus
2010-12-08 19:05                                       ` Yinghai Lu
2010-12-08 19:17                                         ` Peter Zijlstra
2010-12-08 19:20                                           ` Yinghai Lu
2010-12-08 19:06                                     ` Peter Zijlstra
2010-12-08 19:20                                       ` Yinghai Lu
2010-12-08 22:37                                   ` Don Zickus
2010-12-08 23:20                                     ` Eric W. Biederman
2010-12-09  4:34                                       ` Don Zickus
2010-12-09 20:20                                   ` Don Zickus
2010-12-09 20:44                                     ` Cyrill Gorcunov
2010-12-08 14:33                               ` Peter Zijlstra
2010-12-08 14:39                               ` Vivek Goyal
2010-12-07 21:16                         ` Don Zickus
2010-12-08  0:26                           ` Yinghai Lu
2010-12-08 10:39                             ` Peter Zijlstra
2010-12-01 20:41               ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).