All of lore.kernel.org
 help / color / mirror / Atom feed
* How to generate a HW NMI
@ 2010-09-30 17:59 Roger Cruz
  2010-10-01 14:15 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Roger Cruz @ 2010-09-30 17:59 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1592 bytes --]

Hi fellow Xen developers,

 

I continue to get system hangs where the watchdog NMI in Xen is not
doing its job.  I am completely blind as to what is getting jammed.
Tried multiple experiments to force the hang and in each, the watchdog
has kicked in, so I know the mechanism works 99% of the time except in
my one hang.

 

So in the old days of PCI bus, I used to be able to generate a HW NMI by
asserting the SERR signal in the connector.  With the advent of PCIe, I
believe that signal is no longer present, so I am looking for any other
way to cause a system error.    I have examined the PCI express
mini-card specification looking for a signal I can use in the internal
WiFi connector, but alas, none of the signals I read about seem like
they would do what I need.  I am not sure if there is anything I can
short in the PCIe signals that could have a similar effect as the SERR
signal.  The platform is a Lenovo T500 laptop so the number of
connectors to play with is limited.

 

I also thought of causing a parity/ECC error but the GM45 chipset used
in this laptop does not support ECC memory.

 

So I'm basically looking for any other ideas on how to cause a fault by
probing somewhere in the motherboard.  This MB has a docking station
connector but I have not been able to find the pinout list so I don't
know what is brought out there.  At this point, I have no problem
cracking up the case and soldering something on to the motherboard.. I
just need to know what chips and signals to tap.

 

Thanks in advance.

 

Roger R. Cruz


[-- Attachment #1.2: Type: text/html, Size: 3715 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to generate a HW NMI
  2010-09-30 17:59 How to generate a HW NMI Roger Cruz
@ 2010-10-01 14:15 ` Konrad Rzeszutek Wilk
  2010-10-01 19:33   ` Roger Cruz
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-10-01 14:15 UTC (permalink / raw)
  To: Roger Cruz; +Cc: xen-devel

On Thu, Sep 30, 2010 at 12:59:25PM -0500, Roger Cruz wrote:
> Hi fellow Xen developers,
> 
>  
> 
> I continue to get system hangs where the watchdog NMI in Xen is not
> doing its job.  I am completely blind as to what is getting jammed.
> Tried multiple experiments to force the hang and in each, the watchdog
> has kicked in, so I know the mechanism works 99% of the time except in
> my one hang.
> 
>  
> 
> So in the old days of PCI bus, I used to be able to generate a HW NMI by
> asserting the SERR signal in the connector.  With the advent of PCIe, I

Nice.

> believe that signal is no longer present, so I am looking for any other
> way to cause a system error.    I have examined the PCI express

What about the Mini PCI-e to PCI-e adapter:
http://www.hwtools.net/adapter/PM2C.html

And then plug in a PCI to PCI-e adapter:

http://www.newegg.com/Product/Product.aspx?Item=N82E16815158165&nm_mc=OTC-Froogle&cm_mmc=OTC-Froogle-_-Add-On+Cards-_-STARTECH-_-15158165

And then assert the SERR#?

> mini-card specification looking for a signal I can use in the internal
> WiFi connector, but alas, none of the signals I read about seem like
> they would do what I need.  I am not sure if there is anything I can
> short in the PCIe signals that could have a similar effect as the SERR

Per this slide deck:
http://www.pcisig.com/developers/main/training_materials/get_document?doc_id=cdf593816ee20b90d8603d4aeb081a726ddc3091
it looks as if you can program the PCIe bridge to fall to "legacy" mode.

And per some folks post:
http://forums.gentoo.org/viewtopic-t-752165.html

it looks as if the SERR# signal is asserted on SMBus controller?
Maybe there is a way to do it via that?

> signal.  The platform is a Lenovo T500 laptop so the number of
> connectors to play with is limited.
> 

IBM on the server sides used to have NMI buttons - it could be that Lenova
hadn't completly gotten rid of them. Since you are open to looking at the
motherboard, maybe there is a spot marked #NMI ?

> 
>  
> 
> I also thought of causing a parity/ECC error but the GM45 chipset used
> in this laptop does not support ECC memory.

>  
> 
> So I'm basically looking for any other ideas on how to cause a fault by
> probing somewhere in the motherboard.  This MB has a docking station
> connector but I have not been able to find the pinout list so I don't
> know what is brought out there.  At this point, I have no problem

How about just shorting the pins randomly :-)

> cracking up the case and soldering something on to the motherboard.. I
> just need to know what chips and signals to tap.
> 
>  
> 
> Thanks in advance.
> 
>  
> 
> Roger R. Cruz
> 

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: How to generate a HW NMI
  2010-10-01 14:15 ` Konrad Rzeszutek Wilk
@ 2010-10-01 19:33   ` Roger Cruz
  2010-10-01 20:01     ` Konrad Rzeszutek Wilk
       [not found]     ` <4CA9AC25.6020707@siemens.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Roger Cruz @ 2010-10-01 19:33 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel

Great ideas Konrad.  I have ordered these parts.  It will probably take
a few days before they get here.
The goal of using the HW NMI is to rule out any incorrect SW settings of
the Performance Monitoring counters used in Xen to triggered the NMI.

Someone else mentioned that another possibility as to why an NMI may not
be triggered is that the system is stuck handling an SMI interrupt.  I
haven't studied Xen code with respect to SMIs yet, but I assume that Xen
doesn't do much in that area right?  I was under the impression that the
BIOS usually set this up and the OSs could not even modify the handlers
as they were in protected RAM.

R.

-----Original Message-----
From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] 
Sent: Friday, October 01, 2010 10:15 AM
To: Roger Cruz
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] How to generate a HW NMI

On Thu, Sep 30, 2010 at 12:59:25PM -0500, Roger Cruz wrote:
> Hi fellow Xen developers,
> 
>  
> 
> I continue to get system hangs where the watchdog NMI in Xen is not
> doing its job.  I am completely blind as to what is getting jammed.
> Tried multiple experiments to force the hang and in each, the watchdog
> has kicked in, so I know the mechanism works 99% of the time except in
> my one hang.
> 
>  
> 
> So in the old days of PCI bus, I used to be able to generate a HW NMI
by
> asserting the SERR signal in the connector.  With the advent of PCIe,
I

Nice.

> believe that signal is no longer present, so I am looking for any
other
> way to cause a system error.    I have examined the PCI express

What about the Mini PCI-e to PCI-e adapter:
http://www.hwtools.net/adapter/PM2C.html

And then plug in a PCI to PCI-e adapter:

http://www.newegg.com/Product/Product.aspx?Item=N82E16815158165&nm_mc=OT
C-Froogle&cm_mmc=OTC-Froogle-_-Add-On+Cards-_-STARTECH-_-15158165

And then assert the SERR#?

> mini-card specification looking for a signal I can use in the internal
> WiFi connector, but alas, none of the signals I read about seem like
> they would do what I need.  I am not sure if there is anything I can
> short in the PCIe signals that could have a similar effect as the SERR

Per this slide deck:
http://www.pcisig.com/developers/main/training_materials/get_document?do
c_id=cdf593816ee20b90d8603d4aeb081a726ddc3091
it looks as if you can program the PCIe bridge to fall to "legacy" mode.

And per some folks post:
http://forums.gentoo.org/viewtopic-t-752165.html

it looks as if the SERR# signal is asserted on SMBus controller?
Maybe there is a way to do it via that?

> signal.  The platform is a Lenovo T500 laptop so the number of
> connectors to play with is limited.
> 

IBM on the server sides used to have NMI buttons - it could be that
Lenova
hadn't completly gotten rid of them. Since you are open to looking at
the
motherboard, maybe there is a spot marked #NMI ?

> 
>  
> 
> I also thought of causing a parity/ECC error but the GM45 chipset used
> in this laptop does not support ECC memory.

>  
> 
> So I'm basically looking for any other ideas on how to cause a fault
by
> probing somewhere in the motherboard.  This MB has a docking station
> connector but I have not been able to find the pinout list so I don't
> know what is brought out there.  At this point, I have no problem

How about just shorting the pins randomly :-)

> cracking up the case and soldering something on to the motherboard.. I
> just need to know what chips and signals to tap.
> 
>  
> 
> Thanks in advance.
> 
>  
> 
> Roger R. Cruz
> 

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel


No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/01/10
02:34:00

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to generate a HW NMI
  2010-10-01 19:33   ` Roger Cruz
@ 2010-10-01 20:01     ` Konrad Rzeszutek Wilk
  2010-10-01 20:36       ` pciback doesn't take CardBus device Huang2, Wei
       [not found]     ` <4CA9AC25.6020707@siemens.com>
  1 sibling, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-10-01 20:01 UTC (permalink / raw)
  To: Roger Cruz; +Cc: xen-devel

On Fri, Oct 01, 2010 at 02:33:20PM -0500, Roger Cruz wrote:
> Great ideas Konrad.  I have ordered these parts.  It will probably take
> a few days before they get here.
> The goal of using the HW NMI is to rule out any incorrect SW settings of
> the Performance Monitoring counters used in Xen to triggered the NMI.

Right.
> 
> Someone else mentioned that another possibility as to why an NMI may not
> be triggered is that the system is stuck handling an SMI interrupt.  I
> haven't studied Xen code with respect to SMIs yet, but I assume that Xen
> doesn't do much in that area right?  I was under the impression that the
> BIOS usually set this up and the OSs could not even modify the handlers
> as they were in protected RAM.

Ugh. That is true - we have no notion of when the SMIs run. Not that
the SMIs are actually working 100% all the time.

Another thought, and this might be a complete shoot in the dark.
Look in the upstream (2.6.36-rc6) blacklist.c file. There is an entry
for that specific ThinkPad which activates the ACPI _OSI, maybe that
needs to be done?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* pciback doesn't take CardBus device
  2010-10-01 20:01     ` Konrad Rzeszutek Wilk
@ 2010-10-01 20:36       ` Huang2, Wei
  2010-10-01 20:45         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Huang2, Wei @ 2010-10-01 20:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel

Hi Konrad,

I found that pciback doesn't accept CardBus device. It only handles type-0 and type-1. Any specific reason to skip it? That caused some trouble for for firewire passthru on my laptop. I want to know the reason before submitting submit a patch.

Thanks,
-Wei

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: pciback doesn't take CardBus device
  2010-10-01 20:36       ` pciback doesn't take CardBus device Huang2, Wei
@ 2010-10-01 20:45         ` Konrad Rzeszutek Wilk
  2010-10-01 21:04           ` Huang2, Wei
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-10-01 20:45 UTC (permalink / raw)
  To: Huang2, Wei; +Cc: Xen-devel

On Fri, Oct 01, 2010 at 03:36:45PM -0500, Huang2, Wei wrote:
> Hi Konrad,
> 
> I found that pciback doesn't accept CardBus device. It only handles type-0 and type-1. Any specific reason to skip it? That caused some trouble for for firewire passthru on my laptop. I want to know the reason before submitting submit a patch.

No reason at all.
Was this working in the past (2.6.18?). I will gladly accept any patch.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: pciback doesn't take CardBus device
  2010-10-01 20:45         ` Konrad Rzeszutek Wilk
@ 2010-10-01 21:04           ` Huang2, Wei
  0 siblings, 0 replies; 15+ messages in thread
From: Huang2, Wei @ 2010-10-01 21:04 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel

I haven't test 2.6.18 yet; but will do. The issue I found is with the following configuration. These devices are behind the same bridge. But because 46:06.5 is a CardBus and can't be assigned, it blocks other devices from being assigned to a guest VM. I will create a patch for it. 

Thanks,
-Wei

===========
46:06.0 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller (rev 06)
46:06.1 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 25)
46:06.2 System peripheral: Ricoh Co Ltd R5C843 MMC Host Controller (rev 14)
46:06.3 System peripheral: Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter (rev 14)
46:06.4 System peripheral: Ricoh Co Ltd xD-Picture Card Controller (rev 14)
46:06.5 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev bb)
===========


-----Original Message-----
From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] 
Sent: Friday, October 01, 2010 3:46 PM
To: Huang2, Wei
Cc: Xen-devel
Subject: Re: pciback doesn't take CardBus device

On Fri, Oct 01, 2010 at 03:36:45PM -0500, Huang2, Wei wrote:
> Hi Konrad,
> 
> I found that pciback doesn't accept CardBus device. It only handles type-0 and type-1. Any specific reason to skip it? That caused some trouble for for firewire passthru on my laptop. I want to know the reason before submitting submit a patch.

No reason at all.
Was this working in the past (2.6.18?). I will gladly accept any patch.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: How to generate a HW NMI
       [not found]     ` <4CA9AC25.6020707@siemens.com>
@ 2010-10-04 13:56       ` Roger Cruz
       [not found]         ` <4CA9E0FB.6000109@siemens.com>
  2010-10-12 15:59       ` Roger Cruz
  1 sibling, 1 reply; 15+ messages in thread
From: Roger Cruz @ 2010-10-04 13:56 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xen-devel, Konrad Rzeszutek Wilk


[-- Attachment #1.1: Type: text/plain, Size: 2410 bytes --]

Jan, 

I will try your suggestion of turning off SMIs.  I am also interested in you conducting an experiment for me.  If you can, please tell your kernel not to use any CPU power saving modes.  In Xen I use max_cstate=0 in the bootline.  I have found that when I do this, the hangs appear to go away (we had one customer report one since using this work-around, so it is not 100% working).

Thanks
Roger


-----Original Message-----
From: Jan Kiszka [mailto:jan.kiszka@siemens.com]
Sent: Mon 10/4/2010 6:27 AM
To: Roger Cruz
Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com
Subject: Re: How to generate a HW NMI
 
Am 01.10.2010 21:33, Roger Cruz wrote:
> Someone else mentioned that another possibility as to why an NMI may not
> be triggered is that the system is stuck handling an SMI interrupt.  I
> haven't studied Xen code with respect to SMIs yet, but I assume that Xen
> doesn't do much in that area right?  I was under the impression that the
> BIOS usually set this up and the OSs could not even modify the handlers
> as they were in protected RAM.

We happen to face strange freezes of KVM right now as well (CPU is
apparently stuck in guest mode), and turning of SMIs cures them here
[1]. However, it's too early to draw final conclusions, we are still
collecting test results & data on the systems.

It would therefore be interesting to see if you case is similar to ours.
If you feel brave enough to turn off your SMIs (there are rumors that
CPUs /could/ get fried as some thermal management /might/ be done via
SMIs), please check out [2], build it (requires libpci and a kernel
source tree), and run "smitctrl -s 0" on your box. Should give something
like this:

SMI-enabled chipset found:
 PCI_VENDOR_ID_INTEL:PCI_DEVICE_ID_INTEL_PCH_LPC_MIN+7 (8086:3b07)
 SMI_EN register:       0006403b
 new value:             00000002

If the chipset is not detected, add the PCI device ID of your ISA bridge
to the list in smictrl.c. If the new value still has bit 0 set, you are
unlucky as your BIOS has locked some SMIs against disabling. Otherwise,
SMIs are off now, and your lock up /may/ disappear. Looking forward to
your results!

Jan

[1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/60326
[2] http://git.kiszka.org/?p=smictrl.git

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

[-- Attachment #1.2: Type: text/html, Size: 3331 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: How to generate a HW NMI
       [not found]         ` <4CA9E0FB.6000109@siemens.com>
@ 2010-10-04 14:19           ` Roger Cruz
  2010-10-04 15:23             ` Dan Magenheimer
       [not found]             ` <4CA9F16C.905@siemens.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Roger Cruz @ 2010-10-04 14:19 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xen-devel, Konrad Rzeszutek Wilk

Until Friday, all hard hangs that we and our customers had experienced
were on Lenovo T500 and X200, even with their latest BIOSes.  The Lenovo
T400 has never hung for me and I don't have any reports on them from the
field.  On Friday, I had an HP i5 hard hang with similar footprint as
the Lenovos.  When this hard hang happens, the Xen watchdog (which is
driven by the NMI handler) will not do its job and cause a crash/stack
trace.  This is why we have started to suspect something with the BIOS
and SMIs as they are the only thing that can block an NMI.  I am pretty
certain that this is somehow related to entering C3 power states and
possibly at the same time an SMI comes in.  The time it takes to hang
varies from 30mins to 24 hrs.

Roger




-----Original Message-----
From: Jan Kiszka [mailto:jan.kiszka@siemens.com] 
Sent: Monday, October 04, 2010 10:13 AM
To: Roger Cruz
Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com
Subject: Re: How to generate a HW NMI

Am 04.10.2010 15:56, Roger Cruz wrote:
> Jan,
> 
> I will try your suggestion of turning off SMIs. I am also interested
in you 
> conducting an experiment for me. If you can, please tell your kernel
not to use 
> any CPU power saving modes. In Xen I use max_cstate=0 in the bootline.
I have 
> found that when I do this, the hangs appear to go away (we had one
customer 
> report one since using this work-around, so it is not 100% working).

Will do. My customer reported that he was able to easily crash his i7
notebook by pulling and re-plugging the power cable. I bet all of these
events are trapped by the BIOS via power management SMIs...

BTW, do you see any correlation between crashable boxes and BIOS
vendors? We have no representative numbers yet, just one confirmed
instable notebook that is Phoenix-based, while one AMI-based i7 server
that is rock-stable.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10
02:35:00

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: RE: How to generate a HW NMI
  2010-10-04 14:19           ` Roger Cruz
@ 2010-10-04 15:23             ` Dan Magenheimer
       [not found]             ` <4CA9F16C.905@siemens.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Dan Magenheimer @ 2010-10-04 15:23 UTC (permalink / raw)
  To: Roger Cruz, Jan Kiszka; +Cc: xen-devel, Konrad Wilk

This is a long shot, but since my thoughts jumped to it after
reading this, I thought I'd post anyway.

Some systems support a special "C1E" power state
that can be enabled/disabled in the BIOS.  My Dell Core2Duo
laptop has this feature.  I remember running into
some weirdness that went away when I turned it off.
Perhaps the power management code is somehow entering
the BIOS to see if this is enabled and max_cstate isn't
controlling it since the check is done in the BIOS
bypassing Xen?

Google for C1E to find lots of information about
this weird power state.

> -----Original Message-----
> From: Roger Cruz [mailto:roger.cruz@virtualcomputer.com]
> Sent: Monday, October 04, 2010 8:19 AM
> To: Jan Kiszka
> Cc: xen-devel@lists.xensource.com; Konrad Rzeszutek Wilk
> Subject: [Xen-devel] RE: How to generate a HW NMI
> 
> Until Friday, all hard hangs that we and our customers had experienced
> were on Lenovo T500 and X200, even with their latest BIOSes.  The
> Lenovo
> T400 has never hung for me and I don't have any reports on them from
> the
> field.  On Friday, I had an HP i5 hard hang with similar footprint as
> the Lenovos.  When this hard hang happens, the Xen watchdog (which is
> driven by the NMI handler) will not do its job and cause a crash/stack
> trace.  This is why we have started to suspect something with the BIOS
> and SMIs as they are the only thing that can block an NMI.  I am pretty
> certain that this is somehow related to entering C3 power states and
> possibly at the same time an SMI comes in.  The time it takes to hang
> varies from 30mins to 24 hrs.
> 
> Roger
> 
> 
> 
> 
> -----Original Message-----
> From: Jan Kiszka [mailto:jan.kiszka@siemens.com]
> Sent: Monday, October 04, 2010 10:13 AM
> To: Roger Cruz
> Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com
> Subject: Re: How to generate a HW NMI
> 
> Am 04.10.2010 15:56, Roger Cruz wrote:
> > Jan,
> >
> > I will try your suggestion of turning off SMIs. I am also interested
> in you
> > conducting an experiment for me. If you can, please tell your kernel
> not to use
> > any CPU power saving modes. In Xen I use max_cstate=0 in the
> bootline.
> I have
> > found that when I do this, the hangs appear to go away (we had one
> customer
> > report one since using this work-around, so it is not 100% working).
> 
> Will do. My customer reported that he was able to easily crash his i7
> notebook by pulling and re-plugging the power cable. I bet all of these
> events are trapped by the BIOS via power management SMIs...
> 
> BTW, do you see any correlation between crashable boxes and BIOS
> vendors? We have no representative numbers yet, just one confirmed
> instable notebook that is Phoenix-based, while one AMI-based i7 server
> that is rock-stable.
> 
> Jan
> 
> --
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date:
> 10/04/10
> 02:35:00
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: How to generate a HW NMI
       [not found]             ` <4CA9F16C.905@siemens.com>
@ 2010-10-04 19:03               ` Roger Cruz
  2010-10-11 21:20                 ` Roger Cruz
  0 siblings, 1 reply; 15+ messages in thread
From: Roger Cruz @ 2010-10-04 19:03 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xen-devel, Konrad Rzeszutek Wilk

> BTW, "rmmod processor thermal" (should be equivalent to your Xen

I am not familiar with the thermal module but my guess is that they are
not the same as the C3 states which can be entered when the kernel
becomes idle.  I believe the thermal plays with other type of state (P?)
where it alters the voltage and frequency of the CPU to keep the CPU
still running but at a particular % of the top speed.  The C3 state
causes the CPU clocks to shutdown entirely and then it is awaken by an
external event.

R.

-----Original Message-----
From: Jan Kiszka [mailto:jan.kiszka@siemens.com] 
Sent: Monday, October 04, 2010 11:23 AM
To: Roger Cruz
Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com
Subject: Re: How to generate a HW NMI

Am 04.10.2010 16:19, Roger Cruz wrote:
> Until Friday, all hard hangs that we and our customers had experienced
> were on Lenovo T500 and X200, even with their latest BIOSes.

Yeah, the T500 was reported as problematic here as well. My Fujitsu
Celsius H700 also crashes.

In contrast, we have positive results from a Dell server with an Asus
P6T Deluxe V2 board and a Core i7 920.

>  The Lenovo
> T400 has never hung for me and I don't have any reports on them from
the
> field.  On Friday, I had an HP i5 hard hang with similar footprint as

i5? Mmh, we only have reports from i7 so far. Which BIOS vendor?

> the Lenovos.  When this hard hang happens, the Xen watchdog (which is
> driven by the NMI handler) will not do its job and cause a crash/stack
> trace.
>  This is why we have started to suspect something with the BIOS
> and SMIs as they are the only thing that can block an NMI.  I am
pretty
> certain that this is somehow related to entering C3 power states and
> possibly at the same time an SMI comes in.

I tried various stuff under Linux as well: nmi_watchdog=1, tracing to
VGA buffer right before/after guest-host switch (it always hangs after
entry here), verified guest interruptibility before entry (though
hypervisors usually do not play with the critical bits), read-out of
host RAM (including kernel log buffer) via Firewire - it all points to a
crash outside the scope of the host OS.

>  The time it takes to hang
> varies from 30mins to 24 hrs.

We are a bit more lucky, maybe due to our special guest (an old RTOS in
16-bit mode): I can reproduce the hang after a few minutes.

BTW, "rmmod processor thermal" (should be equivalent to your Xen
parameter) did not make a difference here.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10
02:35:00

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: RE: How to generate a HW NMI
  2010-10-04 19:03               ` Roger Cruz
@ 2010-10-11 21:20                 ` Roger Cruz
       [not found]                   ` <4CB420D4.2010507@siemens.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Roger Cruz @ 2010-10-11 21:20 UTC (permalink / raw)
  To: Roger Cruz, Jan Kiszka; +Cc: xen-devel, Konrad Rzeszutek Wilk

Here is some additional info from my experiments over the weekend.

I took the Lenovo T500 and removed its internal WiFi miniPCIe card.  In
its place, I put in a miniPCIe to PCIe converter card with a PCIe
socket.  Into that socket, I placed a PCIe dump card.  This card has a
switch that when you press it, it creates an SERR error.  Using the
utility provided by the vendor, I enabled all the bridges between the
card to carry the SERR signal to the CPU and cause the CPU to see it as
an NMI.  I tested the set-up several times.  Every single time I pressed
the switch, I got an NMI, followed by a kdump core.  So I was sure the
HW setup was working correctly.

I left two Lenovo T500 running over the weekend and when I returned this
morning, both had hung.  Completely frozen.  I pressed the NMI switch in
both systems and nothing.  No crashes, no coredumps.  It looks as if the
SERR/NMI is getting ignored/blocked or CPU is completely shutdown
(STPCLK).

This experiment helps me prove that the software watchdog code in Xen
was not the problem and indeed the NMIs are getting blocked somehow.
This is what I now need to investigate.  Areas that I care to learn more
about are the SMI handler and the external chip's use of the STPCLK
signal to the CPU.

As an additional bit of info, the only response we get when the systems
are hung is a beep when the power cord is unplugged/plugged from the
laptop.  I don't know if the beep is done via a HW module or whether
ACPI/BIOS is involved.

Still looking for additional ideas.

Regards,
Roger R. Cruz


-----Original Message-----
From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Roger Cruz
Sent: Monday, October 04, 2010 3:03 PM
To: Jan Kiszka
Cc: xen-devel@lists.xensource.com; Konrad Rzeszutek Wilk
Subject: [Xen-devel] RE: How to generate a HW NMI

> BTW, "rmmod processor thermal" (should be equivalent to your Xen

I am not familiar with the thermal module but my guess is that they are
not the same as the C3 states which can be entered when the kernel
becomes idle.  I believe the thermal plays with other type of state (P?)
where it alters the voltage and frequency of the CPU to keep the CPU
still running but at a particular % of the top speed.  The C3 state
causes the CPU clocks to shutdown entirely and then it is awaken by an
external event.

R.

-----Original Message-----
From: Jan Kiszka [mailto:jan.kiszka@siemens.com] 
Sent: Monday, October 04, 2010 11:23 AM
To: Roger Cruz
Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com
Subject: Re: How to generate a HW NMI

Am 04.10.2010 16:19, Roger Cruz wrote:
> Until Friday, all hard hangs that we and our customers had experienced
> were on Lenovo T500 and X200, even with their latest BIOSes.

Yeah, the T500 was reported as problematic here as well. My Fujitsu
Celsius H700 also crashes.

In contrast, we have positive results from a Dell server with an Asus
P6T Deluxe V2 board and a Core i7 920.

>  The Lenovo
> T400 has never hung for me and I don't have any reports on them from
the
> field.  On Friday, I had an HP i5 hard hang with similar footprint as

i5? Mmh, we only have reports from i7 so far. Which BIOS vendor?

> the Lenovos.  When this hard hang happens, the Xen watchdog (which is
> driven by the NMI handler) will not do its job and cause a crash/stack
> trace.
>  This is why we have started to suspect something with the BIOS
> and SMIs as they are the only thing that can block an NMI.  I am
pretty
> certain that this is somehow related to entering C3 power states and
> possibly at the same time an SMI comes in.

I tried various stuff under Linux as well: nmi_watchdog=1, tracing to
VGA buffer right before/after guest-host switch (it always hangs after
entry here), verified guest interruptibility before entry (though
hypervisors usually do not play with the critical bits), read-out of
host RAM (including kernel log buffer) via Firewire - it all points to a
crash outside the scope of the host OS.

>  The time it takes to hang
> varies from 30mins to 24 hrs.

We are a bit more lucky, maybe due to our special guest (an old RTOS in
16-bit mode): I can reproduce the hang after a few minutes.

BTW, "rmmod processor thermal" (should be equivalent to your Xen
parameter) did not make a difference here.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10
02:35:00

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10
02:35:00

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RE: How to generate a HW NMI
       [not found]                   ` <4CB420D4.2010507@siemens.com>
@ 2010-10-12 12:42                     ` Roger Cruz
  2010-10-25 15:34                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Roger Cruz @ 2010-10-12 12:42 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xen-devel, Konrad Rzeszutek Wilk

Disabling SMIs is part of the experiments to be conducted today or  
tomorrow.  I will keep u posted.




On Oct 12, 2010, at 4:48 AM, "Jan Kiszka" <jan.kiszka@siemens.com>  
wrote:

> Am 11.10.2010 23:20, Roger Cruz wrote:
>> Here is some additional info from my experiments over the weekend.
>>
>> I took the Lenovo T500 and removed its internal WiFi miniPCIe  
>> card.  In
>> its place, I put in a miniPCIe to PCIe converter card with a PCIe
>> socket.  Into that socket, I placed a PCIe dump card.  This card  
>> has a
>> switch that when you press it, it creates an SERR error.  Using the
>> utility provided by the vendor, I enabled all the bridges between the
>> card to carry the SERR signal to the CPU and cause the CPU to see  
>> it as
>> an NMI.  I tested the set-up several times.  Every single time I  
>> pressed
>> the switch, I got an NMI, followed by a kdump core.  So I was sure  
>> the
>> HW setup was working correctly.
>>
>> I left two Lenovo T500 running over the weekend and when I returned  
>> this
>> morning, both had hung.  Completely frozen.  I pressed the NMI  
>> switch in
>> both systems and nothing.  No crashes, no coredumps.  It looks as  
>> if the
>> SERR/NMI is getting ignored/blocked or CPU is completely shutdown
>> (STPCLK).
>>
>> This experiment helps me prove that the software watchdog code in Xen
>> was not the problem and indeed the NMIs are getting blocked somehow.
>> This is what I now need to investigate.  Areas that I care to learn  
>> more
>> about are the SMI handler and the external chip's use of the STPCLK
>> signal to the CPU.
>>
>> As an additional bit of info, the only response we get when the  
>> systems
>> are hung is a beep when the power cord is unplugged/plugged from the
>> laptop.  I don't know if the beep is done via a HW module or whether
>> ACPI/BIOS is involved.
>>
>> Still looking for additional ideas.
>
> Already tried to disable SMIs?
>
> Jan
>
> -- 
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: How to generate a HW NMI
       [not found]     ` <4CA9AC25.6020707@siemens.com>
  2010-10-04 13:56       ` How to generate a HW NMI Roger Cruz
@ 2010-10-12 15:59       ` Roger Cruz
  1 sibling, 0 replies; 15+ messages in thread
From: Roger Cruz @ 2010-10-12 15:59 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xen-devel, Konrad Rzeszutek Wilk


[-- Attachment #1.1: Type: text/plain, Size: 2727 bytes --]


Hi Jan,

Just letting you know that I am grateful for the help you have been providing.  I finally got around to doing the SMI test as you have described here.  It takes a day or two to know for sure the problem is not going to happen so I will let the system stand still for a while.

This is the output of your tool.  Bit 0 was cleared so SMIs should be disabled at this point.


root@hedley-t500:~# ./smictrl -s 0
SMI-enabled chipset found:
 PCI_VENDOR_ID_INTEL:PCI_DEVICE_ID_INTEL_ICH9_1 (8086:2917)
 SMI_EN register:	00062033
 new value:		00000002


-----Original Message-----
From: Jan Kiszka [mailto:jan.kiszka@siemens.com]
Sent: Mon 10/4/2010 6:27 AM
To: Roger Cruz
Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com
Subject: Re: How to generate a HW NMI
 
Am 01.10.2010 21:33, Roger Cruz wrote:
> Someone else mentioned that another possibility as to why an NMI may not
> be triggered is that the system is stuck handling an SMI interrupt.  I
> haven't studied Xen code with respect to SMIs yet, but I assume that Xen
> doesn't do much in that area right?  I was under the impression that the
> BIOS usually set this up and the OSs could not even modify the handlers
> as they were in protected RAM.

We happen to face strange freezes of KVM right now as well (CPU is
apparently stuck in guest mode), and turning of SMIs cures them here
[1]. However, it's too early to draw final conclusions, we are still
collecting test results & data on the systems.

It would therefore be interesting to see if you case is similar to ours.
If you feel brave enough to turn off your SMIs (there are rumors that
CPUs /could/ get fried as some thermal management /might/ be done via
SMIs), please check out [2], build it (requires libpci and a kernel
source tree), and run "smitctrl -s 0" on your box. Should give something
like this:

SMI-enabled chipset found:
 PCI_VENDOR_ID_INTEL:PCI_DEVICE_ID_INTEL_PCH_LPC_MIN+7 (8086:3b07)
 SMI_EN register:       0006403b
 new value:             00000002

If the chipset is not detected, add the PCI device ID of your ISA bridge
to the list in smictrl.c. If the new value still has bit 0 set, you are
unlucky as your BIOS has locked some SMIs against disabling. Otherwise,
SMIs are off now, and your lock up /may/ disappear. Looking forward to
your results!

Jan

[1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/60326
[2] http://git.kiszka.org/?p=smictrl.git

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.862 / Virus Database: 271.1.1/3168 - Release Date: 10/05/10 02:34:00


[-- Attachment #1.2: Type: text/html, Size: 3808 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RE: How to generate a HW NMI
  2010-10-12 12:42                     ` Roger Cruz
@ 2010-10-25 15:34                       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-10-25 15:34 UTC (permalink / raw)
  To: Roger Cruz; +Cc: Jan Kiszka, xen-devel

On Tue, Oct 12, 2010 at 08:42:13AM -0400, Roger Cruz wrote:
> Disabling SMIs is part of the experiments to be conducted today or
> tomorrow.  I will keep u posted.

Soo, what happend? Machine melted down? It caught on fire?

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-10-25 15:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-30 17:59 How to generate a HW NMI Roger Cruz
2010-10-01 14:15 ` Konrad Rzeszutek Wilk
2010-10-01 19:33   ` Roger Cruz
2010-10-01 20:01     ` Konrad Rzeszutek Wilk
2010-10-01 20:36       ` pciback doesn't take CardBus device Huang2, Wei
2010-10-01 20:45         ` Konrad Rzeszutek Wilk
2010-10-01 21:04           ` Huang2, Wei
     [not found]     ` <4CA9AC25.6020707@siemens.com>
2010-10-04 13:56       ` How to generate a HW NMI Roger Cruz
     [not found]         ` <4CA9E0FB.6000109@siemens.com>
2010-10-04 14:19           ` Roger Cruz
2010-10-04 15:23             ` Dan Magenheimer
     [not found]             ` <4CA9F16C.905@siemens.com>
2010-10-04 19:03               ` Roger Cruz
2010-10-11 21:20                 ` Roger Cruz
     [not found]                   ` <4CB420D4.2010507@siemens.com>
2010-10-12 12:42                     ` Roger Cruz
2010-10-25 15:34                       ` Konrad Rzeszutek Wilk
2010-10-12 15:59       ` Roger Cruz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.