public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: James Cleverdon <jamesclv@us.ibm.com>
To: "Martin J. Bligh" <mbligh@aracnet.com>,
	Linus Torvalds <torvalds@transmeta.com>
Cc: William Lee Irwin III <wli@holomorphy.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	Mikael Pettersson <mikpe@csd.uu.se>,
	Asit Mallick <asit.k.mallick@intel.com>
Subject: Re: [BUG] 2.5.63: ESR killed my box!
Date: Wed, 26 Feb 2003 16:32:09 -0800	[thread overview]
Message-ID: <200302261632.09436.jamesclv@us.ibm.com> (raw)
In-Reply-To: <8750000.1046278359@[10.10.2.4]>

On Wednesday 26 February 2003 08:52 am, Martin J. Bligh wrote:
[ Snip! ]
> >
> > Anyway, the above is clearly not what we're doing with the ESR right now.
> >
> > Martin: in the esr disable case you clearly write the ESR multiple times
> > ("over the head with a big hammer"), and you must do that because you
> > noticed that a single write was insufficient. Why four? Did you just
> > decide that as long as you're doing multiple writes, you might as well
> > just do "several". Or did four writes work and two didn't?
>
> The latter, IIRC, 2 writes worked most of the time, but never really fixed
> it. Using any kind of logical analysis never seemed to work on that chip
> ... brute force, trial and error, and 3 months of tearing my hair out was
> the only thing that succeeded in the end. A time I have no wish to revisit
> ;-)
>
> cc'ed James Cleverdon ... he was involved in this with PTX, and gave me
> some  pointers to hair-restorer during the Linux timeframe.
>
> M.
> -

You want _that_ story, eh?   8^)

	*	*	*	*	*

Yeah we had ESR problems on the original NUMA-Q boxes with P6 CPUs.  On system 
shutdown, CPU 0 on one or more secondary nodes would occasionally spasm with 
an infinite stream of APIC error interrupts claiming invalid message.  A 
couple hardware guys and I spent a lot of time looking at the APIC bus with 
special APIC bus analyzers, etc.  We _never_ caught a malformed message on 
the APIC bus.

Once a CPU started weirding out like this, it was impossible to make it shut 
up.  We could clear the error status, and it would show cleared in the ESR, 
but the local APIC would reissue the same error interrupt as soon as we 
returned from the error handler.

In fact, with kernel printf turned off we would get about a million of them 
per second, faster than most APIC messages could be sent over the APIC bus.  
(This was a 16.6667 MHz two bit wide bus.  Messages were about 10 to 40 
frames long.)

Thus, I concluded that it was some weird error state in the local APIC.  We 
never got any answer back from Intel on how to clear this state, let alone 
admission that it existed, so we just turned off the APIC error IRQ.  Since 
we were shutting down the system anyway, this seemed an adequate kludge.

Writing 0 to the ESR four times was done out of paranoia, and a desire to 
grind the clear deeper into the local APIC's state machine.  I have no 
evidence that it ever really fixed this bug.  Nothing did.

Maybe this weirdness was fixed in P2s or later CPUs.  Maybe.  Intel never did 
say anything about it to us.  Regardless, the four writes to ESR is still 
enshrined in Dynix/PTX's APIC error handler, and will remain a hidden 
testimony to this bug for as long as IBM maintains PTX support.

-- 
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com


  reply	other threads:[~2003-02-27  0:23 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-02-26  4:33 [BUG] 2.5.63: ESR killed my box! Rusty Russell
2003-02-26  5:51 ` Martin J. Bligh
2003-02-26  7:14   ` Rusty Russell
2003-02-26  7:27     ` William Lee Irwin III
2003-02-26 15:20       ` Linus Torvalds
2003-02-26 15:52         ` Martin J. Bligh
2003-02-26 16:37           ` Linus Torvalds
2003-02-26 16:52             ` Martin J. Bligh
2003-02-27  0:32               ` James Cleverdon [this message]
2003-02-27 11:11             ` Maciej W. Rozycki
2003-02-26 20:47     ` Ion Badulescu
2003-02-26 21:03       ` Martin J. Bligh
2003-02-26 21:16         ` Ion Badulescu
2003-02-26 21:23           ` Martin J. Bligh
2003-02-26 21:30           ` Linus Torvalds
2003-02-26 21:44             ` Ion Badulescu
2003-02-26 22:05               ` Linus Torvalds
2003-02-26 22:51                 ` Martin J. Bligh
2003-02-26 23:07                 ` Mikael Pettersson
2003-02-27  0:00                   ` Linus Torvalds
2003-02-27  0:45                     ` Martin J. Bligh
2003-02-27  1:20                       ` Ion Badulescu
2003-02-27  1:33                         ` Martin J. Bligh
2003-02-27 10:33                       ` Mikael Pettersson
2003-02-27  1:26                     ` Ion Badulescu
2003-02-27  1:40                       ` Martin J. Bligh
2003-02-27  7:17                     ` Rusty Russell
  -- strict thread matches above, loose matches on Subject: below --
2003-02-26  8:43 Rusty Russell
2003-02-26 22:34 Grover, Andrew
2003-03-01  1:42 Mallick, Asit K
2003-03-01  3:07 Mallick, Asit K

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200302261632.09436.jamesclv@us.ibm.com \
    --to=jamesclv@us.ibm.com \
    --cc=asit.k.mallick@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbligh@aracnet.com \
    --cc=mikpe@csd.uu.se \
    --cc=mingo@redhat.com \
    --cc=rusty@rustcorp.com.au \
    --cc=torvalds@transmeta.com \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox