public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* APIC error on SMP machine
@ 2003-09-30 21:42 Chris Rankin
  2003-10-01  1:52 ` James Cleverdon
  2003-10-01  7:47 ` Rogier Wolff
  0 siblings, 2 replies; 5+ messages in thread
From: Chris Rankin @ 2003-09-30 21:42 UTC (permalink / raw)
  To: linux-kernel

Linux-2.4.22-SMP, 1 GB RAM, devfs, gcc-3.2.3.

Hi,

Today, my dual PIII (Coppermine) refused to boot, and wrote a large number of 
these messages to the serial console instead:

APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)
APIC error on CPU1: 04(04)

Can anyone tell me what these might mean, please? The kernel source implies that 
it's a "Send accept error", but this doesn't help me in an "Ah, I can fix that!" 
sense.

Does this APIC error just mean that the CPU is unhappy in this slot, and is 
refusing to listen to the motherboard? Or is the motherboard refusing to listen 
to the CPU?

Background:
This machine has been misbehaving for a while. I thought I had worked around the 
problem by underclocking the FSB from 133 MHz to 100 MHz, but that now looks 
like it was just a "reprieve". I have tried running "nosmp", "pci=noacpi" and 
"noapic pci=noacpi" without success, and have resorted to yanking the CPU out of 
this slot entirely. (I suspect that the CPU is fine, however.) I have also 
restored the FSB to 133 MHz, so I am currently running the SMP kernel on a 
single 933 MHz PIII.

Cheers,
Chris


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: APIC error on SMP machine
  2003-09-30 21:42 APIC error on SMP machine Chris Rankin
@ 2003-10-01  1:52 ` James Cleverdon
  2003-10-01 10:14   ` Chris Rankin
  2003-10-01  7:47 ` Rogier Wolff
  1 sibling, 1 reply; 5+ messages in thread
From: James Cleverdon @ 2003-10-01  1:52 UTC (permalink / raw)
  To: Chris Rankin, linux-kernel

On Tuesday 30 September 2003 2:42 pm, Chris Rankin wrote:
> Linux-2.4.22-SMP, 1 GB RAM, devfs, gcc-3.2.3.
>
> Hi,
>
> Today, my dual PIII (Coppermine) refused to boot, and wrote a large number
> of these messages to the serial console instead:
>
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
>
> Can anyone tell me what these might mean, please? The kernel source implies
> that it's a "Send accept error", but this doesn't help me in an "Ah, I can
> fix that!" sense.
>
> Does this APIC error just mean that the CPU is unhappy in this slot, and is
> refusing to listen to the motherboard? Or is the motherboard refusing to
> listen to the CPU?

Neither.  An APIC send accept error means that when trying to send an 
interrupt, it was not accepted by the target.  In this case, the target is a  
CPU, either your other CPU or the same one (a CPU can send itself an 
interrupt).

While there are several reasons why this can happen, the most common ones are:

1) The target CPU is "full".  The local APIC on P54Cs through P3s only has two 
interrupt latches per interrupt "level", which is the high nibble of the IRQ 
vector number.  So, if a CPU had already latched interrupt vectors 0x30 and 
0x3A, it would have to reject any other 0x3X vector that was sent until it 
could service one of the two latched vectors.

You can force this to happen by manually binding too many IRQs that happen to 
be on the same "level" to one CPU, then causing a lot of interrupt traffic on 
those devices.

In order to avoid this problem, Linux spreads the IRQs among as many vector 
levels as possible.  Still, the vector assignment is done before any devices 
have requested interrupts.  You may get unlucky and have 3 devices on one 
level.

2) The interrupt cannot be delivered because something is wrong with it.  This 
can happen if the kernel screws up and picks "clustered" APIC mode on a 
"flat" system or vice versa.  A dual P3 system should be flat.  Check your 
dmesg log to make sure it was properly detected.  (This seldom happens unless 
you're doing interrupt development work in Linux.)

3) Maybe the other CPU is broken and physically cannot accept the interrupt.  
Do any previous kernels boot?

> Background:
> This machine has been misbehaving for a while. I thought I had worked
> around the problem by underclocking the FSB from 133 MHz to 100 MHz, but
> that now looks like it was just a "reprieve". I have tried running "nosmp",
> "pci=noacpi" and "noapic pci=noacpi" without success, and have resorted to
> yanking the CPU out of this slot entirely. (I suspect that the CPU is fine,
> however.) I have also restored the FSB to 133 MHz, so I am currently
> running the SMP kernel on a single 933 MHz PIII.
>
> Cheers,
> Chris
>
> -


-- 
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot comm

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: APIC error on SMP machine
  2003-09-30 21:42 APIC error on SMP machine Chris Rankin
  2003-10-01  1:52 ` James Cleverdon
@ 2003-10-01  7:47 ` Rogier Wolff
  1 sibling, 0 replies; 5+ messages in thread
From: Rogier Wolff @ 2003-10-01  7:47 UTC (permalink / raw)
  To: Chris Rankin; +Cc: linux-kernel

On Tue, Sep 30, 2003 at 10:42:19PM +0100, Chris Rankin wrote:
> Linux-2.4.22-SMP, 1 GB RAM, devfs, gcc-3.2.3.
> 
> Hi,
> 
> Today, my dual PIII (Coppermine) refused to boot, and wrote a large number 
> of these messages to the serial console instead:
> 
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)

> Can anyone tell me what these might mean, please? The kernel source implies 
> that it's a "Send accept error", but this doesn't help me in an "Ah, I can 
> fix that!" sense.

I rewrote that code to make it spit out those messages that you
see. That however doesn't mean I know what I'm doing....

The APIC chip has a bit register that indicates errors. The kernel,
reads the register, stores it, and that should clear the error. Just to
be sure, we read it again, and store the result. Then we print the two
results. 

In your case, the APIC seems to have a problem, and it doesn't go away
when we read the register, as it should. 

On my "BP6" motherboard, I often see 04(08) errors: The error changes
after I read it once.

The code was printing the whole bitflag shebang before reading it again,
allowing the system to generate another error in the meanwhile, and
hanging the machine. To prevent this, I modified it to just print the
raw bits, trusting that you'd be knowledgable enough to grep through the
kernel sources to find the definitions of the bits. That proved true.
And as expected, you (just like me) don't know what to do with the
definition of that bit anyway. 

On the BP6 it seems that the APIC bus is a bit noisy. So we get
transmission errors on that bus, allowing for a variety of errors on the
recieving end. In your case, the errors seem to end up happening faster 
than the machine can handle :-(

		Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
**** "Linux is like a wigwam -  no windows, no gates, apache inside!" ****

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: APIC error on SMP machine
  2003-10-01  1:52 ` James Cleverdon
@ 2003-10-01 10:14   ` Chris Rankin
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Rankin @ 2003-10-01 10:14 UTC (permalink / raw)
  To: jamesclv, linux-kernel; +Cc: R.E.Wolff

 --- James Cleverdon <jamesclv@us.ibm.com> wrote:
> An APIC send accept error means that when trying to
> send an interrupt, it was not accepted by the
target. 
> In this case, the target is a CPU, either your other
> CPU or the same one (a CPU can send itself an 
> interrupt).
...
> 3) Maybe the other CPU is broken and physically
> cannot accept the interrupt.

Given the background, the most likely cause would seem
to be bad a CPU/motherboard connection. I have
realised that the APIC error is for CPU1, but I have
actually removed CPU0. And a bad CPU0 would explain
why "nosmp" didn't work either.

It's a pity that "nosmp" doesn't (logially cannot?)
take a "boot CPU number" as a parameter.

> Do any previous kernels boot?

Not any more. Everything started to hit the fan at the
beginning of August, and I thought that I had
"patched" things by underclocking the FSB. However,
that only seems to have delayed the inevitable. CPU
slot 2 on my motherboard just seems not to work any
more. I have no idea why - it's not like I can see a
lot of dust and dirt in there.

Oh well, I hear that Dell are selling dual 2.6 GHz
Xeons with RedHat preinstalled nowadays. (These should
have "hyperthreading support", right ;-) ?)

Cheers,
Chris


________________________________________________________________________
Want to chat instantly with your online friends?  Get the FREE Yahoo!
Messenger http://mail.messenger.yahoo.co.uk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: APIC error on SMP machine
@ 2003-10-01 13:08 Matt_Domsch
  0 siblings, 0 replies; 5+ messages in thread
From: Matt_Domsch @ 2003-10-01 13:08 UTC (permalink / raw)
  To: rankincj; +Cc: jamesclv, linux-kernel, R.E.Wolff

> Oh well, I hear that Dell are selling dual 2.6 GHz
> Xeons with RedHat preinstalled nowadays. (These should
> have "hyperthreading support", right ;-) ?)

Right.

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2003-10-01 13:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-30 21:42 APIC error on SMP machine Chris Rankin
2003-10-01  1:52 ` James Cleverdon
2003-10-01 10:14   ` Chris Rankin
2003-10-01  7:47 ` Rogier Wolff
  -- strict thread matches above, loose matches on Subject: below --
2003-10-01 13:08 Matt_Domsch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox