public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] 2.5.63: ESR killed my box!
@ 2003-02-26  4:33 Rusty Russell
  2003-02-26  5:51 ` Martin J. Bligh
  0 siblings, 1 reply; 31+ messages in thread
From: Rusty Russell @ 2003-02-26  4:33 UTC (permalink / raw)
  To: torvalds, linux-kernel; +Cc: mingo, mbligh

SMP box, compiled for UP with CONFIG_LOCAL_APIC=y freezes on boot with
last lines:

	POSIX conformance testing by UNIFIX
	masked ExtINT on CPU#0
	ESR value before enabling vector: 00000008
	[ Freeze here ]

With SMP, it boots fine (then freezes mysteriously a few mins after
boot, which is what I am still chasing):

	masked ExtINT on CPU#0
	ESR value before enabling vector: 00000000
	ESR value after enabling vector: 00000000
	...

Don't know exactly what kernel this first happened, I usually run SMP.

Clues?
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [BUG] 2.5.63: ESR killed my box!
@ 2003-02-26  8:43 Rusty Russell
  0 siblings, 0 replies; 31+ messages in thread
From: Rusty Russell @ 2003-02-26  8:43 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: torvalds, linux-kernel, mingo

> In message <9530000.1046238665@[10.10.2.4]> I wrote:
> Yes.  Hmm.  Wonder if that helps my SMP wierness, too.

Didn't get that far.  Booted, then, froze later in UP with esr_disable
#defined to 1, too.

I'm stumped.  2.4 works fine, 2.5 has trouble lasting 10 minutes.
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 31+ messages in thread
* RE: [BUG] 2.5.63: ESR killed my box!
@ 2003-02-26 22:34 Grover, Andrew
  0 siblings, 0 replies; 31+ messages in thread
From: Grover, Andrew @ 2003-02-26 22:34 UTC (permalink / raw)
  To: Linus Torvalds, Ion Badulescu
  Cc: Martin J. Bligh, Rusty Russell, linux-kernel, mingo

> From: Linus Torvalds [mailto:torvalds@transmeta.com] 
> Wouldn't it be nicer to just fix the write instead? I can see the 
> potential to actually want to change the APIC ID - in 
> particular, if the 
> SMP MP tables say that the APIC ID for the BP should be X, 
> maybe we should 
> actually write X to it instead of just using what is there.

OK so we have a redundancy. You can get the same info from MPS and from
the lapic itself.

The fact that ACPI's boot tables does not include the lapic id (just its
address) suggests strongly to me that we should similarly query the
lapic for its address instead of writing in a new value when using the
MPS tables, as well.

> In particular, Mikaels patch will BUG() if the MP tables 
> don't match the 
> APIC ID. I think that's extremely rude: we should select one 
> of the two 
> and just run with it, instead of unconditionally failing.

Agree.

Regards -- Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread
* RE: [BUG] 2.5.63: ESR killed my box!
@ 2003-03-01  1:42 Mallick, Asit K
  0 siblings, 0 replies; 31+ messages in thread
From: Mallick, Asit K @ 2003-03-01  1:42 UTC (permalink / raw)
  To: Maciej W. Rozycki, Linus Torvalds
  Cc: Martin J. Bligh, William Lee Irwin III, Rusty Russell,
	linux-kernel, mingo, Mikael Pettersson, Saxena, Sunil

Linus,

Your interpretation is correct. The algorithm in the documentation (PRM
vol3) is defined to make it work for both Pentium and P6 and above
family of processors.

As Maciej mentioned, on Pentium any read of ESR will clear the ESR and
writes to ESR have no effect except the errata #3AP that was fixed in
C-stepping Pentium processors.

On P6 and above processors a write is needed to make the current error
bits to be visible in the ESR (the latch as you mentioned). Also this
write will make the current error bits 0 (clear). So, this will provide
the current status of the errors:

> >  - latch current state and read it:
> >
> > 	apic_write(0, APIC_ESR);	// I doubt the value matters
> > 	value = apic_read(APIC_ESR);
> >
> >    This reads the real "current state", leaving it in the latch.

To clear the ESR (the latch) you need to do a back-to-back write as the
first write will clear the current error bits and the 2nd write will
move the cleared bits (form previous write) to readable ESR. So, your
algorithm should work: 

> >  - clear and read current state:
> >
> > 	apic_write(0, APIC_ESR);
> > 	value = apic_read(APIC_ESR);

					<<<====== another error can
occur
> > 	apic_write(0, APIC_ESR);
> >

However, there is a window where another error could be generated after
the first write and read. In this case, a read after the last write will
see a non-zero value and the ESR will not be cleared. You can handle
this by doing write and read in a loop until the ESR read value becomes
0.

> > Also, I would _assume_ that the error interrupt is active based on
the
> > bit-wise "or" of both the latched and the real value, since the docs
> > clearly say that it must be cleared by sw by back-to-back writes
> 
>  I believe only the real value matters.

It is the real value. Error interrupt is generated when any bit is set
in the real value (error bits) and does not use visible ESR. However,
the ESR (latch) bits are cumulative and if the ESR is not cleared (using
2 writes) when we handle the interrupt the read of ESR status will also
contain the errors for the previous error. So, the interrupt handler
also should use the clear and read current state as you mentioned.

I do not know the problem that James mentioned but it will be good to
know the kind of errors that were causing the problem. Also, if someone
can provide a test case we will be glad to look at this.

Thanks,
Asit



^ permalink raw reply	[flat|nested] 31+ messages in thread
* RE: [BUG] 2.5.63: ESR killed my box!
@ 2003-03-01  3:07 Mallick, Asit K
  0 siblings, 0 replies; 31+ messages in thread
From: Mallick, Asit K @ 2003-03-01  3:07 UTC (permalink / raw)
  To: Maciej W. Rozycki, Linus Torvalds
  Cc: Martin J. Bligh, William Lee Irwin III, Rusty Russell,
	linux-kernel, mingo, Mikael Pettersson, Saxena, Sunil


I want to correct the cumulative part:

> 
> It is the real value. Error interrupt is generated when any 
> bit is set in the real value (error bits) and does not use 
> visible ESR. However, the ESR (latch) bits are cumulative and 
> if the ESR is not cleared (using 2 writes) when we handle the 
> interrupt the read of ESR status will also contain the errors 
> for the previous error. So, the interrupt handler also should 
> use the clear and read current state as you mentioned.

The error interrupt is generated based on the real error bits and
readable ESR bits does not affect the interrupt generation (did verify
with the architects). We need the back-to-back write only to make the
readable ESR to get 0 on a read. So, the interrupt handler should be
able to use the write to ESR and read of ESR to get the current error
status.

Thanks,
Asit


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2003-03-01  2:57 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-26  4:33 [BUG] 2.5.63: ESR killed my box! Rusty Russell
2003-02-26  5:51 ` Martin J. Bligh
2003-02-26  7:14   ` Rusty Russell
2003-02-26  7:27     ` William Lee Irwin III
2003-02-26 15:20       ` Linus Torvalds
2003-02-26 15:52         ` Martin J. Bligh
2003-02-26 16:37           ` Linus Torvalds
2003-02-26 16:52             ` Martin J. Bligh
2003-02-27  0:32               ` James Cleverdon
2003-02-27 11:11             ` Maciej W. Rozycki
2003-02-26 20:47     ` Ion Badulescu
2003-02-26 21:03       ` Martin J. Bligh
2003-02-26 21:16         ` Ion Badulescu
2003-02-26 21:23           ` Martin J. Bligh
2003-02-26 21:30           ` Linus Torvalds
2003-02-26 21:44             ` Ion Badulescu
2003-02-26 22:05               ` Linus Torvalds
2003-02-26 22:51                 ` Martin J. Bligh
2003-02-26 23:07                 ` Mikael Pettersson
2003-02-27  0:00                   ` Linus Torvalds
2003-02-27  0:45                     ` Martin J. Bligh
2003-02-27  1:20                       ` Ion Badulescu
2003-02-27  1:33                         ` Martin J. Bligh
2003-02-27 10:33                       ` Mikael Pettersson
2003-02-27  1:26                     ` Ion Badulescu
2003-02-27  1:40                       ` Martin J. Bligh
2003-02-27  7:17                     ` Rusty Russell
  -- strict thread matches above, loose matches on Subject: below --
2003-02-26  8:43 Rusty Russell
2003-02-26 22:34 Grover, Andrew
2003-03-01  1:42 Mallick, Asit K
2003-03-01  3:07 Mallick, Asit K

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox