public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed
* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
       [not found]   ` <Pine.LNX.4.55.0806161636510.20218@cliff.in.clinika.pl>
@ 2008-06-16 22:38     ` Rafael J. Wysocki
  2008-06-16 23:05       ` Rafael J. Wysocki
  2008-06-17 20:59       ` Rafael J. Wysocki
  0 siblings, 2 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-16 22:38 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Stephen Rothwell, linux-next, LKML, Ingo Molnar, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Monday, 16 of June 2008, Maciej W. Rozycki wrote:
> On Mon, 16 Jun 2008, Rafael J. Wysocki wrote:
> 
> > > > commit 7e3530cd98a0c6ab38f5898e855a5beffab26561
> > > > Author: Maciej W. Rozycki <macro@linux-mips.org>
> > > > Date:   Tue May 27 21:19:51 2008 +0100
> > > > 
> > > >     x86: I/O APIC: timer through 8259A second-chance
> > > > 
> > > >     Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
> > > >     Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > > 
> > >  Can I have .config used and a full bootstrap log from that system with
> > > the patch still applied?
> > 
> > That may be difficult, because with the patch applied the box either doesn't
> > boot at all, or works unreliably when booted (depending on the set of patches
> > applied on top of it).
> 
>  Serial console?

No, this box doesn't have any serial ports.  It has a FireWire one, but I don't
have a matching cable ...

>  I'm most interested in one from a configuration that 
> does not boot at all as that's easier to reproduce, determine the cause
> and verify whether a change fixes the problem or not.  Other
> configurations may then be tested with the fix in place.

With the -next from today (20080616) I get a different picture.

Without any patches on top it boots, but the fan is turned 100% on as soon as
the ACPI modules get loaded, regardless of the temperature (normally it does
that above 75^o C, which is impossible to get normally, because there are 3
temperature trip points below that level; generally the hardware only does that
when overheating).  After that, things start to go _very_ slow, like 10x slower
than usually in X and somewhat slower in the fb console, but I was able to get
a dmesg output.  This is reproducible 100% of the time.

With commit 7e3530cd98a0c6ab38f5898e855a5beffab26561 reverted the box seems to
work normally.  However, while I was writing this message, ACPI decided it was
overheating and emergency shut down the box, although that was completely
wrong.  Next time I'll try with the C1E patches reverted.

The .config is at: http://www.sisk.pl/kernel/debug/20080616/next-config

dmesg output without any patches is at
http://www.sisk.pl/kernel/debug/20080616/dmesg-1.log

dmesg output with commit 7e3530cd98a0c6ab38f5898e855a5beffab26561 reverted is
at: http://www.sisk.pl/kernel/debug/20080616/dmesg-2.log

(they look pretty similar to my untrained eye, but well).

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-16 22:38     ` linux-next: Tree for June 13: IO APIC breakage on HP nx6325 Rafael J. Wysocki
@ 2008-06-16 23:05       ` Rafael J. Wysocki
  2008-06-17  7:12         ` Thomas Gleixner
  2008-06-17 20:59       ` Rafael J. Wysocki
  1 sibling, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-16 23:05 UTC (permalink / raw)
  To: Maciej W. Rozycki, Thomas Gleixner
  Cc: Stephen Rothwell, linux-next, LKML, Ingo Molnar,
	ACPI Devel Maling List, Len Brown

On Tuesday, 17 of June 2008, Rafael J. Wysocki wrote:
> On Monday, 16 of June 2008, Maciej W. Rozycki wrote:
> > On Mon, 16 Jun 2008, Rafael J. Wysocki wrote:
> > 
> > > > > commit 7e3530cd98a0c6ab38f5898e855a5beffab26561
> > > > > Author: Maciej W. Rozycki <macro@linux-mips.org>
> > > > > Date:   Tue May 27 21:19:51 2008 +0100
> > > > > 
> > > > >     x86: I/O APIC: timer through 8259A second-chance
> > > > > 
> > > > >     Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
> > > > >     Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > > > 
> > > >  Can I have .config used and a full bootstrap log from that system with
> > > > the patch still applied?
> > > 
> > > That may be difficult, because with the patch applied the box either doesn't
> > > boot at all, or works unreliably when booted (depending on the set of patches
> > > applied on top of it).
> > 
> >  Serial console?
> 
> No, this box doesn't have any serial ports.  It has a FireWire one, but I don't
> have a matching cable ...
> 
> >  I'm most interested in one from a configuration that 
> > does not boot at all as that's easier to reproduce, determine the cause
> > and verify whether a change fixes the problem or not.  Other
> > configurations may then be tested with the fix in place.
> 
> With the -next from today (20080616) I get a different picture.
> 
> Without any patches on top it boots, but the fan is turned 100% on as soon as
> the ACPI modules get loaded, regardless of the temperature (normally it does
> that above 75^o C, which is impossible to get normally, because there are 3
> temperature trip points below that level; generally the hardware only does that
> when overheating).  After that, things start to go _very_ slow, like 10x slower
> than usually in X and somewhat slower in the fb console, but I was able to get
> a dmesg output.  This is reproducible 100% of the time.
> 
> With commit 7e3530cd98a0c6ab38f5898e855a5beffab26561 reverted the box seems to
> work normally.  However, while I was writing this message, ACPI decided it was
> overheating and emergency shut down the box, although that was completely
> wrong.  Next time I'll try with the C1E patches reverted.
> 
> The .config is at: http://www.sisk.pl/kernel/debug/20080616/next-config
> 
> dmesg output without any patches is at
> http://www.sisk.pl/kernel/debug/20080616/dmesg-1.log
> 
> dmesg output with commit 7e3530cd98a0c6ab38f5898e855a5beffab26561 reverted is
> at: http://www.sisk.pl/kernel/debug/20080616/dmesg-2.log
> 
> (they look pretty similar to my untrained eye, but well).

BTW, with the C1E patches reverted I don't get the
WARNING: at /home/rafael/src/linux-next/kernel/smp.c:215 smp_call_function_single+0x3d/0xa2
in the log.  Thomas?

dmesg with commit 7e3530cd98a0c6ab38f5898e855a5beffab26561 and with the C1E
commits (the ones between 8750bf598db6a0ea3475d1cf8da922b325941e12 and
aa83f3f2cfc74d66d01b1d2eb1485ea1103a0f4e inclusive) reverted is at:
http://www.sisk.pl/kernel/debug/20080616/dmesg-3.log

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-16 23:05       ` Rafael J. Wysocki
@ 2008-06-17  7:12         ` Thomas Gleixner
  2008-06-17 20:44           ` Rafael J. Wysocki
  0 siblings, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2008-06-17  7:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Maciej W. Rozycki, Stephen Rothwell, linux-next, LKML,
	Ingo Molnar, ACPI Devel Maling List, Len Brown

On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
> 
> BTW, with the C1E patches reverted I don't get the
> WARNING: at /home/rafael/src/linux-next/kernel/smp.c:215 smp_call_function_single+0x3d/0xa2
> in the log.  Thomas?

Yeah, my bad. Fix below.

Thanks,
	tglx

------------------->
Subject: x86: c1e_idle, run BROADCAST_FORCE notify with irqs enabled
From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 17 Jun 2008 09:07:53 +0200

The BROADCAST_FORCE notification uses smp_function_call and therefor
must be run with interrupts enabled.

While at it, add a comment for the BROADCAST_EXIT notifier as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57fa86d..1450e0f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -267,17 +267,29 @@ static void c1e_idle(void)
 
 		if (!cpu_isset(cpu, c1e_mask)) {
 			cpu_set(cpu, c1e_mask);
-			/* Force broadcast so ACPI can not interfere */
+			/*
+			 * Force broadcast so ACPI can not interfere. Needs
+			 * to run with interrupts enabled as it uses
+			 * smp_function_call.
+			 */
+			local_irq_enable();
 			clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_FORCE,
 					   &cpu);
 			printk(KERN_INFO "Switch to broadcast mode on CPU%d\n",
 			       cpu);
+			local_irq_disable();
 		}
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+
 		default_idle();
-		local_irq_disable();
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
-		local_irq_enable();
+
+		/*
+		 * The switch back from broadcast mode needs to be
+		 * called with interrupts disabled.
+		 */
+		 local_irq_disable();
+		 clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+		 local_irq_enable();
 	} else
 		default_idle();
 }


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-17  7:12         ` Thomas Gleixner
@ 2008-06-17 20:44           ` Rafael J. Wysocki
  2008-06-17 22:19             ` Rafael J. Wysocki
  0 siblings, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-17 20:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Maciej W. Rozycki, Stephen Rothwell, linux-next, LKML,
	Ingo Molnar, ACPI Devel Maling List, Len Brown

On Tuesday, 17 of June 2008, Thomas Gleixner wrote:
> On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
> > 
> > BTW, with the C1E patches reverted I don't get the
> > WARNING: at /home/rafael/src/linux-next/kernel/smp.c:215 smp_call_function_single+0x3d/0xa2
> > in the log.  Thomas?
> 
> Yeah, my bad. Fix below.

Thanks, it eliminates the WARNING, but still the box doesn't work with
the "x86: add C1E aware idle function" patch applied, even with 'highres=off'.

The main symptom is that CPU loads are computed incorrectly (I got X using 126%
of CPU time from 'top', for example).  Apart from this, some processes (like
gkrellm) seem to be 'frozen' and only change their state in 'jumps', as though
they only got CPU from time to time at random.

Reverting the above-mentioned patch fixes those problems.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-16 22:38     ` linux-next: Tree for June 13: IO APIC breakage on HP nx6325 Rafael J. Wysocki
  2008-06-16 23:05       ` Rafael J. Wysocki
@ 2008-06-17 20:59       ` Rafael J. Wysocki
  2008-06-17 21:19         ` Maciej W. Rozycki
  1 sibling, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-17 20:59 UTC (permalink / raw)
  To: Maciej W. Rozycki, Ingo Molnar
  Cc: Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Tuesday, 17 of June 2008, Rafael J. Wysocki wrote:
> On Monday, 16 of June 2008, Maciej W. Rozycki wrote:
> > On Mon, 16 Jun 2008, Rafael J. Wysocki wrote:
> > 
> > > > > commit 7e3530cd98a0c6ab38f5898e855a5beffab26561
> > > > > Author: Maciej W. Rozycki <macro@linux-mips.org>
> > > > > Date:   Tue May 27 21:19:51 2008 +0100
> > > > > 
> > > > >     x86: I/O APIC: timer through 8259A second-chance
> > > > > 
> > > > >     Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
> > > > >     Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > > > 
> > > >  Can I have .config used and a full bootstrap log from that system with
> > > > the patch still applied?
> > > 
> > > That may be difficult, because with the patch applied the box either doesn't
> > > boot at all, or works unreliably when booted (depending on the set of patches
> > > applied on top of it).
> > 
> >  Serial console?
> 
> No, this box doesn't have any serial ports.  It has a FireWire one, but I don't
> have a matching cable ...
> 
> >  I'm most interested in one from a configuration that 
> > does not boot at all as that's easier to reproduce, determine the cause
> > and verify whether a change fixes the problem or not.  Other
> > configurations may then be tested with the fix in place.
> 
> With the -next from today (20080616) I get a different picture.
> 
> Without any patches on top it boots, but the fan is turned 100% on as soon as
> the ACPI modules get loaded, regardless of the temperature (normally it does
> that above 75^o C, which is impossible to get normally, because there are 3
> temperature trip points below that level; generally the hardware only does that
> when overheating).  After that, things start to go _very_ slow, like 10x slower
> than usually in X and somewhat slower in the fb console, but I was able to get
> a dmesg output.  This is reproducible 100% of the time.
> 
> With commit 7e3530cd98a0c6ab38f5898e855a5beffab26561 reverted the box seems to
> work normally.

To debug this problem a bit more, I applied the following change:

--- linux-next.orig/arch/x86/kernel/io_apic_64.c
+++ linux-next/arch/x86/kernel/io_apic_64.c
@@ -1667,7 +1667,7 @@ static inline void __init check_timer(vo
 	pin2  = ioapic_i8259.pin;
 	apic2 = ioapic_i8259.apic;
 
-	apic_printk(APIC_VERBOSE,KERN_INFO "..TIMER: vector=0x%02X apic1=%d pin1=%d apic2=%d pin2=%d\n",
+	printk(KERN_CRIT "TIMER: vector=0x%02X apic1=%d pin1=%d apic2=%d pin2=%d\n",
 		cfg->vector, apic1, pin1, apic2, pin2);
 
 	if (pin1 != -1) {

and found that apic1=0, pin1=2, apic2=-1, pin2=-1.  Moreover, the
(!no_timer_check && timer_irq_works()) test evidently fails, so the timer
cannot be connected to apic1, but the patch forcibly ignores that, which in
turn, on this particular box, confuses the heck out of the northbridge.

May I gently ask that the patch ("x86: I/O APIC: timer through 8259A second-chance")
be reverted?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-17 20:59       ` Rafael J. Wysocki
@ 2008-06-17 21:19         ` Maciej W. Rozycki
  2008-06-17 21:38           ` Rafael J. Wysocki
  0 siblings, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-17 21:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:

> May I gently ask that the patch ("x86: I/O APIC: timer through 8259A second-chance")
> be reverted?

 We're trying to find a solution for a long-standing problem and this
patch is a step in that direction.  We need to find out exactly what is
going wrong with the HP nx6325 system and removing the patch would make us
lose the opportunity to get things right in this area.  At the time I
submitted that patch I warned a lot of testing would be required before it
goes upstream and hopefully my request will get honored.  If you do not
want to participate in testing for whatever reason, you have the right to
do so, but I insist on the patch to stay at least until we know the source
of the problem and conclude there is no other way to get it fixed.  Len
reported he's got the same system and it behaves the same, so I hope he'll
be able to do the testing if you decide to opt out.

 Unfortunately the 64-bit variation has a lot of necessary logging
disabled by default (as you have now discovered with the need to rename
apic_printk() to printk()), so my plan is to cook up a patch to enable all
the available logging facilities around that code first.  I was very tired
yesterday and could not afford having a look at the logs -- sorry about
that.  I'll try to do it tonight and see if there is anything else I can
do.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-17 21:19         ` Maciej W. Rozycki
@ 2008-06-17 21:38           ` Rafael J. Wysocki
  2008-06-17 22:53             ` Rafael J. Wysocki
  0 siblings, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-17 21:38 UTC (permalink / raw)
  To: Maciej W. Rozycki, Ingo Molnar
  Cc: Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Tuesday, 17 of June 2008, Maciej W. Rozycki wrote:
> On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
> 
> > May I gently ask that the patch ("x86: I/O APIC: timer through 8259A second-chance")
> > be reverted?
> 
>  We're trying to find a solution for a long-standing problem and this
> patch is a step in that direction.  We need to find out exactly what is
> going wrong with the HP nx6325 system and removing the patch would make us
> lose the opportunity to get things right in this area.  At the time I
> submitted that patch I warned a lot of testing would be required before it
> goes upstream and hopefully my request will get honored.  If you do not
> want to participate in testing for whatever reason, you have the right to
> do so, but I insist on the patch to stay at least until we know the source
> of the problem and conclude there is no other way to get it fixed.  Len
> reported he's got the same system and it behaves the same, so I hope he'll
> be able to do the testing if you decide to opt out.

I can do the testing actually, but IMO putting that patch into linux-next was a
mistake.

>  Unfortunately the 64-bit variation has a lot of necessary logging
> disabled by default (as you have now discovered with the need to rename
> apic_printk() to printk()), so my plan is to cook up a patch to enable all
> the available logging facilities around that code first.

Well, that's easy.  I can send you a dmesg output with all of the printk()s in
there functional if that helps, but frankly I don't see how this is going to
get you more information than I've already posted.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-17 20:44           ` Rafael J. Wysocki
@ 2008-06-17 22:19             ` Rafael J. Wysocki
  2008-06-17 22:25               ` Rafael J. Wysocki
  2008-06-18 13:14               ` Ingo Molnar
  0 siblings, 2 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-17 22:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Maciej W. Rozycki, Stephen Rothwell, linux-next, LKML,
	Ingo Molnar, ACPI Devel Maling List, Len Brown

On Tuesday, 17 of June 2008, Rafael J. Wysocki wrote:
> On Tuesday, 17 of June 2008, Thomas Gleixner wrote:
> > On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
> > > 
> > > BTW, with the C1E patches reverted I don't get the
> > > WARNING: at /home/rafael/src/linux-next/kernel/smp.c:215 smp_call_function_single+0x3d/0xa2
> > > in the log.  Thomas?
> > 
> > Yeah, my bad. Fix below.
> 
> Thanks, it eliminates the WARNING, but still the box doesn't work with
> the "x86: add C1E aware idle function" patch applied, even with 'highres=off'.
> 
> The main symptom is that CPU loads are computed incorrectly (I got X using 126%
> of CPU time from 'top', for example).  Apart from this, some processes (like
> gkrellm) seem to be 'frozen' and only change their state in 'jumps', as though
> they only got CPU from time to time at random.
> 
> Reverting the above-mentioned patch fixes those problems.

Ah.  If your fix is replaced with the appended one, the system happily works
with C1E and highres.

Thanks,
Rafael


---
 arch/x86/kernel/process.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Index: linux-next/arch/x86/kernel/process.c
===================================================================
--- linux-next.orig/arch/x86/kernel/process.c
+++ linux-next/arch/x86/kernel/process.c
@@ -265,16 +265,30 @@ static void c1e_idle(void)
 	if (c1e_detected) {
 		int cpu = smp_processor_id();
 
+		local_irq_enable();
+
 		if (!cpu_isset(cpu, c1e_mask)) {
 			cpu_set(cpu, c1e_mask);
-			/* Force broadcast so ACPI can not interfere */
+			/*
+			 * Force broadcast so ACPI can not interfere. Needs
+			 * to run with interrupts enabled as it uses
+			 * smp_function_call.
+			 */
 			clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_FORCE,
 					   &cpu);
 			printk(KERN_INFO "Switch to broadcast mode on CPU%d\n",
 			       cpu);
 		}
+		local_irq_disable();
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+		local_irq_enable();
+
 		default_idle();
+
+		/*
+		 * The switch back from broadcast mode needs to be
+		 * called with interrupts disabled.
+		 */
 		local_irq_disable();
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
 		local_irq_enable();

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-17 22:19             ` Rafael J. Wysocki
@ 2008-06-17 22:25               ` Rafael J. Wysocki
  2008-06-18  8:02                 ` Thomas Gleixner
  2008-06-18 13:15                 ` Ingo Molnar
  2008-06-18 13:14               ` Ingo Molnar
  1 sibling, 2 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-17 22:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Maciej W. Rozycki, Stephen Rothwell, linux-next, LKML,
	Ingo Molnar, ACPI Devel Maling List, Len Brown

On Wednesday, 18 of June 2008, Rafael J. Wysocki wrote:
> On Tuesday, 17 of June 2008, Rafael J. Wysocki wrote:
> > On Tuesday, 17 of June 2008, Thomas Gleixner wrote:
> > > On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
> > > > 
> > > > BTW, with the C1E patches reverted I don't get the
> > > > WARNING: at /home/rafael/src/linux-next/kernel/smp.c:215 smp_call_function_single+0x3d/0xa2
> > > > in the log.  Thomas?
> > > 
> > > Yeah, my bad. Fix below.
> > 
> > Thanks, it eliminates the WARNING, but still the box doesn't work with
> > the "x86: add C1E aware idle function" patch applied, even with 'highres=off'.
> > 
> > The main symptom is that CPU loads are computed incorrectly (I got X using 126%
> > of CPU time from 'top', for example).  Apart from this, some processes (like
> > gkrellm) seem to be 'frozen' and only change their state in 'jumps', as though
> > they only got CPU from time to time at random.
> > 
> > Reverting the above-mentioned patch fixes those problems.
> 
> Ah.  If your fix is replaced with the appended one, the system happily works
> with C1E and highres.

Scratch that.  The symptoms appeared later this time, that's all.  I've just got
b43 consuming 90+ % of the CPU time. :-(

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-17 21:38           ` Rafael J. Wysocki
@ 2008-06-17 22:53             ` Rafael J. Wysocki
  2008-06-18  4:02               ` Maciej W. Rozycki
  0 siblings, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-17 22:53 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Ingo Molnar, Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Tuesday, 17 of June 2008, Rafael J. Wysocki wrote:
> On Tuesday, 17 of June 2008, Maciej W. Rozycki wrote:
> > On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
> > 
> > > May I gently ask that the patch ("x86: I/O APIC: timer through 8259A second-chance")
> > > be reverted?
> > 
> >  We're trying to find a solution for a long-standing problem and this
> > patch is a step in that direction.  We need to find out exactly what is
> > going wrong with the HP nx6325 system and removing the patch would make us
> > lose the opportunity to get things right in this area.  At the time I
> > submitted that patch I warned a lot of testing would be required before it
> > goes upstream and hopefully my request will get honored.  If you do not
> > want to participate in testing for whatever reason, you have the right to
> > do so, but I insist on the patch to stay at least until we know the source
> > of the problem and conclude there is no other way to get it fixed.  Len
> > reported he's got the same system and it behaves the same, so I hope he'll
> > be able to do the testing if you decide to opt out.
> 
> I can do the testing actually, but IMO putting that patch into linux-next was a
> mistake.
> 
> >  Unfortunately the 64-bit variation has a lot of necessary logging
> > disabled by default (as you have now discovered with the need to rename
> > apic_printk() to printk()), so my plan is to cook up a patch to enable all
> > the available logging facilities around that code first.
> 
> Well, that's easy.  I can send you a dmesg output with all of the printk()s in
> there functional if that helps, but frankly I don't see how this is going to
> get you more information than I've already posted.

Here you go.  Below is the relevant snippet from the yesterday's linux-next
dmesg with the patches:
"x86: I/O APIC: timer through 8259A second-chance"
"x86: add C1E aware idle function"
reverted and the appended debug patch applied.

[    0.108006] TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <2> failed
[    0.108006] ...trying to set up timer as Virtual Wire IRQ...<2> works.

The entire dmesg is at: http://www.sisk.pl/kernel/debug/20080616/dmesg-4.log

Thanks,
Rafael

---
 arch/x86/kernel/io_apic_64.c |   26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

Index: linux-next/arch/x86/kernel/io_apic_64.c
===================================================================
--- linux-next.orig/arch/x86/kernel/io_apic_64.c
+++ linux-next/arch/x86/kernel/io_apic_64.c
@@ -1667,7 +1667,7 @@ static inline void __init check_timer(vo
 	pin2  = ioapic_i8259.pin;
 	apic2 = ioapic_i8259.apic;
 
-	apic_printk(APIC_VERBOSE,KERN_INFO "..TIMER: vector=0x%02X apic1=%d pin1=%d apic2=%d pin2=%d\n",
+	printk(KERN_CRIT "TIMER: vector=0x%02X apic1=%d pin1=%d apic2=%d pin2=%d\n",
 		cfg->vector, apic1, pin1, apic2, pin2);
 
 	if (pin1 != -1) {
@@ -1686,14 +1686,14 @@ static inline void __init check_timer(vo
 			goto out;
 		}
 		clear_IO_APIC_pin(apic1, pin1);
-		apic_printk(APIC_QUIET,KERN_ERR "..MP-BIOS bug: 8254 timer not "
+		printk(KERN_CRIT "..MP-BIOS bug: 8254 timer not "
 				"connected to IO-APIC\n");
 	}
 
-	apic_printk(APIC_VERBOSE,KERN_INFO "...trying to set up timer (IRQ0) "
+	printk(KERN_CRIT "...trying to set up timer (IRQ0) "
 				"through the 8259A ... ");
 	if (pin2 != -1) {
-		apic_printk(APIC_VERBOSE,"\n..... (found apic %d pin %d) ...",
+		printk(KERN_CRIT "\n..... (found apic %d pin %d) ...",
 			apic2, pin2);
 		/*
 		 * legacy devices should be connected to IO APIC #0
@@ -1702,7 +1702,7 @@ static inline void __init check_timer(vo
 		unmask_IO_APIC_irq(0);
 		enable_8259A_irq(0);
 		if (timer_irq_works()) {
-			apic_printk(APIC_VERBOSE," works.\n");
+			printk(KERN_CRIT " works.\n");
 			timer_through_8259 = 1;
 			nmi_watchdog_default();
 			if (nmi_watchdog == NMI_IO_APIC) {
@@ -1718,28 +1718,28 @@ static inline void __init check_timer(vo
 		disable_8259A_irq(0);
 		clear_IO_APIC_pin(apic2, pin2);
 	}
-	apic_printk(APIC_VERBOSE," failed.\n");
+	printk(KERN_CRIT " failed.\n");
 
 	if (nmi_watchdog == NMI_IO_APIC) {
-		printk(KERN_WARNING "timer doesn't work through the IO-APIC - disabling NMI Watchdog!\n");
+		printk(KERN_CRIT "timer doesn't work through the IO-APIC - disabling NMI Watchdog!\n");
 		nmi_watchdog = NMI_NONE;
 	}
 
-	apic_printk(APIC_VERBOSE, KERN_INFO "...trying to set up timer as Virtual Wire IRQ...");
+	printk(KERN_CRIT "...trying to set up timer as Virtual Wire IRQ...");
 
 	irq_desc[0].chip = &lapic_irq_type;
 	apic_write(APIC_LVT0, APIC_DM_FIXED | cfg->vector);	/* Fixed mode */
 	enable_8259A_irq(0);
 
 	if (timer_irq_works()) {
-		apic_printk(APIC_VERBOSE," works.\n");
+		printk(KERN_CRIT " works.\n");
 		goto out;
 	}
 	disable_8259A_irq(0);
 	apic_write(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_FIXED | cfg->vector);
-	apic_printk(APIC_VERBOSE," failed.\n");
+	printk(KERN_CRIT " failed.\n");
 
-	apic_printk(APIC_VERBOSE, KERN_INFO "...trying to set up timer as ExtINT IRQ...");
+	printk(KERN_CRIT "...trying to set up timer as ExtINT IRQ...");
 
 	init_8259A(0);
 	make_8259A_irq(0);
@@ -1748,10 +1748,10 @@ static inline void __init check_timer(vo
 	unlock_ExtINT_logic();
 
 	if (timer_irq_works()) {
-		apic_printk(APIC_VERBOSE," works.\n");
+		printk(KERN_CRIT " works.\n");
 		goto out;
 	}
-	apic_printk(APIC_VERBOSE," failed :(.\n");
+	printk(KERN_CRIT " failed :(.\n");
 	panic("IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter\n");
 out:
 	local_irq_restore(flags);

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-17 22:53             ` Rafael J. Wysocki
@ 2008-06-18  4:02               ` Maciej W. Rozycki
  2008-06-18 19:06                 ` Cyrill Gorcunov
  2008-06-18 22:11                 ` Rafael J. Wysocki
  0 siblings, 2 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-18  4:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Wed, 18 Jun 2008, Rafael J. Wysocki wrote:

> Here you go.  Below is the relevant snippet from the yesterday's linux-next
> dmesg with the patches:
> "x86: I/O APIC: timer through 8259A second-chance"
> "x86: add C1E aware idle function"
> reverted and the appended debug patch applied.
> 
> [    0.108006] TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> [    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <2> failed
> [    0.108006] ...trying to set up timer as Virtual Wire IRQ...<2> works.
> 
> The entire dmesg is at: http://www.sisk.pl/kernel/debug/20080616/dmesg-4.log

 Thanks -- this is very important and useful information as it shows the
exact alternative used.

 With such a configuration the "x86: I/O APIC: timer through 8259A
second-chance" patch should not matter, because the only change it
introduces is an attempt to try the same I/O APIC pin again, but with the
IRQ0 line of the master 8259A enabled.  That's not a terribly unusual 
configuration and nothing should get confused in the system.

 Barring the unlikely possibility of the 8259A actually being wired to 
INTIN2 of the I/O APIC I can see two possible explanations:

1. The 8259A interrupt actually escapes to the CPU somehow and is handled
   as an ExtINTA interrupt.  This would make the code in check_timer()  
   decide it has found a working configuration, while actually it has been
   fooled.

2. There is a bug in this patch or an assumption it makes which results 
   in the state of some component not to be restored correctly.  
   Unfortunately I have no resources to test the 64-bit variation of the 
   code, so something may have escaped my attention.

 I'd like to find out which one is the case -- can you please reapply the
patch and send me the corresponding section of the bootstrap log?  If the
system hangs before you can retrieve the log, please just place:

while (1);

or something like that after the out: label in check_timer().

 Thanks.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-17 22:25               ` Rafael J. Wysocki
@ 2008-06-18  8:02                 ` Thomas Gleixner
  2008-06-18 12:41                   ` Thomas Gleixner
  2008-06-18 13:15                 ` Ingo Molnar
  1 sibling, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2008-06-18  8:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Maciej W. Rozycki, Stephen Rothwell, linux-next, LKML,
	Ingo Molnar, ACPI Devel Maling List, Len Brown

On Wed, 18 Jun 2008, Rafael J. Wysocki wrote:
> On Wednesday, 18 of June 2008, Rafael J. Wysocki wrote:
> > On Tuesday, 17 of June 2008, Rafael J. Wysocki wrote:
> > > On Tuesday, 17 of June 2008, Thomas Gleixner wrote:
> > > > On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
> > > > > 
> > > > > BTW, with the C1E patches reverted I don't get the
> > > > > WARNING: at /home/rafael/src/linux-next/kernel/smp.c:215 smp_call_function_single+0x3d/0xa2
> > > > > in the log.  Thomas?
> > > > 
> > > > Yeah, my bad. Fix below.
> > > 
> > > Thanks, it eliminates the WARNING, but still the box doesn't work with
> > > the "x86: add C1E aware idle function" patch applied, even with 'highres=off'.
> > > 
> > > The main symptom is that CPU loads are computed incorrectly (I got X using 126%
> > > of CPU time from 'top', for example).  Apart from this, some processes (like
> > > gkrellm) seem to be 'frozen' and only change their state in 'jumps', as though
> > > they only got CPU from time to time at random.
> > > 
> > > Reverting the above-mentioned patch fixes those problems.
> > 
> > Ah.  If your fix is replaced with the appended one, the system happily works
> > with C1E and highres.
> 
> Scratch that.  The symptoms appeared later this time, that's all.  I've just got
> b43 consuming 90+ % of the CPU time. :-(

I would have been pretty surprised if it had helped :)

Does the box boot when you disable the local apic timer on the kernel
command line with the patch applied ?

Also does forcing hpet change anything ?

Thanks,
	tglx



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18  8:02                 ` Thomas Gleixner
@ 2008-06-18 12:41                   ` Thomas Gleixner
  2008-06-18 14:37                     ` Rafael J. Wysocki
  2008-06-18 14:40                     ` Rafael J. Wysocki
  0 siblings, 2 replies; 73+ messages in thread
From: Thomas Gleixner @ 2008-06-18 12:41 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Maciej W. Rozycki, Stephen Rothwell, linux-next, LKML,
	Ingo Molnar, ACPI Devel Maling List, Len Brown

On Wed, 18 Jun 2008, Thomas Gleixner wrote:
> On Wed, 18 Jun 2008, Rafael J. Wysocki wrote:
> > On Wednesday, 18 of June 2008, Rafael J. Wysocki wrote:
> > > On Tuesday, 17 of June 2008, Rafael J. Wysocki wrote:
> > > > On Tuesday, 17 of June 2008, Thomas Gleixner wrote:
> > > > > On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
> > > > > > 
> > > > > > BTW, with the C1E patches reverted I don't get the
> > > > > > WARNING: at /home/rafael/src/linux-next/kernel/smp.c:215 smp_call_function_single+0x3d/0xa2
> > > > > > in the log.  Thomas?
> > > > > 
> > > > > Yeah, my bad. Fix below.
> > > > 
> > > > Thanks, it eliminates the WARNING, but still the box doesn't work with
> > > > the "x86: add C1E aware idle function" patch applied, even with 'highres=off'.
> > > > 
> > > > The main symptom is that CPU loads are computed incorrectly (I got X using 126%
> > > > of CPU time from 'top', for example).  Apart from this, some processes (like
> > > > gkrellm) seem to be 'frozen' and only change their state in 'jumps', as though
> > > > they only got CPU from time to time at random.
> > > > 
> > > > Reverting the above-mentioned patch fixes those problems.
> > > 
> > > Ah.  If your fix is replaced with the appended one, the system happily works
> > > with C1E and highres.
> > 
> > Scratch that.  The symptoms appeared later this time, that's all.  I've just got
> > b43 consuming 90+ % of the CPU time. :-(
> 
> I would have been pretty surprised if it had helped :)
> 
> Does the box boot when you disable the local apic timer on the kernel
> command line with the patch applied ?
> 
> Also does forcing hpet change anything ?

I just checked that the original c1e series and the affected code in
tip are not different. IIRC you confirmed that the C1E patches would
work on your box. So I wonder what else got changed which causes these
problems.

Thanks,
	tglx

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-17 22:19             ` Rafael J. Wysocki
  2008-06-17 22:25               ` Rafael J. Wysocki
@ 2008-06-18 13:14               ` Ingo Molnar
  1 sibling, 0 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-06-18 13:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, Maciej W. Rozycki, Stephen Rothwell, linux-next,
	LKML, ACPI Devel Maling List, Len Brown


* Rafael J. Wysocki <rjw@sisk.pl> wrote:

> > Thanks, it eliminates the WARNING, but still the box doesn't work 
> > with the "x86: add C1E aware idle function" patch applied, even with 
> > 'highres=off'.
> > 
> > The main symptom is that CPU loads are computed incorrectly (I got X 
> > using 126% of CPU time from 'top', for example).  Apart from this, 
> > some processes (like gkrellm) seem to be 'frozen' and only change 
> > their state in 'jumps', as though they only got CPU from time to 
> > time at random.
> > 
> > Reverting the above-mentioned patch fixes those problems.
> 
> Ah.  If your fix is replaced with the appended one, the system happily 
> works with C1E and highres.

very nice! I have applied your fix to tip/x86/cpu.

does this resolve all problems on your box, or is the IO-APIC problem 
still open?

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-17 22:25               ` Rafael J. Wysocki
  2008-06-18  8:02                 ` Thomas Gleixner
@ 2008-06-18 13:15                 ` Ingo Molnar
  1 sibling, 0 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-06-18 13:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, Maciej W. Rozycki, Stephen Rothwell, linux-next,
	LKML, ACPI Devel Maling List, Len Brown


* Rafael J. Wysocki <rjw@sisk.pl> wrote:

> > Ah.  If your fix is replaced with the appended one, the system 
> > happily works with C1E and highres.
> 
> Scratch that.  The symptoms appeared later this time, that's all.  
> I've just got b43 consuming 90+ % of the CPU time. :-(

ah, ok. Discarded the patch :-/

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18 12:41                   ` Thomas Gleixner
@ 2008-06-18 14:37                     ` Rafael J. Wysocki
  2008-06-18 14:40                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-18 14:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Maciej W. Rozycki, Stephen Rothwell, linux-next, LKML,
	Ingo Molnar, ACPI Devel Maling List, Len Brown

On Wednesday, 18 of June 2008, Thomas Gleixner wrote:
> On Wed, 18 Jun 2008, Thomas Gleixner wrote:
> > On Wed, 18 Jun 2008, Rafael J. Wysocki wrote:
> > > On Wednesday, 18 of June 2008, Rafael J. Wysocki wrote:
> > > > On Tuesday, 17 of June 2008, Rafael J. Wysocki wrote:
> > > > > On Tuesday, 17 of June 2008, Thomas Gleixner wrote:
> > > > > > On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
> > > > > > > 
> > > > > > > BTW, with the C1E patches reverted I don't get the
> > > > > > > WARNING: at /home/rafael/src/linux-next/kernel/smp.c:215 smp_call_function_single+0x3d/0xa2
> > > > > > > in the log.  Thomas?
> > > > > > 
> > > > > > Yeah, my bad. Fix below.
> > > > > 
> > > > > Thanks, it eliminates the WARNING, but still the box doesn't work with
> > > > > the "x86: add C1E aware idle function" patch applied, even with 'highres=off'.
> > > > > 
> > > > > The main symptom is that CPU loads are computed incorrectly (I got X using 126%
> > > > > of CPU time from 'top', for example).  Apart from this, some processes (like
> > > > > gkrellm) seem to be 'frozen' and only change their state in 'jumps', as though
> > > > > they only got CPU from time to time at random.
> > > > > 
> > > > > Reverting the above-mentioned patch fixes those problems.
> > > > 
> > > > Ah.  If your fix is replaced with the appended one, the system happily works
> > > > with C1E and highres.
> > > 
> > > Scratch that.  The symptoms appeared later this time, that's all.  I've just got
> > > b43 consuming 90+ % of the CPU time. :-(
> > 
> > I would have been pretty surprised if it had helped :)
> > 
> > Does the box boot when you disable the local apic timer on the kernel
> > command line with the patch applied ?
> > 
> > Also does forcing hpet change anything ?
> 
> I just checked that the original c1e series and the affected code in
> tip are not different. IIRC you confirmed that the C1E patches would
> work on your box. So I wonder what else got changed which causes these
> problems.

Well, probably I didn't test that long enough.

The symptoms do not always appear immediately, they sometimes appear only
after several minutes.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18 12:41                   ` Thomas Gleixner
  2008-06-18 14:37                     ` Rafael J. Wysocki
@ 2008-06-18 14:40                     ` Rafael J. Wysocki
  2008-06-18 15:29                       ` Thomas Gleixner
  1 sibling, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-18 14:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Maciej W. Rozycki, Stephen Rothwell, linux-next, LKML,
	Ingo Molnar, ACPI Devel Maling List, Len Brown

On Wednesday, 18 of June 2008, Thomas Gleixner wrote:
> On Wed, 18 Jun 2008, Thomas Gleixner wrote:
> > On Wed, 18 Jun 2008, Rafael J. Wysocki wrote:
> > > On Wednesday, 18 of June 2008, Rafael J. Wysocki wrote:
> > > > On Tuesday, 17 of June 2008, Rafael J. Wysocki wrote:
> > > > > On Tuesday, 17 of June 2008, Thomas Gleixner wrote:
> > > > > > On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
> > > > > > > 
> > > > > > > BTW, with the C1E patches reverted I don't get the
> > > > > > > WARNING: at /home/rafael/src/linux-next/kernel/smp.c:215 smp_call_function_single+0x3d/0xa2
> > > > > > > in the log.  Thomas?
> > > > > > 
> > > > > > Yeah, my bad. Fix below.
> > > > > 
> > > > > Thanks, it eliminates the WARNING, but still the box doesn't work with
> > > > > the "x86: add C1E aware idle function" patch applied, even with 'highres=off'.
> > > > > 
> > > > > The main symptom is that CPU loads are computed incorrectly (I got X using 126%
> > > > > of CPU time from 'top', for example).  Apart from this, some processes (like
> > > > > gkrellm) seem to be 'frozen' and only change their state in 'jumps', as though
> > > > > they only got CPU from time to time at random.
> > > > > 
> > > > > Reverting the above-mentioned patch fixes those problems.
> > > > 
> > > > Ah.  If your fix is replaced with the appended one, the system happily works
> > > > with C1E and highres.
> > > 
> > > Scratch that.  The symptoms appeared later this time, that's all.  I've just got
> > > b43 consuming 90+ % of the CPU time. :-(
> > 
> > I would have been pretty surprised if it had helped :)
> > 
> > Does the box boot when you disable the local apic timer on the kernel
> > command line with the patch applied ?
> > 
> > Also does forcing hpet change anything ?
> 
> I just checked that the original c1e series and the affected code in
> tip are not different. IIRC you confirmed that the C1E patches would
> work on your box. So I wonder what else got changed which causes these
> problems.

Well, to eliminate any possible correlations, do you have a version of the
series or a single patch against the current mainline?

Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18 14:40                     ` Rafael J. Wysocki
@ 2008-06-18 15:29                       ` Thomas Gleixner
  2008-06-21 22:47                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2008-06-18 15:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Maciej W. Rozycki, Stephen Rothwell, linux-next, LKML,
	Ingo Molnar, ACPI Devel Maling List, Len Brown

On Wed, 18 Jun 2008, Rafael J. Wysocki wrote:
> > I just checked that the original c1e series and the affected code in
> > tip are not different. IIRC you confirmed that the C1E patches would
> > work on your box. So I wonder what else got changed which causes these
> > problems.
> 
> Well, to eliminate any possible correlations, do you have a version of the
> series or a single patch against the current mainline?

http://userweb.kernel.org/~tglx/952f4a-c1e-apic.patch
http://userweb.kernel.org/~tglx/952f4a-c1e.patch

c1e-apic is the forward port of the apic changes and c1e is the pure
c1e stuff. On my box it does not work w/o the c1e-apic one, but ....

Thanks,
	tglx

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18  4:02               ` Maciej W. Rozycki
@ 2008-06-18 19:06                 ` Cyrill Gorcunov
  2008-06-18 22:36                   ` Maciej W. Rozycki
  2008-06-18 22:11                 ` Rafael J. Wysocki
  1 sibling, 1 reply; 73+ messages in thread
From: Cyrill Gorcunov @ 2008-06-18 19:06 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

[Maciej W. Rozycki - Wed, Jun 18, 2008 at 05:02:48AM +0100]
| On Wed, 18 Jun 2008, Rafael J. Wysocki wrote:
| 
| > Here you go.  Below is the relevant snippet from the yesterday's linux-next
| > dmesg with the patches:
| > "x86: I/O APIC: timer through 8259A second-chance"
| > "x86: add C1E aware idle function"
| > reverted and the appended debug patch applied.
| > 
| > [    0.108006] TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
| > [    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
| > [    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <2> failed
| > [    0.108006] ...trying to set up timer as Virtual Wire IRQ...<2> works.
| > 
| > The entire dmesg is at: http://www.sisk.pl/kernel/debug/20080616/dmesg-4.log
| 
|  Thanks -- this is very important and useful information as it shows the
| exact alternative used.
| 
|  With such a configuration the "x86: I/O APIC: timer through 8259A
| second-chance" patch should not matter, because the only change it
| introduces is an attempt to try the same I/O APIC pin again, but with the
| IRQ0 line of the master 8259A enabled.  That's not a terribly unusual 
| configuration and nothing should get confused in the system.
| 
|  Barring the unlikely possibility of the 8259A actually being wired to 
| INTIN2 of the I/O APIC I can see two possible explanations:
| 
| 1. The 8259A interrupt actually escapes to the CPU somehow and is handled
|    as an ExtINTA interrupt.  This would make the code in check_timer()  
|    decide it has found a working configuration, while actually it has been
|    fooled.

Maciej, that is why we get 'received illegal vector'?

	[  129.092151] APIC error on CPU1: 00(40)

| 
| 2. There is a bug in this patch or an assumption it makes which results 
|    in the state of some component not to be restored correctly.  
|    Unfortunately I have no resources to test the 64-bit variation of the 
|    code, so something may have escaped my attention.
| 
|  I'd like to find out which one is the case -- can you please reapply the
| patch and send me the corresponding section of the bootstrap log?  If the
| system hangs before you can retrieve the log, please just place:
| 
| while (1);
| 
| or something like that after the out: label in check_timer().
| 
|  Thanks.
| 
|   Maciej
| 

		- Cyrill -

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18  4:02               ` Maciej W. Rozycki
  2008-06-18 19:06                 ` Cyrill Gorcunov
@ 2008-06-18 22:11                 ` Rafael J. Wysocki
  2008-06-18 23:39                   ` Maciej W. Rozycki
  1 sibling, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-18 22:11 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Ingo Molnar, Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Wednesday, 18 of June 2008, Maciej W. Rozycki wrote:
> On Wed, 18 Jun 2008, Rafael J. Wysocki wrote:
> 
> > Here you go.  Below is the relevant snippet from the yesterday's linux-next
> > dmesg with the patches:
> > "x86: I/O APIC: timer through 8259A second-chance"
> > "x86: add C1E aware idle function"
> > reverted and the appended debug patch applied.
> > 
> > [    0.108006] TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> > [    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > [    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <2> failed
> > [    0.108006] ...trying to set up timer as Virtual Wire IRQ...<2> works.
> > 
> > The entire dmesg is at: http://www.sisk.pl/kernel/debug/20080616/dmesg-4.log
> 
>  Thanks -- this is very important and useful information as it shows the
> exact alternative used.
> 
>  With such a configuration the "x86: I/O APIC: timer through 8259A
> second-chance" patch should not matter, because the only change it
> introduces is an attempt to try the same I/O APIC pin again, but with the
> IRQ0 line of the master 8259A enabled.  That's not a terribly unusual 
> configuration and nothing should get confused in the system.

But it _does_ get confused, really.

>  Barring the unlikely possibility of the 8259A actually being wired to 
> INTIN2 of the I/O APIC I can see two possible explanations:
> 
> 1. The 8259A interrupt actually escapes to the CPU somehow and is handled
>    as an ExtINTA interrupt.  This would make the code in check_timer()  
>    decide it has found a working configuration, while actually it has been
>    fooled.
> 
> 2. There is a bug in this patch or an assumption it makes which results 
>    in the state of some component not to be restored correctly.  
>    Unfortunately I have no resources to test the 64-bit variation of the 
>    code, so something may have escaped my attention.
> 
>  I'd like to find out which one is the case -- can you please reapply the
> patch and send me the corresponding section of the bootstrap log?  If the
> system hangs before you can retrieve the log, please just place:

Here you go:

[    0.108006] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <3>
[    0.108006] ..... (found apic 0 pin 2) ...<3> works.

The full dmesg is at: http://www.sisk.pl/kernel/debug/20080618/dmesg-1.log

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18 19:06                 ` Cyrill Gorcunov
@ 2008-06-18 22:36                   ` Maciej W. Rozycki
  2008-06-20 18:59                     ` Cyrill Gorcunov
  0 siblings, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-18 22:36 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Wed, 18 Jun 2008, Cyrill Gorcunov wrote:

> | 1. The 8259A interrupt actually escapes to the CPU somehow and is handled
> |    as an ExtINTA interrupt.  This would make the code in check_timer()  
> |    decide it has found a working configuration, while actually it has been
> |    fooled.
> 
> Maciej, that is why we get 'received illegal vector'?
> 
> 	[  129.092151] APIC error on CPU1: 00(40)

 No, but that's an interesting observation, thank you -- well spotted!  

 ExtINTA stands for an "External INTA cycle" which is passed through from
the CPU down to the system bus instead of being intercepted by the local
APIC unit as usually.  In response to the INTA cycle one of the 8259A
chips (either the master or the slave, depending on the source of the
interrupt selected for handling) supplies the vector directly to the CPU
through PCI (or whatever kind of bus links the legacy bridge with the host
bridge) and then the FSB.  Therefore the vector bypasses all the APIC
circuitry and cannot result in an APIC error interrupt.

 Instead the message quoted means an APIC input is misprogrammed
somewhere.  This error happens if an interrupt is signalled to an unmasked
APIC input which uses the Fixed or Lowest-Priority delivery mode and its
vector implies priority below the minimum permitted, that is in the range
from 0 to 15.

 We have code already in place in io_apic_{32,64}.c that can be used to
find out the offender with a piece of code like this (#if 0 has to be
deactivated for this to work and they may be bit rot bugs to be fixed):

int __init all_pic_dump(void)
{
	int v = apic_verbosity;

	apic_verbosity = APIC_DEBUG;
	print_IO_APIC();
	print_all_local_APICs();
	print_PIC();
	apic_verbosity = v;

	return 0;
}

late_initcall(all_pic_dump);

if somebody is willing to aid with debugging this problem.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18 22:11                 ` Rafael J. Wysocki
@ 2008-06-18 23:39                   ` Maciej W. Rozycki
  2008-06-19  0:25                     ` Rafael J. Wysocki
  2008-06-19  9:35                     ` Ingo Molnar
  0 siblings, 2 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-18 23:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Thu, 19 Jun 2008, Rafael J. Wysocki wrote:

> >  With such a configuration the "x86: I/O APIC: timer through 8259A
> > second-chance" patch should not matter, because the only change it
> > introduces is an attempt to try the same I/O APIC pin again, but with the
> > IRQ0 line of the master 8259A enabled.  That's not a terribly unusual 
> > configuration and nothing should get confused in the system.
> 
> But it _does_ get confused, really.

 Something certainly gets confused, but so far I am not sure which bit 
exactly it is, are you?

> >  Barring the unlikely possibility of the 8259A actually being wired to 
> > INTIN2 of the I/O APIC I can see two possible explanations:
> > 
> > 1. The 8259A interrupt actually escapes to the CPU somehow and is handled
> >    as an ExtINTA interrupt.  This would make the code in check_timer()  
> >    decide it has found a working configuration, while actually it has been
> >    fooled.
[...]
> Here you go:
> 
> [    0.108006] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> [    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <3>
> [    0.108006] ..... (found apic 0 pin 2) ...<3> works.
> 
> The full dmesg is at: http://www.sisk.pl/kernel/debug/20080618/dmesg-1.log

 Thanks.  In this case I suspect the case #1 quoted above happens, that is
the 8259A manages to deliver its interrupt somehow.  Note at this stage it
is meant to be in the AEOI mode, so it can happily resubmit the interrupt
indefinitely with no additional handling as long as it receives INTA
cycles.

 Can you please try the patch below on top of "x86: I/O APIC: timer
through 8259A second-chance" to see whether my hypothesis is true?  It
modifies the through-8259A setup path so that the APIC input gets masked,
but the 8259A has the timer interrupt still enabled.  Let me know how the
timer interrupt is routed in this case.

 BTW, do we have any piece of technical information about the chipset
used?  The southbridge used is an ATI SB400, which is where I would
normally expect two 8259A and an I/O APIC core to be placed.

  Maciej

--- a/arch/x86/kernel/io_apic_64.c	2008-06-18 22:53:34.000000000 +0000
+++ b/arch/x86/kernel/io_apic_64.c	2008-06-18 22:58:45.000000000 +0000
@@ -1714,6 +1714,7 @@ static inline void __init check_timer(vo
 		/* replace_pin_at_irq(0, apic1, pin1, apic2, pin2); */
 		setup_timer_IRQ0_pin(apic2, pin2, cfg->vector);
 		unmask_IO_APIC_irq(0);
+		clear_IO_APIC_pin(apic2, pin2);
 		enable_8259A_irq(0);
 		if (timer_irq_works()) {
 			apic_printk(APIC_VERBOSE," works.\n");

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18 23:39                   ` Maciej W. Rozycki
@ 2008-06-19  0:25                     ` Rafael J. Wysocki
  2008-06-20  0:35                       ` Maciej W. Rozycki
  2008-06-19  9:35                     ` Ingo Molnar
  1 sibling, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-19  0:25 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Ingo Molnar, Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Thursday, 19 of June 2008, Maciej W. Rozycki wrote:
> On Thu, 19 Jun 2008, Rafael J. Wysocki wrote:
> 
> > >  With such a configuration the "x86: I/O APIC: timer through 8259A
> > > second-chance" patch should not matter, because the only change it
> > > introduces is an attempt to try the same I/O APIC pin again, but with the
> > > IRQ0 line of the master 8259A enabled.  That's not a terribly unusual 
> > > configuration and nothing should get confused in the system.
> > 
> > But it _does_ get confused, really.
> 
>  Something certainly gets confused, but so far I am not sure which bit 
> exactly it is, are you?

No, I'm not.

> > >  Barring the unlikely possibility of the 8259A actually being wired to 
> > > INTIN2 of the I/O APIC I can see two possible explanations:
> > > 
> > > 1. The 8259A interrupt actually escapes to the CPU somehow and is handled
> > >    as an ExtINTA interrupt.  This would make the code in check_timer()  
> > >    decide it has found a working configuration, while actually it has been
> > >    fooled.
> [...]
> > Here you go:
> > 
> > [    0.108006] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> > [    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > [    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <3>
> > [    0.108006] ..... (found apic 0 pin 2) ...<3> works.
> > 
> > The full dmesg is at: http://www.sisk.pl/kernel/debug/20080618/dmesg-1.log
> 
>  Thanks.  In this case I suspect the case #1 quoted above happens, that is
> the 8259A manages to deliver its interrupt somehow.  Note at this stage it
> is meant to be in the AEOI mode, so it can happily resubmit the interrupt
> indefinitely with no additional handling as long as it receives INTA
> cycles.
> 
>  Can you please try the patch below on top of "x86: I/O APIC: timer
> through 8259A second-chance" to see whether my hypothesis is true?  It
> modifies the through-8259A setup path so that the APIC input gets masked,
> but the 8259A has the timer interrupt still enabled.  Let me know how the
> timer interrupt is routed in this case.

That helped a lot, the system seems to work normally now.

Here's the relevant snippet from dmesg:

[    0.108006] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <3>
[    0.108006] ..... (found apic 0 pin 2) ...<3> failed.
[    0.108006] ...trying to set up timer as Virtual Wire IRQ...<3> works.

and the whole thing is at: http://www.sisk.pl/kernel/debug/20080618/dmesg-2.log
 
>  BTW, do we have any piece of technical information about the chipset
> used?

I, personally, don't have any and AMD only has SB600 documentation on its
web page (it's still marked as "AMD confidential" ;-)).

> The southbridge used is an ATI SB400, which is where I would 
> normally expect two 8259A and an I/O APIC core to be placed.

There is an interrupt controller in there, but I'm not sure if there's any
8259A.  The northbridge is on the CPU, actually.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18 23:39                   ` Maciej W. Rozycki
  2008-06-19  0:25                     ` Rafael J. Wysocki
@ 2008-06-19  9:35                     ` Ingo Molnar
  2008-06-19 18:17                       ` Maciej W. Rozycki
  1 sibling, 1 reply; 73+ messages in thread
From: Ingo Molnar @ 2008-06-19  9:35 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown


* Maciej W. Rozycki <macro@linux-mips.org> wrote:

> --- a/arch/x86/kernel/io_apic_64.c	2008-06-18 22:53:34.000000000 +0000
> +++ b/arch/x86/kernel/io_apic_64.c	2008-06-18 22:58:45.000000000 +0000
> @@ -1714,6 +1714,7 @@ static inline void __init check_timer(vo
>  		/* replace_pin_at_irq(0, apic1, pin1, apic2, pin2); */
>  		setup_timer_IRQ0_pin(apic2, pin2, cfg->vector);
>  		unmask_IO_APIC_irq(0);
> +		clear_IO_APIC_pin(apic2, pin2);
>  		enable_8259A_irq(0);
>  		if (timer_irq_works()) {
>  			apic_printk(APIC_VERBOSE," works.\n");

would it be fine with you if we applied this to tip/x86, as it unbreaks 
Rafael's box?

does PIT programming matter? One detail which might matter and which 
touches IRQ0 generation is the clockevent driver on nohz/highres. See 
arch/x86/kernel/i8253.c:init_pit_timer():

        case CLOCK_EVT_MODE_SHUTDOWN:
        case CLOCK_EVT_MODE_UNUSED:
                if (evt->mode == CLOCK_EVT_MODE_PERIODIC ||
                    evt->mode == CLOCK_EVT_MODE_ONESHOT) {
                        outb_pit(0x30, PIT_MODE);
                        outb_pit(0, PIT_CH0);
                        outb_pit(0, PIT_CH0);
                }
                pit_disable_clocksource();
                break;

        case CLOCK_EVT_MODE_ONESHOT:
                /* One shot setup */
                pit_disable_clocksource();
                outb_pit(0x38, PIT_MODE);
                break;

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-19  9:35                     ` Ingo Molnar
@ 2008-06-19 18:17                       ` Maciej W. Rozycki
  2008-06-20 10:44                         ` Ingo Molnar
  2008-06-20 13:11                         ` Thomas Gleixner
  0 siblings, 2 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-19 18:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Thu, 19 Jun 2008, Ingo Molnar wrote:

> * Maciej W. Rozycki <macro@linux-mips.org> wrote:
> 
> > --- a/arch/x86/kernel/io_apic_64.c	2008-06-18 22:53:34.000000000 +0000
> > +++ b/arch/x86/kernel/io_apic_64.c	2008-06-18 22:58:45.000000000 +0000
> > @@ -1714,6 +1714,7 @@ static inline void __init check_timer(vo
> >  		/* replace_pin_at_irq(0, apic1, pin1, apic2, pin2); */
> >  		setup_timer_IRQ0_pin(apic2, pin2, cfg->vector);
> >  		unmask_IO_APIC_irq(0);
> > +		clear_IO_APIC_pin(apic2, pin2);
> >  		enable_8259A_irq(0);
> >  		if (timer_irq_works()) {
> >  			apic_printk(APIC_VERBOSE," works.\n");
> 
> would it be fine with you if we applied this to tip/x86, as it unbreaks 
> Rafael's box?

 It makes no sense to push it anyweher -- it is a diagnostic check only
which makes most of the surrounding code useless.  It masks the APIC input
selected for use so that it can be seen whether the 8259A delivers its
interrupt regardless.  Obviously in this case it does not, so I must
conclude the 8259A is really wired to this I/O APIC input.

 As expressed before, unfortunately a lot of diagnostic APIC messages have
been disabled in the 64-bit variation.  The result is I was unable to get
good results from my Internet search for bootstrap logs from other systems
using this southbridge.  Fortunately at least ACPI messages are present
and what I noticed is some of the systems do not provide an IRQ0 override
and still work correctly.  So it is quite possible the chip actually wires
the timer interrupt to INTIN0 and the virtual wire cascade to INTIN2 (that
would make the ACPI override provided by this machine incorrect).  That
would be unusual, but not unreasonable, especially for someone like ATI
doing their first chipset with no legacy burden carried over.  I'll post a
patch shortly, that will make it possible to determine that.

 Overall, it would really help to see the a piece of documentation for the
SB400.  Now that ATI has been taken over by AMD it might be a bit easier.  
Both companies have a reasonably good record of providing technical
documentation, but AMD's track seems a little bit better.  At least to me.
Perhaps someone who cooperates with AMD officially could approach them?

> does PIT programming matter? One detail which might matter and which 
> touches IRQ0 generation is the clockevent driver on nohz/highres. See 
> arch/x86/kernel/i8253.c:init_pit_timer():
> 
>         case CLOCK_EVT_MODE_SHUTDOWN:
>         case CLOCK_EVT_MODE_UNUSED:
>                 if (evt->mode == CLOCK_EVT_MODE_PERIODIC ||
>                     evt->mode == CLOCK_EVT_MODE_ONESHOT) {
>                         outb_pit(0x30, PIT_MODE);
>                         outb_pit(0, PIT_CH0);
>                         outb_pit(0, PIT_CH0);
>                 }
>                 pit_disable_clocksource();
>                 break;
> 
>         case CLOCK_EVT_MODE_ONESHOT:
>                 /* One shot setup */
>                 pit_disable_clocksource();
>                 outb_pit(0x38, PIT_MODE);
>                 break;

 It does, though not necessarily in this case.  In principle all this
8254-through-APIC timer validation code assumes the source retriggers
automatically and if an edge is lost because the APIC input targeted is
masked or not configured yet, another one will follow shortly by itself.  
It used to be the case when this code was implemented as we never used any
of the single-shot modes of the 8254 back then.

 Is it now possible at the time check_timer() is called the 8254 has been
put in one of the single-shot modes?  If so, then additional code has to
be put in place either to switch the timer into the periodic mode for the
duration of check_timer() or to rearm the timer if in a single-shot mode
each time timer_irq_works() is called.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-19  0:25                     ` Rafael J. Wysocki
@ 2008-06-20  0:35                       ` Maciej W. Rozycki
  2008-06-20 11:53                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-20  0:35 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Thu, 19 Jun 2008, Rafael J. Wysocki wrote:

> That helped a lot, the system seems to work normally now.
> 
> Here's the relevant snippet from dmesg:
> 
> [    0.108006] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> [    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <3>
> [    0.108006] ..... (found apic 0 pin 2) ...<3> failed.
> [    0.108006] ...trying to set up timer as Virtual Wire IRQ...<3> works.
> 
> and the whole thing is at: http://www.sisk.pl/kernel/debug/20080618/dmesg-2.log

 Hmm, that only proved the 8259A is indeed wired to the pin #2 of the I/O 
APIC.

> I, personally, don't have any and AMD only has SB600 documentation on its
> web page (it's still marked as "AMD confidential" ;-)).

 Well, the IC block is most likely the same as that's not rocket science
and once done there is no need to fiddle with that.  That written, I am
afraid there is nothing useful about the IC in the document, except that
it's there and consists of an I/O APIC providing 24 inputs and the usual
pair of 8259A cores.  Thanks for the reference anyway.

> There is an interrupt controller in there, but I'm not sure if there's any
> 8259A.  The northbridge is on the CPU, actually.

 I will praise the day someone ships an x86 machine without an 8259A core!

 As expressed in another mail I suspect there may actually be a direct
route from the 8254 to INTIN0 in the southbridge -- this is what other
bootstrap logs seen in the Internet suggest.  This would mean this
particular BIOS is buggy (is it the latest version?) and provides an
incorrect IRQ override in its ACPI tables, for example because the
responsible block has been blindly copied from a machine using a commoner
wiring.  This could be moderately easily fixed up with a quirk based on
the PCI ID (after checking it again, we actually used to have a quirk for
ATI in this area, but the way it was done suggests the issue was not
understood well enough).

 Could you please remove the hack sent yesterday and test the patch
provided below?  I do hope it builds, but I have no immediate means to
check it.  Please report the output.  The intent is to test INTIN0
directly before testing INTIN2 through the 8259A.  Thanks.

 Aside of that, what I have gathered from your reports (please correct me
if I have got it wrong) is that when the through-8259A mode is used, then
after a while 8254 timer interrupts stop arriving.  What's interesting,
the "Virtual Wire IRQ" seems to work for you correctly (that's quite an
odd setup where a local APIC input is used in the native mode -- please
post /proc/interrupts for confirmation), which in turn implies the master
8259A drives its INT output as we expect.  Why would the I/O APIC input
have problems then?  Hmm...

  Maciej

patch-2.6.26-rc1-20080505-ioapic-replace-debug-1
diff -up --recursive --new-file linux-2.6.26-rc1-20080505.macro/arch/x86/kernel/io_apic_64.c linux-2.6.26-rc1-20080505/arch/x86/kernel/io_apic_64.c
--- linux-2.6.26-rc1-20080505.macro/arch/x86/kernel/io_apic_64.c	2008-06-18 03:24:54.000000000 +0000
+++ linux-2.6.26-rc1-20080505/arch/x86/kernel/io_apic_64.c	2008-06-20 00:12:39.000000000 +0000
@@ -360,6 +360,26 @@ static void add_pin_to_irq(unsigned int 
 	entry->pin = pin;
 }
 
+/*
+ * Reroute an IRQ to a different pin.
+ */
+static void __init replace_pin_at_irq(unsigned int irq,
+				      int oldapic, int oldpin,
+				      int newapic, int newpin)
+{
+	struct irq_pin_list *entry = irq_2_pin + irq;
+
+	while (1) {
+		if (entry->apic == oldapic && entry->pin == oldpin) {
+			entry->apic = newapic;
+			entry->pin = newpin;
+		}
+		if (!entry->next)
+			break;
+		entry = irq_2_pin + entry->next;
+	}
+}
+
 
 #define DO_ACTION(name,R,ACTION, FINAL)					\
 									\
@@ -1679,6 +1699,11 @@ static inline void __init check_timer(vo
 		apic2 = apic1;
 	}
 
+	replace_pin_at_irq(0, 0, 0, apic1, pin1);
+	apic1 = 0;
+	pin1 = 0;
+	setup_timer_IRQ0_pin(apic1, pin1, cfg->vector);
+
 	if (pin1 != -1) {
 		/*
 		 * Ok, does IRQ0 through the IOAPIC work?
@@ -1711,7 +1736,7 @@ static inline void __init check_timer(vo
 		/*
 		 * legacy devices should be connected to IO APIC #0
 		 */
-		/* replace_pin_at_irq(0, apic1, pin1, apic2, pin2); */
+		replace_pin_at_irq(0, apic1, pin1, apic2, pin2);
 		setup_timer_IRQ0_pin(apic2, pin2, cfg->vector);
 		unmask_IO_APIC_irq(0);
 		enable_8259A_irq(0);

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-19 18:17                       ` Maciej W. Rozycki
@ 2008-06-20 10:44                         ` Ingo Molnar
  2008-06-20 13:11                         ` Thomas Gleixner
  1 sibling, 0 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-06-20 10:44 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown


* Maciej W. Rozycki <macro@linux-mips.org> wrote:

>  As expressed before, unfortunately a lot of diagnostic APIC messages 
> have been disabled in the 64-bit variation.  The result is I was 
> unable to get good results from my Internet search for bootstrap logs 
> from other systems using this southbridge.  Fortunately at least ACPI 
> messages are present and what I noticed is some of the systems do not 
> provide an IRQ0 override and still work correctly. [...]

okay, so when those files are unified, the diagnostics should remain and 
be prominent. (or even be put back into the 64-bit version right now.)

> > does PIT programming matter? One detail which might matter and which 
> > touches IRQ0 generation is the clockevent driver on nohz/highres. See 
> > arch/x86/kernel/i8253.c:init_pit_timer():
> > 
> >         case CLOCK_EVT_MODE_SHUTDOWN:
> >         case CLOCK_EVT_MODE_UNUSED:
> >                 if (evt->mode == CLOCK_EVT_MODE_PERIODIC ||
> >                     evt->mode == CLOCK_EVT_MODE_ONESHOT) {
> >                         outb_pit(0x30, PIT_MODE);
> >                         outb_pit(0, PIT_CH0);
> >                         outb_pit(0, PIT_CH0);
> >                 }
> >                 pit_disable_clocksource();
> >                 break;
> > 
> >         case CLOCK_EVT_MODE_ONESHOT:
> >                 /* One shot setup */
> >                 pit_disable_clocksource();
> >                 outb_pit(0x38, PIT_MODE);
> >                 break;
> 
>  It does, though not necessarily in this case.  In principle all this 
> 8254-through-APIC timer validation code assumes the source retriggers 
> automatically and if an edge is lost because the APIC input targeted 
> is masked or not configured yet, another one will follow shortly by 
> itself.  It used to be the case when this code was implemented as we 
> never used any of the single-shot modes of the 8254 back then.
> 
>  Is it now possible at the time check_timer() is called the 8254 has 
> been put in one of the single-shot modes?  If so, then additional code 
> has to be put in place either to switch the timer into the periodic 
> mode for the duration of check_timer() or to rearm the timer if in a 
> single-shot mode each time timer_irq_works() is called.

that's a question for Thomas i guess, he wrote the PIT single-shot code.

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-20  0:35                       ` Maciej W. Rozycki
@ 2008-06-20 11:53                         ` Rafael J. Wysocki
  2008-06-20 11:57                           ` Matthew Garrett
  2008-06-21  1:49                           ` Maciej W. Rozycki
  0 siblings, 2 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-20 11:53 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Ingo Molnar, Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Friday, 20 of June 2008, Maciej W. Rozycki wrote:
> On Thu, 19 Jun 2008, Rafael J. Wysocki wrote:
> 
> > That helped a lot, the system seems to work normally now.
> > 
> > Here's the relevant snippet from dmesg:
> > 
> > [    0.108006] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> > [    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > [    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <3>
> > [    0.108006] ..... (found apic 0 pin 2) ...<3> failed.
> > [    0.108006] ...trying to set up timer as Virtual Wire IRQ...<3> works.
> > 
> > and the whole thing is at: http://www.sisk.pl/kernel/debug/20080618/dmesg-2.log
> 
>  Hmm, that only proved the 8259A is indeed wired to the pin #2 of the I/O 
> APIC.
> 
> > I, personally, don't have any and AMD only has SB600 documentation on its
> > web page (it's still marked as "AMD confidential" ;-)).
> 
>  Well, the IC block is most likely the same as that's not rocket science
> and once done there is no need to fiddle with that.  That written, I am
> afraid there is nothing useful about the IC in the document, except that
> it's there and consists of an I/O APIC providing 24 inputs and the usual
> pair of 8259A cores.  Thanks for the reference anyway.
> 
> > There is an interrupt controller in there, but I'm not sure if there's any
> > 8259A.  The northbridge is on the CPU, actually.
> 
>  I will praise the day someone ships an x86 machine without an 8259A core!
> 
>  As expressed in another mail I suspect there may actually be a direct
> route from the 8254 to INTIN0 in the southbridge -- this is what other
> bootstrap logs seen in the Internet suggest.  This would mean this
> particular BIOS is buggy (is it the latest version?) and provides an
> incorrect IRQ override in its ACPI tables, for example because the
> responsible block has been blindly copied from a machine using a commoner
> wiring.  This could be moderately easily fixed up with a quirk based on
> the PCI ID (after checking it again, we actually used to have a quirk for
> ATI in this area, but the way it was done suggests the issue was not
> understood well enough).
> 
>  Could you please remove the hack sent yesterday and test the patch
> provided below?  I do hope it builds, but I have no immediate means to
> check it.  Please report the output.  The intent is to test INTIN0
> directly before testing INTIN2 through the 8259A.  Thanks.

Tested, doesn't work.  The symptoms are exactly the same as with the unpatched
kernel.

This is the relevant snippet from dmesg:

[    0.108006] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <3>
[    0.108006] ..... (found apic 0 pin 2) ...<3> works.

and the whole thing is at: http://www.sisk.pl/kernel/debug/20080620/dmesg-1.log

>  Aside of that, what I have gathered from your reports (please correct me
> if I have got it wrong) is that when the through-8259A mode is used, then
> after a while 8254 timer interrupts stop arriving.

What exactly I observe is that in this case:
1) The cooling fan is 100% on, as though the box were overheating, which seems
   to indicate some serious confusion of the platform (the mechanism turning
   the fan 100% on is supposed to be transparent to software).
2) Everything seems to slow down substantially, at least as soon as X is
   started.
3) The box cannot reboot, ie. it turns everything off as expected, but when the
   BIOS is supposed to restart the box, it just hangs solid.

> What's interesting, the "Virtual Wire IRQ" seems to work for you correctly
> (that's quite an odd setup where a local APIC input is used in the native
> mode -- please post /proc/interrupts for confirmation),

           CPU0       CPU1       
  0:        885      37234   IO-APIC-edge      timer
  1:          1        250   IO-APIC-edge      i8042
  8:          0          0   IO-APIC-edge      rtc0
 12:          4        148   IO-APIC-edge      i8042
 14:        568         52   IO-APIC-edge      ide0
 15:          0          0   IO-APIC-edge      ide1
 16:       5048       4555   IO-APIC-fasteoi   sata_sil, HDA Intel
 18:         45        110   IO-APIC-fasteoi   b43
 19:      11811      11973   IO-APIC-fasteoi   ohci_hcd:usb1, ohci_hcd:usb2, ehci_hcd:usb3
 20:          0          4   IO-APIC-fasteoi   yenta, tifm_7xx1, ohci1394
 21:      11695       1987   IO-APIC-fasteoi   acpi
 23:        883        115   IO-APIC-fasteoi   eth0
NMI:          0          0   Non-maskable interrupts
LOC:      36636        585   Local timer interrupts
RES:       7982       4590   Rescheduling interrupts
CAL:        260         75   function call interrupts
TLB:        207        146   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
SPU:          0          0   Spurious interrupts
ERR:          1

(also available at: http://www.sisk.pl/kernel/debug/20080620/interrupts-1.txt).

> which in turn implies the master 8259A drives its INT output as we expect.
> Why would the I/O APIC input have problems then?  Hmm...

Because it's wired to something we're not aware of?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-20 11:53                         ` Rafael J. Wysocki
@ 2008-06-20 11:57                           ` Matthew Garrett
  2008-06-20 12:22                             ` Rafael J. Wysocki
  2008-06-21  1:49                           ` Maciej W. Rozycki
  1 sibling, 1 reply; 73+ messages in thread
From: Matthew Garrett @ 2008-06-20 11:57 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Maciej W. Rozycki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Fri, Jun 20, 2008 at 01:53:58PM +0200, Rafael J. Wysocki wrote:

> What exactly I observe is that in this case:
> 1) The cooling fan is 100% on, as though the box were overheating, which seems
>    to indicate some serious confusion of the platform (the mechanism turning
>    the fan 100% on is supposed to be transparent to software).
> 2) Everything seems to slow down substantially, at least as soon as X is
>    started.

What does ACPI claim the trip points are set to in this case? On the 
6125, if IRQ 2 is enabled in the APIC then the DSDT sets all the thermal 
trip points to 16 degrees C. I suspect this means that enabling IRQ 2 is 
the wrong thing to do on this chipset.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-20 11:57                           ` Matthew Garrett
@ 2008-06-20 12:22                             ` Rafael J. Wysocki
  2008-06-20 12:27                               ` Matthew Garrett
  2008-06-24  9:15                               ` Pavel Machek
  0 siblings, 2 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-20 12:22 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Maciej W. Rozycki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Friday, 20 of June 2008, Matthew Garrett wrote:
> On Fri, Jun 20, 2008 at 01:53:58PM +0200, Rafael J. Wysocki wrote:
> 
> > What exactly I observe is that in this case:
> > 1) The cooling fan is 100% on, as though the box were overheating, which seems
> >    to indicate some serious confusion of the platform (the mechanism turning
> >    the fan 100% on is supposed to be transparent to software).
> > 2) Everything seems to slow down substantially, at least as soon as X is
> >    started.
> 
> What does ACPI claim the trip points are set to in this case? On the 
> 6125, if IRQ 2 is enabled in the APIC then the DSDT sets all the thermal 
> trip points to 16 degrees C. I suspect this means that enabling IRQ 2 is 
> the wrong thing to do on this chipset.

Ah, indeed, thanks for the hint.  This is the output of

$ cat /proc/acpi/thermal_zone/TZ*/trip_points

in the failing case:

critical (S5):           105 C
passive:                 16 C: tc1=1 tc2=2 tsp=100 devices=C000 C001 
active[0]:               16 C: devices=C34F 
active[1]:               16 C: devices=C350 
active[2]:               16 C: devices=C351 
active[3]:               16 C: devices=C352 
critical (S5):           100 C
passive:                 16 C: tc1=1 tc2=2 tsp=300 devices=C000 C001 
critical (S5):           100 C
passive:                 16 C: tc1=1 tc2=2 tsp=300 devices=C000 C001 

(also available at: http://www.sisk.pl/kernel/debug/20080620/trip-points.txt).

So, the observed slowdown may be a result of throttling.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-20 12:22                             ` Rafael J. Wysocki
@ 2008-06-20 12:27                               ` Matthew Garrett
  2008-06-21  1:09                                 ` Maciej W. Rozycki
  2008-06-24  9:15                               ` Pavel Machek
  1 sibling, 1 reply; 73+ messages in thread
From: Matthew Garrett @ 2008-06-20 12:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Maciej W. Rozycki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Fri, Jun 20, 2008 at 02:22:11PM +0200, Rafael J. Wysocki wrote:

> Ah, indeed, thanks for the hint.  This is the output of

Right. My recollection of this is somewhat hazy, so here's something I 
wrote a couple of years ago:

"If you dig through the DSDT code for the 6125, you'll find a bit where 
it writes 0x14 to 0xfec00000 and then checks whether offset 0x12 from 
there is 1. In other words, it's checking if pin 2 of the io-apic is 
masked. If it's not masked (that is, offset 0x12 is 0 and irq 2 is 
enabled) it sets another bit in a register. This is then checked by the 
thermal zone code which as a result sets the thermal trip temperatures 
to 16 degrees Celsius. This bites when the acpi_skip_timer_override 
option is used in Linux."

I have no idea what this code is for, but it's pretty clear that Windows 
sets it up in such a way that this isn't true.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-19 18:17                       ` Maciej W. Rozycki
  2008-06-20 10:44                         ` Ingo Molnar
@ 2008-06-20 13:11                         ` Thomas Gleixner
  2008-06-20 20:56                           ` Maciej W. Rozycki
  1 sibling, 1 reply; 73+ messages in thread
From: Thomas Gleixner @ 2008-06-20 13:11 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Ingo Molnar, Rafael J. Wysocki, Stephen Rothwell, linux-next,
	LKML, ACPI Devel Maling List, Len Brown

On Thu, 19 Jun 2008, Maciej W. Rozycki wrote:
> On Thu, 19 Jun 2008, Ingo Molnar wrote:
> > * Maciej W. Rozycki <macro@linux-mips.org> wrote:
> > does PIT programming matter? One detail which might matter and which 
> > touches IRQ0 generation is the clockevent driver on nohz/highres. See 
> > arch/x86/kernel/i8253.c:init_pit_timer():
> > 
> >         case CLOCK_EVT_MODE_SHUTDOWN:
> >         case CLOCK_EVT_MODE_UNUSED:
> >                 if (evt->mode == CLOCK_EVT_MODE_PERIODIC ||
> >                     evt->mode == CLOCK_EVT_MODE_ONESHOT) {
> >                         outb_pit(0x30, PIT_MODE);
> >                         outb_pit(0, PIT_CH0);
> >                         outb_pit(0, PIT_CH0);
> >                 }
> >                 pit_disable_clocksource();
> >                 break;
> > 
> >         case CLOCK_EVT_MODE_ONESHOT:
> >                 /* One shot setup */
> >                 pit_disable_clocksource();
> >                 outb_pit(0x38, PIT_MODE);
> >                 break;
> 
>  It does, though not necessarily in this case.  In principle all this
> 8254-through-APIC timer validation code assumes the source retriggers
> automatically and if an edge is lost because the APIC input targeted is
> masked or not configured yet, another one will follow shortly by itself.  
> It used to be the case when this code was implemented as we never used any
> of the single-shot modes of the 8254 back then.
> 
>  Is it now possible at the time check_timer() is called the 8254 has been
> put in one of the single-shot modes?  If so, then additional code has to
> be put in place either to switch the timer into the periodic mode for the
> duration of check_timer() or to rearm the timer if in a single-shot mode
> each time timer_irq_works() is called.

At this point the PIT is in periodic mode.

Let me explain how the timer startup works:

PIT is started in periodic mode
... basic CPU bring up
APIC timer initialization (switches PIT off)
...
Highres/Dyntick mode switches local apic timers to one shot mode

When the system has C2/C3 or C1E states, then we restart the PIT in
one shot mode and reprogram it every time when the system goes into
idle to replace the local apic timer, which stops in those states.

Thanks,
	tglx

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18 22:36                   ` Maciej W. Rozycki
@ 2008-06-20 18:59                     ` Cyrill Gorcunov
  2008-06-20 20:44                       ` Maciej W. Rozycki
  0 siblings, 1 reply; 73+ messages in thread
From: Cyrill Gorcunov @ 2008-06-20 18:59 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

[Maciej W. Rozycki - Wed, Jun 18, 2008 at 11:36:16PM +0100]
| On Wed, 18 Jun 2008, Cyrill Gorcunov wrote:
| 
| > | 1. The 8259A interrupt actually escapes to the CPU somehow and is handled
| > |    as an ExtINTA interrupt.  This would make the code in check_timer()  
| > |    decide it has found a working configuration, while actually it has been
| > |    fooled.
| > 
| > Maciej, that is why we get 'received illegal vector'?
| > 
| > 	[  129.092151] APIC error on CPU1: 00(40)
| 
|  No, but that's an interesting observation, thank you -- well spotted!  
| 
|  ExtINTA stands for an "External INTA cycle" which is passed through from
| the CPU down to the system bus instead of being intercepted by the local
| APIC unit as usually.  In response to the INTA cycle one of the 8259A
| chips (either the master or the slave, depending on the source of the
| interrupt selected for handling) supplies the vector directly to the CPU
| through PCI (or whatever kind of bus links the legacy bridge with the host
| bridge) and then the FSB.  Therefore the vector bypasses all the APIC
| circuitry and cannot result in an APIC error interrupt.
| 
|  Instead the message quoted means an APIC input is misprogrammed
| somewhere.  This error happens if an interrupt is signalled to an unmasked
| APIC input which uses the Fixed or Lowest-Priority delivery mode and its
| vector implies priority below the minimum permitted, that is in the range
| from 0 to 15.
| 
|  We have code already in place in io_apic_{32,64}.c that can be used to
| find out the offender with a piece of code like this (#if 0 has to be
| deactivated for this to work and they may be bit rot bugs to be fixed):
| 
| int __init all_pic_dump(void)
| {
| 	int v = apic_verbosity;
| 
| 	apic_verbosity = APIC_DEBUG;
| 	print_IO_APIC();
| 	print_all_local_APICs();
| 	print_PIC();
| 	apic_verbosity = v;
| 
| 	return 0;
| }
| 
| late_initcall(all_pic_dump);
| 
| if somebody is willing to aid with debugging this problem.
| 
|   Maciej
| 

Thanks, Maciej,

i would really like to help... but I can't even hit this
bug on my laptop :(

		- Cyrill -

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-20 18:59                     ` Cyrill Gorcunov
@ 2008-06-20 20:44                       ` Maciej W. Rozycki
  0 siblings, 0 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-20 20:44 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Fri, 20 Jun 2008, Cyrill Gorcunov wrote:

> i would really like to help... but I can't even hit this
> bug on my laptop :(

 Hmm, most people would be rather happy not to have a given bug on their 
piece of hardware... ;)

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-20 13:11                         ` Thomas Gleixner
@ 2008-06-20 20:56                           ` Maciej W. Rozycki
  0 siblings, 0 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-20 20:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Rafael J. Wysocki, Stephen Rothwell, linux-next,
	LKML, ACPI Devel Maling List, Len Brown

On Fri, 20 Jun 2008, Thomas Gleixner wrote:

> >  Is it now possible at the time check_timer() is called the 8254 has been
> > put in one of the single-shot modes?  If so, then additional code has to
> > be put in place either to switch the timer into the periodic mode for the
> > duration of check_timer() or to rearm the timer if in a single-shot mode
> > each time timer_irq_works() is called.
> 
> At this point the PIT is in periodic mode.

 I had a feeling this was the case -- thanks for your clarification.  
Nothing to change in check_timer() as far as this property is concerned
then.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-20 12:27                               ` Matthew Garrett
@ 2008-06-21  1:09                                 ` Maciej W. Rozycki
  2008-06-21  1:40                                   ` Matthew Garrett
  0 siblings, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-21  1:09 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Fri, 20 Jun 2008, Matthew Garrett wrote:

> > Ah, indeed, thanks for the hint.  This is the output of
> 
> Right. My recollection of this is somewhat hazy, so here's something I 
> wrote a couple of years ago:
> 
> "If you dig through the DSDT code for the 6125, you'll find a bit where 
> it writes 0x14 to 0xfec00000 and then checks whether offset 0x12 from 
> there is 1. In other words, it's checking if pin 2 of the io-apic is 
> masked. If it's not masked (that is, offset 0x12 is 0 and irq 2 is 
> enabled) it sets another bit in a register. This is then checked by the 
> thermal zone code which as a result sets the thermal trip temperatures 
> to 16 degrees Celsius. This bites when the acpi_skip_timer_override 
> option is used in Linux."
> 
> I have no idea what this code is for, but it's pretty clear that Windows 
> sets it up in such a way that this isn't true.

 Thanks, that is a very useful insight indeed.  I went through the effort
to locate a DSDT dump for the nx6325.  Here are the relevant parts, first
the definition:

OperationRegion (C253, SystemMemory, 0xFEC00000, 0x14)
Field (C253, ByteAcc, NoLock, Preserve)
{
    C08B,   8,
    Offset (0x10),
    Offset (0x12),
    C08C,   1
}

So now we have got a block defined, which corresponds to the location of
the I/O APIC and is 0x14 bytes long.  That is not top quality code, I
would say, but surely it achieves what it is meant to.  Within that block 
two fields are defined:

1. An 8-bit one at the byte offset 0 -- that corresponds to the index
   register.

2. A 1-bit one at the byte offset 0x12 -- that corresponds to the bit #16 
   of the data register, which for redirection entries is the mask 
   register.

 And then we have a method elsewhere, which uses the above definition:

Method (_INI, 0, NotSerialized)
{
    C084 ()
    Store (0x00, \_SB.C074.C089.C08A)
    Store (0x14, C08B)
    If (LEqual (C08C, 0x00))
    {
        Store (0x01, \_SB.C074.C089.C08A)
    }
}

_SB.C074.C089.C08A refers to a piece of 8-bit data at an offset of 0xf0 
accessed through an index and data registers located at 0x72 and 0x73 in 
the port I/O space.  That's probably an extended part of the NVRAM 
associated with the RTC.

 That location is referred from two places as follows:

If (LEqual (\_SB.C074.C089.C08A, 0x01))
{
    Store (0x0B4B, Local2)
}

which is obviously that 16C trip point mentioned, overriding the result 
of the method obtained from the respective device in the usual way, and:

If (LEqual (\_SB.C074.C089.C08A, 0x00))
{
    \_SB.C074.C0E3.C149.C195 (0x00)
}

elsewhere which sets a location in the embedded controller which seems
related to battery control.  Overall my guts feeling is it's some
debugging or leftover code meant for a different configuration.

 This is further confirmed by another block defined next to the one quoted
above:

OperationRegion (C254, SystemIO, 0x21, 0x01)
Field (C254, ByteAcc, NoLock, Preserve)
{
    C255,   1
}

which quite similarly defines a mask for the 8254 timer interrupt in the
master 8259A.  This is nowhere used though -- any references may have been
removed with the I/O APIC part not adjusted accordingly.  Note that the
I/O APIC mask defined above is not quite a mask for the 8254 timer
interrupt in this system (as it is the ExtINTA 8259A cascade), but it is a
common location for one.

 Anyway, it's clear it's firmware that is at fault here and not hardware.  
There are actually two bugs -- first is described above and the other one
is the IRQ0 override, which is clearly incorrect.  The piece of hardware
comes from a reputable vendor, so it should be possible to submit a bug
report for the firmware.  Anybody happens to know the appropriate contact?

 Meanwhile we may consider implementing a workaround.  I think one that 
does not hurt competent vendors would be preferable.  The DSDT containing 
the rubbish described here is marked with an OEM ID: "HP    " and OEM 
Table ID: "SB400".  These keys could be used to remove IRQ0 information
from the IRQ tables.  Our code is prepared to handle such a case.  
Something easy to do for a seasoned ACPI fiddler, I suppose. ;)

 Windows does not trigger this bug, because it stays away from the 8254 on 
APIC platforms and uses the RTC for the timer instead I am told.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-21  1:09                                 ` Maciej W. Rozycki
@ 2008-06-21  1:40                                   ` Matthew Garrett
  2008-06-21  2:41                                     ` Maciej W. Rozycki
  2008-06-26 19:52                                     ` Rafael J. Wysocki
  0 siblings, 2 replies; 73+ messages in thread
From: Matthew Garrett @ 2008-06-21  1:40 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Sat, Jun 21, 2008 at 02:09:00AM +0100, Maciej W. Rozycki wrote:

>  Meanwhile we may consider implementing a workaround.  I think one that 
> does not hurt competent vendors would be preferable.  The DSDT containing 
> the rubbish described here is marked with an OEM ID: "HP    " and OEM 
> Table ID: "SB400".  These keys could be used to remove IRQ0 information
> from the IRQ tables.  Our code is prepared to handle such a case.  
> Something easy to do for a seasoned ACPI fiddler, I suppose. ;)

Something roughly like the following? Entirely untested, my 6125 is in a 
box somewhere. My recollection is that skip_timer_override will disable 
the IRQ 0->2 mapping, which I believe is what's broken here?

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 33c5216..6ca5eff 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -1060,6 +1060,16 @@ static int __init force_acpi_ht(const struct dmi_system_id *d)
 	return 0;
 }
 
+#ifdef CONFIG_X86_IO_APIC
+static int __init force_skip_timer_override(const struct dmi_system_id *d)
+{
+	printk(KERN_NOTICE "%s detected: disabling timer overrides",
+	       d->ident);
+	acpi_skip_timer_override = 1;
+	return 0;
+}
+#endif
+
 /*
  * If your system is blacklisted here, but you find that acpi=force
  * works for you, please contact acpi-devel@sourceforge.net
@@ -1227,6 +1237,24 @@ static struct dmi_system_id __initdata acpi_dmi_table[] = {
 		     DMI_MATCH(DMI_PRODUCT_NAME, "TravelMate 360"),
 		     },
 	 },
+#ifdef CONFIG_X86_IO_APIC
+	{
+	 .callback = force_skip_timer_override,
+	 .ident = "HP NX6125 laptop",
+	 .matches = {
+		     DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
+		     DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq nx6125"),
+		     },
+	 },
+	{
+	 .callback = force_skip_timer_override,
+	 .ident = "HP NX6325 laptop",
+	 .matches = {
+		     DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
+		     DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq nx6325"),
+		     },
+	 },
+#endif
 	{}
 };


-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-20 11:53                         ` Rafael J. Wysocki
  2008-06-20 11:57                           ` Matthew Garrett
@ 2008-06-21  1:49                           ` Maciej W. Rozycki
  1 sibling, 0 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-21  1:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown

On Fri, 20 Jun 2008, Rafael J. Wysocki wrote:

> Tested, doesn't work.  The symptoms are exactly the same as with the unpatched
> kernel.

 Thanks.

> This is the relevant snippet from dmesg:
> 
> [    0.108006] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [    0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> [    0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <3>
> [    0.108006] ..... (found apic 0 pin 2) ...<3> works.
> 
> and the whole thing is at: http://www.sisk.pl/kernel/debug/20080620/dmesg-1.log

 Hmm, it means INTIN0 is not connected to the output of the 8254.  Which
in turn means the input is either externally rewirable or internally
reconfigurable for the use with the 8254, or something else, or nothing at
all (although it seems a dumb idea not to wire the 8254 to the I/O APIC).  
It might be interesting to know whether the HPET #0 can be routed to
INTIN0 on this platform.

> What exactly I observe is that in this case:
> 1) The cooling fan is 100% on, as though the box were overheating, which seems
>    to indicate some serious confusion of the platform (the mechanism turning
>    the fan 100% on is supposed to be transparent to software).
> 2) Everything seems to slow down substantially, at least as soon as X is
>    started.
> 3) The box cannot reboot, ie. it turns everything off as expected, but when the
>    BIOS is supposed to restart the box, it just hangs solid.

 OK, as explained by Matthew and investigated by myself, it is not exactly 
a problem with the timer itself, but broken power-management 
configuration.

 This could explain the reboot thing too -- our shutdown code is meant to
revert all the APIC configuration back to the bootstrap default as yours
would not be the first BIOS that has problems with its reboot vector being
entered with the APIC infrastructure active.  But the bit that's written 
to the NVRAM may interact with the BIOS for example.

 OTOH, perhaps something has got broken on the way with the APIC code too
-- I have had a look and now we have two local APIC shutdown functions:
disable_local_APIC() and lapic_shutdown() with overlapping functionality,
plus the I/O APIC is cleared after the local APIC in at least one place,
so I would not feel terribly confident about this code.

> > What's interesting, the "Virtual Wire IRQ" seems to work for you correctly
> > (that's quite an odd setup where a local APIC input is used in the native
> > mode -- please post /proc/interrupts for confirmation),
> 
>            CPU0       CPU1       
>   0:        885      37234   IO-APIC-edge      timer
[...]
> (also available at: http://www.sisk.pl/kernel/debug/20080620/interrupts-1.txt).

 One for the other configuration, which reports "Virtual Wire IRQ", i.e.  
without my "x86: I/O APIC: timer through 8259A second-chance" patch, would
be more interesting, though perhaps less so now that the reason of the
misbehaviour is known.

> > which in turn implies the master 8259A drives its INT output as we expect.
> > Why would the I/O APIC input have problems then?  Hmm...
> 
> Because it's wired to something we're not aware of?

 Well, sure, but the question in such a case would be: "What for?"  The
output of the 8259A has had quite a standard meaning for some 30 years
now, so I would expect one would not wire it to anything else but the
interrupt input of a CPU or an APIC input without a purpose.  Or at least
a reason.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-21  1:40                                   ` Matthew Garrett
@ 2008-06-21  2:41                                     ` Maciej W. Rozycki
  2008-06-21 12:38                                       ` Matthew Garrett
  2008-06-26 19:52                                     ` Rafael J. Wysocki
  1 sibling, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-21  2:41 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Sat, 21 Jun 2008, Matthew Garrett wrote:

> >  Meanwhile we may consider implementing a workaround.  I think one that 
> > does not hurt competent vendors would be preferable.  The DSDT containing 
> > the rubbish described here is marked with an OEM ID: "HP    " and OEM 
> > Table ID: "SB400".  These keys could be used to remove IRQ0 information
> > from the IRQ tables.  Our code is prepared to handle such a case.  
> > Something easy to do for a seasoned ACPI fiddler, I suppose. ;)
> 
> Something roughly like the following? Entirely untested, my 6125 is in a 

 Maybe, though your code seems to match product IDs rather than the broken
DSDT itself.  I think the latter would be preferable as it would cover all
the pieces of equipment using the broken piece of firmware rather than
ones we have already tracked down.  Perhaps the version could be included
too, but that would only make sense if the breakage ever gets fixed -- the
use of the through-8259A mode for the 8254 timer would allow this piece of
equipment to benefit from the I/O APIC driven NMI watchdog.

> box somewhere. My recollection is that skip_timer_override will disable 
> the IRQ 0->2 mapping, which I believe is what's broken here?

 Not exactly.  The IRQ0->2 mapping is certainly wrong here, but so is the
identity IRQ0->0 one.  Which means it should not be recorded in
mp_config_acpi_legacy_irqs() at all.  I can cook this part if you'd rather
not to, if you do the ACPI part.  If you think there is no easy way to
match the DSDT rather than the product ID -- we are trying to cope with
outright breakage here after all, so any amount of effort is too much ;)  
-- I can update your proposal with what I have in mind.

 One point to note -- perhaps it would be better to avoid these #ifdef
clauses -- even though it's a workaround, I think the amount of resources
consumed does not justify the clutter introduced.

 Thanks for your submission.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-21  2:41                                     ` Maciej W. Rozycki
@ 2008-06-21 12:38                                       ` Matthew Garrett
  0 siblings, 0 replies; 73+ messages in thread
From: Matthew Garrett @ 2008-06-21 12:38 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Sat, Jun 21, 2008 at 03:41:37AM +0100, Maciej W. Rozycki wrote:

>  Maybe, though your code seems to match product IDs rather than the broken
> DSDT itself.  I think the latter would be preferable as it would cover all
> the pieces of equipment using the broken piece of firmware rather than
> ones we have already tracked down.  Perhaps the version could be included
> too, but that would only make sense if the breakage ever gets fixed -- the
> use of the through-8259A mode for the 8254 timer would allow this piece of
> equipment to benefit from the I/O APIC driven NMI watchdog.

I haven't seen any other machines with this issue, so I suspect that 
this is HP-specific code. I'll look into what would need doing to quirk 
it off the DSDT strings, though.

>  Not exactly.  The IRQ0->2 mapping is certainly wrong here, but so is the
> identity IRQ0->0 one.  Which means it should not be recorded in
> mp_config_acpi_legacy_irqs() at all.  I can cook this part if you'd rather
> not to, if you do the ACPI part.  If you think there is no easy way to
> match the DSDT rather than the product ID -- we are trying to cope with
> outright breakage here after all, so any amount of effort is too much ;)  
> -- I can update your proposal with what I have in mind.

Ok, that works for me.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-18 15:29                       ` Thomas Gleixner
@ 2008-06-21 22:47                         ` Rafael J. Wysocki
  0 siblings, 0 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-21 22:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Maciej W. Rozycki, Stephen Rothwell, linux-next, LKML,
	Ingo Molnar, ACPI Devel Maling List, Len Brown

On Wednesday, 18 of June 2008, Thomas Gleixner wrote:
> On Wed, 18 Jun 2008, Rafael J. Wysocki wrote:
> > > I just checked that the original c1e series and the affected code in
> > > tip are not different. IIRC you confirmed that the C1E patches would
> > > work on your box. So I wonder what else got changed which causes these
> > > problems.
> > 
> > Well, to eliminate any possible correlations, do you have a version of the
> > series or a single patch against the current mainline?
> 
> http://userweb.kernel.org/~tglx/952f4a-c1e-apic.patch
> http://userweb.kernel.org/~tglx/952f4a-c1e.patch
> 
> c1e-apic is the forward port of the apic changes and c1e is the pure
> c1e stuff. On my box it does not work w/o the c1e-apic one, but ....

Unfortunately, with the c1e.patch on top of the apic.patch on top of the
current mainline I get the same symptoms as with -next:
- processes freeze
- CPU loads are unreasonably high
- things generally get stucked if I don't move the mouse or press keys

Removing the the c1e.patch makes things work again.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-20 12:22                             ` Rafael J. Wysocki
  2008-06-20 12:27                               ` Matthew Garrett
@ 2008-06-24  9:15                               ` Pavel Machek
  2008-06-26  8:37                                 ` Rafael J. Wysocki
  2008-06-27  1:53                                 ` Maciej W. Rozycki
  1 sibling, 2 replies; 73+ messages in thread
From: Pavel Machek @ 2008-06-24  9:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Matthew Garrett, Maciej W. Rozycki, Ingo Molnar, Stephen Rothwell,
	linux-next, LKML, Thomas Gleixner, ACPI Devel Maling List,
	Len Brown

Hi!

> > What does ACPI claim the trip points are set to in this case? On the 
> > 6125, if IRQ 2 is enabled in the APIC then the DSDT sets all the thermal 
> > trip points to 16 degrees C. I suspect this means that enabling IRQ 2 is 
> > the wrong thing to do on this chipset.
> 
> Ah, indeed, thanks for the hint.  This is the output of
> 
> $ cat /proc/acpi/thermal_zone/TZ*/trip_points
> 
> in the failing case:
> 
> critical (S5):           105 C
> passive:                 16 C: tc1=1 tc2=2 tsp=100 devices=C000 C001 
> active[0]:               16 C: devices=C34F 
> active[1]:               16 C: devices=C350 
> active[2]:               16 C: devices=C351 
> active[3]:               16 C: devices=C352 
> critical (S5):           100 C
> passive:                 16 C: tc1=1 tc2=2 tsp=300 devices=C000 C001 
> critical (S5):           100 C
> passive:                 16 C: tc1=1 tc2=2 tsp=300 devices=C000 C001 

Can we call the ACPI BIOS to be terminally broken at this point?

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-24  9:15                               ` Pavel Machek
@ 2008-06-26  8:37                                 ` Rafael J. Wysocki
  2008-06-27  1:53                                 ` Maciej W. Rozycki
  1 sibling, 0 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-26  8:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Matthew Garrett, Maciej W. Rozycki, Ingo Molnar, Stephen Rothwell,
	linux-next, LKML, Thomas Gleixner, ACPI Devel Maling List,
	Len Brown

On Tuesday, 24 of June 2008, Pavel Machek wrote:
> Hi!
> 
> > > What does ACPI claim the trip points are set to in this case? On the 
> > > 6125, if IRQ 2 is enabled in the APIC then the DSDT sets all the thermal 
> > > trip points to 16 degrees C. I suspect this means that enabling IRQ 2 is 
> > > the wrong thing to do on this chipset.
> > 
> > Ah, indeed, thanks for the hint.  This is the output of
> > 
> > $ cat /proc/acpi/thermal_zone/TZ*/trip_points
> > 
> > in the failing case:
> > 
> > critical (S5):           105 C
> > passive:                 16 C: tc1=1 tc2=2 tsp=100 devices=C000 C001 
> > active[0]:               16 C: devices=C34F 
> > active[1]:               16 C: devices=C350 
> > active[2]:               16 C: devices=C351 
> > active[3]:               16 C: devices=C352 
> > critical (S5):           100 C
> > passive:                 16 C: tc1=1 tc2=2 tsp=300 devices=C000 C001 
> > critical (S5):           100 C
> > passive:                 16 C: tc1=1 tc2=2 tsp=300 devices=C000 C001 
> 
> Can we call the ACPI BIOS to be terminally broken at this point?

It is broken, but the configuration worked before the patch.  Consequently,
the patch introduces a regression.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-21  1:40                                   ` Matthew Garrett
  2008-06-21  2:41                                     ` Maciej W. Rozycki
@ 2008-06-26 19:52                                     ` Rafael J. Wysocki
  2008-06-27  0:06                                       ` Maciej W. Rozycki
  1 sibling, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-26 19:52 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Maciej W. Rozycki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Saturday, 21 of June 2008, Matthew Garrett wrote:
> On Sat, Jun 21, 2008 at 02:09:00AM +0100, Maciej W. Rozycki wrote:
> 
> >  Meanwhile we may consider implementing a workaround.  I think one that 
> > does not hurt competent vendors would be preferable.  The DSDT containing 
> > the rubbish described here is marked with an OEM ID: "HP    " and OEM 
> > Table ID: "SB400".  These keys could be used to remove IRQ0 information
> > from the IRQ tables.  Our code is prepared to handle such a case.  
> > Something easy to do for a seasoned ACPI fiddler, I suppose. ;)
> 
> Something roughly like the following? Entirely untested, my 6125 is in a 
> box somewhere. My recollection is that skip_timer_override will disable 
> the IRQ 0->2 mapping, which I believe is what's broken here?

Well, actually, I'm not sure that will work.  I have only found
acpi_skip_timer_override being set to 1 in two places, but it doesn't seem to
be read anywhere.  What am I missing?


> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 33c5216..6ca5eff 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -1060,6 +1060,16 @@ static int __init force_acpi_ht(const struct dmi_system_id *d)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_X86_IO_APIC
> +static int __init force_skip_timer_override(const struct dmi_system_id *d)
> +{
> +	printk(KERN_NOTICE "%s detected: disabling timer overrides",
> +	       d->ident);
> +	acpi_skip_timer_override = 1;
> +	return 0;
> +}
> +#endif
> +
>  /*
>   * If your system is blacklisted here, but you find that acpi=force
>   * works for you, please contact acpi-devel@sourceforge.net
> @@ -1227,6 +1237,24 @@ static struct dmi_system_id __initdata acpi_dmi_table[] = {
>  		     DMI_MATCH(DMI_PRODUCT_NAME, "TravelMate 360"),
>  		     },
>  	 },
> +#ifdef CONFIG_X86_IO_APIC
> +	{
> +	 .callback = force_skip_timer_override,
> +	 .ident = "HP NX6125 laptop",
> +	 .matches = {
> +		     DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
> +		     DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq nx6125"),
> +		     },
> +	 },
> +	{
> +	 .callback = force_skip_timer_override,
> +	 .ident = "HP NX6325 laptop",
> +	 .matches = {
> +		     DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
> +		     DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq nx6325"),
> +		     },
> +	 },
> +#endif
>  	{}
>  };
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-26 19:52                                     ` Rafael J. Wysocki
@ 2008-06-27  0:06                                       ` Maciej W. Rozycki
  2008-06-29 14:00                                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-27  0:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Matthew Garrett, Ingo Molnar, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Thu, 26 Jun 2008, Rafael J. Wysocki wrote:

> Well, actually, I'm not sure that will work.  I have only found
> acpi_skip_timer_override being set to 1 in two places, but it doesn't seem to
> be read anywhere.  What am I missing?

 I believe I removed all the occurences.  I am waiting for a proposal of a
quirk based on the DSDT ID -- my time is a bit too limited to study the
internals of our ACPI code at the moment; sorry about that.  I will
complement it with a change to remove IRQ0 from I/O APIC tables as
promised then; this piece of code I am quite familiar with.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-24  9:15                               ` Pavel Machek
  2008-06-26  8:37                                 ` Rafael J. Wysocki
@ 2008-06-27  1:53                                 ` Maciej W. Rozycki
  2008-07-08 12:48                                   ` Pavel Machek
  1 sibling, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-27  1:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rafael J. Wysocki, Matthew Garrett, Ingo Molnar, Stephen Rothwell,
	linux-next, LKML, Thomas Gleixner, ACPI Devel Maling List,
	Len Brown

On Tue, 24 Jun 2008, Pavel Machek wrote:

> > $ cat /proc/acpi/thermal_zone/TZ*/trip_points
> > 
> > in the failing case:
> > 
> > critical (S5):           105 C
> > passive:                 16 C: tc1=1 tc2=2 tsp=100 devices=C000 C001 
> > active[0]:               16 C: devices=C34F 
> > active[1]:               16 C: devices=C350 
> > active[2]:               16 C: devices=C351 
> > active[3]:               16 C: devices=C352 
> > critical (S5):           100 C
> > passive:                 16 C: tc1=1 tc2=2 tsp=300 devices=C000 C001 
> > critical (S5):           100 C
> > passive:                 16 C: tc1=1 tc2=2 tsp=300 devices=C000 C001 
> 
> Can we call the ACPI BIOS to be terminally broken at this point?

 Do we have any point of contact at HP and/or ATI/AMD?  I suppose getting 
hands on a SB400 datasheet could be tricky, but someone may be able to 
answer questions about the interrupt routing between the 8254, the 8259A 
and the I/O APIC for this chip and/or fix the DSDT.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-27  0:06                                       ` Maciej W. Rozycki
@ 2008-06-29 14:00                                         ` Rafael J. Wysocki
  2008-06-29 19:05                                           ` Maciej W. Rozycki
  0 siblings, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-29 14:00 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Matthew Garrett, Ingo Molnar, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Friday, 27 of June 2008, Maciej W. Rozycki wrote:
> On Thu, 26 Jun 2008, Rafael J. Wysocki wrote:
> 
> > Well, actually, I'm not sure that will work.  I have only found
> > acpi_skip_timer_override being set to 1 in two places, but it doesn't seem to
> > be read anywhere.  What am I missing?
> 
>  I believe I removed all the occurences.  I am waiting for a proposal of a
> quirk based on the DSDT ID -- my time is a bit too limited to study the
> internals of our ACPI code at the moment; sorry about that.  I will
> complement it with a change to remove IRQ0 from I/O APIC tables as
> promised then; this piece of code I am quite familiar with.

Well, why don't we use the DMI identification as suggested by Matthew?

I think we can safely assume that all of these boxes are broken for now and we
can use a more fine grained identification in the future, if necessary.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 14:00                                         ` Rafael J. Wysocki
@ 2008-06-29 19:05                                           ` Maciej W. Rozycki
  2008-06-29 19:23                                             ` Rafael J. Wysocki
  2008-06-29 19:23                                             ` Matthew Garrett
  0 siblings, 2 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-29 19:05 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Matthew Garrett, Ingo Molnar, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Sun, 29 Jun 2008, Rafael J. Wysocki wrote:

> >  I believe I removed all the occurences.  I am waiting for a proposal of a
> > quirk based on the DSDT ID -- my time is a bit too limited to study the
> > internals of our ACPI code at the moment; sorry about that.  I will
> > complement it with a change to remove IRQ0 from I/O APIC tables as
> > promised then; this piece of code I am quite familiar with.
> 
> Well, why don't we use the DMI identification as suggested by Matthew?

 Because it checks the wrong property.

> I think we can safely assume that all of these boxes are broken for now and we
> can use a more fine grained identification in the future, if necessary.

 It is the reverse -- checking the DSDT ID is coarser, matching all the
systems that use the broken firmware.  With DMI we may face both false
positives and false negatives which imply further maintenance actions.  
Please note as proved over the years understanding of these issues seems
to be problematic for people, so the result may be another round of
discussions reinventing the wheel in a couple of years' time or so.

 That's my opinion only though -- if it was to hinder the progress, then I
am not going to persist.

 Have you tried to report the issue through the usual manufacturer's
support channels, BTW?  They may not even be aware of the existence of the
bug.  Of course they may dismiss it anyway, but at least they will have a
record of it somewhere.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 19:05                                           ` Maciej W. Rozycki
@ 2008-06-29 19:23                                             ` Rafael J. Wysocki
  2008-06-29 19:56                                               ` Maciej W. Rozycki
  2008-06-29 19:23                                             ` Matthew Garrett
  1 sibling, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-29 19:23 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Matthew Garrett, Ingo Molnar, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown, Andi Kleen

On Sunday, 29 of June 2008, Maciej W. Rozycki wrote:
> On Sun, 29 Jun 2008, Rafael J. Wysocki wrote:
> 
> > >  I believe I removed all the occurences.  I am waiting for a proposal of a
> > > quirk based on the DSDT ID -- my time is a bit too limited to study the
> > > internals of our ACPI code at the moment; sorry about that.  I will
> > > complement it with a change to remove IRQ0 from I/O APIC tables as
> > > promised then; this piece of code I am quite familiar with.
> > 
> > Well, why don't we use the DMI identification as suggested by Matthew?
> 
>  Because it checks the wrong property.
> 
> > I think we can safely assume that all of these boxes are broken for now and we
> > can use a more fine grained identification in the future, if necessary.
> 
>  It is the reverse -- checking the DSDT ID is coarser, matching all the
> systems that use the broken firmware.

How can you tell which DSDTs are broken until somebody reports them?

> With DMI we may face both false positives and false negatives which imply
> further maintenance actions.   

With DSDT matching you're likely to end up breaking systems the users of
which have not reported problems.

> Please note as proved over the years understanding of these issues seems
> to be problematic for people, so the result may be another round of
> discussions reinventing the wheel in a couple of years' time or so.
> 
>  That's my opinion only though -- if it was to hinder the progress, then I
> am not going to persist.

Good.

>  Have you tried to report the issue through the usual manufacturer's
> support channels, BTW?

My experience with HP indicates that it would have been a loss of time.

Apart from this, I've always been against forcing people to upgrade their
BIOSes just because we just had a briliant idea that made the kernel stop
working on their systems.  IMO it's extremely user-unfriendly and plain wrong.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 19:05                                           ` Maciej W. Rozycki
  2008-06-29 19:23                                             ` Rafael J. Wysocki
@ 2008-06-29 19:23                                             ` Matthew Garrett
  2008-06-29 19:31                                               ` Rafael J. Wysocki
  2008-06-29 20:03                                               ` Maciej W. Rozycki
  1 sibling, 2 replies; 73+ messages in thread
From: Matthew Garrett @ 2008-06-29 19:23 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Sun, Jun 29, 2008 at 08:05:42PM +0100, Maciej W. Rozycki wrote:

>  It is the reverse -- checking the DSDT ID is coarser, matching all the
> systems that use the broken firmware.  With DMI we may face both false
> positives and false negatives which imply further maintenance actions.  
> Please note as proved over the years understanding of these issues seems
> to be problematic for people, so the result may be another round of
> discussions reinventing the wheel in a couple of years' time or so.

The DSDT can't be updated without the BIOS being updated, and the DMI 
information gives us a BIOS version string that can be matched against 
if a fixed version is ever released. I'd be in favour of doing it with 
DMI on the grounds that it's how we already handle machine-specific 
quirks rather than adding new code to do it.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 19:23                                             ` Matthew Garrett
@ 2008-06-29 19:31                                               ` Rafael J. Wysocki
  2008-06-29 20:03                                               ` Maciej W. Rozycki
  1 sibling, 0 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-29 19:31 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Maciej W. Rozycki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Sunday, 29 of June 2008, Matthew Garrett wrote:
> On Sun, Jun 29, 2008 at 08:05:42PM +0100, Maciej W. Rozycki wrote:
> 
> >  It is the reverse -- checking the DSDT ID is coarser, matching all the
> > systems that use the broken firmware.  With DMI we may face both false
> > positives and false negatives which imply further maintenance actions.  
> > Please note as proved over the years understanding of these issues seems
> > to be problematic for people, so the result may be another round of
> > discussions reinventing the wheel in a couple of years' time or so.
> 
> The DSDT can't be updated without the BIOS being updated, and the DMI 
> information gives us a BIOS version string that can be matched against 
> if a fixed version is ever released. I'd be in favour of doing it with 
> DMI on the grounds that it's how we already handle machine-specific 
> quirks rather than adding new code to do it.

I violently agree.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 19:23                                             ` Rafael J. Wysocki
@ 2008-06-29 19:56                                               ` Maciej W. Rozycki
  2008-06-29 20:02                                                 ` Ingo Molnar
  2008-06-29 22:56                                                 ` Rafael J. Wysocki
  0 siblings, 2 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-29 19:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Matthew Garrett, Ingo Molnar, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown, Andi Kleen

On Sun, 29 Jun 2008, Rafael J. Wysocki wrote:

> >  It is the reverse -- checking the DSDT ID is coarser, matching all the
> > systems that use the broken firmware.
> 
> How can you tell which DSDTs are broken until somebody reports them?

 We know the DSDT matching OEM ID: "HP ", OEM Table ID: "SB400" and OEM
Revision: 10000 is broken, because it has already been reported.  If these
properties are checked, there is no need to for further reports providing
us with DMI IDs of systems using the same DSDT.  The revision can be used
to make sure a good one is not selected inadvertently.

> > With DMI we may face both false positives and false negatives which imply
> > further maintenance actions.   
> 
> With DSDT matching you're likely to end up breaking systems the users of
> which have not reported problems.

 s/breaking/fixing/

 Besides, there is nothing to break here -- the mixed interrupt mode will
be used when the workaround is selected and the mode has to work or pieces
of legacy software, such as DOS, which make use of the 8259A would not
work.

> >  Have you tried to report the issue through the usual manufacturer's
> > support channels, BTW?
> 
> My experience with HP indicates that it would have been a loss of time.

 Well, if you do not report problems, they may never know of their
existence and obviously will have no way to fix them.  They may ignore
your report, but at least you can say you have done your part.  Based on
the experience the next time you may choose another manufacturer when
making a purchase decision.

> Apart from this, I've always been against forcing people to upgrade their
> BIOSes just because we just had a briliant idea that made the kernel stop
> working on their systems.  IMO it's extremely user-unfriendly and plain wrong.

 The BIOS is broken and should be fixed -- it is not our mission to fix up
somebody else's faults.  As a courtesy to users we may try to work around
problems that are hard for them to cope with, but in a sense this is
promoting bad quality of hardware: "Don't bother doing this properly --
they will fix it up somehow in the OS anyway."

 You may argue this is a regression, but this is simply the cost paid for
progress -- the kernel stays within the spec as defined both by ACPI and
MPS, we have just started using a different configuration now and an
interrupt source override provided by the manufacturer explicitly states
INTIN2 is good to use.  In a sense you were simply lucky previously the
kernel was bad enough with the way it configured the timer through the I/O
APIC it failed completely avoiding the bug in your firmware.  Now the bug
has got uncovered.

 And last but not least, you can always specify "noapic" to get away --
that's a perfectly good workaround.

 I'll cook up the part I promised shortly and leave it up to the others to
"wire" it to some breakage detection logic.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 19:56                                               ` Maciej W. Rozycki
@ 2008-06-29 20:02                                                 ` Ingo Molnar
  2008-06-29 20:14                                                   ` Maciej W. Rozycki
  2008-06-29 22:59                                                   ` Rafael J. Wysocki
  2008-06-29 22:56                                                 ` Rafael J. Wysocki
  1 sibling, 2 replies; 73+ messages in thread
From: Ingo Molnar @ 2008-06-29 20:02 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Matthew Garrett, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown,
	Andi Kleen


* Maciej W. Rozycki <macro@linux-mips.org> wrote:

>  You may argue this is a regression, but this is simply the cost paid 
> for progress -- the kernel stays within the spec as defined both by 
> ACPI and MPS, we have just started using a different configuration now 
> and an interrupt source override provided by the manufacturer 
> explicitly states INTIN2 is good to use.  In a sense you were simply 
> lucky previously the kernel was bad enough with the way it configured 
> the timer through the I/O APIC it failed completely avoiding the bug 
> in your firmware.  Now the bug has got uncovered.

well as long as we eliminate the bad effects around via DMI exceptions 
nobody will feel the need to argue whether it's a regression ;-) [this 
problem could be argued to be a regression, even if it's caused by prior 
luck/stupidity of Linux. We have to live with the effects of our 
mistakes.]

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 19:23                                             ` Matthew Garrett
  2008-06-29 19:31                                               ` Rafael J. Wysocki
@ 2008-06-29 20:03                                               ` Maciej W. Rozycki
  2008-06-29 20:07                                                 ` Matthew Garrett
  1 sibling, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-29 20:03 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Sun, 29 Jun 2008, Matthew Garrett wrote:

> The DSDT can't be updated without the BIOS being updated, and the DMI 
> information gives us a BIOS version string that can be matched against 
> if a fixed version is ever released. I'd be in favour of doing it with 
> DMI on the grounds that it's how we already handle machine-specific 
> quirks rather than adding new code to do it.

 Is the DMI ID *guaranteed* to be changed with an update to the DSDT?  
Anyway, you cannot imply from a given DMI ID a broken DSDT is present, so
you will have to repeat the experience of adding another DMI ID whenever a
user hits this broken DSDT with another piece of hardware.  As long as you
are able to pull this piece of information from that user, that is...

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 20:03                                               ` Maciej W. Rozycki
@ 2008-06-29 20:07                                                 ` Matthew Garrett
  2008-06-29 20:16                                                   ` Maciej W. Rozycki
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Garrett @ 2008-06-29 20:07 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Sun, Jun 29, 2008 at 09:03:29PM +0100, Maciej W. Rozycki wrote:
>  Is the DMI ID *guaranteed* to be changed with an update to the DSDT?  

The BIOS version is, yes.

> Anyway, you cannot imply from a given DMI ID a broken DSDT is present, so
> you will have to repeat the experience of adding another DMI ID whenever a
> user hits this broken DSDT with another piece of hardware.  As long as you
> are able to pull this piece of information from that user, that is...

These are the only two pieces of hardware ever reported to have this 
problem. Nobody appears to have demonstrated it on any other HP systems, 
and any non-HP systems would have a different identifier string. With 
the SB400 being superceded, I don't expect us to see any more machines 
with the same ID.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 20:02                                                 ` Ingo Molnar
@ 2008-06-29 20:14                                                   ` Maciej W. Rozycki
  2008-06-29 23:06                                                     ` Rafael J. Wysocki
  2008-06-29 22:59                                                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-29 20:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Matthew Garrett, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown,
	Andi Kleen

On Sun, 29 Jun 2008, Ingo Molnar wrote:

> >  You may argue this is a regression, but this is simply the cost paid 
> > for progress -- the kernel stays within the spec as defined both by 
> > ACPI and MPS, we have just started using a different configuration now 
> > and an interrupt source override provided by the manufacturer 
> > explicitly states INTIN2 is good to use.  In a sense you were simply 
> > lucky previously the kernel was bad enough with the way it configured 
> > the timer through the I/O APIC it failed completely avoiding the bug 
> > in your firmware.  Now the bug has got uncovered.
> 
> well as long as we eliminate the bad effects around via DMI exceptions 
> nobody will feel the need to argue whether it's a regression ;-) [this 
> problem could be argued to be a regression, even if it's caused by prior 
> luck/stupidity of Linux. We have to live with the effects of our 
> mistakes.]

 Of course -- this is the only reason I can be bothered with the issue in
the first place.  Otherwise, I would have said: 'Get the manufacturer to
fix it, use "noapic" or live with a local patch.'

 This is actually how I have kept one of my old MPS SMP systems up for
years now -- it has a broken MP table which prevents interrupts from
working when too many PCI option cards are present, so I have prepared a
patch for patching the table manually.  I proposed it once, which you may
recall, but it was rejected on the grounds of the syntax being too tough
to comprehend to a poor average user being.  I am sure more systems would
benefit as MP table breakages used to be quite common.

 Here the simple workaround was "noapic" too, so everyone else could be
happy and I have been happy to keep the patch and use the capabilities of
the piece of hardware properly despite its broken firmware.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 20:07                                                 ` Matthew Garrett
@ 2008-06-29 20:16                                                   ` Maciej W. Rozycki
  0 siblings, 0 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-29 20:16 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown

On Sun, 29 Jun 2008, Matthew Garrett wrote:

> >  Is the DMI ID *guaranteed* to be changed with an update to the DSDT?  
> 
> The BIOS version is, yes.

 Good to know, thanks.

> > Anyway, you cannot imply from a given DMI ID a broken DSDT is present, so
> > you will have to repeat the experience of adding another DMI ID whenever a
> > user hits this broken DSDT with another piece of hardware.  As long as you
> > are able to pull this piece of information from that user, that is...
> 
> These are the only two pieces of hardware ever reported to have this 
> problem. Nobody appears to have demonstrated it on any other HP systems, 
> and any non-HP systems would have a different identifier string. With 
> the SB400 being superceded, I don't expect us to see any more machines 
> with the same ID.

 Fair enough.  As I wrote, I'll send an update shortly.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 19:56                                               ` Maciej W. Rozycki
  2008-06-29 20:02                                                 ` Ingo Molnar
@ 2008-06-29 22:56                                                 ` Rafael J. Wysocki
  2008-06-30  1:00                                                   ` Maciej W. Rozycki
  1 sibling, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-29 22:56 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Matthew Garrett, Ingo Molnar, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Sunday, 29 of June 2008, Maciej W. Rozycki wrote:
> On Sun, 29 Jun 2008, Rafael J. Wysocki wrote:
> 
> > >  It is the reverse -- checking the DSDT ID is coarser, matching all the
> > > systems that use the broken firmware.
> > 
> > How can you tell which DSDTs are broken until somebody reports them?
> 
>  We know the DSDT matching OEM ID: "HP ", OEM Table ID: "SB400" and OEM
> Revision: 10000 is broken, because it has already been reported.  If these
> properties are checked, there is no need to for further reports providing
> us with DMI IDs of systems using the same DSDT.  The revision can be used
> to make sure a good one is not selected inadvertently.
> 
> > > With DMI we may face both false positives and false negatives which imply
> > > further maintenance actions.   
> > 
> > With DSDT matching you're likely to end up breaking systems the users of
> > which have not reported problems.
> 
>  s/breaking/fixing/

No.

If your patch is applied in its present form, all of the boxes from HP
nx6x25 series won't work any more, although they worked before.

If you use DSDT matching and all of the DSDTs of these boxes are similarly
broken, which is quite possible, some of them will not be matched and will be
broken.  If you use DMI matching, there's a chance we'll cover all of them.

>  Besides, there is nothing to break here -- the mixed interrupt mode will
> be used when the workaround is selected and the mode has to work or pieces
> of legacy software, such as DOS, which make use of the 8259A would not
> work.

I'm not sure what you mean here.

> > >  Have you tried to report the issue through the usual manufacturer's
> > > support channels, BTW?
> > 
> > My experience with HP indicates that it would have been a loss of time.
> 
>  Well, if you do not report problems, they may never know of their
> existence and obviously will have no way to fix them.  They may ignore
> your report, but at least you can say you have done your part.  Based on
> the experience the next time you may choose another manufacturer when
> making a purchase decision.

Surely I will, but as long as I have the HP box here, I need to live with it.
Also, there are other people who happen to use the affected boxes and do not
expect them to stop working with future kernel releases.

> > Apart from this, I've always been against forcing people to upgrade their
> > BIOSes just because we just had a briliant idea that made the kernel stop
> > working on their systems.  IMO it's extremely user-unfriendly and plain wrong.
> 
>  The BIOS is broken and should be fixed -- it is not our mission to fix up
> somebody else's faults.  As a courtesy to users we may try to work around
> problems that are hard for them to cope with, but in a sense this is
> promoting bad quality of hardware: "Don't bother doing this properly --
> they will fix it up somehow in the OS anyway."
> 
>  You may argue this is a regression,

This IS a regression.

The patch breaks a perfectly working configuration and something like this
_always_ is a regression.  The root cause of this regression may be a BIOS
breakage, but you have to take this into account, this way or another.

We can't really afford breaking working configurations.

>  but this is simply the cost paid for progress -- 

Sorry, with this philosophy I could reject 90% of suspend-related bug reports.

>  the kernel stays within the spec as defined both by ACPI and 
> MPS, we have just started using a different configuration now and an
> interrupt source override provided by the manufacturer explicitly states
> INTIN2 is good to use.  In a sense you were simply lucky previously the
> kernel was bad enough with the way it configured the timer through the I/O
> APIC it failed completely avoiding the bug in your firmware.  Now the bug
> has got uncovered.

No, you are wrong.  The kernel previously _worked_ on the affected boxes and
now it _doesn't_.  The reason why it worked before doesn't matter one whit.

If we did something that made it work despite the BIOS brokenness, we have to
continue doing it on these particular boxes.

>  And last but not least, you can always specify "noapic" to get away --
> that's a perfectly good workaround.

Which was unnecessary before your patch.

>  I'll cook up the part I promised shortly and leave it up to the others to
> "wire" it to some breakage detection logic.

Please do, perhaps I'll be able to fix it up.

Still, you should pay more attention to what your patches may break, IMO,
although those systems may contain broken BIOSes or something.  If they worked
before, they are expected to continue to work and everything that violates this
expectation is a regression.  Sorry, but that's how it goes.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 20:02                                                 ` Ingo Molnar
  2008-06-29 20:14                                                   ` Maciej W. Rozycki
@ 2008-06-29 22:59                                                   ` Rafael J. Wysocki
  1 sibling, 0 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-29 22:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Maciej W. Rozycki, Matthew Garrett, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown,
	Andi Kleen, Andrew Morton, Linus Torvalds

On Sunday, 29 of June 2008, Ingo Molnar wrote:
> 
> * Maciej W. Rozycki <macro@linux-mips.org> wrote:
> 
> >  You may argue this is a regression, but this is simply the cost paid 
> > for progress -- the kernel stays within the spec as defined both by 
> > ACPI and MPS, we have just started using a different configuration now 
> > and an interrupt source override provided by the manufacturer 
> > explicitly states INTIN2 is good to use.  In a sense you were simply 
> > lucky previously the kernel was bad enough with the way it configured 
> > the timer through the I/O APIC it failed completely avoiding the bug 
> > in your firmware.  Now the bug has got uncovered.
> 
> well as long as we eliminate the bad effects around via DMI exceptions 
> nobody will feel the need to argue whether it's a regression ;-)

If all boxes affected by this particulare breakage are covered by DMI-based
workarounds, they will continue to work and that won't be any regression.
The point is that the patch should go along with such workarounds.

> [this problem could be argued to be a regression, even if it's caused by
> prior luck/stupidity of Linux. We have to live with the effects of our
> mistakes.] 

That's exactly right.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 20:14                                                   ` Maciej W. Rozycki
@ 2008-06-29 23:06                                                     ` Rafael J. Wysocki
  2008-06-30  0:45                                                       ` Andi Kleen
  2008-06-30  1:39                                                       ` Maciej W. Rozycki
  0 siblings, 2 replies; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-29 23:06 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Ingo Molnar, Matthew Garrett, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Sunday, 29 of June 2008, Maciej W. Rozycki wrote:
> On Sun, 29 Jun 2008, Ingo Molnar wrote:
> 
> > >  You may argue this is a regression, but this is simply the cost paid 
> > > for progress -- the kernel stays within the spec as defined both by 
> > > ACPI and MPS, we have just started using a different configuration now 
> > > and an interrupt source override provided by the manufacturer 
> > > explicitly states INTIN2 is good to use.  In a sense you were simply 
> > > lucky previously the kernel was bad enough with the way it configured 
> > > the timer through the I/O APIC it failed completely avoiding the bug 
> > > in your firmware.  Now the bug has got uncovered.
> > 
> > well as long as we eliminate the bad effects around via DMI exceptions 
> > nobody will feel the need to argue whether it's a regression ;-) [this 
> > problem could be argued to be a regression, even if it's caused by prior 
> > luck/stupidity of Linux. We have to live with the effects of our 
> > mistakes.]
> 
>  Of course -- this is the only reason I can be bothered with the issue in
> the first place.  Otherwise, I would have said: 'Get the manufacturer to
> fix it, use "noapic" or live with a local patch.'

In that case your patch would surely make it to the regression list.

>  This is actually how I have kept one of my old MPS SMP systems up for
> years now -- it has a broken MP table which prevents interrupts from
> working when too many PCI option cards are present, so I have prepared a
> patch for patching the table manually.  I proposed it once, which you may
> recall, but it was rejected on the grounds of the syntax being too tough
> to comprehend to a poor average user being.  I am sure more systems would
> benefit as MP table breakages used to be quite common.
> 
>  Here the simple workaround was "noapic" too, so everyone else could be
> happy and I have been happy to keep the patch and use the capabilities of
> the piece of hardware properly despite its broken firmware.

Again.  If there's a configuration that didn't need any manual workarounds
before, it's expected to continue to work without any manual workarounds and
as a patch submitter, it's _your_ burden to make that happen.

Otherwise you throw this burden onto users who
(1) don't expect things to stop working,
(2) may not be able to figure out themselves what the right workaround is,
(3) may not be able to make hardware manufacturers do anything.

If there's a configuration that worked before your patch and doesn't work
after it, you're hurting the users of that configuration.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 23:06                                                     ` Rafael J. Wysocki
@ 2008-06-30  0:45                                                       ` Andi Kleen
  2008-06-30  0:47                                                         ` Matthew Garrett
  2008-06-30  1:39                                                       ` Maciej W. Rozycki
  1 sibling, 1 reply; 73+ messages in thread
From: Andi Kleen @ 2008-06-30  0:45 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Maciej W. Rozycki, Ingo Molnar, Matthew Garrett, Stephen Rothwell,
	linux-next, LKML, Thomas Gleixner, ACPI Devel Maling List,
	Len Brown, Andi Kleen, Andrew Morton, Linus Torvalds

> 
> Otherwise you throw this burden onto users who
> (1) don't expect things to stop working,
> (2) may not be able to figure out themselves what the right workaround is,
> (3) may not be able to make hardware manufacturers do anything.

Right thing would be to revert the guilty patches until these
problems are resolved.

> 
> If there's a configuration that worked before your patch and doesn't work
> after it, you're hurting the users of that configuration.

... also past experience is that DMI tables don't work well for this.
We tried that early when ACPI was still very problematic and it turned
out to be a flawed non-scalable strategy,

Typically the configurations causing problems are in multiple motherboards
with different DMI strings and it's very difficult to catch them all.

Also sometimes BIOS behaviour changes over versions and that's tricky to catch
with the standard DMI matches.

One way that would half way scale is to check for specific configurations
based on PCI-IDs and knowledge of the config space of these chipset, 
although it's also not ideal because often multiple chipset generations
with different PCI-IDs have similar issues.

-Andi

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-30  0:45                                                       ` Andi Kleen
@ 2008-06-30  0:47                                                         ` Matthew Garrett
  0 siblings, 0 replies; 73+ messages in thread
From: Matthew Garrett @ 2008-06-30  0:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Rafael J. Wysocki, Maciej W. Rozycki, Ingo Molnar,
	Stephen Rothwell, linux-next, LKML, Thomas Gleixner,
	ACPI Devel Maling List, Len Brown, Andi Kleen, Andrew Morton,
	Linus Torvalds

On Mon, Jun 30, 2008 at 02:45:44AM +0200, Andi Kleen wrote:

> ... also past experience is that DMI tables don't work well for this.
> We tried that early when ACPI was still very problematic and it turned
> out to be a flawed non-scalable strategy,
> 
> Typically the configurations causing problems are in multiple motherboards
> with different DMI strings and it's very difficult to catch them all.
> 
> Also sometimes BIOS behaviour changes over versions and that's tricky to catch
> with the standard DMI matches.

In this specific case, the problem is clearly due to nonsensical code in 
the DSDT that alters the thermal trip points based on ioapic 
configuration. It's only been observed on two almost identical models 
from one manufacturer, and doesn't occur on any other known machines 
with the same chipset. I think it's safe to special-case.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 22:56                                                 ` Rafael J. Wysocki
@ 2008-06-30  1:00                                                   ` Maciej W. Rozycki
  2008-06-30  9:06                                                     ` Matthew Garrett
  0 siblings, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-30  1:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Matthew Garrett, Ingo Molnar, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Mon, 30 Jun 2008, Rafael J. Wysocki wrote:

> > > With DSDT matching you're likely to end up breaking systems the users of
> > > which have not reported problems.
> > 
> >  s/breaking/fixing/
> 
> No.
> 
> If your patch is applied in its present form, all of the boxes from HP
> nx6x25 series won't work any more, although they worked before.

 I have not proposed a patch to do DSDT matching, so you mean Matthew's 
patch, right?  Well, there are two possibilities -- either a true or a 
false positive.  For a true positive, the patch will work around the DSDT 
problem by disabling the I/O APIC route for the timer interrupt.  For a 
false positive, the effect will be the same, although unnecessary.  I am 
not sure what you think will not work anymore.

> If you use DSDT matching and all of the DSDTs of these boxes are similarly
> broken, which is quite possible, some of them will not be matched and will be
> broken.  If you use DMI matching, there's a chance we'll cover all of them.

 The DSDT is clearly associated with the SB400 southbridge.  I would not
expect a given make and model to use different southbridges across the
series, so there will only be one DSDT per model, possibly in a number of
revisions.  On the other hand different models may use the same
southbridge and hence the same DSDT.

 Note that Matthew's made a point here, that apparently there are only two 
models using this southbridge and new ones are unlikely to be released, so 
my note is for a reference only.

> >  Besides, there is nothing to break here -- the mixed interrupt mode will
> > be used when the workaround is selected and the mode has to work or pieces
> > of legacy software, such as DOS, which make use of the 8259A would not
> > work.
> 
> I'm not sure what you mean here.

 The workaround makes the system use the mixed interrupt mode (well, to 
be honest, it is a simplification, because LINT0 is tried as a native 
interrupt before falling back to ExtINTA), which means some interrupts go 
through the I/O APIC and some go through the 8259A.  The route through the 
8259A has to work, because otherwise legacy software would fail.

 Without the workaround the APIC mode would be used, where all interrupts
go through the I/O APIC (but it fails on your system).

 The third alternative is the virtual-wire mode, the default at the
bootstrap (or IOW the point control is passed to Linux from the firmware)
and then forced to stay with the "noapic" option, where all interrupts go
through the 8259A.

> >  Well, if you do not report problems, they may never know of their
> > existence and obviously will have no way to fix them.  They may ignore
> > your report, but at least you can say you have done your part.  Based on
> > the experience the next time you may choose another manufacturer when
> > making a purchase decision.
> 
> Surely I will, but as long as I have the HP box here, I need to live with it.
> Also, there are other people who happen to use the affected boxes and do not
> expect them to stop working with future kernel releases.

 There's always the "noapic" option.  It was added for the very purpose of
dealing with various kinds of breakages manufacturers have been happy to
put into I/O APIC interrupts for years and is meant to work.  Please
report if there is a problem with the option with your system.

> >  The BIOS is broken and should be fixed -- it is not our mission to fix up
> > somebody else's faults.  As a courtesy to users we may try to work around
> > problems that are hard for them to cope with, but in a sense this is
> > promoting bad quality of hardware: "Don't bother doing this properly --
> > they will fix it up somehow in the OS anyway."
> > 
> >  You may argue this is a regression,
> 
> This IS a regression.
> 
> The patch breaks a perfectly working configuration and something like this
> _always_ is a regression.  The root cause of this regression may be a BIOS
> breakage, but you have to take this into account, this way or another.
> 
> We can't really afford breaking working configurations.

 Noted, with the exception yours is not a "perfectly working
configuration" -- notice how the timer interrupt is set up twice and fails
before the third fallback recovers.  If not our persistence to keep it
going despite breakage of hardware we would have panic()ked at the very
first failure.  Now the attempts have been improved so that the second one
already succeeds, but it does not make your piece of hardware less broken.

> >  but this is simply the cost paid for progress -- 
> 
> Sorry, with this philosophy I could reject 90% of suspend-related bug reports.

 Are these genuine bugs in code you take responsibility for or bugs in
some other code?

> >  the kernel stays within the spec as defined both by ACPI and 
> > MPS, we have just started using a different configuration now and an
> > interrupt source override provided by the manufacturer explicitly states
> > INTIN2 is good to use.  In a sense you were simply lucky previously the
> > kernel was bad enough with the way it configured the timer through the I/O
> > APIC it failed completely avoiding the bug in your firmware.  Now the bug
> > has got uncovered.
> 
> No, you are wrong.  The kernel previously _worked_ on the affected boxes and
> now it _doesn't_.  The reason why it worked before doesn't matter one whit.
> 
> If we did something that made it work despite the BIOS brokenness, we have to
> continue doing it on these particular boxes.

 This is what the specs are for to resolve.  We keep to the spec on one
side and the hardware/firmware has to on the other -- this is a contract 
set between components.  Not some particular version of a piece of 
software or equipment.  If we stopped using parts of some spec, because 
there are broken pieces of equipment out there, then we would soon reach 
the point we could not use the spec at all.

 To give you an example: let's assume we have a class of hardware which
comes in two generations, G1 and G2.  Both generations were designed to a
separate open spec each and the newer one may optionally implement a
crippled legacy mode where the older revision of the spec is used;
initially all G2 hardware implements this mode.

 Let's assume we have version V1 of Linux which supports the legacy mode
only, which works correctly with all known G1 and G2 hardware at the time
of its release.  Now in version V2 (V2 = V1 + 1) native Linux support for
G2 hardware has been added.  Unfortunately one of the manufacturers of G2
hardware misinterpreted the spec for its H2 and an essential status bit B2
is negated compared to the spec and to all the other pieces of G2
hardware.  As a result, code updated to work with G2 natively does not
work on this H2 piece of equipment.

 This is clearly a regression, because this H2 piece of equipment used to
work flawlessly before.  What should we do then?  I think we have four
notable choices:

1. Ignore all the mix-up and blame the manufacturer.  The hardware is
   faulty and it is up to users to return it to the supplier for money 
   back.

2. Scrap all the G2 support because it introduces a regression.  We were 
   not fast enough to implement it before someone broke the spec and we
   are doomed.  Sorry.

3. Add an option that would flip the meaning of B2 or force the legacy 
   mode.  This way there is no negative impact on good G2 hardware

4. Discover and special-case H2, proceeding with the option #3 as above 
   automatically.  Likewise, no negative impact.

 In an ideal world (but not as ideal for hardware bugs not to happen) the
#1 would be the natural option -- the offender would pay the price of
their mistake.  Unfortunately we do not live in an ideal world and expect
the offender to ignore the blame.  Therefore we are left with the
remaining options.  You seem to insist on the #2 and I argue for either
the #3 or the #4.

 All of the three deal with the problem somehow.  Unfortunately I fail to
see any advantage from the #2, but I look forward to justification I may
have missed.  OTOH, the disadvantage from the #3 is negligible -- an 
additional option put somewhere -- and there is no disadvantage from the 
#4 that I would recognise.  Therefore I fail to see why the #2 would have 
to be chosen.

> >  And last but not least, you can always specify "noapic" to get away --
> > that's a perfectly good workaround.
> 
> Which was unnecessary before your patch.

 It would not be necessary with your piece of hardware running Linux 2.2
too.  My old SMP board (mentioned in another mail in this thread) stopped
working without "noapic" at one point because of its MP table breakage too
and yet "noapic" has not become the default since then.

> >  I'll cook up the part I promised shortly and leave it up to the others to
> > "wire" it to some breakage detection logic.
> 
> Please do, perhaps I'll be able to fix it up.

 Nothing to do from your side except from further testing perhaps as I
think we have agreed upon Matthew's proposal.  I'll try to get it wrapped
up today, though not necessarily before the noon. ;)

> Still, you should pay more attention to what your patches may break, IMO,
> although those systems may contain broken BIOSes or something.  If they worked
> before, they are expected to continue to work and everything that violates this
> expectation is a regression.  Sorry, but that's how it goes.

 It is not the lack of attention -- please do me a favour and try not to
give me unjustified pieces of advice.  Thank you.

 I have explicitly warned the patch may break things and was pretty much
confident it would -- see my comment accompanying the original submission
at "http://lkml.org/lkml/2008/5/27/306".  I was pretty much confident it
would fix more systems than it would break too.  We are dealing with
substandard hardware/firmware here and these painful efforts should not be
necessary at all in the first place.  Your system is an example of a
particularly degenerate breakage, where the mode of failue triggered is
not immediately disastrous, and you are lucky a culprit has been found at
all.

 In all cases thanks a lot for your testing -- you have just uncovered one
example of the inevitable and I am trying to tackle it the best way
possible.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-29 23:06                                                     ` Rafael J. Wysocki
  2008-06-30  0:45                                                       ` Andi Kleen
@ 2008-06-30  1:39                                                       ` Maciej W. Rozycki
  2008-06-30  9:24                                                         ` Andi Kleen
  2008-06-30 10:41                                                         ` Rafael J. Wysocki
  1 sibling, 2 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-30  1:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Matthew Garrett, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Mon, 30 Jun 2008, Rafael J. Wysocki wrote:

> > > well as long as we eliminate the bad effects around via DMI exceptions 
> > > nobody will feel the need to argue whether it's a regression ;-) [this 
> > > problem could be argued to be a regression, even if it's caused by prior 
> > > luck/stupidity of Linux. We have to live with the effects of our 
> > > mistakes.]
> > 
> >  Of course -- this is the only reason I can be bothered with the issue in
> > the first place.  Otherwise, I would have said: 'Get the manufacturer to
> > fix it, use "noapic" or live with a local patch.'
> 
> In that case your patch would surely make it to the regression list.

 Please be careful -- you seem to contradict yourself.  I wrote to the
effect of: "If this wasn't a regression, I would have said [...]" and your
reply is: "In that case your [non-regressing] patch would surely make it
to the regression list."

> >  This is actually how I have kept one of my old MPS SMP systems up for
> > years now -- it has a broken MP table which prevents interrupts from
> > working when too many PCI option cards are present, so I have prepared a
> > patch for patching the table manually.  I proposed it once, which you may
> > recall, but it was rejected on the grounds of the syntax being too tough
> > to comprehend to a poor average user being.  I am sure more systems would
> > benefit as MP table breakages used to be quite common.
> > 
> >  Here the simple workaround was "noapic" too, so everyone else could be
> > happy and I have been happy to keep the patch and use the capabilities of
> > the piece of hardware properly despite its broken firmware.
> 
> Again.  If there's a configuration that didn't need any manual workarounds
> before, it's expected to continue to work without any manual workarounds and
> as a patch submitter, it's _your_ burden to make that happen.

 That is certainly true for standard hardware.  We have to take
responsibility for own bugs, sure.  I cannot readily understand why you
apparently try to imply hardware vendors do not.

> Otherwise you throw this burden onto users who
> (1) don't expect things to stop working,
> (2) may not be able to figure out themselves what the right workaround is,
> (3) may not be able to make hardware manufacturers do anything.
> 
> If there's a configuration that worked before your patch and doesn't work
> after it, you're hurting the users of that configuration.

 Honestly?  These poor users who have no clue or time to follow the
development lists and/or fix bugs themselves should report the problem to
the supplier of their Linux distribution, who would sort it out by, first,
providing a temporary workaround till the problem is sorted out correctly,
second, contacting the hardware vendor through a recognised channel to
request the problem to be investigated and fixed properly.  I am fairly
sure all the reputable (responsible?) distribution vendors have service
agreements already in place with all the major hardware vendors and all
the minor hardware vendors will be happy to cooperate anyway so as not to
be minor vendors anymore.  This is why I have asked for points of contact 
repeatedly in this thread.

 Of course it leaves hobbyist distributions at a slight disadvantage, but
their users are sort of expected to be "power users" (otherwise they
wouldn't have been hobbyists, would they?) and adding an option or a patch
even should not be a problem for them.  We may try to do our best to help
them, but not at the price of penalising good hardware.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-30  1:00                                                   ` Maciej W. Rozycki
@ 2008-06-30  9:06                                                     ` Matthew Garrett
  2008-06-30 15:29                                                       ` Maciej W. Rozycki
  0 siblings, 1 reply; 73+ messages in thread
From: Matthew Garrett @ 2008-06-30  9:06 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown,
	Andi Kleen, Andrew Morton, Linus Torvalds

On Mon, Jun 30, 2008 at 02:00:04AM +0100, Maciej W. Rozycki wrote:

>  The DSDT is clearly associated with the SB400 southbridge.  I would not
> expect a given make and model to use different southbridges across the
> series, so there will only be one DSDT per model, possibly in a number of
> revisions.  On the other hand different models may use the same
> southbridge and hence the same DSDT.
>
>  Note that Matthew's made a point here, that apparently there are only two 
> models using this southbridge and new ones are unlikely to be released, so 
> my note is for a reference only.

No, there's many other systems using the same southbridge that don't 
have the bizarre DSDT code and so don't show this behaviour.
 
-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-30  1:39                                                       ` Maciej W. Rozycki
@ 2008-06-30  9:24                                                         ` Andi Kleen
  2008-07-02  1:19                                                           ` Maciej W. Rozycki
  2008-06-30 10:41                                                         ` Rafael J. Wysocki
  1 sibling, 1 reply; 73+ messages in thread
From: Andi Kleen @ 2008-06-30  9:24 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Ingo Molnar, Matthew Garrett, Stephen Rothwell,
	linux-next, LKML, Thomas Gleixner, ACPI Devel Maling List,
	Len Brown, Andi Kleen, Andrew Morton, Linus Torvalds


>  That is certainly true for standard hardware.  We have to take
> responsibility for own bugs, sure.  I cannot readily understand why you
> apparently try to imply hardware vendors do not.

Sorry Maciej, you're totally off base on that. On consumer hardware
vendors very rarely fix anything after release of the machine
and in general users expect Linux to work around any BIOS or
hardware bugs that happen (especially if it's a regression and worked
before)

So you either need to provide a workaround for the problem or your
patches should be reverted.

-Andi

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-30  1:39                                                       ` Maciej W. Rozycki
  2008-06-30  9:24                                                         ` Andi Kleen
@ 2008-06-30 10:41                                                         ` Rafael J. Wysocki
  2008-07-02  1:48                                                           ` Maciej W. Rozycki
  1 sibling, 1 reply; 73+ messages in thread
From: Rafael J. Wysocki @ 2008-06-30 10:41 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Ingo Molnar, Matthew Garrett, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Monday, 30 of June 2008, Maciej W. Rozycki wrote:
> On Mon, 30 Jun 2008, Rafael J. Wysocki wrote:
> 
> > > > well as long as we eliminate the bad effects around via DMI exceptions 
> > > > nobody will feel the need to argue whether it's a regression ;-) [this 
> > > > problem could be argued to be a regression, even if it's caused by prior 
> > > > luck/stupidity of Linux. We have to live with the effects of our 
> > > > mistakes.]
> > > 
> > >  Of course -- this is the only reason I can be bothered with the issue in
> > > the first place.  Otherwise, I would have said: 'Get the manufacturer to
> > > fix it, use "noapic" or live with a local patch.'
> > 
> > In that case your patch would surely make it to the regression list.
> 
>  Please be careful -- you seem to contradict yourself.  I wrote to the
> effect of: "If this wasn't a regression, I would have said [...]" and your
> reply is: "In that case your [non-regressing] patch would surely make it
> to the regression list."

Sorry, I didn't parse that paragraph correctly

> > >  This is actually how I have kept one of my old MPS SMP systems up for
> > > years now -- it has a broken MP table which prevents interrupts from
> > > working when too many PCI option cards are present, so I have prepared a
> > > patch for patching the table manually.  I proposed it once, which you may
> > > recall, but it was rejected on the grounds of the syntax being too tough
> > > to comprehend to a poor average user being.  I am sure more systems would
> > > benefit as MP table breakages used to be quite common.
> > > 
> > >  Here the simple workaround was "noapic" too, so everyone else could be
> > > happy and I have been happy to keep the patch and use the capabilities of
> > > the piece of hardware properly despite its broken firmware.
> > 
> > Again.  If there's a configuration that didn't need any manual workarounds
> > before, it's expected to continue to work without any manual workarounds and
> > as a patch submitter, it's _your_ burden to make that happen.
> 
>  That is certainly true for standard hardware.  We have to take
> responsibility for own bugs, sure.  I cannot readily understand why you
> apparently try to imply hardware vendors do not.
> 
> > Otherwise you throw this burden onto users who
> > (1) don't expect things to stop working,
> > (2) may not be able to figure out themselves what the right workaround is,
> > (3) may not be able to make hardware manufacturers do anything.
> > 
> > If there's a configuration that worked before your patch and doesn't work
> > after it, you're hurting the users of that configuration.
> 
>  Honestly?  These poor users who have no clue or time to follow the
> development lists and/or fix bugs themselves should report the problem to
> the supplier of their Linux distribution, who would sort it out by, first,
> providing a temporary workaround till the problem is sorted out correctly,
> second, contacting the hardware vendor through a recognised channel to
> request the problem to be investigated and fixed properly.  I am fairly
> sure all the reputable (responsible?) distribution vendors have service
> agreements already in place with all the major hardware vendors and all
> the minor hardware vendors will be happy to cooperate anyway so as not to
> be minor vendors anymore.  This is why I have asked for points of contact 
> repeatedly in this thread.
> 
>  Of course it leaves hobbyist distributions at a slight disadvantage, but
> their users are sort of expected to be "power users" (otherwise they
> wouldn't have been hobbyists, would they?) and adding an option or a patch
> even should not be a problem for them.  We may try to do our best to help
> them, but not at the price of penalising good hardware.

Well, there are lots of pieces of hardware that are not up to the
specifications, more or less, and I don't think that's a good enough reason
for us to refuse to support them.  The same applies to BIOSes IMO.

Of course, the _default_ should be to follow the spec, but if that doesn't work
on given hardware/BIOS combination and we know what to do to handle it, we
should just handle it instead of asking users to fix their BIOSes.

I have seen enough failed BIOS upgrades to be very cautious about such things.
Certainly, I wouldn't have seriously asked anyone to upgrade the BIOS in a
notebook, because if that had failed, the user would have end up with a piece
of electronic junk.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-30  9:06                                                     ` Matthew Garrett
@ 2008-06-30 15:29                                                       ` Maciej W. Rozycki
  2008-06-30 15:35                                                         ` Matthew Garrett
  0 siblings, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-06-30 15:29 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown,
	Andi Kleen, Andrew Morton, Linus Torvalds

On Mon, 30 Jun 2008, Matthew Garrett wrote:

> >  Note that Matthew's made a point here, that apparently there are only two 
> > models using this southbridge and new ones are unlikely to be released, so 
> > my note is for a reference only.
> 
> No, there's many other systems using the same southbridge that don't 
> have the bizarre DSDT code and so don't show this behaviour.

 I meant "... two models [of HP laptops] using this southbridge..." of 
course. :)

 Now I did a search of the Internet and have become puzzled.  Apparently
there *are* other devices using this DSDT.  See for example a thread at:  
"http://bbs.archlinux.org/viewtopic.php?pid=359559" where an owner of an
HP Compaq 6715s has some other problems with a DSDT which coincidentally
is the very same HP/SB400/10000 (though built with a different ASL
compiler, hmm...).

 Matthew, where did you get these DMI IDs from? -- I cannot see them being 
reported in any bootstrap log.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-30 15:29                                                       ` Maciej W. Rozycki
@ 2008-06-30 15:35                                                         ` Matthew Garrett
  0 siblings, 0 replies; 73+ messages in thread
From: Matthew Garrett @ 2008-06-30 15:35 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Ingo Molnar, Stephen Rothwell, linux-next,
	LKML, Thomas Gleixner, ACPI Devel Maling List, Len Brown,
	Andi Kleen, Andrew Morton, Linus Torvalds

On Mon, Jun 30, 2008 at 04:29:33PM +0100, Maciej W. Rozycki wrote:

>  Now I did a search of the Internet and have become puzzled.  Apparently
> there *are* other devices using this DSDT.  See for example a thread at:  
> "http://bbs.archlinux.org/viewtopic.php?pid=359559" where an owner of an
> HP Compaq 6715s has some other problems with a DSDT which coincidentally
> is the very same HP/SB400/10000 (though built with a different ASL
> compiler, hmm...).

Hm. It'd be interesting to know whether the bizarre debug code is in 
there. What's even more interesting is that the 6715s is an SB600, not 
an SB400...

>  Matthew, where did you get these DMI IDs from? -- I cannot see them being 
> reported in any bootstrap log.

dmidecode or /sys/class/dmi. They're not reported on boot.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-30  9:24                                                         ` Andi Kleen
@ 2008-07-02  1:19                                                           ` Maciej W. Rozycki
  0 siblings, 0 replies; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-07-02  1:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Rafael J. Wysocki, Ingo Molnar, Matthew Garrett, Stephen Rothwell,
	linux-next, LKML, Thomas Gleixner, ACPI Devel Maling List,
	Len Brown, Andi Kleen, Andrew Morton, Linus Torvalds

On Mon, 30 Jun 2008, Andi Kleen wrote:

> >  That is certainly true for standard hardware.  We have to take
> > responsibility for own bugs, sure.  I cannot readily understand why you
> > apparently try to imply hardware vendors do not.
> 
> Sorry Maciej, you're totally off base on that. On consumer hardware

 Am I?  Just a little bit maybe.  Or actually I am more like playing a
devil's advocate here. ;)

> vendors very rarely fix anything after release of the machine
> and in general users expect Linux to work around any BIOS or
> hardware bugs that happen (especially if it's a regression and worked
> before)

 Well, that's no news to me, but my point is are you happy with such a 
state of affairs?  I am not.

 It is well known (at least to me) that quality of x86 firmware is
questionable and has been such for many years now.  I recall bad
experiences from as long ago as 15 years.  That was before I discovered
Linux and even without knowing the quality of free software I considered
many implementations of PC BIOSes a bad joke.

 Of course it may sometimes be difficult to notice all the problematic
bugs early enough.  Testing is expensive and takes a lot of time.  
Proving functional correctness of a non-trivial piece of software against
a complex specification is infeasible.  Back then firmware used to be
included in a piece of EPROM memory, so for practical purposes it was cast
in stone -- nobody would order new chips with an updated replacement for
their PC, which was a practice quite common among workstation/server
manufacturers though.

 These days the firmware is included in an easily reprogrammable piece of
Flash memory, which means technically an update can be done by any user.  
Yet apparently PC equipment manufacturers taught users (similarly to what
some companies did about operating system software) that bad quality is an
immanent property of firmware.  This way they can cut the cost of testing
down, effectively shifting it to someone else.  They take no
responsibility for their mistakes and make the others pay for them.  
That's quite a convenient situation, isn't it? -- I wish I could apply it
to myself as well.

 I do not blame the users.  For most of the users the internals of
computer equipment are beyond comprehension and this is perfectly fine as
nobody is meant to be skilled in everything.  Likewise I do not want to
know in details how a bridge is to be constructed -- I only want to use it
to cross the river.  I just need to trust someone the bridge is safe to
use.  Similarly the user of a computer trusts someone to decide whether a
given piece of equipment is good or not.  In this case I think it is our
role to make users aware firmware bugs are not our responsibility, and our
willingness to cope with them is more a courtesy than a duty.

 In a perfect world they would go back to the manufacturer, or better yet,
to the point of sale and demand the piece of equipment to be replaced with
a good one, fixed, discounted or refunded.  Just with all the other goods
-- if I buy a shirt and discover it has three sleeves despite being
advertised for regular human beings, I will not demand from a coat
manufacturer to get it fitted with three sleeves to match.  Instead I will
go to the shop I got the shirt from and demand to get the situation
rectified.  Of course I could go to a coat manufacturer instead and ask
them nicely to add an extra sleeve and they might do it, but that's by no
means their duty.

 As we are not in a perfect world, users are not likely to do so as they
can be easily god ridden of, by ignoring them or giving arguments they
feel not competent enough to dispute.  And if all manufacturers behaved
consistently, users would have no alternative for their next purchase.  

 The cost remains though.  For example people involved with this case
could have spent the time on something creative, like adding new
functionality.  I do not consider it fair when someone shifts their costs 
onto me and while I may accept it for a given case for some reasons, I am 
not going to treat the situation as normal and will seek a proper 
solution.

 Here I think there is some potential interest to a few well-defined
parties to get better support from hardware manufacturers when it comes to
the firmware.  These parties may be vendors of Linux distributions who
certainly bear costs of dealing with firmware bugs.  These parties may be
x86 CPU vendors as well as the overall quality of equipment matters for
their reputation.  And it is not that they can relax unconcerned in the
belief the x86 is there forever -- times are changing and there
alternatives on the horizon (the Jisus laptop may be just one swallow, but
even if it fails, there will be followers), which are unlikely to be
beaten by price, but may be beaten by quality.  While users will not care 
how baroque the solution is, they certainly will not disregard how it 
works.

 Sorry for such a long dissertation, but I think the current situation is
too far from perfect not to do anything about it.  I do not seem to be a
position to change it, but at least I may try to increase the awareness of
the problem.  And refer users who complain to the respective
manufacturers.  What I am sure of is if we just keep papering firmware
bugs over and never come back with them to the (ir)responsible
manufacturer, then the situation will never change.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-30 10:41                                                         ` Rafael J. Wysocki
@ 2008-07-02  1:48                                                           ` Maciej W. Rozycki
  2008-07-02  9:35                                                             ` Andi Kleen
  0 siblings, 1 reply; 73+ messages in thread
From: Maciej W. Rozycki @ 2008-07-02  1:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Matthew Garrett, Stephen Rothwell, linux-next, LKML,
	Thomas Gleixner, ACPI Devel Maling List, Len Brown, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Mon, 30 Jun 2008, Rafael J. Wysocki wrote:

> Well, there are lots of pieces of hardware that are not up to the
> specifications, more or less, and I don't think that's a good enough reason
> for us to refuse to support them.  The same applies to BIOSes IMO.

 Refusing to support broken hardware would provide some incentive to
manufacturers to improve it, because people would rather not buy
unsupported pieces of junk.  I realise that may be impractical though --
we would get the blame anyway, because "it runs the other OS just fine."  

 I think we may legitimately request something in return for our effort
though, for example at least minimal support from hardware manufacturers.  
It is not that we would waste a lot of their time, because in general
anything we do not filter out must be really tough.

> Of course, the _default_ should be to follow the spec, but if that doesn't work
> on given hardware/BIOS combination and we know what to do to handle it, we
> should just handle it instead of asking users to fix their BIOSes.

 I think we should insist on getting issues reported back to the 
manufacturer.  We may implement workarounds independently and leave it up 
to the users whether they want to do a BIOS upgrade or not.

> I have seen enough failed BIOS upgrades to be very cautious about such things.
> Certainly, I wouldn't have seriously asked anyone to upgrade the BIOS in a
> notebook, because if that had failed, the user would have end up with a piece
> of electronic junk.

 That's a valid point, although making the point of quality yet clearer --
being critical enough, I would expect it to have been thorougly tested by
the manufacturer.  Also solutions like protected Flash areas have been
available for many years now, which means a machine should be operative
enough for recovery to be doable if an upgrade fails.  So perhaps the very
first thing to do after a new purchase should be doing a BIOS update, so
that you can claim your warranty if something goes wrong.

 Technically upgrading a laptop should be safer as bearing an on-board UPS
they are protected from power failures, which may be problematic for some
users of other equipment.

  Maciej

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-07-02  1:48                                                           ` Maciej W. Rozycki
@ 2008-07-02  9:35                                                             ` Andi Kleen
  0 siblings, 0 replies; 73+ messages in thread
From: Andi Kleen @ 2008-07-02  9:35 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Ingo Molnar, Matthew Garrett, Stephen Rothwell,
	linux-next, LKML, Thomas Gleixner, ACPI Devel Maling List,
	Len Brown, Andi Kleen, Andrew Morton, Linus Torvalds

Maciej W. Rozycki wrote:
> On Mon, 30 Jun 2008, Rafael J. Wysocki wrote:
> 
>> Well, there are lots of pieces of hardware that are not up to the
>> specifications, more or less, and I don't think that's a good enough reason
>> for us to refuse to support them.  The same applies to BIOSes IMO.
> 
>  Refusing to support broken hardware would provide some incentive to
> manufacturers to improve it, because people would rather not buy
> unsupported pieces of junk. 

For most consumer level hardware the vendors generally don't
really care if Linux runs on it or not.

Also they very rarely fix anything after release anyways because
they don't make enough money on it.

For server hardware that is different (vendors care about Linux,
but typically not about mainline, but about given RHEL/SLES  releases),
but even there we generally try to work around BIOS bugs
(at least as long as it is possible)
because it tends to be quite difficult logistically to require
a BIOS update. In the end it just hurts the user.

> I realise that may be impractical though 

It is.

> we would get the blame anyway, because "it runs the other OS just fine."

That is exactly what happens.

-Andi

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: linux-next: Tree for June 13: IO APIC breakage on HP nx6325
  2008-06-27  1:53                                 ` Maciej W. Rozycki
@ 2008-07-08 12:48                                   ` Pavel Machek
  0 siblings, 0 replies; 73+ messages in thread
From: Pavel Machek @ 2008-07-08 12:48 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Rafael J. Wysocki, Matthew Garrett, Ingo Molnar, Stephen Rothwell,
	linux-next, LKML, Thomas Gleixner, ACPI Devel Maling List,
	Len Brown

On Fri 2008-06-27 02:53:09, Maciej W. Rozycki wrote:
> On Tue, 24 Jun 2008, Pavel Machek wrote:
> 
> > > $ cat /proc/acpi/thermal_zone/TZ*/trip_points
> > > 
> > > in the failing case:
> > > 
> > > critical (S5):           105 C
> > > passive:                 16 C: tc1=1 tc2=2 tsp=100 devices=C000 C001 
> > > active[0]:               16 C: devices=C34F 
> > > active[1]:               16 C: devices=C350 
> > > active[2]:               16 C: devices=C351 
> > > active[3]:               16 C: devices=C352 
> > > critical (S5):           100 C
> > > passive:                 16 C: tc1=1 tc2=2 tsp=300 devices=C000 C001 
> > > critical (S5):           100 C
> > > passive:                 16 C: tc1=1 tc2=2 tsp=300 devices=C000 C001 
> > 
> > Can we call the ACPI BIOS to be terminally broken at this point?
> 
>  Do we have any point of contact at HP and/or ATI/AMD?  I suppose getting 
> hands on a SB400 datasheet could be tricky, but someone may be able to 
> answer questions about the interrupt routing between the 8254, the 8259A 
> and the I/O APIC for this chip and/or fix the DSDT.

Yes, we do have AMD contacts. Contact me privately if that's still
relevant.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2008-07-08 12:48 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20080613232214.394fd6fd.sfr@canb.auug.org.au>
     [not found] ` <200806161539.05524.rjw@sisk.pl>
     [not found]   ` <Pine.LNX.4.55.0806161636510.20218@cliff.in.clinika.pl>
2008-06-16 22:38     ` linux-next: Tree for June 13: IO APIC breakage on HP nx6325 Rafael J. Wysocki
2008-06-16 23:05       ` Rafael J. Wysocki
2008-06-17  7:12         ` Thomas Gleixner
2008-06-17 20:44           ` Rafael J. Wysocki
2008-06-17 22:19             ` Rafael J. Wysocki
2008-06-17 22:25               ` Rafael J. Wysocki
2008-06-18  8:02                 ` Thomas Gleixner
2008-06-18 12:41                   ` Thomas Gleixner
2008-06-18 14:37                     ` Rafael J. Wysocki
2008-06-18 14:40                     ` Rafael J. Wysocki
2008-06-18 15:29                       ` Thomas Gleixner
2008-06-21 22:47                         ` Rafael J. Wysocki
2008-06-18 13:15                 ` Ingo Molnar
2008-06-18 13:14               ` Ingo Molnar
2008-06-17 20:59       ` Rafael J. Wysocki
2008-06-17 21:19         ` Maciej W. Rozycki
2008-06-17 21:38           ` Rafael J. Wysocki
2008-06-17 22:53             ` Rafael J. Wysocki
2008-06-18  4:02               ` Maciej W. Rozycki
2008-06-18 19:06                 ` Cyrill Gorcunov
2008-06-18 22:36                   ` Maciej W. Rozycki
2008-06-20 18:59                     ` Cyrill Gorcunov
2008-06-20 20:44                       ` Maciej W. Rozycki
2008-06-18 22:11                 ` Rafael J. Wysocki
2008-06-18 23:39                   ` Maciej W. Rozycki
2008-06-19  0:25                     ` Rafael J. Wysocki
2008-06-20  0:35                       ` Maciej W. Rozycki
2008-06-20 11:53                         ` Rafael J. Wysocki
2008-06-20 11:57                           ` Matthew Garrett
2008-06-20 12:22                             ` Rafael J. Wysocki
2008-06-20 12:27                               ` Matthew Garrett
2008-06-21  1:09                                 ` Maciej W. Rozycki
2008-06-21  1:40                                   ` Matthew Garrett
2008-06-21  2:41                                     ` Maciej W. Rozycki
2008-06-21 12:38                                       ` Matthew Garrett
2008-06-26 19:52                                     ` Rafael J. Wysocki
2008-06-27  0:06                                       ` Maciej W. Rozycki
2008-06-29 14:00                                         ` Rafael J. Wysocki
2008-06-29 19:05                                           ` Maciej W. Rozycki
2008-06-29 19:23                                             ` Rafael J. Wysocki
2008-06-29 19:56                                               ` Maciej W. Rozycki
2008-06-29 20:02                                                 ` Ingo Molnar
2008-06-29 20:14                                                   ` Maciej W. Rozycki
2008-06-29 23:06                                                     ` Rafael J. Wysocki
2008-06-30  0:45                                                       ` Andi Kleen
2008-06-30  0:47                                                         ` Matthew Garrett
2008-06-30  1:39                                                       ` Maciej W. Rozycki
2008-06-30  9:24                                                         ` Andi Kleen
2008-07-02  1:19                                                           ` Maciej W. Rozycki
2008-06-30 10:41                                                         ` Rafael J. Wysocki
2008-07-02  1:48                                                           ` Maciej W. Rozycki
2008-07-02  9:35                                                             ` Andi Kleen
2008-06-29 22:59                                                   ` Rafael J. Wysocki
2008-06-29 22:56                                                 ` Rafael J. Wysocki
2008-06-30  1:00                                                   ` Maciej W. Rozycki
2008-06-30  9:06                                                     ` Matthew Garrett
2008-06-30 15:29                                                       ` Maciej W. Rozycki
2008-06-30 15:35                                                         ` Matthew Garrett
2008-06-29 19:23                                             ` Matthew Garrett
2008-06-29 19:31                                               ` Rafael J. Wysocki
2008-06-29 20:03                                               ` Maciej W. Rozycki
2008-06-29 20:07                                                 ` Matthew Garrett
2008-06-29 20:16                                                   ` Maciej W. Rozycki
2008-06-24  9:15                               ` Pavel Machek
2008-06-26  8:37                                 ` Rafael J. Wysocki
2008-06-27  1:53                                 ` Maciej W. Rozycki
2008-07-08 12:48                                   ` Pavel Machek
2008-06-21  1:49                           ` Maciej W. Rozycki
2008-06-19  9:35                     ` Ingo Molnar
2008-06-19 18:17                       ` Maciej W. Rozycki
2008-06-20 10:44                         ` Ingo Molnar
2008-06-20 13:11                         ` Thomas Gleixner
2008-06-20 20:56                           ` Maciej W. Rozycki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox