Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
@ 2003-12-07 13:12 Ross Dickson
  2003-12-09 15:20 ` Maciej W. Rozycki
  2003-12-10  3:39 ` Jesse Allen
  0 siblings, 2 replies; 35+ messages in thread
From: Ross Dickson @ 2003-12-07 13:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: AMartin, ross, andre, kernel

[-- Attachment #1: Type: text/plain, Size: 5471 bytes --]

Greetings,
I am not subscribed so please cc responses.
I have monitored list and know my nforce2 experiences have been common.
Attached patches are in a single bzip tar ball.

I have Albatron KM18G Pro & Epox 8RGA+ MOBOs both using nforce2 chipsets.
I made up a kernel as follows.
Get std 2.4.22 src
apply patch-2.4.23
apply 2.4.22-low-latency.patch
apply preempt-kernel-rml-2.4.23-pre5-1.patch
apply vhz-j64-2.4.22.patch

One patch fails on inode.c, dispose_list() so I placed conditional_schedule() as follows
=static void dispose_list(struct list_head *head)
={
=	int nr_disposed = 0;
=
=	while (!list_empty(head)) {
=		struct inode *inode;
=		conditional_schedule();

Config for athlon with 1000hz tics, preempt & low-lat on.
Compiled and installed nvnet & nvidia video driver.

Disclaimer: The following information and code patches are not fully tested and may be 
dangerous, also these are the first patches I have made for public consumption so I hope
that their format works.

Note also that the patches are against 2.4.22 even though they were developed
against the heavily patched 2.4.23 mentioned above. The patch code is the same for both
kernels but at different line numbers.

When I enabled either apic or io-apic in kern config, lockups came hard and fast.
Particularly bad under hard disk load. Heaps of lost ints on irq7 in apic and ioapic mode. 
Lockups disappeared when I lowered the ide hda udma speed to mode 3 with hdparm so
I went looking for answers which now follow.

There are three parts to this email.
a) apic mods.
b) io-apic mods
c) ide driver mods

a) Lockups are due to too fast an apic acknowledge of apic timer int.
Apic hard locked up the system - no nmi debug available.
Fixed it by introducing a delay of at least 500ns into smp_apic_timer_interrupt() 
just prior to ack_APIC_irq().
See attached diff file "nforce2-apic.c-2.4.22.patch" for details. 
I have guessed at a suitable cpu speed dependent delay.
Perhaps someone with AMD cpu docs (apic timing specs)  & analyser tools could refine it.

Maybe nforce2 chipset really is very quick accessing ram in dual dimm mode? 
Or AMD 2200XP has a really slow APIC?

--- linux-2.4.22/arch/i386/kernel/apic.c	2003-06-14 00:51:29.000000000 +1000
+++ linux-2.4.22-rd/arch/i386/kernel/apic.c	2003-12-07 18:27:32.000000000 +1000
@@ -1078,6 +1078,15 @@
 	 */
 	apic_timer_irqs[cpu]++;

+#ifdef CONFIG_MK7 && CONFIG_BLK_DEV_AMD74XX
+	/*
+	 * on 2200XP & nforce2 chipset we need at least 500ns delay here
+	 * to stop lockups with udma100 drive. try to scale delay time
+	 * with cpu speed. Ross Dickson.
+	 */
+	ndelay((cpu_khz >> 12)+200 ); /* don't ack too soon or hard lockup */
+#endif
+
 	/*
 	 * NOTE! We'd better ACK the irq immediately,
 	 * because timer handling can be slow.

b) I was also disappointed to see I could not have irq0 timer IO-APIC-edge. 
So I have fixed it too (tested on both my epox and albatron MOBOs).
Firstly I found 8254 connected directly to pin 0 not pin 2 of io-apic.
I have modified check_timer() in io_apic.c to trial connect pin and test for it
after the existing test for connection to io-apic.
See attached diff file nforce2-io-apic.c-2.4.22 for details.

--- linux-2.4.22/arch/i386/kernel/io_apic.c	2003-08-25 21:44:39.000000000 +1000
+++ linux-2.4.22-rd/arch/i386/kernel/io_apic.c	2003-12-07 18:40:40.000000000 +1000
@@ -1614,9 +1614,44 @@
 			return;
 		}
 		clear_IO_APIC_pin(0, pin1);
-		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC\n");
+		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC pin%d\n",pin1);
 	}

+#ifdef CONFIG_ACPI_BOOT && CONFIG_X86_UP_IOAPIC
+	/* for nforce2 try vector 0 on pin0
+	 * Note the io_apic_set_pci_routing call disables the 8259 irq 0
+	 * so we must be connected directly to the 8254 timer if this works
+	 * Note2: this violates the above comment re Subtle but works!
+	 */
+	printk(KERN_INFO "..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...\n");
+	if ( pin1 != -1 && nr_ioapics ) {
+		int saved_timer_ack = timer_ack;
+		/* next call also disables 8259 irq0 */
+		int result = io_apic_set_pci_routing ( 0, 0, 0, 0, 0);
+		/*
+		 * Ok, does IRQ0 through the IOAPIC work?
+		 */
+		unmask_IO_APIC_irq(0);
+		timer_ack = 0 ;
+		if (timer_irq_works()) {
+			if (nmi_watchdog == NMI_IO_APIC) {
+				disable_8259A_irq(0);
+				setup_nmi();
+				enable_8259A_irq(0);
+				check_nmi_watchdog();
+			}
+			printk(KERN_INFO "..TIMER: works OK on apic pin0 irq0\n" );
+			return;
+		}
+		/* failed */
+		timer_ack = saved_timer_ack;
+		clear_IO_APIC_pin(0, 0);
+		result = io_apic_set_pci_routing ( 0, pin1, 0, 0, 0);
+		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC Pin 0\n");
+	}
+#endif
+/* end new stuff for nforce2 */
+
 	printk(KERN_INFO "...trying to set up timer (IRQ0) through the 8259A ... ");
 	if (pin2 != -1) {
 		printk("\n..... (found pin %d) ...", pin2);

c) Finally during my fault finding I merged A.Martins patches for the nforce2 IDE driver.
I note that the nforce2 address setup timing bits are different to the AMD ones.
I have assumed the nforce2 address timings apply to nforce and nforce3 chipsets.
I could be wrong so if someone with the nvidia docs could check it please.
I have also not tested it with anything but a WDC ata100 hard drive.
For info see attached patch files (I think pci ids are already in 2.4.23)
nforce2-amd74xx.c-2.4.22.patch, nforce2-amd74xx.h-2.4.22.patch, nforce2-pci_ids.h-2.4.22.patch

Thanks
Ross Dickson

[-- Attachment #2: ross-diffs.tar.bz2 --]
[-- Type: application/x-tbz, Size: 4375 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-07 13:12 Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered Ross Dickson
@ 2003-12-09 15:20 ` Maciej W. Rozycki
  2003-12-10  5:43   ` Ross Dickson
  2003-12-10  3:39 ` Jesse Allen
  1 sibling, 1 reply; 35+ messages in thread
From: Maciej W. Rozycki @ 2003-12-09 15:20 UTC (permalink / raw)
  To: Ross Dickson; +Cc: linux-kernel, AMartin, andre, kernel

On Sun, 7 Dec 2003, Ross Dickson wrote:

> b) I was also disappointed to see I could not have irq0 timer IO-APIC-edge. 
> So I have fixed it too (tested on both my epox and albatron MOBOs).
> Firstly I found 8254 connected directly to pin 0 not pin 2 of io-apic.
> I have modified check_timer() in io_apic.c to trial connect pin and test for it
> after the existing test for connection to io-apic.

 I'm pretty sure this part is bogus.  Have you actually verified it either
by using a hardware probe or at least by investigating documentation you
really have IRQ 0 routed to the I/O APIC interrupt #0 (INTIN 0)?  If no,
then you can almost surely see interrupts travelling across the pair of
8259A PICS which are connected to the INTIN 0 input of the first I/O APIC
in every IA32-based PC system providing an I/O APIC seen so far.

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--------------------------------------------------------------+
+        e-mail: macro@ds2.pg.gda.pl, PGP key available        +

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-07 13:12 Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered Ross Dickson
  2003-12-09 15:20 ` Maciej W. Rozycki
@ 2003-12-10  3:39 ` Jesse Allen
  2003-12-10  9:22   ` Ross Dickson
  2003-12-10 10:00   ` Mikael Pettersson
  1 sibling, 2 replies; 35+ messages in thread
From: Jesse Allen @ 2003-12-10  3:39 UTC (permalink / raw)
  To: Ross Dickson; +Cc: linux-kernel, AMartin

[-- Attachment #1: Type: text/plain, Size: 957 bytes --]

Hi Ross,

I have rediffed your two patches for vanilla 2.6.0-test11.  Briefly, I tried the apic patch first, and found that there are no lockups so far; well it passed my grep tests and even a kernel compile =).  Then I tried your io_apic patch + apic patch.  With nmi_watchdog=1 "NMI:" in /proc/interrupts increments alot compared to nmi_watchdog=2 before (as much as the timer).  So I believe your two patches are more correct than the other two.  Especially the fact I can run with CPU Disconnect and not lock up just like windows ... for people that have windows (I dont have windows =) plus a probably working nmi_watchdog.

And for comparison, my setup:
Shuttle AN35N Ultra v 1.1  (Nforce2 400 ultra), bios updated
Athlon Barton 2600+ (1.9 Ghz)
256 MB PC3200, single stick.

The patches are included in this mail.  I suppose the next thing to do is get out of nvidia the corresponding information.  And then clean up the patch for inclusion.

Jesse



[-- Attachment #2: nforce2-apic-delay-2.6t11.patch --]
[-- Type: text/plain, Size: 611 bytes --]

--- linux/arch/i386/kernel/apic.c	2003-10-25 11:44:59.000000000 -0700
+++ linux-jla/arch/i386/kernel/apic.c	2003-12-09 19:07:19.000000000 -0700
@@ -1089,6 +1089,16 @@
 	 */
 	irq_stat[cpu].apic_timer_irqs++;
 
+#ifdef CONFIG_MK7 && CONFIG_BLK_DEV_AMD74XX
+
+	/*
+	 * on 2200XP & nforce2 chipset we need at least 500ns delay here
+	 * to stop lockups with udma100 drive. try to scale delay time
+	 * with cpu speed. Ross Dickson.
+	 */
+	ndelay((cpu_khz >> 12)+200 ); /* don't ack too soon or hard lockup */
+#endif
+
 	/*
 	 * NOTE! We'd better ACK the irq immediately,
 	 * because timer handling can be slow.

[-- Attachment #3: nforce2-ioapic-timer-2.6t11.patch --]
[-- Type: text/plain, Size: 1564 bytes --]

--- linux/arch/i386/kernel/io_apic.c	2003-10-25 11:43:20.000000000 -0700
+++ linux-jla/arch/i386/kernel/io_apic.c	2003-12-09 19:56:07.000000000 -0700
@@ -2128,6 +2128,41 @@
 		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC\n");
 	}
 
+#ifdef CONFIG_ACPI_BOOT && CONFIG_X86_UP_IOAPIC
+	/* for nforce2 try vector 0 on pin0
+	 * Note the io_apic_set_pci_routing call disables the 8259 irq 0
+	 * so we must be connected directly to the 8254 timer if this works
+	 * Note2: this violates the above comment re Subtle but works!
+	 */
+	printk(KERN_INFO "..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...\n");
+	if ( pin1 != -1 && nr_ioapics ) {
+		int saved_timer_ack = timer_ack;
+		/* next call also disables 8259 irq0 */
+		int result = io_apic_set_pci_routing ( 0, 0, 0, 0, 0);
+		/*
+		 * Ok, does IRQ0 through the IOAPIC work?
+		 */
+		unmask_IO_APIC_irq(0);
+		timer_ack = 0 ;
+		if (timer_irq_works()) {
+			if (nmi_watchdog == NMI_IO_APIC) {
+				disable_8259A_irq(0);
+				setup_nmi();
+				enable_8259A_irq(0);
+				check_nmi_watchdog();
+			}
+			printk(KERN_INFO "..TIMER: works OK on apic pin0 irq0\n" );
+			return;
+		}
+		/* failed */
+		timer_ack = saved_timer_ack;
+		clear_IO_APIC_pin(0, 0);
+		result = io_apic_set_pci_routing ( 0, pin1, 0, 0, 0);
+		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC Pin 0\n");
+	}
+#endif
+/* end new stuff for nforce2 */
+
 	printk(KERN_INFO "...trying to set up timer (IRQ0) through the 8259A ... ");
 	if (pin2 != -1) {
 		printk("\n..... (found pin %d) ...", pin2);

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-09 15:20 ` Maciej W. Rozycki
@ 2003-12-10  5:43   ` Ross Dickson
  2003-12-10 16:06     ` Maciej W. Rozycki
  0 siblings, 1 reply; 35+ messages in thread
From: Ross Dickson @ 2003-12-10  5:43 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: linux-kernel, AMartin, kernel, Ian Kumlien

On Wednesday 10 December 2003 01:20, Maciej W. Rozycki wrote:
> On Sun, 7 Dec 2003, Ross Dickson wrote:
> 
> > b) I was also disappointed to see I could not have irq0 timer IO-APIC-edge. 
> > So I have fixed it too (tested on both my epox and albatron MOBOs).
> > Firstly I found 8254 connected directly to pin 0 not pin 2 of io-apic.
> > I have modified check_timer() in io_apic.c to trial connect pin and test for it
> > after the existing test for connection to io-apic.
> 
>  I'm pretty sure this part is bogus.  Have you actually verified it either
> by using a hardware probe or at least by investigating documentation you
> really have IRQ 0 routed to the I/O APIC interrupt #0 (INTIN 0)?  If no,
> then you can almost surely see interrupts travelling across the pair of
> 8259A PICS which are connected to the INTIN 0 input of the first I/O APIC
> in every IA32-based PC system providing an I/O APIC seen so far.
> 
> -- 
> +  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
> +--------------------------------------------------------------+
> +        e-mail: macro@ds2.pg.gda.pl, PGP key available        +
> 
> 
> 

Thanks Maciej for your response.

If everyone followed published recommendations then I would agree with
your comments however nvidia? et al?.

I have no appropriate docs so I cannot confirm via a real hardware probe 
so I can only offer a software confirmation.

Background musings:

I was forced to approach the problems using somewhat educated guesses and
with the tools I had at hand. As with most discoveries about black boxes the answer 
comes about by a combination of educated guess, luck and checking the unlikely.
The apic delay (a) came about because the lockup problem went away when I put 
a debugging outb_p() statement flipping bits at the lpt port while I was trying 
to catch the frozen IRQ state info on my CRO. I was pleasantly surprised when 
the lockups ceased so I replaced the outb_p with a delay and trimmed it as 
best I could without docs. I did not change it within the Ack call as I realised 
that all the other normal apic ack paths had considerably more code delay time - 
although could this be a gotcha depending on what code path is in the driver.
What if we had a really fast cpu or is it restricted solely to the timer irq??

Back to your query:

I approached the io-apic edge with the same what if?
I think I got it right but please check my following code to confirm.  I have
since hacked the kernel as follows.

WARNING Following Mods For Debugging Only!

In File i8259.c I needed to get to "cached_irq_mask" 

/*
 * This contains the irq mask for both 8259A irq controllers,
 */
//static unsigned int cached_irq_mask = 0xffff; debug ross
unsigned int cached_irq_mask = 0xffff;

In File io_apic.c I have tried to fully mask the 8259.

/*
 * This code may look a bit paranoid, but it's supposed to cooperate with
 * a wide range of boards and BIOS bugs.  Fortunately only the timer IRQ
 * is so screwy.  Thanks to Brian Perkins for testing/hacking this beast
 * fanatically on his truly buggy board.
 */
// debug ross
extern spinlock_t i8259A_lock;
extern unsigned int cached_irq_mask;

static inline void check_timer(void)
{
....
<snip>
....
#ifdef CONFIG_ACPI_BOOT && CONFIG_X86_UP_IOAPIC
	/* for nforce2 try vector 0 on pin0
	 * Note the io_apic_set_pci_routing call disables the 8259 irq 0
	 * so we must be connected directly to the 8254 timer if this works
	 * Note2: this violates the above comment re Subtle but works!
	 */
	printk(KERN_INFO "..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...\n");
	if ( pin1 != -1 && nr_ioapics ) {
		int result, tok;
		unsigned long flags;
		unsigned int saved_cached_irq_mask;
		unsigned char imr1, imr2;

		int saved_timer_ack = timer_ack;

		// disable all of 8259 irq's
		spin_lock_irqsave(&i8259A_lock, flags);
		saved_cached_irq_mask = cached_irq_mask;
		cached_irq_mask = 0xffff;; // ensure nothing restores 8259 ints
		outb(0xff, 0x21);	/* mask all of 8259A-1 */
		outb(0xff, 0xA1);	/* mask all of 8259A-2 */
		spin_unlock_irqrestore(&i8259A_lock, flags);

		/*
		 * Ok, does IRQ0 through the IOAPIC work?
		 */
		/* next call also disables 8259 irq0 */
		result = io_apic_set_pci_routing ( 0, 0, 0, 0, 0);
		unmask_IO_APIC_irq(0);
		timer_ack = 0 ;

		spin_lock_irqsave(&i8259A_lock, flags);
		imr1 = inb(0x21);
		imr2 = inb(0xA1);
		printk("..TIMER check 8259 ints disabled, imr1:%02x, imr2:%02x\n", imr1, imr2);
		tok = timer_irq_works();
		spin_unlock_irqrestore(&i8259A_lock, flags);

		// restore 8259 mask
		spin_lock_irqsave(&i8259A_lock, flags);
		cached_irq_mask = saved_cached_irq_mask;
		outb( cached_irq_mask & 0xff, 0x21 ); /* restore all of 8259A-1 */
		outb( cached_irq_mask >> 8, 0xA1 ); /* restore all of 8259A-2 */
		spin_unlock_irqrestore(&i8259A_lock, flags);

		/*
		 * Ok, does IRQ0 through the IOAPIC work?
		 */
//		unmask_IO_APIC_irq(0);
//		timer_ack = 0 ;
//		if (timer_irq_works()) {
		if (tok) {
			if (nmi_watchdog == NMI_IO_APIC) {
				disable_8259A_irq(0);
				setup_nmi();
				enable_8259A_irq(0);
				check_nmi_watchdog();
			}
			printk(KERN_INFO "..TIMER: works OK on apic pin0 irq0\n" );
			return;
		}
		/* failed */
		timer_ack = saved_timer_ack;
		clear_IO_APIC_pin(0, 0);
		result = io_apic_set_pci_routing ( 0, pin1, 0, 0, 0);
		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC Pin 0\n");
	}
#endif
/* end new stuff for nforce2 */

The inner spinlock around timer_irq_works() I think is redundant but I put it there 
for good measure.
Relevant dmesg output from Albatron KM18G Pro ( this is different MOBO (same type) but 
this time has a barton core 2500 XP cpu).

enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
ENABLING IO-APIC IRQs
init IO_APIC IRQs
 IO-APIC (apicid-pin) 2-0, 2-16, 2-17, 2-18, 2-19, 2-20, 2-21, 2-22, 2-23 not connected.
..TIMER: vector=0x31 pin1=2 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC pin2
..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...
IOAPIC[0]: Set PCI routing entry (2-0 -> 0x31 -> IRQ 0 Mode:0 Active:0)
..TIMER check 8259 ints disabled, imr1:ff, imr2:ff
..TIMER: works OK on apic pin0 irq0
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 1829.0708 MHz.
..... host bus clock speed is 332.0674 MHz.
cpu: 0, clocks: 332674, slice: 166337
CPU0<T0:332672,T1:166320,D:15,S:166337,C:332674>

Please advise if anyone knows of extra registers which may have been added to the
nforce2 8259 core which could allow the interrupts through the masked chip core?
I note that they may exist after reading your email March 21 2002 (irq FosterP4) 
http://www.ussg.iu.edu/hypermail/linux/kernel/0203.2/1213.html

Note that I think it is safe to leave the 8259 irq(0) implicitly disabled on 
failure exit as the code paths following my code patch do it anyway.

Regards
Ross.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-10 10:00   ` Mikael Pettersson
@ 2003-12-10  8:40     ` Ross Dickson
  2003-12-11 14:32     ` Jesse Allen
  1 sibling, 0 replies; 35+ messages in thread
From: Ross Dickson @ 2003-12-10  8:40 UTC (permalink / raw)
  To: Mikael Pettersson, Jesse Allen; +Cc: linux-kernel, AMartin, Ian Kumlien

On Wednesday 10 December 2003 20:00, Mikael Pettersson wrote:
> Jesse Allen writes:
>  > --- linux/arch/i386/kernel/apic.c	2003-10-25 11:44:59.000000000 -0700
>  > +++ linux-jla/arch/i386/kernel/apic.c	2003-12-09 19:07:19.000000000 -0700
>  > @@ -1089,6 +1089,16 @@
>  >  	 */
>  >  	irq_stat[cpu].apic_timer_irqs++;
>  >  
>  > +#ifdef CONFIG_MK7 && CONFIG_BLK_DEV_AMD74XX
>  > +
>  > +	/*
>  > +	 * on 2200XP & nforce2 chipset we need at least 500ns delay here
>  > +	 * to stop lockups with udma100 drive. try to scale delay time
>  > +	 * with cpu speed. Ross Dickson.
>  > +	 */
>  > +	ndelay((cpu_khz >> 12)+200 ); /* don't ack too soon or hard lockup */
>  > +#endif
>  > +
>  >  	/*
>  >  	 * NOTE! We'd better ACK the irq immediately,
>  >  	 * because timer handling can be slow.
> 
> This is too much of a kludge. APIC timer ACKing is supposed to be fast.
> Please try without this delay but with the disconnect PCI quirk.
> 
> If the delay is still needed even when disconnect is disabled, _then_
> can discuss how to do the delay properly.
> 
> /Mikael
> 
> 
> 

Thanks Mikael, I think the more heads on this problem the better.

I don't like timing kludges either such as this existing one in ide-iops.c
in kernel 2.4.23

	hwif->OUTBSYNC(drive, cmd, IDE_COMMAND_REG);
	/* Drive takes 400nS to respond, we must avoid the IRQ being
	   serviced before that.

	   FIXME: we could skip this delay with care on non shared
	   devices

	   For DMA transfers highpoint have a neat trick we could
	   use. When they take an IRQ they check STS but also that
	   the DMA count is not zero (see hpt's own driver)
	*/
	ndelay(400);
	spin_unlock_irqrestore(&io_request_lock, flags);
}

But does anyone exactly know what nvidia and the bios writer are doing - why the 
cpu-disconnect is an issue for the nforce2 boards?

Is it technically correct in their view to turn off features that some pci or
other device they have made may expect? I wonder about their
ram devices because I note after some more testing that without any 
lockup fixes the lockups were spaced a lot further apart in time when I used 
a pair of KINGMAX 256MB DDR-333 then when I used a pair of SEITEC 256MB 
DDR-400 memory. The cpu used XP2500 has a 333 fsb. Is the ram driver chip core
enforcing the disconnect for a reason?

When using the ack delay, lockups with both memory types ceased - as they may 
also cease with the disconnect patch. So the disconnect cycles seem related 
to the nforce2 ram driver circuitry. (See Ian's take towards end of this email)

The reason why I put the ack delay in only the apic timer servicing path is that
I think it is the only commonly traversed path which acks the apic so quickly. If
we end up stuck with a delay then we could probably make it more accurate by
reading the apic timer withinin the delay and using the counts down from the reload
value because if our irq was already pre delayed then no additional delay would 
be required. I am sure many clever programmers can improve on it - not that we
want it at all.

I note the following comments in 2.2.23 io_apic.c
/*
 * Level triggered interrupts can just be masked,
 * and shutting down and starting up the interrupt
 * is the same as enabling and disabling them -- except
 * with a startup need to return a "was pending" value.
 *
 * Level triggered interrupts are special because we
 * do not touch any IO-APIC register while handling
 * them. We ack the APIC in the end-IRQ handler, not
 * in the start-IRQ-handler. Protection against reentrance
 * from the same interrupt is still provided, both by the
 * generic IRQ layer and by the fact that an unacked local
 * APIC does not accept IRQs.
 */
If I am reading this correctly then PCI interrupts which are level 
triggered are processed with the equivalent of a global (maskable) hardware 
interrupt disable (on a uniprocessor machine) if all hardware interrupts
are routed via the APIC. Chances are that we have more than 500ns irq off
times occurring with these servicing routines especially if several handlers
are chained on the one pirq.

Another clue may have just come to light, does the ack in this routine (io_apic.c)
usually get done within the 500ns or so from its activation? If it does then either the
mask_IO_APIC_irq() has a positive effect on the lockups or alternately the 
problem is synchronous with, or inherent to the apic timer.

/*
 * Once we have recorded IRQ_PENDING already, we can mask the
 * interrupt for real. This prevents IRQ storms from unhandled
 * devices.
 */
static void ack_edge_ioapic_irq(unsigned int irq)
{
	if ((irq_desc[irq].status & (IRQ_PENDING | IRQ_DISABLED))
					== (IRQ_PENDING | IRQ_DISABLED))
		mask_IO_APIC_irq(irq);
	ack_APIC_irq();
}

How slow can timer handling be?
When I was debugging with my CRO on the LPT port and turning a bit on
going into the smp_apic_timer_interrupt() routine and turning the bit off
when exiting I saw times of greater than 0.5ms for the routine to complete.
Thats milliseconds!. I certainly agree with the comment regarding the ack 
immediately and think it means before 0.5ms instead of after 0.5ms 
because 0.5ms is an eternity to have interrupts disabled in a hardware 
interrupt context.
 /*
  * NOTE! We'd better ACK the irq immediately,
  * because timer handling can be slow.
I am not too crazy about having them off for 500ns to 1000ns either but until I 
know for certain that the cpu disconnect issue is a non issue then I will
choose to suffer a time hit, and leave the hardware run as the maker intended.

BIG HINT TO THOSE IN THE KNOW.
If we had the docs from nvidia regarding the unknown pci devices?

00:00.0 Host bridge: nVidia Corporation: Unknown device 01e0 (rev a2)
00:00.1 RAM memory: nVidia Corporation: Unknown device 01eb (rev a2)
00:00.2 RAM memory: nVidia Corporation: Unknown device 01ee (rev a2)
00:00.3 RAM memory: nVidia Corporation: Unknown device 01ed (rev a2)
00:00.4 RAM memory: nVidia Corporation: Unknown device 01ec (rev a2)
00:00.5 RAM memory: nVidia Corporation: Unknown device 01ef (rev a2)

then perhaps the underlying cause would present itself. Then we could properly
deal with the issue because we would know why we should do whatever it
is we should do. If the disconnect should be left on then hopefully we could test
a register somewhere to know when it is safe to ack or not - something like
that.

I think Ian is heading in the right direction with his comments:

On Wednesday 10 December 2003 11:20, Ian Kumlien wrote:
> Hi, again.
> 
> I did some reading on amd's site, and if the disconnect + apic fixed the
> same problem as the ~500ns delay, then it could be as i suspect...
> 
> I suspect that something goes wrong with apic ack when the cpu is
> disconnected and according to the amd docs we could check the
> Northbridge's CLKFWDRST or isn't that avail on the outside?
> (It would be interesting to see if that fixes the problem as well.)
> 
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26237.PDF
> 
> I don't really have the knowledge but it would sure be nicer to fix this
> by checking this than to just disable it. I dunno if there is something
> we could do from within the kernel aswell with the sending of HLT but i
> doubt it.
> 
> Anyways, we need a generalized patch that does better checking on the
> NMI bit (like Ross' patch). 
> 
> PS. Anyone that can point me to northbridge tech docks? and CC
> 
> -- 
> Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net
> 

Regards
Ross.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-10  3:39 ` Jesse Allen
@ 2003-12-10  9:22   ` Ross Dickson
  2003-12-10 10:00   ` Mikael Pettersson
  1 sibling, 0 replies; 35+ messages in thread
From: Ross Dickson @ 2003-12-10  9:22 UTC (permalink / raw)
  To: Jesse Allen; +Cc: linux-kernel, AMartin

On Wednesday 10 December 2003 13:39, Jesse Allen wrote:
> Hi Ross,
> 
> I have rediffed your two patches for vanilla 2.6.0-test11.  Briefly, I tried the apic patch first, and found that there are no lockups so far; well it passed my grep tests and even a kernel compile =).  Then I tried your io_apic patch + apic patch.  With nmi_watchdog=1 "NMI:" in /proc/interrupts increments alot compared to nmi_watchdog=2 before (as much as the timer).  So I believe your two patches are more correct than the other two.  Especially the fact I can run with CPU Disconnect and not lock up just like windows ... for people that have windows (I dont have windows =) plus a probably working nmi_watchdog.
> 
> And for comparison, my setup:
> Shuttle AN35N Ultra v 1.1  (Nforce2 400 ultra), bios updated
> Athlon Barton 2600+ (1.9 Ghz)
> 256 MB PC3200, single stick.
> 
> The patches are included in this mail.  I suppose the next thing to do is get out of nvidia the corresponding information.  And then clean up the patch for inclusion.
> 
> Jesse
> 
> 
> 
Thank Jesse
It is interesting that the lockup problems also occur with a single memory stick,
I have only tried dual sticks.
Regards
Ross.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-10  3:39 ` Jesse Allen
  2003-12-10  9:22   ` Ross Dickson
@ 2003-12-10 10:00   ` Mikael Pettersson
  2003-12-10  8:40     ` Ross Dickson
  2003-12-11 14:32     ` Jesse Allen
  1 sibling, 2 replies; 35+ messages in thread
From: Mikael Pettersson @ 2003-12-10 10:00 UTC (permalink / raw)
  To: Jesse Allen; +Cc: Ross Dickson, linux-kernel, AMartin

Jesse Allen writes:
 > --- linux/arch/i386/kernel/apic.c	2003-10-25 11:44:59.000000000 -0700
 > +++ linux-jla/arch/i386/kernel/apic.c	2003-12-09 19:07:19.000000000 -0700
 > @@ -1089,6 +1089,16 @@
 >  	 */
 >  	irq_stat[cpu].apic_timer_irqs++;
 >  
 > +#ifdef CONFIG_MK7 && CONFIG_BLK_DEV_AMD74XX
 > +
 > +	/*
 > +	 * on 2200XP & nforce2 chipset we need at least 500ns delay here
 > +	 * to stop lockups with udma100 drive. try to scale delay time
 > +	 * with cpu speed. Ross Dickson.
 > +	 */
 > +	ndelay((cpu_khz >> 12)+200 ); /* don't ack too soon or hard lockup */
 > +#endif
 > +
 >  	/*
 >  	 * NOTE! We'd better ACK the irq immediately,
 >  	 * because timer handling can be slow.

This is too much of a kludge. APIC timer ACKing is supposed to be fast.
Please try without this delay but with the disconnect PCI quirk.

If the delay is still needed even when disconnect is disabled, _then_
can discuss how to do the delay properly.

/Mikael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-10  5:43   ` Ross Dickson
@ 2003-12-10 16:06     ` Maciej W. Rozycki
  2003-12-11  6:55       ` Ross Dickson
  0 siblings, 1 reply; 35+ messages in thread
From: Maciej W. Rozycki @ 2003-12-10 16:06 UTC (permalink / raw)
  To: Ross Dickson; +Cc: linux-kernel, AMartin, kernel, Ian Kumlien

On Wed, 10 Dec 2003, Ross Dickson wrote:

> Relevant dmesg output from Albatron KM18G Pro ( this is different MOBO (same type) but 
> this time has a barton core 2500 XP cpu).
> 
> enabled ExtINT on CPU#0
> ESR value before enabling vector: 00000000
> ESR value after enabling vector: 00000000
> ENABLING IO-APIC IRQs
> init IO_APIC IRQs
>  IO-APIC (apicid-pin) 2-0, 2-16, 2-17, 2-18, 2-19, 2-20, 2-21, 2-22, 2-23 not connected.
> ..TIMER: vector=0x31 pin1=2 pin2=-1
> ..MP-BIOS bug: 8254 timer not connected to IO-APIC pin2
> ..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...
> IOAPIC[0]: Set PCI routing entry (2-0 -> 0x31 -> IRQ 0 Mode:0 Active:0)
> ..TIMER check 8259 ints disabled, imr1:ff, imr2:ff
> ..TIMER: works OK on apic pin0 irq0
> Using local APIC timer interrupts.
> calibrating APIC timer ...
> ..... CPU clock speed is 1829.0708 MHz.
> ..... host bus clock speed is 332.0674 MHz.
> cpu: 0, clocks: 332674, slice: 166337
> CPU0<T0:332672,T1:166320,D:15,S:166337,C:332674>

 Hmm, while this is different from what is documented in the MP Spec, it
looks like the 8254 IRQ is connected to INTIN0 indeed.  We can handle such
a setup if the BIOS reports routing correctly.  Since you invoke
io_apic_set_pci_routing() I assume you use ACPI for IRQ routing
information.  Can you please rebuild the kernel with APIC_DEBUG set to 1
in include/asm-i386/apic.h and send me the bootstrap log?  Can you please
send me the output of a tool called `mptable' as well, so that I can
compare the results?

  Maciej

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--------------------------------------------------------------+
+        e-mail: macro@ds2.pg.gda.pl, PGP key available        +

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-10 16:06     ` Maciej W. Rozycki
@ 2003-12-11  6:55       ` Ross Dickson
  2003-12-11 11:47         ` Ian Kumlien
  2003-12-11 15:15         ` Maciej W. Rozycki
  0 siblings, 2 replies; 35+ messages in thread
From: Ross Dickson @ 2003-12-11  6:55 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: linux-kernel, AMartin, kernel, Ian Kumlien

On Thursday 11 December 2003 02:06, Maciej W. Rozycki wrote:
> On Wed, 10 Dec 2003, Ross Dickson wrote:
> 
> > Relevant dmesg output from Albatron KM18G Pro ( this is different MOBO (same type) but 
> > this time has a barton core 2500 XP cpu).
> > 
> > enabled ExtINT on CPU#0
> > ESR value before enabling vector: 00000000
> > ESR value after enabling vector: 00000000
> > ENABLING IO-APIC IRQs
> > init IO_APIC IRQs
> >  IO-APIC (apicid-pin) 2-0, 2-16, 2-17, 2-18, 2-19, 2-20, 2-21, 2-22, 2-23 not connected.
> > ..TIMER: vector=0x31 pin1=2 pin2=-1
> > ..MP-BIOS bug: 8254 timer not connected to IO-APIC pin2
> > ..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...
> > IOAPIC[0]: Set PCI routing entry (2-0 -> 0x31 -> IRQ 0 Mode:0 Active:0)
> > ..TIMER check 8259 ints disabled, imr1:ff, imr2:ff
> > ..TIMER: works OK on apic pin0 irq0
> > Using local APIC timer interrupts.
> > calibrating APIC timer ...
> > ..... CPU clock speed is 1829.0708 MHz.
> > ..... host bus clock speed is 332.0674 MHz.
> > cpu: 0, clocks: 332674, slice: 166337
> > CPU0<T0:332672,T1:166320,D:15,S:166337,C:332674>
> 
>  Hmm, while this is different from what is documented in the MP Spec, it
> looks like the 8254 IRQ is connected to INTIN0 indeed.  We can handle such
> a setup if the BIOS reports routing correctly.  Since you invoke
> io_apic_set_pci_routing() I assume you use ACPI for IRQ routing
> information.  Can you please rebuild the kernel with APIC_DEBUG set to 1
> in include/asm-i386/apic.h and send me the bootstrap log?  Can you please
> send me the output of a tool called `mptable' as well, so that I can
> compare the results?
> 
>   Maciej
> 
> -- 
> +  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
> +--------------------------------------------------------------+
> +        e-mail: macro@ds2.pg.gda.pl, PGP key available        +
> 
> 
> 
Thanks Maciej,
bootstrap log follows

CPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x1dff3000
ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x1dff3040
ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x1dff7980
ACPI: DSDT (v001 NVIDIA AWRDACPI 0x00001000 MSFT 0x0100000e) @ 0x00000000
ACPI: Local APIC address 0xfee00000
Boot CPU = 0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 Pentium(tm) Pro APIC version 16
    Floating point unit present.
    Machine Exception supported.
    64 bit compare & exchange supported.
    Internal APIC present.
    SEP present.
    MTRR  present.
    PGE  present.
    MCA  present.
    CMOV  present.
    PAT  present.
    PSE  present.
    MMX  present.
    FXSR  present.
    XMM  present.
    Bootup CPU
ACPI: LAPIC_NMI (acpi_id[0x00] polarity[0x1] trigger[0x1] lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0])
IOAPIC[0]: Assigned apic_id 2
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, IRQ 0-23
Bus #0 is ISA
Int: type 3, pol 0, trig 0, bus 0, irq 0, 2-0
Int: type 0, pol 0, trig 0, bus 0, irq 1, 2-1
Int: type 0, pol 0, trig 0, bus 0, irq 3, 2-3
Int: type 0, pol 0, trig 0, bus 0, irq 4, 2-4
Int: type 0, pol 0, trig 0, bus 0, irq 5, 2-5
Int: type 0, pol 0, trig 0, bus 0, irq 6, 2-6
Int: type 0, pol 0, trig 0, bus 0, irq 7, 2-7
Int: type 0, pol 0, trig 0, bus 0, irq 8, 2-8
Int: type 0, pol 0, trig 0, bus 0, irq 9, 2-9
Int: type 0, pol 0, trig 0, bus 0, irq 10, 2-10
Int: type 0, pol 0, trig 0, bus 0, irq 11, 2-11
Int: type 0, pol 0, trig 0, bus 0, irq 12, 2-12
Int: type 0, pol 0, trig 0, bus 0, irq 13, 2-13
Int: type 0, pol 0, trig 0, bus 0, irq 14, 2-14
Int: type 0, pol 0, trig 0, bus 0, irq 15, 2-15
ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0] trigger[0x0])
Int: type 0, pol 0, trig 0, bus 0, irq 0, 2-2
ACPI: INT_SRC_OVR (bus[0] irq[0x9] global_irq[0x9] polarity[0x1] trigger[0x3])
Int: type 0, pol 1, trig 3, bus 0, irq 9, 2-9
ACPI BALANCE SET
Using ACPI (MADT) for SMP configuration information
Kernel command line: splash=silent root=/dev/hda2 hdc=ide-scsi hdclun=0
ide_setup: hdc=ide-scsi
ide_setup: hdclun=0
mapped APIC to ffffe000 (fee00000)
mapped IOAPIC to ffffd000 (fec00000)
Initializing CPU#0
Detected 1830.076 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 3620.86 BogoMIPS
Memory: 482980k/491456k available (1800k kernel code, 8088k reserved, 622k data, 112k init, 0k highmem)
Dentry cache hash table entries: 65536 (order: 7, 524288 bytes)
Inode cache hash table entries: 32768 (order: 6, 262144 bytes)
Mount cache hash table entries: 512 (order: 0, 4096 bytes)
Buffer cache hash table entries: 32768 (order: 5, 131072 bytes)
Page-cache hash table entries: 131072 (order: 7, 524288 bytes)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU:     After generic, caps: 0383fbff c1c3fbff 00000000 00000000
CPU:             Common caps: 0383fbff c1c3fbff 00000000 00000000
CPU: AMD Athlon(tm) XP 2500+ stepping 00
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
Getting VERSION: 40010
Getting VERSION: 40010
Getting ID: 0
Getting ID: f000000
Getting LVT0: 700
Getting LVT1: 400
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
ENABLING IO-APIC IRQs
Synchronizing Arb IDs.
init IO_APIC IRQs
 IO-APIC (apicid-pin) 2-0, 2-16, 2-17, 2-18, 2-19, 2-20, 2-21, 2-22, 2-23 not connected.
..TIMER: vector=0x31 pin1=2 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC pin2
..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...
IOAPIC[0]: Set PCI routing entry (2-0 -> 0x31 -> IRQ 0 Mode:0 Active:0)
..TIMER check 8259 ints disabled, imr1:ff, imr2:ff
..TIMER: works OK on apic pin0 irq0
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 1829.0813 MHz.
..... host bus clock speed is 332.0693 MHz.
cpu: 0, clocks: 332693, slice: 166346
CPU0<T0:332688,T1:166336,D:6,S:166346,C:332693>
mtrr: v1.40 (20010327) Richard Gooch (rgooch@atnf.csiro.au)
mtrr: detected mtrr type: Intel
ACPI: Subsystem revision 20031002
PCI: PCI BIOS revision 2.10 entry at 0xfb4e0, last bus=2
PCI: Using configuration type 1
IOAPIC[0]: Set PCI routing entry (2-9 -> 0x71 -> IRQ 9 Mode:1 Active:0)
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: System [ACPI] (supports S0 S1 S4 S5)
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.HUB0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGPB._PRT]
ACPI: PCI Interrupt Link [LNK1] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK2] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK3] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK4] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK5] (IRQs 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LUBA] (IRQs 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LUBB] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LMAC] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LAPU] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LACI] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LMCI] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LSMB] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LUB2] (IRQs 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LFIR] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [L3CM] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LIDE] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [APC1] (IRQs 16)
ACPI: PCI Interrupt Link [APC2] (IRQs 17)
ACPI: PCI Interrupt Link [APC3] (IRQs 18)
ACPI: PCI Interrupt Link [APC4] (IRQs 19)
ACPI: PCI Interrupt Link [APC5] (IRQs *16)
ACPI: PCI Interrupt Link [APCF] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCG] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCH] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCI] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCJ] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCK] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCS] (IRQs *23)
ACPI: PCI Interrupt Link [APCL] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCM] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [AP3C] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCZ] (IRQs 20 21 22)
PCI: Probing PCI hardware
ACPI: PCI Interrupt Link [APCS] enabled at IRQ 23
IOAPIC[0]: Set PCI routing entry (2-23 -> 0xa9 -> IRQ 23 Mode:1 Active:0)
00:00:01[A] -> 2-23 -> IRQ 23
Pin 2-23 already programmed
ACPI: PCI Interrupt Link [APCF] enabled at IRQ 20
IOAPIC[0]: Set PCI routing entry (2-20 -> 0xb1 -> IRQ 20 Mode:1 Active:0)
00:00:02[A] -> 2-20 -> IRQ 20
ACPI: PCI Interrupt Link [APCG] enabled at IRQ 22
IOAPIC[0]: Set PCI routing entry (2-22 -> 0xb9 -> IRQ 22 Mode:1 Active:0)
00:00:02[B] -> 2-22 -> IRQ 22
ACPI: PCI Interrupt Link [APCL] enabled at IRQ 21
IOAPIC[0]: Set PCI routing entry (2-21 -> 0xc1 -> IRQ 21 Mode:1 Active:0)
00:00:02[C] -> 2-21 -> IRQ 21
ACPI: PCI Interrupt Link [APCH] enabled at IRQ 20
Pin 2-20 already programmed
ACPI: PCI Interrupt Link [APCI] enabled at IRQ 22
Pin 2-22 already programmed
ACPI: PCI Interrupt Link [APCJ] enabled at IRQ 21
Pin 2-21 already programmed
ACPI: PCI Interrupt Link [APCK] enabled at IRQ 20
Pin 2-20 already programmed
ACPI: PCI Interrupt Link [APCM] enabled at IRQ 22
Pin 2-22 already programmed
ACPI: PCI Interrupt Link [APCZ] enabled at IRQ 21
Pin 2-21 already programmed
ACPI: PCI Interrupt Link [APC3] enabled at IRQ 18
IOAPIC[0]: Set PCI routing entry (2-18 -> 0xc9 -> IRQ 18 Mode:1 Active:0)
00:01:06[A] -> 2-18 -> IRQ 18
ACPI: PCI Interrupt Link [APC4] enabled at IRQ 19
IOAPIC[0]: Set PCI routing entry (2-19 -> 0xd1 -> IRQ 19 Mode:1 Active:0)
00:01:06[B] -> 2-19 -> IRQ 19
ACPI: PCI Interrupt Link [APC1] enabled at IRQ 16
IOAPIC[0]: Set PCI routing entry (2-16 -> 0xd9 -> IRQ 16 Mode:1 Active:0)
00:01:06[C] -> 2-16 -> IRQ 16
ACPI: PCI Interrupt Link [APC2] enabled at IRQ 17
IOAPIC[0]: Set PCI routing entry (2-17 -> 0xe1 -> IRQ 17 Mode:1 Active:0)
00:01:06[D] -> 2-17 -> IRQ 17
Pin 2-19 already programmed
Pin 2-16 already programmed
Pin 2-17 already programmed
Pin 2-18 already programmed
Pin 2-16 already programmed
Pin 2-17 already programmed
Pin 2-18 already programmed
Pin 2-19 already programmed
ACPI: PCI Interrupt Link [APC5] enabled at IRQ 16
Pin 2-16 already programmed
number of MP IRQ sources: 15.
number of IO-APIC #2 registers: 24.
testing the IO APIC.......................

IO APIC #2......
.... register #00: 02000000
.......    : physical APIC id: 02
.......    : Delivery Type: 0
.......    : LTS          : 0
.... register #01: 00170011
.......     : max redirection entries: 0017
.......     : PRQ implemented: 0
.......     : IO APIC version: 0011
.... register #02: 00000000
.......     : arbitration: 00
.... IRQ redirection table:
 NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
 00 001 01  0    0    0   0   0    1    1    31
 01 001 01  0    0    0   0   0    1    1    39
 02 000 00  0    0    0   0   0    0    0    00
 03 001 01  0    0    0   0   0    1    1    41
 04 001 01  0    0    0   0   0    1    1    49
 05 001 01  0    0    0   0   0    1    1    51
 06 001 01  0    0    0   0   0    1    1    59
 07 001 01  0    0    0   0   0    1    1    61
 08 001 01  0    0    0   0   0    1    1    69
 09 001 01  0    1    0   0   0    1    1    71
 0a 001 01  0    0    0   0   0    1    1    79
 0b 001 01  0    0    0   0   0    1    1    81
 0c 001 01  0    0    0   0   0    1    1    89
 0d 001 01  0    0    0   0   0    1    1    91
 0e 001 01  0    0    0   0   0    1    1    99
 0f 001 01  0    0    0   0   0    1    1    A1
 10 001 01  1    1    0   0   0    1    1    D9
 11 001 01  1    1    0   0   0    1    1    E1
 12 001 01  1    1    0   0   0    1    1    C9
 13 001 01  1    1    0   0   0    1    1    D1
 14 001 01  1    1    0   0   0    1    1    B1
 15 001 01  1    1    0   0   0    1    1    C1
 16 001 01  1    1    0   0   0    1    1    B9
 17 001 01  1    1    0   0   0    1    1    A9
IRQ to pin mappings:
IRQ0 -> 0:2-> 0:0
IRQ1 -> 0:1
IRQ3 -> 0:3
IRQ4 -> 0:4
IRQ5 -> 0:5
IRQ6 -> 0:6
IRQ7 -> 0:7
IRQ8 -> 0:8
IRQ9 -> 0:9-> 0:9
IRQ10 -> 0:10
IRQ11 -> 0:11
IRQ12 -> 0:12
IRQ13 -> 0:13
IRQ14 -> 0:14
IRQ15 -> 0:15
IRQ16 -> 0:16
IRQ17 -> 0:17
IRQ18 -> 0:18
IRQ19 -> 0:19
IRQ20 -> 0:20
IRQ21 -> 0:21
IRQ22 -> 0:22
IRQ23 -> 0:23
.................................... done.
PCI: Using ACPI for IRQ routing
PCI: if you experience problems, try using option 'pci=noacpi' or even 'acpi=off'

mptable doesn't like my bios
I tried setting bios mp versions to both 1.1 and 1.4

albatron:/usr/src/mptable-2.0.15a # ./mptable -verbose

===============================================================================

MPTable, version 2.0.15 Linux

 looking for EBDA pointer @ 0x040e, found, searching EBDA @ 0x0009fc00
 searching CMOS 'top of mem' @ 0x0009f800 (638K)
 searching default 'top of mem' @ 0x0009fc00 (639K)
 searching BIOS @ 0x000f0000

 MP FPS found in BIOS @ physical addr: 0x000f50b0

-------------------------------------------------------------------------------

MP Floating Pointer Structure:

  location:                     BIOS
  physical address:             0x000f50b0
  signature:                    '_MP_'
  length:                       16 bytes
  version:                      1.1
  checksum:                     0x00
  mode:                         Virtual Wire

-------------------------------------------------------------------------------

MP Config Table Header:

  physical address:             0x0xf0c00
  signature:                    '$ml$'
  base table length:            0
  version:                      1.6
  checksum:                     0x00
  OEM ID:                       'Ä
                                  ¸§'
°öProduct ID:                   '(
m'P
  OEM table pointer:            0x12d90e22
  OEM table size:               7964
  entry count:                  7964
  local APIC address:           0x1f1c1f1c
  extended table length:        65284
  extended table checksum:      255

-------------------------------------------------------------------------------

MP Config Base Table Entries:

--
MPTABLE HOSED! record type = 55
albatron:/usr/src/mptable-2.0.15a #

Finally others working with kern 2.6  earlier trialled the following patch which may provide some
more clues: 
retrieved from:

http://www.kernel.org/pub/linux/kernel/people/bart/2.6.0-test11-bart1/broken-out/nforce2-apic.patch 
 
[x86] do not wrongly override mp_ExtINT IRQ

From: Mathieu <cheuche+lkml@free.fr>.

With this patch timer IRQ0 is correctly set to IO-APIC-edge
(not XT-PIC) on nForce2 boards when using APIC and ACPI.

 arch/i386/kernel/mpparse.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletion(-)

diff -puN arch/i386/kernel/mpparse.c~nforce2-apic arch/i386/kernel/mpparse.c
--- linux-2.6.0-test11/arch/i386/kernel/mpparse.c~nforce2-apic	2003-12-08 00:12:25.782597272 +0100
+++ linux-2.6.0-test11-root/arch/i386/kernel/mpparse.c	2003-12-08 00:12:25.786596664 +0100
@@ -962,7 +962,8 @@ void __init mp_override_legacy_irq (
 	 */
 	for (i = 0; i < mp_irq_entries; i++) {
 		if ((mp_irqs[i].mpc_dstapic == intsrc.mpc_dstapic)
-			&& (mp_irqs[i].mpc_srcbusirq == intsrc.mpc_srcbusirq)) {
+			&& (mp_irqs[i].mpc_srcbusirq == intsrc.mpc_srcbusirq)
+			&& (mp_irqs[i].mpc_irqtype == intsrc.mpc_irqtype)) {
 			mp_irqs[i] = intsrc;
 			found = 1;
 			break;

_

however the results were not completely successful as this posting shows it
routing through the 8259?

http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/1303.html

dmesg differences: 

1. 
 after: 
 ..TIMER: vector=0x31 pin1=2 pin2=0 

before: 
 ..TIMER: vector=0x31 pin1=2 pin2=-1 

2. 
 after: 
 ...trying to set up timer (IRQ0) through the 8259A ... 
 ..... (found pin 0) ...works. 
 number of MP IRQ sources: 16. 

before: 
 ...trying to set up timer (IRQ0) through the 8259A ... failed. 
 ...trying to set up timer as Virtual Wire IRQ... failed. 
 ...trying to set up timer as ExtINT IRQ... works. 
 number of MP IRQ sources: 15. 

Perhaps someone else could get mptable to run on their machine and send you
the result.

Regards
Ross 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11 11:47         ` Ian Kumlien
@ 2003-12-11  9:12           ` Ross Dickson
  2003-12-11 17:52             ` Ian Kumlien
  2003-12-11 14:58           ` Jesse Allen
  1 sibling, 1 reply; 35+ messages in thread
From: Ross Dickson @ 2003-12-11  9:12 UTC (permalink / raw)
  To: Ian Kumlien; +Cc: Maciej W. Rozycki, linux-kernel, AMartin, kernel

On Thursday 11 December 2003 21:47, Ian Kumlien wrote:
> On Thu, 2003-12-11 at 07:55, Ross Dickson wrote:
> > albatron:/usr/src/mptable-2.0.15a # ./mptable -verbose
> > 
> > ===============================================================================
> > 
> > MPTable, version 2.0.15 Linux
> > 
> >  looking for EBDA pointer @ 0x040e, found, searching EBDA @ 0x0009fc00
> >  searching CMOS 'top of mem' @ 0x0009f800 (638K)
> >  searching default 'top of mem' @ 0x0009fc00 (639K)
> >  searching BIOS @ 0x000f0000
> > 
> >  MP FPS found in BIOS @ physical addr: 0x000f50b0
> > 
> > -------------------------------------------------------------------------------
> > 
> > MP Floating Pointer Structure:
> > 
> >   location:                     BIOS
> >   physical address:             0x000f50b0
> >   signature:                    '_MP_'
> >   length:                       16 bytes
> >   version:                      1.1
> >   checksum:                     0x00
> >   mode:                         Virtual Wire
> > 
> > -------------------------------------------------------------------------------
> > 
> > MP Config Table Header:
> > 
> >   physical address:             0x0xf0c00
> >   signature:                    '$ml$'
> >   base table length:            0
> >   version:                      1.6
> >   checksum:                     0x00
> >   OEM ID:                       'Ä
> >                                   ¸§'
> > °öProduct ID:                   '(
> > m'P
> >   OEM table pointer:            0x12d90e22
> >   OEM table size:               7964
> >   entry count:                  7964
> >   local APIC address:           0x1f1c1f1c
> >   extended table length:        65284
> >   extended table checksum:      255
> > 
> > -------------------------------------------------------------------------------
> > 
> > MP Config Base Table Entries:
> > 
> > --
> > MPTABLE HOSED! record type = 55
> > albatron:/usr/src/mptable-2.0.15a #
> > 
> 
> > Perhaps someone else could get mptable to run on their machine and send you
> > the result.
> 
> mptable dosn't seem to accept it's own options, anyways, heres the
> output.
> 
> mptable -extra -verbose -pirq
>  
> ===============================================================================
>  
> MPTable, version 2.0.15 Linux
>  
>  looking for EBDA pointer @ 0x040e, found, searching EBDA @ 0x0009fc00
>  searching CMOS 'top of mem' @ 0x0009f800 (638K)
>  searching default 'top of mem' @ 0x0009fc00 (639K)
>  searching BIOS @ 0x000f0000
>  
>  MP FPS found in BIOS @ physical addr: 0x000f5ce0
>  
> -------------------------------------------------------------------------------
>  
> MP Floating Pointer Structure:
>  
>   location:                     BIOS
>   physical address:             0x000f5ce0
>   signature:                    '_MP_'
>   length:                       16 bytes
>   version:                      1.1
>   checksum:                     0x00
>   mode:                         Virtual Wire
>  
> -------------------------------------------------------------------------------
>  
> MP Config Table Header:
>  
>   physical address:             0x0xf0c00
>   signature:                    ''
>   base table length:            1280
>   version:                      1.7
>   checksum:                     0x00
>   OEM ID:                       ''
>   Product ID:                   ''
>   OEM table pointer:            0x0000ffff
>   OEM table size:               0
>   entry count:                  65535
>   local APIC address:           0x000000c4
>   extended table length:        1
>   extended table checksum:      0
>  
> -------------------------------------------------------------------------------
>  
> MP Config Base Table Entries:
>  
> --
> Processors:     APIC ID Version State           Family  Model   Step    Flags
>                  0       0x 7    BSP, usable     15      15      15      0x1a00c035
>                  0       0x 0    AP, unusable    0       0       10      0x78ffff0a
> --
> MPTABLE HOSED! record type = 15
> 
> I couldn't find the source so i used a old RedHat rpm...
> (Asus A7N8X-X bios 1007)
>  
> -- 
> Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net
> 

Thanks Ian

Also many thanks for pointing out the relevant section to look in with the AMD
cpu link that you sent - Credit where credit is due (assuming we are both on the
right track).

I had a read and refined your surmisings. I think the 
problem appears synchronous with the apic timer because of two reasons.
1) any apic irq can cause re-connection of the system bus after disconnect.
2) the apic timer irq in my examinations has the shortest path to an ack.

I also had a look back through the athlon cooler and power management 
postings and web site articles. I was blissfully ignorant of these issues when I
started and now I wonder what I have stepped into... Yuk

I submitted a support request to AMD, apologies for not cc'ing you, I kept
the cc's down to just nvidia and the mailing list. If you have not seen it yet
then it is here

http://lkml.org/lkml/2003/12/11/17

We hope....

Regards
Ross


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11  6:55       ` Ross Dickson
@ 2003-12-11 11:47         ` Ian Kumlien
  2003-12-11  9:12           ` Ross Dickson
  2003-12-11 14:58           ` Jesse Allen
  2003-12-11 15:15         ` Maciej W. Rozycki
  1 sibling, 2 replies; 35+ messages in thread
From: Ian Kumlien @ 2003-12-11 11:47 UTC (permalink / raw)
  To: ross; +Cc: Maciej W. Rozycki, linux-kernel, AMartin, kernel

[-- Attachment #1: Type: text/plain, Size: 4023 bytes --]

On Thu, 2003-12-11 at 07:55, Ross Dickson wrote:
> albatron:/usr/src/mptable-2.0.15a # ./mptable -verbose
> 
> ===============================================================================
> 
> MPTable, version 2.0.15 Linux
> 
>  looking for EBDA pointer @ 0x040e, found, searching EBDA @ 0x0009fc00
>  searching CMOS 'top of mem' @ 0x0009f800 (638K)
>  searching default 'top of mem' @ 0x0009fc00 (639K)
>  searching BIOS @ 0x000f0000
> 
>  MP FPS found in BIOS @ physical addr: 0x000f50b0
> 
> -------------------------------------------------------------------------------
> 
> MP Floating Pointer Structure:
> 
>   location:                     BIOS
>   physical address:             0x000f50b0
>   signature:                    '_MP_'
>   length:                       16 bytes
>   version:                      1.1
>   checksum:                     0x00
>   mode:                         Virtual Wire
> 
> -------------------------------------------------------------------------------
> 
> MP Config Table Header:
> 
>   physical address:             0x0xf0c00
>   signature:                    '$ml$'
>   base table length:            0
>   version:                      1.6
>   checksum:                     0x00
>   OEM ID:                       'Ä
>                                   ¸§'
> °öProduct ID:                   '(
> m'P
>   OEM table pointer:            0x12d90e22
>   OEM table size:               7964
>   entry count:                  7964
>   local APIC address:           0x1f1c1f1c
>   extended table length:        65284
>   extended table checksum:      255
> 
> -------------------------------------------------------------------------------
> 
> MP Config Base Table Entries:
> 
> --
> MPTABLE HOSED! record type = 55
> albatron:/usr/src/mptable-2.0.15a #
> 

> Perhaps someone else could get mptable to run on their machine and send you
> the result.

mptable dosn't seem to accept it's own options, anyways, heres the
output.

mptable -extra -verbose -pirq
 
===============================================================================
 
MPTable, version 2.0.15 Linux
 
 looking for EBDA pointer @ 0x040e, found, searching EBDA @ 0x0009fc00
 searching CMOS 'top of mem' @ 0x0009f800 (638K)
 searching default 'top of mem' @ 0x0009fc00 (639K)
 searching BIOS @ 0x000f0000
 
 MP FPS found in BIOS @ physical addr: 0x000f5ce0
 
-------------------------------------------------------------------------------
 
MP Floating Pointer Structure:
 
  location:                     BIOS
  physical address:             0x000f5ce0
  signature:                    '_MP_'
  length:                       16 bytes
  version:                      1.1
  checksum:                     0x00
  mode:                         Virtual Wire
 
-------------------------------------------------------------------------------
 
MP Config Table Header:
 
  physical address:             0x0xf0c00
  signature:                    ''
  base table length:            1280
  version:                      1.7
  checksum:                     0x00
  OEM ID:                       ''
  Product ID:                   ''
  OEM table pointer:            0x0000ffff
  OEM table size:               0
  entry count:                  65535
  local APIC address:           0x000000c4
  extended table length:        1
  extended table checksum:      0
 
-------------------------------------------------------------------------------
 
MP Config Base Table Entries:
 
--
Processors:     APIC ID Version State           Family  Model   Step    Flags
                 0       0x 7    BSP, usable     15      15      15      0x1a00c035
                 0       0x 0    AP, unusable    0       0       10      0x78ffff0a
--
MPTABLE HOSED! record type = 15

I couldn't find the source so i used a old RedHat rpm...
(Asus A7N8X-X bios 1007)
 
-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-10 10:00   ` Mikael Pettersson
  2003-12-10  8:40     ` Ross Dickson
@ 2003-12-11 14:32     ` Jesse Allen
  1 sibling, 0 replies; 35+ messages in thread
From: Jesse Allen @ 2003-12-11 14:32 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: Ross Dickson, linux-kernel, AMartin

On Wed, Dec 10, 2003 at 11:00:39AM +0100, Mikael Pettersson wrote:
> Please try without this delay but with the disconnect PCI quirk.
> 

OK,  I have tried it without the delay, and with Ross' timer patch.  It will obviously lockup, and nmi_watchdog doesn't work.  Added the disconnect quirk patch, and lockups are gone and nmi_watchdog works.  So there is no difference between the disconnect patch or the ACK delay patch.  Though I found nmi_watchdog does depend on having either the disconnect patch or the delay patch (not an io_apic patch).  You think the disconnect patch is better?  In any event, they both indicate a behavior, and there maybe a better solution to all of it.

Jesse

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11 11:47         ` Ian Kumlien
  2003-12-11  9:12           ` Ross Dickson
@ 2003-12-11 14:58           ` Jesse Allen
  2003-12-11 15:20             ` Craig Bradney
  1 sibling, 1 reply; 35+ messages in thread
From: Jesse Allen @ 2003-12-11 14:58 UTC (permalink / raw)
  To: Ian Kumlien; +Cc: linux-kernel, ross, macro

My mptable output looks pretty weird.  (Product ID "ny Key "?)
It doesn't even compare to the other two.  I have a shuttle AN35N.


===============================================================================

MPTable, version 2.0.15 Linux

-------------------------------------------------------------------------------

MP Floating Pointer Structure:

  location:			BIOS
  physical address:		0x000f5650
  signature:			'_MP_'
  length:			16 bytes
  version:			1.1
  checksum:			0x00
  mode:				Virtual Wire

-------------------------------------------------------------------------------

MP Config Table Header:

  physical address:		0x0xf0c00
  signature:			'N   '
  base table length:		8224
  version:			1.32
  checksum:			0x20
  OEM ID:			'    : '
  Product ID:			'ny Key '
  OEM table pointer:		0x2031462d
  OEM table size:		17152
  entry count:			29300
  local APIC address:		0x32462d6c
  extended table length:	32
  extended table checksum:	67

-------------------------------------------------------------------------------

MP Config Base Table Entries:

--
MPTABLE HOSED! record type = 114

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11  6:55       ` Ross Dickson
  2003-12-11 11:47         ` Ian Kumlien
@ 2003-12-11 15:15         ` Maciej W. Rozycki
  2003-12-11 16:23           ` Josh McKinney
  1 sibling, 1 reply; 35+ messages in thread
From: Maciej W. Rozycki @ 2003-12-11 15:15 UTC (permalink / raw)
  To: Ross Dickson, len.brown; +Cc: linux-kernel, AMartin, kernel, Ian Kumlien

On Thu, 11 Dec 2003, Ross Dickson wrote:

> ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0])
> IOAPIC[0]: Assigned apic_id 2
> IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, IRQ 0-23
> Bus #0 is ISA
> Int: type 3, pol 0, trig 0, bus 0, irq 0, 2-0

 I've browsed the relevant part of the ACPI spec and the above entry is 
incorrect.  It looks like INTIN0 is now the preferred line for the 8254 
timer; at least it is the default one when using ACPI tables.  This is a 
bug in Linux.

> ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0] trigger[0x0])
> Int: type 0, pol 0, trig 0, bus 0, irq 0, 2-2

 Now this is an explicit entry stating the 8254 timer is connected to
INTIN2.  If this is not the case, the BIOS is buggy and the solution is to
fix it.  I don't consider it possible to be worked around in Linux except
maybe with a command line option added manually.

> ACPI: INT_SRC_OVR (bus[0] irq[0x9] global_irq[0x9] polarity[0x1] trigger[0x3])
> Int: type 0, pol 1, trig 3, bus 0, irq 9, 2-9

 And yet another explicit entry which has an effect on configuration as
reported below.

> init IO_APIC IRQs
>  IO-APIC (apicid-pin) 2-0, 2-16, 2-17, 2-18, 2-19, 2-20, 2-21, 2-22, 2-23 not connected.
> ..TIMER: vector=0x31 pin1=2 pin2=-1
> ..MP-BIOS bug: 8254 timer not connected to IO-APIC pin2

 As reported above, the BIOS explicitly reports the timer is there.

> ..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...
> IOAPIC[0]: Set PCI routing entry (2-0 -> 0x31 -> IRQ 0 Mode:0 Active:0)
> ..TIMER check 8259 ints disabled, imr1:ff, imr2:ff
> ..TIMER: works OK on apic pin0 irq0

 And this may be correct if the default ACPI settings reflect the actual 
wiring of this board (but the BIOS says otherwise).

> IRQ to pin mappings:
> IRQ0 -> 0:2-> 0:0
[...]
> IRQ9 -> 0:9-> 0:9

 These two entries are wrong -- the interrupts are set up as if they were
connected to multiple I/O APIC inputs.  The first entry is a result of 
your hack, but the second one suggests a bug somewhere.

> Finally others working with kern 2.6  earlier trialled the following patch which may provide some
> more clues: 
> retrieved from:
> 
> http://www.kernel.org/pub/linux/kernel/people/bart/2.6.0-test11-bart1/broken-out/nforce2-apic.patch 
>  
> [x86] do not wrongly override mp_ExtINT IRQ

 That's a workaround to the bug in Linux I've mentioned earlier.  The bug
should be fixed instead.  The ACPI spec doesn't support mixed 
configurations, so ExtINT is irrelevant.

> Perhaps someone else could get mptable to run on their machine and send you
> the result.

 I wanted it to compare with the ACPI table and possibly to treat as a
reference for a workaround.  Since you have no valid MP-table, there's
nothing to do.

 Here's a patch that fixes a few bugs I've spotted browsing through our
ACPI code.  Please try it and report the result.  I don't have a system
with ACPI available, so I cannot verify the changes at all.

 The same bugs are present in 2.4 and I have a corresponding patch
available if some wants to test the changes with that version.

  Maciej

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--------------------------------------------------------------+
+        e-mail: macro@ds2.pg.gda.pl, PGP key available        +

patch-mips-2.6.0-test11-20031209-acpi-irq0-1
diff -up --recursive --new-file linux-mips-2.6.0-test11-20031209.macro/arch/i386/kernel/mpparse.c linux-mips-2.6.0-test11-20031209/arch/i386/kernel/mpparse.c
--- linux-mips-2.6.0-test11-20031209.macro/arch/i386/kernel/mpparse.c	2003-11-25 04:57:01.000000000 +0000
+++ linux-mips-2.6.0-test11-20031209/arch/i386/kernel/mpparse.c	2003-12-11 09:43:26.000000000 +0000
@@ -940,7 +940,7 @@ void __init mp_override_legacy_irq (
 	 *      erroneously sets the trigger to level, resulting in a HUGE 
 	 *      increase of timer interrupts!
 	 */
-	if ((bus_irq == 0) && (global_irq == 2) && (trigger == 3))
+	if ((bus_irq == 0) && (trigger == 3))
 		trigger = 1;

 	intsrc.mpc_type = MP_INTSRC;
@@ -961,7 +961,7 @@ void __init mp_override_legacy_irq (
 	 * Otherwise create a new entry (e.g. global_irq == 2).
 	 */
 	for (i = 0; i < mp_irq_entries; i++) {
-		if ((mp_irqs[i].mpc_dstapic == intsrc.mpc_dstapic) 
+		if ((mp_irqs[i].mpc_srcbus == intsrc.mpc_srcbus) 
 			&& (mp_irqs[i].mpc_srcbusirq == intsrc.mpc_srcbusirq)) {
 			mp_irqs[i] = intsrc;
 			found = 1;
@@ -1008,9 +1008,10 @@ void __init mp_config_acpi_legacy_irqs (
 	 */
 	for (i = 0; i < 16; i++) {

-		if (i == 2) continue;			/* Don't connect IRQ2 */
+		if (i == 2)
+			continue;			/* Don't connect IRQ2 */

-		intsrc.mpc_irqtype = i ? mp_INT : mp_ExtINT;   /* 8259A to #0 */
+		intsrc.mpc_irqtype = mp_INT;
 		intsrc.mpc_srcbusirq = i;		   /* Identity mapped */
 		intsrc.mpc_dstirq = i;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11 14:58           ` Jesse Allen
@ 2003-12-11 15:20             ` Craig Bradney
  2003-12-11 16:05               ` Jesse Allen
  0 siblings, 1 reply; 35+ messages in thread
From: Craig Bradney @ 2003-12-11 15:20 UTC (permalink / raw)
  To: Jesse Allen; +Cc: Ian Kumlien, linux-kernel, ross, macro

Not really sure what I'm looking at here but as you guys are showing
this information I thought it might be helpful for those that can use it
to have the information run on a Asus A7N8X Deluxe (v2.0 bios 1007) with
Athlon XP 2600+. 

===============================================================================

MPTable, version 2.0.15 Linux

-------------------------------------------------------------------------------

MP Floating Pointer Structure:

  location:                     BIOS
  physical address:             0x000f5ce0
  signature:                    '_MP_'
  length:                       16 bytes
  version:                      1.1
  checksum:                     0x00
  mode:                         Virtual Wire

-------------------------------------------------------------------------------

MP Config Table Header:

  physical address:             0x0xf0c00
  signature:                    '
'
  base table length:            65287
  version:                      1.255
  checksum:                     0x04
  OEM ID:                       ''
  Product ID:                   ''
  OEM table pointer:            0x00000704
  OEM table size:               15
  entry count:                  3896
  local APIC address:           0x00070500
  extended table length:        3584
  extended table checksum:      0

-------------------------------------------------------------------------------

MP Config Base Table Entries:

--
Processors:     APIC ID Version State           Family  Model   Step   
Flags
                13       0x 0    AP, usable      3       0       0      
0xff070600
                 0       0xff    BSP, usable     0       12      4      
0x0001
--
MPTABLE HOSED! record type = 53



Craig


On Thu, 2003-12-11 at 15:58, Jesse Allen wrote:
> My mptable output looks pretty weird.  (Product ID "ny Key "?)
> It doesn't even compare to the other two.  I have a shuttle AN35N.
> 
> 
> ===============================================================================
> 
> MPTable, version 2.0.15 Linux
> 
> -------------------------------------------------------------------------------
> 
> MP Floating Pointer Structure:
> 
>   location:			BIOS
>   physical address:		0x000f5650
>   signature:			'_MP_'
>   length:			16 bytes
>   version:			1.1
>   checksum:			0x00
>   mode:				Virtual Wire
> 
> -------------------------------------------------------------------------------
> 
> MP Config Table Header:
> 
>   physical address:		0x0xf0c00
>   signature:			'N   '
>   base table length:		8224
>   version:			1.32
>   checksum:			0x20
>   OEM ID:			'    : '
>   Product ID:			'ny Key '
>   OEM table pointer:		0x2031462d
>   OEM table size:		17152
>   entry count:			29300
>   local APIC address:		0x32462d6c
>   extended table length:	32
>   extended table checksum:	67
> 
> -------------------------------------------------------------------------------
> 
> MP Config Base Table Entries:
> 
> --
> MPTABLE HOSED! record type = 114
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11 15:20             ` Craig Bradney
@ 2003-12-11 16:05               ` Jesse Allen
  0 siblings, 0 replies; 35+ messages in thread
From: Jesse Allen @ 2003-12-11 16:05 UTC (permalink / raw)
  To: Craig Bradney; +Cc: linux-kernel

On Thu, Dec 11, 2003 at 04:20:58PM +0100, Craig Bradney wrote:
> Not really sure what I'm looking at here but as you guys are showing
> this information I thought it might be helpful for those that can use it
> to have the information run on a Asus A7N8X Deluxe (v2.0 bios 1007) with
> Athlon XP 2600+. 
> 

Unfortunately, it looks as all our MP tables are invalid.  So I don't think we can use them.  I thought mine was especailly weird because of the Product ID seems to be pointing to a "Press Any Key" string which proves that.

Jesse

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11 15:15         ` Maciej W. Rozycki
@ 2003-12-11 16:23           ` Josh McKinney
  2003-12-11 17:04             ` Maciej W. Rozycki
  0 siblings, 1 reply; 35+ messages in thread
From: Josh McKinney @ 2003-12-11 16:23 UTC (permalink / raw)
  To: linux-kernel

Trying to get a grasp on the all the fixes floating around.  I have
been running the first "timer" patch, the two liner to mpparse.c, for
about five days until I made it crash with by catting 4 drives to
/dev/null.  It crashed after I turned on disconnect with athcool, so
that may be related, because I could crash it with disconnect off.
Now I am running both of Ross's patches for 2.6 for just 10 hours, but
disconnect is still enabled, so far so good.  

So the consensus seems to be that Ross's timer patch and the
disconnect OR delay ACK patch is the mostly *correct* fix?  As of
right now I am compiling kernels with the disconnect patch and ross's
timer patch, and one with those fixes and Maciej's acpi fixes below.
Should I try it with just the acpi fixes sent by Maciej or are these
just general fixes?

I also tried running mptable, but the output is "hosed".

Thanks
    
On approximately Thu, Dec 11, 2003 at 04:15:28PM +0100, Maciej W. Rozycki wrote:
> On Thu, 11 Dec 2003, Ross Dickson wrote:
> 
> > ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0])
> > IOAPIC[0]: Assigned apic_id 2
> > IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, IRQ 0-23
> > Bus #0 is ISA
> > Int: type 3, pol 0, trig 0, bus 0, irq 0, 2-0
> 
>  I've browsed the relevant part of the ACPI spec and the above entry is 
> incorrect.  It looks like INTIN0 is now the preferred line for the 8254 
> timer; at least it is the default one when using ACPI tables.  This is a 
> bug in Linux.
> 
> > ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0] trigger[0x0])
> > Int: type 0, pol 0, trig 0, bus 0, irq 0, 2-2
> 
>  Now this is an explicit entry stating the 8254 timer is connected to
> INTIN2.  If this is not the case, the BIOS is buggy and the solution is to
> fix it.  I don't consider it possible to be worked around in Linux except
> maybe with a command line option added manually.
> 
> > ACPI: INT_SRC_OVR (bus[0] irq[0x9] global_irq[0x9] polarity[0x1] trigger[0x3])
> > Int: type 0, pol 1, trig 3, bus 0, irq 9, 2-9
> 
>  And yet another explicit entry which has an effect on configuration as
> reported below.
> 
> > init IO_APIC IRQs
> >  IO-APIC (apicid-pin) 2-0, 2-16, 2-17, 2-18, 2-19, 2-20, 2-21, 2-22, 2-23 not connected.
> > ..TIMER: vector=0x31 pin1=2 pin2=-1
> > ..MP-BIOS bug: 8254 timer not connected to IO-APIC pin2
> 
>  As reported above, the BIOS explicitly reports the timer is there.
> 
> > ..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...
> > IOAPIC[0]: Set PCI routing entry (2-0 -> 0x31 -> IRQ 0 Mode:0 Active:0)
> > ..TIMER check 8259 ints disabled, imr1:ff, imr2:ff
> > ..TIMER: works OK on apic pin0 irq0
> 
>  And this may be correct if the default ACPI settings reflect the actual 
> wiring of this board (but the BIOS says otherwise).
> 
> > IRQ to pin mappings:
> > IRQ0 -> 0:2-> 0:0
> [...]
> > IRQ9 -> 0:9-> 0:9
> 
>  These two entries are wrong -- the interrupts are set up as if they were
> connected to multiple I/O APIC inputs.  The first entry is a result of 
> your hack, but the second one suggests a bug somewhere.
> 
> > Finally others working with kern 2.6  earlier trialled the following patch which may provide some
> > more clues: 
> > retrieved from:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/bart/2.6.0-test11-bart1/broken-out/nforce2-apic.patch 
> >  
> > [x86] do not wrongly override mp_ExtINT IRQ
> 
>  That's a workaround to the bug in Linux I've mentioned earlier.  The bug
> should be fixed instead.  The ACPI spec doesn't support mixed 
> configurations, so ExtINT is irrelevant.
> 
> > Perhaps someone else could get mptable to run on their machine and send you
> > the result.
> 
>  I wanted it to compare with the ACPI table and possibly to treat as a
> reference for a workaround.  Since you have no valid MP-table, there's
> nothing to do.
> 
>  Here's a patch that fixes a few bugs I've spotted browsing through our
> ACPI code.  Please try it and report the result.  I don't have a system
> with ACPI available, so I cannot verify the changes at all.
> 
>  The same bugs are present in 2.4 and I have a corresponding patch
> available if some wants to test the changes with that version.
> 
>   Maciej
> 
> -- 
> +  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
> +--------------------------------------------------------------+
> +        e-mail: macro@ds2.pg.gda.pl, PGP key available        +
> 
> patch-mips-2.6.0-test11-20031209-acpi-irq0-1
> diff -up --recursive --new-file linux-mips-2.6.0-test11-20031209.macro/arch/i386/kernel/mpparse.c linux-mips-2.6.0-test11-20031209/arch/i386/kernel/mpparse.c
> --- linux-mips-2.6.0-test11-20031209.macro/arch/i386/kernel/mpparse.c	2003-11-25 04:57:01.000000000 +0000
> +++ linux-mips-2.6.0-test11-20031209/arch/i386/kernel/mpparse.c	2003-12-11 09:43:26.000000000 +0000
> @@ -940,7 +940,7 @@ void __init mp_override_legacy_irq (
>  	 *      erroneously sets the trigger to level, resulting in a HUGE 
>  	 *      increase of timer interrupts!
>  	 */
> -	if ((bus_irq == 0) && (global_irq == 2) && (trigger == 3))
> +	if ((bus_irq == 0) && (trigger == 3))
>  		trigger = 1;
>  
>  	intsrc.mpc_type = MP_INTSRC;
> @@ -961,7 +961,7 @@ void __init mp_override_legacy_irq (
>  	 * Otherwise create a new entry (e.g. global_irq == 2).
>  	 */
>  	for (i = 0; i < mp_irq_entries; i++) {
> -		if ((mp_irqs[i].mpc_dstapic == intsrc.mpc_dstapic) 
> +		if ((mp_irqs[i].mpc_srcbus == intsrc.mpc_srcbus) 
>  			&& (mp_irqs[i].mpc_srcbusirq == intsrc.mpc_srcbusirq)) {
>  			mp_irqs[i] = intsrc;
>  			found = 1;
> @@ -1008,9 +1008,10 @@ void __init mp_config_acpi_legacy_irqs (
>  	 */
>  	for (i = 0; i < 16; i++) {
>  
> -		if (i == 2) continue;			/* Don't connect IRQ2 */
> +		if (i == 2)
> +			continue;			/* Don't connect IRQ2 */
>  
> -		intsrc.mpc_irqtype = i ? mp_INT : mp_ExtINT;   /* 8259A to #0 */
> +		intsrc.mpc_irqtype = mp_INT;
>  		intsrc.mpc_srcbusirq = i;		   /* Identity mapped */
>  		intsrc.mpc_dstirq = i;
>  
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Josh McKinney		     |	Webmaster: http://joshandangie.org
--------------------------------------------------------------------------
                             | They that can give up essential liberty
Linux, the choice       -o)  | to obtain a little temporary safety deserve 
of the GNU generation    /\  | neither liberty or safety. 
                        _\_v |                          -Benjamin Franklin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11 16:23           ` Josh McKinney
@ 2003-12-11 17:04             ` Maciej W. Rozycki
  2003-12-11 17:25               ` Jesse Allen
  0 siblings, 1 reply; 35+ messages in thread
From: Maciej W. Rozycki @ 2003-12-11 17:04 UTC (permalink / raw)
  To: Josh McKinney; +Cc: linux-kernel

On Thu, 11 Dec 2003, Josh McKinney wrote:

> Should I try it with just the acpi fixes sent by Maciej or are these
> just general fixes?

 They should make (at least some of) the reported problems go away,
superseding the respective workarounds.

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--------------------------------------------------------------+
+        e-mail: macro@ds2.pg.gda.pl, PGP key available        +

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11 17:04             ` Maciej W. Rozycki
@ 2003-12-11 17:25               ` Jesse Allen
  0 siblings, 0 replies; 35+ messages in thread
From: Jesse Allen @ 2003-12-11 17:25 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: linux-kernel

On Thu, Dec 11, 2003 at 06:04:54PM +0100, Maciej W. Rozycki wrote:
> On Thu, 11 Dec 2003, Josh McKinney wrote:
> 
> > Should I try it with just the acpi fixes sent by Maciej or are these
> > just general fixes?
> 
>  They should make (at least some of) the reported problems go away,
> superseding the respective workarounds.
> 

As far as I can tell, your patch _alone_ doesn't prevent the lockup, fix the timer, or nmi_watchdog.  I have attached a dmesg of my current running kernel that includes Ross' io_apic patch, the disconnect quirk patch, your acpi patch, and other minor patches.  ACPI and APIC debugging are on.


Linux version 2.6.0-test11 (jesse@tesore) (gcc version 3.3.2) #2 Thu Dec 11 09:45:15 MST 2003
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000000fff0000 (usable)
 BIOS-e820: 000000000fff0000 - 000000000fff3000 (ACPI NVS)
 BIOS-e820: 000000000fff3000 - 0000000010000000 (ACPI data)
 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
255MB LOWMEM available.
On node 0 totalpages: 65520
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 61424 pages, LIFO batch:14
  HighMem zone: 0 pages, LIFO batch:1
DMI 2.2 present.
ACPI: RSDP (v000 Nvidia                                    ) @ 0x000f6f60
ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x0fff3000
ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x0fff3040
ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x0fff7880
ACPI: DSDT (v001 NVIDIA AWRDACPI 0x00001000 MSFT 0x0100000e) @ 0x00000000
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 6:10 APIC version 16
ACPI: LAPIC_NMI (acpi_id[0x00] polarity[0x1] trigger[0x1] lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0])
IOAPIC[0]: Assigned apic_id 2
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, IRQ 0-23
ACPI: INT_SRC_OVR (bus[0] irq[0x0] global_irq[0x2] polarity[0x0] trigger[0x0])
ACPI: INT_SRC_OVR (bus[0] irq[0x9] global_irq[0x9] polarity[0x1] trigger[0x3])
ACPI: INT_SRC_OVR (bus[0] irq[0xe] global_irq[0xe] polarity[0x1] trigger[0x1])
ACPI: INT_SRC_OVR (bus[0] irq[0xf] global_irq[0xf] polarity[0x1] trigger[0x1])
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
Building zonelist for node : 0
Kernel command line: BOOT_IMAGE=Linux-2.6 ro root=301
Initializing CPU#0
PID hash table entries: 1024 (order 10: 8192 bytes)
Detected 1913.621 MHz processor.
Console: colour VGA+ 80x25
Memory: 256144k/262080k available (1611k kernel code, 5212k reserved, 693k data, 128k init, 0k highmem)
Calibrating delay loop... 3784.70 BogoMIPS
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
CPU:     After generic identify, caps: 0383fbff c1c3fbff 00000000 00000000
CPU:     After vendor identify, caps: 0383fbff c1c3fbff 00000000 00000000
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU:     After all inits, caps: 0383fbff c1c3fbff 00000000 00000020
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: AMD Athlon(tm) XP 2600+ stepping 00
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
ENABLING IO-APIC IRQs
init IO_APIC IRQs
 IO-APIC (apicid-pin) 2-0, 2-16, 2-17, 2-18, 2-19, 2-20, 2-21, 2-22, 2-23 not connected.
..TIMER: vector=0x31 pin1=2 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...
IOAPIC[0]: Set PCI routing entry (2-0 -> 0x31 -> IRQ 0 Mode:0 Active:0)
..TIMER: works OK on apic pin0 irq0
number of MP IRQ sources: 15.
number of IO-APIC #2 registers: 24.
testing the IO APIC.......................
IO APIC #2......
.... register #00: 02000000
.......    : physical APIC id: 02
.......    : Delivery Type: 0
.......    : LTS          : 0
.... register #01: 00170011
.......     : max redirection entries: 0017
.......     : PRQ implemented: 0
.......     : IO APIC version: 0011
.... register #02: 00000000
.......     : arbitration: 00
.... IRQ redirection table:
 NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:   
 00 001 01  0    0    0   0   0    1    1    31
 01 001 01  0    0    0   0   0    1    1    39
 02 000 00  0    0    0   0   0    0    0    00
 03 001 01  0    0    0   0   0    1    1    41
 04 001 01  0    0    0   0   0    1    1    49
 05 001 01  0    0    0   0   0    1    1    51
 06 001 01  0    0    0   0   0    1    1    59
 07 001 01  0    0    0   0   0    1    1    61
 08 001 01  0    0    0   0   0    1    1    69
 09 001 01  1    1    0   0   0    1    1    71
 0a 001 01  0    0    0   0   0    1    1    79
 0b 001 01  0    0    0   0   0    1    1    81
 0c 001 01  0    0    0   0   0    1    1    89
 0d 001 01  0    0    0   0   0    1    1    91
 0e 001 01  0    0    0   0   0    1    1    99
 0f 001 01  0    0    0   0   0    1    1    A1
 10 000 00  1    0    0   0   0    0    0    00
 11 000 00  1    0    0   0   0    0    0    00
 12 000 00  1    0    0   0   0    0    0    00
 13 000 00  1    0    0   0   0    0    0    00
 14 000 00  1    0    0   0   0    0    0    00
 15 000 00  1    0    0   0   0    0    0    00
 16 000 00  1    0    0   0   0    0    0    00
 17 000 00  1    0    0   0   0    0    0    00
IRQ to pin mappings:
IRQ0 -> 0:2-> 0:0
IRQ1 -> 0:1
IRQ3 -> 0:3
IRQ4 -> 0:4
IRQ5 -> 0:5
IRQ6 -> 0:6
IRQ7 -> 0:7
IRQ8 -> 0:8
IRQ9 -> 0:9
IRQ10 -> 0:10
IRQ11 -> 0:11
IRQ12 -> 0:12
IRQ13 -> 0:13
IRQ14 -> 0:14
IRQ15 -> 0:15
.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 1912.0861 MHz.
..... host bus clock speed is 332.0671 MHz.
NET: Registered protocol family 16
PCI: PCI BIOS revision 2.10 entry at 0xfb590, last bus=2
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
ACPI: Subsystem revision 20031002
 tbxface-0117 [03] acpi_load_tables      : ACPI Tables successfully acquired
Parsing all Control Methods:........................................................................................................................................................................................................................................................................................
Table [DSDT](id F004) - 761 Objects with 78 Devices 280 Methods 30 Regions
ACPI Namespace successfully loaded at root c0378d3c
IOAPIC[0]: Set PCI routing entry (2-9 -> 0x71 -> IRQ 9 Mode:1 Active:0)
evxfevnt-0093 [04] acpi_enable           : Transition to ACPI mode successful
evgpeblk-0748 [06] ev_create_gpe_block   : GPE 00 to 31 [_GPE] 4 regs at 0000000000004020 on int 9
evgpeblk-0748 [06] ev_create_gpe_block   : GPE 32 to 95 [_GPE] 8 regs at 00000000000044A0 on int 9
Completing Region/Field/Buffer/Package initialization:.................................................................................................
Initialized 30/30 Regions 9/9 Fields 31/31 Buffers 27/27 Packages (769 nodes)
Executing all Device _STA and_INI methods:...............................................................................
79 Devices found containing: 79 _STA, 2 _INI methods
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.HUB0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGPB._PRT]
ACPI: PCI Interrupt Link [LNK1] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK2] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK3] (IRQs 3 4 5 6 7 10 11 *12 14 15)
ACPI: PCI Interrupt Link [LNK4] (IRQs 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNK5] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LUBA] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LUBB] (IRQs 3 4 *5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LMAC] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LAPU] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LACI] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LMCI] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LSMB] (IRQs 3 4 5 6 7 10 11 *12 14 15)
ACPI: PCI Interrupt Link [LUB2] (IRQs 3 4 5 6 7 10 11 *12 14 15)
ACPI: PCI Interrupt Link [LFIR] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [L3CM] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LIDE] (IRQs 3 4 5 6 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [APC1] (IRQs 16)
ACPI: PCI Interrupt Link [APC2] (IRQs 17)
ACPI: PCI Interrupt Link [APC3] (IRQs *18)
ACPI: PCI Interrupt Link [APC4] (IRQs *19)
ACPI: PCI Interrupt Link [APC5] (IRQs 16)
pci_link-0262 [40] acpi_pci_link_get_curr: No IRQ resource found
ACPI: PCI Interrupt Link [APCF] (IRQs 20 21 22)
pci_link-0262 [42] acpi_pci_link_get_curr: No IRQ resource found
ACPI: PCI Interrupt Link [APCG] (IRQs 20 21 22)
pci_link-0262 [44] acpi_pci_link_get_curr: No IRQ resource found
ACPI: PCI Interrupt Link [APCH] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCI] (IRQs 20 21 22)
pci_link-0262 [47] acpi_pci_link_get_curr: No IRQ resource found
ACPI: PCI Interrupt Link [APCJ] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCK] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCS] (IRQs *23)
pci_link-0262 [52] acpi_pci_link_get_curr: No IRQ resource found
ACPI: PCI Interrupt Link [APCL] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCM] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [AP3C] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCZ] (IRQs 20 21 22)
ACPI: PCI Interrupt Link [APCS] enabled at IRQ 23
IOAPIC[0]: Set PCI routing entry (2-23 -> 0xa9 -> IRQ 23 Mode:1 Active:0)
00:00:01[A] -> 2-23 -> IRQ 23
Pin 2-23 already programmed
ACPI: PCI Interrupt Link [APCF] enabled at IRQ 20
IOAPIC[0]: Set PCI routing entry (2-20 -> 0xb1 -> IRQ 20 Mode:1 Active:0)
00:00:02[A] -> 2-20 -> IRQ 20
ACPI: PCI Interrupt Link [APCG] enabled at IRQ 22
IOAPIC[0]: Set PCI routing entry (2-22 -> 0xb9 -> IRQ 22 Mode:1 Active:0)
00:00:02[B] -> 2-22 -> IRQ 22
ACPI: PCI Interrupt Link [APCL] enabled at IRQ 21
IOAPIC[0]: Set PCI routing entry (2-21 -> 0xc1 -> IRQ 21 Mode:1 Active:0)
00:00:02[C] -> 2-21 -> IRQ 21
ACPI: PCI Interrupt Link [APCH] enabled at IRQ 20
Pin 2-20 already programmed
ACPI: PCI Interrupt Link [APCI] enabled at IRQ 22
Pin 2-22 already programmed
ACPI: PCI Interrupt Link [APCJ] enabled at IRQ 21
Pin 2-21 already programmed
ACPI: PCI Interrupt Link [APCK] enabled at IRQ 20
Pin 2-20 already programmed
ACPI: PCI Interrupt Link [APCM] enabled at IRQ 22
Pin 2-22 already programmed
ACPI: PCI Interrupt Link [APCZ] enabled at IRQ 21
Pin 2-21 already programmed
ACPI: PCI Interrupt Link [APC1] enabled at IRQ 16
IOAPIC[0]: Set PCI routing entry (2-16 -> 0xc9 -> IRQ 16 Mode:1 Active:0)
00:01:08[A] -> 2-16 -> IRQ 16
ACPI: PCI Interrupt Link [APC2] enabled at IRQ 17
IOAPIC[0]: Set PCI routing entry (2-17 -> 0xd1 -> IRQ 17 Mode:1 Active:0)
00:01:08[B] -> 2-17 -> IRQ 17
ACPI: PCI Interrupt Link [APC3] enabled at IRQ 18
IOAPIC[0]: Set PCI routing entry (2-18 -> 0xd9 -> IRQ 18 Mode:1 Active:0)
00:01:08[C] -> 2-18 -> IRQ 18
ACPI: PCI Interrupt Link [APC4] enabled at IRQ 19
IOAPIC[0]: Set PCI routing entry (2-19 -> 0xe1 -> IRQ 19 Mode:1 Active:0)
00:01:08[D] -> 2-19 -> IRQ 19
Pin 2-17 already programmed
Pin 2-18 already programmed
Pin 2-19 already programmed
Pin 2-16 already programmed
Pin 2-18 already programmed
Pin 2-19 already programmed
Pin 2-16 already programmed
Pin 2-17 already programmed
Pin 2-19 already programmed
Pin 2-16 already programmed
Pin 2-17 already programmed
Pin 2-18 already programmed
Pin 2-16 already programmed
Pin 2-17 already programmed
Pin 2-18 already programmed
Pin 2-19 already programmed
Pin 2-19 already programmed
PCI: Using ACPI for IRQ routing
PCI: if you experience problems, try using option 'pci=noacpi' or even 'acpi=off'
Machine check exception polling timer started.
ACPI: Power Button (FF) [PWRF]
ACPI: Sleep Button (CM) [SLPB]
ACPI: Fan [FAN] (on)
ACPI: Processor [CPU0] (supports C1)
ACPI: Thermal Zone [THRM] (38 C)
pty: 256 Unix98 ptys configured
Real Time Clock Driver v1.12
Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
Using anticipatory io scheduler
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
loop: loaded (max 8 devices)
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
NFORCE2: IDE controller at PCI slot 0000:00:09.0
NFORCE2: chipset revision 162
NFORCE2: not 100% native mode: will probe irqs later
NFORCE2: BIOS didn't set cable bits correctly. Enabling workaround.
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
NFORCE2: 0000:00:09.0 (rev a2) UDMA133 controller
    ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:DMA
    ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:DMA, hdd:DMA
hda: WDC WD200BB-00DEA0, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hdc: MATSHITA CR-585, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
hda: max request size: 128KiB
hda: 39102336 sectors (20020 MB) w/2048KiB Cache, CHS=38792/16/63, UDMA(100)
 hda: hda1 hda2 hda3
hdc: ATAPI 24X CD-ROM drive, 128kB Cache, DMA
Uniform CD-ROM driver Revision: 3.12
mice: PS/2 mouse device common for all mice
input: PC Speaker
serio: i8042 AUX port at 0x60,0x64 irq 12
input: AT Translated Set 2 keyboard on isa0060/serio0
serio: i8042 KBD port at 0x60,0x64 irq 1
NET: Registered protocol family 2
IP: routing cache hash table of 2048 buckets, 16Kbytes
TCP: Hash tables configured (established 16384 bind 32768)
NET: Registered protocol family 1
NET: Registered protocol family 17
found reiserfs format "3.6" with standard journal
Reiserfs journal params: device hda1, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
reiserfs: checking transaction log (hda1) for (hda1)
Using r5 hash to sort names
VFS: Mounted root (reiserfs filesystem) readonly.
Freeing unused kernel memory: 128k freed
Adding 377516k swap on /dev/hda2.  Priority:-1 extents:1
found reiserfs format "3.6" with standard journal
Reiserfs journal params: device hda3, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
reiserfs: checking transaction log (hda3) for (hda3)
Using r5 hash to sort names
Linux agpgart interface v0.100 (c) Dave Jones
8139too Fast Ethernet driver 0.9.27
eth0: RealTek RTL8139 at 0xd09a1000, 00:40:c7:77:0a:d5, IRQ 18
eth0:  Identified 8139 chip type 'RTL-8139 rev K'
eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
agpgart: Detected NVIDIA nForce2 chipset
agpgart: Maximum main memory to use for agp memory: 203M
agpgart: AGP aperture is 64M @ 0xe8000000
PCI: Setting latency timer of device 0000:00:06.0 to 64
intel8x0: clocking to 47451
/home/jesse/linux/drivers/usb/core/usb.c: registered new driver hub
ohci_hcd: 2003 Oct 13 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
ohci_hcd: block sizes: ed 64 td 64
ohci_hcd 0000:00:02.0: OHCI Host Controller
PCI: Setting latency timer of device 0000:00:02.0 to 64
ohci_hcd 0000:00:02.0: irq 20, pci mem d0a17000
ohci_hcd 0000:00:02.0: new USB bus registered, assigned bus number 1
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 3 ports detected
ohci_hcd 0000:00:02.1: OHCI Host Controller
PCI: Setting latency timer of device 0000:00:02.1 to 64
ohci_hcd 0000:00:02.1: irq 22, pci mem d0a20000
ohci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 2
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 3 ports detected
ehci_hcd 0000:00:02.2: EHCI Host Controller
PCI: Setting latency timer of device 0000:00:02.2 to 64
ehci_hcd 0000:00:02.2: irq 21, pci mem d0a2e000
ehci_hcd 0000:00:02.2: new USB bus registered, assigned bus number 3
PCI: cache line size of 64 is not supported by device 0000:00:02.2
ehci_hcd 0000:00:02.2: USB 2.0 enabled, EHCI 1.00, driver 2003-Jun-13
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 6 ports detected
parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE]
parport0: cpp_daisy: aa5500ff(38)
parport0: assign_addrs: aa5500ff(38)
parport0: cpp_daisy: aa5500ff(38)
parport0: assign_addrs: aa5500ff(38)
i2c_adapter i2c-0: nForce2 SMBus adapter at 0x5000
i2c_adapter i2c-1: nForce2 SMBus adapter at 0x5100

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11  9:12           ` Ross Dickson
@ 2003-12-11 17:52             ` Ian Kumlien
  2003-12-11 18:21               ` Jesse Allen
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Kumlien @ 2003-12-11 17:52 UTC (permalink / raw)
  To: ross; +Cc: macro, linux-kernel, AMartin, kernel

[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]

On Thu, 2003-12-11 at 10:12, Ross Dickson wrote:
> On Thursday 11 December 2003 21:47, Ian Kumlien wrote:
> Thanks Ian
> 
> Also many thanks for pointing out the relevant section to look in with the AMD
> cpu link that you sent - Credit where credit is due (assuming we are both on the
> right track).

Heh, thanks, feels nice to have someone who agrees with you =).

> I had a read and refined your surmisings. I think the 
> problem appears synchronous with the apic timer because of two reasons.
> 1) any apic irq can cause re-connection of the system bus after disconnect.
> 2) the apic timer irq in my examinations has the shortest path to an ack.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24416.pdf
Page 42 and 94 might help as well. I haven't grasped it all or had any
food yet but i hope i'm right =)

> I also had a look back through the athlon cooler and power management 
> postings and web site articles. I was blissfully ignorant of these issues when I
> started and now I wonder what I have stepped into... Yuk

Heh, yeah, the need for disconnect is somewhat dodgy, i haven't read up
on th rest.

> I submitted a support request to AMD, apologies for not cc'ing you, I kept
> the cc's down to just nvidia and the mailing list. If you have not seen it yet
> then it is here

Thanks

> We hope....

Yup...

-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11 17:52             ` Ian Kumlien
@ 2003-12-11 18:21               ` Jesse Allen
  2003-12-12  9:27                 ` Bob
  0 siblings, 1 reply; 35+ messages in thread
From: Jesse Allen @ 2003-12-11 18:21 UTC (permalink / raw)
  To: Ian Kumlien; +Cc: linux-kernel

On Thu, Dec 11, 2003 at 06:52:41PM +0100, Ian Kumlien wrote:
> Heh, yeah, the need for disconnect is somewhat dodgy, i haven't read up
> on th rest.
> 

Hmm, weird.  I went to go look at the Shuttle motherboard maker's site - maybe so that I can bug them for a bios disconnect option - but I checked for a bios update first.  And sure enough like they read my mind, just posted online today, an update.  Here are the details of fixes:

" Checksum:   8B00H                         Date Code: 12/05/03
1.Support 0.18 micron AMD Duron (Palomino) CPU.
2.Add C1 disconnect item."

It's almost as they're reading this list.  This disconnect problem was discovered on the 5th (well the 5th in my timezone).  Perhaps they're aware of this issue...  I'm gonna talk to them.

Jesse

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-11 18:21               ` Jesse Allen
@ 2003-12-12  9:27                 ` Bob
  2003-12-12 16:59                   ` Working nforce2, was " Jesse Allen
  0 siblings, 1 reply; 35+ messages in thread
From: Bob @ 2003-12-12  9:27 UTC (permalink / raw)
  To: linux-kernel

Jesse Allen wrote:

>On Thu, Dec 11, 2003 at 06:52:41PM +0100, Ian Kumlien wrote:
>  
>
>>Heh, yeah, the need for disconnect is somewhat dodgy, i haven't read up
>>on th rest.
>>
>Hmm, weird.  I went to go look at the Shuttle motherboard maker's site - maybe so that I can bug them for a bios disconnect option - but I checked for a bios update first.  And sure enough like they read my mind, just posted online today, an update.  Here are the details of fixes:
>
>" Checksum:   8B00H                         Date Code: 12/05/03
>1.Support 0.18 micron AMD Duron (Palomino) CPU.
>2.Add C1 disconnect item."
>
>It's almost as they're reading this list.  This disconnect problem was discovered on the 5th (well the 5th in my timezone).  Perhaps they're aware of this issue...  I'm gonna talk to them.
>
>Jesse
>
A bios update for MSI K7N2 MCP2-T nforce2 board
fixed the crashing BEFORE these patches were developed,
but there was no documentation that would relate or explain.

http://www.msi.com.tw/program/support/bios/bos/spt_bos_detail.php?UID=436&kind=1
http://download.msi.com.tw/support/bos_exe/6570v76.exe

Award 7.6 at the top of the list. Maybe somebody can figure
out what they're doing.

Nvidia X driver for ti4200 agp8 still locks up linux though,
but X nv works fine. agp8 3d may expose the timer issue.

-Bob

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12  9:27                 ` Bob
@ 2003-12-12 16:59                   ` Jesse Allen
  2003-12-12 17:18                     ` Jesse Allen
                                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Jesse Allen @ 2003-12-12 16:59 UTC (permalink / raw)
  To: linux-kernel

On Fri, Dec 12, 2003 at 04:27:59AM -0500, Bob wrote:
> Jesse Allen wrote:
> 
> >On Thu, Dec 11, 2003 at 06:52:41PM +0100, Ian Kumlien wrote:
> > 
> >
> >>Heh, yeah, the need for disconnect is somewhat dodgy, i haven't read up
> >>on th rest.
> >>
> >Hmm, weird.  I went to go look at the Shuttle motherboard maker's site - 
> >maybe so that I can bug them for a bios disconnect option - but I checked 
> >for a bios update first.  And sure enough like they read my mind, just 
> >posted online today, an update.  Here are the details of fixes:
> >
> >" Checksum:   8B00H                         Date Code: 12/05/03
> >1.Support 0.18 micron AMD Duron (Palomino) CPU.
> >2.Add C1 disconnect item."
> >
> >It's almost as they're reading this list.  This disconnect problem was 
> >discovered on the 5th (well the 5th in my timezone).  Perhaps they're 
> >aware of this issue...  I'm gonna talk to them.
> >
> >Jesse
> >
> A bios update for MSI K7N2 MCP2-T nforce2 board
> fixed the crashing BEFORE these patches were developed,
> but there was no documentation that would relate or explain.

Last night, I updated the bios to the 12-5-03 released yesterday (see above).  I looked at the new option under Advanced Chipset Features, "C1 Disconnect".  It has three selections: Auto, Enabled, Disabled.  There seems to be no default.  The item help says:
"Force En/Disabled 
 or Auto mode:
 C17 IGP/SPP NB A03
 C18D SPP NM A01 (C01)
 enabled C1 disconnect
 otherwise disabled it"

Auto sounded nice, so I selected that first.  I compiled a new kernel without the disconnect off patch, or the ack delay.  These are the exact patches I used on 2.6.0-test11:
patch-2.6.0-test11-bk8.bz2
acpi-2.6.0t11.patch acpi bugfixes from Maciej.
nforce-ioapic-timer-2.6t11.patch from Ross Dickson.  Timer patch.
forcedeth.patch Patch stolen from -test10-mm1?  Unused.
forcedeth-update-2.patch Same.

Sure enough, under this kernel, no lockups.  Athcool reported Disconnect was "on".

I decided to wait till this morning, to try the BIOS "C1 Disconnect" set to enabled.  Still no lockups under this kernel.  Tried a vanilla kernel, no lockups (but timer and watchdog messed up still).  Now that I read your message Bob, I understand what you are saying.  Luckily, the updated BIOS changelog states "Add C1 disconnect item."  And this exact version seems to have fixed it, and now we have an exact fix (another one?) to refer to.

So the fix was absolutely a BIOS fix.  It seems a lot of people have buggy BIOSes on nforce2 boards.  Even some that have the option.  I guess I haven't proved that it was the BIOS fix, because I haven't stressed it for a long period of time.  But I don't believe I have to because I can do grep's and kernel compiles with disconnect on now, where before I couldn't (always been very easy to reproduce lockup).

> 
> http://www.msi.com.tw/program/support/bios/bos/spt_bos_detail.php?UID=436&kind=1
> http://download.msi.com.tw/support/bos_exe/6570v76.exe
> 
> Award 7.6 at the top of the list. Maybe somebody can figure
> out what they're doing.

I think I'll continue on contacting shuttle and ask them why they added the option, and how they added it.  Maybe that will give us the right information.

> 
> Nvidia X driver for ti4200 agp8 still locks up linux though,
> but X nv works fine. agp8 3d may expose the timer issue.
> 

That's either an nvidia driver problem, or agpgart-nforce problem.  I'd try 4x agp, and or NVAGP (or agpgart, if already using NVAGP).  If you think it's the timer, try the timer patch, or with nolapic noapic.

Jesse

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 16:59                   ` Working nforce2, was " Jesse Allen
@ 2003-12-12 17:18                     ` Jesse Allen
  2003-12-12 18:18                     ` Josh McKinney
  2003-12-13  6:34                     ` Bob
  2 siblings, 0 replies; 35+ messages in thread
From: Jesse Allen @ 2003-12-12 17:18 UTC (permalink / raw)
  To: linux-kernel

Oops, typo: NM supposed to be NB
On Fri, Dec 12, 2003 at 09:59:29AM -0700, Jesse Allen wrote:
> The item help says:
> "Force En/Disabled 
>  or Auto mode:
>  C17 IGP/SPP NB A03
>  C18D SPP NM A01 (C01)
  C18D SPP /NB/ A01 (C01)
>  enabled C1 disconnect
>  otherwise disabled it"
> 

Maybe NB means northbridge?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 16:59                   ` Working nforce2, was " Jesse Allen
  2003-12-12 17:18                     ` Jesse Allen
@ 2003-12-12 18:18                     ` Josh McKinney
  2003-12-12 19:29                       ` Jesse Allen
                                         ` (2 more replies)
  2003-12-13  6:34                     ` Bob
  2 siblings, 3 replies; 35+ messages in thread
From: Josh McKinney @ 2003-12-12 18:18 UTC (permalink / raw)
  To: linux-kernel

On approximately Fri, Dec 12, 2003 at 09:59:29AM -0700, Jesse Allen wrote:
> On Fri, Dec 12, 2003 at 04:27:59AM -0500, Bob wrote:
> > Jesse Allen wrote:
> > 
> > >On Thu, Dec 11, 2003 at 06:52:41PM +0100, Ian Kumlien wrote:
> > > 
> > >
> > >>Heh, yeah, the need for disconnect is somewhat dodgy, i haven't read up
> > >>on th rest.
> > >>
> > >Hmm, weird.  I went to go look at the Shuttle motherboard maker's site - 
> > >maybe so that I can bug them for a bios disconnect option - but I checked 
> > >for a bios update first.  And sure enough like they read my mind, just 
> > >posted online today, an update.  Here are the details of fixes:
> > >
> > >" Checksum:   8B00H                         Date Code: 12/05/03
> > >1.Support 0.18 micron AMD Duron (Palomino) CPU.
> > >2.Add C1 disconnect item."
> > >
> > >It's almost as they're reading this list.  This disconnect problem was 
> > >discovered on the 5th (well the 5th in my timezone).  Perhaps they're 
> > >aware of this issue...  I'm gonna talk to them.
> > >
> > >Jesse
> > >
> > A bios update for MSI K7N2 MCP2-T nforce2 board
> > fixed the crashing BEFORE these patches were developed,
> > but there was no documentation that would relate or explain.
> 
> Last night, I updated the bios to the 12-5-03 released yesterday (see above).  I looked at the new option under Advanced Chipset Features, "C1 Disconnect".  It has three selections: Auto, Enabled, Disabled.  There seems to be no default.  The item help says:
> "Force En/Disabled 
>  or Auto mode:
>  C17 IGP/SPP NB A03
>  C18D SPP NM A01 (C01)
>  enabled C1 disconnect
>  otherwise disabled it"
> 
> Auto sounded nice, so I selected that first.  I compiled a new kernel without the disconnect off patch, or the ack delay.  These are the exact patches I used on 2.6.0-test11:
> patch-2.6.0-test11-bk8.bz2
> acpi-2.6.0t11.patch acpi bugfixes from Maciej.
> nforce-ioapic-timer-2.6t11.patch from Ross Dickson.  Timer patch.
> forcedeth.patch Patch stolen from -test10-mm1?  Unused.
> forcedeth-update-2.patch Same.
> 
> Sure enough, under this kernel, no lockups.  Athcool reported Disconnect was "on".
> 
<snip>
> So the fix was absolutely a BIOS fix.  It seems a lot of people have buggy BIOSes on nforce2 boards.  Even some that have the option.  I guess I haven't proved that it was the BIOS fix, because I haven't stressed it for a long period of time.  But I don't believe I have to because I can do grep's and kernel compiles with disconnect on now, where before I couldn't (always been very easy to reproduce lockup).
<snip>

The thing that strikes me funny is that you get no crashes with the
updated BIOS and Disconnect on, but without the updated BIOS we have
to turn disconnect off with athcool or the patch?  This makes me think
that there is some voodoo going on in the BIOS update that they aren't
saying, surprise surprise, or something is just slowing down the time
it takes for it to crash.  I say this because I have gone 5+ days
without any of the patches from these threads, acpi apic lapic
enabled, and CPU disconnect on as stated by athcool.  This was with
much stress testing, idle time, etc.  One day I just ran a grep that I
have done probably 30 times and boom, hang.  

Good luck, hope the BIOS is the trick, now off to see how I can get
ASUS to put the C1 Disconnect in the next revision.    

-- 
Josh McKinney		     |	Webmaster: http://joshandangie.org
--------------------------------------------------------------------------
                             | They that can give up essential liberty
Linux, the choice       -o)  | to obtain a little temporary safety deserve 
of the GNU generation    /\  | neither liberty or safety. 
                        _\_v |                          -Benjamin Franklin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 18:18                     ` Josh McKinney
@ 2003-12-12 19:29                       ` Jesse Allen
  2003-12-12 21:42                       ` Craig Bradney
  2003-12-13  4:18                       ` Bob
  2 siblings, 0 replies; 35+ messages in thread
From: Jesse Allen @ 2003-12-12 19:29 UTC (permalink / raw)
  To: Josh McKinney; +Cc: linux-kernel

On Fri, Dec 12, 2003 at 01:18:27PM -0500, Josh McKinney wrote:
> 
> The thing that strikes me funny is that you get no crashes with the
> updated BIOS and Disconnect on, but without the updated BIOS we have
> to turn disconnect off with athcool or the patch?  This makes me think
> that there is some voodoo going on in the BIOS update that they aren't
> saying, surprise surprise, 

Yes, it is weird.  I've now asked shuttle for more information.

> or something is just slowing down the time
> it takes for it to crash.  I say this because I have gone 5+ days
> without any of the patches from these threads, acpi apic lapic
> enabled, and CPU disconnect on as stated by athcool.  This was with
> much stress testing, idle time, etc.  One day I just ran a grep that I
> have done probably 30 times and boom, hang.  

I hope this is not the case!  The one/two grep test worked flawlessly, but now if it's delayed, then I can't do that anymore.

(but at least I have the bios option now! heh)

I suggest you reference the Shuttle AN35 12-05-2003 BIOS, and maybe Bob's MSI, when you talk to Asus.  If they can do it, then Asus should be able as well.

Jesse

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 18:18                     ` Josh McKinney
  2003-12-12 19:29                       ` Jesse Allen
@ 2003-12-12 21:42                       ` Craig Bradney
  2003-12-13  4:18                       ` Bob
  2 siblings, 0 replies; 35+ messages in thread
From: Craig Bradney @ 2003-12-12 21:42 UTC (permalink / raw)
  To: Josh McKinney; +Cc: linux-kernel

On Fri, 2003-12-12 at 19:18, Josh McKinney wrote:
> On approximately Fri, Dec 12, 2003 at 09:59:29AM -0700, Jesse Allen wrote:
> > On Fri, Dec 12, 2003 at 04:27:59AM -0500, Bob wrote:
> > > Jesse Allen wrote:
> > > 
> > > >On Thu, Dec 11, 2003 at 06:52:41PM +0100, Ian Kumlien wrote:
> > > > 
> > > >
> > > >>Heh, yeah, the need for disconnect is somewhat dodgy, i haven't read up
> > > >>on th rest.
> > > >>
> > > >Hmm, weird.  I went to go look at the Shuttle motherboard maker's site - 
> > > >maybe so that I can bug them for a bios disconnect option - but I checked 
> > > >for a bios update first.  And sure enough like they read my mind, just 
> > > >posted online today, an update.  Here are the details of fixes:
> > > >
> > > >" Checksum:   8B00H                         Date Code: 12/05/03
> > > >1.Support 0.18 micron AMD Duron (Palomino) CPU.
> > > >2.Add C1 disconnect item."
> > > >
> > > >It's almost as they're reading this list.  This disconnect problem was 
> > > >discovered on the 5th (well the 5th in my timezone).  Perhaps they're 
> > > >aware of this issue...  I'm gonna talk to them.
> > > >
> > > >Jesse
> > > >
> > > A bios update for MSI K7N2 MCP2-T nforce2 board
> > > fixed the crashing BEFORE these patches were developed,
> > > but there was no documentation that would relate or explain.
> > 
> > Last night, I updated the bios to the 12-5-03 released yesterday (see above).  I looked at the new option under Advanced Chipset Features, "C1 Disconnect".  It has three selections: Auto, Enabled, Disabled.  There seems to be no default.  The item help says:
> > "Force En/Disabled 
> >  or Auto mode:
> >  C17 IGP/SPP NB A03
> >  C18D SPP NM A01 (C01)
> >  enabled C1 disconnect
> >  otherwise disabled it"
> > 
> > Auto sounded nice, so I selected that first.  I compiled a new kernel without the disconnect off patch, or the ack delay.  These are the exact patches I used on 2.6.0-test11:
> > patch-2.6.0-test11-bk8.bz2
> > acpi-2.6.0t11.patch acpi bugfixes from Maciej.
> > nforce-ioapic-timer-2.6t11.patch from Ross Dickson.  Timer patch.
> > forcedeth.patch Patch stolen from -test10-mm1?  Unused.
> > forcedeth-update-2.patch Same.
> > 
> > Sure enough, under this kernel, no lockups.  Athcool reported Disconnect was "on".
> > 
> <snip>
> > So the fix was absolutely a BIOS fix.  It seems a lot of people have buggy BIOSes on nforce2 boards.  Even some that have the option.  I guess I haven't proved that it was the BIOS fix, because I haven't stressed it for a long period of time.  But I don't believe I have to because I can do grep's and kernel compiles with disconnect on now, where before I couldn't (always been very easy to reproduce lockup).
> <snip>
> 
> The thing that strikes me funny is that you get no crashes with the
> updated BIOS and Disconnect on, but without the updated BIOS we have
> to turn disconnect off with athcool or the patch?  This makes me think
> that there is some voodoo going on in the BIOS update that they aren't
> saying, surprise surprise, or something is just slowing down the time
> it takes for it to crash.  I say this because I have gone 5+ days
> without any of the patches from these threads, acpi apic lapic
> enabled, and CPU disconnect on as stated by athcool.  This was with
> much stress testing, idle time, etc.  One day I just ran a grep that I
> have done probably 30 times and boom, hang.  
> 
> Good luck, hope the BIOS is the trick, now off to see how I can get
> ASUS to put the C1 Disconnect in the next revision.    


Yes, thats how it was for me.. I was the only one here saying "no
problems, la la la", then at about 5.25 days.. boom. Then the next day
it crashed twice. Hopefully you make some progress with ASUS.. (for the
A7N8X Deluxe as well as you mobo please :) ).

Ive been playing with hardware in the past few days (new quieter Zalman
PSU, and Zalman 7000 Cu fan etc) so no uptime to speak of here now. I
did compile KDE 3.2 beta 2 last night though.. 6 hours of solid
compilation.. no hassles. I have never turned off Disconnect either.

Thanks to all you guys who are working on this one. Seems to be getting
somewhere.

Craig


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 18:18                     ` Josh McKinney
  2003-12-12 19:29                       ` Jesse Allen
  2003-12-12 21:42                       ` Craig Bradney
@ 2003-12-13  4:18                       ` Bob
  2 siblings, 0 replies; 35+ messages in thread
From: Bob @ 2003-12-13  4:18 UTC (permalink / raw)
  To: linux-kernel

Re: two instances of good but undocumented bios voodoo

Josh McKinney wrote:

>On approximately Fri, Dec 12, 2003 at 09:59:29AM -0700, Jesse Allen wrote:
>  
>
>>On Fri, Dec 12, 2003 at 04:27:59AM -0500, Bob wrote:
>>    
>>
>>>Jesse Allen wrote:
>>>
>>>      
>>>
>>>>On Thu, Dec 11, 2003 at 06:52:41PM +0100, Ian Kumlien wrote:
>>>>
>>>>
>>>>        
>>>>
>>>> ............
>>>>
>>>>but I checked 
>>>>for a bios update first.  And sure enough like they read my mind, just 
>>>>posted online today, an update.  Here are the details of fixes:
>>>>
>>>>" Checksum:   8B00H                         Date Code: 12/05/03
>>>>1.Support 0.18 micron AMD Duron (Palomino) CPU.
>>>>2.Add C1 disconnect item."..........Jesse
>>>>        
>>>>
-Jesse got a bios update that gives him a cpu disconnect
option now in setup

>>>>        
>>>>
>>>A bios update for MSI K7N2 MCP2-T nforce2 board
>>>fixed the crashing BEFORE these patches were developed,
>>>but there was no documentation that would relate or explain.
>>>      
>>>
-Bob said that about his bios update fixing
the lockup problem entirely, but no doc,
needing no patch except to turn on ioapic
edge timer(another clue--without ioapic
edge timer working bios update fixed this
nforce2 situation!), no clue as to whether bios
update sets cpu disconnect one way or the other,
no opt to choose cpu disconnect in new or old
setup.

Jesse continues--

>>Last night, I updated the bios to the 12-5-03 released yesterday (see above).  I looked at the new option under Advanced Chipset Features, "C1 Disconnect".  It has three selections: Auto, Enabled, Disabled.  There seems to be no default.  The item help says:
>>"Force En/Disabled 
>> or Auto mode:
>> C17 IGP/SPP NB A03
>> C18D SPP NM A01 (C01)
>> enabled C1 disconnect
>> otherwise disabled it"
>>
>>Auto sounded nice, so I selected that first.  I compiled a new kernel without the disconnect off patch, or the ack delay.  These are the exact patches I used on 2.6.0-test11:
>>patch-2.6.0-test11-bk8.bz2
>>acpi-2.6.0t11.patch acpi bugfixes from Maciej.
>>nforce-ioapic-timer-2.6t11.patch from Ross Dickson.  Timer patch.
>>forcedeth.patch Patch stolen from -test10-mm1?  Unused.
>>forcedeth-update-2.patch Same.
>>
>>Sure enough, under this kernel, no lockups.  Athcool reported Disconnect was "on".
>>    
>>
Disconnect was ON!!!

> <snip>  ...one case the bios update fixed the problem

without needing cpu disconnect off, the other case we
don't know how or whether cpu disconnect is on or
off now but bios update fixed nforce2 without turning
ioapic edge timer on. I guess these two case prove that
neither cpu disconnect =on or ioapic timer =off are
causing the problem directly.

>The thing that strikes me funny is that you get no crashes with the
>updated BIOS and Disconnect on, but without the updated BIOS we have
>to turn disconnect off with athcool or the patch?  This makes me think
>that there is some voodoo going on in the BIOS update that they aren't
>saying, surprise surprise, or something is just slowing down the time
>it takes for it to crash.  I say this because I have gone 5+ days
>without any of the patches from these threads, acpi apic lapic
>enabled, and CPU disconnect on as stated by athcool.  This was with
>much stress testing, idle time, etc.  One day I just ran a grep that I
>have done probably 30 times and boom, hang.  
>
>Good luck, hope the BIOS is the trick, now off to see how I can get
>ASUS to put the C1 Disconnect in the next revision.
>
...and at least two motherboard makers have voodoo
to fix the problem.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
@ 2003-12-13  5:16 Ross Dickson
  2003-12-13  6:04 ` Jesse Allen
  0 siblings, 1 reply; 35+ messages in thread
From: Ross Dickson @ 2003-12-13  5:16 UTC (permalink / raw)
  To: cbradney; +Cc: linux-kernel, AMartin, Ian Kumlien

<snip>
>> The thing that strikes me funny is that you get no crashes with the 
> > updated BIOS and Disconnect on, but without the updated BIOS we have 
> > to turn disconnect off with athcool or the patch? This makes me think 
> > that there is some voodoo going on in the BIOS update that they aren't 
> > saying, surprise surprise, or something is just slowing down the time 
> > it takes for it to crash. I say this because I have gone 5+ days 
> > without any of the patches from these threads, acpi apic lapic 
> > enabled, and CPU disconnect on as stated by athcool. This was with 
> > much stress testing, idle time, etc. One day I just ran a grep that I 
> > have done probably 30 times and boom, hang. 

>> Good luck, hope the BIOS is the trick, now off to see how I can get 
> > ASUS to put the C1 Disconnect in the next revision. 

>Yes, thats how it was for me.. I was the only one here saying "no 
> problems, la la la", then at about 5.25 days.. boom. Then the next day 
> it crashed twice. Hopefully you make some progress with ASUS.. (for the 
> A7N8X Deluxe as well as you mobo please :) ). 

>Ive been playing with hardware in the past few days (new quieter Zalman 
> PSU, and Zalman 7000 Cu fan etc) so no uptime to speak of here now. I 
> did compile KDE 3.2 beta 2 last night though.. 6 hours of solid 
> compilation.. no hassles. I have never turned off Disconnect either. 

>Thanks to all you guys who are working on this one. Seems to be getting 
> somewhere. 

>Craig 

I wonder about the "voodoo" because my apic ack delay patch was developed
without knowledge of the C1 disconnect bit and reports I have received so far
are that the hard lockups go away when using it independent of the state of the 
disconnect bit. Apparently the bit was on in my test systems. 

Ian Kumlien pointed out the linkage with the northbridge timing signals 
to the CPU to do with the connect disconnect handshake so I now wonder just how
programmable the nforce2 northbridge is? Is it a bit fpga'ish in that they may be
using the bios boot to alter the handshake timing enough to accomplish what
the ack delay does but like it should be - transparent to the OS?

Of course they -the makers- have access to knowledge we don't so it could be 
something completely different that they are doing!

In short I agree with the suggestion that the new bios options do more behind
the scenes than what the athcool and disconnect patches do. 

I am pretty sure that I read somewhere that when the epox boards 
were first released the epox 8rda bios started out with it (the disconnect bit) off
then the 8rga+ came out with it on by default? So back then people were wanting
to turn it on in the 8rda to lower their CPU temperature - now some want it off
in search of stability? Back then under win.... some experienced lockups depending
on which IDE driver was used and which state the bit was in!

Out of interest has anyone seen new disconnect bit options in the Pheonix bios or
only in the award bios?

Finally I have done some more work and found that the ack delay patch on my
system is about 13 apic timer counts, about half that required to write a byte 
directly outb(0x00, 0x378) to the printer port at 28 apic timer counts. 
So the ack delay is about twice as quick as writing a single EOI to the 8259 in
XTPIC mode provided the 8259 accesses are not souped up under the hood.
In other words whilst it is a timing hit it is not much of one and it won't be
needed once this is all fixed by the respective manufacturers -lets hope they
can do it on the hardware we have already bought.

Regards
Ross Dickson

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-13  5:16 Working nforce2, was " Ross Dickson
@ 2003-12-13  6:04 ` Jesse Allen
  0 siblings, 0 replies; 35+ messages in thread
From: Jesse Allen @ 2003-12-13  6:04 UTC (permalink / raw)
  To: Ross Dickson; +Cc: AMartin, linux-kernel

On Sat, Dec 13, 2003 at 03:16:51PM +1000, Ross Dickson wrote:
> I wonder about the "voodoo" because my apic ack delay patch was developed
> without knowledge of the C1 disconnect bit and reports I have received so far
> are that the hard lockups go away when using it independent of the state of the 
> disconnect bit. Apparently the bit was on in my test systems. 
> 
> Ian Kumlien pointed out the linkage with the northbridge timing signals 
> to the CPU to do with the connect disconnect handshake 

This is what the item help for C1 Disconnect in my BIOS said:
 "Force En/Disabled
  or Auto mode:
  C17 IGP/SPP NB A03
  C18D SPP NB A01 (C01)
  enabled C1 disconnect
  otherwise disabled it"

I was thinking NB referred to northbridge.  SPP is the type of NForce chip.  IGP would be a graphics chip(?), though this board don't have that.

So yes, we do have at least some relationship with the northbridge and disconnect.  This BIOS update probably addressed that, and the BIOS changelog is just a summary.

> so I now wonder just how
> programmable the nforce2 northbridge is? Is it a bit fpga'ish in that they may be
> using the bios boot to alter the handshake timing enough to accomplish what
> the ack delay does but like it should be - transparent to the OS?

Probably.  That's what I'm thinking too now.

> 
> Of course they -the makers- have access to knowledge we don't so it could be 
> something completely different that they are doing!
> 
> In short I agree with the suggestion that the new bios options do more behind
> the scenes than what the athcool and disconnect patches do. 

That's why I'm trying to contact shuttle.

> 
> I am pretty sure that I read somewhere that when the epox boards 
> were first released the epox 8rda bios started out with it (the disconnect bit) off
> then the 8rga+ came out with it on by default? So back then people were wanting
> to turn it on in the 8rda to lower their CPU temperature - now some want it off
> in search of stability? 

Ah, that reminds me.  The very first day I ran this board last week, I was very worried on how high the system temperature was getting -- above 40 deg C.  CPU was getting up to 49 deg C.  Not that it was locking up because of temperature - it would on a cold-boot - but that I was experiencing lock ups and higher than normal temperatures which indicates to me now on how poorly it's thermal management was operating then.  Now with the new patches, and ultimately, BIOS update, system temperature is about 35 deg C, which aint too bad =)

> Back then under win.... some experienced lockups depending
> on which IDE driver was used and which state the bit was in!

Good point!  I was reading some message boards discussing nforce2s yesterday.  And they pretty much unaminiously said, don't use NForce IDE driver, use windows provided IDE driver, because the NForce IDE _locks up_.  So windows does have the same problem after all.  I wouldn't know because I don't have windows... but you can find this same issue everywhere then.

> 
> Out of interest has anyone seen new disconnect bit options in the Pheonix bios or
> only in the award bios?

I have an award bios.

> 
> Finally I have done some more work and found that the ack delay patch on my
> system is about 13 apic timer counts, about half that required to write a byte 
> directly outb(0x00, 0x378) to the printer port at 28 apic timer counts. 
> So the ack delay is about twice as quick as writing a single EOI to the 8259 in
> XTPIC mode provided the 8259 accesses are not souped up under the hood.
> In other words whilst it is a timing hit it is not much of one and it won't be
> needed once this is all fixed by the respective manufacturers -lets hope they
> can do it on the hardware we have already bought.
> 
> Regards
> Ross Dickson
> 
> 

Good work.  Lets hope the hardware manufacturers come through.

Jesse

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-12 16:59                   ` Working nforce2, was " Jesse Allen
  2003-12-12 17:18                     ` Jesse Allen
  2003-12-12 18:18                     ` Josh McKinney
@ 2003-12-13  6:34                     ` Bob
  2 siblings, 0 replies; 35+ messages in thread
From: Bob @ 2003-12-13  6:34 UTC (permalink / raw)
  To: linux-kernel

hackers be clever--

"system temperature was getting -- above 40 deg C. CPU was getting up to 
49 deg C...how poorly it's thermal management was operating then. Now 
with the new patches, and ultimately, BIOS update, system temperature is 
about 35 deg C -JesseAllen"

Maybe that tells me that my bios update fixed my
lockup problems without turning on cpu disconnect
or even by turning it off with no doc as face-saver
and not allowing me to see a choice in setup, since
like yours before cpu disconnect working my temp
is 41C most of the time and 48C under a
heavy load, possibly 49C, the exact range you
are looking at before you had cpu disconnect
working

or they turned cpu disconnect off without saying
anything, buying time, saving embarrassment

anyway it's probably off here since I have exactly
the same heat profile

I have 120mm fans one in one out, blowing air
across Zalman cpu and gpu heatsinks, no 80mm
extra Zalman fan. amd xp 3000+ 333mhz 1:1
arctic silver compound on heatsinks

Thermal 1: ok, 41.0 degrees C 105.8 degrees F
 - 41C in X, running realplayer
 - 48C compile a fat kernel or several heavy tasks

-Bob

Jesse Allen wrote:

> ....I compiled a new kernel without the disconnect off patch, or the 
> ack delay. These are the exact patches I used on 2.6.0-test11:
>
>patch-2.6.0-test11-bk8.bz2
>acpi-2.6.0t11.patch acpi bugfixes from Maciej.
>nforce-ioapic-timer-2.6t11.patch from Ross Dickson.  Timer patch.
>forcedeth.patch Patch stolen from -test10-mm1?  Unused.
>forcedeth-update-2.patch Same.
>
>Sure enough, under this kernel, no lockups.  Athcool reported Disconnect was "on".
>
>I decided to wait till this morning, to try the BIOS "C1 Disconnect" set to enabled.  Still no lockups under this kernel.  Tried a vanilla kernel, no lockups (but timer and watchdog messed up still).  Now that I read your message Bob, I understand what you are saying.  Luckily, the updated BIOS changelog states "Add C1 disconnect item."  And this exact version seems to have fixed it, and now we have an exact fix (another one?) to refer to.
>
>So the fix was absolutely a BIOS fix.
>
...but we're stuck looking at smoke and mirrors,
when the kernel might be able to work around
bioses that have not been "updated". Or to put
it another way, "voodoo" may be done by
kernel if not done by bios. Whatever is being
tweaked may be accessible to kernel code.

I can't read anything useful in my bios flash
file w6570nms.760 which is contained in--

>>http://download.msi.com.tw/support/bos_exe/6570v76.exe
>>

>>Nvidia X driver for ti4200 agp8 still locks up linux though,
>>but X nv works fine. agp8 3d may expose the timer issue.
>>
>>    
>>
>
>That's either an nvidia driver problem, or agpgart-nforce problem.  I'd try 4x agp, and or NVAGP (or agpgart, if already using NVAGP).  If you think it's the timer, try the timer patch, or with nolapic noapic.
>
>Jesse
>
Thanks, I've tried all of those except passing agp4 or agp2
to the nvidia X "nvidia" driver. Another clue that it's related
to interrupts or timing of access to interrupts is that before
I put another card on the pci bus I could get into X for a
few seconds with the nvidia driver before linux locked up,
now with an elan pcmcia 32-bit cardbus pci card that claims
it needs its own interrupt(can't give it one yet!) X just locks
up linux on load.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
@ 2003-12-13  9:20 Ross Dickson
  2003-12-13  9:51 ` Bob
  0 siblings, 1 reply; 35+ messages in thread
From: Ross Dickson @ 2003-12-13  9:20 UTC (permalink / raw)
  To: linux-kernel

<snip>
>>I decided to wait till this morning, to try the BIOS "C1 Disconnect" set to
>> enabled. Still no lockups under this kernel. Tried a vanilla kernel, no
>> lockups (but timer and watchdog messed up still). Now that I read
>> your message Bob, I understand what you are saying. Luckily, the 
>>updated BIOS changelog states "Add C1 disconnect item." And this exact
>> version seems to have fixed it, and now we have an exact fix (another one?) 
>>to refer to. 
> > 
> >So the fix was absolutely a BIOS fix. 
> > 
<snip>

==That's why I'm trying to contact shuttle.
Jesse

Good Work Jesse, I hope shuttle give up some info - especially as I have
pheonix bioses and they are doing ?? about it?

> ...but we're stuck looking at smoke and mirrors, 
> when the kernel might be able to work around 
> bioses that have not been "updated". Or to put 
> it another way, "voodoo" may be done by 
> kernel if not done by bios. Whatever is being 
> tweaked may be accessible to kernel code. 
<snip>
Bob

Please ignore the following if you are already up to speed on SMM. Some
readers may not know why we cannot do all that the bios can do aside from
a lack of information.

Agreed but the keywords are might and may. I remember doing dos based data acquisition 
with 486SX laptops and then Intel brought out the 486Sl and our pulse counting 
went bad because of the power saving core. I got the data book from Intel and
was very dismayed to see that bios code was being executed when I thought our code
was running and there was not a darn thing I could do about it and keep the
laptop warranty intact. 

Its offspring as you may already know is SMM. It is a priviledged mode that we can
do pretty much squat about. It can pop up anywhere in the middle of our code 
and the only thing we will know about it aside from missing time is when it has
stuffed something up - like setting registers back to the wrong values. Think of
it like a kernel within our kernel with permissions set so it can hack us but we
cannot hack it.

Maciej recently writes of its continuing effect on NMI debug here.

http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/2940.html

Regards
Ross.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-13  9:20 Ross Dickson
@ 2003-12-13  9:51 ` Bob
  0 siblings, 0 replies; 35+ messages in thread
From: Bob @ 2003-12-13  9:51 UTC (permalink / raw)
  To: linux-kernel

Ross Dickson wrote:

>>>So the fix was absolutely a BIOS fix. 
>>>
>>>      
>>>
><snip>
>
>==That's why I'm trying to contact shuttle.
>Jesse
>  
>
R> Good Work Jesse, I hope shuttle give up some info - especially as I have
pheonix bioses and they are doing ?? about it? -Ross

B> I was expecting to hear that.

I have an Award bios on MSI nforce2 mboard. Their bios flash
file begins with
"w" for Award  w6570nms.760(W6570 v760 bios flash file)
"p" phoenix
"a" ami
also appears at boot but goes by in a flash
and appears on first cmos setup page

So Award bios has a fix for the nforce2.

How about Jesse's bios that can fix the
problem without a kernel patch, as my
Award bios is doing? What kind of bios
is that you have, Jesse?

My Award bios does not make any way for
me to have ioapic edge timer turn on,
though. I need a patch to get that on.

Also I don't have a cpu disconnect choice in
setup and by running temp range 41C to 48C
I guess cpu disconnect is not on. 48C once in
a while does not hurt anything though.  -Bob

>>...but we're stuck looking at smoke and mirrors, 
>>when the kernel might be able to work around 
>>bioses that have not been "updated". Or to put 
>>it another way, "voodoo" may be done by 
>>kernel if not done by bios. Whatever is being 
>>tweaked may be accessible to kernel code. 
>>    
>>
><snip>
>Bob
>
>Please ignore the following if you are already up to speed on SMM. Some
>readers may not know why we cannot do all that the bios can do aside from
>a lack of information.
> 
>Agreed but the keywords are might and may. I remember doing dos based data acquisition 
>with 486SX laptops and then Intel brought out the 486Sl and our pulse counting 
>went bad because of the power saving core. I got the data book from Intel and
>was very dismayed to see that bios code was being executed when I thought our code
>was running and there was not a darn thing I could do about it and keep the
>laptop warranty intact. 
>
>Its offspring as you may already know is SMM. It is a priviledged mode that we can
>do pretty much squat about. It can pop up anywhere in the middle of our code 
>and the only thing we will know about it aside from missing time is when it has
>stuffed something up - like setting registers back to the wrong values. Think of
>it like a kernel within our kernel with permissions set so it can hack us but we
>cannot hack it.
>
>Maciej recently writes of its continuing effect on NMI debug here.
>
>http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/2940.html
>
>Regards
>Ross
>
>  
>
Thanks for explaining. We got some new functionality
just by turning nmi_watchdog on but I don't know if
anybody has learned anything from the extra debug
have they, as far as this nforce2 timing thing?     -Bob

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-15 16:54   ` Ross Dickson
@ 2003-12-16  6:07     ` Bob
  0 siblings, 0 replies; 35+ messages in thread
From: Bob @ 2003-12-16  6:07 UTC (permalink / raw)
  To: linux-kernel

Ross,

my_make_script nf2-800UL 2>&1 | tee /tmp/make.err

#/tmp/make.err
<snip>

  CC      arch/i386/kernel/apic.o
arch/i386/kernel/apic.c: In function `smp_apic_timer_interrupt':
arch/i386/kernel/apic.c:1105: warning: unsigned int format, long unsigned int arg (arg 2)


...which is around the printk line here--

                       printk("..APIC TIMER ack delay, reload:%u, safe:%u\n",

+                if(!passno) { /* calculate timing */
+                        safecnt = apic_read(APIC_TMICT) -
+                                ( (800UL * apic_read(APIC_TMICT) ) /
+                                (1000000000UL/HZ) );
+                        printk("..APIC TIMER ack delay, reload:%u, safe:%u\n",
+                                apic_read(APIC_TMICT), safecnt);
+                        passno++;


Here are the two patches with "#ifdef N" to "#if defined(N)" change
but not the unsigned int change --


diff -urN linux-2.6.0-test11/arch/i386/kernel/apic.c linux-2.6.0-test11-nf2/arch/i386/kernel/apic.c
--- linux-2.6.0-test11/arch/i386/kernel/apic.c	2003-11-26 15:46:07.000000000 -0500
+++ linux-2.6.0-test11-nf2/arch/i386/kernel/apic.c	2003-12-13 23:48:30.000000000 -0500
@@ -1089,6 +1089,37 @@
 	 */
 	irq_stat[cpu].apic_timer_irqs++;
 
+#if defined(CONFIG_MK7) && defined(CONFIG_BLK_DEV_AMD74XX)
+        /*
+         * on 2200XP & nforce2 chipset we need 600ns? 800? 1000? 1100?
+         * from timer irq start to apic irq ack to prevent
+         * hard lockups, use apic timer itself.
+         * C1 disconnect bit related.  Ross Dickson.
+         */
+        {
+                static unsigned int passno, safecnt;
+                if(!passno) { /* calculate timing */
+                        safecnt = apic_read(APIC_TMICT) -
+                                ( (800UL * apic_read(APIC_TMICT) ) /
+                                (1000000000UL/HZ) );
+                        printk("..APIC TIMER ack delay, reload:%u, safe:%u\n",
+                                apic_read(APIC_TMICT), safecnt);
+                        passno++;
+                }
+#if APIC_DEBUG
+                if(passno<12) {
+                        unsigned int at1 = apic_read(APIC_TMCCT);
+                        if( passno > 1 )
+                                Dprintk("..APIC TIMER ack delay, predelay count:%u \n", at1 );
+                        passno++;
+                }
+#endif
+                /* delay only if required */
+                while( apic_read(APIC_TMCCT) > safecnt )
+                        ndelay(100);
+        }
+#endif
+
 	/*
 	 * NOTE! We'd better ACK the irq immediately,
 	 * because timer handling can be slow.*/


diff -urN linux-2.6.0-test11/arch/i386/kernel/io_apic.c linux-2.6.0-test11-nf2/arch/i386/kernel/io_apic.c
--- linux-2.6.0-test11/arch/i386/kernel/io_apic.c	2003-11-26 15:43:32.000000000 -0500
+++ linux-2.6.0-test11-nf2/arch/i386/kernel/io_apic.c	2003-12-13 15:14:25.000000000 -0500
@@ -2128,6 +2128,54 @@
 		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC\n");
 	}
 
+#if defined (CONFIG_ACPI_BOOT) && (CONFIG_X86_UP_IOAPIC)
+        /* for nforce2 try vector 0 on pin0
+         * Note 8259a is already masked, also by default
+         * the io_apic_set_pci_routing call disables the 8259 irq 0
+         * so we must be connected directly to the 8254 timer if this works
+         * Note2: this violates the above comment re Subtle but works!
+         */
+        printk(KERN_INFO "..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...\n");
+        if (pin1 != -1) {
+                extern spinlock_t i8259A_lock;
+                unsigned long flags;
+                int tok, saved_timer_ack = timer_ack;
+                /*
+                 * Ok, does IRQ0 through the IOAPIC work?
+                 */
+                io_apic_set_pci_routing ( 0, 0, 0, 0, 0); /* connect pin */
+                unmask_IO_APIC_irq(0);
+                timer_ack = 0;
+
+                /*
+
+
+
+                 * Ok, does IRQ0 through the IOAPIC work?
+                 */
+                spin_lock_irqsave(&i8259A_lock, flags);
+                Dprintk("..TIMER check 8259 ints disabled, imr1:%02x, imr2:%02x\n", inb(0x21), inb(0xA1));
+                tok = timer_irq_works();
+                spin_unlock_irqrestore(&i8259A_lock, flags);
+                if (tok) {
+                        if (nmi_watchdog == NMI_IO_APIC) {
+                                disable_8259A_irq(0);
+                                setup_nmi();
+                                enable_8259A_irq(0);
+                                check_nmi_watchdog();
+                        }
+                        printk(KERN_INFO "..TIMER: works OK on apic pin0 irq0\n" );
+                        return;
+                }
+                /* failed */
+                timer_ack = saved_timer_ack;
+                clear_IO_APIC_pin(0, 0);
+                io_apic_set_pci_routing ( 0, pin1, 0, 0, 0);
+                printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC Pin 0\n");
+        }
+/* end new stuff for nforce2 */
+#endif
+
 	printk(KERN_INFO "...trying to set up timer (IRQ0) through the 8259A ... ");
 	if (pin2 != -1) {
 		printk("\n..... (found pin %d) ...", pin2);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
  2003-12-15 13:11   ` Maciej W. Rozycki
@ 2003-12-16  7:18     ` Bob
  0 siblings, 0 replies; 35+ messages in thread
From: Bob @ 2003-12-16  7:18 UTC (permalink / raw)
  To: linux-kernel

apic.c patch needs reload:%lu instead of %u  ---------->
printk("..APIC TIMER ack delay, reload:%lu, safe:%u\n",

amd xp3000+, 1:1 333mhz fsb to ram, 166mhz cpu
bus clock x dual channel 2-512mb pc3200 tested cas2
sticks, 1:1 fsb to ram for 333mhz, Award bios with
update that works for non-crashing but not for edge
timer without patch.  MSI K7N2 Delta MCP2-T mbo
linux-2.6.0-test11

This was with 3ware controller and unpatched 2.6.0-test11
Note low MIS score but PIC timer and no nmi--

          CPU0       0:  244393560          XT-PIC  timer
 1:      31963    IO-APIC-edge  i8042
 2:          0          XT-PIC  cascade
 8:          1    IO-APIC-edge  rtc
 9:          0   IO-APIC-level  acpi
12:     251884    IO-APIC-edge  i8042
14:         22    IO-APIC-edge  ide0
15:         24    IO-APIC-edge  ide1
16:    4290216   IO-APIC-level  3ware Storage Controller, yenta, yenta
17:    5929405   IO-APIC-level  eth0
21:          0   IO-APIC-level  NVidia nForce2
NMI:          0
LOC:  244378698
ERR:          0
MIS:          6

Next is with the first edge timer patch, nmi_watchdog=2
works but =1 does not, MIS really high("noisy bus"),
replacing 3ware with promise cards and hdparm udma133
causes apic error logged to console during bonnie++ test--

>>APIC error on CPU0: 02(02)
>>what?? no crash though.
>>    
>>
>>bob@where cat /proc/interrupts
>>           CPU0      
>>  0:    3350153    IO-APIC-edge  timer
>>  1:       5775    IO-APIC-edge  i8042
>>  2:          0          XT-PIC  cascade
>>  8:          1    IO-APIC-edge  rtc
>>  9:          0   IO-APIC-level  acpi
>> 12:       5385    IO-APIC-edge  i8042
>> 14:         10    IO-APIC-edge  ide0
>> 15:         10    IO-APIC-edge  ide1
>> 16:    1717957   IO-APIC-level  ide2, ide3, eth0
>> 19:     472929   IO-APIC-level  ide4, ide5
>> 21:          0   IO-APIC-level  NVidia nForce2
>>NMI:        822
>>LOC:    3350073
>>ERR:         35
>>MIS:      15818
>>    
>>

now with promise controllers again, new edge timer patch
permits nmi_watchdog=1 not =2, lots of nmi ticks, MIS count
is only half with first timer patch, NMI ticks = LOC?

bob@where cat /proc/interrupts
           CPU0      
  0:   46188571    IO-APIC-edge  timer
  1:      12396    IO-APIC-edge  i8042
  2:          0          XT-PIC  cascade
  8:          1    IO-APIC-edge  rtc
  9:          0   IO-APIC-level  acpi
 12:     147429    IO-APIC-edge  i8042
 14:         10    IO-APIC-edge  ide0
 15:         10    IO-APIC-edge  ide1
 16:    1413705   IO-APIC-level  ide2, ide3, eth0
 17:          0   IO-APIC-level  yenta, yenta
 19:     258804   IO-APIC-level  ide4, ide5
 21:          0   IO-APIC-level  NVidia nForce2
NMI:   46188592
LOC:   46188482
ERR:         36
MIS:       6877

Now I'll try 800UL/100ndelay to see if it helps with
MIS count(pseudo-sci masochism), be back in a while.

Oh, by the way, I set debug 1 in apic.h but I don't
see anything, and I thought I saw a compile error
flash by, so now I'll compile > logfile 2>&1 and
might see why I don't see--

"..APIC TIMER ack delay, predelay count: 20769"

I don't see any of that debug stuff. Maybe the compile
errors I found were it, see my previous message about
"unsigned in format", maybe printk needs %lu(I don't
know hardly nuffing yet). I'm going to boot 800UL/100ndelay
now.

it needs reload:%lu instead of %u  ---------->
printk("..APIC TIMER ack delay, reload:%lu, safe:%u\n",

Ross: "Can you also advise if your bios setting of the
"C1 disconnect" is set"

I can only guess by my 41C low load 48C high load
temps exactly equal to range for "2.1Ghz 333mhz"
of Ian Kumlien(his?) which is same speed as mine,
that probably cpu disconnect is not on. I have
no visible choice in setup for cpu disconnect.
I'll try athcool to see how disconnect is set.

Ross:"I have heard lockups are not supposed to happen
at all if the fsb (host bus clock speed) matches the
ddr speed. One of my systems went about 4 hours (xp2500
333fsb, DDR333) without the apic delay patch on a phoenix 
bios before lockup"

A couple of months ago I was overly optimistic a couple
of times before the bios update, and it seemed to work
to use 1:1 and only amd74xx onboard hd controller, no
hd cards, and pre-emptive, anticipatory sched not
deadline, apic off in setup but on in linux, lapic
off, acpi on. It was almost stable if using only one
drive, but I really can't go without hd cards for
software raid, so the first fsck on boot if using hd
card, and crash. I could finesse stability by using
options but never quite reach reliability without a
bios update, and certain functions need patching, and
I still have "MIS count, noisy bus" and agp8 crash(I can
use the X nv driver and agpgart no problem, but not nvidia
drivers for X and agp8).

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2003-12-16  7:18 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-12-07 13:12 Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered Ross Dickson
2003-12-09 15:20 ` Maciej W. Rozycki
2003-12-10  5:43   ` Ross Dickson
2003-12-10 16:06     ` Maciej W. Rozycki
2003-12-11  6:55       ` Ross Dickson
2003-12-11 11:47         ` Ian Kumlien
2003-12-11  9:12           ` Ross Dickson
2003-12-11 17:52             ` Ian Kumlien
2003-12-11 18:21               ` Jesse Allen
2003-12-12  9:27                 ` Bob
2003-12-12 16:59                   ` Working nforce2, was " Jesse Allen
2003-12-12 17:18                     ` Jesse Allen
2003-12-12 18:18                     ` Josh McKinney
2003-12-12 19:29                       ` Jesse Allen
2003-12-12 21:42                       ` Craig Bradney
2003-12-13  4:18                       ` Bob
2003-12-13  6:34                     ` Bob
2003-12-11 14:58           ` Jesse Allen
2003-12-11 15:20             ` Craig Bradney
2003-12-11 16:05               ` Jesse Allen
2003-12-11 15:15         ` Maciej W. Rozycki
2003-12-11 16:23           ` Josh McKinney
2003-12-11 17:04             ` Maciej W. Rozycki
2003-12-11 17:25               ` Jesse Allen
2003-12-10  3:39 ` Jesse Allen
2003-12-10  9:22   ` Ross Dickson
2003-12-10 10:00   ` Mikael Pettersson
2003-12-10  8:40     ` Ross Dickson
2003-12-11 14:32     ` Jesse Allen
  -- strict thread matches above, loose matches on Subject: below --
2003-12-13  5:16 Working nforce2, was " Ross Dickson
2003-12-13  6:04 ` Jesse Allen
2003-12-13  9:20 Ross Dickson
2003-12-13  9:51 ` Bob
2003-12-15 14:30 Fwd: " Ross Dickson
2003-12-15 15:02 ` Craig Bradney
2003-12-15 16:54   ` Ross Dickson
2003-12-16  6:07     ` Bob
     [not found] <200312132040.00875.ross@datscreative.com.au>
2003-12-13 12:00 ` Fwd: " Bob
2003-12-15 13:11   ` Maciej W. Rozycki
2003-12-16  7:18     ` Bob

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox