public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
@ 2003-12-07 13:12 Ross Dickson
  2003-12-09 15:20 ` Maciej W. Rozycki
  2003-12-10  3:39 ` Jesse Allen
  0 siblings, 2 replies; 35+ messages in thread
From: Ross Dickson @ 2003-12-07 13:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: AMartin, ross, andre, kernel

[-- Attachment #1: Type: text/plain, Size: 5471 bytes --]

Greetings,
I am not subscribed so please cc responses.
I have monitored list and know my nforce2 experiences have been common.
Attached patches are in a single bzip tar ball.

I have Albatron KM18G Pro & Epox 8RGA+ MOBOs both using nforce2 chipsets.
I made up a kernel as follows.
Get std 2.4.22 src
apply patch-2.4.23
apply 2.4.22-low-latency.patch
apply preempt-kernel-rml-2.4.23-pre5-1.patch
apply vhz-j64-2.4.22.patch

One patch fails on inode.c, dispose_list() so I placed conditional_schedule() as follows
=static void dispose_list(struct list_head *head)
={
=	int nr_disposed = 0;
=
=	while (!list_empty(head)) {
=		struct inode *inode;
=		conditional_schedule();

Config for athlon with 1000hz tics, preempt & low-lat on.
Compiled and installed nvnet & nvidia video driver.

Disclaimer: The following information and code patches are not fully tested and may be 
dangerous, also these are the first patches I have made for public consumption so I hope
that their format works.

Note also that the patches are against 2.4.22 even though they were developed
against the heavily patched 2.4.23 mentioned above. The patch code is the same for both
kernels but at different line numbers.

When I enabled either apic or io-apic in kern config, lockups came hard and fast.
Particularly bad under hard disk load. Heaps of lost ints on irq7 in apic and ioapic mode. 
Lockups disappeared when I lowered the ide hda udma speed to mode 3 with hdparm so
I went looking for answers which now follow.

There are three parts to this email.
a) apic mods.
b) io-apic mods
c) ide driver mods

a) Lockups are due to too fast an apic acknowledge of apic timer int.
Apic hard locked up the system - no nmi debug available.
Fixed it by introducing a delay of at least 500ns into smp_apic_timer_interrupt() 
just prior to ack_APIC_irq().
See attached diff file "nforce2-apic.c-2.4.22.patch" for details. 
I have guessed at a suitable cpu speed dependent delay.
Perhaps someone with AMD cpu docs (apic timing specs)  & analyser tools could refine it.

Maybe nforce2 chipset really is very quick accessing ram in dual dimm mode? 
Or AMD 2200XP has a really slow APIC?

--- linux-2.4.22/arch/i386/kernel/apic.c	2003-06-14 00:51:29.000000000 +1000
+++ linux-2.4.22-rd/arch/i386/kernel/apic.c	2003-12-07 18:27:32.000000000 +1000
@@ -1078,6 +1078,15 @@
 	 */
 	apic_timer_irqs[cpu]++;

+#ifdef CONFIG_MK7 && CONFIG_BLK_DEV_AMD74XX
+	/*
+	 * on 2200XP & nforce2 chipset we need at least 500ns delay here
+	 * to stop lockups with udma100 drive. try to scale delay time
+	 * with cpu speed. Ross Dickson.
+	 */
+	ndelay((cpu_khz >> 12)+200 ); /* don't ack too soon or hard lockup */
+#endif
+
 	/*
 	 * NOTE! We'd better ACK the irq immediately,
 	 * because timer handling can be slow.


b) I was also disappointed to see I could not have irq0 timer IO-APIC-edge. 
So I have fixed it too (tested on both my epox and albatron MOBOs).
Firstly I found 8254 connected directly to pin 0 not pin 2 of io-apic.
I have modified check_timer() in io_apic.c to trial connect pin and test for it
after the existing test for connection to io-apic.
See attached diff file nforce2-io-apic.c-2.4.22 for details.

--- linux-2.4.22/arch/i386/kernel/io_apic.c	2003-08-25 21:44:39.000000000 +1000
+++ linux-2.4.22-rd/arch/i386/kernel/io_apic.c	2003-12-07 18:40:40.000000000 +1000
@@ -1614,9 +1614,44 @@
 			return;
 		}
 		clear_IO_APIC_pin(0, pin1);
-		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC\n");
+		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC pin%d\n",pin1);
 	}

+#ifdef CONFIG_ACPI_BOOT && CONFIG_X86_UP_IOAPIC
+	/* for nforce2 try vector 0 on pin0
+	 * Note the io_apic_set_pci_routing call disables the 8259 irq 0
+	 * so we must be connected directly to the 8254 timer if this works
+	 * Note2: this violates the above comment re Subtle but works!
+	 */
+	printk(KERN_INFO "..TIMER: Is timer irq0 connected to IOAPIC Pin0? ...\n");
+	if ( pin1 != -1 && nr_ioapics ) {
+		int saved_timer_ack = timer_ack;
+		/* next call also disables 8259 irq0 */
+		int result = io_apic_set_pci_routing ( 0, 0, 0, 0, 0);
+		/*
+		 * Ok, does IRQ0 through the IOAPIC work?
+		 */
+		unmask_IO_APIC_irq(0);
+		timer_ack = 0 ;
+		if (timer_irq_works()) {
+			if (nmi_watchdog == NMI_IO_APIC) {
+				disable_8259A_irq(0);
+				setup_nmi();
+				enable_8259A_irq(0);
+				check_nmi_watchdog();
+			}
+			printk(KERN_INFO "..TIMER: works OK on apic pin0 irq0\n" );
+			return;
+		}
+		/* failed */
+		timer_ack = saved_timer_ack;
+		clear_IO_APIC_pin(0, 0);
+		result = io_apic_set_pci_routing ( 0, pin1, 0, 0, 0);
+		printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC Pin 0\n");
+	}
+#endif
+/* end new stuff for nforce2 */
+
 	printk(KERN_INFO "...trying to set up timer (IRQ0) through the 8259A ... ");
 	if (pin2 != -1) {
 		printk("\n..... (found pin %d) ...", pin2);

c) Finally during my fault finding I merged A.Martins patches for the nforce2 IDE driver.
I note that the nforce2 address setup timing bits are different to the AMD ones.
I have assumed the nforce2 address timings apply to nforce and nforce3 chipsets.
I could be wrong so if someone with the nvidia docs could check it please.
I have also not tested it with anything but a WDC ata100 hard drive.
For info see attached patch files (I think pci ids are already in 2.4.23)
nforce2-amd74xx.c-2.4.22.patch, nforce2-amd74xx.h-2.4.22.patch, nforce2-pci_ids.h-2.4.22.patch

Thanks
Ross Dickson

[-- Attachment #2: ross-diffs.tar.bz2 --]
[-- Type: application/x-tbz, Size: 4375 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread
* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
@ 2003-12-13  5:16 Ross Dickson
  2003-12-13  6:04 ` Jesse Allen
  0 siblings, 1 reply; 35+ messages in thread
From: Ross Dickson @ 2003-12-13  5:16 UTC (permalink / raw)
  To: cbradney; +Cc: linux-kernel, AMartin, Ian Kumlien

<snip>
>> The thing that strikes me funny is that you get no crashes with the 
> > updated BIOS and Disconnect on, but without the updated BIOS we have 
> > to turn disconnect off with athcool or the patch? This makes me think 
> > that there is some voodoo going on in the BIOS update that they aren't 
> > saying, surprise surprise, or something is just slowing down the time 
> > it takes for it to crash. I say this because I have gone 5+ days 
> > without any of the patches from these threads, acpi apic lapic 
> > enabled, and CPU disconnect on as stated by athcool. This was with 
> > much stress testing, idle time, etc. One day I just ran a grep that I 
> > have done probably 30 times and boom, hang. 
 
>> Good luck, hope the BIOS is the trick, now off to see how I can get 
> > ASUS to put the C1 Disconnect in the next revision. 
 



>Yes, thats how it was for me.. I was the only one here saying "no 
> problems, la la la", then at about 5.25 days.. boom. Then the next day 
> it crashed twice. Hopefully you make some progress with ASUS.. (for the 
> A7N8X Deluxe as well as you mobo please :) ). 
 


>Ive been playing with hardware in the past few days (new quieter Zalman 
> PSU, and Zalman 7000 Cu fan etc) so no uptime to speak of here now. I 
> did compile KDE 3.2 beta 2 last night though.. 6 hours of solid 
> compilation.. no hassles. I have never turned off Disconnect either. 
 


>Thanks to all you guys who are working on this one. Seems to be getting 
> somewhere. 
 


>Craig 

I wonder about the "voodoo" because my apic ack delay patch was developed
without knowledge of the C1 disconnect bit and reports I have received so far
are that the hard lockups go away when using it independent of the state of the 
disconnect bit. Apparently the bit was on in my test systems. 

Ian Kumlien pointed out the linkage with the northbridge timing signals 
to the CPU to do with the connect disconnect handshake so I now wonder just how
programmable the nforce2 northbridge is? Is it a bit fpga'ish in that they may be
using the bios boot to alter the handshake timing enough to accomplish what
the ack delay does but like it should be - transparent to the OS?

Of course they -the makers- have access to knowledge we don't so it could be 
something completely different that they are doing!

In short I agree with the suggestion that the new bios options do more behind
the scenes than what the athcool and disconnect patches do. 

I am pretty sure that I read somewhere that when the epox boards 
were first released the epox 8rda bios started out with it (the disconnect bit) off
then the 8rga+ came out with it on by default? So back then people were wanting
to turn it on in the 8rda to lower their CPU temperature - now some want it off
in search of stability? Back then under win.... some experienced lockups depending
on which IDE driver was used and which state the bit was in!

Out of interest has anyone seen new disconnect bit options in the Pheonix bios or
only in the award bios?

Finally I have done some more work and found that the ack delay patch on my
system is about 13 apic timer counts, about half that required to write a byte 
directly outb(0x00, 0x378) to the printer port at 28 apic timer counts. 
So the ack delay is about twice as quick as writing a single EOI to the 8259 in
XTPIC mode provided the 8259 accesses are not souped up under the hood.
In other words whilst it is a timing hit it is not much of one and it won't be
needed once this is all fixed by the respective manufacturers -lets hope they
can do it on the hardware we have already bought.

Regards
Ross Dickson




^ permalink raw reply	[flat|nested] 35+ messages in thread
* Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
@ 2003-12-13  9:20 Ross Dickson
  2003-12-13  9:51 ` Bob
  0 siblings, 1 reply; 35+ messages in thread
From: Ross Dickson @ 2003-12-13  9:20 UTC (permalink / raw)
  To: linux-kernel

<snip>
>>I decided to wait till this morning, to try the BIOS "C1 Disconnect" set to
>> enabled. Still no lockups under this kernel. Tried a vanilla kernel, no
>> lockups (but timer and watchdog messed up still). Now that I read
>> your message Bob, I understand what you are saying. Luckily, the 
>>updated BIOS changelog states "Add C1 disconnect item." And this exact
>> version seems to have fixed it, and now we have an exact fix (another one?) 
>>to refer to. 
> > 
> >So the fix was absolutely a BIOS fix. 
> > 
<snip>

==That's why I'm trying to contact shuttle.
Jesse

Good Work Jesse, I hope shuttle give up some info - especially as I have
pheonix bioses and they are doing ?? about it?


> ...but we're stuck looking at smoke and mirrors, 
> when the kernel might be able to work around 
> bioses that have not been "updated". Or to put 
> it another way, "voodoo" may be done by 
> kernel if not done by bios. Whatever is being 
> tweaked may be accessible to kernel code. 
<snip>
Bob

Please ignore the following if you are already up to speed on SMM. Some
readers may not know why we cannot do all that the bios can do aside from
a lack of information.
 
Agreed but the keywords are might and may. I remember doing dos based data acquisition 
with 486SX laptops and then Intel brought out the 486Sl and our pulse counting 
went bad because of the power saving core. I got the data book from Intel and
was very dismayed to see that bios code was being executed when I thought our code
was running and there was not a darn thing I could do about it and keep the
laptop warranty intact. 

Its offspring as you may already know is SMM. It is a priviledged mode that we can
do pretty much squat about. It can pop up anywhere in the middle of our code 
and the only thing we will know about it aside from missing time is when it has
stuffed something up - like setting registers back to the wrong values. Think of
it like a kernel within our kernel with permissions set so it can hack us but we
cannot hack it.

Maciej recently writes of its continuing effect on NMI debug here.

http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/2940.html

Regards
Ross.


^ permalink raw reply	[flat|nested] 35+ messages in thread
* Re: Fwd: Re: Working nforce2, was Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered
@ 2003-12-15 14:30 Ross Dickson
  2003-12-15 15:02 ` Craig Bradney
  0 siblings, 1 reply; 35+ messages in thread
From: Ross Dickson @ 2003-12-15 14:30 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: recbo, linux-kernel

>> APIC error on CPU0: 02(02) 
> > what?? no crash though. 
> [...] 
> > bob@where cat /proc/interrupts 
> > CPU0 
> > 0: 3350153 IO-APIC-edge timer 
> > 1: 5775 IO-APIC-edge i8042 
> > 2: 0 XT-PIC cascade 
> > 8: 1 IO-APIC-edge rtc 
> > 9: 0 IO-APIC-level acpi 
> > 12: 5385 IO-APIC-edge i8042 
> > 14: 10 IO-APIC-edge ide0 
> > 15: 10 IO-APIC-edge ide1 
> > 16: 1717957 IO-APIC-level ide2, ide3, eth0 
> > 19: 472929 IO-APIC-level ide4, ide5 
> > 21: 0 IO-APIC-level NVidia nForce2 
> > NMI: 822 
> > LOC: 3350073 
> > ERR: 35 
> > MIS: 15818 

>It looks like the infamous APIC delivery bug -- the "MIS" counter shows 
>how many level-triggered interrupts has been erronously delivered as 
>edge-triggered ones. No wonder the system shows instability -- you have 
>noise problems at the APIC bus. 
 
Thanks Maciej
I was wondering about those, I had seen the work around code and would not
have thought it need apply to recent athlon chipsets?


For comparison here is my proc/interrupts 
CPU0
  0:   50462204    IO-APIC-edge  timer
  1:      49153    IO-APIC-edge  keyboard
  2:          0          XT-PIC  cascade
  9:          0   IO-APIC-level  acpi
 12:     395912    IO-APIC-edge  PS/2 Mouse
 14:     995872    IO-APIC-edge  ide0
 15:        283    IO-APIC-edge  ide1
 16:    3921102   IO-APIC-level  nvidia
 18:          2   IO-APIC-level  bttv
 20:     136325   IO-APIC-level  eth0, usb-ohci
 21:     146903   IO-APIC-level  ehci_hcd, NVIDIA nForce Audio
 22:          0   IO-APIC-level  usb-ohci
NMI:          0
LOC:   50457798
ERR:          0
MIS:          0

Albatron KM18G-Pro, nforce2, pheonix bios, 2200XP, 255fsb, ddr400,
ide0 is hard drive, ide1 is cdrom, nmi watchdog off

Report seems OK but this machine locks up hard without the apic delay patch.

I am currently trying the simpler v1 (always add a delay) patch but on all apic
acks as per this posting

http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-12/3291.html

which is a reply to an earlier posting of the same name but I accidently 
omitted the Re in the subject.

Regards,
Ross.


^ permalink raw reply	[flat|nested] 35+ messages in thread
[parent not found: <200312132040.00875.ross@datscreative.com.au>]

end of thread, other threads:[~2003-12-16  7:18 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-12-07 13:12 Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered Ross Dickson
2003-12-09 15:20 ` Maciej W. Rozycki
2003-12-10  5:43   ` Ross Dickson
2003-12-10 16:06     ` Maciej W. Rozycki
2003-12-11  6:55       ` Ross Dickson
2003-12-11 11:47         ` Ian Kumlien
2003-12-11  9:12           ` Ross Dickson
2003-12-11 17:52             ` Ian Kumlien
2003-12-11 18:21               ` Jesse Allen
2003-12-12  9:27                 ` Bob
2003-12-12 16:59                   ` Working nforce2, was " Jesse Allen
2003-12-12 17:18                     ` Jesse Allen
2003-12-12 18:18                     ` Josh McKinney
2003-12-12 19:29                       ` Jesse Allen
2003-12-12 21:42                       ` Craig Bradney
2003-12-13  4:18                       ` Bob
2003-12-13  6:34                     ` Bob
2003-12-11 14:58           ` Jesse Allen
2003-12-11 15:20             ` Craig Bradney
2003-12-11 16:05               ` Jesse Allen
2003-12-11 15:15         ` Maciej W. Rozycki
2003-12-11 16:23           ` Josh McKinney
2003-12-11 17:04             ` Maciej W. Rozycki
2003-12-11 17:25               ` Jesse Allen
2003-12-10  3:39 ` Jesse Allen
2003-12-10  9:22   ` Ross Dickson
2003-12-10 10:00   ` Mikael Pettersson
2003-12-10  8:40     ` Ross Dickson
2003-12-11 14:32     ` Jesse Allen
  -- strict thread matches above, loose matches on Subject: below --
2003-12-13  5:16 Working nforce2, was " Ross Dickson
2003-12-13  6:04 ` Jesse Allen
2003-12-13  9:20 Ross Dickson
2003-12-13  9:51 ` Bob
2003-12-15 14:30 Fwd: " Ross Dickson
2003-12-15 15:02 ` Craig Bradney
2003-12-15 16:54   ` Ross Dickson
2003-12-16  6:07     ` Bob
     [not found] <200312132040.00875.ross@datscreative.com.au>
2003-12-13 12:00 ` Fwd: " Bob
2003-12-15 13:11   ` Maciej W. Rozycki
2003-12-16  7:18     ` Bob

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox