public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: test13-pre3
  2000-12-18 17:35 test13-pre3 Petr Vandrovec
@ 2000-12-18 17:18 ` Maciej W. Rozycki
  2000-12-18 18:48   ` Problem with UDMA 4 - deadlocking machine Zdenek Kabelac
  0 siblings, 1 reply; 3+ messages in thread
From: Maciej W. Rozycki @ 2000-12-18 17:18 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: Kernel Mailing List, mingo, torvalds

On Mon, 18 Dec 2000, Petr Vandrovec wrote:

> No. Without udelay() before first printk() it just does not boot on my
> motherboard. There were two choices: either remove all printk() from
> these loops (define Dprintk to null), or add udelay(x), where x >= 200,
> before first printk. I sent patch twice to linux-kernel, and to 
> mingo@redhat.com, and nobody said anything against it.

 I see.  But are you sure this is the right fix?  You may be covering
the real problem with this arbitrary delay.

 I haven't actually noticed any of your previous mails -- given the load
on the list I sometimes miss letters with "uninteresting" subjects. 

> No. If there is no udelay() before first printk(), on my GA-6VXD7 board
> (SMP VIA 694X) only 'Startup point 1.' is printed, but no 'Waiting
> for send to finish...'. So maybe we do not need udelay(200) below loop,
> but for sure we need udelay() before first printk(). (my board works
> without ANY udelay() in smpboot.c, except one I added... This one is 
> required.) If somebody lives in Prague, and wants to come with logical

 Other delays are imposed by the MPS (most if not all of them).  For
example there are systems that assert RESET to a CPU as a result of an
INIT IPI.  These systems need these delays to allow CPUs to recover. 

> analyzer (or if I should come with motherboard), I'm willing to continue
> testing. But current idea is that inb/outb done by cursor positioning
> code is incompatible with something else done in secondary CPU startup.

 Have you tried putting explicit display adapter (other ISA) I/O accesses
after sending the IPI to see if they trigger the problem?  IPIs are
transmitted over the inter-APIC bus and should be completely invisible to
other parts of the system.  But the code involved in processing a printk() 
may interact with the one executed by another CPU right after waking it
up.  It would be worth to investigate it...

> (it boots also without any kernel change with 'console=ttyS0,9600', but 
> it may be caused by slow serial line)

 Or by running different code.

> Without delay() both CPU die, and board does not react to anything except
> hard reset anymore (and sometime it does not react even to hard reset; lookup
> for my messages during last week).

 Now THAT is weird.  It might mean a chipset bug.  Still no idea how an
inter-APIC message might trigger it -- it completely bypasses MB
chipset...  Hmm, maybe not...  Is your I/O APIC discrete (like Intel's
82093AA) or integrated?  It appears there are vendors manufacturing I/O
APIC clones and this may imply new problems, sigh...

  Maciej

-- 
+  Maciej W. Rozycki, Technical University of Gdansk, Poland   +
+--------------------------------------------------------------+
+        e-mail: macro@ds2.pg.gda.pl, PGP key available        +

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: test13-pre3
@ 2000-12-18 17:35 Petr Vandrovec
  2000-12-18 17:18 ` test13-pre3 Maciej W. Rozycki
  0 siblings, 1 reply; 3+ messages in thread
From: Petr Vandrovec @ 2000-12-18 17:35 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Kernel Mailing List, mingo, torvalds

On 18 Dec 00 at 13:58, Maciej W. Rozycki wrote:

>  What is this change about?
> 
> diff -u --recursive --new-file v2.4.0-test12/linux/arch/i386/kernel/smpboot.c linux/arch/i386/kernel/smpboot.c
> --- v2.4.0-test12/linux/arch/i386/kernel/smpboot.c      Mon Dec 11 17:59:43 2000
> +++ linux/arch/i386/kernel/smpboot.c    Thu Dec 14 14:54:40 2000
> @@ -694,6 +694,11 @@
>         apic_write_around(APIC_ICR, APIC_DM_STARTUP
>                     | (start_eip >> 12));
>  
> +       /*
> +        * Give the other CPU some time to accept the IPI.
> +        */
> +       udelay(300);
> +
>         Dprintk("Startup point 1.\n");
>  
>         Dprintk("Waiting for send to finish...\n");
> 
> There is the following code is just after it, making the above change
> just useless garbage:

No. Without udelay() before first printk() it just does not boot on my
motherboard. There were two choices: either remove all printk() from
these loops (define Dprintk to null), or add udelay(x), where x >= 200,
before first printk. I sent patch twice to linux-kernel, and to 
mingo@redhat.com, and nobody said anything against it.
 
>         timeout = 0;
>         do {
>             Dprintk("+");
>             udelay(100);
>             send_status = apic_read(APIC_ICR) & APIC_ICR_BUSY;
>         } while (send_status && (timeout++ < 1000));
> 
>         /*
>          * Give the other CPU some time to accept the IPI.
>          */
>         udelay(200);
> 
> If we need 600usecs of delay for certain systems, then why not just make
> it like below?

No. If there is no udelay() before first printk(), on my GA-6VXD7 board
(SMP VIA 694X) only 'Startup point 1.' is printed, but no 'Waiting
for send to finish...'. So maybe we do not need udelay(200) below loop,
but for sure we need udelay() before first printk(). (my board works
without ANY udelay() in smpboot.c, except one I added... This one is 
required.) If somebody lives in Prague, and wants to come with logical
analyzer (or if I should come with motherboard), I'm willing to continue
testing. But current idea is that inb/outb done by cursor positioning
code is incompatible with something else done in secondary CPU startup.
(it boots also without any kernel change with 'console=ttyS0,9600', but 
it may be caused by slow serial line)

Without delay() both CPU die, and board does not react to anything except
hard reset anymore (and sometime it does not react even to hard reset; lookup
for my messages during last week).
                                    Best regards,
                                            Petr Vandrovec
                                            vandrove@vc.cvut.cz
                                            
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Problem with UDMA 4 - deadlocking machine
  2000-12-18 17:18 ` test13-pre3 Maciej W. Rozycki
@ 2000-12-18 18:48   ` Zdenek Kabelac
  0 siblings, 0 replies; 3+ messages in thread
From: Zdenek Kabelac @ 2000-12-18 18:48 UTC (permalink / raw)
  To: andre, linux-abit, morton

Hello

Last Saturday I've spent some time on testing deadlock which bites me
for a
long time on my BP6 SMP board (practically since I've started to use
ATA66
controller). Before you will stop reading and saying I should buy a
different
board please try to continue for a few moments. (its a bit longer - but
I think
there are few interesting points)

First I'll describe my hw configuration - BP6 128MB 2x400MHz CPU
(usually running @92MHz FSB - ~560MHz)
Cards: SBLive slot 2, TV-Phone98 slot 5, G400Max AGP
Hard drives: 6GB Western on UDMA33 (contains only vfat partitions),
             CD-ROM Ricoh 9060A UDMA33,
             30GB IBM UDMA66 hpt366 (contains only ext2 partitions).
There are no shared interrupts in /proc/interrupts.

hdparm gives me around 34MB/s for UDMA66 drive and 8MB/s for WD6GB
drive.

For testing I've used the following simple sequence:

I've created 650MB file on UDMA66 drive and I've been copying this file
to UDMA33 vfat partition.
(while : ; do cp /ext2/my650 /vfat/  ; md5sum /vfat/my650 ; done)

When this test was running correctly for 6 times I've considered it
stable.

Kernel used for testing has been CLEAN UNPATCHED 2.4.0-test12
(with SMP, and without SMP support)

Here are some initial results:

After reboot and running this on console - everything was OK starting
the X
server and not doing anything else in them (e.g. moving mouse) - again
everything was OK.

And now to the less positive part - as soon as I've turned on programs
which are using SBLive or TV card the locking had begun.

After a while I've came with conclusion it doesn't matter if it's just
emu10k1
driver or bttv driver or both of them running at same time - simply when
one or
both of them were running I've been unable to copy the file from UDMA66
to
UDMA33 (test usually occurred after 200MB has been copied).

For the huge amount of tested environment I've used just fbtv and linux
matrox
framebuffer console with just bttv as this was usually to the fastest
way to
deadlock.

For the testings I've flashed around 9 different BP6 bioses (always
cleaning
all the BIOS settings) - LP, NJ, QQbeta, QQ-2, and couple of RU bioses -
usually I'm using ru with htp 1.26 Also I've been checking
non-overclocked &
overclocked setup with each BIOS.

After few hours the clear result was:

Linux kernel with SMP or without SMP ALWAYS locks during this test - it
has
never finished more then 1 copy of this file - and the crash is always
complete
deadlock - but the console says something about hdf: lost interrupt
first - but
I've already reported this and no one seems to care. (I'll repeat - I've
used
all known to me Bioses & correctly clocked CPU for this test and it had
always
failed)

So I've came to conclusion that there could be several possible
problems:

UDMA 4 mode is not correctly programmed for hpt366 controller (as the
only one
who seems to understand the setting for this chipset is Andre itself, I
afraid
that he is the only one who could play with them and possible fix)

UDMA 4 mode is incorrectly programmed and it doesn't take into account
that
some other interrupt services might lock the system bus for a while
(again this
is for Andre Hedrick probably) (I've also noticed few other peoples
comlaing
about some problems while copying huge files - and they didn't have BP6
board,
so maybe its some more general problem)

BP6 hpt366 is completely bad piece of hardware (I hope it's not that
bad)

Ok after this - I've also found some partial solution for this problem
which is
relatively easy - degrading hpt366 from UDMA 4 to UDMA 2 with 'hdparm
-X66
/dev/hdf' - after issuing this command all my test had never failed (8
copies
without any problems while playing TV and mp3 files with xmms) (btw is
there
some way I could turn it back to UDMA 4 mode ??)

So hpt366 & linux could coexists peacefully they just can't use UDMA4
mode.
I've also run this test with both drives on UDMA33 controllers - again
without
a single problem.

So that's probably all I wanted to say - so maybe some advice for BP6
users -
if they want to have stable system - they should probably not use UDMA 4
mode
(-X66 for hpt366 or just ignore UDMA66 controller at all - especially if
you
are using devices which generates a lot of interrupts - like sound
cards, tv
card, network cards :)

And strange note for the end - I had usually problems even to boot the
machine
when it has not been overclocked and it was running with 66MHz FSB speed
-
however this "hdf: lost interrupt" message has not lead to deadlock of
machine
(deadlock means for me that computer doesn't react even to Magic SysRq)
- after
a few messages about the lost interrupts the system has turned off DMA
and was
continuing with boot process however using computer with just 3MB/s
throughput
is not that funny (also there were no APIC error with 66 FSB). This boot
problem has never happened to me while running with 92MHz FSB even
though APIC
errors makes console practically unusable.

I think this letter is already way to long - so now I'm curios
if this will be helpful for anyone....

-- 
             There are three types of people in the world:
               those who can count, and those who can't.
  Zdenek Kabelac  http://i.am/kabi/ kabi@i.am {debian.org; fi.muni.cz}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2000-12-18 19:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-12-18 17:35 test13-pre3 Petr Vandrovec
2000-12-18 17:18 ` test13-pre3 Maciej W. Rozycki
2000-12-18 18:48   ` Problem with UDMA 4 - deadlocking machine Zdenek Kabelac

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox