Re: Oops in sched.c on PPro SMP

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Arcangeli <andrea@suse.de>
To: Peter Waechtler <pwaechtler@mac.com>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com
Subject: Re: Oops in sched.c on PPro SMP
Date: Mon, 16 Sep 2002 17:42:33 +0200	[thread overview]
Message-ID: <20020916154233.GH11605@dualathlon.random> (raw)
In-Reply-To: <174178B9-C980-11D6-8873-00039387C942@mac.com>

On Mon, Sep 16, 2002 at 04:25:03PM +0200, Peter Waechtler wrote:
> Process setiathome (pid: 2035, stackpage=c45d3000)
	  ^^^^^^^^^^ stress the cpu

>      651:   81 f9 00 00 00 00       cmp    $0x0,%ecx
				      ^^^^^^^^^^^^^^^^^
>      657:   74 26                   je     67f <schedule+0x26f>
>      659:   bb 14 00 00 00          mov    $0x14,%ebx
>      65e:   89 f6                   mov    %esi,%esi
>             p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
>      660:   8b 51 20                mov    0x20(%ecx),%edx  <= CRASH
>      663:   d1 fa                   sar    %edx
>      665:   89 d8                   mov    %ebx,%eax
>      667:   2b 41 24                sub    0x24(%ecx),%eax
>      66a:   c1 f8 02                sar    $0x2,%eax
>      66d:   8d 54 10 01             lea    0x1(%eax,%edx,1),%edx
>      671:   89 51 20                mov    %edx,0x20(%ecx)
>      674:   8b 49 48                mov    0x48(%ecx),%ecx
>      677:   81 f9 00 00 00 00       cmp    $0x0,%ecx
				      ^^^^^^^^^^^^^^^^
>      67d:   75 e1                   jne    660 <schedule+0x250>
>         read_unlock(&tasklist_lock);
>      67f:   f0 ff 05 00 00 00 00    lock incl 0x0
>         spin_lock_irq(&runqueue_lock);
>      686:   fa                      cli

as you said apparently ecx cannot be zero. But I bet the 0x00 really is
waiting relocation at link time. that shouldn't be zero infact, it
should be the address of the init_task (&init_task). 

Can you disassemble the .o object using objdump -Dr or can you disassemble such
piece of code from the vmlinux instead of compiling with the -S flag to
verify that to verify that? If it really checks against zero then it's a
miscompilation, it should check against &init_task as said above.

> First I thought the readlocks were broken by the compiler, due
> to syntax changes. But staring at the code I wondered how
> %ecx can become zero at 660: - from this code it's impossible!
> But wait: we allowed interrupts again...
> 
> So my explanation is as follows: the scheduler is interrupted
> and entry.S calls:

interrupts cannot clobber %ecx or change the tasklist, if they clobber
%ecx or modify the tasklist that would be the real bug.

> So there are 2 possibilities: the spin_unlock_irq(&runqueue_lock)
> is wrong in the scheduler, but this should be noted by more SMP
> users then, or the CONFIG_X86_PPRO_FENCE does not work as expected.

the PPRO_FENCE is strictier than the non FENCE one. However here the
corruption you notice is in the tasklist, and the read/write locks are
not affected by the FENCE option, so FENCE isn't likely to explain it.

If something I would suspect something wrong in the read/write
spinlocks, to rule out such possibility you could for example try to
replace all the read_lock and write_lock around the tasklist with
spin_lock_irqsave/spin_lock_irqrestore. So that you would use the FENCE
xchg functionality around the tasklist too and you would also make sure
that no irq can happen in between, just to be 100% sure that if it
crashes it's because the memory is corrupted somehow. But really, the
read/write locks just use the lock on the bus when they modify the
spinlock value so I'd be *very* surprised if that doesn't work on ppro.
The non-FENCE case of recent cpus is used to skip the lock on the bus
during the unlock operation to exploit the release semantics of all
writes to memory in writeback cache mode of the recent x86 (that allows
unrelated speculative reads outside the critical section to enter
inside).

I really suspect an hardware fault here, if you could reproduce easily
you could try to drop a dimm of ram and retest, you can also try memtst
from cerberus testsuite or/and memtest86 from the lilo menu.

the tasklist walking very likely is triggering very quick cacheline
bouncing and lots of ram I/O, 99% of hardware bugs triggers while
walking lists because of the cpu-dcache trashing and the ram cannot cope
with it. Probably the O1 scheduler would hide the problem because it
drops such tight loop. Note that the 2.4.18 SuSE kernel scheduler
algorithm is O(N) where N is always the number of running tasks, never the
total number of tasks in the system, while in mainline the scheduler is
O(N) where N is the total number of tasks in the system, this mean
normally in mainline you can walk a list with 100/500 elements, while
with the SuSE kernel you'll walk a list of always around 2/3 elements,
depends on the workload of course. The O1 scheduler included in 8.1
reduces N to a constant 1.  So if you cannot reproduce with the SuSE 8.0
kernel could be simply because you've lots of tasks in your system but
only a few of them runs at the same time. That's a dramatic optimization
in the SuSE kernel but it's not a bugfix, it only hides the corruption
in your case methinks, like the O1 scheduler in 8.1 will hide it even
better, even if you have lots of tasks running at the same time ;). It
is true to walk the runqueue we need the runqueue_lock that needs irq
disabled, but regardless irqs must not screwup anything in the tasklist.

So I would say, it's either an hardware issue, or random memory
corruption generated by some driver. Just some guesses. And if it's the
irq clobbering the %ecx or the tasklist then something is very wrong in
the irq code or in the hardware (I'd exclude such possibility, but you
can try adding _irq to the read_lock/read_unlock of the tasklist_lock to
disable irq and see if you can still reproduce). If %ecx is checked
against zero as well something is very wrong, but in the compiler, and
that would explain things too (you're recommended to use either 2.95 or
3.2).

Hope this helps,

Andrea

next prev parent reply	other threads:[~2002-09-16 15:37 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-09-16 14:25 Oops in sched.c on PPro SMP Peter Waechtler
2002-09-16 14:49 ` Alan Cox
2002-09-16 15:44   ` Andrea Arcangeli
2002-09-16 21:16     ` Peter Waechtler
2002-09-16 23:13       ` Andrea Arcangeli
2002-09-17 17:18         ` Peter Waechtler
2002-09-17  2:51       ` Adam Kropelin
2002-09-16 21:10   ` Peter Waechtler
2002-09-16 15:42 ` Andrea Arcangeli [this message]
2002-09-17 17:11   ` Peter Waechtler
2002-09-17 17:41     ` Andrea Arcangeli
2002-09-18 11:00       ` Peter Waechtler
2002-09-18 17:52         ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20020916154233.GH11605@dualathlon.random \
    --to=andrea@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pwaechtler@mac.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.