A couple of oops.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* A couple of oops.
@ 2006-05-23 18:28 Carlos Martín
  2006-05-24  0:40 ` Andrew Morton
  0 siblings, 1 reply; 4+ messages in thread
From: Carlos Martín @ 2006-05-23 18:28 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1082 bytes --]

Hi,

I've nailed this down to something that happened in 2.6.17-rc4. The
system locks up with either a NULL dereference or an unhandable paging
request. The stack trace shows this:

paging request            NULL dereference

_raw_spin_trylock+12    _raw_spin_trylock+20
__spin_lock+22
main_timer_handler+22
timer_interrupt+18
handle_IRQ_event+41
__do_IRQ+156
do_IRQ+51
default_idle+0
_spin_unlock_irq+43
thread_return+187
generic_unplug_device+0
default_idle+45
dev_idle+95 (I can't read the func clearly in this handwriting)
start_secondary+1129

I'm guessing this is the same problem only that it once manifests itself
as one and another time as the other. The problem is in the call to
write_seqlock(&xtime_lock) from main_timer_handler().

I've not been able to determine what patch has caused this to happen,
but it is between 2.6.17-rc3 and -rc4. I'm bisecting, but if anybody has
a good candidate, it'd probably be faster than doing a complete bisect.

   cmn
-- 
Carlos Martín Nieto    |   http://www.cmartin.tk
Hobbyist programmer    |

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: A couple of oops.
  2006-05-23 18:28 A couple of oops Carlos Martín
@ 2006-05-24  0:40 ` Andrew Morton
  2006-05-24 14:26   ` Carlos Martín
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2006-05-24  0:40 UTC (permalink / raw)
  To: Carlos Martín; +Cc: linux-kernel, Andi Kleen

Carlos Martín <carlos@cmartin.tk> wrote:
>
> Hi,
> 
> I've nailed this down to something that happened in 2.6.17-rc4. The
> system locks up with either a NULL dereference or an unhandable paging
> request. The stack trace shows this:
> 
> paging request            NULL dereference
> 
> _raw_spin_trylock+12    _raw_spin_trylock+20
> __spin_lock+22
> main_timer_handler+22
> timer_interrupt+18
> handle_IRQ_event+41
> __do_IRQ+156
> do_IRQ+51
> default_idle+0
> _spin_unlock_irq+43
> thread_return+187
> generic_unplug_device+0
> default_idle+45
> dev_idle+95 (I can't read the func clearly in this handwriting)
> start_secondary+1129
> 
> I'm guessing this is the same problem only that it once manifests itself
> as one and another time as the other. The problem is in the call to
> write_seqlock(&xtime_lock) from main_timer_handler().
> 
> I've not been able to determine what patch has caused this to happen,
> but it is between 2.6.17-rc3 and -rc4. I'm bisecting, but if anybody has
> a good candidate, it'd probably be faster than doing a complete bisect.

hm, so an attempt to access xtime_lock.lock results in a null-pointer deref?

x86_64 does novel things, putting xtime_lock into a linker section all of
its own.  At a guess I'd say that your compiler/assembler/linker toolchain
got confused and generated the incorrect address for xtime_lock.  But if
that was the case you'd get oopses 100% of the time - it wouldn't be
intermittent, as your description seems to imply (although it's quite
unclear?).

You could do:

gdb vmlinux
(gdb) p &xtime_lock
(gdb) x/40i main_timer_handler


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: A couple of oops.
  2006-05-24  0:40 ` Andrew Morton
@ 2006-05-24 14:26   ` Carlos Martín
  2006-05-26 16:30     ` Carlos Martín
  0 siblings, 1 reply; 4+ messages in thread
From: Carlos Martín @ 2006-05-24 14:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 2686 bytes --]

On Tue, 2006-05-23 at 17:40 -0700, Andrew Morton wrote:
> Carlos Martín <carlos@cmartin.tk> wrote:
> >
> > Hi,
> > 
> > I've nailed this down to something that happened in 2.6.17-rc4. The
> > system locks up with either a NULL dereference or an unhandable paging
> > request. The stack trace shows this:
> > 
> > paging request            NULL dereference
> > 
> > _raw_spin_trylock+12    _raw_spin_trylock+20
> > __spin_lock+22
> > main_timer_handler+22
> > timer_interrupt+18
> > handle_IRQ_event+41
> > __do_IRQ+156
> > do_IRQ+51
> > default_idle+0
> > _spin_unlock_irq+43
> > thread_return+187
> > generic_unplug_device+0
> > default_idle+45
> > dev_idle+95 (I can't read the func clearly in this handwriting)
> > start_secondary+1129
> > 
> > I'm guessing this is the same problem only that it once manifests itself
> > as one and another time as the other. The problem is in the call to
> > write_seqlock(&xtime_lock) from main_timer_handler().
> > 
> > I've not been able to determine what patch has caused this to happen,
> > but it is between 2.6.17-rc3 and -rc4. I'm bisecting, but if anybody has
> > a good candidate, it'd probably be faster than doing a complete bisect.
> 
> hm, so an attempt to access xtime_lock.lock results in a null-pointer deref?

It seems to be xtime_lock.sequence:
main_timer_handler:
        pushq   %r13    #
        movq    %rdi, %r13      # regs, regs
        movq    $xtime_lock+8, %rdi     #,
        pushq   %r12    #
        pushq   %rbp    #
        pushq   %rbx    #
        pushq   %rbp    #
        call    _spin_lock      #
OOPS--> incl    xtime_lock(%rip)        # xtime_lock.sequence
        xorl    %r12d, %r12d    # offset

> 
> x86_64 does novel things, putting xtime_lock into a linker section all of
> its own.  At a guess I'd say that your compiler/assembler/linker toolchain
> got confused and generated the incorrect address for xtime_lock.  But if
> that was the case you'd get oopses 100% of the time - it wouldn't be
> intermittent, as your description seems to imply (although it's quite
> unclear?).

Once I've seen the NULL dereference, the rest are paging request errors.
Most of the time I'm on X, so I don't actually see output, but the ones
I've seen have been like that.

It starts somewhere between rc3 and rc4.

With the bisect, now I'm left with a bunch of SCSI patches which don't
seem to bear any relationship and which I don't use (except for
usb-storage).

> 
> You could do:
> 
> gdb vmlinux
> (gdb) p &xtime_lock
> (gdb) x/40i main_timer_handler
> 
-- 
Carlos Martín Nieto    |   http://www.cmartin.tk
Hobbyist programmer    |

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: A couple of oops.
  2006-05-24 14:26   ` Carlos Martín
@ 2006-05-26 16:30     ` Carlos Martín
  0 siblings, 0 replies; 4+ messages in thread
From: Carlos Martín @ 2006-05-26 16:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 424 bytes --]

 This seems to have gone as quick as it appeared. I'm currently running
rc5 with three hours of uptime, which is much longer than I ever managed
with the broken kernel.

 It looks like it was due to some miscompilation issue, though it
happened with a clean tree (make mrproper) so it seemed to be a problem
with the kernel.

   cmn
-- 
Carlos Martín Nieto    |   http://www.cmartin.tk
Hobbyist programmer    |

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-05-26 16:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-23 18:28 A couple of oops Carlos Martín
2006-05-24  0:40 ` Andrew Morton
2006-05-24 14:26   ` Carlos Martín
2006-05-26 16:30     ` Carlos Martín

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox