* A couple of oops. @ 2006-05-23 18:28 Carlos Martín 2006-05-24 0:40 ` Andrew Morton 0 siblings, 1 reply; 4+ messages in thread From: Carlos Martín @ 2006-05-23 18:28 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1082 bytes --] Hi, I've nailed this down to something that happened in 2.6.17-rc4. The system locks up with either a NULL dereference or an unhandable paging request. The stack trace shows this: paging request NULL dereference _raw_spin_trylock+12 _raw_spin_trylock+20 __spin_lock+22 main_timer_handler+22 timer_interrupt+18 handle_IRQ_event+41 __do_IRQ+156 do_IRQ+51 default_idle+0 _spin_unlock_irq+43 thread_return+187 generic_unplug_device+0 default_idle+45 dev_idle+95 (I can't read the func clearly in this handwriting) start_secondary+1129 I'm guessing this is the same problem only that it once manifests itself as one and another time as the other. The problem is in the call to write_seqlock(&xtime_lock) from main_timer_handler(). I've not been able to determine what patch has caused this to happen, but it is between 2.6.17-rc3 and -rc4. I'm bisecting, but if anybody has a good candidate, it'd probably be faster than doing a complete bisect. cmn -- Carlos Martín Nieto | http://www.cmartin.tk Hobbyist programmer | [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: A couple of oops. 2006-05-23 18:28 A couple of oops Carlos Martín @ 2006-05-24 0:40 ` Andrew Morton 2006-05-24 14:26 ` Carlos Martín 0 siblings, 1 reply; 4+ messages in thread From: Andrew Morton @ 2006-05-24 0:40 UTC (permalink / raw) To: Carlos Martín; +Cc: linux-kernel, Andi Kleen Carlos Martín <carlos@cmartin.tk> wrote: > > Hi, > > I've nailed this down to something that happened in 2.6.17-rc4. The > system locks up with either a NULL dereference or an unhandable paging > request. The stack trace shows this: > > paging request NULL dereference > > _raw_spin_trylock+12 _raw_spin_trylock+20 > __spin_lock+22 > main_timer_handler+22 > timer_interrupt+18 > handle_IRQ_event+41 > __do_IRQ+156 > do_IRQ+51 > default_idle+0 > _spin_unlock_irq+43 > thread_return+187 > generic_unplug_device+0 > default_idle+45 > dev_idle+95 (I can't read the func clearly in this handwriting) > start_secondary+1129 > > I'm guessing this is the same problem only that it once manifests itself > as one and another time as the other. The problem is in the call to > write_seqlock(&xtime_lock) from main_timer_handler(). > > I've not been able to determine what patch has caused this to happen, > but it is between 2.6.17-rc3 and -rc4. I'm bisecting, but if anybody has > a good candidate, it'd probably be faster than doing a complete bisect. hm, so an attempt to access xtime_lock.lock results in a null-pointer deref? x86_64 does novel things, putting xtime_lock into a linker section all of its own. At a guess I'd say that your compiler/assembler/linker toolchain got confused and generated the incorrect address for xtime_lock. But if that was the case you'd get oopses 100% of the time - it wouldn't be intermittent, as your description seems to imply (although it's quite unclear?). You could do: gdb vmlinux (gdb) p &xtime_lock (gdb) x/40i main_timer_handler ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: A couple of oops. 2006-05-24 0:40 ` Andrew Morton @ 2006-05-24 14:26 ` Carlos Martín 2006-05-26 16:30 ` Carlos Martín 0 siblings, 1 reply; 4+ messages in thread From: Carlos Martín @ 2006-05-24 14:26 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Andi Kleen [-- Attachment #1: Type: text/plain, Size: 2686 bytes --] On Tue, 2006-05-23 at 17:40 -0700, Andrew Morton wrote: > Carlos Martín <carlos@cmartin.tk> wrote: > > > > Hi, > > > > I've nailed this down to something that happened in 2.6.17-rc4. The > > system locks up with either a NULL dereference or an unhandable paging > > request. The stack trace shows this: > > > > paging request NULL dereference > > > > _raw_spin_trylock+12 _raw_spin_trylock+20 > > __spin_lock+22 > > main_timer_handler+22 > > timer_interrupt+18 > > handle_IRQ_event+41 > > __do_IRQ+156 > > do_IRQ+51 > > default_idle+0 > > _spin_unlock_irq+43 > > thread_return+187 > > generic_unplug_device+0 > > default_idle+45 > > dev_idle+95 (I can't read the func clearly in this handwriting) > > start_secondary+1129 > > > > I'm guessing this is the same problem only that it once manifests itself > > as one and another time as the other. The problem is in the call to > > write_seqlock(&xtime_lock) from main_timer_handler(). > > > > I've not been able to determine what patch has caused this to happen, > > but it is between 2.6.17-rc3 and -rc4. I'm bisecting, but if anybody has > > a good candidate, it'd probably be faster than doing a complete bisect. > > hm, so an attempt to access xtime_lock.lock results in a null-pointer deref? It seems to be xtime_lock.sequence: main_timer_handler: pushq %r13 # movq %rdi, %r13 # regs, regs movq $xtime_lock+8, %rdi #, pushq %r12 # pushq %rbp # pushq %rbx # pushq %rbp # call _spin_lock # OOPS--> incl xtime_lock(%rip) # xtime_lock.sequence xorl %r12d, %r12d # offset > > x86_64 does novel things, putting xtime_lock into a linker section all of > its own. At a guess I'd say that your compiler/assembler/linker toolchain > got confused and generated the incorrect address for xtime_lock. But if > that was the case you'd get oopses 100% of the time - it wouldn't be > intermittent, as your description seems to imply (although it's quite > unclear?). Once I've seen the NULL dereference, the rest are paging request errors. Most of the time I'm on X, so I don't actually see output, but the ones I've seen have been like that. It starts somewhere between rc3 and rc4. With the bisect, now I'm left with a bunch of SCSI patches which don't seem to bear any relationship and which I don't use (except for usb-storage). > > You could do: > > gdb vmlinux > (gdb) p &xtime_lock > (gdb) x/40i main_timer_handler > -- Carlos Martín Nieto | http://www.cmartin.tk Hobbyist programmer | [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: A couple of oops. 2006-05-24 14:26 ` Carlos Martín @ 2006-05-26 16:30 ` Carlos Martín 0 siblings, 0 replies; 4+ messages in thread From: Carlos Martín @ 2006-05-26 16:30 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Andi Kleen [-- Attachment #1: Type: text/plain, Size: 424 bytes --] This seems to have gone as quick as it appeared. I'm currently running rc5 with three hours of uptime, which is much longer than I ever managed with the broken kernel. It looks like it was due to some miscompilation issue, though it happened with a clean tree (make mrproper) so it seemed to be a problem with the kernel. cmn -- Carlos Martín Nieto | http://www.cmartin.tk Hobbyist programmer | [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-05-26 16:31 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-05-23 18:28 A couple of oops Carlos Martín 2006-05-24 0:40 ` Andrew Morton 2006-05-24 14:26 ` Carlos Martín 2006-05-26 16:30 ` Carlos Martín
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox