new kernel oops in recent kernels

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* new kernel oops in recent kernels
@ 2008-03-16 15:19 Giuseppe Sacco
  2008-03-16 16:39 ` James Bottomley
  2008-03-16 16:42 ` Matthew Wilcox
  0 siblings, 2 replies; 9+ messages in thread
From: Giuseppe Sacco @ 2008-03-16 15:19 UTC (permalink / raw)
  To: linux-scsi

Hi all,
testing latest kernels on SGI O2, I found this new kernel oops. It has
been produced with kernel from linux-mips.org git of yesterday night.
A very similar oops has been reported by others[0] using 2.6.22.

As you may see, the oops happens while booting the machine, when init
run all scripts via rc. One of those scripts run hald-probe-storage, the
process that actually create the oops.

I am not able to identify the cause nor to propose a solution, but I am
willing to test any patch for this problem.

Thanks a lot,
Giuseppe

CPU 0 Unable to handle kernel paging request at virtual address 0000000000000000, epc == 0000000000000000, ra == 0000000000000000
Oops[#1]:
Cpu 0
$ 0   : 0000000000000000 ffffffff9001fce0 ffffffffffffff86 0000000000000028
$ 4   : 980000000fc01140 0000000000000080 0000000000024000 0000000000000000
$ 8   : 980000000fc54700 0000000000000001 0000000000008000 404000130a0808ff
$12   : 0000000000000008 ffffffff801b8db8 0000000000000000 ffffffff803f0000
$16   : 980000000ff2fa70 980000000c417bb8 980000000c417c20 980000000fdeb610
$20   : 000000007fffffff 980000000f9211a0 980000000fc26000 000000007fa51ecd
$24   : 0000000000000000 ffffffff80074290                                  
$28   : 980000000c414000 980000000c417bb0 0000000000400000 0000000000000000
Hi    : 0000000000000000
Lo    : 003d08dbda057200
epc   : 0000000000000000 0x0     Not tainted
ra    : 0000000000000000 0x0
Status: 9001fce3    KX SX UX KERNEL EXL IE 
Cause : 00000008
BadVA : 0000000000000000
PrId  : 00002321 (R5000)
Modules linked in: parport_pc lp parport ipv6 deflate zlib_deflate ctr twofish twofish_common camellia serpent blowfish des_generic cbc aes_generic xcbc sha25
6_generic sha1_generic crypto_null crypto_blkcipher dm_snapshot dm_mirror dm_mod ehci_hcd ohci_hcd r8169 usbcore sg evdev
Process hald-probe-stor (pid: 1937, threadinfo=980000000c414000, task=980000000ebf47d8)
Stack : 980000000c417be0 980000000c417de0 0800000000000000 980000000c417bb0
        00000008ffffff86 0000000000000000 0200000000000001 000006d600000000
        0000000000000000 980000000c417de0 980000000fdeb610 0000000000000001
        0000000000005326 ffffffff802460b0 0000000070023a00 000000000f9211a0
        ffffffff80490000 ffffffff8024bb84 980000000fc10e80 980000000f80bb28
        0000000000000000 980000000f9210e0 0000010100000001 00000000800d1618
        0000000000000004 980000000fc8f850 000000007fffffff 980000000fde4000
        0000000000005326 000000007fffffff 980000000c407540 980000000f9211a0
        980000000fc26000 000000007fa51ecd ffffffff80245c6c 980000000c407540
        0000000000000000 fffffffffffffdfd 0000000000005326 ffffffff801ad8bc
        ...
Call Trace:
[<ffffffff802460b0>] sr_drive_status+0x50/0xe8
[<ffffffff8024bb84>] cdrom_ioctl+0x5f4/0x1208
[<ffffffff80245c6c>] sr_block_ioctl+0x64/0xe8
[<ffffffff801ad8bc>] compat_blkdev_ioctl+0x7cc/0x18e0
[<ffffffff800d1870>] do_open+0x98/0x310
[<ffffffff800d1d60>] blkdev_open+0x0/0xc0
[<ffffffff800d1da8>] blkdev_open+0x48/0xc0
[<ffffffff8009c444>] __dentry_open+0x114/0x2e0
[<ffffffff8009c740>] do_filp_open+0x48/0x58
[<ffffffff8009c740>] do_filp_open+0x48/0x58
[<ffffffff800def8c>] compat_sys_ioctl+0xf4/0x440
[<ffffffff80019154>] handle_sys+0x114/0x130
[<ffffffff8001fcf3>] fpu_emulator_cop1Handler+0x362/0x2270


Code: (Bad address in epc)

[0]http://lists.debian.org/debian-mips/2008/03/msg00082.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: new kernel oops in recent kernels
  2008-03-16 15:19 new kernel oops in recent kernels Giuseppe Sacco
@ 2008-03-16 16:39 ` James Bottomley
  2008-03-16 18:32   ` Giuseppe Sacco
  2008-03-16 16:42 ` Matthew Wilcox
  1 sibling, 1 reply; 9+ messages in thread
From: James Bottomley @ 2008-03-16 16:39 UTC (permalink / raw)
  To: Giuseppe Sacco; +Cc: linux-scsi

On Sun, 2008-03-16 at 16:19 +0100, Giuseppe Sacco wrote:
> Hi all,
> testing latest kernels on SGI O2, I found this new kernel oops. It has
> been produced with kernel from linux-mips.org git of yesterday night.
> A very similar oops has been reported by others[0] using 2.6.22.
> 
> As you may see, the oops happens while booting the machine, when init
> run all scripts via rc. One of those scripts run hald-probe-storage, the
> process that actually create the oops.
> 
> I am not able to identify the cause nor to propose a solution, but I am
> willing to test any patch for this problem.
> 
> Thanks a lot,
> Giuseppe
> 
> CPU 0 Unable to handle kernel paging request at virtual address 0000000000000000, epc == 0000000000000000, ra == 0000000000000000
> Oops[#1]:
> Cpu 0
> $ 0   : 0000000000000000 ffffffff9001fce0 ffffffffffffff86 0000000000000028
> $ 4   : 980000000fc01140 0000000000000080 0000000000024000 0000000000000000
> $ 8   : 980000000fc54700 0000000000000001 0000000000008000 404000130a0808ff
> $12   : 0000000000000008 ffffffff801b8db8 0000000000000000 ffffffff803f0000
> $16   : 980000000ff2fa70 980000000c417bb8 980000000c417c20 980000000fdeb610
> $20   : 000000007fffffff 980000000f9211a0 980000000fc26000 000000007fa51ecd
> $24   : 0000000000000000 ffffffff80074290                                  
> $28   : 980000000c414000 980000000c417bb0 0000000000400000 0000000000000000
> Hi    : 0000000000000000
> Lo    : 003d08dbda057200
> epc   : 0000000000000000 0x0     Not tainted
> ra    : 0000000000000000 0x0
> Status: 9001fce3    KX SX UX KERNEL EXL IE 
> Cause : 00000008
> BadVA : 0000000000000000
> PrId  : 00002321 (R5000)
> Modules linked in: parport_pc lp parport ipv6 deflate zlib_deflate ctr twofish twofish_common camellia serpent blowfish des_generic cbc aes_generic xcbc sha25
> 6_generic sha1_generic crypto_null crypto_blkcipher dm_snapshot dm_mirror dm_mod ehci_hcd ohci_hcd r8169 usbcore sg evdev
> Process hald-probe-stor (pid: 1937, threadinfo=980000000c414000, task=980000000ebf47d8)
> Stack : 980000000c417be0 980000000c417de0 0800000000000000 980000000c417bb0
>         00000008ffffff86 0000000000000000 0200000000000001 000006d600000000
>         0000000000000000 980000000c417de0 980000000fdeb610 0000000000000001
>         0000000000005326 ffffffff802460b0 0000000070023a00 000000000f9211a0
>         ffffffff80490000 ffffffff8024bb84 980000000fc10e80 980000000f80bb28
>         0000000000000000 980000000f9210e0 0000010100000001 00000000800d1618
>         0000000000000004 980000000fc8f850 000000007fffffff 980000000fde4000
>         0000000000005326 000000007fffffff 980000000c407540 980000000f9211a0
>         980000000fc26000 000000007fa51ecd ffffffff80245c6c 980000000c407540
>         0000000000000000 fffffffffffffdfd 0000000000005326 ffffffff801ad8bc
>         ...
> Call Trace:
> [<ffffffff802460b0>] sr_drive_status+0x50/0xe8
> [<ffffffff8024bb84>] cdrom_ioctl+0x5f4/0x1208
> [<ffffffff80245c6c>] sr_block_ioctl+0x64/0xe8
> [<ffffffff801ad8bc>] compat_blkdev_ioctl+0x7cc/0x18e0
> [<ffffffff800d1870>] do_open+0x98/0x310
> [<ffffffff800d1d60>] blkdev_open+0x0/0xc0
> [<ffffffff800d1da8>] blkdev_open+0x48/0xc0
> [<ffffffff8009c444>] __dentry_open+0x114/0x2e0
> [<ffffffff8009c740>] do_filp_open+0x48/0x58
> [<ffffffff8009c740>] do_filp_open+0x48/0x58
> [<ffffffff800def8c>] compat_sys_ioctl+0xf4/0x440
> [<ffffffff80019154>] handle_sys+0x114/0x130
> [<ffffffff8001fcf3>] fpu_emulator_cop1Handler+0x362/0x2270

This is a bit strange.  It's obviously O2 specific, which makes it a lot
harder.  Can you compile the kernel with CONFIG_DEBUG_INFO and reproduce
(just in case this changes the symbol layout).  Then ask gdb where
sr_drive_status+0x50 (or what it moves to) is:

gdb vmlinux
b *(sr_drive_status+0x50)

should identify the file and line.

The signature implies that cdi->handle may be NULL, so you could put in
a check for that as well.

Thanks,

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: new kernel oops in recent kernels
  2008-03-16 15:19 new kernel oops in recent kernels Giuseppe Sacco
  2008-03-16 16:39 ` James Bottomley
@ 2008-03-16 16:42 ` Matthew Wilcox
  2008-03-16 18:29   ` Giuseppe Sacco
  1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2008-03-16 16:42 UTC (permalink / raw)
  To: Giuseppe Sacco; +Cc: linux-scsi

On Sun, Mar 16, 2008 at 04:19:08PM +0100, Giuseppe Sacco wrote:
> testing latest kernels on SGI O2, I found this new kernel oops. It has
> been produced with kernel from linux-mips.org git of yesterday night.
> A very similar oops has been reported by others[0] using 2.6.22.

> CPU 0 Unable to handle kernel paging request at virtual address 0000000000000000, epc == 0000000000000000, ra == 0000000000000000

I'm not familiar with MIPS; is epc the program counter?  If so, this
would be a branch to 0.  That's somewhat confusing as I don't see any
function pointers used within sr_drive_status().  How accurate are MIPS
backtraces?

> Call Trace:
> [<ffffffff802460b0>] sr_drive_status+0x50/0xe8
> [<ffffffff8024bb84>] cdrom_ioctl+0x5f4/0x1208
> [<ffffffff80245c6c>] sr_block_ioctl+0x64/0xe8

It would be interesting to see a disassembly (objdump -dr
drivers/scsi/sr_ioctl.o) of sr_drive_status from say 0x40 to 0x60.

And if that calls a function, it would be interesting to put in printks
to figure out where we're dereferencing a null pointer.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: new kernel oops in recent kernels
  2008-03-16 16:42 ` Matthew Wilcox
@ 2008-03-16 18:29   ` Giuseppe Sacco
  2008-03-17  3:58     ` Matthew Wilcox
  2008-03-17  4:41     ` Matthew Wilcox
  0 siblings, 2 replies; 9+ messages in thread
From: Giuseppe Sacco @ 2008-03-16 18:29 UTC (permalink / raw)
  To: linux-scsi

Hi all,

Il giorno dom, 16/03/2008 alle 10.42 -0600, Matthew Wilcox ha scritto:
> On Sun, Mar 16, 2008 at 04:19:08PM +0100, Giuseppe Sacco wrote:
[...]
> > Call Trace:
> > [<ffffffff802460b0>] sr_drive_status+0x50/0xe8
> > [<ffffffff8024bb84>] cdrom_ioctl+0x5f4/0x1208
> > [<ffffffff80245c6c>] sr_block_ioctl+0x64/0xe8
> 
> It would be interesting to see a disassembly (objdump -dr
> drivers/scsi/sr_ioctl.o) of sr_drive_status from say 0x40 to 0x60.

here it is:

(gdb) disassemble sr_drive_status+0x50
Dump of assembler code for function sr_drive_status:
0xffffffff80246060 <sr_drive_status+0>:	daddiu	sp,sp,-32
0xffffffff80246064 <sr_drive_status+4>:	lui	v0,0x7fff
0xffffffff80246068 <sr_drive_status+8>:	sd	s0,16(sp)
0xffffffff8024606c <sr_drive_status+12>:	sd	ra,24(sp)
0xffffffff80246070 <sr_drive_status+16>:	ori	v0,v0,0xffff
0xffffffff80246074 <sr_drive_status+20>:	move	s0,a0
0xffffffff80246078 <sr_drive_status+24>:	bne	a1,v0,0xffffffff802460e8 <sr_drive_status+136>
0xffffffff8024607c <sr_drive_status+28>:	ld	v1,24(a0)
0xffffffff80246080 <sr_drive_status+32>:	ld	a0,16(v1)
0xffffffff80246084 <sr_drive_status+36>:	jal	0xffffffff80244c70 <sr_test_unit_ready>
0xffffffff80246088 <sr_drive_status+40>:	daddiu	a1,sp,4
0xffffffff8024608c <sr_drive_status+44>:	bnez	v0,0xffffffff802460a8 <sr_drive_status+72>
0xffffffff80246090 <sr_drive_status+48>:	move	a0,s0
0xffffffff80246094 <sr_drive_status+52>:	li	v0,4
0xffffffff80246098 <sr_drive_status+56>:	ld	ra,24(sp)
0xffffffff8024609c <sr_drive_status+60>:	ld	s0,16(sp)
0xffffffff802460a0 <sr_drive_status+64>:	jr	ra
0xffffffff802460a4 <sr_drive_status+68>:	daddiu	sp,sp,32
0xffffffff802460a8 <sr_drive_status+72>:	jal	0xffffffff8024c838 <cdrom_get_media_event>
0xffffffff802460ac <sr_drive_status+76>:	move	a1,sp
0xffffffff802460b0 <sr_drive_status+80>:	bnez	v0,0xffffffff802460fc <sr_drive_status+156>
0xffffffff802460b4 <sr_drive_status+84>:	lhu	v0,0(sp)
0xffffffff802460b8 <sr_drive_status+88>:	sll	v0,v0,0x0
0xffffffff802460bc <sr_drive_status+92>:	andi	v0,v0,0xff
0xffffffff802460c0 <sr_drive_status+96>:	andi	v1,v0,0x2
0xffffffff802460c4 <sr_drive_status+100>:	bnez	v1,0xffffffff80246094 <sr_drive_status+52>
0xffffffff802460c8 <sr_drive_status+104>:	andi	v0,v0,0x1
0xffffffff802460cc <sr_drive_status+108>:	beqz	v0,0xffffffff80246098 <sr_drive_status+56>
0xffffffff802460d0 <sr_drive_status+112>:	li	v0,1
0xffffffff802460d4 <sr_drive_status+116>:	ld	ra,24(sp)

> And if that calls a function, it would be interesting to put in printks
> to figure out where we're dereferencing a null pointer.
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: new kernel oops in recent kernels
  2008-03-16 16:39 ` James Bottomley
@ 2008-03-16 18:32   ` Giuseppe Sacco
  2008-03-16 18:47     ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Giuseppe Sacco @ 2008-03-16 18:32 UTC (permalink / raw)
  To: linux-scsi

Hi James,

Il giorno dom, 16/03/2008 alle 11.39 -0500, James Bottomley ha scritto:
> On Sun, 2008-03-16 at 16:19 +0100, Giuseppe Sacco wrote:
[...]
> This is a bit strange.  It's obviously O2 specific, which makes it a lot
> harder.  Can you compile the kernel with CONFIG_DEBUG_INFO and reproduce
> (just in case this changes the symbol layout).  Then ask gdb where
[...]

I cannot find any CONFIG_DEBUG_INFO. Do you mean CONFIG_DEBUG_KERNEL?

Thanks,
Giuseppe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: new kernel oops in recent kernels
  2008-03-16 18:32   ` Giuseppe Sacco
@ 2008-03-16 18:47     ` James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2008-03-16 18:47 UTC (permalink / raw)
  To: Giuseppe Sacco; +Cc: linux-scsi

On Sun, 2008-03-16 at 19:32 +0100, Giuseppe Sacco wrote:
> Hi James,
> 
> Il giorno dom, 16/03/2008 alle 11.39 -0500, James Bottomley ha scritto:
> > On Sun, 2008-03-16 at 16:19 +0100, Giuseppe Sacco wrote:
> [...]
> > This is a bit strange.  It's obviously O2 specific, which makes it a lot
> > harder.  Can you compile the kernel with CONFIG_DEBUG_INFO and reproduce
> > (just in case this changes the symbol layout).  Then ask gdb where
> [...]
> 
> I cannot find any CONFIG_DEBUG_INFO. Do you mean CONFIG_DEBUG_KERNEL?

This from lib/Kconfig.debug:

config DEBUG_INFO
        bool "Compile the kernel with debug info"
        depends on DEBUG_KERNEL
        help
          If you say Y here the resulting kernel image will include
          debugging info resulting in a larger kernel image.
          This adds debug symbols to the kernel and modules (gcc -g), and
          is needed if you intend to use kernel crashdump or binary object
          tools like crash, kgdb, LKCD, gdb, etc on the kernel.
          Say Y here only if you plan to debug the kernel.

          If unsure, say N.

It does depend on CONFIG_DEBUG_KERNEL according to the depends clause.

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: new kernel oops in recent kernels
  2008-03-16 18:29   ` Giuseppe Sacco
@ 2008-03-17  3:58     ` Matthew Wilcox
  2008-03-17  4:41     ` Matthew Wilcox
  1 sibling, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2008-03-17  3:58 UTC (permalink / raw)
  To: Giuseppe Sacco; +Cc: linux-scsi

On Sun, Mar 16, 2008 at 07:29:07PM +0100, Giuseppe Sacco wrote:
> > It would be interesting to see a disassembly (objdump -dr
> > drivers/scsi/sr_ioctl.o) of sr_drive_status from say 0x40 to 0x60.
> 
> here it is:
> 
> (gdb) disassemble sr_drive_status+0x50
> 0xffffffff802460b0 <sr_drive_status+80>:	bnez	v0,0xffffffff802460fc <sr_drive_status+156>

The thing about objdump -dr is that it gives you the name of the
function that's being called.  gdb apparently doesn't, or would need a
different command from "disassemble".

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: new kernel oops in recent kernels
  2008-03-16 18:29   ` Giuseppe Sacco
  2008-03-17  3:58     ` Matthew Wilcox
@ 2008-03-17  4:41     ` Matthew Wilcox
  2008-03-17  8:17       ` Giuseppe Sacco
  1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2008-03-17  4:41 UTC (permalink / raw)
  To: Giuseppe Sacco; +Cc: linux-scsi

On Sun, Mar 16, 2008 at 07:29:07PM +0100, Giuseppe Sacco wrote:
> > > [<ffffffff802460b0>] sr_drive_status+0x50/0xe8
> > > [<ffffffff8024bb84>] cdrom_ioctl+0x5f4/0x1208
> > > [<ffffffff80245c6c>] sr_block_ioctl+0x64/0xe8
> > 

> 0xffffffff802460a4 <sr_drive_status+68>:	daddiu	sp,sp,32
> 0xffffffff802460a8 <sr_drive_status+72>:	jal	0xffffffff8024c838 <cdrom_get_media_event>
> 0xffffffff802460ac <sr_drive_status+76>:	move	a1,sp
> 0xffffffff802460b0 <sr_drive_status+80>:	bnez	v0,0xffffffff802460fc <sr_drive_status+156>

I think I was confused earlier.  156 is 0x9c, thus within the function.
The backtrace must be incorrect; this is really 0x48 and thus a call to
cdrom_get_media_event, which points the finger at
cdi->ops->generic_packet being NULL.

Put a BUG_ON(!cdi->ops->generic_packet) in drivers/cdrom/cdrom.c right
before the line that calls it (ie line 11 of cdrom_get_media_event).
That should trigger and give a better backtrace.  Then it's a simple (*)
matter of figuring out why it's NULL.

* This is sarcasm.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: new kernel oops in recent kernels
  2008-03-17  4:41     ` Matthew Wilcox
@ 2008-03-17  8:17       ` Giuseppe Sacco
  0 siblings, 0 replies; 9+ messages in thread
From: Giuseppe Sacco @ 2008-03-17  8:17 UTC (permalink / raw)
  To: linux-scsi

Hi all,
I wrote a message to linux-mips mailing list for investigating on the
assembly code generated in sr_cdrom_status() since adding the suggested
printk() stopped the oops. I supposed there is a problem with the
compiler, but people there are investigating about problem with cache
coherency.
You may follow the thread on public web archive, at
http://www.linux-mips.org/archives/linux-mips/2008-03/msg00079.html

I'll be back on this list after checking any problem with cache
coherence and c compiler.

Thanks,
Giuseppe

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-03-17  8:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-16 15:19 new kernel oops in recent kernels Giuseppe Sacco
2008-03-16 16:39 ` James Bottomley
2008-03-16 18:32   ` Giuseppe Sacco
2008-03-16 18:47     ` James Bottomley
2008-03-16 16:42 ` Matthew Wilcox
2008-03-16 18:29   ` Giuseppe Sacco
2008-03-17  3:58     ` Matthew Wilcox
2008-03-17  4:41     ` Matthew Wilcox
2008-03-17  8:17       ` Giuseppe Sacco

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox