AIO/DIO lockup/crash

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* AIO/DIO lockup/crash
@ 2008-04-28 12:29 Peter Zijlstra
  2008-04-28 16:08 ` Andrew Morton
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2008-04-28 12:29 UTC (permalink / raw)
  To: linux-kernel, linux-aio; +Cc: Zach Brown, Clark Williams

Hi guys,

I'm getting this (and various variations thereof - like crashing in the
PI chain code on -rt) when running aio-dio-invalidate-failure for a few
hours.

(dual core opteron - single spindle - ext3)

Is this a known issue?

I'll run the same on current -git overnight to see if it went away :-)


[ 1796.238953] BUG: soft lockup - CPU#1 stuck for 11s! [aio-dio-invalid:3037]
[ 1796.245794] CPU 1:
[ 1796.247802] Modules linked in: autofs4 binfmt_misc ext2 psmouse evbug evdev i2c_piix4 i2c_core pcspkr thermal processor button sr_mod cdrom sg shpchp pci_hotplug sd_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd usbcore
[ 1796.267532] Pid: 3037, comm: aio-dio-invalid Not tainted 2.6.24.4 #194
[ 1796.274023] RIP: 0010:[<ffffffff804a7993>]  [<ffffffff804a7993>] _spin_lock_irqsave+0x63/0x90
[ 1796.282517] RSP: 0018:ffff81007fba7ce0  EFLAGS: 00000246
[ 1796.287800] RAX: 0000000000000000 RBX: ffff81007fba7cf0 RCX: 0000000000001000
[ 1796.294895] RDX: 0000000000000213 RSI: ffff810067dbc740 RDI: 0000000000000001
[ 1796.301993] RBP: ffff81007fba7c60 R08: 0000000000000101 R09: 000000000169aa28
[ 1796.309090] R10: 000000000169aa28 R11: 0000000000000003 R12: ffffffff8020d0c6
[ 1796.316187] R13: ffff81007fba7c60 R14: ffff81007eaddc00 R15: ffff81007eaddf24
[ 1796.323283] FS:  00002b489f45db00(0000) GS:ffff81007fb6cac0(0000) knlGS:0000000000000000
[ 1796.331330] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1796.337043] CR2: 00000000008c7f1c CR3: 0000000068610000 CR4: 00000000000006e0
[ 1796.344140] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1796.351237] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1796.358334]
[ 1796.358334] Call Trace:
[ 1796.362244]  <IRQ>  [<ffffffff802dee4a>] dio_bio_end_aio+0x3a/0xe0
[ 1796.368405]  [<ffffffff802dac79>] bio_endio+0x19/0x40
[ 1796.373430]  [<ffffffff8034fe8e>] req_bio_endio+0x4e/0xa0
[ 1796.378800]  [<ffffffff80350084>] __end_that_request_first+0x1a4/0x3c0
[ 1796.385292]  [<ffffffff803502a9>] end_that_request_chunk+0x9/0x10
[ 1796.391354]  [<ffffffff803e95fb>] scsi_end_request+0x3b/0x110
[ 1796.397069]  [<ffffffff803e99d5>] scsi_io_completion+0xa5/0x3b0
[ 1796.402958]  [<ffffffff804a7e06>] _spin_unlock_irqrestore+0x16/0x40
[ 1796.409192]  [<ffffffff803e3479>] scsi_finish_command+0x99/0xf0
[ 1796.415079]  [<ffffffff803ea515>] scsi_softirq_done+0x115/0x150
[ 1796.420967]  [<ffffffff803536db>] blk_done_softirq+0x6b/0x80
[ 1796.426598]  [<ffffffff802458c4>] __do_softirq+0x64/0xd0
[ 1796.431883]  [<ffffffff8020d61c>] call_softirq+0x1c/0x30
[ 1796.437166]  [<ffffffff8020efbd>] do_softirq+0x3d/0x90
[ 1796.442276]  [<ffffffff802457d8>] irq_exit+0x88/0xa0
[ 1796.447213]  [<ffffffff8020f095>] do_IRQ+0x85/0x100
[ 1796.452064]  [<ffffffff8020c971>] ret_from_intr+0x0/0xa
[ 1796.457258]  <EOI>  [<ffffffff804a799e>] _spin_lock_irqsave+0x6e/0x90
[ 1796.463678]  [<ffffffff804a796e>] _spin_lock_irqsave+0x3e/0x90
[ 1796.469479]  [<ffffffff802ddded>] dio_bio_submit+0x2d/0x90
[ 1796.474935]  [<ffffffff802ddeee>] dio_send_cur_page+0x9e/0xa0
[ 1796.480648]  [<ffffffff802ddf2e>] submit_page_section+0x3e/0x130
[ 1796.486623]  [<ffffffff802deb39>] __blockdev_direct_IO+0x979/0xc50
[ 1796.492783]  [<ffffffff8806591f>] :ext3:ext3_direct_IO+0xaf/0x1c0
[ 1796.498847]  [<ffffffff88063ad0>] :ext3:ext3_get_block+0x0/0x110
[ 1796.504825]  [<ffffffff802851ba>] generic_file_direct_IO+0xba/0x160
[ 1796.511059]  [<ffffffff802852cf>] generic_file_direct_write+0x6f/0x130
[ 1796.517551]  [<ffffffff80285e13>] __generic_file_aio_write_nolock+0x383/0x440
[ 1796.524650]  [<ffffffff80285f34>] generic_file_aio_write+0x64/0xd0
[ 1796.530802]  [<ffffffff88060a26>] :ext3:ext3_file_write+0x26/0xc0
[ 1796.536865]  [<ffffffff88060a00>] :ext3:ext3_file_write+0x0/0xc0
[ 1796.542841]  [<ffffffff802cce4f>] aio_rw_vect_retry+0x6f/0x180
[ 1796.548642]  [<ffffffff802ccde0>] aio_rw_vect_retry+0x0/0x180
[ 1796.554355]  [<ffffffff802cda19>] aio_run_iocb+0x49/0x110
[ 1796.559725]  [<ffffffff802ce663>] io_submit_one+0x1d3/0x3f0
[ 1796.565268]  [<ffffffff802cf22e>] sys_io_submit+0xde/0x140
[ 1796.570725]  [<ffffffff8020c5dc>] tracesys+0xdc/0xe1
[ 1796.575661]



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AIO/DIO lockup/crash
  2008-04-28 12:29 AIO/DIO lockup/crash Peter Zijlstra
@ 2008-04-28 16:08 ` Andrew Morton
  2008-04-28 17:48   ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2008-04-28 16:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-aio, Zach Brown, Clark Williams

On Mon, 28 Apr 2008 14:29:42 +0200 Peter Zijlstra <peterz@infradead.org> wrote:

> Hi guys,
> 
> I'm getting this (and various variations thereof - like crashing in the
> PI chain code on -rt) when running aio-dio-invalidate-failure for a few
> hours.
> 
> (dual core opteron - single spindle - ext3)
> 
> Is this a known issue?
> 
> I'll run the same on current -git overnight to see if it went away :-)
> 
> 
> [ 1796.238953] BUG: soft lockup - CPU#1 stuck for 11s! [aio-dio-invalid:3037]
> [ 1796.245794] CPU 1:
> [ 1796.247802] Modules linked in: autofs4 binfmt_misc ext2 psmouse evbug evdev i2c_piix4 i2c_core pcspkr thermal processor button sr_mod cdrom sg shpchp pci_hotplug sd_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd usbcore
> [ 1796.267532] Pid: 3037, comm: aio-dio-invalid Not tainted 2.6.24.4 #194
> [ 1796.274023] RIP: 0010:[<ffffffff804a7993>]  [<ffffffff804a7993>] _spin_lock_irqsave+0x63/0x90
> [ 1796.282517] RSP: 0018:ffff81007fba7ce0  EFLAGS: 00000246
> [ 1796.287800] RAX: 0000000000000000 RBX: ffff81007fba7cf0 RCX: 0000000000001000
> [ 1796.294895] RDX: 0000000000000213 RSI: ffff810067dbc740 RDI: 0000000000000001
> [ 1796.301993] RBP: ffff81007fba7c60 R08: 0000000000000101 R09: 000000000169aa28
> [ 1796.309090] R10: 000000000169aa28 R11: 0000000000000003 R12: ffffffff8020d0c6
> [ 1796.316187] R13: ffff81007fba7c60 R14: ffff81007eaddc00 R15: ffff81007eaddf24
> [ 1796.323283] FS:  00002b489f45db00(0000) GS:ffff81007fb6cac0(0000) knlGS:0000000000000000
> [ 1796.331330] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 1796.337043] CR2: 00000000008c7f1c CR3: 0000000068610000 CR4: 00000000000006e0
> [ 1796.344140] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1796.351237] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1796.358334]
> [ 1796.358334] Call Trace:
> [ 1796.362244]  <IRQ>  [<ffffffff802dee4a>] dio_bio_end_aio+0x3a/0xe0
> [ 1796.368405]  [<ffffffff802dac79>] bio_endio+0x19/0x40
> [ 1796.373430]  [<ffffffff8034fe8e>] req_bio_endio+0x4e/0xa0
> [ 1796.378800]  [<ffffffff80350084>] __end_that_request_first+0x1a4/0x3c0
> [ 1796.385292]  [<ffffffff803502a9>] end_that_request_chunk+0x9/0x10
> [ 1796.391354]  [<ffffffff803e95fb>] scsi_end_request+0x3b/0x110
> [ 1796.397069]  [<ffffffff803e99d5>] scsi_io_completion+0xa5/0x3b0
> [ 1796.402958]  [<ffffffff804a7e06>] _spin_unlock_irqrestore+0x16/0x40
> [ 1796.409192]  [<ffffffff803e3479>] scsi_finish_command+0x99/0xf0
> [ 1796.415079]  [<ffffffff803ea515>] scsi_softirq_done+0x115/0x150
> [ 1796.420967]  [<ffffffff803536db>] blk_done_softirq+0x6b/0x80
> [ 1796.426598]  [<ffffffff802458c4>] __do_softirq+0x64/0xd0
> [ 1796.431883]  [<ffffffff8020d61c>] call_softirq+0x1c/0x30
> [ 1796.437166]  [<ffffffff8020efbd>] do_softirq+0x3d/0x90
> [ 1796.442276]  [<ffffffff802457d8>] irq_exit+0x88/0xa0
> [ 1796.447213]  [<ffffffff8020f095>] do_IRQ+0x85/0x100
> [ 1796.452064]  [<ffffffff8020c971>] ret_from_intr+0x0/0xa
> [ 1796.457258]  <EOI>  [<ffffffff804a799e>] _spin_lock_irqsave+0x6e/0x90
> [ 1796.463678]  [<ffffffff804a796e>] _spin_lock_irqsave+0x3e/0x90
> [ 1796.469479]  [<ffffffff802ddded>] dio_bio_submit+0x2d/0x90
> [ 1796.474935]  [<ffffffff802ddeee>] dio_send_cur_page+0x9e/0xa0
> [ 1796.480648]  [<ffffffff802ddf2e>] submit_page_section+0x3e/0x130
> [ 1796.486623]  [<ffffffff802deb39>] __blockdev_direct_IO+0x979/0xc50
> [ 1796.492783]  [<ffffffff8806591f>] :ext3:ext3_direct_IO+0xaf/0x1c0
> [ 1796.498847]  [<ffffffff88063ad0>] :ext3:ext3_get_block+0x0/0x110
> [ 1796.504825]  [<ffffffff802851ba>] generic_file_direct_IO+0xba/0x160
> [ 1796.511059]  [<ffffffff802852cf>] generic_file_direct_write+0x6f/0x130
> [ 1796.517551]  [<ffffffff80285e13>] __generic_file_aio_write_nolock+0x383/0x440
> [ 1796.524650]  [<ffffffff80285f34>] generic_file_aio_write+0x64/0xd0
> [ 1796.530802]  [<ffffffff88060a26>] :ext3:ext3_file_write+0x26/0xc0
> [ 1796.536865]  [<ffffffff88060a00>] :ext3:ext3_file_write+0x0/0xc0
> [ 1796.542841]  [<ffffffff802cce4f>] aio_rw_vect_retry+0x6f/0x180
> [ 1796.548642]  [<ffffffff802ccde0>] aio_rw_vect_retry+0x0/0x180
> [ 1796.554355]  [<ffffffff802cda19>] aio_run_iocb+0x49/0x110
> [ 1796.559725]  [<ffffffff802ce663>] io_submit_one+0x1d3/0x3f0
> [ 1796.565268]  [<ffffffff802cf22e>] sys_io_submit+0xde/0x140
> [ 1796.570725]  [<ffffffff8020c5dc>] tracesys+0xdc/0xe1

erk, that's dio->bio_lock, isn't it?

That lock is super-simple and hasn't changed in quite some time.  If there
has been major memory wreckage and we're simply grabbing at a "lock" in
random memory then I'd expect the bug to maninfest in different ways on
different runs?

I assume you have lots of runtime debugging options enabled.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AIO/DIO lockup/crash
  2008-04-28 16:08 ` Andrew Morton
@ 2008-04-28 17:48   ` Peter Zijlstra
  2008-04-30 14:46     ` Jeff Moyer
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2008-04-28 17:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-aio, Zach Brown, Clark Williams

On Mon, 2008-04-28 at 09:08 -0700, Andrew Morton wrote:

> erk, that's dio->bio_lock, isn't it?

Yep.

> That lock is super-simple and hasn't changed in quite some time.  If there
> has been major memory wreckage and we're simply grabbing at a "lock" in
> random memory then I'd expect the bug to maninfest in different ways on
> different runs?

Looks like it.

> I assume you have lots of runtime debugging options enabled.

Not on this particular run. I'll start a -git run this evening with most
of the debugging option enabled. It takes a few hours to reproduce, so I
let it run over-night.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AIO/DIO lockup/crash
  2008-04-28 17:48   ` Peter Zijlstra
@ 2008-04-30 14:46     ` Jeff Moyer
  2008-04-30 16:31       ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Moyer @ 2008-04-30 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-kernel, linux-aio, Zach Brown,
	Clark Williams

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, 2008-04-28 at 09:08 -0700, Andrew Morton wrote:
>
>> erk, that's dio->bio_lock, isn't it?
>
> Yep.
>
>> That lock is super-simple and hasn't changed in quite some time.  If there
>> has been major memory wreckage and we're simply grabbing at a "lock" in
>> random memory then I'd expect the bug to maninfest in different ways on
>> different runs?
>
> Looks like it.
>
>> I assume you have lots of runtime debugging options enabled.
>
> Not on this particular run. I'll start a -git run this evening with most
> of the debugging option enabled. It takes a few hours to reproduce, so I
> let it run over-night.

Peter, any update on this?

FWIW, I've been running the aio-dio-invalidate-failure test on a fedora
kernel (2.6.25-8.fc9.i686) for several days now without any problems.
However, I'm not sure I can reproduce the bugs at all.  I'll revert to a
2.6.24 kernel and try.

Cheers,

Jeff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AIO/DIO lockup/crash
  2008-04-30 14:46     ` Jeff Moyer
@ 2008-04-30 16:31       ` Peter Zijlstra
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2008-04-30 16:31 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Andrew Morton, linux-kernel, linux-aio, Zach Brown,
	Clark Williams

On Wed, 2008-04-30 at 10:46 -0400, Jeff Moyer wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Mon, 2008-04-28 at 09:08 -0700, Andrew Morton wrote:
> >
> >> erk, that's dio->bio_lock, isn't it?
> >
> > Yep.
> >
> >> That lock is super-simple and hasn't changed in quite some time.  If there
> >> has been major memory wreckage and we're simply grabbing at a "lock" in
> >> random memory then I'd expect the bug to maninfest in different ways on
> >> different runs?
> >
> > Looks like it.
> >
> >> I assume you have lots of runtime debugging options enabled.
> >
> > Not on this particular run. I'll start a -git run this evening with most
> > of the debugging option enabled. It takes a few hours to reproduce, so I
> > let it run over-night.
> 
> Peter, any update on this?
> 
> FWIW, I've been running the aio-dio-invalidate-failure test on a fedora
> kernel (2.6.25-8.fc9.i686) for several days now without any problems.
> However, I'm not sure I can reproduce the bugs at all.  I'll revert to a
> 2.6.24 kernel and try.

I've ran -git for 10+ hours without crashing, but I've also
changed .config settings (enabled many debugging switches). I'm starting
to work my way backwards to the previous setup that did crash, but since
it takes so long to test each kernel its slow going.




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-04-30 16:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-28 12:29 AIO/DIO lockup/crash Peter Zijlstra
2008-04-28 16:08 ` Andrew Morton
2008-04-28 17:48   ` Peter Zijlstra
2008-04-30 14:46     ` Jeff Moyer
2008-04-30 16:31       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox