* Re: Strange crash on Dell R720xd [not found] <20121015213522.kaiyahro@trusted.unix-scripts.info> @ 2012-10-16 9:03 ` Borislav Petkov 2012-10-16 9:26 ` Laurent CARON 0 siblings, 1 reply; 6+ messages in thread From: Borislav Petkov @ 2012-10-16 9:03 UTC (permalink / raw) To: Laurent CARON; +Cc: linux-kernel, linux-raid On Mon, Oct 15, 2012 at 09:42:58PM +0200, Laurent CARON wrote: > Hi, > > I'm currently replacing an old system (HP DL 380 G5) by new dell R720xd. > On those new boxes I did configure the H310 controler as plain JBOD. > > Those boxes appear to crash more often than not (from 5 mins to a couple > of hours). > I have the impression those crashes appear under heavy IO. > > The setup consists of a few md RAID arrays serving as underlying devices > for either filesystem, or drbd (plus lvm on top). > > I managed to catch a trace over netconsole: > ------------[ cut here ]------------ > kernel BUG at crypto/async_tx/async_tx.c:174! That's: BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || txd_parent(tx)); but probably the b0rkage happens up the stack. And this __raid_run_ops is probably starting the whole TX so maybe we should add linux-raid@vger.kernel.org to CC. Added. > invalid opcode: 0000 [#1] SMP > Modules linked in: drbd lru_cache netconsole iptable_filter ip_tables ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue bonding ipv6 btrfs ioatdma lpc_ich sb_edac dca mfd_core > CPU 0 > Pid: 12580, comm: kworker/u:2 Not tainted 3.6.2-r510-r720xd #1 Dell Inc. PowerEdge R720xd What is that "r510" thing in the kernel version? You have your patches ontop? If yes, please try reproducing this with a kernel.org kernel without anything else ontop. Also, it might be worth trying plain 3.6 to rule out a regression introduced in the stable 3.6 series. (leaving in the rest for reference) > RIP: 0010:[<ffffffff8130f9ab>] [<ffffffff8130f9ab>] async_tx_submit+0x29/0xab > RSP: 0018:ffff88100940fb30 EFLAGS: 00010202 > RAX: ffff88100b30aeb0 RBX: ffff88080b5cf390 RCX: 0000000000000029 > RDX: ffff88100940fd00 RSI: ffff88080b5cf390 RDI: ffff880809ad0818 > RBP: ffff8808054a7d90 R08: ffff88080b5cf900 R09: 0000000000000001 > R10: 0000000000001000 R11: 0000000000000001 R12: ffff88100940fd00 > R13: 0000000000000002 R14: ffff880809ad0638 R15: ffff880809ad0818 > FS: 0000000000000000(0000) GS:ffff88080fc00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: ffffffffff600400 CR3: 0000000e4055f000 CR4: 00000000000407f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process kworker/u:2 (pid: 12580, threadinfo ffff88100940e000, task ffff880804850630) > Stack: > ffff88100940fd00 ffff88100940fc40 0000000000000101 ffffffff8131044b > 0000000000000001 0000000000000246 0000000000000201 ffffffffa0073a00 > ffff8808054a7d90 ffff8808054a7690 ffff88100940fc40 ffff88080bf9e668 > Call Trace: > [<ffffffff8131044b>] ? do_async_gen_syndrome+0x2f3/0x320 > [<ffffffffa0073a00>] ? ioat2_tx_submit_unlock+0xac/0xb3 [ioatdma] > [<ffffffff815e6820>] ? ops_complete_compute+0x7b/0x7b > [<ffffffff81310540>] ? async_gen_syndrome+0xc8/0x1d6 > [<ffffffff815e8b9a>] ? __raid_run_ops+0x9e7/0xb5a > [<ffffffff810848f0>] ? select_task_rq_fair+0x487/0x74b > [<ffffffff815e6820>] ? ops_complete_compute+0x7b/0x7b > [<ffffffff8107e40b>] ? __wake_up+0x35/0x46 > [<ffffffff8107ca2a>] ? async_schedule+0x12/0x12 > [<ffffffff815e8d3f>] ? async_run_ops+0x32/0x3e > [<ffffffff8107cace>] ? async_run_entry_fn+0xa4/0x17e > [<ffffffff8107ca2a>] ? async_schedule+0x12/0x12 > [<ffffffff81071cf8>] ? process_one_work+0x259/0x381 > [<ffffffff81072312>] ? worker_thread+0x2ad/0x3e3 > [<ffffffff81082e50>] ? try_to_wake_up+0x1fc/0x20c > [<ffffffff81072065>] ? manage_workers+0x245/0x245 > [<ffffffff81072065>] ? manage_workers+0x245/0x245 > [<ffffffff8107746a>] ? kthread+0x81/0x89 > [<ffffffff81791034>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff810773e9>] ? kthread_freezable_should_stop+0x4e/0x4e > [<ffffffff81791030>] ? gs_change+0xb/0xb > Code: 5b c3 41 54 49 89 d4 55 53 48 89 f3 48 8b 6a 08 48 8b 42 10 48 85 ed 48 89 46 20 48 8b 42 18 48 89 46 28 74 5c f6 45 04 02 74 72 <0f> 0b eb fe 48 8b 02 48 8b 48 28 80 e1 40 74 24 31 f6 48 89 d7 > RIP [<ffffffff8130f9ab>] async_tx_submit+0x29/0xab > RSP <ffff88100940fb30> > ---[ end trace 64fb561d16a3b535 ]--- > Kernel panic - not syncing: Fatal exception in interrupt > Rebooting in 5 seconds.. > > Do any of you guys have a clue about it ? > > Thanks > > Laurent > > PS: The very same kernel doesn't cause any trouble on R510 hardware. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Strange crash on Dell R720xd 2012-10-16 9:03 ` Strange crash on Dell R720xd Borislav Petkov @ 2012-10-16 9:26 ` Laurent CARON 2012-10-16 12:48 ` Borislav Petkov 0 siblings, 1 reply; 6+ messages in thread From: Laurent CARON @ 2012-10-16 9:26 UTC (permalink / raw) To: Borislav Petkov, linux-kernel, linux-raid On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: > That's: > > BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || > txd_parent(tx)); > > but probably the b0rkage happens up the stack. And this __raid_run_ops > is probably starting the whole TX so maybe we should add > linux-raid@vger.kernel.org to CC. Added. Hi, The machines seem stable after disabling I/O AT DMA at the BIOS level. > What is that "r510" thing in the kernel version? You have your patches > ontop? If yes, please try reproducing this with a kernel.org kernel > without anything else ontop. My kernel is vanilla from Kernel.org. The -r510 string is because I tried it on a -r510 also. > Also, it might be worth trying plain 3.6 to rule out a regression > introduced in the stable 3.6 series. I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. For now, I did create more volumes, rsync lors of data over the network to the disks with no crashs (after disabling I/O AT DMA). ...snip... ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Strange crash on Dell R720xd 2012-10-16 9:26 ` Laurent CARON @ 2012-10-16 12:48 ` Borislav Petkov 2012-10-16 12:52 ` Laurent CARON 0 siblings, 1 reply; 6+ messages in thread From: Borislav Petkov @ 2012-10-16 12:48 UTC (permalink / raw) To: Laurent CARON; +Cc: linux-kernel, linux-raid, Vinod Koul, Dan Williams On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote: > On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: > > That's: > > > > BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || > > txd_parent(tx)); > > > > but probably the b0rkage happens up the stack. And this __raid_run_ops > > is probably starting the whole TX so maybe we should add > > linux-raid@vger.kernel.org to CC. Added. > > > Hi, > > The machines seem stable after disabling I/O AT DMA at the BIOS level. That's a good point because the backtrace goes through I/O AT DMA so it could very well be the culprit. Let's add some more people to Cc. Vinod/Dan, here's the BUG_ON Laurent is hitting: http://marc.info/?l=linux-kernel&m=135033064724794&w=2 and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma in the BIOS makes the issue disappear so ... > > What is that "r510" thing in the kernel version? You have your patches > > ontop? If yes, please try reproducing this with a kernel.org kernel > > without anything else ontop. > > My kernel is vanilla from Kernel.org. The -r510 string is because I > tried it on a -r510 also. Ok, good. > > Also, it might be worth trying plain 3.6 to rule out a regression > > introduced in the stable 3.6 series. > > I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. > > For now, I did create more volumes, rsync lors of data over the network > to the disks with no crashs (after disabling I/O AT DMA). And when you do this with ioat dma enabled, you get the bug, right? So it is reproducible...? Thanks. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Strange crash on Dell R720xd 2012-10-16 12:48 ` Borislav Petkov @ 2012-10-16 12:52 ` Laurent CARON 2012-10-16 17:58 ` Dan Williams 0 siblings, 1 reply; 6+ messages in thread From: Laurent CARON @ 2012-10-16 12:52 UTC (permalink / raw) To: Borislav Petkov, linux-kernel, linux-raid, Vinod Koul, Dan Williams On Tue, Oct 16, 2012 at 02:48:25PM +0200, Borislav Petkov wrote: > On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote: > > On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: > > > That's: > > > > > > BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || > > > txd_parent(tx)); > > > > > > but probably the b0rkage happens up the stack. And this __raid_run_ops > > > is probably starting the whole TX so maybe we should add > > > linux-raid@vger.kernel.org to CC. Added. > > > > > > Hi, > > > > The machines seem stable after disabling I/O AT DMA at the BIOS level. > > That's a good point because the backtrace goes through I/O AT DMA so it > could very well be the culprit. Let's add some more people to Cc. > > Vinod/Dan, here's the BUG_ON Laurent is hitting: > > http://marc.info/?l=linux-kernel&m=135033064724794&w=2 > > and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma > in the BIOS makes the issue disappear so ... > > > > What is that "r510" thing in the kernel version? You have your patches > > > ontop? If yes, please try reproducing this with a kernel.org kernel > > > without anything else ontop. > > > > My kernel is vanilla from Kernel.org. The -r510 string is because I > > tried it on a -r510 also. > > Ok, good. > > > > Also, it might be worth trying plain 3.6 to rule out a regression > > > introduced in the stable 3.6 series. > > > > I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. > > > > For now, I did create more volumes, rsync lors of data over the network > > to the disks with no crashs (after disabling I/O AT DMA). > > And when you do this with ioat dma enabled, you get the bug, right? So > it is reproducible...? It is 100% reproductible. The only "nondeterministic" point is the time it takes to have the machine crash. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Strange crash on Dell R720xd 2012-10-16 12:52 ` Laurent CARON @ 2012-10-16 17:58 ` Dan Williams 2012-10-17 7:31 ` Laurent CARON 0 siblings, 1 reply; 6+ messages in thread From: Dan Williams @ 2012-10-16 17:58 UTC (permalink / raw) To: Borislav Petkov, linux-kernel, linux-raid, Vinod Koul, Dan Williams On Tue, Oct 16, 2012 at 5:52 AM, Laurent CARON <lcaron@unix-scripts.info> wrote: > On Tue, Oct 16, 2012 at 02:48:25PM +0200, Borislav Petkov wrote: >> On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote: >> > On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: >> > > That's: >> > > >> > > BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || >> > > txd_parent(tx)); >> > > >> > > but probably the b0rkage happens up the stack. And this __raid_run_ops >> > > is probably starting the whole TX so maybe we should add >> > > linux-raid@vger.kernel.org to CC. Added. >> > >> > >> > Hi, >> > >> > The machines seem stable after disabling I/O AT DMA at the BIOS level. >> >> That's a good point because the backtrace goes through I/O AT DMA so it >> could very well be the culprit. Let's add some more people to Cc. >> >> Vinod/Dan, here's the BUG_ON Laurent is hitting: >> >> http://marc.info/?l=linux-kernel&m=135033064724794&w=2 >> >> and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma >> in the BIOS makes the issue disappear so ... >> >> > > What is that "r510" thing in the kernel version? You have your patches >> > > ontop? If yes, please try reproducing this with a kernel.org kernel >> > > without anything else ontop. >> > >> > My kernel is vanilla from Kernel.org. The -r510 string is because I >> > tried it on a -r510 also. >> >> Ok, good. >> >> > > Also, it might be worth trying plain 3.6 to rule out a regression >> > > introduced in the stable 3.6 series. >> > >> > I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. >> > >> > For now, I did create more volumes, rsync lors of data over the network >> > to the disks with no crashs (after disabling I/O AT DMA). >> >> And when you do this with ioat dma enabled, you get the bug, right? So >> it is reproducible...? > > It is 100% reproductible. The only "nondeterministic" point is the time > it takes to have the machine crash. > I think this may be a bug in __raid_run_ops that is only possible when raid offload and CONFIG_MULTICORE_RAID456 are enabled. I'm thinking the descriptor is completed and recycled to another requester in the space between these two events: ops_run_compute(); /* terminate the chain if reconstruct is not set to be run */ if (tx && !test_bit(STRIPE_OP_RECONSTRUCT, &ops_request)) async_tx_ack(tx); ...don't use the experimental CONFIG_MULTICORE_RAID456 even if you leave IOAT DMA disabled. A rework of the raid operation dma chaining is in progress, but may not be ready for a while. -- Dan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Strange crash on Dell R720xd 2012-10-16 17:58 ` Dan Williams @ 2012-10-17 7:31 ` Laurent CARON 0 siblings, 0 replies; 6+ messages in thread From: Laurent CARON @ 2012-10-17 7:31 UTC (permalink / raw) To: Dan Williams; +Cc: Borislav Petkov, linux-kernel, linux-raid, Vinod Koul On Tue, Oct 16, 2012 at 10:58:49AM -0700, Dan Williams wrote: > I think this may be a bug in __raid_run_ops that is only possible when > raid offload and CONFIG_MULTICORE_RAID456 are enabled. I'm thinking > the descriptor is completed and recycled to another requester in the > space between these two events: > > ops_run_compute(); > > /* terminate the chain if reconstruct is not set to be run */ > if (tx && !test_bit(STRIPE_OP_RECONSTRUCT, &ops_request)) > async_tx_ack(tx); > > ...don't use the experimental CONFIG_MULTICORE_RAID456 even if you > leave IOAT DMA disabled. A rework of the raid operation dma chaining > is in progress, but may not be ready for a while. Hi, I usually don't use CONFIG_MULTICORE_RAID456 as it proved to be sluggish and/or unstable in my experience, so I should be pretty safe letting I/O AT DMA disabled for now on those bosex. Thanks ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-10-17 7:31 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20121015213522.kaiyahro@trusted.unix-scripts.info> 2012-10-16 9:03 ` Strange crash on Dell R720xd Borislav Petkov 2012-10-16 9:26 ` Laurent CARON 2012-10-16 12:48 ` Borislav Petkov 2012-10-16 12:52 ` Laurent CARON 2012-10-16 17:58 ` Dan Williams 2012-10-17 7:31 ` Laurent CARON
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).