* Re: raid1 boot regression in 2.6.37 [bisected] [not found] <201103251725.21180.thomas.jarosch@intra2net.com> @ 2011-03-28 7:59 ` Tejun Heo 2011-03-28 11:02 ` Thomas Jarosch 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2011-03-28 7:59 UTC (permalink / raw) To: Thomas Jarosch; +Cc: linux-raid, Neil Brown Hello, (cc'ing Neil and quoting whole body) On Fri, Mar 25, 2011 at 05:25:20PM +0100, Thomas Jarosch wrote: > Hello, > > I've just updated from kernel 2.6.34.7 to kernel 2.6.37.5 and one > HP Proliant DL320 G3 box with a raid1 software RAID stopped booting. > (also two other non-HP boxes). > > We run this script at boot time via dracut: > ---------------------------------- > #!/bin/sh > . /lib/dracut-lib.sh > > info "Telling kernel to auto-detect RAID arrays" > /sbin/initqueue --settled --name kerneldetectraid /sbin/mdadm --auto-detect > ---------------------------------- > > With the "bad" commit in place, the kernel doesn't output > any md message at all. I've bisected it down to this commit: > > e804ac780e2f01cb3b914daca2fd4780d1743db1 is the first bad commit > commit e804ac780e2f01cb3b914daca2fd4780d1743db1 > Author: Tejun Heo <tj@kernel.org> > Date: Fri Oct 15 15:36:08 2010 +0200 > > md: fix and update workqueue usage > > Workqueue usage in md has two problems. > > * Flush can be used during or depended upon by memory reclaim, but md > uses the system workqueue for flush_work which may lead to deadlock. > > * md depends on flush_scheduled_work() to achieve exclusion against > completion of removal of previous instances. flush_scheduled_work() > may incur unexpected amount of delay and is scheduled to be removed. > > This patch adds two workqueues to md - md_wq and md_misc_wq. The > former is guaranteed to make forward progress under memory pressure > and serves flush_work. The latter serves as the flush domain for > other works. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Signed-off-by: NeilBrown <neilb@suse.de> > > :040000 040000 f6b6a34a71864263ed253866c5f8abe7f766ac6b > dc2eff4a91825142b7c88cf54751fc7acdf1a6d2 M drivers > > I manually verified that the commit before it > (57dab0bdf689d42972975ec646d862b0900a4bf3) works > and the "bad" commit prevents the box from booting. > > > Some more info: > > # mdadm --version > mdadm - v2.6.9 - 10th March 2009 > > # mdadm --detail /dev/md0 > /dev/md0: > Version : 0.90 > Creation Time : Wed May 27 17:52:40 2009 > Raid Level : raid1 > Array Size : 2562240 (2.44 GiB 2.62 GB) > Used Dev Size : 2562240 (2.44 GiB 2.62 GB) > Raid Devices : 2 > Total Devices : 2 > Preferred Minor : 0 > Persistence : Superblock is persistent > > Update Time : Fri Mar 25 17:11:33 2011 > State : clean > Active Devices : 2 > Working Devices : 2 > Failed Devices : 0 > Spare Devices : 0 > > UUID : 0ee8da2c:5803478b:e399b924:6520c535 > Events : 0.160 > > Number Major Minor RaidDevice State > 0 8 1 0 active sync /dev/sda1 > 1 8 17 1 active sync /dev/sdb1 > > > > Any idea what might go wrong? May be building a kernel > with lock debugging on Monday might help. > > Unfortunately bugzilla.kernel.org is currently down, > so I can't look for a possible existing bug/solution. I don't think it's a reported problem. How does it fail? Things just stop? As you wrote in the other mail, lockdep would definitely help. Another thing which can be helpful is sysrq-t and see where things are stuck. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-28 7:59 ` raid1 boot regression in 2.6.37 [bisected] Tejun Heo @ 2011-03-28 11:02 ` Thomas Jarosch 2011-03-28 12:53 ` Thomas Jarosch 0 siblings, 1 reply; 22+ messages in thread From: Thomas Jarosch @ 2011-03-28 11:02 UTC (permalink / raw) To: Tejun Heo, linux-raid; +Cc: Neil Brown Hello Tejun, On Monday, 28. March 2011 09:59:37 Tejun Heo wrote: > I don't think it's a reported problem. How does it fail? Things just > stop? As you wrote in the other mail, lockdep would definitely help. > Another thing which can be helpful is sysrq-t and see where things are > stuck. It just stops. I enabled kernel debugging and the USB serial console. Here are the gory details: -------------------------------------------- BUG: soft lockup - CPU#0 stuck for 61s! [blkid:1113] Modules linked in: ahci libahci libata ehci_hcd ohci_hcd dm_mod [last unloaded: scsi_wait_scan] irq event stamp: 2136355122 hardirqs last enabled at (2136355121): [<c12c3532>] _raw_spin_unlock_irq+0x22/0 x30 hardirqs last disabled at (2136355122): [<c12c3d4b>] apic_timer_interrupt+0x2f/0 x3c softirqs last enabled at (2136337366): [<c103d98c>] __do_softirq+0xfc/0x150 softirqs last disabled at (2136337351): [<c1005624>] do_softirq+0xa4/0x100 Modules linked in: ahci libahci libata ehci_hcd ohci_hcd dm_mod [last unloaded: scsi_wait_scan] Pid: 1113, comm: blkid Not tainted 2.6.36+ #24 MS-7349/MS-7349 EIP: 0060:[<c12c3536>] EFLAGS: 00000202 CPU: 0 EIP is at _raw_spin_unlock_irq+0x26/0x30 EAX: 7f563131 EBX: f6407000 ECX: 00000006 EDX: f60e98f4 ESI: 00000000 EDI: f61ce9c0 EBP: f5641d24 ESP: f5641d20 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process blkid (pid: 1113, ti=f5640000 task=f60e9530 task.ti=f5640000) Stack: f640b800 f5641d50 c104b646 001ce9f4 00000005 f61cea20 00000001 00000002 <0> f6407000 f61ce9c0 00000002 f5641d80 f5641df4 c104c990 00000000 00000002 <0> 00000000 c104c7e0 f60e9548 f60e98f4 00000010 f61ce9d0 f61cea30 00000001 Call Trace: [<c104b646>] ? flush_workqueue_prep_cwqs+0x106/0x200 [<c104c990>] ? flush_workqueue+0x1b0/0x430 [<c104c7e0>] ? flush_workqueue+0x0/0x430 [<c1213fb6>] ? md_open+0x96/0xa0 [<c10e4aa4>] ? __blkdev_get+0x74/0x310 [<c10e4d4a>] ? blkdev_get+0xa/0x10 [<c10e4e0e>] ? blkdev_open+0xbe/0x110 [<c10b9c50>] ? __dentry_open+0xa0/0x260 [<c10b9f01>] ? nameidata_to_filp+0x51/0x60 [<c10e4d50>] ? blkdev_open+0x0/0x110 [<c10c59a7>] ? do_last+0x327/0x600 [<c10c7772>] ? do_filp_open+0x1a2/0x4c0 [<c10a5eb7>] ? handle_mm_fault+0x137/0x8a0 [<c12c35bd>] ? _raw_spin_unlock+0x1d/0x20 [<c10d12c8>] ? alloc_fd+0xb8/0xf0 [<c10b9a51>] ? do_sys_open+0x51/0x100 [<c10b9b69>] ? sys_open+0x29/0x40 [<c1002e9f>] ? sysenter_do_call+0x12/0x38 Code: 26 00 00 00 00 55 ba 01 00 00 00 89 e5 8b 4d 04 53 89 c3 8d 40 10 e8 7a 40 da ff 89 d8 e8 b3 b7 e8 ff e8 be 11 da ff fb 66 66 90 <66> 90 5b 5d c3 90 8d 74 26 00 55 89 e5 83 ec 08 8b 4d 04 89 1c Call Trace: [<c104b646>] flush_workqueue_prep_cwqs+0x106/0x200 [<c104c990>] flush_workqueue+0x1b0/0x430 [<c104c7e0>] ? flush_workqueue+0x0/0x430 [<c1213fb6>] md_open+0x96/0xa0 [<c10e4aa4>] __blkdev_get+0x74/0x310 [<c10e4d4a>] blkdev_get+0xa/0x10 [<c10e4e0e>] blkdev_open+0xbe/0x110 [<c10b9c50>] __dentry_open+0xa0/0x260 [<c10b9f01>] nameidata_to_filp+0x51/0x60 [<c10e4d50>] ? blkdev_open+0x0/0x110 [<c10c59a7>] do_last+0x327/0x600 [<c10c7772>] do_filp_open+0x1a2/0x4c0 [<c10a5eb7>] ? handle_mm_fault+0x137/0x8a0 [<c12c35bd>] ? _raw_spin_unlock+0x1d/0x20 [<c10d12c8>] ? alloc_fd+0xb8/0xf0 [<c10b9a51>] do_sys_open+0x51/0x100 [<c10b9b69>] sys_open+0x29/0x40 [<c1002e9f>] sysenter_do_call+0x12/0x38 -------------------------------------------- Or a similar looking one: -------------------------------------------- BUG: soft lockup - CPU#0 stuck for 63s! [blkid:1113] Modules linked in: ahci libahci libata ehci_hcd ohci_hcd dm_mod [last unloaded: scsi_wait_scan] irq event stamp: 3897487660 hardirqs last enabled at (3897487659): [<c12c15c4>] __mutex_unlock_slowpath+0xb 4/0x150 hardirqs last disabled at (3897487660): [<c12c3d4b>] apic_timer_interrupt+0x2f/0 x3c softirqs last enabled at (3897470978): [<c103d98c>] __do_softirq+0xfc/0x150 softirqs last disabled at (3897470965): [<c1005624>] do_softirq+0xa4/0x100 Modules linked in: ahci libahci libata ehci_hcd ohci_hcd dm_mod [last unloaded: scsi_wait_scan] Pid: 1113, comm: blkid Not tainted 2.6.36+ #24 MS-7349/MS-7349 EIP: 0060:[<c1067298>] EFLAGS: 00000246 CPU: 0 EIP is at lock_acquire+0x88/0xa0 EAX: 00000246 EBX: f60e9530 ECX: 00000000 EDX: c1823eb0 ESI: 00000000 EDI: f5dc5d08 EBP: f5641dd0 ESP: f5641da8 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process blkid (pid: 1113, ti=f5640000 task=f60e9530 task.ti=f5640000) Stack: 00000000 00000002 00000000 00000000 c10e4a5a 00000000 00000246 f5dc5cd0 <0> 00000000 f5dc5cd0 f5641e0c c12c1eb1 00000000 00000002 00000000 c10e4a5a <0> f61c0800 f60e9530 00000000 f5641e2c f5dc5cd0 f5641e0c f5dc5cc0 f5520400 Call Trace: [<c10e4a5a>] ? __blkdev_get+0x2a/0x310 [<c12c1eb1>] ? mutex_lock_nested+0x51/0x2f0 [<c10e4a5a>] ? __blkdev_get+0x2a/0x310 [<c10e4a5a>] ? __blkdev_get+0x2a/0x310 [<c10e4d4a>] ? blkdev_get+0xa/0x10 [<c10e4e0e>] ? blkdev_open+0xbe/0x110 [<c10b9c50>] ? __dentry_open+0xa0/0x260 [<c10b9f01>] ? nameidata_to_filp+0x51/0x60 [<c10e4d50>] ? blkdev_open+0x0/0x110 [<c10c59a7>] ? do_last+0x327/0x600 [<c10c7772>] ? do_filp_open+0x1a2/0x4c0 [<c10a5eb7>] ? handle_mm_fault+0x137/0x8a0 [<c12c35bd>] ? _raw_spin_unlock+0x1d/0x20 [<c10d12c8>] ? alloc_fd+0xb8/0xf0 [<c10b9a51>] ? do_sys_open+0x51/0x100 [<c10b9b69>] ? sys_open+0x29/0x40 [<c1002e9f>] ? sysenter_do_call+0x12/0x38 Code: 01 83 e0 01 89 44 24 08 8b 45 0c 89 44 24 04 8b 45 08 89 04 24 89 f8 e8 77 e7 ff ff c7 83 a8 03 00 00 00 00 00 00 8b 45 f0 50 9d <66> 66 66 90 8b 5d f4 8b 75 f8 8b 7d fc 89 ec 5d c3 8d b4 26 00 Call Trace: [<c10e4a5a>] ? __blkdev_get+0x2a/0x310 [<c12c1eb1>] mutex_lock_nested+0x51/0x2f0 [<c10e4a5a>] ? __blkdev_get+0x2a/0x310 [<c10e4a5a>] __blkdev_get+0x2a/0x310 [<c10e4d4a>] blkdev_get+0xa/0x10 [<c10e4e0e>] blkdev_open+0xbe/0x110 [<c10b9c50>] __dentry_open+0xa0/0x260 [<c10b9f01>] nameidata_to_filp+0x51/0x60 [<c10e4d50>] ? blkdev_open+0x0/0x110 [<c10c59a7>] do_last+0x327/0x600 [<c10c7772>] do_filp_open+0x1a2/0x4c0 [<c10a5eb7>] ? handle_mm_fault+0x137/0x8a0 [<c12c35bd>] ? _raw_spin_unlock+0x1d/0x20 [<c10d12c8>] ? alloc_fd+0xb8/0xf0 [<c10b9a51>] do_sys_open+0x51/0x100 [<c10b9b69>] sys_open+0x29/0x40 [<c1002e9f>] sysenter_do_call+0x12/0x38 -------------------------------------------- I'll try to reproduce the backtrace with 2.6.38 and see if it's any different. Cheers, Thomas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-28 11:02 ` Thomas Jarosch @ 2011-03-28 12:53 ` Thomas Jarosch 2011-03-28 15:59 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Thomas Jarosch @ 2011-03-28 12:53 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-raid, Neil Brown Hi, here's a backtrace from vanilla 2.6.38: -------------------------------------------- BUG: soft lockup - CPU#0 stuck for 64s! [blkid:1143] Modules linked in: ahci libahci libata dm_mod [last unloaded: scsi_wait_scan] irq event stamp: 981226512 hardirqs last enabled at (981226511): [<c12dc764>] __mutex_unlock_slowpath+0xb4 /0x150 hardirqs last disabled at (981226512): [<c12deeeb>] apic_timer_interrupt+0x2f/0x 3c softirqs last enabled at (981207466): [<c103dbfd>] __do_softirq+0xed/0x140 softirqs last disabled at (981207451): [<c100558d>] do_softirq+0x9d/0xf0 Modules linked in: ahci libahci libata dm_mod [last unloaded: scsi_wait_scan] Pid: 1143, comm: blkid Not tainted 2.6.38 #26 MSI MS-7349/MS-7349 EIP: 0060:[<c12dc768>] EFLAGS: 00000246 CPU: 0 EIP is at __mutex_unlock_slowpath+0xb8/0x150 EAX: 00000246 EBX: c14252a0 ECX: 00000001 EDX: 00000001 ESI: c14252c4 EDI: 00000246 EBP: f568fd8c ESP: f568fd80 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process blkid (pid: 1143, ti=f568e000 task=f615c030 task.ti=f568e000) Stack: 00000000 f545a9c0 00900000 f568fd94 c12dc808 f568fdc0 c11c4451 c11323b0 f568fdf0 f55e2e00 00000009 f61ae800 00000000 00000000 f568fdf0 f5df7490 f568fdd0 c11331b8 f5df7480 f55e2e00 f568fe00 c10e85aa f615c030 00000000 Call Trace: [<c12dc808>] mutex_unlock+0x8/0x10 [<c11c4451>] kobj_lookup+0xe1/0x140 [<c11323b0>] ? exact_match+0x0/0x10 [<c11331b8>] get_gendisk+0x98/0xb0 [<c10e85aa>] __blkdev_get+0xca/0x320 [<c10e8843>] blkdev_get+0x43/0x2c0 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 [<c10e8b12>] blkdev_open+0x52/0x70 [<c10bb12d>] __dentry_open+0x9d/0x240 [<c10bb3c6>] nameidata_to_filp+0x66/0x80 [<c10e8ac0>] ? blkdev_open+0x0/0x70 [<c10c781f>] finish_open+0xaf/0x190 [<c10c8a24>] ? do_path_lookup+0x44/0xe0 [<c10c9920>] do_filp_open+0x210/0x6d0 [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 [<c10d47d8>] ? alloc_fd+0xb8/0xf0 [<c10baf45>] do_sys_open+0x55/0xf0 [<c10bb049>] sys_open+0x29/0x40 [<c1002e9f>] sysenter_do_call+0x12/0x38 Code: 00 00 39 c6 74 11 89 d8 89 f2 e8 14 5a d8 ff 8b 46 08 e8 4c 75 d5 ff fe 43 04 f7 c7 00 02 00 00 74 21 e8 6c 7f d8 ff 89 f8 50 9d <66> 66 66 90 8b 1c 24 8b 74 24 04 8b 7c 24 08 89 ec 5d c3 90 8d Call Trace: [<c12dc808>] mutex_unlock+0x8/0x10 [<c11c4451>] kobj_lookup+0xe1/0x140 [<c11323b0>] ? exact_match+0x0/0x10 [<c11331b8>] get_gendisk+0x98/0xb0 [<c10e85aa>] __blkdev_get+0xca/0x320 [<c10e8843>] blkdev_get+0x43/0x2c0 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 [<c10e8b12>] blkdev_open+0x52/0x70 [<c10bb12d>] __dentry_open+0x9d/0x240 [<c10bb3c6>] nameidata_to_filp+0x66/0x80 [<c10e8ac0>] ? blkdev_open+0x0/0x70 [<c10c781f>] finish_open+0xaf/0x190 [<c10c8a24>] ? do_path_lookup+0x44/0xe0 [<c10c9920>] do_filp_open+0x210/0x6d0 [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 [<c10d47d8>] ? alloc_fd+0xb8/0xf0 [<c10baf45>] do_sys_open+0x55/0xf0 [<c10bb049>] sys_open+0x29/0x40 [<c1002e9f>] sysenter_do_call+0x12/0x38 -------------------------------------------- Or this one: -------------------------------------------- BUG: soft lockup - CPU#0 stuck for 63s! [blkid:1143] Modules linked in: ahci libahci libata dm_mod [last unloaded: scsi_wait_scan] irq event stamp: 2778370780 hardirqs last enabled at (2778370779): [<c12de715>] _raw_spin_unlock_irqrestore +0x35/0x60 hardirqs last disabled at (2778370780): [<c12deeeb>] apic_timer_interrupt+0x2f/0 x3c softirqs last enabled at (2778355336): [<c103dbfd>] __do_softirq+0xed/0x140 softirqs last disabled at (2778355323): [<c100558d>] do_softirq+0x9d/0xf0 Modules linked in: ahci libahci libata dm_mod [last unloaded: scsi_wait_scan] Pid: 1143, comm: blkid Not tainted 2.6.38 #26 MSI MS-7349/MS-7349 EIP: 0060:[<c12de719>] EFLAGS: 00000286 CPU: 0 EIP is at _raw_spin_unlock_irqrestore+0x39/0x60 EAX: 00000286 EBX: f568fd54 ECX: 00000006 EDX: f615c3ac ESI: 00000286 EDI: f568fd50 EBP: f568fccc ESP: f568fcc4 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process blkid (pid: 1143, ti=f568e000 task=f615c030 task.ti=f568e000) Stack: f568fd54 00000286 f568fce8 c102cd89 00000000 00000000 00000007 00000008 f61b0840 f568fd14 c104b9c8 001b0874 0000000b f61b08a0 00000007 00000008 f6503100 f61b0840 00000008 f568fd44 f568fdb8 c104cd20 00000000 00000002 Call Trace: [<c102cd89>] complete+0x49/0x60 [<c104b9c8>] flush_workqueue_prep_cwqs+0x138/0x200 [<c104cd20>] flush_workqueue+0x1b0/0x430 [<c104cb70>] ? flush_workqueue+0x0/0x430 [<c1229e79>] md_open+0x99/0xb0 [<c10e8557>] __blkdev_get+0x77/0x320 [<c10e8843>] blkdev_get+0x43/0x2c0 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 [<c10e8b12>] blkdev_open+0x52/0x70 [<c10bb12d>] __dentry_open+0x9d/0x240 [<c10bb3c6>] nameidata_to_filp+0x66/0x80 [<c10e8ac0>] ? blkdev_open+0x0/0x70 [<c10c781f>] finish_open+0xaf/0x190 [<c10c8a24>] ? do_path_lookup+0x44/0xe0 [<c10c9920>] do_filp_open+0x210/0x6d0 [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 [<c10d47d8>] ? alloc_fd+0xb8/0xf0 [<c10baf45>] do_sys_open+0x55/0xf0 [<c10bb049>] sys_open+0x29/0x40 [<c1002e9f>] sysenter_do_call+0x12/0x38 Code: 89 74 24 04 8d 40 10 89 d6 ba 01 00 00 00 e8 7f 8e d8 ff 89 d8 e8 48 5b e7 ff f7 c6 00 02 00 00 74 18 e8 bb 5f d8 ff 89 f0 50 9d <66> 66 66 90 8b 1c 24 8b 74 24 04 89 ec 5d c3 89 f0 50 9d 66 66 Call Trace: [<c102cd89>] complete+0x49/0x60 [<c104b9c8>] flush_workqueue_prep_cwqs+0x138/0x200 [<c104cd20>] flush_workqueue+0x1b0/0x430 [<c104cb70>] ? flush_workqueue+0x0/0x430 [<c1229e79>] md_open+0x99/0xb0 [<c10e8557>] __blkdev_get+0x77/0x320 [<c10e8843>] blkdev_get+0x43/0x2c0 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 [<c10e8b12>] blkdev_open+0x52/0x70 [<c10bb12d>] __dentry_open+0x9d/0x240 [<c10bb3c6>] nameidata_to_filp+0x66/0x80 [<c10e8ac0>] ? blkdev_open+0x0/0x70 [<c10c781f>] finish_open+0xaf/0x190 [<c10c8a24>] ? do_path_lookup+0x44/0xe0 [<c10c9920>] do_filp_open+0x210/0x6d0 [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 [<c10d47d8>] ? alloc_fd+0xb8/0xf0 [<c10baf45>] do_sys_open+0x55/0xf0 [<c10bb049>] sys_open+0x29/0x40 [<c1002e9f>] sysenter_do_call+0x12/0x38 -------------------------------------------- Cheers, Thomas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-28 12:53 ` Thomas Jarosch @ 2011-03-28 15:59 ` Tejun Heo 2011-03-28 19:46 ` Thomas Jarosch 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2011-03-28 15:59 UTC (permalink / raw) To: Thomas Jarosch; +Cc: linux-raid, Neil Brown Hello, On Mon, Mar 28, 2011 at 02:53:53PM +0200, Thomas Jarosch wrote: > Pid: 1143, comm: blkid Not tainted 2.6.38 #26 MSI MS-7349/MS-7349 > EIP: 0060:[<c12dc768>] EFLAGS: 00000246 CPU: 0 > EIP is at __mutex_unlock_slowpath+0xb8/0x150 > EAX: 00000246 EBX: c14252a0 ECX: 00000001 EDX: 00000001 > ESI: c14252c4 EDI: 00000246 EBP: f568fd8c ESP: f568fd80 > DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 > Process blkid (pid: 1143, ti=f568e000 task=f615c030 task.ti=f568e000) > Stack: > 00000000 f545a9c0 00900000 f568fd94 c12dc808 f568fdc0 c11c4451 c11323b0 > f568fdf0 f55e2e00 00000009 f61ae800 00000000 00000000 f568fdf0 f5df7490 > f568fdd0 c11331b8 f5df7480 f55e2e00 f568fe00 c10e85aa f615c030 00000000 > Call Trace: > [<c12dc808>] mutex_unlock+0x8/0x10 > [<c11c4451>] kobj_lookup+0xe1/0x140 > [<c11323b0>] ? exact_match+0x0/0x10 > [<c11331b8>] get_gendisk+0x98/0xb0 > [<c10e85aa>] __blkdev_get+0xca/0x320 > [<c10e8843>] blkdev_get+0x43/0x2c0 > [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 > [<c10e8b12>] blkdev_open+0x52/0x70 > [<c10bb12d>] __dentry_open+0x9d/0x240 > [<c10bb3c6>] nameidata_to_filp+0x66/0x80 > [<c10e8ac0>] ? blkdev_open+0x0/0x70 > [<c10c781f>] finish_open+0xaf/0x190 > [<c10c8a24>] ? do_path_lookup+0x44/0xe0 > [<c10c9920>] do_filp_open+0x210/0x6d0 > [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0 > [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 > [<c10d47d8>] ? alloc_fd+0xb8/0xf0 > [<c10baf45>] do_sys_open+0x55/0xf0 > [<c10bb049>] sys_open+0x29/0x40 > [<c1002e9f>] sysenter_do_call+0x12/0x38 Hmmm... Weird. * blkid seems to be looping in blkdev_open() repeatedly calling md_open() which keeps returning -ERESTARTSYS. * It triggered softlockup. Even with -ERESTARTSYS looping, I can't see how that would be possible. Is this custom boot script? If so, do you use RT priority in the script? Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-28 15:59 ` Tejun Heo @ 2011-03-28 19:46 ` Thomas Jarosch 2011-03-28 19:59 ` Roberto Spadim 2011-03-29 8:25 ` Tejun Heo 0 siblings, 2 replies; 22+ messages in thread From: Thomas Jarosch @ 2011-03-28 19:46 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-raid, Neil Brown On 03/28/2011 05:59 PM, Tejun Heo wrote: >> Call Trace: >> [<c12dc808>] mutex_unlock+0x8/0x10 >> [<c11c4451>] kobj_lookup+0xe1/0x140 >> [<c11323b0>] ? exact_match+0x0/0x10 >> [<c11331b8>] get_gendisk+0x98/0xb0 >> [<c10e85aa>] __blkdev_get+0xca/0x320 >> [<c10e8843>] blkdev_get+0x43/0x2c0 >> [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 >> [<c10e8b12>] blkdev_open+0x52/0x70 >> [<c10bb12d>] __dentry_open+0x9d/0x240 >> [<c10bb3c6>] nameidata_to_filp+0x66/0x80 >> [<c10e8ac0>] ? blkdev_open+0x0/0x70 >> [<c10c781f>] finish_open+0xaf/0x190 >> [<c10c8a24>] ? do_path_lookup+0x44/0xe0 >> [<c10c9920>] do_filp_open+0x210/0x6d0 >> [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0 >> [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 >> [<c10d47d8>] ? alloc_fd+0xb8/0xf0 >> [<c10baf45>] do_sys_open+0x55/0xf0 >> [<c10bb049>] sys_open+0x29/0x40 >> [<c1002e9f>] sysenter_do_call+0x12/0x38 > > Hmmm... Weird. > > * blkid seems to be looping in blkdev_open() repeatedly calling > md_open() which keeps returning -ERESTARTSYS. > > * It triggered softlockup. Even with -ERESTARTSYS looping, I can't > see how that would be possible. > > Is this custom boot script? If so, do you use RT priority in the > script? It's a normal dracut installation with an additional custom script to trigger kernel raid auto detection via mdadm. The custom script was part of the initial post. I've also noticed another odd thing: On a HP Proliant ML110 G6 box, which is quite fast / SMP, the box brings up the software RAID successfully. The box is slow as hell and I can see a constant load on a kernel process (could be "kworker", don't remember it exactly). I'll try tomorrow if that is also related to the RAID subsystem or something else turning it into a PDP11... Thanks for your help so far! Cheers, Thomas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-28 19:46 ` Thomas Jarosch @ 2011-03-28 19:59 ` Roberto Spadim 2011-03-29 12:06 ` Thomas Jarosch 2011-03-29 8:25 ` Tejun Heo 1 sibling, 1 reply; 22+ messages in thread From: Roberto Spadim @ 2011-03-28 19:59 UTC (permalink / raw) To: Thomas Jarosch; +Cc: Tejun Heo, linux-raid, Neil Brown that´s what i did, not a linux raid list opnion/experience: i had some problems with hp server (the same model but G5) and new intel core i7 desktops with archlinux and kernel 2.6.37, try to get a older kernel version, just to make your production server stable maybe some changes between kernel 3.6.32 and 3.6.37 made md working worst, i will wait some changes at kernel, since realtime works are being developed and many modifications are in work using the same instalation process with archilnux using a dell (i don´t remember the model, but i bought it in january 2011) removed the problem (maybe a problem at hp sas interface/driver?! i didn´t tested hp with dell interface) 2011/3/28 Thomas Jarosch <thomas.jarosch@intra2net.com>: > On 03/28/2011 05:59 PM, Tejun Heo wrote: >>> Call Trace: >>> [<c12dc808>] mutex_unlock+0x8/0x10 >>> [<c11c4451>] kobj_lookup+0xe1/0x140 >>> [<c11323b0>] ? exact_match+0x0/0x10 >>> [<c11331b8>] get_gendisk+0x98/0xb0 >>> [<c10e85aa>] __blkdev_get+0xca/0x320 >>> [<c10e8843>] blkdev_get+0x43/0x2c0 >>> [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 >>> [<c10e8b12>] blkdev_open+0x52/0x70 >>> [<c10bb12d>] __dentry_open+0x9d/0x240 >>> [<c10bb3c6>] nameidata_to_filp+0x66/0x80 >>> [<c10e8ac0>] ? blkdev_open+0x0/0x70 >>> [<c10c781f>] finish_open+0xaf/0x190 >>> [<c10c8a24>] ? do_path_lookup+0x44/0xe0 >>> [<c10c9920>] do_filp_open+0x210/0x6d0 >>> [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0 >>> [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 >>> [<c10d47d8>] ? alloc_fd+0xb8/0xf0 >>> [<c10baf45>] do_sys_open+0x55/0xf0 >>> [<c10bb049>] sys_open+0x29/0x40 >>> [<c1002e9f>] sysenter_do_call+0x12/0x38 >> >> Hmmm... Weird. >> >> * blkid seems to be looping in blkdev_open() repeatedly calling >> md_open() which keeps returning -ERESTARTSYS. >> >> * It triggered softlockup. Even with -ERESTARTSYS looping, I can't >> see how that would be possible. >> >> Is this custom boot script? If so, do you use RT priority in the >> script? > > It's a normal dracut installation with an additional custom script > to trigger kernel raid auto detection via mdadm. > The custom script was part of the initial post. > > I've also noticed another odd thing: On a HP Proliant ML110 G6 box, > which is quite fast / SMP, the box brings up the software > RAID successfully. The box is slow as hell and I can see a constant load > on a kernel process (could be "kworker", don't remember it exactly). > I'll try tomorrow if that is also related to the RAID subsystem > or something else turning it into a PDP11... > > Thanks for your help so far! > > Cheers, > Thomas > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-28 19:59 ` Roberto Spadim @ 2011-03-29 12:06 ` Thomas Jarosch 2011-03-29 12:22 ` Roberto Spadim 0 siblings, 1 reply; 22+ messages in thread From: Thomas Jarosch @ 2011-03-29 12:06 UTC (permalink / raw) To: Roberto Spadim; +Cc: linux-raid, Tejun Heo, Neil Brown Hello Roberto, On Monday, 28. March 2011 21:59:37 Roberto Spadim wrote: > i had some problems with hp server (the same model but G5) and new > intel core i7 desktops with archlinux and kernel 2.6.37, try to get a > older kernel version, just to make your production server stable Can you describe in more detail what kind of issue you were having? Didn't it boot at all or was it just slow / unusable? Was the issue related to md? Cheers, Thomas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-29 12:06 ` Thomas Jarosch @ 2011-03-29 12:22 ` Roberto Spadim 0 siblings, 0 replies; 22+ messages in thread From: Roberto Spadim @ 2011-03-29 12:22 UTC (permalink / raw) To: Thomas Jarosch; +Cc: linux-raid, Tejun Heo, Neil Brown it boot without problem, work with apache,mysql,softraid(md) without problem but after some time, it 'crash', many messages like kernel not tainted kernel tainted <0x0000fffff> messages and some debug information on console i tested with xfs,reiserfs3,ext4 and nothing changed, maybe it´s a lower layer problem: softraid, hardware module, kernel thread/process options, i didn´t diff the old and the new kernel config file, maybe something changed and i was sleeping, changing to old kernel solved the problem, i didn´t found performace problems, just kernel messages, and sometimes it stop working and i must reboot 2011/3/29 Thomas Jarosch <thomas.jarosch@intra2net.com>: > Hello Roberto, > > On Monday, 28. March 2011 21:59:37 Roberto Spadim wrote: >> i had some problems with hp server (the same model but G5) and new >> intel core i7 desktops with archlinux and kernel 2.6.37, try to get a >> older kernel version, just to make your production server stable > > Can you describe in more detail what kind of issue you were having? > > Didn't it boot at all or was it just slow / unusable? > Was the issue related to md? > > Cheers, > Thomas > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-28 19:46 ` Thomas Jarosch 2011-03-28 19:59 ` Roberto Spadim @ 2011-03-29 8:25 ` Tejun Heo 2011-03-29 9:53 ` Thomas Jarosch 1 sibling, 1 reply; 22+ messages in thread From: Tejun Heo @ 2011-03-29 8:25 UTC (permalink / raw) To: Thomas Jarosch; +Cc: linux-raid, Neil Brown On Mon, Mar 28, 2011 at 09:46:08PM +0200, Thomas Jarosch wrote: > On 03/28/2011 05:59 PM, Tejun Heo wrote: > >> Call Trace: > >> [<c12dc808>] mutex_unlock+0x8/0x10 > >> [<c11c4451>] kobj_lookup+0xe1/0x140 > >> [<c11323b0>] ? exact_match+0x0/0x10 > >> [<c11331b8>] get_gendisk+0x98/0xb0 > >> [<c10e85aa>] __blkdev_get+0xca/0x320 > >> [<c10e8843>] blkdev_get+0x43/0x2c0 > >> [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 > >> [<c10e8b12>] blkdev_open+0x52/0x70 > >> [<c10bb12d>] __dentry_open+0x9d/0x240 > >> [<c10bb3c6>] nameidata_to_filp+0x66/0x80 > >> [<c10e8ac0>] ? blkdev_open+0x0/0x70 > >> [<c10c781f>] finish_open+0xaf/0x190 > >> [<c10c8a24>] ? do_path_lookup+0x44/0xe0 > >> [<c10c9920>] do_filp_open+0x210/0x6d0 > >> [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0 > >> [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20 > >> [<c10d47d8>] ? alloc_fd+0xb8/0xf0 > >> [<c10baf45>] do_sys_open+0x55/0xf0 > >> [<c10bb049>] sys_open+0x29/0x40 > >> [<c1002e9f>] sysenter_do_call+0x12/0x38 > > > > Hmmm... Weird. > > > > * blkid seems to be looping in blkdev_open() repeatedly calling > > md_open() which keeps returning -ERESTARTSYS. > > > > * It triggered softlockup. Even with -ERESTARTSYS looping, I can't > > see how that would be possible. > > > > Is this custom boot script? If so, do you use RT priority in the > > script? > > It's a normal dracut installation with an additional custom script > to trigger kernel raid auto detection via mdadm. > The custom script was part of the initial post. > > I've also noticed another odd thing: On a HP Proliant ML110 G6 box, > which is quite fast / SMP, the box brings up the software > RAID successfully. The box is slow as hell and I can see a constant load > on a kernel process (could be "kworker", don't remember it exactly). > I'll try tomorrow if that is also related to the RAID subsystem > or something else turning it into a PDP11... Can you please apply the following patch and see whether it resolves the problem and report the boot log? Thanks. diff --git a/drivers/md/md.c b/drivers/md/md.c index 8b66e04..e17098b 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6001,6 +6001,15 @@ static int md_open(struct block_device *bdev, fmode_t mode) * bd_disk. */ mddev_put(mddev); + if (current->policy == SCHED_FIFO || current->policy == SCHED_RR) { + static bool once; + if (!once) { + printk("%s: md_open(): RT prio, pol=%u p=%d rt_p=%u\n", + current->comm, current->policy, current->static_prio, current->rt_priority); + once = true; + } + } + msleep(10); /* Wait until bdev->bd_disk is definitely gone */ flush_workqueue(md_misc_wq); /* Then retry the open from the top */ ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-29 8:25 ` Tejun Heo @ 2011-03-29 9:53 ` Thomas Jarosch 2011-03-29 10:07 ` Tejun Heo 0 siblings, 1 reply; 22+ messages in thread From: Thomas Jarosch @ 2011-03-29 9:53 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-raid, Neil Brown On Tuesday, 29. March 2011 10:25:03 Tejun Heo wrote: > Can you please apply the following patch and see whether it resolves > the problem and report the boot log? Ok, I did the following: - Check out commit e804ac780e2f01cb3b914daca2fd4780d1743db1 (md: fix and update workqueue usage) - Apply your patch - Add small debug output on top of it: ------------------------------ # git diff diff --git a/drivers/md/md.c b/drivers/md/md.c index 1e6534d..d2ddef4 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5899,6 +5899,16 @@ static int md_open(struct block_device *bdev, fmode_t mode) once = true; } } + /* DEBUG HACK */ + { + static bool tomj_once = false; + if (!tomj_once) + { + printk("TOMJ %s: md_open(): RT prio, pol=%u p=%d rt_p=%u\n", + current->comm, current->policy, current->static_prio, current->rt_priority); + tomj_once = true; + } + } msleep(10); /* Wait until bdev->bd_disk is definitely gone */ flush_workqueue(md_misc_wq); ------------------------------ With your patch applied, it boots! Here's the dmesg output: ------------------------------ pci 0000:00:00.0: Found enabled HT MSI Mapping pci 0000:00:00.0: Found enabled HT MSI Mapping pci 0000:00:00.0: Found enabled HT MSI Mapping pci 0000:00:00.0: Found enabled HT MSI Mapping pci 0000:00:12.0: Boot video device PCI: CLS 64 bytes, default 64 Trying to unpack rootfs image as initramfs... Freeing initrd memory: 5120k freed apm: BIOS version 1.2 Flags 0x03 (Driver version 1.16ac) apm: overridden by ACPI. audit: initializing netlink socket (disabled) type=2000 audit(1301389498.428:1): initialized highmem bounce pool size: 64 pages msgmni has been set to 1730 Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253) io scheduler noop registered io scheduler deadline registered io scheduler cfq registered (default) pcieport 0000:00:0b.0: ACPI _OSC control granted for 0x1d pcieport 0000:00:0b.0: setting latency timer to 64 pcieport 0000:00:0b.0: irq 40 for MSI/MSI-X pcieport 0000:00:0c.0: ACPI _OSC control granted for 0x1d pcieport 0000:00:0c.0: setting latency timer to 64 pcieport 0000:00:0c.0: irq 41 for MSI/MSI-X pcieport 0000:00:0d.0: ACPI _OSC control granted for 0x1d pcieport 0000:00:0d.0: setting latency timer to 64 pcieport 0000:00:0d.0: irq 42 for MSI/MSI-X pcieport 0000:00:0e.0: ACPI _OSC control granted for 0x1d pcieport 0000:00:0e.0: setting latency timer to 64 pcieport 0000:00:0e.0: irq 43 for MSI/MSI-X pcieport 0000:00:0f.0: ACPI _OSC control granted for 0x1d pcieport 0000:00:0f.0: setting latency timer to 64 pcieport 0000:00:0f.0: irq 44 for MSI/MSI-X pcieport 0000:00:10.0: ACPI _OSC control granted for 0x1d pcieport 0000:00:10.0: setting latency timer to 64 pcieport 0000:00:10.0: irq 45 for MSI/MSI-X pcieport 0000:00:11.0: ACPI _OSC control granted for 0x1d pcieport 0000:00:11.0: setting latency timer to 64 pcieport 0000:00:11.0: irq 46 for MSI/MSI-X ACPI: acpi_idle registered with cpuidle Non-volatile memory driver v1.3 Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A 00:04: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A brd: module loaded loop: module loaded Uniform Multi-Platform E-IDE driver ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver ACPI: PCI Interrupt Link [LUB2] enabled at IRQ 23 ehci_hcd 0000:00:02.1: PCI INT B -> Link[LUB2] -> GSI 23 (level, low) -> IRQ 23 ehci_hcd 0000:00:02.1: setting latency timer to 64 ehci_hcd 0000:00:02.1: EHCI Host Controller ehci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 1 ehci_hcd 0000:00:02.1: debug port 1 ehci_hcd 0000:00:02.1: cache line size of 64 is not supported ehci_hcd 0000:00:02.1: irq 23, io mem 0xfe8fec00 ehci_hcd 0000:00:02.1: USB 2.0 started, EHCI 1.00 hub 1-0:1.0: USB hub found hub 1-0:1.0: 6 ports detected ACPI: PCI Interrupt Link [UB12] enabled at IRQ 22 ehci_hcd 0000:00:04.1: PCI INT B -> Link[UB12] -> GSI 22 (level, low) -> IRQ 22 ehci_hcd 0000:00:04.1: setting latency timer to 64 ehci_hcd 0000:00:04.1: EHCI Host Controller ehci_hcd 0000:00:04.1: new USB bus registered, assigned bus number 2 ehci_hcd 0000:00:04.1: debug port 1 ehci_hcd 0000:00:04.1: cache line size of 64 is not supported ehci_hcd 0000:00:04.1: irq 22, io mem 0xfe8fe800 ehci_hcd 0000:00:04.1: USB 2.0 started, EHCI 1.00 hub 2-0:1.0: USB hub found hub 2-0:1.0: 6 ports detected ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver ACPI: PCI Interrupt Link [LUB0] enabled at IRQ 21 ohci_hcd 0000:00:02.0: PCI INT A -> Link[LUB0] -> GSI 21 (level, low) -> IRQ 21 ohci_hcd 0000:00:02.0: setting latency timer to 64 ohci_hcd 0000:00:02.0: OHCI Host Controller ohci_hcd 0000:00:02.0: new USB bus registered, assigned bus number 3 ohci_hcd 0000:00:02.0: irq 21, io mem 0xfe8ff000 hub 3-0:1.0: USB hub found hub 3-0:1.0: 6 ports detected ACPI: PCI Interrupt Link [UB11] enabled at IRQ 20 ohci_hcd 0000:00:04.0: PCI INT A -> Link[UB11] -> GSI 20 (level, low) -> IRQ 20 ohci_hcd 0000:00:04.0: setting latency timer to 64 ohci_hcd 0000:00:04.0: OHCI Host Controller ohci_hcd 0000:00:04.0: new USB bus registered, assigned bus number 4 ohci_hcd 0000:00:04.0: irq 20, io mem 0xfe8fd000 hub 4-0:1.0: USB hub found hub 4-0:1.0: 6 ports detected uhci_hcd: USB Universal Host Controller Interface driver usbcore: registered new interface driver usbserial USB Serial support registered for generic usbcore: registered new interface driver usbserial_generic usbserial: USB Serial Driver core USB Serial support registered for FTDI USB Serial Device usbcore: registered new interface driver ftdi_sio ftdi_sio: v1.6.0:USB FTDI Serial Converters Driver PNP: No PS/2 controller found. Probing ports directly. serio: i8042 KBD port at 0x60,0x64 irq 1 serio: i8042 AUX port at 0x60,0x64 irq 12 mice: PS/2 mouse device common for all mice input: PC Speaker as /devices/platform/pcspkr/input/input0 md: raid1 personality registered for level 1 cpuidle: using governor ladder cpuidle: using governor menu usbcore: registered new interface driver usbhid usbhid: USB HID core driver TCP cubic registered NET: Registered protocol family 17 Using IPI No-Shortcut mode drivers/rtc/hctosys.c: unable to open rtc device (rtc0) Freeing unused kernel memory: 408k freed Write protecting the kernel text: 2896k Write protecting the kernel read-only data: 1108k dracut: dracut-004-6.i2n device-mapper: ioctl: 4.18.0-ioctl (2010-06-29) initialised: dm-devel@redhat.com dracut: rd_NO_MDIMSM: no MD RAID for imsm/isw raids udev: starting version 145 udevd (582): /proc/582/oom_adj is deprecated, please use /proc/582/oom_score_adj instead. dracut: Telling kernel to auto-detect RAID arrays libata version 3.00 loaded. ahci 0000:00:09.0: version 3.0 ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 23 ahci 0000:00:09.0: PCI INT A -> Link[LSA0] -> GSI 23 (level, low) -> IRQ 23 ahci 0000:00:09.0: controller can't do PMP, turning off CAP_PMP ahci 0000:00:09.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl IDE mode ahci 0000:00:09.0: flags: 64bit ncq sntf led clo pio ahci 0000:00:09.0: setting latency timer to 64 scsi0 : ahci scsi1 : ahci scsi2 : ahci scsi3 : ahci ata1: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 23 ata2: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 23 ata3: SATA max UDMA/133 abar m8192@0xfe8f6000 port 0xfe8f6200 irq 23 ata4: SATA max UDMA/133 abar m8192@0xfe8f6000 port 0xfe8f6280 irq 23 usb 4-1: new low speed USB device using ohci_hcd and address 2 ata4: SATA link down (SStatus 0 SControl 300) ata3: SATA link down (SStatus 0 SControl 300) input: USB CAT5 as /devices/pci0000:00/0000:00:04.0/usb4/4-1/4-1:1.0/input/input1 generic-usb 0003:0D3D:0040.0001: input: USB HID v1.00 Keyboard [USB CAT5] on usb-0000:00:04.0-1/input0 input: USB CAT5 as /devices/pci0000:00/0000:00:04.0/usb4/4-1/4-1:1.1/input/input2 generic-usb 0003:0D3D:0040.0002: input: USB HID v1.00 Mouse [USB CAT5] on usb-0000:00:04.0-1/input1 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) usb 4-2: new full speed USB device using ohci_hcd and address 3 ata2.00: ATA-7: ST3160215AS, 3.AAD, max UDMA/133 ata2.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 31/32) ata1.00: ATA-7: ST3160215AS, 3.AAD, max UDMA/133 ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 31/32) ata2.00: configured for UDMA/133 ata1.00: configured for UDMA/133 scsi 0:0:0:0: Direct-Access ATA ST3160215AS 3.AA PQ: 0 ANSI: 5 sd 0:0:0:0: [sda] 312581808 512-byte logical blocks: (160 GB/149 GiB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA scsi 1:0:0:0: Direct-Access ATA ST3160215AS 3.AA PQ: 0 ANSI: 5 sd 1:0:0:0: [sdb] 312581808 512-byte logical blocks: (160 GB/149 GiB) sd 1:0:0:0: [sdb] Write Protect is off sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdb: sdb1 sdb2 < sdb5 sdb6 sdb7 sdb8 sdb9 > sd 1:0:0:0: [sdb] Attached SCSI disk ftdi_sio 4-2:1.0: FTDI USB Serial Device converter detected usb 4-2: Detected FT232BM usb 4-2: Number of endpoints 2 usb 4-2: Endpoint 1 MaxPacketSize 64 usb 4-2: Endpoint 2 MaxPacketSize 64 usb 4-2: Setting MaxPacketSize 64 sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 > sd 0:0:0:0: [sda] Attached SCSI disk usb 4-2: FTDI USB Serial Device converter now attached to ttyUSB0 console [ttyUSB0] enabled usb 4-4: new full speed USB device using ohci_hcd and address 4 TOMJ blkid: md_open(): RT prio, pol=0 p=118 rt_p=0 md: Autodetecting RAID arrays. md: Scanned 12 and added 12 devices. md: autorun ... md: considering sda9 ... md: adding sda9 ... md: sda8 has different UUID to sda9 md: sda7 has different UUID to sda9 md: sda6 has different UUID to sda9 md: sda5 has different UUID to sda9 md: sda1 has different UUID to sda9 md: adding sdb9 ... md: sdb8 has different UUID to sda9 md: sdb7 has different UUID to sda9 md: sdb6 has different UUID to sda9 md: sdb5 has different UUID to sda9 md: sdb1 has different UUID to sda9 md: created md5 md: bind<sdb9> md: bind<sda9> md: running: <sda9><sdb9> md/raid1:md5: active with 2 out of 2 mirrors md5: detected capacity change from 0 to 147446235136 md5: unknown partition table md: considering sda8 ... md: adding sda8 ... md: sda7 has different UUID to sda8 md: sda6 has different UUID to sda8 md: sda5 has different UUID to sda8 md: sda1 has different UUID to sda8 md: adding sdb8 ... md: sdb7 has different UUID to sda8 md: sdb6 has different UUID to sda8 md: sdb5 has different UUID to sda8 md: sdb1 has different UUID to sda8 md: created md4 md: bind<sdb8> md: bind<sda8> md: running: <sda8><sdb8> md/raid1:md4: active with 2 out of 2 mirrors md4: detected capacity change from 0 to 2097348608 md4: unknown partition table md: considering sda7 ... md: adding sda7 ... md: sda6 has different UUID to sda7 md: sda5 has different UUID to sda7 md: sda1 has different UUID to sda7 md: adding sdb7 ... md: sdb6 has different UUID to sda7 md: sdb5 has different UUID to sda7 md: sdb1 has different UUID to sda7 md: created md3 md: bind<sdb7> md: bind<sda7> md: running: <sda7><sdb7> md/raid1:md3: active with 2 out of 2 mirrors md3: detected capacity change from 0 to 2623733760 md3: unknown partition table md: considering sda6 ... md: adding sda6 ... md: sda5 has different UUID to sda6 md: sda1 has different UUID to sda6 md: adding sdb6 ... md: sdb5 has different UUID to sda6 md: sdb1 has different UUID to sda6 md: created md2 md: bind<sdb6> md: bind<sda6> md: running: <sda6><sdb6> md/raid1:md2: active with 2 out of 2 mirrors md2: detected capacity change from 0 to 2623733760 md: considering sda5 ... md: adding sda5 ... md: sda1 has different UUID to sda5 md: adding sdb5 ... md: sdb1 has different UUID to sda5 md: created md1 md: bind<sdb5> md: bind<sda5> md: running: <sda5><sdb5> md/raid1:md1: active with 2 out of 2 mirrors md1: detected capacity change from 0 to 2623733760 md1: unknown partition table md: considering sda1 ... md: adding sda1 ... md: adding sdb1 ... md: created md0 md: bind<sdb1> md: bind<sda1> md: running: <sda1><sdb1> md/raid1:md0: active with 2 out of 2 mirrors md0: detected capacity change from 0 to 2623733760 md: ... autorun DONE. md0: unknown partition table md5: detected capacity change from 0 to 147446235136 md5: unknown partition table md3: detected capacity change from 0 to 2623733760 md3: unknown partition table md2: detected capacity change from 0 to 2623733760 md2: unknown partition table md1: detected capacity change from 0 to 2623733760 md4: detected capacity change from 0 to 2097348608 md4: unknown partition table md1: unknown partition table usb 4-4: usbfs: interface 0 claimed by usbfs while 'display' sets config #1 EXT3-fs: barriers not enabled kjournald starting. Commit interval 5 seconds EXT3-fs (md0): mounted filesystem with ordered data mode dracut: Remounting /dev/md0 with -o ro,defaults,noatime EXT3-fs: barriers not enabled kjournald starting. Commit interval 5 seconds EXT3-fs (md0): mounted filesystem with ordered data mode dracut: Mounted root filesystem /dev/md0 dracut: Switching root ------------------------------ As you can see, your printk() is not triggered(). I just copied your printk and made it print once unconditionally. So probably the msleep(10); does the trick. Something seems very racy to me as other boxes with software RAID can boot the exact same kernel + dracut version just fine. I'll put the box in a reboot loop over the lunch break. Cheers, Thomas ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-29 9:53 ` Thomas Jarosch @ 2011-03-29 10:07 ` Tejun Heo 2011-03-29 11:52 ` Thomas Jarosch 2011-04-05 3:46 ` NeilBrown 0 siblings, 2 replies; 22+ messages in thread From: Tejun Heo @ 2011-03-29 10:07 UTC (permalink / raw) To: Thomas Jarosch; +Cc: linux-raid, Neil Brown On Tue, Mar 29, 2011 at 11:53:06AM +0200, Thomas Jarosch wrote: > On Tuesday, 29. March 2011 10:25:03 Tejun Heo wrote: > > Can you please apply the following patch and see whether it resolves > > the problem and report the boot log? > > Ok, I did the following: > - Check out commit e804ac780e2f01cb3b914daca2fd4780d1743db1 > (md: fix and update workqueue usage) > - Apply your patch > - Add small debug output on top of it: > > ------------------------------ > # git diff > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 1e6534d..d2ddef4 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -5899,6 +5899,16 @@ static int md_open(struct block_device *bdev, fmode_t mode) > once = true; > } > } > + /* DEBUG HACK */ > + { > + static bool tomj_once = false; > + if (!tomj_once) > + { > + printk("TOMJ %s: md_open(): RT prio, pol=%u p=%d rt_p=%u\n", > + current->comm, current->policy, current->static_prio, current->rt_priority); > + tomj_once = true; > + } > + } > msleep(10); > /* Wait until bdev->bd_disk is definitely gone */ > flush_workqueue(md_misc_wq); ... > TOMJ blkid: md_open(): RT prio, pol=0 p=118 rt_p=0 ... > As you can see, your printk() is not triggered(). I just > copied your printk and made it print once unconditionally. > > So probably the msleep(10); does the trick. Something > seems very racy to me as other boxes with software RAID > can boot the exact same kernel + dracut version just fine. > > I'll put the box in a reboot loop over the lunch break. Hmmm.. interesting, so no RT task there. I don't know why the softlockup is triggering then. Ah, okay, none of CONFIG_PREEMPT and CONFIG_PREEMPT_VOLUNTARY is set, right? Anyways, the root cause here is that md_open() -ERESTARTSYS retrying is busy looping without giving the put path a chance to run. When it was using flush_scheduled_work(), there were some unrelated work items there so it ended up sleeping by accident giving the put path a chance to run. With the conversion, the flush domain is reduced and there's nothing unrelated to wait for so it just busy loops. Neil, we can put a short unconditional sleep there or somehow ensure work item is queued before the restart loop engages. What do you think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-29 10:07 ` Tejun Heo @ 2011-03-29 11:52 ` Thomas Jarosch 2011-04-05 3:46 ` NeilBrown 1 sibling, 0 replies; 22+ messages in thread From: Thomas Jarosch @ 2011-03-29 11:52 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-raid, Neil Brown On Tuesday, 29. March 2011 12:07:44 Tejun Heo wrote: > > I'll put the box in a reboot loop over the lunch break. > > Hmmm.. interesting, so no RT task there. I don't know why the > softlockup is triggering then. Ah, okay, none of CONFIG_PREEMPT and > CONFIG_PREEMPT_VOLUNTARY is set, right? "grep PREEM .config" shows: # CONFIG_PREEMPT_RCU is not set CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set You were right :) > Anyways, the root cause here is that md_open() -ERESTARTSYS retrying > is busy looping without giving the put path a chance to run. When it > was using flush_scheduled_work(), there were some unrelated work items > there so it ended up sleeping by accident giving the put path a chance > to run. With the conversion, the flush domain is reduced and there's > nothing unrelated to wait for so it just busy loops. The test box survived 52 reboots without a hickup. When we come up with a final solution, I'm going to let that reboot loop run over night to be on the safe side. Thanks! The HP Proliant ML110 G6 issue seemes unrelated to this one as I can reproduce the slowdown issue also with 2.6.36. I also don't think it's md related as the box is only slow over the network (igb driver) and fast on the console. So it must be something else. Crap, two issues to hunt down :o) Thomas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-03-29 10:07 ` Tejun Heo 2011-03-29 11:52 ` Thomas Jarosch @ 2011-04-05 3:46 ` NeilBrown 2011-04-06 10:16 ` Tejun Heo 1 sibling, 1 reply; 22+ messages in thread From: NeilBrown @ 2011-04-05 3:46 UTC (permalink / raw) To: Tejun Heo; +Cc: Thomas Jarosch, linux-raid On Tue, 29 Mar 2011 12:07:44 +0200 Tejun Heo <tj@kernel.org> wrote: > On Tue, Mar 29, 2011 at 11:53:06AM +0200, Thomas Jarosch wrote: > > On Tuesday, 29. March 2011 10:25:03 Tejun Heo wrote: > > > Can you please apply the following patch and see whether it resolves > > > the problem and report the boot log? > > > > Ok, I did the following: > > - Check out commit e804ac780e2f01cb3b914daca2fd4780d1743db1 > > (md: fix and update workqueue usage) > > - Apply your patch > > - Add small debug output on top of it: > > > > ------------------------------ > > # git diff > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index 1e6534d..d2ddef4 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -5899,6 +5899,16 @@ static int md_open(struct block_device *bdev, fmode_t mode) > > once = true; > > } > > } > > + /* DEBUG HACK */ > > + { > > + static bool tomj_once = false; > > + if (!tomj_once) > > + { > > + printk("TOMJ %s: md_open(): RT prio, pol=%u p=%d rt_p=%u\n", > > + current->comm, current->policy, current->static_prio, current->rt_priority); > > + tomj_once = true; > > + } > > + } > > msleep(10); > > /* Wait until bdev->bd_disk is definitely gone */ > > flush_workqueue(md_misc_wq); > ... > > TOMJ blkid: md_open(): RT prio, pol=0 p=118 rt_p=0 > ... > > As you can see, your printk() is not triggered(). I just > > copied your printk and made it print once unconditionally. > > > > So probably the msleep(10); does the trick. Something > > seems very racy to me as other boxes with software RAID > > can boot the exact same kernel + dracut version just fine. > > > > I'll put the box in a reboot loop over the lunch break. > > Hmmm.. interesting, so no RT task there. I don't know why the > softlockup is triggering then. Ah, okay, none of CONFIG_PREEMPT and > CONFIG_PREEMPT_VOLUNTARY is set, right? > > Anyways, the root cause here is that md_open() -ERESTARTSYS retrying > is busy looping without giving the put path a chance to run. When it > was using flush_scheduled_work(), there were some unrelated work items > there so it ended up sleeping by accident giving the put path a chance > to run. With the conversion, the flush domain is reduced and there's > nothing unrelated to wait for so it just busy loops. > > Neil, we can put a short unconditional sleep there or somehow ensure > work item is queued before the restart loop engages. What do you > think? (sorry for delay - leave and stuff...) I don't think we should need to add a sleep - it should work as is. Obviously it doesn't so I must be missing something, but I would like to find out what it is to make sure I fix the right thing. The interesting race is between md_open and mddev_put. After mddev_put has made a firm decision to discard the mddev and consequently del_gendisk, __blkdev_get can find the gendisk and try to re-open it. This will find a different (new) mddev which is a problem. So we return -ERESTARTSYS in that case so the lookup for a gendisk will be retried. The mddev_find in md_open will only return a new mddev after taking all_mddevs_lock and finding that the old mddev is no longer available. This must be after mddev_put has dropped the lock, and so after queue_work has been called. After mddev_find returns the new mddev, md_open calls flush_workqueue and as the work item to complete the delete has definitely been queued, it should wait for that work item to complete. So the next time around the retry loop in __blkdev_get the old gendisk will not be found.... Where is my logic wrong?? To put it another way matching your description Tejun, the put path has a chance to run firstly while mddev_find is waiting for the spinlock, and then while flush_workqueue is waiting for the rest of the put path to complete. Thanks for any light you can shed on this... NeilBrown ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-04-05 3:46 ` NeilBrown @ 2011-04-06 10:16 ` Tejun Heo 2011-04-12 14:05 ` Thomas Jarosch 0 siblings, 1 reply; 22+ messages in thread From: Tejun Heo @ 2011-04-06 10:16 UTC (permalink / raw) To: NeilBrown; +Cc: Thomas Jarosch, linux-raid Hey, Neil. On Tue, Apr 05, 2011 at 01:46:29PM +1000, NeilBrown wrote: > After mddev_find returns the new mddev, md_open calls flush_workqueue > and as the work item to complete the delete has definitely been queued, it > should wait for that work item to complete. > > So the next time around the retry loop in __blkdev_get the old gendisk will > not be found.... > > Where is my logic wrong?? > > To put it another way matching your description Tejun, the put path has a > chance to run firstly while mddev_find is waiting for the spinlock, and then > while flush_workqueue is waiting for the rest of the put path to complete. I don't think the logic is wrong per-se. It's more likely that the implemented code doesn't really follow the model described by the logic. Probably the best way would be reproducing the problem and throwing in some diagnostic code to tell the sequence of events? If work is being queued first but it still ends up busy looping, that would be a bug in flush_workqueue(), but I think it's more likely that the restart condition somehow triggers in an unexpected way without the work item queued as expected. Thanks. -- tejun ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-04-06 10:16 ` Tejun Heo @ 2011-04-12 14:05 ` Thomas Jarosch 2011-04-12 22:44 ` NeilBrown 0 siblings, 1 reply; 22+ messages in thread From: Thomas Jarosch @ 2011-04-12 14:05 UTC (permalink / raw) To: Tejun Heo; +Cc: NeilBrown, linux-raid Hello Neil, On Wednesday, 6. April 2011 12:16:00 Tejun Heo wrote: > > To put it another way matching your description Tejun, the put path has > > a chance to run firstly while mddev_find is waiting for the spinlock, > > and then while flush_workqueue is waiting for the rest of the put path > > to complete. > > I don't think the logic is wrong per-se. It's more likely that the > implemented code doesn't really follow the model described by the > logic. > > Probably the best way would be reproducing the problem and throwing in > some diagnostic code to tell the sequence of events? If work is being > queued first but it still ends up busy looping, that would be a bug in > flush_workqueue(), but I think it's more likely that the restart > condition somehow triggers in an unexpected way without the work item > queued as expected. I can test any debug patch you want, the box is in a test lab anyway. Best regards, Thomas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-04-12 14:05 ` Thomas Jarosch @ 2011-04-12 22:44 ` NeilBrown [not found] ` <201104261051.09464.thomas.jarosch@intra2net.com> 0 siblings, 1 reply; 22+ messages in thread From: NeilBrown @ 2011-04-12 22:44 UTC (permalink / raw) To: Thomas Jarosch; +Cc: Tejun Heo, linux-raid On Tue, 12 Apr 2011 16:05:52 +0200 Thomas Jarosch <thomas.jarosch@intra2net.com> wrote: > Hello Neil, > > On Wednesday, 6. April 2011 12:16:00 Tejun Heo wrote: > > > To put it another way matching your description Tejun, the put path has > > > a chance to run firstly while mddev_find is waiting for the spinlock, > > > and then while flush_workqueue is waiting for the rest of the put path > > > to complete. > > > > I don't think the logic is wrong per-se. It's more likely that the > > implemented code doesn't really follow the model described by the > > logic. > > > > Probably the best way would be reproducing the problem and throwing in > > some diagnostic code to tell the sequence of events? If work is being > > queued first but it still ends up busy looping, that would be a bug in > > flush_workqueue(), but I think it's more likely that the restart > > condition somehow triggers in an unexpected way without the work item > > queued as expected. > > I can test any debug patch you want, the box is in a test lab anyway. > > Best regards, > Thomas Could you try this? diff --git a/drivers/md/md.c b/drivers/md/md.c index a0ccaab..07c97b1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6175,6 +6175,8 @@ static int md_open(struct block_device *bdev, fmode_t mode) mddev_t *mddev = mddev_find(bdev->bd_dev); int err; + BUG_ON(!mddev->gendisk); + if (mddev->gendisk != bdev->bd_disk) { /* we are racing with mddev_put which is discarding this * bd_disk. It don't know how it could get to the state where gendisk was NULL, but it is the only way I can see that the looping could happen. If the BUG_ON does trigger I'll probably be able to find out why it happens. If it doesn't then I'll still be at a loss. NeilBrown ^ permalink raw reply related [flat|nested] 22+ messages in thread
[parent not found: <201104261051.09464.thomas.jarosch@intra2net.com>]
* Re: raid1 boot regression in 2.6.37 [bisected] [not found] ` <201104261051.09464.thomas.jarosch@intra2net.com> @ 2011-04-27 8:17 ` NeilBrown 2011-04-27 10:05 ` NeilBrown 0 siblings, 1 reply; 22+ messages in thread From: NeilBrown @ 2011-04-27 8:17 UTC (permalink / raw) To: Thomas Jarosch; +Cc: Tejun Heo, linux-raid On Tue, 26 Apr 2011 10:51:09 +0200 Thomas Jarosch <thomas.jarosch@intra2net.com> wrote: > Hello Neil, > > On Wednesday, 13. April 2011 00:44:08 NeilBrown wrote: > > Could you try this? > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index a0ccaab..07c97b1 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -6175,6 +6175,8 @@ static int md_open(struct block_device *bdev, > > fmode_t mode) mddev_t *mddev = mddev_find(bdev->bd_dev); > > int err; > > > > + BUG_ON(!mddev->gendisk); > > + > > if (mddev->gendisk != bdev->bd_disk) { > > /* we are racing with mddev_put which is discarding this > > * bd_disk. > > > > > > It don't know how it could get to the state where gendisk was NULL, but > > it is the only way I can see that the looping could happen. > > > > If the BUG_ON does trigger I'll probably be able to find out why it > > happens. If it doesn't then I'll still be at a loss. > > Sorry for the late reply, I somehow missed your message. > > Your intuition was right, the BUG_ON is triggered. > Attached you'll find a screenshot of the call trace. > > Best regards, > Thomas Thanks. I manage to reproduce something very similar and I think I know what is happening. It appears to be fixed by this change to driver/md/md.c @@ -4340,8 +4344,8 @@ static int md_alloc(dev_t dev, char *name) * remove it now. */ disk->flags |= GENHD_FL_EXT_DEVT; - add_disk(disk); mddev->gendisk = disk; + add_disk(disk); error = kobject_init_and_add(&mddev->kobj, &md_ktype, &disk_to_dev(disk)->kobj, "%s", "md"); if (error) { However I need to think through the sequence of events in the morning and make sure it all makes sense and there isn't some other race hiding in there. Thanks again for the report. NeilBrown ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-04-27 8:17 ` NeilBrown @ 2011-04-27 10:05 ` NeilBrown [not found] ` <201104271700.58894.thomas.jarosch@intra2net.com> 0 siblings, 1 reply; 22+ messages in thread From: NeilBrown @ 2011-04-27 10:05 UTC (permalink / raw) To: Thomas Jarosch; +Cc: Tejun Heo, linux-raid On Wed, 27 Apr 2011 18:17:14 +1000 NeilBrown <neilb@suse.de> wrote: > I manage to reproduce something very similar and I think I know what is > happening. It appears to be fixed by this change to driver/md/md.c > > @@ -4340,8 +4344,8 @@ static int md_alloc(dev_t dev, char *name) > * remove it now. > */ > disk->flags |= GENHD_FL_EXT_DEVT; > - add_disk(disk); > mddev->gendisk = disk; > + add_disk(disk); > error = kobject_init_and_add(&mddev->kobj, &md_ktype, > &disk_to_dev(disk)->kobj, "%s", "md"); > if (error) { > > > However I need to think through the sequence of events in the morning and > make sure it all makes sense and there isn't some other race hiding in there. > > Thanks again for the report. > On reflection over dinner, I think I should move the add_disk right to the end just before the mutex_unlock. I'm not sure it makes a big practical difference but it will be easier to justify. So if you could test this one instead please that would be great. The rationale is that as soon as we call add_disk, get_gendisk will return the disk rather than coming through to md_probe and blocking on the mutex. So we want everything set up properly before calling add_disk. Thanks, NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index 818313e..6ab5bfb 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4340,7 +4340,6 @@ static int md_alloc(dev_t dev, char *name) * remove it now. */ disk->flags |= GENHD_FL_EXT_DEVT; - add_disk(disk); mddev->gendisk = disk; error = kobject_init_and_add(&mddev->kobj, &md_ktype, &disk_to_dev(disk)->kobj, "%s", "md"); @@ -4357,6 +4356,7 @@ static int md_alloc(dev_t dev, char *name) printk(KERN_DEBUG "pointless warning\n"); blk_queue_flush(mddev->queue, REQ_FLUSH | REQ_FUA); + add_disk(disk); abort: mutex_unlock(&disks_mutex); if (!error && mddev->kobj.sd) { ^ permalink raw reply related [flat|nested] 22+ messages in thread
[parent not found: <201104271700.58894.thomas.jarosch@intra2net.com>]
* Re: raid1 boot regression in 2.6.37 [bisected] [not found] ` <201104271700.58894.thomas.jarosch@intra2net.com> @ 2011-04-28 1:23 ` NeilBrown 2011-04-28 13:47 ` Thomas Jarosch 0 siblings, 1 reply; 22+ messages in thread From: NeilBrown @ 2011-04-28 1:23 UTC (permalink / raw) To: Thomas Jarosch; +Cc: Tejun Heo, linux-raid On Wed, 27 Apr 2011 17:00:58 +0200 Thomas Jarosch <thomas.jarosch@intra2net.com> wrote: > On Wednesday, 27. April 2011 12:05:13 NeilBrown wrote: > > > Thanks again for the report. > > > > On reflection over dinner, I think I should move the add_disk right to > > the end just before the mutex_unlock. > > I'm not sure it makes a big practical difference but it will be easier to > > justify. > > > > So if you could test this one instead please that would be great. > > > > The rationale is that as soon as we call add_disk, get_gendisk will > > return the disk rather than coming through to md_probe and blocking on > > the mutex. So we want everything set up properly before calling > > add_disk. > > Unfortunately the kernel oopses now. Attached you'll find screenshots > of the oops. I can bring back the serial console if needed. > > I've double verified with "git diff" that I only have your > latest change in the tree. > > Hope that helps, > Thomas Yes it helps, thanks. I do need to add_disk before I kobject_init_and_add the mddev object otherwise it will try to create something in sysfs in a non-existent directory. So the sysfs stuff needs to come last. Could you try this one please? Thanks, NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index 818313e..9f9fa54 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4335,13 +4335,15 @@ static int md_alloc(dev_t dev, char *name) disk->fops = &md_fops; disk->private_data = mddev; disk->queue = mddev->queue; + blk_queue_flush(mddev->queue, REQ_FLUSH | REQ_FUA); /* Allow extended partitions. This makes the * 'mdp' device redundant, but we can't really * remove it now. */ disk->flags |= GENHD_FL_EXT_DEVT; - add_disk(disk); mddev->gendisk = disk; + add_disk(disk); + error = kobject_init_and_add(&mddev->kobj, &md_ktype, &disk_to_dev(disk)->kobj, "%s", "md"); if (error) { @@ -4355,8 +4357,6 @@ static int md_alloc(dev_t dev, char *name) if (mddev->kobj.sd && sysfs_create_group(&mddev->kobj, &md_bitmap_group)) printk(KERN_DEBUG "pointless warning\n"); - - blk_queue_flush(mddev->queue, REQ_FLUSH | REQ_FUA); abort: mutex_unlock(&disks_mutex); if (!error && mddev->kobj.sd) { ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-04-28 1:23 ` NeilBrown @ 2011-04-28 13:47 ` Thomas Jarosch 2011-05-02 12:17 ` Thomas Jarosch 0 siblings, 1 reply; 22+ messages in thread From: Thomas Jarosch @ 2011-04-28 13:47 UTC (permalink / raw) To: NeilBrown; +Cc: Tejun Heo, linux-raid On Thursday, 28. April 2011 03:23:28 NeilBrown wrote: > Yes it helps, thanks. > > I do need to add_disk before I kobject_init_and_add the mddev object > otherwise it will try to create something in sysfs in a non-existent > directory. > > So the sysfs stuff needs to come last. > > Could you try this one please? Works like a charm! I've put the box in a reboot loop again, lets see if it survives until tomorrow :) Cheers, Thomas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: raid1 boot regression in 2.6.37 [bisected] 2011-04-28 13:47 ` Thomas Jarosch @ 2011-05-02 12:17 ` Thomas Jarosch 0 siblings, 0 replies; 22+ messages in thread From: Thomas Jarosch @ 2011-05-02 12:17 UTC (permalink / raw) To: NeilBrown; +Cc: Tejun Heo, linux-raid Neil, On Thursday, 28. April 2011 15:47:40 Thomas Jarosch wrote: > > Could you try this one please? > > Works like a charm! I've put the box in a reboot loop again, > lets see if it survives until tomorrow :) Patch is working fine, the box survived rebooting for several days. Thanks, Thomas ^ permalink raw reply [flat|nested] 22+ messages in thread
* raid1 boot regression in 2.6.37 [bisected] @ 2011-03-25 18:55 Thomas Jarosch 0 siblings, 0 replies; 22+ messages in thread From: Thomas Jarosch @ 2011-03-25 18:55 UTC (permalink / raw) To: linux-raid; +Cc: Tejun Heo Hello, I've just updated from kernel 2.6.34.7 to kernel 2.6.37.5 and one HP Proliant DL320 G3 box with a raid1 software RAID stopped booting. (also two other non-HP boxes). We run this script at boot time via dracut: ---------------------------------- #!/bin/sh . /lib/dracut-lib.sh info "Telling kernel to auto-detect RAID arrays" /sbin/initqueue --settled --name kerneldetectraid /sbin/mdadm --auto-detect ---------------------------------- With the "bad" commit in place, the kernel doesn't output any md message at all. I've bisected it down to this commit: e804ac780e2f01cb3b914daca2fd4780d1743db1 is the first bad commit commit e804ac780e2f01cb3b914daca2fd4780d1743db1 Author: Tejun Heo <tj@kernel.org> Date: Fri Oct 15 15:36:08 2010 +0200 md: fix and update workqueue usage Workqueue usage in md has two problems. * Flush can be used during or depended upon by memory reclaim, but md uses the system workqueue for flush_work which may lead to deadlock. * md depends on flush_scheduled_work() to achieve exclusion against completion of removal of previous instances. flush_scheduled_work() may incur unexpected amount of delay and is scheduled to be removed. This patch adds two workqueues to md - md_wq and md_misc_wq. The former is guaranteed to make forward progress under memory pressure and serves flush_work. The latter serves as the flush domain for other works. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> :040000 040000 f6b6a34a71864263ed253866c5f8abe7f766ac6b dc2eff4a91825142b7c88cf54751fc7acdf1a6d2 M drivers I manually verified that the commit before it (57dab0bdf689d42972975ec646d862b0900a4bf3) works and the "bad" commit prevents the box from booting. Some more info: # mdadm --version mdadm - v2.6.9 - 10th March 2009 # mdadm --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Wed May 27 17:52:40 2009 Raid Level : raid1 Array Size : 2562240 (2.44 GiB 2.62 GB) Used Dev Size : 2562240 (2.44 GiB 2.62 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Fri Mar 25 17:11:33 2011 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : 0ee8da2c:5803478b:e399b924:6520c535 Events : 0.160 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 Any idea what might go wrong? May be building a kernel with lock debugging on Monday might help. I think I also tried kernel 2.6.38 though I'll very on Monday, too. Have a nice weekend, Thomas PS: Sorry Tejun for the HTML crap in my first mail. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2011-05-02 12:17 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <201103251725.21180.thomas.jarosch@intra2net.com> 2011-03-28 7:59 ` raid1 boot regression in 2.6.37 [bisected] Tejun Heo 2011-03-28 11:02 ` Thomas Jarosch 2011-03-28 12:53 ` Thomas Jarosch 2011-03-28 15:59 ` Tejun Heo 2011-03-28 19:46 ` Thomas Jarosch 2011-03-28 19:59 ` Roberto Spadim 2011-03-29 12:06 ` Thomas Jarosch 2011-03-29 12:22 ` Roberto Spadim 2011-03-29 8:25 ` Tejun Heo 2011-03-29 9:53 ` Thomas Jarosch 2011-03-29 10:07 ` Tejun Heo 2011-03-29 11:52 ` Thomas Jarosch 2011-04-05 3:46 ` NeilBrown 2011-04-06 10:16 ` Tejun Heo 2011-04-12 14:05 ` Thomas Jarosch 2011-04-12 22:44 ` NeilBrown [not found] ` <201104261051.09464.thomas.jarosch@intra2net.com> 2011-04-27 8:17 ` NeilBrown 2011-04-27 10:05 ` NeilBrown [not found] ` <201104271700.58894.thomas.jarosch@intra2net.com> 2011-04-28 1:23 ` NeilBrown 2011-04-28 13:47 ` Thomas Jarosch 2011-05-02 12:17 ` Thomas Jarosch 2011-03-25 18:55 Thomas Jarosch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).