Re: raid1 boot regression in 2.6.37 [bisected]

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: raid1 boot regression in 2.6.37 [bisected]
       [not found] <201103251725.21180.thomas.jarosch@intra2net.com>
@ 2011-03-28  7:59 ` Tejun Heo
  2011-03-28 11:02   ` Thomas Jarosch
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2011-03-28  7:59 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: linux-raid, Neil Brown

Hello,

(cc'ing Neil and quoting whole body)

On Fri, Mar 25, 2011 at 05:25:20PM +0100, Thomas Jarosch wrote:
> Hello,
> 
> I've just updated from kernel 2.6.34.7 to kernel 2.6.37.5 and one
> HP Proliant DL320 G3 box with a raid1 software RAID stopped booting.
> (also two other non-HP boxes).
> 
> We run this script at boot time via dracut:
> ----------------------------------
> #!/bin/sh
> . /lib/dracut-lib.sh
> 
> info "Telling kernel to auto-detect RAID arrays"
> /sbin/initqueue --settled --name kerneldetectraid /sbin/mdadm --auto-detect
> ----------------------------------
> 
> With the "bad" commit in place, the kernel doesn't output
> any md message at all. I've bisected it down to this commit:
> 
> e804ac780e2f01cb3b914daca2fd4780d1743db1 is the first bad commit
> commit e804ac780e2f01cb3b914daca2fd4780d1743db1
> Author: Tejun Heo <tj@kernel.org>
> Date:   Fri Oct 15 15:36:08 2010 +0200
> 
>     md: fix and update workqueue usage
>     
>     Workqueue usage in md has two problems.
>     
>     * Flush can be used during or depended upon by memory reclaim, but md
>       uses the system workqueue for flush_work which may lead to deadlock.
>     
>     * md depends on flush_scheduled_work() to achieve exclusion against
>       completion of removal of previous instances.  flush_scheduled_work()
>       may incur unexpected amount of delay and is scheduled to be removed.
>     
>     This patch adds two workqueues to md - md_wq and md_misc_wq.  The
>     former is guaranteed to make forward progress under memory pressure
>     and serves flush_work.  The latter serves as the flush domain for
>     other works.
>     
>     Signed-off-by: Tejun Heo <tj@kernel.org>
>     Signed-off-by: NeilBrown <neilb@suse.de>
> 
> :040000 040000 f6b6a34a71864263ed253866c5f8abe7f766ac6b 
> dc2eff4a91825142b7c88cf54751fc7acdf1a6d2 M      drivers
> 
> I manually verified that the commit before it 
> (57dab0bdf689d42972975ec646d862b0900a4bf3) works
> and the "bad" commit prevents the box from booting.
> 
> 
> Some more info:
> 
> # mdadm --version
> mdadm - v2.6.9 - 10th March 2009
> 
> # mdadm --detail /dev/md0
> /dev/md0:
>         Version : 0.90
>   Creation Time : Wed May 27 17:52:40 2009
>      Raid Level : raid1
>      Array Size : 2562240 (2.44 GiB 2.62 GB)
>   Used Dev Size : 2562240 (2.44 GiB 2.62 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Fri Mar 25 17:11:33 2011
>           State : clean
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0
> 
>            UUID : 0ee8da2c:5803478b:e399b924:6520c535
>          Events : 0.160
> 
>     Number   Major   Minor   RaidDevice State
>        0       8        1        0      active sync   /dev/sda1
>        1       8       17        1      active sync   /dev/sdb1
> 
> 
> 
> Any idea what might go wrong? May be building a kernel
> with lock debugging on Monday might help.
> 
> Unfortunately bugzilla.kernel.org is currently down,
> so I can't look for a possible existing bug/solution.

I don't think it's a reported problem.  How does it fail?  Things just
stop?  As you wrote in the other mail, lockdep would definitely help.
Another thing which can be helpful is sysrq-t and see where things are
stuck.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-28  7:59 ` raid1 boot regression in 2.6.37 [bisected] Tejun Heo
@ 2011-03-28 11:02   ` Thomas Jarosch
  2011-03-28 12:53     ` Thomas Jarosch
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Jarosch @ 2011-03-28 11:02 UTC (permalink / raw)
  To: Tejun Heo, linux-raid; +Cc: Neil Brown

Hello Tejun,

On Monday, 28. March 2011 09:59:37 Tejun Heo wrote:
> I don't think it's a reported problem.  How does it fail?  Things just
> stop?  As you wrote in the other mail, lockdep would definitely help.
> Another thing which can be helpful is sysrq-t and see where things are
> stuck.

It just stops. I enabled kernel debugging and the USB serial console.
Here are the gory details:

--------------------------------------------
BUG: soft lockup - CPU#0 stuck for 61s! [blkid:1113]
Modules linked in: ahci libahci libata ehci_hcd ohci_hcd dm_mod [last unloaded:
scsi_wait_scan]
irq event stamp: 2136355122
hardirqs last  enabled at (2136355121): [<c12c3532>] _raw_spin_unlock_irq+0x22/0
x30
hardirqs last disabled at (2136355122): [<c12c3d4b>] apic_timer_interrupt+0x2f/0
x3c
softirqs last  enabled at (2136337366): [<c103d98c>] __do_softirq+0xfc/0x150
softirqs last disabled at (2136337351): [<c1005624>] do_softirq+0xa4/0x100
Modules linked in: ahci libahci libata ehci_hcd ohci_hcd dm_mod [last unloaded:
scsi_wait_scan]

Pid: 1113, comm: blkid Not tainted 2.6.36+ #24 MS-7349/MS-7349
EIP: 0060:[<c12c3536>] EFLAGS: 00000202 CPU: 0
EIP is at _raw_spin_unlock_irq+0x26/0x30
EAX: 7f563131 EBX: f6407000 ECX: 00000006 EDX: f60e98f4
ESI: 00000000 EDI: f61ce9c0 EBP: f5641d24 ESP: f5641d20
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process blkid (pid: 1113, ti=f5640000 task=f60e9530 task.ti=f5640000)
Stack:
 f640b800 f5641d50 c104b646 001ce9f4 00000005 f61cea20 00000001 00000002
<0> f6407000 f61ce9c0 00000002 f5641d80 f5641df4 c104c990 00000000 00000002
<0> 00000000 c104c7e0 f60e9548 f60e98f4 00000010 f61ce9d0 f61cea30 00000001
Call Trace:
 [<c104b646>] ? flush_workqueue_prep_cwqs+0x106/0x200
 [<c104c990>] ? flush_workqueue+0x1b0/0x430
 [<c104c7e0>] ? flush_workqueue+0x0/0x430
 [<c1213fb6>] ? md_open+0x96/0xa0
 [<c10e4aa4>] ? __blkdev_get+0x74/0x310
 [<c10e4d4a>] ? blkdev_get+0xa/0x10
 [<c10e4e0e>] ? blkdev_open+0xbe/0x110
 [<c10b9c50>] ? __dentry_open+0xa0/0x260
 [<c10b9f01>] ? nameidata_to_filp+0x51/0x60
 [<c10e4d50>] ? blkdev_open+0x0/0x110
 [<c10c59a7>] ? do_last+0x327/0x600
 [<c10c7772>] ? do_filp_open+0x1a2/0x4c0
 [<c10a5eb7>] ? handle_mm_fault+0x137/0x8a0
 [<c12c35bd>] ? _raw_spin_unlock+0x1d/0x20
 [<c10d12c8>] ? alloc_fd+0xb8/0xf0
 [<c10b9a51>] ? do_sys_open+0x51/0x100
 [<c10b9b69>] ? sys_open+0x29/0x40
 [<c1002e9f>] ? sysenter_do_call+0x12/0x38
Code: 26 00 00 00 00 55 ba 01 00 00 00 89 e5 8b 4d 04 53 89 c3 8d 40 10 e8 7a 40
 da ff 89 d8 e8 b3 b7 e8 ff e8 be 11 da ff fb 66 66 90 <66> 90 5b 5d c3 90 8d 74
 26 00 55 89 e5 83 ec 08 8b 4d 04 89 1c
Call Trace:
 [<c104b646>] flush_workqueue_prep_cwqs+0x106/0x200
 [<c104c990>] flush_workqueue+0x1b0/0x430
 [<c104c7e0>] ? flush_workqueue+0x0/0x430
 [<c1213fb6>] md_open+0x96/0xa0
 [<c10e4aa4>] __blkdev_get+0x74/0x310
 [<c10e4d4a>] blkdev_get+0xa/0x10
 [<c10e4e0e>] blkdev_open+0xbe/0x110
 [<c10b9c50>] __dentry_open+0xa0/0x260
 [<c10b9f01>] nameidata_to_filp+0x51/0x60
 [<c10e4d50>] ? blkdev_open+0x0/0x110
 [<c10c59a7>] do_last+0x327/0x600
 [<c10c7772>] do_filp_open+0x1a2/0x4c0
 [<c10a5eb7>] ? handle_mm_fault+0x137/0x8a0
 [<c12c35bd>] ? _raw_spin_unlock+0x1d/0x20
 [<c10d12c8>] ? alloc_fd+0xb8/0xf0
 [<c10b9a51>] do_sys_open+0x51/0x100
 [<c10b9b69>] sys_open+0x29/0x40
 [<c1002e9f>] sysenter_do_call+0x12/0x38
--------------------------------------------


Or a similar looking one:
--------------------------------------------
BUG: soft lockup - CPU#0 stuck for 63s! [blkid:1113]
Modules linked in: ahci libahci libata ehci_hcd ohci_hcd dm_mod [last unloaded:
scsi_wait_scan]
irq event stamp: 3897487660
hardirqs last  enabled at (3897487659): [<c12c15c4>] __mutex_unlock_slowpath+0xb
4/0x150
hardirqs last disabled at (3897487660): [<c12c3d4b>] apic_timer_interrupt+0x2f/0
x3c
softirqs last  enabled at (3897470978): [<c103d98c>] __do_softirq+0xfc/0x150
softirqs last disabled at (3897470965): [<c1005624>] do_softirq+0xa4/0x100
Modules linked in: ahci libahci libata ehci_hcd ohci_hcd dm_mod [last unloaded:
scsi_wait_scan]

Pid: 1113, comm: blkid Not tainted 2.6.36+ #24 MS-7349/MS-7349
EIP: 0060:[<c1067298>] EFLAGS: 00000246 CPU: 0
EIP is at lock_acquire+0x88/0xa0
EAX: 00000246 EBX: f60e9530 ECX: 00000000 EDX: c1823eb0
ESI: 00000000 EDI: f5dc5d08 EBP: f5641dd0 ESP: f5641da8
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process blkid (pid: 1113, ti=f5640000 task=f60e9530 task.ti=f5640000)
Stack:
 00000000 00000002 00000000 00000000 c10e4a5a 00000000 00000246 f5dc5cd0
<0> 00000000 f5dc5cd0 f5641e0c c12c1eb1 00000000 00000002 00000000 c10e4a5a
<0> f61c0800 f60e9530 00000000 f5641e2c f5dc5cd0 f5641e0c f5dc5cc0 f5520400
Call Trace:
 [<c10e4a5a>] ? __blkdev_get+0x2a/0x310
 [<c12c1eb1>] ? mutex_lock_nested+0x51/0x2f0
 [<c10e4a5a>] ? __blkdev_get+0x2a/0x310
 [<c10e4a5a>] ? __blkdev_get+0x2a/0x310
 [<c10e4d4a>] ? blkdev_get+0xa/0x10
 [<c10e4e0e>] ? blkdev_open+0xbe/0x110
 [<c10b9c50>] ? __dentry_open+0xa0/0x260
 [<c10b9f01>] ? nameidata_to_filp+0x51/0x60
 [<c10e4d50>] ? blkdev_open+0x0/0x110
 [<c10c59a7>] ? do_last+0x327/0x600
 [<c10c7772>] ? do_filp_open+0x1a2/0x4c0
 [<c10a5eb7>] ? handle_mm_fault+0x137/0x8a0
 [<c12c35bd>] ? _raw_spin_unlock+0x1d/0x20
 [<c10d12c8>] ? alloc_fd+0xb8/0xf0
 [<c10b9a51>] ? do_sys_open+0x51/0x100
 [<c10b9b69>] ? sys_open+0x29/0x40
 [<c1002e9f>] ? sysenter_do_call+0x12/0x38
Code: 01 83 e0 01 89 44 24 08 8b 45 0c 89 44 24 04 8b 45 08 89 04 24 89 f8 e8 77
 e7 ff ff c7 83 a8 03 00 00 00 00 00 00 8b 45 f0 50 9d <66> 66 66 90 8b 5d f4 8b
 75 f8 8b 7d fc 89 ec 5d c3 8d b4 26 00
Call Trace:
 [<c10e4a5a>] ? __blkdev_get+0x2a/0x310
 [<c12c1eb1>] mutex_lock_nested+0x51/0x2f0
 [<c10e4a5a>] ? __blkdev_get+0x2a/0x310
 [<c10e4a5a>] __blkdev_get+0x2a/0x310
 [<c10e4d4a>] blkdev_get+0xa/0x10
 [<c10e4e0e>] blkdev_open+0xbe/0x110
 [<c10b9c50>] __dentry_open+0xa0/0x260
 [<c10b9f01>] nameidata_to_filp+0x51/0x60
 [<c10e4d50>] ? blkdev_open+0x0/0x110
 [<c10c59a7>] do_last+0x327/0x600
 [<c10c7772>] do_filp_open+0x1a2/0x4c0
 [<c10a5eb7>] ? handle_mm_fault+0x137/0x8a0
 [<c12c35bd>] ? _raw_spin_unlock+0x1d/0x20
 [<c10d12c8>] ? alloc_fd+0xb8/0xf0
 [<c10b9a51>] do_sys_open+0x51/0x100
 [<c10b9b69>] sys_open+0x29/0x40
 [<c1002e9f>] sysenter_do_call+0x12/0x38
--------------------------------------------

I'll try to reproduce the backtrace with 2.6.38
and see if it's any different.

Cheers,
Thomas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-28 11:02   ` Thomas Jarosch
@ 2011-03-28 12:53     ` Thomas Jarosch
  2011-03-28 15:59       ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Jarosch @ 2011-03-28 12:53 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-raid, Neil Brown

Hi,

here's a backtrace from vanilla 2.6.38:

--------------------------------------------
BUG: soft lockup - CPU#0 stuck for 64s! [blkid:1143]
Modules linked in: ahci libahci libata dm_mod [last unloaded: scsi_wait_scan]
irq event stamp: 981226512
hardirqs last  enabled at (981226511): [<c12dc764>] __mutex_unlock_slowpath+0xb4
/0x150
hardirqs last disabled at (981226512): [<c12deeeb>] apic_timer_interrupt+0x2f/0x
3c
softirqs last  enabled at (981207466): [<c103dbfd>] __do_softirq+0xed/0x140
softirqs last disabled at (981207451): [<c100558d>] do_softirq+0x9d/0xf0
Modules linked in: ahci libahci libata dm_mod [last unloaded: scsi_wait_scan]

Pid: 1143, comm: blkid Not tainted 2.6.38 #26 MSI MS-7349/MS-7349
EIP: 0060:[<c12dc768>] EFLAGS: 00000246 CPU: 0
EIP is at __mutex_unlock_slowpath+0xb8/0x150
EAX: 00000246 EBX: c14252a0 ECX: 00000001 EDX: 00000001
ESI: c14252c4 EDI: 00000246 EBP: f568fd8c ESP: f568fd80
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process blkid (pid: 1143, ti=f568e000 task=f615c030 task.ti=f568e000)
Stack:
 00000000 f545a9c0 00900000 f568fd94 c12dc808 f568fdc0 c11c4451 c11323b0
 f568fdf0 f55e2e00 00000009 f61ae800 00000000 00000000 f568fdf0 f5df7490
 f568fdd0 c11331b8 f5df7480 f55e2e00 f568fe00 c10e85aa f615c030 00000000
Call Trace:
 [<c12dc808>] mutex_unlock+0x8/0x10
 [<c11c4451>] kobj_lookup+0xe1/0x140
 [<c11323b0>] ? exact_match+0x0/0x10
 [<c11331b8>] get_gendisk+0x98/0xb0
 [<c10e85aa>] __blkdev_get+0xca/0x320
 [<c10e8843>] blkdev_get+0x43/0x2c0
 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
 [<c10e8b12>] blkdev_open+0x52/0x70
 [<c10bb12d>] __dentry_open+0x9d/0x240
 [<c10bb3c6>] nameidata_to_filp+0x66/0x80
 [<c10e8ac0>] ? blkdev_open+0x0/0x70
 [<c10c781f>] finish_open+0xaf/0x190
 [<c10c8a24>] ? do_path_lookup+0x44/0xe0
 [<c10c9920>] do_filp_open+0x210/0x6d0
 [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0
 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
 [<c10d47d8>] ? alloc_fd+0xb8/0xf0
 [<c10baf45>] do_sys_open+0x55/0xf0
 [<c10bb049>] sys_open+0x29/0x40
 [<c1002e9f>] sysenter_do_call+0x12/0x38
Code: 00 00 39 c6 74 11 89 d8 89 f2 e8 14 5a d8 ff 8b 46 08 e8 4c 75 d5 ff fe 43
 04 f7 c7 00 02 00 00 74 21 e8 6c 7f d8 ff 89 f8 50 9d <66> 66 66 90 8b 1c 24 8b
 74 24 04 8b 7c 24 08 89 ec 5d c3 90 8d
Call Trace:
 [<c12dc808>] mutex_unlock+0x8/0x10
 [<c11c4451>] kobj_lookup+0xe1/0x140
 [<c11323b0>] ? exact_match+0x0/0x10
 [<c11331b8>] get_gendisk+0x98/0xb0
 [<c10e85aa>] __blkdev_get+0xca/0x320
 [<c10e8843>] blkdev_get+0x43/0x2c0
 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
 [<c10e8b12>] blkdev_open+0x52/0x70
 [<c10bb12d>] __dentry_open+0x9d/0x240
 [<c10bb3c6>] nameidata_to_filp+0x66/0x80
 [<c10e8ac0>] ? blkdev_open+0x0/0x70
 [<c10c781f>] finish_open+0xaf/0x190
 [<c10c8a24>] ? do_path_lookup+0x44/0xe0
 [<c10c9920>] do_filp_open+0x210/0x6d0
 [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0
 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
 [<c10d47d8>] ? alloc_fd+0xb8/0xf0
 [<c10baf45>] do_sys_open+0x55/0xf0
 [<c10bb049>] sys_open+0x29/0x40
 [<c1002e9f>] sysenter_do_call+0x12/0x38
--------------------------------------------


Or this one:
--------------------------------------------
BUG: soft lockup - CPU#0 stuck for 63s! [blkid:1143]
Modules linked in: ahci libahci libata dm_mod [last unloaded: scsi_wait_scan]
irq event stamp: 2778370780
hardirqs last  enabled at (2778370779): [<c12de715>] _raw_spin_unlock_irqrestore
+0x35/0x60
hardirqs last disabled at (2778370780): [<c12deeeb>] apic_timer_interrupt+0x2f/0
x3c
softirqs last  enabled at (2778355336): [<c103dbfd>] __do_softirq+0xed/0x140
softirqs last disabled at (2778355323): [<c100558d>] do_softirq+0x9d/0xf0
Modules linked in: ahci libahci libata dm_mod [last unloaded: scsi_wait_scan]

Pid: 1143, comm: blkid Not tainted 2.6.38 #26 MSI MS-7349/MS-7349
EIP: 0060:[<c12de719>] EFLAGS: 00000286 CPU: 0
EIP is at _raw_spin_unlock_irqrestore+0x39/0x60
EAX: 00000286 EBX: f568fd54 ECX: 00000006 EDX: f615c3ac
ESI: 00000286 EDI: f568fd50 EBP: f568fccc ESP: f568fcc4
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process blkid (pid: 1143, ti=f568e000 task=f615c030 task.ti=f568e000)
Stack:
 f568fd54 00000286 f568fce8 c102cd89 00000000 00000000 00000007 00000008
 f61b0840 f568fd14 c104b9c8 001b0874 0000000b f61b08a0 00000007 00000008
 f6503100 f61b0840 00000008 f568fd44 f568fdb8 c104cd20 00000000 00000002
Call Trace:
 [<c102cd89>] complete+0x49/0x60
 [<c104b9c8>] flush_workqueue_prep_cwqs+0x138/0x200
 [<c104cd20>] flush_workqueue+0x1b0/0x430
 [<c104cb70>] ? flush_workqueue+0x0/0x430
 [<c1229e79>] md_open+0x99/0xb0
 [<c10e8557>] __blkdev_get+0x77/0x320
 [<c10e8843>] blkdev_get+0x43/0x2c0
 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
 [<c10e8b12>] blkdev_open+0x52/0x70
 [<c10bb12d>] __dentry_open+0x9d/0x240
 [<c10bb3c6>] nameidata_to_filp+0x66/0x80
 [<c10e8ac0>] ? blkdev_open+0x0/0x70
 [<c10c781f>] finish_open+0xaf/0x190
 [<c10c8a24>] ? do_path_lookup+0x44/0xe0
 [<c10c9920>] do_filp_open+0x210/0x6d0
 [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0
 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
 [<c10d47d8>] ? alloc_fd+0xb8/0xf0
 [<c10baf45>] do_sys_open+0x55/0xf0
 [<c10bb049>] sys_open+0x29/0x40
 [<c1002e9f>] sysenter_do_call+0x12/0x38
Code: 89 74 24 04 8d 40 10 89 d6 ba 01 00 00 00 e8 7f 8e d8 ff 89 d8 e8 48 5b e7
 ff f7 c6 00 02 00 00 74 18 e8 bb 5f d8 ff 89 f0 50 9d <66> 66 66 90 8b 1c 24 8b
 74 24 04 89 ec 5d c3 89 f0 50 9d 66 66
Call Trace:
 [<c102cd89>] complete+0x49/0x60
 [<c104b9c8>] flush_workqueue_prep_cwqs+0x138/0x200
 [<c104cd20>] flush_workqueue+0x1b0/0x430
 [<c104cb70>] ? flush_workqueue+0x0/0x430
 [<c1229e79>] md_open+0x99/0xb0
 [<c10e8557>] __blkdev_get+0x77/0x320
 [<c10e8843>] blkdev_get+0x43/0x2c0
 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
 [<c10e8b12>] blkdev_open+0x52/0x70
 [<c10bb12d>] __dentry_open+0x9d/0x240
 [<c10bb3c6>] nameidata_to_filp+0x66/0x80
 [<c10e8ac0>] ? blkdev_open+0x0/0x70
 [<c10c781f>] finish_open+0xaf/0x190
 [<c10c8a24>] ? do_path_lookup+0x44/0xe0
 [<c10c9920>] do_filp_open+0x210/0x6d0
 [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0
 [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
 [<c10d47d8>] ? alloc_fd+0xb8/0xf0
 [<c10baf45>] do_sys_open+0x55/0xf0
 [<c10bb049>] sys_open+0x29/0x40
 [<c1002e9f>] sysenter_do_call+0x12/0x38

--------------------------------------------


Cheers,
Thomas
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-28 12:53     ` Thomas Jarosch
@ 2011-03-28 15:59       ` Tejun Heo
  2011-03-28 19:46         ` Thomas Jarosch
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2011-03-28 15:59 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: linux-raid, Neil Brown

Hello,

On Mon, Mar 28, 2011 at 02:53:53PM +0200, Thomas Jarosch wrote:
> Pid: 1143, comm: blkid Not tainted 2.6.38 #26 MSI MS-7349/MS-7349
> EIP: 0060:[<c12dc768>] EFLAGS: 00000246 CPU: 0
> EIP is at __mutex_unlock_slowpath+0xb8/0x150
> EAX: 00000246 EBX: c14252a0 ECX: 00000001 EDX: 00000001
> ESI: c14252c4 EDI: 00000246 EBP: f568fd8c ESP: f568fd80
>  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> Process blkid (pid: 1143, ti=f568e000 task=f615c030 task.ti=f568e000)
> Stack:
>  00000000 f545a9c0 00900000 f568fd94 c12dc808 f568fdc0 c11c4451 c11323b0
>  f568fdf0 f55e2e00 00000009 f61ae800 00000000 00000000 f568fdf0 f5df7490
>  f568fdd0 c11331b8 f5df7480 f55e2e00 f568fe00 c10e85aa f615c030 00000000
> Call Trace:
>  [<c12dc808>] mutex_unlock+0x8/0x10
>  [<c11c4451>] kobj_lookup+0xe1/0x140
>  [<c11323b0>] ? exact_match+0x0/0x10
>  [<c11331b8>] get_gendisk+0x98/0xb0
>  [<c10e85aa>] __blkdev_get+0xca/0x320
>  [<c10e8843>] blkdev_get+0x43/0x2c0
>  [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
>  [<c10e8b12>] blkdev_open+0x52/0x70
>  [<c10bb12d>] __dentry_open+0x9d/0x240
>  [<c10bb3c6>] nameidata_to_filp+0x66/0x80
>  [<c10e8ac0>] ? blkdev_open+0x0/0x70
>  [<c10c781f>] finish_open+0xaf/0x190
>  [<c10c8a24>] ? do_path_lookup+0x44/0xe0
>  [<c10c9920>] do_filp_open+0x210/0x6d0
>  [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0
>  [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
>  [<c10d47d8>] ? alloc_fd+0xb8/0xf0
>  [<c10baf45>] do_sys_open+0x55/0xf0
>  [<c10bb049>] sys_open+0x29/0x40
>  [<c1002e9f>] sysenter_do_call+0x12/0x38

Hmmm... Weird.

* blkid seems to be looping in blkdev_open() repeatedly calling
  md_open() which keeps returning -ERESTARTSYS.

* It triggered softlockup.  Even with -ERESTARTSYS looping, I can't
  see how that would be possible.

Is this custom boot script?  If so, do you use RT priority in the
script?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-28 15:59       ` Tejun Heo
@ 2011-03-28 19:46         ` Thomas Jarosch
  2011-03-28 19:59           ` Roberto Spadim
  2011-03-29  8:25           ` Tejun Heo
  0 siblings, 2 replies; 22+ messages in thread
From: Thomas Jarosch @ 2011-03-28 19:46 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-raid, Neil Brown

On 03/28/2011 05:59 PM, Tejun Heo wrote:
>> Call Trace:
>>  [<c12dc808>] mutex_unlock+0x8/0x10
>>  [<c11c4451>] kobj_lookup+0xe1/0x140
>>  [<c11323b0>] ? exact_match+0x0/0x10
>>  [<c11331b8>] get_gendisk+0x98/0xb0
>>  [<c10e85aa>] __blkdev_get+0xca/0x320
>>  [<c10e8843>] blkdev_get+0x43/0x2c0
>>  [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
>>  [<c10e8b12>] blkdev_open+0x52/0x70
>>  [<c10bb12d>] __dentry_open+0x9d/0x240
>>  [<c10bb3c6>] nameidata_to_filp+0x66/0x80
>>  [<c10e8ac0>] ? blkdev_open+0x0/0x70
>>  [<c10c781f>] finish_open+0xaf/0x190
>>  [<c10c8a24>] ? do_path_lookup+0x44/0xe0
>>  [<c10c9920>] do_filp_open+0x210/0x6d0
>>  [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0
>>  [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
>>  [<c10d47d8>] ? alloc_fd+0xb8/0xf0
>>  [<c10baf45>] do_sys_open+0x55/0xf0
>>  [<c10bb049>] sys_open+0x29/0x40
>>  [<c1002e9f>] sysenter_do_call+0x12/0x38
> 
> Hmmm... Weird.
> 
> * blkid seems to be looping in blkdev_open() repeatedly calling
>   md_open() which keeps returning -ERESTARTSYS.
> 
> * It triggered softlockup.  Even with -ERESTARTSYS looping, I can't
>   see how that would be possible.
> 
> Is this custom boot script?  If so, do you use RT priority in the
> script?

It's a normal dracut installation with an additional custom script
to trigger kernel raid auto detection via mdadm.
The custom script was part of the initial post.

I've also noticed another odd thing: On a HP Proliant ML110 G6 box,
which is quite fast / SMP, the box brings up the software
RAID successfully. The box is slow as hell and I can see a constant load
on a kernel process (could be "kworker", don't remember it exactly).
I'll try tomorrow if that is also related to the RAID subsystem
or something else turning it into a PDP11...

Thanks for your help so far!

Cheers,
Thomas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-28 19:46         ` Thomas Jarosch
@ 2011-03-28 19:59           ` Roberto Spadim
  2011-03-29 12:06             ` Thomas Jarosch
  2011-03-29  8:25           ` Tejun Heo
  1 sibling, 1 reply; 22+ messages in thread
From: Roberto Spadim @ 2011-03-28 19:59 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: Tejun Heo, linux-raid, Neil Brown

that´s what i did, not a linux raid list opnion/experience:

i had some problems with hp server (the same model but G5) and new
intel core i7 desktops with archlinux and kernel 2.6.37, try to get a
older kernel version, just to make your production server stable

maybe some changes between kernel 3.6.32 and 3.6.37 made md working
worst, i will wait some changes at kernel, since realtime works are
being developed and many modifications are in work

using the same instalation process with archilnux using a dell (i
don´t remember the model, but i bought it in january 2011) removed the
problem (maybe a problem at hp sas interface/driver?! i didn´t tested
hp with dell interface)

2011/3/28 Thomas Jarosch <thomas.jarosch@intra2net.com>:
> On 03/28/2011 05:59 PM, Tejun Heo wrote:
>>> Call Trace:
>>>  [<c12dc808>] mutex_unlock+0x8/0x10
>>>  [<c11c4451>] kobj_lookup+0xe1/0x140
>>>  [<c11323b0>] ? exact_match+0x0/0x10
>>>  [<c11331b8>] get_gendisk+0x98/0xb0
>>>  [<c10e85aa>] __blkdev_get+0xca/0x320
>>>  [<c10e8843>] blkdev_get+0x43/0x2c0
>>>  [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
>>>  [<c10e8b12>] blkdev_open+0x52/0x70
>>>  [<c10bb12d>] __dentry_open+0x9d/0x240
>>>  [<c10bb3c6>] nameidata_to_filp+0x66/0x80
>>>  [<c10e8ac0>] ? blkdev_open+0x0/0x70
>>>  [<c10c781f>] finish_open+0xaf/0x190
>>>  [<c10c8a24>] ? do_path_lookup+0x44/0xe0
>>>  [<c10c9920>] do_filp_open+0x210/0x6d0
>>>  [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0
>>>  [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
>>>  [<c10d47d8>] ? alloc_fd+0xb8/0xf0
>>>  [<c10baf45>] do_sys_open+0x55/0xf0
>>>  [<c10bb049>] sys_open+0x29/0x40
>>>  [<c1002e9f>] sysenter_do_call+0x12/0x38
>>
>> Hmmm... Weird.
>>
>> * blkid seems to be looping in blkdev_open() repeatedly calling
>>   md_open() which keeps returning -ERESTARTSYS.
>>
>> * It triggered softlockup.  Even with -ERESTARTSYS looping, I can't
>>   see how that would be possible.
>>
>> Is this custom boot script?  If so, do you use RT priority in the
>> script?
>
> It's a normal dracut installation with an additional custom script
> to trigger kernel raid auto detection via mdadm.
> The custom script was part of the initial post.
>
> I've also noticed another odd thing: On a HP Proliant ML110 G6 box,
> which is quite fast / SMP, the box brings up the software
> RAID successfully. The box is slow as hell and I can see a constant load
> on a kernel process (could be "kworker", don't remember it exactly).
> I'll try tomorrow if that is also related to the RAID subsystem
> or something else turning it into a PDP11...
>
> Thanks for your help so far!
>
> Cheers,
> Thomas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-28 19:59           ` Roberto Spadim
@ 2011-03-29 12:06             ` Thomas Jarosch
  2011-03-29 12:22               ` Roberto Spadim
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Jarosch @ 2011-03-29 12:06 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: linux-raid, Tejun Heo, Neil Brown

Hello Roberto,

On Monday, 28. March 2011 21:59:37 Roberto Spadim wrote:
> i had some problems with hp server (the same model but G5) and new
> intel core i7 desktops with archlinux and kernel 2.6.37, try to get a
> older kernel version, just to make your production server stable

Can you describe in more detail what kind of issue you were having?

Didn't it boot at all or was it just slow / unusable?
Was the issue related to md?

Cheers,
Thomas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-29 12:06             ` Thomas Jarosch
@ 2011-03-29 12:22               ` Roberto Spadim
  0 siblings, 0 replies; 22+ messages in thread
From: Roberto Spadim @ 2011-03-29 12:22 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: linux-raid, Tejun Heo, Neil Brown

it boot without problem, work with apache,mysql,softraid(md) without problem
but after some time, it 'crash', many messages like
kernel not tainted
kernel tainted
<0x0000fffff> messages

and some debug information on console
i tested with xfs,reiserfs3,ext4 and nothing changed, maybe it´s a
lower layer problem: softraid, hardware module, kernel thread/process
options, i didn´t diff the old and the new kernel config file, maybe
something changed and i was sleeping, changing to old kernel solved
the problem, i didn´t found performace problems, just kernel messages,
and sometimes it stop working and i must reboot

2011/3/29 Thomas Jarosch <thomas.jarosch@intra2net.com>:
> Hello Roberto,
>
> On Monday, 28. March 2011 21:59:37 Roberto Spadim wrote:
>> i had some problems with hp server (the same model but G5) and new
>> intel core i7 desktops with archlinux and kernel 2.6.37, try to get a
>> older kernel version, just to make your production server stable
>
> Can you describe in more detail what kind of issue you were having?
>
> Didn't it boot at all or was it just slow / unusable?
> Was the issue related to md?
>
> Cheers,
> Thomas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-28 19:46         ` Thomas Jarosch
  2011-03-28 19:59           ` Roberto Spadim
@ 2011-03-29  8:25           ` Tejun Heo
  2011-03-29  9:53             ` Thomas Jarosch
  1 sibling, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2011-03-29  8:25 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: linux-raid, Neil Brown

On Mon, Mar 28, 2011 at 09:46:08PM +0200, Thomas Jarosch wrote:
> On 03/28/2011 05:59 PM, Tejun Heo wrote:
> >> Call Trace:
> >>  [<c12dc808>] mutex_unlock+0x8/0x10
> >>  [<c11c4451>] kobj_lookup+0xe1/0x140
> >>  [<c11323b0>] ? exact_match+0x0/0x10
> >>  [<c11331b8>] get_gendisk+0x98/0xb0
> >>  [<c10e85aa>] __blkdev_get+0xca/0x320
> >>  [<c10e8843>] blkdev_get+0x43/0x2c0
> >>  [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
> >>  [<c10e8b12>] blkdev_open+0x52/0x70
> >>  [<c10bb12d>] __dentry_open+0x9d/0x240
> >>  [<c10bb3c6>] nameidata_to_filp+0x66/0x80
> >>  [<c10e8ac0>] ? blkdev_open+0x0/0x70
> >>  [<c10c781f>] finish_open+0xaf/0x190
> >>  [<c10c8a24>] ? do_path_lookup+0x44/0xe0
> >>  [<c10c9920>] do_filp_open+0x210/0x6d0
> >>  [<c10672e9>] ? lock_release_non_nested+0x59/0x2f0
> >>  [<c12de75d>] ? _raw_spin_unlock+0x1d/0x20
> >>  [<c10d47d8>] ? alloc_fd+0xb8/0xf0
> >>  [<c10baf45>] do_sys_open+0x55/0xf0
> >>  [<c10bb049>] sys_open+0x29/0x40
> >>  [<c1002e9f>] sysenter_do_call+0x12/0x38
> > 
> > Hmmm... Weird.
> > 
> > * blkid seems to be looping in blkdev_open() repeatedly calling
> >   md_open() which keeps returning -ERESTARTSYS.
> > 
> > * It triggered softlockup.  Even with -ERESTARTSYS looping, I can't
> >   see how that would be possible.
> > 
> > Is this custom boot script?  If so, do you use RT priority in the
> > script?
> 
> It's a normal dracut installation with an additional custom script
> to trigger kernel raid auto detection via mdadm.
> The custom script was part of the initial post.
> 
> I've also noticed another odd thing: On a HP Proliant ML110 G6 box,
> which is quite fast / SMP, the box brings up the software
> RAID successfully. The box is slow as hell and I can see a constant load
> on a kernel process (could be "kworker", don't remember it exactly).
> I'll try tomorrow if that is also related to the RAID subsystem
> or something else turning it into a PDP11...

Can you please apply the following patch and see whether it resolves
the problem and report the boot log?

Thanks.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8b66e04..e17098b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6001,6 +6001,15 @@ static int md_open(struct block_device *bdev, fmode_t mode)
 		 * bd_disk.
 		 */
 		mddev_put(mddev);
+		if (current->policy == SCHED_FIFO || current->policy == SCHED_RR) {
+			static bool once;
+			if (!once) {
+				printk("%s: md_open(): RT prio, pol=%u p=%d rt_p=%u\n",
+				       current->comm, current->policy, current->static_prio, current->rt_priority);
+				once = true;
+			}
+		}
+		msleep(10);
 		/* Wait until bdev->bd_disk is definitely gone */
 		flush_workqueue(md_misc_wq);
 		/* Then retry the open from the top */

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-29  8:25           ` Tejun Heo
@ 2011-03-29  9:53             ` Thomas Jarosch
  2011-03-29 10:07               ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Jarosch @ 2011-03-29  9:53 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-raid, Neil Brown

On Tuesday, 29. March 2011 10:25:03 Tejun Heo wrote:
> Can you please apply the following patch and see whether it resolves
> the problem and report the boot log?

Ok, I did the following:
- Check out commit e804ac780e2f01cb3b914daca2fd4780d1743db1
  (md: fix and update workqueue usage)
- Apply your patch
- Add small debug output on top of it:

------------------------------
# git diff
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1e6534d..d2ddef4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5899,6 +5899,16 @@ static int md_open(struct block_device *bdev, fmode_t mode)
                                once = true;
                        }
                }
+               /* DEBUG HACK */
+               {
+                   static bool tomj_once = false;
+                   if (!tomj_once)
+                   {
+                               printk("TOMJ %s: md_open(): RT prio, pol=%u p=%d rt_p=%u\n",
+                                      current->comm, current->policy, current->static_prio, current->rt_priority);
+                       tomj_once = true;
+                   }
+               }
                msleep(10);
                /* Wait until bdev->bd_disk is definitely gone */
                flush_workqueue(md_misc_wq);
------------------------------

With your patch applied, it boots!

Here's the dmesg output:

------------------------------
pci 0000:00:00.0: Found enabled HT MSI Mapping
pci 0000:00:00.0: Found enabled HT MSI Mapping
pci 0000:00:00.0: Found enabled HT MSI Mapping
pci 0000:00:00.0: Found enabled HT MSI Mapping
pci 0000:00:12.0: Boot video device
PCI: CLS 64 bytes, default 64
Trying to unpack rootfs image as initramfs...
Freeing initrd memory: 5120k freed
apm: BIOS version 1.2 Flags 0x03 (Driver version 1.16ac)
apm: overridden by ACPI.
audit: initializing netlink socket (disabled)
type=2000 audit(1301389498.428:1): initialized
highmem bounce pool size: 64 pages
msgmni has been set to 1730
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
io scheduler noop registered
io scheduler deadline registered
io scheduler cfq registered (default)
pcieport 0000:00:0b.0: ACPI _OSC control granted for 0x1d
pcieport 0000:00:0b.0: setting latency timer to 64
pcieport 0000:00:0b.0: irq 40 for MSI/MSI-X
pcieport 0000:00:0c.0: ACPI _OSC control granted for 0x1d
pcieport 0000:00:0c.0: setting latency timer to 64
pcieport 0000:00:0c.0: irq 41 for MSI/MSI-X
pcieport 0000:00:0d.0: ACPI _OSC control granted for 0x1d
pcieport 0000:00:0d.0: setting latency timer to 64
pcieport 0000:00:0d.0: irq 42 for MSI/MSI-X
pcieport 0000:00:0e.0: ACPI _OSC control granted for 0x1d
pcieport 0000:00:0e.0: setting latency timer to 64
pcieport 0000:00:0e.0: irq 43 for MSI/MSI-X
pcieport 0000:00:0f.0: ACPI _OSC control granted for 0x1d
pcieport 0000:00:0f.0: setting latency timer to 64
pcieport 0000:00:0f.0: irq 44 for MSI/MSI-X
pcieport 0000:00:10.0: ACPI _OSC control granted for 0x1d
pcieport 0000:00:10.0: setting latency timer to 64
pcieport 0000:00:10.0: irq 45 for MSI/MSI-X
pcieport 0000:00:11.0: ACPI _OSC control granted for 0x1d
pcieport 0000:00:11.0: setting latency timer to 64
pcieport 0000:00:11.0: irq 46 for MSI/MSI-X
ACPI: acpi_idle registered with cpuidle
Non-volatile memory driver v1.3
Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:04: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
brd: module loaded
loop: module loaded
Uniform Multi-Platform E-IDE driver
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ACPI: PCI Interrupt Link [LUB2] enabled at IRQ 23
ehci_hcd 0000:00:02.1: PCI INT B -> Link[LUB2] -> GSI 23 (level, low) -> IRQ 23
ehci_hcd 0000:00:02.1: setting latency timer to 64
ehci_hcd 0000:00:02.1: EHCI Host Controller
ehci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:02.1: debug port 1
ehci_hcd 0000:00:02.1: cache line size of 64 is not supported
ehci_hcd 0000:00:02.1: irq 23, io mem 0xfe8fec00
ehci_hcd 0000:00:02.1: USB 2.0 started, EHCI 1.00
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 6 ports detected
ACPI: PCI Interrupt Link [UB12] enabled at IRQ 22
ehci_hcd 0000:00:04.1: PCI INT B -> Link[UB12] -> GSI 22 (level, low) -> IRQ 22
ehci_hcd 0000:00:04.1: setting latency timer to 64
ehci_hcd 0000:00:04.1: EHCI Host Controller
ehci_hcd 0000:00:04.1: new USB bus registered, assigned bus number 2
ehci_hcd 0000:00:04.1: debug port 1
ehci_hcd 0000:00:04.1: cache line size of 64 is not supported
ehci_hcd 0000:00:04.1: irq 22, io mem 0xfe8fe800
ehci_hcd 0000:00:04.1: USB 2.0 started, EHCI 1.00
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 6 ports detected
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
ACPI: PCI Interrupt Link [LUB0] enabled at IRQ 21
ohci_hcd 0000:00:02.0: PCI INT A -> Link[LUB0] -> GSI 21 (level, low) -> IRQ 21
ohci_hcd 0000:00:02.0: setting latency timer to 64
ohci_hcd 0000:00:02.0: OHCI Host Controller
ohci_hcd 0000:00:02.0: new USB bus registered, assigned bus number 3
ohci_hcd 0000:00:02.0: irq 21, io mem 0xfe8ff000
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 6 ports detected
ACPI: PCI Interrupt Link [UB11] enabled at IRQ 20
ohci_hcd 0000:00:04.0: PCI INT A -> Link[UB11] -> GSI 20 (level, low) -> IRQ 20
ohci_hcd 0000:00:04.0: setting latency timer to 64
ohci_hcd 0000:00:04.0: OHCI Host Controller
ohci_hcd 0000:00:04.0: new USB bus registered, assigned bus number 4
ohci_hcd 0000:00:04.0: irq 20, io mem 0xfe8fd000
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 6 ports detected
uhci_hcd: USB Universal Host Controller Interface driver
usbcore: registered new interface driver usbserial
USB Serial support registered for generic
usbcore: registered new interface driver usbserial_generic
usbserial: USB Serial Driver core
USB Serial support registered for FTDI USB Serial Device
usbcore: registered new interface driver ftdi_sio
ftdi_sio: v1.6.0:USB FTDI Serial Converters Driver
PNP: No PS/2 controller found. Probing ports directly.
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
input: PC Speaker as /devices/platform/pcspkr/input/input0
md: raid1 personality registered for level 1
cpuidle: using governor ladder
cpuidle: using governor menu
usbcore: registered new interface driver usbhid
usbhid: USB HID core driver
TCP cubic registered
NET: Registered protocol family 17
Using IPI No-Shortcut mode
drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
Freeing unused kernel memory: 408k freed
Write protecting the kernel text: 2896k
Write protecting the kernel read-only data: 1108k
dracut: dracut-004-6.i2n
device-mapper: ioctl: 4.18.0-ioctl (2010-06-29) initialised: dm-devel@redhat.com
dracut: rd_NO_MDIMSM: no MD RAID for imsm/isw raids
udev: starting version 145
udevd (582): /proc/582/oom_adj is deprecated, please use /proc/582/oom_score_adj instead.
dracut: Telling kernel to auto-detect RAID arrays
libata version 3.00 loaded.
ahci 0000:00:09.0: version 3.0
ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 23
ahci 0000:00:09.0: PCI INT A -> Link[LSA0] -> GSI 23 (level, low) -> IRQ 23
ahci 0000:00:09.0: controller can't do PMP, turning off CAP_PMP
ahci 0000:00:09.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl IDE mode
ahci 0000:00:09.0: flags: 64bit ncq sntf led clo pio 
ahci 0000:00:09.0: setting latency timer to 64
scsi0 : ahci
scsi1 : ahci
scsi2 : ahci
scsi3 : ahci
ata1: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 23
ata2: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 23
ata3: SATA max UDMA/133 abar m8192@0xfe8f6000 port 0xfe8f6200 irq 23
ata4: SATA max UDMA/133 abar m8192@0xfe8f6000 port 0xfe8f6280 irq 23
usb 4-1: new low speed USB device using ohci_hcd and address 2
ata4: SATA link down (SStatus 0 SControl 300)
ata3: SATA link down (SStatus 0 SControl 300)
input: USB CAT5 as /devices/pci0000:00/0000:00:04.0/usb4/4-1/4-1:1.0/input/input1
generic-usb 0003:0D3D:0040.0001: input: USB HID v1.00 Keyboard [USB CAT5] on usb-0000:00:04.0-1/input0
input: USB CAT5 as /devices/pci0000:00/0000:00:04.0/usb4/4-1/4-1:1.1/input/input2
generic-usb 0003:0D3D:0040.0002: input: USB HID v1.00 Mouse [USB CAT5] on usb-0000:00:04.0-1/input1
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
usb 4-2: new full speed USB device using ohci_hcd and address 3
ata2.00: ATA-7: ST3160215AS, 3.AAD, max UDMA/133
ata2.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata1.00: ATA-7: ST3160215AS, 3.AAD, max UDMA/133
ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata2.00: configured for UDMA/133
ata1.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access     ATA      ST3160215AS      3.AA PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 312581808 512-byte logical blocks: (160 GB/149 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
scsi 1:0:0:0: Direct-Access     ATA      ST3160215AS      3.AA PQ: 0 ANSI: 5
sd 1:0:0:0: [sdb] 312581808 512-byte logical blocks: (160 GB/149 GiB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdb: sdb1 sdb2 < sdb5 sdb6 sdb7 sdb8 sdb9 >
sd 1:0:0:0: [sdb] Attached SCSI disk
ftdi_sio 4-2:1.0: FTDI USB Serial Device converter detected
usb 4-2: Detected FT232BM
usb 4-2: Number of endpoints 2
usb 4-2: Endpoint 1 MaxPacketSize 64
usb 4-2: Endpoint 2 MaxPacketSize 64
usb 4-2: Setting MaxPacketSize 64
 sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 >
sd 0:0:0:0: [sda] Attached SCSI disk
usb 4-2: FTDI USB Serial Device converter now attached to ttyUSB0
console [ttyUSB0] enabled
usb 4-4: new full speed USB device using ohci_hcd and address 4
TOMJ blkid: md_open(): RT prio, pol=0 p=118 rt_p=0
md: Autodetecting RAID arrays.
md: Scanned 12 and added 12 devices.
md: autorun ...
md: considering sda9 ...
md:  adding sda9 ...
md: sda8 has different UUID to sda9
md: sda7 has different UUID to sda9
md: sda6 has different UUID to sda9
md: sda5 has different UUID to sda9
md: sda1 has different UUID to sda9
md:  adding sdb9 ...
md: sdb8 has different UUID to sda9
md: sdb7 has different UUID to sda9
md: sdb6 has different UUID to sda9
md: sdb5 has different UUID to sda9
md: sdb1 has different UUID to sda9
md: created md5
md: bind<sdb9>
md: bind<sda9>
md: running: <sda9><sdb9>
md/raid1:md5: active with 2 out of 2 mirrors
md5: detected capacity change from 0 to 147446235136
 md5: unknown partition table
md: considering sda8 ...
md:  adding sda8 ...
md: sda7 has different UUID to sda8
md: sda6 has different UUID to sda8
md: sda5 has different UUID to sda8
md: sda1 has different UUID to sda8
md:  adding sdb8 ...
md: sdb7 has different UUID to sda8
md: sdb6 has different UUID to sda8
md: sdb5 has different UUID to sda8
md: sdb1 has different UUID to sda8
md: created md4
md: bind<sdb8>
md: bind<sda8>
md: running: <sda8><sdb8>
md/raid1:md4: active with 2 out of 2 mirrors
md4: detected capacity change from 0 to 2097348608
 md4: unknown partition table
md: considering sda7 ...
md:  adding sda7 ...
md: sda6 has different UUID to sda7
md: sda5 has different UUID to sda7
md: sda1 has different UUID to sda7
md:  adding sdb7 ...
md: sdb6 has different UUID to sda7
md: sdb5 has different UUID to sda7
md: sdb1 has different UUID to sda7
md: created md3
md: bind<sdb7>
md: bind<sda7>
md: running: <sda7><sdb7>
md/raid1:md3: active with 2 out of 2 mirrors
md3: detected capacity change from 0 to 2623733760
 md3: unknown partition table
md: considering sda6 ...
md:  adding sda6 ...
md: sda5 has different UUID to sda6
md: sda1 has different UUID to sda6
md:  adding sdb6 ...
md: sdb5 has different UUID to sda6
md: sdb1 has different UUID to sda6
md: created md2
md: bind<sdb6>
md: bind<sda6>
md: running: <sda6><sdb6>
md/raid1:md2: active with 2 out of 2 mirrors
md2: detected capacity change from 0 to 2623733760
md: considering sda5 ...
md:  adding sda5 ...
md: sda1 has different UUID to sda5
md:  adding sdb5 ...
md: sdb1 has different UUID to sda5
md: created md1
md: bind<sdb5>
md: bind<sda5>
md: running: <sda5><sdb5>
md/raid1:md1: active with 2 out of 2 mirrors
md1: detected capacity change from 0 to 2623733760
 md1: unknown partition table
md: considering sda1 ...
md:  adding sda1 ...
md:  adding sdb1 ...
md: created md0
md: bind<sdb1>
md: bind<sda1>
md: running: <sda1><sdb1>
md/raid1:md0: active with 2 out of 2 mirrors
md0: detected capacity change from 0 to 2623733760
md: ... autorun DONE.
 md0: unknown partition table
md5: detected capacity change from 0 to 147446235136
 md5: unknown partition table
md3: detected capacity change from 0 to 2623733760
 md3: unknown partition table
md2: detected capacity change from 0 to 2623733760
 md2: unknown partition table
md1: detected capacity change from 0 to 2623733760
md4: detected capacity change from 0 to 2097348608
 md4: unknown partition table
 md1: unknown partition table
usb 4-4: usbfs: interface 0 claimed by usbfs while 'display' sets config #1
EXT3-fs: barriers not enabled
kjournald starting.  Commit interval 5 seconds
EXT3-fs (md0): mounted filesystem with ordered data mode
dracut: Remounting /dev/md0 with -o ro,defaults,noatime
EXT3-fs: barriers not enabled
kjournald starting.  Commit interval 5 seconds
EXT3-fs (md0): mounted filesystem with ordered data mode
dracut: Mounted root filesystem /dev/md0
dracut: Switching root
------------------------------

As you can see, your printk() is not triggered(). I just
copied your printk and made it print once unconditionally.

So probably the msleep(10); does the trick. Something
seems very racy to me as other boxes with software RAID
can boot the exact same kernel + dracut version just fine.

I'll put the box in a reboot loop over the lunch break.

Cheers,
Thomas

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-29  9:53             ` Thomas Jarosch
@ 2011-03-29 10:07               ` Tejun Heo
  2011-03-29 11:52                 ` Thomas Jarosch
  2011-04-05  3:46                 ` NeilBrown
  0 siblings, 2 replies; 22+ messages in thread
From: Tejun Heo @ 2011-03-29 10:07 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: linux-raid, Neil Brown

On Tue, Mar 29, 2011 at 11:53:06AM +0200, Thomas Jarosch wrote:
> On Tuesday, 29. March 2011 10:25:03 Tejun Heo wrote:
> > Can you please apply the following patch and see whether it resolves
> > the problem and report the boot log?
> 
> Ok, I did the following:
> - Check out commit e804ac780e2f01cb3b914daca2fd4780d1743db1
>   (md: fix and update workqueue usage)
> - Apply your patch
> - Add small debug output on top of it:
> 
> ------------------------------
> # git diff
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1e6534d..d2ddef4 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5899,6 +5899,16 @@ static int md_open(struct block_device *bdev, fmode_t mode)
>                                 once = true;
>                         }
>                 }
> +               /* DEBUG HACK */
> +               {
> +                   static bool tomj_once = false;
> +                   if (!tomj_once)
> +                   {
> +                               printk("TOMJ %s: md_open(): RT prio, pol=%u p=%d rt_p=%u\n",
> +                                      current->comm, current->policy, current->static_prio, current->rt_priority);
> +                       tomj_once = true;
> +                   }
> +               }
>                 msleep(10);
>                 /* Wait until bdev->bd_disk is definitely gone */
>                 flush_workqueue(md_misc_wq);
...
> TOMJ blkid: md_open(): RT prio, pol=0 p=118 rt_p=0
...
> As you can see, your printk() is not triggered(). I just
> copied your printk and made it print once unconditionally.
> 
> So probably the msleep(10); does the trick. Something
> seems very racy to me as other boxes with software RAID
> can boot the exact same kernel + dracut version just fine.
> 
> I'll put the box in a reboot loop over the lunch break.

Hmmm.. interesting, so no RT task there.  I don't know why the
softlockup is triggering then.  Ah, okay, none of CONFIG_PREEMPT and
CONFIG_PREEMPT_VOLUNTARY is set, right?

Anyways, the root cause here is that md_open() -ERESTARTSYS retrying
is busy looping without giving the put path a chance to run.  When it
was using flush_scheduled_work(), there were some unrelated work items
there so it ended up sleeping by accident giving the put path a chance
to run.  With the conversion, the flush domain is reduced and there's
nothing unrelated to wait for so it just busy loops.

Neil, we can put a short unconditional sleep there or somehow ensure
work item is queued before the restart loop engages.  What do you
think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-29 10:07               ` Tejun Heo
@ 2011-03-29 11:52                 ` Thomas Jarosch
  2011-04-05  3:46                 ` NeilBrown
  1 sibling, 0 replies; 22+ messages in thread
From: Thomas Jarosch @ 2011-03-29 11:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-raid, Neil Brown

On Tuesday, 29. March 2011 12:07:44 Tejun Heo wrote:
> > I'll put the box in a reboot loop over the lunch break.
> 
> Hmmm.. interesting, so no RT task there.  I don't know why the
> softlockup is triggering then.  Ah, okay, none of CONFIG_PREEMPT and
> CONFIG_PREEMPT_VOLUNTARY is set, right?

"grep PREEM .config" shows:

# CONFIG_PREEMPT_RCU is not set
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

You were right :)

> Anyways, the root cause here is that md_open() -ERESTARTSYS retrying
> is busy looping without giving the put path a chance to run.  When it
> was using flush_scheduled_work(), there were some unrelated work items
> there so it ended up sleeping by accident giving the put path a chance
> to run.  With the conversion, the flush domain is reduced and there's
> nothing unrelated to wait for so it just busy loops.

The test box survived 52 reboots without a hickup. When we come up with a 
final solution, I'm going to let that reboot loop run over night
to be on the safe side.

Thanks!

The HP Proliant ML110 G6 issue seemes unrelated to this one
as I can reproduce the slowdown issue also with 2.6.36.
I also don't think it's md related as the box is only slow
over the network (igb driver) and fast on the console.
So it must be something else. Crap, two issues to hunt down :o)

Thomas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-03-29 10:07               ` Tejun Heo
  2011-03-29 11:52                 ` Thomas Jarosch
@ 2011-04-05  3:46                 ` NeilBrown
  2011-04-06 10:16                   ` Tejun Heo
  1 sibling, 1 reply; 22+ messages in thread
From: NeilBrown @ 2011-04-05  3:46 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Thomas Jarosch, linux-raid

On Tue, 29 Mar 2011 12:07:44 +0200 Tejun Heo <tj@kernel.org> wrote:

> On Tue, Mar 29, 2011 at 11:53:06AM +0200, Thomas Jarosch wrote:
> > On Tuesday, 29. March 2011 10:25:03 Tejun Heo wrote:
> > > Can you please apply the following patch and see whether it resolves
> > > the problem and report the boot log?
> > 
> > Ok, I did the following:
> > - Check out commit e804ac780e2f01cb3b914daca2fd4780d1743db1
> >   (md: fix and update workqueue usage)
> > - Apply your patch
> > - Add small debug output on top of it:
> > 
> > ------------------------------
> > # git diff
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 1e6534d..d2ddef4 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -5899,6 +5899,16 @@ static int md_open(struct block_device *bdev, fmode_t mode)
> >                                 once = true;
> >                         }
> >                 }
> > +               /* DEBUG HACK */
> > +               {
> > +                   static bool tomj_once = false;
> > +                   if (!tomj_once)
> > +                   {
> > +                               printk("TOMJ %s: md_open(): RT prio, pol=%u p=%d rt_p=%u\n",
> > +                                      current->comm, current->policy, current->static_prio, current->rt_priority);
> > +                       tomj_once = true;
> > +                   }
> > +               }
> >                 msleep(10);
> >                 /* Wait until bdev->bd_disk is definitely gone */
> >                 flush_workqueue(md_misc_wq);
> ...
> > TOMJ blkid: md_open(): RT prio, pol=0 p=118 rt_p=0
> ...
> > As you can see, your printk() is not triggered(). I just
> > copied your printk and made it print once unconditionally.
> > 
> > So probably the msleep(10); does the trick. Something
> > seems very racy to me as other boxes with software RAID
> > can boot the exact same kernel + dracut version just fine.
> > 
> > I'll put the box in a reboot loop over the lunch break.
> 
> Hmmm.. interesting, so no RT task there.  I don't know why the
> softlockup is triggering then.  Ah, okay, none of CONFIG_PREEMPT and
> CONFIG_PREEMPT_VOLUNTARY is set, right?
> 
> Anyways, the root cause here is that md_open() -ERESTARTSYS retrying
> is busy looping without giving the put path a chance to run.  When it
> was using flush_scheduled_work(), there were some unrelated work items
> there so it ended up sleeping by accident giving the put path a chance
> to run.  With the conversion, the flush domain is reduced and there's
> nothing unrelated to wait for so it just busy loops.
> 
> Neil, we can put a short unconditional sleep there or somehow ensure
> work item is queued before the restart loop engages.  What do you
> think?

(sorry for delay - leave and stuff...)

I don't think we should need to add a sleep - it should work as is.
Obviously it doesn't so I must be missing something, but I would like to find
out what it is to make sure I fix the right thing.

The interesting race is between md_open and mddev_put.
After mddev_put has made a firm decision to discard the mddev and consequently
del_gendisk, __blkdev_get can find the gendisk and try to re-open it.
This will find a different (new) mddev which is a problem.
So we return -ERESTARTSYS in that case so the lookup for a gendisk will be
retried.

The mddev_find in md_open will only return a new mddev after taking
all_mddevs_lock and finding that the old mddev is no longer available.
This must be after mddev_put has dropped the lock, and so after
queue_work has been called.

After mddev_find returns the new mddev, md_open calls flush_workqueue
and as the work item to complete the delete has definitely been queued, it
should wait for that work item to complete.

So the next time around the retry loop in __blkdev_get the old gendisk will
not be found....

Where is my logic wrong??

To put it another way matching your description Tejun, the put path has a
chance to run firstly while mddev_find is waiting for the spinlock, and then
while flush_workqueue is waiting for the rest of the put path to complete.


Thanks for any light you can shed on this...

NeilBrown

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-04-05  3:46                 ` NeilBrown
@ 2011-04-06 10:16                   ` Tejun Heo
  2011-04-12 14:05                     ` Thomas Jarosch
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2011-04-06 10:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Jarosch, linux-raid

Hey, Neil.

On Tue, Apr 05, 2011 at 01:46:29PM +1000, NeilBrown wrote:
> After mddev_find returns the new mddev, md_open calls flush_workqueue
> and as the work item to complete the delete has definitely been queued, it
> should wait for that work item to complete.
> 
> So the next time around the retry loop in __blkdev_get the old gendisk will
> not be found....
> 
> Where is my logic wrong??
>
> To put it another way matching your description Tejun, the put path has a
> chance to run firstly while mddev_find is waiting for the spinlock, and then
> while flush_workqueue is waiting for the rest of the put path to complete.

I don't think the logic is wrong per-se.  It's more likely that the
implemented code doesn't really follow the model described by the
logic.

Probably the best way would be reproducing the problem and throwing in
some diagnostic code to tell the sequence of events?  If work is being
queued first but it still ends up busy looping, that would be a bug in
flush_workqueue(), but I think it's more likely that the restart
condition somehow triggers in an unexpected way without the work item
queued as expected.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-04-06 10:16                   ` Tejun Heo
@ 2011-04-12 14:05                     ` Thomas Jarosch
  2011-04-12 22:44                       ` NeilBrown
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Jarosch @ 2011-04-12 14:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: NeilBrown, linux-raid

Hello Neil,

On Wednesday, 6. April 2011 12:16:00 Tejun Heo wrote:
> > To put it another way matching your description Tejun, the put path has
> > a chance to run firstly while mddev_find is waiting for the spinlock,
> > and then while flush_workqueue is waiting for the rest of the put path
> > to complete.
> 
> I don't think the logic is wrong per-se.  It's more likely that the
> implemented code doesn't really follow the model described by the
> logic.
> 
> Probably the best way would be reproducing the problem and throwing in
> some diagnostic code to tell the sequence of events?  If work is being
> queued first but it still ends up busy looping, that would be a bug in
> flush_workqueue(), but I think it's more likely that the restart
> condition somehow triggers in an unexpected way without the work item
> queued as expected.

I can test any debug patch you want, the box is in a test lab anyway.

Best regards,
Thomas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-04-12 14:05                     ` Thomas Jarosch
@ 2011-04-12 22:44                       ` NeilBrown
       [not found]                         ` <201104261051.09464.thomas.jarosch@intra2net.com>
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2011-04-12 22:44 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: Tejun Heo, linux-raid

On Tue, 12 Apr 2011 16:05:52 +0200 Thomas Jarosch
<thomas.jarosch@intra2net.com> wrote:

> Hello Neil,
> 
> On Wednesday, 6. April 2011 12:16:00 Tejun Heo wrote:
> > > To put it another way matching your description Tejun, the put path has
> > > a chance to run firstly while mddev_find is waiting for the spinlock,
> > > and then while flush_workqueue is waiting for the rest of the put path
> > > to complete.
> > 
> > I don't think the logic is wrong per-se.  It's more likely that the
> > implemented code doesn't really follow the model described by the
> > logic.
> > 
> > Probably the best way would be reproducing the problem and throwing in
> > some diagnostic code to tell the sequence of events?  If work is being
> > queued first but it still ends up busy looping, that would be a bug in
> > flush_workqueue(), but I think it's more likely that the restart
> > condition somehow triggers in an unexpected way without the work item
> > queued as expected.
> 
> I can test any debug patch you want, the box is in a test lab anyway.
> 
> Best regards,
> Thomas

Could you try this?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a0ccaab..07c97b1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6175,6 +6175,8 @@ static int md_open(struct block_device *bdev, fmode_t mode)
 	mddev_t *mddev = mddev_find(bdev->bd_dev);
 	int err;
 
+	BUG_ON(!mddev->gendisk);
+
 	if (mddev->gendisk != bdev->bd_disk) {
 		/* we are racing with mddev_put which is discarding this
 		 * bd_disk.


It don't know how it could get to the state where gendisk was NULL, but it is
the only way I can see that the looping could happen.

If the BUG_ON does trigger I'll probably be able to find out why it happens.
If it doesn't then I'll still be at a loss.

NeilBrown


^ permalink raw reply related	[flat|nested] 22+ messages in thread

[parent not found: <201104261051.09464.thomas.jarosch@intra2net.com>]

* Re: raid1 boot regression in 2.6.37 [bisected]
       [not found]                         ` <201104261051.09464.thomas.jarosch@intra2net.com>
@ 2011-04-27  8:17                           ` NeilBrown
  2011-04-27 10:05                             ` NeilBrown
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2011-04-27  8:17 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: Tejun Heo, linux-raid

On Tue, 26 Apr 2011 10:51:09 +0200 Thomas Jarosch
<thomas.jarosch@intra2net.com> wrote:

> Hello Neil,
> 
> On Wednesday, 13. April 2011 00:44:08 NeilBrown wrote:
> > Could you try this?
> > 
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index a0ccaab..07c97b1 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -6175,6 +6175,8 @@ static int md_open(struct block_device *bdev,
> > fmode_t mode) mddev_t *mddev = mddev_find(bdev->bd_dev);
> >  	int err;
> > 
> > +	BUG_ON(!mddev->gendisk);
> > +
> >  	if (mddev->gendisk != bdev->bd_disk) {
> >  		/* we are racing with mddev_put which is discarding this
> >  		 * bd_disk.
> > 
> > 
> > It don't know how it could get to the state where gendisk was NULL, but
> > it is the only way I can see that the looping could happen.
> > 
> > If the BUG_ON does trigger I'll probably be able to find out why it
> > happens. If it doesn't then I'll still be at a loss.
> 
> Sorry for the late reply, I somehow missed your message.
> 
> Your intuition was right, the BUG_ON is triggered.
> Attached you'll find a screenshot of the call trace.
> 
> Best regards,
> Thomas

Thanks.

I manage to reproduce something very similar and I think I know what is
happening.  It appears to be fixed by this change to driver/md/md.c

@@ -4340,8 +4344,8 @@ static int md_alloc(dev_t dev, char *name)
 	 * remove it now.
 	 */
 	disk->flags |= GENHD_FL_EXT_DEVT;
-	add_disk(disk);
 	mddev->gendisk = disk;
+	add_disk(disk);
 	error = kobject_init_and_add(&mddev->kobj, &md_ktype,
 				     &disk_to_dev(disk)->kobj, "%s", "md");
 	if (error) {


However I need to think through the sequence of events in the morning and
make sure it all makes sense and there isn't some other race hiding in there.

Thanks again for the report.

NeilBrown

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-04-27  8:17                           ` NeilBrown
@ 2011-04-27 10:05                             ` NeilBrown
       [not found]                               ` <201104271700.58894.thomas.jarosch@intra2net.com>
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2011-04-27 10:05 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: Tejun Heo, linux-raid

On Wed, 27 Apr 2011 18:17:14 +1000 NeilBrown <neilb@suse.de> wrote:

> I manage to reproduce something very similar and I think I know what is
> happening.  It appears to be fixed by this change to driver/md/md.c
> 
> @@ -4340,8 +4344,8 @@ static int md_alloc(dev_t dev, char *name)
>  	 * remove it now.
>  	 */
>  	disk->flags |= GENHD_FL_EXT_DEVT;
> -	add_disk(disk);
>  	mddev->gendisk = disk;
> +	add_disk(disk);
>  	error = kobject_init_and_add(&mddev->kobj, &md_ktype,
>  				     &disk_to_dev(disk)->kobj, "%s", "md");
>  	if (error) {
> 
> 
> However I need to think through the sequence of events in the morning and
> make sure it all makes sense and there isn't some other race hiding in there.
> 
> Thanks again for the report.
> 

On reflection over dinner, I think I should move the add_disk right to the
end just before the mutex_unlock.
I'm not sure it makes a big practical difference but it will be easier to
justify.

So if you could test this one instead please that would be great.

The rationale is that as soon as we call add_disk, get_gendisk will return
the disk rather than coming through to md_probe and blocking on the mutex.
So we want everything set up properly before calling add_disk.

Thanks,
NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 818313e..6ab5bfb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4340,7 +4340,6 @@ static int md_alloc(dev_t dev, char *name)
 	 * remove it now.
 	 */
 	disk->flags |= GENHD_FL_EXT_DEVT;
-	add_disk(disk);
 	mddev->gendisk = disk;
 	error = kobject_init_and_add(&mddev->kobj, &md_ktype,
 				     &disk_to_dev(disk)->kobj, "%s", "md");
@@ -4357,6 +4356,7 @@ static int md_alloc(dev_t dev, char *name)
 		printk(KERN_DEBUG "pointless warning\n");
 
 	blk_queue_flush(mddev->queue, REQ_FLUSH | REQ_FUA);
+	add_disk(disk);
  abort:
 	mutex_unlock(&disks_mutex);
 	if (!error && mddev->kobj.sd) {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

[parent not found: <201104271700.58894.thomas.jarosch@intra2net.com>]

* Re: raid1 boot regression in 2.6.37 [bisected]
       [not found]                               ` <201104271700.58894.thomas.jarosch@intra2net.com>
@ 2011-04-28  1:23                                 ` NeilBrown
  2011-04-28 13:47                                   ` Thomas Jarosch
  0 siblings, 1 reply; 22+ messages in thread
From: NeilBrown @ 2011-04-28  1:23 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: Tejun Heo, linux-raid

On Wed, 27 Apr 2011 17:00:58 +0200 Thomas Jarosch
<thomas.jarosch@intra2net.com> wrote:

> On Wednesday, 27. April 2011 12:05:13 NeilBrown wrote:
> > > Thanks again for the report.
> > 
> > On reflection over dinner, I think I should move the add_disk right to
> > the end just before the mutex_unlock.
> > I'm not sure it makes a big practical difference but it will be easier to
> > justify.
> > 
> > So if you could test this one instead please that would be great.
> > 
> > The rationale is that as soon as we call add_disk, get_gendisk will
> > return the disk rather than coming through to md_probe and blocking on
> > the mutex. So we want everything set up properly before calling
> > add_disk.
> 
> Unfortunately the kernel oopses now. Attached you'll find screenshots
> of the oops. I can bring back the serial console if needed.
> 
> I've double verified with "git diff" that I only have your
> latest change in the tree.
> 
> Hope that helps,
> Thomas


Yes it helps, thanks.

I do need to add_disk before I kobject_init_and_add the mddev object
otherwise it will try to create something in sysfs in a non-existent
directory.

So the sysfs stuff needs to come last.

Could you try this one please?

Thanks,
NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 818313e..9f9fa54 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4335,13 +4335,15 @@ static int md_alloc(dev_t dev, char *name)
 	disk->fops = &md_fops;
 	disk->private_data = mddev;
 	disk->queue = mddev->queue;
+	blk_queue_flush(mddev->queue, REQ_FLUSH | REQ_FUA);
 	/* Allow extended partitions.  This makes the
 	 * 'mdp' device redundant, but we can't really
 	 * remove it now.
 	 */
 	disk->flags |= GENHD_FL_EXT_DEVT;
-	add_disk(disk);
 	mddev->gendisk = disk;
+	add_disk(disk);
+
 	error = kobject_init_and_add(&mddev->kobj, &md_ktype,
 				     &disk_to_dev(disk)->kobj, "%s", "md");
 	if (error) {
@@ -4355,8 +4357,6 @@ static int md_alloc(dev_t dev, char *name)
 	if (mddev->kobj.sd &&
 	    sysfs_create_group(&mddev->kobj, &md_bitmap_group))
 		printk(KERN_DEBUG "pointless warning\n");
-
-	blk_queue_flush(mddev->queue, REQ_FLUSH | REQ_FUA);
  abort:
 	mutex_unlock(&disks_mutex);
 	if (!error && mddev->kobj.sd) {

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-04-28  1:23                                 ` NeilBrown
@ 2011-04-28 13:47                                   ` Thomas Jarosch
  2011-05-02 12:17                                     ` Thomas Jarosch
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Jarosch @ 2011-04-28 13:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Tejun Heo, linux-raid

On Thursday, 28. April 2011 03:23:28 NeilBrown wrote:
> Yes it helps, thanks.
> 
> I do need to add_disk before I kobject_init_and_add the mddev object
> otherwise it will try to create something in sysfs in a non-existent
> directory.
> 
> So the sysfs stuff needs to come last.
> 
> Could you try this one please?

Works like a charm! I've put the box in a reboot loop again,
lets see if it survives until tomorrow :)

Cheers,
Thomas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: raid1 boot regression in 2.6.37 [bisected]
  2011-04-28 13:47                                   ` Thomas Jarosch
@ 2011-05-02 12:17                                     ` Thomas Jarosch
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Jarosch @ 2011-05-02 12:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: Tejun Heo, linux-raid

Neil,

On Thursday, 28. April 2011 15:47:40 Thomas Jarosch wrote:
> > Could you try this one please?
> 
> Works like a charm! I've put the box in a reboot loop again,
> lets see if it survives until tomorrow :)

Patch is working fine, the box survived rebooting for several days.

Thanks,
Thomas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* raid1 boot regression in 2.6.37 [bisected]
@ 2011-03-25 18:55 Thomas Jarosch
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Jarosch @ 2011-03-25 18:55 UTC (permalink / raw)
  To: linux-raid; +Cc: Tejun Heo

Hello,

I've just updated from kernel 2.6.34.7 to kernel 2.6.37.5 and one
HP Proliant DL320 G3 box with a raid1 software RAID stopped booting.
(also two other non-HP boxes).

We run this script at boot time via dracut:
----------------------------------
#!/bin/sh
. /lib/dracut-lib.sh

info "Telling kernel to auto-detect RAID arrays"
/sbin/initqueue --settled --name kerneldetectraid /sbin/mdadm --auto-detect
----------------------------------

With the "bad" commit in place, the kernel doesn't output
any md message at all. I've bisected it down to this commit:

e804ac780e2f01cb3b914daca2fd4780d1743db1 is the first bad commit
commit e804ac780e2f01cb3b914daca2fd4780d1743db1
Author: Tejun Heo <tj@kernel.org>
Date:   Fri Oct 15 15:36:08 2010 +0200

    md: fix and update workqueue usage

    Workqueue usage in md has two problems.

    * Flush can be used during or depended upon by memory reclaim, but md
      uses the system workqueue for flush_work which may lead to deadlock.

    * md depends on flush_scheduled_work() to achieve exclusion against
      completion of removal of previous instances.  flush_scheduled_work()
      may incur unexpected amount of delay and is scheduled to be removed.

    This patch adds two workqueues to md - md_wq and md_misc_wq.  The
    former is guaranteed to make forward progress under memory pressure
    and serves flush_work.  The latter serves as the flush domain for
    other works.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: NeilBrown <neilb@suse.de>

:040000 040000 f6b6a34a71864263ed253866c5f8abe7f766ac6b 
dc2eff4a91825142b7c88cf54751fc7acdf1a6d2 M      drivers

I manually verified that the commit before it 
(57dab0bdf689d42972975ec646d862b0900a4bf3) works
and the "bad" commit prevents the box from booting.


Some more info:

# mdadm --version
mdadm - v2.6.9 - 10th March 2009

# mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Wed May 27 17:52:40 2009
     Raid Level : raid1
     Array Size : 2562240 (2.44 GiB 2.62 GB)
  Used Dev Size : 2562240 (2.44 GiB 2.62 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri Mar 25 17:11:33 2011
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 0ee8da2c:5803478b:e399b924:6520c535
         Events : 0.160

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1



Any idea what might go wrong? May be building a kernel
with lock debugging on Monday might help. I think I also
tried kernel 2.6.38 though I'll very on Monday, too.


Have a nice weekend,
Thomas

PS: Sorry Tejun for the HTML crap in my first mail.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-05-02 12:17 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <201103251725.21180.thomas.jarosch@intra2net.com>
2011-03-28  7:59 ` raid1 boot regression in 2.6.37 [bisected] Tejun Heo
2011-03-28 11:02   ` Thomas Jarosch
2011-03-28 12:53     ` Thomas Jarosch
2011-03-28 15:59       ` Tejun Heo
2011-03-28 19:46         ` Thomas Jarosch
2011-03-28 19:59           ` Roberto Spadim
2011-03-29 12:06             ` Thomas Jarosch
2011-03-29 12:22               ` Roberto Spadim
2011-03-29  8:25           ` Tejun Heo
2011-03-29  9:53             ` Thomas Jarosch
2011-03-29 10:07               ` Tejun Heo
2011-03-29 11:52                 ` Thomas Jarosch
2011-04-05  3:46                 ` NeilBrown
2011-04-06 10:16                   ` Tejun Heo
2011-04-12 14:05                     ` Thomas Jarosch
2011-04-12 22:44                       ` NeilBrown
     [not found]                         ` <201104261051.09464.thomas.jarosch@intra2net.com>
2011-04-27  8:17                           ` NeilBrown
2011-04-27 10:05                             ` NeilBrown
     [not found]                               ` <201104271700.58894.thomas.jarosch@intra2net.com>
2011-04-28  1:23                                 ` NeilBrown
2011-04-28 13:47                                   ` Thomas Jarosch
2011-05-02 12:17                                     ` Thomas Jarosch
2011-03-25 18:55 Thomas Jarosch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).