BUG - raid 1 deadlock on handle_read_error / wait

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* BUG  - raid 1 deadlock on handle_read_error / wait_barrier
@ 2013-02-21 22:58 Tregaron Bayly
  2013-02-22  3:44 ` Joe Lawrence
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Tregaron Bayly @ 2013-02-21 22:58 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb

Symptom:
A RAID 1 array ends up with two threads (flush and raid1) stuck in D
state forever.  The array is inaccessible and the host must be restarted
to restore access to the array.

I have some scripted workloads that reproduce this within a maximum of a
couple hours on kernels from 3.6.11 - 3.8-rc7.  I cannot reproduce on
3.4.32.  3.5.7 ends up with three threads stuck in D state, but the
stacks are different from this bug (as it's EOL maybe of interest in
bisecting the problem?).

Details:

[flush-9:16]
[<ffffffffa009f1a4>] wait_barrier+0x124/0x180 [raid1]
[<ffffffffa00a2a15>] make_request+0x85/0xd50 [raid1]
[<ffffffff813653c3>] md_make_request+0xd3/0x200
[<ffffffff811f494a>] generic_make_request+0xca/0x100
[<ffffffff811f49f9>] submit_bio+0x79/0x160
[<ffffffff811808f8>] submit_bh+0x128/0x200
[<ffffffff81182fe0>] __block_write_full_page+0x1d0/0x330
[<ffffffff8118320e>] block_write_full_page_endio+0xce/0x100
[<ffffffff81183255>] block_write_full_page+0x15/0x20
[<ffffffff81187908>] blkdev_writepage+0x18/0x20
[<ffffffff810f73b7>] __writepage+0x17/0x40
[<ffffffff810f8543>] write_cache_pages+0x1d3/0x4c0
[<ffffffff810f8881>] generic_writepages+0x51/0x80
[<ffffffff810f88d0>] do_writepages+0x20/0x40
[<ffffffff811782bb>] __writeback_single_inode+0x3b/0x160
[<ffffffff8117a8a9>] writeback_sb_inodes+0x1e9/0x430
[<ffffffff8117ab8e>] __writeback_inodes_wb+0x9e/0xd0
[<ffffffff8117ae9b>] wb_writeback+0x24b/0x2e0
[<ffffffff8117b171>] wb_do_writeback+0x241/0x250
[<ffffffff8117b222>] bdi_writeback_thread+0xa2/0x250
[<ffffffff8106414e>] kthread+0xce/0xe0
[<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

[md16-raid1]
[<ffffffffa009ffb9>] handle_read_error+0x119/0x790 [raid1]
[<ffffffffa00a0862>] raid1d+0x232/0x1060 [raid1]
[<ffffffff813675a7>] md_thread+0x117/0x150
[<ffffffff8106414e>] kthread+0xce/0xe0
[<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

Both processes end up in wait_event_lock_irq() waiting for favorable
conditions in the struct r1conf to proceed.  These conditions obviously
seem to never arrive.  I placed printk statements in freeze_array() and
wait_barrier() directly before calling their respective
wait_event_lock_irq() and this is an example output:

Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3)
Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3)
Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3)

The flush seems to come after the attempt to handle the read error.  I
believe the incrementing nr_waiting comes from the multiple read
processes I have going as they get stuck behind this deadlock.

majienpeng may have been referring to this condition on the linux-raid
list a few days ago (http://www.spinics.net/lists/raid/msg42150.html)
when he stated "Above code is what's you said.  But it met read-error,
raid1d will blocked for ever.  The reason is freeze_array"

Environment:
A RAID 1 array built on top of two multipath maps
Devices underlying the maps are iscsi disks exported from a linux SAN
Multipath is configured to queue i/o if no path for 5 retries (polling
interval 5) before failing and causing the mirror to degrade.

Reproducible on kernels 3.6.11 and up (x86_64)

To reproduce, I create 10 raid 1 arrays on a client built on mpio over
iscsi.  For each of the arrays, I start the following to create some
writes:

$NODE=array_5; while :; do let TIME=($RANDOM % 10); let size=$RANDOM*6;
sleep $TIME; dd if=/dev/zero of=/dev/md/$NODE bs=1024 count=$size; done

I then start 10 processes each doing this to create some reads:

while :; do let SAN=($RANDOM % 10); dd if=/dev/md/array_$SAN
of=/dev/null bs=1024 count=50000; done

In a screen I start a script running in a loop that
monitors /proc/mdstat for failed arrays and does a fail/remove/re-add on
the failed disk.  (This is mainly so that I get more than one crack at
reproducing the bug since the timing never happens just right on one
try.)

Now on one of the iscsi servers I start a loop that sleeps a random
amount of time between 1-5 minutes then stops the export, sleeps again
and then restores the export.

The net effect of all this is that the disks on the initiator will queue
i/o for a while when the export is off and eventually fail.  In most
cases this happens gracefully and i/o will go to the remaining disk.  In
the failure case we get the behavior described here.  The test is setup
to restore the arrays and try again until the bug is hit.

We do see this situation in production with relative frequency so it's
not something that happens only in theory under artificial conditions.

Any help or insight here is much appreciated.

Tregaron Bayly
Systems Administrator
Bluehost, Inc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
  2013-02-21 22:58 BUG - raid 1 deadlock on handle_read_error / wait_barrier Tregaron Bayly
@ 2013-02-22  3:44 ` Joe Lawrence
  2013-02-22 11:52 ` majianpeng
  2013-02-24 22:43 ` NeilBrown
  2 siblings, 0 replies; 20+ messages in thread
From: Joe Lawrence @ 2013-02-22  3:44 UTC (permalink / raw)
  To: Tregaron Bayly; +Cc: linux-raid, neilb

On Thu, 21 Feb 2013, Tregaron Bayly wrote:

> Symptom:
> A RAID 1 array ends up with two threads (flush and raid1) stuck in D
> state forever.  The array is inaccessible and the host must be restarted
> to restore access to the array.
> 
> I have some scripted workloads that reproduce this within a maximum of a
> couple hours on kernels from 3.6.11 - 3.8-rc7.  I cannot reproduce on
> 3.4.32.  3.5.7 ends up with three threads stuck in D state, but the
> stacks are different from this bug (as it's EOL maybe of interest in
> bisecting the problem?).
> 
> Details:
> 
> [flush-9:16]
> [<ffffffffa009f1a4>] wait_barrier+0x124/0x180 [raid1]
> [<ffffffffa00a2a15>] make_request+0x85/0xd50 [raid1]
> [<ffffffff813653c3>] md_make_request+0xd3/0x200
> [<ffffffff811f494a>] generic_make_request+0xca/0x100
> [<ffffffff811f49f9>] submit_bio+0x79/0x160
> [<ffffffff811808f8>] submit_bh+0x128/0x200
> [<ffffffff81182fe0>] __block_write_full_page+0x1d0/0x330
> [<ffffffff8118320e>] block_write_full_page_endio+0xce/0x100
> [<ffffffff81183255>] block_write_full_page+0x15/0x20
> [<ffffffff81187908>] blkdev_writepage+0x18/0x20
> [<ffffffff810f73b7>] __writepage+0x17/0x40
> [<ffffffff810f8543>] write_cache_pages+0x1d3/0x4c0
> [<ffffffff810f8881>] generic_writepages+0x51/0x80
> [<ffffffff810f88d0>] do_writepages+0x20/0x40
> [<ffffffff811782bb>] __writeback_single_inode+0x3b/0x160
> [<ffffffff8117a8a9>] writeback_sb_inodes+0x1e9/0x430
> [<ffffffff8117ab8e>] __writeback_inodes_wb+0x9e/0xd0
> [<ffffffff8117ae9b>] wb_writeback+0x24b/0x2e0
> [<ffffffff8117b171>] wb_do_writeback+0x241/0x250
> [<ffffffff8117b222>] bdi_writeback_thread+0xa2/0x250
> [<ffffffff8106414e>] kthread+0xce/0xe0
> [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> [md16-raid1]
> [<ffffffffa009ffb9>] handle_read_error+0x119/0x790 [raid1]
> [<ffffffffa00a0862>] raid1d+0x232/0x1060 [raid1]
> [<ffffffff813675a7>] md_thread+0x117/0x150
> [<ffffffff8106414e>] kthread+0xce/0xe0
> [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Both processes end up in wait_event_lock_irq() waiting for favorable
> conditions in the struct r1conf to proceed.  These conditions obviously
> seem to never arrive.  I placed printk statements in freeze_array() and
> wait_barrier() directly before calling their respective
> wait_event_lock_irq() and this is an example output:
> 
> Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3)
> Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3)
> Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3)
> 
> The flush seems to come after the attempt to handle the read error.  I
> believe the incrementing nr_waiting comes from the multiple read
> processes I have going as they get stuck behind this deadlock.
> 
> majienpeng may have been referring to this condition on the linux-raid
> list a few days ago (http://www.spinics.net/lists/raid/msg42150.html)
> when he stated "Above code is what's you said.  But it met read-error,
> raid1d will blocked for ever.  The reason is freeze_array"
> 
> Environment:
> A RAID 1 array built on top of two multipath maps
> Devices underlying the maps are iscsi disks exported from a linux SAN
> Multipath is configured to queue i/o if no path for 5 retries (polling
> interval 5) before failing and causing the mirror to degrade.
> 
> Reproducible on kernels 3.6.11 and up (x86_64)
>   
> To reproduce, I create 10 raid 1 arrays on a client built on mpio over
> iscsi.  For each of the arrays, I start the following to create some
> writes:
> 
> $NODE=array_5; while :; do let TIME=($RANDOM % 10); let size=$RANDOM*6;
> sleep $TIME; dd if=/dev/zero of=/dev/md/$NODE bs=1024 count=$size; done
> 
> I then start 10 processes each doing this to create some reads:
> 
> while :; do let SAN=($RANDOM % 10); dd if=/dev/md/array_$SAN
> of=/dev/null bs=1024 count=50000; done
> 
> In a screen I start a script running in a loop that
> monitors /proc/mdstat for failed arrays and does a fail/remove/re-add on
> the failed disk.  (This is mainly so that I get more than one crack at
> reproducing the bug since the timing never happens just right on one
> try.)
> 
> Now on one of the iscsi servers I start a loop that sleeps a random
> amount of time between 1-5 minutes then stops the export, sleeps again
> and then restores the export.
> 
> The net effect of all this is that the disks on the initiator will queue
> i/o for a while when the export is off and eventually fail.  In most
> cases this happens gracefully and i/o will go to the remaining disk.  In
> the failure case we get the behavior described here.  The test is setup
> to restore the arrays and try again until the bug is hit.
> 
> We do see this situation in production with relative frequency so it's
> not something that happens only in theory under artificial conditions.
> 
> Any help or insight here is much appreciated.
> 
> Tregaron Bayly
> Systems Administrator
> Bluehost, Inc.

Hi Tregaron,

I've been seeing a similar looking hang over the past few days.  In my 
testing, I'm running 3.7.0-rcX with direct IO via fio out to an MD RAID1 
array during surprise disk removals.  (We script fail/remove/add when the 
controller is hotplugged.)  This week I have seen three instances of the 
hang, usually after ~5 hours of testing, but once I added debugging 
printk's it went away :(  

For reference, my traces look like:

INFO: task md127_raid1:500 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md127_raid1     D ffff88107fd13d40     0   500      2 0x00000000
 ffff880850643bc8 0000000000000046 ffff88085056c9b0 ffff880850643fd8
 ffff880850643fd8 ffff880850643fd8 ffff880854683120 ffff88085056c9b0
 ffff880850643bf8 ffff8810528d88e8 ffff880850643fd8 ffff8810543c6b10
Call Trace:
 [<ffffffff8165f4a9>] schedule+0x29/0x70
 [<ffffffffa0161e8a>] handle_read_error+0xea/0x7b0 [raid1]
 [<ffffffff81092653>] ? __wake_up+0x53/0x70
 [<ffffffff814dcd3b>] ? mddev_unlock+0x8b/0xf0
 [<ffffffff81085eb0>] ? wake_up_bit+0x40/0x40
 [<ffffffffa0162b39>] raid1d+0x219/0x1020 [raid1]
 [<ffffffff8107114a>] ? del_timer_sync+0x5a/0x70
 [<ffffffff8165d74a>] ? schedule_timeout+0x17a/0x310
 [<ffffffff8108607b>] ? prepare_to_wait+0x5b/0x90
 [<ffffffff814dbced>] md_thread+0x10d/0x140
 [<ffffffff81085eb0>] ? wake_up_bit+0x40/0x40
 [<ffffffff814dbbe0>] ? md_rdev_init+0x140/0x140
 [<ffffffff810854d0>] kthread+0xc0/0xd0
 [<ffffffff81010000>] ? ftrace_define_fields_xen_mmu_flush_tlb+0x10/0x80
 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80
 [<ffffffff8166906c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80

(handle_read_error+0xea puts me at the wait_event_lock_irq_cmd in 
freeze_array.)

INFO: task jbd2/md127-8:547 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/md127-8    D ffff88107fdd3d40     0   547      2 0x00000000
 ffff88085060f9c8 0000000000000046 ffff88084d768000 ffff88085060ffd8
 ffff88085060ffd8 ffff88085060ffd8 ffff8808546b0000 ffff88084d768000
 ffffffff8113c1a5 ffff8810543c6aa8 ffff8810543c6b10 ffff88085060ffd8
Call Trace:
 [<ffffffff8113c1a5>] ? mempool_alloc_slab+0x15/0x20
 [<ffffffff8165f4a9>] schedule+0x29/0x70
 [<ffffffffa01610f9>] wait_barrier+0xc9/0x170 [raid1]
 [<ffffffff81085eb0>] ? wake_up_bit+0x40/0x40
 [<ffffffffa0163bdb>] make_request+0x7b/0xce0 [raid1]
 [<ffffffff8113c1a5>] ? mempool_alloc_slab+0x15/0x20
 [<ffffffff814db837>] md_make_request+0x107/0x260
 [<ffffffff8113c1a5>] ? mempool_alloc_slab+0x15/0x20
 [<ffffffff812e6672>] generic_make_request+0xc2/0x110
 [<ffffffff812e6a09>] submit_bio+0x79/0x160
 [<ffffffff811dbd35>] ? bio_alloc_bioset+0x65/0x120
 [<ffffffff811d6954>] submit_bh+0x114/0x1e0
 [<ffffffff812794af>] jbd2_journal_commit_transaction+0x75f/0x1c40
 [<ffffffff8127f441>] kjournald2+0xd1/0x2a0
 [<ffffffff81085eb0>] ? wake_up_bit+0x40/0x40
 [<ffffffff8127f370>] ? jbd2_journal_release_jbd_inode+0x140/0x140
 [<ffffffff810854d0>] kthread+0xc0/0xd0
 [<ffffffff81010000>] ? ftrace_define_fields_xen_mmu_flush_tlb+0x10/0x80
 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80
 [<ffffffff8166906c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80

INFO: task flush-9:127:580 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-9:127     D ffff88107fc13d40     0   580      2 0x00000000
 ffff880851b1b5c8 0000000000000046 ffff880850568000 ffff880851b1bfd8
 ffff880851b1bfd8 ffff880851b1bfd8 ffff880854666240 ffff880850568000
 00000000000000bb ffff8810543c6aa8 ffff8810543c6b10 ffff880851b1bfd8
Call Trace:
 [<ffffffff8165f4a9>] schedule+0x29/0x70
 [<ffffffffa01610f9>] wait_barrier+0xc9/0x170 [raid1]
 [<ffffffff81085eb0>] ? wake_up_bit+0x40/0x40
 [<ffffffffa0163bdb>] make_request+0x7b/0xce0 [raid1]
 [<ffffffff811dbc1a>] ? bvec_alloc_bs+0x6a/0x120
 [<ffffffff814db837>] md_make_request+0x107/0x260
 [<ffffffff811db4f4>] ? __bio_add_page.part.13+0xf4/0x220
 [<ffffffff812e6672>] generic_make_request+0xc2/0x110
 [<ffffffff812e6a09>] submit_bio+0x79/0x160
 [<ffffffff812343b9>] ext4_io_submit+0x29/0x60
 [<ffffffff8122f73f>] mpage_da_submit_io+0x31f/0x5c0
 [<ffffffff812325a5>] ? write_cache_pages_da+0x175/0x4f0
 [<ffffffff81276907>] ? jbd2__journal_start+0x87/0x110
 [<ffffffff8123204a>] mpage_da_map_and_submit+0x11a/0x420
 [<ffffffff81232c77>] ? ext4_da_writepages+0x357/0x660
 [<ffffffff812769a3>] ? jbd2_journal_start+0x13/0x20
 [<ffffffff81232cc8>] ext4_da_writepages+0x3a8/0x660
 [<ffffffff811458de>] do_writepages+0x1e/0x40
 [<ffffffff811cdbeb>] __writeback_single_inode+0x3b/0x190
 [<ffffffff811ceb1e>] writeback_sb_inodes+0x1ee/0x4e0
 [<ffffffff811ceeaf>] __writeback_inodes_wb+0x9f/0xd0
 [<ffffffff811d061b>] wb_writeback+0x2fb/0x320
 [<ffffffff811d0acf>] wb_do_writeback+0x22f/0x240
 [<ffffffff811d0b73>] bdi_writeback_thread+0x93/0x280
 [<ffffffff811d0ae0>] ? wb_do_writeback+0x240/0x240
 [<ffffffff810854d0>] kthread+0xc0/0xd0
 [<ffffffff81010000>] ? ftrace_define_fields_xen_mmu_flush_tlb+0x10/0x80
 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80
 [<ffffffff8166906c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80

And a few stats before the machine completely wedged:

[root@dhcp-linux-2103-650 Breaker]# cat /proc/mdstat

Personalities : [raid1] 
md124 : active raid1 sdq3[4]
      4192192 blocks super 1.2 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md125 : active raid1 sdq2[0]
      16762752 blocks super 1.2 [2/1] [U_]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md126 : active raid1 sdq1[1]
      16779136 blocks super 1.0 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md127 : active raid1 sdq5[0] sds5[1](F)
      255080256 blocks super 1.2 [2/1] [U_]
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>

[root@dhcp-linux-2103-650 Breaker]# mdadm --detail /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Thu Jan 17 16:56:09 2013
     Raid Level : raid1
     Array Size : 255080256 (243.26 GiB 261.20 GB)
  Used Dev Size : 255080256 (243.26 GiB 261.20 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Feb 12 21:35:02 2013
          State : active, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : dhcp-Linux-2103-627:root
           UUID : b5e4bddd:e76802b0:3b6badd6:e85ebc27
         Events : 128267

    Number   Major   Minor   RaidDevice State
       0      65        5        0      active sync   /dev/sdq5
       1      65       37        1      faulty

I'll try kicking off the test again tonight (sans debug printk's) and see 
if I can get a kdump out of the hang.

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
  2013-02-21 22:58 BUG - raid 1 deadlock on handle_read_error / wait_barrier Tregaron Bayly
  2013-02-22  3:44 ` Joe Lawrence
@ 2013-02-22 11:52 ` majianpeng
  2013-02-22 16:03   ` Tregaron Bayly
  2013-02-24 22:43 ` NeilBrown
  2 siblings, 1 reply; 20+ messages in thread
From: majianpeng @ 2013-02-22 11:52 UTC (permalink / raw)
  To: tbayly; +Cc: linux-raid, neilb

On 02/22/2013 06:58 AM, Tregaron Bayly wrote:
> Symptom:
> A RAID 1 array ends up with two threads (flush and raid1) stuck in D
> state forever.  The array is inaccessible and the host must be restarted
> to restore access to the array.
>
> I have some scripted workloads that reproduce this within a maximum of a
> couple hours on kernels from 3.6.11 - 3.8-rc7.  I cannot reproduce on
> 3.4.32.  3.5.7 ends up with three threads stuck in D state, but the
> stacks are different from this bug (as it's EOL maybe of interest in
> bisecting the problem?).
>
> Details:
>
> [flush-9:16]
> [<ffffffffa009f1a4>] wait_barrier+0x124/0x180 [raid1]
> [<ffffffffa00a2a15>] make_request+0x85/0xd50 [raid1]
> [<ffffffff813653c3>] md_make_request+0xd3/0x200
> [<ffffffff811f494a>] generic_make_request+0xca/0x100
> [<ffffffff811f49f9>] submit_bio+0x79/0x160
> [<ffffffff811808f8>] submit_bh+0x128/0x200
> [<ffffffff81182fe0>] __block_write_full_page+0x1d0/0x330
> [<ffffffff8118320e>] block_write_full_page_endio+0xce/0x100
> [<ffffffff81183255>] block_write_full_page+0x15/0x20
> [<ffffffff81187908>] blkdev_writepage+0x18/0x20
> [<ffffffff810f73b7>] __writepage+0x17/0x40
> [<ffffffff810f8543>] write_cache_pages+0x1d3/0x4c0
> [<ffffffff810f8881>] generic_writepages+0x51/0x80
> [<ffffffff810f88d0>] do_writepages+0x20/0x40
> [<ffffffff811782bb>] __writeback_single_inode+0x3b/0x160
> [<ffffffff8117a8a9>] writeback_sb_inodes+0x1e9/0x430
> [<ffffffff8117ab8e>] __writeback_inodes_wb+0x9e/0xd0
> [<ffffffff8117ae9b>] wb_writeback+0x24b/0x2e0
> [<ffffffff8117b171>] wb_do_writeback+0x241/0x250
> [<ffffffff8117b222>] bdi_writeback_thread+0xa2/0x250
> [<ffffffff8106414e>] kthread+0xce/0xe0
> [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> [md16-raid1]
> [<ffffffffa009ffb9>] handle_read_error+0x119/0x790 [raid1]
> [<ffffffffa00a0862>] raid1d+0x232/0x1060 [raid1]
> [<ffffffff813675a7>] md_thread+0x117/0x150
> [<ffffffff8106414e>] kthread+0xce/0xe0
> [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Both processes end up in wait_event_lock_irq() waiting for favorable
> conditions in the struct r1conf to proceed.  These conditions obviously
> seem to never arrive.  I placed printk statements in freeze_array() and
> wait_barrier() directly before calling their respective
> wait_event_lock_irq() and this is an example output:
>
> Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3)
> Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3)
> Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3)
From those message,there's a request which will be completed or met error.
If completed, the ->nr_pening decrease one.
If request met error, it add ->retry_list and the ->nr_queued add one.
So in two conditions,the hung will be happened.
What's the state of this request? Maybe this bug caused by lower layer drivers.

> The flush seems to come after the attempt to handle the read error.  I
> believe the incrementing nr_waiting comes from the multiple read
> processes I have going as they get stuck behind this deadlock.
>
> majienpeng may have been referring to this condition on the linux-raid
> list a few days ago (http://www.spinics.net/lists/raid/msg42150.html)
> when he stated "Above code is what's you said.  But it met read-error,
> raid1d will blocked for ever.  The reason is freeze_array"
>
> Environment:
> A RAID 1 array built on top of two multipath maps
> Devices underlying the maps are iscsi disks exported from a linux SAN
> Multipath is configured to queue i/o if no path for 5 retries (polling
> interval 5) before failing and causing the mirror to degrade.
>
> Reproducible on kernels 3.6.11 and up (x86_64)
>   
> To reproduce, I create 10 raid 1 arrays on a client built on mpio over
> iscsi.  For each of the arrays, I start the following to create some
> writes:
>
> $NODE=array_5; while :; do let TIME=($RANDOM % 10); let size=$RANDOM*6;
> sleep $TIME; dd if=/dev/zero of=/dev/md/$NODE bs=1024 count=$size; done
>
> I then start 10 processes each doing this to create some reads:
>
> while :; do let SAN=($RANDOM % 10); dd if=/dev/md/array_$SAN
> of=/dev/null bs=1024 count=50000; done
>
> In a screen I start a script running in a loop that
> monitors /proc/mdstat for failed arrays and does a fail/remove/re-add on
> the failed disk.  (This is mainly so that I get more than one crack at
> reproducing the bug since the timing never happens just right on one
> try.)
>
> Now on one of the iscsi servers I start a loop that sleeps a random
> amount of time between 1-5 minutes then stops the export, sleeps again
> and then restores the export.
>
> The net effect of all this is that the disks on the initiator will queue
> i/o for a while when the export is off and eventually fail.  In most
> cases this happens gracefully and i/o will go to the remaining disk.  In
> the failure case we get the behavior described here.  The test is setup
> to restore the arrays and try again until the bug is hit.
>
> We do see this situation in production with relative frequency so it's
> not something that happens only in theory under artificial conditions.
>
> Any help or insight here is much appreciated.
>
> Tregaron Bayly
> Systems Administrator
> Bluehost, Inc.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
  2013-02-22 11:52 ` majianpeng
@ 2013-02-22 16:03   ` Tregaron Bayly
  2013-02-22 18:14     ` Joe Lawrence
  0 siblings, 1 reply; 20+ messages in thread
From: Tregaron Bayly @ 2013-02-22 16:03 UTC (permalink / raw)
  To: majianpeng; +Cc: linux-raid, neilb

On Fri, 2013-02-22 at 19:52 +0800, majianpeng wrote:

> > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3)
> > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3)
> > Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3)
> From those message,there's a request which will be completed or met error.
> If completed, the ->nr_pening decrease one.
> If request met error, it add ->retry_list and the ->nr_queued add one.
> So in two conditions,the hung will be happened.
> What's the state of this request? Maybe this bug caused by lower layer drivers.

That's a good question.  Do you have any suggestion as to how I might
answer it?  As I say, I can readily reproduce it so if you have some
insight into how to capture that info I can do it with no problem.

Also, I don't know for sure but it didn't sound like Joe Lawrence is
using the same lower layer drivers but he has the same problem.

Tregaron


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
  2013-02-22 16:03   ` Tregaron Bayly
@ 2013-02-22 18:14     ` Joe Lawrence
  0 siblings, 0 replies; 20+ messages in thread
From: Joe Lawrence @ 2013-02-22 18:14 UTC (permalink / raw)
  To: Tregaron Bayly; +Cc: majianpeng, linux-raid, neilb

On Fri, 22 Feb 2013, Tregaron Bayly wrote:
> Also, I don't know for sure but it didn't sound like Joe Lawrence is
> using the same lower layer drivers but he has the same problem.

I'm running a slightly modified mpt2sas driver under 3.8.0rcX (not 3.7 as 
I mentioned earlier).

I did capture a kdump vmcore on the freeze_array hang, but haven't had 
time to look at it yet.

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
  2013-02-21 22:58 BUG - raid 1 deadlock on handle_read_error / wait_barrier Tregaron Bayly
  2013-02-22  3:44 ` Joe Lawrence
  2013-02-22 11:52 ` majianpeng
@ 2013-02-24 22:43 ` NeilBrown
  2013-02-25  0:04   ` NeilBrown
  2 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2013-02-24 22:43 UTC (permalink / raw)
  To: tbayly; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2274 bytes --]

On Thu, 21 Feb 2013 15:58:24 -0700 Tregaron Bayly <tbayly@bluehost.com> wrote:

> Symptom:
> A RAID 1 array ends up with two threads (flush and raid1) stuck in D
> state forever.  The array is inaccessible and the host must be restarted
> to restore access to the array.
> 
> I have some scripted workloads that reproduce this within a maximum of a
> couple hours on kernels from 3.6.11 - 3.8-rc7.  I cannot reproduce on
> 3.4.32.  3.5.7 ends up with three threads stuck in D state, but the
> stacks are different from this bug (as it's EOL maybe of interest in
> bisecting the problem?).

Can you post the 3 stacks from the 3.5.7 case?  It might help get a more
complete understanding.

...
> Both processes end up in wait_event_lock_irq() waiting for favorable
> conditions in the struct r1conf to proceed.  These conditions obviously
> seem to never arrive.  I placed printk statements in freeze_array() and
> wait_barrier() directly before calling their respective
> wait_event_lock_irq() and this is an example output:
> 
> Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3)
> Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3)
> Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3)

This is very useful, thanks.  Clearly there is one 'pending' request that
isn't being counted, but also isn't being allowed to complete.
Maybe it is in pending_bio_list, and so counted in conf->pending_count.

Could you print out that value as well and try to trigger the bug again?  If
conf->pending_count is non-zero, then it seems very likely the we have found
the problem.
Fixing it isn't quite so easy.  'nr_pending' counts request from the
filesystem that are still pending.  'pending_count' sounds request down to
the underlying device that are still pending.  There isn't a 1-to-1
correspondence, so we cannot just subtract one from the other.  It will
require more thought than that.

Thanks for the thorough report,

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
  2013-02-24 22:43 ` NeilBrown
@ 2013-02-25  0:04   ` NeilBrown
  2013-02-25 16:11     ` Tregaron Bayly
  0 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2013-02-25  0:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: tbayly, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2690 bytes --]

On Mon, 25 Feb 2013 09:43:50 +1100 NeilBrown <neilb@suse.de> wrote:

> On Thu, 21 Feb 2013 15:58:24 -0700 Tregaron Bayly <tbayly@bluehost.com> wrote:
> 
> > Symptom:
> > A RAID 1 array ends up with two threads (flush and raid1) stuck in D
> > state forever.  The array is inaccessible and the host must be restarted
> > to restore access to the array.
> > 
> > I have some scripted workloads that reproduce this within a maximum of a
> > couple hours on kernels from 3.6.11 - 3.8-rc7.  I cannot reproduce on
> > 3.4.32.  3.5.7 ends up with three threads stuck in D state, but the
> > stacks are different from this bug (as it's EOL maybe of interest in
> > bisecting the problem?).
> 
> Can you post the 3 stacks from the 3.5.7 case?  It might help get a more
> complete understanding.
> 
> ...
> > Both processes end up in wait_event_lock_irq() waiting for favorable
> > conditions in the struct r1conf to proceed.  These conditions obviously
> > seem to never arrive.  I placed printk statements in freeze_array() and
> > wait_barrier() directly before calling their respective
> > wait_event_lock_irq() and this is an example output:
> > 
> > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3)
> > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3)
> > Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3)
> 
> This is very useful, thanks.  Clearly there is one 'pending' request that
> isn't being counted, but also isn't being allowed to complete.
> Maybe it is in pending_bio_list, and so counted in conf->pending_count.
> 
> Could you print out that value as well and try to trigger the bug again?  If
> conf->pending_count is non-zero, then it seems very likely the we have found
> the problem.

Actually  don't bother.  I think I've found the problem.  It is related to
pending_count and is easy to fix.
Could you try this patch please?

Thanks.
NeilBrown

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 6e5d5a5..fd86b37 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -967,6 +967,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
 		spin_unlock_irq(&conf->device_lock);
+		wake_up(&conf->wait_barrier);
 		md_wakeup_thread(mddev->thread);
 		kfree(plug);
 		return;

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
  2013-02-25  0:04   ` NeilBrown
@ 2013-02-25 16:11     ` Tregaron Bayly
  2013-02-25 22:54       ` NeilBrown
  2013-02-26 14:09       ` Joe Lawrence
  0 siblings, 2 replies; 20+ messages in thread
From: Tregaron Bayly @ 2013-02-25 16:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

> Actually  don't bother.  I think I've found the problem.  It is related to
> pending_count and is easy to fix.
> Could you try this patch please?
> 
> Thanks.
> NeilBrown
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 6e5d5a5..fd86b37 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -967,6 +967,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
>  		conf->pending_count += plug->pending_cnt;
>  		spin_unlock_irq(&conf->device_lock);
> +		wake_up(&conf->wait_barrier);
>  		md_wakeup_thread(mddev->thread);
>  		kfree(plug);
>  		return;

Running 15 hours now and no sign of the problem, which is 12 hours
longer than it took to trigger the bug in the past.  I'll continue
testing to be sure but I think this patch is a fix.

Thanks for the fast response!

Tregaron Bayly


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
  2013-02-25 16:11     ` Tregaron Bayly
@ 2013-02-25 22:54       ` NeilBrown
  2013-02-26 14:09       ` Joe Lawrence
  1 sibling, 0 replies; 20+ messages in thread
From: NeilBrown @ 2013-02-25 22:54 UTC (permalink / raw)
  To: tbayly; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1199 bytes --]

On Mon, 25 Feb 2013 09:11:02 -0700 Tregaron Bayly <tbayly@bluehost.com> wrote:

> > Actually  don't bother.  I think I've found the problem.  It is related to
> > pending_count and is easy to fix.
> > Could you try this patch please?
> > 
> > Thanks.
> > NeilBrown
> > 
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index 6e5d5a5..fd86b37 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -967,6 +967,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
> >  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
> >  		conf->pending_count += plug->pending_cnt;
> >  		spin_unlock_irq(&conf->device_lock);
> > +		wake_up(&conf->wait_barrier);
> >  		md_wakeup_thread(mddev->thread);
> >  		kfree(plug);
> >  		return;
> 
> Running 15 hours now and no sign of the problem, which is 12 hours
> longer than it took to trigger the bug in the past.  I'll continue
> testing to be sure but I think this patch is a fix.

Thanks.  I've added a "Tested-by" you to the patch header.

Should go to Linus, and thence to -stable in a day or 2.

NeilBrown

> 
> Thanks for the fast response!
> 
> Tregaron Bayly


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
  2013-02-25 16:11     ` Tregaron Bayly
  2013-02-25 22:54       ` NeilBrown
@ 2013-02-26 14:09       ` Joe Lawrence
  2013-05-16 14:07         ` Alexander Lyakas
  1 sibling, 1 reply; 20+ messages in thread
From: Joe Lawrence @ 2013-02-26 14:09 UTC (permalink / raw)
  To: Tregaron Bayly; +Cc: NeilBrown, linux-raid

Same here:  after ~15 hrs, ~300 surprise device removals and fio stress, 
no hung tasks to report.

-- Joe

On Mon, 25 Feb 2013, Tregaron Bayly wrote:

> > Actually  don't bother.  I think I've found the problem.  It is related to
> > pending_count and is easy to fix.
> > Could you try this patch please?
> > 
> > Thanks.
> > NeilBrown
> > 
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index 6e5d5a5..fd86b37 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -967,6 +967,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
> >  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
> >  		conf->pending_count += plug->pending_cnt;
> >  		spin_unlock_irq(&conf->device_lock);
> > +		wake_up(&conf->wait_barrier);
> >  		md_wakeup_thread(mddev->thread);
> >  		kfree(plug);
> >  		return;
> 
> Running 15 hours now and no sign of the problem, which is 12 hours
> longer than it took to trigger the bug in the past.  I'll continue
> testing to be sure but I think this patch is a fix.
> 
> Thanks for the fast response!
> 
> Tregaron Bayly
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
  2013-02-26 14:09       ` Joe Lawrence
@ 2013-05-16 14:07         ` Alexander Lyakas
  2013-05-20  7:17           ` NeilBrown
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Lyakas @ 2013-05-16 14:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: Tregaron Bayly, linux-raid, Shyam Kaushik

Hello Neil,
we are hitting issue that looks very similar; we are on kernel 3.8.2.
The array is a 2-device raid1, which experienced a drive failure, but
then drive was removed and re-added back to the array (although
rebuild never started). Relevant parts of the kernel log:

May 16 11:12:14  kernel: [46850.090499] md/raid1:md2: Disk failure on
dm-6, disabling device.
May 16 11:12:14  kernel: [46850.090499] md/raid1:md2: Operation
continuing on 1 devices.
May 16 11:12:14  kernel: [46850.090511] md: super_written gets
error=-5, uptodate=0
May 16 11:12:14  kernel: [46850.090676] md/raid1:md2: dm-6:
rescheduling sector 18040736
May 16 11:12:14  kernel: [46850.090764] md/raid1:md2: dm-6:
rescheduling sector 20367040
May 16 11:12:14  kernel: [46850.090826] md/raid1:md2: dm-6:
rescheduling sector 17613504
May 16 11:12:14  kernel: [46850.090883] md/raid1:md2: dm-6:
rescheduling sector 18042720
May 16 11:12:15  kernel: [46850.229970] md/raid1:md2: redirecting
sector 18040736 to other mirror: dm-13
May 16 11:12:15  kernel: [46850.257687] md/raid1:md2: redirecting
sector 20367040 to other mirror: dm-13
May 16 11:12:15  kernel: [46850.268731] md/raid1:md2: redirecting
sector 17613504 to other mirror: dm-13
May 16 11:12:15  kernel: [46850.398242] md/raid1:md2: redirecting
sector 18042720 to other mirror: dm-13
May 16 11:12:23  kernel: [46858.448465] md: unbind<dm-6>
May 16 11:12:23  kernel: [46858.456081] md: export_rdev(dm-6)
May 16 11:23:19  kernel: [47515.062547] md: bind<dm-6>

May 16 11:24:28  kernel: [47583.920126] INFO: task md2_raid1:15749
blocked for more than 60 seconds.
May 16 11:24:28  kernel: [47583.921829] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 16 11:24:28  kernel: [47583.923361] md2_raid1       D
0000000000000001     0 15749      2 0x00000000
May 16 11:24:28  kernel: [47583.923367]  ffff880097c23c28
0000000000000046 ffff880000000002 00000000982c43b8
May 16 11:24:28  kernel: [47583.923372]  ffff880097c23fd8
ffff880097c23fd8 ffff880097c23fd8 0000000000013f40
May 16 11:24:28  kernel: [47583.923376]  ffff880119b11740
ffff8800a5489740 ffff880097c23c38 ffff88011609d3c0
May 16 11:24:28  kernel: [47583.923381] Call Trace:
May 16 11:24:28  kernel: [47583.923395]  [<ffffffff816eb399>] schedule+0x29/0x70
May 16 11:24:28  kernel: [47583.923402]  [<ffffffffa0516736>]
raise_barrier+0x106/0x160 [raid1]
May 16 11:24:28  kernel: [47583.923410]  [<ffffffff8107fcc0>] ?
add_wait_queue+0x60/0x60
May 16 11:24:28  kernel: [47583.923415]  [<ffffffffa0516af7>]
raid1_add_disk+0x197/0x200 [raid1]
May 16 11:24:28  kernel: [47583.923421]  [<ffffffff81567fa4>]
remove_and_add_spares+0x104/0x220
May 16 11:24:28  kernel: [47583.923426]  [<ffffffff8156a02d>]
md_check_recovery.part.49+0x40d/0x530
May 16 11:24:28  kernel: [47583.923429]  [<ffffffff8156a165>]
md_check_recovery+0x15/0x20
May 16 11:24:28  kernel: [47583.923433]  [<ffffffffa0517e42>]
raid1d+0x22/0x180 [raid1]
May 16 11:24:28  kernel: [47583.923439]  [<ffffffff81045cd9>] ?
default_spin_lock_flags+0x9/0x10
May 16 11:24:28  kernel: [47583.923443]  [<ffffffff81045cd9>] ?
default_spin_lock_flags+0x9/0x10
May 16 11:24:28  kernel: [47583.923449]  [<ffffffff815624ed>]
md_thread+0x10d/0x140
May 16 11:24:28  kernel: [47583.923453]  [<ffffffff8107fcc0>] ?
add_wait_queue+0x60/0x60
May 16 11:24:28  kernel: [47583.923457]  [<ffffffff815623e0>] ?
md_rdev_init+0x140/0x140
May 16 11:24:28  kernel: [47583.923461]  [<ffffffff8107f0d0>] kthread+0xc0/0xd0
May 16 11:24:28  kernel: [47583.923465]  [<ffffffff8107f010>] ?
flush_kthread_worker+0xb0/0xb0
May 16 11:24:28  kernel: [47583.923472]  [<ffffffff816f506c>]
ret_from_fork+0x7c/0xb0
May 16 11:24:28  kernel: [47583.923476]  [<ffffffff8107f010>] ?
flush_kthread_worker+0xb0/0xb0

dm-13 is the good drive, dm-6 is the one that failed.

At this point, we have several threads calling submit_bio and all
stuck like this:

cat /proc/16218/stack
[<ffffffffa0516d6e>] wait_barrier+0xbe/0x160 [raid1]
[<ffffffffa0518627>] make_request+0x87/0xa90 [raid1]
[<ffffffff81561ed0>] md_make_request+0xd0/0x200
[<ffffffff8132bcaa>] generic_make_request+0xca/0x100
[<ffffffff8132bd5b>] submit_bio+0x7b/0x160
...

And md raid1 thread stuck like this:

cat /proc/15749/stack
[<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1]
[<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1]
[<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220
[<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530
[<ffffffff8156a165>] md_check_recovery+0x15/0x20
[<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1]
[<ffffffff815624ed>] md_thread+0x10d/0x140
[<ffffffff8107f0d0>] kthread+0xc0/0xd0
[<ffffffff816f506c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

We have also two user-space threads stuck:

one is trying to read /sys/block/md2/md/array_state and its kernel stack is:
# cat /proc/2251/stack
[<ffffffff81564602>] md_attr_show+0x72/0xf0
[<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0
[<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0
[<ffffffff8119b0d0>] vfs_read+0xb0/0x180
[<ffffffff8119b1f2>] sys_read+0x52/0xa0
[<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

the other wants to read from /proc/mdstat and is:
[<ffffffff81563d2b>] md_seq_show+0x4b/0x540
[<ffffffff811bdd1b>] seq_read+0x16b/0x400
[<ffffffff811ff572>] proc_reg_read+0x82/0xc0
[<ffffffff8119b0d0>] vfs_read+0xb0/0x180
[<ffffffff8119b1f2>] sys_read+0x52/0xa0
[<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

mdadm --detail also gets stuck if attempted, in stack like this:
cat /proc/2864/stack
[<ffffffff81564602>] md_attr_show+0x72/0xf0
[<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0
[<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0
[<ffffffff8119b0d0>] vfs_read+0xb0/0x180
[<ffffffff8119b1f2>] sys_read+0x52/0xa0
[<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

Might your patch https://patchwork.kernel.org/patch/2260051/ fix this
issue? Is this patch alone applicable to kernel 3.8.2?
Can you pls kindly comment on this.

Thanks for your help,
Alex.




On Tue, Feb 26, 2013 at 4:09 PM, Joe Lawrence <Joe.Lawrence@stratus.com> wrote:
>
> Same here:  after ~15 hrs, ~300 surprise device removals and fio stress,
> no hung tasks to report.
>
> -- Joe
>
> On Mon, 25 Feb 2013, Tregaron Bayly wrote:
>
> > > Actually  don't bother.  I think I've found the problem.  It is related to
> > > pending_count and is easy to fix.
> > > Could you try this patch please?
> > >
> > > Thanks.
> > > NeilBrown
> > >
> > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > > index 6e5d5a5..fd86b37 100644
> > > --- a/drivers/md/raid1.c
> > > +++ b/drivers/md/raid1.c
> > > @@ -967,6 +967,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
> > >             bio_list_merge(&conf->pending_bio_list, &plug->pending);
> > >             conf->pending_count += plug->pending_cnt;
> > >             spin_unlock_irq(&conf->device_lock);
> > > +           wake_up(&conf->wait_barrier);
> > >             md_wakeup_thread(mddev->thread);
> > >             kfree(plug);
> > >             return;
> >
> > Running 15 hours now and no sign of the problem, which is 12 hours
> > longer than it took to trigger the bug in the past.  I'll continue
> > testing to be sure but I think this patch is a fix.
> >
> > Thanks for the fast response!
> >
> > Tregaron Bayly
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
  2013-05-16 14:07         ` Alexander Lyakas
@ 2013-05-20  7:17           ` NeilBrown
  2013-05-30 14:30             ` Alexander Lyakas
  0 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2013-05-20  7:17 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: Tregaron Bayly, linux-raid, Shyam Kaushik

[-- Attachment #1: Type: text/plain, Size: 6645 bytes --]

On Thu, 16 May 2013 17:07:04 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hello Neil,
> we are hitting issue that looks very similar; we are on kernel 3.8.2.
> The array is a 2-device raid1, which experienced a drive failure, but
> then drive was removed and re-added back to the array (although
> rebuild never started). Relevant parts of the kernel log:
> 
> May 16 11:12:14  kernel: [46850.090499] md/raid1:md2: Disk failure on
> dm-6, disabling device.
> May 16 11:12:14  kernel: [46850.090499] md/raid1:md2: Operation
> continuing on 1 devices.
> May 16 11:12:14  kernel: [46850.090511] md: super_written gets
> error=-5, uptodate=0
> May 16 11:12:14  kernel: [46850.090676] md/raid1:md2: dm-6:
> rescheduling sector 18040736
> May 16 11:12:14  kernel: [46850.090764] md/raid1:md2: dm-6:
> rescheduling sector 20367040
> May 16 11:12:14  kernel: [46850.090826] md/raid1:md2: dm-6:
> rescheduling sector 17613504
> May 16 11:12:14  kernel: [46850.090883] md/raid1:md2: dm-6:
> rescheduling sector 18042720
> May 16 11:12:15  kernel: [46850.229970] md/raid1:md2: redirecting
> sector 18040736 to other mirror: dm-13
> May 16 11:12:15  kernel: [46850.257687] md/raid1:md2: redirecting
> sector 20367040 to other mirror: dm-13
> May 16 11:12:15  kernel: [46850.268731] md/raid1:md2: redirecting
> sector 17613504 to other mirror: dm-13
> May 16 11:12:15  kernel: [46850.398242] md/raid1:md2: redirecting
> sector 18042720 to other mirror: dm-13
> May 16 11:12:23  kernel: [46858.448465] md: unbind<dm-6>
> May 16 11:12:23  kernel: [46858.456081] md: export_rdev(dm-6)
> May 16 11:23:19  kernel: [47515.062547] md: bind<dm-6>
> 
> May 16 11:24:28  kernel: [47583.920126] INFO: task md2_raid1:15749
> blocked for more than 60 seconds.
> May 16 11:24:28  kernel: [47583.921829] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 16 11:24:28  kernel: [47583.923361] md2_raid1       D
> 0000000000000001     0 15749      2 0x00000000
> May 16 11:24:28  kernel: [47583.923367]  ffff880097c23c28
> 0000000000000046 ffff880000000002 00000000982c43b8
> May 16 11:24:28  kernel: [47583.923372]  ffff880097c23fd8
> ffff880097c23fd8 ffff880097c23fd8 0000000000013f40
> May 16 11:24:28  kernel: [47583.923376]  ffff880119b11740
> ffff8800a5489740 ffff880097c23c38 ffff88011609d3c0
> May 16 11:24:28  kernel: [47583.923381] Call Trace:
> May 16 11:24:28  kernel: [47583.923395]  [<ffffffff816eb399>] schedule+0x29/0x70
> May 16 11:24:28  kernel: [47583.923402]  [<ffffffffa0516736>]
> raise_barrier+0x106/0x160 [raid1]
> May 16 11:24:28  kernel: [47583.923410]  [<ffffffff8107fcc0>] ?
> add_wait_queue+0x60/0x60
> May 16 11:24:28  kernel: [47583.923415]  [<ffffffffa0516af7>]
> raid1_add_disk+0x197/0x200 [raid1]
> May 16 11:24:28  kernel: [47583.923421]  [<ffffffff81567fa4>]
> remove_and_add_spares+0x104/0x220
> May 16 11:24:28  kernel: [47583.923426]  [<ffffffff8156a02d>]
> md_check_recovery.part.49+0x40d/0x530
> May 16 11:24:28  kernel: [47583.923429]  [<ffffffff8156a165>]
> md_check_recovery+0x15/0x20
> May 16 11:24:28  kernel: [47583.923433]  [<ffffffffa0517e42>]
> raid1d+0x22/0x180 [raid1]
> May 16 11:24:28  kernel: [47583.923439]  [<ffffffff81045cd9>] ?
> default_spin_lock_flags+0x9/0x10
> May 16 11:24:28  kernel: [47583.923443]  [<ffffffff81045cd9>] ?
> default_spin_lock_flags+0x9/0x10
> May 16 11:24:28  kernel: [47583.923449]  [<ffffffff815624ed>]
> md_thread+0x10d/0x140
> May 16 11:24:28  kernel: [47583.923453]  [<ffffffff8107fcc0>] ?
> add_wait_queue+0x60/0x60
> May 16 11:24:28  kernel: [47583.923457]  [<ffffffff815623e0>] ?
> md_rdev_init+0x140/0x140
> May 16 11:24:28  kernel: [47583.923461]  [<ffffffff8107f0d0>] kthread+0xc0/0xd0
> May 16 11:24:28  kernel: [47583.923465]  [<ffffffff8107f010>] ?
> flush_kthread_worker+0xb0/0xb0
> May 16 11:24:28  kernel: [47583.923472]  [<ffffffff816f506c>]
> ret_from_fork+0x7c/0xb0
> May 16 11:24:28  kernel: [47583.923476]  [<ffffffff8107f010>] ?
> flush_kthread_worker+0xb0/0xb0
> 
> dm-13 is the good drive, dm-6 is the one that failed.
> 
> At this point, we have several threads calling submit_bio and all
> stuck like this:
> 
> cat /proc/16218/stack
> [<ffffffffa0516d6e>] wait_barrier+0xbe/0x160 [raid1]
> [<ffffffffa0518627>] make_request+0x87/0xa90 [raid1]
> [<ffffffff81561ed0>] md_make_request+0xd0/0x200
> [<ffffffff8132bcaa>] generic_make_request+0xca/0x100
> [<ffffffff8132bd5b>] submit_bio+0x7b/0x160
> ...
> 
> And md raid1 thread stuck like this:
> 
> cat /proc/15749/stack
> [<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1]
> [<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1]
> [<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220
> [<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530
> [<ffffffff8156a165>] md_check_recovery+0x15/0x20
> [<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1]
> [<ffffffff815624ed>] md_thread+0x10d/0x140
> [<ffffffff8107f0d0>] kthread+0xc0/0xd0
> [<ffffffff816f506c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> We have also two user-space threads stuck:
> 
> one is trying to read /sys/block/md2/md/array_state and its kernel stack is:
> # cat /proc/2251/stack
> [<ffffffff81564602>] md_attr_show+0x72/0xf0
> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0
> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0
> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> the other wants to read from /proc/mdstat and is:
> [<ffffffff81563d2b>] md_seq_show+0x4b/0x540
> [<ffffffff811bdd1b>] seq_read+0x16b/0x400
> [<ffffffff811ff572>] proc_reg_read+0x82/0xc0
> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> mdadm --detail also gets stuck if attempted, in stack like this:
> cat /proc/2864/stack
> [<ffffffff81564602>] md_attr_show+0x72/0xf0
> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0
> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0
> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Might your patch https://patchwork.kernel.org/patch/2260051/ fix this

Probably.

> issue? Is this patch alone applicable to kernel 3.8.2?

Probably.

> Can you pls kindly comment on this.
> 

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
  2013-05-20  7:17           ` NeilBrown
@ 2013-05-30 14:30             ` Alexander Lyakas
  2013-06-02 12:43               ` Alexander Lyakas
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Lyakas @ 2013-05-30 14:30 UTC (permalink / raw)
  To: NeilBrown, linux-raid; +Cc: Tregaron Bayly, Shyam Kaushik

Hi Neil,

On Mon, May 20, 2013 at 10:17 AM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 16 May 2013 17:07:04 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hello Neil,
>> we are hitting issue that looks very similar; we are on kernel 3.8.2.
>> The array is a 2-device raid1, which experienced a drive failure, but
>> then drive was removed and re-added back to the array (although
>> rebuild never started). Relevant parts of the kernel log:
>>
>> May 16 11:12:14  kernel: [46850.090499] md/raid1:md2: Disk failure on
>> dm-6, disabling device.
>> May 16 11:12:14  kernel: [46850.090499] md/raid1:md2: Operation
>> continuing on 1 devices.
>> May 16 11:12:14  kernel: [46850.090511] md: super_written gets
>> error=-5, uptodate=0
>> May 16 11:12:14  kernel: [46850.090676] md/raid1:md2: dm-6:
>> rescheduling sector 18040736
>> May 16 11:12:14  kernel: [46850.090764] md/raid1:md2: dm-6:
>> rescheduling sector 20367040
>> May 16 11:12:14  kernel: [46850.090826] md/raid1:md2: dm-6:
>> rescheduling sector 17613504
>> May 16 11:12:14  kernel: [46850.090883] md/raid1:md2: dm-6:
>> rescheduling sector 18042720
>> May 16 11:12:15  kernel: [46850.229970] md/raid1:md2: redirecting
>> sector 18040736 to other mirror: dm-13
>> May 16 11:12:15  kernel: [46850.257687] md/raid1:md2: redirecting
>> sector 20367040 to other mirror: dm-13
>> May 16 11:12:15  kernel: [46850.268731] md/raid1:md2: redirecting
>> sector 17613504 to other mirror: dm-13
>> May 16 11:12:15  kernel: [46850.398242] md/raid1:md2: redirecting
>> sector 18042720 to other mirror: dm-13
>> May 16 11:12:23  kernel: [46858.448465] md: unbind<dm-6>
>> May 16 11:12:23  kernel: [46858.456081] md: export_rdev(dm-6)
>> May 16 11:23:19  kernel: [47515.062547] md: bind<dm-6>
>>
>> May 16 11:24:28  kernel: [47583.920126] INFO: task md2_raid1:15749
>> blocked for more than 60 seconds.
>> May 16 11:24:28  kernel: [47583.921829] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> May 16 11:24:28  kernel: [47583.923361] md2_raid1       D
>> 0000000000000001     0 15749      2 0x00000000
>> May 16 11:24:28  kernel: [47583.923367]  ffff880097c23c28
>> 0000000000000046 ffff880000000002 00000000982c43b8
>> May 16 11:24:28  kernel: [47583.923372]  ffff880097c23fd8
>> ffff880097c23fd8 ffff880097c23fd8 0000000000013f40
>> May 16 11:24:28  kernel: [47583.923376]  ffff880119b11740
>> ffff8800a5489740 ffff880097c23c38 ffff88011609d3c0
>> May 16 11:24:28  kernel: [47583.923381] Call Trace:
>> May 16 11:24:28  kernel: [47583.923395]  [<ffffffff816eb399>] schedule+0x29/0x70
>> May 16 11:24:28  kernel: [47583.923402]  [<ffffffffa0516736>]
>> raise_barrier+0x106/0x160 [raid1]
>> May 16 11:24:28  kernel: [47583.923410]  [<ffffffff8107fcc0>] ?
>> add_wait_queue+0x60/0x60
>> May 16 11:24:28  kernel: [47583.923415]  [<ffffffffa0516af7>]
>> raid1_add_disk+0x197/0x200 [raid1]
>> May 16 11:24:28  kernel: [47583.923421]  [<ffffffff81567fa4>]
>> remove_and_add_spares+0x104/0x220
>> May 16 11:24:28  kernel: [47583.923426]  [<ffffffff8156a02d>]
>> md_check_recovery.part.49+0x40d/0x530
>> May 16 11:24:28  kernel: [47583.923429]  [<ffffffff8156a165>]
>> md_check_recovery+0x15/0x20
>> May 16 11:24:28  kernel: [47583.923433]  [<ffffffffa0517e42>]
>> raid1d+0x22/0x180 [raid1]
>> May 16 11:24:28  kernel: [47583.923439]  [<ffffffff81045cd9>] ?
>> default_spin_lock_flags+0x9/0x10
>> May 16 11:24:28  kernel: [47583.923443]  [<ffffffff81045cd9>] ?
>> default_spin_lock_flags+0x9/0x10
>> May 16 11:24:28  kernel: [47583.923449]  [<ffffffff815624ed>]
>> md_thread+0x10d/0x140
>> May 16 11:24:28  kernel: [47583.923453]  [<ffffffff8107fcc0>] ?
>> add_wait_queue+0x60/0x60
>> May 16 11:24:28  kernel: [47583.923457]  [<ffffffff815623e0>] ?
>> md_rdev_init+0x140/0x140
>> May 16 11:24:28  kernel: [47583.923461]  [<ffffffff8107f0d0>] kthread+0xc0/0xd0
>> May 16 11:24:28  kernel: [47583.923465]  [<ffffffff8107f010>] ?
>> flush_kthread_worker+0xb0/0xb0
>> May 16 11:24:28  kernel: [47583.923472]  [<ffffffff816f506c>]
>> ret_from_fork+0x7c/0xb0
>> May 16 11:24:28  kernel: [47583.923476]  [<ffffffff8107f010>] ?
>> flush_kthread_worker+0xb0/0xb0
>>
>> dm-13 is the good drive, dm-6 is the one that failed.
>>
>> At this point, we have several threads calling submit_bio and all
>> stuck like this:
>>
>> cat /proc/16218/stack
>> [<ffffffffa0516d6e>] wait_barrier+0xbe/0x160 [raid1]
>> [<ffffffffa0518627>] make_request+0x87/0xa90 [raid1]
>> [<ffffffff81561ed0>] md_make_request+0xd0/0x200
>> [<ffffffff8132bcaa>] generic_make_request+0xca/0x100
>> [<ffffffff8132bd5b>] submit_bio+0x7b/0x160
>> ...
>>
>> And md raid1 thread stuck like this:
>>
>> cat /proc/15749/stack
>> [<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1]
>> [<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1]
>> [<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220
>> [<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530
>> [<ffffffff8156a165>] md_check_recovery+0x15/0x20
>> [<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1]
>> [<ffffffff815624ed>] md_thread+0x10d/0x140
>> [<ffffffff8107f0d0>] kthread+0xc0/0xd0
>> [<ffffffff816f506c>] ret_from_fork+0x7c/0xb0
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> We have also two user-space threads stuck:
>>
>> one is trying to read /sys/block/md2/md/array_state and its kernel stack is:
>> # cat /proc/2251/stack
>> [<ffffffff81564602>] md_attr_show+0x72/0xf0
>> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0
>> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0
>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> the other wants to read from /proc/mdstat and is:
>> [<ffffffff81563d2b>] md_seq_show+0x4b/0x540
>> [<ffffffff811bdd1b>] seq_read+0x16b/0x400
>> [<ffffffff811ff572>] proc_reg_read+0x82/0xc0
>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> mdadm --detail also gets stuck if attempted, in stack like this:
>> cat /proc/2864/stack
>> [<ffffffff81564602>] md_attr_show+0x72/0xf0
>> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0
>> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0
>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> Might your patch https://patchwork.kernel.org/patch/2260051/ fix this
>
> Probably.
>
>> issue? Is this patch alone applicable to kernel 3.8.2?
>
> Probably.

This problem still happens in 3.8.13, which contains the above fix of
yours. I am currently debugging it deeper, but still don't know what
the issue is.
The problem happens when raid1_add_disk() calls raise_barrier().
conf->nr_waiting is 0 at this point. So then raise_barrier() starts
waiting for conf->nr_pending getting to 0, but it never gets to 0.
From this point, anybody that calls wait_barrier() is also stuck.

This looks like a real deadlock in a non-ancient kernel. I realize
that 3.8.x is EOL, but still it's only one stable behind.
Can you help further with this problem?

Another problem that I see often, is when a drive fails in RAID1, and
then rdev is marked as "Blocked", but then Blocked bit never gets
cleared for some reason. So this drive is not ejected from the array.
Again, I was unable to debug it deeper at this point.

Thanks,
Alex.












>
>> Can you pls kindly comment on this.
>>
>
> NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
  2013-05-30 14:30             ` Alexander Lyakas
@ 2013-06-02 12:43               ` Alexander Lyakas
  2013-06-04  1:49                 ` NeilBrown
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Lyakas @ 2013-06-02 12:43 UTC (permalink / raw)
  To: NeilBrown, linux-raid; +Cc: Shyam Kaushik, yair

[-- Attachment #1: Type: text/plain, Size: 9743 bytes --]

Hello Neil,
I believe I have found what is causing the deadlock. It happens in two flavors:

1)
# raid1d() is called, and conf->pending_bio_list is non-empty at this point
# raid1d() calls md_check_recovery(), which eventually calls
raid1_add_disk(), which calls raise_barrier()
# Now raise_barrier will wait for conf->nr_pending to become 0, but it
cannot become 0, because there are bios sitting in
conf->pending_bio_list, which nobody will flush, because raid1d is the
one supposed to call flush_pending_writes(), either directly or via
handle_read_error. But it is stuck in raise_barrier.

2)
# raid1_add_disk() calls raise_barrier(), and waits for
conf->nr_pending to become 0, as before
# new WRITE comes and calls wait_barrier(), but this thread has a
non-empty current->bio_list
# In this case, the code allows the WRITE to go through
wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs
again end up in conf->pending_bio_list (either via raid1_unplug or
directly). But nobody will flush conf->pending_bio_list, because
raid1d is stuck in raise_barrier.

Previously, for example in kernel 3.2, raid1_add_disk did not call
raise_barrier, so this problem did not happen.

Attached is a reproduction with some prints that I added to
raise_barrier and wait_barrier (their code also attached). It
demonstrates case 2. It shows that once raise_barrier got called,
conf->nr_pending drops down, until it equals the number of
wait_barrier calls, that slipped through because of non-empty
current->bio_list. And at this point, this array hangs.

Can you please comment on how to fix this problem. It looks like a
real deadlock.
We can perhaps call md_check_recovery() after flush_pending_writes(),
but this only makes the window smaller, not closes it entirely. But it
looks like we really should not be calling raise_barrier from raid1d.

Thanks,
Alex.


On Thu, May 30, 2013 at 5:30 PM, Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:
> Hi Neil,
>
> On Mon, May 20, 2013 at 10:17 AM, NeilBrown <neilb@suse.de> wrote:
>> On Thu, 16 May 2013 17:07:04 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> wrote:
>>
>>> Hello Neil,
>>> we are hitting issue that looks very similar; we are on kernel 3.8.2.
>>> The array is a 2-device raid1, which experienced a drive failure, but
>>> then drive was removed and re-added back to the array (although
>>> rebuild never started). Relevant parts of the kernel log:
>>>
>>> May 16 11:12:14  kernel: [46850.090499] md/raid1:md2: Disk failure on
>>> dm-6, disabling device.
>>> May 16 11:12:14  kernel: [46850.090499] md/raid1:md2: Operation
>>> continuing on 1 devices.
>>> May 16 11:12:14  kernel: [46850.090511] md: super_written gets
>>> error=-5, uptodate=0
>>> May 16 11:12:14  kernel: [46850.090676] md/raid1:md2: dm-6:
>>> rescheduling sector 18040736
>>> May 16 11:12:14  kernel: [46850.090764] md/raid1:md2: dm-6:
>>> rescheduling sector 20367040
>>> May 16 11:12:14  kernel: [46850.090826] md/raid1:md2: dm-6:
>>> rescheduling sector 17613504
>>> May 16 11:12:14  kernel: [46850.090883] md/raid1:md2: dm-6:
>>> rescheduling sector 18042720
>>> May 16 11:12:15  kernel: [46850.229970] md/raid1:md2: redirecting
>>> sector 18040736 to other mirror: dm-13
>>> May 16 11:12:15  kernel: [46850.257687] md/raid1:md2: redirecting
>>> sector 20367040 to other mirror: dm-13
>>> May 16 11:12:15  kernel: [46850.268731] md/raid1:md2: redirecting
>>> sector 17613504 to other mirror: dm-13
>>> May 16 11:12:15  kernel: [46850.398242] md/raid1:md2: redirecting
>>> sector 18042720 to other mirror: dm-13
>>> May 16 11:12:23  kernel: [46858.448465] md: unbind<dm-6>
>>> May 16 11:12:23  kernel: [46858.456081] md: export_rdev(dm-6)
>>> May 16 11:23:19  kernel: [47515.062547] md: bind<dm-6>
>>>
>>> May 16 11:24:28  kernel: [47583.920126] INFO: task md2_raid1:15749
>>> blocked for more than 60 seconds.
>>> May 16 11:24:28  kernel: [47583.921829] "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> May 16 11:24:28  kernel: [47583.923361] md2_raid1       D
>>> 0000000000000001     0 15749      2 0x00000000
>>> May 16 11:24:28  kernel: [47583.923367]  ffff880097c23c28
>>> 0000000000000046 ffff880000000002 00000000982c43b8
>>> May 16 11:24:28  kernel: [47583.923372]  ffff880097c23fd8
>>> ffff880097c23fd8 ffff880097c23fd8 0000000000013f40
>>> May 16 11:24:28  kernel: [47583.923376]  ffff880119b11740
>>> ffff8800a5489740 ffff880097c23c38 ffff88011609d3c0
>>> May 16 11:24:28  kernel: [47583.923381] Call Trace:
>>> May 16 11:24:28  kernel: [47583.923395]  [<ffffffff816eb399>] schedule+0x29/0x70
>>> May 16 11:24:28  kernel: [47583.923402]  [<ffffffffa0516736>]
>>> raise_barrier+0x106/0x160 [raid1]
>>> May 16 11:24:28  kernel: [47583.923410]  [<ffffffff8107fcc0>] ?
>>> add_wait_queue+0x60/0x60
>>> May 16 11:24:28  kernel: [47583.923415]  [<ffffffffa0516af7>]
>>> raid1_add_disk+0x197/0x200 [raid1]
>>> May 16 11:24:28  kernel: [47583.923421]  [<ffffffff81567fa4>]
>>> remove_and_add_spares+0x104/0x220
>>> May 16 11:24:28  kernel: [47583.923426]  [<ffffffff8156a02d>]
>>> md_check_recovery.part.49+0x40d/0x530
>>> May 16 11:24:28  kernel: [47583.923429]  [<ffffffff8156a165>]
>>> md_check_recovery+0x15/0x20
>>> May 16 11:24:28  kernel: [47583.923433]  [<ffffffffa0517e42>]
>>> raid1d+0x22/0x180 [raid1]
>>> May 16 11:24:28  kernel: [47583.923439]  [<ffffffff81045cd9>] ?
>>> default_spin_lock_flags+0x9/0x10
>>> May 16 11:24:28  kernel: [47583.923443]  [<ffffffff81045cd9>] ?
>>> default_spin_lock_flags+0x9/0x10
>>> May 16 11:24:28  kernel: [47583.923449]  [<ffffffff815624ed>]
>>> md_thread+0x10d/0x140
>>> May 16 11:24:28  kernel: [47583.923453]  [<ffffffff8107fcc0>] ?
>>> add_wait_queue+0x60/0x60
>>> May 16 11:24:28  kernel: [47583.923457]  [<ffffffff815623e0>] ?
>>> md_rdev_init+0x140/0x140
>>> May 16 11:24:28  kernel: [47583.923461]  [<ffffffff8107f0d0>] kthread+0xc0/0xd0
>>> May 16 11:24:28  kernel: [47583.923465]  [<ffffffff8107f010>] ?
>>> flush_kthread_worker+0xb0/0xb0
>>> May 16 11:24:28  kernel: [47583.923472]  [<ffffffff816f506c>]
>>> ret_from_fork+0x7c/0xb0
>>> May 16 11:24:28  kernel: [47583.923476]  [<ffffffff8107f010>] ?
>>> flush_kthread_worker+0xb0/0xb0
>>>
>>> dm-13 is the good drive, dm-6 is the one that failed.
>>>
>>> At this point, we have several threads calling submit_bio and all
>>> stuck like this:
>>>
>>> cat /proc/16218/stack
>>> [<ffffffffa0516d6e>] wait_barrier+0xbe/0x160 [raid1]
>>> [<ffffffffa0518627>] make_request+0x87/0xa90 [raid1]
>>> [<ffffffff81561ed0>] md_make_request+0xd0/0x200
>>> [<ffffffff8132bcaa>] generic_make_request+0xca/0x100
>>> [<ffffffff8132bd5b>] submit_bio+0x7b/0x160
>>> ...
>>>
>>> And md raid1 thread stuck like this:
>>>
>>> cat /proc/15749/stack
>>> [<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1]
>>> [<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1]
>>> [<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220
>>> [<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530
>>> [<ffffffff8156a165>] md_check_recovery+0x15/0x20
>>> [<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1]
>>> [<ffffffff815624ed>] md_thread+0x10d/0x140
>>> [<ffffffff8107f0d0>] kthread+0xc0/0xd0
>>> [<ffffffff816f506c>] ret_from_fork+0x7c/0xb0
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> We have also two user-space threads stuck:
>>>
>>> one is trying to read /sys/block/md2/md/array_state and its kernel stack is:
>>> # cat /proc/2251/stack
>>> [<ffffffff81564602>] md_attr_show+0x72/0xf0
>>> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0
>>> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0
>>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
>>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
>>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> the other wants to read from /proc/mdstat and is:
>>> [<ffffffff81563d2b>] md_seq_show+0x4b/0x540
>>> [<ffffffff811bdd1b>] seq_read+0x16b/0x400
>>> [<ffffffff811ff572>] proc_reg_read+0x82/0xc0
>>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
>>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
>>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> mdadm --detail also gets stuck if attempted, in stack like this:
>>> cat /proc/2864/stack
>>> [<ffffffff81564602>] md_attr_show+0x72/0xf0
>>> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0
>>> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0
>>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180
>>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0
>>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> Might your patch https://patchwork.kernel.org/patch/2260051/ fix this
>>
>> Probably.
>>
>>> issue? Is this patch alone applicable to kernel 3.8.2?
>>
>> Probably.
>
> This problem still happens in 3.8.13, which contains the above fix of
> yours. I am currently debugging it deeper, but still don't know what
> the issue is.
> The problem happens when raid1_add_disk() calls raise_barrier().
> conf->nr_waiting is 0 at this point. So then raise_barrier() starts
> waiting for conf->nr_pending getting to 0, but it never gets to 0.
> From this point, anybody that calls wait_barrier() is also stuck.
>
> This looks like a real deadlock in a non-ancient kernel. I realize
> that 3.8.x is EOL, but still it's only one stable behind.
> Can you help further with this problem?
>
> Another problem that I see often, is when a drive fails in RAID1, and
> then rdev is marked as "Blocked", but then Blocked bit never gets
> cleared for some reason. So this drive is not ejected from the array.
> Again, I was unable to debug it deeper at this point.
>
> Thanks,
> Alex.
>
>
>
>
>
>
>
>
>
>
>
>
>>
>>> Can you pls kindly comment on this.
>>>
>>
>> NeilBrown

[-- Attachment #2: repro.txt --]
[-- Type: text/plain, Size: 14688 bytes --]

[  514.714425] md: bind<dm-5>
[  514.718751] md2: raise_barrier(1): conf->barrier=0 conf->nr_waiting=0 conf->nr_pending=191 conf->pending_count=0
[  514.718755] md2: raise_barrier(2): conf->barrier=1 conf->nr_waiting=0 conf->nr_pending=191 conf->pending_count=0
[  514.720564] md2: wait_barrier(PID=7079,1) conf->barrier=1 conf->nr_waiting=1 conf->nr_pending=191 conf->pending_count=0
[  514.722073] md2: wait_barrier(PID=7069,1) conf->barrier=1 conf->nr_waiting=2 conf->nr_pending=191 conf->pending_count=0
[  514.724602] md2: wait_barrier(PID=9922,1) conf->barrier=1 conf->nr_waiting=3 conf->nr_pending=191 conf->pending_count=0
[  514.725575] md2: wait_barrier(PID=9909,1) conf->barrier=1 conf->nr_waiting=4 conf->nr_pending=187 conf->pending_count=0
[  514.725661] md2: wait_barrier(PID=7082,1) conf->barrier=1 conf->nr_waiting=5 conf->nr_pending=187 conf->pending_count=0
[  514.725836] md2: wait_barrier(PID=9910,1) conf->barrier=1 conf->nr_waiting=6 conf->nr_pending=178 conf->pending_count=0
[  514.725848] md2: wait_barrier(PID=9918,1) conf->barrier=1 conf->nr_waiting=7 conf->nr_pending=178 conf->pending_count=0
[  514.725860] md2: wait_barrier(PID=7077,1) conf->barrier=1 conf->nr_waiting=8 conf->nr_pending=178 conf->pending_count=0
[  514.725872] md2: wait_barrier(PID=9905,1) conf->barrier=1 conf->nr_waiting=9 conf->nr_pending=178 conf->pending_count=0
[  514.725884] md2: wait_barrier(PID=7147,1) conf->barrier=1 conf->nr_waiting=10 conf->nr_pending=178 conf->pending_count=0
[  514.725917] md2: wait_barrier(PID=10217,1) conf->barrier=1 conf->nr_waiting=11 conf->nr_pending=178 conf->pending_count=0
[  514.727122] md2: wait_barrier(PID=7177,1) conf->barrier=1 conf->nr_waiting=12 conf->nr_pending=155 conf->pending_count=0
[  514.727446] md2: wait_barrier(PID=9904,1) conf->barrier=1 conf->nr_waiting=13 conf->nr_pending=155 conf->pending_count=0
[  514.727475] md2: wait_barrier(PID=9894,1) conf->barrier=1 conf->nr_waiting=14 conf->nr_pending=155 conf->pending_count=0
[  514.727489] md2: wait_barrier(PID=9898,1) conf->barrier=1 conf->nr_waiting=15 conf->nr_pending=155 conf->pending_count=0
[  514.727507] md2: wait_barrier(PID=7072,1) conf->barrier=1 conf->nr_waiting=16 conf->nr_pending=155 conf->pending_count=0
[  514.727651] md2: wait_barrier(PID=9899,1) conf->barrier=1 conf->nr_waiting=17 conf->nr_pending=150 conf->pending_count=0
[  514.727853] md2: wait_barrier(PID=9906,1) conf->barrier=1 conf->nr_waiting=18 conf->nr_pending=150 conf->pending_count=0
[  514.727867] md2: wait_barrier(PID=9912,1) conf->barrier=1 conf->nr_waiting=19 conf->nr_pending=150 conf->pending_count=0
[  514.727915] md2: wait_barrier(PID=6888,1) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=150 conf->pending_count=0
[  514.728109] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=150 conf->pending_count=0
[  514.728112] md2: wait_barrier(PID=7080,2) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=151 conf->pending_count=0
[  514.728136] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=151 conf->pending_count=1
[  514.728138] md2: wait_barrier(PID=7080,2) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=152 conf->pending_count=1
[  514.728143] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=152 conf->pending_count=2
[  514.728145] md2: wait_barrier(PID=7080,2) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=153 conf->pending_count=2
[  514.728149] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=153 conf->pending_count=3
[  514.730110] md2: wait_barrier(PID=9919,1) conf->barrier=1 conf->nr_waiting=22 conf->nr_pending=153 conf->pending_count=3
[  514.730188] md2: wait_barrier(PID=5773,1) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=153 conf->pending_count=3
[  514.731139] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=153 conf->pending_count=3
[  514.731143] md2: wait_barrier(PID=7174,2) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=154 conf->pending_count=3
[  514.731176] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=154 conf->pending_count=4
[  514.731179] md2: wait_barrier(PID=7174,2) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=155 conf->pending_count=4
[  514.731183] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=155 conf->pending_count=5
[  514.731186] md2: wait_barrier(PID=7174,2) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=156 conf->pending_count=5
[  514.731193] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=156 conf->pending_count=6
[  514.731371] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=156 conf->pending_count=6
[  514.731374] md2: wait_barrier(PID=7070,2) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=157 conf->pending_count=6
[  514.731382] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=157 conf->pending_count=7
[  514.731385] md2: wait_barrier(PID=7070,2) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=158 conf->pending_count=7
[  514.731388] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=158 conf->pending_count=8
[  514.731390] md2: wait_barrier(PID=7070,2) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=159 conf->pending_count=8
[  514.731394] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=159 conf->pending_count=9
[  514.731491] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=159 conf->pending_count=9
[  514.731494] md2: wait_barrier(PID=9888,2) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=160 conf->pending_count=9
[  514.731502] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=160 conf->pending_count=10
[  514.731505] md2: wait_barrier(PID=9888,2) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=161 conf->pending_count=10
[  514.731508] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=161 conf->pending_count=11
[  514.731510] md2: wait_barrier(PID=9888,2) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=162 conf->pending_count=11
[  514.731514] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=162 conf->pending_count=12
[  514.731755] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=162 conf->pending_count=12
[  514.731758] md2: wait_barrier(PID=9895,2) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=163 conf->pending_count=12
[  514.731767] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=163 conf->pending_count=13
[  514.731769] md2: wait_barrier(PID=9895,2) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=164 conf->pending_count=13
[  514.731773] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=164 conf->pending_count=14
[  514.731775] md2: wait_barrier(PID=9895,2) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=165 conf->pending_count=14
[  514.731826] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=165 conf->pending_count=15
[  514.731875] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=165 conf->pending_count=15
[  514.731878] md2: wait_barrier(PID=9897,2) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=166 conf->pending_count=15
[  514.731884] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=166 conf->pending_count=16
[  514.731887] md2: wait_barrier(PID=9897,2) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=167 conf->pending_count=16
[  514.731890] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=167 conf->pending_count=17
[  514.731893] md2: wait_barrier(PID=9897,2) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=168 conf->pending_count=17
[  514.731902] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=168 conf->pending_count=18
[  514.731971] md2: wait_barrier(PID=9896,1) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=168 conf->pending_count=18
[  514.732304] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=168 conf->pending_count=18
[  514.732307] md2: wait_barrier(PID=9908,2) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=169 conf->pending_count=18
[  514.732340] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=169 conf->pending_count=19
[  514.732343] md2: wait_barrier(PID=9908,2) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=170 conf->pending_count=19
[  514.732346] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=170 conf->pending_count=20
[  514.732349] md2: wait_barrier(PID=9908,2) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=171 conf->pending_count=20
[  514.732352] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=171 conf->pending_count=21
[  514.732397] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=171 conf->pending_count=21
[  514.732400] md2: wait_barrier(PID=5774,2) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=172 conf->pending_count=21
[  514.732412] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=172 conf->pending_count=22
[  514.732414] md2: wait_barrier(PID=5774,2) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=173 conf->pending_count=22
[  514.732419] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=173 conf->pending_count=23
[  514.732421] md2: wait_barrier(PID=5774,2) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=174 conf->pending_count=23
[  514.732424] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=174 conf->pending_count=24
[  514.732459] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=174 conf->pending_count=24
[  514.732461] md2: wait_barrier(PID=7068,2) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=175 conf->pending_count=24
[  514.732467] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=175 conf->pending_count=25
[  514.732469] md2: wait_barrier(PID=7068,2) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=176 conf->pending_count=25
[  514.732473] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=176 conf->pending_count=26
[  514.732475] md2: wait_barrier(PID=7068,2) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=177 conf->pending_count=26
[  514.732478] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=177 conf->pending_count=27
[  514.734807] md2: wait_barrier(PID=9901,1) conf->barrier=1 conf->nr_waiting=33 conf->nr_pending=177 conf->pending_count=27
[  514.734830] md2: wait_barrier(PID=9923,1) conf->barrier=1 conf->nr_waiting=34 conf->nr_pending=177 conf->pending_count=27
[  514.735975] md2: wait_barrier(PID=7175,1) conf->barrier=1 conf->nr_waiting=35 conf->nr_pending=158 conf->pending_count=27
[  514.735992] md2: wait_barrier(PID=7081,1) conf->barrier=1 conf->nr_waiting=36 conf->nr_pending=158 conf->pending_count=27
[  514.736083] md2: wait_barrier(PID=9921,1) conf->barrier=1 conf->nr_waiting=37 conf->nr_pending=158 conf->pending_count=27
[  514.736670] md2: wait_barrier(PID=7075,1) conf->barrier=1 conf->nr_waiting=38 conf->nr_pending=158 conf->pending_count=27
[  514.740425] md2: wait_barrier(PID=9925,1) conf->barrier=1 conf->nr_waiting=39 conf->nr_pending=125 conf->pending_count=27
[  514.740448] md2: wait_barrier(PID=41,1) conf->barrier=1 conf->nr_waiting=40 conf->nr_pending=125 conf->pending_count=27
[  514.740465] md2: wait_barrier(PID=7083,1) conf->barrier=1 conf->nr_waiting=41 conf->nr_pending=125 conf->pending_count=27
[  514.740478] md2: wait_barrier(PID=9911,1) conf->barrier=1 conf->nr_waiting=42 conf->nr_pending=125 conf->pending_count=27
[  514.740492] md2: wait_barrier(PID=9903,1) conf->barrier=1 conf->nr_waiting=43 conf->nr_pending=125 conf->pending_count=27
[  514.741067] md2: wait_barrier(PID=7071,1) conf->barrier=1 conf->nr_waiting=44 conf->nr_pending=125 conf->pending_count=27
[  514.743463] md2: wait_barrier(PID=9920,1) conf->barrier=1 conf->nr_waiting=45 conf->nr_pending=105 conf->pending_count=27
[  514.743478] md2: wait_barrier(PID=9924,1) conf->barrier=1 conf->nr_waiting=46 conf->nr_pending=105 conf->pending_count=27
[  514.743489] md2: wait_barrier(PID=7146,1) conf->barrier=1 conf->nr_waiting=47 conf->nr_pending=105 conf->pending_count=27
[  514.745765] md2: wait_barrier(PID=9890,1) conf->barrier=1 conf->nr_waiting=48 conf->nr_pending=89 conf->pending_count=27
[  514.745784] md2: wait_barrier(PID=9879,1) conf->barrier=1 conf->nr_waiting=49 conf->nr_pending=89 conf->pending_count=27
[  514.746556] md2: wait_barrier(PID=6,1) conf->barrier=1 conf->nr_waiting=50 conf->nr_pending=81 conf->pending_count=27
[  514.748897] md2: wait_barrier(PID=9966,1) conf->barrier=1 conf->nr_waiting=51 conf->nr_pending=75 conf->pending_count=27
[  514.749195] md2: wait_barrier(PID=9900,1) conf->barrier=1 conf->nr_waiting=52 conf->nr_pending=67 conf->pending_count=27
[  514.750047] md2: wait_barrier(PID=9887,1) conf->barrier=1 conf->nr_waiting=53 conf->nr_pending=57 conf->pending_count=27
[  514.750065] md2: wait_barrier(PID=9889,1) conf->barrier=1 conf->nr_waiting=54 conf->nr_pending=57 conf->pending_count=27
[  514.750076] md2: wait_barrier(PID=9893,1) conf->barrier=1 conf->nr_waiting=55 conf->nr_pending=57 conf->pending_count=27
[  514.752565] md2: wait_barrier(PID=9915,1) conf->barrier=1 conf->nr_waiting=56 conf->nr_pending=48 conf->pending_count=27
[  514.752580] md2: wait_barrier(PID=7074,1) conf->barrier=1 conf->nr_waiting=57 conf->nr_pending=48 conf->pending_count=27
[  514.752592] md2: wait_barrier(PID=9902,1) conf->barrier=1 conf->nr_waiting=58 conf->nr_pending=48 conf->pending_count=27
[  514.752603] md2: wait_barrier(PID=9916,1) conf->barrier=1 conf->nr_waiting=59 conf->nr_pending=48 conf->pending_count=27
[  514.752926] md2: wait_barrier(PID=42,1) conf->barrier=1 conf->nr_waiting=60 conf->nr_pending=27 conf->pending_count=27
[  514.752946] md2: wait_barrier(PID=7073,1) conf->barrier=1 conf->nr_waiting=61 conf->nr_pending=27 conf->pending_count=27
[  514.752960] md2: wait_barrier(PID=9914,1) conf->barrier=1 conf->nr_waiting=62 conf->nr_pending=27 conf->pending_count=27


[-- Attachment #3: raise_wait_barrier.c --]
[-- Type: text/x-csrc, Size: 2457 bytes --]

static void raise_barrier(struct r1conf *conf)
{
	spin_lock_irq(&conf->resync_lock);

	pr_info("%s: raise_barrier(1): conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count);

	/* Wait until no block IO is waiting */
	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
			    conf->resync_lock);

	/* block any new IO from starting */
	conf->barrier++;

	pr_info("%s: raise_barrier(2): conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count);

	/* Now wait for all pending IO to complete */
	wait_event_lock_irq(conf->wait_barrier,
			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
			    conf->resync_lock);

	pr_info("%s: raise_barrier(3): conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count);

	spin_unlock_irq(&conf->resync_lock);
}

static void wait_barrier(struct r1conf *conf)
{
	int waited = 0;
	
	spin_lock_irq(&conf->resync_lock);
	if (conf->barrier) {
		conf->nr_waiting++;

		waited = 1;
		pr_info("%s: wait_barrier(PID=%d,1) conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), current->pid, 
			    conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count);
		if (conf->nr_pending && current->bio_list && !bio_list_empty(current->bio_list))
			dump_stack();

		/* Wait for the barrier to drop.
		 * However if there are already pending
		 * requests (preventing the barrier from
		 * rising completely), and the
		 * pre-process bio queue isn't empty,
		 * then don't wait, as we need to empty
		 * that queue to get the nr_pending
		 * count down.
		 */
		wait_event_lock_irq(conf->wait_barrier,
				    !conf->barrier ||
				    (conf->nr_pending &&
				     current->bio_list &&
				     !bio_list_empty(current->bio_list)),
				    conf->resync_lock);
		conf->nr_waiting--;
	}
	conf->nr_pending++;

	if (waited)
		pr_info("%s: wait_barrier(PID=%d,2) conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), current->pid, 
			    conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count);

	spin_unlock_irq(&conf->resync_lock);
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
  2013-06-02 12:43               ` Alexander Lyakas
@ 2013-06-04  1:49                 ` NeilBrown
  2013-06-04  9:52                   ` Alexander Lyakas
  2013-06-06 15:00                   ` Tregaron Bayly
  0 siblings, 2 replies; 20+ messages in thread
From: NeilBrown @ 2013-06-04  1:49 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid, Shyam Kaushik, yair

[-- Attachment #1: Type: text/plain, Size: 2948 bytes --]

On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hello Neil,
> I believe I have found what is causing the deadlock. It happens in two flavors:
> 
> 1)
> # raid1d() is called, and conf->pending_bio_list is non-empty at this point
> # raid1d() calls md_check_recovery(), which eventually calls
> raid1_add_disk(), which calls raise_barrier()
> # Now raise_barrier will wait for conf->nr_pending to become 0, but it
> cannot become 0, because there are bios sitting in
> conf->pending_bio_list, which nobody will flush, because raid1d is the
> one supposed to call flush_pending_writes(), either directly or via
> handle_read_error. But it is stuck in raise_barrier.
> 
> 2)
> # raid1_add_disk() calls raise_barrier(), and waits for
> conf->nr_pending to become 0, as before
> # new WRITE comes and calls wait_barrier(), but this thread has a
> non-empty current->bio_list
> # In this case, the code allows the WRITE to go through
> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs
> again end up in conf->pending_bio_list (either via raid1_unplug or
> directly). But nobody will flush conf->pending_bio_list, because
> raid1d is stuck in raise_barrier.
> 
> Previously, for example in kernel 3.2, raid1_add_disk did not call
> raise_barrier, so this problem did not happen.
> 
> Attached is a reproduction with some prints that I added to
> raise_barrier and wait_barrier (their code also attached). It
> demonstrates case 2. It shows that once raise_barrier got called,
> conf->nr_pending drops down, until it equals the number of
> wait_barrier calls, that slipped through because of non-empty
> current->bio_list. And at this point, this array hangs.
> 
> Can you please comment on how to fix this problem. It looks like a
> real deadlock.
> We can perhaps call md_check_recovery() after flush_pending_writes(),
> but this only makes the window smaller, not closes it entirely. But it
> looks like we really should not be calling raise_barrier from raid1d.
> 
> Thanks,
> Alex.

Hi Alex,
 thanks for the analysis.

Does the following patch fix it?  It makes raise_barrier  more  like
freeze_array().
If not, could you try making the same change to the first
wait_event_lock_irq in raise_barrier?

Thanks.
NeilBrown



diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 328fa2d..d34f892 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf)
 	conf->barrier++;
 
 	/* Now wait for all pending IO to complete */
-	wait_event_lock_irq(conf->wait_barrier,
-			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
-			    conf->resync_lock);
+	wait_event_lock_irq_cmd(conf->wait_barrier,
+				!conf->nr_pending && conf->barrier < RESYNC_DEPTH,
+				conf->resync_lock, flush_pending_writes(conf));
 
 	spin_unlock_irq(&conf->resync_lock);
 }

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
  2013-06-04  1:49                 ` NeilBrown
@ 2013-06-04  9:52                   ` Alexander Lyakas
  2013-06-06 15:00                   ` Tregaron Bayly
  1 sibling, 0 replies; 20+ messages in thread
From: Alexander Lyakas @ 2013-06-04  9:52 UTC (permalink / raw)
  To: NeilBrown, linux-raid; +Cc: Shyam Kaushik, yair

[-- Attachment #1: Type: text/plain, Size: 6344 bytes --]

Hello Neil,
thanks for your response.

On Tue, Jun 4, 2013 at 4:49 AM, NeilBrown <neilb@suse.de> wrote:
> On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hello Neil,
>> I believe I have found what is causing the deadlock. It happens in two flavors:
>>
>> 1)
>> # raid1d() is called, and conf->pending_bio_list is non-empty at this point
>> # raid1d() calls md_check_recovery(), which eventually calls
>> raid1_add_disk(), which calls raise_barrier()
>> # Now raise_barrier will wait for conf->nr_pending to become 0, but it
>> cannot become 0, because there are bios sitting in
>> conf->pending_bio_list, which nobody will flush, because raid1d is the
>> one supposed to call flush_pending_writes(), either directly or via
>> handle_read_error. But it is stuck in raise_barrier.
>>
>> 2)
>> # raid1_add_disk() calls raise_barrier(), and waits for
>> conf->nr_pending to become 0, as before
>> # new WRITE comes and calls wait_barrier(), but this thread has a
>> non-empty current->bio_list
>> # In this case, the code allows the WRITE to go through
>> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs
>> again end up in conf->pending_bio_list (either via raid1_unplug or
>> directly). But nobody will flush conf->pending_bio_list, because
>> raid1d is stuck in raise_barrier.
>>
>> Previously, for example in kernel 3.2, raid1_add_disk did not call
>> raise_barrier, so this problem did not happen.
>>
>> Attached is a reproduction with some prints that I added to
>> raise_barrier and wait_barrier (their code also attached). It
>> demonstrates case 2. It shows that once raise_barrier got called,
>> conf->nr_pending drops down, until it equals the number of
>> wait_barrier calls, that slipped through because of non-empty
>> current->bio_list. And at this point, this array hangs.
>>
>> Can you please comment on how to fix this problem. It looks like a
>> real deadlock.
>> We can perhaps call md_check_recovery() after flush_pending_writes(),
>> but this only makes the window smaller, not closes it entirely. But it
>> looks like we really should not be calling raise_barrier from raid1d.
>>
>> Thanks,
>> Alex.
>
> Hi Alex,
>  thanks for the analysis.
>
> Does the following patch fix it?  It makes raise_barrier  more  like
> freeze_array().
> If not, could you try making the same change to the first
> wait_event_lock_irq in raise_barrier?
>
> Thanks.
> NeilBrown
>
>
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 328fa2d..d34f892 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf)
>         conf->barrier++;
>
>         /* Now wait for all pending IO to complete */
> -       wait_event_lock_irq(conf->wait_barrier,
> -                           !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
> -                           conf->resync_lock);
> +       wait_event_lock_irq_cmd(conf->wait_barrier,
> +                               !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
> +                               conf->resync_lock, flush_pending_writes(conf));
>
>         spin_unlock_irq(&conf->resync_lock);
>  }

Yes, this patch fixes the problem, thanks! Actually, yesterday, I
attempted a similar fix[1] and it also seemed to work.
I have several comments about this patch:

# It fully fixes case 1, and almost fully closes the race window for
case 2. I attach a reproduction, in which it can be seen, that while
raise_barrier is waiting, new bios slip through wait_barrier (because
of non-empty current->bio_list), and raise_barrier is awaken more than
once to flush them. Eventually it takes 2 seconds for raise_barrier to
complete. This is still much better than sleeping forever though:)

# We are now calling flush_pending_writes() while mddev_lock() is
locked. It doesn't seem problematic to call generic_make_request()
under this lock, but flush_pending_writes() also does bitmap_unplug(),
which may wait for superblock update etc. Is this ok? I found one or
two other places, where we wait for superblock update under mddev_lock
(ADD_NEW_DISK, for example), so it's probably ok?

# I am concerned that raise_barrier is also called from sync_request,
so it may also attempt to flush_pending_writes. Can we make a more
conservative patch, like this:
        /* Now wait for all pending IO to complete */
-       wait_event_lock_irq(conf->wait_barrier,
-                           !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
-                           conf->resync_lock);
+       if (conf->mddev->thread && conf->mddev->thread->tsk == current) {
+               /*
+                * If we are called from the management thread (raid1d), we
+                * need to flush the bios that might be sitting in
conf->pending_bio_list,
+                * otherwise, we will wait for them here forever
+                */
+               wait_event_lock_irq_cmd(conf->wait_barrier,
+                                       !conf->nr_pending &&
conf->barrier < RESYNC_DEPTH,
+                                       conf->resync_lock,
flush_pending_writes(conf));
+       } else {
+               wait_event_lock_irq(conf->wait_barrier,
+                                   !conf->nr_pending && conf->barrier
< RESYNC_DEPTH,
+                                   conf->resync_lock);
+       }
+

        spin_unlock_irq(&conf->resync_lock);

Thanks again. I will work & reply on other issues I mentioned, for
some of them I already made patches.
Alex.


[1]
 	/* block any new IO from starting */
 	conf->barrier++;

+	/* if we are raising the barrier while inside raid1d (which we
really shouldn't)... */
+	if (conf->mddev->thread && conf->mddev->thread->tsk == current) {
+		while (!(!conf->nr_pending && conf->barrier < RESYNC_DEPTH)) {
+			int nr_pending = conf->nr_pending;
+
+			spin_unlock_irq(&conf->resync_lock);
+
+			if (nr_pending)
+				flush_pending_writes(conf);
+			wait_event_timeout(conf->wait_barrier,
+				!conf->nr_pending && conf->barrier < RESYNC_DEPTH,
+				msecs_to_jiffies(100));
+
+			spin_lock_irq(&conf->resync_lock);
+		}
+
+		spin_unlock_irq(&conf->resync_lock);
+
+		return;
+	}
 	/* Now wait for all pending IO to complete */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,

[-- Attachment #2: repro.tgz --]
[-- Type: application/x-gzip, Size: 12349 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
  2013-06-04  1:49                 ` NeilBrown
  2013-06-04  9:52                   ` Alexander Lyakas
@ 2013-06-06 15:00                   ` Tregaron Bayly
  2013-06-08  9:45                     ` Alexander Lyakas
  2013-06-12  1:30                     ` NeilBrown
  1 sibling, 2 replies; 20+ messages in thread
From: Tregaron Bayly @ 2013-06-06 15:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: Alexander Lyakas, linux-raid, Shyam Kaushik, yair

On 06/03/2013 07:49 PM, NeilBrown wrote:
> On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hello Neil,
>> I believe I have found what is causing the deadlock. It happens in two flavors:
>>
>> 1)
>> # raid1d() is called, and conf->pending_bio_list is non-empty at this point
>> # raid1d() calls md_check_recovery(), which eventually calls
>> raid1_add_disk(), which calls raise_barrier()
>> # Now raise_barrier will wait for conf->nr_pending to become 0, but it
>> cannot become 0, because there are bios sitting in
>> conf->pending_bio_list, which nobody will flush, because raid1d is the
>> one supposed to call flush_pending_writes(), either directly or via
>> handle_read_error. But it is stuck in raise_barrier.
>>
>> 2)
>> # raid1_add_disk() calls raise_barrier(), and waits for
>> conf->nr_pending to become 0, as before
>> # new WRITE comes and calls wait_barrier(), but this thread has a
>> non-empty current->bio_list
>> # In this case, the code allows the WRITE to go through
>> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs
>> again end up in conf->pending_bio_list (either via raid1_unplug or
>> directly). But nobody will flush conf->pending_bio_list, because
>> raid1d is stuck in raise_barrier.
>>
>> Previously, for example in kernel 3.2, raid1_add_disk did not call
>> raise_barrier, so this problem did not happen.
>>
>> Attached is a reproduction with some prints that I added to
>> raise_barrier and wait_barrier (their code also attached). It
>> demonstrates case 2. It shows that once raise_barrier got called,
>> conf->nr_pending drops down, until it equals the number of
>> wait_barrier calls, that slipped through because of non-empty
>> current->bio_list. And at this point, this array hangs.
>>
>> Can you please comment on how to fix this problem. It looks like a
>> real deadlock.
>> We can perhaps call md_check_recovery() after flush_pending_writes(),
>> but this only makes the window smaller, not closes it entirely. But it
>> looks like we really should not be calling raise_barrier from raid1d.
>>
>> Thanks,
>> Alex.
>
> Hi Alex,
>   thanks for the analysis.
>
> Does the following patch fix it?  It makes raise_barrier  more  like
> freeze_array().
> If not, could you try making the same change to the first
> wait_event_lock_irq in raise_barrier?
>
> Thanks.
> NeilBrown
>
>
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 328fa2d..d34f892 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf)
>   	conf->barrier++;
>
>   	/* Now wait for all pending IO to complete */
> -	wait_event_lock_irq(conf->wait_barrier,
> -			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
> -			    conf->resync_lock);
> +	wait_event_lock_irq_cmd(conf->wait_barrier,
> +				!conf->nr_pending && conf->barrier < RESYNC_DEPTH,
> +				conf->resync_lock, flush_pending_writes(conf));
>
>   	spin_unlock_irq(&conf->resync_lock);
>   }
>

Neil,

This deadlock also cropped up in 3.4 between .37 and .38.  Passing flush_pending_writes(conf) as cmd to wait_event_lock_irq seems to fix it there as well.

--- linux-3.4.38/drivers/md/raid1.c     2013-03-28 13:12:41.000000000 -0600
+++ linux-3.4.38.patch/drivers/md/raid1.c       2013-06-04 12:17:35.314194903 -0600
@@ -751,7 +751,7 @@
         /* Now wait for all pending IO to complete */
         wait_event_lock_irq(conf->wait_barrier,
                             !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
-                           conf->resync_lock, );
+                           conf->resync_lock, flush_pending_writes(conf));
  
         spin_unlock_irq(&conf->resync_lock);
  }

Thanks,

Tregaron

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
  2013-06-06 15:00                   ` Tregaron Bayly
@ 2013-06-08  9:45                     ` Alexander Lyakas
  2013-06-12  0:42                       ` NeilBrown
  2013-06-12  1:30                     ` NeilBrown
  1 sibling, 1 reply; 20+ messages in thread
From: Alexander Lyakas @ 2013-06-08  9:45 UTC (permalink / raw)
  To: NeilBrown, linux-raid; +Cc: Shyam Kaushik, yair, Tregaron Bayly

Hi Neil,
after reading the code of raid1.c, I see that there's also
conf->retry_list, which is also flushed by raid1d, but not by
flush_pending_writes(). So I think it can also cause similar deadlock,
but I don't know how to fix it:(

Alex.


On Thu, Jun 6, 2013 at 6:00 PM, Tregaron Bayly <tbayly@bluehost.com> wrote:
> On 06/03/2013 07:49 PM, NeilBrown wrote:
>>
>> On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas
>> <alex.bolshoy@gmail.com>
>> wrote:
>>
>>> Hello Neil,
>>> I believe I have found what is causing the deadlock. It happens in two
>>> flavors:
>>>
>>> 1)
>>> # raid1d() is called, and conf->pending_bio_list is non-empty at this
>>> point
>>> # raid1d() calls md_check_recovery(), which eventually calls
>>> raid1_add_disk(), which calls raise_barrier()
>>> # Now raise_barrier will wait for conf->nr_pending to become 0, but it
>>> cannot become 0, because there are bios sitting in
>>> conf->pending_bio_list, which nobody will flush, because raid1d is the
>>> one supposed to call flush_pending_writes(), either directly or via
>>> handle_read_error. But it is stuck in raise_barrier.
>>>
>>> 2)
>>> # raid1_add_disk() calls raise_barrier(), and waits for
>>> conf->nr_pending to become 0, as before
>>> # new WRITE comes and calls wait_barrier(), but this thread has a
>>> non-empty current->bio_list
>>> # In this case, the code allows the WRITE to go through
>>> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs
>>> again end up in conf->pending_bio_list (either via raid1_unplug or
>>> directly). But nobody will flush conf->pending_bio_list, because
>>> raid1d is stuck in raise_barrier.
>>>
>>> Previously, for example in kernel 3.2, raid1_add_disk did not call
>>> raise_barrier, so this problem did not happen.
>>>
>>> Attached is a reproduction with some prints that I added to
>>> raise_barrier and wait_barrier (their code also attached). It
>>> demonstrates case 2. It shows that once raise_barrier got called,
>>> conf->nr_pending drops down, until it equals the number of
>>> wait_barrier calls, that slipped through because of non-empty
>>> current->bio_list. And at this point, this array hangs.
>>>
>>> Can you please comment on how to fix this problem. It looks like a
>>> real deadlock.
>>> We can perhaps call md_check_recovery() after flush_pending_writes(),
>>> but this only makes the window smaller, not closes it entirely. But it
>>> looks like we really should not be calling raise_barrier from raid1d.
>>>
>>> Thanks,
>>> Alex.
>>
>>
>> Hi Alex,
>>   thanks for the analysis.
>>
>> Does the following patch fix it?  It makes raise_barrier  more  like
>> freeze_array().
>> If not, could you try making the same change to the first
>> wait_event_lock_irq in raise_barrier?
>>
>> Thanks.
>> NeilBrown
>>
>>
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 328fa2d..d34f892 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf)
>>         conf->barrier++;
>>
>>         /* Now wait for all pending IO to complete */
>> -       wait_event_lock_irq(conf->wait_barrier,
>> -                           !conf->nr_pending && conf->barrier <
>> RESYNC_DEPTH,
>> -                           conf->resync_lock);
>> +       wait_event_lock_irq_cmd(conf->wait_barrier,
>> +                               !conf->nr_pending && conf->barrier <
>> RESYNC_DEPTH,
>> +                               conf->resync_lock,
>> flush_pending_writes(conf));
>>
>>         spin_unlock_irq(&conf->resync_lock);
>>   }
>>
>
> Neil,
>
> This deadlock also cropped up in 3.4 between .37 and .38.  Passing
> flush_pending_writes(conf) as cmd to wait_event_lock_irq seems to fix it
> there as well.
>
> --- linux-3.4.38/drivers/md/raid1.c     2013-03-28 13:12:41.000000000 -0600
> +++ linux-3.4.38.patch/drivers/md/raid1.c       2013-06-04
> 12:17:35.314194903 -0600
> @@ -751,7 +751,7 @@
>
>         /* Now wait for all pending IO to complete */
>         wait_event_lock_irq(conf->wait_barrier,
>
>                             !conf->nr_pending && conf->barrier <
> RESYNC_DEPTH,
> -                           conf->resync_lock, );
>
> +                           conf->resync_lock, flush_pending_writes(conf));
>          spin_unlock_irq(&conf->resync_lock);
>  }
>
> Thanks,
>
> Tregaron

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
  2013-06-08  9:45                     ` Alexander Lyakas
@ 2013-06-12  0:42                       ` NeilBrown
  0 siblings, 0 replies; 20+ messages in thread
From: NeilBrown @ 2013-06-12  0:42 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid, Shyam Kaushik, yair, Tregaron Bayly

[-- Attachment #1: Type: text/plain, Size: 2219 bytes --]

On Sat, 8 Jun 2013 12:45:41 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> after reading the code of raid1.c, I see that there's also
> conf->retry_list, which is also flushed by raid1d, but not by
> flush_pending_writes(). So I think it can also cause similar deadlock,
> but I don't know how to fix it:(
> 

Good point.
Requests in retry_list are counted in nr_queued, which is checked in
freeze_array().

And freeze_array() already calls flush_pending_writes().

So I suspect the right thing to do is use freeze_array() in place of
raise_barrier().

So maybe this is the right approach.

Testing greatly appreciated...

Thanks,
NeilBrown



diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 226dcd0..240b328 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1569,8 +1569,8 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 		 * we wait for all outstanding requests to complete.
 		 */
 		synchronize_sched();
-		raise_barrier(conf);
-		lower_barrier(conf);
+		freeze_array(conf);
+		unfreeze_array(conf);
 		clear_bit(Unmerged, &rdev->flags);
 	}
 	md_integrity_add_rdev(rdev, mddev);
@@ -1620,11 +1620,11 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 			 */
 			struct md_rdev *repl =
 				conf->mirrors[conf->raid_disks + number].rdev;
-			raise_barrier(conf);
+			freeze_array(conf);
 			clear_bit(Replacement, &repl->flags);
 			p->rdev = repl;
 			conf->mirrors[conf->raid_disks + number].rdev = NULL;
-			lower_barrier(conf);
+			unfreeze_array(conf);
 			clear_bit(WantReplacement, &rdev->flags);
 		} else
 			clear_bit(WantReplacement, &rdev->flags);
@@ -3021,7 +3021,7 @@ static int raid1_reshape(struct mddev *mddev)
 		return -ENOMEM;
 	}
 
-	raise_barrier(conf);
+	freeze_array(conf);
 
 	/* ok, everything is stopped */
 	oldpool = conf->r1bio_pool;
@@ -3052,7 +3052,7 @@ static int raid1_reshape(struct mddev *mddev)
 	conf->raid_disks = mddev->raid_disks = raid_disks;
 	mddev->delta_disks = 0;
 
-	lower_barrier(conf);
+	unfreeze_array(conf);
 
 	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	md_wakeup_thread(mddev->thread);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier
  2013-06-06 15:00                   ` Tregaron Bayly
  2013-06-08  9:45                     ` Alexander Lyakas
@ 2013-06-12  1:30                     ` NeilBrown
  1 sibling, 0 replies; 20+ messages in thread
From: NeilBrown @ 2013-06-12  1:30 UTC (permalink / raw)
  To: Tregaron Bayly; +Cc: Alexander Lyakas, linux-raid, Shyam Kaushik, yair

[-- Attachment #1: Type: text/plain, Size: 4261 bytes --]

On Thu, 06 Jun 2013 09:00:26 -0600 Tregaron Bayly <tbayly@bluehost.com> wrote:

> On 06/03/2013 07:49 PM, NeilBrown wrote:
> > On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Hello Neil,
> >> I believe I have found what is causing the deadlock. It happens in two flavors:
> >>
> >> 1)
> >> # raid1d() is called, and conf->pending_bio_list is non-empty at this point
> >> # raid1d() calls md_check_recovery(), which eventually calls
> >> raid1_add_disk(), which calls raise_barrier()
> >> # Now raise_barrier will wait for conf->nr_pending to become 0, but it
> >> cannot become 0, because there are bios sitting in
> >> conf->pending_bio_list, which nobody will flush, because raid1d is the
> >> one supposed to call flush_pending_writes(), either directly or via
> >> handle_read_error. But it is stuck in raise_barrier.
> >>
> >> 2)
> >> # raid1_add_disk() calls raise_barrier(), and waits for
> >> conf->nr_pending to become 0, as before
> >> # new WRITE comes and calls wait_barrier(), but this thread has a
> >> non-empty current->bio_list
> >> # In this case, the code allows the WRITE to go through
> >> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs
> >> again end up in conf->pending_bio_list (either via raid1_unplug or
> >> directly). But nobody will flush conf->pending_bio_list, because
> >> raid1d is stuck in raise_barrier.
> >>
> >> Previously, for example in kernel 3.2, raid1_add_disk did not call
> >> raise_barrier, so this problem did not happen.
> >>
> >> Attached is a reproduction with some prints that I added to
> >> raise_barrier and wait_barrier (their code also attached). It
> >> demonstrates case 2. It shows that once raise_barrier got called,
> >> conf->nr_pending drops down, until it equals the number of
> >> wait_barrier calls, that slipped through because of non-empty
> >> current->bio_list. And at this point, this array hangs.
> >>
> >> Can you please comment on how to fix this problem. It looks like a
> >> real deadlock.
> >> We can perhaps call md_check_recovery() after flush_pending_writes(),
> >> but this only makes the window smaller, not closes it entirely. But it
> >> looks like we really should not be calling raise_barrier from raid1d.
> >>
> >> Thanks,
> >> Alex.
> >
> > Hi Alex,
> >   thanks for the analysis.
> >
> > Does the following patch fix it?  It makes raise_barrier  more  like
> > freeze_array().
> > If not, could you try making the same change to the first
> > wait_event_lock_irq in raise_barrier?
> >
> > Thanks.
> > NeilBrown
> >
> >
> >
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index 328fa2d..d34f892 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf)
> >   	conf->barrier++;
> >
> >   	/* Now wait for all pending IO to complete */
> > -	wait_event_lock_irq(conf->wait_barrier,
> > -			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
> > -			    conf->resync_lock);
> > +	wait_event_lock_irq_cmd(conf->wait_barrier,
> > +				!conf->nr_pending && conf->barrier < RESYNC_DEPTH,
> > +				conf->resync_lock, flush_pending_writes(conf));
> >
> >   	spin_unlock_irq(&conf->resync_lock);
> >   }
> >
> 
> Neil,
> 
> This deadlock also cropped up in 3.4 between .37 and .38.  Passing flush_pending_writes(conf) as cmd to wait_event_lock_irq seems to fix it there as well.
> 
> --- linux-3.4.38/drivers/md/raid1.c     2013-03-28 13:12:41.000000000 -0600
> +++ linux-3.4.38.patch/drivers/md/raid1.c       2013-06-04 12:17:35.314194903 -0600
> @@ -751,7 +751,7 @@
>          /* Now wait for all pending IO to complete */
>          wait_event_lock_irq(conf->wait_barrier,
>                              !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
> -                           conf->resync_lock, );
> +                           conf->resync_lock, flush_pending_writes(conf));
>   
>          spin_unlock_irq(&conf->resync_lock);
>   }
> 

I suspect it was already there in 3.4.37 .. in fact I think it was there in
3.4.

I'll mark the fix for including in 3.4-stable

Thanks for the report.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-06-12  1:30 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-21 22:58 BUG - raid 1 deadlock on handle_read_error / wait_barrier Tregaron Bayly
2013-02-22  3:44 ` Joe Lawrence
2013-02-22 11:52 ` majianpeng
2013-02-22 16:03   ` Tregaron Bayly
2013-02-22 18:14     ` Joe Lawrence
2013-02-24 22:43 ` NeilBrown
2013-02-25  0:04   ` NeilBrown
2013-02-25 16:11     ` Tregaron Bayly
2013-02-25 22:54       ` NeilBrown
2013-02-26 14:09       ` Joe Lawrence
2013-05-16 14:07         ` Alexander Lyakas
2013-05-20  7:17           ` NeilBrown
2013-05-30 14:30             ` Alexander Lyakas
2013-06-02 12:43               ` Alexander Lyakas
2013-06-04  1:49                 ` NeilBrown
2013-06-04  9:52                   ` Alexander Lyakas
2013-06-06 15:00                   ` Tregaron Bayly
2013-06-08  9:45                     ` Alexander Lyakas
2013-06-12  0:42                       ` NeilBrown
2013-06-12  1:30                     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).