* BUG - raid 1 deadlock on handle_read_error / wait_barrier @ 2013-02-21 22:58 Tregaron Bayly 2013-02-22 3:44 ` Joe Lawrence ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Tregaron Bayly @ 2013-02-21 22:58 UTC (permalink / raw) To: linux-raid; +Cc: neilb Symptom: A RAID 1 array ends up with two threads (flush and raid1) stuck in D state forever. The array is inaccessible and the host must be restarted to restore access to the array. I have some scripted workloads that reproduce this within a maximum of a couple hours on kernels from 3.6.11 - 3.8-rc7. I cannot reproduce on 3.4.32. 3.5.7 ends up with three threads stuck in D state, but the stacks are different from this bug (as it's EOL maybe of interest in bisecting the problem?). Details: [flush-9:16] [<ffffffffa009f1a4>] wait_barrier+0x124/0x180 [raid1] [<ffffffffa00a2a15>] make_request+0x85/0xd50 [raid1] [<ffffffff813653c3>] md_make_request+0xd3/0x200 [<ffffffff811f494a>] generic_make_request+0xca/0x100 [<ffffffff811f49f9>] submit_bio+0x79/0x160 [<ffffffff811808f8>] submit_bh+0x128/0x200 [<ffffffff81182fe0>] __block_write_full_page+0x1d0/0x330 [<ffffffff8118320e>] block_write_full_page_endio+0xce/0x100 [<ffffffff81183255>] block_write_full_page+0x15/0x20 [<ffffffff81187908>] blkdev_writepage+0x18/0x20 [<ffffffff810f73b7>] __writepage+0x17/0x40 [<ffffffff810f8543>] write_cache_pages+0x1d3/0x4c0 [<ffffffff810f8881>] generic_writepages+0x51/0x80 [<ffffffff810f88d0>] do_writepages+0x20/0x40 [<ffffffff811782bb>] __writeback_single_inode+0x3b/0x160 [<ffffffff8117a8a9>] writeback_sb_inodes+0x1e9/0x430 [<ffffffff8117ab8e>] __writeback_inodes_wb+0x9e/0xd0 [<ffffffff8117ae9b>] wb_writeback+0x24b/0x2e0 [<ffffffff8117b171>] wb_do_writeback+0x241/0x250 [<ffffffff8117b222>] bdi_writeback_thread+0xa2/0x250 [<ffffffff8106414e>] kthread+0xce/0xe0 [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff [md16-raid1] [<ffffffffa009ffb9>] handle_read_error+0x119/0x790 [raid1] [<ffffffffa00a0862>] raid1d+0x232/0x1060 [raid1] [<ffffffff813675a7>] md_thread+0x117/0x150 [<ffffffff8106414e>] kthread+0xce/0xe0 [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff Both processes end up in wait_event_lock_irq() waiting for favorable conditions in the struct r1conf to proceed. These conditions obviously seem to never arrive. I placed printk statements in freeze_array() and wait_barrier() directly before calling their respective wait_event_lock_irq() and this is an example output: Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3) Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3) Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3) The flush seems to come after the attempt to handle the read error. I believe the incrementing nr_waiting comes from the multiple read processes I have going as they get stuck behind this deadlock. majienpeng may have been referring to this condition on the linux-raid list a few days ago (http://www.spinics.net/lists/raid/msg42150.html) when he stated "Above code is what's you said. But it met read-error, raid1d will blocked for ever. The reason is freeze_array" Environment: A RAID 1 array built on top of two multipath maps Devices underlying the maps are iscsi disks exported from a linux SAN Multipath is configured to queue i/o if no path for 5 retries (polling interval 5) before failing and causing the mirror to degrade. Reproducible on kernels 3.6.11 and up (x86_64) To reproduce, I create 10 raid 1 arrays on a client built on mpio over iscsi. For each of the arrays, I start the following to create some writes: $NODE=array_5; while :; do let TIME=($RANDOM % 10); let size=$RANDOM*6; sleep $TIME; dd if=/dev/zero of=/dev/md/$NODE bs=1024 count=$size; done I then start 10 processes each doing this to create some reads: while :; do let SAN=($RANDOM % 10); dd if=/dev/md/array_$SAN of=/dev/null bs=1024 count=50000; done In a screen I start a script running in a loop that monitors /proc/mdstat for failed arrays and does a fail/remove/re-add on the failed disk. (This is mainly so that I get more than one crack at reproducing the bug since the timing never happens just right on one try.) Now on one of the iscsi servers I start a loop that sleeps a random amount of time between 1-5 minutes then stops the export, sleeps again and then restores the export. The net effect of all this is that the disks on the initiator will queue i/o for a while when the export is off and eventually fail. In most cases this happens gracefully and i/o will go to the remaining disk. In the failure case we get the behavior described here. The test is setup to restore the arrays and try again until the bug is hit. We do see this situation in production with relative frequency so it's not something that happens only in theory under artificial conditions. Any help or insight here is much appreciated. Tregaron Bayly Systems Administrator Bluehost, Inc. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-02-21 22:58 BUG - raid 1 deadlock on handle_read_error / wait_barrier Tregaron Bayly @ 2013-02-22 3:44 ` Joe Lawrence 2013-02-22 11:52 ` majianpeng 2013-02-24 22:43 ` NeilBrown 2 siblings, 0 replies; 20+ messages in thread From: Joe Lawrence @ 2013-02-22 3:44 UTC (permalink / raw) To: Tregaron Bayly; +Cc: linux-raid, neilb On Thu, 21 Feb 2013, Tregaron Bayly wrote: > Symptom: > A RAID 1 array ends up with two threads (flush and raid1) stuck in D > state forever. The array is inaccessible and the host must be restarted > to restore access to the array. > > I have some scripted workloads that reproduce this within a maximum of a > couple hours on kernels from 3.6.11 - 3.8-rc7. I cannot reproduce on > 3.4.32. 3.5.7 ends up with three threads stuck in D state, but the > stacks are different from this bug (as it's EOL maybe of interest in > bisecting the problem?). > > Details: > > [flush-9:16] > [<ffffffffa009f1a4>] wait_barrier+0x124/0x180 [raid1] > [<ffffffffa00a2a15>] make_request+0x85/0xd50 [raid1] > [<ffffffff813653c3>] md_make_request+0xd3/0x200 > [<ffffffff811f494a>] generic_make_request+0xca/0x100 > [<ffffffff811f49f9>] submit_bio+0x79/0x160 > [<ffffffff811808f8>] submit_bh+0x128/0x200 > [<ffffffff81182fe0>] __block_write_full_page+0x1d0/0x330 > [<ffffffff8118320e>] block_write_full_page_endio+0xce/0x100 > [<ffffffff81183255>] block_write_full_page+0x15/0x20 > [<ffffffff81187908>] blkdev_writepage+0x18/0x20 > [<ffffffff810f73b7>] __writepage+0x17/0x40 > [<ffffffff810f8543>] write_cache_pages+0x1d3/0x4c0 > [<ffffffff810f8881>] generic_writepages+0x51/0x80 > [<ffffffff810f88d0>] do_writepages+0x20/0x40 > [<ffffffff811782bb>] __writeback_single_inode+0x3b/0x160 > [<ffffffff8117a8a9>] writeback_sb_inodes+0x1e9/0x430 > [<ffffffff8117ab8e>] __writeback_inodes_wb+0x9e/0xd0 > [<ffffffff8117ae9b>] wb_writeback+0x24b/0x2e0 > [<ffffffff8117b171>] wb_do_writeback+0x241/0x250 > [<ffffffff8117b222>] bdi_writeback_thread+0xa2/0x250 > [<ffffffff8106414e>] kthread+0xce/0xe0 > [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff > > [md16-raid1] > [<ffffffffa009ffb9>] handle_read_error+0x119/0x790 [raid1] > [<ffffffffa00a0862>] raid1d+0x232/0x1060 [raid1] > [<ffffffff813675a7>] md_thread+0x117/0x150 > [<ffffffff8106414e>] kthread+0xce/0xe0 > [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff > > Both processes end up in wait_event_lock_irq() waiting for favorable > conditions in the struct r1conf to proceed. These conditions obviously > seem to never arrive. I placed printk statements in freeze_array() and > wait_barrier() directly before calling their respective > wait_event_lock_irq() and this is an example output: > > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3) > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3) > Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3) > > The flush seems to come after the attempt to handle the read error. I > believe the incrementing nr_waiting comes from the multiple read > processes I have going as they get stuck behind this deadlock. > > majienpeng may have been referring to this condition on the linux-raid > list a few days ago (http://www.spinics.net/lists/raid/msg42150.html) > when he stated "Above code is what's you said. But it met read-error, > raid1d will blocked for ever. The reason is freeze_array" > > Environment: > A RAID 1 array built on top of two multipath maps > Devices underlying the maps are iscsi disks exported from a linux SAN > Multipath is configured to queue i/o if no path for 5 retries (polling > interval 5) before failing and causing the mirror to degrade. > > Reproducible on kernels 3.6.11 and up (x86_64) > > To reproduce, I create 10 raid 1 arrays on a client built on mpio over > iscsi. For each of the arrays, I start the following to create some > writes: > > $NODE=array_5; while :; do let TIME=($RANDOM % 10); let size=$RANDOM*6; > sleep $TIME; dd if=/dev/zero of=/dev/md/$NODE bs=1024 count=$size; done > > I then start 10 processes each doing this to create some reads: > > while :; do let SAN=($RANDOM % 10); dd if=/dev/md/array_$SAN > of=/dev/null bs=1024 count=50000; done > > In a screen I start a script running in a loop that > monitors /proc/mdstat for failed arrays and does a fail/remove/re-add on > the failed disk. (This is mainly so that I get more than one crack at > reproducing the bug since the timing never happens just right on one > try.) > > Now on one of the iscsi servers I start a loop that sleeps a random > amount of time between 1-5 minutes then stops the export, sleeps again > and then restores the export. > > The net effect of all this is that the disks on the initiator will queue > i/o for a while when the export is off and eventually fail. In most > cases this happens gracefully and i/o will go to the remaining disk. In > the failure case we get the behavior described here. The test is setup > to restore the arrays and try again until the bug is hit. > > We do see this situation in production with relative frequency so it's > not something that happens only in theory under artificial conditions. > > Any help or insight here is much appreciated. > > Tregaron Bayly > Systems Administrator > Bluehost, Inc. Hi Tregaron, I've been seeing a similar looking hang over the past few days. In my testing, I'm running 3.7.0-rcX with direct IO via fio out to an MD RAID1 array during surprise disk removals. (We script fail/remove/add when the controller is hotplugged.) This week I have seen three instances of the hang, usually after ~5 hours of testing, but once I added debugging printk's it went away :( For reference, my traces look like: INFO: task md127_raid1:500 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. md127_raid1 D ffff88107fd13d40 0 500 2 0x00000000 ffff880850643bc8 0000000000000046 ffff88085056c9b0 ffff880850643fd8 ffff880850643fd8 ffff880850643fd8 ffff880854683120 ffff88085056c9b0 ffff880850643bf8 ffff8810528d88e8 ffff880850643fd8 ffff8810543c6b10 Call Trace: [<ffffffff8165f4a9>] schedule+0x29/0x70 [<ffffffffa0161e8a>] handle_read_error+0xea/0x7b0 [raid1] [<ffffffff81092653>] ? __wake_up+0x53/0x70 [<ffffffff814dcd3b>] ? mddev_unlock+0x8b/0xf0 [<ffffffff81085eb0>] ? wake_up_bit+0x40/0x40 [<ffffffffa0162b39>] raid1d+0x219/0x1020 [raid1] [<ffffffff8107114a>] ? del_timer_sync+0x5a/0x70 [<ffffffff8165d74a>] ? schedule_timeout+0x17a/0x310 [<ffffffff8108607b>] ? prepare_to_wait+0x5b/0x90 [<ffffffff814dbced>] md_thread+0x10d/0x140 [<ffffffff81085eb0>] ? wake_up_bit+0x40/0x40 [<ffffffff814dbbe0>] ? md_rdev_init+0x140/0x140 [<ffffffff810854d0>] kthread+0xc0/0xd0 [<ffffffff81010000>] ? ftrace_define_fields_xen_mmu_flush_tlb+0x10/0x80 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80 [<ffffffff8166906c>] ret_from_fork+0x7c/0xb0 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80 (handle_read_error+0xea puts me at the wait_event_lock_irq_cmd in freeze_array.) INFO: task jbd2/md127-8:547 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jbd2/md127-8 D ffff88107fdd3d40 0 547 2 0x00000000 ffff88085060f9c8 0000000000000046 ffff88084d768000 ffff88085060ffd8 ffff88085060ffd8 ffff88085060ffd8 ffff8808546b0000 ffff88084d768000 ffffffff8113c1a5 ffff8810543c6aa8 ffff8810543c6b10 ffff88085060ffd8 Call Trace: [<ffffffff8113c1a5>] ? mempool_alloc_slab+0x15/0x20 [<ffffffff8165f4a9>] schedule+0x29/0x70 [<ffffffffa01610f9>] wait_barrier+0xc9/0x170 [raid1] [<ffffffff81085eb0>] ? wake_up_bit+0x40/0x40 [<ffffffffa0163bdb>] make_request+0x7b/0xce0 [raid1] [<ffffffff8113c1a5>] ? mempool_alloc_slab+0x15/0x20 [<ffffffff814db837>] md_make_request+0x107/0x260 [<ffffffff8113c1a5>] ? mempool_alloc_slab+0x15/0x20 [<ffffffff812e6672>] generic_make_request+0xc2/0x110 [<ffffffff812e6a09>] submit_bio+0x79/0x160 [<ffffffff811dbd35>] ? bio_alloc_bioset+0x65/0x120 [<ffffffff811d6954>] submit_bh+0x114/0x1e0 [<ffffffff812794af>] jbd2_journal_commit_transaction+0x75f/0x1c40 [<ffffffff8127f441>] kjournald2+0xd1/0x2a0 [<ffffffff81085eb0>] ? wake_up_bit+0x40/0x40 [<ffffffff8127f370>] ? jbd2_journal_release_jbd_inode+0x140/0x140 [<ffffffff810854d0>] kthread+0xc0/0xd0 [<ffffffff81010000>] ? ftrace_define_fields_xen_mmu_flush_tlb+0x10/0x80 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80 [<ffffffff8166906c>] ret_from_fork+0x7c/0xb0 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80 INFO: task flush-9:127:580 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-9:127 D ffff88107fc13d40 0 580 2 0x00000000 ffff880851b1b5c8 0000000000000046 ffff880850568000 ffff880851b1bfd8 ffff880851b1bfd8 ffff880851b1bfd8 ffff880854666240 ffff880850568000 00000000000000bb ffff8810543c6aa8 ffff8810543c6b10 ffff880851b1bfd8 Call Trace: [<ffffffff8165f4a9>] schedule+0x29/0x70 [<ffffffffa01610f9>] wait_barrier+0xc9/0x170 [raid1] [<ffffffff81085eb0>] ? wake_up_bit+0x40/0x40 [<ffffffffa0163bdb>] make_request+0x7b/0xce0 [raid1] [<ffffffff811dbc1a>] ? bvec_alloc_bs+0x6a/0x120 [<ffffffff814db837>] md_make_request+0x107/0x260 [<ffffffff811db4f4>] ? __bio_add_page.part.13+0xf4/0x220 [<ffffffff812e6672>] generic_make_request+0xc2/0x110 [<ffffffff812e6a09>] submit_bio+0x79/0x160 [<ffffffff812343b9>] ext4_io_submit+0x29/0x60 [<ffffffff8122f73f>] mpage_da_submit_io+0x31f/0x5c0 [<ffffffff812325a5>] ? write_cache_pages_da+0x175/0x4f0 [<ffffffff81276907>] ? jbd2__journal_start+0x87/0x110 [<ffffffff8123204a>] mpage_da_map_and_submit+0x11a/0x420 [<ffffffff81232c77>] ? ext4_da_writepages+0x357/0x660 [<ffffffff812769a3>] ? jbd2_journal_start+0x13/0x20 [<ffffffff81232cc8>] ext4_da_writepages+0x3a8/0x660 [<ffffffff811458de>] do_writepages+0x1e/0x40 [<ffffffff811cdbeb>] __writeback_single_inode+0x3b/0x190 [<ffffffff811ceb1e>] writeback_sb_inodes+0x1ee/0x4e0 [<ffffffff811ceeaf>] __writeback_inodes_wb+0x9f/0xd0 [<ffffffff811d061b>] wb_writeback+0x2fb/0x320 [<ffffffff811d0acf>] wb_do_writeback+0x22f/0x240 [<ffffffff811d0b73>] bdi_writeback_thread+0x93/0x280 [<ffffffff811d0ae0>] ? wb_do_writeback+0x240/0x240 [<ffffffff810854d0>] kthread+0xc0/0xd0 [<ffffffff81010000>] ? ftrace_define_fields_xen_mmu_flush_tlb+0x10/0x80 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80 [<ffffffff8166906c>] ret_from_fork+0x7c/0xb0 [<ffffffff81085410>] ? flush_kthread_worker+0x80/0x80 And a few stats before the machine completely wedged: [root@dhcp-linux-2103-650 Breaker]# cat /proc/mdstat Personalities : [raid1] md124 : active raid1 sdq3[4] 4192192 blocks super 1.2 [2/1] [_U] bitmap: 0/1 pages [0KB], 65536KB chunk md125 : active raid1 sdq2[0] 16762752 blocks super 1.2 [2/1] [U_] bitmap: 0/1 pages [0KB], 65536KB chunk md126 : active raid1 sdq1[1] 16779136 blocks super 1.0 [2/1] [_U] bitmap: 0/1 pages [0KB], 65536KB chunk md127 : active raid1 sdq5[0] sds5[1](F) 255080256 blocks super 1.2 [2/1] [U_] bitmap: 1/2 pages [4KB], 65536KB chunk unused devices: <none> [root@dhcp-linux-2103-650 Breaker]# mdadm --detail /dev/md127 /dev/md127: Version : 1.2 Creation Time : Thu Jan 17 16:56:09 2013 Raid Level : raid1 Array Size : 255080256 (243.26 GiB 261.20 GB) Used Dev Size : 255080256 (243.26 GiB 261.20 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Tue Feb 12 21:35:02 2013 State : active, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 Name : dhcp-Linux-2103-627:root UUID : b5e4bddd:e76802b0:3b6badd6:e85ebc27 Events : 128267 Number Major Minor RaidDevice State 0 65 5 0 active sync /dev/sdq5 1 65 37 1 faulty I'll try kicking off the test again tonight (sans debug printk's) and see if I can get a kdump out of the hang. Regards, -- Joe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-02-21 22:58 BUG - raid 1 deadlock on handle_read_error / wait_barrier Tregaron Bayly 2013-02-22 3:44 ` Joe Lawrence @ 2013-02-22 11:52 ` majianpeng 2013-02-22 16:03 ` Tregaron Bayly 2013-02-24 22:43 ` NeilBrown 2 siblings, 1 reply; 20+ messages in thread From: majianpeng @ 2013-02-22 11:52 UTC (permalink / raw) To: tbayly; +Cc: linux-raid, neilb On 02/22/2013 06:58 AM, Tregaron Bayly wrote: > Symptom: > A RAID 1 array ends up with two threads (flush and raid1) stuck in D > state forever. The array is inaccessible and the host must be restarted > to restore access to the array. > > I have some scripted workloads that reproduce this within a maximum of a > couple hours on kernels from 3.6.11 - 3.8-rc7. I cannot reproduce on > 3.4.32. 3.5.7 ends up with three threads stuck in D state, but the > stacks are different from this bug (as it's EOL maybe of interest in > bisecting the problem?). > > Details: > > [flush-9:16] > [<ffffffffa009f1a4>] wait_barrier+0x124/0x180 [raid1] > [<ffffffffa00a2a15>] make_request+0x85/0xd50 [raid1] > [<ffffffff813653c3>] md_make_request+0xd3/0x200 > [<ffffffff811f494a>] generic_make_request+0xca/0x100 > [<ffffffff811f49f9>] submit_bio+0x79/0x160 > [<ffffffff811808f8>] submit_bh+0x128/0x200 > [<ffffffff81182fe0>] __block_write_full_page+0x1d0/0x330 > [<ffffffff8118320e>] block_write_full_page_endio+0xce/0x100 > [<ffffffff81183255>] block_write_full_page+0x15/0x20 > [<ffffffff81187908>] blkdev_writepage+0x18/0x20 > [<ffffffff810f73b7>] __writepage+0x17/0x40 > [<ffffffff810f8543>] write_cache_pages+0x1d3/0x4c0 > [<ffffffff810f8881>] generic_writepages+0x51/0x80 > [<ffffffff810f88d0>] do_writepages+0x20/0x40 > [<ffffffff811782bb>] __writeback_single_inode+0x3b/0x160 > [<ffffffff8117a8a9>] writeback_sb_inodes+0x1e9/0x430 > [<ffffffff8117ab8e>] __writeback_inodes_wb+0x9e/0xd0 > [<ffffffff8117ae9b>] wb_writeback+0x24b/0x2e0 > [<ffffffff8117b171>] wb_do_writeback+0x241/0x250 > [<ffffffff8117b222>] bdi_writeback_thread+0xa2/0x250 > [<ffffffff8106414e>] kthread+0xce/0xe0 > [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff > > [md16-raid1] > [<ffffffffa009ffb9>] handle_read_error+0x119/0x790 [raid1] > [<ffffffffa00a0862>] raid1d+0x232/0x1060 [raid1] > [<ffffffff813675a7>] md_thread+0x117/0x150 > [<ffffffff8106414e>] kthread+0xce/0xe0 > [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff > > Both processes end up in wait_event_lock_irq() waiting for favorable > conditions in the struct r1conf to proceed. These conditions obviously > seem to never arrive. I placed printk statements in freeze_array() and > wait_barrier() directly before calling their respective > wait_event_lock_irq() and this is an example output: > > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3) > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3) > Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3) From those message,there's a request which will be completed or met error. If completed, the ->nr_pening decrease one. If request met error, it add ->retry_list and the ->nr_queued add one. So in two conditions,the hung will be happened. What's the state of this request? Maybe this bug caused by lower layer drivers. > The flush seems to come after the attempt to handle the read error. I > believe the incrementing nr_waiting comes from the multiple read > processes I have going as they get stuck behind this deadlock. > > majienpeng may have been referring to this condition on the linux-raid > list a few days ago (http://www.spinics.net/lists/raid/msg42150.html) > when he stated "Above code is what's you said. But it met read-error, > raid1d will blocked for ever. The reason is freeze_array" > > Environment: > A RAID 1 array built on top of two multipath maps > Devices underlying the maps are iscsi disks exported from a linux SAN > Multipath is configured to queue i/o if no path for 5 retries (polling > interval 5) before failing and causing the mirror to degrade. > > Reproducible on kernels 3.6.11 and up (x86_64) > > To reproduce, I create 10 raid 1 arrays on a client built on mpio over > iscsi. For each of the arrays, I start the following to create some > writes: > > $NODE=array_5; while :; do let TIME=($RANDOM % 10); let size=$RANDOM*6; > sleep $TIME; dd if=/dev/zero of=/dev/md/$NODE bs=1024 count=$size; done > > I then start 10 processes each doing this to create some reads: > > while :; do let SAN=($RANDOM % 10); dd if=/dev/md/array_$SAN > of=/dev/null bs=1024 count=50000; done > > In a screen I start a script running in a loop that > monitors /proc/mdstat for failed arrays and does a fail/remove/re-add on > the failed disk. (This is mainly so that I get more than one crack at > reproducing the bug since the timing never happens just right on one > try.) > > Now on one of the iscsi servers I start a loop that sleeps a random > amount of time between 1-5 minutes then stops the export, sleeps again > and then restores the export. > > The net effect of all this is that the disks on the initiator will queue > i/o for a while when the export is off and eventually fail. In most > cases this happens gracefully and i/o will go to the remaining disk. In > the failure case we get the behavior described here. The test is setup > to restore the arrays and try again until the bug is hit. > > We do see this situation in production with relative frequency so it's > not something that happens only in theory under artificial conditions. > > Any help or insight here is much appreciated. > > Tregaron Bayly > Systems Administrator > Bluehost, Inc. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-02-22 11:52 ` majianpeng @ 2013-02-22 16:03 ` Tregaron Bayly 2013-02-22 18:14 ` Joe Lawrence 0 siblings, 1 reply; 20+ messages in thread From: Tregaron Bayly @ 2013-02-22 16:03 UTC (permalink / raw) To: majianpeng; +Cc: linux-raid, neilb On Fri, 2013-02-22 at 19:52 +0800, majianpeng wrote: > > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3) > > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3) > > Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3) > From those message,there's a request which will be completed or met error. > If completed, the ->nr_pening decrease one. > If request met error, it add ->retry_list and the ->nr_queued add one. > So in two conditions,the hung will be happened. > What's the state of this request? Maybe this bug caused by lower layer drivers. That's a good question. Do you have any suggestion as to how I might answer it? As I say, I can readily reproduce it so if you have some insight into how to capture that info I can do it with no problem. Also, I don't know for sure but it didn't sound like Joe Lawrence is using the same lower layer drivers but he has the same problem. Tregaron ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-02-22 16:03 ` Tregaron Bayly @ 2013-02-22 18:14 ` Joe Lawrence 0 siblings, 0 replies; 20+ messages in thread From: Joe Lawrence @ 2013-02-22 18:14 UTC (permalink / raw) To: Tregaron Bayly; +Cc: majianpeng, linux-raid, neilb On Fri, 22 Feb 2013, Tregaron Bayly wrote: > Also, I don't know for sure but it didn't sound like Joe Lawrence is > using the same lower layer drivers but he has the same problem. I'm running a slightly modified mpt2sas driver under 3.8.0rcX (not 3.7 as I mentioned earlier). I did capture a kdump vmcore on the freeze_array hang, but haven't had time to look at it yet. Regards, -- Joe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-02-21 22:58 BUG - raid 1 deadlock on handle_read_error / wait_barrier Tregaron Bayly 2013-02-22 3:44 ` Joe Lawrence 2013-02-22 11:52 ` majianpeng @ 2013-02-24 22:43 ` NeilBrown 2013-02-25 0:04 ` NeilBrown 2 siblings, 1 reply; 20+ messages in thread From: NeilBrown @ 2013-02-24 22:43 UTC (permalink / raw) To: tbayly; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2274 bytes --] On Thu, 21 Feb 2013 15:58:24 -0700 Tregaron Bayly <tbayly@bluehost.com> wrote: > Symptom: > A RAID 1 array ends up with two threads (flush and raid1) stuck in D > state forever. The array is inaccessible and the host must be restarted > to restore access to the array. > > I have some scripted workloads that reproduce this within a maximum of a > couple hours on kernels from 3.6.11 - 3.8-rc7. I cannot reproduce on > 3.4.32. 3.5.7 ends up with three threads stuck in D state, but the > stacks are different from this bug (as it's EOL maybe of interest in > bisecting the problem?). Can you post the 3 stacks from the 3.5.7 case? It might help get a more complete understanding. ... > Both processes end up in wait_event_lock_irq() waiting for favorable > conditions in the struct r1conf to proceed. These conditions obviously > seem to never arrive. I placed printk statements in freeze_array() and > wait_barrier() directly before calling their respective > wait_event_lock_irq() and this is an example output: > > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3) > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3) > Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3) This is very useful, thanks. Clearly there is one 'pending' request that isn't being counted, but also isn't being allowed to complete. Maybe it is in pending_bio_list, and so counted in conf->pending_count. Could you print out that value as well and try to trigger the bug again? If conf->pending_count is non-zero, then it seems very likely the we have found the problem. Fixing it isn't quite so easy. 'nr_pending' counts request from the filesystem that are still pending. 'pending_count' sounds request down to the underlying device that are still pending. There isn't a 1-to-1 correspondence, so we cannot just subtract one from the other. It will require more thought than that. Thanks for the thorough report, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-02-24 22:43 ` NeilBrown @ 2013-02-25 0:04 ` NeilBrown 2013-02-25 16:11 ` Tregaron Bayly 0 siblings, 1 reply; 20+ messages in thread From: NeilBrown @ 2013-02-25 0:04 UTC (permalink / raw) To: NeilBrown; +Cc: tbayly, linux-raid [-- Attachment #1: Type: text/plain, Size: 2690 bytes --] On Mon, 25 Feb 2013 09:43:50 +1100 NeilBrown <neilb@suse.de> wrote: > On Thu, 21 Feb 2013 15:58:24 -0700 Tregaron Bayly <tbayly@bluehost.com> wrote: > > > Symptom: > > A RAID 1 array ends up with two threads (flush and raid1) stuck in D > > state forever. The array is inaccessible and the host must be restarted > > to restore access to the array. > > > > I have some scripted workloads that reproduce this within a maximum of a > > couple hours on kernels from 3.6.11 - 3.8-rc7. I cannot reproduce on > > 3.4.32. 3.5.7 ends up with three threads stuck in D state, but the > > stacks are different from this bug (as it's EOL maybe of interest in > > bisecting the problem?). > > Can you post the 3 stacks from the 3.5.7 case? It might help get a more > complete understanding. > > ... > > Both processes end up in wait_event_lock_irq() waiting for favorable > > conditions in the struct r1conf to proceed. These conditions obviously > > seem to never arrive. I placed printk statements in freeze_array() and > > wait_barrier() directly before calling their respective > > wait_event_lock_irq() and this is an example output: > > > > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3) > > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3) > > Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3) > > This is very useful, thanks. Clearly there is one 'pending' request that > isn't being counted, but also isn't being allowed to complete. > Maybe it is in pending_bio_list, and so counted in conf->pending_count. > > Could you print out that value as well and try to trigger the bug again? If > conf->pending_count is non-zero, then it seems very likely the we have found > the problem. Actually don't bother. I think I've found the problem. It is related to pending_count and is easy to fix. Could you try this patch please? Thanks. NeilBrown diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 6e5d5a5..fd86b37 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -967,6 +967,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule) bio_list_merge(&conf->pending_bio_list, &plug->pending); conf->pending_count += plug->pending_cnt; spin_unlock_irq(&conf->device_lock); + wake_up(&conf->wait_barrier); md_wakeup_thread(mddev->thread); kfree(plug); return; [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-02-25 0:04 ` NeilBrown @ 2013-02-25 16:11 ` Tregaron Bayly 2013-02-25 22:54 ` NeilBrown 2013-02-26 14:09 ` Joe Lawrence 0 siblings, 2 replies; 20+ messages in thread From: Tregaron Bayly @ 2013-02-25 16:11 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid > Actually don't bother. I think I've found the problem. It is related to > pending_count and is easy to fix. > Could you try this patch please? > > Thanks. > NeilBrown > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 6e5d5a5..fd86b37 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -967,6 +967,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule) > bio_list_merge(&conf->pending_bio_list, &plug->pending); > conf->pending_count += plug->pending_cnt; > spin_unlock_irq(&conf->device_lock); > + wake_up(&conf->wait_barrier); > md_wakeup_thread(mddev->thread); > kfree(plug); > return; Running 15 hours now and no sign of the problem, which is 12 hours longer than it took to trigger the bug in the past. I'll continue testing to be sure but I think this patch is a fix. Thanks for the fast response! Tregaron Bayly ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-02-25 16:11 ` Tregaron Bayly @ 2013-02-25 22:54 ` NeilBrown 2013-02-26 14:09 ` Joe Lawrence 1 sibling, 0 replies; 20+ messages in thread From: NeilBrown @ 2013-02-25 22:54 UTC (permalink / raw) To: tbayly; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1199 bytes --] On Mon, 25 Feb 2013 09:11:02 -0700 Tregaron Bayly <tbayly@bluehost.com> wrote: > > Actually don't bother. I think I've found the problem. It is related to > > pending_count and is easy to fix. > > Could you try this patch please? > > > > Thanks. > > NeilBrown > > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > > index 6e5d5a5..fd86b37 100644 > > --- a/drivers/md/raid1.c > > +++ b/drivers/md/raid1.c > > @@ -967,6 +967,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule) > > bio_list_merge(&conf->pending_bio_list, &plug->pending); > > conf->pending_count += plug->pending_cnt; > > spin_unlock_irq(&conf->device_lock); > > + wake_up(&conf->wait_barrier); > > md_wakeup_thread(mddev->thread); > > kfree(plug); > > return; > > Running 15 hours now and no sign of the problem, which is 12 hours > longer than it took to trigger the bug in the past. I'll continue > testing to be sure but I think this patch is a fix. Thanks. I've added a "Tested-by" you to the patch header. Should go to Linus, and thence to -stable in a day or 2. NeilBrown > > Thanks for the fast response! > > Tregaron Bayly [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-02-25 16:11 ` Tregaron Bayly 2013-02-25 22:54 ` NeilBrown @ 2013-02-26 14:09 ` Joe Lawrence 2013-05-16 14:07 ` Alexander Lyakas 1 sibling, 1 reply; 20+ messages in thread From: Joe Lawrence @ 2013-02-26 14:09 UTC (permalink / raw) To: Tregaron Bayly; +Cc: NeilBrown, linux-raid Same here: after ~15 hrs, ~300 surprise device removals and fio stress, no hung tasks to report. -- Joe On Mon, 25 Feb 2013, Tregaron Bayly wrote: > > Actually don't bother. I think I've found the problem. It is related to > > pending_count and is easy to fix. > > Could you try this patch please? > > > > Thanks. > > NeilBrown > > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > > index 6e5d5a5..fd86b37 100644 > > --- a/drivers/md/raid1.c > > +++ b/drivers/md/raid1.c > > @@ -967,6 +967,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule) > > bio_list_merge(&conf->pending_bio_list, &plug->pending); > > conf->pending_count += plug->pending_cnt; > > spin_unlock_irq(&conf->device_lock); > > + wake_up(&conf->wait_barrier); > > md_wakeup_thread(mddev->thread); > > kfree(plug); > > return; > > Running 15 hours now and no sign of the problem, which is 12 hours > longer than it took to trigger the bug in the past. I'll continue > testing to be sure but I think this patch is a fix. > > Thanks for the fast response! > > Tregaron Bayly > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-02-26 14:09 ` Joe Lawrence @ 2013-05-16 14:07 ` Alexander Lyakas 2013-05-20 7:17 ` NeilBrown 0 siblings, 1 reply; 20+ messages in thread From: Alexander Lyakas @ 2013-05-16 14:07 UTC (permalink / raw) To: NeilBrown; +Cc: Tregaron Bayly, linux-raid, Shyam Kaushik Hello Neil, we are hitting issue that looks very similar; we are on kernel 3.8.2. The array is a 2-device raid1, which experienced a drive failure, but then drive was removed and re-added back to the array (although rebuild never started). Relevant parts of the kernel log: May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Disk failure on dm-6, disabling device. May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Operation continuing on 1 devices. May 16 11:12:14 kernel: [46850.090511] md: super_written gets error=-5, uptodate=0 May 16 11:12:14 kernel: [46850.090676] md/raid1:md2: dm-6: rescheduling sector 18040736 May 16 11:12:14 kernel: [46850.090764] md/raid1:md2: dm-6: rescheduling sector 20367040 May 16 11:12:14 kernel: [46850.090826] md/raid1:md2: dm-6: rescheduling sector 17613504 May 16 11:12:14 kernel: [46850.090883] md/raid1:md2: dm-6: rescheduling sector 18042720 May 16 11:12:15 kernel: [46850.229970] md/raid1:md2: redirecting sector 18040736 to other mirror: dm-13 May 16 11:12:15 kernel: [46850.257687] md/raid1:md2: redirecting sector 20367040 to other mirror: dm-13 May 16 11:12:15 kernel: [46850.268731] md/raid1:md2: redirecting sector 17613504 to other mirror: dm-13 May 16 11:12:15 kernel: [46850.398242] md/raid1:md2: redirecting sector 18042720 to other mirror: dm-13 May 16 11:12:23 kernel: [46858.448465] md: unbind<dm-6> May 16 11:12:23 kernel: [46858.456081] md: export_rdev(dm-6) May 16 11:23:19 kernel: [47515.062547] md: bind<dm-6> May 16 11:24:28 kernel: [47583.920126] INFO: task md2_raid1:15749 blocked for more than 60 seconds. May 16 11:24:28 kernel: [47583.921829] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 16 11:24:28 kernel: [47583.923361] md2_raid1 D 0000000000000001 0 15749 2 0x00000000 May 16 11:24:28 kernel: [47583.923367] ffff880097c23c28 0000000000000046 ffff880000000002 00000000982c43b8 May 16 11:24:28 kernel: [47583.923372] ffff880097c23fd8 ffff880097c23fd8 ffff880097c23fd8 0000000000013f40 May 16 11:24:28 kernel: [47583.923376] ffff880119b11740 ffff8800a5489740 ffff880097c23c38 ffff88011609d3c0 May 16 11:24:28 kernel: [47583.923381] Call Trace: May 16 11:24:28 kernel: [47583.923395] [<ffffffff816eb399>] schedule+0x29/0x70 May 16 11:24:28 kernel: [47583.923402] [<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1] May 16 11:24:28 kernel: [47583.923410] [<ffffffff8107fcc0>] ? add_wait_queue+0x60/0x60 May 16 11:24:28 kernel: [47583.923415] [<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1] May 16 11:24:28 kernel: [47583.923421] [<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220 May 16 11:24:28 kernel: [47583.923426] [<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530 May 16 11:24:28 kernel: [47583.923429] [<ffffffff8156a165>] md_check_recovery+0x15/0x20 May 16 11:24:28 kernel: [47583.923433] [<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1] May 16 11:24:28 kernel: [47583.923439] [<ffffffff81045cd9>] ? default_spin_lock_flags+0x9/0x10 May 16 11:24:28 kernel: [47583.923443] [<ffffffff81045cd9>] ? default_spin_lock_flags+0x9/0x10 May 16 11:24:28 kernel: [47583.923449] [<ffffffff815624ed>] md_thread+0x10d/0x140 May 16 11:24:28 kernel: [47583.923453] [<ffffffff8107fcc0>] ? add_wait_queue+0x60/0x60 May 16 11:24:28 kernel: [47583.923457] [<ffffffff815623e0>] ? md_rdev_init+0x140/0x140 May 16 11:24:28 kernel: [47583.923461] [<ffffffff8107f0d0>] kthread+0xc0/0xd0 May 16 11:24:28 kernel: [47583.923465] [<ffffffff8107f010>] ? flush_kthread_worker+0xb0/0xb0 May 16 11:24:28 kernel: [47583.923472] [<ffffffff816f506c>] ret_from_fork+0x7c/0xb0 May 16 11:24:28 kernel: [47583.923476] [<ffffffff8107f010>] ? flush_kthread_worker+0xb0/0xb0 dm-13 is the good drive, dm-6 is the one that failed. At this point, we have several threads calling submit_bio and all stuck like this: cat /proc/16218/stack [<ffffffffa0516d6e>] wait_barrier+0xbe/0x160 [raid1] [<ffffffffa0518627>] make_request+0x87/0xa90 [raid1] [<ffffffff81561ed0>] md_make_request+0xd0/0x200 [<ffffffff8132bcaa>] generic_make_request+0xca/0x100 [<ffffffff8132bd5b>] submit_bio+0x7b/0x160 ... And md raid1 thread stuck like this: cat /proc/15749/stack [<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1] [<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1] [<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220 [<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530 [<ffffffff8156a165>] md_check_recovery+0x15/0x20 [<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1] [<ffffffff815624ed>] md_thread+0x10d/0x140 [<ffffffff8107f0d0>] kthread+0xc0/0xd0 [<ffffffff816f506c>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff We have also two user-space threads stuck: one is trying to read /sys/block/md2/md/array_state and its kernel stack is: # cat /proc/2251/stack [<ffffffff81564602>] md_attr_show+0x72/0xf0 [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0 [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0 [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 [<ffffffff8119b1f2>] sys_read+0x52/0xa0 [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f [<ffffffffffffffff>] 0xffffffffffffffff the other wants to read from /proc/mdstat and is: [<ffffffff81563d2b>] md_seq_show+0x4b/0x540 [<ffffffff811bdd1b>] seq_read+0x16b/0x400 [<ffffffff811ff572>] proc_reg_read+0x82/0xc0 [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 [<ffffffff8119b1f2>] sys_read+0x52/0xa0 [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f [<ffffffffffffffff>] 0xffffffffffffffff mdadm --detail also gets stuck if attempted, in stack like this: cat /proc/2864/stack [<ffffffff81564602>] md_attr_show+0x72/0xf0 [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0 [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0 [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 [<ffffffff8119b1f2>] sys_read+0x52/0xa0 [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f [<ffffffffffffffff>] 0xffffffffffffffff Might your patch https://patchwork.kernel.org/patch/2260051/ fix this issue? Is this patch alone applicable to kernel 3.8.2? Can you pls kindly comment on this. Thanks for your help, Alex. On Tue, Feb 26, 2013 at 4:09 PM, Joe Lawrence <Joe.Lawrence@stratus.com> wrote: > > Same here: after ~15 hrs, ~300 surprise device removals and fio stress, > no hung tasks to report. > > -- Joe > > On Mon, 25 Feb 2013, Tregaron Bayly wrote: > > > > Actually don't bother. I think I've found the problem. It is related to > > > pending_count and is easy to fix. > > > Could you try this patch please? > > > > > > Thanks. > > > NeilBrown > > > > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > > > index 6e5d5a5..fd86b37 100644 > > > --- a/drivers/md/raid1.c > > > +++ b/drivers/md/raid1.c > > > @@ -967,6 +967,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule) > > > bio_list_merge(&conf->pending_bio_list, &plug->pending); > > > conf->pending_count += plug->pending_cnt; > > > spin_unlock_irq(&conf->device_lock); > > > + wake_up(&conf->wait_barrier); > > > md_wakeup_thread(mddev->thread); > > > kfree(plug); > > > return; > > > > Running 15 hours now and no sign of the problem, which is 12 hours > > longer than it took to trigger the bug in the past. I'll continue > > testing to be sure but I think this patch is a fix. > > > > Thanks for the fast response! > > > > Tregaron Bayly > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-05-16 14:07 ` Alexander Lyakas @ 2013-05-20 7:17 ` NeilBrown 2013-05-30 14:30 ` Alexander Lyakas 0 siblings, 1 reply; 20+ messages in thread From: NeilBrown @ 2013-05-20 7:17 UTC (permalink / raw) To: Alexander Lyakas; +Cc: Tregaron Bayly, linux-raid, Shyam Kaushik [-- Attachment #1: Type: text/plain, Size: 6645 bytes --] On Thu, 16 May 2013 17:07:04 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hello Neil, > we are hitting issue that looks very similar; we are on kernel 3.8.2. > The array is a 2-device raid1, which experienced a drive failure, but > then drive was removed and re-added back to the array (although > rebuild never started). Relevant parts of the kernel log: > > May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Disk failure on > dm-6, disabling device. > May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Operation > continuing on 1 devices. > May 16 11:12:14 kernel: [46850.090511] md: super_written gets > error=-5, uptodate=0 > May 16 11:12:14 kernel: [46850.090676] md/raid1:md2: dm-6: > rescheduling sector 18040736 > May 16 11:12:14 kernel: [46850.090764] md/raid1:md2: dm-6: > rescheduling sector 20367040 > May 16 11:12:14 kernel: [46850.090826] md/raid1:md2: dm-6: > rescheduling sector 17613504 > May 16 11:12:14 kernel: [46850.090883] md/raid1:md2: dm-6: > rescheduling sector 18042720 > May 16 11:12:15 kernel: [46850.229970] md/raid1:md2: redirecting > sector 18040736 to other mirror: dm-13 > May 16 11:12:15 kernel: [46850.257687] md/raid1:md2: redirecting > sector 20367040 to other mirror: dm-13 > May 16 11:12:15 kernel: [46850.268731] md/raid1:md2: redirecting > sector 17613504 to other mirror: dm-13 > May 16 11:12:15 kernel: [46850.398242] md/raid1:md2: redirecting > sector 18042720 to other mirror: dm-13 > May 16 11:12:23 kernel: [46858.448465] md: unbind<dm-6> > May 16 11:12:23 kernel: [46858.456081] md: export_rdev(dm-6) > May 16 11:23:19 kernel: [47515.062547] md: bind<dm-6> > > May 16 11:24:28 kernel: [47583.920126] INFO: task md2_raid1:15749 > blocked for more than 60 seconds. > May 16 11:24:28 kernel: [47583.921829] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > May 16 11:24:28 kernel: [47583.923361] md2_raid1 D > 0000000000000001 0 15749 2 0x00000000 > May 16 11:24:28 kernel: [47583.923367] ffff880097c23c28 > 0000000000000046 ffff880000000002 00000000982c43b8 > May 16 11:24:28 kernel: [47583.923372] ffff880097c23fd8 > ffff880097c23fd8 ffff880097c23fd8 0000000000013f40 > May 16 11:24:28 kernel: [47583.923376] ffff880119b11740 > ffff8800a5489740 ffff880097c23c38 ffff88011609d3c0 > May 16 11:24:28 kernel: [47583.923381] Call Trace: > May 16 11:24:28 kernel: [47583.923395] [<ffffffff816eb399>] schedule+0x29/0x70 > May 16 11:24:28 kernel: [47583.923402] [<ffffffffa0516736>] > raise_barrier+0x106/0x160 [raid1] > May 16 11:24:28 kernel: [47583.923410] [<ffffffff8107fcc0>] ? > add_wait_queue+0x60/0x60 > May 16 11:24:28 kernel: [47583.923415] [<ffffffffa0516af7>] > raid1_add_disk+0x197/0x200 [raid1] > May 16 11:24:28 kernel: [47583.923421] [<ffffffff81567fa4>] > remove_and_add_spares+0x104/0x220 > May 16 11:24:28 kernel: [47583.923426] [<ffffffff8156a02d>] > md_check_recovery.part.49+0x40d/0x530 > May 16 11:24:28 kernel: [47583.923429] [<ffffffff8156a165>] > md_check_recovery+0x15/0x20 > May 16 11:24:28 kernel: [47583.923433] [<ffffffffa0517e42>] > raid1d+0x22/0x180 [raid1] > May 16 11:24:28 kernel: [47583.923439] [<ffffffff81045cd9>] ? > default_spin_lock_flags+0x9/0x10 > May 16 11:24:28 kernel: [47583.923443] [<ffffffff81045cd9>] ? > default_spin_lock_flags+0x9/0x10 > May 16 11:24:28 kernel: [47583.923449] [<ffffffff815624ed>] > md_thread+0x10d/0x140 > May 16 11:24:28 kernel: [47583.923453] [<ffffffff8107fcc0>] ? > add_wait_queue+0x60/0x60 > May 16 11:24:28 kernel: [47583.923457] [<ffffffff815623e0>] ? > md_rdev_init+0x140/0x140 > May 16 11:24:28 kernel: [47583.923461] [<ffffffff8107f0d0>] kthread+0xc0/0xd0 > May 16 11:24:28 kernel: [47583.923465] [<ffffffff8107f010>] ? > flush_kthread_worker+0xb0/0xb0 > May 16 11:24:28 kernel: [47583.923472] [<ffffffff816f506c>] > ret_from_fork+0x7c/0xb0 > May 16 11:24:28 kernel: [47583.923476] [<ffffffff8107f010>] ? > flush_kthread_worker+0xb0/0xb0 > > dm-13 is the good drive, dm-6 is the one that failed. > > At this point, we have several threads calling submit_bio and all > stuck like this: > > cat /proc/16218/stack > [<ffffffffa0516d6e>] wait_barrier+0xbe/0x160 [raid1] > [<ffffffffa0518627>] make_request+0x87/0xa90 [raid1] > [<ffffffff81561ed0>] md_make_request+0xd0/0x200 > [<ffffffff8132bcaa>] generic_make_request+0xca/0x100 > [<ffffffff8132bd5b>] submit_bio+0x7b/0x160 > ... > > And md raid1 thread stuck like this: > > cat /proc/15749/stack > [<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1] > [<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1] > [<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220 > [<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530 > [<ffffffff8156a165>] md_check_recovery+0x15/0x20 > [<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1] > [<ffffffff815624ed>] md_thread+0x10d/0x140 > [<ffffffff8107f0d0>] kthread+0xc0/0xd0 > [<ffffffff816f506c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff > > We have also two user-space threads stuck: > > one is trying to read /sys/block/md2/md/array_state and its kernel stack is: > # cat /proc/2251/stack > [<ffffffff81564602>] md_attr_show+0x72/0xf0 > [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0 > [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0 > [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 > [<ffffffff8119b1f2>] sys_read+0x52/0xa0 > [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f > [<ffffffffffffffff>] 0xffffffffffffffff > > the other wants to read from /proc/mdstat and is: > [<ffffffff81563d2b>] md_seq_show+0x4b/0x540 > [<ffffffff811bdd1b>] seq_read+0x16b/0x400 > [<ffffffff811ff572>] proc_reg_read+0x82/0xc0 > [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 > [<ffffffff8119b1f2>] sys_read+0x52/0xa0 > [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f > [<ffffffffffffffff>] 0xffffffffffffffff > > mdadm --detail also gets stuck if attempted, in stack like this: > cat /proc/2864/stack > [<ffffffff81564602>] md_attr_show+0x72/0xf0 > [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0 > [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0 > [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 > [<ffffffff8119b1f2>] sys_read+0x52/0xa0 > [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f > [<ffffffffffffffff>] 0xffffffffffffffff > > Might your patch https://patchwork.kernel.org/patch/2260051/ fix this Probably. > issue? Is this patch alone applicable to kernel 3.8.2? Probably. > Can you pls kindly comment on this. > NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-05-20 7:17 ` NeilBrown @ 2013-05-30 14:30 ` Alexander Lyakas 2013-06-02 12:43 ` Alexander Lyakas 0 siblings, 1 reply; 20+ messages in thread From: Alexander Lyakas @ 2013-05-30 14:30 UTC (permalink / raw) To: NeilBrown, linux-raid; +Cc: Tregaron Bayly, Shyam Kaushik Hi Neil, On Mon, May 20, 2013 at 10:17 AM, NeilBrown <neilb@suse.de> wrote: > On Thu, 16 May 2013 17:07:04 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hello Neil, >> we are hitting issue that looks very similar; we are on kernel 3.8.2. >> The array is a 2-device raid1, which experienced a drive failure, but >> then drive was removed and re-added back to the array (although >> rebuild never started). Relevant parts of the kernel log: >> >> May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Disk failure on >> dm-6, disabling device. >> May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Operation >> continuing on 1 devices. >> May 16 11:12:14 kernel: [46850.090511] md: super_written gets >> error=-5, uptodate=0 >> May 16 11:12:14 kernel: [46850.090676] md/raid1:md2: dm-6: >> rescheduling sector 18040736 >> May 16 11:12:14 kernel: [46850.090764] md/raid1:md2: dm-6: >> rescheduling sector 20367040 >> May 16 11:12:14 kernel: [46850.090826] md/raid1:md2: dm-6: >> rescheduling sector 17613504 >> May 16 11:12:14 kernel: [46850.090883] md/raid1:md2: dm-6: >> rescheduling sector 18042720 >> May 16 11:12:15 kernel: [46850.229970] md/raid1:md2: redirecting >> sector 18040736 to other mirror: dm-13 >> May 16 11:12:15 kernel: [46850.257687] md/raid1:md2: redirecting >> sector 20367040 to other mirror: dm-13 >> May 16 11:12:15 kernel: [46850.268731] md/raid1:md2: redirecting >> sector 17613504 to other mirror: dm-13 >> May 16 11:12:15 kernel: [46850.398242] md/raid1:md2: redirecting >> sector 18042720 to other mirror: dm-13 >> May 16 11:12:23 kernel: [46858.448465] md: unbind<dm-6> >> May 16 11:12:23 kernel: [46858.456081] md: export_rdev(dm-6) >> May 16 11:23:19 kernel: [47515.062547] md: bind<dm-6> >> >> May 16 11:24:28 kernel: [47583.920126] INFO: task md2_raid1:15749 >> blocked for more than 60 seconds. >> May 16 11:24:28 kernel: [47583.921829] "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> May 16 11:24:28 kernel: [47583.923361] md2_raid1 D >> 0000000000000001 0 15749 2 0x00000000 >> May 16 11:24:28 kernel: [47583.923367] ffff880097c23c28 >> 0000000000000046 ffff880000000002 00000000982c43b8 >> May 16 11:24:28 kernel: [47583.923372] ffff880097c23fd8 >> ffff880097c23fd8 ffff880097c23fd8 0000000000013f40 >> May 16 11:24:28 kernel: [47583.923376] ffff880119b11740 >> ffff8800a5489740 ffff880097c23c38 ffff88011609d3c0 >> May 16 11:24:28 kernel: [47583.923381] Call Trace: >> May 16 11:24:28 kernel: [47583.923395] [<ffffffff816eb399>] schedule+0x29/0x70 >> May 16 11:24:28 kernel: [47583.923402] [<ffffffffa0516736>] >> raise_barrier+0x106/0x160 [raid1] >> May 16 11:24:28 kernel: [47583.923410] [<ffffffff8107fcc0>] ? >> add_wait_queue+0x60/0x60 >> May 16 11:24:28 kernel: [47583.923415] [<ffffffffa0516af7>] >> raid1_add_disk+0x197/0x200 [raid1] >> May 16 11:24:28 kernel: [47583.923421] [<ffffffff81567fa4>] >> remove_and_add_spares+0x104/0x220 >> May 16 11:24:28 kernel: [47583.923426] [<ffffffff8156a02d>] >> md_check_recovery.part.49+0x40d/0x530 >> May 16 11:24:28 kernel: [47583.923429] [<ffffffff8156a165>] >> md_check_recovery+0x15/0x20 >> May 16 11:24:28 kernel: [47583.923433] [<ffffffffa0517e42>] >> raid1d+0x22/0x180 [raid1] >> May 16 11:24:28 kernel: [47583.923439] [<ffffffff81045cd9>] ? >> default_spin_lock_flags+0x9/0x10 >> May 16 11:24:28 kernel: [47583.923443] [<ffffffff81045cd9>] ? >> default_spin_lock_flags+0x9/0x10 >> May 16 11:24:28 kernel: [47583.923449] [<ffffffff815624ed>] >> md_thread+0x10d/0x140 >> May 16 11:24:28 kernel: [47583.923453] [<ffffffff8107fcc0>] ? >> add_wait_queue+0x60/0x60 >> May 16 11:24:28 kernel: [47583.923457] [<ffffffff815623e0>] ? >> md_rdev_init+0x140/0x140 >> May 16 11:24:28 kernel: [47583.923461] [<ffffffff8107f0d0>] kthread+0xc0/0xd0 >> May 16 11:24:28 kernel: [47583.923465] [<ffffffff8107f010>] ? >> flush_kthread_worker+0xb0/0xb0 >> May 16 11:24:28 kernel: [47583.923472] [<ffffffff816f506c>] >> ret_from_fork+0x7c/0xb0 >> May 16 11:24:28 kernel: [47583.923476] [<ffffffff8107f010>] ? >> flush_kthread_worker+0xb0/0xb0 >> >> dm-13 is the good drive, dm-6 is the one that failed. >> >> At this point, we have several threads calling submit_bio and all >> stuck like this: >> >> cat /proc/16218/stack >> [<ffffffffa0516d6e>] wait_barrier+0xbe/0x160 [raid1] >> [<ffffffffa0518627>] make_request+0x87/0xa90 [raid1] >> [<ffffffff81561ed0>] md_make_request+0xd0/0x200 >> [<ffffffff8132bcaa>] generic_make_request+0xca/0x100 >> [<ffffffff8132bd5b>] submit_bio+0x7b/0x160 >> ... >> >> And md raid1 thread stuck like this: >> >> cat /proc/15749/stack >> [<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1] >> [<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1] >> [<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220 >> [<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530 >> [<ffffffff8156a165>] md_check_recovery+0x15/0x20 >> [<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1] >> [<ffffffff815624ed>] md_thread+0x10d/0x140 >> [<ffffffff8107f0d0>] kthread+0xc0/0xd0 >> [<ffffffff816f506c>] ret_from_fork+0x7c/0xb0 >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> We have also two user-space threads stuck: >> >> one is trying to read /sys/block/md2/md/array_state and its kernel stack is: >> # cat /proc/2251/stack >> [<ffffffff81564602>] md_attr_show+0x72/0xf0 >> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0 >> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0 >> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 >> [<ffffffff8119b1f2>] sys_read+0x52/0xa0 >> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> the other wants to read from /proc/mdstat and is: >> [<ffffffff81563d2b>] md_seq_show+0x4b/0x540 >> [<ffffffff811bdd1b>] seq_read+0x16b/0x400 >> [<ffffffff811ff572>] proc_reg_read+0x82/0xc0 >> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 >> [<ffffffff8119b1f2>] sys_read+0x52/0xa0 >> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> mdadm --detail also gets stuck if attempted, in stack like this: >> cat /proc/2864/stack >> [<ffffffff81564602>] md_attr_show+0x72/0xf0 >> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0 >> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0 >> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 >> [<ffffffff8119b1f2>] sys_read+0x52/0xa0 >> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> Might your patch https://patchwork.kernel.org/patch/2260051/ fix this > > Probably. > >> issue? Is this patch alone applicable to kernel 3.8.2? > > Probably. This problem still happens in 3.8.13, which contains the above fix of yours. I am currently debugging it deeper, but still don't know what the issue is. The problem happens when raid1_add_disk() calls raise_barrier(). conf->nr_waiting is 0 at this point. So then raise_barrier() starts waiting for conf->nr_pending getting to 0, but it never gets to 0. From this point, anybody that calls wait_barrier() is also stuck. This looks like a real deadlock in a non-ancient kernel. I realize that 3.8.x is EOL, but still it's only one stable behind. Can you help further with this problem? Another problem that I see often, is when a drive fails in RAID1, and then rdev is marked as "Blocked", but then Blocked bit never gets cleared for some reason. So this drive is not ejected from the array. Again, I was unable to debug it deeper at this point. Thanks, Alex. > >> Can you pls kindly comment on this. >> > > NeilBrown ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-05-30 14:30 ` Alexander Lyakas @ 2013-06-02 12:43 ` Alexander Lyakas 2013-06-04 1:49 ` NeilBrown 0 siblings, 1 reply; 20+ messages in thread From: Alexander Lyakas @ 2013-06-02 12:43 UTC (permalink / raw) To: NeilBrown, linux-raid; +Cc: Shyam Kaushik, yair [-- Attachment #1: Type: text/plain, Size: 9743 bytes --] Hello Neil, I believe I have found what is causing the deadlock. It happens in two flavors: 1) # raid1d() is called, and conf->pending_bio_list is non-empty at this point # raid1d() calls md_check_recovery(), which eventually calls raid1_add_disk(), which calls raise_barrier() # Now raise_barrier will wait for conf->nr_pending to become 0, but it cannot become 0, because there are bios sitting in conf->pending_bio_list, which nobody will flush, because raid1d is the one supposed to call flush_pending_writes(), either directly or via handle_read_error. But it is stuck in raise_barrier. 2) # raid1_add_disk() calls raise_barrier(), and waits for conf->nr_pending to become 0, as before # new WRITE comes and calls wait_barrier(), but this thread has a non-empty current->bio_list # In this case, the code allows the WRITE to go through wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs again end up in conf->pending_bio_list (either via raid1_unplug or directly). But nobody will flush conf->pending_bio_list, because raid1d is stuck in raise_barrier. Previously, for example in kernel 3.2, raid1_add_disk did not call raise_barrier, so this problem did not happen. Attached is a reproduction with some prints that I added to raise_barrier and wait_barrier (their code also attached). It demonstrates case 2. It shows that once raise_barrier got called, conf->nr_pending drops down, until it equals the number of wait_barrier calls, that slipped through because of non-empty current->bio_list. And at this point, this array hangs. Can you please comment on how to fix this problem. It looks like a real deadlock. We can perhaps call md_check_recovery() after flush_pending_writes(), but this only makes the window smaller, not closes it entirely. But it looks like we really should not be calling raise_barrier from raid1d. Thanks, Alex. On Thu, May 30, 2013 at 5:30 PM, Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > > On Mon, May 20, 2013 at 10:17 AM, NeilBrown <neilb@suse.de> wrote: >> On Thu, 16 May 2013 17:07:04 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> >> wrote: >> >>> Hello Neil, >>> we are hitting issue that looks very similar; we are on kernel 3.8.2. >>> The array is a 2-device raid1, which experienced a drive failure, but >>> then drive was removed and re-added back to the array (although >>> rebuild never started). Relevant parts of the kernel log: >>> >>> May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Disk failure on >>> dm-6, disabling device. >>> May 16 11:12:14 kernel: [46850.090499] md/raid1:md2: Operation >>> continuing on 1 devices. >>> May 16 11:12:14 kernel: [46850.090511] md: super_written gets >>> error=-5, uptodate=0 >>> May 16 11:12:14 kernel: [46850.090676] md/raid1:md2: dm-6: >>> rescheduling sector 18040736 >>> May 16 11:12:14 kernel: [46850.090764] md/raid1:md2: dm-6: >>> rescheduling sector 20367040 >>> May 16 11:12:14 kernel: [46850.090826] md/raid1:md2: dm-6: >>> rescheduling sector 17613504 >>> May 16 11:12:14 kernel: [46850.090883] md/raid1:md2: dm-6: >>> rescheduling sector 18042720 >>> May 16 11:12:15 kernel: [46850.229970] md/raid1:md2: redirecting >>> sector 18040736 to other mirror: dm-13 >>> May 16 11:12:15 kernel: [46850.257687] md/raid1:md2: redirecting >>> sector 20367040 to other mirror: dm-13 >>> May 16 11:12:15 kernel: [46850.268731] md/raid1:md2: redirecting >>> sector 17613504 to other mirror: dm-13 >>> May 16 11:12:15 kernel: [46850.398242] md/raid1:md2: redirecting >>> sector 18042720 to other mirror: dm-13 >>> May 16 11:12:23 kernel: [46858.448465] md: unbind<dm-6> >>> May 16 11:12:23 kernel: [46858.456081] md: export_rdev(dm-6) >>> May 16 11:23:19 kernel: [47515.062547] md: bind<dm-6> >>> >>> May 16 11:24:28 kernel: [47583.920126] INFO: task md2_raid1:15749 >>> blocked for more than 60 seconds. >>> May 16 11:24:28 kernel: [47583.921829] "echo 0 > >>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> May 16 11:24:28 kernel: [47583.923361] md2_raid1 D >>> 0000000000000001 0 15749 2 0x00000000 >>> May 16 11:24:28 kernel: [47583.923367] ffff880097c23c28 >>> 0000000000000046 ffff880000000002 00000000982c43b8 >>> May 16 11:24:28 kernel: [47583.923372] ffff880097c23fd8 >>> ffff880097c23fd8 ffff880097c23fd8 0000000000013f40 >>> May 16 11:24:28 kernel: [47583.923376] ffff880119b11740 >>> ffff8800a5489740 ffff880097c23c38 ffff88011609d3c0 >>> May 16 11:24:28 kernel: [47583.923381] Call Trace: >>> May 16 11:24:28 kernel: [47583.923395] [<ffffffff816eb399>] schedule+0x29/0x70 >>> May 16 11:24:28 kernel: [47583.923402] [<ffffffffa0516736>] >>> raise_barrier+0x106/0x160 [raid1] >>> May 16 11:24:28 kernel: [47583.923410] [<ffffffff8107fcc0>] ? >>> add_wait_queue+0x60/0x60 >>> May 16 11:24:28 kernel: [47583.923415] [<ffffffffa0516af7>] >>> raid1_add_disk+0x197/0x200 [raid1] >>> May 16 11:24:28 kernel: [47583.923421] [<ffffffff81567fa4>] >>> remove_and_add_spares+0x104/0x220 >>> May 16 11:24:28 kernel: [47583.923426] [<ffffffff8156a02d>] >>> md_check_recovery.part.49+0x40d/0x530 >>> May 16 11:24:28 kernel: [47583.923429] [<ffffffff8156a165>] >>> md_check_recovery+0x15/0x20 >>> May 16 11:24:28 kernel: [47583.923433] [<ffffffffa0517e42>] >>> raid1d+0x22/0x180 [raid1] >>> May 16 11:24:28 kernel: [47583.923439] [<ffffffff81045cd9>] ? >>> default_spin_lock_flags+0x9/0x10 >>> May 16 11:24:28 kernel: [47583.923443] [<ffffffff81045cd9>] ? >>> default_spin_lock_flags+0x9/0x10 >>> May 16 11:24:28 kernel: [47583.923449] [<ffffffff815624ed>] >>> md_thread+0x10d/0x140 >>> May 16 11:24:28 kernel: [47583.923453] [<ffffffff8107fcc0>] ? >>> add_wait_queue+0x60/0x60 >>> May 16 11:24:28 kernel: [47583.923457] [<ffffffff815623e0>] ? >>> md_rdev_init+0x140/0x140 >>> May 16 11:24:28 kernel: [47583.923461] [<ffffffff8107f0d0>] kthread+0xc0/0xd0 >>> May 16 11:24:28 kernel: [47583.923465] [<ffffffff8107f010>] ? >>> flush_kthread_worker+0xb0/0xb0 >>> May 16 11:24:28 kernel: [47583.923472] [<ffffffff816f506c>] >>> ret_from_fork+0x7c/0xb0 >>> May 16 11:24:28 kernel: [47583.923476] [<ffffffff8107f010>] ? >>> flush_kthread_worker+0xb0/0xb0 >>> >>> dm-13 is the good drive, dm-6 is the one that failed. >>> >>> At this point, we have several threads calling submit_bio and all >>> stuck like this: >>> >>> cat /proc/16218/stack >>> [<ffffffffa0516d6e>] wait_barrier+0xbe/0x160 [raid1] >>> [<ffffffffa0518627>] make_request+0x87/0xa90 [raid1] >>> [<ffffffff81561ed0>] md_make_request+0xd0/0x200 >>> [<ffffffff8132bcaa>] generic_make_request+0xca/0x100 >>> [<ffffffff8132bd5b>] submit_bio+0x7b/0x160 >>> ... >>> >>> And md raid1 thread stuck like this: >>> >>> cat /proc/15749/stack >>> [<ffffffffa0516736>] raise_barrier+0x106/0x160 [raid1] >>> [<ffffffffa0516af7>] raid1_add_disk+0x197/0x200 [raid1] >>> [<ffffffff81567fa4>] remove_and_add_spares+0x104/0x220 >>> [<ffffffff8156a02d>] md_check_recovery.part.49+0x40d/0x530 >>> [<ffffffff8156a165>] md_check_recovery+0x15/0x20 >>> [<ffffffffa0517e42>] raid1d+0x22/0x180 [raid1] >>> [<ffffffff815624ed>] md_thread+0x10d/0x140 >>> [<ffffffff8107f0d0>] kthread+0xc0/0xd0 >>> [<ffffffff816f506c>] ret_from_fork+0x7c/0xb0 >>> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> We have also two user-space threads stuck: >>> >>> one is trying to read /sys/block/md2/md/array_state and its kernel stack is: >>> # cat /proc/2251/stack >>> [<ffffffff81564602>] md_attr_show+0x72/0xf0 >>> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0 >>> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0 >>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 >>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0 >>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f >>> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> the other wants to read from /proc/mdstat and is: >>> [<ffffffff81563d2b>] md_seq_show+0x4b/0x540 >>> [<ffffffff811bdd1b>] seq_read+0x16b/0x400 >>> [<ffffffff811ff572>] proc_reg_read+0x82/0xc0 >>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 >>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0 >>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f >>> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> mdadm --detail also gets stuck if attempted, in stack like this: >>> cat /proc/2864/stack >>> [<ffffffff81564602>] md_attr_show+0x72/0xf0 >>> [<ffffffff8120f116>] fill_read_buffer.isra.8+0x66/0xf0 >>> [<ffffffff8120f244>] sysfs_read_file+0xa4/0xc0 >>> [<ffffffff8119b0d0>] vfs_read+0xb0/0x180 >>> [<ffffffff8119b1f2>] sys_read+0x52/0xa0 >>> [<ffffffff816f511d>] system_call_fastpath+0x1a/0x1f >>> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> Might your patch https://patchwork.kernel.org/patch/2260051/ fix this >> >> Probably. >> >>> issue? Is this patch alone applicable to kernel 3.8.2? >> >> Probably. > > This problem still happens in 3.8.13, which contains the above fix of > yours. I am currently debugging it deeper, but still don't know what > the issue is. > The problem happens when raid1_add_disk() calls raise_barrier(). > conf->nr_waiting is 0 at this point. So then raise_barrier() starts > waiting for conf->nr_pending getting to 0, but it never gets to 0. > From this point, anybody that calls wait_barrier() is also stuck. > > This looks like a real deadlock in a non-ancient kernel. I realize > that 3.8.x is EOL, but still it's only one stable behind. > Can you help further with this problem? > > Another problem that I see often, is when a drive fails in RAID1, and > then rdev is marked as "Blocked", but then Blocked bit never gets > cleared for some reason. So this drive is not ejected from the array. > Again, I was unable to debug it deeper at this point. > > Thanks, > Alex. > > > > > > > > > > > > >> >>> Can you pls kindly comment on this. >>> >> >> NeilBrown [-- Attachment #2: repro.txt --] [-- Type: text/plain, Size: 14688 bytes --] [ 514.714425] md: bind<dm-5> [ 514.718751] md2: raise_barrier(1): conf->barrier=0 conf->nr_waiting=0 conf->nr_pending=191 conf->pending_count=0 [ 514.718755] md2: raise_barrier(2): conf->barrier=1 conf->nr_waiting=0 conf->nr_pending=191 conf->pending_count=0 [ 514.720564] md2: wait_barrier(PID=7079,1) conf->barrier=1 conf->nr_waiting=1 conf->nr_pending=191 conf->pending_count=0 [ 514.722073] md2: wait_barrier(PID=7069,1) conf->barrier=1 conf->nr_waiting=2 conf->nr_pending=191 conf->pending_count=0 [ 514.724602] md2: wait_barrier(PID=9922,1) conf->barrier=1 conf->nr_waiting=3 conf->nr_pending=191 conf->pending_count=0 [ 514.725575] md2: wait_barrier(PID=9909,1) conf->barrier=1 conf->nr_waiting=4 conf->nr_pending=187 conf->pending_count=0 [ 514.725661] md2: wait_barrier(PID=7082,1) conf->barrier=1 conf->nr_waiting=5 conf->nr_pending=187 conf->pending_count=0 [ 514.725836] md2: wait_barrier(PID=9910,1) conf->barrier=1 conf->nr_waiting=6 conf->nr_pending=178 conf->pending_count=0 [ 514.725848] md2: wait_barrier(PID=9918,1) conf->barrier=1 conf->nr_waiting=7 conf->nr_pending=178 conf->pending_count=0 [ 514.725860] md2: wait_barrier(PID=7077,1) conf->barrier=1 conf->nr_waiting=8 conf->nr_pending=178 conf->pending_count=0 [ 514.725872] md2: wait_barrier(PID=9905,1) conf->barrier=1 conf->nr_waiting=9 conf->nr_pending=178 conf->pending_count=0 [ 514.725884] md2: wait_barrier(PID=7147,1) conf->barrier=1 conf->nr_waiting=10 conf->nr_pending=178 conf->pending_count=0 [ 514.725917] md2: wait_barrier(PID=10217,1) conf->barrier=1 conf->nr_waiting=11 conf->nr_pending=178 conf->pending_count=0 [ 514.727122] md2: wait_barrier(PID=7177,1) conf->barrier=1 conf->nr_waiting=12 conf->nr_pending=155 conf->pending_count=0 [ 514.727446] md2: wait_barrier(PID=9904,1) conf->barrier=1 conf->nr_waiting=13 conf->nr_pending=155 conf->pending_count=0 [ 514.727475] md2: wait_barrier(PID=9894,1) conf->barrier=1 conf->nr_waiting=14 conf->nr_pending=155 conf->pending_count=0 [ 514.727489] md2: wait_barrier(PID=9898,1) conf->barrier=1 conf->nr_waiting=15 conf->nr_pending=155 conf->pending_count=0 [ 514.727507] md2: wait_barrier(PID=7072,1) conf->barrier=1 conf->nr_waiting=16 conf->nr_pending=155 conf->pending_count=0 [ 514.727651] md2: wait_barrier(PID=9899,1) conf->barrier=1 conf->nr_waiting=17 conf->nr_pending=150 conf->pending_count=0 [ 514.727853] md2: wait_barrier(PID=9906,1) conf->barrier=1 conf->nr_waiting=18 conf->nr_pending=150 conf->pending_count=0 [ 514.727867] md2: wait_barrier(PID=9912,1) conf->barrier=1 conf->nr_waiting=19 conf->nr_pending=150 conf->pending_count=0 [ 514.727915] md2: wait_barrier(PID=6888,1) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=150 conf->pending_count=0 [ 514.728109] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=150 conf->pending_count=0 [ 514.728112] md2: wait_barrier(PID=7080,2) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=151 conf->pending_count=0 [ 514.728136] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=151 conf->pending_count=1 [ 514.728138] md2: wait_barrier(PID=7080,2) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=152 conf->pending_count=1 [ 514.728143] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=152 conf->pending_count=2 [ 514.728145] md2: wait_barrier(PID=7080,2) conf->barrier=1 conf->nr_waiting=20 conf->nr_pending=153 conf->pending_count=2 [ 514.728149] md2: wait_barrier(PID=7080,1) conf->barrier=1 conf->nr_waiting=21 conf->nr_pending=153 conf->pending_count=3 [ 514.730110] md2: wait_barrier(PID=9919,1) conf->barrier=1 conf->nr_waiting=22 conf->nr_pending=153 conf->pending_count=3 [ 514.730188] md2: wait_barrier(PID=5773,1) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=153 conf->pending_count=3 [ 514.731139] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=153 conf->pending_count=3 [ 514.731143] md2: wait_barrier(PID=7174,2) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=154 conf->pending_count=3 [ 514.731176] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=154 conf->pending_count=4 [ 514.731179] md2: wait_barrier(PID=7174,2) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=155 conf->pending_count=4 [ 514.731183] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=155 conf->pending_count=5 [ 514.731186] md2: wait_barrier(PID=7174,2) conf->barrier=1 conf->nr_waiting=23 conf->nr_pending=156 conf->pending_count=5 [ 514.731193] md2: wait_barrier(PID=7174,1) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=156 conf->pending_count=6 [ 514.731371] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=156 conf->pending_count=6 [ 514.731374] md2: wait_barrier(PID=7070,2) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=157 conf->pending_count=6 [ 514.731382] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=157 conf->pending_count=7 [ 514.731385] md2: wait_barrier(PID=7070,2) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=158 conf->pending_count=7 [ 514.731388] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=158 conf->pending_count=8 [ 514.731390] md2: wait_barrier(PID=7070,2) conf->barrier=1 conf->nr_waiting=24 conf->nr_pending=159 conf->pending_count=8 [ 514.731394] md2: wait_barrier(PID=7070,1) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=159 conf->pending_count=9 [ 514.731491] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=159 conf->pending_count=9 [ 514.731494] md2: wait_barrier(PID=9888,2) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=160 conf->pending_count=9 [ 514.731502] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=160 conf->pending_count=10 [ 514.731505] md2: wait_barrier(PID=9888,2) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=161 conf->pending_count=10 [ 514.731508] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=161 conf->pending_count=11 [ 514.731510] md2: wait_barrier(PID=9888,2) conf->barrier=1 conf->nr_waiting=25 conf->nr_pending=162 conf->pending_count=11 [ 514.731514] md2: wait_barrier(PID=9888,1) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=162 conf->pending_count=12 [ 514.731755] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=162 conf->pending_count=12 [ 514.731758] md2: wait_barrier(PID=9895,2) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=163 conf->pending_count=12 [ 514.731767] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=163 conf->pending_count=13 [ 514.731769] md2: wait_barrier(PID=9895,2) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=164 conf->pending_count=13 [ 514.731773] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=164 conf->pending_count=14 [ 514.731775] md2: wait_barrier(PID=9895,2) conf->barrier=1 conf->nr_waiting=26 conf->nr_pending=165 conf->pending_count=14 [ 514.731826] md2: wait_barrier(PID=9895,1) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=165 conf->pending_count=15 [ 514.731875] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=165 conf->pending_count=15 [ 514.731878] md2: wait_barrier(PID=9897,2) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=166 conf->pending_count=15 [ 514.731884] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=166 conf->pending_count=16 [ 514.731887] md2: wait_barrier(PID=9897,2) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=167 conf->pending_count=16 [ 514.731890] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=167 conf->pending_count=17 [ 514.731893] md2: wait_barrier(PID=9897,2) conf->barrier=1 conf->nr_waiting=27 conf->nr_pending=168 conf->pending_count=17 [ 514.731902] md2: wait_barrier(PID=9897,1) conf->barrier=1 conf->nr_waiting=28 conf->nr_pending=168 conf->pending_count=18 [ 514.731971] md2: wait_barrier(PID=9896,1) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=168 conf->pending_count=18 [ 514.732304] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=168 conf->pending_count=18 [ 514.732307] md2: wait_barrier(PID=9908,2) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=169 conf->pending_count=18 [ 514.732340] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=169 conf->pending_count=19 [ 514.732343] md2: wait_barrier(PID=9908,2) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=170 conf->pending_count=19 [ 514.732346] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=170 conf->pending_count=20 [ 514.732349] md2: wait_barrier(PID=9908,2) conf->barrier=1 conf->nr_waiting=29 conf->nr_pending=171 conf->pending_count=20 [ 514.732352] md2: wait_barrier(PID=9908,1) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=171 conf->pending_count=21 [ 514.732397] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=171 conf->pending_count=21 [ 514.732400] md2: wait_barrier(PID=5774,2) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=172 conf->pending_count=21 [ 514.732412] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=172 conf->pending_count=22 [ 514.732414] md2: wait_barrier(PID=5774,2) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=173 conf->pending_count=22 [ 514.732419] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=173 conf->pending_count=23 [ 514.732421] md2: wait_barrier(PID=5774,2) conf->barrier=1 conf->nr_waiting=30 conf->nr_pending=174 conf->pending_count=23 [ 514.732424] md2: wait_barrier(PID=5774,1) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=174 conf->pending_count=24 [ 514.732459] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=174 conf->pending_count=24 [ 514.732461] md2: wait_barrier(PID=7068,2) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=175 conf->pending_count=24 [ 514.732467] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=175 conf->pending_count=25 [ 514.732469] md2: wait_barrier(PID=7068,2) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=176 conf->pending_count=25 [ 514.732473] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=176 conf->pending_count=26 [ 514.732475] md2: wait_barrier(PID=7068,2) conf->barrier=1 conf->nr_waiting=31 conf->nr_pending=177 conf->pending_count=26 [ 514.732478] md2: wait_barrier(PID=7068,1) conf->barrier=1 conf->nr_waiting=32 conf->nr_pending=177 conf->pending_count=27 [ 514.734807] md2: wait_barrier(PID=9901,1) conf->barrier=1 conf->nr_waiting=33 conf->nr_pending=177 conf->pending_count=27 [ 514.734830] md2: wait_barrier(PID=9923,1) conf->barrier=1 conf->nr_waiting=34 conf->nr_pending=177 conf->pending_count=27 [ 514.735975] md2: wait_barrier(PID=7175,1) conf->barrier=1 conf->nr_waiting=35 conf->nr_pending=158 conf->pending_count=27 [ 514.735992] md2: wait_barrier(PID=7081,1) conf->barrier=1 conf->nr_waiting=36 conf->nr_pending=158 conf->pending_count=27 [ 514.736083] md2: wait_barrier(PID=9921,1) conf->barrier=1 conf->nr_waiting=37 conf->nr_pending=158 conf->pending_count=27 [ 514.736670] md2: wait_barrier(PID=7075,1) conf->barrier=1 conf->nr_waiting=38 conf->nr_pending=158 conf->pending_count=27 [ 514.740425] md2: wait_barrier(PID=9925,1) conf->barrier=1 conf->nr_waiting=39 conf->nr_pending=125 conf->pending_count=27 [ 514.740448] md2: wait_barrier(PID=41,1) conf->barrier=1 conf->nr_waiting=40 conf->nr_pending=125 conf->pending_count=27 [ 514.740465] md2: wait_barrier(PID=7083,1) conf->barrier=1 conf->nr_waiting=41 conf->nr_pending=125 conf->pending_count=27 [ 514.740478] md2: wait_barrier(PID=9911,1) conf->barrier=1 conf->nr_waiting=42 conf->nr_pending=125 conf->pending_count=27 [ 514.740492] md2: wait_barrier(PID=9903,1) conf->barrier=1 conf->nr_waiting=43 conf->nr_pending=125 conf->pending_count=27 [ 514.741067] md2: wait_barrier(PID=7071,1) conf->barrier=1 conf->nr_waiting=44 conf->nr_pending=125 conf->pending_count=27 [ 514.743463] md2: wait_barrier(PID=9920,1) conf->barrier=1 conf->nr_waiting=45 conf->nr_pending=105 conf->pending_count=27 [ 514.743478] md2: wait_barrier(PID=9924,1) conf->barrier=1 conf->nr_waiting=46 conf->nr_pending=105 conf->pending_count=27 [ 514.743489] md2: wait_barrier(PID=7146,1) conf->barrier=1 conf->nr_waiting=47 conf->nr_pending=105 conf->pending_count=27 [ 514.745765] md2: wait_barrier(PID=9890,1) conf->barrier=1 conf->nr_waiting=48 conf->nr_pending=89 conf->pending_count=27 [ 514.745784] md2: wait_barrier(PID=9879,1) conf->barrier=1 conf->nr_waiting=49 conf->nr_pending=89 conf->pending_count=27 [ 514.746556] md2: wait_barrier(PID=6,1) conf->barrier=1 conf->nr_waiting=50 conf->nr_pending=81 conf->pending_count=27 [ 514.748897] md2: wait_barrier(PID=9966,1) conf->barrier=1 conf->nr_waiting=51 conf->nr_pending=75 conf->pending_count=27 [ 514.749195] md2: wait_barrier(PID=9900,1) conf->barrier=1 conf->nr_waiting=52 conf->nr_pending=67 conf->pending_count=27 [ 514.750047] md2: wait_barrier(PID=9887,1) conf->barrier=1 conf->nr_waiting=53 conf->nr_pending=57 conf->pending_count=27 [ 514.750065] md2: wait_barrier(PID=9889,1) conf->barrier=1 conf->nr_waiting=54 conf->nr_pending=57 conf->pending_count=27 [ 514.750076] md2: wait_barrier(PID=9893,1) conf->barrier=1 conf->nr_waiting=55 conf->nr_pending=57 conf->pending_count=27 [ 514.752565] md2: wait_barrier(PID=9915,1) conf->barrier=1 conf->nr_waiting=56 conf->nr_pending=48 conf->pending_count=27 [ 514.752580] md2: wait_barrier(PID=7074,1) conf->barrier=1 conf->nr_waiting=57 conf->nr_pending=48 conf->pending_count=27 [ 514.752592] md2: wait_barrier(PID=9902,1) conf->barrier=1 conf->nr_waiting=58 conf->nr_pending=48 conf->pending_count=27 [ 514.752603] md2: wait_barrier(PID=9916,1) conf->barrier=1 conf->nr_waiting=59 conf->nr_pending=48 conf->pending_count=27 [ 514.752926] md2: wait_barrier(PID=42,1) conf->barrier=1 conf->nr_waiting=60 conf->nr_pending=27 conf->pending_count=27 [ 514.752946] md2: wait_barrier(PID=7073,1) conf->barrier=1 conf->nr_waiting=61 conf->nr_pending=27 conf->pending_count=27 [ 514.752960] md2: wait_barrier(PID=9914,1) conf->barrier=1 conf->nr_waiting=62 conf->nr_pending=27 conf->pending_count=27 [-- Attachment #3: raise_wait_barrier.c --] [-- Type: text/x-csrc, Size: 2457 bytes --] static void raise_barrier(struct r1conf *conf) { spin_lock_irq(&conf->resync_lock); pr_info("%s: raise_barrier(1): conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count); /* Wait until no block IO is waiting */ wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting, conf->resync_lock); /* block any new IO from starting */ conf->barrier++; pr_info("%s: raise_barrier(2): conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count); /* Now wait for all pending IO to complete */ wait_event_lock_irq(conf->wait_barrier, !conf->nr_pending && conf->barrier < RESYNC_DEPTH, conf->resync_lock); pr_info("%s: raise_barrier(3): conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count); spin_unlock_irq(&conf->resync_lock); } static void wait_barrier(struct r1conf *conf) { int waited = 0; spin_lock_irq(&conf->resync_lock); if (conf->barrier) { conf->nr_waiting++; waited = 1; pr_info("%s: wait_barrier(PID=%d,1) conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), current->pid, conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count); if (conf->nr_pending && current->bio_list && !bio_list_empty(current->bio_list)) dump_stack(); /* Wait for the barrier to drop. * However if there are already pending * requests (preventing the barrier from * rising completely), and the * pre-process bio queue isn't empty, * then don't wait, as we need to empty * that queue to get the nr_pending * count down. */ wait_event_lock_irq(conf->wait_barrier, !conf->barrier || (conf->nr_pending && current->bio_list && !bio_list_empty(current->bio_list)), conf->resync_lock); conf->nr_waiting--; } conf->nr_pending++; if (waited) pr_info("%s: wait_barrier(PID=%d,2) conf->barrier=%d conf->nr_waiting=%d conf->nr_pending=%d conf->pending_count=%d\n", mdname(conf->mddev), current->pid, conf->barrier, conf->nr_waiting, conf->nr_pending, conf->pending_count); spin_unlock_irq(&conf->resync_lock); } ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-06-02 12:43 ` Alexander Lyakas @ 2013-06-04 1:49 ` NeilBrown 2013-06-04 9:52 ` Alexander Lyakas 2013-06-06 15:00 ` Tregaron Bayly 0 siblings, 2 replies; 20+ messages in thread From: NeilBrown @ 2013-06-04 1:49 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid, Shyam Kaushik, yair [-- Attachment #1: Type: text/plain, Size: 2948 bytes --] On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hello Neil, > I believe I have found what is causing the deadlock. It happens in two flavors: > > 1) > # raid1d() is called, and conf->pending_bio_list is non-empty at this point > # raid1d() calls md_check_recovery(), which eventually calls > raid1_add_disk(), which calls raise_barrier() > # Now raise_barrier will wait for conf->nr_pending to become 0, but it > cannot become 0, because there are bios sitting in > conf->pending_bio_list, which nobody will flush, because raid1d is the > one supposed to call flush_pending_writes(), either directly or via > handle_read_error. But it is stuck in raise_barrier. > > 2) > # raid1_add_disk() calls raise_barrier(), and waits for > conf->nr_pending to become 0, as before > # new WRITE comes and calls wait_barrier(), but this thread has a > non-empty current->bio_list > # In this case, the code allows the WRITE to go through > wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs > again end up in conf->pending_bio_list (either via raid1_unplug or > directly). But nobody will flush conf->pending_bio_list, because > raid1d is stuck in raise_barrier. > > Previously, for example in kernel 3.2, raid1_add_disk did not call > raise_barrier, so this problem did not happen. > > Attached is a reproduction with some prints that I added to > raise_barrier and wait_barrier (their code also attached). It > demonstrates case 2. It shows that once raise_barrier got called, > conf->nr_pending drops down, until it equals the number of > wait_barrier calls, that slipped through because of non-empty > current->bio_list. And at this point, this array hangs. > > Can you please comment on how to fix this problem. It looks like a > real deadlock. > We can perhaps call md_check_recovery() after flush_pending_writes(), > but this only makes the window smaller, not closes it entirely. But it > looks like we really should not be calling raise_barrier from raid1d. > > Thanks, > Alex. Hi Alex, thanks for the analysis. Does the following patch fix it? It makes raise_barrier more like freeze_array(). If not, could you try making the same change to the first wait_event_lock_irq in raise_barrier? Thanks. NeilBrown diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 328fa2d..d34f892 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf) conf->barrier++; /* Now wait for all pending IO to complete */ - wait_event_lock_irq(conf->wait_barrier, - !conf->nr_pending && conf->barrier < RESYNC_DEPTH, - conf->resync_lock); + wait_event_lock_irq_cmd(conf->wait_barrier, + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, + conf->resync_lock, flush_pending_writes(conf)); spin_unlock_irq(&conf->resync_lock); } [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-06-04 1:49 ` NeilBrown @ 2013-06-04 9:52 ` Alexander Lyakas 2013-06-06 15:00 ` Tregaron Bayly 1 sibling, 0 replies; 20+ messages in thread From: Alexander Lyakas @ 2013-06-04 9:52 UTC (permalink / raw) To: NeilBrown, linux-raid; +Cc: Shyam Kaushik, yair [-- Attachment #1: Type: text/plain, Size: 6344 bytes --] Hello Neil, thanks for your response. On Tue, Jun 4, 2013 at 4:49 AM, NeilBrown <neilb@suse.de> wrote: > On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hello Neil, >> I believe I have found what is causing the deadlock. It happens in two flavors: >> >> 1) >> # raid1d() is called, and conf->pending_bio_list is non-empty at this point >> # raid1d() calls md_check_recovery(), which eventually calls >> raid1_add_disk(), which calls raise_barrier() >> # Now raise_barrier will wait for conf->nr_pending to become 0, but it >> cannot become 0, because there are bios sitting in >> conf->pending_bio_list, which nobody will flush, because raid1d is the >> one supposed to call flush_pending_writes(), either directly or via >> handle_read_error. But it is stuck in raise_barrier. >> >> 2) >> # raid1_add_disk() calls raise_barrier(), and waits for >> conf->nr_pending to become 0, as before >> # new WRITE comes and calls wait_barrier(), but this thread has a >> non-empty current->bio_list >> # In this case, the code allows the WRITE to go through >> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs >> again end up in conf->pending_bio_list (either via raid1_unplug or >> directly). But nobody will flush conf->pending_bio_list, because >> raid1d is stuck in raise_barrier. >> >> Previously, for example in kernel 3.2, raid1_add_disk did not call >> raise_barrier, so this problem did not happen. >> >> Attached is a reproduction with some prints that I added to >> raise_barrier and wait_barrier (their code also attached). It >> demonstrates case 2. It shows that once raise_barrier got called, >> conf->nr_pending drops down, until it equals the number of >> wait_barrier calls, that slipped through because of non-empty >> current->bio_list. And at this point, this array hangs. >> >> Can you please comment on how to fix this problem. It looks like a >> real deadlock. >> We can perhaps call md_check_recovery() after flush_pending_writes(), >> but this only makes the window smaller, not closes it entirely. But it >> looks like we really should not be calling raise_barrier from raid1d. >> >> Thanks, >> Alex. > > Hi Alex, > thanks for the analysis. > > Does the following patch fix it? It makes raise_barrier more like > freeze_array(). > If not, could you try making the same change to the first > wait_event_lock_irq in raise_barrier? > > Thanks. > NeilBrown > > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 328fa2d..d34f892 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf) > conf->barrier++; > > /* Now wait for all pending IO to complete */ > - wait_event_lock_irq(conf->wait_barrier, > - !conf->nr_pending && conf->barrier < RESYNC_DEPTH, > - conf->resync_lock); > + wait_event_lock_irq_cmd(conf->wait_barrier, > + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, > + conf->resync_lock, flush_pending_writes(conf)); > > spin_unlock_irq(&conf->resync_lock); > } Yes, this patch fixes the problem, thanks! Actually, yesterday, I attempted a similar fix[1] and it also seemed to work. I have several comments about this patch: # It fully fixes case 1, and almost fully closes the race window for case 2. I attach a reproduction, in which it can be seen, that while raise_barrier is waiting, new bios slip through wait_barrier (because of non-empty current->bio_list), and raise_barrier is awaken more than once to flush them. Eventually it takes 2 seconds for raise_barrier to complete. This is still much better than sleeping forever though:) # We are now calling flush_pending_writes() while mddev_lock() is locked. It doesn't seem problematic to call generic_make_request() under this lock, but flush_pending_writes() also does bitmap_unplug(), which may wait for superblock update etc. Is this ok? I found one or two other places, where we wait for superblock update under mddev_lock (ADD_NEW_DISK, for example), so it's probably ok? # I am concerned that raise_barrier is also called from sync_request, so it may also attempt to flush_pending_writes. Can we make a more conservative patch, like this: /* Now wait for all pending IO to complete */ - wait_event_lock_irq(conf->wait_barrier, - !conf->nr_pending && conf->barrier < RESYNC_DEPTH, - conf->resync_lock); + if (conf->mddev->thread && conf->mddev->thread->tsk == current) { + /* + * If we are called from the management thread (raid1d), we + * need to flush the bios that might be sitting in conf->pending_bio_list, + * otherwise, we will wait for them here forever + */ + wait_event_lock_irq_cmd(conf->wait_barrier, + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, + conf->resync_lock, flush_pending_writes(conf)); + } else { + wait_event_lock_irq(conf->wait_barrier, + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, + conf->resync_lock); + } + spin_unlock_irq(&conf->resync_lock); Thanks again. I will work & reply on other issues I mentioned, for some of them I already made patches. Alex. [1] /* block any new IO from starting */ conf->barrier++; + /* if we are raising the barrier while inside raid1d (which we really shouldn't)... */ + if (conf->mddev->thread && conf->mddev->thread->tsk == current) { + while (!(!conf->nr_pending && conf->barrier < RESYNC_DEPTH)) { + int nr_pending = conf->nr_pending; + + spin_unlock_irq(&conf->resync_lock); + + if (nr_pending) + flush_pending_writes(conf); + wait_event_timeout(conf->wait_barrier, + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, + msecs_to_jiffies(100)); + + spin_lock_irq(&conf->resync_lock); + } + + spin_unlock_irq(&conf->resync_lock); + + return; + } /* Now wait for all pending IO to complete */ wait_event_lock_irq(conf->wait_barrier, !conf->nr_pending && conf->barrier < RESYNC_DEPTH, [-- Attachment #2: repro.tgz --] [-- Type: application/x-gzip, Size: 12349 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-06-04 1:49 ` NeilBrown 2013-06-04 9:52 ` Alexander Lyakas @ 2013-06-06 15:00 ` Tregaron Bayly 2013-06-08 9:45 ` Alexander Lyakas 2013-06-12 1:30 ` NeilBrown 1 sibling, 2 replies; 20+ messages in thread From: Tregaron Bayly @ 2013-06-06 15:00 UTC (permalink / raw) To: NeilBrown; +Cc: Alexander Lyakas, linux-raid, Shyam Kaushik, yair On 06/03/2013 07:49 PM, NeilBrown wrote: > On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > wrote: > >> Hello Neil, >> I believe I have found what is causing the deadlock. It happens in two flavors: >> >> 1) >> # raid1d() is called, and conf->pending_bio_list is non-empty at this point >> # raid1d() calls md_check_recovery(), which eventually calls >> raid1_add_disk(), which calls raise_barrier() >> # Now raise_barrier will wait for conf->nr_pending to become 0, but it >> cannot become 0, because there are bios sitting in >> conf->pending_bio_list, which nobody will flush, because raid1d is the >> one supposed to call flush_pending_writes(), either directly or via >> handle_read_error. But it is stuck in raise_barrier. >> >> 2) >> # raid1_add_disk() calls raise_barrier(), and waits for >> conf->nr_pending to become 0, as before >> # new WRITE comes and calls wait_barrier(), but this thread has a >> non-empty current->bio_list >> # In this case, the code allows the WRITE to go through >> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs >> again end up in conf->pending_bio_list (either via raid1_unplug or >> directly). But nobody will flush conf->pending_bio_list, because >> raid1d is stuck in raise_barrier. >> >> Previously, for example in kernel 3.2, raid1_add_disk did not call >> raise_barrier, so this problem did not happen. >> >> Attached is a reproduction with some prints that I added to >> raise_barrier and wait_barrier (their code also attached). It >> demonstrates case 2. It shows that once raise_barrier got called, >> conf->nr_pending drops down, until it equals the number of >> wait_barrier calls, that slipped through because of non-empty >> current->bio_list. And at this point, this array hangs. >> >> Can you please comment on how to fix this problem. It looks like a >> real deadlock. >> We can perhaps call md_check_recovery() after flush_pending_writes(), >> but this only makes the window smaller, not closes it entirely. But it >> looks like we really should not be calling raise_barrier from raid1d. >> >> Thanks, >> Alex. > > Hi Alex, > thanks for the analysis. > > Does the following patch fix it? It makes raise_barrier more like > freeze_array(). > If not, could you try making the same change to the first > wait_event_lock_irq in raise_barrier? > > Thanks. > NeilBrown > > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 328fa2d..d34f892 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf) > conf->barrier++; > > /* Now wait for all pending IO to complete */ > - wait_event_lock_irq(conf->wait_barrier, > - !conf->nr_pending && conf->barrier < RESYNC_DEPTH, > - conf->resync_lock); > + wait_event_lock_irq_cmd(conf->wait_barrier, > + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, > + conf->resync_lock, flush_pending_writes(conf)); > > spin_unlock_irq(&conf->resync_lock); > } > Neil, This deadlock also cropped up in 3.4 between .37 and .38. Passing flush_pending_writes(conf) as cmd to wait_event_lock_irq seems to fix it there as well. --- linux-3.4.38/drivers/md/raid1.c 2013-03-28 13:12:41.000000000 -0600 +++ linux-3.4.38.patch/drivers/md/raid1.c 2013-06-04 12:17:35.314194903 -0600 @@ -751,7 +751,7 @@ /* Now wait for all pending IO to complete */ wait_event_lock_irq(conf->wait_barrier, !conf->nr_pending && conf->barrier < RESYNC_DEPTH, - conf->resync_lock, ); + conf->resync_lock, flush_pending_writes(conf)); spin_unlock_irq(&conf->resync_lock); } Thanks, Tregaron ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-06-06 15:00 ` Tregaron Bayly @ 2013-06-08 9:45 ` Alexander Lyakas 2013-06-12 0:42 ` NeilBrown 2013-06-12 1:30 ` NeilBrown 1 sibling, 1 reply; 20+ messages in thread From: Alexander Lyakas @ 2013-06-08 9:45 UTC (permalink / raw) To: NeilBrown, linux-raid; +Cc: Shyam Kaushik, yair, Tregaron Bayly Hi Neil, after reading the code of raid1.c, I see that there's also conf->retry_list, which is also flushed by raid1d, but not by flush_pending_writes(). So I think it can also cause similar deadlock, but I don't know how to fix it:( Alex. On Thu, Jun 6, 2013 at 6:00 PM, Tregaron Bayly <tbayly@bluehost.com> wrote: > On 06/03/2013 07:49 PM, NeilBrown wrote: >> >> On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas >> <alex.bolshoy@gmail.com> >> wrote: >> >>> Hello Neil, >>> I believe I have found what is causing the deadlock. It happens in two >>> flavors: >>> >>> 1) >>> # raid1d() is called, and conf->pending_bio_list is non-empty at this >>> point >>> # raid1d() calls md_check_recovery(), which eventually calls >>> raid1_add_disk(), which calls raise_barrier() >>> # Now raise_barrier will wait for conf->nr_pending to become 0, but it >>> cannot become 0, because there are bios sitting in >>> conf->pending_bio_list, which nobody will flush, because raid1d is the >>> one supposed to call flush_pending_writes(), either directly or via >>> handle_read_error. But it is stuck in raise_barrier. >>> >>> 2) >>> # raid1_add_disk() calls raise_barrier(), and waits for >>> conf->nr_pending to become 0, as before >>> # new WRITE comes and calls wait_barrier(), but this thread has a >>> non-empty current->bio_list >>> # In this case, the code allows the WRITE to go through >>> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs >>> again end up in conf->pending_bio_list (either via raid1_unplug or >>> directly). But nobody will flush conf->pending_bio_list, because >>> raid1d is stuck in raise_barrier. >>> >>> Previously, for example in kernel 3.2, raid1_add_disk did not call >>> raise_barrier, so this problem did not happen. >>> >>> Attached is a reproduction with some prints that I added to >>> raise_barrier and wait_barrier (their code also attached). It >>> demonstrates case 2. It shows that once raise_barrier got called, >>> conf->nr_pending drops down, until it equals the number of >>> wait_barrier calls, that slipped through because of non-empty >>> current->bio_list. And at this point, this array hangs. >>> >>> Can you please comment on how to fix this problem. It looks like a >>> real deadlock. >>> We can perhaps call md_check_recovery() after flush_pending_writes(), >>> but this only makes the window smaller, not closes it entirely. But it >>> looks like we really should not be calling raise_barrier from raid1d. >>> >>> Thanks, >>> Alex. >> >> >> Hi Alex, >> thanks for the analysis. >> >> Does the following patch fix it? It makes raise_barrier more like >> freeze_array(). >> If not, could you try making the same change to the first >> wait_event_lock_irq in raise_barrier? >> >> Thanks. >> NeilBrown >> >> >> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c >> index 328fa2d..d34f892 100644 >> --- a/drivers/md/raid1.c >> +++ b/drivers/md/raid1.c >> @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf) >> conf->barrier++; >> >> /* Now wait for all pending IO to complete */ >> - wait_event_lock_irq(conf->wait_barrier, >> - !conf->nr_pending && conf->barrier < >> RESYNC_DEPTH, >> - conf->resync_lock); >> + wait_event_lock_irq_cmd(conf->wait_barrier, >> + !conf->nr_pending && conf->barrier < >> RESYNC_DEPTH, >> + conf->resync_lock, >> flush_pending_writes(conf)); >> >> spin_unlock_irq(&conf->resync_lock); >> } >> > > Neil, > > This deadlock also cropped up in 3.4 between .37 and .38. Passing > flush_pending_writes(conf) as cmd to wait_event_lock_irq seems to fix it > there as well. > > --- linux-3.4.38/drivers/md/raid1.c 2013-03-28 13:12:41.000000000 -0600 > +++ linux-3.4.38.patch/drivers/md/raid1.c 2013-06-04 > 12:17:35.314194903 -0600 > @@ -751,7 +751,7 @@ > > /* Now wait for all pending IO to complete */ > wait_event_lock_irq(conf->wait_barrier, > > !conf->nr_pending && conf->barrier < > RESYNC_DEPTH, > - conf->resync_lock, ); > > + conf->resync_lock, flush_pending_writes(conf)); > spin_unlock_irq(&conf->resync_lock); > } > > Thanks, > > Tregaron ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-06-08 9:45 ` Alexander Lyakas @ 2013-06-12 0:42 ` NeilBrown 0 siblings, 0 replies; 20+ messages in thread From: NeilBrown @ 2013-06-12 0:42 UTC (permalink / raw) To: Alexander Lyakas; +Cc: linux-raid, Shyam Kaushik, yair, Tregaron Bayly [-- Attachment #1: Type: text/plain, Size: 2219 bytes --] On Sat, 8 Jun 2013 12:45:41 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> wrote: > Hi Neil, > after reading the code of raid1.c, I see that there's also > conf->retry_list, which is also flushed by raid1d, but not by > flush_pending_writes(). So I think it can also cause similar deadlock, > but I don't know how to fix it:( > Good point. Requests in retry_list are counted in nr_queued, which is checked in freeze_array(). And freeze_array() already calls flush_pending_writes(). So I suspect the right thing to do is use freeze_array() in place of raise_barrier(). So maybe this is the right approach. Testing greatly appreciated... Thanks, NeilBrown diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 226dcd0..240b328 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1569,8 +1569,8 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev) * we wait for all outstanding requests to complete. */ synchronize_sched(); - raise_barrier(conf); - lower_barrier(conf); + freeze_array(conf); + unfreeze_array(conf); clear_bit(Unmerged, &rdev->flags); } md_integrity_add_rdev(rdev, mddev); @@ -1620,11 +1620,11 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev) */ struct md_rdev *repl = conf->mirrors[conf->raid_disks + number].rdev; - raise_barrier(conf); + freeze_array(conf); clear_bit(Replacement, &repl->flags); p->rdev = repl; conf->mirrors[conf->raid_disks + number].rdev = NULL; - lower_barrier(conf); + unfreeze_array(conf); clear_bit(WantReplacement, &rdev->flags); } else clear_bit(WantReplacement, &rdev->flags); @@ -3021,7 +3021,7 @@ static int raid1_reshape(struct mddev *mddev) return -ENOMEM; } - raise_barrier(conf); + freeze_array(conf); /* ok, everything is stopped */ oldpool = conf->r1bio_pool; @@ -3052,7 +3052,7 @@ static int raid1_reshape(struct mddev *mddev) conf->raid_disks = mddev->raid_disks = raid_disks; mddev->delta_disks = 0; - lower_barrier(conf); + unfreeze_array(conf); set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier 2013-06-06 15:00 ` Tregaron Bayly 2013-06-08 9:45 ` Alexander Lyakas @ 2013-06-12 1:30 ` NeilBrown 1 sibling, 0 replies; 20+ messages in thread From: NeilBrown @ 2013-06-12 1:30 UTC (permalink / raw) To: Tregaron Bayly; +Cc: Alexander Lyakas, linux-raid, Shyam Kaushik, yair [-- Attachment #1: Type: text/plain, Size: 4261 bytes --] On Thu, 06 Jun 2013 09:00:26 -0600 Tregaron Bayly <tbayly@bluehost.com> wrote: > On 06/03/2013 07:49 PM, NeilBrown wrote: > > On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas <alex.bolshoy@gmail.com> > > wrote: > > > >> Hello Neil, > >> I believe I have found what is causing the deadlock. It happens in two flavors: > >> > >> 1) > >> # raid1d() is called, and conf->pending_bio_list is non-empty at this point > >> # raid1d() calls md_check_recovery(), which eventually calls > >> raid1_add_disk(), which calls raise_barrier() > >> # Now raise_barrier will wait for conf->nr_pending to become 0, but it > >> cannot become 0, because there are bios sitting in > >> conf->pending_bio_list, which nobody will flush, because raid1d is the > >> one supposed to call flush_pending_writes(), either directly or via > >> handle_read_error. But it is stuck in raise_barrier. > >> > >> 2) > >> # raid1_add_disk() calls raise_barrier(), and waits for > >> conf->nr_pending to become 0, as before > >> # new WRITE comes and calls wait_barrier(), but this thread has a > >> non-empty current->bio_list > >> # In this case, the code allows the WRITE to go through > >> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs > >> again end up in conf->pending_bio_list (either via raid1_unplug or > >> directly). But nobody will flush conf->pending_bio_list, because > >> raid1d is stuck in raise_barrier. > >> > >> Previously, for example in kernel 3.2, raid1_add_disk did not call > >> raise_barrier, so this problem did not happen. > >> > >> Attached is a reproduction with some prints that I added to > >> raise_barrier and wait_barrier (their code also attached). It > >> demonstrates case 2. It shows that once raise_barrier got called, > >> conf->nr_pending drops down, until it equals the number of > >> wait_barrier calls, that slipped through because of non-empty > >> current->bio_list. And at this point, this array hangs. > >> > >> Can you please comment on how to fix this problem. It looks like a > >> real deadlock. > >> We can perhaps call md_check_recovery() after flush_pending_writes(), > >> but this only makes the window smaller, not closes it entirely. But it > >> looks like we really should not be calling raise_barrier from raid1d. > >> > >> Thanks, > >> Alex. > > > > Hi Alex, > > thanks for the analysis. > > > > Does the following patch fix it? It makes raise_barrier more like > > freeze_array(). > > If not, could you try making the same change to the first > > wait_event_lock_irq in raise_barrier? > > > > Thanks. > > NeilBrown > > > > > > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > > index 328fa2d..d34f892 100644 > > --- a/drivers/md/raid1.c > > +++ b/drivers/md/raid1.c > > @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf) > > conf->barrier++; > > > > /* Now wait for all pending IO to complete */ > > - wait_event_lock_irq(conf->wait_barrier, > > - !conf->nr_pending && conf->barrier < RESYNC_DEPTH, > > - conf->resync_lock); > > + wait_event_lock_irq_cmd(conf->wait_barrier, > > + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, > > + conf->resync_lock, flush_pending_writes(conf)); > > > > spin_unlock_irq(&conf->resync_lock); > > } > > > > Neil, > > This deadlock also cropped up in 3.4 between .37 and .38. Passing flush_pending_writes(conf) as cmd to wait_event_lock_irq seems to fix it there as well. > > --- linux-3.4.38/drivers/md/raid1.c 2013-03-28 13:12:41.000000000 -0600 > +++ linux-3.4.38.patch/drivers/md/raid1.c 2013-06-04 12:17:35.314194903 -0600 > @@ -751,7 +751,7 @@ > /* Now wait for all pending IO to complete */ > wait_event_lock_irq(conf->wait_barrier, > !conf->nr_pending && conf->barrier < RESYNC_DEPTH, > - conf->resync_lock, ); > + conf->resync_lock, flush_pending_writes(conf)); > > spin_unlock_irq(&conf->resync_lock); > } > I suspect it was already there in 3.4.37 .. in fact I think it was there in 3.4. I'll mark the fix for including in 3.4-stable Thanks for the report. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2013-06-12 1:30 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-02-21 22:58 BUG - raid 1 deadlock on handle_read_error / wait_barrier Tregaron Bayly 2013-02-22 3:44 ` Joe Lawrence 2013-02-22 11:52 ` majianpeng 2013-02-22 16:03 ` Tregaron Bayly 2013-02-22 18:14 ` Joe Lawrence 2013-02-24 22:43 ` NeilBrown 2013-02-25 0:04 ` NeilBrown 2013-02-25 16:11 ` Tregaron Bayly 2013-02-25 22:54 ` NeilBrown 2013-02-26 14:09 ` Joe Lawrence 2013-05-16 14:07 ` Alexander Lyakas 2013-05-20 7:17 ` NeilBrown 2013-05-30 14:30 ` Alexander Lyakas 2013-06-02 12:43 ` Alexander Lyakas 2013-06-04 1:49 ` NeilBrown 2013-06-04 9:52 ` Alexander Lyakas 2013-06-06 15:00 ` Tregaron Bayly 2013-06-08 9:45 ` Alexander Lyakas 2013-06-12 0:42 ` NeilBrown 2013-06-12 1:30 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).