From mboxrd@z Thu Jan 1 00:00:00 1970 From: majianpeng Subject: Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier Date: Fri, 22 Feb 2013 19:52:22 +0800 Message-ID: <51275BF6.2050702@gmail.com> References: <1361487504.4863.54.camel@linux-lxtg.site> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1361487504.4863.54.camel@linux-lxtg.site> Sender: linux-raid-owner@vger.kernel.org To: tbayly@bluehost.com Cc: linux-raid@vger.kernel.org, neilb@suse.de List-Id: linux-raid.ids On 02/22/2013 06:58 AM, Tregaron Bayly wrote: > Symptom: > A RAID 1 array ends up with two threads (flush and raid1) stuck in D > state forever. The array is inaccessible and the host must be restarted > to restore access to the array. > > I have some scripted workloads that reproduce this within a maximum of a > couple hours on kernels from 3.6.11 - 3.8-rc7. I cannot reproduce on > 3.4.32. 3.5.7 ends up with three threads stuck in D state, but the > stacks are different from this bug (as it's EOL maybe of interest in > bisecting the problem?). > > Details: > > [flush-9:16] > [] wait_barrier+0x124/0x180 [raid1] > [] make_request+0x85/0xd50 [raid1] > [] md_make_request+0xd3/0x200 > [] generic_make_request+0xca/0x100 > [] submit_bio+0x79/0x160 > [] submit_bh+0x128/0x200 > [] __block_write_full_page+0x1d0/0x330 > [] block_write_full_page_endio+0xce/0x100 > [] block_write_full_page+0x15/0x20 > [] blkdev_writepage+0x18/0x20 > [] __writepage+0x17/0x40 > [] write_cache_pages+0x1d3/0x4c0 > [] generic_writepages+0x51/0x80 > [] do_writepages+0x20/0x40 > [] __writeback_single_inode+0x3b/0x160 > [] writeback_sb_inodes+0x1e9/0x430 > [] __writeback_inodes_wb+0x9e/0xd0 > [] wb_writeback+0x24b/0x2e0 > [] wb_do_writeback+0x241/0x250 > [] bdi_writeback_thread+0xa2/0x250 > [] kthread+0xce/0xe0 > [] ret_from_fork+0x7c/0xb0 > [] 0xffffffffffffffff > > [md16-raid1] > [] handle_read_error+0x119/0x790 [raid1] > [] raid1d+0x232/0x1060 [raid1] > [] md_thread+0x117/0x150 > [] kthread+0xce/0xe0 > [] ret_from_fork+0x7c/0xb0 > [] 0xffffffffffffffff > > Both processes end up in wait_event_lock_irq() waiting for favorable > conditions in the struct r1conf to proceed. These conditions obviously > seem to never arrive. I placed printk statements in freeze_array() and > wait_barrier() directly before calling their respective > wait_event_lock_irq() and this is an example output: > > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3) > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3) > Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3) >From those message,there's a request which will be completed or met error. If completed, the ->nr_pening decrease one. If request met error, it add ->retry_list and the ->nr_queued add one. So in two conditions,the hung will be happened. What's the state of this request? Maybe this bug caused by lower layer drivers. > The flush seems to come after the attempt to handle the read error. I > believe the incrementing nr_waiting comes from the multiple read > processes I have going as they get stuck behind this deadlock. > > majienpeng may have been referring to this condition on the linux-raid > list a few days ago (http://www.spinics.net/lists/raid/msg42150.html) > when he stated "Above code is what's you said. But it met read-error, > raid1d will blocked for ever. The reason is freeze_array" > > Environment: > A RAID 1 array built on top of two multipath maps > Devices underlying the maps are iscsi disks exported from a linux SAN > Multipath is configured to queue i/o if no path for 5 retries (polling > interval 5) before failing and causing the mirror to degrade. > > Reproducible on kernels 3.6.11 and up (x86_64) > > To reproduce, I create 10 raid 1 arrays on a client built on mpio over > iscsi. For each of the arrays, I start the following to create some > writes: > > $NODE=array_5; while :; do let TIME=($RANDOM % 10); let size=$RANDOM*6; > sleep $TIME; dd if=/dev/zero of=/dev/md/$NODE bs=1024 count=$size; done > > I then start 10 processes each doing this to create some reads: > > while :; do let SAN=($RANDOM % 10); dd if=/dev/md/array_$SAN > of=/dev/null bs=1024 count=50000; done > > In a screen I start a script running in a loop that > monitors /proc/mdstat for failed arrays and does a fail/remove/re-add on > the failed disk. (This is mainly so that I get more than one crack at > reproducing the bug since the timing never happens just right on one > try.) > > Now on one of the iscsi servers I start a loop that sleeps a random > amount of time between 1-5 minutes then stops the export, sleeps again > and then restores the export. > > The net effect of all this is that the disks on the initiator will queue > i/o for a while when the export is off and eventually fail. In most > cases this happens gracefully and i/o will go to the remaining disk. In > the failure case we get the behavior described here. The test is setup > to restore the arrays and try again until the bug is hit. > > We do see this situation in production with relative frequency so it's > not something that happens only in theory under artificial conditions. > > Any help or insight here is much appreciated. > > Tregaron Bayly > Systems Administrator > Bluehost, Inc. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html