Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier

All of lore.kernel.org
 help / color / mirror / Atom feed

From: majianpeng <kedacomkernel@gmail.com>
To: tbayly@bluehost.com
Cc: linux-raid@vger.kernel.org, neilb@suse.de
Subject: Re: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
Date: Fri, 22 Feb 2013 19:52:22 +0800	[thread overview]
Message-ID: <51275BF6.2050702@gmail.com> (raw)
In-Reply-To: <1361487504.4863.54.camel@linux-lxtg.site>

On 02/22/2013 06:58 AM, Tregaron Bayly wrote:
> Symptom:
> A RAID 1 array ends up with two threads (flush and raid1) stuck in D
> state forever.  The array is inaccessible and the host must be restarted
> to restore access to the array.
>
> I have some scripted workloads that reproduce this within a maximum of a
> couple hours on kernels from 3.6.11 - 3.8-rc7.  I cannot reproduce on
> 3.4.32.  3.5.7 ends up with three threads stuck in D state, but the
> stacks are different from this bug (as it's EOL maybe of interest in
> bisecting the problem?).
>
> Details:
>
> [flush-9:16]
> [<ffffffffa009f1a4>] wait_barrier+0x124/0x180 [raid1]
> [<ffffffffa00a2a15>] make_request+0x85/0xd50 [raid1]
> [<ffffffff813653c3>] md_make_request+0xd3/0x200
> [<ffffffff811f494a>] generic_make_request+0xca/0x100
> [<ffffffff811f49f9>] submit_bio+0x79/0x160
> [<ffffffff811808f8>] submit_bh+0x128/0x200
> [<ffffffff81182fe0>] __block_write_full_page+0x1d0/0x330
> [<ffffffff8118320e>] block_write_full_page_endio+0xce/0x100
> [<ffffffff81183255>] block_write_full_page+0x15/0x20
> [<ffffffff81187908>] blkdev_writepage+0x18/0x20
> [<ffffffff810f73b7>] __writepage+0x17/0x40
> [<ffffffff810f8543>] write_cache_pages+0x1d3/0x4c0
> [<ffffffff810f8881>] generic_writepages+0x51/0x80
> [<ffffffff810f88d0>] do_writepages+0x20/0x40
> [<ffffffff811782bb>] __writeback_single_inode+0x3b/0x160
> [<ffffffff8117a8a9>] writeback_sb_inodes+0x1e9/0x430
> [<ffffffff8117ab8e>] __writeback_inodes_wb+0x9e/0xd0
> [<ffffffff8117ae9b>] wb_writeback+0x24b/0x2e0
> [<ffffffff8117b171>] wb_do_writeback+0x241/0x250
> [<ffffffff8117b222>] bdi_writeback_thread+0xa2/0x250
> [<ffffffff8106414e>] kthread+0xce/0xe0
> [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> [md16-raid1]
> [<ffffffffa009ffb9>] handle_read_error+0x119/0x790 [raid1]
> [<ffffffffa00a0862>] raid1d+0x232/0x1060 [raid1]
> [<ffffffff813675a7>] md_thread+0x117/0x150
> [<ffffffff8106414e>] kthread+0xce/0xe0
> [<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Both processes end up in wait_event_lock_irq() waiting for favorable
> conditions in the struct r1conf to proceed.  These conditions obviously
> seem to never arrive.  I placed printk statements in freeze_array() and
> wait_barrier() directly before calling their respective
> wait_event_lock_irq() and this is an example output:
>
> Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3)
> Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3)
> Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3)
From those message,there's a request which will be completed or met error.
If completed, the ->nr_pening decrease one.
If request met error, it add ->retry_list and the ->nr_queued add one.
So in two conditions,the hung will be happened.
What's the state of this request? Maybe this bug caused by lower layer drivers.

> The flush seems to come after the attempt to handle the read error.  I
> believe the incrementing nr_waiting comes from the multiple read
> processes I have going as they get stuck behind this deadlock.
>
> majienpeng may have been referring to this condition on the linux-raid
> list a few days ago (http://www.spinics.net/lists/raid/msg42150.html)
> when he stated "Above code is what's you said.  But it met read-error,
> raid1d will blocked for ever.  The reason is freeze_array"
>
> Environment:
> A RAID 1 array built on top of two multipath maps
> Devices underlying the maps are iscsi disks exported from a linux SAN
> Multipath is configured to queue i/o if no path for 5 retries (polling
> interval 5) before failing and causing the mirror to degrade.
>
> Reproducible on kernels 3.6.11 and up (x86_64)
>   
> To reproduce, I create 10 raid 1 arrays on a client built on mpio over
> iscsi.  For each of the arrays, I start the following to create some
> writes:
>
> $NODE=array_5; while :; do let TIME=($RANDOM % 10); let size=$RANDOM*6;
> sleep $TIME; dd if=/dev/zero of=/dev/md/$NODE bs=1024 count=$size; done
>
> I then start 10 processes each doing this to create some reads:
>
> while :; do let SAN=($RANDOM % 10); dd if=/dev/md/array_$SAN
> of=/dev/null bs=1024 count=50000; done
>
> In a screen I start a script running in a loop that
> monitors /proc/mdstat for failed arrays and does a fail/remove/re-add on
> the failed disk.  (This is mainly so that I get more than one crack at
> reproducing the bug since the timing never happens just right on one
> try.)
>
> Now on one of the iscsi servers I start a loop that sleeps a random
> amount of time between 1-5 minutes then stops the export, sleeps again
> and then restores the export.
>
> The net effect of all this is that the disks on the initiator will queue
> i/o for a while when the export is off and eventually fail.  In most
> cases this happens gracefully and i/o will go to the remaining disk.  In
> the failure case we get the behavior described here.  The test is setup
> to restore the arrays and try again until the bug is hit.
>
> We do see this situation in production with relative frequency so it's
> not something that happens only in theory under artificial conditions.
>
> Any help or insight here is much appreciated.
>
> Tregaron Bayly
> Systems Administrator
> Bluehost, Inc.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2013-02-22 11:52 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-21 22:58 BUG - raid 1 deadlock on handle_read_error / wait_barrier Tregaron Bayly
2013-02-22  3:44 ` Joe Lawrence
2013-02-22 11:52 ` majianpeng [this message]
2013-02-22 16:03   ` Tregaron Bayly
2013-02-22 18:14     ` Joe Lawrence
2013-02-24 22:43 ` NeilBrown
2013-02-25  0:04   ` NeilBrown
2013-02-25 16:11     ` Tregaron Bayly
2013-02-25 22:54       ` NeilBrown
2013-02-26 14:09       ` Joe Lawrence
2013-05-16 14:07         ` Alexander Lyakas
2013-05-20  7:17           ` NeilBrown
2013-05-30 14:30             ` Alexander Lyakas
2013-06-02 12:43               ` Alexander Lyakas
2013-06-04  1:49                 ` NeilBrown
2013-06-04  9:52                   ` Alexander Lyakas
2013-06-06 15:00                   ` Tregaron Bayly
2013-06-08  9:45                     ` Alexander Lyakas
2013-06-12  0:42                       ` NeilBrown
2013-06-12  1:30                     ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51275BF6.2050702@gmail.com \
    --to=kedacomkernel@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=tbayly@bluehost.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.