BUG - raid 1 deadlock on handle_read_error / wait_barrier

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tregaron Bayly <tbayly@bluehost.com>
To: linux-raid@vger.kernel.org
Cc: neilb@suse.de
Subject: BUG  - raid 1 deadlock on handle_read_error / wait_barrier
Date: Thu, 21 Feb 2013 15:58:24 -0700	[thread overview]
Message-ID: <1361487504.4863.54.camel@linux-lxtg.site> (raw)

Symptom:
A RAID 1 array ends up with two threads (flush and raid1) stuck in D
state forever.  The array is inaccessible and the host must be restarted
to restore access to the array.

I have some scripted workloads that reproduce this within a maximum of a
couple hours on kernels from 3.6.11 - 3.8-rc7.  I cannot reproduce on
3.4.32.  3.5.7 ends up with three threads stuck in D state, but the
stacks are different from this bug (as it's EOL maybe of interest in
bisecting the problem?).

Details:

[flush-9:16]
[<ffffffffa009f1a4>] wait_barrier+0x124/0x180 [raid1]
[<ffffffffa00a2a15>] make_request+0x85/0xd50 [raid1]
[<ffffffff813653c3>] md_make_request+0xd3/0x200
[<ffffffff811f494a>] generic_make_request+0xca/0x100
[<ffffffff811f49f9>] submit_bio+0x79/0x160
[<ffffffff811808f8>] submit_bh+0x128/0x200
[<ffffffff81182fe0>] __block_write_full_page+0x1d0/0x330
[<ffffffff8118320e>] block_write_full_page_endio+0xce/0x100
[<ffffffff81183255>] block_write_full_page+0x15/0x20
[<ffffffff81187908>] blkdev_writepage+0x18/0x20
[<ffffffff810f73b7>] __writepage+0x17/0x40
[<ffffffff810f8543>] write_cache_pages+0x1d3/0x4c0
[<ffffffff810f8881>] generic_writepages+0x51/0x80
[<ffffffff810f88d0>] do_writepages+0x20/0x40
[<ffffffff811782bb>] __writeback_single_inode+0x3b/0x160
[<ffffffff8117a8a9>] writeback_sb_inodes+0x1e9/0x430
[<ffffffff8117ab8e>] __writeback_inodes_wb+0x9e/0xd0
[<ffffffff8117ae9b>] wb_writeback+0x24b/0x2e0
[<ffffffff8117b171>] wb_do_writeback+0x241/0x250
[<ffffffff8117b222>] bdi_writeback_thread+0xa2/0x250
[<ffffffff8106414e>] kthread+0xce/0xe0
[<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

[md16-raid1]
[<ffffffffa009ffb9>] handle_read_error+0x119/0x790 [raid1]
[<ffffffffa00a0862>] raid1d+0x232/0x1060 [raid1]
[<ffffffff813675a7>] md_thread+0x117/0x150
[<ffffffff8106414e>] kthread+0xce/0xe0
[<ffffffff81488a6c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

Both processes end up in wait_event_lock_irq() waiting for favorable
conditions in the struct r1conf to proceed.  These conditions obviously
seem to never arrive.  I placed printk statements in freeze_array() and
wait_barrier() directly before calling their respective
wait_event_lock_irq() and this is an example output:

Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3)
Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3)
Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3)

The flush seems to come after the attempt to handle the read error.  I
believe the incrementing nr_waiting comes from the multiple read
processes I have going as they get stuck behind this deadlock.

majienpeng may have been referring to this condition on the linux-raid
list a few days ago (http://www.spinics.net/lists/raid/msg42150.html)
when he stated "Above code is what's you said.  But it met read-error,
raid1d will blocked for ever.  The reason is freeze_array"

Environment:
A RAID 1 array built on top of two multipath maps
Devices underlying the maps are iscsi disks exported from a linux SAN
Multipath is configured to queue i/o if no path for 5 retries (polling
interval 5) before failing and causing the mirror to degrade.

Reproducible on kernels 3.6.11 and up (x86_64)

To reproduce, I create 10 raid 1 arrays on a client built on mpio over
iscsi.  For each of the arrays, I start the following to create some
writes:

$NODE=array_5; while :; do let TIME=($RANDOM % 10); let size=$RANDOM*6;
sleep $TIME; dd if=/dev/zero of=/dev/md/$NODE bs=1024 count=$size; done

I then start 10 processes each doing this to create some reads:

while :; do let SAN=($RANDOM % 10); dd if=/dev/md/array_$SAN
of=/dev/null bs=1024 count=50000; done

In a screen I start a script running in a loop that
monitors /proc/mdstat for failed arrays and does a fail/remove/re-add on
the failed disk.  (This is mainly so that I get more than one crack at
reproducing the bug since the timing never happens just right on one
try.)

Now on one of the iscsi servers I start a loop that sleeps a random
amount of time between 1-5 minutes then stops the export, sleeps again
and then restores the export.

The net effect of all this is that the disks on the initiator will queue
i/o for a while when the export is off and eventually fail.  In most
cases this happens gracefully and i/o will go to the remaining disk.  In
the failure case we get the behavior described here.  The test is setup
to restore the arrays and try again until the bug is hit.

We do see this situation in production with relative frequency so it's
not something that happens only in theory under artificial conditions.

Any help or insight here is much appreciated.

Tregaron Bayly
Systems Administrator
Bluehost, Inc.

next             reply	other threads:[~2013-02-21 22:58 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-21 22:58 Tregaron Bayly [this message]
2013-02-22  3:44 ` BUG - raid 1 deadlock on handle_read_error / wait_barrier Joe Lawrence
2013-02-22 11:52 ` majianpeng
2013-02-22 16:03   ` Tregaron Bayly
2013-02-22 18:14     ` Joe Lawrence
2013-02-24 22:43 ` NeilBrown
2013-02-25  0:04   ` NeilBrown
2013-02-25 16:11     ` Tregaron Bayly
2013-02-25 22:54       ` NeilBrown
2013-02-26 14:09       ` Joe Lawrence
2013-05-16 14:07         ` Alexander Lyakas
2013-05-20  7:17           ` NeilBrown
2013-05-30 14:30             ` Alexander Lyakas
2013-06-02 12:43               ` Alexander Lyakas
2013-06-04  1:49                 ` NeilBrown
2013-06-04  9:52                   ` Alexander Lyakas
2013-06-06 15:00                   ` Tregaron Bayly
2013-06-08  9:45                     ` Alexander Lyakas
2013-06-12  0:42                       ` NeilBrown
2013-06-12  1:30                     ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1361487504.4863.54.camel@linux-lxtg.site \
    --to=tbayly@bluehost.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.