From: NeilBrown <neilb@suse.de>
To: Ray Morris <support@bettercgi.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: debugging md2_resync hang at raise_barrier
Date: Thu, 1 Mar 2012 12:34:18 +1100 [thread overview]
Message-ID: <20120301123418.1254f763@notabene.brown> (raw)
In-Reply-To: <20120229184413.13db7143@bettercgi.com>
[-- Attachment #1: Type: text/plain, Size: 3725 bytes --]
On Wed, 29 Feb 2012 18:44:13 -0600 Ray Morris <support@bettercgi.com> wrote:
> I am attempting to debug a hang in raid1 and possibly one raid5.
> I have experienced the same problem with many kernel versions
> over a couple of years, and with disparate hardware.
>
> My current plan, if noone more experienced suggests I do otherwise, is
> to compile a kernel with some printk() in strategic locations and try to
> narrow down the problem. I know very little about kernel work, so I am
> seeking suggestions from those who know better than I.
>
> In the case logged below, it appears to hang at raise_barrier in md2_resync
> at raise_barrier, then further access to the device hangs. I'm just a Perl
> programmer who dabbles in C, but my intuition said I check that if perhaps
> lower_barrier had been called with conf->barrier already at zero, so that's
> the one printk I've added so far. It may take a week or more before it
> crashes again, so is there any more debugging I should add before waiting
> for it to hang?
>
> Also below is some older logging from similar symptoms with raid5,
> hanging at raid5_quiesce. I got rid of the raid5 in hopes of getting
> rid of the problem, but if anyone has suggestions on how to further
> debug that I maybe be able to add a raid5 array.
>
> The load when I've noticed it is rsync to LVM volumes with snapshots.
> After some discussion, lvm-devel suggested linux-raid would be the next
> logical step. Tested kernels include 2.6.32-220.4.1.el6.x86_64
> 2.6.32.26-175.fc12.x86_64, vmlinuz-2.6.32.9-70.fc12.x86_64, and others.
> Since I already have updated the kernel several times in the last couple
> of years, I figured I'd try some debugging with the current EL 6 kernel.
>
> Anyway, any thoughts on how to debug, where to stick some printk, other
> debugging functions?
I might know what is happening.
It is kind-a complicated and involved the magic code in
block/blk-core.c:generic_make_request which turns recursive calls into tail
recursion.
The fs sends a request to dm.
dm split it in 2 for some reason and sends them both to md.
This involves them getting queued in generic_make_request.
The first gets actioned by md/raid1 and converted into a request to the
underlying device (it must be a read request for this to happen - so just one
device). This gets added to the queue and is counted in nr_pending.
At this point sync_request is called by another thread and it tries to
raise_battier. It gets past the first hurdle, increments ->barrier, and
waits for nr_pending to hit zero.
Now the second request from dm to md is passed to raid1.c:make_request where
it tries to wait_barrier. This blocks because ->barrier is up, and we have a
deadlock - the request to the underlying device will not progress until this
md request progresses, and it is stuck.
This patch might fix it. Maybe. If it compiles.
NeilBrown
Index: linux-2.6.32-SLE11-SP1/drivers/md/raid1.c
===================================================================
--- linux-2.6.32-SLE11-SP1.orig/drivers/md/raid1.c 2012-03-01 12:28:05.000000000 +1100
+++ linux-2.6.32-SLE11-SP1/drivers/md/raid1.c 2012-03-01 12:28:22.427992913 +1100
@@ -695,7 +695,11 @@ static void wait_barrier(conf_t *conf)
spin_lock_irq(&conf->resync_lock);
if (conf->barrier) {
conf->nr_waiting++;
- wait_event_lock_irq(conf->wait_barrier, !conf->barrier,
+ wait_event_lock_irq(conf->wait_barrier,
+ !conf->barrier ||
+ (current->bio_tail &&
+ current->bio_list &&
+ conf->nr_pending),
conf->resync_lock,
raid1_unplug(conf->mddev->queue));
conf->nr_waiting--;
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2012-03-01 1:34 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-01 0:44 debugging md2_resync hang at raise_barrier Ray Morris
2012-03-01 1:34 ` NeilBrown [this message]
2012-03-12 18:46 ` Ray Morris
2012-03-13 0:16 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120301123418.1254f763@notabene.brown \
--to=neilb@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=support@bettercgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).