From: Dan Williams <dan.j.williams@intel.com>
To: dean gaudet <dean@arctic.org>
Cc: linux-raid@vger.kernel.org, neilb@suse.de
Subject: Re: 2.6.24-rc6 reproducible raid5 hang
Date: Wed, 09 Jan 2008 11:28:50 -0700 [thread overview]
Message-ID: <1199903330.18882.7.camel@dwillia2-linux.ch.intel.com> (raw)
In-Reply-To: <alpine.DEB.0.999999.0712300922520.20553@twinlark.arctic.org>
On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
> On Sat, 29 Dec 2007, Dan Williams wrote:
>
> > On Dec 29, 2007 1:58 PM, dean gaudet <dean@arctic.org> wrote:
> > > On Sat, 29 Dec 2007, Dan Williams wrote:
> > >
> > > > On Dec 29, 2007 9:48 AM, dean gaudet <dean@arctic.org> wrote:
> > > > > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
> > > > > the same 64k chunk array and had raised the stripe_cache_size to 1024...
> > > > > and got a hang. this time i grabbed stripe_cache_active before bumping
> > > > > the size again -- it was only 905 active. as i recall the bug we were
> > > > > debugging a year+ ago the active was at the size when it would hang. so
> > > > > this is probably something new.
> > > >
> > > > I believe I am seeing the same issue and am trying to track down
> > > > whether XFS is doing something unexpected, i.e. I have not been able
> > > > to reproduce the problem with EXT3. MD tries to increase throughput
> > > > by letting some stripe work build up in batches. It looks like every
> > > > time your system has hung it has been in the 'inactive_blocked' state
> > > > i.e. > 3/4 of stripes active. This state should automatically
> > > > clear...
> > >
> > > cool, glad you can reproduce it :)
> > >
> > > i have a bit more data... i'm seeing the same problem on debian's
> > > 2.6.22-3-amd64 kernel, so it's not new in 2.6.24.
> > >
> >
> > This is just brainstorming at this point, but it looks like xfs can
> > submit more requests in the bi_end_io path such that it can lock
> > itself out of the RAID array. The sequence that concerns me is:
> >
> > return_io->xfs_buf_end_io->xfs_buf_io_end->xfs_buf_iodone_work->xfs_buf_iorequest->make_request-><hang>
> >
> > I need verify whether this path is actually triggering, but if we are
> > in an inactive_blocked condition this new request will be put on a
> > wait queue and we'll never get to the release_stripe() call after
> > return_io(). It would be interesting to see if this is new XFS
> > behavior in recent kernels.
>
>
> i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
>
> which was Neil's change in 2.6.22 for deferring generic_make_request
> until there's enough stack space for it.
>
> with my git tree sync'd to that commit my test cases fail in under 20
> minutes uptime (i rebooted and tested 3x). sync'd to the commit previous
> to it i've got 8h of run-time now without the problem.
>
> this isn't definitive of course since it does seem to be timing
> dependent, but since all failures have occured much earlier than that
> for me so far i think this indicates this change is either the cause of
> the problem or exacerbates an existing raid5 problem.
>
> given that this problem looks like a very rare problem i saw with 2.6.18
> (raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an
> existing problem... not that i have evidence either way.
>
> i've attached a new kernel log with a hang at d89d87965d... and the
> reduced config file i was using for the bisect. hopefully the hang
> looks the same as what we were seeing at 2.6.24-rc6. let me know.
>
Dean could you try the below patch to see if it fixes your failure
scenario? It passes my test case.
Thanks,
Dan
------->
md: add generic_make_request_immed to prevent raid5 hang
From: Dan Williams <dan.j.williams@intel.com>
Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
by preventing recursive calls to generic_make_request. However the
following conditions can cause raid5 to hang until 'stripe_cache_size' is
increased:
1/ stripe_cache_active is N stripes away from the 'inactive_blocked' limit
(3/4 * stripe_cache_size)
2/ a bio is submitted that requires M stripes to be processed where M > N
3/ stripes 1 through N are up-to-date and ready for immediate processing,
i.e. no trip through raid5d required
This results in the calling thread hanging while waiting for resources to
process stripes N through M. This means we never return from make_request.
All other raid5 users pile up in get_active_stripe. Increasing
stripe_cache_size temporarily resolves the blockage by allowing the blocked
make_request to return to generic_make_request.
Another way to solve this is to move all i/o submission to raid5d context.
Thanks to Dean Gaudet for bisecting this down to d89d8796.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
block/ll_rw_blk.c | 16 +++++++++++++---
drivers/md/raid5.c | 4 ++--
include/linux/blkdev.h | 1 +
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 8b91994..bff40c2 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -3287,16 +3287,26 @@ end_io:
}
/*
- * We only want one ->make_request_fn to be active at a time,
- * else stack usage with stacked devices could be a problem.
+ * In the general case we only want one ->make_request_fn to be active
+ * at a time, else stack usage with stacked devices could be a problem.
* So use current->bio_{list,tail} to keep a list of requests
* submited by a make_request_fn function.
* current->bio_tail is also used as a flag to say if
* generic_make_request is currently active in this task or not.
* If it is NULL, then no make_request is active. If it is non-NULL,
* then a make_request is active, and new requests should be added
- * at the tail
+ * at the tail.
+ * However, some stacking drivers, like md-raid5, need to submit
+ * the bio without delay when it may not have the resources to
+ * complete its q->make_request_fn. generic_make_request_immed is
+ * provided for this explicit purpose.
*/
+void generic_make_request_immed(struct bio *bio)
+{
+ __generic_make_request(bio);
+}
+EXPORT_SYMBOL(generic_make_request_immed);
+
void generic_make_request(struct bio *bio)
{
if (current->bio_tail) {
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c857b5a..ffa2be4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -450,7 +450,7 @@ static void ops_run_io(struct stripe_head *sh)
test_bit(R5_ReWrite, &sh->dev[i].flags))
atomic_add(STRIPE_SECTORS,
&rdev->corrected_errors);
- generic_make_request(bi);
+ generic_make_request_immed(bi);
} else {
if (rw == WRITE)
set_bit(STRIPE_DEGRADED, &sh->state);
@@ -3124,7 +3124,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
if (rw == WRITE &&
test_bit(R5_ReWrite, &sh->dev[i].flags))
atomic_add(STRIPE_SECTORS, &rdev->corrected_errors);
- generic_make_request(bi);
+ generic_make_request_immed(bi);
} else {
if (rw == WRITE)
set_bit(STRIPE_DEGRADED, &sh->state);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d18ee67..774a3a0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -642,6 +642,7 @@ extern int blk_register_queue(struct gendisk *disk);
extern void blk_unregister_queue(struct gendisk *disk);
extern void register_disk(struct gendisk *dev);
extern void generic_make_request(struct bio *bio);
+extern void generic_make_request_immed(struct bio *bio);
extern void blk_put_request(struct request *);
extern void __blk_put_request(struct request_queue *, struct request *);
extern void blk_end_sync_rq(struct request *rq, int error);
next prev parent reply other threads:[~2008-01-09 18:28 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-12-27 17:06 2.6.24-rc6 reproducible raid5 hang dean gaudet
2007-12-27 17:39 ` dean gaudet
2007-12-29 16:48 ` dean gaudet
2007-12-29 20:47 ` Dan Williams
2007-12-29 20:58 ` dean gaudet
2007-12-29 21:50 ` Justin Piszcz
2007-12-29 22:11 ` dean gaudet
2007-12-29 22:21 ` dean gaudet
2007-12-29 22:06 ` Dan Williams
2007-12-30 17:58 ` dean gaudet
2008-01-09 18:28 ` Dan Williams [this message]
2008-01-10 0:09 ` Neil Brown
2008-01-10 3:07 ` Dan Williams
2008-01-10 3:57 ` Neil Brown
2008-01-10 4:56 ` Dan Williams
2008-01-10 20:28 ` Bill Davidsen
2008-01-10 7:13 ` dean gaudet
2008-01-10 18:49 ` Dan Williams
2008-01-11 1:46 ` Neil Brown
2008-01-11 2:14 ` dean gaudet
2008-01-10 17:59 ` dean gaudet
2007-12-27 19:52 ` Justin Piszcz
2007-12-28 0:08 ` dean gaudet
-- strict thread matches above, loose matches on Subject: below --
2008-01-23 13:37 Tim Southerwood
2008-01-23 17:43 ` Carlos Carvalho
2008-01-24 20:30 ` Tim Southerwood
2008-01-28 17:29 ` Tim Southerwood
2008-01-29 14:16 ` Carlos Carvalho
2008-01-29 22:58 ` Bill Davidsen
2008-02-14 10:13 ` Burkhard Carstens
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1199903330.18882.7.camel@dwillia2-linux.ch.intel.com \
--to=dan.j.williams@intel.com \
--cc=dean@arctic.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).