Re: 2.6.24-rc6 reproducible raid5 hang

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dan Williams <dan.j.williams@intel.com>
To: dean gaudet <dean@arctic.org>
Cc: linux-raid@vger.kernel.org, neilb@suse.de
Subject: Re: 2.6.24-rc6 reproducible raid5 hang
Date: Wed, 09 Jan 2008 11:28:50 -0700	[thread overview]
Message-ID: <1199903330.18882.7.camel@dwillia2-linux.ch.intel.com> (raw)
In-Reply-To: <alpine.DEB.0.999999.0712300922520.20553@twinlark.arctic.org>

On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
> On Sat, 29 Dec 2007, Dan Williams wrote:
> 
> > On Dec 29, 2007 1:58 PM, dean gaudet <dean@arctic.org> wrote: 
> > > On Sat, 29 Dec 2007, Dan Williams wrote: 
> > > 
> > > > On Dec 29, 2007 9:48 AM, dean gaudet <dean@arctic.org> wrote: 
> > > > > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on 
> > > > > the same 64k chunk array and had raised the stripe_cache_size to 1024... 
> > > > > and got a hang.  this time i grabbed stripe_cache_active before bumping 
> > > > > the size again -- it was only 905 active.  as i recall the bug we were 
> > > > > debugging a year+ ago the active was at the size when it would hang.  so 
> > > > > this is probably something new. 
> > > > 
> > > > I believe I am seeing the same issue and am trying to track down 
> > > > whether XFS is doing something unexpected, i.e. I have not been able 
> > > > to reproduce the problem with EXT3.  MD tries to increase throughput 
> > > > by letting some stripe work build up in batches.  It looks like every 
> > > > time your system has hung it has been in the 'inactive_blocked' state 
> > > > i.e. > 3/4 of stripes active.  This state should automatically 
> > > > clear... 
> > > 
> > > cool, glad you can reproduce it :) 
> > > 
> > > i have a bit more data... i'm seeing the same problem on debian's 
> > > 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. 
> > > 
> > 
> > This is just brainstorming at this point, but it looks like xfs can 
> > submit more requests in the bi_end_io path such that it can lock 
> > itself out of the RAID array.  The sequence that concerns me is: 
> > 
> > return_io->xfs_buf_end_io->xfs_buf_io_end->xfs_buf_iodone_work->xfs_buf_iorequest->make_request-><hang> 
> > 
> > I need verify whether this path is actually triggering, but if we are 
> > in an inactive_blocked condition this new request will be put on a 
> > wait queue and we'll never get to the release_stripe() call after 
> > return_io().  It would be interesting to see if this is new XFS 
> > behavior in recent kernels.
> 
> 
> i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> 
> which was Neil's change in 2.6.22 for deferring generic_make_request 
> until there's enough stack space for it.
> 
> with my git tree sync'd to that commit my test cases fail in under 20 
> minutes uptime (i rebooted and tested 3x).  sync'd to the commit previous 
> to it i've got 8h of run-time now without the problem.
> 
> this isn't definitive of course since it does seem to be timing 
> dependent, but since all failures have occured much earlier than that 
> for me so far i think this indicates this change is either the cause of 
> the problem or exacerbates an existing raid5 problem.
> 
> given that this problem looks like a very rare problem i saw with 2.6.18 
> (raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an 
> existing problem... not that i have evidence either way.
> 
> i've attached a new kernel log with a hang at d89d87965d... and the 
> reduced config file i was using for the bisect.  hopefully the hang 
> looks the same as what we were seeing at 2.6.24-rc6.  let me know.
> 

Dean could you try the below patch to see if it fixes your failure
scenario?  It passes my test case.

Thanks,
Dan

------->
md: add generic_make_request_immed to prevent raid5 hang

From: Dan Williams <dan.j.williams@intel.com>

Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
by preventing recursive calls to generic_make_request.  However the
following conditions can cause raid5 to hang until 'stripe_cache_size' is
increased:

1/ stripe_cache_active is N stripes away from the 'inactive_blocked' limit
   (3/4 * stripe_cache_size)
2/ a bio is submitted that requires M stripes to be processed where M > N
3/ stripes 1 through N are up-to-date and ready for immediate processing,
   i.e. no trip through raid5d required

This results in the calling thread hanging while waiting for resources to
process stripes N through M.  This means we never return from make_request.
All other raid5 users pile up in get_active_stripe.  Increasing
stripe_cache_size temporarily resolves the blockage by allowing the blocked
make_request to return to generic_make_request.

Another way to solve this is to move all i/o submission to raid5d context.

Thanks to Dean Gaudet for bisecting this down to d89d8796.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 block/ll_rw_blk.c      |   16 +++++++++++++---
 drivers/md/raid5.c     |    4 ++--
 include/linux/blkdev.h |    1 +
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 8b91994..bff40c2 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -3287,16 +3287,26 @@ end_io:
 }
 
 /*
- * We only want one ->make_request_fn to be active at a time,
- * else stack usage with stacked devices could be a problem.
+ * In the general case we only want one ->make_request_fn to be active
+ * at a time, else stack usage with stacked devices could be a problem.
  * So use current->bio_{list,tail} to keep a list of requests
  * submited by a make_request_fn function.
  * current->bio_tail is also used as a flag to say if
  * generic_make_request is currently active in this task or not.
  * If it is NULL, then no make_request is active.  If it is non-NULL,
  * then a make_request is active, and new requests should be added
- * at the tail
+ * at the tail.
+ * However, some stacking drivers, like md-raid5, need to submit
+ * the bio without delay when it may not have the resources to
+ * complete its q->make_request_fn.  generic_make_request_immed is
+ * provided for this explicit purpose.
  */
+void generic_make_request_immed(struct bio *bio)
+{
+	__generic_make_request(bio);
+}
+EXPORT_SYMBOL(generic_make_request_immed);
+
 void generic_make_request(struct bio *bio)
 {
 	if (current->bio_tail) {
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c857b5a..ffa2be4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -450,7 +450,7 @@ static void ops_run_io(struct stripe_head *sh)
 			    test_bit(R5_ReWrite, &sh->dev[i].flags))
 				atomic_add(STRIPE_SECTORS,
 					&rdev->corrected_errors);
-			generic_make_request(bi);
+			generic_make_request_immed(bi);
 		} else {
 			if (rw == WRITE)
 				set_bit(STRIPE_DEGRADED, &sh->state);
@@ -3124,7 +3124,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			if (rw == WRITE &&
 			    test_bit(R5_ReWrite, &sh->dev[i].flags))
 				atomic_add(STRIPE_SECTORS, &rdev->corrected_errors);
-			generic_make_request(bi);
+			generic_make_request_immed(bi);
 		} else {
 			if (rw == WRITE)
 				set_bit(STRIPE_DEGRADED, &sh->state);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d18ee67..774a3a0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -642,6 +642,7 @@ extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 extern void register_disk(struct gendisk *dev);
 extern void generic_make_request(struct bio *bio);
+extern void generic_make_request_immed(struct bio *bio);
 extern void blk_put_request(struct request *);
 extern void __blk_put_request(struct request_queue *, struct request *);
 extern void blk_end_sync_rq(struct request *rq, int error);

next prev parent reply	other threads:[~2008-01-09 18:28 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-27 17:06 2.6.24-rc6 reproducible raid5 hang dean gaudet
2007-12-27 17:39 ` dean gaudet
2007-12-29 16:48   ` dean gaudet
2007-12-29 20:47     ` Dan Williams
2007-12-29 20:58       ` dean gaudet
2007-12-29 21:50         ` Justin Piszcz
2007-12-29 22:11           ` dean gaudet
2007-12-29 22:21             ` dean gaudet
2007-12-29 22:06         ` Dan Williams
2007-12-30 17:58           ` dean gaudet
2008-01-09 18:28             ` Dan Williams [this message]
2008-01-10  0:09               ` Neil Brown
2008-01-10  3:07                 ` Dan Williams
2008-01-10  3:57                   ` Neil Brown
2008-01-10  4:56                     ` Dan Williams
2008-01-10 20:28                     ` Bill Davidsen
2008-01-10  7:13                 ` dean gaudet
2008-01-10 18:49                   ` Dan Williams
2008-01-11  1:46                     ` Neil Brown
2008-01-11  2:14                       ` dean gaudet
2008-01-10 17:59                 ` dean gaudet
2007-12-27 19:52 ` Justin Piszcz
2007-12-28  0:08   ` dean gaudet
  -- strict thread matches above, loose matches on Subject: below --
2008-01-23 13:37 Tim Southerwood
2008-01-23 17:43 ` Carlos Carvalho
2008-01-24 20:30   ` Tim Southerwood
2008-01-28 17:29     ` Tim Southerwood
2008-01-29 14:16       ` Carlos Carvalho
2008-01-29 22:58         ` Bill Davidsen
2008-02-14 10:13           ` Burkhard Carstens

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:8b91994 dfblob:bff40c2 dfblob:c857b5a dfblob:ffa2be4
dfblob:d18ee67 dfblob:774a3a0 )
 OR (
bs:"Re: 2.6.24-rc6 reproducible raid5 hang" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1199903330.18882.7.camel@dwillia2-linux.ch.intel.com \
    --to=dan.j.williams@intel.com \
    --cc=dean@arctic.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).