Re: [patch 1/4] raid1: directly dispatch write request if no bitmap

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Shaohua Li <shli@kernel.org>
To: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org, axboe@kernel.dk, shli@fusionio.com
Subject: Re: [patch 1/4] raid1: directly dispatch write request if no bitmap
Date: Thu, 24 May 2012 11:31:26 +0800	[thread overview]
Message-ID: <20120524033126.GA1172@kernel.org> (raw)
In-Reply-To: <20120524130912.4076ea1a@notabene.brown>

On Thu, May 24, 2012 at 01:09:12PM +1000, NeilBrown wrote:
> On Thu, 24 May 2012 10:54:25 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > On Thu, May 24, 2012 at 12:21:12PM +1000, NeilBrown wrote:
> > > On Wed, 23 May 2012 15:26:20 +0800 Shaohua Li <shli@kernel.org> wrote:
> > > 
> > > > In raid1, all write requests are dispatched in raid1d thread. In fast storage,
> > > > the raid1d thread is a bottleneck, because it dispatches request too slow. Also
> > > > raid1d thread migrates freely, which makes request completion cpu not match
> > > > with submission cpu even driver/block layer has such capability. This will
> > > > cause bad cache issue.
> > > > 
> > > > If no bitmap, there is no point to queue bio to a thread and dispatch it in the
> > > > thread. Directly dispatching bio doesn't impact correctness and removes above
> > > > bottleneck.
> > > > 
> > > > Multiple threads dispatch requests could potentially reduce request merge and
> > > > increase lock contention. For slow stroage, we just worry about request merge.
> > > > Caller of .make_request should already have correct block plug set, which will
> > > > take care of request merge and locking just like accessing raw device, so we
> > > > don't need worry about this too much.
> > > > 
> > > > In a 4k randwrite test with a 2 disks setup, below patch can provide 20% ~ 50%
> > > > performance improvements depending on numa binding.
> > > > 
> > > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > > ---
> > > >  drivers/md/raid1.c |   11 +++++++----
> > > >  1 file changed, 7 insertions(+), 4 deletions(-)
> > > > 
> > > > Index: linux/drivers/md/raid1.c
> > > > ===================================================================
> > > > --- linux.orig/drivers/md/raid1.c	2012-05-22 13:50:26.989820654 +0800
> > > > +++ linux/drivers/md/raid1.c	2012-05-22 13:56:46.117054559 +0800
> > > > @@ -1187,10 +1187,13 @@ read_again:
> > > >  		mbio->bi_private = r1_bio;
> > > >  
> > > >  		atomic_inc(&r1_bio->remaining);
> > > > -		spin_lock_irqsave(&conf->device_lock, flags);
> > > > -		bio_list_add(&conf->pending_bio_list, mbio);
> > > > -		conf->pending_count++;
> > > > -		spin_unlock_irqrestore(&conf->device_lock, flags);
> > > > +		if (bitmap) {
> > > > +			spin_lock_irqsave(&conf->device_lock, flags);
> > > > +			bio_list_add(&conf->pending_bio_list, mbio);
> > > > +			conf->pending_count++;
> > > > +			spin_unlock_irqrestore(&conf->device_lock, flags);
> > > > +		} else
> > > > +			generic_make_request(mbio);
> > > >  	}
> > > >  	/* Mustn't call r1_bio_write_done before this next test,
> > > >  	 * as it could result in the bio being freed.
> > > 
> > > This looks like it should be 'obviously correct' but unfortunately it isn't.
> > > 
> > > If this raid1 is beneath a dm device (e.g. LVM), then generic_make_request
> > > will queue the request internally and not actually start it.
> > > In particular this means that the cloned bio has no chance of being released
> > > before the next clone_bio call which is made on the next time around the loop.
> > > 
> > > This can lead to a deadlock as the clone_bio() might be waiting for that
> > > first cloned bio to be released (if memory is really tight).
> > > 
> > > i.e. when allocating multiple bios from a mempool, we need to arrange for
> > > them to be submitted by a separate thread.
> > 
> > If there isn't block plug, generic_make_request will dispatch the request
> > directly.
> 
> Not necessarily.  It might queue it:
> 	if (current->bio_list) {
> 		bio_list_add(current->bio_list, bio);
> 		return;
> 	}

Ah, yes. Looks we can workaround it to increase bioset entry number to
raid_disks*2. There isn't too much memory wasted with it.

Thanks,
Shaohua

next prev parent reply	other threads:[~2012-05-24  3:31 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-23  7:26 [patch 0/4] MD: improve raid1/10 write performance for fast storage Shaohua Li
2012-05-23  7:26 ` [patch 1/4] raid1: directly dispatch write request if no bitmap Shaohua Li
2012-05-24  2:21   ` NeilBrown
2012-05-24  2:54     ` Shaohua Li
2012-05-24  3:09       ` NeilBrown
2012-05-24  3:31         ` Shaohua Li [this message]
2012-05-24  3:51           ` NeilBrown
2012-05-23  7:26 ` [patch 2/4] raid1: percpu dispatch for write request if bitmap supported Shaohua Li
2012-05-24  3:34   ` NeilBrown
2012-05-24  5:17     ` Shaohua Li
2012-05-24  5:34       ` NeilBrown
2012-05-24  5:50         ` Shaohua Li
2012-05-23  7:26 ` [patch 3/4] raid10: directly dispatch write request if no bitmap Shaohua Li
2012-05-23  7:26 ` [patch 4/4] raid10: percpu dispatch for write request if bitmap supported Shaohua Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120524033126.GA1172@kernel.org \
    --to=shli@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=shli@fusionio.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.