Re: [PATCH] raid5: add support for rmw writes in raid6

All of lore.kernel.org
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: Dan Williams <djbw@fb.com>
Cc: linux-raid@vger.kernel.org, Kumar Sundararajan <kumar@fb.com>
Subject: Re: [PATCH] raid5: add support for rmw writes in raid6
Date: Mon, 22 Apr 2013 14:22:11 +1000	[thread overview]
Message-ID: <20130422142211.71a1ae8e@notabene.brown> (raw)
In-Reply-To: <20130418005609.6782.15041.stgit@dev279.prn2.facebook.com>

[-- Attachment #1: Type: text/plain, Size: 15497 bytes --]

On Wed, 17 Apr 2013 17:56:56 -0700 Dan Williams <djbw@fb.com> wrote:

> From: Kumar Sundararajan <kumar@fb.com>
> 
> Add support for rmw writes in raid6. This improves small write performance for most workloads.
> Since there may be some configurations and workloads where rcw always outperforms rmw,
> this feature is disabled by default. It can be enabled via sysfs. For example,
> 
> 	#echo 1 > /sys/block/md0/md/enable_rmw
> 
> Signed-off-by: Kumar Sundararajan <kumar@fb.com>
> Signed-off-by: Dan Williams <djbw@fb.com>
> ---
> Hi Neil,
> 
> We decided to leave the enable in for the few cases where forced-rcw
> outperformed rmw and there may be other cases out there.
> 
> Thoughts?

- More commentary would help.  The text at the top should explain enough so
  when I read the code I am just verifying the text at the top, not trying to
  figure out how it is supposed to work.

- If 'enable_rmw' really is a good idea, then it is possibly a good idea for
  RAID5 to and so should be a separate patch and should work for RAID4/5/6.
  The default for each array type may well be different, but the
  functionality would be the same.

- Can you  explain *why* rcw is sometimes better than rmw even on large
  arrays? Even a fairly hand-wavy arguement would help.  And it would go in
  the comment at the top of the patch that adds enable_rmw.

- patch looks fairly sane assuming that it works - but I don't really know if
  it does.  You've reported speeds but haven't told me that you ran 'check'
  after doing each test and it reported no mismatches.  I suspect you did but
  I'd like to be told.  I'd also like to be told what role 'spare2' plays.

There: my complaints are longer than your explanatory text, so I think I
can stop now :-)

Oh, and the stuff below, that should be above so that it gets captured with
the patch and remains in the git logs of posterity. 

NeilBrown

P.S. a couple of comments further down.

> 
> Here are some numbers from Kumar's testing with a 12 disk array:
> 
>                                      with rmw   rcw only
> 4K random write                    887.0 KB/s  94.4 KB/s
> 64k seq write                      291.5 MB/s 283.2 MB/s
> 64k random write                    16.4 MB/s   7.7 MB/s
> 64K mixed read/write                94.4 MB/s  94.0 MB/s
> (70% random reads/30 % seq writes)   1.6 MB/s   1.8 MB/s
> 
> --
> Dan
> 
>  drivers/md/raid5.c |  169 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  drivers/md/raid5.h |    4 +
>  2 files changed, 161 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index e25520e..c9b6112 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -1015,6 +1015,40 @@ static int set_syndrome_sources(struct page **srcs, struct stripe_head *sh)
>  	return syndrome_disks;
>  }
>  
> +static int set_rmw_syndrome_sources(struct page **srcs, struct stripe_head *sh,
> +	 struct raid5_percpu *percpu, int subtract)

Arg.  Bad intending.

I suspect it would look better if this were two separate functions instead of
having a 'subtract' arg, but I'm not sure.


> +{
> +	int disks = sh->disks;
> +	int syndrome_disks = sh->ddf_layout ? disks : (disks - 2);
> +	int d0_idx = raid6_d0(sh);
> +	int count;
> +	int i;
> +
> +	for (i = 0; i < disks; i++)
> +		srcs[i] = NULL;
> +
> +	count = 0;
> +	i = d0_idx;
> +	do {
> +		int slot = raid6_idx_to_slot(i, sh, &count, syndrome_disks);
> +
> +		if (subtract) {
> +			if (test_bit(R5_Wantdrain, &sh->dev[i].flags))
> +				srcs[slot] = sh->dev[i].page;
> +			else if (i == sh->pd_idx)
> +				srcs[slot] = percpu->spare_page;
> +			else if (i == sh->qd_idx)
> +				srcs[slot] = percpu->spare_page2;
> +		} else if (sh->dev[i].written || i == sh->pd_idx ||
> +			    i == sh->qd_idx)
> +			srcs[slot] = sh->dev[i].page;
> +
> +		i = raid6_next_disk(i, disks);
> +	} while (i != d0_idx);
> +
> +	return syndrome_disks;
> +}
> +
>  static struct dma_async_tx_descriptor *
>  ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
>  {
> @@ -1401,6 +1435,50 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
>  	async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE,  &submit);
>  }
>  
> +static struct dma_async_tx_descriptor *
> +ops_run_rmw(struct stripe_head *sh, struct raid5_percpu *percpu,
> +	    struct dma_async_tx_descriptor *tx, int subtract)
> +{
> +	int pd_idx = sh->pd_idx;
> +	int qd_idx = sh->qd_idx;
> +	struct page **blocks = percpu->scribble;
> +	struct page *xor_dest;
> +	struct r5dev *dev;
> +	struct async_submit_ctl submit;
> +	int count;
> +
> +	pr_debug("%s: stripe %llu\n", __func__,
> +		(unsigned long long)sh->sector);
> +
> +	count = set_rmw_syndrome_sources(blocks, sh, percpu, subtract);
> +
> +	init_async_submit(&submit, ASYNC_TX_ACK, tx, NULL, NULL,
> +			  to_addr_conv(sh, percpu));
> +	tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit);
> +
> +	dev = &sh->dev[pd_idx];
> +	xor_dest = blocks[0] = subtract ? percpu->spare_page : dev->page;
> +	blocks[1] = subtract ? dev->page : percpu->spare_page;
> +
> +	init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
> +			  NULL, sh, to_addr_conv(sh, percpu));
> +	tx = async_xor(xor_dest, blocks, 0, 2, STRIPE_SIZE, &submit);
> +
> +	dev = &sh->dev[qd_idx];
> +	xor_dest = blocks[0] = subtract ? percpu->spare_page2 : dev->page;
> +	blocks[1] = subtract ? dev->page : percpu->spare_page2;
> +
> +	if (!subtract)
> +		atomic_inc(&sh->count);
> +
> +	init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
> +			  subtract ? NULL : ops_complete_reconstruct, sh,
> +			  to_addr_conv(sh, percpu));
> +	tx = async_xor(xor_dest, blocks, 0, 2, STRIPE_SIZE, &submit);
> +
> +	return tx;

The continual switching on 'subtract' make this hard to read too.  It is
probably a bit big to duplication ... Is there anything you can do to make it
easier to read?


> +}
> +
>  static void ops_complete_check(void *stripe_head_ref)
>  {
>  	struct stripe_head *sh = stripe_head_ref;
> @@ -1500,6 +1578,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
>  	if (test_bit(STRIPE_OP_PREXOR, &ops_request))
>  		tx = ops_run_prexor(sh, percpu, tx);
>  
> +	if (test_bit(STRIPE_OP_RMW_SUBTRACT, &ops_request))
> +		tx = ops_run_rmw(sh, percpu, tx, 1);
> +
>  	if (test_bit(STRIPE_OP_BIODRAIN, &ops_request)) {
>  		tx = ops_run_biodrain(sh, tx);
>  		overlap_clear++;
> @@ -1512,6 +1593,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
>  			ops_run_reconstruct6(sh, percpu, tx);
>  	}
>  
> +	if (test_bit(STRIPE_OP_RMW, &ops_request))
> +		tx = ops_run_rmw(sh, percpu, tx, 0);
> +
>  	if (test_bit(STRIPE_OP_CHECK, &ops_request)) {
>  		if (sh->check_state == check_state_run)
>  			ops_run_check_p(sh, percpu);
> @@ -2346,7 +2430,7 @@ static void
>  schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
>  			 int rcw, int expand)
>  {
> -	int i, pd_idx = sh->pd_idx, disks = sh->disks;
> +	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx, disks = sh->disks;
>  	struct r5conf *conf = sh->raid_conf;
>  	int level = conf->level;
>  
> @@ -2382,13 +2466,15 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
>  			if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state))
>  				atomic_inc(&conf->pending_full_writes);
>  	} else {
> -		BUG_ON(level == 6);
>  		BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) ||
>  			test_bit(R5_Wantcompute, &sh->dev[pd_idx].flags)));
> +		BUG_ON(level == 6 &&
> +			(!(test_bit(R5_UPTODATE, &sh->dev[qd_idx].flags) ||
> +			test_bit(R5_Wantcompute, &sh->dev[qd_idx].flags))));
>  
>  		for (i = disks; i--; ) {
>  			struct r5dev *dev = &sh->dev[i];
> -			if (i == pd_idx)
> +			if (i == pd_idx || i == qd_idx)
>  				continue;
>  
>  			if (dev->towrite &&
> @@ -2404,9 +2490,16 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
>  			/* False alarm - nothing to do */
>  			return;
>  		sh->reconstruct_state = reconstruct_state_prexor_drain_run;
> -		set_bit(STRIPE_OP_PREXOR, &s->ops_request);
> -		set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
> -		set_bit(STRIPE_OP_RECONSTRUCT, &s->ops_request);
> +
> +		if (level == 6) {
> +			set_bit(STRIPE_OP_RMW_SUBTRACT, &s->ops_request);
> +			set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
> +			set_bit(STRIPE_OP_RMW, &s->ops_request);
> +		} else {
> +			set_bit(STRIPE_OP_PREXOR, &s->ops_request);
> +			set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
> +			set_bit(STRIPE_OP_RECONSTRUCT, &s->ops_request);
> +		}
>  	}
>  
>  	/* keep the parity disk(s) locked while asynchronous operations
> @@ -2876,15 +2969,14 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  	int rmw = 0, rcw = 0, i;
>  	sector_t recovery_cp = conf->mddev->recovery_cp;
>  
> -	/* RAID6 requires 'rcw' in current implementation.
> -	 * Otherwise, check whether resync is now happening or should start.
> +	/* Check whether resync is now happening or should start.
>  	 * If yes, then the array is dirty (after unclean shutdown or
>  	 * initial creation), so parity in some stripes might be inconsistent.
>  	 * In this case, we need to always do reconstruct-write, to ensure
>  	 * that in case of drive failure or read-error correction, we
>  	 * generate correct data from the parity.
>  	 */
> -	if (conf->max_degraded == 2 ||
> +	if ((conf->max_degraded == 2 && !conf->enable_rmw) ||
>  	    (recovery_cp < MaxSector && sh->sector >= recovery_cp)) {
>  		/* Calculate the real rcw later - for now make it
>  		 * look like rcw is cheaper
> @@ -2896,7 +2988,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  	} else for (i = disks; i--; ) {
>  		/* would I have to read this buffer for read_modify_write */
>  		struct r5dev *dev = &sh->dev[i];
> -		if ((dev->towrite || i == sh->pd_idx) &&
> +		if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
>  		    !test_bit(R5_LOCKED, &dev->flags) &&
>  		    !(test_bit(R5_UPTODATE, &dev->flags) ||
>  		      test_bit(R5_Wantcompute, &dev->flags))) {
> @@ -2907,6 +2999,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  		}
>  		/* Would I have to read this buffer for reconstruct_write */
>  		if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
> +		    i != sh->qd_idx &&
>  		    !test_bit(R5_LOCKED, &dev->flags) &&
>  		    !(test_bit(R5_UPTODATE, &dev->flags) ||
>  		    test_bit(R5_Wantcompute, &dev->flags))) {
> @@ -2926,7 +3019,8 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  					  (unsigned long long)sh->sector, rmw);
>  		for (i = disks; i--; ) {
>  			struct r5dev *dev = &sh->dev[i];
> -			if ((dev->towrite || i == sh->pd_idx) &&
> +			if ((dev->towrite || i == sh->pd_idx ||
> +			    i == sh->qd_idx) &&
>  			    !test_bit(R5_LOCKED, &dev->flags) &&
>  			    !(test_bit(R5_UPTODATE, &dev->flags) ||
>  			    test_bit(R5_Wantcompute, &dev->flags)) &&
> @@ -5360,11 +5454,48 @@ raid5_auxthread_number = __ATTR(auxthread_number, S_IRUGO|S_IWUSR,
>  				raid5_show_auxthread_number,
>  				raid5_store_auxthread_number);
>  
> +static ssize_t
> +raid5_show_enable_rmw(struct mddev  *mddev, char *page)
> +{
> +	struct r5conf *conf = mddev->private;
> +
> +	if (conf->level == 6)
> +		return sprintf(page, "%d\n", conf->enable_rmw);
> +	else
> +		return sprintf(page, "%d\n", 1);
> +}
> +
> +static ssize_t
> +raid5_store_enable_rmw(struct mddev  *mddev, const char *page, size_t len)
> +{
> +	struct r5conf *conf = mddev->private;
> +	unsigned long new;
> +
> +	if (!conf)
> +		return -ENODEV;
> +	if (conf->level != 6)
> +		return -EINVAL;
> +
> +	if (len >= PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (kstrtoul(page, 10, &new))
> +		return -EINVAL;
> +	conf->enable_rmw = !!new;
> +	return len;
> +}
> +
> +static struct md_sysfs_entry
> +raid5_enable_rmw = __ATTR(enable_rmw, S_IRUGO | S_IWUSR,
> +				raid5_show_enable_rmw,
> +				raid5_store_enable_rmw);
> +
>  static struct attribute *raid5_attrs[] =  {
>  	&raid5_stripecache_size.attr,
>  	&raid5_stripecache_active.attr,
>  	&raid5_preread_bypass_threshold.attr,
>  	&raid5_auxthread_number.attr,
> +	&raid5_enable_rmw.attr,
>  	NULL,
>  };
>  static struct attribute_group raid5_attrs_group = {
> @@ -5400,6 +5531,7 @@ static void raid5_free_percpu(struct r5conf *conf)
>  	for_each_possible_cpu(cpu) {
>  		percpu = per_cpu_ptr(conf->percpu, cpu);
>  		safe_put_page(percpu->spare_page);
> +		safe_put_page(percpu->spare_page2);
>  		kfree(percpu->scribble);
>  	}
>  #ifdef CONFIG_HOTPLUG_CPU
> @@ -5433,12 +5565,16 @@ static int raid456_cpu_notify(struct notifier_block *nfb, unsigned long action,
>  	case CPU_UP_PREPARE_FROZEN:
>  		if (conf->level == 6 && !percpu->spare_page)
>  			percpu->spare_page = alloc_page(GFP_KERNEL);
> +		if (conf->level == 6 && !percpu->spare_page2)
> +			percpu->spare_page2 = alloc_page(GFP_KERNEL);
>  		if (!percpu->scribble)
>  			percpu->scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
>  
>  		if (!percpu->scribble ||
> -		    (conf->level == 6 && !percpu->spare_page)) {
> +		    (conf->level == 6 && !percpu->spare_page) ||
> +		    (conf->level == 6 && !percpu->spare_page2)) {
>  			safe_put_page(percpu->spare_page);
> +			safe_put_page(percpu->spare_page2);
>  			kfree(percpu->scribble);
>  			pr_err("%s: failed memory allocation for cpu%ld\n",
>  			       __func__, cpu);
> @@ -5456,8 +5592,10 @@ static int raid456_cpu_notify(struct notifier_block *nfb, unsigned long action,
>  		spin_unlock_irq(&conf->device_lock);
>  
>  		safe_put_page(percpu->spare_page);
> +		safe_put_page(percpu->spare_page2);
>  		kfree(percpu->scribble);
>  		percpu->spare_page = NULL;
> +		percpu->spare_page2 = NULL;
>  		percpu->scribble = NULL;
>  		break;
>  	default:
> @@ -5496,6 +5634,13 @@ static int raid5_alloc_percpu(struct r5conf *conf)
>  				break;
>  			}
>  			percpu->spare_page = spare_page;
> +			spare_page = alloc_page(GFP_KERNEL);
> +			if (!spare_page) {
> +				err = -ENOMEM;
> +				break;
> +			}
> +			per_cpu_ptr(conf->percpu, cpu)->spare_page2 =
> +			    spare_page;
>  		}
>  		scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
>  		if (!scribble) {
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 217cb19..fd7ed19 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -334,6 +334,8 @@ enum {
>  	STRIPE_OP_PREXOR,
>  	STRIPE_OP_BIODRAIN,
>  	STRIPE_OP_RECONSTRUCT,
> +	STRIPE_OP_RMW_SUBTRACT,
> +	STRIPE_OP_RMW,
>  	STRIPE_OP_CHECK,
>  };
>  /*
> @@ -437,6 +439,7 @@ struct r5conf {
>  	/* per cpu variables */
>  	struct raid5_percpu {
>  		struct page	*spare_page; /* Used when checking P/Q in raid6 */
> +		struct page	*spare_page2; /* Used for rmw writes in raid6 */
>  		void		*scribble;   /* space for constructing buffer
>  					      * lists and performing address
>  					      * conversions
> @@ -479,6 +482,7 @@ struct r5conf {
>  	struct raid5_auxth	**aux_threads;
>  	/* which CPUs should raid5d thread handle stripes from */
>  	cpumask_t		work_mask;
> +	int			enable_rmw; /* 1 if rmw is enabled for raid6 */
>  };
>  
>  /*


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

next prev parent reply	other threads:[~2013-04-22  4:22 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-18  0:56 [PATCH] raid5: add support for rmw writes in raid6 Dan Williams
2013-04-18 18:40 ` Dan Williams
2013-04-22  4:22 ` NeilBrown [this message]
2013-04-26 21:35   ` Dan Williams
2013-04-29  1:29     ` NeilBrown
2013-04-29  7:10       ` David Brown
2013-04-29 16:11         ` Kumar Sundararajan
2013-04-29 19:28           ` David Brown
2013-04-30  0:05             ` Dan Williams
2013-04-30  6:48               ` David Brown
2013-04-30 16:01                 ` Kumar Sundararajan
2013-04-30 16:10                   ` Kumar Sundararajan
2013-04-30 18:39                     ` David Brown
2013-04-29 17:54       ` Kumar Sundararajan
2013-04-30 21:32       ` Dan Williams
2013-05-01 13:57         ` David Brown
2013-05-08 17:42       ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130422142211.71a1ae8e@notabene.brown \
    --to=neilb@suse.de \
    --cc=djbw@fb.com \
    --cc=kumar@fb.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.