From: NeilBrown <neilb@suse.de>
To: Dan Williams <djbw@fb.com>
Cc: linux-raid@vger.kernel.org, Kumar Sundararajan <kumar@fb.com>
Subject: Re: [PATCH] raid5: add support for rmw writes in raid6
Date: Mon, 22 Apr 2013 14:22:11 +1000 [thread overview]
Message-ID: <20130422142211.71a1ae8e@notabene.brown> (raw)
In-Reply-To: <20130418005609.6782.15041.stgit@dev279.prn2.facebook.com>
[-- Attachment #1: Type: text/plain, Size: 15497 bytes --]
On Wed, 17 Apr 2013 17:56:56 -0700 Dan Williams <djbw@fb.com> wrote:
> From: Kumar Sundararajan <kumar@fb.com>
>
> Add support for rmw writes in raid6. This improves small write performance for most workloads.
> Since there may be some configurations and workloads where rcw always outperforms rmw,
> this feature is disabled by default. It can be enabled via sysfs. For example,
>
> #echo 1 > /sys/block/md0/md/enable_rmw
>
> Signed-off-by: Kumar Sundararajan <kumar@fb.com>
> Signed-off-by: Dan Williams <djbw@fb.com>
> ---
> Hi Neil,
>
> We decided to leave the enable in for the few cases where forced-rcw
> outperformed rmw and there may be other cases out there.
>
> Thoughts?
- More commentary would help. The text at the top should explain enough so
when I read the code I am just verifying the text at the top, not trying to
figure out how it is supposed to work.
- If 'enable_rmw' really is a good idea, then it is possibly a good idea for
RAID5 to and so should be a separate patch and should work for RAID4/5/6.
The default for each array type may well be different, but the
functionality would be the same.
- Can you explain *why* rcw is sometimes better than rmw even on large
arrays? Even a fairly hand-wavy arguement would help. And it would go in
the comment at the top of the patch that adds enable_rmw.
- patch looks fairly sane assuming that it works - but I don't really know if
it does. You've reported speeds but haven't told me that you ran 'check'
after doing each test and it reported no mismatches. I suspect you did but
I'd like to be told. I'd also like to be told what role 'spare2' plays.
There: my complaints are longer than your explanatory text, so I think I
can stop now :-)
Oh, and the stuff below, that should be above so that it gets captured with
the patch and remains in the git logs of posterity.
NeilBrown
P.S. a couple of comments further down.
>
> Here are some numbers from Kumar's testing with a 12 disk array:
>
> with rmw rcw only
> 4K random write 887.0 KB/s 94.4 KB/s
> 64k seq write 291.5 MB/s 283.2 MB/s
> 64k random write 16.4 MB/s 7.7 MB/s
> 64K mixed read/write 94.4 MB/s 94.0 MB/s
> (70% random reads/30 % seq writes) 1.6 MB/s 1.8 MB/s
>
> --
> Dan
>
> drivers/md/raid5.c | 169 ++++++++++++++++++++++++++++++++++++++++++++++++----
> drivers/md/raid5.h | 4 +
> 2 files changed, 161 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index e25520e..c9b6112 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -1015,6 +1015,40 @@ static int set_syndrome_sources(struct page **srcs, struct stripe_head *sh)
> return syndrome_disks;
> }
>
> +static int set_rmw_syndrome_sources(struct page **srcs, struct stripe_head *sh,
> + struct raid5_percpu *percpu, int subtract)
Arg. Bad intending.
I suspect it would look better if this were two separate functions instead of
having a 'subtract' arg, but I'm not sure.
> +{
> + int disks = sh->disks;
> + int syndrome_disks = sh->ddf_layout ? disks : (disks - 2);
> + int d0_idx = raid6_d0(sh);
> + int count;
> + int i;
> +
> + for (i = 0; i < disks; i++)
> + srcs[i] = NULL;
> +
> + count = 0;
> + i = d0_idx;
> + do {
> + int slot = raid6_idx_to_slot(i, sh, &count, syndrome_disks);
> +
> + if (subtract) {
> + if (test_bit(R5_Wantdrain, &sh->dev[i].flags))
> + srcs[slot] = sh->dev[i].page;
> + else if (i == sh->pd_idx)
> + srcs[slot] = percpu->spare_page;
> + else if (i == sh->qd_idx)
> + srcs[slot] = percpu->spare_page2;
> + } else if (sh->dev[i].written || i == sh->pd_idx ||
> + i == sh->qd_idx)
> + srcs[slot] = sh->dev[i].page;
> +
> + i = raid6_next_disk(i, disks);
> + } while (i != d0_idx);
> +
> + return syndrome_disks;
> +}
> +
> static struct dma_async_tx_descriptor *
> ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
> {
> @@ -1401,6 +1435,50 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
> async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit);
> }
>
> +static struct dma_async_tx_descriptor *
> +ops_run_rmw(struct stripe_head *sh, struct raid5_percpu *percpu,
> + struct dma_async_tx_descriptor *tx, int subtract)
> +{
> + int pd_idx = sh->pd_idx;
> + int qd_idx = sh->qd_idx;
> + struct page **blocks = percpu->scribble;
> + struct page *xor_dest;
> + struct r5dev *dev;
> + struct async_submit_ctl submit;
> + int count;
> +
> + pr_debug("%s: stripe %llu\n", __func__,
> + (unsigned long long)sh->sector);
> +
> + count = set_rmw_syndrome_sources(blocks, sh, percpu, subtract);
> +
> + init_async_submit(&submit, ASYNC_TX_ACK, tx, NULL, NULL,
> + to_addr_conv(sh, percpu));
> + tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit);
> +
> + dev = &sh->dev[pd_idx];
> + xor_dest = blocks[0] = subtract ? percpu->spare_page : dev->page;
> + blocks[1] = subtract ? dev->page : percpu->spare_page;
> +
> + init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
> + NULL, sh, to_addr_conv(sh, percpu));
> + tx = async_xor(xor_dest, blocks, 0, 2, STRIPE_SIZE, &submit);
> +
> + dev = &sh->dev[qd_idx];
> + xor_dest = blocks[0] = subtract ? percpu->spare_page2 : dev->page;
> + blocks[1] = subtract ? dev->page : percpu->spare_page2;
> +
> + if (!subtract)
> + atomic_inc(&sh->count);
> +
> + init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx,
> + subtract ? NULL : ops_complete_reconstruct, sh,
> + to_addr_conv(sh, percpu));
> + tx = async_xor(xor_dest, blocks, 0, 2, STRIPE_SIZE, &submit);
> +
> + return tx;
The continual switching on 'subtract' make this hard to read too. It is
probably a bit big to duplication ... Is there anything you can do to make it
easier to read?
> +}
> +
> static void ops_complete_check(void *stripe_head_ref)
> {
> struct stripe_head *sh = stripe_head_ref;
> @@ -1500,6 +1578,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
> if (test_bit(STRIPE_OP_PREXOR, &ops_request))
> tx = ops_run_prexor(sh, percpu, tx);
>
> + if (test_bit(STRIPE_OP_RMW_SUBTRACT, &ops_request))
> + tx = ops_run_rmw(sh, percpu, tx, 1);
> +
> if (test_bit(STRIPE_OP_BIODRAIN, &ops_request)) {
> tx = ops_run_biodrain(sh, tx);
> overlap_clear++;
> @@ -1512,6 +1593,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
> ops_run_reconstruct6(sh, percpu, tx);
> }
>
> + if (test_bit(STRIPE_OP_RMW, &ops_request))
> + tx = ops_run_rmw(sh, percpu, tx, 0);
> +
> if (test_bit(STRIPE_OP_CHECK, &ops_request)) {
> if (sh->check_state == check_state_run)
> ops_run_check_p(sh, percpu);
> @@ -2346,7 +2430,7 @@ static void
> schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
> int rcw, int expand)
> {
> - int i, pd_idx = sh->pd_idx, disks = sh->disks;
> + int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx, disks = sh->disks;
> struct r5conf *conf = sh->raid_conf;
> int level = conf->level;
>
> @@ -2382,13 +2466,15 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
> if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state))
> atomic_inc(&conf->pending_full_writes);
> } else {
> - BUG_ON(level == 6);
> BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) ||
> test_bit(R5_Wantcompute, &sh->dev[pd_idx].flags)));
> + BUG_ON(level == 6 &&
> + (!(test_bit(R5_UPTODATE, &sh->dev[qd_idx].flags) ||
> + test_bit(R5_Wantcompute, &sh->dev[qd_idx].flags))));
>
> for (i = disks; i--; ) {
> struct r5dev *dev = &sh->dev[i];
> - if (i == pd_idx)
> + if (i == pd_idx || i == qd_idx)
> continue;
>
> if (dev->towrite &&
> @@ -2404,9 +2490,16 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
> /* False alarm - nothing to do */
> return;
> sh->reconstruct_state = reconstruct_state_prexor_drain_run;
> - set_bit(STRIPE_OP_PREXOR, &s->ops_request);
> - set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
> - set_bit(STRIPE_OP_RECONSTRUCT, &s->ops_request);
> +
> + if (level == 6) {
> + set_bit(STRIPE_OP_RMW_SUBTRACT, &s->ops_request);
> + set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
> + set_bit(STRIPE_OP_RMW, &s->ops_request);
> + } else {
> + set_bit(STRIPE_OP_PREXOR, &s->ops_request);
> + set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
> + set_bit(STRIPE_OP_RECONSTRUCT, &s->ops_request);
> + }
> }
>
> /* keep the parity disk(s) locked while asynchronous operations
> @@ -2876,15 +2969,14 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> int rmw = 0, rcw = 0, i;
> sector_t recovery_cp = conf->mddev->recovery_cp;
>
> - /* RAID6 requires 'rcw' in current implementation.
> - * Otherwise, check whether resync is now happening or should start.
> + /* Check whether resync is now happening or should start.
> * If yes, then the array is dirty (after unclean shutdown or
> * initial creation), so parity in some stripes might be inconsistent.
> * In this case, we need to always do reconstruct-write, to ensure
> * that in case of drive failure or read-error correction, we
> * generate correct data from the parity.
> */
> - if (conf->max_degraded == 2 ||
> + if ((conf->max_degraded == 2 && !conf->enable_rmw) ||
> (recovery_cp < MaxSector && sh->sector >= recovery_cp)) {
> /* Calculate the real rcw later - for now make it
> * look like rcw is cheaper
> @@ -2896,7 +2988,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> } else for (i = disks; i--; ) {
> /* would I have to read this buffer for read_modify_write */
> struct r5dev *dev = &sh->dev[i];
> - if ((dev->towrite || i == sh->pd_idx) &&
> + if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
> !test_bit(R5_LOCKED, &dev->flags) &&
> !(test_bit(R5_UPTODATE, &dev->flags) ||
> test_bit(R5_Wantcompute, &dev->flags))) {
> @@ -2907,6 +2999,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> }
> /* Would I have to read this buffer for reconstruct_write */
> if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
> + i != sh->qd_idx &&
> !test_bit(R5_LOCKED, &dev->flags) &&
> !(test_bit(R5_UPTODATE, &dev->flags) ||
> test_bit(R5_Wantcompute, &dev->flags))) {
> @@ -2926,7 +3019,8 @@ static void handle_stripe_dirtying(struct r5conf *conf,
> (unsigned long long)sh->sector, rmw);
> for (i = disks; i--; ) {
> struct r5dev *dev = &sh->dev[i];
> - if ((dev->towrite || i == sh->pd_idx) &&
> + if ((dev->towrite || i == sh->pd_idx ||
> + i == sh->qd_idx) &&
> !test_bit(R5_LOCKED, &dev->flags) &&
> !(test_bit(R5_UPTODATE, &dev->flags) ||
> test_bit(R5_Wantcompute, &dev->flags)) &&
> @@ -5360,11 +5454,48 @@ raid5_auxthread_number = __ATTR(auxthread_number, S_IRUGO|S_IWUSR,
> raid5_show_auxthread_number,
> raid5_store_auxthread_number);
>
> +static ssize_t
> +raid5_show_enable_rmw(struct mddev *mddev, char *page)
> +{
> + struct r5conf *conf = mddev->private;
> +
> + if (conf->level == 6)
> + return sprintf(page, "%d\n", conf->enable_rmw);
> + else
> + return sprintf(page, "%d\n", 1);
> +}
> +
> +static ssize_t
> +raid5_store_enable_rmw(struct mddev *mddev, const char *page, size_t len)
> +{
> + struct r5conf *conf = mddev->private;
> + unsigned long new;
> +
> + if (!conf)
> + return -ENODEV;
> + if (conf->level != 6)
> + return -EINVAL;
> +
> + if (len >= PAGE_SIZE)
> + return -EINVAL;
> +
> + if (kstrtoul(page, 10, &new))
> + return -EINVAL;
> + conf->enable_rmw = !!new;
> + return len;
> +}
> +
> +static struct md_sysfs_entry
> +raid5_enable_rmw = __ATTR(enable_rmw, S_IRUGO | S_IWUSR,
> + raid5_show_enable_rmw,
> + raid5_store_enable_rmw);
> +
> static struct attribute *raid5_attrs[] = {
> &raid5_stripecache_size.attr,
> &raid5_stripecache_active.attr,
> &raid5_preread_bypass_threshold.attr,
> &raid5_auxthread_number.attr,
> + &raid5_enable_rmw.attr,
> NULL,
> };
> static struct attribute_group raid5_attrs_group = {
> @@ -5400,6 +5531,7 @@ static void raid5_free_percpu(struct r5conf *conf)
> for_each_possible_cpu(cpu) {
> percpu = per_cpu_ptr(conf->percpu, cpu);
> safe_put_page(percpu->spare_page);
> + safe_put_page(percpu->spare_page2);
> kfree(percpu->scribble);
> }
> #ifdef CONFIG_HOTPLUG_CPU
> @@ -5433,12 +5565,16 @@ static int raid456_cpu_notify(struct notifier_block *nfb, unsigned long action,
> case CPU_UP_PREPARE_FROZEN:
> if (conf->level == 6 && !percpu->spare_page)
> percpu->spare_page = alloc_page(GFP_KERNEL);
> + if (conf->level == 6 && !percpu->spare_page2)
> + percpu->spare_page2 = alloc_page(GFP_KERNEL);
> if (!percpu->scribble)
> percpu->scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
>
> if (!percpu->scribble ||
> - (conf->level == 6 && !percpu->spare_page)) {
> + (conf->level == 6 && !percpu->spare_page) ||
> + (conf->level == 6 && !percpu->spare_page2)) {
> safe_put_page(percpu->spare_page);
> + safe_put_page(percpu->spare_page2);
> kfree(percpu->scribble);
> pr_err("%s: failed memory allocation for cpu%ld\n",
> __func__, cpu);
> @@ -5456,8 +5592,10 @@ static int raid456_cpu_notify(struct notifier_block *nfb, unsigned long action,
> spin_unlock_irq(&conf->device_lock);
>
> safe_put_page(percpu->spare_page);
> + safe_put_page(percpu->spare_page2);
> kfree(percpu->scribble);
> percpu->spare_page = NULL;
> + percpu->spare_page2 = NULL;
> percpu->scribble = NULL;
> break;
> default:
> @@ -5496,6 +5634,13 @@ static int raid5_alloc_percpu(struct r5conf *conf)
> break;
> }
> percpu->spare_page = spare_page;
> + spare_page = alloc_page(GFP_KERNEL);
> + if (!spare_page) {
> + err = -ENOMEM;
> + break;
> + }
> + per_cpu_ptr(conf->percpu, cpu)->spare_page2 =
> + spare_page;
> }
> scribble = kmalloc(conf->scribble_len, GFP_KERNEL);
> if (!scribble) {
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 217cb19..fd7ed19 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -334,6 +334,8 @@ enum {
> STRIPE_OP_PREXOR,
> STRIPE_OP_BIODRAIN,
> STRIPE_OP_RECONSTRUCT,
> + STRIPE_OP_RMW_SUBTRACT,
> + STRIPE_OP_RMW,
> STRIPE_OP_CHECK,
> };
> /*
> @@ -437,6 +439,7 @@ struct r5conf {
> /* per cpu variables */
> struct raid5_percpu {
> struct page *spare_page; /* Used when checking P/Q in raid6 */
> + struct page *spare_page2; /* Used for rmw writes in raid6 */
> void *scribble; /* space for constructing buffer
> * lists and performing address
> * conversions
> @@ -479,6 +482,7 @@ struct r5conf {
> struct raid5_auxth **aux_threads;
> /* which CPUs should raid5d thread handle stripes from */
> cpumask_t work_mask;
> + int enable_rmw; /* 1 if rmw is enabled for raid6 */
> };
>
> /*
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2013-04-22 4:22 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-18 0:56 [PATCH] raid5: add support for rmw writes in raid6 Dan Williams
2013-04-18 18:40 ` Dan Williams
2013-04-22 4:22 ` NeilBrown [this message]
2013-04-26 21:35 ` Dan Williams
2013-04-29 1:29 ` NeilBrown
2013-04-29 7:10 ` David Brown
2013-04-29 16:11 ` Kumar Sundararajan
2013-04-29 19:28 ` David Brown
2013-04-30 0:05 ` Dan Williams
2013-04-30 6:48 ` David Brown
2013-04-30 16:01 ` Kumar Sundararajan
2013-04-30 16:10 ` Kumar Sundararajan
2013-04-30 18:39 ` David Brown
2013-04-29 17:54 ` Kumar Sundararajan
2013-04-30 21:32 ` Dan Williams
2013-05-01 13:57 ` David Brown
2013-05-08 17:42 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130422142211.71a1ae8e@notabene.brown \
--to=neilb@suse.de \
--cc=djbw@fb.com \
--cc=kumar@fb.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).