Re: [RFC]RAID5: batch adjacent full stripe write

All of lore.kernel.org
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org
Subject: Re: [RFC]RAID5: batch adjacent full stripe write
Date: Tue, 2 Sep 2014 16:52:40 +1000	[thread overview]
Message-ID: <20140902165240.3affa8b0@notabene.brown> (raw)
In-Reply-To: <20140818082531.GA13732@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3592 bytes --]

On Mon, 18 Aug 2014 16:25:31 +0800 Shaohua Li <shli@kernel.org> wrote:

> 
> stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
> unit. Idealy we should use big size for adjacent full stripe writes. Bigger
> stripe cache size means less stripes runing in the state machine so can reduce
> cpu overhead. And also bigger size can cause bigger IO size dispatched to under
> layer disks.
> 
> With below patch, we will automatically batch adjacent full stripe write
> together. Such stripes will form to a container and be added to the container
> list. Only the first stripe of a container will be put to handle_list and so
> run handle_stripe(). Some steps of handle_stripe() are extended to cover whole
> container stripes, including ops_run_io, ops_run_biodrain and so on. With this
> patch, we have less stripes running in handle_stripe() and we send IO of whole
> container stripes together to increase IO size.
> 
> Stripes added to a container have some limitations. A container can only
> include full stripe write and can't cross chunk boundary to make sure stripes
> have the same parity disk. Stripes in a container must in the same state (no
> written, toread and so on). If a stripe is in a container, all new read/write
> to add_stripe_bio will be blocked to overlap conflict till the container are
> handled. The limitations will make sure stripes in a container in exactly the
> same state in the life circly of the container.
> 
> I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
> PCIe SSD. This patch improves around 30% performance and IO size to under layer
> disk is exactly 32k. I also run a 4k randwrite test in the same array to make
> sure the performance isn't changed with the patch.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>

Thanks for posting this ... and sorry for taking so long to look at it - I'm
still fighting of the flu so I'm not thinking a clearly as I would like so
I'll have to look over this again once I'm fully recovered.

I think I like it.  It seems more complex than I would like, which makes it
harder to review, but it probably needs to be that complex to actually work.

I'm a bit worried about the ->scribble  usage.  The default chunk size of
512K with means 128 stripe_heads in a batch.  On a 64 bit machine that is
1kilobyte of pointers per device.  8 devices in a RAID6 means more than 8K
needs to be allocated for ->scribble.  That has a risk of failing.

Maybe it would make sense to use a flex_array
(Documentation/flexible-arrays.txt).

Splitting out the changes for ->scribble into a separate patch might help.

The testing for "can this stripe_head be batched" seems a bit clumsy - lots
of loops hunting for problems.
Could we just set a "don't batch" flag whenever something happens that makes
a stripe un-batchable?  Have another flag that gets set when a stripe becomes
a full-write stripe?

Can we call the collections of stripe_heads "batch"es rather than
"container"s?   mdadm already used the name "containers" for something else,
and I think "batch" fits better.

I think it might be useful if we could start batching together stripe_heads
that are in the same stripe, even before they are full-write.  That might
help the scheduling and avoid some of the unnecessary pre-reading that we
currently do.  I haven't really thought properly about it and don't expect
you to do that, but I thought I'd mention it anyway.

Do any of the above changes seem reasonable to you?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

next prev parent reply	other threads:[~2014-09-02  6:52 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-18  8:25 [RFC]RAID5: batch adjacent full stripe write Shaohua Li
2014-09-02  6:52 ` NeilBrown [this message]
2014-09-03  1:04   ` Shaohua Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140902165240.3affa8b0@notabene.brown \
    --to=neilb@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=shli@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.