From: David Brown <david@westcontrol.com>
To: linux-raid@vger.kernel.org
Subject: Re: raid5/raid6 write performance question
Date: Fri, 18 Feb 2011 10:56:34 +0100 [thread overview]
Message-ID: <ijlfpe$ief$1@dough.gmane.org> (raw)
In-Reply-To: <AANLkTinRCMJq3bt5jmQfJ1PRfN1S-ZoJB0HXeMzFeqSA@mail.gmail.com>
On 17/02/2011 19:52, Patrick J. LoPresti wrote:
> I have a fair amount of experience with hardware RAID devices, but now
> I am investigating Linux software RAID and I have a question. Well, a
> few questions.
>
I'll give some answers, but I am not sure about all the details. I hope
that someone else will correct me if I'm wrong :-)
> The classic problem for RAID5/RAID6 write performance, especially when
> striping across many drives, is that a single small write requires
> reading in the entire stripe from all disks to calculate the new
> syndrome block(s).
>
You don't need to read the whole stripe (at least, not for RAID5 - I
don't know enough about RAID6 to comment).
With RAID5, the parity is the xor of all the other blocks in the stripe.
So if you only want to change one block, you can read the old block
and the old parity block, and calculate the new parity block as the xor
of the old data block, the old parity block, and the new data block.
You still have to do some reads then a write, but at least you don't
need to read the whole stripe.
I presume that's the way md RAID5 implements small writes.
> Hardware RAID controllers typically mitigate this problem by using a
> sizable (512MiB - 4GiB) non-volatile write-back cache, in the hopes
> that enough blocks will be written in a short period of time to
> populate an entire stripe. Once an entire stripe is in the write-back
> cache, it can be written out with its syndrome blocks without having
> to read anything.
>
> Of course, the cache has to be non-volatile (battery backed or solid
> state), because the kernel is expecting stuff it has written to disk
> not to vanish because of a power failure.
>
> My question is this: How does Linux RAID5/RAID6 avoid reading an
> entire stripe every time the kernel flushes a single page? Does it
> have a (volatile?) cache? Or does it rely on the kernel flushing lots
> of contiguous data in a single request? Or something else?
>
My understanding is that md keeps a cache of the stripes in ram. Any
writes must be completed to the disk itself, rather than just the stripe
cache, before being reported to the file system as completed, as this
cache is volatile. But the next time you make a small write to a stripe
that is in the cache, it can avoid the reads. Of course, the cache will
also be used for reads.
The size of this cache is configurable - using a larger stripe cache
will give you a higher hit ratio, and thus faster small writes on
average. But the same ram can be used for other types of caches -
directory entry caches, file caches, etc. The best balance will depend
on your load - for a read-mostly array, ram will probably be better
spent as file cache, while for a write-mostly array the stripe cache is
more important.
My understanding of hardware raid cards is that the have stripe caches,
but these are typically volatile. A non-volatile cache would mean you
can't swap out controllers or disks when the system is switched off, as
some of the data might be in the controller card's cache instead of the
disks.
For high-end systems, your battery backup must not only keep the cache
alive, but it should keep your disks running so that the cache can be
flushed to disk when there is a power failure. Then the controller will
be able to report a write as "complete" when it is cached, and handle
the flush to disk in the background.
Less high-end systems would, I believe, handle the cache in the same way
as md raid - the stripe cache in ram would help avoid the reads before
writing to part of a RAID 5 stripe. Typically, this on-board cache will
be a lot smaller than you would have in an md RAID system.
prev parent reply other threads:[~2011-02-18 9:56 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-17 18:52 raid5/raid6 write performance question Patrick J. LoPresti
2011-02-17 20:13 ` Piergiorgio Sartor
2011-02-18 9:56 ` David Brown [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='ijlfpe$ief$1@dough.gmane.org' \
--to=david@westcontrol.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).