Re: RAID6 r-m-w, op-journaled fs, SSDs

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Emmanuel Florac <eflorac@intellique.com>
To: Peter Grandi <pg_xf2@xf2.for.sabi.co.UK>,
	Linux fs JFS <jfs-discussion@lists.SourceForge.net>
Cc: Linux RAID <linux-raid@vger.kernel.org>, Linux fs XFS <xfs@oss.sgi.com>
Subject: Re: RAID6 r-m-w, op-journaled fs, SSDs
Date: Sat, 30 Apr 2011 18:02:13 +0200	[thread overview]
Message-ID: <20110430180213.6dcfc41c@galadriel2.home> (raw)
In-Reply-To: <19900.10868.583555.849181@tree.ty.sabi.co.UK>

Le Sat, 30 Apr 2011 16:27:48 +0100 vous écriviez:

> While I agree with BAARF.com arguments fully, I sometimes have
> to deal with legacy systems with wide RAID6 sets (for example 16
> drives, quite revolting)

Revolting for what? I manage hundreds of such systems, but 99% of them
are used for video storage (typical file size range is several to
hundred of GBs).

> Sometimes (but fortunately not that recently) I have had to deal
> with small-file filesystems setup on wide-stripe RAID6 setup

What do you call "wide stripe" exactly? Do you mean a 256K stripe, a
4MB stripe?

> by
> morons who don't understand the difference between a database
> and a filesystem (and I have strong doubts that RAID6 is
> remotely appropriate to databases).

RAID-6 isn't appropriate for databases, but work reasonably well if the
workflow is almost only reading. And creating hundreds of millions of
files in a filesystem works reasonably well, too.

> So I'd like to figure out how much effort I should invest in
> undoing cases of the above, that is how badly they are likely to
> be and degrade over time (usually very badly).

Well, actually my bet is that it's impossible to say without you
providing much more detail on the hardware, the file IO patterns...

> 
>   * When reading or writing part of RAID[456] stripe for example
>     smaller than a sector, what is the minimum unit of transfer
>     with Linux MD? The full stripe, the chunk containing the
>     sector, or just the sector containing the bytes to be
>     written or updated (and potentially the parity sectors)? I
>     would expect reads to always read just the sector, but not
>     so sure about writing.
> 
>   * What about popular HW RAID host adapter (e.g. LSI, Adaptec,
>     Areca, 3ware), where is the documentation if any on how they
>     behave in these cases?

I may be wrong but in my tests, both Linux RAID and 3Ware, LSI and
Adaptec controllers (didn't really tested Areca on that point) would
read the full stripe most of the time. At least, they'll read the full
stripe in a single thread environment. However, when using many
concurrent threads the behaviour changes and they seem to work at chunk
level.
 
> Regardless, op-journaled file system designs like JFS and XFS
> write small records (way below a stripe set size, and usually
> way below a chunk size) to the journal when they queue
> operations, even if sometimes depending on design and options
> may "batch" the journal updates (potentially breaking safety
> semantics). Also they do small write when they dequeue the
> operations from the journal to the actual metadata records
> involved.
> 
> How bad can this be when the journal is say internal for a
> filesystem that is held on wide-stride RAID6 set?

Not that bad because typically the journal is small enough to fit
entirely in the controller cache.

> I suspect very
> very bad, with apocalyptic read-modify-write storms, eating IOPS.

Not if you're using write-back cache.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2011-04-30 16:02 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-30 15:27 RAID6 r-m-w, op-journaled fs, SSDs Peter Grandi
2011-04-30 16:02 ` Emmanuel Florac [this message]
2011-04-30 22:27 ` NeilBrown
2011-05-01 15:24   ` David Brown
2011-05-01 16:48     ` Christoph Hellwig
2011-05-01 22:01     ` NeilBrown
2011-05-01 15:31   ` Peter Grandi
2011-05-01 18:32     ` David Brown
2012-01-12 10:33   ` pg_mh, Peter Grandi
2011-05-01  9:36 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110430180213.6dcfc41c@galadriel2.home \
    --to=eflorac@intellique.com \
    --cc=jfs-discussion@lists.SourceForge.net \
    --cc=linux-raid@vger.kernel.org \
    --cc=pg_xf2@xf2.for.sabi.co.UK \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).