All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Brown <david.brown@hesbynett.no>
To: linux-raid@vger.kernel.org
Subject: Re: RAID6 r-m-w, op-journaled fs, SSDs
Date: Sun, 01 May 2011 17:24:09 +0200	[thread overview]
Message-ID: <ipjtup$47u$1@dough.gmane.org> (raw)
In-Reply-To: <20110501082717.5116e575@notabene.brown>

On 01/05/11 00:27, NeilBrown wrote:
> On Sat, 30 Apr 2011 16:27:48 +0100 pg_xf2@xf2.for.sabi.co.UK (Peter Grandi)
> wrote:
>
>> While I agree with BAARF.com arguments fully, I sometimes have
>> to deal with legacy systems with wide RAID6 sets (for example 16
>> drives, quite revolting) which have op-journaled filesystems on
>> them like XFS or JFS (sometimes block-journaled ext[34], but I
>> am not that interested in them for this).
>>
>> Sometimes (but fortunately not that recently) I have had to deal
>> with small-file filesystems setup on wide-stripe RAID6 setup by
>> morons who don't understand the difference between a database
>> and a filesystem (and I have strong doubts that RAID6 is
>> remotely appropriate to databases).
>>
>> So I'd like to figure out how much effort I should invest in
>> undoing cases of the above, that is how badly they are likely to
>> be and degrade over time (usually very badly).
>>
>> First a couple of question purely about RAID, but indirectly
>> relevant to op-journaled filesystems:
>>
>>    * Can Linux MD do "abbreviated" read-modify-write RAID6
>>      updates like for RAID5? That is where not the whole stripe
>>      is read in, modified and written, but just the block to be
>>      updated and the parity wblocks.
>
> No.  (patches welcome).

As far as I understand the raid6 mathematics, it shouldn't be too hard 
to do such abbreviated updates, but that it could quickly lead to 
complex code if you are trying to update more than a couple of blocks at 
a time.

>
>>
>>    * When reading or writing part of RAID[456] stripe for example
>>      smaller than a sector, what is the minimum unit of transfer
>>      with Linux MD? The full stripe, the chunk containing the
>>      sector, or just the sector containing the bytes to be
>>      written or updated (and potentially the parity sectors)? I
>>      would expect reads to always read just the sector, but not
>>      so sure about writing.
>
> 1 "PAGE" - normally 4K.
>
>
>>
>>    * What about popular HW RAID host adapter (e.g. LSI, Adaptec,
>>      Areca, 3ware), where is the documentation if any on how they
>>      behave in these cases?
>>
>> Regardless, op-journaled file system designs like JFS and XFS
>> write small records (way below a stripe set size, and usually
>> way below a chunk size) to the journal when they queue
>> operations, even if sometimes depending on design and options
>> may "batch" the journal updates (potentially breaking safety
>> semantics). Also they do small write when they dequeue the
>> operations from the journal to the actual metadata records
>> involved.
>
> The ideal config for a journalled filesystem is for put the journal on a
> separate smaller lower-latency device.  e.g. a small RAID1 pair.
>
> In a previous work place I had good results with:
>    RAID1 pair of small disks with root, swap, journal
>    Large RAID5/6 array with bulk of filesystem.
>
> I also did data journalling as it helps a lot with NFS.
>

I suppose it also makes sense to put the write-intent bitmap for md raid 
on such a raid1 pair (typically SSD's).

What would be very nice is a RAM-based SSD with battery backup, rather 
than a flash disk.  These sorts of devices exist, but they are usually 
vastly expensive because they RAM is expensive for disk-like sizes.  I'd 
like to see physically small and cheap RAM-based SSD with 1 or 2 GB - 
that would be ideal for file system journals, write intent bitmaps, etc.

>
>
>>
>> How bad can this be when the journal is say internal for a
>> filesystem that is held on wide-stride RAID6 set? I suspect very
>> very bad, with apocalyptic read-modify-write storms, eating IOPS.
>>
>> I suspect that this happens a lot with SSDs too, where the role
>> of stripe set size is played by the erase block size (often in
>> the hundreds of KBytes, and even more expensive).
>>
>> Where are studies or even just impressions of anedoctes on how
>> bad this is?
>>
>> Are there instrumentation tools in JFS or XFS that may allow me
>> to watch/inspect what is happening with the journal? For Linux
>> MD to see what are the rates of stripe r-m-w cases?
>
> Not that I am aware of.
>
>
> NeilBrown
>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



  reply	other threads:[~2011-05-01 15:24 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-30 15:27 RAID6 r-m-w, op-journaled fs, SSDs Peter Grandi
2011-04-30 16:02 ` Emmanuel Florac
2011-04-30 16:02   ` Emmanuel Florac
2011-04-30 19:54   ` Stan Hoeppner
2011-04-30 21:50     ` Michael Monnerie
2011-05-01  3:17       ` Stan Hoeppner
2011-05-01  9:14       ` Emmanuel Florac
2011-05-01  9:11     ` Emmanuel Florac
2011-04-30 22:27 ` NeilBrown
2011-04-30 22:27   ` NeilBrown
2011-05-01 15:24   ` David Brown [this message]
2011-05-01 16:48     ` Christoph Hellwig
2011-05-01 22:01     ` NeilBrown
2011-05-01 15:31   ` Peter Grandi
2011-05-01 18:32     ` David Brown
2011-05-01 18:32       ` David Brown
2012-01-12 10:33   ` pg_mh, Peter Grandi
2011-05-01  9:36 ` Dave Chinner
2011-05-01  9:36   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='ipjtup$47u$1@dough.gmane.org' \
    --to=david.brown@hesbynett.no \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.