* Re: RAID6 r-m-w, op-journaled fs, SSDs
2011-04-30 15:27 RAID6 r-m-w, op-journaled fs, SSDs Peter Grandi
@ 2011-04-30 16:02 ` Emmanuel Florac
2011-04-30 19:54 ` Stan Hoeppner
2011-04-30 22:27 ` NeilBrown
2011-05-01 9:36 ` Dave Chinner
2 siblings, 1 reply; 11+ messages in thread
From: Emmanuel Florac @ 2011-04-30 16:02 UTC (permalink / raw)
To: Peter Grandi, Linux fs JFS; +Cc: Linux RAID, Linux fs XFS
Le Sat, 30 Apr 2011 16:27:48 +0100 vous écriviez:
> While I agree with BAARF.com arguments fully, I sometimes have
> to deal with legacy systems with wide RAID6 sets (for example 16
> drives, quite revolting)
Revolting for what? I manage hundreds of such systems, but 99% of them
are used for video storage (typical file size range is several to
hundred of GBs).
> Sometimes (but fortunately not that recently) I have had to deal
> with small-file filesystems setup on wide-stripe RAID6 setup
What do you call "wide stripe" exactly? Do you mean a 256K stripe, a
4MB stripe?
> by
> morons who don't understand the difference between a database
> and a filesystem (and I have strong doubts that RAID6 is
> remotely appropriate to databases).
RAID-6 isn't appropriate for databases, but work reasonably well if the
workflow is almost only reading. And creating hundreds of millions of
files in a filesystem works reasonably well, too.
> So I'd like to figure out how much effort I should invest in
> undoing cases of the above, that is how badly they are likely to
> be and degrade over time (usually very badly).
Well, actually my bet is that it's impossible to say without you
providing much more detail on the hardware, the file IO patterns...
>
> * When reading or writing part of RAID[456] stripe for example
> smaller than a sector, what is the minimum unit of transfer
> with Linux MD? The full stripe, the chunk containing the
> sector, or just the sector containing the bytes to be
> written or updated (and potentially the parity sectors)? I
> would expect reads to always read just the sector, but not
> so sure about writing.
>
> * What about popular HW RAID host adapter (e.g. LSI, Adaptec,
> Areca, 3ware), where is the documentation if any on how they
> behave in these cases?
I may be wrong but in my tests, both Linux RAID and 3Ware, LSI and
Adaptec controllers (didn't really tested Areca on that point) would
read the full stripe most of the time. At least, they'll read the full
stripe in a single thread environment. However, when using many
concurrent threads the behaviour changes and they seem to work at chunk
level.
> Regardless, op-journaled file system designs like JFS and XFS
> write small records (way below a stripe set size, and usually
> way below a chunk size) to the journal when they queue
> operations, even if sometimes depending on design and options
> may "batch" the journal updates (potentially breaking safety
> semantics). Also they do small write when they dequeue the
> operations from the journal to the actual metadata records
> involved.
>
> How bad can this be when the journal is say internal for a
> filesystem that is held on wide-stride RAID6 set?
Not that bad because typically the journal is small enough to fit
entirely in the controller cache.
> I suspect very
> very bad, with apocalyptic read-modify-write storms, eating IOPS.
Not if you're using write-back cache.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: RAID6 r-m-w, op-journaled fs, SSDs
2011-04-30 16:02 ` Emmanuel Florac
@ 2011-04-30 19:54 ` Stan Hoeppner
2011-04-30 21:50 ` Michael Monnerie
2011-05-01 9:11 ` Emmanuel Florac
0 siblings, 2 replies; 11+ messages in thread
From: Stan Hoeppner @ 2011-04-30 19:54 UTC (permalink / raw)
To: xfs
On 4/30/2011 11:02 AM, Emmanuel Florac wrote:
> Le Sat, 30 Apr 2011 16:27:48 +0100 vous écriviez:
>> How bad can this be when the journal is say internal for a
>> filesystem that is held on wide-stride RAID6 set?
>
> Not that bad because typically the journal is small enough to fit
> entirely in the controller cache.
>
>> I suspect very
>> very bad, with apocalyptic read-modify-write storms, eating IOPS.
>
> Not if you're using write-back cache.
Just having write back cache isn't magic by itself. The cache
management algorithm and configuration thereof are often as important,
if not more, than the total cache size on the RAID HBA or SAN controller.
Poor cache management, I'd guess, is one reason why you see Areca RAID
cards with 1-4GB cache DRAM whereas competing cards w/ similar
price/performance/features from LSI, Adaptec, and others sport 512MB.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID6 r-m-w, op-journaled fs, SSDs
2011-04-30 19:54 ` Stan Hoeppner
@ 2011-04-30 21:50 ` Michael Monnerie
2011-05-01 3:17 ` Stan Hoeppner
2011-05-01 9:14 ` Emmanuel Florac
2011-05-01 9:11 ` Emmanuel Florac
1 sibling, 2 replies; 11+ messages in thread
From: Michael Monnerie @ 2011-04-30 21:50 UTC (permalink / raw)
To: xfs; +Cc: Stan Hoeppner
[-- Attachment #1.1: Type: Text/Plain, Size: 1169 bytes --]
On Samstag, 30. April 2011 Stan Hoeppner wrote:
> Poor cache management, I'd guess, is one reason why you see Areca
> RAID cards with 1-4GB cache DRAM whereas competing cards w/ similar
> price/performance/features from LSI, Adaptec, and others sport
> 512MB.
On one server (XENserver virtualized with ~14 VMs running Linux) which
suffered from slow I/O on RAID-6 during heavy times, I upgraded the
cache from 1G to 4G using an Areca ARC-1260 controller (somewhat
outdated now), and couldn't see any advantage. Maybe it would have been
measurable, but the damn thing was still pretty slow, so using more hard
disks is still the better option than upgrading the cache.
Just for documentation if someone sees slow I/O on Areca. More spindles
rock. That server had 8x 10krpm WD Raptor 150G drives by the time.
--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc
it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531
// ****** Radiointerview zum Thema Spam ******
// http://www.it-podcast.at/archiv.html#podcast-100716
//
// Haus zu verkaufen: http://zmi.at/langegg/
[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID6 r-m-w, op-journaled fs, SSDs
2011-04-30 21:50 ` Michael Monnerie
@ 2011-05-01 3:17 ` Stan Hoeppner
2011-05-01 9:14 ` Emmanuel Florac
1 sibling, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2011-05-01 3:17 UTC (permalink / raw)
To: xfs
On 4/30/2011 4:50 PM, Michael Monnerie wrote:
> On Samstag, 30. April 2011 Stan Hoeppner wrote:
>> Poor cache management, I'd guess, is one reason why you see Areca
>> RAID cards with 1-4GB cache DRAM whereas competing cards w/ similar
>> price/performance/features from LSI, Adaptec, and others sport
>> 512MB.
>
> On one server (XENserver virtualized with ~14 VMs running Linux) which
> suffered from slow I/O on RAID-6 during heavy times, I upgraded the
> cache from 1G to 4G using an Areca ARC-1260 controller (somewhat
> outdated now), and couldn't see any advantage. Maybe it would have been
> measurable, but the damn thing was still pretty slow, so using more hard
> disks is still the better option than upgrading the cache.
>
> Just for documentation if someone sees slow I/O on Areca. More spindles
> rock. That server had 8x 10krpm WD Raptor 150G drives by the time.
Similar to the case with CPUs, more cache can only take you so far. The
benefit resulting from the cache size, locality (on/off chip), and
algorithm is often very workload dependent, as is the case with RAID
controller cache.
Adding controller cache can benefit some workloads, depending on the
controller make/model, but I agree with you that adding spindles, or
swapping to faster spindles (say 7.2 to 15k, or SSD), will typically
benefit all workloads. However, given that DIMMs are so cheap compared
to hot swap disks, maxing out controller cache on models that have DIMM
slots is an inexpensive first step to take when faced with an IO bottleneck.
Larger controller cache seemed to have more positive impact on SCSI RAID
controllers of the mid/late 90s than on modern controllers. The
difference between 8MB and 64MB was substantial with many workloads back
then. On many modern SAS/SATA controllers the difference between 512MB
and 1GB isn't nearly as profound, if any at all. The shared SCSI bus
dictated sequential access to all 15 drives on the bus which would tend
to explain why more cache made a big difference, by masking the
latencies. SAS/SATA allows concurrent access to all drives
simultaneously (assuming no expanders) without the SCSI latencies. This
may tend to explain why larger RAID cache on today's controllers doesn't
yield the benefits of previous generation SCSI RAID cards.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID6 r-m-w, op-journaled fs, SSDs
2011-04-30 21:50 ` Michael Monnerie
2011-05-01 3:17 ` Stan Hoeppner
@ 2011-05-01 9:14 ` Emmanuel Florac
1 sibling, 0 replies; 11+ messages in thread
From: Emmanuel Florac @ 2011-05-01 9:14 UTC (permalink / raw)
To: Michael Monnerie; +Cc: Stan Hoeppner, xfs
Le Sat, 30 Apr 2011 23:50:31 +0200 vous écriviez:
> Just for documentation if someone sees slow I/O on Areca. More
> spindles rock. That server had 8x 10krpm WD Raptor 150G drives by the
> time.
As a side note, VMs typically creates lots of random small IOs, and
perform quite poorly on RAID-6 arrays.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID6 r-m-w, op-journaled fs, SSDs
2011-04-30 19:54 ` Stan Hoeppner
2011-04-30 21:50 ` Michael Monnerie
@ 2011-05-01 9:11 ` Emmanuel Florac
1 sibling, 0 replies; 11+ messages in thread
From: Emmanuel Florac @ 2011-05-01 9:11 UTC (permalink / raw)
To: Stan Hoeppner; +Cc: xfs
Le Sat, 30 Apr 2011 14:54:02 -0500 vous écriviez:
> Just having write back cache isn't magic by itself. The cache
> management algorithm and configuration thereof are often as
> important, if not more, than the total cache size on the RAID HBA or
> SAN controller.
>
> Poor cache management, I'd guess, is one reason why you see Areca
> RAID cards with 1-4GB cache DRAM whereas competing cards w/ similar
> price/performance/features from LSI, Adaptec, and others sport 512MB.
Yes, probably. To give some meat to the argument :
Using XFS mounted nobarrier on an 8 drives RAID-6 array with WB cache :
30000 journal (file creation/deletion) operations/s
Using XFS with barriers on the same RAID : 7000 journal operations/s
Using XFS nobarrier with WT cache : 700 journal op/s.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID6 r-m-w, op-journaled fs, SSDs
2011-04-30 15:27 RAID6 r-m-w, op-journaled fs, SSDs Peter Grandi
2011-04-30 16:02 ` Emmanuel Florac
@ 2011-04-30 22:27 ` NeilBrown
2011-05-01 15:31 ` Peter Grandi
2011-05-01 9:36 ` Dave Chinner
2 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2011-04-30 22:27 UTC (permalink / raw)
To: Peter Grandi; +Cc: Linux RAID, Linux fs JFS, Linux fs XFS
On Sat, 30 Apr 2011 16:27:48 +0100 pg_xf2@xf2.for.sabi.co.UK (Peter Grandi)
wrote:
> While I agree with BAARF.com arguments fully, I sometimes have
> to deal with legacy systems with wide RAID6 sets (for example 16
> drives, quite revolting) which have op-journaled filesystems on
> them like XFS or JFS (sometimes block-journaled ext[34], but I
> am not that interested in them for this).
>
> Sometimes (but fortunately not that recently) I have had to deal
> with small-file filesystems setup on wide-stripe RAID6 setup by
> morons who don't understand the difference between a database
> and a filesystem (and I have strong doubts that RAID6 is
> remotely appropriate to databases).
>
> So I'd like to figure out how much effort I should invest in
> undoing cases of the above, that is how badly they are likely to
> be and degrade over time (usually very badly).
>
> First a couple of question purely about RAID, but indirectly
> relevant to op-journaled filesystems:
>
> * Can Linux MD do "abbreviated" read-modify-write RAID6
> updates like for RAID5? That is where not the whole stripe
> is read in, modified and written, but just the block to be
> updated and the parity wblocks.
No. (patches welcome).
>
> * When reading or writing part of RAID[456] stripe for example
> smaller than a sector, what is the minimum unit of transfer
> with Linux MD? The full stripe, the chunk containing the
> sector, or just the sector containing the bytes to be
> written or updated (and potentially the parity sectors)? I
> would expect reads to always read just the sector, but not
> so sure about writing.
1 "PAGE" - normally 4K.
>
> * What about popular HW RAID host adapter (e.g. LSI, Adaptec,
> Areca, 3ware), where is the documentation if any on how they
> behave in these cases?
>
> Regardless, op-journaled file system designs like JFS and XFS
> write small records (way below a stripe set size, and usually
> way below a chunk size) to the journal when they queue
> operations, even if sometimes depending on design and options
> may "batch" the journal updates (potentially breaking safety
> semantics). Also they do small write when they dequeue the
> operations from the journal to the actual metadata records
> involved.
The ideal config for a journalled filesystem is for put the journal on a
separate smaller lower-latency device. e.g. a small RAID1 pair.
In a previous work place I had good results with:
RAID1 pair of small disks with root, swap, journal
Large RAID5/6 array with bulk of filesystem.
I also did data journalling as it helps a lot with NFS.
>
> How bad can this be when the journal is say internal for a
> filesystem that is held on wide-stride RAID6 set? I suspect very
> very bad, with apocalyptic read-modify-write storms, eating IOPS.
>
> I suspect that this happens a lot with SSDs too, where the role
> of stripe set size is played by the erase block size (often in
> the hundreds of KBytes, and even more expensive).
>
> Where are studies or even just impressions of anedoctes on how
> bad this is?
>
> Are there instrumentation tools in JFS or XFS that may allow me
> to watch/inspect what is happening with the journal? For Linux
> MD to see what are the rates of stripe r-m-w cases?
Not that I am aware of.
NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID6 r-m-w, op-journaled fs, SSDs
2011-04-30 22:27 ` NeilBrown
@ 2011-05-01 15:31 ` Peter Grandi
2011-05-01 18:32 ` David Brown
0 siblings, 1 reply; 11+ messages in thread
From: Peter Grandi @ 2011-05-01 15:31 UTC (permalink / raw)
To: Linux RAID, Linux fs JFS, Linux fs XFS
[ ... ]
>> * Can Linux MD do "abbreviated" read-modify-write RAID6
>> updates like for RAID5? [ ... ]
> No. (patches welcome).
Ahhhm, but let me dig a bit deeper, even if it may be implied in
the answer: would it be *possible*?
That is, is the double parity scheme used in MS such that it is
possible to "subtract" the old content of a page and "add" the
new content of that page to both parity pages?
[ ... ]
> The ideal config for a journalled filesystem is for put the
> journal on a separate smaller lower-latency device. e.g. a
> small RAID1 pair.
> In a previous work place I had good results with:
> RAID1 pair of small disks with root, swap, journal
> Large RAID5/6 array with bulk of filesystem.
Sound reasonable, except that I am allergic to RAID5 (except in
two cases) and RAID6 (in general). :-), but would work equally
well I guess with RAID10 and its delightful MD implementation.
[ ... ]
Thanks for the information!
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID6 r-m-w, op-journaled fs, SSDs
2011-05-01 15:31 ` Peter Grandi
@ 2011-05-01 18:32 ` David Brown
0 siblings, 0 replies; 11+ messages in thread
From: David Brown @ 2011-05-01 18:32 UTC (permalink / raw)
To: linux-xfs; +Cc: linux-raid
On 01/05/11 17:31, Peter Grandi wrote:
> [ ... ]
>
>>> * Can Linux MD do "abbreviated" read-modify-write RAID6
>>> updates like for RAID5? [ ... ]
>
>> No. (patches welcome).
>
> Ahhhm, but let me dig a bit deeper, even if it may be implied in
> the answer: would it be *possible*?
>
> That is, is the double parity scheme used in MS such that it is
> possible to "subtract" the old content of a page and "add" the
> new content of that page to both parity pages?
>
If I've understood the maths correctly, then yes it would be possible.
But it would involve more calculations, and it is difficult to see where
the best balance lies between cpu demands and IO demands. In general,
calculating the Q parity block for raid6 is processor-intensive -
there's a fair amount of optimisation done in the normal calculations to
keep it reasonable.
Basically, the first parity P is a simple calculation:
P = D_0 + D_1 + .. + D_n-1
But Q is more difficult:
Q = D_0 + g.D_1 + g².D_2 + ... + g^(n-1).D_n-1
where "plus" is xor, "times" is a weird function calculated over a
G(2^8) field, and g is a generator for that field.
If you want to replace D_i, then you can calculate:
P(new) = P(old) + D_i(old) + D_i(new)
Q(new) = Q(old) + g^i.(D_i(old) + D_i(new))
This means multiplying by g_i for whichever block i is being replaced.
The generator and multiply operation are picked to make it relatively
fast and easy to multiply by g, especially if you've got a processor
that has vector operations (as most powerful cpus do). This means that
the original Q calculation is fairly efficient. But to do general
multiplications by g_i is more effort, and will typically involve
cache-killing lookup tables or multiple steps.
It is probably reasonable to say that when md raid first implemented
raid6, it made little sense to do these abbreviated parity calculations.
But as processors have got faster (and wider, with more cores) while
disk throughput has made slower progress, it's maybe a different
balance. So it's probably both possible and practical to do these
calculations. All it needs is someone to spend the time writing the
code - and lots of people willing to test it.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID6 r-m-w, op-journaled fs, SSDs
2011-04-30 15:27 RAID6 r-m-w, op-journaled fs, SSDs Peter Grandi
2011-04-30 16:02 ` Emmanuel Florac
2011-04-30 22:27 ` NeilBrown
@ 2011-05-01 9:36 ` Dave Chinner
2 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2011-05-01 9:36 UTC (permalink / raw)
To: Peter Grandi; +Cc: Linux RAID, Linux fs JFS, Linux fs XFS
On Sat, Apr 30, 2011 at 04:27:48PM +0100, Peter Grandi wrote:
> Regardless, op-journaled file system designs like JFS and XFS
> write small records (way below a stripe set size, and usually
> way below a chunk size) to the journal when they queue
> operations,
XFS will write log-stripe-unit sized records to disk. If the log
buffers are not full, it pads them. Supported log-sunit sizes are up
to 256k.
> even if sometimes depending on design and options
> may "batch" the journal updates (potentially breaking safety
> semantics). Also they do small write when they dequeue the
> operations from the journal to the actual metadata records
> involved.
>
> How bad can this be when the journal is say internal for a
> filesystem that is held on wide-stride RAID6 set? I suspect very
> very bad, with apocalyptic read-modify-write storms, eating IOPS.
Not bad at all, because the journal writes are sequential, and XFS
can have multiple log IOs in progress at once (up to 8 x 256k =
2MB). So in general while metadata operations are in progress, XFS
will fill full stripes with log IO and you won't get problems with
RMW.
> Where are studies or even just impressions of anedoctes on how
> bad this is?
Just buy decent RAID hardware with a BBWC and journal IO does not
hurt at all.
> Are there instrumentation tools in JFS or XFS that may allow me
> to watch/inspect what is happening with the journal? For Linux
> MD to see what are the rates of stripe r-m-w cases?
XFS has plenty of event tracing, including all the transaction
reservation and commit accounting in it. And if you know what you
are looking for, you can see all the log IO and transaction
completion processing in the event traces, too.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread