* Verify filesystem is aligned to stripes
@ 2010-11-24 18:39 Spelic
2010-11-25 5:46 ` Dave Chinner
0 siblings, 1 reply; 11+ messages in thread
From: Spelic @ 2010-11-24 18:39 UTC (permalink / raw)
To: xfs
Hi there,
I thought there was a way to empirically check that the filesystem is
correctly aligned to RAID stripes, but my attempts fail.
I don't mean by looking at sunit and swidth from xfs_info, because that
would not detect if there is some LVM offset problem.
I am particularly interested for parity RAIDs in MD.
I was thinking at "iostat -x 1": if writes are aligned I shouldn't see
any reads from the drives in a parity RAID...
unfortunately this does not work:
- a dd streaming write test has almost no reads even when I mount with
"noalign", with sufficiently large stripe_cache_size such as 1024. If it
is smaller, always reads, even if xfs is aligned.
- a kernel untar will show lots of reads at any stripe_cache_size even
if I'm pretty sure I aligned the stripes correctly on my 1024k x 15 data
disks and the .tar.bz2 file was in cache. I tried with both xfs stripes
autodetection in 2.6.37-rc2 and by specifying su and sw values by hand,
which turned out to be the same; I was without LVM so I'm pretty sure
alignment was correct. Why are there still lots of reads in this case?
so I'm pretty clueless. Anyone has good suggestions?
Thank you
PS, OT: do you confirm it's not a good idea to have agsize multiple of
stripe size like the mkfs warns you against? Today I offsetted it by +1
stripe unit (chunk) so that every AG begins on a different drive but
performances didn't improve noticeably. Wouldn't that cause more
unfilled stripes when writing?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Verify filesystem is aligned to stripes
2010-11-24 18:39 Spelic
@ 2010-11-25 5:46 ` Dave Chinner
2010-11-25 7:00 ` Stan Hoeppner
0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2010-11-25 5:46 UTC (permalink / raw)
To: Spelic; +Cc: xfs
On Wed, Nov 24, 2010 at 07:39:56PM +0100, Spelic wrote:
> Hi there,
> I thought there was a way to empirically check that the filesystem
> is correctly aligned to RAID stripes, but my attempts fail.
> I don't mean by looking at sunit and swidth from xfs_info, because
> that would not detect if there is some LVM offset problem.
>
> I am particularly interested for parity RAIDs in MD.
>
> I was thinking at "iostat -x 1": if writes are aligned I shouldn't
> see any reads from the drives in a parity RAID...
>
> unfortunately this does not work:
> - a dd streaming write test has almost no reads even when I mount
> with "noalign", with sufficiently large stripe_cache_size such as
> 1024. If it is smaller, always reads, even if xfs is aligned.
IO may not be aligned, though allocation usually is. With a large
stripe cache, the MD device waits long enough for sequential
unaligned IO to fill a full stripe width and hence never needs to
read to calculate parity.
> - a kernel untar will show lots of reads at any stripe_cache_size
> even if I'm pretty sure I aligned the stripes correctly on my 1024k
> x 15 data disks and the .tar.bz2 file was in cache. I tried with
> both xfs stripes autodetection in 2.6.37-rc2 and by specifying su
> and sw values by hand, which turned out to be the same; I was
> without LVM so I'm pretty sure alignment was correct. Why are there
> still lots of reads in this case?
Because writes for workloads like this are never full stripe writes.
Hence reads must be done to pullin the rest of the stripe before the
new parity can be calculated. This RMW cycle for small IOs has
always been the pain point for stripe based parity protection. If
you are doing lots of small IOs, RAID1 is your friend.
> PS, OT: do you confirm it's not a good idea to have agsize multiple
> of stripe size like the mkfs warns you against? Today I offsetted it
> by +1 stripe unit (chunk) so that every AG begins on a different
> drive but performances didn't improve noticeably.
Depends on the workload and a lot of other factors. in general,
putting all the AG headers on the same spindle/lun results in that
spindle/lun becoming a hotspot, especially when you have a
filesystem with a few hundred AGs in it...
> Wouldn't that
> cause more unfilled stripes when writing?
Not for sequential IO (for the above reason), and for small IOs it
will make zero difference.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Verify filesystem is aligned to stripes
2010-11-25 5:46 ` Dave Chinner
@ 2010-11-25 7:00 ` Stan Hoeppner
2010-11-25 10:15 ` Dave Chinner
0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2010-11-25 7:00 UTC (permalink / raw)
To: xfs
Dave Chinner put forth on 11/24/2010 11:46 PM:
> Because writes for workloads like this are never full stripe writes.
> Hence reads must be done to pullin the rest of the stripe before the
> new parity can be calculated. This RMW cycle for small IOs has
> always been the pain point for stripe based parity protection. If
> you are doing lots of small IOs, RAID1 is your friend.
Do you really mean RAID1 here Dave, or RAID10? If RAID1, please
elaborate a bit. RAID1 traditionally has equal read performance to a
single device, and half the write performance of a single device.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Verify filesystem is aligned to stripes
2010-11-25 7:00 ` Stan Hoeppner
@ 2010-11-25 10:15 ` Dave Chinner
2010-11-25 22:57 ` Stan Hoeppner
0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2010-11-25 10:15 UTC (permalink / raw)
To: Stan Hoeppner; +Cc: xfs
On Thu, Nov 25, 2010 at 01:00:37AM -0600, Stan Hoeppner wrote:
> Dave Chinner put forth on 11/24/2010 11:46 PM:
>
> > Because writes for workloads like this are never full stripe writes.
> > Hence reads must be done to pullin the rest of the stripe before the
> > new parity can be calculated. This RMW cycle for small IOs has
> > always been the pain point for stripe based parity protection. If
> > you are doing lots of small IOs, RAID1 is your friend.
>
> Do you really mean RAID1 here Dave, or RAID10? If RAID1, please
> elaborate a bit.
RAID10 is just a convenient way of saying "striped mirrors" or
"mirrored stripes". Fundamentally they are still using RAID1 for
redundancy - a mirror of two devices. A device could be a single
drive or a stripe of drives.
> RAID1 traditionally has equal read performance to a
> single device, and half the write performance of a single device.
A good RAID1 implementation typically has the read performance of
two devices (i.e. it can read from both legs simultaneously) and the
write performance of a single device.
Parity based RAID is only fast for large write IOs or small IOs that
are close enough together that a stripe cache can coalesce them into
large writes. If this can't be acheived, parity based raid will be
no faster than a _single drive_ for writes because all drives will
be involved in RMW cycles. Indeed, I've seen RAID5 luns be saturated
at only 50 iops because every IO required a RMW cycle, while an
equivalent number of drives using RAID1 of RAID0 stripes did 1,000
iops...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Verify filesystem is aligned to stripes
2010-11-25 10:15 ` Dave Chinner
@ 2010-11-25 22:57 ` Stan Hoeppner
2010-11-26 8:16 ` Emmanuel Florac
0 siblings, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2010-11-25 22:57 UTC (permalink / raw)
To: xfs
Dave Chinner put forth on 11/25/2010 4:15 AM:
> On Thu, Nov 25, 2010 at 01:00:37AM -0600, Stan Hoeppner wrote:
>> RAID1 traditionally has equal read performance to a
>> single device, and half the write performance of a single device.
>
> A good RAID1 implementation typically has the read performance of
> two devices (i.e. it can read from both legs simultaneously) and the
> write performance of a single device.
I've done no formal testing myself, but the last article I read that
tested md RAID1 performance showed marginally faster read performance
for a two disk mirror. IIRC, one test of 10 showed a 50% improvement.
The rest showed less than 10% improvement, and some showed lower
performance than a single drive. The performance would probably still
be greater than the parity scenario you described with heavy RMW ops.
> Parity based RAID is only fast for large write IOs or small IOs that
> are close enough together that a stripe cache can coalesce them into
> large writes. If this can't be acheived, parity based raid will be
> no faster than a _single drive_ for writes because all drives will
> be involved in RMW cycles. Indeed, I've seen RAID5 luns be saturated
> at only 50 iops because every IO required a RMW cycle, while an
> equivalent number of drives using RAID1 of RAID0 stripes did 1,000
> iops...
This point brings up a question I've had for some time for which I've
never found a thorough technical answer (maybe for lack of looking hard
enough). And I'm painfully showing my lack of knowledge of how striping
actually works, so please don't beat me up too much here. :)
Lets use an IMAP mail server in our example, configured to use maildir
storage format. Most email messages are less than 4KB is size, and many
are less than 512B--not even a full sector. Thus, the real size of each
maildir file is going to be less than 4KB or 512B.
Let's say our array, either software or hardware based, contains
14x300GB SAS drives in RAID10. Let's say we've created the array with a
(7x32KB) 224KB stripe size (though most hardware controllers would
probably force us to choose between 128 or 256).
Looking at the stripe size, which is equal to 64 sectors per array
member drive (448 sectors total), how exactly is a sub 4KB mail file (8
sectors) going to be split up into equal chunks across a 224KB RAID
stripe? Does 220KB of the stripe merely get wasted? Will XFS pack this
tiny file into the same extent with other small files, and then the
extent gets written into the 128KB stripe?
So, for an array+filesystem that is going to overwhelming be storing
lots of tiny files (mail), what array stripe size should one use, and
what XFS parameters should the filesystem be created and mounted with to
yield maximum random IOPs and minimum latency? Obviously these
parameters may be different depending on RAID level chosen, so let's
stick with this 14 disk RAID10 for our discussion.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Verify filesystem is aligned to stripes
@ 2010-11-26 2:43 Richard Scobie
0 siblings, 0 replies; 11+ messages in thread
From: Richard Scobie @ 2010-11-26 2:43 UTC (permalink / raw)
To: xfs
Stan Hoeppner wrote:
> I've done no formal testing myself, but the last article I read that
> tested md RAID1 performance showed marginally faster read performance
> for a two disk mirror.
All my testing of md RAID1 has shown write performance of one device and
single threaded read performance of one device. A second, concurrent
read will be directed to the second device though.
Regards,
Richard
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Verify filesystem is aligned to stripes
2010-11-25 22:57 ` Stan Hoeppner
@ 2010-11-26 8:16 ` Emmanuel Florac
2010-11-26 12:22 ` Dave Chinner
0 siblings, 1 reply; 11+ messages in thread
From: Emmanuel Florac @ 2010-11-26 8:16 UTC (permalink / raw)
To: Stan Hoeppner; +Cc: xfs
Le Thu, 25 Nov 2010 16:57:00 -0600 vous écriviez:
> Looking at the stripe size, which is equal to 64 sectors per array
> member drive (448 sectors total), how exactly is a sub 4KB mail file
> (8 sectors) going to be split up into equal chunks across a 224KB RAID
> stripe?
It won't, it will simply end on one drive (actually one mirror).
However because the mirrors are striped together, all drives in the
array will be sollicited in my experience, that's why you need at least
as many writing threads as there are stripes to reach the top IOPS. In
your case, writing 56 4K files simultaneously will effectively write on
all drives at once, hopefully (depends upon the filesystem allocation
policy though).
> Does 220KB of the stripe merely get wasted?
It's not wasted, it just remains unallocated. What's wasted is
potential IO performance.
What appears from the benchmarks I ran along the year is that anyway
you turn it, whatever caching, command tag queuing and reordering
your're using, a single thread can't reach maximal IOPS throughput on
an array, i. e. writing on all drives simultaneously; a single thread
writing to the fastest RAID 10 with 4K or 8K IOs can't do much better
than with a single drive, 200 to 300 IOPS for a 15k drive.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Verify filesystem is aligned to stripes
2010-11-26 8:16 ` Emmanuel Florac
@ 2010-11-26 12:22 ` Dave Chinner
2010-11-26 13:15 ` Spelic
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Dave Chinner @ 2010-11-26 12:22 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: Stan Hoeppner, xfs
On Fri, Nov 26, 2010 at 09:16:22AM +0100, Emmanuel Florac wrote:
> Le Thu, 25 Nov 2010 16:57:00 -0600 vous écriviez:
>
> > Looking at the stripe size, which is equal to 64 sectors per array
> > member drive (448 sectors total), how exactly is a sub 4KB mail file
> > (8 sectors) going to be split up into equal chunks across a 224KB RAID
> > stripe?
>
> It won't, it will simply end on one drive (actually one mirror).
> However because the mirrors are striped together, all drives in the
> array will be sollicited in my experience, that's why you need at least
> as many writing threads as there are stripes to reach the top IOPS. In
> your case, writing 56 4K files simultaneously will effectively write on
> all drives at once, hopefully (depends upon the filesystem allocation
> policy though).
>
> > Does 220KB of the stripe merely get wasted?
>
> It's not wasted, it just remains unallocated. What's wasted is
> potential IO performance.
No, that's wrong. I don't have the time to explain the intricacies
of how XFS packs small files together, but it does. You can observe
the result by unpacking a kernel tarball and looking at the layout
with xfs_bmap if you really want to...
FWIW, for workloads that do random, small IO, XFS works best when you
_turn off_ aligned allocation and just let it spray the IO at the
disks. This works best if you are using RAID 0/1/10. All the numbers
I've been posting are with aligned allocation turned off (i.e. no
sunit/swidth set).
> What appears from the benchmarks I ran along the year is that anyway
> you turn it, whatever caching, command tag queuing and reordering
> your're using, a single thread can't reach maximal IOPS throughput on
> an array, i. e. writing on all drives simultaneously; a single thread
> writing to the fastest RAID 10 with 4K or 8K IOs can't do much better
> than with a single drive, 200 to 300 IOPS for a 15k drive.
Assuming synchronous IO. If you are doing async IO, a single CPU
should be able to keep hundreds of SRDs (Spinning Rust Disks) busy...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Verify filesystem is aligned to stripes
2010-11-26 12:22 ` Dave Chinner
@ 2010-11-26 13:15 ` Spelic
2010-11-26 14:05 ` Michael Monnerie
2010-11-26 14:36 ` Emmanuel Florac
2 siblings, 0 replies; 11+ messages in thread
From: Spelic @ 2010-11-26 13:15 UTC (permalink / raw)
To: xfs
On 11/26/2010 01:22 PM, Dave Chinner wrote:
>
> FWIW, for workloads that do random, small IO, XFS works best when you
> _turn off_ aligned allocation and just let it spray the IO at the
> disks. This works best if you are using RAID 0/1/10. All the numbers
> I've been posting are with aligned allocation turned off (i.e. no
> sunit/swidth set).
>
I think I also noticed this...
The thing is, for large sequential I/O it seems to me it's indifferent
if XFS is aligned or not, because the resulting file will anyway be
sequential, and if you have raid10 or even parity raid with a large
stripe cache there won't be any reads anyway. Ok maybe with alignment
you could avoid reads on the first stripe (not sure, it might read
anyway if the RAID reacts fast, before enough output is sent to it), but
that's the only one.
So when is alignment to be turned on?
Thanks for the info
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Verify filesystem is aligned to stripes
2010-11-26 12:22 ` Dave Chinner
2010-11-26 13:15 ` Spelic
@ 2010-11-26 14:05 ` Michael Monnerie
2010-11-26 14:36 ` Emmanuel Florac
2 siblings, 0 replies; 11+ messages in thread
From: Michael Monnerie @ 2010-11-26 14:05 UTC (permalink / raw)
To: xfs; +Cc: Stan Hoeppner
[-- Attachment #1.1: Type: Text/Plain, Size: 1335 bytes --]
On Freitag, 26. November 2010 Dave Chinner wrote:
> FWIW, for workloads that do random, small IO, XFS works best when you
> turn off aligned allocation and just let it spray the IO at the
> disks. This works best if you are using RAID 0/1/10. All the numbers
> I've been posting are with aligned allocation turned off (i.e. no
> sunit/swidth set).
That's interesting to read.
Why would sunit/swidth be slower then? I'd thought that XFS then would
know one stripe is 64k and I have 8 disks so it should try to pack
8*64=512kb in one junk on disk, and that especially for small files it
would write them like that.
The man page just says inodes, log are stripe aligned, and file tails
>512k extended to full stripes on append. I thought that even the
inode/log alignment alone would help a lot.
Now what is the advantage on skipping sunit/swidth altogether?
And what is the difference when it's on RAID10 to RAID6?
I'm always eager to understand performance issues ;-)
--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc
it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531
// ****** Radiointerview zum Thema Spam ******
// http://www.it-podcast.at/archiv.html#podcast-100716
//
// Haus zu verkaufen: http://zmi.at/langegg/
[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Verify filesystem is aligned to stripes
2010-11-26 12:22 ` Dave Chinner
2010-11-26 13:15 ` Spelic
2010-11-26 14:05 ` Michael Monnerie
@ 2010-11-26 14:36 ` Emmanuel Florac
2 siblings, 0 replies; 11+ messages in thread
From: Emmanuel Florac @ 2010-11-26 14:36 UTC (permalink / raw)
To: Dave Chinner; +Cc: Stan Hoeppner, xfs
Le Fri, 26 Nov 2010 23:22:18 +1100
Dave Chinner <david@fromorbit.com> écrivait:
> Assuming synchronous IO. If you are doing async IO, a single CPU
> should be able to keep hundreds of SRDs (Spinning Rust Disks) busy...
>
What I had in mind (typical random IO) is DB access, which is AFAIK
synchronous IO, my bad :)
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2010-11-26 14:34 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-26 2:43 Verify filesystem is aligned to stripes Richard Scobie
-- strict thread matches above, loose matches on Subject: below --
2010-11-24 18:39 Spelic
2010-11-25 5:46 ` Dave Chinner
2010-11-25 7:00 ` Stan Hoeppner
2010-11-25 10:15 ` Dave Chinner
2010-11-25 22:57 ` Stan Hoeppner
2010-11-26 8:16 ` Emmanuel Florac
2010-11-26 12:22 ` Dave Chinner
2010-11-26 13:15 ` Spelic
2010-11-26 14:05 ` Michael Monnerie
2010-11-26 14:36 ` Emmanuel Florac
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox