public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* 30 TB RAID6 + XFS slow write performance
@ 2011-07-18 19:58 John Bokma
  2011-07-19  0:00 ` Eric Sandeen
  2011-07-19  8:37 ` Emmanuel Florac
  0 siblings, 2 replies; 17+ messages in thread
From: John Bokma @ 2011-07-18 19:58 UTC (permalink / raw)
  To: xfs

Dear list members,

A customer of mine is currently struggling with the performance of a 30 
TB RAID6 which uses XFS as the filing system. I am somewhat sure it's 
not XFS that's causing the performance issue but my expertise is not XFS 
nor RAID; I just wrote the software that after moving to the larger RAID 
(from a much smaller one, ~ 3TB, using ext3) suddenly seems to have a 
huge drop in write performance.

The software I wrote writes many small (50-150K) files in parallel (100+ 
processes), thousands of times per hour. Writing a file of 50-150K now 
and then seems to take between 30 and 90 seconds, and more rarely can 
take over 200 seconds (several times an hour).

When all processes are stopped and restarted again the 30-90 seconds 
delay start happening when about 16-20+ processes are running.

To me this sounds like something has been configured wrong. I already 
recommended my customer to find someone who is capable of configuring 
the RAID correctly; to me it sounds like a hardware/configuration issue.

Any insights are very welcome.

Hardware:
card: MegaRAID SAS 9260-16i
disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares).
RAID6
~ 30TB

Thanks for reading,
John

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-18 19:58 30 TB RAID6 + XFS slow write performance John Bokma
@ 2011-07-19  0:00 ` Eric Sandeen
  2011-07-19  8:37 ` Emmanuel Florac
  1 sibling, 0 replies; 17+ messages in thread
From: Eric Sandeen @ 2011-07-19  0:00 UTC (permalink / raw)
  To: John Bokma; +Cc: xfs

On 7/18/11 2:58 PM, John Bokma wrote:
> Dear list members,
> 
> A customer of mine is currently struggling with the performance of a
> 30 TB RAID6 which uses XFS as the filing system. I am somewhat sure
> it's not XFS that's causing the performance issue but my expertise is
> not XFS nor RAID; I just wrote the software that after moving to the
> larger RAID (from a much smaller one, ~ 3TB, using ext3) suddenly
> seems to have a huge drop in write performance.
> 
> The software I wrote writes many small (50-150K) files in parallel
> (100+ processes), thousands of times per hour. Writing a file of
> 50-150K now and then seems to take between 30 and 90 seconds, and
> more rarely can take over 200 seconds (several times an hour).
> 
> When all processes are stopped and restarted again the 30-90 seconds
> delay start happening when about 16-20+ processes are running.
> 
> To me this sounds like something has been configured wrong. I already
> recommended my customer to find someone who is capable of configuring
> the RAID correctly; to me it sounds like a hardware/configuration
> issue.
> 
> Any insights are very welcome.
> 
> Hardware: card: MegaRAID SAS 9260-16i disks: 14x Barracuda® XT
> ST33000651AS 3TB (2 hot spares). RAID6 ~ 30TB

My first suggestion would be to check the partition alignment on the raid (if it is partitioned), and be sure it is aligned with the underlying raid geometry.

And then make sure you give mkfs.xfs the proper geometry as well.

After that, does the raid card have a battery-backed write cache?  If so, you can safely disable barriers.

More info is always good too, for starters what kernel & what xfsprogs version?

What mkfs & mount options?

-Eric

> Thanks for reading, John
> 
> _______________________________________________ xfs mailing list 
> xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-18 19:58 30 TB RAID6 + XFS slow write performance John Bokma
  2011-07-19  0:00 ` Eric Sandeen
@ 2011-07-19  8:37 ` Emmanuel Florac
  2011-07-19 22:37   ` Stan Hoeppner
  1 sibling, 1 reply; 17+ messages in thread
From: Emmanuel Florac @ 2011-07-19  8:37 UTC (permalink / raw)
  To: John Bokma; +Cc: xfs

Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez:

> card: MegaRAID SAS 9260-16i
> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares).
> RAID6
> ~ 30TB
> 

This card doesn't activate the write cache without a BBU present. Be
sure you have a BBU or the performance will always be unbearably awful.
Then proceed like Eric suggested. Initialize your filesystem with the
right options : su= your RAID stripe size, sw= your RAID array data
members (for RAID 6, the total number minus 2), don't forget the useful
option -l lazy-count=1, and mount with nobarriers and inode64.

BTW apparently you're confusing hot spares and parity drives. A RAID-6
array has 2 parity drives; then it may have or not 1 or more hot spares
(generally one is enough). I suppose your array is actually a 12 data +
2 parity drives.

regards,
-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-19  8:37 ` Emmanuel Florac
@ 2011-07-19 22:37   ` Stan Hoeppner
  2011-07-20  0:20     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2011-07-19 22:37 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: xfs, John Bokma

On 7/19/2011 3:37 AM, Emmanuel Florac wrote:
> Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez:
> 
>> card: MegaRAID SAS 9260-16i
>> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares).
>> RAID6
>> ~ 30TB

> This card doesn't activate the write cache without a BBU present. Be
> sure you have a BBU or the performance will always be unbearably awful.

In addition to all the other recommendations, once the BBU is installed,
disable the individual drive caches (if this isn't done automatically),
and set the controller cache mode to 'write back'.  The write through
and direct I/O cache modes will deliver horrible RAID6 write performance.

And, BTW, RAID6 is a horrible choice for a parallel, small file, high
random I/O workload such as you've described.  RAID10 would be much more
suitable.  Actually, any striped RAID is less than optimal for such a
small file workload.  The default stripe size for the LSI RAID
controllers, IIRC, is 64KB.  With 14 spindles of stripe width you end up
with 64*14 = 896KB.  XFS will try to pack as many of these 50-150K files
into a single extent, but you're talking 6 to 18 files per extent, and
this is wholly dependent on the parallel write pattern, and in which of
the allocation groups XFS decides to write each file.  XFS isn't going
to be 100% efficient in this case.  Thus, you will end up with many
partial stripe width writes, eliminating much of the performance
advantage of striping.

These are large 7200 rpm SATA drives which have poor seek performance to
begin with, unlike the 'small' 300GB 15k SAS drives.  You're robbing
that poor seek performance further by:

1.  Using double parity striped RAID
2.  Writing thousands of small files in parallel

This workload is very similar to the case of a mail server using the
maildir storage format.  If you read the list archives you'll see
recommendations for an optimal storage stack setup for this workload.
It goes something like this:

1.  Create a linear array of hardware RAID1 mirror sets.
    Do this all in the controller if it can do it.
    If not, use Linux RAID (mdadm) to create a '--linear' array of the
    multiple (7 in your case, apparently) hardware RAID1 mirror sets

2.  Now let XFS handle the write parallelism.  Format the resulting
    7 spindle Linux RAID device with, for example:

    mkfs.xfs -d agcount=14 /dev/md0

By using this configuration you eliminate the excessive head seeking
associated with the partial stripe write problems of RAID6, restoring
performance efficiency to the array.  Using 14 allocation groups allows
XFS to write write, at minimum, 14 such files in parallel.  This may not
seem like a lot given you have ~200 writers, but it's actually far more
than what you're getting now, or what you'll get with striped parity
RAID.  Consider the 150KB file case:  14*150KB = 2.1MB/s.  Assuming this
hardware and software stack can sink 210MB/s with this workload, that's
~1400 files written per second, or 84,000 files per hour.  Would this be
sufficient for your application?

Now that we've covered the XFS and hardware RAID side of this equation,
does your application run directly on the this machine, or are you
writing over NFS or CIFS to this XFS filesystem?  If so, that's another
fly in the ointment we may have to deal with.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-19 22:37   ` Stan Hoeppner
@ 2011-07-20  0:20     ` Dave Chinner
  2011-07-20  5:16       ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-07-20  0:20 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: John Bokma, xfs

On Tue, Jul 19, 2011 at 05:37:25PM -0500, Stan Hoeppner wrote:
> On 7/19/2011 3:37 AM, Emmanuel Florac wrote:
> > Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez:
> > 
> >> card: MegaRAID SAS 9260-16i
> >> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares).
> >> RAID6
> >> ~ 30TB
> 
> > This card doesn't activate the write cache without a BBU present. Be
> > sure you have a BBU or the performance will always be unbearably awful.
> 
> In addition to all the other recommendations, once the BBU is installed,
> disable the individual drive caches (if this isn't done automatically),
> and set the controller cache mode to 'write back'.  The write through
> and direct I/O cache modes will deliver horrible RAID6 write performance.
> 
> And, BTW, RAID6 is a horrible choice for a parallel, small file, high
> random I/O workload such as you've described.  RAID10 would be much more
> suitable.  Actually, any striped RAID is less than optimal for such a
> small file workload.  The default stripe size for the LSI RAID
> controllers, IIRC, is 64KB.  With 14 spindles of stripe width you end up
> with 64*14 = 896KB. 

All good up to here.

> XFS will try to pack as many of these 50-150K files
> into a single extent, but you're talking 6 to 18 files per extent,

I think you've got your terminology wrong. An extent can only belong
to a single inode, but an inode can contain many extents, as can a
stripe width. We do not pack data from multiple files into a single
extent.

For new files on a su/sw aware filesystem, however, XFS will *not*
pack multiple files into the same stripe unit. It will try to align
the first extent of the file to sunit, or if you have the swalloc
mount option set and the allocation is for more than a swidth of
space it will align to swidth rather than sunit.

So if you have a small file workload, specifying sunit/swidth can
actually -decrease- performance because it allocates the file
extents sparsely. IOWs, stripe alignment is important for bandwidth
intensive applications because it allows full stripe writes to occur
much more frequently, but can be harmful to small file performance
as the aligned allocation pattern can prevent full stripe writes
from occurring.....

> and
> this is wholly dependent on the parallel write pattern, and in which of
> the allocation groups XFS decides to write each file.

That's pretty much irrelevant for small files as a single allocation
is done for each file during writeback.

> XFS isn't going
> to be 100% efficient in this case.  Thus, you will end up with many
> partial stripe width writes, eliminating much of the performance
> advantage of striping.

Yes, that's the ultimate problem, but not for the reasons you
suggested. ;)

> These are large 7200 rpm SATA drives which have poor seek performance to
> begin with, unlike the 'small' 300GB 15k SAS drives.  You're robbing
> that poor seek performance further by:
> 
> 1.  Using double parity striped RAID
> 2.  Writing thousands of small files in parallel

The writing in parallel is only an issue if it is direct or
synchronous IO. If it's using normal buffered writes, then writeback
is mostly single threaded and delayed allocation should be preventing
fragmentation completely. That still doesn't guarantee that
writeback avoids RAID RMW cycles (see above about allocation
alignment).

> This workload is very similar to the case of a mail server using the
> maildir storage format.

There's not enough detail in the workload description to make that
assumption.

> If you read the list archives you'll see
> recommendations for an optimal storage stack setup for this workload.
> It goes something like this:
> 
> 1.  Create a linear array of hardware RAID1 mirror sets.
>     Do this all in the controller if it can do it.
>     If not, use Linux RAID (mdadm) to create a '--linear' array of the
>     multiple (7 in your case, apparently) hardware RAID1 mirror sets
> 
> 2.  Now let XFS handle the write parallelism.  Format the resulting
>     7 spindle Linux RAID device with, for example:
> 
>     mkfs.xfs -d agcount=14 /dev/md0
> 
> By using this configuration you eliminate the excessive head seeking
> associated with the partial stripe write problems of RAID6, restoring
> performance efficiency to the array.  Using 14 allocation groups allows
> XFS to write write, at minimum, 14 such files in parallel.

That's not correct. 14 AG means that if the files are laid out
across all AGs then there can be 14 -allocations- in parallel at
once. If Io does not require allocation, then they don't serialise
at all on the AGs.  IOWs, If allocation takes 1ms of work in an AG,
then you could have 1,000 allocations per second per AG. With 14
AGs, that gives allocation capability of up to 14,000/s

And given that not all writes require allocation and allocation is
usually only a small percentage of the total IO time. You can have
many, many more write IOs in flight than you can do allocations in
an AG....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-20  0:20     ` Dave Chinner
@ 2011-07-20  5:16       ` Stan Hoeppner
  2011-07-20  6:44         ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2011-07-20  5:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: John Bokma, xfs

On 7/19/2011 7:20 PM, Dave Chinner wrote:
> On Tue, Jul 19, 2011 at 05:37:25PM -0500, Stan Hoeppner wrote:
>> On 7/19/2011 3:37 AM, Emmanuel Florac wrote:
>>> Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez:
>>>
>>>> card: MegaRAID SAS 9260-16i
>>>> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares).
>>>> RAID6
>>>> ~ 30TB
>>
>>> This card doesn't activate the write cache without a BBU present. Be
>>> sure you have a BBU or the performance will always be unbearably awful.
>>
>> In addition to all the other recommendations, once the BBU is installed,
>> disable the individual drive caches (if this isn't done automatically),
>> and set the controller cache mode to 'write back'.  The write through
>> and direct I/O cache modes will deliver horrible RAID6 write performance.
>>
>> And, BTW, RAID6 is a horrible choice for a parallel, small file, high
>> random I/O workload such as you've described.  RAID10 would be much more
>> suitable.  Actually, any striped RAID is less than optimal for such a
>> small file workload.  The default stripe size for the LSI RAID
>> controllers, IIRC, is 64KB.  With 14 spindles of stripe width you end up
>> with 64*14 = 896KB. 
> 
> All good up to here.

And then my lack of understanding of XFS internals begins to show. :(

>> XFS will try to pack as many of these 50-150K files
>> into a single extent, but you're talking 6 to 18 files per extent,

> I think you've got your terminology wrong. An extent can only belong
> to a single inode, but an inode can contain many extents, as can a
> stripe width. We do not pack data from multiple files into a single
> extent.

Yes, I think I meant stripe unit, the 896KB.

> For new files on a su/sw aware filesystem, however, XFS will *not*
> pack multiple files into the same stripe unit. It will try to align
> the first extent of the file to sunit, or if you have the swalloc
> mount option set and the allocation is for more than a swidth of
> space it will align to swidth rather than sunit.

Interesting.  Didn't realize this.

> So if you have a small file workload, specifying sunit/swidth can
> actually -decrease- performance because it allocates the file
> extents sparsely. IOWs, stripe alignment is important for bandwidth
> intensive applications because it allows full stripe writes to occur
> much more frequently, but can be harmful to small file performance
> as the aligned allocation pattern can prevent full stripe writes
> from occurring.....

I don't recall reading this before Dave.  Thank you for this tidbit.
How much performance decrease are we looking at here?  An mkfs.xfs of an
mdraid striped array will by default create sunit/swidth values right?
And thus this lower performance w/small files.

>> and
>> this is wholly dependent on the parallel write pattern, and in which of
>> the allocation groups XFS decides to write each file.
> 
> That's pretty much irrelevant for small files as a single allocation
> is done for each file during writeback.

I believe I was already thinking of the concatenated array at this point
and accidentally dropped those thoughts into the striped array discussion.

>> XFS isn't going
>> to be 100% efficient in this case.  Thus, you will end up with many
>> partial stripe width writes, eliminating much of the performance
>> advantage of striping.
> 
> Yes, that's the ultimate problem, but not for the reasons you
> suggested. ;)

Thanks for saving me Dave. :)  I had the big picture right but FUBAR'd
some of the details.  Maybe there's a job in politics waiting for me. ;)

>> These are large 7200 rpm SATA drives which have poor seek performance to
>> begin with, unlike the 'small' 300GB 15k SAS drives.  You're robbing
>> that poor seek performance further by:
>>
>> 1.  Using double parity striped RAID
>> 2.  Writing thousands of small files in parallel
> 
> The writing in parallel is only an issue if it is direct or
> synchronous IO. If it's using normal buffered writes, then writeback
> is mostly single threaded and delayed allocation should be preventing
> fragmentation completely. That still doesn't guarantee that
> writeback avoids RAID RMW cycles (see above about allocation
> alignment).

The RMW was mainly what I was concerned with here.

>> This workload is very similar to the case of a mail server using the
>> maildir storage format.
> 
> There's not enough detail in the workload description to make that
> assumption.

Good point.  I should have said "at first glance... seems similar".

>> If you read the list archives you'll see
>> recommendations for an optimal storage stack setup for this workload.
>> It goes something like this:
>>
>> 1.  Create a linear array of hardware RAID1 mirror sets.
>>     Do this all in the controller if it can do it.
>>     If not, use Linux RAID (mdadm) to create a '--linear' array of the
>>     multiple (7 in your case, apparently) hardware RAID1 mirror sets
>>
>> 2.  Now let XFS handle the write parallelism.  Format the resulting
>>     7 spindle Linux RAID device with, for example:
>>
>>     mkfs.xfs -d agcount=14 /dev/md0
>>
>> By using this configuration you eliminate the excessive head seeking
>> associated with the partial stripe write problems of RAID6, restoring
>> performance efficiency to the array.  Using 14 allocation groups allows
>> XFS to write write, at minimum, 14 such files in parallel.
> 
> That's not correct. 14 AG means that if the files are laid out
> across all AGs then there can be 14 -allocations- in parallel at
> once. If Io does not require allocation, then they don't serialise
> at all on the AGs.  IOWs, If allocation takes 1ms of work in an AG,
> then you could have 1,000 allocations per second per AG. With 14
> AGs, that gives allocation capability of up to 14,000/s

So are you saying that we have no guarantee, nor high probability, that
the small files in this case will be spread out across all AGs, thus
making more efficient use of each disk's performance in the concatenated
array, vs a striped array?  Or, are you merely pointing out a detail I
have incorrect, which I've yet to fully understand?

> And given that not all writes require allocation and allocation is
> usually only a small percentage of the total IO time. You can have
> many, many more write IOs in flight than you can do allocations in
> an AG....

Ahh, I think I see your point.  For the maildir case, more of the IO is
likely due to things like updating message flags, etc, than actually
writing new mail files into the directory.  Such operations don't
require allocation.  With the workload mentioned by the OP, it's
possible that all of the small file writes may indeed require
allocation, unlike the maildir workload.  But if this is the case,
wouldn't the concatenated array still yield better overall performance
than RAID6, or any other striped array?

If I misunderstood your last point, or any points, please guide me to
the light Dave.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-20  5:16       ` Stan Hoeppner
@ 2011-07-20  6:44         ` Dave Chinner
  2011-07-20 12:10           ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-07-20  6:44 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: John Bokma, xfs

On Wed, Jul 20, 2011 at 12:16:15AM -0500, Stan Hoeppner wrote:
> On 7/19/2011 7:20 PM, Dave Chinner wrote:
> > On Tue, Jul 19, 2011 at 05:37:25PM -0500, Stan Hoeppner wrote:
> >> On 7/19/2011 3:37 AM, Emmanuel Florac wrote:
> >>> Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez:
> >>>
> >>>> card: MegaRAID SAS 9260-16i
> >>>> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares).
> >>>> RAID6
> >>>> ~ 30TB
> >>
> >>> This card doesn't activate the write cache without a BBU present. Be
> >>> sure you have a BBU or the performance will always be unbearably awful.
> >>
> >> In addition to all the other recommendations, once the BBU is installed,
> >> disable the individual drive caches (if this isn't done automatically),
> >> and set the controller cache mode to 'write back'.  The write through
> >> and direct I/O cache modes will deliver horrible RAID6 write performance.
> >>
> >> And, BTW, RAID6 is a horrible choice for a parallel, small file, high
> >> random I/O workload such as you've described.  RAID10 would be much more
> >> suitable.  Actually, any striped RAID is less than optimal for such a
> >> small file workload.  The default stripe size for the LSI RAID
> >> controllers, IIRC, is 64KB.  With 14 spindles of stripe width you end up
> >> with 64*14 = 896KB. 
> > 
> > All good up to here.
> 
> And then my lack of understanding of XFS internals begins to show. :(

The fact you are trying to understand them is the important bit!

....
> > So if you have a small file workload, specifying sunit/swidth can
> > actually -decrease- performance because it allocates the file
> > extents sparsely. IOWs, stripe alignment is important for bandwidth
> > intensive applications because it allows full stripe writes to occur
> > much more frequently, but can be harmful to small file performance
> > as the aligned allocation pattern can prevent full stripe writes
> > from occurring.....
> 
> I don't recall reading this before Dave.  Thank you for this tidbit.

I'm sure I've said this before, but it's possible I've said it this
time in away that is obvious and understandable. Most people
struggle with the concept of allocation alignment and why it might be
important, let alone understand it well enough to discuss intricate
details of the allocator and tuning it for different workloads...

> How much performance decrease are we looking at here?

Depends on your hardware and the workload. It may not be measurable,
it may be very noticable. benchmarking your system with your
workload is the only way to really know.

> An mkfs.xfs of an
> mdraid striped array will by default create sunit/swidth values right?
> And thus this lower performance w/small files.

In general, sunit/swidth being specified provides a better tradeoff
for maintaining consistent performance on files across the
filesystem. it might cost a little for small files, but unaligned IO
on large files cause much more noticable performace problems...

....

> >> If you read the list archives you'll see
> >> recommendations for an optimal storage stack setup for this workload.
> >> It goes something like this:
> >>
> >> 1.  Create a linear array of hardware RAID1 mirror sets.
> >>     Do this all in the controller if it can do it.
> >>     If not, use Linux RAID (mdadm) to create a '--linear' array of the
> >>     multiple (7 in your case, apparently) hardware RAID1 mirror sets
> >>
> >> 2.  Now let XFS handle the write parallelism.  Format the resulting
> >>     7 spindle Linux RAID device with, for example:
> >>
> >>     mkfs.xfs -d agcount=14 /dev/md0
> >>
> >> By using this configuration you eliminate the excessive head seeking
> >> associated with the partial stripe write problems of RAID6, restoring
> >> performance efficiency to the array.  Using 14 allocation groups allows
> >> XFS to write write, at minimum, 14 such files in parallel.
> > 
> > That's not correct. 14 AG means that if the files are laid out
> > across all AGs then there can be 14 -allocations- in parallel at
> > once. If Io does not require allocation, then they don't serialise
> > at all on the AGs.  IOWs, If allocation takes 1ms of work in an AG,
> > then you could have 1,000 allocations per second per AG. With 14
> > AGs, that gives allocation capability of up to 14,000/s
> 
> So are you saying that we have no guarantee, nor high probability, that
> the small files in this case will be spread out across all AGs, thus
> making more efficient use of each disk's performance in the concatenated
> array, vs a striped array?  Or, are you merely pointing out a detail I
> have incorrect, which I've yet to fully understand?

Yet to fully understand. It's not limited to small files, either.

XFS doesn't guarantee that specific allocations are evenly
distributed across AGs, but it does try to spread the overall
contents of the filesystem across all AGs. It does have concepts of
locality of reference, but they change depending on the allocator in
use.

Take, for example, inode32 vs inode64 which are the two most common
allocation strategies and assume we have a 16TB fs with 1TB AGs.
The inode32 allocator will place all inodes and most directory
metadata in the first AG, below one TB. There is basically no
metadata allocation parallelism in this strategy, so metadata
performance is limited and will often serialise. Metadata tends to
have good locality of reference - all directories and inodes will
tend to be close together on disk because they are in the same AG.

Data, on the other had is rotored around AGs 2-16 on a per file
basis, so there is no locality between inodes and their data, nor of
data between two adjacent files in the same directory. There is,
however, data allocation parallelism because files are spread
across allocation groups...

Hence for inode32, metadata is closely located, but data is spread
out widely. Hence metadata operations don't scale at all well on a
linear concat (e.g. hit only one disk/mirror pair), but data
allocations are spread effectively and hence parallelise and scale
quite well. The downside to this is that data lookups involve large
seeks if you have a stripe, and hence can be quite slow. Data reads
on a linear concat are not guaranteed to evenly load the disks,
either, simply because there's no correlation between the location
of the data and the access patterns.

For inode64, locality of reference clusters around the directory
structure. The inodes for files in a directory will be allocated in
the same AG as the directory inode, and the data for each file will
be allocated in the same AG as the file inodes. When you create a
new directory, it gets placed in a different AG, and the pattern
repeats. So for inode64, distributing files across all AGs is caused
by distributing the directory structure. FWIW, an example is a
kernel source tree:

~/src/kern/xfsdev$ find . -type d -exec sudo xfs_bmap -v {} \; | awk '/ 0: / { print $4 }' |sort -n |uniq -c
     76 0
     66 1
     85 2
     81 3
     82 4
     69 5
     89 6
     74 7
     90 8
     81 9
     96 10
     84 11
     85 12
     84 13
     86 14
     71 15

As you can see, there's a relatively even spread of the directories
across all 16 AGs in that directory structure, and the file data
will follow this pattern. Because of it's better metadata<->data
locality of reference, inode64 tends to be signficantly faster on
workloads that mix metadata operations with data operations (e.g.
recursive grep across a kernel source tree) as the seek cost between
the inode and it's data is much less than for inode32....

However, if youre workload does not spread across directories, then
IO will tend to be limited to specific silos in the linear concat
while other disks sit idle. If you have a stripe, then the seeks to
get to the data are small, and hence much faster than inode32 on
similar workloads.

This is all ignoring stripe aligned allocation - that is often lost
in the noise comapred to bigger issues like seeking from AG 0 to AG
15 when reading the inode then the data or having a workload only
use a single AG because it is all confined to a single directory.

IOWs, the best, most optimal filesystem layout and allocation
stratgey is both workload and hardware dependent, and there's no one
right answer. The defaults select the best balance for typical usage
- beyond that benchmarking the workload is the only way to really
measure whether your tweaks are the right ones or not. IOWs, you
need to understand the filesystem, your storage hardware and -the
application IO patterns- to make the right tuning decisions.


> > And given that not all writes require allocation and allocation is
> > usually only a small percentage of the total IO time. You can have
> > many, many more write IOs in flight than you can do allocations in
> > an AG....
> 
> Ahh, I think I see your point.  For the maildir case, more of the IO is
> likely due to things like updating message flags, etc, than actually
> writing new mail files into the directory. 

I wasn't really talking about maildir here, just pointing out that
allocation is generally not the limiting factor in doing large
amounts of concurrent write IO.

> Such operations don't
> require allocation.  With the workload mentioned by the OP, it's
> possible that all of the small file writes may indeed require
> allocation, unlike the maildir workload.  But if this is the case,
> wouldn't the concatenated array still yield better overall performance
> than RAID6, or any other striped array?

<shrug>

Quite possibly, butI can't say conclusively - I simply don't know
enough about the workload or the fs configuration.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-20  6:44         ` Dave Chinner
@ 2011-07-20 12:10           ` Stan Hoeppner
  2011-07-20 14:04             ` Michael Monnerie
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2011-07-20 12:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: John Bokma, xfs

On 7/20/2011 1:44 AM, Dave Chinner wrote:
> On Wed, Jul 20, 2011 at 12:16:15AM -0500, Stan Hoeppner wrote:
>> On 7/19/2011 7:20 PM, Dave Chinner wrote:
>>> On Tue, Jul 19, 2011 at 05:37:25PM -0500, Stan Hoeppner wrote:
>>>> On 7/19/2011 3:37 AM, Emmanuel Florac wrote:
>>>>> Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez:
>>>>>
>>>>>> card: MegaRAID SAS 9260-16i
>>>>>> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares).
>>>>>> RAID6
>>>>>> ~ 30TB
>>>>
>>>>> This card doesn't activate the write cache without a BBU present. Be
>>>>> sure you have a BBU or the performance will always be unbearably awful.
>>>>
>>>> In addition to all the other recommendations, once the BBU is installed,
>>>> disable the individual drive caches (if this isn't done automatically),
>>>> and set the controller cache mode to 'write back'.  The write through
>>>> and direct I/O cache modes will deliver horrible RAID6 write performance.
>>>>
>>>> And, BTW, RAID6 is a horrible choice for a parallel, small file, high
>>>> random I/O workload such as you've described.  RAID10 would be much more
>>>> suitable.  Actually, any striped RAID is less than optimal for such a
>>>> small file workload.  The default stripe size for the LSI RAID
>>>> controllers, IIRC, is 64KB.  With 14 spindles of stripe width you end up
>>>> with 64*14 = 896KB. 
>>>
>>> All good up to here.
>>
>> And then my lack of understanding of XFS internals begins to show. :(
> 
> The fact you are trying to understand them is the important bit!

I've always found XFS fascinating (as with most of SGI's creations).
The more I use XFS, and the more I participate here, the more I want to
understand how the cogs turn.  And as you mentioned previously, it's
beneficial to this list if users can effectively answer other users'
questions, giving devs more time for developing. :)

> ....
>>> So if you have a small file workload, specifying sunit/swidth can
>>> actually -decrease- performance because it allocates the file
>>> extents sparsely. IOWs, stripe alignment is important for bandwidth
>>> intensive applications because it allows full stripe writes to occur
>>> much more frequently, but can be harmful to small file performance
>>> as the aligned allocation pattern can prevent full stripe writes
>>> from occurring.....
>>
>> I don't recall reading this before Dave.  Thank you for this tidbit.
> 
> I'm sure I've said this before, but it's possible I've said it this
> time in away that is obvious and understandable. Most people
> struggle with the concept of allocation alignment and why it might be
> important, let alone understand it well enough to discuss intricate
> details of the allocator and tuning it for different workloads...

In general I've understood for quite some time that large stripes were
typically bad for small file performance due to the partial stripe write
issue.  However, I misunderstood something you said quite some time ago
about XFS having some tricks to somewhat mitigate partial stripe writes
during writeback.  I thought this was packing multiple small files into
a single stripe write, which you just explained XFS does not do.
Thinking back you were probably talking about some other aggregation
that occurs in the allocator to cut down on the number of physical IOs
required to write the data, or something like that.

...
>> An mkfs.xfs of an
>> mdraid striped array will by default create sunit/swidth values right?
>> And thus this lower performance w/small files.
> 
> In general, sunit/swidth being specified provides a better tradeoff
> for maintaining consistent performance on files across the
> filesystem. it might cost a little for small files, but unaligned IO
> on large files cause much more noticable performace problems...

The reason I asked is to get something in Google.  If a user has a
purely small file workload, such as maildir, but insists on using an
mdraid striped array, would it be better to override the mkfs.xfs
defaults here so sunit/swidth aren't defined?  If so, would one specify
zero for each parameter on the command line?

> ....
> 
>>>> If you read the list archives you'll see
>>>> recommendations for an optimal storage stack setup for this workload.
>>>> It goes something like this:
>>>>
>>>> 1.  Create a linear array of hardware RAID1 mirror sets.
>>>>     Do this all in the controller if it can do it.
>>>>     If not, use Linux RAID (mdadm) to create a '--linear' array of the
>>>>     multiple (7 in your case, apparently) hardware RAID1 mirror sets
>>>>
>>>> 2.  Now let XFS handle the write parallelism.  Format the resulting
>>>>     7 spindle Linux RAID device with, for example:
>>>>
>>>>     mkfs.xfs -d agcount=14 /dev/md0
>>>>
>>>> By using this configuration you eliminate the excessive head seeking
>>>> associated with the partial stripe write problems of RAID6, restoring
>>>> performance efficiency to the array.  Using 14 allocation groups allows
>>>> XFS to write write, at minimum, 14 such files in parallel.
>>>
>>> That's not correct. 14 AG means that if the files are laid out
>>> across all AGs then there can be 14 -allocations- in parallel at
>>> once. If Io does not require allocation, then they don't serialise
>>> at all on the AGs.  IOWs, If allocation takes 1ms of work in an AG,
>>> then you could have 1,000 allocations per second per AG. With 14
>>> AGs, that gives allocation capability of up to 14,000/s
>>
>> So are you saying that we have no guarantee, nor high probability, that
>> the small files in this case will be spread out across all AGs, thus
>> making more efficient use of each disk's performance in the concatenated
>> array, vs a striped array?  Or, are you merely pointing out a detail I
>> have incorrect, which I've yet to fully understand?
> 
> Yet to fully understand. It's not limited to small files, either.
> 
> XFS doesn't guarantee that specific allocations are evenly
> distributed across AGs, but it does try to spread the overall
> contents of the filesystem across all AGs. It does have concepts of
> locality of reference, but they change depending on the allocator in
> use.
> 
> Take, for example, inode32 vs inode64 which are the two most common
> allocation strategies and assume we have a 16TB fs with 1TB AGs.
> The inode32 allocator will place all inodes and most directory
> metadata in the first AG, below one TB. There is basically no
> metadata allocation parallelism in this strategy, so metadata
> performance is limited and will often serialise. Metadata tends to
> have good locality of reference - all directories and inodes will
> tend to be close together on disk because they are in the same AG.

I'd forgotten this.  I do recall discussions of all the directories and
inodes being in the first 1TB on an inode32 filesystem.  IIRC, those
were focused on people "running out of space" when they still had many
hundreds of Gigs or a TB free, simply because they ran out of space for
inodes.  Until now I hadn't tied this together with the potential
metadata performance issue, and specifically with a linear concat setup.

> Data, on the other had is rotored around AGs 2-16 on a per file
> basis, so there is no locality between inodes and their data, nor of
> data between two adjacent files in the same directory. There is,
> however, data allocation parallelism because files are spread
> across allocation groups...
> 
> Hence for inode32, metadata is closely located, but data is spread
> out widely. Hence metadata operations don't scale at all well on a
> linear concat (e.g. hit only one disk/mirror pair), but data
> allocations are spread effectively and hence parallelise and scale
> quite well. The downside to this is that data lookups involve large
> seeks if you have a stripe, and hence can be quite slow. Data reads
> on a linear concat are not guaranteed to evenly load the disks,
> either, simply because there's no correlation between the location
> of the data and the access patterns.

Got it.

> For inode64, locality of reference clusters around the directory
> structure. The inodes for files in a directory will be allocated in
> the same AG as the directory inode, and the data for each file will
> be allocated in the same AG as the file inodes. When you create a
> new directory, it gets placed in a different AG, and the pattern
> repeats. So for inode64, distributing files across all AGs is caused
> by distributing the directory structure. 

And this is why maildir works very well with a linear concat on an
inode64 filesystem, as each mailbox is in a different directory, thus
spreading all the small mail files and metadata across all AGs.  Which
is why I've been recommending it.  I don't think I've been specifying
inode64 though in my previous recommendations.  I should probably be
doing that.  I guess I assumed everyone running XFS today is running a
64bit kernel/user space--probably not good to simply assume that.

> FWIW, an example is a
> kernel source tree:
> 
> ~/src/kern/xfsdev$ find . -type d -exec sudo xfs_bmap -v {} \; | awk '/ 0: / { print $4 }' |sort -n |uniq -c
>      76 0
>      66 1
>      85 2
>      81 3
>      82 4
>      69 5
>      89 6
>      74 7
>      90 8
>      81 9
>      96 10
>      84 11
>      85 12
>      84 13
>      86 14
>      71 15
> 
> As you can see, there's a relatively even spread of the directories
> across all 16 AGs in that directory structure, and the file data
> will follow this pattern. Because of it's better metadata<->data
> locality of reference, inode64 tends to be signficantly faster on
> workloads that mix metadata operations with data operations (e.g.
> recursive grep across a kernel source tree) as the seek cost between
> the inode and it's data is much less than for inode32....

Right.

> However, if youre workload does not spread across directories, then
> IO will tend to be limited to specific silos in the linear concat
> while other disks sit idle. If you have a stripe, then the seeks to
> get to the data are small, and hence much faster than inode32 on
> similar workloads.

And now I understand your previous comment that we don't know enough
about the user's workload to make the linear concat recommendation.  If
he's writing all those hundreds of thousands of small files into the
same directory the performance of a linear concat would be horrible.

> This is all ignoring stripe aligned allocation - that is often lost
> in the noise comapred to bigger issues like seeking from AG 0 to AG
> 15 when reading the inode then the data or having a workload only
> use a single AG because it is all confined to a single directory.
> 
> IOWs, the best, most optimal filesystem layout and allocation
> stratgey is both workload and hardware dependent, and there's no one
> right answer. The defaults select the best balance for typical usage
> - beyond that benchmarking the workload is the only way to really
> measure whether your tweaks are the right ones or not. IOWs, you
> need to understand the filesystem, your storage hardware and -the
> application IO patterns- to make the right tuning decisions.

Got it.  When I prematurely recommended the linear concat I'd simply
forgotten that our AG parallelism is dependent on having many of
directories, not just many small files.

>>> And given that not all writes require allocation and allocation is
>>> usually only a small percentage of the total IO time. You can have
>>> many, many more write IOs in flight than you can do allocations in
>>> an AG....
>>
>> Ahh, I think I see your point.  For the maildir case, more of the IO is
>> likely due to things like updating message flags, etc, than actually
>> writing new mail files into the directory. 
> 
> I wasn't really talking about maildir here, just pointing out that
> allocation is generally not the limiting factor in doing large
> amounts of concurrent write IO.

Got it.  In the specific case the OP posted about, hundreds of thousands
of small file writes, allocation could be a limiting factor though, correct?

>> Such operations don't
>> require allocation.  With the workload mentioned by the OP, it's
>> possible that all of the small file writes may indeed require
>> allocation, unlike the maildir workload.  But if this is the case,
>> wouldn't the concatenated array still yield better overall performance
>> than RAID6, or any other striped array?
> 
> <shrug>
> 
> Quite possibly, butI can't say conclusively - I simply don't know
> enough about the workload or the fs configuration.

Don't shrug Dave. :)  You already answered this question up above.
Well, you provided me some new information, and reminded me of things I
already knew, which allowed me to answer this for my self.

Thanks for spending the time you have in this thread to do some serious
teaching.  You provided some valuable information that isn't in the XFS
User Guide, nor the XFS File System Structure document.  If it is there,
it's not in a format that a mere mortal such as my self can digest.  You
make deeper aspects of XFS understandable, and I really appreciate that.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-20 12:10           ` Stan Hoeppner
@ 2011-07-20 14:04             ` Michael Monnerie
  2011-07-20 23:01               ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Monnerie @ 2011-07-20 14:04 UTC (permalink / raw)
  To: xfs; +Cc: Stan Hoeppner, John Bokma


[-- Attachment #1.1: Type: Text/Plain, Size: 1023 bytes --]

On Mittwoch, 20. Juli 2011 Stan Hoeppner wrote:
> I thought this was packing multiple small files into
> a single stripe write, which you just explained XFS does not do.

This is interesting, I jump in here. Does that mean that if I have a XFS 
volume with sw=14,su=64k (14*64=896KiB) that when I write 10 small files 
in the same dir with 2KB each, each file would be placed at a 896KiB 
boundary? That way, all stripes of a 1GB partition would be full when 
there are roughly 1170 files (1170*896KiB ~ 1GB). What would happen when 
I create other files - is XFS "full" then, or would it start using sub-
stripes? If sub-stripes, would they start at su (=64KiB) distances, or 
at single block (e.g. 4KiB) distances?

I hope I could explain my thoughts in an understandable way ;-)

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

// Haus zu verkaufen: http://zmi.at/langegg/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-20 14:04             ` Michael Monnerie
@ 2011-07-20 23:01               ` Dave Chinner
  2011-07-21  6:19                 ` Michael Monnerie
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-07-20 23:01 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: John Bokma, Stan Hoeppner, xfs

On Wed, Jul 20, 2011 at 04:04:31PM +0200, Michael Monnerie wrote:
> On Mittwoch, 20. Juli 2011 Stan Hoeppner wrote:
> > I thought this was packing multiple small files into
> > a single stripe write, which you just explained XFS does not do.
> 
> This is interesting, I jump in here. Does that mean that if I have a XFS 
> volume with sw=14,su=64k (14*64=896KiB) that when I write 10 small files 
> in the same dir with 2KB each, each file would be placed at a 896KiB 
> boundary?

No, they'll get sunit aligned but default, which would be on 64k
boundaries.

> That way, all stripes of a 1GB partition would be full when 
> there are roughly 1170 files (1170*896KiB ~ 1GB). What would happen when 
> I create other files - is XFS "full" then, or would it start using sub-
> stripes? If sub-stripes, would they start at su (=64KiB) distances, or 
> at single block (e.g. 4KiB) distances?

It starts packing files tightly into remaining free space when no
free aligned extents are availble for allocation in the AG.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-20 23:01               ` Dave Chinner
@ 2011-07-21  6:19                 ` Michael Monnerie
  2011-07-21  6:48                   ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Monnerie @ 2011-07-21  6:19 UTC (permalink / raw)
  To: xfs; +Cc: Stan Hoeppner, John Bokma


[-- Attachment #1.1: Type: Text/Plain, Size: 1866 bytes --]

On Donnerstag, 21. Juli 2011 Dave Chinner wrote:
> No, they'll get sunit aligned but default, which would be on 64k
> boundaries.

OK, so only when <quote Dave> "swalloc mount option set and the 
allocation is for more than a swidth of space it will align to swidth 
rather than sunit" </quote Dave>.

So even when I specify swalloc but a file is generated with only 4KB, it 
will very probably be sunit aligned on disk.
 
> > That way, all stripes of a 1GB partition would be full when 
> > there are roughly 1170 files (1170*896KiB ~ 1GB). What would happen
> > when  I create other files - is XFS "full" then, or would it start
> > using sub- stripes? If sub-stripes, would they start at su
> > (=64KiB) distances, or at single block (e.g. 4KiB) distances?
> 
> It starts packing files tightly into remaining free space when no
> free aligned extents are availble for allocation in the AG.

That means for above example, that 16384 x 2KiB files could be created, 
and each be sunit aligned on disk. Then all sunit start blocks are full, 
so additional files will be sub-sunit "packed", is it this?

That would mean fragmentation is likely to occur from that moment, if 
there are files that grow. And files >64KiB are immediately fragmented 
then. At this time, there are only 16384 * 2KiB = 32MiB used, which is 
3,125% of the disk. I can't believe my numbers, are they true?
OK, this is a worst case scenario, and as you've said before, any 
filesystem can be considered full at 85% fill grade. But it's incredible 
how quickly you could fuck up a filesystem when using su/sw and writing 
small files.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

// Haus zu verkaufen: http://zmi.at/langegg/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-21  6:19                 ` Michael Monnerie
@ 2011-07-21  6:48                   ` Dave Chinner
  2011-07-22  6:10                     ` Michael Monnerie
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-07-21  6:48 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: John Bokma, Stan Hoeppner, xfs

On Thu, Jul 21, 2011 at 08:19:54AM +0200, Michael Monnerie wrote:
> On Donnerstag, 21. Juli 2011 Dave Chinner wrote:
> > No, they'll get sunit aligned but default, which would be on 64k
> > boundaries.
> 
> OK, so only when <quote Dave> "swalloc mount option set and the 
> allocation is for more than a swidth of space it will align to swidth 
> rather than sunit" </quote Dave>.
> 
> So even when I specify swalloc but a file is generated with only 4KB, it 
> will very probably be sunit aligned on disk.
>  
> > > That way, all stripes of a 1GB partition would be full when 
> > > there are roughly 1170 files (1170*896KiB ~ 1GB). What would happen
> > > when  I create other files - is XFS "full" then, or would it start
> > > using sub- stripes? If sub-stripes, would they start at su
> > > (=64KiB) distances, or at single block (e.g. 4KiB) distances?
> > 
> > It starts packing files tightly into remaining free space when no
> > free aligned extents are availble for allocation in the AG.
> 
> That means for above example, that 16384 x 2KiB files could be created, 
> and each be sunit aligned on disk. Then all sunit start blocks are full, 
> so additional files will be sub-sunit "packed", is it this?

Effectively.

> That would mean fragmentation is likely to occur from that moment, if 
> there are files that grow.

If you are writing files that grow like this, then you are doing
something wrong. If the app can't do it's IO differently, then this
is exactly the reason we have userspace-controlled preallocation
interfaces.

Filesystems cannot prevent user stupidity from screwing something
up....

> And files >64KiB are immediately fragmented 
> then. At this time, there are only 16384 * 2KiB = 32MiB used, which is 
> 3,125% of the disk. I can't believe my numbers, are they true?

No, because most filesystems have a 4k block size. Not to mention
that fragmentation is likely to be limited to the single AG the files
in the directory belong to. i.e. even if we can't allocation a sunit
aligned chunk in an AG, we won't switch to another AG just to do
sunit aligned allocation.

> OK, this is a worst case scenario, and as you've said before, any 
> filesystem can be considered full at 85% fill grade. But it's incredible 
> how quickly you could fuck up a filesystem when using su/sw and writing 
> small files.

Well, don't use a filesystem that is optimised for storing large
sizes, large files and high bandwidth for storing lots of small
files, then.  Indeed, the point of not packing the files is so they
-don't fragemnt as they grow-. XFS is not designed to be optimal
for small filesystems or small files. In most cases it will deal
with them just fine, so in reality your concerns are mostly
unfounded...

BTW, ext3/ext4 do exactly the same thing with spreading files out
over block groups before packing them tightly when there are not
more empty block groups left....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-21  6:48                   ` Dave Chinner
@ 2011-07-22  6:10                     ` Michael Monnerie
  2011-07-22 18:05                       ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Monnerie @ 2011-07-22  6:10 UTC (permalink / raw)
  To: xfs; +Cc: Stan Hoeppner, John Bokma


[-- Attachment #1.1: Type: Text/Plain, Size: 2684 bytes --]

On Donnerstag, 21. Juli 2011 Dave Chinner wrote:
> If you are writing files that grow like this, then you are doing
> something wrong. If the app can't do it's IO differently, then this
> is exactly the reason we have userspace-controlled preallocation
> interfaces.
> 
> Filesystems cannot prevent user stupidity from screwing something
> up....

This can happen if you copy a syslog server over to a new disk, then let 
it start it's work again. Many files that start small and grow. Luckily, 
the logs are rotated latest monthly, so it shouldn't be too bad.
 
> > And files >64KiB are immediately fragmented
> > then. At this time, there are only 16384 * 2KiB = 32MiB used, which
> > is 3,125% of the disk. I can't believe my numbers, are they true?
> 
> No, because most filesystems have a 4k block size. 

I just meant pure disk usage. Of 1GB, only 32MB are used, and this worst 
case example hits us badly.

> Not to mention
> that fragmentation is likely to be limited to the single AG the files
> in the directory belong to. i.e. even if we can't allocation a sunit
> aligned chunk in an AG, we won't switch to another AG just to do
> sunit aligned allocation.

This is good to know also, thanks.
 
> > OK, this is a worst case scenario, and as you've said before, any
> > filesystem can be considered full at 85% fill grade. But it's
> > incredible how quickly you could fuck up a filesystem when using
> > su/sw and writing small files.
> 
> Well, don't use a filesystem that is optimised for storing large
> sizes, large files and high bandwidth for storing lots of small
> files, then.  Indeed, the point of not packing the files is so they
> -don't fragemnt as they grow-. XFS is not designed to be optimal
> for small filesystems or small files. In most cases it will deal
> with them just fine, so in reality your concerns are mostly
> unfounded...

Yes, I just wanted to know about the corner cases, and how XFS behaves. 
Actually, we're changing over to using NetApps, and with their WAFL 
anyway I should drop all su/sw usage and just use 4KB blocks.

And even when XFS is optimized for large files, there are often small 
ones. Think of a mysql server with hundreds of DBs and 
innodb_file_per_table set. Even when some DBs are large, there are many 
small files.

But this thread has drifted a bit. XFS does great work, and now I 
understand the background a bit more. Thanks, Dave.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

// Haus zu verkaufen: http://zmi.at/langegg/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-22  6:10                     ` Michael Monnerie
@ 2011-07-22 18:05                       ` Stan Hoeppner
  2011-07-22 23:10                         ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2011-07-22 18:05 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: John Bokma, xfs

On 7/22/2011 1:10 AM, Michael Monnerie wrote:

> Yes, I just wanted to know about the corner cases, and how XFS behaves. 
> Actually, we're changing over to using NetApps, and with their WAFL 
> anyway I should drop all su/sw usage and just use 4KB blocks.

I've never used a NetApp filer myself.  However, that said, I would
assume that WAFL is only in play for NFS/CIFS transactions since WAFL is
itself a filesystem.

When exposing LUNs from the same filer to FC and iSCSI hosts I would
assume the filer acts just as any other SAN controller would.  In this
case I would think you'd probably still want to align your XFS
filesystem to the underlying RAID stripe from which the LUN was carved.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-22 18:05                       ` Stan Hoeppner
@ 2011-07-22 23:10                         ` Dave Chinner
  2011-07-24  6:14                           ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-07-22 23:10 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Michael Monnerie, John Bokma, xfs

On Fri, Jul 22, 2011 at 01:05:14PM -0500, Stan Hoeppner wrote:
> On 7/22/2011 1:10 AM, Michael Monnerie wrote:
> 
> > Yes, I just wanted to know about the corner cases, and how XFS behaves. 
> > Actually, we're changing over to using NetApps, and with their WAFL 
> > anyway I should drop all su/sw usage and just use 4KB blocks.
> 
> I've never used a NetApp filer myself.  However, that said, I would
> assume that WAFL is only in play for NFS/CIFS transactions since WAFL is
> itself a filesystem.

Netapp's website is busted, so here's a cached link:

http://webcache.googleusercontent.com/search?q=cache:9DdO2a16hdIJ:blogs.netapp.com/extensible_netapp/2008/10/what-is-wafl--3.html+netapp+san+wafl&cd=1&hl=en&ct=clnk&source=www.google.com

"The point is that WAFL is the part of the code that provides the
'read or write from-disk' mechanisms to both NFS and CIFS and SAN.
The semantics of a how the blocks are accessed are provided by
higher level code not by WAFL, which means WAFL is not a file
system."

If you can be bothered trolling for that entire series of blog posts
in the google cache, it's probably a good idea so you can get a
basic understanding of what WAFL actually is.

> When exposing LUNs from the same filer to FC and iSCSI hosts I would
> assume the filer acts just as any other SAN controller would.

It has it's own quirks, just like any other FC attached RAID array...

> In this case I would think you'd probably still want to align your
> XFS filesystem to the underlying RAID stripe from which the LUN
> was carved.

Which actually matters very little when WAFL between the FS and the
disk because WAFL uses copy-on-write and stages all it's writes
through NVRAM and so you've got no idea what the alignment of any
given address in the filesystem maps to, anyway.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-22 23:10                         ` Dave Chinner
@ 2011-07-24  6:14                           ` Stan Hoeppner
  2011-07-24  8:47                             ` Michael Monnerie
  0 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2011-07-24  6:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Michael Monnerie, xfs, John Bokma

On 7/22/2011 6:10 PM, Dave Chinner wrote:
> On Fri, Jul 22, 2011 at 01:05:14PM -0500, Stan Hoeppner wrote:

>> I've never used a NetApp filer myself.  However, that said, I would
>> assume that WAFL is only in play for NFS/CIFS transactions since WAFL is
>> itself a filesystem.
> 
> Netapp's website is busted, so here's a cached link:
> 
> http://webcache.googleusercontent.com/search?q=cache:9DdO2a16hdIJ:blogs.netapp.com/extensible_netapp/2008/10/what-is-wafl--3.html+netapp+san+wafl&cd=1&hl=en&ct=clnk&source=www.google.com

This is interesting:
http://communities.netapp.com/community/netapp-blogs/dave/blog/2008/12/08/is-wafl-a-filesystem

The author implemented WAFL in two layers.  The bottom layer handles
block stuff including volume management, dedup, snapshots, etc, and the
top layer functions as multiple file systems, amongst other duties.

> If you can be bothered trolling for that entire series of blog posts
> in the google cache, it's probably a good idea so you can get a
> basic understanding of what WAFL actually is.

It's never a bother to learn something new. :)

>> When exposing LUNs from the same filer to FC and iSCSI hosts I would
>> assume the filer acts just as any other SAN controller would.
> 
> It has it's own quirks, just like any other FC attached RAID array...
> 
>> In this case I would think you'd probably still want to align your
>> XFS filesystem to the underlying RAID stripe from which the LUN
>> was carved.
> 
> Which actually matters very little when WAFL between the FS and the
> disk because WAFL uses copy-on-write and stages all it's writes
> through NVRAM and so you've got no idea what the alignment of any
> given address in the filesystem maps to, anyway.

Is the NetApp FC/iSCSI attachment performance still competitive for
large file/streaming IO, given that one can't optimize XFS stripe
alignment, and with no indication of where the file fragments are
actually written on the media?  Or does it lag behind something like a
roughly equivalent class Infinite Storage array, or IBM DS?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 30 TB RAID6 + XFS slow write performance
  2011-07-24  6:14                           ` Stan Hoeppner
@ 2011-07-24  8:47                             ` Michael Monnerie
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Monnerie @ 2011-07-24  8:47 UTC (permalink / raw)
  To: xfs; +Cc: Stan Hoeppner, John Bokma


[-- Attachment #1.1: Type: Text/Plain, Size: 1410 bytes --]

On Sonntag, 24. Juli 2011 Stan Hoeppner wrote:
> Is the NetApp FC/iSCSI attachment performance still competitive for
> large file/streaming IO, given that one can't optimize XFS stripe
> alignment, and with no indication of where the file fragments are
> actually written on the media?  Or does it lag behind something like
> a roughly equivalent class Infinite Storage array, or IBM DS?

I can't tell about performance difference. But I'd like to explain two 
fundamental differences to all other storages:

1) WAFL *never* overwrites an existing block. Whenver there's a write to 
an existing block, that block is instead written to a new location, 
afterwards the old block mapped to the new one. This is a key factor to 
keeping performance up when using snapshots and deduplication.

2) WAFL never does small or random writes. All writes are collected in 
NVRAM, and then written as one large sequential write, always one full 
stripe is written.

That means for workloads with lots of small random writes, NetApp 
storages beat the hell out of the disks, compared to other storages.
I can't tell for large seq. writes, though, I don't have such workload.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

// Haus zu verkaufen: http://zmi.at/langegg/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-07-24  8:47 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-18 19:58 30 TB RAID6 + XFS slow write performance John Bokma
2011-07-19  0:00 ` Eric Sandeen
2011-07-19  8:37 ` Emmanuel Florac
2011-07-19 22:37   ` Stan Hoeppner
2011-07-20  0:20     ` Dave Chinner
2011-07-20  5:16       ` Stan Hoeppner
2011-07-20  6:44         ` Dave Chinner
2011-07-20 12:10           ` Stan Hoeppner
2011-07-20 14:04             ` Michael Monnerie
2011-07-20 23:01               ` Dave Chinner
2011-07-21  6:19                 ` Michael Monnerie
2011-07-21  6:48                   ` Dave Chinner
2011-07-22  6:10                     ` Michael Monnerie
2011-07-22 18:05                       ` Stan Hoeppner
2011-07-22 23:10                         ` Dave Chinner
2011-07-24  6:14                           ` Stan Hoeppner
2011-07-24  8:47                             ` Michael Monnerie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox