Rationale for hardware RAID 10 su, sw values in FAQ

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Rationale for hardware RAID 10 su, sw values in FAQ
@ 2017-09-26  8:54 Ewen McNeill
  2017-09-27  0:43 ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Ewen McNeill @ 2017-09-26  8:54 UTC (permalink / raw)
  To: linux-xfs

The FAQ:

http://xfs.org/index.php/XFS_FAQ#Q:_How_to_calculate_the_correct_sunit.2Cswidth_values_for_optimal_performance

suggests using:

su = hardware RAID stripe size on single disk

sw = (disks in RAID-10 / 2)

on hardware RAID 10 volumes, but doesn't provide a reason for that "sw" 
value (other than that "(disks in RAID-10 / 2)" are the effective data 
disks).

In the RAID 5 / RAID 6 case, obviously you want (su * sw) to cover the 
user data that you can write across the whole array in a single stripe, 
since that is the "writeable unit" on the array on which 
read/modify/write will need to be done -- so you do not want to have a 
data structure spanning the boundary between two writeable units (as 
that means two blocks will need to be read / modify / written).

In the RAID 10 case it is clearly preferable to avoid spanning across 
the boundary of a _single_ disk's (pair's) stripe size (su * 1), as then 
_two_ pairs of disks in the RAID 10 need to get involved in the write 
(so you potentially have two seek penalties, etc).

But in the RAID 10 case the each physical disk is just paired with one 
other disk, and that pair can be written independently of the rest -- 
since there's no parity information as such, there's normally no need 
for a read / modify / write cycle of any block larger than, eg, a 
physical sector or SSD erase block.

So why is "sw" in the RAID 10 case given as "(disks in RAID-10 / 2)" 
rather than "1"?  Wouldn't

su = hardware RAID stripe size on single disk

sw = 1

make more sense for RAID 10?

In the RAID 10 case, spanning across the whole data disk set seems 
likely to align data structures (more frequently) on the first disk pair 
in the RAID set (especially with larger single-disk stripe sizes), 
potentially making that the "metadata disk pair" -- and thus both 
potentially having more metadata activity on it, and also being more at 
risk if one disk in that pair is lost or that pair is rebuilding.  (The 
same "align to start of disk set" would seem to happen with RAID 5 / 
RAID 6 too, but is unavoidable due to the "large smallest physically 
modifiable block" issue.)

What am I missing that leads to the FAQ suggesting "sw = (disks in 
RAID-10 / 2)"?  Perhaps this additional rationale could be added to that 
FAQ question?  (Or if "sw = 1" actually does make sense on RAID 10, the 
FAQ could be updated to suggest that as an option.)

Thanks,

Ewen

PS: In the specific case that had me pondering this today, it's RAID 10 
over 12 spindles, with a 512KB per-spindle stripe size.  So that's 
either 512KB * 1 = 512KB, or 512KB * 6 = 3072KB depending on the rationale.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Rationale for hardware RAID 10 su, sw values in FAQ
  2017-09-26  8:54 Rationale for hardware RAID 10 su, sw values in FAQ Ewen McNeill
@ 2017-09-27  0:43 ` Dave Chinner
  2017-09-27  1:56   ` Ewen McNeill
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2017-09-27  0:43 UTC (permalink / raw)
  To: Ewen McNeill; +Cc: linux-xfs

On Tue, Sep 26, 2017 at 09:54:28PM +1300, Ewen McNeill wrote:
> The FAQ:
> 
> http://xfs.org/index.php/XFS_FAQ#Q:_How_to_calculate_the_correct_sunit.2Cswidth_values_for_optimal_performance
> 
> suggests using:
> 
> su = hardware RAID stripe size on single disk
> 
> sw = (disks in RAID-10 / 2)
> 
> on hardware RAID 10 volumes, but doesn't provide a reason for that
> "sw" value (other than that "(disks in RAID-10 / 2)" are the
> effective data disks).

Because the RAID-0 portion of the RAID-10 volume is half the number
of disks.

RAID-1 does not affect the performance of the underlying volume -
the behaviour and performance of RAID-1 is identical to a single
drive, so layout can not be optimised to improve performance on
RAID-1 volumes. RAID-0, OTOH, can give dramatically better
performance if we tell the filesystem about it because we can do
things like allocate more carefully to prevent hotspots....

> In the RAID 5 / RAID 6 case, obviously you want (su * sw) to cover
> the user data that you can write across the whole array in a single
> stripe, since that is the "writeable unit" on the array on which
> read/modify/write will need to be done -- so you do not want to have
> a data structure spanning the boundary between two writeable units
> (as that means two blocks will need to be read / modify / written).

Yes, it's more complex than that, especially when striping RAID5/6
luns together (e.g. RAID50), but the concept is correct - su/sw tell
the filesystem what the most efficient write sizes and alignment
are.

> In the RAID 10 case it is clearly preferable to avoid spanning
> across the boundary of a _single_ disk's (pair's) stripe size (su *
> 1), as then _two_ pairs of disks in the RAID 10 need to get involved
> in the write (so you potentially have two seek penalties, etc).
> 
> But in the RAID 10 case the each physical disk is just paired with
> one other disk, and that pair can be written independently of the
> rest -- since there's no parity information as such, there's
> normally no need for a read / modify / write cycle of any block
> larger than, eg, a physical sector or SSD erase block.

Which is exactly the same as RAID 0.

> So why is "sw" in the RAID 10 case given as "(disks in RAID-10 / 2)"
> rather than "1"?  Wouldn't
> 
> su = hardware RAID stripe size on single disk
> 
> sw = 1
> 
> make more sense for RAID 10?

Nope, because then you have no idea about how many disks you have
to spread the data over. e.g. if we have 8 disks and a sw=1, then
how do you optimise allocation to hit every disk just once for
a (su * number of disks) sized write? i.e. the sw config allows
allocation and IO sizes to be optimised to load all disks in the
RAID-0 stripe equally.

> In the RAID 10 case, spanning across the whole data disk set seems
> likely to align data structures (more frequently) on the first disk
> pair in the RAID set (especially with larger single-disk stripe
> sizes), potentially making that the "metadata disk pair" -- and thus
> both potentially having more metadata activity on it, and also being
> more at risk if one disk in that pair is lost or that pair is
> rebuilding.

Nope, the filesystem does not do that. Allocation is complex, and
the filesystem may choose to pack the data (small files) so there is
no alignment, it may align to /any/ stripe unit for allocations of
stripe unit size of larger, or for really large allocations it may
attempt to align and round out to stripe width.

It does all this to prevent hot spots on disk in the RAID - that's
the problem you're trying to describe, I think. There was lots of
analysis and optimisation work done back at SGI around 1999/2000 to
sort out all the hot-spot problems in really large arrays. Out of
that came things like mkfs placing static metadata across all stripe
units instead of just the first in a stripe width, better selection
of initial alignment, etc.

For the vast majority of users, hot spot problems went away and we
really haven't seen such problems in the 15+ years since these
problems were addressed...

> (The same "align to start of disk set" would seem to
> happen with RAID 5 / RAID 6 too, but is unavoidable due to the
> "large smallest physically modifiable block" issue.)

RAID5/6 have different issues, and the way you have to think about
RAID5/6 luns changes depending on how you are aggregating them into
a larger single unit (e.g. optimising for RAID-0 stripes of RAID5/6
luns is highly complex).

> What am I missing that leads to the FAQ suggesting "sw = (disks in
> RAID-10 / 2)"?  Perhaps this additional rationale could be added to
> that FAQ question?  (Or if "sw = 1" actually does make sense on RAID
> 10, the FAQ could be updated to suggest that as an option.)

That RAID-10 is optimised for the dominant RAID-0 layout and IO
characteristics, not the RAID-1 characteristics which have no impact
on performance and hence don't require any optimisations.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Rationale for hardware RAID 10 su, sw values in FAQ
  2017-09-27  0:43 ` Dave Chinner
@ 2017-09-27  1:56   ` Ewen McNeill
  2017-09-27  4:33     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Ewen McNeill @ 2017-09-27  1:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

On 27/09/17 13:43, Dave Chinner wrote:
> RAID-1 does not affect the performance of the underlying volume -
> the behaviour and performance of RAID-1 is identical to a single
> drive, so layout can not be optimised to improve performance on
> RAID-1 volumes. RAID-0, OTOH, can give dramatically better
> performance if we tell the filesystem about it because we can do
> things like allocate more carefully to prevent hotspots....
> [...]
> > [why not sw=1]
> Nope, because then you have no idea about how many disks you have
> to spread the data over. e.g. if we have 8 disks and a sw=1, then
> how do you optimise allocation to hit every disk just once for
> a (su * number of disks) sized write? i.e. the sw config allows
> allocation and IO sizes to be optimised to load all disks in the
> RAID-0 stripe equally.

Thanks for the detailed answer.  I'd been assuming that the su/sw values 
were to align with "rewritable chunks" (which clearly is important in 
the RAID 5 / 6 case), and ignoring the benefit in letting the file 
system choose allocation locations in RAID 0 to best maximise 
distribution across the individual RAID 0 elements to avoid hot spots.

> [hot spot work in late 1990s]  Out of
> that came things like mkfs placing static metadata across all stripe
> units instead of just the first in a stripe width, better selection
> of initial alignment, etc.

It's good to hear that XFS work already anticipated the side effect that 
I was concerned about (accidentally aligning everything on "start of 
stripe * width boundary, and thus one disk).

I did end up creating the file system with su=512k,sw=6 (RAID 10 on 12 
disks) anyway, so I'm glad to hear this is supported by earlier 
performance tuning work.

As a suggestion the FAQ section 
(http://xfs.org/index.php/XFS_FAQ#Q:_How_to_calculate_the_correct_sunit.2Cswidth_values_for_optimal_performance) 
could hint at this reasoning with, eg:

-=- cut here -=-
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

  su = 256k
  sw = 8 (RAID-10 of 16 disks has 8 data disks)

because in RAID-10 the RAID-0 behaviour dominates performance, and this
allows XFS to spread the workload evenly across all pairs of disks.
--= cut here -=-

and/or that FAQ entry could also talk about RAID-0 sw/su tuning values 
which would provide a hint towards the "spread workload over all disks" 
rationale.

Ewen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Rationale for hardware RAID 10 su, sw values in FAQ
  2017-09-27  1:56   ` Ewen McNeill
@ 2017-09-27  4:33     ` Dave Chinner
  2017-09-27 23:36       ` Ewen McNeill
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2017-09-27  4:33 UTC (permalink / raw)
  To: Ewen McNeill; +Cc: linux-xfs

On Wed, Sep 27, 2017 at 02:56:16PM +1300, Ewen McNeill wrote:
> Hi Dave,
> 
> On 27/09/17 13:43, Dave Chinner wrote:
> >RAID-1 does not affect the performance of the underlying volume -
> >the behaviour and performance of RAID-1 is identical to a single
> >drive, so layout can not be optimised to improve performance on
> >RAID-1 volumes. RAID-0, OTOH, can give dramatically better
> >performance if we tell the filesystem about it because we can do
> >things like allocate more carefully to prevent hotspots....
> >[...]
> >> [why not sw=1]
> >Nope, because then you have no idea about how many disks you have
> >to spread the data over. e.g. if we have 8 disks and a sw=1, then
> >how do you optimise allocation to hit every disk just once for
> >a (su * number of disks) sized write? i.e. the sw config allows
> >allocation and IO sizes to be optimised to load all disks in the
> >RAID-0 stripe equally.
> 
> Thanks for the detailed answer.  I'd been assuming that the su/sw
> values were to align with "rewritable chunks" (which clearly is
> important in the RAID 5 / 6 case), and ignoring the benefit in
> letting the file system choose allocation locations in RAID 0 to
> best maximise distribution across the individual RAID 0 elements to
> avoid hot spots.

RAID 5/6 need the same optimisations as RAID-0. In that case,
optimal RAID performance occurs when writing stripe width aligned
and sized IOs so there is no need for RMW cycles to calculate
parity. For large files being written sequentially, it really
doesn't matter that much if the head and/or tail are not exactly
stripe width aligned as things like the stripe cache in MD or the
BBWC in hardware RAID take care of delaying the data IO long enough
to do full stripe writes....

> >[hot spot work in late 1990s]  Out of
> >that came things like mkfs placing static metadata across all stripe
> >units instead of just the first in a stripe width, better selection
> >of initial alignment, etc.
> 
> It's good to hear that XFS work already anticipated the side effect
> that I was concerned about (accidentally aligning everything on
> "start of stripe * width boundary, and thus one disk).
> 
> I did end up creating the file system with su=512k,sw=6 (RAID 10 on
> 12 disks) anyway, so I'm glad to hear this is supported by earlier
> performance tuning work.
> 
> As a suggestion the FAQ section (http://xfs.org/index.php/XFS_FAQ#Q:_How_to_calculate_the_correct_sunit.2Cswidth_values_for_optimal_performance)
> could hint at this reasoning with, eg:
> 
> -=- cut here -=-
> A RAID stripe size of 256KB with a RAID-10 over 16 disks should use
> 
>  su = 256k
>  sw = 8 (RAID-10 of 16 disks has 8 data disks)
> 
> because in RAID-10 the RAID-0 behaviour dominates performance, and this
> allows XFS to spread the workload evenly across all pairs of disks.
> --= cut here -=-

This assumes the reader understands exaclty how RAID works and how
the different types of RAID affect IO performance. The FAQ simply
tries to convey how to get it right without going into the
complicated reasons for doing it that way. Most users don't care
about the implementation or reasons, they just want to know how to
set it up correctly and quickly :P

> and/or that FAQ entry could also talk about RAID-0 sw/su tuning
> values which would provide a hint towards the "spread workload over
> all disks" rationale.

The FAQ is not the place to explain how the filesystem optimises
allocation for different types of storage. The moment we add that
for RAID-0, we;ve got to do it for RAID 5, then raid 6, then
someone will ask about raid 50, etc and suddenly the typical reader
no longer understands the answer to the question....

If you look at the admin doc that is sitting a git repo:

https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc

You'll see the section about all this reads:

	==== Alignment to storage geometry

	TODO: This is extremely complex and requires an entire chapter to itself.

because nobody (well, me, really) has had the time to write
everything down that is needed to cover this topic sufficiently...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Rationale for hardware RAID 10 su, sw values in FAQ
  2017-09-27  4:33     ` Dave Chinner
@ 2017-09-27 23:36       ` Ewen McNeill
  2017-09-28  1:39         ` Dave Chinner
                           ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Ewen McNeill @ 2017-09-27 23:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

On 27/09/17 17:33, Dave Chinner wrote:
> On Wed, Sep 27, 2017 at 02:56:16PM +1300, Ewen McNeill wrote:
>> As a suggestion the FAQ section [hint at reason for RAID10 sw=N/2]
> 
> This assumes the reader understands exactly how RAID works and how
> the different types of RAID affect IO performance. The FAQ simply
> tries to convey how to get it right without going into the
> complicated reasons for doing it that way.

Fair enough.

However, as a "sysadmin searching for answers" it is.... difficult to 
judge the up-to-date-ness and accuracy of "use these magic values" 
tuning information without some insight into why and thus whether those 
values (a) are still applicable with modern software/hardware or (b) 
applicable in one's own situation.  Particularly so when one has already 
read N different variations on recommendations, all without 
justification, some with conflicting (or confused) advice.

If the FAQ isn't the right place to even hint at the reasons, then it 
would at least be very helpful if the FAQ said "see this $REFERENCE for 
more information" and said $REFERENCE had some more "behind the scenes" 
information.

> https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc
> 
> You'll see the section about all this reads: [TODO]
> because nobody (well, me, really) has had the time to write
> everything down that is needed to cover this topic sufficiently...

Perhaps the perfect is the enemy of the good here.  Would it help if I 
were to write up some text covering:

- key storage alignment tuning aims (avoid work amplification, eg RMW; 
avoid hot spots; increase parallelism)

- summary of physical storage technology concerns (4K sectors, seek 
latency, SSD erase blocks)

- summary of considerations for RAID levels (1 - none; 0 - hot spots; 
5/6 RMW, hot spots; 10 - hot spots; stripe size/width)

- summary of storage virtualisation considerations on alignment (eg, LVM)

- high level advice on determining su/sw given the above, and short 
rationale for each formula

- some observations on unit confusion (sunit/su swidth/sw, values 
supplied in 512 byte sectors / kB versus reported in usually 4kB blocks)

I'm obviously not an expert in filesystem layout optimisation.  But I 
have many years sysadmin experience dealing with lots of types of 
storage, and can write a couple of pages of text on the "background" 
bits next time I have a free moment.  The more advanced tunables could 
remain "todo" until there's time for the perfect version...

(No promises when, but "October" seems plausible if this would be useful.)

Ewen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Rationale for hardware RAID 10 su, sw values in FAQ
  2017-09-27 23:36       ` Ewen McNeill
@ 2017-09-28  1:39         ` Dave Chinner
  2017-09-28 10:53         ` Emmanuel Florac
  2017-12-19  2:19         ` Ewen McNeill
  2 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2017-09-28  1:39 UTC (permalink / raw)
  To: Ewen McNeill; +Cc: linux-xfs

On Thu, Sep 28, 2017 at 12:36:39PM +1300, Ewen McNeill wrote:
> Hi Dave,
> 
> On 27/09/17 17:33, Dave Chinner wrote:
> >On Wed, Sep 27, 2017 at 02:56:16PM +1300, Ewen McNeill wrote:
> >>As a suggestion the FAQ section [hint at reason for RAID10 sw=N/2]
> >
> >This assumes the reader understands exactly how RAID works and how
> >the different types of RAID affect IO performance. The FAQ simply
> >tries to convey how to get it right without going into the
> >complicated reasons for doing it that way.
> 
> Fair enough.
> 
> However, as a "sysadmin searching for answers" it is.... difficult
> to judge the up-to-date-ness and accuracy of "use these magic
> values" tuning information without some insight into why and thus
> whether those values (a) are still applicable with modern
> software/hardware or (b) applicable in one's own situation.
> Particularly so when one has already read N different variations on
> recommendations, all without justification, some with conflicting
> (or confused) advice.

Yeah, searching for information doesn't help these days - engines
like google are bloody terrible at finding relevant, useful answers
to technical questions these days....

> If the FAQ isn't the right place to even hint at the reasons, then
> it would at least be very helpful if the FAQ said "see this
> $REFERENCE for more information" and said $REFERENCE had some more
> "behind the scenes" information.

*nod*

> >https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc
> >
> >You'll see the section about all this reads: [TODO]
> >because nobody (well, me, really) has had the time to write
> >everything down that is needed to cover this topic sufficiently...
> 
> Perhaps the perfect is the enemy of the good here.

It always is :)

> Would it help if
> I were to write up some text covering:
> 
> - key storage alignment tuning aims (avoid work amplification, eg
> RMW; avoid hot spots; increase parallelism)
> 
> - summary of physical storage technology concerns (4K sectors, seek
> latency, SSD erase blocks)
> 
> - summary of considerations for RAID levels (1 - none; 0 - hot
> spots; 5/6 RMW, hot spots; 10 - hot spots; stripe size/width)
> 
> - summary of storage virtualisation considerations on alignment (eg, LVM)
> 
> - high level advice on determining su/sw given the above, and short
> rationale for each formula
> 
> - some observations on unit confusion (sunit/su swidth/sw, values
> supplied in 512 byte sectors / kB versus reported in usually 4kB
> blocks)
> 
> I'm obviously not an expert in filesystem layout optimisation.  But
> I have many years sysadmin experience dealing with lots of types of
> storage, and can write a couple of pages of text on the "background"
> bits next time I have a free moment.  The more advanced tunables
> could remain "todo" until there's time for the perfect version...

That sounds like an excellent idea. The documentation always ends up
more useful to end users and admins when it is written by a user
rather than a developer.

> (No promises when, but "October" seems plausible if this would be useful.)

Sounds good to me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Rationale for hardware RAID 10 su, sw values in FAQ
  2017-09-27 23:36       ` Ewen McNeill
  2017-09-28  1:39         ` Dave Chinner
@ 2017-09-28 10:53         ` Emmanuel Florac
  2017-09-28 11:07           ` Dave Chinner
  2017-12-19  2:19         ` Ewen McNeill
  2 siblings, 1 reply; 11+ messages in thread
From: Emmanuel Florac @ 2017-09-28 10:53 UTC (permalink / raw)
  To: Ewen McNeill; +Cc: Dave Chinner, linux-xfs

[-- Attachment #1: Type: text/plain, Size: 1033 bytes --]

Le Thu, 28 Sep 2017 12:36:39 +1300
Ewen McNeill <xfs@ewen.mcneill.gen.nz> écrivait:

> I'm obviously not an expert in filesystem layout optimisation.  But I 
> have many years sysadmin experience dealing with lots of types of 
> storage, and can write a couple of pages of text on the "background" 
> bits next time I have a free moment.  The more advanced tunables
> could remain "todo" until there's time for the perfect version...
> 

As you're at it, please mention the potential problems with 512e (512
bytes blocks emulated) disks, which are the vast majority of shipped
disks nowadays. Doing RAID with these can lead to all sort of trouble
by worsening the read/modify/write problem.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Rationale for hardware RAID 10 su, sw values in FAQ
  2017-09-28 10:53         ` Emmanuel Florac
@ 2017-09-28 11:07           ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2017-09-28 11:07 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: Ewen McNeill, linux-xfs

On Thu, Sep 28, 2017 at 12:53:10PM +0200, Emmanuel Florac wrote:
> Le Thu, 28 Sep 2017 12:36:39 +1300
> Ewen McNeill <xfs@ewen.mcneill.gen.nz> écrivait:
> 
> > I'm obviously not an expert in filesystem layout optimisation.  But I 
> > have many years sysadmin experience dealing with lots of types of 
> > storage, and can write a couple of pages of text on the "background" 
> > bits next time I have a free moment.  The more advanced tunables
> > could remain "todo" until there's time for the perfect version...
> > 
> 
> As you're at it, please mention the potential problems with 512e (512
> bytes blocks emulated) disks, which are the vast majority of shipped
> disks nowadays. Doing RAID with these can lead to all sort of trouble
> by worsening the read/modify/write problem.

That's not a filesystem problem, though. Answering questions like
"what is the optimal device sector size for RAID on some random
storage hardware" is quite a long way outside the scope the
filesystem configuration guidelines....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Rationale for hardware RAID 10 su, sw values in FAQ
  2017-09-27 23:36       ` Ewen McNeill
  2017-09-28  1:39         ` Dave Chinner
  2017-09-28 10:53         ` Emmanuel Florac
@ 2017-12-19  2:19         ` Ewen McNeill
  2017-12-20 22:29           ` Dave Chinner
  2 siblings, 1 reply; 11+ messages in thread
From: Ewen McNeill @ 2017-12-19  2:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

On 28/09/17 12:36, Ewen McNeill wrote:
> Perhaps the perfect is the enemy of the good here.  Would it help if I 
> were to write up some text covering:
> [... storage media/RAID considerations for xfs layout ...]
> 
> (No promises when, but "October" seems plausible if this would be useful.)

So... October proved not to be plausible.  But I'm still considering 
writing something, perhaps while it's quieter over late December/early 
January.

AFAICT the relevant "TODO" section of:

https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc

is unchanged.

If it's still useful to write something, I'll see if there's a quiet 
"home" day over the holidays to write a page or so about the 
considerations I'm aware of as a SysAdmin.

Ewen


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Rationale for hardware RAID 10 su, sw values in FAQ
  2017-12-19  2:19         ` Ewen McNeill
@ 2017-12-20 22:29           ` Dave Chinner
  2018-02-06  1:56             ` Storage considerations for XFS layout documentation (was Re: Rationale for hardware RAID 10 su, sw values in FAQ) Ewen McNeill
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2017-12-20 22:29 UTC (permalink / raw)
  To: Ewen McNeill; +Cc: linux-xfs

On Tue, Dec 19, 2017 at 03:19:32PM +1300, Ewen McNeill wrote:
> Hi Dave,
> 
> On 28/09/17 12:36, Ewen McNeill wrote:
> >Perhaps the perfect is the enemy of the good here.  Would it help
> >if I were to write up some text covering:
> >[... storage media/RAID considerations for xfs layout ...]
> >
> >(No promises when, but "October" seems plausible if this would be useful.)
> 
> So... October proved not to be plausible.  But I'm still considering
> writing something, perhaps while it's quieter over late
> December/early January.
> 
> AFAICT the relevant "TODO" section of:
> 
> https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc
> 
> is unchanged.
> 
> If it's still useful to write something, I'll see if there's a quiet
> "home" day over the holidays to write a page or so about the
> considerations I'm aware of as a SysAdmin.

Sure, if you've got time to write something it would be greatly
appreciated.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Storage considerations for XFS layout documentation (was Re: Rationale for hardware RAID 10 su, sw values in FAQ)
  2017-12-20 22:29           ` Dave Chinner
@ 2018-02-06  1:56             ` Ewen McNeill
  0 siblings, 0 replies; 11+ messages in thread
From: Ewen McNeill @ 2018-02-06  1:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

On 21/12/17 11:29, Dave Chinner wrote:
>> On 28/09/17 12:36, Ewen McNeill wrote:
>>> Perhaps the perfect is the enemy of the good here.  Would it help
>>> if I were to write up some text covering:
>>> [... storage media/RAID considerations for xfs layout ...]
>> [...]
> Sure, if you've got time to write something it would be greatly
> appreciated.

I finally found some time to write up some storage technology background 
that impacts on file system layout, and some recommendations on 
alignment and su/sw values.

It... turned out longer than "a page or so" (my fingers default to 
verbose, not terse!), and I suspect in practice most of it is common to 
*all* filesystems and would be better in another document referenced 
from XFS_Performance_Tuning/filesystem_tunables.asciidoc, rather than 
inlined in the XFS document.

But it is the sort of documentation I was hoping to find, as a sysadmin, 
when I went looking last year -- including enough background to 
understand *why* particular values are recommended -- so I think it's 
all useful somewhere in the kernel documentation.

There's a non-trivial amount of yak shaving required to, eg, try to 
contribute it via a kernel git patch (and I'm unclear what the 
documentation subtree requires for contributions anyway).  So given that 
I'm not even sure it all belongs in 
XFS_Performance_Tuning/filesystem_tunables.asciidoc I thought I'd just 
include what I wrote here (ie, linux-xfs) in the hope that it is useful. 
  If you, eg, want to accept it as is into that document I can work on 
turning it into an actual patch.  Or feel free to cut'n'paste it into 
the existing document or another nearby document in the kernel source.

Below is written as a replacement for:

-=- cut here -=-
==== Alignment to storage geometry

TODO: This is extremely complex and requires an entire chapter to itself.
-=- cut here -=-

portion of the 
https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc 
document.  I believe it's valid asciidoc (seems to format okay here) if 
one puts enough "higher level" headers in as a prefix to make the 
asciidoc tools happy).

Ewen

PS: Please CC me on all comments; I'm not subscribed to linux-xfs.

-=- cut here -=-
==== Alignment to storage geometry

XFS can be used on a wide variety of storage technology (spinning
magnetic disks, SSDs), on single disks or spanned across multiple
disks (with software or hardware RAID).  Potentially there are
multiple layers of abstraction between the physical storage medium
and the file system (XFS), including software layers like LVM, and
potentially https://en.wikipedia.org/wiki/Flash_memory_controller[flash
translation layers] or
https://en.wikipedia.org/wiki/Hierarchical_storage_management[hierachical
storage management].

Each of these technology choices has its own requirements for best
alignment, and/or its own trade offs between latency and performance,
and the combination of multiple layers may introduce additional
alignment or layout constraints.

The goal of file system alignment to the storage geometry is to:

* maximise throughput (eg, through locality or parallelism)

* minimise latency (at least for common activities)

* minimise storage overhead (such as write amplification
due to read-modify-write -- RMW -- cycles).

===== Physical Storage Technology

Modern storage technology divides into two broad categories:

* magnetic storage on spinning media (eg, HDD)

* flash storage (eg, SSD or https://en.wikipedia.org/wiki/NVM_Express[NVMe])

These two storage technology families have distinct features that influence
the optimal file system layout.

*Magnetic Storage*: accessing magnetic storage requires moving a
physical read/write head across the magnetic media, which takes a
non-trivial amount of time (ms).  The seek time required to move
the head to the correct location is approximately linearly proportional
to the distance the head needs to move, which means two locations
near each other are faster to access than two locations far away.
Performance can be improved by locating data regularly accessed
together "near" each other.  (See also
https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics[Wikipeida
Overview of HDD performance characteristics].)

*4KiB physical sectors HDD*: Most larger modern magnetic HDDs (many
2TiB+, almost all 4TiB+) use 4KiB physical sectors to help minimise
storage overhead (of sector headers/footers and inter-sector gaps),
and thus maximise storage density.  But for backwards compatibility
they continue to present the illusion of 512 byte logical sectors.
Alignment of file system data structures and user data blocks to
the start of (4KiB) _physical_ sectors avoids unnecessarily spanning
a read or write across two physical sectors, and thus avoids write
amplification.

*Flash Storage*: Flash storage has both a page size (smallest unit
that can be written at once), _and_ an erase block size (smallest
unit that can be erased) which is typically much larger (eg, 128KiB).
A key limitation of flash storage is that only one value can be
written to it on an individual bit/byte level.  This means that
updates to physical flash storage usually involve an erase cycle
to "blank the slate" with a single common value, followed by writing
the bits that should have the other value (and writing back
the unmodified data -- a read-modify-write cycle).  To further
complicate matters, most flash storage physical media has a limitation
on how many times a given physical storage cell can be erased, depending
on the technology used (typically in the order of 10000 times).

To compensate for these technological limitations, all flash storage
suitable for use with XFS uses a Flash Translation Layer within the
device, which provides both wear levelling and relocation of
individual pages to different erase blocks as they are updated (to
minimise the amount that needs to be updated with each write, and
reduce the frequency blocks are erased).  These are often implemented
on-device as a type of log structured file system, hidden within the
device.

For a file system like XFS, a key consideration is to avoid spanning
data structures across erase blocks boundaries, as that would mean
that multiple erase blocks would need updating for a single change.
https://en.wikipedia.org/wiki/Write_amplification[Write amplification]
within the SSD may still result in multiple updates to physical
media for a single update, but this can be reduced by advising the
flash storage of blocks that do not need to be preserved (eg, with
the `discard` mount option, or by using `fstrim`) so it stops copying
those blocks around.

===== RAID

RAID provides a way to combine multiple storage devices into one
larger logical storage device, with better performance or more
redundancy (and sometimes both, eg, RAID-10).  There are multiple
RAID array arrangements ("levels") with different performance
considerations.  RAID can be implemented both directly in the Linux
kernel ("software RAID", eg the "MD" subsystem), or within a dedicated
controller card ("hardware RAID").  The filesystem layout considerations
are similar for both, but where the "MD" subsystem is used modern user
space tools can often automatically determine key RAID parameters and
use those to tune the layout of higher layers; for hardware RAID
these key values typically need to be manually determined and provided
to user space tools by hand.

*RAID 0* stripes data across two or more storage devices, with
the aim of increasing performance, but provides no redundancy (in fact
the data is more at risk as failure of any disk probably renders the
data inaccessible).  For XFS storage layout the key consideration is
to maximise parallel access to all the underlying storage devices by
avoiding "hot spots" that are reliant on a single underlying device.

*RAID 1* duplicates data (identically) across two more more
storage devices, with the aim of increasing redundancy.  It may provide
a small read performance boost if data can be read from multiple
disks at once, but provides no write performance boost (data needs
to be written to all disks).  There are no special XFS storage layout
considerations for RAID 1, as every disk has the same data.

*RAID 5* organises data into stripes across three or more storage
devices, where N-1 storage devices contain file system data, and
the remaining storage device contains parity information which
allows recalculation of the contents of any one other storage device
(eg in the event that storage device fails).  To avoid the "parity"
block being a hot spot, its location is rotated amongst all the
member storage devices (unlike RAID 4 which had a parity hot spot).
Writes to RAID-5 require reading multiple elements of the RAID 5
parity block set (to be able to recalculate the parity values), and
writing at least the modified data block and parity block.  The
performance of RAID 5 is improved by having a high hit rate on
caching (thus avoiding the read part of the read-modify-write cycle),
but there is still an inevitable write overhead.

For XFS storage layout on RAID 5 the key considerations are the
read-modify-write cycle to update the parity blocks (and avoiding
needing to unnecessarily modify multiple parity blocks), as well as
increasing parallelism by avoiding hot spots on a single underlying
storage device.  For this XFS needs to know both the stripe size on
an underlying disk, and how many of those stripes can be stored before
it cycles back to the same underlying disk (N-1).

*RAID 6* is an extension of the RAID 5 idea, which uses two parity
blocks per set, so N-2 storage devices contain file system data and
the remaining two storage device contain parity information.  This
increases the overhead of writes, for the benefit of being able to
recover information if more than one storage device fails at the
same time (including, eg, during the recovery from the first storage
device failing -- a not unknown even with larger storage devices and
thus longer RAID parity rebuild recovery times).

For XFS storage layout on RAID 6, the considerations are the same
as RAID 5, but only N-2 disks contain user data.

*RAID 10* is a conceptual combination of RAID 1 and RAID 0, across
at least four underlying storage devices.  It provides both storage
redundancy (like RAID 1) and interleaving for performance (like
RAID 0).  The write performance (particularly for smaller writes)
is usually better than RAID 5/6, at the cost of less usable storage
space.  For XFS storage layout the RAID-0 performance considerations
apply -- spread the work across the underlying storage devices to
maximise parallelism.

A further layout consideration with RAID is that RAID arrays typically
need to store some metadata with each RAID array that helps it
locate the underlying storage devices.  This metadata may be stored
at the start or end of the RAID member devices.  If it is stored
at the start of the member devices, then this may introduce alignment
considerations.  For instance the Linux "MD" subsystem has multiple
metadata formats, and formats 0.9/1.0 store the metadata at the end
of the RAID member devices and formats 1.1/1.2 store the metadata
at the beginning of the RAID member devices.  Modern user space
tools will typically try to ensure user data starts on a 1MiB
boundary ("Data Offset").

Hardware RAID controllers may use either of these techniques too, and
may require manual determination of the relevant offsets from documentation
or vendor tools.

===== Disk partitioning

Disk partitioning impacts on file system alignment to the underlying
storage blocks in two ways:

* the starting sectors of each partition need to be aligned to the
underlying storage blocks for best performance.  With modern Linux
user space tools this will typically happen automatically, but older
Linux and other tools often would attempt to align to historically
relevant boundaries (eg, 63-sector tracks) that are not only
irrelevant to modern storage technology but due to the odd number
(63) result in misalignment to the underlying storage blocks (eg,
4KiB sector HDD, 128KiB erase block SSD, or RAID array stripes).

* the partitioning system may require storing metadata about the
partition locations between partitions (eg, `MBR` logical partitions),
which may throw off the alignment of the start of the partition
from the optimal location.  Use of `GPT` partitioning is recommended
for modern systems to avoid this, or if `MBR` partitioning is used
either use only the 4 primary partitions or take extra care when
adding logical partitions.

Modern Linux user space tools will typically attempt to align on
1MiB boundaries to maximise the chance of achieving a good alignment;
beware if using older tools, or storage media partitioned with older
tools.

===== Storage Virtualisation and Encryption

Storage virtualisation such as the Linux kernel LVM (Logical Volume
Manager) introduce another layer of abstraction between the storage
device and the file system.  These layers may also need to store
their own metadata, which may affect alignment with the underlying
storage sectors or erase blocks.

LVM needs to store metadata the physical volumes (PV) -- typically
192KiB at the start of the physical volume (check the "1st PE" value
with `pvs -o name,pe_start`).  This holds both physical volume
information as well as volume group (VG) and logical volume (LV)
information.  The size of this metadata can be adjusted at `pvcreate`
time to help improve alignment of the user data with the underlying
storage.

Encrypted volumes (such as LUKS) also need to store their own
metadata at the start of the volume.  The size of this metadata
depends on the key size used for encryption.  Typical sizes are
1MiB (256-bit key) or 2MiB (512-bit key), stored at the start of
the underlying volume.  These headers may also cause alignment
issues with the underlying storage, although probably only in the
case of wider RAID 5/6/10 sets.  The `--align-payload` argument
to `cryptsetup` may be used to influence the data alignment of
the user data in the encrypted volume (it takes a value in 512
byte logical sectors), or a detached header (`--header DEVICE`)
may be used to store the header somewhere other than the start
of the underlying device.

===== Determining `su`/`sw` values

Assuming every layer in your storage stack is properly aligned with
the underlying layers, the remaining step is to give +mkfs.xfs+
appropriate values to guide the XFS layout across the underlying
storage to minimise latency and hot spots and maximise performance.
In some simple cases (eg, modern Linux software RAID) +mkfs.xfs+ can
automatically determine these values; in other cases they may need
to be manually calculated and supplied.

The key values to control layout are:

* *`su`*: stripe unit size, in _bytes_ (use `m` or `g` suffixes for
MiB or GiB) that is updatable on a single underlying device (eg,
RAID set member)

* *`sw`*: stripe width, in member elements storing user data before
you wrap around to the first storage device again (ie, excluding
parity disks, spares, etc); this is used to distribute data/metadata
(and thus work) between multiple members of the underlying storage
to reduce hot spots and increase parallelism.

When multiple layers of storage technology are involved, you want to
ensure that each higher layer has a block size that is the same as
the underlying layer, or an even multiple of the underlying layer, and
then give that largest multiple to +mkfs.xfs+.

Formulas for calculating appropriate values for various storage technology:

* *HDD*: alignment to physical sector size (512 bytes or 4KiB).
This will happen automatically due to XFS defaulting to 4KiB block
sizes.

* *Flash Storage*: alignment to erase blocks (eg, 128 KiB).  If you have
a single flash storage device, specify `su=ERASE_BLOCK_SIZE` and `sw=1`.

* *RAID 0*: Set `su=RAID_CHUNK_SIZE` and `sw=NUMBER_OF_ACTIVE_DISKS`, to
spread the work as evenly as possible across all member disks.

* *RAID 1*: No special values required; use the values required from
the underlying storage.

* *RAID 5*: Set `su=RAID_CHUNK_SIZE` and `sw=(NUMBER_OF_ACTIVE_DISKS-1)`,
as one disk is used for parity so the wrap around to the first disk
happens one disk earlier than the full RAID set width.

* *RAID 6*: Set `su=RAID_CHUNK_SIZE` and `sw=(NUMBER_OF_ACTIVE_DISKS-2)`,
as one disk is used for parity so the wrap around to the first disk
happens one disk earlier than the full RAID set width.

* *RAID-10*: The RAID 0 portion of RAID-10 dominates alignment
considerations. The RAID 1 redundancy reduces the effective number
of active disks, eg 2-way mirroring halves the effective number of
active disks, and 3-way mirroring reduces it to one third.  Calculate
the number of effective active disks, and then use the RAID 0 values.
Eg, for 2-way RAID 10 mirroring, use `su=RAID_CHUNK_SIZE` and
`sw=(NUMBER_OF_MEMBER_DISKS / 2)`.

* *RAID-50/RAID-60*: These are logical combinations of RAID 5 and
RAID 0, or RAID 6 and RAID 0 respectively.  Both the RAID 5/6 and
the RAID 0 performance characteristics matter.  Calculate the number
of disks holding parity (2+ for RAID 50; 4+ for RAID 60) and subtract
that from the number of disks in the RAID set to get the number of
data disks.  Then use su=RAID_CHUNK_SIZE` and
`sw=NUMBER_OF_DATA_DISKS`.

For the purpose of calculating these values in a RAID set only the
active storage devices in the RAID set should be included; spares,
even dedicated spares, are outside the layout considerations.

===== A note on `sunit`/`swidth` versus `su`/`sw`

Alignment values historically were specified in `sunit`/`swidth`
values, which provided numbers in _512-byte sectors_, where `swidth`
was some multiple of `sunit`.  These units were historically useful
when all storage technology used 512-byte logical and physical
sectors, and often reported by underlying layers in physical sectors.
However they are increasingly difficult to work with for modern
storage technology with its variety of physical sector and block sizes.

The `su`/`sw` values, introduced later, provide a value in *bytes* (`su`)
and a number of occurrences (`sw`), which are easier to work with when
calculating values for a variety of physical sector and block sizes.

Logically:

* `sunit = su / 512`

* `swidth = sunit * sw`

With the result that `swidth = (su / 512) * sw`.

Use of `sunit` / `swidth` is discouraged, and use of `su` / `sw` is
encouraged to avoid confusion.

*WARNING*: beware that while the `sunit`/`swidth` values are
*specified* to +mkfs.xfs+ in _512-byte sectors_, they are *reported*
by +mkfs.xfs+ (and `xfs_info`) in file system _blocks_ (typically
4KiB, shown in the `bsize` value).  This can be very confusing, and
is another reason to prefer to specify values with `su` / `sw` and
ignore the `sunit` / `swidth` options to +mkfs.xfs+.
-=- cut here -=-

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-02-06  2:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-26  8:54 Rationale for hardware RAID 10 su, sw values in FAQ Ewen McNeill
2017-09-27  0:43 ` Dave Chinner
2017-09-27  1:56   ` Ewen McNeill
2017-09-27  4:33     ` Dave Chinner
2017-09-27 23:36       ` Ewen McNeill
2017-09-28  1:39         ` Dave Chinner
2017-09-28 10:53         ` Emmanuel Florac
2017-09-28 11:07           ` Dave Chinner
2017-12-19  2:19         ` Ewen McNeill
2017-12-20 22:29           ` Dave Chinner
2018-02-06  1:56             ` Storage considerations for XFS layout documentation (was Re: Rationale for hardware RAID 10 su, sw values in FAQ) Ewen McNeill

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).