From: Dave Chinner <david@fromorbit.com>
To: Phil Turmel <philip@turmel.org>
Cc: Wols Lists <antlists@youngman.org.uk>,
linux-xfs@vger.kernel.org, linux-raid@vger.kernel.org
Subject: Re: Growing RAID10 with active XFS filesystem
Date: Thu, 11 Jan 2018 14:07:23 +1100 [thread overview]
Message-ID: <20180111030723.GU16421@dastard> (raw)
In-Reply-To: <85f60b96-b2ca-be84-a7f6-380b545c3ac8@turmel.org>
on Wed, Jan 10, 2018 at 09:10:55AM -0500, Phil Turmel wrote:
> On 01/09/2018 05:25 PM, Dave Chinner wrote:
>
> > It's nice to know that MD has redefined RAID-10 to be different to
> > the industry standard definition that has been used for 20 years and
> > optimised filesystem layouts for. Rotoring data across odd numbers
> > of disks like this is going to really, really suck on filesystems
> > that are stripe layout aware..
>
> You're a bit late to this party, Dave. MD has implemented raid10 like
> this as far back as I can remember, and it is especially valuable when
> running more than two copies. Running raid10,n3 across four or five
> devices is a nice capacity boost without giving up triple copies (when
> multiples of three aren't available) or giving up the performance of
> mirrored raid.
XFS comes from a different background - high performance, high
reliability and hardware RAID storage. Think hundreds of drives in a
filesystem, not a handful. i.e. The XFS world is largely enterprise
and HPC storage, not small DIY solutions for a home or back-room
office. We live in a different world, and MD rarely enters mine.
> > For example, XFS has hot-spot prevention algorithms in it's
> > internal physical layout for striped devices. It aligns AGs across
> > different stripe units so that metadata and data doesn't all get
> > aligned to the one disk in a RAID0/5/6 stripe. If the stripes are
> > rotoring across disks themselves, then we're going to end up back in
> > the same position we started with - multiple AGs aligned to the
> > same disk.
>
> All of MD's default raid5 and raid6 layouts rotate stripes, too, so that
> parity and syndrome are distributed uniformly.
Well, yes, but it appears you haven't thought through what that
typically means. Take a 4+1, chunk size 128k, stripe width 512k
A B C D E
0 0 0 0 P
P 1 1 1 1
2 P 2 2 2
3 3 P 3 3
4 4 4 P 4
For every 5 stripe widths, each disk holds one stripe unit of
parity. Hence 80% of data accesses aligned to a specific data offset
hit that disk. i.e. disk A is hit by 0-128k, parity for 512-1024k,
1024-1152k, 1536-1664k and 2048-2176k. IOWs, if we align stuff to
512k, we're going to hit disk A 80% of the time and disk B 20% of
the time.
So, if mkfs.xfs ends up aligning all AGs to a multiple of 512k, then
all our static AG metadata is aligned to disk A. Further, all the
AGs will align their first stripe unit in a stripe width to Disk A,
too. Hence this results in a major IO hotspot on disk A, and
smaller hotspot on disk B. Disks C, D, and E will have the least IO
load on them.
By telling XFS that the stripe unit is 128k and the stripe width is
512k, we can avoid this problem. mkfs.xfs will rotor it's AG
alignment by some number of stripe units at a time. i.e. AG 0 aligns
to disk A, AG 1 aligns to disk B, AG 2 aligns to disk 3, and so on.
The result is that base alignment used by the filesystem is now
distributed evenly across all disks in the RAID array and so all
disks get loaded evenly. The hot spots go away because the
filesystem has aligned it's layout appropriately for the underlying
storage geometry. This applies to any RAID geometry that stripes
data across multiple disks in a regular/predictable pattern.
[ I'd cite an internal SGI paper written in 1999 that measured and
analysed all this on RAID0 in real world workloads and industry
standard benchmarks like AIM7 and SpecSFS and lead to the mkfs.xfs
changes I described above, but, well, I haven't had access to that
since I left SGI 10 years ago... ]
> > IMO, odd-numbered disks in RAID-10 should be considered harmful and
> > never used....
>
> Users are perfectly able to layer raid1+0 or raid0+1 if they don't want
> the features of raid10. Given the advantages of MD's raid10, a pedant
> could say XFS's lack of support for it should be considered harmful and
> XFS never used. (-:
MD RAID is fine with XFS as long as you use a sane layout and avoid
doing stupid things that require reshaping and changing the geometry
of the underlying device. Reshaping is where the trouble all
starts...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2018-01-11 3:07 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <f289da8f-96ec-7db4-abb1-b151d553c088@gmail.com>
[not found] ` <20180108192607.GS5602@magnolia>
2018-01-08 22:01 ` Growing RAID10 with active XFS filesystem Dave Chinner
2018-01-08 23:44 ` mdraid.pkoch
2018-01-09 9:36 ` Wols Lists
2018-01-09 21:47 ` IMAP-FCC:Sent
2018-01-09 22:25 ` Dave Chinner
2018-01-09 22:32 ` Reindl Harald
2018-01-10 6:17 ` Wols Lists
2018-01-11 2:14 ` Dave Chinner
2018-01-12 2:16 ` Guoqing Jiang
2018-01-10 14:10 ` Phil Turmel
2018-01-10 21:57 ` Wols Lists
2018-01-11 3:07 ` Dave Chinner [this message]
2018-01-12 13:32 ` Wols Lists
2018-01-12 14:25 ` Emmanuel Florac
2018-01-12 17:52 ` Wols Lists
2018-01-12 18:37 ` Emmanuel Florac
2018-01-12 19:35 ` Wol's lists
2018-01-13 12:30 ` Brad Campbell
2018-01-13 13:18 ` Wols Lists
2018-01-13 0:20 ` Stan Hoeppner
2018-01-13 19:29 ` Wol's lists
2018-01-13 22:40 ` Dave Chinner
2018-01-13 23:04 ` Wols Lists
2018-01-14 21:33 ` Wol's lists
2018-01-15 17:08 ` Emmanuel Florac
2018-01-08 19:06 mdraid.pkoch
-- strict thread matches above, loose matches on Subject: below --
2018-01-06 15:44 mdraid.pkoch
2018-01-07 19:33 ` John Stoffel
2018-01-07 20:16 ` Andreas Klauer
2018-01-08 7:31 ` Guoqing Jiang
2018-01-08 15:16 ` Wols Lists
2018-01-08 15:34 ` Reindl Harald
2018-01-08 16:24 ` Wolfgang Denk
2018-01-10 1:57 ` Guoqing Jiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180111030723.GU16421@dastard \
--to=david@fromorbit.com \
--cc=antlists@youngman.org.uk \
--cc=linux-raid@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=philip@turmel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).