From: Richard Wareing <rwareing@fb.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>,
"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
"darrick.wong@oracle.com" <darrick.wong@oracle.com>
Subject: Re: [PATCH v2 0/3] XFS real-time device tweaks
Date: Wed, 6 Sep 2017 06:54:41 +0000 [thread overview]
Message-ID: <9729DF06-8F96-4F93-BF50-133F9BA2770F@fb.com> (raw)
In-Reply-To: <20170906034443.GQ17782@dastard>
On 9/5/17, 8:45 PM, "Dave Chinner" <david@fromorbit.com> wrote:
On Sun, Sep 03, 2017 at 10:02:41PM +0000, Richard Wareing wrote:
>
> > On Sep 3, 2017, at 1:56 AM, Christoph Hellwig
> > <hch@infradead.org> wrote:
> >
> > On Sat, Sep 02, 2017 at 03:41:42PM -0700, Richard Wareing
> > wrote:
> >> - Replaced rtdefault with rtdisable, this yields similar
> >> operational benefits when combined with the existing mkfs time
> >> setting of the inheritance flag on the root directory. Allows
> >> temporary disabling of real-time allocation without having to
> >> walk entire FS to remove flags (which could be time consuming).
> >> I still don't think it's super obvious to an admin the
> >> real-time flag was put there at mkfs time (vs. rtdefault being
> >> in mount flags), but this gets me half of what I'm after.
> >
> > I still don't understand this option. What is the use case of
> > dynamically switching on/off these default to the rt device?
> >
>
> Say you are in a bit of an emergency, and you need IOPs *now*
> (incident recovery), w/ rtdisable you could funnel the IO to the
> SSD
But it /doesn't do that/. It only disables new files from writing to
the rt device. All reads for data in the RT device and writes to
existing files still go to the RT device.
> without having to strip the inheritance bits from all the
> directories (which would require two walks....one to remove and
> one to add them all back). I think this is about having some
> options during incidents, and a "kill-switch" should the need
> arise.
And soon after the kill switch is triggered, your tiny data device
will go ENOSPC because changing that mount option effective removed
TBs of free space from the filesystem. Then things will really start
going bad.
So maybe you didn't think this through properly - the last thing a
typical user would expect is a filesystem reporting TBs of free
space to go ENOSPC and not being able to recover, regardless of what
mount options are present. iAnd they'll be especially confused when
they start looking at inodes and seeing RT bits set all over the
place...
It's just a recipe for confusion, unexpected behaviour and all I
see here is a support and triage nightmare. Not to mention FB will
move on to something else in a couple of years, and we get stuck
having to maintain it forever more (*cough* filestreams *cough*).
Fair enough, what are your thoughts on rtdefault, if I changed it to *not* set the inheritance bits, but take over this responsibility in their place? My thinking here is this integrates better than inheritance bits w/ policy management systems such as Chef/Puppet. Inheritance bits, on the other hand don¹t really lend themselves to machine level policies; they can be sprinkled about all over the FS, and a walk would be required to enforce a machine wide policy.
Or instead of a mount option, would a sysfs option be acceptable?
My hope is we don¹t move on, but collaborate a bit more with the open-source world on these sorts of problems instead of re-inventing the proverbial FS wheel (and re-learning old lessons solved many moons ago by FS developers). Trying to do my part now, show it can be done and should be done.
> The other problem I see is accessibility and usability. By making
> these decisions buried in more generic XFS allocation mechanisms
> or fnctl's, few developers are going to really understand how to
> safely use them (e.g. without blowing up their SSD's WAF or
> endurance).
The whole point of putting them into the XFS allocator as admin
policies is that *applications developers don't need to know they
exist*.
I get you now: *admins* need to know, but application developers not so much.
> Fallocation is a better understood notion, easier to
> use and has wider support amongst existing utilities.
Almost every application I've seen that uses fallocate does
something wrong and/or breaks a longevity or performance
optimisation that filesystems have been making for years.
fallocate is "easy to understand" but *difficult to use optimally*
because it's behaviour is tightly bound to the filesystem allocator
algorithms. i.e. it's easy to defeat hidden filesystem optimisations
with fallocate, but it's difficult to understand a sub-optimal
corner case in the filesystem allocator that fallocate could be used
to avoid.
In reality, we don't want people using fallocate - the filesystem
algorithms should do the right thing so people don't need to modify
their applications. In cases like this, having the filesystem decide
automatically at first allocation what device to use is the right
way to integrate the functionality, not require users to use
fallocate to trigger such a decision and, as a side effect, prevent
the filesystem from making all the other optimisations they still
want it to make.
You make a good point here, on preventing the FS from making other optimizations. I¹m re-working this as you and others have suggested (new version tomorrow).
And xfs_fsr would be the home for code migrating the file to the real-time device once it grows beyond some tunable size.
> Keep in
> mind, we need our SSDs to last >3 years (w/ only a mere 70-80 TBW;
> 300TBW if we are lucky), so we want to design things such that
> application developers are less likely to step on land mines
> causing pre-mature SSD failure.
Hmmm. I don't think they way you are using fallocate is doing what
you think it is doing.
That is, using fallocate to preallocate all files so you can direct
allocation to a different device means that delayed allocation is
turned off. Hence XFS cannot optimise allocation across multiple
files at writeback time. This means that writeback across multiple
files will be sprayed around disjointed preallocated regions. When
using delayed allocation, the filesystem will allocate the blocks
for all the files sequentially and so the block layer merge will
them all into one big contiguous IO.
IOWs, fallocate sprays write IO around because they decouple
allocation locality from temporal writeback locality and this causes
non-contiguous write patterns which are a significant contributin
factor to write amplification in SSDs. In comparison, delayed
allocation results in large sequential IOs that minimise write
amplification in the SSD...
Hence the method you describe that "maximises SSD life" won't help
- if anything it's going to actively harm the SSD life when
compared to just letting the filesystem use delayed allocation and
choose what device to write to at that time....
Wrt to SSDs you are completely correct on this, our fallocate calls were intended to pay up front on the write path for more favorable allocations which pay off during reads on HDDs. For SSDs this clearly makes less sense, and an optimization we will need to make in our code for the reasons you point out.
Hacking one-off high level controls into APIs like fallocate does
not work. Allocation policies need to be integrated into the
filesystem allocators for them to be effective and useful to
administrators and applications alike. fallocate is no sustitute for
the black magic that filesystems do to optimise allocation and IO
patterns....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
Thanks for the great comments, suggestions & insights. Learning a lot.
Richard
next prev parent reply other threads:[~2017-09-06 6:55 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-02 22:41 [PATCH v2 0/3] XFS real-time device tweaks Richard Wareing
2017-09-02 22:41 ` [PATCH v2 1/3] fs/xfs: Add rtdisable option Richard Wareing
2017-09-02 22:41 ` [PATCH v2 2/3] fs/xfs: Add real-time device support to statfs Richard Wareing
2017-09-03 8:49 ` Christoph Hellwig
2017-09-02 22:41 ` [PATCH v2 3/3] fs/xfs: Add rtfallocmin mount option Richard Wareing
2017-09-03 8:50 ` Christoph Hellwig
2017-09-03 22:04 ` Richard Wareing
2017-09-03 8:56 ` [PATCH v2 0/3] XFS real-time device tweaks Christoph Hellwig
2017-09-03 22:02 ` Richard Wareing
2017-09-06 3:44 ` Dave Chinner
2017-09-06 6:54 ` Richard Wareing [this message]
2017-09-06 11:19 ` Dave Chinner
2017-09-06 11:43 ` Brian Foster
2017-09-06 12:12 ` Dave Chinner
2017-09-06 12:49 ` Brian Foster
2017-09-06 23:29 ` Dave Chinner
2017-09-07 11:58 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9729DF06-8F96-4F93-BF50-133F9BA2770F@fb.com \
--to=rwareing@fb.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox