From: Boris Burkov <boris@bur.io>
To: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@infradead.org>,
Linux regressions mailing list <regressions@lists.linux.dev>,
Sergei Trofimovich <slyich@gmail.com>,
Josef Bacik <josef@toxicpanda.com>,
Christopher Price <pricechrispy@gmail.com>,
anand.jain@oracle.com, clm@fb.com, dsterba@suse.com,
linux-btrfs@vger.kernel.org
Subject: Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
Date: Tue, 4 Apr 2023 11:51:51 -0700 [thread overview]
Message-ID: <20230404185138.GB344341@zen> (raw)
In-Reply-To: <41141706-2685-1b32-8624-c895a3b219ea@meta.com>
On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote:
> On 4/4/23 12:04 PM, Christoph Hellwig wrote:
> > On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>> And that jut NVMe, the still shipping SATA SSDs are another different
> >>>> story. Not helped by the fact that we don't even support ranged
> >>>> discards for them in Linux.
> >>
> >> Thx for your comments Christoph. Quick question, just to be sure I
> >> understand things properly:
> >>
> >> I assume on some of those problematic devices these discard storms will
> >> lead to a performance regression?
>
> I'm searching through the various threads, but I don't think I've seen
> the discard storm quantified?
>
> Boris sent me this:
> https://lore.kernel.org/linux-btrfs/ZCxP%2Fll7YjPdb9Ou@infradead.org/T/#m65851e5b8b0caa5320d2b7e322805dd200686f01
>
> Which seems to match the 10 discards per second setting? We should be
> doing more of a dribble than a storm, so I'd like to understand if this
> is a separate bug that should be fixed.
>
> >
> > Probably.
> >
> >> I also heard people saying these discard storms might reduce the life
> >> time of some devices - is that true?
> >
> > Also very much possible. There are various SSDs that treat a discard
> > as a write zeroes and always return zeroes from all discarded blocks.
> > If the discards are smaller than or not aligned to the internal erase
> > (super)blocks, this will actually cause additional writes.
> >
> >> If the answer to at least one of these is "yes" I'd say we it might be
> >> best to revert 63a7cb130718 for now.
> >
> > I don't think enabling it is a very a smart idea for most consumer
> > devices.
>
> It seems like a good time to talk through a variations of discard usage
> in fb data centers. We run a pretty wide variety of hardware from
> consumer grade ssds to enterprise ssds, and we've run these on
> ext4/btrfs/xfs.
>
> (Christoph knows most of this already, so I'm only partially replying to
> him here)
>
> First, there was synchronous discard. These were pretty dark times
> because all three of our filesystems would build a batch of synchronous
> discards and then wait for them during filesystem commit. There were
> long tail latencies across all of our workloads, and so workload owners
> would turn off discard and declare victory over terrible latencies.
>
> Of course this predictably ends up with GC on the drives leading to
> terrible latencies because we weren't discarding anymore, and nightly
> trims are the obvious answer. Different workloads would gyrate through
> the variations and the only consistent result was unhappiness.
>
> Some places in the fleet still do this, and it can be a pretty simple
> tradeoff between the IO impacts of full drive trims vs the latency
> impact of built up GC vs over-provisioning. It works for consistent
> workloads, but honestly there aren't many of those.
>
> Along the way both btrfs and xfs have grown variations of async discard.
> The XFS one (sorry if I'm out of date here), didn't include any kind of
> rate limiting, so if you were bulk deleting a lot of data, XFS would
> effectively queue up so many discards that it actually saturated the
> device for a long time, starving reads and writes. If your workload did
> a constant stream of allocation and deletion, the async discards would
> just saturate the drive forever.
>
> The workloads that care about latencies on XFS ended up going back to
> synchronous discards, and they do a slow-rm hack that nibbles away at
> the ends of files with periodic fsyncs mixed in until the file is zero
> length. They love this and it makes me cry.
>
> The btrfs async discard feature was meant to address both of these
> cases. The primary features:
>
> - Get rid of the transaction commit latency
> - Enable allocations to steal from discards, reducing discard IO
> - Avoid saturating the devices with discards by metering them out
>
> Christoph mentions that modern enterprise drives are much better at
> discarding, and we see this in production too. But, we still have
> workloads that switched from XFS to Btrfs because the async discard
> feature did a better job of reducing drive write-amp and latencies.
>
> So, honestly from my POV the async discard is best suited to consumer
> devices. Our defaults are probably wrong because no matter what you
> choose there's a drive out there that makes it look bad. Also, laptops
> probably don't want the slow dribble.
>
> I know Boris has some ideas on how to make the defaults better, so I'll
> let him chime in there.
>
> -chris
>
Our reasonable options, as I see them:
- back to nodiscard, rely on periodic trims from the OS.
- leave low iops_limit, drives stay busy unexpectedly long, conclude that
that's OK, and communicate the tuning/measurement options better.
- set a high iops_limit (e.g. 1000) drives will get back to idle faster.
- change an unset iops_limit to mean truly unlimited async discard, set
that as the default, and anyone who cares to meter it can set an
iops_limit.
The regression here is in drive idle time due to modest discard getting
metered out over minutes rather than dealt with relatively quickly. So
I would favor the unlimited async discard mode and will send a patch to
that effect which we can discuss.
IMO, the periodic discard cron screwing up your box once a week or once
a day or whatever is a pretty bad user experience as well, as is randomly
hitting bad latencies because you haven't been discarding often enough.
Boris
next prev parent reply other threads:[~2023-04-04 18:52 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-20 22:40 [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Christopher Price
2023-03-21 21:26 ` Josef Bacik
2023-03-22 8:38 ` Christoph Hellwig
2023-03-23 22:26 ` Sergei Trofimovich
2023-04-04 10:49 ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 16:04 ` Christoph Hellwig
2023-04-04 16:20 ` Roman Mamedov
2023-04-04 16:27 ` Christoph Hellwig
2023-04-04 23:37 ` Damien Le Moal
2023-04-04 18:15 ` Chris Mason
2023-04-04 18:51 ` Boris Burkov [this message]
2023-04-04 19:22 ` David Sterba
2023-04-04 19:39 ` Boris Burkov
2023-04-05 8:17 ` Linux regression tracking (Thorsten Leemhuis)
2023-04-10 2:03 ` Michael Bromilow
2023-04-11 17:52 ` David Sterba
2023-04-11 18:15 ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 19:08 ` Sergei Trofimovich
2023-04-05 6:18 ` Christoph Hellwig
2023-04-05 12:01 ` Chris Mason
2023-04-04 18:23 ` Boris Burkov
2023-04-04 19:12 ` Sergei Trofimovich
-- strict thread matches above, loose matches on Subject: below --
2023-03-01 19:30 Sergei Trofimovich
2023-03-02 8:04 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-04 10:52 ` Linux regression tracking #update (Thorsten Leemhuis)
2023-04-21 13:56 ` Linux regression tracking #update (Thorsten Leemhuis)
2023-03-02 9:12 ` Anand Jain
2023-03-02 10:54 ` Sergei Trofimovich
2023-03-15 11:44 ` Linux regression tracking (Thorsten Leemhuis)
2023-03-15 16:34 ` Sergei Trofimovich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230404185138.GB344341@zen \
--to=boris@bur.io \
--cc=anand.jain@oracle.com \
--cc=clm@fb.com \
--cc=clm@meta.com \
--cc=dsterba@suse.com \
--cc=hch@infradead.org \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=pricechrispy@gmail.com \
--cc=regressions@lists.linux.dev \
--cc=slyich@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox