Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Boris Burkov <boris@bur.io>
To: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Linux regressions mailing list <regressions@lists.linux.dev>,
	Sergei Trofimovich <slyich@gmail.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Christopher Price <pricechrispy@gmail.com>,
	anand.jain@oracle.com, clm@fb.com, dsterba@suse.com,
	linux-btrfs@vger.kernel.org
Subject: Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
Date: Tue, 4 Apr 2023 11:51:51 -0700	[thread overview]
Message-ID: <20230404185138.GB344341@zen> (raw)
In-Reply-To: <41141706-2685-1b32-8624-c895a3b219ea@meta.com>

On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote:
> On 4/4/23 12:04 PM, Christoph Hellwig wrote:
> > On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>> And that jut NVMe, the still shipping SATA SSDs are another different
> >>>> story.  Not helped by the fact that we don't even support ranged
> >>>> discards for them in Linux.
> >>
> >> Thx for your comments Christoph. Quick question, just to be sure I
> >> understand things properly:
> >>
> >> I assume on some of those problematic devices these discard storms will
> >> lead to a performance regression?
> 
> I'm searching through the various threads, but I don't think I've seen
> the discard storm quantified?
> 
> Boris sent me this:
> https://lore.kernel.org/linux-btrfs/ZCxP%2Fll7YjPdb9Ou@infradead.org/T/#m65851e5b8b0caa5320d2b7e322805dd200686f01
> 
> Which seems to match the 10 discards per second setting?  We should be
> doing more of a dribble than a storm, so I'd like to understand if this
> is a separate bug that should be fixed.
> 
> > 
> > Probably.
> > 
> >> I also heard people saying these discard storms might reduce the life
> >> time of some devices - is that true?
> > 
> > Also very much possible.  There are various SSDs that treat a discard
> > as a write zeroes and always return zeroes from all discarded blocks.
> > If the discards are smaller than or not aligned to the internal erase
> > (super)blocks, this will actually cause additional writes.
> > 
> >> If the answer to at least one of these is "yes" I'd say we it might be
> >> best to revert 63a7cb130718 for now.
> > 
> > I don't think enabling it is a very a smart idea for most consumer
> > devices.
> 
> It seems like a good time to talk through a variations of discard usage
> in fb data centers.  We run a pretty wide variety of hardware from
> consumer grade ssds to enterprise ssds, and we've run these on
> ext4/btrfs/xfs.
> 
> (Christoph knows most of this already, so I'm only partially replying to
> him here)
> 
> First, there was synchronous discard.  These were pretty dark times
> because all three of our filesystems would build a batch of synchronous
> discards and then wait for them during filesystem commit.  There were
> long tail latencies across all of our workloads, and so workload owners
> would turn off discard and declare victory over terrible latencies.
> 
> Of course this predictably ends up with GC on the drives leading to
> terrible latencies because we weren't discarding anymore, and nightly
> trims are the obvious answer.  Different workloads would gyrate through
> the variations and the only consistent result was unhappiness.
> 
> Some places in the fleet still do this, and it can be a pretty simple
> tradeoff between the IO impacts of full drive trims vs the latency
> impact of built up GC vs over-provisioning.  It works for consistent
> workloads, but honestly there aren't many of those.
> 
> Along the way both btrfs and xfs have grown variations of async discard.
>  The XFS one (sorry if I'm out of date here), didn't include any kind of
> rate limiting, so if you were bulk deleting a lot of data, XFS would
> effectively queue up so many discards that it actually saturated the
> device for a long time, starving reads and writes.  If your workload did
> a constant stream of allocation and deletion, the async discards would
> just saturate the drive forever.
> 
> The workloads that care about latencies on XFS ended up going back to
> synchronous discards, and they do a slow-rm hack that nibbles away at
> the ends of files with periodic fsyncs mixed in until the file is zero
> length.  They love this and it makes me cry.
> 
> The btrfs async discard feature was meant to address both of these
> cases.  The primary features:
> 
> - Get rid of the transaction commit latency
> - Enable allocations to steal from discards, reducing discard IO
> - Avoid saturating the devices with discards by metering them out
> 
> Christoph mentions that modern enterprise drives are much better at
> discarding, and we see this in production too.  But, we still have
> workloads that switched from XFS to Btrfs because the async discard
> feature did a better job of reducing drive write-amp and latencies.
> 
> So, honestly from my POV the async discard is best suited to consumer
> devices.  Our defaults are probably wrong because no matter what you
> choose there's a drive out there that makes it look bad.  Also, laptops
> probably don't want the slow dribble.
> 
> I know Boris has some ideas on how to make the defaults better, so I'll
> let him chime in there.
> 
> -chris
>

Our reasonable options, as I see them:
- back to nodiscard, rely on periodic trims from the OS.
- leave low iops_limit, drives stay busy unexpectedly long, conclude that
  that's OK, and communicate the tuning/measurement options better.
- set a high iops_limit (e.g. 1000) drives will get back to idle faster.
- change an unset iops_limit to mean truly unlimited async discard, set
  that as the default, and anyone who cares to meter it can set an
  iops_limit.

The regression here is in drive idle time due to modest discard getting
metered out over minutes rather than dealt with relatively quickly. So
I would favor the unlimited async discard mode and will send a patch to
that effect which we can discuss.

IMO, the periodic discard cron screwing up your box once a week or once
a day or whatever is a pretty bad user experience as well, as is randomly
hitting bad latencies because you haven't been discarding often enough.

Boris

next prev parent reply	other threads:[~2023-04-04 18:52 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-20 22:40 [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Christopher Price
2023-03-21 21:26 ` Josef Bacik
2023-03-22  8:38   ` Christoph Hellwig
2023-03-23 22:26     ` Sergei Trofimovich
2023-04-04 10:49       ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 16:04         ` Christoph Hellwig
2023-04-04 16:20           ` Roman Mamedov
2023-04-04 16:27             ` Christoph Hellwig
2023-04-04 23:37               ` Damien Le Moal
2023-04-04 18:15           ` Chris Mason
2023-04-04 18:51             ` Boris Burkov [this message]
2023-04-04 19:22               ` David Sterba
2023-04-04 19:39                 ` Boris Burkov
2023-04-05  8:17                   ` Linux regression tracking (Thorsten Leemhuis)
2023-04-10  2:03               ` Michael Bromilow
2023-04-11 17:52                 ` David Sterba
2023-04-11 18:15                   ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 19:08             ` Sergei Trofimovich
2023-04-05  6:18             ` Christoph Hellwig
2023-04-05 12:01               ` Chris Mason
2023-04-04 18:23         ` Boris Burkov
2023-04-04 19:12           ` Sergei Trofimovich
  -- strict thread matches above, loose matches on Subject: below --
2023-03-01 19:30 Sergei Trofimovich
2023-03-02  8:04 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-04 10:52   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-04-21 13:56   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-03-02  9:12 ` Anand Jain
2023-03-02 10:54   ` Sergei Trofimovich
2023-03-15 11:44     ` Linux regression tracking (Thorsten Leemhuis)
2023-03-15 16:34       ` Sergei Trofimovich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230404185138.GB344341@zen \
    --to=boris@bur.io \
    --cc=anand.jain@oracle.com \
    --cc=clm@fb.com \
    --cc=clm@meta.com \
    --cc=dsterba@suse.com \
    --cc=hch@infradead.org \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=pricechrispy@gmail.com \
    --cc=regressions@lists.linux.dev \
    --cc=slyich@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox