Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sergei Trofimovich <slyich@gmail.com>
To: Anand Jain <anand.jain@oracle.com>
Cc: linux-btrfs@vger.kernel.org, David Sterba <dsterba@suse.com>,
	Boris Burkov <boris@bur.io>, Chris Mason <clm@fb.com>,
	Josef Bacik <josef@toxicpanda.com>
Subject: Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
Date: Thu, 2 Mar 2023 10:54:06 +0000	[thread overview]
Message-ID: <20230302105406.2cd367f7@nz> (raw)
In-Reply-To: <94cf49d0-fa2d-cc2c-240e-222706d69eb3@oracle.com>

On Thu, 2 Mar 2023 17:12:27 +0800
Anand Jain <anand.jain@oracle.com> wrote:

> On 3/2/23 03:30, Sergei Trofimovich wrote:
> > Hi btrfs maintainers!
> > 
> > Tl;DR:
> > 
> >    After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I
> >    see constant DISCARD storm towards my NVME device be it idle or not.
> > 
> >    No storm: v6.1 and older
> >    Has storm: v6.2 and newer
> > 
> > More words:
> > 
> > After upgrade from 6.1 to 6.2 I noticed that Disk led on my desktop
> > started flashing incessantly regardless of present or absent workload.
> > 
> > I think I confirmed the storm with `perf`: led flashes align with output
> > of:
> > 
> >      # perf ftrace -a -T 'nvme_setup*' | cat
> > 
> >      kworker/6:1H-298     [006]   2569.645201: nvme_setup_cmd <-nvme_queue_rq
> >      kworker/6:1H-298     [006]   2569.645205: nvme_setup_discard <-nvme_setup_cmd
> >      kworker/6:1H-298     [006]   2569.749198: nvme_setup_cmd <-nvme_queue_rq
> >      kworker/6:1H-298     [006]   2569.749202: nvme_setup_discard <-nvme_setup_cmd
> >      kworker/6:1H-298     [006]   2569.853204: nvme_setup_cmd <-nvme_queue_rq
> >      kworker/6:1H-298     [006]   2569.853209: nvme_setup_discard <-nvme_setup_cmd
> >      kworker/6:1H-298     [006]   2569.958198: nvme_setup_cmd <-nvme_queue_rq
> >      kworker/6:1H-298     [006]   2569.958202: nvme_setup_discard <-nvme_setup_cmd
> > 
> > `iotop` shows no read/write IO at all (expected).
> > 
> > I was able to bisect it down to this commit:
> > 
> >    $ git bisect good
> >    63a7cb13071842966c1ce931edacbc23573aada5 is the first bad commit
> >    commit 63a7cb13071842966c1ce931edacbc23573aada5
> >    Author: David Sterba <dsterba@suse.com>
> >    Date:   Tue Jul 26 20:54:10 2022 +0200
> > 
> >      btrfs: auto enable discard=async when possible
> > 
> >      There's a request to automatically enable async discard for capable
> >      devices. We can do that, the async mode is designed to wait for larger
> >      freed extents and is not intrusive, with limits to iops, kbps or latency.
> > 
> >      The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .
> > 
> >      The automatic selection is done if there's at least one discard capable
> >      device in the filesystem (not capable devices are skipped). Mounting
> >      with any other discard option will honor that option, notably mounting
> >      with nodiscard will keep it disabled.
> > 
> >      Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/
> >      Reviewed-by: Boris Burkov <boris@bur.io>
> >      Signed-off-by: David Sterba <dsterba@suse.com>
> > 
> >     fs/btrfs/ctree.h   |  1 +
> >     fs/btrfs/disk-io.c | 14 ++++++++++++++
> >     fs/btrfs/super.c   |  2 ++
> >     fs/btrfs/volumes.c |  3 +++
> >     fs/btrfs/volumes.h |  2 ++
> >     5 files changed, 22 insertions(+)
> > 
> > Is this storm a known issue? I did not dig too much into the patch. But
> > glancing at it this bit looks slightly off:
> > 
> >      +       if (bdev_max_discard_sectors(bdev))
> >      +               fs_devices->discardable = true;
> > 
> > Is it expected that there is no `= false` assignment?
> > 
> > This is the list of `btrfs` filesystems I have:
> > 
> >    $ cat /proc/mounts | fgrep btrfs
> >    /dev/nvme0n1p3 / btrfs rw,noatime,compress=zstd:3,ssd,space_cache,subvolid=848,subvol=/nixos 0 0
> >    /dev/sda3 /mnt/archive btrfs rw,noatime,compress=zstd:3,space_cache,subvolid=5,subvol=/ 0 0
> >    # skipped bind mounts
> >   
> 
> 
> 
> > The device is:
> > 
> >    $ lspci | fgrep -i Solid
> >    01:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive (rev 03)  
> 
> 
>   It is a SSD device with NVME interface, that needs regular discard.
>   Why not try tune io intensity using
> 
>   /sys/fs/btrfs/<uuid>/discard
> 
>   options?
> 
>   Maybe not all discardable sectors are not issued at once. It is a good
>   idea to try with a fresh mkfs (which runs discard at mkfs) to see if
>   discard is being issued even if there are no fs activities.

Ah, thank you Anand! I poked a bit more in `perf ftrace` and I think I
see a "slow" pass through the discard backlog:

    /sys/fs/btrfs/<UUID>/discard$  cat iops_limit
    10

Twice a minute I get a short burst of file creates/deletes that produces
a bit of free space in many block groups. That enqueues hundreds of work
items.

    $ sudo perf ftrace -a -T 'btrfs_discard_workfn' -T 'btrfs_issue_discard' -T 'btrfs_discard_queue_work'
     btrfs-transacti-407     [011]  42800.424027: btrfs_discard_queue_work <-__btrfs_add_free_space
     btrfs-transacti-407     [011]  42800.424070: btrfs_discard_queue_work <-__btrfs_add_free_space
     ...
     btrfs-transacti-407     [011]  42800.425053: btrfs_discard_queue_work <-__btrfs_add_free_space
     btrfs-transacti-407     [011]  42800.425055: btrfs_discard_queue_work <-__btrfs_add_free_space

193 entries of btrfs_discard_queue_work.
It took 1ms to enqueue all of the work into the workqueue.
    
     kworker/u64:1-2379115 [000]  42800.487010: btrfs_discard_workfn <-process_one_work
     kworker/u64:1-2379115 [000]  42800.487028: btrfs_issue_discard <-btrfs_discard_extent
     kworker/u64:1-2379115 [005]  42800.594010: btrfs_discard_workfn <-process_one_work
     kworker/u64:1-2379115 [005]  42800.594031: btrfs_issue_discard <-btrfs_discard_extent
     ...
     kworker/u64:15-2396822 [007]  42830.441487: btrfs_discard_workfn <-process_one_work
     kworker/u64:15-2396822 [007]  42830.441502: btrfs_issue_discard <-btrfs_discard_extent
     kworker/u64:15-2396822 [000]  42830.546497: btrfs_discard_workfn <-process_one_work
     kworker/u64:15-2396822 [000]  42830.546524: btrfs_issue_discard <-btrfs_discard_extent

286 pairs of btrfs_discard_workfn / btrfs_issue_discard.
Each pair takes 10ms to process, which seems to match iops_limit=10.
That means I can get about 300 discards per second max.

     btrfs-transacti-407     [002]  42830.634216: btrfs_discard_queue_work <-__btrfs_add_free_space
     btrfs-transacti-407     [002]  42830.634228: btrfs_discard_queue_work <-__btrfs_add_free_space
     ...

Next transaction started 30 seconds later, which is a default commit
interval.

My file system is of 512GB size. My guess I get about one discard entry
per block group on each 

Does my system keeps up with scheduled discard backlog? Can I peek at
workqueue size?

Is iops_limit=10 a reasonable default for discard=async? It feels like
for larger file systems it will not be enough even for this idle state.

-- 

  Sergei

next prev parent reply	other threads:[~2023-03-02 10:54 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-01 19:30 [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Sergei Trofimovich
2023-03-02  8:04 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-04 10:52   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-04-21 13:56   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-03-02  9:12 ` Anand Jain
2023-03-02 10:54   ` Sergei Trofimovich [this message]
2023-03-15 11:44     ` Linux regression tracking (Thorsten Leemhuis)
2023-03-15 16:34       ` Sergei Trofimovich
  -- strict thread matches above, loose matches on Subject: below --
2023-03-20 22:40 Christopher Price
2023-03-21 21:26 ` Josef Bacik
2023-03-22  8:38   ` Christoph Hellwig
2023-03-23 22:26     ` Sergei Trofimovich
2023-04-04 10:49       ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 16:04         ` Christoph Hellwig
2023-04-04 16:20           ` Roman Mamedov
2023-04-04 16:27             ` Christoph Hellwig
2023-04-04 23:37               ` Damien Le Moal
2023-04-04 18:15           ` Chris Mason
2023-04-04 18:51             ` Boris Burkov
2023-04-04 19:22               ` David Sterba
2023-04-04 19:39                 ` Boris Burkov
2023-04-05  8:17                   ` Linux regression tracking (Thorsten Leemhuis)
2023-04-10  2:03               ` Michael Bromilow
2023-04-11 17:52                 ` David Sterba
2023-04-11 18:15                   ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 19:08             ` Sergei Trofimovich
2023-04-05  6:18             ` Christoph Hellwig
2023-04-05 12:01               ` Chris Mason
2023-04-04 18:23         ` Boris Burkov
2023-04-04 19:12           ` Sergei Trofimovich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230302105406.2cd367f7@nz \
    --to=slyich@gmail.com \
    --cc=anand.jain@oracle.com \
    --cc=boris@bur.io \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.