Re: [PATCH v2] xfs: enable WQ_MEM_RECLAIM on m_sync_workqueue

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mike Snitzer <snitzer@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, Brian Foster <bfoster@redhat.com>,
	linux-nfs@vger.kernel.org
Subject: Re: [PATCH v2] xfs: enable WQ_MEM_RECLAIM on m_sync_workqueue
Date: Mon, 1 Jul 2024 10:13:16 -0400	[thread overview]
Message-ID: <ZoK5fEz6HSU5iUSH@kernel.org> (raw)
In-Reply-To: <ZoHuXHMEuMrem73H@dread.disaster.area>

On Mon, Jul 01, 2024 at 09:46:36AM +1000, Dave Chinner wrote:
> On Sun, Jun 30, 2024 at 12:35:17PM -0400, Mike Snitzer wrote:
> > The need for this fix was exposed while developing a new NFS feature
> > called "localio" which bypasses the network, if both the client and
> > server are on the same host, see:
> > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=nfs-localio-for-6.11
> > 
> > Because NFS's nfsiod_workqueue enables WQ_MEM_RECLAIM, writeback will
> > call into NFS and if localio is enabled the NFS client will call
> > directly into xfs_file_write_iter, this causes the following
> > backtrace when running xfstest generic/476 against NFS with localio:
> 
> Oh, that's nasty.
> 
> We now have to change every path in every filesystem that NFS can
> call that might defer work to a workqueue.
> 
> IOWs, this makes WQ_MEM_RECLAIM pretty much mandatory for front end
> workqueues in the filesystem and block layers regardless of whether
> the filesystem or block layer path runs under memory reclaim context
> or not.

As you noticed later (yet didn't circle back to edit here) this is
just triggering when space is low.

But yeah, I submitted the patch knowing it was likely not viable, I
just wasn't clear on all the whys.  I appreciate your feedback, and
yes it did feel like a slippery slope (understatement) that'd force
other filesystems to follow.
 
> All WQ_MEM_RECLAIM does is create a rescuer thread at workqueue
> creation that is used as a "worker of last resort" when forking new
> worker threads fail due to ENOMEM. This prevents deadlocks when
> doing GFP_KERNEL allocations in workqueue context and potentially
> deadlocking because a GFP_KERNEL allocation is blocking waiting for
> this workqueue to allocate workers to make progress.

Right.

> >   workqueue: WQ_MEM_RECLAIM writeback:wb_workfn is flushing !WQ_MEM_RECLAIM xfs-sync/vdc:xfs_flush_inodes_worker
> >   WARNING: CPU: 6 PID: 8525 at kernel/workqueue.c:3706 check_flush_dependency+0x2a4/0x328
> >   Modules linked in:
> >   CPU: 6 PID: 8525 Comm: kworker/u71:5 Not tainted 6.10.0-rc3-ktest-00032-g2b0a133403ab #18502
> >   Hardware name: linux,dummy-virt (DT)
> >   Workqueue: writeback wb_workfn (flush-0:33)
> >   pstate: 400010c5 (nZcv daIF -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
> >   pc : check_flush_dependency+0x2a4/0x328
> >   lr : check_flush_dependency+0x2a4/0x328
> >   sp : ffff0000c5f06bb0
> >   x29: ffff0000c5f06bb0 x28: ffff0000c998a908 x27: 1fffe00019331521
> >   x26: ffff0000d0620900 x25: ffff0000c5f06ca0 x24: ffff8000828848c0
> >   x23: 1fffe00018be0d8e x22: ffff0000c1210000 x21: ffff0000c75fde00
> >   x20: ffff800080bfd258 x19: ffff0000cad63400 x18: ffff0000cd3a4810
> >   x17: 0000000000000000 x16: 0000000000000000 x15: ffff800080508d98
> >   x14: 0000000000000000 x13: 204d49414c434552 x12: 1fffe0001b6eeab2
> >   x11: ffff60001b6eeab2 x10: dfff800000000000 x9 : ffff60001b6eeab3
> >   x8 : 0000000000000001 x7 : 00009fffe491154e x6 : ffff0000db775593
> >   x5 : ffff0000db775590 x4 : ffff0000db775590 x3 : 0000000000000000
> >   x2 : 0000000000000027 x1 : ffff600018be0d62 x0 : dfff800000000000
> >   Call trace:
> >    check_flush_dependency+0x2a4/0x328
> >    __flush_work+0x184/0x5c8
> >    flush_work+0x18/0x28
> >    xfs_flush_inodes+0x68/0x88
> >    xfs_file_buffered_write+0x128/0x6f0
> >    xfs_file_write_iter+0x358/0x448
> >    nfs_local_doio+0x854/0x1568
> >    nfs_initiate_pgio+0x214/0x418
> >    nfs_generic_pg_pgios+0x304/0x480
> >    nfs_pageio_doio+0xe8/0x240
> >    nfs_pageio_complete+0x160/0x480
> >    nfs_writepages+0x300/0x4f0
> >    do_writepages+0x12c/0x4a0
> >    __writeback_single_inode+0xd4/0xa68
> >    writeback_sb_inodes+0x470/0xcb0
> >    __writeback_inodes_wb+0xb0/0x1d0
> >    wb_writeback+0x594/0x808
> >    wb_workfn+0x5e8/0x9e0
> >    process_scheduled_works+0x53c/0xd90
> 
> Ah, this is just the standard backing device flusher thread that is
> running. This is the back end of filesystem writeback, not the front
> end. It was never intended to be able to directly do loop back IO
> submission to the front end filesystem IO paths like this - they are
> very different contexts and have very different constraints.
> 
> This particular situation occurs when XFS is near ENOSPC.  There's a
> very high probability it is going to fail these writes, and so it's
> doing slow path work that involves blocking and front end filesystem
> processing is allowed to block on just about anything in the
> filesystem as long as it can guarantee it won't deadlock.
> 
> Fundamentally, doing IO submission in WQ_MEM_RECLAIM context changes
> the submission context for -all- filesystems, not just XFS.

Yes, I see that.

> If we have to make this change to XFS, then -every-
> workqueue in XFS (not just this one) must be converted to
> WQ_MEM_RECLAIM, and then many workqueues in all the other
> filesystems will need to have the same changes made, too.

AFAICT they are all WQ_MEM_RECLAIM aside from m_sync_workqueue, but
that's besides the point.

> That doesn't smell right to me.
> 
> ----
> 
> So let's look at how back end filesystem IO currently submits new
> front end filesystem IO: the loop block device does this, and it
> uses workqueues to defer submitted IO so that the lower IO
> submission context can be directly controlled and made with the
> front end filesystem IO submission path behaviours.
> 
> The loop device does not use rescuer threads - that's not needed
> when you have a queue based submission and just use the workqueues
> to run the queues until they are empty. So the loop device uses
> a standard unbound workqueue for it's IO submission path, and
> then when the work is running it sets the task flags to say "this is
> a nested IO worker thread" before it starts processing the
> submission queue and submitting new front end filesystem IO:
> 
> static void loop_process_work(struct loop_worker *worker,
>                         struct list_head *cmd_list, struct loop_device *lo)
> {
>         int orig_flags = current->flags;
>         struct loop_cmd *cmd;
> 
>         current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
> 
> PF_LOCAL_THROTTLE prevents deadlocks in balance_dirty_pages() by
> lifting the dirty ratio for this thread a little, hence giving it
> priority over the upper filesystem. i.e. the upper filesystem will
> throttle incoming writes first, then the back end IO submission
> thread can still submit new front end IOs to the lower filesystem
> and they won't block in balance_dirty_pages() because the lower
> filesystem has a higher limit. hence the lower filesystem can always
> drain the dirty pages on the upper filesystem, and the system won't
> deadlock in balance_dirty_pages().

Perfect, thanks for the guidance.

> Using WQ_MEM_RECLAIM context for IO submission does not address this
> deadlock.
> 
> The PF_MEMALLOC_NOIO flag prevents the lower filesystem IO from
> causing memory reclaim to re-enter filesystems or IO devices and so
> prevents deadlocks from occuring where IO that cleans pages is
> waiting on IO to complete.
> 
> Using WQ_MEM_RECLAIM context for IO submission does not address this
> deadlock either.
> 
> IOWs, doing front IO submission like this from the BDI flusher
> thread is guaranteed to deadlock sooner or later, regardless of
> whether WQ_MEM_RECLAIM is set or not on workqueues that are flushed
> during IO submission. The WQ_MEM_RECLAIM warning is effectively your
> canary in the coal mine. And the canary just carked it.

Yes, I knew it as such, but I wanted to pin down why it died.. thanks
for your help!

FYI, this is the long-standing approach to how Trond dealt with this
WQ_MEM_RECLAIM situation. I was just hoping to avoid having to
introduce a dedicated workqueue for localio's needs (NeilBrown really
disliked the fact it mentioned how it avoids blowing the stack, but we
get that as a side-effect of needing it to avoid WQ_MEM_RECLAIM):
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=0cd25f7610df291827ad95023e03fdd4f93bbea7
(I revised the patch to add PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO and
adjusted the header to reflect our discussion in this thread).

> IMO, the only sane way to ensure this sort of nested "back-end page
> cleaning submits front-end IO filesystem IO" mechanism works is to
> do something similar to the loop device. You most definitely don't
> want to be doing buffered IO (double caching is almost always bad)
> and you want to be doing async direct IO so that the submission
> thread is not waiting on completion before the next IO is
> submitted.

Yes, follow-on work is for me to revive the directio path for localio
that ultimately wasn't pursued (or properly wired up) because it
creates DIO alignment requirements on NFS client IO:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-6.11-testing&id=f6c9f51fca819a8af595a4eb94811c1f90051eab

But underlying filesystems (like XFS) have the appropriate checks, we
just need to fail gracefully and disable NFS localio if the IO is
misaligned.

Mike

next prev parent reply	other threads:[~2024-07-01 14:13 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-28 16:18 [RFC PATCH] xfs: enable WQ_MEM_RECLAIM on m_sync_workqueue Mike Snitzer
2024-06-30 16:35 ` [PATCH v2] " Mike Snitzer
2024-06-30 23:46   ` Dave Chinner
2024-07-01  4:45     ` Christoph Hellwig
2024-07-02 23:51       ` Dave Chinner
2024-07-03 11:29       ` NeilBrown
2024-07-03 14:15         ` Christoph Hellwig
2024-07-03 23:02           ` Dave Chinner
2024-07-09 23:12           ` NeilBrown
2024-07-11 11:55             ` Dave Chinner
2024-07-01 14:13     ` Mike Snitzer [this message]
2024-07-02 12:33       ` Trond Myklebust
2024-07-02 13:04         ` Dave Chinner
2024-07-02 14:00           ` Trond Myklebust
2024-07-02 23:15             ` Dave Chinner
2024-07-06  0:32             ` NeilBrown
2024-07-06  6:13               ` Christoph Hellwig
2024-07-06  6:37                 ` Christoph Hellwig
2024-07-09 23:39                   ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZoK5fEz6HSU5iUSH@kernel.org \
    --to=snitzer@kernel.org \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).