From: Matthew Wilcox <willy@infradead.org>
To: Rushil Patel <rushil.patel@gsacapital.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org,
Trond Myklebust <trondmy@kernel.org>,
Anna Schumaker <anna@kernel.org>
Subject: Re: [RFC PATCH 0/1] mm/filemap: make writeback wait killable in __filemap_fdatawait_range()
Date: Wed, 25 Mar 2026 19:15:45 +0000 [thread overview]
Message-ID: <acQ0YRLM_SxYfjfI@casper.infradead.org> (raw)
In-Reply-To: <20260325113616.785496-1-rushil.patel@gsacapital.com>
On Wed, Mar 25, 2026 at 11:36:15AM +0000, Rushil Patel wrote:
> We run Slurm on compute nodes with NFS mounts (NFSv4.1, NetApp).
> When a job is cancelled, processes with dirty NFS pages get stuck
> in D-state inside folio_wait_bit_common() because
> __filemap_fdatawait_range() uses folio_wait_writeback(), which is
> TASK_UNINTERRUPTIBLE. If the filer is slow to respond these processes are
> unkillable - we've found the only recovery in practice is rebooting
> the node.
Hi Rushil. Thanks for the patch! I have a lot of sympathy for the
problem you're trying to solve. It was something similar which led
to me introducing the TASK_KILLABLE infrastructure back in 2007.
My problem was read-only though, and while I had an initial attempt
to also handle write workloads, it didn't work and I didn't have a
personal need for it, so I abandoned it. Now you have a real need, so
let's make it work.
> The patch switches to folio_wait_writeback_killable() so SIGKILL can
> interrupt the wait. Writeback itself continues on the server, we just stop
> waiting for the ack. All 6 callers of __filemap_fdatawait_range() detect
> errors independently via errseq_t / filemap_check_errors(), so the early
> return doesn't suppress error reporting.
Well ... I'm not entirely sure it doesn't suppress error reporting.
But I think I see what you're trying to say, and I think the change
of behaviour is one that was never guaranteed anyway.
> The tricky part is a re-entry through do_exit(). Making the wait killable
> alone isn't enough - we hit this in testing:
>
> 1. SIGKILL wakes the killable wait, signal is consumed by get_signal()
> 2. do_exit() -> exit_signals() sets PF_EXITING
> 3. do_exit() -> exit_files() -> nfs4_file_flush() -> nfs_wb_all()
> re-enters __filemap_fdatawait_range()
> 4. wants_signal() checks PF_EXITING *before* the SIGKILL special case
> (kernel/signal.c:951 vs 954), so it returns false
> 5. No signal can wake the second wait -> stuck in D-state again
Yes, this was where I got stuck too!
> The PF_EXITING check at the top of the function avoids re-entering the
> wait entirely. This is the same pattern used in mm/oom_kill.c,
> mm/memcontrol.c, block/blk-ioc.c, and io_uring/.
I'm not entirely comfortable with the location of the check. I feel
that __filemap_fdatawait_range() is a bit too low level for a check
of PF_EXITING. I could see there being other places
which really do want to wait, even in the presence of an exiting task.
Maybe I'm being overly paranoid there, but I would suppress the call
from nfs_wb_all(). Maybe something like this?
- ret = filemap_write_and_wait(inode->i_mapping);
+ if (current->flags & PF_EXITING)
+ ret = filemap_fdatawrite(inode->i_mapping);
+ else
+ ret = filemap_write_and_wait(inode->i_mapping);
What held me up from doing this though was the next part of
nfs_wb_all():
ret = nfs_commit_inode(inode, FLUSH_SYNC);
I didn't trace through exactly what this would do, but I inferred from
the FLUSH_SYNC that it would also wait for the file server to finish
the write of the inode ...
> Reproduced with iptables DROP on port 2049, confirmed the killable-only
> revision gets stuck on re-entry, and the PF_EXITING + killable revision
> kills cleanly.
... but if your testing shows that it works, I must be mistaken about
that.
> Sending as RFC because this touches the generic writeback sync path in
> mm/filemap.c rather than being NFS-specific. NFS can't really fix this on
> its own - it reaches __filemap_fdatawait_range() through
> filemap_write_and_wait() and doesn't own the wait. But I wanted to get
> guidance on whether this is the right place for the fix, or if you'd prefer
> a different approach.
Appreciate your flexibillity on this ... sounds like you considered
doing it this way, but didn't know about filemap_fdatawrite()?
Anyway, adding the NFS people for their opinions. Other filesystems
don't do this flush-on-close behaviour (for various reasons, but
basically NFS has a close-to-open consistency model). I believe
we can break this guarantee in this case as it's not an orderly close
but an involuntary termination of the process.
prev parent reply other threads:[~2026-03-25 19:15 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-25 11:36 [RFC PATCH 0/1] mm/filemap: make writeback wait killable in __filemap_fdatawait_range() Rushil Patel
2026-03-25 11:36 ` [RFC PATCH 1/1] " Rushil Patel
2026-03-25 19:15 ` Matthew Wilcox [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=acQ0YRLM_SxYfjfI@casper.infradead.org \
--to=willy@infradead.org \
--cc=akpm@linux-foundation.org \
--cc=anna@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=rushil.patel@gsacapital.com \
--cc=trondmy@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox