From: Jens Axboe <axboe@kernel.dk>
To: Tal Zussman <tz2294@columbia.edu>,
"Tigran A. Aivazian" <aivazian.tigran@gmail.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Namjae Jeon <linkinjeon@kernel.org>,
Sungjong Seo <sj1557.seo@samsung.com>,
Yuezhang Mo <yuezhang.mo@sony.com>,
Dave Kleikamp <shaggy@kernel.org>,
Ryusuke Konishi <konishi.ryusuke@gmail.com>,
Viacheslav Dubeyko <slava@dubeyko.com>,
Konstantin Komarov <almaz.alexandrovich@paragon-software.com>,
Bob Copeland <me@bobcopeland.com>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
jfs-discussion@lists.sourceforge.net,
linux-nilfs@vger.kernel.org, ntfs3@lists.linux.dev,
linux-karma-devel@lists.sourceforge.net, linux-mm@kvack.org
Subject: Re: [PATCH RFC v2 1/2] filemap: defer dropbehind invalidation from IRQ context
Date: Wed, 25 Feb 2026 15:52:41 -0700 [thread overview]
Message-ID: <c8078a80-f801-4f8a-b3cd-e2ccbfca1def@kernel.dk> (raw)
In-Reply-To: <20260225-blk-dontcache-v2-1-70e7ac4f7108@columbia.edu>
On 2/25/26 3:40 PM, Tal Zussman wrote:
> folio_end_dropbehind() is called from folio_end_writeback(), which can
> run in IRQ context through buffer_head completion.
>
> Previously, when folio_end_dropbehind() detected !in_task(), it skipped
> the invalidation entirely. This meant that folios marked for dropbehind
> via RWF_DONTCACHE would remain in the page cache after writeback when
> completed from IRQ context, defeating the purpose of using it.
>
> Fix this by deferring the dropbehind invalidation to a work item. When
> folio_end_dropbehind() is called from IRQ context, the folio is added to
> a global folio_batch and the work item is scheduled. The worker drains
> the batch, locking each folio and calling filemap_end_dropbehind(), and
> re-drains if new folios arrived while processing.
>
> This unblocks enabling RWF_UNCACHED for block devices and other
> buffer_head-based I/O.
>
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
> ---
> mm/filemap.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 79 insertions(+), 5 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ebd75684cb0a..6263f35c5d13 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1085,6 +1085,8 @@ static const struct ctl_table filemap_sysctl_table[] = {
> }
> };
>
> +static void __init dropbehind_init(void);
> +
> void __init pagecache_init(void)
> {
> int i;
> @@ -1092,6 +1094,7 @@ void __init pagecache_init(void)
> for (i = 0; i < PAGE_WAIT_TABLE_SIZE; i++)
> init_waitqueue_head(&folio_wait_table[i]);
>
> + dropbehind_init();
> page_writeback_init();
> register_sysctl_init("vm", filemap_sysctl_table);
> }
> @@ -1613,23 +1616,94 @@ static void filemap_end_dropbehind(struct folio *folio)
> * If folio was marked as dropbehind, then pages should be dropped when writeback
> * completes. Do that now. If we fail, it's likely because of a big folio -
> * just reset dropbehind for that case and latter completions should invalidate.
> + *
> + * When called from IRQ context (e.g. buffer_head completion), we cannot lock
> + * the folio and invalidate. Defer to a workqueue so that callers like
> + * end_buffer_async_write() that complete in IRQ context still get their folios
> + * pruned.
> */
> +static DEFINE_SPINLOCK(dropbehind_lock);
> +static struct folio_batch dropbehind_fbatch;
> +static struct work_struct dropbehind_work;
> +
> +static void dropbehind_work_fn(struct work_struct *w)
> +{
> + struct folio_batch fbatch;
> +
> +again:
> + spin_lock_irq(&dropbehind_lock);
> + fbatch = dropbehind_fbatch;
> + folio_batch_reinit(&dropbehind_fbatch);
> + spin_unlock_irq(&dropbehind_lock);
> +
> + for (int i = 0; i < folio_batch_count(&fbatch); i++) {
> + struct folio *folio = fbatch.folios[i];
> +
> + if (folio_trylock(folio)) {
> + filemap_end_dropbehind(folio);
> + folio_unlock(folio);
> + }
> + folio_put(folio);
> + }
> +
> + /* Drain folios that were added while we were processing. */
> + spin_lock_irq(&dropbehind_lock);
> + if (folio_batch_count(&dropbehind_fbatch)) {
> + spin_unlock_irq(&dropbehind_lock);
> + goto again;
> + }
> + spin_unlock_irq(&dropbehind_lock);
> +}
> +
> +static void __init dropbehind_init(void)
> +{
> + folio_batch_init(&dropbehind_fbatch);
> + INIT_WORK(&dropbehind_work, dropbehind_work_fn);
> +}
> +
> +static void folio_end_dropbehind_irq(struct folio *folio)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&dropbehind_lock, flags);
> +
> + /* If there is no space in the folio_batch, skip the invalidation. */
> + if (!folio_batch_space(&dropbehind_fbatch)) {
> + spin_unlock_irqrestore(&dropbehind_lock, flags);
> + return;
> + }
> +
> + folio_get(folio);
> + folio_batch_add(&dropbehind_fbatch, folio);
> + spin_unlock_irqrestore(&dropbehind_lock, flags);
> +
> + schedule_work(&dropbehind_work);
> +}
How well does this scale? I did a patch basically the same as this, but
not using a folio batch though. But the main sticking point was
dropbehind_lock contention, to the point where I left it alone and
thought "ok maybe we just do this when we're done with the awful
buffer_head stuff". What happens if you have N threads doing IO at the
same time to N block devices? I suspect it'll look absolutely terrible,
as each thread will be banging on that dropbehind_lock.
One solution could potentially be to use per-cpu lists for this. If you
have N threads working on separate block devices, they will tend to be
sticky to their CPU anyway.
tldr - I don't believe the above will work well enough to scale
appropriately.
Let me know if you want me to test this on my big box, it's got a bunch
of drives and CPUs to match.
I did a patch exactly matching this, youc an probably find it
> void folio_end_dropbehind(struct folio *folio)
> {
> if (!folio_test_dropbehind(folio))
> return;
>
> /*
> - * Hitting !in_task() should not happen off RWF_DONTCACHE writeback,
> - * but can happen if normal writeback just happens to find dirty folios
> - * that were created as part of uncached writeback, and that writeback
> - * would otherwise not need non-IRQ handling. Just skip the
> - * invalidation in that case.
> + * Hitting !in_task() can happen for IO completed from IRQ contexts or
> + * if normal writeback just happens to find dirty folios that were
> + * created as part of uncached writeback, and that writeback would
> + * otherwise not need non-IRQ handling.
> */
> if (in_task() && folio_trylock(folio)) {
> filemap_end_dropbehind(folio);
> folio_unlock(folio);
> + return;
> }
> +
> + /*
> + * In IRQ context we cannot lock the folio or call into the
> + * invalidation path. Defer to a workqueue. This happens for
> + * buffer_head-based writeback which runs from bio IRQ context.
> + */
> + if (!in_task())
> + folio_end_dropbehind_irq(folio);
> }
Ideally we'd have the caller be responsible for this, rather than put it
inside folio_end_dropbehind().
--
Jens Axboe
next prev parent reply other threads:[~2026-02-25 22:52 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-25 22:40 [PATCH RFC v2 0/2] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-02-25 22:40 ` [PATCH RFC v2 1/2] filemap: defer dropbehind invalidation from IRQ context Tal Zussman
2026-02-25 22:52 ` Jens Axboe [this message]
2026-02-26 1:38 ` Tal Zussman
2026-02-26 3:11 ` Jens Axboe
2026-02-26 2:55 ` Matthew Wilcox
2026-02-26 3:15 ` Jens Axboe
2026-02-26 21:12 ` Matthew Wilcox
2026-02-26 22:04 ` Christoph Hellwig
2026-02-25 22:40 ` [PATCH RFC v2 2/2] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-02-26 22:07 ` Christoph Hellwig
2026-02-27 0:44 ` Tal Zussman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c8078a80-f801-4f8a-b3cd-e2ccbfca1def@kernel.dk \
--to=axboe@kernel.dk \
--cc=aivazian.tigran@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=almaz.alexandrovich@paragon-software.com \
--cc=brauner@kernel.org \
--cc=jack@suse.cz \
--cc=jfs-discussion@lists.sourceforge.net \
--cc=konishi.ryusuke@gmail.com \
--cc=linkinjeon@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-karma-devel@lists.sourceforge.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nilfs@vger.kernel.org \
--cc=me@bobcopeland.com \
--cc=ntfs3@lists.linux.dev \
--cc=shaggy@kernel.org \
--cc=sj1557.seo@samsung.com \
--cc=slava@dubeyko.com \
--cc=tz2294@columbia.edu \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yuezhang.mo@sony.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.