From: Jeff Layton <jlayton@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Mike Snitzer <snitzer@kernel.org>, Jens Axboe <axboe@kernel.dk>,
Ritesh Harjani <ritesh.list@gmail.com>,
Chuck Lever <chuck.lever@oracle.com>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-nfs@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
Date: Sun, 03 May 2026 20:41:24 +0200 [thread overview]
Message-ID: <d12b162648d8d2aef521439b672e5624f95c728d.camel@kernel.org> (raw)
In-Reply-To: <oykxd436yv47u2yojrwrp3qdtzekq63hanezs6bwlovot6il2a@266nl5oqnnam>
On Sun, 2026-05-03 at 16:45 +0200, Jan Kara wrote:
> On Fri 01-05-26 10:49:36, Jeff Layton wrote:
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context. Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the writer
> > for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> > (dontcache).
> >
> > Replace the inline filemap_flush_range() call with a flusher kick that
> > drains dirty pages in the background. This moves writeback submission
> > completely off the writer's hot path.
> >
> > To avoid flushing unrelated buffered dirty data, add a dedicated
> > WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> > the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
> > write back. The flusher writes back that many pages from the oldest dirty
> > inodes (not restricted to dontcache-specific inodes). This helps
> > preserve I/O batching while limiting the scope of expedited writeback.
> >
> > Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> > DONTCACHE writes into a single flusher wakeup without per-write
> > allocations.
> >
> > Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> > visibility, and target the correct cgroup writeback domain via
> > unlocked_inode_to_wb_begin().
> >
> > dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
> > xfs on NVMe, fio io_uring):
> >
> > Buffered and direct I/O paths are unaffected by this patchset. All
> > improvements are confined to the dontcache path:
> >
> > Single-stream throughput (MB/s):
> > Before After Change
> > seq-write/dontcache 298 897 +201%
> > rand-write/dontcache 131 236 +80%
> >
> > Tail latency improvements (seq-write/dontcache):
> > p99: 135,266 us -> 23,986 us (-82%)
> > p99.9: 8,925,479 us -> 28,443 us (-99.7%)
> >
> > Multi-writer (4 jobs, sequential write):
> > Before After Change
> > dontcache aggregate (MB/s) 2,529 4,532 +79%
> > dontcache p99 (us) 8,553 1,002 -88%
> > dontcache p99.9 (us) 109,314 1,057 -99%
> >
> > Dontcache multi-writer throughput now matches buffered (4,532 vs
> > 4,616 MB/s).
> >
> > 32-file write (Axboe test):
> > Before After Change
> > dontcache aggregate (MB/s) 1,548 3,499 +126%
> > dontcache p99 (us) 10,170 602 -94%
> > Peak dirty pages (MB) 1,837 213 -88%
> >
> > Dontcache now reaches 81% of buffered throughput (was 35%).
> >
> > Competing writers (dontcache vs buffered, separate files):
> > Before After
> > buffered writer 868 433 MB/s
> > dontcache writer 415 433 MB/s
> > Aggregate 1,284 866 MB/s
> >
> > Previously the buffered writer starved the dontcache writer 2:1.
> > With per-bdi_writeback tracking, both writers now receive equal
> > bandwidth. The aggregate matches the buffered-vs-buffered baseline
> > (863 MB/s), indicating fair sharing regardless of I/O mode.
> >
> > The dontcache writer's p99.9 latency collapsed from 119 ms to
> > 33 ms (-73%), eliminating the severe periodic stalls seen in the
> > baseline. Both writers now share identical latency profiles,
> > matching the buffered-vs-buffered pattern.
> >
> > The per-bdi_writeback dirty tracking dramatically reduces peak dirty
> > pages in dontcache workloads, with the 32-file test dropping from
> > 1.8 GB to 213 MB. Dontcache sequential write throughput triples and
> > multi-writer throughput reaches parity with buffered I/O, with tail
> > latencies collapsing by 1-2 orders of magnitude.
> >
> > Assisted-by: Claude:claude-opus-4-6
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
>
> Nice and looks good to me now. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
> One nit below:
>
> > +/**
> > + * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
> > + * @mapping: address_space that was just written to
> > + *
> > + * Kick the writeback flusher thread to expedite writeback of dontcache
> > + * dirty pages. Uses a dedicated WB_start_dontcache bit so that only
> > + * pages tracked by WB_DONTCACHE_DIRTY are written back, rather than
> > + * flushing the entire BDI's dirty pages.
>
> This comment is a bit confusing as in fact we write arbitrary dirty pages.
> It is only the amount of pages that is influenced by WB_DONTCACHE_DIRTY. So
> I'd rephrase the last sentence like: We queue writeback for the inode's wb
> for as many pages as there are dontcache pages but we don't restrict
> writeback to dontcache pages only. This significantly improves performance
> over either writing all wb's pages or writing only dontcache pages.
> Although it doesn't guarantee quick writeback and reclaim of dontcache
> pages it keeps the amount of dirty pages in check and over longer term
> dontcache pages get written and reclaimed by background writeback even with
> this rough heuristic.
>
> Honza
>
I'll add that. Thanks for the suggestion and review!
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2026-05-03 18:41 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-01 9:49 [PATCH v4 0/4] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-05-01 9:49 ` [PATCH v4 1/4] mm: track DONTCACHE dirty pages per bdi_writeback Jeff Layton
2026-05-03 14:37 ` Jan Kara
2026-05-01 9:49 ` [PATCH v4 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
2026-05-01 16:44 ` Jens Axboe
2026-05-03 14:45 ` Jan Kara
2026-05-03 18:41 ` Jeff Layton [this message]
2026-05-01 9:49 ` [PATCH v4 3/4] testing: add nfsd-io-bench NFS server benchmark suite Jeff Layton
2026-05-01 9:49 ` [PATCH v4 4/4] testing: add dontcache-bench local filesystem " Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d12b162648d8d2aef521439b672e5624f95c728d.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=brauner@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=david@kernel.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=ritesh.list@gmail.com \
--cc=rppt@kernel.org \
--cc=snitzer@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox