From: Jeff Layton <jlayton@kernel.org>
To: Ritesh Harjani <ritesh.list@gmail.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Mike Snitzer <snitzer@kernel.org>, Jens Axboe <axboe@kernel.dk>,
Chuck Lever <chuck.lever@oracle.com>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-nfs@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 1/3] mm: kick writeback flusher instead of inline flush for IOCB_DONTCACHE
Date: Thu, 16 Apr 2026 15:49:51 -0700 [thread overview]
Message-ID: <52b81c4d1fb2ad0e07b3b3b4dfbd3d36e8ee3e7d.camel@kernel.org> (raw)
In-Reply-To: <tstklxm7.ritesh.list@gmail.com>
On Thu, 2026-04-09 at 07:10 +0530, Ritesh Harjani wrote:
> Jeff Layton <jlayton@kernel.org> writes:
>
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context. Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the
> > writer for milliseconds, inflating p99.9 latency from 23ms (buffered)
> > to 93ms (dontcache).
> >
> > Replace the inline filemap_flush_range() call with a
> > wakeup_flusher_threads_bdi() call that kicks the BDI's flusher thread
> > to drain dirty pages in the background. This moves writeback
> > submission completely off the writer's hot path. The flusher thread
> > handles writeback asynchronously, naturally coalescing and rate-limiting
> > I/O without any explicit skip-if-busy or dirty pressure checks.
> >
>
> Thanks Jeff for explaining this. It make sense now.
>
>
> > Add WB_REASON_DONTCACHE as a new writeback reason for tracing
> > visibility.
> >
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> > fs/fs-writeback.c | 14 ++++++++++++++
> > include/linux/backing-dev-defs.h | 1 +
> > include/linux/fs.h | 6 ++----
> > include/trace/events/writeback.h | 3 ++-
> > 4 files changed, 19 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 3c75ee025bda..88dc31388a31 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -2466,6 +2466,20 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
> > rcu_read_unlock();
> > }
> >
> > +/**
> > + * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
> > + * @mapping: address_space that was just written to
> > + *
> > + * Wake the BDI flusher thread to start writeback of dirty pages in the
> > + * background.
> > + */
> > +void filemap_dontcache_kick_writeback(struct address_space *mapping)
>
> This api gives a wrong sense that we are kicking writeback to write
> dirty pages which belongs to only this inode's address space mapping.
> But instead we are starting wb for everything on the respective bdi.
>
> So instead why not just export symbol for wakeup_flusher_threads_bdi()
> and use it instead?
>
> If not, then IMO at least making it...
> filemap_kick_writeback_all(mapping, enum wb_reason)
>
> ... might be better.
I did draft up a version of this -- adding a way to tell the flusher
thread to only flush a single inode. The performance is better than
today's DONTCACHE, but was worse than just kicking the flusher thread.
I think we're probably better off not doing this because we lose some
batching opportunities by trying to force out a single inode's pages
rather than allowing the thread to do its thing.
The raw results and Claude's analysis follows. The analysis seems a bit
speculative, but the numbers were pretty clear:
==================================================================
Deliverable 1: Single-Client fio Benchmarks
==================================================================
--- seq-write ---
Mode MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us) Sys CPU% PeakDirty(kB) PeakCache(kB)
------------------------------------------------------------------------------------------------------------------------
buffered 695.5 695.528450 22949.1 20054.016 55312.384 61079.552 1.9 44987764 253243376
dontcache 1437.9 1437.886186 11072.1 11993.088 21889.024 23461.888 2.0 42870772 52593220
direct 1383.4 1383.368808 11511.1 10944.512 22151.168 23986.176 2.0 44749956 252773512
--- rand-write ---
Mode MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us) Sys CPU% PeakDirty(kB) PeakCache(kB)
------------------------------------------------------------------------------------------------------------------------
buffered 314 80374.217300 196.5 193.536 522.24 757.76 2.0 44149552 252083008
dontcache 285.8 73158.694019 216.2 112.128 489.472 16711.68 1.8 45074268 54406332
direct 220.9 56539.452768 280.5 98.816 749.568 17170.432 1.6 44545592 251172912
--- seq-read ---
Mode MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us) Sys CPU% PeakDirty(kB) PeakCache(kB)
------------------------------------------------------------------------------------------------------------------------
buffered 2330.9 2330.876027 6508.6 6455.296 9633.792 11075.584 1.5 12112 253248964
dontcache 2350.4 2350.352499 6382.3 6324.224 7503.872 8093.696 1.4 5420 5717984
direct 2392.4 2392.364823 6364.5 6324.224 9240.576 10420.224 1.4 17784 252508448
--- rand-read ---
Mode MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us) Sys CPU% PeakDirty(kB) PeakCache(kB)
------------------------------------------------------------------------------------------------------------------------
buffered 590.8 151245.227927 99.7 94.72 179.2 228.352 1.4 207388 253316616
dontcache 518.8 132801.398005 113.3 105.984 218.112 856.064 1.4 15504 8251388
direct 574.2 146991.048352 102.6 97.792 181.248 232.448 1.3 1520268 253300752
==================================================================
Deliverable 2: Noisy-Neighbor Benchmarks
==================================================================
--- Scenario A: Multiple Writers ---
Mode: buffered
Client MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us)
client1 405.9 405.890628 2332.4 0.564 40632.32 57933.824
client2 397.2 397.237862 2371.2 0.572 40632.32 58458.112
client3 398.1 398.125174 2365.1 0.548 40632.32 58982.4
client4 398.7 398.714803 2360.9 0.628 40632.32 58458.112
Aggregate BW: 1599.9 MB/s | Sys CPU: 1.1% | Peak Dirty: 48237132 kB
Mode: dontcache
Client MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us)
client1 249.7 249.679863 3861 0.556 51642.368 77070.336
client2 250.8 250.827411 3844.4 0.532 51642.368 77070.336
client3 249.5 249.498815 3871.6 0.596 52166.656 77070.336
client4 249.3 249.264908 3875.4 0.612 52166.656 77070.336
Aggregate BW: 999.3 MB/s | Sys CPU: 1.1% | Peak Dirty: 44716500 kB
Mode: direct
Client MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us)
client1 237.8 237.802737 4072.9 0.564 42729.472 50069.504
client2 239.3 239.331213 4046.4 0.588 42729.472 50069.504
client3 238.1 238.123964 4069.2 0.58 42729.472 50069.504
client4 239.8 239.752136 4038.6 0.596 42729.472 50069.504
Aggregate BW: 955.0 MB/s | Sys CPU: 1.1% | Peak Dirty: 44481312 kB
--- Scenario C: Noisy Writer + Latency-Sensitive Readers ---
Mode: buffered
Job MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us)
Bulk writer 891.7 891.707189 964.6 0.668 18743.296 25559.04
reader1 643.2 164663.316583 4.7 0.131 164.864 684.032
reader2 638.4 163431.421446 4.4 0.131 162.816 634.88
reader3 633.7 162217.821782 4.4 0.131 173.056 626.688
Sys CPU: 1.1% | Peak Dirty: 44129604 kB
Mode: dontcache
Job MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us)
Bulk writer 1385.9 1385.920180 566.7 0.564 14876.672 23461.888
reader1 18.7 4781.729962 204.3 0.183 1548.288 88604.672
reader2 18.7 4784.697379 204.1 0.185 1581.056 88604.672
reader3 18.7 4780.857893 204.3 0.183 1564.672 88604.672
Sys CPU: 1.1% | Peak Dirty: 44262448 kB
Mode: direct
Job MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us)
Bulk writer 1313.6 1313.630216 604.1 0.652 15007.744 24248.32
reader1 984.6 252061.538462 2.5 0.131 136.192 259.072
reader2 992.2 254015.503876 2.4 0.131 130.56 280.576
reader3 988.4 253034.749035 2.4 0.131 128.512 284.672
Sys CPU: 1.0% | Peak Dirty: 44416804 kB
--- Scenario D: Mixed-Mode Noisy Writer + Readers ---
Mode: dontcache-w_buffered-r
Job MB/s IOPS Avg(us) p50(us) p99(us) p99.9(us)
Bulk writer 1466.7 1466.717576 529.7 0.564 13959.168 23724.032
reader1 504.4 129134.975369 6.3 0.131 129.536 358.4
reader2 504.4 129134.975369 6.1 0.131 123.392 358.4
reader3 513.5 131466.399198 6.2 0.131 140.288 358.4
Sys CPU: 0.9% | Peak Dirty: 43284376 kB
==================================================================
System Info
==================================================================
Timestamp: 20260408-071440
Kernel: 7.0.0-rc6-00025-gad8572d1ac65
Filesystem: xfs
File size: 514746M
Test dir: /export
Host: 256GB RAM + 80 CPUs
==================================================================
CLAUDE's Comparison and Analysis
==================================================================
● Comparing dontcache numbers against the previous whole-BDI flusher kernel (from /tmp/dontcache-local-4way-flusher.md):
Per-Inode vs Whole-BDI Flusher — DONTCACHE on Local XFS
Single-Client Writes
┌──────────────────┬───────────┬───────────┬─────────────┐
│ Benchmark │ Whole-BDI │ Per-Inode │ Change │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Seq write MB/s │ 1450 │ 1438 │ -1% (noise) │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Seq write p99.9 │ 23.5 ms │ 23.5 ms │ identical │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Rand write MB/s │ 363 │ 286 │ -21% │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Rand write p99.9 │ 1.8 ms │ 16.7 ms │ regression │
└──────────────────┴───────────┴───────────┴─────────────┘
Seq write is identical. Rand write regressed — the whole-BDI flusher batched all dirty pages in one pass with writeback_sb_inodes() under a
single blk_plug, while per-inode write_inode_now() loses that batching.
Single-Client Reads
┌────────────────┬───────────┬───────────┬────────┐
│ Benchmark │ Whole-BDI │ Per-Inode │ Change │
├────────────────┼───────────┼───────────┼────────┤
│ Seq read MB/s │ 2950 │ 2350 │ -20% │
├────────────────┼───────────┼───────────┼────────┤
│ Rand read MB/s │ 651 │ 519 │ -20% │
└────────────────┴───────────┴───────────┴────────┘
Reads shouldn't be affected by writeback path changes. Buffered reads also dropped (2888 → 2331), suggesting different system conditions
between runs rather than a per-inode regression.
Multi-Writer (Scenario A)
┌────────────────┬───────────┬───────────┬────────────┐
│ Metric │ Whole-BDI │ Per-Inode │ Change │
├────────────────┼───────────┼───────────┼────────────┤
│ Aggregate MB/s │ 1478 │ 999 │ -32% │
├────────────────┼───────────┼───────────┼────────────┤
│ p99.9 │ 46 ms │ 77 ms │ -67% worse │
└────────────────┴───────────┴───────────┴────────────┘
This is the biggest regression. With whole-BDI, the flusher did one batched pass through all dirty inodes via writeback_sb_inodes(). With
per-inode, each of 4 writers queues a separate work item processed serially by write_inode_now() — losing the batch I/O merging benefit.
Scenario C & D (Noisy Neighbor)
┌─────────────────────────┬───────────┬───────────┬─────────────┐
│ Metric │ Whole-BDI │ Per-Inode │ Change │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario C writer │ 1468 │ 1386 │ -6% │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario C readers │ 18.7 MB/s │ 18.7 MB/s │ identical │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario D writer │ 1472 │ 1467 │ identical │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario D readers │ 496 MB/s │ 507 MB/s │ +2% │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario D reader p99.9 │ 440 us │ 358 us │ +19% better │
└─────────────────────────┴───────────┴───────────┴─────────────┘
Mixed-mode (Scenario D) is the intended production case and it's essentially identical or slightly better — per-inode writeback creates less
device contention for buffered readers.
Summary
The per-inode approach is neutral-to-slightly-better for the production scenario (Scenario D), but regresses on multi-writer and random write
workloads. The core issue is loss of I/O batching — writeback_sb_inodes() processes all dirty inodes in one blk_plug'd pass, while per-inode
write_inode_now() calls are processed one at a time. The read regressions likely reflect different system conditions since buffered/direct
reads also dropped ~20%.
next prev parent reply other threads:[~2026-04-16 22:49 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 14:25 [PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-04-08 14:25 ` [PATCH v2 1/3] mm: kick writeback flusher instead of inline flush for IOCB_DONTCACHE Jeff Layton
2026-04-09 1:40 ` Ritesh Harjani
2026-04-09 5:52 ` Christoph Hellwig
2026-04-16 22:49 ` Jeff Layton [this message]
2026-04-17 2:55 ` Ritesh Harjani
2026-04-09 5:50 ` Christoph Hellwig
2026-04-09 7:21 ` Jan Kara
2026-04-09 14:21 ` Christoph Hellwig
2026-04-10 10:41 ` Jan Kara
2026-04-10 11:05 ` Jeff Layton
2026-04-08 14:25 ` [PATCH v2 2/3] testing: add nfsd-io-bench NFS server benchmark suite Jeff Layton
2026-04-08 14:25 ` [PATCH v2 3/3] testing: add dontcache-bench local filesystem " Jeff Layton
2026-04-08 18:45 ` [PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-04-09 6:06 ` Christoph Hellwig
2026-04-09 6:05 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52b81c4d1fb2ad0e07b3b3b4dfbd3d36e8ee3e7d.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=brauner@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=david@kernel.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=ritesh.list@gmail.com \
--cc=rppt@kernel.org \
--cc=snitzer@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox