Re: [PATCH v2 1/3] mm: kick writeback flusher instead of inline flush for IOCB_DONTCACHE

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Jeff Layton <jlayton@kernel.org>
To: Ritesh Harjani <ritesh.list@gmail.com>,
	Alexander Viro	 <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Andrew Morton	 <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka	 <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan	 <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Mike Snitzer	 <snitzer@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Chuck Lever	 <chuck.lever@oracle.com>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	 linux-nfs@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 1/3] mm: kick writeback flusher instead of inline flush for IOCB_DONTCACHE
Date: Thu, 16 Apr 2026 15:49:51 -0700	[thread overview]
Message-ID: <52b81c4d1fb2ad0e07b3b3b4dfbd3d36e8ee3e7d.camel@kernel.org> (raw)
In-Reply-To: <tstklxm7.ritesh.list@gmail.com>

On Thu, 2026-04-09 at 07:10 +0530, Ritesh Harjani wrote:
> Jeff Layton <jlayton@kernel.org> writes:
> 
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context.  Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the
> > writer for milliseconds, inflating p99.9 latency from 23ms (buffered)
> > to 93ms (dontcache).
> > 
> > Replace the inline filemap_flush_range() call with a
> > wakeup_flusher_threads_bdi() call that kicks the BDI's flusher thread
> > to drain dirty pages in the background.  This moves writeback
> > submission completely off the writer's hot path.  The flusher thread
> > handles writeback asynchronously, naturally coalescing and rate-limiting
> > I/O without any explicit skip-if-busy or dirty pressure checks.
> > 
> 
> Thanks Jeff for explaining this. It make sense now.
> 
> 
> > Add WB_REASON_DONTCACHE as a new writeback reason for tracing
> > visibility.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/fs-writeback.c                | 14 ++++++++++++++
> >  include/linux/backing-dev-defs.h |  1 +
> >  include/linux/fs.h               |  6 ++----
> >  include/trace/events/writeback.h |  3 ++-
> >  4 files changed, 19 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 3c75ee025bda..88dc31388a31 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -2466,6 +2466,20 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
> >  	rcu_read_unlock();
> >  }
> >  
> > +/**
> > + * filemap_dontcache_kick_writeback - kick flusher for IOCB_DONTCACHE writes
> > + * @mapping:	address_space that was just written to
> > + *
> > + * Wake the BDI flusher thread to start writeback of dirty pages in the
> > + * background.
> > + */
> > +void filemap_dontcache_kick_writeback(struct address_space *mapping)
> 
> This api gives a wrong sense that we are kicking writeback to write
> dirty pages which belongs to only this inode's address space mapping.
> But instead we are starting wb for everything on the respective bdi.
> 
> So instead why not just export symbol for wakeup_flusher_threads_bdi()
> and use it instead?
> 
> If not, then IMO at least making it... 
>    filemap_kick_writeback_all(mapping, enum wb_reason)
> 
> ... might be better.

I did draft up a version of this -- adding a way to tell the flusher
thread to only flush a single inode. The performance is better than
today's DONTCACHE, but was worse than just kicking the flusher thread.

I think we're probably better off not doing this because we lose some
batching opportunities by trying to force out a single inode's pages
rather than allowing the thread to do its thing.

The raw results and Claude's analysis follows. The analysis seems a bit
speculative, but the numbers were pretty clear:

==================================================================
  Deliverable 1: Single-Client fio Benchmarks
==================================================================

--- seq-write ---
Mode                   MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)   Sys CPU% PeakDirty(kB) PeakCache(kB)
------------------------------------------------------------------------------------------------------------------------
buffered              695.5 695.528450    22949.1  20054.016  55312.384  61079.552        1.9     44987764    253243376
dontcache            1437.9 1437.886186    11072.1  11993.088  21889.024  23461.888        2.0     42870772     52593220
direct               1383.4 1383.368808    11511.1  10944.512  22151.168  23986.176        2.0     44749956    252773512

--- rand-write ---
Mode                   MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)   Sys CPU% PeakDirty(kB) PeakCache(kB)
------------------------------------------------------------------------------------------------------------------------
buffered                314 80374.217300      196.5    193.536     522.24     757.76        2.0     44149552    252083008
dontcache             285.8 73158.694019      216.2    112.128    489.472   16711.68        1.8     45074268     54406332
direct                220.9 56539.452768      280.5     98.816    749.568  17170.432        1.6     44545592    251172912

--- seq-read ---
Mode                   MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)   Sys CPU% PeakDirty(kB) PeakCache(kB)
------------------------------------------------------------------------------------------------------------------------
buffered             2330.9 2330.876027     6508.6   6455.296   9633.792  11075.584        1.5        12112    253248964
dontcache            2350.4 2350.352499     6382.3   6324.224   7503.872   8093.696        1.4         5420      5717984
direct               2392.4 2392.364823     6364.5   6324.224   9240.576  10420.224        1.4        17784    252508448

--- rand-read ---
Mode                   MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)   Sys CPU% PeakDirty(kB) PeakCache(kB)
------------------------------------------------------------------------------------------------------------------------
buffered              590.8 151245.227927       99.7      94.72      179.2    228.352        1.4       207388    253316616
dontcache             518.8 132801.398005      113.3    105.984    218.112    856.064        1.4        15504      8251388
direct                574.2 146991.048352      102.6     97.792    181.248    232.448        1.3      1520268    253300752

==================================================================
  Deliverable 2: Noisy-Neighbor Benchmarks
==================================================================

--- Scenario A: Multiple Writers ---
  Mode: buffered
  Client           MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)
  client1         405.9 405.890628     2332.4      0.564   40632.32  57933.824
  client2         397.2 397.237862     2371.2      0.572   40632.32  58458.112
  client3         398.1 398.125174     2365.1      0.548   40632.32    58982.4
  client4         398.7 398.714803     2360.9      0.628   40632.32  58458.112
  Aggregate BW: 1599.9 MB/s | Sys CPU: 1.1% | Peak Dirty: 48237132 kB

  Mode: dontcache
  Client           MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)
  client1         249.7 249.679863       3861      0.556  51642.368  77070.336
  client2         250.8 250.827411     3844.4      0.532  51642.368  77070.336
  client3         249.5 249.498815     3871.6      0.596  52166.656  77070.336
  client4         249.3 249.264908     3875.4      0.612  52166.656  77070.336
  Aggregate BW: 999.3 MB/s | Sys CPU: 1.1% | Peak Dirty: 44716500 kB

  Mode: direct
  Client           MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)
  client1         237.8 237.802737     4072.9      0.564  42729.472  50069.504
  client2         239.3 239.331213     4046.4      0.588  42729.472  50069.504
  client3         238.1 238.123964     4069.2       0.58  42729.472  50069.504
  client4         239.8 239.752136     4038.6      0.596  42729.472  50069.504
  Aggregate BW: 955.0 MB/s | Sys CPU: 1.1% | Peak Dirty: 44481312 kB

--- Scenario C: Noisy Writer + Latency-Sensitive Readers ---
  Mode: buffered
  Job                  MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)
  Bulk writer         891.7 891.707189      964.6      0.668  18743.296   25559.04
  reader1             643.2 164663.316583        4.7      0.131    164.864    684.032
  reader2             638.4 163431.421446        4.4      0.131    162.816     634.88
  reader3             633.7 162217.821782        4.4      0.131    173.056    626.688
  Sys CPU: 1.1% | Peak Dirty: 44129604 kB

  Mode: dontcache
  Job                  MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)
  Bulk writer        1385.9 1385.920180      566.7      0.564  14876.672  23461.888
  reader1              18.7 4781.729962      204.3      0.183   1548.288  88604.672
  reader2              18.7 4784.697379      204.1      0.185   1581.056  88604.672
  reader3              18.7 4780.857893      204.3      0.183   1564.672  88604.672
  Sys CPU: 1.1% | Peak Dirty: 44262448 kB

  Mode: direct
  Job                  MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)
  Bulk writer        1313.6 1313.630216      604.1      0.652  15007.744   24248.32
  reader1             984.6 252061.538462        2.5      0.131    136.192    259.072
  reader2             992.2 254015.503876        2.4      0.131     130.56    280.576
  reader3             988.4 253034.749035        2.4      0.131    128.512    284.672
  Sys CPU: 1.0% | Peak Dirty: 44416804 kB

--- Scenario D: Mixed-Mode Noisy Writer + Readers ---
  Mode: dontcache-w_buffered-r
  Job                  MB/s       IOPS    Avg(us)    p50(us)    p99(us)  p99.9(us)
  Bulk writer        1466.7 1466.717576      529.7      0.564  13959.168  23724.032
  reader1             504.4 129134.975369        6.3      0.131    129.536      358.4
  reader2             504.4 129134.975369        6.1      0.131    123.392      358.4
  reader3             513.5 131466.399198        6.2      0.131    140.288      358.4
  Sys CPU: 0.9% | Peak Dirty: 43284376 kB

==================================================================
  System Info
==================================================================
Timestamp: 20260408-071440
Kernel: 7.0.0-rc6-00025-gad8572d1ac65
Filesystem: xfs
File size: 514746M
Test dir: /export
Host: 256GB RAM + 80 CPUs



==================================================================
CLAUDE's Comparison and Analysis
==================================================================

● Comparing dontcache numbers against the previous whole-BDI flusher kernel (from /tmp/dontcache-local-4way-flusher.md):

  Per-Inode vs Whole-BDI Flusher — DONTCACHE on Local XFS

  Single-Client Writes

  ┌──────────────────┬───────────┬───────────┬─────────────┐
  │    Benchmark     │ Whole-BDI │ Per-Inode │   Change    │
  ├──────────────────┼───────────┼───────────┼─────────────┤
  │ Seq write MB/s   │ 1450      │ 1438      │ -1% (noise) │
  ├──────────────────┼───────────┼───────────┼─────────────┤
  │ Seq write p99.9  │ 23.5 ms   │ 23.5 ms   │ identical   │
  ├──────────────────┼───────────┼───────────┼─────────────┤
  │ Rand write MB/s  │ 363       │ 286       │ -21%        │
  ├──────────────────┼───────────┼───────────┼─────────────┤
  │ Rand write p99.9 │ 1.8 ms    │ 16.7 ms   │ regression  │
  └──────────────────┴───────────┴───────────┴─────────────┘

  Seq write is identical. Rand write regressed — the whole-BDI flusher batched all dirty pages in one pass with writeback_sb_inodes() under a
  single blk_plug, while per-inode write_inode_now() loses that batching.

  Single-Client Reads

  ┌────────────────┬───────────┬───────────┬────────┐
  │   Benchmark    │ Whole-BDI │ Per-Inode │ Change │
  ├────────────────┼───────────┼───────────┼────────┤
  │ Seq read MB/s  │ 2950      │ 2350      │ -20%   │
  ├────────────────┼───────────┼───────────┼────────┤
  │ Rand read MB/s │ 651       │ 519       │ -20%   │
  └────────────────┴───────────┴───────────┴────────┘

  Reads shouldn't be affected by writeback path changes. Buffered reads also dropped (2888 → 2331), suggesting different system conditions
  between runs rather than a per-inode regression.

  Multi-Writer (Scenario A)

  ┌────────────────┬───────────┬───────────┬────────────┐
  │     Metric     │ Whole-BDI │ Per-Inode │   Change   │
  ├────────────────┼───────────┼───────────┼────────────┤
  │ Aggregate MB/s │ 1478      │ 999       │ -32%       │
  ├────────────────┼───────────┼───────────┼────────────┤
  │ p99.9          │ 46 ms     │ 77 ms     │ -67% worse │
  └────────────────┴───────────┴───────────┴────────────┘

  This is the biggest regression. With whole-BDI, the flusher did one batched pass through all dirty inodes via writeback_sb_inodes(). With
  per-inode, each of 4 writers queues a separate work item processed serially by write_inode_now() — losing the batch I/O merging benefit.

  Scenario C & D (Noisy Neighbor)

  ┌─────────────────────────┬───────────┬───────────┬─────────────┐
  │         Metric          │ Whole-BDI │ Per-Inode │   Change    │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario C writer       │ 1468      │ 1386      │ -6%         │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario C readers      │ 18.7 MB/s │ 18.7 MB/s │ identical   │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario D writer       │ 1472      │ 1467      │ identical   │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario D readers      │ 496 MB/s  │ 507 MB/s  │ +2%         │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario D reader p99.9 │ 440 us    │ 358 us    │ +19% better │
  └─────────────────────────┴───────────┴───────────┴─────────────┘

  Mixed-mode (Scenario D) is the intended production case and it's essentially identical or slightly better — per-inode writeback creates less
  device contention for buffered readers.

  Summary

  The per-inode approach is neutral-to-slightly-better for the production scenario (Scenario D), but regresses on multi-writer and random write
  workloads. The core issue is loss of I/O batching — writeback_sb_inodes() processes all dirty inodes in one blk_plug'd pass, while per-inode
  write_inode_now() calls are processed one at a time. The read regressions likely reflect different system conditions since buffered/direct
  reads also dropped ~20%.

next prev parent reply	other threads:[~2026-04-16 22:49 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-08 14:25 [PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-04-08 14:25 ` [PATCH v2 1/3] mm: kick writeback flusher instead of inline flush for IOCB_DONTCACHE Jeff Layton
2026-04-09  1:40   ` Ritesh Harjani
2026-04-09  5:52     ` Christoph Hellwig
2026-04-16 22:49     ` Jeff Layton [this message]
2026-04-17  2:55       ` Ritesh Harjani
2026-04-09  5:50   ` Christoph Hellwig
2026-04-09  7:21     ` Jan Kara
2026-04-09 14:21       ` Christoph Hellwig
2026-04-10 10:41         ` Jan Kara
2026-04-10 11:05           ` Jeff Layton
2026-04-08 14:25 ` [PATCH v2 2/3] testing: add nfsd-io-bench NFS server benchmark suite Jeff Layton
2026-04-08 14:25 ` [PATCH v2 3/3] testing: add dontcache-bench local filesystem " Jeff Layton
2026-04-08 18:45 ` [PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-04-09  6:06   ` Christoph Hellwig
2026-04-09  6:05 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52b81c4d1fb2ad0e07b3b3b4dfbd3d36e8ee3e7d.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=david@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=ritesh.list@gmail.com \
    --cc=rppt@kernel.org \
    --cc=snitzer@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox