From: Jeff Layton <jlayton@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Mike Snitzer <snitzer@kernel.org>, Jens Axboe <axboe@kernel.dk>,
Ritesh Harjani <ritesh.list@gmail.com>,
Christoph Hellwig <hch@infradead.org>,
Kairui Song <kasong@tencent.com>, Qi Zheng <qi.zheng@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Barry Song <baohua@kernel.org>,
Axel Rasmussen <axelrasmussen@google.com>,
Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Chuck Lever <chuck.lever@oracle.com>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-nfs@vger.kernel.org, linux-mm@kvack.org,
linux-trace-kernel@vger.kernel.org
Subject: Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
Date: Sun, 26 Apr 2026 14:25:36 -0400 [thread overview]
Message-ID: <c40b691f76e6d397b889b4ee27e43490320660f1.camel@kernel.org> (raw)
In-Reply-To: <20260426052854.8372fb9d4c616f16a8aa0a0f@linux-foundation.org>
On Sun, 2026-04-26 at 05:28 -0700, Andrew Morton wrote:
> Naive questions...
>
> On Sun, 26 Apr 2026 07:56:08 -0400 Jeff Layton <jlayton@kernel.org> wrote:
>
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context. Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the writer
> > for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> > (dontcache).
>
> So in the current case, when generic_write_sync() returns, all that
> memory is written back and clean&reclaimable (or freed?), yes?
>
> > Replace the inline filemap_flush_range() call with a flusher kick that
> > drains dirty pages in the background. This moves writeback submission
> > completely off the writer's hot path.
>
> Whereas after this change, that pagecache is probably still dirty,
> unreclaimable, waiting for the flusher to do its thing?
>
> So is there potential that the system will get all gummed up with
> dirty, to-be-written-soon pagecache? Is there something which limits
> this buildup?
>
> > ...
> >
> > dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
> > RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
> > ~503 GB, compared to a v6.19-ish baseline):
> >
> > Single-client sequential write (MB/s):
> > baseline patched change
> > buffered 1449.8 1440.1 -0.7%
> > dontcache 1347.9 1461.5 +8.4%
> > direct 1450.0 1440.1 -0.7%
> >
> > Single-client sequential write latency (us):
> > baseline patched change
> > dontcache p50 3031.0 10551.3 +248.1%
> > dontcache p99 74973.2 21626.9 -71.2%
> > dontcache p99.9 85459.0 23199.7 -72.9%
> >
> > Single-client random write (MB/s):
> > baseline patched change
> > dontcache 284.2 295.4 +3.9%
> >
> > Single-client random write p99.9 latency (us):
> > baseline patched change
> > dontcache 2277.4 872.4 -61.7%
> >
> > Multi-writer aggregate throughput (MB/s):
> > baseline patched change
> > buffered 1619.5 1611.2 -0.5%
> > dontcache 1281.1 1629.4 +27.2%
> > direct 1545.4 1609.4 +4.1%
> >
> > Mixed-mode noisy neighbor (dontcache writer + buffered readers):
> > baseline patched change
> > writer (MB/s) 1297.6 1471.1 +13.4%
> > readers avg (MB/s) 855.0 462.4 -45.9%
>
> These results look ambiguous. Sometimes better, sometimes worse?
>
Forgot to comment on this part earlier...
This is the "mixed-mode" (dontcache writes + buffered reads). I played
with a bunch of different settings under nfsd, and those settings
turned out to perform the best with this benchmark.
I suspect what's happening is that the increase in write throughput
from writing via the flusher thread is crowding out reads. So, read
throughput suffers in this test from that. There are a number of ways
we could probably make that more fair.
> > nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
> > NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
> > file size ~502 GB, compared to v6.19-ish baseline):
> >
> > Single-client sequential write (MB/s):
> > baseline patched change
> > buffered 4844.2 4653.4 -3.9%
> > dontcache 3028.3 3723.1 +22.9%
> > direct 957.6 987.8 +3.2%
> >
> > Single-client sequential write p99.9 latency (us):
> > baseline patched change
> > dontcache 759169.0 175112.2 -76.9%
> >
> > Single-client random write (MB/s):
> > baseline patched change
> > dontcache 590.0 1561.0 +164.6%
> >
> > Multi-writer aggregate throughput (MB/s):
> > baseline patched change
> > buffered 9636.3 9422.9 -2.2%
> > dontcache 1894.9 9442.6 +398.3%
> > direct 809.6 975.1 +20.4%
> >
> > Noisy neighbor (dontcache writer + random readers):
> > baseline patched change
> > writer (MB/s) 1854.5 4063.6 +119.1%
> > readers avg (MB/s) 131.2 101.6 -22.5%
>
> Ditto but less so.
>
Same reason for the drop, I think.
> > The NFS results show even larger improvements than the local benchmarks.
> > Multi-writer dontcache throughput improves nearly 5x, matching buffered
> > I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
> > buffered.
>
> It sounds that you like the results, so OK ;)
I think it's a win overall. As with anything writeback-related, it's a
game of tradeoffs. The good news is that DONTCACHE is still fairly new
and not many applications are using it yet, so the blast radius from
any change here should be rather small.
As a side note: I've long thought that we in general wait too long to
kick off writeback with normal buffered I/O, particularly with modern
memory sizes. DONTCACHE gives us a place to experiment with this
scheme, but we may want to think about kicking off writeback earlier in
the normal buffered case too.
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2026-04-26 18:25 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-26 11:56 [PATCH v3 0/4] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-04-26 11:56 ` [PATCH v3 1/4] mm: add NR_DONTCACHE_DIRTY node page counter Jeff Layton
2026-04-26 11:56 ` [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
2026-04-26 12:28 ` Andrew Morton
2026-04-26 14:05 ` Jeff Layton
2026-04-26 18:25 ` Jeff Layton [this message]
2026-04-26 20:44 ` Matthew Wilcox
2026-04-26 11:56 ` [PATCH v3 3/4] testing: add nfsd-io-bench NFS server benchmark suite Jeff Layton
2026-04-26 12:34 ` Andrew Morton
2026-04-26 14:11 ` Jeff Layton
2026-04-26 11:56 ` [PATCH v3 4/4] testing: add dontcache-bench local filesystem " Jeff Layton
2026-04-26 19:02 ` [syzbot ci] Re: mm: improve write performance with RWF_DONTCACHE syzbot ci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c40b691f76e6d397b889b4ee27e43490320660f1.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=brauner@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=david@kernel.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=kasong@tencent.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=qi.zheng@linux.dev \
--cc=ritesh.list@gmail.com \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=snitzer@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=yuanchu@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox