public inbox for linux-trace-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Ritesh Harjani <ritesh.list@gmail.com>,
	Alexander Viro	 <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Andrew Morton	 <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka	 <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan	 <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Mike Snitzer	 <snitzer@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Christoph Hellwig	 <hch@infradead.org>,
	Kairui Song <kasong@tencent.com>, Qi Zheng	 <qi.zheng@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Barry Song	 <baohua@kernel.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie	 <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	Steven Rostedt	 <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Chuck Lever <chuck.lever@oracle.com>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	 linux-nfs@vger.kernel.org, linux-mm@kvack.org,
	 linux-trace-kernel@vger.kernel.org
Subject: Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
Date: Mon, 27 Apr 2026 11:44:14 +0100	[thread overview]
Message-ID: <bb418f9a7bfcabc3070b412c745c5b6456d592b9.camel@kernel.org> (raw)
In-Reply-To: <qzo1s6a4.ritesh.list@gmail.com>

On Mon, 2026-04-27 at 04:01 +0530, Ritesh Harjani wrote:
> Jeff Layton <jlayton@kernel.org> writes:
> 
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context.  Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the writer
> > for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> > (dontcache).
> > 
> > Replace the inline filemap_flush_range() call with a flusher kick that
> > drains dirty pages in the background.  This moves writeback submission
> > completely off the writer's hot path.
> > 
> > To avoid flushing unrelated buffered dirty data, add a dedicated
> > WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> > the new NR_DONTCACHE_DIRTY counter to determine how many pages to write
> > back.  The flusher writes back that many pages from the oldest dirty
> > inodes (not restricted to dontcache-specific inodes). This helps
> > preserve I/O batching while limiting the scope of expedited writeback.
> > 
> 
> Yup, so, we wakeup the writeback flusher, which will write those many
> "number" of dirty pages. Those dirty pages written by writeback, can be
> of any type though, can be DONTCACHE or normal (non-dontcache) dirty
> pages. IIUC, writeback doesn't distinguish between them while writing.
> 

Correct. This was the approach that Jan and HCH suggested in the
responses to the last posting.

> 
> IMO, what we could also include in the commit msg is why is this above
> approach taken? IIUC, that is because, by writing NR_DONTCACHE_DIRTY
> pages, it still reduces the page cache pressure and still reduces the
> amount of work that the reclaim has to do, even though some of those
> pages maybe non-dontcache pages, in case if there was a parallel
> buffered write in the system.
> 

Good suggestion. I'll add that.

> 
> Also should the following change be documented somewhere? Like in Man
> page maybe? i.e.
> Earlier RWF_DONTCACHE writes made sure that those dirty pages are
> immediately submitted for writeback and completion would release those
> pages. But now, in certain cases when there is a mixed buffered write in
> the system, those dontcache dirty pages might be written back after a
> delay (whenever the next time writeback kicks in).
> However for RWF_DONTCACHE reads, it should not affect anything.
> 

Looks like DONTCACHE is documented in the preadv/writev manpage. Here's
the current blurb about writes:

    Additionally, any range dirtied by a write operation with RWF_DONT‐
    CACHE  set  will  get kicked off for writeback.  This is similar to
    calling  sync_file_range(2)  with  SYNC_FILE_RANGE_WRITE  to  start
    writeback on the given range.  RWF_DONTCACHE is a hint, or best ef‐
    fort,  where  no hard guarantees are given on the state of the page
    cache once the operation completes.

I don't think this verbiage is invalid after this change. Kicking off
writeback is still just a hint, like it was before. We could mention
about how that I/O can compete with regular buffered I/O, but it seems
a bit like we're adding info that will just be confusing for users.

> > Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> > DONTCACHE writes into a single flusher wakeup without per-write
> > allocations.
> > 
> > Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> > visibility, and target the correct cgroup writeback domain via
> > unlocked_inode_to_wb_begin().
> > 
> > dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
> > RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
> > ~503 GB, compared to a v6.19-ish baseline):
> > 
> 
> Can we please also test parallel buffered writes and dontcache writes? 
> Since this patch series definitely affects that.
>
> BTW - adding these numbers in the commit msg itself is much helpful.
> 

To be clear, this only affects DONTCACHE, not normal buffered writes,
but I guess you're referring to the fact that DONTCACHE and buffered
writes can compete now.

Can you clarify specifically what you'd like me to test here? Are you
saying you want me to test parallel and buffered writes together at the
same time (i.e. make them compete?).

I should be able to do that for the local benchmarks, but nfsd's iomode
settings are global and that won't be possible there.

> >   Single-client sequential write (MB/s):
> >                        baseline    patched     change
> >   buffered              1449.8     1440.1      -0.7%
> >   dontcache             1347.9     1461.5      +8.4%
> >   direct                1450.0     1440.1      -0.7%
> > 
> >   Single-client sequential write latency (us):
> >                        baseline    patched     change
> >   dontcache p50         3031.0    10551.3    +248.1%
> >   dontcache p99        74973.2    21626.9     -71.2%
> >   dontcache p99.9      85459.0    23199.7     -72.9%
> > 
> >   Single-client random write (MB/s):
> >                        baseline    patched     change
> >   dontcache              284.2      295.4      +3.9%
> > 
> >   Single-client random write p99.9 latency (us):
> >                        baseline    patched     change
> >   dontcache             2277.4      872.4     -61.7%
> > 
> >   Multi-writer aggregate throughput (MB/s):
> 
> Can you please help describe this test scenario if possible.. In above
> you mentioned we are writing file_size as 2x RAM_SIZE. But your
> multi-client tests says something else..
> 
> local num_clients=4
> +	mem_kb=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
> +	client_size="$(( mem_kb / 1024 / num_clients ))M"
> 
> Also the multi-writer case is spawning parallel fio jobs, and then
> parsing and aggregating the bandwidth results instead of using fio to
> spawn multiple parallel threads... which is ok, but a bit wierd.
> Why not let fio do the aggregate bandwidth, and latency calculation
> instead?
> 

That's what I get for asking Claude to roll a testsuite. I'm not that
well-versed in fio, but that makes sense. I'll have a look at reworking
it along those lines.

> >                        baseline    patched     change
> >   buffered              1619.5     1611.2      -0.5%
> >   dontcache             1281.1     1629.4     +27.2%
> >   direct                1545.4     1609.4      +4.1%
> > 
> >   Mixed-mode noisy neighbor (dontcache writer + buffered readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1297.6     1471.1     +13.4%
> >   readers avg (MB/s)     855.0      462.4     -45.9%
> > 
> > nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
> > NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
> > file size ~502 GB, compared to v6.19-ish baseline):
> > 
> >   Single-client sequential write (MB/s):
> >                        baseline    patched     change
> >   buffered              4844.2     4653.4      -3.9%
> >   dontcache             3028.3     3723.1     +22.9%
> >   direct                 957.6      987.8      +3.2%
> > 
> >   Single-client sequential write p99.9 latency (us):
> >                        baseline    patched     change
> >   dontcache            759169.0   175112.2     -76.9%
> > 
> >   Single-client random write (MB/s):
> >                        baseline    patched     change
> >   dontcache              590.0     1561.0    +164.6%
> > 
> >   Multi-writer aggregate throughput (MB/s):
> >                        baseline    patched     change
> >   buffered              9636.3     9422.9      -2.2%
> >   dontcache             1894.9     9442.6    +398.3%
> >   direct                 809.6      975.1     +20.4%
> > 
> >   Noisy neighbor (dontcache writer + random readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1854.5     4063.6    +119.1%
> >   readers avg (MB/s)     131.2      101.6     -22.5%
> > 
> > The NFS results show even larger improvements than the local benchmarks.
> > Multi-writer dontcache throughput improves nearly 5x, matching buffered
> > I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
> > buffered.
> > 
> 
> Nice :)
> Some explaination here of why 5x improvement with NFS compared to local
> filesystems please?
> (I am not much aware of NFS side, but a possible reasoning would help)
> 

I suspect that it's because of the "scattered" nature of nfsd writes.
When the client sends a write to nfsd, we wake a nfsd thread to service
it. So, if there are a lot of writes operating in parallel, they all
get done in the context of different tasks.

My hunch is that this I/O pattern (writing to same file from a bunch of
different threads), particularly suffers from the DONTCACHE inline
write behavior. The threads all end up competing to submit jobs to the
queue and that causes the performance to fall off sharply.

Thanks for the review!
-- 
Jeff Layton <jlayton@kernel.org>

  reply	other threads:[~2026-04-27 10:44 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-26 11:56 [PATCH v3 0/4] mm: improve write performance with RWF_DONTCACHE Jeff Layton
2026-04-26 11:56 ` [PATCH v3 1/4] mm: add NR_DONTCACHE_DIRTY node page counter Jeff Layton
2026-04-26 11:56 ` [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking Jeff Layton
2026-04-26 12:28   ` Andrew Morton
2026-04-26 14:05     ` Jeff Layton
2026-04-26 18:25     ` Jeff Layton
2026-04-26 20:44   ` Matthew Wilcox
2026-04-27 10:51     ` Jeff Layton
2026-04-26 22:31   ` Ritesh Harjani
2026-04-27 10:44     ` Jeff Layton [this message]
2026-04-27 12:46   ` Jan Kara
2026-04-26 11:56 ` [PATCH v3 3/4] testing: add nfsd-io-bench NFS server benchmark suite Jeff Layton
2026-04-26 12:34   ` Andrew Morton
2026-04-26 14:11     ` Jeff Layton
2026-04-26 23:54       ` Ritesh Harjani
2026-04-26 11:56 ` [PATCH v3 4/4] testing: add dontcache-bench local filesystem " Jeff Layton
2026-04-26 19:02 ` [syzbot ci] Re: mm: improve write performance with RWF_DONTCACHE syzbot ci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bb418f9a7bfcabc3070b412c745c5b6456d592b9.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=brauner@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=david@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=kasong@tencent.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=ljs@kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=qi.zheng@linux.dev \
    --cc=ritesh.list@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=snitzer@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=yuanchu@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox