From: Aubrey Li <aubrey.li@linux.intel.com>
To: Jan Kara <jack@suse.cz>, Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>,
Nanhai Zou <nanhai.zou@intel.com>,
Gang Deng <gang.deng@intel.com>,
Tianyou Li <tianyou.li@intel.com>,
Vinicius Gomes <vinicius.gomes@intel.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
Chen Yu <yu.c.chen@intel.com>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org,
Roman Gushchin <roman.gushchin@linux.dev>
Subject: Re: [PATCH] mm/readahead: Skip fully overlapped range
Date: Fri, 7 Nov 2025 18:28:00 +0800 [thread overview]
Message-ID: <48341947-dd12-4a89-870d-fb73f5121888@linux.intel.com> (raw)
In-Reply-To: <mze6nnqy2xwwqaz5xpwkthx3x4n6yd5vgbnyateyjlyjefiwde@qclv7inpacqe>
Really sorry for the late, too. Thunderbird collapsed this thread, but didn't
highlight it as unread, I thought no one response, :(
On 10/17/25 12:21 AM, Jan Kara wrote:
> Sorry for not replying earlier. I wanted make up my mind about this and
> other stuff was keeping preempting me...
>
> On Sat 11-10-25 15:20:42, Andrew Morton wrote:
>> On Tue, 30 Sep 2025 13:35:43 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
>>
>>> file_ra_state is considered a performance hint, not a critical correctness
>>> field. The race conditions on file's readahead state don't affect the
>>> correctness of file I/O because later the page cache mechanisms ensure data
>>> consistency, it won't cause wrong data to be read. I think that's why we do
>>> not lock file_ra_state today, to avoid performance penalties on this hot path.
>>>
>>> That said, this patch didn't make things worse, and it does take a risk but
>>> brings the rewards of RocksDB's readseq benchmark.
>>
>> So if I may summarize:
>>
>> - you've identifed and addressed an issue with concurrent readahead
>> against an fd
>
> Right but let me also note that the patch modifies only
> force_page_cache_ra() which is a pretty peculiar function. It's used at two
> places:
> 1) When page_cache_sync_ra() decides it isn't worth to do a proper
> readahead and just wants to read that one one.
>
> 2) From POSIX_FADV_WILLNEED - I suppose this is Aubrey's case.
>
> As such it seems to be fixing mostly a "don't do it when it hurts" kind of
> load from the benchmark than a widely used practical case since I'm not
> sure many programs call POSIX_FADV_WILLNEED from many threads in parallel
> for the same range.
>
>> - Jan points out that we don't properly handle concurrent access to a
>> file's ra_state. This is somewhat offtopic, but we should address
>> this sometime anyway. Then we can address the RocksDB issue later.
>>
>> Another practicality: improving a benchmark is nice, but do we have any
>> reasons to believe that this change will improve any real-world
>> workload? If so, which and by how much?
I only have RocksDB on my side, but this isn't a lab case but a real case.
It's an issue reported by a customer. They use this case to stress test the
system under high-concurrency data workloads, it could have business impact.
>
> The problem I had with the patch is that it adds more racy updates & checks
> for the shared ra state so it's kind of difficult to say whether some
> workload will not now more often clobber the ra state resulting in poor
> readahead behavior. Also as I looked into the patch now another objection I
> have is that force_page_cache_ra() previously didn't touch the ra state at
> all, it just read the requested pages. After the patch
> force_page_cache_ra() will destroy the readahead state completely. This is
> definitely something we don't want to do.
This is also something I worried about, so I added two trace points at the
entry and exit of force_page_cache_ra(), and I got all ZEROs.
test-9858 [018] ..... 554.352691: force_page_cache_ra: force_page_cache_ra entry: ra->start = 0, ra->size = 0
test-9858 [018] ..... 554.352695: force_page_cache_ra: force_page_cache_ra exit: ra->start = 0, ra->size = 0
test-9855 [009] ..... 554.352701: force_page_cache_ra: force_page_cache_ra entry: ra->start = 0, ra->size = 0
test-9855 [009] ..... 554.352705: force_page_cache_ra: force_page_cache_ra exit: ra->start = 0, ra->size = 0
I think for this code path, my patch doesn't break anything. Do we have any
other code paths I can check?
Anyway, thanks Andrew and Jan for the detailed feedback and discussion. if
we later plan to make file_ra_state concurrency-safe first, I'd be happy to
help test or rebase this optimization on top of that work.
Thanks,
-Aubrey
prev parent reply other threads:[~2025-11-07 10:30 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-23 3:59 [PATCH] mm/readahead: Skip fully overlapped range Aubrey Li
2025-09-23 3:49 ` Andrew Morton
2025-09-23 5:11 ` Aubrey Li
2025-09-23 9:57 ` Jan Kara
2025-09-24 0:27 ` Aubrey Li
2025-09-30 5:35 ` Aubrey Li
2025-10-11 22:20 ` Andrew Morton
2025-10-16 16:21 ` Jan Kara
2025-11-07 10:28 ` Aubrey Li [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=48341947-dd12-4a89-870d-fb73f5121888@linux.intel.com \
--to=aubrey.li@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=gang.deng@intel.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nanhai.zou@intel.com \
--cc=roman.gushchin@linux.dev \
--cc=tianyou.li@intel.com \
--cc=tim.c.chen@linux.intel.com \
--cc=vinicius.gomes@intel.com \
--cc=willy@infradead.org \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.