From: Roman Gushchin <roman.gushchin@linux.dev>
To: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Liu Shixin <liushixin2@huawei.com>
Subject: Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
Date: Fri, 25 Jul 2025 16:25:49 -0700 [thread overview]
Message-ID: <87jz3vdf9e.fsf@linux.dev> (raw)
In-Reply-To: <875xffsxj4.fsf@linux.dev> (Roman Gushchin's message of "Fri, 25 Jul 2025 15:42:07 -0700")
Roman Gushchin <roman.gushchin@linux.dev> writes:
> Jan Kara <jack@suse.cz> writes:
>
>> On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
>>> We've noticed in production that under a very heavy memory pressure
>>> the readahead behavior becomes unstable causing spikes in memory
>>> pressure and CPU contention on zone locks.
>>>
>>> The current mmap_miss heuristics considers minor pagefaults as a
>>> good reason to decrease mmap_miss and conditionally start async
>>> readahead. This creates a vicious cycle: asynchronous readahead
>>> loads more pages, which in turn causes more minor pagefaults.
>>> This problem is especially pronounced when multiple threads of
>>> an application fault on consecutive pages of an evicted executable,
>>> aggressively lowering the mmap_miss counter and preventing readahead
>>> from being disabled.
>>
>> I think you're talking about filemap_map_pages() logic of handling
>> mmap_miss. It would be nice to mention it in the changelog. There's one
>> thing that doesn't quite make sense to me: When there's memory pressure,
>> I'd expect the pages to be reclaimed from memory and not just unmapped.
>> Also given your solution uses !uptodate folios suggests the pages were
>> actually fully reclaimed and the problem really is that filemap_map_pages()
>> treats as minor page fault (i.e., cache hit) what is in fact a major page
>> fault (i.e., cache miss)?
>>
>> Actually, now that I digged deeper I've remembered that based on Liu
>> Shixin's report
>> (https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
>> which sounds a lot like what you're reporting, we have eventually merged his
>> fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
>> loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
>> don't decrease mmap_miss when folio has workingset flag")). Did you test a
>> kernel with these fixes (6.10 or later)? In particular after these fixes
>> the !folio_test_workingset() check in filemap_map_folio_range() and
>> filemap_map_order0_folio() should make sure we don't decrease mmap_miss
>> when faulting fresh pages. Or was in your case page evicted so long ago
>> that workingset bit is already clear?
>>
>> Once we better understand the situation, let me also mention that I have
>> two patches which I originally proposed to fix Liu's problems. They didn't
>> quite fix them so his patches got merged in the end but the problems
>> described there are still somewhat valid:
>
> Ok, I got a better understanding of the situation now. Basically we have
> a multi-threaded application which is under very heavy memory pressure.
> I multiple threads are faulting simultaneously into the same page,
> do_sync_mmap_readahead() can be called multiple times for the same page.
> This creates a negative pressure on the mmap_miss counter, which can't be
> matched by do_sync_mmap_readahead(), which is be called only once
> for every page. This basically keeps the readahead on, despite the heavy
> memory pressure.
>
> The following patch solves the problem, at least in my test scenario.
> Wdyt?
Actually, a better version is below. We don't have to avoid the actual
readahead, just not decrease mmap_miss if the page is locked.
--
diff --git a/mm/filemap.c b/mm/filemap.c
index 0d0369fb5fa1..1756690dd275 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3323,9 +3323,15 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
return fpin;
- mmap_miss = READ_ONCE(ra->mmap_miss);
- if (mmap_miss)
- WRITE_ONCE(ra->mmap_miss, --mmap_miss);
+ /* If folio is locked, we're likely racing against another fault,
+ * don't decrease the mmap_miss counter to avoid decreasing it
+ * multiple times for the same page and break the balance.
+ */
+ if (likely(!folio_test_locked(folio))) {
+ mmap_miss = READ_ONCE(ra->mmap_miss);
+ if (mmap_miss)
+ WRITE_ONCE(ra->mmap_miss, --mmap_miss);
+ }
if (folio_test_readahead(folio)) {
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
next prev parent reply other threads:[~2025-07-25 23:25 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-10 19:52 [PATCH] mm: consider disabling readahead if there are signs of thrashing Roman Gushchin
2025-07-10 20:57 ` Andrew Morton
2025-07-10 22:54 ` Roman Gushchin
2025-07-10 21:43 ` Matthew Wilcox
2025-07-11 16:29 ` Roman Gushchin
2025-07-14 15:16 ` Jan Kara
2025-07-14 20:12 ` Roman Gushchin
2025-07-25 22:42 ` Roman Gushchin
2025-07-25 23:25 ` Roman Gushchin [this message]
2025-07-28 9:16 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87jz3vdf9e.fsf@linux.dev \
--to=roman.gushchin@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=jack@suse.cz \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=liushixin2@huawei.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.