From: Roman Gushchin <roman.gushchin@linux.dev>
To: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Liu Shixin <liushixin2@huawei.com>
Subject: Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
Date: Mon, 14 Jul 2025 13:12:59 -0700 [thread overview]
Message-ID: <87ms96v8dw.fsf@linux.dev> (raw)
In-Reply-To: <at4ojyziprhhktjgtfmuyzrqwfmomnly6fubkvmbtxkdnx6hpb@5nldc3vipwny> (Jan Kara's message of "Mon, 14 Jul 2025 17:16:51 +0200")
Jan Kara <jack@suse.cz> writes:
> On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
>> We've noticed in production that under a very heavy memory pressure
>> the readahead behavior becomes unstable causing spikes in memory
>> pressure and CPU contention on zone locks.
>>
>> The current mmap_miss heuristics considers minor pagefaults as a
>> good reason to decrease mmap_miss and conditionally start async
>> readahead. This creates a vicious cycle: asynchronous readahead
>> loads more pages, which in turn causes more minor pagefaults.
>> This problem is especially pronounced when multiple threads of
>> an application fault on consecutive pages of an evicted executable,
>> aggressively lowering the mmap_miss counter and preventing readahead
>> from being disabled.
>
> I think you're talking about filemap_map_pages() logic of handling
> mmap_miss. It would be nice to mention it in the changelog. There's one
> thing that doesn't quite make sense to me: When there's memory pressure,
> I'd expect the pages to be reclaimed from memory and not just unmapped.
> Also given your solution uses !uptodate folios suggests the pages were
> actually fully reclaimed and the problem really is that filemap_map_pages()
> treats as minor page fault (i.e., cache hit) what is in fact a major page
> fault (i.e., cache miss)?
>
> Actually, now that I digged deeper I've remembered that based on Liu
> Shixin's report
> (https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
> which sounds a lot like what you're reporting, we have eventually merged his
> fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
> loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
> don't decrease mmap_miss when folio has workingset flag")). Did you test a
> kernel with these fixes (6.10 or later)? In particular after these fixes
> the !folio_test_workingset() check in filemap_map_folio_range() and
> filemap_map_order0_folio() should make sure we don't decrease mmap_miss
> when faulting fresh pages. Or was in your case page evicted so long ago
> that workingset bit is already clear?
Hi Jan!
I've tried to cherry-pick those changes into the kernel I'm looking at
(it's 6.6-based), it didn't help much. I haven't looked why, but at some
point I added traces into filemap_map_pages() and I don't remember
seeing anything. Most likely because pages are not uptodate at the
moment of fault.
In my case I saw a large number of minor pagefaults on consequent pages.
It seems like one thread is having a major pagefault and then a bunch
of other threads are faulting into next pages, effectively breaking
the mmap_miss heuristics. Sometimes it reaches 1000, but struggles
to stay there.
>
> Once we better understand the situation, let me also mention that I have
> two patches which I originally proposed to fix Liu's problems. They didn't
> quite fix them so his patches got merged in the end but the problems
> described there are still somewhat valid:
>
> mm/readahead: Improve page readaround miss detection
>
> filemap_map_pages() decreases ra->mmap_miss for every page it maps. This
> however overestimates number of real cache hits because we have no idea
> whether the application will use the pages we map or not. This is
> problematic in particular in memory constrained situations where we
> think we have great readahead success rate although in fact we are just
> trashing page cache & disk. Change filemap_map_pages() to count only
> success of mapping the page we are faulting in. This should be actually
> enough to keep mmap_miss close to 0 for workloads doing sequential reads
> because filemap_map_pages() does not map page with readahead flag and
> thus these are going to contribute to decreasing the mmap_miss counter.
>
> Fixes: f1820361f83d ("mm: implement ->map_pages for page cache")
>
> -
> mm/readahead: Fix readahead miss detection with FAULT_FLAG_RETRY_NOWAIT
>
> When the page fault happens with FAULT_FLAG_RETRY_NOWAIT (which is
> common) we will bail out of the page fault after issuing reads and retry
> the fault. That will then find the created pages in filemap_map_pages()
> and hence will be treated as cache hit canceling out the cache miss in
> do_sync_mmap_readahead(). Increment mmap_miss by two in
> do_sync_mmap_readahead() in case FAULT_FLAG_RETRY_NOWAIT is set to
> account for the following expected hit. If the page gets evicted even
> before we manage to retry the fault, we are under so heavy memory
> pressure that increasing mmap_miss by two is fine.
>
> Fixes: d065bd810b6d ("mm: retry page fault when blocking on disk transfer")
Yeah, this looks interesting...
Thanks!
next prev parent reply other threads:[~2025-07-14 20:13 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-10 19:52 [PATCH] mm: consider disabling readahead if there are signs of thrashing Roman Gushchin
2025-07-10 20:57 ` Andrew Morton
2025-07-10 22:54 ` Roman Gushchin
2025-07-10 21:43 ` Matthew Wilcox
2025-07-11 16:29 ` Roman Gushchin
2025-07-14 15:16 ` Jan Kara
2025-07-14 20:12 ` Roman Gushchin [this message]
2025-07-25 22:42 ` Roman Gushchin
2025-07-25 23:25 ` Roman Gushchin
2025-07-28 9:16 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ms96v8dw.fsf@linux.dev \
--to=roman.gushchin@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=jack@suse.cz \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=liushixin2@huawei.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.