[PATCH] mm: consider disabling readahead if there are signs of thrashing

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: consider disabling readahead if there are signs of thrashing
@ 2025-07-10 19:52 Roman Gushchin
  2025-07-10 20:57 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Roman Gushchin @ 2025-07-10 19:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Matthew Wilcox, linux-mm, linux-kernel, Liu Shixin,
	Roman Gushchin

We've noticed in production that under a very heavy memory pressure
the readahead behavior becomes unstable causing spikes in memory
pressure and CPU contention on zone locks.

The current mmap_miss heuristics considers minor pagefaults as a
good reason to decrease mmap_miss and conditionally start async
readahead. This creates a vicious cycle: asynchronous readahead
loads more pages, which in turn causes more minor pagefaults.
This problem is especially pronounced when multiple threads of
an application fault on consecutive pages of an evicted executable,
aggressively lowering the mmap_miss counter and preventing readahead
from being disabled.

To improve the logic let's check for !uptodate and workingset
folios in do_async_mmap_readahead(). The presence of such pages
is a strong indicator of thrashing, which is also used by the
delay accounting code, e.g. in folio_wait_bit_common(). So instead
of decreasing mmap_miss and lower chances to disable readahead,
let's do the opposite and bump it by MMAP_LOTSAMISS / 2.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: linux-mm@kvack.org
---
 mm/filemap.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 0d0369fb5fa1..ec3f611c3320 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3324,6 +3324,17 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 		return fpin;

 	mmap_miss = READ_ONCE(ra->mmap_miss);
+	if (unlikely(!folio_test_uptodate(folio) &&
+		     folio_test_workingset(folio))) {
+		/*
+		 * If there are signs of thrashing, take a big step
+		 * towards disabling readahead.
+		 */
+		mmap_miss += MMAP_LOTSAMISS / 2;
+		mmap_miss = min(mmap_miss, MMAP_LOTSAMISS * 10);
+		WRITE_ONCE(ra->mmap_miss, mmap_miss);
+		return fpin;
+	}
 	if (mmap_miss)
 		WRITE_ONCE(ra->mmap_miss, --mmap_miss);

-- 
2.50.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
  2025-07-10 19:52 [PATCH] mm: consider disabling readahead if there are signs of thrashing Roman Gushchin
@ 2025-07-10 20:57 ` Andrew Morton
  2025-07-10 22:54   ` Roman Gushchin
  2025-07-10 21:43 ` Matthew Wilcox
  2025-07-14 15:16 ` Jan Kara
  2 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2025-07-10 20:57 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Jan Kara, Matthew Wilcox, linux-mm, linux-kernel, Liu Shixin

On Thu, 10 Jul 2025 12:52:32 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote:

> We've noticed in production that under a very heavy memory pressure
> the readahead behavior becomes unstable causing spikes in memory
> pressure and CPU contention on zone locks.
> 
> The current mmap_miss heuristics considers minor pagefaults as a
> good reason to decrease mmap_miss and conditionally start async
> readahead. This creates a vicious cycle: asynchronous readahead
> loads more pages, which in turn causes more minor pagefaults.
> This problem is especially pronounced when multiple threads of
> an application fault on consecutive pages of an evicted executable,
> aggressively lowering the mmap_miss counter and preventing readahead
> from being disabled.
> 
> To improve the logic let's check for !uptodate and workingset
> folios in do_async_mmap_readahead(). The presence of such pages
> is a strong indicator of thrashing, which is also used by the
> delay accounting code, e.g. in folio_wait_bit_common(). So instead
> of decreasing mmap_miss and lower chances to disable readahead,
> let's do the opposite and bump it by MMAP_LOTSAMISS / 2.

Are there any testing results to share?

What sort of workloads might be harmed by this change?

We do seem to be thrashing around (heh) with these readahead
heuristics.  Lots of potential for playing whack-a-mole.  

Should we make the readahead code more observable?  We don't seem to
have much in the way of statistics, counters, etc.  And no tracepoints,
which is surprising.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
  2025-07-10 19:52 [PATCH] mm: consider disabling readahead if there are signs of thrashing Roman Gushchin
  2025-07-10 20:57 ` Andrew Morton
@ 2025-07-10 21:43 ` Matthew Wilcox
  2025-07-11 16:29   ` Roman Gushchin
  2025-07-14 15:16 ` Jan Kara
  2 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2025-07-10 21:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Jan Kara, linux-mm, linux-kernel, Liu Shixin

On Thu, Jul 10, 2025 at 12:52:32PM -0700, Roman Gushchin wrote:
> We've noticed in production that under a very heavy memory pressure
> the readahead behavior becomes unstable causing spikes in memory
> pressure and CPU contention on zone locks.
> 
> The current mmap_miss heuristics considers minor pagefaults as a
> good reason to decrease mmap_miss and conditionally start async
> readahead. This creates a vicious cycle: asynchronous readahead
> loads more pages, which in turn causes more minor pagefaults.

Is the correct response to turn off faultaround, or would we be better
off scaling it down (eg as low as 64k)?

I like the signal you're using; I think that makes a lot of sense.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
  2025-07-10 20:57 ` Andrew Morton
@ 2025-07-10 22:54   ` Roman Gushchin
  0 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2025-07-10 22:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Matthew Wilcox, linux-mm, linux-kernel, Liu Shixin

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 10 Jul 2025 12:52:32 -0700 Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
>> We've noticed in production that under a very heavy memory pressure
>> the readahead behavior becomes unstable causing spikes in memory
>> pressure and CPU contention on zone locks.
>> 
>> The current mmap_miss heuristics considers minor pagefaults as a
>> good reason to decrease mmap_miss and conditionally start async
>> readahead. This creates a vicious cycle: asynchronous readahead
>> loads more pages, which in turn causes more minor pagefaults.
>> This problem is especially pronounced when multiple threads of
>> an application fault on consecutive pages of an evicted executable,
>> aggressively lowering the mmap_miss counter and preventing readahead
>> from being disabled.
>> 
>> To improve the logic let's check for !uptodate and workingset
>> folios in do_async_mmap_readahead(). The presence of such pages
>> is a strong indicator of thrashing, which is also used by the
>> delay accounting code, e.g. in folio_wait_bit_common(). So instead
>> of decreasing mmap_miss and lower chances to disable readahead,
>> let's do the opposite and bump it by MMAP_LOTSAMISS / 2.
>
> Are there any testing results to share?

Nothing from the production yet, but it makes a lot of difference
to the reproducer I use (authored by Greg Thelen), which basically
runs a huge binary with 2xCPU number of threads in a very constrained
memory cgroup. Without this change the system is oscillating between
performing more or less well and being completely stuck on zone locks
contention when 256 threads are all competing for a small number of
pages. With this change the system is pretty stable once it reaches
the point with the disabled readahead.

>
> What sort of workloads might be harmed by this change?

I hope none, but maybe I miss something.

>
> We do seem to be thrashing around (heh) with these readahead
> heuristics.  Lots of potential for playing whack-a-mole.  
>
> Should we make the readahead code more observable?  We don't seem to
> have much in the way of statistics, counters, etc.  And no tracepoints,
> which is surprising.

I think it's another good mm candidate (the first being oom killer
policies, working on it) for eventual bpf-ization. For example,
I can easily see that a policy specific to a file format can make
a large difference.

In this particular case I guess we can disable readahead based
on memory psi metrics, potentially all in bpf.

Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
  2025-07-10 21:43 ` Matthew Wilcox
@ 2025-07-11 16:29   ` Roman Gushchin
  0 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2025-07-11 16:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Jan Kara, linux-mm, linux-kernel, Liu Shixin

Matthew Wilcox <willy@infradead.org> writes:

> On Thu, Jul 10, 2025 at 12:52:32PM -0700, Roman Gushchin wrote:
>> We've noticed in production that under a very heavy memory pressure
>> the readahead behavior becomes unstable causing spikes in memory
>> pressure and CPU contention on zone locks.
>> 
>> The current mmap_miss heuristics considers minor pagefaults as a
>> good reason to decrease mmap_miss and conditionally start async
>> readahead. This creates a vicious cycle: asynchronous readahead
>> loads more pages, which in turn causes more minor pagefaults.
>
> Is the correct response to turn off faultaround, or would we be better
> off scaling it down (eg as low as 64k)?

I think at this point it better to turn it off entirely.

For scaling I wonder if we want to scale it depending on PSI numbers?

>
> I like the signal you're using; I think that makes a lot of sense.

Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
  2025-07-10 19:52 [PATCH] mm: consider disabling readahead if there are signs of thrashing Roman Gushchin
  2025-07-10 20:57 ` Andrew Morton
  2025-07-10 21:43 ` Matthew Wilcox
@ 2025-07-14 15:16 ` Jan Kara
  2025-07-14 20:12   ` Roman Gushchin
  2025-07-25 22:42   ` Roman Gushchin
  2 siblings, 2 replies; 10+ messages in thread
From: Jan Kara @ 2025-07-14 15:16 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Jan Kara, Matthew Wilcox, linux-mm, linux-kernel,
	Liu Shixin

On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
> We've noticed in production that under a very heavy memory pressure
> the readahead behavior becomes unstable causing spikes in memory
> pressure and CPU contention on zone locks.
> 
> The current mmap_miss heuristics considers minor pagefaults as a
> good reason to decrease mmap_miss and conditionally start async
> readahead. This creates a vicious cycle: asynchronous readahead
> loads more pages, which in turn causes more minor pagefaults.
> This problem is especially pronounced when multiple threads of
> an application fault on consecutive pages of an evicted executable,
> aggressively lowering the mmap_miss counter and preventing readahead
> from being disabled.

I think you're talking about filemap_map_pages() logic of handling
mmap_miss. It would be nice to mention it in the changelog. There's one
thing that doesn't quite make sense to me: When there's memory pressure,
I'd expect the pages to be reclaimed from memory and not just unmapped. 
Also given your solution uses !uptodate folios suggests the pages were
actually fully reclaimed and the problem really is that filemap_map_pages()
treats as minor page fault (i.e., cache hit) what is in fact a major page
fault (i.e., cache miss)?

Actually, now that I digged deeper I've remembered that based on Liu
Shixin's report
(https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
which sounds a lot like what you're reporting, we have eventually merged his
fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
don't decrease mmap_miss when folio has workingset flag")). Did you test a
kernel with these fixes (6.10 or later)? In particular after these fixes
the !folio_test_workingset() check in filemap_map_folio_range() and
filemap_map_order0_folio() should make sure we don't decrease mmap_miss
when faulting fresh pages. Or was in your case page evicted so long ago
that workingset bit is already clear?

Once we better understand the situation, let me also mention that I have
two patches which I originally proposed to fix Liu's problems. They didn't
quite fix them so his patches got merged in the end but the problems
described there are still somewhat valid:

    mm/readahead: Improve page readaround miss detection

    filemap_map_pages() decreases ra->mmap_miss for every page it maps. This
    however overestimates number of real cache hits because we have no idea
    whether the application will use the pages we map or not. This is
    problematic in particular in memory constrained situations where we
    think we have great readahead success rate although in fact we are just
    trashing page cache & disk. Change filemap_map_pages() to count only
    success of mapping the page we are faulting in. This should be actually
    enough to keep mmap_miss close to 0 for workloads doing sequential reads
    because filemap_map_pages() does not map page with readahead flag and
    thus these are going to contribute to decreasing the mmap_miss counter.

    Fixes: f1820361f83d ("mm: implement ->map_pages for page cache")

-
    mm/readahead: Fix readahead miss detection with FAULT_FLAG_RETRY_NOWAIT

    When the page fault happens with FAULT_FLAG_RETRY_NOWAIT (which is
    common) we will bail out of the page fault after issuing reads and retry
    the fault. That will then find the created pages in filemap_map_pages()
    and hence will be treated as cache hit canceling out the cache miss in
    do_sync_mmap_readahead(). Increment mmap_miss by two in
    do_sync_mmap_readahead() in case FAULT_FLAG_RETRY_NOWAIT is set to
    account for the following expected hit. If the page gets evicted even
    before we manage to retry the fault, we are under so heavy memory
    pressure that increasing mmap_miss by two is fine.

    Fixes: d065bd810b6d ("mm: retry page fault when blocking on disk transfer")

In particular the second problem described could still lead to mmap_miss
not growing as fast as it should so maybe it would be worth reviving it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
  2025-07-14 15:16 ` Jan Kara
@ 2025-07-14 20:12   ` Roman Gushchin
  2025-07-25 22:42   ` Roman Gushchin
  1 sibling, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2025-07-14 20:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Matthew Wilcox, linux-mm, linux-kernel, Liu Shixin

Jan Kara <jack@suse.cz> writes:

> On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
>> We've noticed in production that under a very heavy memory pressure
>> the readahead behavior becomes unstable causing spikes in memory
>> pressure and CPU contention on zone locks.
>> 
>> The current mmap_miss heuristics considers minor pagefaults as a
>> good reason to decrease mmap_miss and conditionally start async
>> readahead. This creates a vicious cycle: asynchronous readahead
>> loads more pages, which in turn causes more minor pagefaults.
>> This problem is especially pronounced when multiple threads of
>> an application fault on consecutive pages of an evicted executable,
>> aggressively lowering the mmap_miss counter and preventing readahead
>> from being disabled.
>
> I think you're talking about filemap_map_pages() logic of handling
> mmap_miss. It would be nice to mention it in the changelog. There's one
> thing that doesn't quite make sense to me: When there's memory pressure,
> I'd expect the pages to be reclaimed from memory and not just unmapped. 
> Also given your solution uses !uptodate folios suggests the pages were
> actually fully reclaimed and the problem really is that filemap_map_pages()
> treats as minor page fault (i.e., cache hit) what is in fact a major page
> fault (i.e., cache miss)?
>
> Actually, now that I digged deeper I've remembered that based on Liu
> Shixin's report
> (https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
> which sounds a lot like what you're reporting, we have eventually merged his
> fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
> loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
> don't decrease mmap_miss when folio has workingset flag")). Did you test a
> kernel with these fixes (6.10 or later)? In particular after these fixes
> the !folio_test_workingset() check in filemap_map_folio_range() and
> filemap_map_order0_folio() should make sure we don't decrease mmap_miss
> when faulting fresh pages. Or was in your case page evicted so long ago
> that workingset bit is already clear?

Hi Jan!

I've tried to cherry-pick those changes into the kernel I'm looking at
(it's 6.6-based), it didn't help much. I haven't looked why, but at some
point I added traces into filemap_map_pages() and I don't remember
seeing anything. Most likely because pages are not uptodate at the
moment of fault.

In my case I saw a large number of minor pagefaults on consequent pages.
It seems like one thread is having a major pagefault and then a bunch
of other threads are faulting into next pages, effectively breaking
the mmap_miss heuristics. Sometimes it reaches 1000, but struggles
to stay there.

>
> Once we better understand the situation, let me also mention that I have
> two patches which I originally proposed to fix Liu's problems. They didn't
> quite fix them so his patches got merged in the end but the problems
> described there are still somewhat valid:
>
>     mm/readahead: Improve page readaround miss detection
>     
>     filemap_map_pages() decreases ra->mmap_miss for every page it maps. This
>     however overestimates number of real cache hits because we have no idea
>     whether the application will use the pages we map or not. This is
>     problematic in particular in memory constrained situations where we
>     think we have great readahead success rate although in fact we are just
>     trashing page cache & disk. Change filemap_map_pages() to count only
>     success of mapping the page we are faulting in. This should be actually
>     enough to keep mmap_miss close to 0 for workloads doing sequential reads
>     because filemap_map_pages() does not map page with readahead flag and
>     thus these are going to contribute to decreasing the mmap_miss counter.
>
>     Fixes: f1820361f83d ("mm: implement ->map_pages for page cache")
>
> -
>     mm/readahead: Fix readahead miss detection with FAULT_FLAG_RETRY_NOWAIT
>     
>     When the page fault happens with FAULT_FLAG_RETRY_NOWAIT (which is
>     common) we will bail out of the page fault after issuing reads and retry
>     the fault. That will then find the created pages in filemap_map_pages()
>     and hence will be treated as cache hit canceling out the cache miss in
>     do_sync_mmap_readahead(). Increment mmap_miss by two in
>     do_sync_mmap_readahead() in case FAULT_FLAG_RETRY_NOWAIT is set to
>     account for the following expected hit. If the page gets evicted even
>     before we manage to retry the fault, we are under so heavy memory
>     pressure that increasing mmap_miss by two is fine.
>
>     Fixes: d065bd810b6d ("mm: retry page fault when blocking on disk transfer")

Yeah, this looks interesting...

Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
  2025-07-14 15:16 ` Jan Kara
  2025-07-14 20:12   ` Roman Gushchin
@ 2025-07-25 22:42   ` Roman Gushchin
  2025-07-25 23:25     ` Roman Gushchin
  1 sibling, 1 reply; 10+ messages in thread
From: Roman Gushchin @ 2025-07-25 22:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Matthew Wilcox, linux-mm, linux-kernel, Liu Shixin

Jan Kara <jack@suse.cz> writes:

> On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
>> We've noticed in production that under a very heavy memory pressure
>> the readahead behavior becomes unstable causing spikes in memory
>> pressure and CPU contention on zone locks.
>> 
>> The current mmap_miss heuristics considers minor pagefaults as a
>> good reason to decrease mmap_miss and conditionally start async
>> readahead. This creates a vicious cycle: asynchronous readahead
>> loads more pages, which in turn causes more minor pagefaults.
>> This problem is especially pronounced when multiple threads of
>> an application fault on consecutive pages of an evicted executable,
>> aggressively lowering the mmap_miss counter and preventing readahead
>> from being disabled.
>
> I think you're talking about filemap_map_pages() logic of handling
> mmap_miss. It would be nice to mention it in the changelog. There's one
> thing that doesn't quite make sense to me: When there's memory pressure,
> I'd expect the pages to be reclaimed from memory and not just unmapped. 
> Also given your solution uses !uptodate folios suggests the pages were
> actually fully reclaimed and the problem really is that filemap_map_pages()
> treats as minor page fault (i.e., cache hit) what is in fact a major page
> fault (i.e., cache miss)?
>
> Actually, now that I digged deeper I've remembered that based on Liu
> Shixin's report
> (https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
> which sounds a lot like what you're reporting, we have eventually merged his
> fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
> loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
> don't decrease mmap_miss when folio has workingset flag")). Did you test a
> kernel with these fixes (6.10 or later)? In particular after these fixes
> the !folio_test_workingset() check in filemap_map_folio_range() and
> filemap_map_order0_folio() should make sure we don't decrease mmap_miss
> when faulting fresh pages. Or was in your case page evicted so long ago
> that workingset bit is already clear?
>
> Once we better understand the situation, let me also mention that I have
> two patches which I originally proposed to fix Liu's problems. They didn't
> quite fix them so his patches got merged in the end but the problems
> described there are still somewhat valid:

Ok, I got a better understanding of the situation now. Basically we have
a multi-threaded application which is under very heavy memory pressure.
I multiple threads are faulting simultaneously into the same page,
do_sync_mmap_readahead() can be called multiple times for the same page.
This creates a negative pressure on the mmap_miss counter, which can't be
matched by do_sync_mmap_readahead(), which is be called only once
for every page. This basically keeps the readahead on, despite the heavy
memory pressure.

The following patch solves the problem, at least in my test scenario.
Wdyt?

Thanks!

--

@@ -3323,6 +3323,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 	if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
 		return fpin;
 
+	/* We're likely racing against another fault, bail out */
+	if (folio_test_locked(folio) && !folio_test_uptodate(folio))
+		return fpin;
+
 	mmap_miss = READ_ONCE(ra->mmap_miss);
 	if (mmap_miss)
 		WRITE_ONCE(ra->mmap_miss, --mmap_miss);



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
  2025-07-25 22:42   ` Roman Gushchin
@ 2025-07-25 23:25     ` Roman Gushchin
  2025-07-28  9:16       ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Roman Gushchin @ 2025-07-25 23:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Matthew Wilcox, linux-mm, linux-kernel, Liu Shixin

Roman Gushchin <roman.gushchin@linux.dev> writes:

> Jan Kara <jack@suse.cz> writes:
>
>> On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
>>> We've noticed in production that under a very heavy memory pressure
>>> the readahead behavior becomes unstable causing spikes in memory
>>> pressure and CPU contention on zone locks.
>>> 
>>> The current mmap_miss heuristics considers minor pagefaults as a
>>> good reason to decrease mmap_miss and conditionally start async
>>> readahead. This creates a vicious cycle: asynchronous readahead
>>> loads more pages, which in turn causes more minor pagefaults.
>>> This problem is especially pronounced when multiple threads of
>>> an application fault on consecutive pages of an evicted executable,
>>> aggressively lowering the mmap_miss counter and preventing readahead
>>> from being disabled.
>>
>> I think you're talking about filemap_map_pages() logic of handling
>> mmap_miss. It would be nice to mention it in the changelog. There's one
>> thing that doesn't quite make sense to me: When there's memory pressure,
>> I'd expect the pages to be reclaimed from memory and not just unmapped. 
>> Also given your solution uses !uptodate folios suggests the pages were
>> actually fully reclaimed and the problem really is that filemap_map_pages()
>> treats as minor page fault (i.e., cache hit) what is in fact a major page
>> fault (i.e., cache miss)?
>>
>> Actually, now that I digged deeper I've remembered that based on Liu
>> Shixin's report
>> (https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
>> which sounds a lot like what you're reporting, we have eventually merged his
>> fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
>> loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
>> don't decrease mmap_miss when folio has workingset flag")). Did you test a
>> kernel with these fixes (6.10 or later)? In particular after these fixes
>> the !folio_test_workingset() check in filemap_map_folio_range() and
>> filemap_map_order0_folio() should make sure we don't decrease mmap_miss
>> when faulting fresh pages. Or was in your case page evicted so long ago
>> that workingset bit is already clear?
>>
>> Once we better understand the situation, let me also mention that I have
>> two patches which I originally proposed to fix Liu's problems. They didn't
>> quite fix them so his patches got merged in the end but the problems
>> described there are still somewhat valid:
>
> Ok, I got a better understanding of the situation now. Basically we have
> a multi-threaded application which is under very heavy memory pressure.
> I multiple threads are faulting simultaneously into the same page,
> do_sync_mmap_readahead() can be called multiple times for the same page.
> This creates a negative pressure on the mmap_miss counter, which can't be
> matched by do_sync_mmap_readahead(), which is be called only once
> for every page. This basically keeps the readahead on, despite the heavy
> memory pressure.
>
> The following patch solves the problem, at least in my test scenario.
> Wdyt?

Actually, a better version is below. We don't have to avoid the actual
readahead, just not decrease mmap_miss if the page is locked.

--

diff --git a/mm/filemap.c b/mm/filemap.c
index 0d0369fb5fa1..1756690dd275 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3323,9 +3323,15 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
        if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
                return fpin;
 
-       mmap_miss = READ_ONCE(ra->mmap_miss);
-       if (mmap_miss)
-               WRITE_ONCE(ra->mmap_miss, --mmap_miss);
+       /* If folio is locked, we're likely racing against another fault,
+        * don't decrease the mmap_miss counter to avoid decreasing it
+        * multiple times for the same page and break the balance.
+        */
+       if (likely(!folio_test_locked(folio))) {
+               mmap_miss = READ_ONCE(ra->mmap_miss);
+               if (mmap_miss)
+                       WRITE_ONCE(ra->mmap_miss, --mmap_miss);
+       }
 
        if (folio_test_readahead(folio)) {
                fpin = maybe_unlock_mmap_for_io(vmf, fpin);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] mm: consider disabling readahead if there are signs of thrashing
  2025-07-25 23:25     ` Roman Gushchin
@ 2025-07-28  9:16       ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2025-07-28  9:16 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Jan Kara, Andrew Morton, Matthew Wilcox, linux-mm, linux-kernel,
	Liu Shixin

On Fri 25-07-25 16:25:49, Roman Gushchin wrote:
> Roman Gushchin <roman.gushchin@linux.dev> writes:
> > Jan Kara <jack@suse.cz> writes:
> >> On Thu 10-07-25 12:52:32, Roman Gushchin wrote:
> >>> We've noticed in production that under a very heavy memory pressure
> >>> the readahead behavior becomes unstable causing spikes in memory
> >>> pressure and CPU contention on zone locks.
> >>> 
> >>> The current mmap_miss heuristics considers minor pagefaults as a
> >>> good reason to decrease mmap_miss and conditionally start async
> >>> readahead. This creates a vicious cycle: asynchronous readahead
> >>> loads more pages, which in turn causes more minor pagefaults.
> >>> This problem is especially pronounced when multiple threads of
> >>> an application fault on consecutive pages of an evicted executable,
> >>> aggressively lowering the mmap_miss counter and preventing readahead
> >>> from being disabled.
> >>
> >> I think you're talking about filemap_map_pages() logic of handling
> >> mmap_miss. It would be nice to mention it in the changelog. There's one
> >> thing that doesn't quite make sense to me: When there's memory pressure,
> >> I'd expect the pages to be reclaimed from memory and not just unmapped. 
> >> Also given your solution uses !uptodate folios suggests the pages were
> >> actually fully reclaimed and the problem really is that filemap_map_pages()
> >> treats as minor page fault (i.e., cache hit) what is in fact a major page
> >> fault (i.e., cache miss)?
> >>
> >> Actually, now that I digged deeper I've remembered that based on Liu
> >> Shixin's report
> >> (https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/)
> >> which sounds a lot like what you're reporting, we have eventually merged his
> >> fixes (ended up as commits 0fd44ab213bc ("mm/readahead: break read-ahead
> >> loop if filemap_add_folio return -ENOMEM"), 5c46d5319bde ("mm/filemap:
> >> don't decrease mmap_miss when folio has workingset flag")). Did you test a
> >> kernel with these fixes (6.10 or later)? In particular after these fixes
> >> the !folio_test_workingset() check in filemap_map_folio_range() and
> >> filemap_map_order0_folio() should make sure we don't decrease mmap_miss
> >> when faulting fresh pages. Or was in your case page evicted so long ago
> >> that workingset bit is already clear?
> >>
> >> Once we better understand the situation, let me also mention that I have
> >> two patches which I originally proposed to fix Liu's problems. They didn't
> >> quite fix them so his patches got merged in the end but the problems
> >> described there are still somewhat valid:
> >
> > Ok, I got a better understanding of the situation now. Basically we have
> > a multi-threaded application which is under very heavy memory pressure.
> > I multiple threads are faulting simultaneously into the same page,
> > do_sync_mmap_readahead() can be called multiple times for the same page.
> > This creates a negative pressure on the mmap_miss counter, which can't be
> > matched by do_sync_mmap_readahead(), which is be called only once
> > for every page. This basically keeps the readahead on, despite the heavy
> > memory pressure.
> >
> > The following patch solves the problem, at least in my test scenario.
> > Wdyt?
> 
> Actually, a better version is below. We don't have to avoid the actual
> readahead, just not decrease mmap_miss if the page is locked.
> 
> --
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0d0369fb5fa1..1756690dd275 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3323,9 +3323,15 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
>         if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
>                 return fpin;
>  
> -       mmap_miss = READ_ONCE(ra->mmap_miss);
> -       if (mmap_miss)
> -               WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> +       /* If folio is locked, we're likely racing against another fault,
> +        * don't decrease the mmap_miss counter to avoid decreasing it
> +        * multiple times for the same page and break the balance.
> +        */
> +       if (likely(!folio_test_locked(folio))) {

I like this, although even more understandable to me would be to have

	  if (likely(folio_test_uptodate(folio)))

which should be more or less equivalent for your situation but would better
express, whether this is indeed a cache hit or not. But I can live with
either variant.

								Honza

> +               mmap_miss = READ_ONCE(ra->mmap_miss);
> +               if (mmap_miss)
> +                       WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> +       }
>  
>         if (folio_test_readahead(folio)) {
>                 fpin = maybe_unlock_mmap_for_io(vmf, fpin);
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-07-28  9:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-10 19:52 [PATCH] mm: consider disabling readahead if there are signs of thrashing Roman Gushchin
2025-07-10 20:57 ` Andrew Morton
2025-07-10 22:54   ` Roman Gushchin
2025-07-10 21:43 ` Matthew Wilcox
2025-07-11 16:29   ` Roman Gushchin
2025-07-14 15:16 ` Jan Kara
2025-07-14 20:12   ` Roman Gushchin
2025-07-25 22:42   ` Roman Gushchin
2025-07-25 23:25     ` Roman Gushchin
2025-07-28  9:16       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).