* [PATCH] mm: readahead: improve mmap_miss heuristic for concurrent faults
@ 2025-08-15 18:32 Roman Gushchin
2025-08-19 7:33 ` David Hildenbrand
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Roman Gushchin @ 2025-08-15 18:32 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Roman Gushchin, Matthew Wilcox (Oracle),
Jan Kara
If two or more threads of an application faulting on the same folio,
the mmap_miss counter can be decreased multiple times. It breaks the
mmap_miss heuristic and keeps the readahead enabled even under extreme
levels of memory pressure.
It happens often if file folios backing a multi-threaded application
are getting evicted and re-faulted.
Fix it by skipping decreasing mmap_miss if the folio is locked.
This change was evaluated on several hundred thousands hosts in Google's
production over a couple of weeks. The number of containers being
stuck in a vicious reclaim cycle for a long time was reduced several
fold (~10-20x), as well as the overall fleet-wide cpu time spent in
direct memory reclaim was meaningfully reduced. No regressions were
observed.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-mm@kvack.org
---
mm/filemap.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c21e98657e0b..983ba1019674 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3324,9 +3324,17 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
return fpin;
- mmap_miss = READ_ONCE(ra->mmap_miss);
- if (mmap_miss)
- WRITE_ONCE(ra->mmap_miss, --mmap_miss);
+ /*
+ * If the folio is locked, we're likely racing against another fault.
+ * Don't touch the mmap_miss counter to avoid decreasing it multiple
+ * times for a single folio and break the balance with mmap_miss
+ * increase in do_sync_mmap_readahead().
+ */
+ if (likely(!folio_test_locked(folio))) {
+ mmap_miss = READ_ONCE(ra->mmap_miss);
+ if (mmap_miss)
+ WRITE_ONCE(ra->mmap_miss, --mmap_miss);
+ }
if (folio_test_readahead(folio)) {
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
--
2.50.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] mm: readahead: improve mmap_miss heuristic for concurrent faults
2025-08-15 18:32 [PATCH] mm: readahead: improve mmap_miss heuristic for concurrent faults Roman Gushchin
@ 2025-08-19 7:33 ` David Hildenbrand
2025-08-25 8:16 ` Jan Kara
2025-08-25 12:27 ` Mateusz Guzik
2 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand @ 2025-08-19 7:33 UTC (permalink / raw)
To: Roman Gushchin, Andrew Morton
Cc: linux-mm, linux-kernel, Matthew Wilcox (Oracle), Jan Kara
On 15.08.25 20:32, Roman Gushchin wrote:
> If two or more threads of an application faulting on the same folio,
> the mmap_miss counter can be decreased multiple times. It breaks the
> mmap_miss heuristic and keeps the readahead enabled even under extreme
> levels of memory pressure.
>
> It happens often if file folios backing a multi-threaded application
> are getting evicted and re-faulted.
>
> Fix it by skipping decreasing mmap_miss if the folio is locked.
>
> This change was evaluated on several hundred thousands hosts in Google's
> production over a couple of weeks. The number of containers being
> stuck in a vicious reclaim cycle for a long time was reduced several
> fold (~10-20x), as well as the overall fleet-wide cpu time spent in
> direct memory reclaim was meaningfully reduced. No regressions were
> observed.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: linux-mm@kvack.org
> ---
> mm/filemap.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c21e98657e0b..983ba1019674 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3324,9 +3324,17 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
> if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
> return fpin;
>
> - mmap_miss = READ_ONCE(ra->mmap_miss);
> - if (mmap_miss)
> - WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> + /*
> + * If the folio is locked, we're likely racing against another fault.
> + * Don't touch the mmap_miss counter to avoid decreasing it multiple
> + * times for a single folio and break the balance with mmap_miss
> + * increase in do_sync_mmap_readahead().
> + */
> + if (likely(!folio_test_locked(folio))) {
> + mmap_miss = READ_ONCE(ra->mmap_miss);
> + if (mmap_miss)
> + WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> + }
Makes sense to me, bud I am no readahead expert.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm: readahead: improve mmap_miss heuristic for concurrent faults
2025-08-15 18:32 [PATCH] mm: readahead: improve mmap_miss heuristic for concurrent faults Roman Gushchin
2025-08-19 7:33 ` David Hildenbrand
@ 2025-08-25 8:16 ` Jan Kara
2025-08-25 16:50 ` Roman Gushchin
2025-08-25 12:27 ` Mateusz Guzik
2 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2025-08-25 8:16 UTC (permalink / raw)
To: Roman Gushchin
Cc: Andrew Morton, linux-mm, linux-kernel, Matthew Wilcox (Oracle),
Jan Kara
On Fri 15-08-25 11:32:24, Roman Gushchin wrote:
> If two or more threads of an application faulting on the same folio,
> the mmap_miss counter can be decreased multiple times. It breaks the
> mmap_miss heuristic and keeps the readahead enabled even under extreme
> levels of memory pressure.
>
> It happens often if file folios backing a multi-threaded application
> are getting evicted and re-faulted.
>
> Fix it by skipping decreasing mmap_miss if the folio is locked.
>
> This change was evaluated on several hundred thousands hosts in Google's
> production over a couple of weeks. The number of containers being
> stuck in a vicious reclaim cycle for a long time was reduced several
> fold (~10-20x), as well as the overall fleet-wide cpu time spent in
> direct memory reclaim was meaningfully reduced. No regressions were
> observed.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: linux-mm@kvack.org
Looks good! Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> mm/filemap.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c21e98657e0b..983ba1019674 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3324,9 +3324,17 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
> if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
> return fpin;
>
> - mmap_miss = READ_ONCE(ra->mmap_miss);
> - if (mmap_miss)
> - WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> + /*
> + * If the folio is locked, we're likely racing against another fault.
> + * Don't touch the mmap_miss counter to avoid decreasing it multiple
> + * times for a single folio and break the balance with mmap_miss
> + * increase in do_sync_mmap_readahead().
> + */
> + if (likely(!folio_test_locked(folio))) {
> + mmap_miss = READ_ONCE(ra->mmap_miss);
> + if (mmap_miss)
> + WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> + }
>
> if (folio_test_readahead(folio)) {
> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> --
> 2.50.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm: readahead: improve mmap_miss heuristic for concurrent faults
2025-08-15 18:32 [PATCH] mm: readahead: improve mmap_miss heuristic for concurrent faults Roman Gushchin
2025-08-19 7:33 ` David Hildenbrand
2025-08-25 8:16 ` Jan Kara
@ 2025-08-25 12:27 ` Mateusz Guzik
2025-08-25 16:54 ` Roman Gushchin
2 siblings, 1 reply; 6+ messages in thread
From: Mateusz Guzik @ 2025-08-25 12:27 UTC (permalink / raw)
To: Roman Gushchin
Cc: Andrew Morton, linux-mm, linux-kernel, Matthew Wilcox (Oracle),
Jan Kara
On Fri, Aug 15, 2025 at 11:32:24AM -0700, Roman Gushchin wrote:
> If two or more threads of an application faulting on the same folio,
> the mmap_miss counter can be decreased multiple times. It breaks the
> mmap_miss heuristic and keeps the readahead enabled even under extreme
> levels of memory pressure.
>
> It happens often if file folios backing a multi-threaded application
> are getting evicted and re-faulted.
>
> Fix it by skipping decreasing mmap_miss if the folio is locked.
>
> This change was evaluated on several hundred thousands hosts in Google's
> production over a couple of weeks. The number of containers being
> stuck in a vicious reclaim cycle for a long time was reduced several
> fold (~10-20x), as well as the overall fleet-wide cpu time spent in
> direct memory reclaim was meaningfully reduced. No regressions were
> observed.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: linux-mm@kvack.org
> ---
> mm/filemap.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c21e98657e0b..983ba1019674 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3324,9 +3324,17 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
> if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
> return fpin;
>
> - mmap_miss = READ_ONCE(ra->mmap_miss);
> - if (mmap_miss)
> - WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> + /*
> + * If the folio is locked, we're likely racing against another fault.
> + * Don't touch the mmap_miss counter to avoid decreasing it multiple
> + * times for a single folio and break the balance with mmap_miss
> + * increase in do_sync_mmap_readahead().
> + */
> + if (likely(!folio_test_locked(folio))) {
> + mmap_miss = READ_ONCE(ra->mmap_miss);
> + if (mmap_miss)
> + WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> + }
I'm not an mm person.
The comment implies the change fixes the race, but it is not at all
clear to me how.
Does it merely make it significantly less likely?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm: readahead: improve mmap_miss heuristic for concurrent faults
2025-08-25 8:16 ` Jan Kara
@ 2025-08-25 16:50 ` Roman Gushchin
0 siblings, 0 replies; 6+ messages in thread
From: Roman Gushchin @ 2025-08-25 16:50 UTC (permalink / raw)
To: Jan Kara; +Cc: Andrew Morton, linux-mm, linux-kernel, Matthew Wilcox (Oracle)
Jan Kara <jack@suse.cz> writes:
> On Fri 15-08-25 11:32:24, Roman Gushchin wrote:
>> If two or more threads of an application faulting on the same folio,
>> the mmap_miss counter can be decreased multiple times. It breaks the
>> mmap_miss heuristic and keeps the readahead enabled even under extreme
>> levels of memory pressure.
>>
>> It happens often if file folios backing a multi-threaded application
>> are getting evicted and re-faulted.
>>
>> Fix it by skipping decreasing mmap_miss if the folio is locked.
>>
>> This change was evaluated on several hundred thousands hosts in Google's
>> production over a couple of weeks. The number of containers being
>> stuck in a vicious reclaim cycle for a long time was reduced several
>> fold (~10-20x), as well as the overall fleet-wide cpu time spent in
>> direct memory reclaim was meaningfully reduced. No regressions were
>> observed.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: linux-mm@kvack.org
>
> Looks good! Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
Thank you!
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm: readahead: improve mmap_miss heuristic for concurrent faults
2025-08-25 12:27 ` Mateusz Guzik
@ 2025-08-25 16:54 ` Roman Gushchin
0 siblings, 0 replies; 6+ messages in thread
From: Roman Gushchin @ 2025-08-25 16:54 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Andrew Morton, linux-mm, linux-kernel, Matthew Wilcox (Oracle),
Jan Kara
Mateusz Guzik <mjguzik@gmail.com> writes:
> On Fri, Aug 15, 2025 at 11:32:24AM -0700, Roman Gushchin wrote:
>> If two or more threads of an application faulting on the same folio,
>> the mmap_miss counter can be decreased multiple times. It breaks the
>> mmap_miss heuristic and keeps the readahead enabled even under extreme
>> levels of memory pressure.
>>
>> It happens often if file folios backing a multi-threaded application
>> are getting evicted and re-faulted.
>>
>> Fix it by skipping decreasing mmap_miss if the folio is locked.
>>
>> This change was evaluated on several hundred thousands hosts in Google's
>> production over a couple of weeks. The number of containers being
>> stuck in a vicious reclaim cycle for a long time was reduced several
>> fold (~10-20x), as well as the overall fleet-wide cpu time spent in
>> direct memory reclaim was meaningfully reduced. No regressions were
>> observed.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: linux-mm@kvack.org
>> ---
>> mm/filemap.c | 14 +++++++++++---
>> 1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index c21e98657e0b..983ba1019674 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3324,9 +3324,17 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
>> if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
>> return fpin;
>>
>> - mmap_miss = READ_ONCE(ra->mmap_miss);
>> - if (mmap_miss)
>> - WRITE_ONCE(ra->mmap_miss, --mmap_miss);
>> + /*
>> + * If the folio is locked, we're likely racing against another fault.
>> + * Don't touch the mmap_miss counter to avoid decreasing it multiple
>> + * times for a single folio and break the balance with mmap_miss
>> + * increase in do_sync_mmap_readahead().
>> + */
>> + if (likely(!folio_test_locked(folio))) {
>> + mmap_miss = READ_ONCE(ra->mmap_miss);
>> + if (mmap_miss)
>> + WRITE_ONCE(ra->mmap_miss, --mmap_miss);
>> + }
>
> I'm not an mm person.
>
> The comment implies the change fixes the race, but it is not at all
> clear to me how.
>
> Does it merely make it significantly less likely?
It's not fixing any race, it's fixing the imbalance in the upward and
downward pressure on the mmap_miss variable. This improves the readahead
behavior under very special circumstances: a multi-threaded application
under very heavy memory pressure. There should be no visible difference
in behavior in other cases.
Thanks!
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-08-25 16:54 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-15 18:32 [PATCH] mm: readahead: improve mmap_miss heuristic for concurrent faults Roman Gushchin
2025-08-19 7:33 ` David Hildenbrand
2025-08-25 8:16 ` Jan Kara
2025-08-25 16:50 ` Roman Gushchin
2025-08-25 12:27 ` Mateusz Guzik
2025-08-25 16:54 ` Roman Gushchin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).