[PATCH v2] mm: limit filemap_fault readahead to VMA boundaries

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v2] mm: limit filemap_fault readahead to VMA boundaries
@ 2026-04-27  3:01 Frederick Mayle
  2026-04-27 12:41 ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 4+ messages in thread
From: Frederick Mayle @ 2026-04-27  3:01 UTC (permalink / raw)
  To: David Hildenbrand, Jan Kara, Lorenzo Stoakes, Matthew Wilcox,
	Andrew Morton, Pedro Falcato
  Cc: Frederick Mayle, Kalesh Singh, Suren Baghdasaryan, android-mm,
	kernel-team, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, linux-fsdevel, linux-mm, linux-kernel

When a file mapping covers a strict subset of a file, an access to the
mapping can trigger readahead of file pages outside the mapped region.
Readahead is meant to prefetch pages likely to be accessed soon, but
these pages aren't accessible via the same means, so it fair to say we
don't have a good indicator they'll be accessed soon. Take an ELF file
for example: An access to the end of a program's read-only segment isn't
a sign that nearby file contents will be accessed next (they are likely
to be mapped discontiguously, or not at all). The pressure from loading
these pages into the cache can evict more useful pages.

To improve the behavior, make three changes:

* Introduce a new readahead_control field, max_index, as a hard limit on
  the readahead. The existing file_ra_state->size can't be used as a
  limit, it is more of a hint and can be increased by various
  heuristics.
* Set readahead_control->max_index to the end of the VMA in all of the
  readahead paths that can be triggered from a fault on a file mapping
  (both "sync" and "async" readahead).
* Limit the read-around range start to the VMA's start.

Note that these changes only affect readahead triggered in the context
of a fault, they do not affect readahead triggered by read syscalls. If
a user mixes the two types of accesses, the behavior is expected to be
the following: if a fault causes readahead and places a PG_readahead
marker and then a read(2) syscall hits the PG_readahead marker, the
resulting async readahead *will not* be limited to the VMA end.
Conversely, if a read(2) syscall places a PG_readahead marker and then a
fault hits the marker, the async readahead *will* be limited to the VMA
end.

There is an edge case that the above motivation glosses over: A single
file mapping might be backed by multiple VMAs. For example, a whole file
could be mapped RW, then part of the mapping made RO using mprotect.
This patch would hurt performance of a sequential faulted read of such a
mapping, the degree depending on how fragmented the VMAs are. A usage
pattern like that is likely rare and already suffering from sub-optimal
performance because, e.g., the fragmented VMAs limit the fault-around,
so each VMA boundary in a sequential faulted read would cause a minor
fault. Still, this patch would make it worse. See a previous discussion
of this topic at [1].

Tested by mapping and reading a small subset of a large file, then using
the cachestat syscall to verify the number of cached pages didn't exceed
the mapping size.

In practical scenarios, the effect depends on the specific file and
usage. Sometimes there is no effect at all, but, for some ELF files in
Android, we see ~20% fewer pages pull into the cache.

A comprehensive performance evaluation hasn't been done, but, in
addition to the anecdontal memory savings mentioned above, a benchmark
was run with fio 3.38, showing neutral looking results:

    /data/local/tmp/fio --version

    fio --name=mmap_test --ioengine=mmap --rw=read --bs=4k \
        --offset=1G --size=1G --filesize=3G --numjobs=1 \
        --filename=testfile.bin

        Before: 4366.6 MiB/s (avg of 3459, 4592, 4613, 4697, 4472)
        After:  4444.0 MiB/s (avg of 4633, 4655, 4511, 4571, 3850)
                +1.7%

    Same, with --ioengine=mmap --rw=randread

        Before: 445.6 MiB/s  (avg of 446, 447, 442, 452, 441)
        After:  447.0 MiB/s  (avg of 447, 446, 446, 451, 445)
                +0.3%

    Same, with --ioengine=psync --rw=read

        Before: 3086.6 MiB/s (avg of 3122, 3094, 3066, 3094, 3057)
        After:  3084.6 MiB/s (avg of 3039, 3103, 3103, 3084, 3094)
                -0.06%

    Same, with --ioengine=psync --rw=randread

        Before: 2226.4 MiB/s (avg of 2256, 2183, 2207, 2265, 2221)
        After:  2231.4 MiB/s (avg of 2236, 2241, 2236, 2193, 2251)
                +0.2%

[1] https://lore.kernel.org/all/ivnv2crd3et76p2nx7oszuqhzzah756oecn5yuykzqfkqzoygw@yvnlkhjjssoz/

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: android-mm@google.com
Cc: kernel-team@android.com
Signed-off-by: Frederick Mayle <fmayle@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---

This is v2 of
https://lore.kernel.org/r/20260422005608.342028-1-fmayle@google.com/

In v1 of the patch, I made a mistake and accidentally mailed twice it,
the first time without including get_maintainer.pl output, so the
mailing lists weren't CC'd. There were replies from Andrew to the first
email which aren't visible on the mailing list or lore.

Changes in v2:
  - Add Jan's Reviewed-by tag.
  - Tweak commit message wording, per Andrew
  - Change field from `unsigned long max_index` to `pgoff_t _max_index`
    and move next to `_index`, per Andrew
  - Avoid min_t, per Andrew

 include/linux/pagemap.h | 2 ++
 mm/filemap.c            | 4 ++++
 mm/readahead.c          | 6 +++++-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ec442af3f886..6fd2a8914073 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1361,6 +1361,7 @@ struct readahead_control {
 	struct file_ra_state *ra;
 /* private: use the readahead_* accessors instead */
 	pgoff_t _index;
+	pgoff_t _max_index; /* limit readahead to _max_index, inclusive */
 	unsigned int _nr_pages;
 	unsigned int _batch_count;
 	bool dropbehind;
@@ -1374,6 +1375,7 @@ struct readahead_control {
 		.mapping = m,						\
 		.ra = r,						\
 		._index = i,						\
+		._max_index = ULONG_MAX,				\
 	}

 #define VM_READAHEAD_PAGES	(SZ_128K / PAGE_SIZE)
diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..97772a05a18e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3314,6 +3314,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	bool force_thp_readahead = false;
 	unsigned short mmap_miss;

+	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
+
 	/* Use the readahead code, even if readahead is disabled */
 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
 	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
@@ -3396,6 +3398,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		 * mmap read-around
 		 */
 		ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
+		ra->start = max(ra->start, vmf->vma->vm_pgoff);
 		ra->size = ra->ra_pages;
 		ra->async_size = ra->ra_pages / 4;
 		ra->order = 0;
@@ -3438,6 +3441,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 	}

 	if (folio_test_readahead(folio)) {
+		ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_async_ra(&ractl, folio, ra->ra_pages);
 	}
diff --git a/mm/readahead.c b/mm/readahead.c
index 7b05082c89ea..8c12b63ccd4a 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -324,6 +324,8 @@ static void do_page_cache_ra(struct readahead_control *ractl,
 		return;

 	end_index = (isize - 1) >> PAGE_SHIFT;
+	if (end_index > ractl->_max_index)
+		end_index = ractl->_max_index;
 	if (index > end_index)
 		return;
 	/* Don't read past the page containing the last byte of the file */
@@ -471,7 +473,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	pgoff_t start = readahead_index(ractl);
 	pgoff_t index = start;
 	unsigned int min_order = mapping_min_folio_order(mapping);
-	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+	pgoff_t limit;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	unsigned int nofs;
 	int err = 0;
@@ -484,6 +486,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		goto fallback;
 	}

+	limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
+	limit = min(limit, ractl->_max_index);
 	limit = min(limit, index + ra->size - 1);

 	new_order = min(mapping_max_folio_order(mapping), new_order);

base-commit: db2a1695b2b6feb071b47b72e61d0359bf1524bf
-- 
2.54.0.545.g6539524ca2-goog

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] mm: limit filemap_fault readahead to VMA boundaries
  2026-04-27  3:01 [PATCH v2] mm: limit filemap_fault readahead to VMA boundaries Frederick Mayle
@ 2026-04-27 12:41 ` David Hildenbrand (Arm)
  2026-04-27 16:22   ` Kalesh Singh
  0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-27 12:41 UTC (permalink / raw)
  To: Frederick Mayle, Jan Kara, Lorenzo Stoakes, Matthew Wilcox,
	Andrew Morton, Pedro Falcato
  Cc: Kalesh Singh, Suren Baghdasaryan, android-mm, kernel-team,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	linux-fsdevel, linux-mm, linux-kernel

On 4/27/26 05:01, Frederick Mayle wrote:
> When a file mapping covers a strict subset of a file, an access to the
> mapping can trigger readahead of file pages outside the mapped region.
> Readahead is meant to prefetch pages likely to be accessed soon, but
> these pages aren't accessible via the same means, so it fair to say we
> don't have a good indicator they'll be accessed soon. Take an ELF file
> for example: An access to the end of a program's read-only segment isn't
> a sign that nearby file contents will be accessed next (they are likely
> to be mapped discontiguously, or not at all). The pressure from loading
> these pages into the cache can evict more useful pages.
> 
> To improve the behavior, make three changes:
> 
> * Introduce a new readahead_control field, max_index, as a hard limit on
>   the readahead. The existing file_ra_state->size can't be used as a
>   limit, it is more of a hint and can be increased by various
>   heuristics.
> * Set readahead_control->max_index to the end of the VMA in all of the
>   readahead paths that can be triggered from a fault on a file mapping
>   (both "sync" and "async" readahead).
> * Limit the read-around range start to the VMA's start.
> 
> Note that these changes only affect readahead triggered in the context
> of a fault, they do not affect readahead triggered by read syscalls. If
> a user mixes the two types of accesses, the behavior is expected to be
> the following: if a fault causes readahead and places a PG_readahead
> marker and then a read(2) syscall hits the PG_readahead marker, the
> resulting async readahead *will not* be limited to the VMA end.
> Conversely, if a read(2) syscall places a PG_readahead marker and then a
> fault hits the marker, the async readahead *will* be limited to the VMA
> end.
> 
> There is an edge case that the above motivation glosses over: A single
> file mapping might be backed by multiple VMAs. For example, a whole file
> could be mapped RW, then part of the mapping made RO using mprotect.
> This patch would hurt performance of a sequential faulted read of such a
> mapping, the degree depending on how fragmented the VMAs are. A usage
> pattern like that is likely rare and already suffering from sub-optimal
> performance because, e.g., the fragmented VMAs limit the fault-around,
> so each VMA boundary in a sequential faulted read would cause a minor
> fault. Still, this patch would make it worse. See a previous discussion
> of this topic at [1].

I agree that workloads that do a lot of mprotect() magic likely do not depend on
readahead optimizations.

But I'm sure we'll learn quickly if that is not the case :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] mm: limit filemap_fault readahead to VMA boundaries
  2026-04-27 12:41 ` David Hildenbrand (Arm)
@ 2026-04-27 16:22   ` Kalesh Singh
  2026-04-27 22:58     ` Frederick Mayle
  0 siblings, 1 reply; 4+ messages in thread
From: Kalesh Singh @ 2026-04-27 16:22 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Frederick Mayle, Jan Kara, Lorenzo Stoakes, Matthew Wilcox,
	Andrew Morton, Pedro Falcato, Suren Baghdasaryan, android-mm,
	kernel-team, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, linux-fsdevel, linux-mm, linux-kernel

On Mon, Apr 27, 2026 at 5:41 AM 'David Hildenbrand (Arm)' via
android-mm <android-mm@google.com> wrote:
>
> On 4/27/26 05:01, Frederick Mayle wrote:
> > When a file mapping covers a strict subset of a file, an access to the
> > mapping can trigger readahead of file pages outside the mapped region.
> > Readahead is meant to prefetch pages likely to be accessed soon, but
> > these pages aren't accessible via the same means, so it fair to say we
> > don't have a good indicator they'll be accessed soon. Take an ELF file
> > for example: An access to the end of a program's read-only segment isn't
> > a sign that nearby file contents will be accessed next (they are likely
> > to be mapped discontiguously, or not at all). The pressure from loading
> > these pages into the cache can evict more useful pages.
> >
> > To improve the behavior, make three changes:
> >
> > * Introduce a new readahead_control field, max_index, as a hard limit on
> >   the readahead. The existing file_ra_state->size can't be used as a
> >   limit, it is more of a hint and can be increased by various
> >   heuristics.
> > * Set readahead_control->max_index to the end of the VMA in all of the
> >   readahead paths that can be triggered from a fault on a file mapping
> >   (both "sync" and "async" readahead).
> > * Limit the read-around range start to the VMA's start.
> >
> > Note that these changes only affect readahead triggered in the context
> > of a fault, they do not affect readahead triggered by read syscalls. If
> > a user mixes the two types of accesses, the behavior is expected to be
> > the following: if a fault causes readahead and places a PG_readahead
> > marker and then a read(2) syscall hits the PG_readahead marker, the
> > resulting async readahead *will not* be limited to the VMA end.
> > Conversely, if a read(2) syscall places a PG_readahead marker and then a
> > fault hits the marker, the async readahead *will* be limited to the VMA
> > end.
> >
> > There is an edge case that the above motivation glosses over: A single
> > file mapping might be backed by multiple VMAs. For example, a whole file
> > could be mapped RW, then part of the mapping made RO using mprotect.
> > This patch would hurt performance of a sequential faulted read of such a
> > mapping, the degree depending on how fragmented the VMAs are. A usage
> > pattern like that is likely rare and already suffering from sub-optimal
> > performance because, e.g., the fragmented VMAs limit the fault-around,
> > so each VMA boundary in a sequential faulted read would cause a minor
> > fault. Still, this patch would make it worse. See a previous discussion
> > of this topic at [1].
>
> I agree that workloads that do a lot of mprotect() magic likely do not depend on
> readahead optimizations.
>
> But I'm sure we'll learn quickly if that is not the case :)

Hi David,

There is already this limit for the exec VMAs, so perhaps these use
cases are in fact rare enough; but we'll need to see ...

https://lore.kernel.org/all/20250609092729.274960-6-ryan.roberts@arm.com/

Frederick, could we also now remove that logic (EXEC mappings)? Maybe
in a follow up patch.

For this patch: Reviewed-by: Kalesh Singh <kaleshsingh@google.com>

Thanks,
Kalesh

>
> --
> Cheers,
>
> David
>
> --
> You received this message because you are subscribed to the Google Groups "android-mm" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to android-mm+unsubscribe@google.com.
> To view this discussion visit https://groups.google.com/a/google.com/d/msgid/android-mm/c7f94ce6-1dfb-420a-b073-d86abeff1f76%40kernel.org.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] mm: limit filemap_fault readahead to VMA boundaries
  2026-04-27 16:22   ` Kalesh Singh
@ 2026-04-27 22:58     ` Frederick Mayle
  0 siblings, 0 replies; 4+ messages in thread
From: Frederick Mayle @ 2026-04-27 22:58 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: David Hildenbrand (Arm), Jan Kara, Lorenzo Stoakes,
	Matthew Wilcox, Andrew Morton, Pedro Falcato, Suren Baghdasaryan,
	android-mm, kernel-team, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, linux-fsdevel, linux-mm,
	linux-kernel, Ryan Roberts

On Mon, Apr 27, 2026 at 9:23 AM Kalesh Singh <kaleshsingh@google.com> wrote:
>
> On Mon, Apr 27, 2026 at 5:41 AM 'David Hildenbrand (Arm)' via
> android-mm <android-mm@google.com> wrote:
> >
> > On 4/27/26 05:01, Frederick Mayle wrote:
> > > When a file mapping covers a strict subset of a file, an access to the
> > > mapping can trigger readahead of file pages outside the mapped region.
> > > Readahead is meant to prefetch pages likely to be accessed soon, but
> > > these pages aren't accessible via the same means, so it fair to say we
> > > don't have a good indicator they'll be accessed soon. Take an ELF file
> > > for example: An access to the end of a program's read-only segment isn't
> > > a sign that nearby file contents will be accessed next (they are likely
> > > to be mapped discontiguously, or not at all). The pressure from loading
> > > these pages into the cache can evict more useful pages.
> > >
> > > To improve the behavior, make three changes:
> > >
> > > * Introduce a new readahead_control field, max_index, as a hard limit on
> > >   the readahead. The existing file_ra_state->size can't be used as a
> > >   limit, it is more of a hint and can be increased by various
> > >   heuristics.
> > > * Set readahead_control->max_index to the end of the VMA in all of the
> > >   readahead paths that can be triggered from a fault on a file mapping
> > >   (both "sync" and "async" readahead).
> > > * Limit the read-around range start to the VMA's start.
> > >
> > > Note that these changes only affect readahead triggered in the context
> > > of a fault, they do not affect readahead triggered by read syscalls. If
> > > a user mixes the two types of accesses, the behavior is expected to be
> > > the following: if a fault causes readahead and places a PG_readahead
> > > marker and then a read(2) syscall hits the PG_readahead marker, the
> > > resulting async readahead *will not* be limited to the VMA end.
> > > Conversely, if a read(2) syscall places a PG_readahead marker and then a
> > > fault hits the marker, the async readahead *will* be limited to the VMA
> > > end.
> > >
> > > There is an edge case that the above motivation glosses over: A single
> > > file mapping might be backed by multiple VMAs. For example, a whole file
> > > could be mapped RW, then part of the mapping made RO using mprotect.
> > > This patch would hurt performance of a sequential faulted read of such a
> > > mapping, the degree depending on how fragmented the VMAs are. A usage
> > > pattern like that is likely rare and already suffering from sub-optimal
> > > performance because, e.g., the fragmented VMAs limit the fault-around,
> > > so each VMA boundary in a sequential faulted read would cause a minor
> > > fault. Still, this patch would make it worse. See a previous discussion
> > > of this topic at [1].
> >
> > I agree that workloads that do a lot of mprotect() magic likely do not depend on
> > readahead optimizations.
> >
> > But I'm sure we'll learn quickly if that is not the case :)
>
> Hi David,
>
> There is already this limit for the exec VMAs, so perhaps these use
> cases are in fact rare enough; but we'll need to see ...
>
> https://lore.kernel.org/all/20250609092729.274960-6-ryan.roberts@arm.com/
>
> Frederick, could we also now remove that logic (EXEC mappings)? Maybe
> in a follow up patch.

The VM_EXEC branch is still meaningfully different from the read-around branch,
it sets `ra->order = exec_folio_order()` and `ra->async_size = 0`, but I think
most of the details could be unified. I can try that in a follow up.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-27 22:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27  3:01 [PATCH v2] mm: limit filemap_fault readahead to VMA boundaries Frederick Mayle
2026-04-27 12:41 ` David Hildenbrand (Arm)
2026-04-27 16:22   ` Kalesh Singh
2026-04-27 22:58     ` Frederick Mayle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox