[PATCH v5 0/5] Readahead tweaks for larger folios

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/5] Readahead tweaks for larger folios
@ 2025-06-09  9:27 Ryan Roberts
  2025-06-09  9:27 ` [PATCH v5 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09  9:27 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
	Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
	Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
	linux-mm

Hi All,

This series adds some tweaks to readahead so that it does a better job of
ramping up folio sizes as readahead extends further into the file. And it
additionally special-cases executable mappings to allow the arch to request a
preferred folio size for text.

Originally the series focussed on the latter part only (large folios for text).
See [3]. But after discussion with Matthew Wilcox, v4 switched to additionally
fix some of the unintended behaviours in how a folio size is selected in general
before special-casing for text. As a result patches 1-4 make folio size
selection behave more sanely, then patch 5 introduces large folios for text.
Patch 5 depends on patch 1, but does not depend on patches 2-4.

---

I've run a number of benchmarks and observed no regressions. mm selftests also
shows no regressions vs mm-unstable. Selected benchmark results are presented in
the commit log for the final patch.

Most patches have R-b/A-b now so would be good to get into linux-next for some
soak testing when possible.

Applies on top of today's mm-unstable (a32230de8810).

Changes since v4 [4]
====================

- Added R-b/A-b (thanks all!)
- Patch 1:
  - Removed ra->size fallback check in page_cache_ra_order() (Pankaj)
- Patch 2:
  - Modify ra end alignment to handle non-power-of-2 optimal ra sizes (Jan)
- Patch 4:
  - Only reset order to 0 if fallback is due to not supporting large folios
    (Jan)
- Patch 5:
  - Ignore VM_RAND_READ for VM_EXEC mappings in favour of following the new
    VM_EXEC code path since code is always random (Will)

Changes since v3 [3]
====================

 - Added patchs 1-4 to do better job of ramping up folio order
 - In patch 5:
   - Confine readahead blocks to vma boundaries (per Kalesh)
   - Rename arch_exec_folio_order() to exec_folio_order() (per Matthew)
   - exec_folio_order() now returns unsigned int and defaults to order-0
     (per Matthew)
   - readahead size is honoured (including when disabled)

Changes since v2 [2]
====================

 - Rename arch_wants_exec_folio_order() to arch_exec_folio_order() (per Andrew)
 - Fixed some typos (per Andrew)

Changes since v1 [1]
====================

 - Remove "void" from arch_wants_exec_folio_order() macro args list

[1] https://lore.kernel.org/linux-mm/20240111154106.3692206-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20250327160700.1147155-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/linux-mm/20250430145920.3748738-1-ryan.roberts@arm.com/

Thanks,
Ryan

Ryan Roberts (5):
  mm/readahead: Honour new_order in page_cache_ra_order()
  mm/readahead: Terminate async readahead on natural boundary
  mm/readahead: Make space in struct file_ra_state
  mm/readahead: Store folio order in struct file_ra_state
  mm/filemap: Allow arch to request folio size for exec memory

 arch/arm64/include/asm/pgtable.h |  8 +++++
 include/linux/fs.h               |  4 ++-
 include/linux/pgtable.h          | 11 ++++++
 mm/filemap.c                     | 62 ++++++++++++++++++++++++--------
 mm/internal.h                    |  3 +-
 mm/readahead.c                   | 36 ++++++++++---------
 6 files changed, 89 insertions(+), 35 deletions(-)

--
2.43.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
  2025-06-09  9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
@ 2025-06-09  9:27 ` Ryan Roberts
  2025-06-09  9:27 ` [PATCH v5 2/5] mm/readahead: Terminate async readahead on natural boundary Ryan Roberts
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09  9:27 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
	Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
	Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
	linux-mm, Chaitanya S Prakash

page_cache_ra_order() takes a parameter called new_order, which is
intended to express the preferred order of the folios that will be
allocated for the readahead operation. Most callers indeed call this
with their preferred new order. But page_cache_async_ra() calls it with
the preferred order of the previous readahead request (actually the
order of the folio that had the readahead marker, which may be smaller
when alignment comes into play).

And despite the parameter name, page_cache_ra_order() always treats it
at the old order, adding 2 to it on entry. As a result, a cold readahead
always starts with order-2 folios.

Let's fix this behaviour by always passing in the *new* order.

Worked example:

Prior to the change, mmaping an 8MB file and touching each page
sequentially, resulted in the following, where we start with order-2
folios for the first 128K then ramp up to order-4 for the next 128K,
then get clamped to order-5 for the rest of the file because pa_pages is
limited to 128K:

TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
-----  ----------  ----------  ---------  -------  -------  -----  -----
FOLIO  0x00000000  0x00004000      16384        0        4      4      2
FOLIO  0x00004000  0x00008000      16384        4        8      4      2
FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
FOLIO  0x00010000  0x00014000      16384       16       20      4      2
FOLIO  0x00014000  0x00018000      16384       20       24      4      2
FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
FOLIO  0x00020000  0x00030000      65536       32       48     16      4
FOLIO  0x00030000  0x00040000      65536       48       64     16      4
FOLIO  0x00040000  0x00060000     131072       64       96     32      5
FOLIO  0x00060000  0x00080000     131072       96      128     32      5
FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
...

After the change, the same operation results in the first 128K being
order-0, then we start ramping up to order-2, -4, and finally get
clamped at order-5:

TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
-----  ----------  ----------  ---------  -------  -------  -----  -----
FOLIO  0x00000000  0x00001000       4096        0        1      1      0
FOLIO  0x00001000  0x00002000       4096        1        2      1      0
FOLIO  0x00002000  0x00003000       4096        2        3      1      0
FOLIO  0x00003000  0x00004000       4096        3        4      1      0
FOLIO  0x00004000  0x00005000       4096        4        5      1      0
FOLIO  0x00005000  0x00006000       4096        5        6      1      0
FOLIO  0x00006000  0x00007000       4096        6        7      1      0
FOLIO  0x00007000  0x00008000       4096        7        8      1      0
FOLIO  0x00008000  0x00009000       4096        8        9      1      0
FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
FOLIO  0x00010000  0x00011000       4096       16       17      1      0
FOLIO  0x00011000  0x00012000       4096       17       18      1      0
FOLIO  0x00012000  0x00013000       4096       18       19      1      0
FOLIO  0x00013000  0x00014000       4096       19       20      1      0
FOLIO  0x00014000  0x00015000       4096       20       21      1      0
FOLIO  0x00015000  0x00016000       4096       21       22      1      0
FOLIO  0x00016000  0x00017000       4096       22       23      1      0
FOLIO  0x00017000  0x00018000       4096       23       24      1      0
FOLIO  0x00018000  0x00019000       4096       24       25      1      0
FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
FOLIO  0x00020000  0x00024000      16384       32       36      4      2
FOLIO  0x00024000  0x00028000      16384       36       40      4      2
FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
FOLIO  0x00030000  0x00034000      16384       48       52      4      2
FOLIO  0x00034000  0x00038000      16384       52       56      4      2
FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
FOLIO  0x00040000  0x00050000      65536       64       80     16      4
FOLIO  0x00050000  0x00060000      65536       80       96     16      4
FOLIO  0x00060000  0x00080000     131072       96      128     32      5
FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5
...

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/readahead.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 20d36d6b055e..973de2551efe 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -468,20 +468,12 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	unsigned int nofs;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
-	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
 
-	/*
-	 * Fallback when size < min_nrpages as each folio should be
-	 * at least min_nrpages anyway.
-	 */
-	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
+	if (!mapping_large_folio_support(mapping))
 		goto fallback;
 
 	limit = min(limit, index + ra->size - 1);
 
-	if (new_order < mapping_max_folio_order(mapping))
-		new_order += 2;
-
 	new_order = min(mapping_max_folio_order(mapping), new_order);
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 	new_order = max(new_order, min_order);
@@ -683,6 +675,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	ra->size = get_next_ra_size(ra, max_pages);
 	ra->async_size = ra->size;
 readit:
+	order += 2;
 	ractl->_index = ra->start;
 	page_cache_ra_order(ractl, ra, order);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 2/5] mm/readahead: Terminate async readahead on natural boundary
  2025-06-09  9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
  2025-06-09  9:27 ` [PATCH v5 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
@ 2025-06-09  9:27 ` Ryan Roberts
  2025-06-09  9:27 ` [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09  9:27 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
	Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
	Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
	linux-mm

Previously asynchonous readahead would read ra_pages (usually 128K)
directly after the end of the synchonous readahead and given the
synchronous readahead portion had no alignment guarantees (beyond page
boundaries) it is possible (and likely) that the end of the initial 128K
region would not fall on a natural boundary for the folio size being
used. Therefore smaller folios were used to align down to the required
boundary, both at the end of the previous readahead block and at the
start of the new one.

In the worst cases, this can result in never properly ramping up the
folio size, and instead getting stuck oscillating between order-0, -1
and -2 folios. The next readahead will try to use folios whose order is
+2 bigger than the folio that had the readahead marker. But because of
the alignment requirements, that folio (the first one in the readahead
block) can end up being order-0 in some cases.

There will be 2 modifications to solve this issue:

1) Calculate the readahead size so the end is aligned to a folio
   boundary. This prevents needing to allocate small folios to align
   down at the end of the window and fixes the oscillation problem.

2) Remember the "preferred folio order" in the ra state instead of
   inferring it from the folio with the readahead marker. This solves
   the slow ramp up problem (discussed in a subsequent patch).

This patch addresses (1) only. A subsequent patch will address (2).

Worked example:

The following shows the previous pathalogical behaviour when the initial
synchronous readahead is unaligned. We start reading at page 17 in the
file and read sequentially from there. I'm showing a dump of the pages
in the page cache just after we read the first page of the folio with
the readahead marker.

Initially there are no pages in the page cache:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00800000     8388608        0     2048   2048

Then we access page 17, causing synchonous read-around of 128K with a
readahead marker set up at page 25. So far, all as expected:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00001000        4096        0        1      1
FOLIO  0x00001000  0x00002000        4096        1        2      1      0
FOLIO  0x00002000  0x00003000        4096        2        3      1      0
FOLIO  0x00003000  0x00004000        4096        3        4      1      0
FOLIO  0x00004000  0x00005000        4096        4        5      1      0
FOLIO  0x00005000  0x00006000        4096        5        6      1      0
FOLIO  0x00006000  0x00007000        4096        6        7      1      0
FOLIO  0x00007000  0x00008000        4096        7        8      1      0
FOLIO  0x00008000  0x00009000        4096        8        9      1      0
FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
FOLIO  0x00010000  0x00011000        4096       16       17      1      0
FOLIO  0x00011000  0x00012000        4096       17       18      1      0
FOLIO  0x00012000  0x00013000        4096       18       19      1      0
FOLIO  0x00013000  0x00014000        4096       19       20      1      0
FOLIO  0x00014000  0x00015000        4096       20       21      1      0
FOLIO  0x00015000  0x00016000        4096       21       22      1      0
FOLIO  0x00016000  0x00017000        4096       22       23      1      0
FOLIO  0x00017000  0x00018000        4096       23       24      1      0
FOLIO  0x00018000  0x00019000        4096       24       25      1      0
FOLIO  0x00019000  0x0001a000        4096       25       26      1      0  Y
FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
FOLIO  0x00020000  0x00021000        4096       32       33      1      0
HOLE   0x00021000  0x00800000     8253440       33     2048   2015

Now access pages 18-25 inclusive. This causes an asynchronous 128K
readahead starting at page 33. But since we are unaligned, even though
the preferred folio order is 2, the first folio in this batch (the one
with the new readahead marker) is order-0:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00001000        4096        0        1      1
FOLIO  0x00001000  0x00002000        4096        1        2      1      0
FOLIO  0x00002000  0x00003000        4096        2        3      1      0
FOLIO  0x00003000  0x00004000        4096        3        4      1      0
FOLIO  0x00004000  0x00005000        4096        4        5      1      0
FOLIO  0x00005000  0x00006000        4096        5        6      1      0
FOLIO  0x00006000  0x00007000        4096        6        7      1      0
FOLIO  0x00007000  0x00008000        4096        7        8      1      0
FOLIO  0x00008000  0x00009000        4096        8        9      1      0
FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
FOLIO  0x00010000  0x00011000        4096       16       17      1      0
FOLIO  0x00011000  0x00012000        4096       17       18      1      0
FOLIO  0x00012000  0x00013000        4096       18       19      1      0
FOLIO  0x00013000  0x00014000        4096       19       20      1      0
FOLIO  0x00014000  0x00015000        4096       20       21      1      0
FOLIO  0x00015000  0x00016000        4096       21       22      1      0
FOLIO  0x00016000  0x00017000        4096       22       23      1      0
FOLIO  0x00017000  0x00018000        4096       23       24      1      0
FOLIO  0x00018000  0x00019000        4096       24       25      1      0
FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
FOLIO  0x00020000  0x00021000        4096       32       33      1      0
FOLIO  0x00021000  0x00022000        4096       33       34      1      0  Y
FOLIO  0x00022000  0x00024000        8192       34       36      2      1
FOLIO  0x00024000  0x00028000       16384       36       40      4      2
FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
FOLIO  0x00030000  0x00034000       16384       48       52      4      2
FOLIO  0x00034000  0x00038000       16384       52       56      4      2
FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
FOLIO  0x00040000  0x00041000        4096       64       65      1      0
HOLE   0x00041000  0x00800000     8122368       65     2048   1983

Which means that when we now read pages 26-33 and readahead is kicked
off again, the new preferred order is 2 (0 + 2), not 4 as we intended:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00001000        4096        0        1      1
FOLIO  0x00001000  0x00002000        4096        1        2      1      0
FOLIO  0x00002000  0x00003000        4096        2        3      1      0
FOLIO  0x00003000  0x00004000        4096        3        4      1      0
FOLIO  0x00004000  0x00005000        4096        4        5      1      0
FOLIO  0x00005000  0x00006000        4096        5        6      1      0
FOLIO  0x00006000  0x00007000        4096        6        7      1      0
FOLIO  0x00007000  0x00008000        4096        7        8      1      0
FOLIO  0x00008000  0x00009000        4096        8        9      1      0
FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
FOLIO  0x00010000  0x00011000        4096       16       17      1      0
FOLIO  0x00011000  0x00012000        4096       17       18      1      0
FOLIO  0x00012000  0x00013000        4096       18       19      1      0
FOLIO  0x00013000  0x00014000        4096       19       20      1      0
FOLIO  0x00014000  0x00015000        4096       20       21      1      0
FOLIO  0x00015000  0x00016000        4096       21       22      1      0
FOLIO  0x00016000  0x00017000        4096       22       23      1      0
FOLIO  0x00017000  0x00018000        4096       23       24      1      0
FOLIO  0x00018000  0x00019000        4096       24       25      1      0
FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
FOLIO  0x00020000  0x00021000        4096       32       33      1      0
FOLIO  0x00021000  0x00022000        4096       33       34      1      0
FOLIO  0x00022000  0x00024000        8192       34       36      2      1
FOLIO  0x00024000  0x00028000       16384       36       40      4      2
FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
FOLIO  0x00030000  0x00034000       16384       48       52      4      2
FOLIO  0x00034000  0x00038000       16384       52       56      4      2
FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
FOLIO  0x00040000  0x00041000        4096       64       65      1      0
FOLIO  0x00041000  0x00042000        4096       65       66      1      0  Y
FOLIO  0x00042000  0x00044000        8192       66       68      2      1
FOLIO  0x00044000  0x00048000       16384       68       72      4      2
FOLIO  0x00048000  0x0004c000       16384       72       76      4      2
FOLIO  0x0004c000  0x00050000       16384       76       80      4      2
FOLIO  0x00050000  0x00054000       16384       80       84      4      2
FOLIO  0x00054000  0x00058000       16384       84       88      4      2
FOLIO  0x00058000  0x0005c000       16384       88       92      4      2
FOLIO  0x0005c000  0x00060000       16384       92       96      4      2
FOLIO  0x00060000  0x00061000        4096       96       97      1      0
HOLE   0x00061000  0x00800000     7991296       97     2048   1951

This ramp up from order-0 with smaller orders at the edges for alignment
cycle continues all the way to the end of the file (not shown).

After the change, we round down the end boundary to the order boundary
so we no longer get stuck in the cycle and can ramp up the order over
time. Note that the rate of the ramp up is still not as we would expect
it. We will fix that next. Here we are touching pages 17-256
sequentially:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00001000        4096        0        1      1
FOLIO  0x00001000  0x00002000        4096        1        2      1      0
FOLIO  0x00002000  0x00003000        4096        2        3      1      0
FOLIO  0x00003000  0x00004000        4096        3        4      1      0
FOLIO  0x00004000  0x00005000        4096        4        5      1      0
FOLIO  0x00005000  0x00006000        4096        5        6      1      0
FOLIO  0x00006000  0x00007000        4096        6        7      1      0
FOLIO  0x00007000  0x00008000        4096        7        8      1      0
FOLIO  0x00008000  0x00009000        4096        8        9      1      0
FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
FOLIO  0x00010000  0x00011000        4096       16       17      1      0
FOLIO  0x00011000  0x00012000        4096       17       18      1      0
FOLIO  0x00012000  0x00013000        4096       18       19      1      0
FOLIO  0x00013000  0x00014000        4096       19       20      1      0
FOLIO  0x00014000  0x00015000        4096       20       21      1      0
FOLIO  0x00015000  0x00016000        4096       21       22      1      0
FOLIO  0x00016000  0x00017000        4096       22       23      1      0
FOLIO  0x00017000  0x00018000        4096       23       24      1      0
FOLIO  0x00018000  0x00019000        4096       24       25      1      0
FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
FOLIO  0x00020000  0x00021000        4096       32       33      1      0
FOLIO  0x00021000  0x00022000        4096       33       34      1      0
FOLIO  0x00022000  0x00024000        8192       34       36      2      1
FOLIO  0x00024000  0x00028000       16384       36       40      4      2
FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
FOLIO  0x00030000  0x00034000       16384       48       52      4      2
FOLIO  0x00034000  0x00038000       16384       52       56      4      2
FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
FOLIO  0x00040000  0x00044000       16384       64       68      4      2
FOLIO  0x00044000  0x00048000       16384       68       72      4      2
FOLIO  0x00048000  0x0004c000       16384       72       76      4      2
FOLIO  0x0004c000  0x00050000       16384       76       80      4      2
FOLIO  0x00050000  0x00054000       16384       80       84      4      2
FOLIO  0x00054000  0x00058000       16384       84       88      4      2
FOLIO  0x00058000  0x0005c000       16384       88       92      4      2
FOLIO  0x0005c000  0x00060000       16384       92       96      4      2
FOLIO  0x00060000  0x00070000       65536       96      112     16      4
FOLIO  0x00070000  0x00080000       65536      112      128     16      4
FOLIO  0x00080000  0x000a0000      131072      128      160     32      5
FOLIO  0x000a0000  0x000c0000      131072      160      192     32      5
FOLIO  0x000c0000  0x000e0000      131072      192      224     32      5
FOLIO  0x000e0000  0x00100000      131072      224      256     32      5
FOLIO  0x00100000  0x00120000      131072      256      288     32      5
FOLIO  0x00120000  0x00140000      131072      288      320     32      5  Y
HOLE   0x00140000  0x00800000     7077888      320     2048   1728

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/readahead.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 973de2551efe..87be20ae00d0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -620,7 +620,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	unsigned long max_pages;
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t index = readahead_index(ractl);
-	pgoff_t expected, start;
+	pgoff_t expected, start, end, aligned_end, align;
 	unsigned int order = folio_order(folio);
 
 	/* no readahead */
@@ -652,7 +652,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
 		 * the readahead window.
 		 */
 		ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
-		ra->async_size = ra->size;
 		goto readit;
 	}
 
@@ -673,9 +672,14 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	ra->size = start - index;	/* old async_size */
 	ra->size += req_count;
 	ra->size = get_next_ra_size(ra, max_pages);
-	ra->async_size = ra->size;
 readit:
 	order += 2;
+	align = 1UL << min(order, ffs(max_pages) - 1);
+	end = ra->start + ra->size;
+	aligned_end = round_down(end, align);
+	if (aligned_end > ra->start)
+		ra->size -= end - aligned_end;
+	ra->async_size = ra->size;
 	ractl->_index = ra->start;
 	page_cache_ra_order(ractl, ra, order);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state
  2025-06-09  9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
  2025-06-09  9:27 ` [PATCH v5 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
  2025-06-09  9:27 ` [PATCH v5 2/5] mm/readahead: Terminate async readahead on natural boundary Ryan Roberts
@ 2025-06-09  9:27 ` Ryan Roberts
  2025-06-11  9:57   ` Christian Brauner
  2025-06-09  9:27 ` [PATCH v5 4/5] mm/readahead: Store folio order " Ryan Roberts
  2025-06-09  9:27 ` [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
  4 siblings, 1 reply; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09  9:27 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
	Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
	Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
	linux-mm

We need to be able to store the preferred folio order associated with a
readahead request in the struct file_ra_state so that we can more
accurately increase the order across subsequent readahead requests. But
struct file_ra_state is per-struct file, so we don't really want to
increase it's size.

mmap_miss is currently 32 bits but it is only counted up to 10 *
MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be
plenty. Redefine it to unsigned short, making room for order as unsigned
short in follow up commit.

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/fs.h |  2 +-
 mm/filemap.c       | 11 ++++++-----
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 05abdabe9db7..87e7d5790e43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1052,7 +1052,7 @@ struct file_ra_state {
 	unsigned int size;
 	unsigned int async_size;
 	unsigned int ra_pages;
-	unsigned int mmap_miss;
+	unsigned short mmap_miss;
 	loff_t prev_pos;
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a6459874bb2a..7bb4ffca8487 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3217,7 +3217,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
 	struct file *fpin = NULL;
 	unsigned long vm_flags = vmf->vma->vm_flags;
-	unsigned int mmap_miss;
+	unsigned short mmap_miss;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* Use the readahead code, even if readahead is disabled */
@@ -3285,7 +3285,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 	struct file_ra_state *ra = &file->f_ra;
 	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
 	struct file *fpin = NULL;
-	unsigned int mmap_miss;
+	unsigned short mmap_miss;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
@@ -3605,7 +3605,7 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
 static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 			struct folio *folio, unsigned long start,
 			unsigned long addr, unsigned int nr_pages,
-			unsigned long *rss, unsigned int *mmap_miss)
+			unsigned long *rss, unsigned short *mmap_miss)
 {
 	vm_fault_t ret = 0;
 	struct page *page = folio_page(folio, start);
@@ -3667,7 +3667,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 
 static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
 		struct folio *folio, unsigned long addr,
-		unsigned long *rss, unsigned int *mmap_miss)
+		unsigned long *rss, unsigned short *mmap_miss)
 {
 	vm_fault_t ret = 0;
 	struct page *page = &folio->page;
@@ -3709,7 +3709,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	unsigned long rss = 0;
-	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
+	unsigned int nr_pages = 0, folio_type;
+	unsigned short mmap_miss = 0, mmap_miss_saved;
 
 	rcu_read_lock();
 	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state
  2025-06-09  9:27 ` [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
@ 2025-06-11  9:57   ` Christian Brauner
  0 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2025-06-11  9:57 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro, Jan Kara,
	David Hildenbrand, Dave Chinner, Catalin Marinas, Will Deacon,
	Kalesh Singh, Zi Yan, linux-arm-kernel, linux-kernel,
	linux-fsdevel, linux-mm

On Mon, Jun 09, 2025 at 10:27:25AM +0100, Ryan Roberts wrote:
> We need to be able to store the preferred folio order associated with a
> readahead request in the struct file_ra_state so that we can more
> accurately increase the order across subsequent readahead requests. But
> struct file_ra_state is per-struct file, so we don't really want to
> increase it's size.
> 
> mmap_miss is currently 32 bits but it is only counted up to 10 *
> MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be
> plenty. Redefine it to unsigned short, making room for order as unsigned
> short in follow up commit.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/fs.h |  2 +-
>  mm/filemap.c       | 11 ++++++-----
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 05abdabe9db7..87e7d5790e43 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1052,7 +1052,7 @@ struct file_ra_state {
>  	unsigned int size;
>  	unsigned int async_size;
>  	unsigned int ra_pages;
> -	unsigned int mmap_miss;
> +	unsigned short mmap_miss;

Thanks for not making struct file grow!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 4/5] mm/readahead: Store folio order in struct file_ra_state
  2025-06-09  9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
                   ` (2 preceding siblings ...)
  2025-06-09  9:27 ` [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
@ 2025-06-09  9:27 ` Ryan Roberts
  2025-06-12 11:37   ` Jan Kara
  2025-06-09  9:27 ` [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
  4 siblings, 1 reply; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09  9:27 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
	Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
	Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
	linux-mm

Previously the folio order of the previous readahead request was
inferred from the folio who's readahead marker was hit. But due to the
way we have to round to non-natural boundaries sometimes, this first
folio in the readahead block is often smaller than the preferred order
for that request. This means that for cases where the initial sync
readahead is poorly aligned, the folio order will ramp up much more
slowly.

So instead, let's store the order in struct file_ra_state so we are not
affected by any required alignment. We previously made enough room in
the struct for a 16 order field. This should be plenty big enough since
we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
larger than ~20.

Since we now pass order in struct file_ra_state, page_cache_ra_order()
no longer needs it's new_order parameter, so let's remove that.

Worked example:

Here we are touching pages 17-256 sequentially just as we did in the
previous commit, but now that we are remembering the preferred order
explicitly, we no longer have the slow ramp up problem. Note
specifically that we no longer have 2 rounds (2x ~128K) of order-2
folios:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00001000        4096        0        1      1
FOLIO  0x00001000  0x00002000        4096        1        2      1      0
FOLIO  0x00002000  0x00003000        4096        2        3      1      0
FOLIO  0x00003000  0x00004000        4096        3        4      1      0
FOLIO  0x00004000  0x00005000        4096        4        5      1      0
FOLIO  0x00005000  0x00006000        4096        5        6      1      0
FOLIO  0x00006000  0x00007000        4096        6        7      1      0
FOLIO  0x00007000  0x00008000        4096        7        8      1      0
FOLIO  0x00008000  0x00009000        4096        8        9      1      0
FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
FOLIO  0x00010000  0x00011000        4096       16       17      1      0
FOLIO  0x00011000  0x00012000        4096       17       18      1      0
FOLIO  0x00012000  0x00013000        4096       18       19      1      0
FOLIO  0x00013000  0x00014000        4096       19       20      1      0
FOLIO  0x00014000  0x00015000        4096       20       21      1      0
FOLIO  0x00015000  0x00016000        4096       21       22      1      0
FOLIO  0x00016000  0x00017000        4096       22       23      1      0
FOLIO  0x00017000  0x00018000        4096       23       24      1      0
FOLIO  0x00018000  0x00019000        4096       24       25      1      0
FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
FOLIO  0x00020000  0x00021000        4096       32       33      1      0
FOLIO  0x00021000  0x00022000        4096       33       34      1      0
FOLIO  0x00022000  0x00024000        8192       34       36      2      1
FOLIO  0x00024000  0x00028000       16384       36       40      4      2
FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
FOLIO  0x00030000  0x00034000       16384       48       52      4      2
FOLIO  0x00034000  0x00038000       16384       52       56      4      2
FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
FOLIO  0x00040000  0x00050000       65536       64       80     16      4
FOLIO  0x00050000  0x00060000       65536       80       96     16      4
FOLIO  0x00060000  0x00080000      131072       96      128     32      5
FOLIO  0x00080000  0x000a0000      131072      128      160     32      5
FOLIO  0x000a0000  0x000c0000      131072      160      192     32      5
FOLIO  0x000c0000  0x000e0000      131072      192      224     32      5
FOLIO  0x000e0000  0x00100000      131072      224      256     32      5
FOLIO  0x00100000  0x00120000      131072      256      288     32      5
FOLIO  0x00120000  0x00140000      131072      288      320     32      5  Y
HOLE   0x00140000  0x00800000     7077888      320     2048   1728

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       |  6 ++++--
 mm/internal.h      |  3 +--
 mm/readahead.c     | 21 +++++++++++++--------
 4 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 87e7d5790e43..b5172b691f97 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1041,6 +1041,7 @@ struct fown_struct {
  *      and so were/are genuinely "ahead".  Start next readahead when
  *      the first of these pages is accessed.
  * @ra_pages: Maximum size of a readahead request, copied from the bdi.
+ * @order: Preferred folio order used for most recent readahead.
  * @mmap_miss: How many mmap accesses missed in the page cache.
  * @prev_pos: The last byte in the most recent read request.
  *
@@ -1052,6 +1053,7 @@ struct file_ra_state {
 	unsigned int size;
 	unsigned int async_size;
 	unsigned int ra_pages;
+	unsigned short order;
 	unsigned short mmap_miss;
 	loff_t prev_pos;
 };
diff --git a/mm/filemap.c b/mm/filemap.c
index 7bb4ffca8487..4b5c8d69f04c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3232,7 +3232,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		if (!(vm_flags & VM_RAND_READ))
 			ra->size *= 2;
 		ra->async_size = HPAGE_PMD_NR;
-		page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
+		ra->order = HPAGE_PMD_ORDER;
+		page_cache_ra_order(&ractl, ra);
 		return fpin;
 	}
 #endif
@@ -3268,8 +3269,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
 	ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
+	ra->order = 0;
 	ractl._index = ra->start;
-	page_cache_ra_order(&ractl, ra, 0);
+	page_cache_ra_order(&ractl, ra);
 	return fpin;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 6b8ed2017743..f91688e2894f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -436,8 +436,7 @@ void zap_page_range_single_batched(struct mmu_gather *tlb,
 int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
 			   gfp_t gfp);
 
-void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
-		unsigned int order);
+void page_cache_ra_order(struct readahead_control *, struct file_ra_state *);
 void force_page_cache_ra(struct readahead_control *, unsigned long nr);
 static inline void force_page_cache_readahead(struct address_space *mapping,
 		struct file *file, pgoff_t index, unsigned long nr_to_read)
diff --git a/mm/readahead.c b/mm/readahead.c
index 87be20ae00d0..95a24f12d1e7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -457,7 +457,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 }
 
 void page_cache_ra_order(struct readahead_control *ractl,
-		struct file_ra_state *ra, unsigned int new_order)
+		struct file_ra_state *ra)
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t start = readahead_index(ractl);
@@ -468,9 +468,12 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	unsigned int nofs;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
+	unsigned int new_order = ra->order;
 
-	if (!mapping_large_folio_support(mapping))
+	if (!mapping_large_folio_support(mapping)) {
+		ra->order = 0;
 		goto fallback;
+	}
 
 	limit = min(limit, index + ra->size - 1);
 
@@ -478,6 +481,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 	new_order = max(new_order, min_order);
 
+	ra->order = new_order;
+
 	/* See comment in page_cache_ra_unbounded() */
 	nofs = memalloc_nofs_save();
 	filemap_invalidate_lock_shared(mapping);
@@ -609,8 +614,9 @@ void page_cache_sync_ra(struct readahead_control *ractl,
 	ra->size = min(contig_count + req_count, max_pages);
 	ra->async_size = 1;
 readit:
+	ra->order = 0;
 	ractl->_index = ra->start;
-	page_cache_ra_order(ractl, ra, 0);
+	page_cache_ra_order(ractl, ra);
 }
 EXPORT_SYMBOL_GPL(page_cache_sync_ra);
 
@@ -621,7 +627,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t index = readahead_index(ractl);
 	pgoff_t expected, start, end, aligned_end, align;
-	unsigned int order = folio_order(folio);
 
 	/* no readahead */
 	if (!ra->ra_pages)
@@ -644,7 +649,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	 * Ramp up sizes, and push forward the readahead window.
 	 */
 	expected = round_down(ra->start + ra->size - ra->async_size,
-			1UL << order);
+			1UL << folio_order(folio));
 	if (index == expected) {
 		ra->start += ra->size;
 		/*
@@ -673,15 +678,15 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	ra->size += req_count;
 	ra->size = get_next_ra_size(ra, max_pages);
 readit:
-	order += 2;
-	align = 1UL << min(order, ffs(max_pages) - 1);
+	ra->order += 2;
+	align = 1UL << min(ra->order, ffs(max_pages) - 1);
 	end = ra->start + ra->size;
 	aligned_end = round_down(end, align);
 	if (aligned_end > ra->start)
 		ra->size -= end - aligned_end;
 	ra->async_size = ra->size;
 	ractl->_index = ra->start;
-	page_cache_ra_order(ractl, ra, order);
+	page_cache_ra_order(ractl, ra);
 }
 EXPORT_SYMBOL_GPL(page_cache_async_ra);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 4/5] mm/readahead: Store folio order in struct file_ra_state
  2025-06-09  9:27 ` [PATCH v5 4/5] mm/readahead: Store folio order " Ryan Roberts
@ 2025-06-12 11:37   ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2025-06-12 11:37 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
	Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
	Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
	linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm

On Mon 09-06-25 10:27:26, Ryan Roberts wrote:
> Previously the folio order of the previous readahead request was
> inferred from the folio who's readahead marker was hit. But due to the
> way we have to round to non-natural boundaries sometimes, this first
> folio in the readahead block is often smaller than the preferred order
> for that request. This means that for cases where the initial sync
> readahead is poorly aligned, the folio order will ramp up much more
> slowly.
> 
> So instead, let's store the order in struct file_ra_state so we are not
> affected by any required alignment. We previously made enough room in
> the struct for a 16 order field. This should be plenty big enough since
> we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
> larger than ~20.
> 
> Since we now pass order in struct file_ra_state, page_cache_ra_order()
> no longer needs it's new_order parameter, so let's remove that.
> 
> Worked example:
> 
> Here we are touching pages 17-256 sequentially just as we did in the
> previous commit, but now that we are remembering the preferred order
> explicitly, we no longer have the slow ramp up problem. Note
> specifically that we no longer have 2 rounds (2x ~128K) of order-2
> folios:
> 
> TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
> -----  ----------  ----------  ----------  -------  -------  -----  -----  --
> HOLE   0x00000000  0x00001000        4096        0        1      1
> FOLIO  0x00001000  0x00002000        4096        1        2      1      0
> FOLIO  0x00002000  0x00003000        4096        2        3      1      0
> FOLIO  0x00003000  0x00004000        4096        3        4      1      0
> FOLIO  0x00004000  0x00005000        4096        4        5      1      0
> FOLIO  0x00005000  0x00006000        4096        5        6      1      0
> FOLIO  0x00006000  0x00007000        4096        6        7      1      0
> FOLIO  0x00007000  0x00008000        4096        7        8      1      0
> FOLIO  0x00008000  0x00009000        4096        8        9      1      0
> FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
> FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
> FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
> FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
> FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
> FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
> FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
> FOLIO  0x00010000  0x00011000        4096       16       17      1      0
> FOLIO  0x00011000  0x00012000        4096       17       18      1      0
> FOLIO  0x00012000  0x00013000        4096       18       19      1      0
> FOLIO  0x00013000  0x00014000        4096       19       20      1      0
> FOLIO  0x00014000  0x00015000        4096       20       21      1      0
> FOLIO  0x00015000  0x00016000        4096       21       22      1      0
> FOLIO  0x00016000  0x00017000        4096       22       23      1      0
> FOLIO  0x00017000  0x00018000        4096       23       24      1      0
> FOLIO  0x00018000  0x00019000        4096       24       25      1      0
> FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
> FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
> FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
> FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
> FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
> FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
> FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
> FOLIO  0x00020000  0x00021000        4096       32       33      1      0
> FOLIO  0x00021000  0x00022000        4096       33       34      1      0
> FOLIO  0x00022000  0x00024000        8192       34       36      2      1
> FOLIO  0x00024000  0x00028000       16384       36       40      4      2
> FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
> FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
> FOLIO  0x00030000  0x00034000       16384       48       52      4      2
> FOLIO  0x00034000  0x00038000       16384       52       56      4      2
> FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
> FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
> FOLIO  0x00040000  0x00050000       65536       64       80     16      4
> FOLIO  0x00050000  0x00060000       65536       80       96     16      4
> FOLIO  0x00060000  0x00080000      131072       96      128     32      5
> FOLIO  0x00080000  0x000a0000      131072      128      160     32      5
> FOLIO  0x000a0000  0x000c0000      131072      160      192     32      5
> FOLIO  0x000c0000  0x000e0000      131072      192      224     32      5
> FOLIO  0x000e0000  0x00100000      131072      224      256     32      5
> FOLIO  0x00100000  0x00120000      131072      256      288     32      5
> FOLIO  0x00120000  0x00140000      131072      288      320     32      5  Y
> HOLE   0x00140000  0x00800000     7077888      320     2048   1728
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Looks good! Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/fs.h |  2 ++
>  mm/filemap.c       |  6 ++++--
>  mm/internal.h      |  3 +--
>  mm/readahead.c     | 21 +++++++++++++--------
>  4 files changed, 20 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 87e7d5790e43..b5172b691f97 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1041,6 +1041,7 @@ struct fown_struct {
>   *      and so were/are genuinely "ahead".  Start next readahead when
>   *      the first of these pages is accessed.
>   * @ra_pages: Maximum size of a readahead request, copied from the bdi.
> + * @order: Preferred folio order used for most recent readahead.
>   * @mmap_miss: How many mmap accesses missed in the page cache.
>   * @prev_pos: The last byte in the most recent read request.
>   *
> @@ -1052,6 +1053,7 @@ struct file_ra_state {
>  	unsigned int size;
>  	unsigned int async_size;
>  	unsigned int ra_pages;
> +	unsigned short order;
>  	unsigned short mmap_miss;
>  	loff_t prev_pos;
>  };
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7bb4ffca8487..4b5c8d69f04c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3232,7 +3232,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  		if (!(vm_flags & VM_RAND_READ))
>  			ra->size *= 2;
>  		ra->async_size = HPAGE_PMD_NR;
> -		page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
> +		ra->order = HPAGE_PMD_ORDER;
> +		page_cache_ra_order(&ractl, ra);
>  		return fpin;
>  	}
>  #endif
> @@ -3268,8 +3269,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>  	ra->size = ra->ra_pages;
>  	ra->async_size = ra->ra_pages / 4;
> +	ra->order = 0;
>  	ractl._index = ra->start;
> -	page_cache_ra_order(&ractl, ra, 0);
> +	page_cache_ra_order(&ractl, ra);
>  	return fpin;
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 6b8ed2017743..f91688e2894f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -436,8 +436,7 @@ void zap_page_range_single_batched(struct mmu_gather *tlb,
>  int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
>  			   gfp_t gfp);
>  
> -void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
> -		unsigned int order);
> +void page_cache_ra_order(struct readahead_control *, struct file_ra_state *);
>  void force_page_cache_ra(struct readahead_control *, unsigned long nr);
>  static inline void force_page_cache_readahead(struct address_space *mapping,
>  		struct file *file, pgoff_t index, unsigned long nr_to_read)
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 87be20ae00d0..95a24f12d1e7 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -457,7 +457,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
>  }
>  
>  void page_cache_ra_order(struct readahead_control *ractl,
> -		struct file_ra_state *ra, unsigned int new_order)
> +		struct file_ra_state *ra)
>  {
>  	struct address_space *mapping = ractl->mapping;
>  	pgoff_t start = readahead_index(ractl);
> @@ -468,9 +468,12 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  	unsigned int nofs;
>  	int err = 0;
>  	gfp_t gfp = readahead_gfp_mask(mapping);
> +	unsigned int new_order = ra->order;
>  
> -	if (!mapping_large_folio_support(mapping))
> +	if (!mapping_large_folio_support(mapping)) {
> +		ra->order = 0;
>  		goto fallback;
> +	}
>  
>  	limit = min(limit, index + ra->size - 1);
>  
> @@ -478,6 +481,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>  	new_order = max(new_order, min_order);
>  
> +	ra->order = new_order;
> +
>  	/* See comment in page_cache_ra_unbounded() */
>  	nofs = memalloc_nofs_save();
>  	filemap_invalidate_lock_shared(mapping);
> @@ -609,8 +614,9 @@ void page_cache_sync_ra(struct readahead_control *ractl,
>  	ra->size = min(contig_count + req_count, max_pages);
>  	ra->async_size = 1;
>  readit:
> +	ra->order = 0;
>  	ractl->_index = ra->start;
> -	page_cache_ra_order(ractl, ra, 0);
> +	page_cache_ra_order(ractl, ra);
>  }
>  EXPORT_SYMBOL_GPL(page_cache_sync_ra);
>  
> @@ -621,7 +627,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  	struct file_ra_state *ra = ractl->ra;
>  	pgoff_t index = readahead_index(ractl);
>  	pgoff_t expected, start, end, aligned_end, align;
> -	unsigned int order = folio_order(folio);
>  
>  	/* no readahead */
>  	if (!ra->ra_pages)
> @@ -644,7 +649,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  	 * Ramp up sizes, and push forward the readahead window.
>  	 */
>  	expected = round_down(ra->start + ra->size - ra->async_size,
> -			1UL << order);
> +			1UL << folio_order(folio));
>  	if (index == expected) {
>  		ra->start += ra->size;
>  		/*
> @@ -673,15 +678,15 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  	ra->size += req_count;
>  	ra->size = get_next_ra_size(ra, max_pages);
>  readit:
> -	order += 2;
> -	align = 1UL << min(order, ffs(max_pages) - 1);
> +	ra->order += 2;
> +	align = 1UL << min(ra->order, ffs(max_pages) - 1);
>  	end = ra->start + ra->size;
>  	aligned_end = round_down(end, align);
>  	if (aligned_end > ra->start)
>  		ra->size -= end - aligned_end;
>  	ra->async_size = ra->size;
>  	ractl->_index = ra->start;
> -	page_cache_ra_order(ractl, ra, order);
> +	page_cache_ra_order(ractl, ra);
>  }
>  EXPORT_SYMBOL_GPL(page_cache_async_ra);
>  
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory
  2025-06-09  9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
                   ` (3 preceding siblings ...)
  2025-06-09  9:27 ` [PATCH v5 4/5] mm/readahead: Store folio order " Ryan Roberts
@ 2025-06-09  9:27 ` Ryan Roberts
  2025-06-19 11:07   ` Ryan Roberts
  2025-07-11 15:41   ` Tao Xu
  4 siblings, 2 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09  9:27 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
	Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
	Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
	linux-mm

Change the readahead config so that if it is being requested for an
executable mapping, do a synchronous read into a set of folios with an
arch-specified order and in a naturally aligned manner. We no longer
center the read on the faulting page but simply align it down to the
previous natural boundary. Additionally, we don't bother with an
asynchronous part.

On arm64 if memory is physically contiguous and naturally aligned to the
"contpte" size, we can use contpte mappings, which improves utilization
of the TLB. When paired with the "multi-size THP" feature, this works
well to reduce dTLB pressure. However iTLB pressure is still high due to
executable mappings having a low likelihood of being in the required
folio size and mapping alignment, even when the filesystem supports
readahead into large folios (e.g. XFS).

The reason for the low likelihood is that the current readahead
algorithm starts with an order-0 folio and increases the folio order by
2 every time the readahead mark is hit. But most executable memory tends
to be accessed randomly and so the readahead mark is rarely hit and most
executable folios remain order-0.

So let's special-case the read(ahead) logic for executable mappings. The
trade-off is performance improvement (due to more efficient storage of
the translations in iTLB) vs potential for making reclaim more difficult
(due to the folios being larger so if a part of the folio is hot the
whole thing is considered hot). But executable memory is a small portion
of the overall system memory so I doubt this will even register from a
reclaim perspective.

I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
base page size configs. Crucially the same amount of data is still read
(usually 128K) so I'm not expecting any read amplification issues. I
don't anticipate any write amplification because text is always RO.

Note that the text region of an ELF file could be populated into the
page cache for other reasons than taking a fault in a mmapped area. The
most common case is due to the loader read()ing the header which can be
shared with the beginning of text. So some text will still remain in
small folios, but this simple, best effort change provides good
performance improvements as is.

Confine this special-case approach to the bounds of the VMA. This
prevents wasting memory for any padding that might exist in the file
between sections. Previously the padding would have been contained in
order-0 folios and would be easy to reclaim. But now it would be part of
a larger folio so more difficult to reclaim. Solve this by simply not
reading it into memory in the first place.

Benchmarking
============

The below shows pgbench and redis benchmarks on Graviton3 arm64 system.

First, confirmation that this patch causes more text to be contained in
64K folios:

+----------------------+---------------+---------------+---------------+
| File-backed folios by|  system boot  |    pgbench    |     redis     |
| size as percentage of+-------+-------+-------+-------+-------+-------+
| all mapped text mem  |before | after |before | after |before | after |
+======================+=======+=======+=======+=======+=======+=======+
| base-page-4kB        |   78% |   30% |   78% |   11% |   73% |   14% |
| thp-aligned-8kB      |    1% |    0% |    0% |    0% |    1% |    0% |
| thp-aligned-16kB     |   17% |    4% |   17% |    3% |   20% |    4% |
| thp-aligned-32kB     |    1% |    1% |    1% |    2% |    1% |    1% |
| thp-aligned-64kB     |    3% |   63% |    3% |   81% |    4% |   77% |
| thp-aligned-128kB    |    0% |    1% |    1% |    1% |    1% |    2% |
| thp-unaligned-64kB   |    0% |    0% |    0% |    1% |    0% |    1% |
| thp-unaligned-128kB  |    0% |    1% |    0% |    0% |    0% |    0% |
| thp-partial          |    0% |    0% |    0% |    1% |    0% |    1% |
+----------------------+-------+-------+-------+-------+-------+-------+
| cont-aligned-64kB    |    4% |   65% |    4% |   83% |    6% |   79% |
+----------------------+-------+-------+-------+-------+-------+-------+

The above shows that for both workloads (each isolated with cgroups) as
well as the general system state after boot, the amount of text backed
by 4K and 16K folios reduces and the amount backed by 64K folios
increases significantly. And the amount of text that is contpte-mapped
significantly increases (see last row).

And this is reflected in performance improvement. "(I)" indicates a
statistically significant improvement. Note TPS and Reqs/sec are rates
so bigger is better, ms is time so smaller is better:

+-------------+-------------------------------------------+------------+
| Benchmark   | Result Class                              | Improvemnt |
+=============+===========================================+============+
| pts/pgbench | Scale: 1 Clients: 1 RO (TPS)              |  (I) 3.47% |
|             | Scale: 1 Clients: 1 RO - Latency (ms)     |     -2.88% |
|             | Scale: 1 Clients: 250 RO (TPS)            |  (I) 5.02% |
|             | Scale: 1 Clients: 250 RO - Latency (ms)   | (I) -4.79% |
|             | Scale: 1 Clients: 1000 RO (TPS)           |  (I) 6.16% |
|             | Scale: 1 Clients: 1000 RO - Latency (ms)  | (I) -5.82% |
|             | Scale: 100 Clients: 1 RO (TPS)            |      2.51% |
|             | Scale: 100 Clients: 1 RO - Latency (ms)   |     -3.51% |
|             | Scale: 100 Clients: 250 RO (TPS)          |  (I) 4.75% |
|             | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
|             | Scale: 100 Clients: 1000 RO (TPS)         |  (I) 6.34% |
|             | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
+-------------+-------------------------------------------+------------+
| pts/redis   | Test: GET Connections: 50 (Reqs/sec)      |  (I) 3.20% |
|             | Test: GET Connections: 1000 (Reqs/sec)    |  (I) 2.55% |
|             | Test: LPOP Connections: 50 (Reqs/sec)     |  (I) 4.59% |
|             | Test: LPOP Connections: 1000 (Reqs/sec)   |  (I) 4.81% |
|             | Test: LPUSH Connections: 50 (Reqs/sec)    |  (I) 5.31% |
|             | Test: LPUSH Connections: 1000 (Reqs/sec)  |  (I) 4.36% |
|             | Test: SADD Connections: 50 (Reqs/sec)     |  (I) 2.64% |
|             | Test: SADD Connections: 1000 (Reqs/sec)   |  (I) 4.15% |
|             | Test: SET Connections: 50 (Reqs/sec)      |  (I) 3.11% |
|             | Test: SET Connections: 1000 (Reqs/sec)    |  (I) 3.36% |
+-------------+-------------------------------------------+------------+

Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h |  8 ++++++
 include/linux/pgtable.h          | 11 ++++++++
 mm/filemap.c                     | 47 ++++++++++++++++++++++++++------
 3 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 88db8a0c0b37..7a7dfdce14b8 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1643,6 +1643,14 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
  */
 #define arch_wants_old_prefaulted_pte	cpu_has_hw_af
 
+/*
+ * Request exec memory is read into pagecache in at least 64K folios. This size
+ * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
+ * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
+ * pages are in use.
+ */
+#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
+
 static inline bool pud_sect_supported(void)
 {
 	return PAGE_SIZE == SZ_4K;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 0b6e1f781d86..e4a3895c043b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -456,6 +456,17 @@ static inline bool arch_has_hw_pte_young(void)
 }
 #endif
 
+#ifndef exec_folio_order
+/*
+ * Returns preferred minimum folio order for executable file-backed memory. Must
+ * be in range [0, PMD_ORDER). Default to order-0.
+ */
+static inline unsigned int exec_folio_order(void)
+{
+	return 0;
+}
+#endif
+
 #ifndef arch_check_zapped_pte
 static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
 					 pte_t pte)
diff --git a/mm/filemap.c b/mm/filemap.c
index 4b5c8d69f04c..93fbc2ef232a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3238,8 +3238,11 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	}
 #endif
 
-	/* If we don't want any read-ahead, don't bother */
-	if (vm_flags & VM_RAND_READ)
+	/*
+	 * If we don't want any read-ahead, don't bother. VM_EXEC case below is
+	 * already intended for random access.
+	 */
+	if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
 		return fpin;
 	if (!ra->ra_pages)
 		return fpin;
@@ -3262,14 +3265,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	if (mmap_miss > MMAP_LOTSAMISS)
 		return fpin;
 
-	/*
-	 * mmap read-around
-	 */
 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
-	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
-	ra->size = ra->ra_pages;
-	ra->async_size = ra->ra_pages / 4;
-	ra->order = 0;
+	if (vm_flags & VM_EXEC) {
+		/*
+		 * Allow arch to request a preferred minimum folio order for
+		 * executable memory. This can often be beneficial to
+		 * performance if (e.g.) arm64 can contpte-map the folio.
+		 * Executable memory rarely benefits from readahead, due to its
+		 * random access nature, so set async_size to 0.
+		 *
+		 * Limit to the boundaries of the VMA to avoid reading in any
+		 * pad that might exist between sections, which would be a waste
+		 * of memory.
+		 */
+		struct vm_area_struct *vma = vmf->vma;
+		unsigned long start = vma->vm_pgoff;
+		unsigned long end = start + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT);
+		unsigned long ra_end;
+
+		ra->order = exec_folio_order();
+		ra->start = round_down(vmf->pgoff, 1UL << ra->order);
+		ra->start = max(ra->start, start);
+		ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
+		ra_end = min(ra_end, end);
+		ra->size = ra_end - ra->start;
+		ra->async_size = 0;
+	} else {
+		/*
+		 * mmap read-around
+		 */
+		ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
+		ra->size = ra->ra_pages;
+		ra->async_size = ra->ra_pages / 4;
+		ra->order = 0;
+	}
 	ractl._index = ra->start;
 	page_cache_ra_order(&ractl, ra);
 	return fpin;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory
  2025-06-09  9:27 ` [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
@ 2025-06-19 11:07   ` Ryan Roberts
  2025-07-11 15:41   ` Tao Xu
  1 sibling, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-06-19 11:07 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
	Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
	Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
  Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm

Hi Andrew,

On 09/06/2025 10:27, Ryan Roberts wrote:
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read into a set of folios with an
> arch-specified order and in a naturally aligned manner. We no longer
> center the read on the faulting page but simply align it down to the
> previous natural boundary. Additionally, we don't bother with an
> asynchronous part.
> 
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" feature, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low likelihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
> 
> The reason for the low likelihood is that the current readahead
> algorithm starts with an order-0 folio and increases the folio order by
> 2 every time the readahead mark is hit. But most executable memory tends
> to be accessed randomly and so the readahead mark is rarely hit and most
> executable folios remain order-0.
> 
> So let's special-case the read(ahead) logic for executable mappings. The
> trade-off is performance improvement (due to more efficient storage of
> the translations in iTLB) vs potential for making reclaim more difficult
> (due to the folios being larger so if a part of the folio is hot the
> whole thing is considered hot). But executable memory is a small portion
> of the overall system memory so I doubt this will even register from a
> reclaim perspective.
> 
> I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
> base page size configs. Crucially the same amount of data is still read
> (usually 128K) so I'm not expecting any read amplification issues. I
> don't anticipate any write amplification because text is always RO.
> 
> Note that the text region of an ELF file could be populated into the
> page cache for other reasons than taking a fault in a mmapped area. The
> most common case is due to the loader read()ing the header which can be
> shared with the beginning of text. So some text will still remain in
> small folios, but this simple, best effort change provides good
> performance improvements as is.
> 
> Confine this special-case approach to the bounds of the VMA. This
> prevents wasting memory for any padding that might exist in the file
> between sections. Previously the padding would have been contained in
> order-0 folios and would be easy to reclaim. But now it would be part of
> a larger folio so more difficult to reclaim. Solve this by simply not
> reading it into memory in the first place.
> 
> Benchmarking
> ============
> 
> The below shows pgbench and redis benchmarks on Graviton3 arm64 system.
> 
> First, confirmation that this patch causes more text to be contained in
> 64K folios:
> 
> +----------------------+---------------+---------------+---------------+
> | File-backed folios by|  system boot  |    pgbench    |     redis     |
> | size as percentage of+-------+-------+-------+-------+-------+-------+
> | all mapped text mem  |before | after |before | after |before | after |
> +======================+=======+=======+=======+=======+=======+=======+
> | base-page-4kB        |   78% |   30% |   78% |   11% |   73% |   14% |
> | thp-aligned-8kB      |    1% |    0% |    0% |    0% |    1% |    0% |
> | thp-aligned-16kB     |   17% |    4% |   17% |    3% |   20% |    4% |
> | thp-aligned-32kB     |    1% |    1% |    1% |    2% |    1% |    1% |
> | thp-aligned-64kB     |    3% |   63% |    3% |   81% |    4% |   77% |
> | thp-aligned-128kB    |    0% |    1% |    1% |    1% |    1% |    2% |
> | thp-unaligned-64kB   |    0% |    0% |    0% |    1% |    0% |    1% |
> | thp-unaligned-128kB  |    0% |    1% |    0% |    0% |    0% |    0% |
> | thp-partial          |    0% |    0% |    0% |    1% |    0% |    1% |
> +----------------------+-------+-------+-------+-------+-------+-------+
> | cont-aligned-64kB    |    4% |   65% |    4% |   83% |    6% |   79% |
> +----------------------+-------+-------+-------+-------+-------+-------+
> 
> The above shows that for both workloads (each isolated with cgroups) as
> well as the general system state after boot, the amount of text backed
> by 4K and 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of text that is contpte-mapped
> significantly increases (see last row).
> 
> And this is reflected in performance improvement. "(I)" indicates a
> statistically significant improvement. Note TPS and Reqs/sec are rates
> so bigger is better, ms is time so smaller is better:
> 
> +-------------+-------------------------------------------+------------+
> | Benchmark   | Result Class                              | Improvemnt |
> +=============+===========================================+============+
> | pts/pgbench | Scale: 1 Clients: 1 RO (TPS)              |  (I) 3.47% |
> |             | Scale: 1 Clients: 1 RO - Latency (ms)     |     -2.88% |
> |             | Scale: 1 Clients: 250 RO (TPS)            |  (I) 5.02% |
> |             | Scale: 1 Clients: 250 RO - Latency (ms)   | (I) -4.79% |
> |             | Scale: 1 Clients: 1000 RO (TPS)           |  (I) 6.16% |
> |             | Scale: 1 Clients: 1000 RO - Latency (ms)  | (I) -5.82% |
> |             | Scale: 100 Clients: 1 RO (TPS)            |      2.51% |
> |             | Scale: 100 Clients: 1 RO - Latency (ms)   |     -3.51% |
> |             | Scale: 100 Clients: 250 RO (TPS)          |  (I) 4.75% |
> |             | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
> |             | Scale: 100 Clients: 1000 RO (TPS)         |  (I) 6.34% |
> |             | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
> +-------------+-------------------------------------------+------------+
> | pts/redis   | Test: GET Connections: 50 (Reqs/sec)      |  (I) 3.20% |
> |             | Test: GET Connections: 1000 (Reqs/sec)    |  (I) 2.55% |
> |             | Test: LPOP Connections: 50 (Reqs/sec)     |  (I) 4.59% |
> |             | Test: LPOP Connections: 1000 (Reqs/sec)   |  (I) 4.81% |
> |             | Test: LPUSH Connections: 50 (Reqs/sec)    |  (I) 5.31% |
> |             | Test: LPUSH Connections: 1000 (Reqs/sec)  |  (I) 4.36% |
> |             | Test: SADD Connections: 50 (Reqs/sec)     |  (I) 2.64% |
> |             | Test: SADD Connections: 1000 (Reqs/sec)   |  (I) 4.15% |
> |             | Test: SET Connections: 50 (Reqs/sec)      |  (I) 3.11% |
> |             | Test: SET Connections: 1000 (Reqs/sec)    |  (I) 3.36% |
> +-------------+-------------------------------------------+------------+
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Acked-by: Will Deacon <will@kernel.org>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>


A use-after-free issue was reported againt this patch, which I believe is still
in mm-unstable? The problem is that I'm accessing the vma after unlocking it. So
the fix is to move the unlock to after the if/else. Would you mind squashing
this into the patch?

The report is here:
https://lore.kernel.org/linux-mm/hi6tsbuplmf6jcr44tqu6mdhtyebyqgsfif7okhnrzkcowpo4d@agoyrl4ozyth/

---8<---
diff --git a/mm/filemap.c b/mm/filemap.c
index 93fbc2ef232a..eaf853d6b719 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3265,7 +3265,6 @@ static struct file *do_sync_mmap_readahead(struct vm_fault
*vmf)
 	if (mmap_miss > MMAP_LOTSAMISS)
 		return fpin;

-	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	if (vm_flags & VM_EXEC) {
 		/*
 		 * Allow arch to request a preferred minimum folio order for
@@ -3299,6 +3298,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault
*vmf)
 		ra->async_size = ra->ra_pages / 4;
 		ra->order = 0;
 	}
+
+	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	ractl._index = ra->start;
 	page_cache_ra_order(&ractl, ra);
 	return fpin;
---8<---

Thanks,
Ryan


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory
  2025-06-09  9:27 ` [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
  2025-06-19 11:07   ` Ryan Roberts
@ 2025-07-11 15:41   ` Tao Xu
  2025-07-14  8:19     ` Ryan Roberts
  1 sibling, 1 reply; 11+ messages in thread
From: Tao Xu @ 2025-07-11 15:41 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
	Alexander Viro, Christian Brauner, Jan Kara, David Hildenbrand,
	Dave Chinner, Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
  Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm

On 09/06/2025 10:27, Ryan Roberts wrote:
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read into a set of folios with an
> arch-specified order and in a naturally aligned manner. We no longer
> center the read on the faulting page but simply align it down to the
> previous natural boundary. Additionally, we don't bother with an
> asynchronous part.
> 
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" feature, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low likelihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
> 
> The reason for the low likelihood is that the current readahead
> algorithm starts with an order-0 folio and increases the folio order by
> 2 every time the readahead mark is hit. But most executable memory tends
> to be accessed randomly and so the readahead mark is rarely hit and most
> executable folios remain order-0.
> 
> So let's special-case the read(ahead) logic for executable mappings. The
> trade-off is performance improvement (due to more efficient storage of
> the translations in iTLB) vs potential for making reclaim more difficult
> (due to the folios being larger so if a part of the folio is hot the
> whole thing is considered hot). But executable memory is a small portion
> of the overall system memory so I doubt this will even register from a
> reclaim perspective.
> 
> I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
> base page size configs. Crucially the same amount of data is still read
> (usually 128K) so I'm not expecting any read amplification issues. I
> don't anticipate any write amplification because text is always RO.
> 
> Note that the text region of an ELF file could be populated into the
> page cache for other reasons than taking a fault in a mmapped area. The
> most common case is due to the loader read()ing the header which can be
> shared with the beginning of text. So some text will still remain in
> small folios, but this simple, best effort change provides good
> performance improvements as is.
> 
> Confine this special-case approach to the bounds of the VMA. This
> prevents wasting memory for any padding that might exist in the file
> between sections. Previously the padding would have been contained in
> order-0 folios and would be easy to reclaim. But now it would be part of
> a larger folio so more difficult to reclaim. Solve this by simply not
> reading it into memory in the first place.
> 
> Benchmarking
> ============
> 
> The below shows pgbench and redis benchmarks on Graviton3 arm64 system.
> 
> First, confirmation that this patch causes more text to be contained in
> 64K folios:
> 
> +----------------------+---------------+---------------+---------------+
> | File-backed folios by|  system boot  |    pgbench    |     redis     |
> | size as percentage of+-------+-------+-------+-------+-------+-------+
> | all mapped text mem  |before | after |before | after |before | after |
> +======================+=======+=======+=======+=======+=======+=======+
> | base-page-4kB        |   78% |   30% |   78% |   11% |   73% |   14% |
> | thp-aligned-8kB      |    1% |    0% |    0% |    0% |    1% |    0% |
> | thp-aligned-16kB     |   17% |    4% |   17% |    3% |   20% |    4% |
> | thp-aligned-32kB     |    1% |    1% |    1% |    2% |    1% |    1% |
> | thp-aligned-64kB     |    3% |   63% |    3% |   81% |    4% |   77% |
> | thp-aligned-128kB    |    0% |    1% |    1% |    1% |    1% |    2% |
> | thp-unaligned-64kB   |    0% |    0% |    0% |    1% |    0% |    1% |
> | thp-unaligned-128kB  |    0% |    1% |    0% |    0% |    0% |    0% |
> | thp-partial          |    0% |    0% |    0% |    1% |    0% |    1% |
> +----------------------+-------+-------+-------+-------+-------+-------+
> | cont-aligned-64kB    |    4% |   65% |    4% |   83% |    6% |   79% |
> +----------------------+-------+-------+-------+-------+-------+-------+
> 
> The above shows that for both workloads (each isolated with cgroups) as
> well as the general system state after boot, the amount of text backed
> by 4K and 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of text that is contpte-mapped
> significantly increases (see last row).
> 
> And this is reflected in performance improvement. "(I)" indicates a
> statistically significant improvement. Note TPS and Reqs/sec are rates
> so bigger is better, ms is time so smaller is better:
> 
> +-------------+-------------------------------------------+------------+
> | Benchmark   | Result Class                              | Improvemnt |
> +=============+===========================================+============+
> | pts/pgbench | Scale: 1 Clients: 1 RO (TPS)              |  (I) 3.47% |
> |             | Scale: 1 Clients: 1 RO - Latency (ms)     |     -2.88% |
> |             | Scale: 1 Clients: 250 RO (TPS)            |  (I) 5.02% |
> |             | Scale: 1 Clients: 250 RO - Latency (ms)   | (I) -4.79% |
> |             | Scale: 1 Clients: 1000 RO (TPS)           |  (I) 6.16% |
> |             | Scale: 1 Clients: 1000 RO - Latency (ms)  | (I) -5.82% |
> |             | Scale: 100 Clients: 1 RO (TPS)            |      2.51% |
> |             | Scale: 100 Clients: 1 RO - Latency (ms)   |     -3.51% |
> |             | Scale: 100 Clients: 250 RO (TPS)          |  (I) 4.75% |
> |             | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
> |             | Scale: 100 Clients: 1000 RO (TPS)         |  (I) 6.34% |
> |             | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
> +-------------+-------------------------------------------+------------+
> | pts/redis   | Test: GET Connections: 50 (Reqs/sec)      |  (I) 3.20% |
> |             | Test: GET Connections: 1000 (Reqs/sec)    |  (I) 2.55% |
> |             | Test: LPOP Connections: 50 (Reqs/sec)     |  (I) 4.59% |
> |             | Test: LPOP Connections: 1000 (Reqs/sec)   |  (I) 4.81% |
> |             | Test: LPUSH Connections: 50 (Reqs/sec)    |  (I) 5.31% |
> |             | Test: LPUSH Connections: 1000 (Reqs/sec)  |  (I) 4.36% |
> |             | Test: SADD Connections: 50 (Reqs/sec)     |  (I) 2.64% |
> |             | Test: SADD Connections: 1000 (Reqs/sec)   |  (I) 4.15% |
> |             | Test: SET Connections: 50 (Reqs/sec)      |  (I) 3.11% |
> |             | Test: SET Connections: 1000 (Reqs/sec)    |  (I) 3.36% |
> +-------------+-------------------------------------------+------------+
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Acked-by: Will Deacon <will@kernel.org>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Tested-by: Tao Xu <tao.xu@arm.com>

Observed similar performance optimization and iTLB benefits in mysql 
sysbench on Azure Cobalt-100 arm64 system.

Below shows more .text sections are now backed by 64K folios for the 
52MiB mysqld binary file in XFS, and more in 128K folios when increasing 
the p_align from default 64k to 2M in ELF header:

+----------------------+-------+-------+-------+
|                      |         mysql         |
+----------------------+-------+-------+-------+
|                      |before |     after     |
+----------------------+-------+-------+-------+
|                      |       |    p_align    |
|                      |       |  64k  |   2M  |
+----------------------+-------+-------+-------+
| thp-aligned-8kB      |    1% |    0% |    0% |
| thp-aligned-16kB     |   53% |    0% |    0% |
| thp-aligned-32kB     |    0% |    0% |    0% |
| thp-aligned-64kB     |    3% |   72% |    1% |
| thp-aligned-128kB    |    0% |    0% |   67% |
| thp-partial          |    0% |    0% |    5% |
+----------------------+-------+-------+-------+

The resulting performance improvment is +5.65% in TPS throughput and 
-6.06% in average latency, using 16 local sysbench clients to the mysqld 
running on 32 cores and 12GiB innodb_buffer_pool_size. Corresponding 
iTLB effectiveness benefits can also be observed from perf PMU metrics:

+-------------+--------------------------+------------+
| Benchmark   | Result                   | Improvemnt |
+=============+==========================+============+
| sysbench    | TPS                      |      5.65% |
|             | Latency              (ms)|     -6.06% |
+-------------+--------------------------+------------+
| perf PMU    | l1i_tlb           (M/sec)|     +1.11% |
|             | l2d_tlb           (M/sec)|    -13.01% |
|             | l1i_tlb_refill    (K/sec)|    -46.50% |
|             | itlb_walk         (K/sec)|    -64.03% |
|             | l2d_tlb_refill    (K/sec)|    -33.90% |
|             | l1d_tlb           (M/sec)|     +1.24% |
|             | l1d_tlb_refill    (M/sec)|     +2.23% |
|             | dtlb_walk         (K/sec)|    -20.69% |
|             | IPC                      |     +1.85% |
+-------------+--------------------------+------------+

> ---
>   arch/arm64/include/asm/pgtable.h |  8 ++++++
>   include/linux/pgtable.h          | 11 ++++++++
>   mm/filemap.c                     | 47 ++++++++++++++++++++++++++------
>   3 files changed, 57 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 88db8a0c0b37..7a7dfdce14b8 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1643,6 +1643,14 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>    */
>   #define arch_wants_old_prefaulted_pte	cpu_has_hw_af
>   
> +/*
> + * Request exec memory is read into pagecache in at least 64K folios. This size
> + * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
> + * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
> + * pages are in use.
> + */
> +#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
> +
>   static inline bool pud_sect_supported(void)
>   {
>   	return PAGE_SIZE == SZ_4K;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 0b6e1f781d86..e4a3895c043b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -456,6 +456,17 @@ static inline bool arch_has_hw_pte_young(void)
>   }
>   #endif
>   
> +#ifndef exec_folio_order
> +/*
> + * Returns preferred minimum folio order for executable file-backed memory. Must
> + * be in range [0, PMD_ORDER). Default to order-0.
> + */
> +static inline unsigned int exec_folio_order(void)
> +{
> +	return 0;
> +}
> +#endif
> +
>   #ifndef arch_check_zapped_pte
>   static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>   					 pte_t pte)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4b5c8d69f04c..93fbc2ef232a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3238,8 +3238,11 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>   	}
>   #endif
>   
> -	/* If we don't want any read-ahead, don't bother */
> -	if (vm_flags & VM_RAND_READ)
> +	/*
> +	 * If we don't want any read-ahead, don't bother. VM_EXEC case below is
> +	 * already intended for random access.
> +	 */
> +	if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
>   		return fpin;
>   	if (!ra->ra_pages)
>   		return fpin;
> @@ -3262,14 +3265,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>   	if (mmap_miss > MMAP_LOTSAMISS)
>   		return fpin;
>   
> -	/*
> -	 * mmap read-around
> -	 */
>   	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> -	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> -	ra->size = ra->ra_pages;
> -	ra->async_size = ra->ra_pages / 4;
> -	ra->order = 0;
> +	if (vm_flags & VM_EXEC) {
> +		/*
> +		 * Allow arch to request a preferred minimum folio order for
> +		 * executable memory. This can often be beneficial to
> +		 * performance if (e.g.) arm64 can contpte-map the folio.
> +		 * Executable memory rarely benefits from readahead, due to its
> +		 * random access nature, so set async_size to 0.
> +		 *
> +		 * Limit to the boundaries of the VMA to avoid reading in any
> +		 * pad that might exist between sections, which would be a waste
> +		 * of memory.
> +		 */
> +		struct vm_area_struct *vma = vmf->vma;
> +		unsigned long start = vma->vm_pgoff;
> +		unsigned long end = start + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT);
> +		unsigned long ra_end;
> +
> +		ra->order = exec_folio_order();
> +		ra->start = round_down(vmf->pgoff, 1UL << ra->order);
> +		ra->start = max(ra->start, start);
> +		ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
> +		ra_end = min(ra_end, end);
> +		ra->size = ra_end - ra->start;
> +		ra->async_size = 0;
> +	} else {
> +		/*
> +		 * mmap read-around
> +		 */
> +		ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> +		ra->size = ra->ra_pages;
> +		ra->async_size = ra->ra_pages / 4;
> +		ra->order = 0;
> +	}
>   	ractl._index = ra->start;
>   	page_cache_ra_order(&ractl, ra);
>   	return fpin;


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory
  2025-07-11 15:41   ` Tao Xu
@ 2025-07-14  8:19     ` Ryan Roberts
  0 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-07-14  8:19 UTC (permalink / raw)
  To: Tao Xu, Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
	Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
	Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
  Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm

On 11/07/2025 16:41, Tao Xu wrote:
> On 09/06/2025 10:27, Ryan Roberts wrote:
>> Change the readahead config so that if it is being requested for an
>> executable mapping, do a synchronous read into a set of folios with an
>> arch-specified order and in a naturally aligned manner. We no longer
>> center the read on the faulting page but simply align it down to the
>> previous natural boundary. Additionally, we don't bother with an
>> asynchronous part.
>>
>> On arm64 if memory is physically contiguous and naturally aligned to the
>> "contpte" size, we can use contpte mappings, which improves utilization
>> of the TLB. When paired with the "multi-size THP" feature, this works
>> well to reduce dTLB pressure. However iTLB pressure is still high due to
>> executable mappings having a low likelihood of being in the required
>> folio size and mapping alignment, even when the filesystem supports
>> readahead into large folios (e.g. XFS).
>>
>> The reason for the low likelihood is that the current readahead
>> algorithm starts with an order-0 folio and increases the folio order by
>> 2 every time the readahead mark is hit. But most executable memory tends
>> to be accessed randomly and so the readahead mark is rarely hit and most
>> executable folios remain order-0.
>>
>> So let's special-case the read(ahead) logic for executable mappings. The
>> trade-off is performance improvement (due to more efficient storage of
>> the translations in iTLB) vs potential for making reclaim more difficult
>> (due to the folios being larger so if a part of the folio is hot the
>> whole thing is considered hot). But executable memory is a small portion
>> of the overall system memory so I doubt this will even register from a
>> reclaim perspective.
>>
>> I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
>> base page size configs. Crucially the same amount of data is still read
>> (usually 128K) so I'm not expecting any read amplification issues. I
>> don't anticipate any write amplification because text is always RO.
>>
>> Note that the text region of an ELF file could be populated into the
>> page cache for other reasons than taking a fault in a mmapped area. The
>> most common case is due to the loader read()ing the header which can be
>> shared with the beginning of text. So some text will still remain in
>> small folios, but this simple, best effort change provides good
>> performance improvements as is.
>>
>> Confine this special-case approach to the bounds of the VMA. This
>> prevents wasting memory for any padding that might exist in the file
>> between sections. Previously the padding would have been contained in
>> order-0 folios and would be easy to reclaim. But now it would be part of
>> a larger folio so more difficult to reclaim. Solve this by simply not
>> reading it into memory in the first place.
>>
>> Benchmarking
>> ============
>>
>> The below shows pgbench and redis benchmarks on Graviton3 arm64 system.
>>
>> First, confirmation that this patch causes more text to be contained in
>> 64K folios:
>>
>> +----------------------+---------------+---------------+---------------+
>> | File-backed folios by|  system boot  |    pgbench    |     redis     |
>> | size as percentage of+-------+-------+-------+-------+-------+-------+
>> | all mapped text mem  |before | after |before | after |before | after |
>> +======================+=======+=======+=======+=======+=======+=======+
>> | base-page-4kB        |   78% |   30% |   78% |   11% |   73% |   14% |
>> | thp-aligned-8kB      |    1% |    0% |    0% |    0% |    1% |    0% |
>> | thp-aligned-16kB     |   17% |    4% |   17% |    3% |   20% |    4% |
>> | thp-aligned-32kB     |    1% |    1% |    1% |    2% |    1% |    1% |
>> | thp-aligned-64kB     |    3% |   63% |    3% |   81% |    4% |   77% |
>> | thp-aligned-128kB    |    0% |    1% |    1% |    1% |    1% |    2% |
>> | thp-unaligned-64kB   |    0% |    0% |    0% |    1% |    0% |    1% |
>> | thp-unaligned-128kB  |    0% |    1% |    0% |    0% |    0% |    0% |
>> | thp-partial          |    0% |    0% |    0% |    1% |    0% |    1% |
>> +----------------------+-------+-------+-------+-------+-------+-------+
>> | cont-aligned-64kB    |    4% |   65% |    4% |   83% |    6% |   79% |
>> +----------------------+-------+-------+-------+-------+-------+-------+
>>
>> The above shows that for both workloads (each isolated with cgroups) as
>> well as the general system state after boot, the amount of text backed
>> by 4K and 16K folios reduces and the amount backed by 64K folios
>> increases significantly. And the amount of text that is contpte-mapped
>> significantly increases (see last row).
>>
>> And this is reflected in performance improvement. "(I)" indicates a
>> statistically significant improvement. Note TPS and Reqs/sec are rates
>> so bigger is better, ms is time so smaller is better:
>>
>> +-------------+-------------------------------------------+------------+
>> | Benchmark   | Result Class                              | Improvemnt |
>> +=============+===========================================+============+
>> | pts/pgbench | Scale: 1 Clients: 1 RO (TPS)              |  (I) 3.47% |
>> |             | Scale: 1 Clients: 1 RO - Latency (ms)     |     -2.88% |
>> |             | Scale: 1 Clients: 250 RO (TPS)            |  (I) 5.02% |
>> |             | Scale: 1 Clients: 250 RO - Latency (ms)   | (I) -4.79% |
>> |             | Scale: 1 Clients: 1000 RO (TPS)           |  (I) 6.16% |
>> |             | Scale: 1 Clients: 1000 RO - Latency (ms)  | (I) -5.82% |
>> |             | Scale: 100 Clients: 1 RO (TPS)            |      2.51% |
>> |             | Scale: 100 Clients: 1 RO - Latency (ms)   |     -3.51% |
>> |             | Scale: 100 Clients: 250 RO (TPS)          |  (I) 4.75% |
>> |             | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
>> |             | Scale: 100 Clients: 1000 RO (TPS)         |  (I) 6.34% |
>> |             | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
>> +-------------+-------------------------------------------+------------+
>> | pts/redis   | Test: GET Connections: 50 (Reqs/sec)      |  (I) 3.20% |
>> |             | Test: GET Connections: 1000 (Reqs/sec)    |  (I) 2.55% |
>> |             | Test: LPOP Connections: 50 (Reqs/sec)     |  (I) 4.59% |
>> |             | Test: LPOP Connections: 1000 (Reqs/sec)   |  (I) 4.81% |
>> |             | Test: LPUSH Connections: 50 (Reqs/sec)    |  (I) 5.31% |
>> |             | Test: LPUSH Connections: 1000 (Reqs/sec)  |  (I) 4.36% |
>> |             | Test: SADD Connections: 50 (Reqs/sec)     |  (I) 2.64% |
>> |             | Test: SADD Connections: 1000 (Reqs/sec)   |  (I) 4.15% |
>> |             | Test: SET Connections: 50 (Reqs/sec)      |  (I) 3.11% |
>> |             | Test: SET Connections: 1000 (Reqs/sec)    |  (I) 3.36% |
>> +-------------+-------------------------------------------+------------+
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>> Acked-by: Will Deacon <will@kernel.org>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> Tested-by: Tao Xu <tao.xu@arm.com>

Thanks for testing! Although, unfortunately I think you were a day late and this
patch is now in mm-stable so too late to add the tag.

Thanks,
Ryan

> 
> Observed similar performance optimization and iTLB benefits in mysql sysbench on
> Azure Cobalt-100 arm64 system.
> 
> Below shows more .text sections are now backed by 64K folios for the 52MiB
> mysqld binary file in XFS, and more in 128K folios when increasing the p_align
> from default 64k to 2M in ELF header:
> 
> +----------------------+-------+-------+-------+
> |                      |         mysql         |
> +----------------------+-------+-------+-------+
> |                      |before |     after     |
> +----------------------+-------+-------+-------+
> |                      |       |    p_align    |
> |                      |       |  64k  |   2M  |
> +----------------------+-------+-------+-------+
> | thp-aligned-8kB      |    1% |    0% |    0% |
> | thp-aligned-16kB     |   53% |    0% |    0% |
> | thp-aligned-32kB     |    0% |    0% |    0% |
> | thp-aligned-64kB     |    3% |   72% |    1% |
> | thp-aligned-128kB    |    0% |    0% |   67% |
> | thp-partial          |    0% |    0% |    5% |
> +----------------------+-------+-------+-------+
> 
> The resulting performance improvment is +5.65% in TPS throughput and -6.06% in
> average latency, using 16 local sysbench clients to the mysqld running on 32
> cores and 12GiB innodb_buffer_pool_size. Corresponding iTLB effectiveness
> benefits can also be observed from perf PMU metrics:
> 
> +-------------+--------------------------+------------+
> | Benchmark   | Result                   | Improvemnt |
> +=============+==========================+============+
> | sysbench    | TPS                      |      5.65% |
> |             | Latency              (ms)|     -6.06% |
> +-------------+--------------------------+------------+
> | perf PMU    | l1i_tlb           (M/sec)|     +1.11% |
> |             | l2d_tlb           (M/sec)|    -13.01% |
> |             | l1i_tlb_refill    (K/sec)|    -46.50% |
> |             | itlb_walk         (K/sec)|    -64.03% |
> |             | l2d_tlb_refill    (K/sec)|    -33.90% |
> |             | l1d_tlb           (M/sec)|     +1.24% |
> |             | l1d_tlb_refill    (M/sec)|     +2.23% |
> |             | dtlb_walk         (K/sec)|    -20.69% |
> |             | IPC                      |     +1.85% |
> +-------------+--------------------------+------------+
> 
>> ---
>>   arch/arm64/include/asm/pgtable.h |  8 ++++++
>>   include/linux/pgtable.h          | 11 ++++++++
>>   mm/filemap.c                     | 47 ++++++++++++++++++++++++++------
>>   3 files changed, 57 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 88db8a0c0b37..7a7dfdce14b8 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1643,6 +1643,14 @@ static inline void update_mmu_cache_range(struct
>> vm_fault *vmf,
>>    */
>>   #define arch_wants_old_prefaulted_pte    cpu_has_hw_af
>>   +/*
>> + * Request exec memory is read into pagecache in at least 64K folios. This size
>> + * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
>> + * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
>> + * pages are in use.
>> + */
>> +#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
>> +
>>   static inline bool pud_sect_supported(void)
>>   {
>>       return PAGE_SIZE == SZ_4K;
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 0b6e1f781d86..e4a3895c043b 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -456,6 +456,17 @@ static inline bool arch_has_hw_pte_young(void)
>>   }
>>   #endif
>>   +#ifndef exec_folio_order
>> +/*
>> + * Returns preferred minimum folio order for executable file-backed memory. Must
>> + * be in range [0, PMD_ORDER). Default to order-0.
>> + */
>> +static inline unsigned int exec_folio_order(void)
>> +{
>> +    return 0;
>> +}
>> +#endif
>> +
>>   #ifndef arch_check_zapped_pte
>>   static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>>                        pte_t pte)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 4b5c8d69f04c..93fbc2ef232a 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3238,8 +3238,11 @@ static struct file *do_sync_mmap_readahead(struct
>> vm_fault *vmf)
>>       }
>>   #endif
>>   -    /* If we don't want any read-ahead, don't bother */
>> -    if (vm_flags & VM_RAND_READ)
>> +    /*
>> +     * If we don't want any read-ahead, don't bother. VM_EXEC case below is
>> +     * already intended for random access.
>> +     */
>> +    if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
>>           return fpin;
>>       if (!ra->ra_pages)
>>           return fpin;
>> @@ -3262,14 +3265,40 @@ static struct file *do_sync_mmap_readahead(struct
>> vm_fault *vmf)
>>       if (mmap_miss > MMAP_LOTSAMISS)
>>           return fpin;
>>   -    /*
>> -     * mmap read-around
>> -     */
>>       fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>> -    ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>> -    ra->size = ra->ra_pages;
>> -    ra->async_size = ra->ra_pages / 4;
>> -    ra->order = 0;
>> +    if (vm_flags & VM_EXEC) {
>> +        /*
>> +         * Allow arch to request a preferred minimum folio order for
>> +         * executable memory. This can often be beneficial to
>> +         * performance if (e.g.) arm64 can contpte-map the folio.
>> +         * Executable memory rarely benefits from readahead, due to its
>> +         * random access nature, so set async_size to 0.
>> +         *
>> +         * Limit to the boundaries of the VMA to avoid reading in any
>> +         * pad that might exist between sections, which would be a waste
>> +         * of memory.
>> +         */
>> +        struct vm_area_struct *vma = vmf->vma;
>> +        unsigned long start = vma->vm_pgoff;
>> +        unsigned long end = start + ((vma->vm_end - vma->vm_start) >>
>> PAGE_SHIFT);
>> +        unsigned long ra_end;
>> +
>> +        ra->order = exec_folio_order();
>> +        ra->start = round_down(vmf->pgoff, 1UL << ra->order);
>> +        ra->start = max(ra->start, start);
>> +        ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
>> +        ra_end = min(ra_end, end);
>> +        ra->size = ra_end - ra->start;
>> +        ra->async_size = 0;
>> +    } else {
>> +        /*
>> +         * mmap read-around
>> +         */
>> +        ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>> +        ra->size = ra->ra_pages;
>> +        ra->async_size = ra->ra_pages / 4;
>> +        ra->order = 0;
>> +    }
>>       ractl._index = ra->start;
>>       page_cache_ra_order(&ractl, ra);
>>       return fpin;
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-07-14  8:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-09  9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
2025-06-09  9:27 ` [PATCH v5 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
2025-06-09  9:27 ` [PATCH v5 2/5] mm/readahead: Terminate async readahead on natural boundary Ryan Roberts
2025-06-09  9:27 ` [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
2025-06-11  9:57   ` Christian Brauner
2025-06-09  9:27 ` [PATCH v5 4/5] mm/readahead: Store folio order " Ryan Roberts
2025-06-12 11:37   ` Jan Kara
2025-06-09  9:27 ` [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
2025-06-19 11:07   ` Ryan Roberts
2025-07-11 15:41   ` Tao Xu
2025-07-14  8:19     ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).