* [PATCH v5 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-06-09 9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
@ 2025-06-09 9:27 ` Ryan Roberts
2025-06-09 9:27 ` [PATCH v5 2/5] mm/readahead: Terminate async readahead on natural boundary Ryan Roberts
` (3 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09 9:27 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
linux-mm, Chaitanya S Prakash
page_cache_ra_order() takes a parameter called new_order, which is
intended to express the preferred order of the folios that will be
allocated for the readahead operation. Most callers indeed call this
with their preferred new order. But page_cache_async_ra() calls it with
the preferred order of the previous readahead request (actually the
order of the folio that had the readahead marker, which may be smaller
when alignment comes into play).
And despite the parameter name, page_cache_ra_order() always treats it
at the old order, adding 2 to it on entry. As a result, a cold readahead
always starts with order-2 folios.
Let's fix this behaviour by always passing in the *new* order.
Worked example:
Prior to the change, mmaping an 8MB file and touching each page
sequentially, resulted in the following, where we start with order-2
folios for the first 128K then ramp up to order-4 for the next 128K,
then get clamped to order-5 for the rest of the file because pa_pages is
limited to 128K:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
----- ---------- ---------- --------- ------- ------- ----- -----
FOLIO 0x00000000 0x00004000 16384 0 4 4 2
FOLIO 0x00004000 0x00008000 16384 4 8 4 2
FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
FOLIO 0x00010000 0x00014000 16384 16 20 4 2
FOLIO 0x00014000 0x00018000 16384 20 24 4 2
FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
FOLIO 0x00020000 0x00030000 65536 32 48 16 4
FOLIO 0x00030000 0x00040000 65536 48 64 16 4
FOLIO 0x00040000 0x00060000 131072 64 96 32 5
FOLIO 0x00060000 0x00080000 131072 96 128 32 5
FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
...
After the change, the same operation results in the first 128K being
order-0, then we start ramping up to order-2, -4, and finally get
clamped at order-5:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
----- ---------- ---------- --------- ------- ------- ----- -----
FOLIO 0x00000000 0x00001000 4096 0 1 1 0
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00024000 16384 32 36 4 2
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00050000 65536 64 80 16 4
FOLIO 0x00050000 0x00060000 65536 80 96 16 4
FOLIO 0x00060000 0x00080000 131072 96 128 32 5
FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
...
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
mm/readahead.c | 11 ++---------
1 file changed, 2 insertions(+), 9 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 20d36d6b055e..973de2551efe 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -468,20 +468,12 @@ void page_cache_ra_order(struct readahead_control *ractl,
unsigned int nofs;
int err = 0;
gfp_t gfp = readahead_gfp_mask(mapping);
- unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
- /*
- * Fallback when size < min_nrpages as each folio should be
- * at least min_nrpages anyway.
- */
- if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
+ if (!mapping_large_folio_support(mapping))
goto fallback;
limit = min(limit, index + ra->size - 1);
- if (new_order < mapping_max_folio_order(mapping))
- new_order += 2;
-
new_order = min(mapping_max_folio_order(mapping), new_order);
new_order = min_t(unsigned int, new_order, ilog2(ra->size));
new_order = max(new_order, min_order);
@@ -683,6 +675,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
ra->size = get_next_ra_size(ra, max_pages);
ra->async_size = ra->size;
readit:
+ order += 2;
ractl->_index = ra->start;
page_cache_ra_order(ractl, ra, order);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v5 2/5] mm/readahead: Terminate async readahead on natural boundary
2025-06-09 9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
2025-06-09 9:27 ` [PATCH v5 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
@ 2025-06-09 9:27 ` Ryan Roberts
2025-06-09 9:27 ` [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
` (2 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09 9:27 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
linux-mm
Previously asynchonous readahead would read ra_pages (usually 128K)
directly after the end of the synchonous readahead and given the
synchronous readahead portion had no alignment guarantees (beyond page
boundaries) it is possible (and likely) that the end of the initial 128K
region would not fall on a natural boundary for the folio size being
used. Therefore smaller folios were used to align down to the required
boundary, both at the end of the previous readahead block and at the
start of the new one.
In the worst cases, this can result in never properly ramping up the
folio size, and instead getting stuck oscillating between order-0, -1
and -2 folios. The next readahead will try to use folios whose order is
+2 bigger than the folio that had the readahead marker. But because of
the alignment requirements, that folio (the first one in the readahead
block) can end up being order-0 in some cases.
There will be 2 modifications to solve this issue:
1) Calculate the readahead size so the end is aligned to a folio
boundary. This prevents needing to allocate small folios to align
down at the end of the window and fixes the oscillation problem.
2) Remember the "preferred folio order" in the ra state instead of
inferring it from the folio with the readahead marker. This solves
the slow ramp up problem (discussed in a subsequent patch).
This patch addresses (1) only. A subsequent patch will address (2).
Worked example:
The following shows the previous pathalogical behaviour when the initial
synchronous readahead is unaligned. We start reading at page 17 in the
file and read sequentially from there. I'm showing a dump of the pages
in the page cache just after we read the first page of the folio with
the readahead marker.
Initially there are no pages in the page cache:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00800000 8388608 0 2048 2048
Then we access page 17, causing synchonous read-around of 128K with a
readahead marker set up at page 25. So far, all as expected:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 Y
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
HOLE 0x00021000 0x00800000 8253440 33 2048 2015
Now access pages 18-25 inclusive. This causes an asynchronous 128K
readahead starting at page 33. But since we are unaligned, even though
the preferred folio order is 2, the first folio in this batch (the one
with the new readahead marker) is order-0:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0 Y
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00041000 4096 64 65 1 0
HOLE 0x00041000 0x00800000 8122368 65 2048 1983
Which means that when we now read pages 26-33 and readahead is kicked
off again, the new preferred order is 2 (0 + 2), not 4 as we intended:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00041000 4096 64 65 1 0
FOLIO 0x00041000 0x00042000 4096 65 66 1 0 Y
FOLIO 0x00042000 0x00044000 8192 66 68 2 1
FOLIO 0x00044000 0x00048000 16384 68 72 4 2
FOLIO 0x00048000 0x0004c000 16384 72 76 4 2
FOLIO 0x0004c000 0x00050000 16384 76 80 4 2
FOLIO 0x00050000 0x00054000 16384 80 84 4 2
FOLIO 0x00054000 0x00058000 16384 84 88 4 2
FOLIO 0x00058000 0x0005c000 16384 88 92 4 2
FOLIO 0x0005c000 0x00060000 16384 92 96 4 2
FOLIO 0x00060000 0x00061000 4096 96 97 1 0
HOLE 0x00061000 0x00800000 7991296 97 2048 1951
This ramp up from order-0 with smaller orders at the edges for alignment
cycle continues all the way to the end of the file (not shown).
After the change, we round down the end boundary to the order boundary
so we no longer get stuck in the cycle and can ramp up the order over
time. Note that the rate of the ramp up is still not as we would expect
it. We will fix that next. Here we are touching pages 17-256
sequentially:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00044000 16384 64 68 4 2
FOLIO 0x00044000 0x00048000 16384 68 72 4 2
FOLIO 0x00048000 0x0004c000 16384 72 76 4 2
FOLIO 0x0004c000 0x00050000 16384 76 80 4 2
FOLIO 0x00050000 0x00054000 16384 80 84 4 2
FOLIO 0x00054000 0x00058000 16384 84 88 4 2
FOLIO 0x00058000 0x0005c000 16384 88 92 4 2
FOLIO 0x0005c000 0x00060000 16384 92 96 4 2
FOLIO 0x00060000 0x00070000 65536 96 112 16 4
FOLIO 0x00070000 0x00080000 65536 112 128 16 4
FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
FOLIO 0x00100000 0x00120000 131072 256 288 32 5
FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
HOLE 0x00140000 0x00800000 7077888 320 2048 1728
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
mm/readahead.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 973de2551efe..87be20ae00d0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -620,7 +620,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
unsigned long max_pages;
struct file_ra_state *ra = ractl->ra;
pgoff_t index = readahead_index(ractl);
- pgoff_t expected, start;
+ pgoff_t expected, start, end, aligned_end, align;
unsigned int order = folio_order(folio);
/* no readahead */
@@ -652,7 +652,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
* the readahead window.
*/
ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
- ra->async_size = ra->size;
goto readit;
}
@@ -673,9 +672,14 @@ void page_cache_async_ra(struct readahead_control *ractl,
ra->size = start - index; /* old async_size */
ra->size += req_count;
ra->size = get_next_ra_size(ra, max_pages);
- ra->async_size = ra->size;
readit:
order += 2;
+ align = 1UL << min(order, ffs(max_pages) - 1);
+ end = ra->start + ra->size;
+ aligned_end = round_down(end, align);
+ if (aligned_end > ra->start)
+ ra->size -= end - aligned_end;
+ ra->async_size = ra->size;
ractl->_index = ra->start;
page_cache_ra_order(ractl, ra, order);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state
2025-06-09 9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
2025-06-09 9:27 ` [PATCH v5 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
2025-06-09 9:27 ` [PATCH v5 2/5] mm/readahead: Terminate async readahead on natural boundary Ryan Roberts
@ 2025-06-09 9:27 ` Ryan Roberts
2025-06-11 9:57 ` Christian Brauner
2025-06-09 9:27 ` [PATCH v5 4/5] mm/readahead: Store folio order " Ryan Roberts
2025-06-09 9:27 ` [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
4 siblings, 1 reply; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09 9:27 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
linux-mm
We need to be able to store the preferred folio order associated with a
readahead request in the struct file_ra_state so that we can more
accurately increase the order across subsequent readahead requests. But
struct file_ra_state is per-struct file, so we don't really want to
increase it's size.
mmap_miss is currently 32 bits but it is only counted up to 10 *
MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be
plenty. Redefine it to unsigned short, making room for order as unsigned
short in follow up commit.
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
include/linux/fs.h | 2 +-
mm/filemap.c | 11 ++++++-----
2 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 05abdabe9db7..87e7d5790e43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1052,7 +1052,7 @@ struct file_ra_state {
unsigned int size;
unsigned int async_size;
unsigned int ra_pages;
- unsigned int mmap_miss;
+ unsigned short mmap_miss;
loff_t prev_pos;
};
diff --git a/mm/filemap.c b/mm/filemap.c
index a6459874bb2a..7bb4ffca8487 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3217,7 +3217,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
struct file *fpin = NULL;
unsigned long vm_flags = vmf->vma->vm_flags;
- unsigned int mmap_miss;
+ unsigned short mmap_miss;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* Use the readahead code, even if readahead is disabled */
@@ -3285,7 +3285,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
struct file_ra_state *ra = &file->f_ra;
DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
struct file *fpin = NULL;
- unsigned int mmap_miss;
+ unsigned short mmap_miss;
/* If we don't want any read-ahead, don't bother */
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
@@ -3605,7 +3605,7 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
struct folio *folio, unsigned long start,
unsigned long addr, unsigned int nr_pages,
- unsigned long *rss, unsigned int *mmap_miss)
+ unsigned long *rss, unsigned short *mmap_miss)
{
vm_fault_t ret = 0;
struct page *page = folio_page(folio, start);
@@ -3667,7 +3667,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
struct folio *folio, unsigned long addr,
- unsigned long *rss, unsigned int *mmap_miss)
+ unsigned long *rss, unsigned short *mmap_miss)
{
vm_fault_t ret = 0;
struct page *page = &folio->page;
@@ -3709,7 +3709,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
struct folio *folio;
vm_fault_t ret = 0;
unsigned long rss = 0;
- unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
+ unsigned int nr_pages = 0, folio_type;
+ unsigned short mmap_miss = 0, mmap_miss_saved;
rcu_read_lock();
folio = next_uptodate_folio(&xas, mapping, end_pgoff);
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state
2025-06-09 9:27 ` [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
@ 2025-06-11 9:57 ` Christian Brauner
0 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2025-06-11 9:57 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro, Jan Kara,
David Hildenbrand, Dave Chinner, Catalin Marinas, Will Deacon,
Kalesh Singh, Zi Yan, linux-arm-kernel, linux-kernel,
linux-fsdevel, linux-mm
On Mon, Jun 09, 2025 at 10:27:25AM +0100, Ryan Roberts wrote:
> We need to be able to store the preferred folio order associated with a
> readahead request in the struct file_ra_state so that we can more
> accurately increase the order across subsequent readahead requests. But
> struct file_ra_state is per-struct file, so we don't really want to
> increase it's size.
>
> mmap_miss is currently 32 bits but it is only counted up to 10 *
> MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be
> plenty. Redefine it to unsigned short, making room for order as unsigned
> short in follow up commit.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> include/linux/fs.h | 2 +-
> mm/filemap.c | 11 ++++++-----
> 2 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 05abdabe9db7..87e7d5790e43 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1052,7 +1052,7 @@ struct file_ra_state {
> unsigned int size;
> unsigned int async_size;
> unsigned int ra_pages;
> - unsigned int mmap_miss;
> + unsigned short mmap_miss;
Thanks for not making struct file grow!
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v5 4/5] mm/readahead: Store folio order in struct file_ra_state
2025-06-09 9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
` (2 preceding siblings ...)
2025-06-09 9:27 ` [PATCH v5 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
@ 2025-06-09 9:27 ` Ryan Roberts
2025-06-12 11:37 ` Jan Kara
2025-06-09 9:27 ` [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
4 siblings, 1 reply; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09 9:27 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
linux-mm
Previously the folio order of the previous readahead request was
inferred from the folio who's readahead marker was hit. But due to the
way we have to round to non-natural boundaries sometimes, this first
folio in the readahead block is often smaller than the preferred order
for that request. This means that for cases where the initial sync
readahead is poorly aligned, the folio order will ramp up much more
slowly.
So instead, let's store the order in struct file_ra_state so we are not
affected by any required alignment. We previously made enough room in
the struct for a 16 order field. This should be plenty big enough since
we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
larger than ~20.
Since we now pass order in struct file_ra_state, page_cache_ra_order()
no longer needs it's new_order parameter, so let's remove that.
Worked example:
Here we are touching pages 17-256 sequentially just as we did in the
previous commit, but now that we are remembering the preferred order
explicitly, we no longer have the slow ramp up problem. Note
specifically that we no longer have 2 rounds (2x ~128K) of order-2
folios:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00050000 65536 64 80 16 4
FOLIO 0x00050000 0x00060000 65536 80 96 16 4
FOLIO 0x00060000 0x00080000 131072 96 128 32 5
FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
FOLIO 0x00100000 0x00120000 131072 256 288 32 5
FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
HOLE 0x00140000 0x00800000 7077888 320 2048 1728
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
include/linux/fs.h | 2 ++
mm/filemap.c | 6 ++++--
mm/internal.h | 3 +--
mm/readahead.c | 21 +++++++++++++--------
4 files changed, 20 insertions(+), 12 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 87e7d5790e43..b5172b691f97 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1041,6 +1041,7 @@ struct fown_struct {
* and so were/are genuinely "ahead". Start next readahead when
* the first of these pages is accessed.
* @ra_pages: Maximum size of a readahead request, copied from the bdi.
+ * @order: Preferred folio order used for most recent readahead.
* @mmap_miss: How many mmap accesses missed in the page cache.
* @prev_pos: The last byte in the most recent read request.
*
@@ -1052,6 +1053,7 @@ struct file_ra_state {
unsigned int size;
unsigned int async_size;
unsigned int ra_pages;
+ unsigned short order;
unsigned short mmap_miss;
loff_t prev_pos;
};
diff --git a/mm/filemap.c b/mm/filemap.c
index 7bb4ffca8487..4b5c8d69f04c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3232,7 +3232,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
if (!(vm_flags & VM_RAND_READ))
ra->size *= 2;
ra->async_size = HPAGE_PMD_NR;
- page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
+ ra->order = HPAGE_PMD_ORDER;
+ page_cache_ra_order(&ractl, ra);
return fpin;
}
#endif
@@ -3268,8 +3269,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
ra->size = ra->ra_pages;
ra->async_size = ra->ra_pages / 4;
+ ra->order = 0;
ractl._index = ra->start;
- page_cache_ra_order(&ractl, ra, 0);
+ page_cache_ra_order(&ractl, ra);
return fpin;
}
diff --git a/mm/internal.h b/mm/internal.h
index 6b8ed2017743..f91688e2894f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -436,8 +436,7 @@ void zap_page_range_single_batched(struct mmu_gather *tlb,
int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
gfp_t gfp);
-void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
- unsigned int order);
+void page_cache_ra_order(struct readahead_control *, struct file_ra_state *);
void force_page_cache_ra(struct readahead_control *, unsigned long nr);
static inline void force_page_cache_readahead(struct address_space *mapping,
struct file *file, pgoff_t index, unsigned long nr_to_read)
diff --git a/mm/readahead.c b/mm/readahead.c
index 87be20ae00d0..95a24f12d1e7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -457,7 +457,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
}
void page_cache_ra_order(struct readahead_control *ractl,
- struct file_ra_state *ra, unsigned int new_order)
+ struct file_ra_state *ra)
{
struct address_space *mapping = ractl->mapping;
pgoff_t start = readahead_index(ractl);
@@ -468,9 +468,12 @@ void page_cache_ra_order(struct readahead_control *ractl,
unsigned int nofs;
int err = 0;
gfp_t gfp = readahead_gfp_mask(mapping);
+ unsigned int new_order = ra->order;
- if (!mapping_large_folio_support(mapping))
+ if (!mapping_large_folio_support(mapping)) {
+ ra->order = 0;
goto fallback;
+ }
limit = min(limit, index + ra->size - 1);
@@ -478,6 +481,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
new_order = min_t(unsigned int, new_order, ilog2(ra->size));
new_order = max(new_order, min_order);
+ ra->order = new_order;
+
/* See comment in page_cache_ra_unbounded() */
nofs = memalloc_nofs_save();
filemap_invalidate_lock_shared(mapping);
@@ -609,8 +614,9 @@ void page_cache_sync_ra(struct readahead_control *ractl,
ra->size = min(contig_count + req_count, max_pages);
ra->async_size = 1;
readit:
+ ra->order = 0;
ractl->_index = ra->start;
- page_cache_ra_order(ractl, ra, 0);
+ page_cache_ra_order(ractl, ra);
}
EXPORT_SYMBOL_GPL(page_cache_sync_ra);
@@ -621,7 +627,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
struct file_ra_state *ra = ractl->ra;
pgoff_t index = readahead_index(ractl);
pgoff_t expected, start, end, aligned_end, align;
- unsigned int order = folio_order(folio);
/* no readahead */
if (!ra->ra_pages)
@@ -644,7 +649,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
* Ramp up sizes, and push forward the readahead window.
*/
expected = round_down(ra->start + ra->size - ra->async_size,
- 1UL << order);
+ 1UL << folio_order(folio));
if (index == expected) {
ra->start += ra->size;
/*
@@ -673,15 +678,15 @@ void page_cache_async_ra(struct readahead_control *ractl,
ra->size += req_count;
ra->size = get_next_ra_size(ra, max_pages);
readit:
- order += 2;
- align = 1UL << min(order, ffs(max_pages) - 1);
+ ra->order += 2;
+ align = 1UL << min(ra->order, ffs(max_pages) - 1);
end = ra->start + ra->size;
aligned_end = round_down(end, align);
if (aligned_end > ra->start)
ra->size -= end - aligned_end;
ra->async_size = ra->size;
ractl->_index = ra->start;
- page_cache_ra_order(ractl, ra, order);
+ page_cache_ra_order(ractl, ra);
}
EXPORT_SYMBOL_GPL(page_cache_async_ra);
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v5 4/5] mm/readahead: Store folio order in struct file_ra_state
2025-06-09 9:27 ` [PATCH v5 4/5] mm/readahead: Store folio order " Ryan Roberts
@ 2025-06-12 11:37 ` Jan Kara
0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2025-06-12 11:37 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Mon 09-06-25 10:27:26, Ryan Roberts wrote:
> Previously the folio order of the previous readahead request was
> inferred from the folio who's readahead marker was hit. But due to the
> way we have to round to non-natural boundaries sometimes, this first
> folio in the readahead block is often smaller than the preferred order
> for that request. This means that for cases where the initial sync
> readahead is poorly aligned, the folio order will ramp up much more
> slowly.
>
> So instead, let's store the order in struct file_ra_state so we are not
> affected by any required alignment. We previously made enough room in
> the struct for a 16 order field. This should be plenty big enough since
> we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
> larger than ~20.
>
> Since we now pass order in struct file_ra_state, page_cache_ra_order()
> no longer needs it's new_order parameter, so let's remove that.
>
> Worked example:
>
> Here we are touching pages 17-256 sequentially just as we did in the
> previous commit, but now that we are remembering the preferred order
> explicitly, we no longer have the slow ramp up problem. Note
> specifically that we no longer have 2 rounds (2x ~128K) of order-2
> folios:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
> HOLE 0x00000000 0x00001000 4096 0 1 1
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
> FOLIO 0x00021000 0x00022000 4096 33 34 1 0
> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
> FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
> FOLIO 0x00100000 0x00120000 131072 256 288 32 5
> FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
> HOLE 0x00140000 0x00800000 7077888 320 2048 1728
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Looks good! Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> include/linux/fs.h | 2 ++
> mm/filemap.c | 6 ++++--
> mm/internal.h | 3 +--
> mm/readahead.c | 21 +++++++++++++--------
> 4 files changed, 20 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 87e7d5790e43..b5172b691f97 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1041,6 +1041,7 @@ struct fown_struct {
> * and so were/are genuinely "ahead". Start next readahead when
> * the first of these pages is accessed.
> * @ra_pages: Maximum size of a readahead request, copied from the bdi.
> + * @order: Preferred folio order used for most recent readahead.
> * @mmap_miss: How many mmap accesses missed in the page cache.
> * @prev_pos: The last byte in the most recent read request.
> *
> @@ -1052,6 +1053,7 @@ struct file_ra_state {
> unsigned int size;
> unsigned int async_size;
> unsigned int ra_pages;
> + unsigned short order;
> unsigned short mmap_miss;
> loff_t prev_pos;
> };
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7bb4ffca8487..4b5c8d69f04c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3232,7 +3232,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> if (!(vm_flags & VM_RAND_READ))
> ra->size *= 2;
> ra->async_size = HPAGE_PMD_NR;
> - page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
> + ra->order = HPAGE_PMD_ORDER;
> + page_cache_ra_order(&ractl, ra);
> return fpin;
> }
> #endif
> @@ -3268,8 +3269,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> ra->size = ra->ra_pages;
> ra->async_size = ra->ra_pages / 4;
> + ra->order = 0;
> ractl._index = ra->start;
> - page_cache_ra_order(&ractl, ra, 0);
> + page_cache_ra_order(&ractl, ra);
> return fpin;
> }
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 6b8ed2017743..f91688e2894f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -436,8 +436,7 @@ void zap_page_range_single_batched(struct mmu_gather *tlb,
> int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
> gfp_t gfp);
>
> -void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
> - unsigned int order);
> +void page_cache_ra_order(struct readahead_control *, struct file_ra_state *);
> void force_page_cache_ra(struct readahead_control *, unsigned long nr);
> static inline void force_page_cache_readahead(struct address_space *mapping,
> struct file *file, pgoff_t index, unsigned long nr_to_read)
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 87be20ae00d0..95a24f12d1e7 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -457,7 +457,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
> }
>
> void page_cache_ra_order(struct readahead_control *ractl,
> - struct file_ra_state *ra, unsigned int new_order)
> + struct file_ra_state *ra)
> {
> struct address_space *mapping = ractl->mapping;
> pgoff_t start = readahead_index(ractl);
> @@ -468,9 +468,12 @@ void page_cache_ra_order(struct readahead_control *ractl,
> unsigned int nofs;
> int err = 0;
> gfp_t gfp = readahead_gfp_mask(mapping);
> + unsigned int new_order = ra->order;
>
> - if (!mapping_large_folio_support(mapping))
> + if (!mapping_large_folio_support(mapping)) {
> + ra->order = 0;
> goto fallback;
> + }
>
> limit = min(limit, index + ra->size - 1);
>
> @@ -478,6 +481,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> new_order = max(new_order, min_order);
>
> + ra->order = new_order;
> +
> /* See comment in page_cache_ra_unbounded() */
> nofs = memalloc_nofs_save();
> filemap_invalidate_lock_shared(mapping);
> @@ -609,8 +614,9 @@ void page_cache_sync_ra(struct readahead_control *ractl,
> ra->size = min(contig_count + req_count, max_pages);
> ra->async_size = 1;
> readit:
> + ra->order = 0;
> ractl->_index = ra->start;
> - page_cache_ra_order(ractl, ra, 0);
> + page_cache_ra_order(ractl, ra);
> }
> EXPORT_SYMBOL_GPL(page_cache_sync_ra);
>
> @@ -621,7 +627,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
> struct file_ra_state *ra = ractl->ra;
> pgoff_t index = readahead_index(ractl);
> pgoff_t expected, start, end, aligned_end, align;
> - unsigned int order = folio_order(folio);
>
> /* no readahead */
> if (!ra->ra_pages)
> @@ -644,7 +649,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> * Ramp up sizes, and push forward the readahead window.
> */
> expected = round_down(ra->start + ra->size - ra->async_size,
> - 1UL << order);
> + 1UL << folio_order(folio));
> if (index == expected) {
> ra->start += ra->size;
> /*
> @@ -673,15 +678,15 @@ void page_cache_async_ra(struct readahead_control *ractl,
> ra->size += req_count;
> ra->size = get_next_ra_size(ra, max_pages);
> readit:
> - order += 2;
> - align = 1UL << min(order, ffs(max_pages) - 1);
> + ra->order += 2;
> + align = 1UL << min(ra->order, ffs(max_pages) - 1);
> end = ra->start + ra->size;
> aligned_end = round_down(end, align);
> if (aligned_end > ra->start)
> ra->size -= end - aligned_end;
> ra->async_size = ra->size;
> ractl->_index = ra->start;
> - page_cache_ra_order(ractl, ra, order);
> + page_cache_ra_order(ractl, ra);
> }
> EXPORT_SYMBOL_GPL(page_cache_async_ra);
>
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory
2025-06-09 9:27 [PATCH v5 0/5] Readahead tweaks for larger folios Ryan Roberts
` (3 preceding siblings ...)
2025-06-09 9:27 ` [PATCH v5 4/5] mm/readahead: Store folio order " Ryan Roberts
@ 2025-06-09 9:27 ` Ryan Roberts
2025-06-19 11:07 ` Ryan Roberts
2025-07-11 15:41 ` Tao Xu
4 siblings, 2 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-06-09 9:27 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
linux-mm
Change the readahead config so that if it is being requested for an
executable mapping, do a synchronous read into a set of folios with an
arch-specified order and in a naturally aligned manner. We no longer
center the read on the faulting page but simply align it down to the
previous natural boundary. Additionally, we don't bother with an
asynchronous part.
On arm64 if memory is physically contiguous and naturally aligned to the
"contpte" size, we can use contpte mappings, which improves utilization
of the TLB. When paired with the "multi-size THP" feature, this works
well to reduce dTLB pressure. However iTLB pressure is still high due to
executable mappings having a low likelihood of being in the required
folio size and mapping alignment, even when the filesystem supports
readahead into large folios (e.g. XFS).
The reason for the low likelihood is that the current readahead
algorithm starts with an order-0 folio and increases the folio order by
2 every time the readahead mark is hit. But most executable memory tends
to be accessed randomly and so the readahead mark is rarely hit and most
executable folios remain order-0.
So let's special-case the read(ahead) logic for executable mappings. The
trade-off is performance improvement (due to more efficient storage of
the translations in iTLB) vs potential for making reclaim more difficult
(due to the folios being larger so if a part of the folio is hot the
whole thing is considered hot). But executable memory is a small portion
of the overall system memory so I doubt this will even register from a
reclaim perspective.
I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
base page size configs. Crucially the same amount of data is still read
(usually 128K) so I'm not expecting any read amplification issues. I
don't anticipate any write amplification because text is always RO.
Note that the text region of an ELF file could be populated into the
page cache for other reasons than taking a fault in a mmapped area. The
most common case is due to the loader read()ing the header which can be
shared with the beginning of text. So some text will still remain in
small folios, but this simple, best effort change provides good
performance improvements as is.
Confine this special-case approach to the bounds of the VMA. This
prevents wasting memory for any padding that might exist in the file
between sections. Previously the padding would have been contained in
order-0 folios and would be easy to reclaim. But now it would be part of
a larger folio so more difficult to reclaim. Solve this by simply not
reading it into memory in the first place.
Benchmarking
============
The below shows pgbench and redis benchmarks on Graviton3 arm64 system.
First, confirmation that this patch causes more text to be contained in
64K folios:
+----------------------+---------------+---------------+---------------+
| File-backed folios by| system boot | pgbench | redis |
| size as percentage of+-------+-------+-------+-------+-------+-------+
| all mapped text mem |before | after |before | after |before | after |
+======================+=======+=======+=======+=======+=======+=======+
| base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% |
| thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% |
| thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% |
| thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% |
| thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% |
| thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% |
| thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% |
| thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% |
| thp-partial | 0% | 0% | 0% | 1% | 0% | 1% |
+----------------------+-------+-------+-------+-------+-------+-------+
| cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% |
+----------------------+-------+-------+-------+-------+-------+-------+
The above shows that for both workloads (each isolated with cgroups) as
well as the general system state after boot, the amount of text backed
by 4K and 16K folios reduces and the amount backed by 64K folios
increases significantly. And the amount of text that is contpte-mapped
significantly increases (see last row).
And this is reflected in performance improvement. "(I)" indicates a
statistically significant improvement. Note TPS and Reqs/sec are rates
so bigger is better, ms is time so smaller is better:
+-------------+-------------------------------------------+------------+
| Benchmark | Result Class | Improvemnt |
+=============+===========================================+============+
| pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% |
| | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% |
| | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% |
| | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% |
| | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% |
| | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% |
| | Scale: 100 Clients: 1 RO (TPS) | 2.51% |
| | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% |
| | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% |
| | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
| | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% |
| | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
+-------------+-------------------------------------------+------------+
| pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% |
| | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% |
| | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% |
| | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% |
| | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% |
| | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% |
| | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% |
| | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% |
| | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% |
| | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% |
+-------------+-------------------------------------------+------------+
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
arch/arm64/include/asm/pgtable.h | 8 ++++++
include/linux/pgtable.h | 11 ++++++++
mm/filemap.c | 47 ++++++++++++++++++++++++++------
3 files changed, 57 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 88db8a0c0b37..7a7dfdce14b8 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1643,6 +1643,14 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
*/
#define arch_wants_old_prefaulted_pte cpu_has_hw_af
+/*
+ * Request exec memory is read into pagecache in at least 64K folios. This size
+ * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
+ * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
+ * pages are in use.
+ */
+#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
+
static inline bool pud_sect_supported(void)
{
return PAGE_SIZE == SZ_4K;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 0b6e1f781d86..e4a3895c043b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -456,6 +456,17 @@ static inline bool arch_has_hw_pte_young(void)
}
#endif
+#ifndef exec_folio_order
+/*
+ * Returns preferred minimum folio order for executable file-backed memory. Must
+ * be in range [0, PMD_ORDER). Default to order-0.
+ */
+static inline unsigned int exec_folio_order(void)
+{
+ return 0;
+}
+#endif
+
#ifndef arch_check_zapped_pte
static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
pte_t pte)
diff --git a/mm/filemap.c b/mm/filemap.c
index 4b5c8d69f04c..93fbc2ef232a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3238,8 +3238,11 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
}
#endif
- /* If we don't want any read-ahead, don't bother */
- if (vm_flags & VM_RAND_READ)
+ /*
+ * If we don't want any read-ahead, don't bother. VM_EXEC case below is
+ * already intended for random access.
+ */
+ if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
return fpin;
if (!ra->ra_pages)
return fpin;
@@ -3262,14 +3265,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
if (mmap_miss > MMAP_LOTSAMISS)
return fpin;
- /*
- * mmap read-around
- */
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
- ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
- ra->size = ra->ra_pages;
- ra->async_size = ra->ra_pages / 4;
- ra->order = 0;
+ if (vm_flags & VM_EXEC) {
+ /*
+ * Allow arch to request a preferred minimum folio order for
+ * executable memory. This can often be beneficial to
+ * performance if (e.g.) arm64 can contpte-map the folio.
+ * Executable memory rarely benefits from readahead, due to its
+ * random access nature, so set async_size to 0.
+ *
+ * Limit to the boundaries of the VMA to avoid reading in any
+ * pad that might exist between sections, which would be a waste
+ * of memory.
+ */
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long start = vma->vm_pgoff;
+ unsigned long end = start + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT);
+ unsigned long ra_end;
+
+ ra->order = exec_folio_order();
+ ra->start = round_down(vmf->pgoff, 1UL << ra->order);
+ ra->start = max(ra->start, start);
+ ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
+ ra_end = min(ra_end, end);
+ ra->size = ra_end - ra->start;
+ ra->async_size = 0;
+ } else {
+ /*
+ * mmap read-around
+ */
+ ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
+ ra->size = ra->ra_pages;
+ ra->async_size = ra->ra_pages / 4;
+ ra->order = 0;
+ }
ractl._index = ra->start;
page_cache_ra_order(&ractl, ra);
return fpin;
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory
2025-06-09 9:27 ` [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
@ 2025-06-19 11:07 ` Ryan Roberts
2025-07-11 15:41 ` Tao Xu
1 sibling, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-06-19 11:07 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
Hi Andrew,
On 09/06/2025 10:27, Ryan Roberts wrote:
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read into a set of folios with an
> arch-specified order and in a naturally aligned manner. We no longer
> center the read on the faulting page but simply align it down to the
> previous natural boundary. Additionally, we don't bother with an
> asynchronous part.
>
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" feature, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low likelihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
>
> The reason for the low likelihood is that the current readahead
> algorithm starts with an order-0 folio and increases the folio order by
> 2 every time the readahead mark is hit. But most executable memory tends
> to be accessed randomly and so the readahead mark is rarely hit and most
> executable folios remain order-0.
>
> So let's special-case the read(ahead) logic for executable mappings. The
> trade-off is performance improvement (due to more efficient storage of
> the translations in iTLB) vs potential for making reclaim more difficult
> (due to the folios being larger so if a part of the folio is hot the
> whole thing is considered hot). But executable memory is a small portion
> of the overall system memory so I doubt this will even register from a
> reclaim perspective.
>
> I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
> base page size configs. Crucially the same amount of data is still read
> (usually 128K) so I'm not expecting any read amplification issues. I
> don't anticipate any write amplification because text is always RO.
>
> Note that the text region of an ELF file could be populated into the
> page cache for other reasons than taking a fault in a mmapped area. The
> most common case is due to the loader read()ing the header which can be
> shared with the beginning of text. So some text will still remain in
> small folios, but this simple, best effort change provides good
> performance improvements as is.
>
> Confine this special-case approach to the bounds of the VMA. This
> prevents wasting memory for any padding that might exist in the file
> between sections. Previously the padding would have been contained in
> order-0 folios and would be easy to reclaim. But now it would be part of
> a larger folio so more difficult to reclaim. Solve this by simply not
> reading it into memory in the first place.
>
> Benchmarking
> ============
>
> The below shows pgbench and redis benchmarks on Graviton3 arm64 system.
>
> First, confirmation that this patch causes more text to be contained in
> 64K folios:
>
> +----------------------+---------------+---------------+---------------+
> | File-backed folios by| system boot | pgbench | redis |
> | size as percentage of+-------+-------+-------+-------+-------+-------+
> | all mapped text mem |before | after |before | after |before | after |
> +======================+=======+=======+=======+=======+=======+=======+
> | base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% |
> | thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% |
> | thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% |
> | thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% |
> | thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% |
> | thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% |
> | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% |
> | thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% |
> | thp-partial | 0% | 0% | 0% | 1% | 0% | 1% |
> +----------------------+-------+-------+-------+-------+-------+-------+
> | cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% |
> +----------------------+-------+-------+-------+-------+-------+-------+
>
> The above shows that for both workloads (each isolated with cgroups) as
> well as the general system state after boot, the amount of text backed
> by 4K and 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of text that is contpte-mapped
> significantly increases (see last row).
>
> And this is reflected in performance improvement. "(I)" indicates a
> statistically significant improvement. Note TPS and Reqs/sec are rates
> so bigger is better, ms is time so smaller is better:
>
> +-------------+-------------------------------------------+------------+
> | Benchmark | Result Class | Improvemnt |
> +=============+===========================================+============+
> | pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% |
> | | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% |
> | | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% |
> | | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% |
> | | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% |
> | | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% |
> | | Scale: 100 Clients: 1 RO (TPS) | 2.51% |
> | | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% |
> | | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% |
> | | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
> | | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% |
> | | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
> +-------------+-------------------------------------------+------------+
> | pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% |
> | | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% |
> | | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% |
> | | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% |
> | | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% |
> | | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% |
> | | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% |
> | | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% |
> | | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% |
> | | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% |
> +-------------+-------------------------------------------+------------+
>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Acked-by: Will Deacon <will@kernel.org>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
A use-after-free issue was reported againt this patch, which I believe is still
in mm-unstable? The problem is that I'm accessing the vma after unlocking it. So
the fix is to move the unlock to after the if/else. Would you mind squashing
this into the patch?
The report is here:
https://lore.kernel.org/linux-mm/hi6tsbuplmf6jcr44tqu6mdhtyebyqgsfif7okhnrzkcowpo4d@agoyrl4ozyth/
---8<---
diff --git a/mm/filemap.c b/mm/filemap.c
index 93fbc2ef232a..eaf853d6b719 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3265,7 +3265,6 @@ static struct file *do_sync_mmap_readahead(struct vm_fault
*vmf)
if (mmap_miss > MMAP_LOTSAMISS)
return fpin;
- fpin = maybe_unlock_mmap_for_io(vmf, fpin);
if (vm_flags & VM_EXEC) {
/*
* Allow arch to request a preferred minimum folio order for
@@ -3299,6 +3298,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault
*vmf)
ra->async_size = ra->ra_pages / 4;
ra->order = 0;
}
+
+ fpin = maybe_unlock_mmap_for_io(vmf, fpin);
ractl._index = ra->start;
page_cache_ra_order(&ractl, ra);
return fpin;
---8<---
Thanks,
Ryan
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory
2025-06-09 9:27 ` [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
2025-06-19 11:07 ` Ryan Roberts
@ 2025-07-11 15:41 ` Tao Xu
2025-07-14 8:19 ` Ryan Roberts
1 sibling, 1 reply; 11+ messages in thread
From: Tao Xu @ 2025-07-11 15:41 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, David Hildenbrand,
Dave Chinner, Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 09/06/2025 10:27, Ryan Roberts wrote:
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read into a set of folios with an
> arch-specified order and in a naturally aligned manner. We no longer
> center the read on the faulting page but simply align it down to the
> previous natural boundary. Additionally, we don't bother with an
> asynchronous part.
>
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" feature, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low likelihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
>
> The reason for the low likelihood is that the current readahead
> algorithm starts with an order-0 folio and increases the folio order by
> 2 every time the readahead mark is hit. But most executable memory tends
> to be accessed randomly and so the readahead mark is rarely hit and most
> executable folios remain order-0.
>
> So let's special-case the read(ahead) logic for executable mappings. The
> trade-off is performance improvement (due to more efficient storage of
> the translations in iTLB) vs potential for making reclaim more difficult
> (due to the folios being larger so if a part of the folio is hot the
> whole thing is considered hot). But executable memory is a small portion
> of the overall system memory so I doubt this will even register from a
> reclaim perspective.
>
> I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
> base page size configs. Crucially the same amount of data is still read
> (usually 128K) so I'm not expecting any read amplification issues. I
> don't anticipate any write amplification because text is always RO.
>
> Note that the text region of an ELF file could be populated into the
> page cache for other reasons than taking a fault in a mmapped area. The
> most common case is due to the loader read()ing the header which can be
> shared with the beginning of text. So some text will still remain in
> small folios, but this simple, best effort change provides good
> performance improvements as is.
>
> Confine this special-case approach to the bounds of the VMA. This
> prevents wasting memory for any padding that might exist in the file
> between sections. Previously the padding would have been contained in
> order-0 folios and would be easy to reclaim. But now it would be part of
> a larger folio so more difficult to reclaim. Solve this by simply not
> reading it into memory in the first place.
>
> Benchmarking
> ============
>
> The below shows pgbench and redis benchmarks on Graviton3 arm64 system.
>
> First, confirmation that this patch causes more text to be contained in
> 64K folios:
>
> +----------------------+---------------+---------------+---------------+
> | File-backed folios by| system boot | pgbench | redis |
> | size as percentage of+-------+-------+-------+-------+-------+-------+
> | all mapped text mem |before | after |before | after |before | after |
> +======================+=======+=======+=======+=======+=======+=======+
> | base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% |
> | thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% |
> | thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% |
> | thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% |
> | thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% |
> | thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% |
> | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% |
> | thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% |
> | thp-partial | 0% | 0% | 0% | 1% | 0% | 1% |
> +----------------------+-------+-------+-------+-------+-------+-------+
> | cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% |
> +----------------------+-------+-------+-------+-------+-------+-------+
>
> The above shows that for both workloads (each isolated with cgroups) as
> well as the general system state after boot, the amount of text backed
> by 4K and 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of text that is contpte-mapped
> significantly increases (see last row).
>
> And this is reflected in performance improvement. "(I)" indicates a
> statistically significant improvement. Note TPS and Reqs/sec are rates
> so bigger is better, ms is time so smaller is better:
>
> +-------------+-------------------------------------------+------------+
> | Benchmark | Result Class | Improvemnt |
> +=============+===========================================+============+
> | pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% |
> | | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% |
> | | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% |
> | | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% |
> | | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% |
> | | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% |
> | | Scale: 100 Clients: 1 RO (TPS) | 2.51% |
> | | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% |
> | | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% |
> | | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
> | | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% |
> | | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
> +-------------+-------------------------------------------+------------+
> | pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% |
> | | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% |
> | | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% |
> | | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% |
> | | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% |
> | | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% |
> | | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% |
> | | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% |
> | | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% |
> | | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% |
> +-------------+-------------------------------------------+------------+
>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Acked-by: Will Deacon <will@kernel.org>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Tao Xu <tao.xu@arm.com>
Observed similar performance optimization and iTLB benefits in mysql
sysbench on Azure Cobalt-100 arm64 system.
Below shows more .text sections are now backed by 64K folios for the
52MiB mysqld binary file in XFS, and more in 128K folios when increasing
the p_align from default 64k to 2M in ELF header:
+----------------------+-------+-------+-------+
| | mysql |
+----------------------+-------+-------+-------+
| |before | after |
+----------------------+-------+-------+-------+
| | | p_align |
| | | 64k | 2M |
+----------------------+-------+-------+-------+
| thp-aligned-8kB | 1% | 0% | 0% |
| thp-aligned-16kB | 53% | 0% | 0% |
| thp-aligned-32kB | 0% | 0% | 0% |
| thp-aligned-64kB | 3% | 72% | 1% |
| thp-aligned-128kB | 0% | 0% | 67% |
| thp-partial | 0% | 0% | 5% |
+----------------------+-------+-------+-------+
The resulting performance improvment is +5.65% in TPS throughput and
-6.06% in average latency, using 16 local sysbench clients to the mysqld
running on 32 cores and 12GiB innodb_buffer_pool_size. Corresponding
iTLB effectiveness benefits can also be observed from perf PMU metrics:
+-------------+--------------------------+------------+
| Benchmark | Result | Improvemnt |
+=============+==========================+============+
| sysbench | TPS | 5.65% |
| | Latency (ms)| -6.06% |
+-------------+--------------------------+------------+
| perf PMU | l1i_tlb (M/sec)| +1.11% |
| | l2d_tlb (M/sec)| -13.01% |
| | l1i_tlb_refill (K/sec)| -46.50% |
| | itlb_walk (K/sec)| -64.03% |
| | l2d_tlb_refill (K/sec)| -33.90% |
| | l1d_tlb (M/sec)| +1.24% |
| | l1d_tlb_refill (M/sec)| +2.23% |
| | dtlb_walk (K/sec)| -20.69% |
| | IPC | +1.85% |
+-------------+--------------------------+------------+
> ---
> arch/arm64/include/asm/pgtable.h | 8 ++++++
> include/linux/pgtable.h | 11 ++++++++
> mm/filemap.c | 47 ++++++++++++++++++++++++++------
> 3 files changed, 57 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 88db8a0c0b37..7a7dfdce14b8 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1643,6 +1643,14 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
> */
> #define arch_wants_old_prefaulted_pte cpu_has_hw_af
>
> +/*
> + * Request exec memory is read into pagecache in at least 64K folios. This size
> + * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
> + * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
> + * pages are in use.
> + */
> +#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
> +
> static inline bool pud_sect_supported(void)
> {
> return PAGE_SIZE == SZ_4K;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 0b6e1f781d86..e4a3895c043b 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -456,6 +456,17 @@ static inline bool arch_has_hw_pte_young(void)
> }
> #endif
>
> +#ifndef exec_folio_order
> +/*
> + * Returns preferred minimum folio order for executable file-backed memory. Must
> + * be in range [0, PMD_ORDER). Default to order-0.
> + */
> +static inline unsigned int exec_folio_order(void)
> +{
> + return 0;
> +}
> +#endif
> +
> #ifndef arch_check_zapped_pte
> static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
> pte_t pte)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4b5c8d69f04c..93fbc2ef232a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3238,8 +3238,11 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> }
> #endif
>
> - /* If we don't want any read-ahead, don't bother */
> - if (vm_flags & VM_RAND_READ)
> + /*
> + * If we don't want any read-ahead, don't bother. VM_EXEC case below is
> + * already intended for random access.
> + */
> + if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
> return fpin;
> if (!ra->ra_pages)
> return fpin;
> @@ -3262,14 +3265,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> if (mmap_miss > MMAP_LOTSAMISS)
> return fpin;
>
> - /*
> - * mmap read-around
> - */
> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> - ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> - ra->size = ra->ra_pages;
> - ra->async_size = ra->ra_pages / 4;
> - ra->order = 0;
> + if (vm_flags & VM_EXEC) {
> + /*
> + * Allow arch to request a preferred minimum folio order for
> + * executable memory. This can often be beneficial to
> + * performance if (e.g.) arm64 can contpte-map the folio.
> + * Executable memory rarely benefits from readahead, due to its
> + * random access nature, so set async_size to 0.
> + *
> + * Limit to the boundaries of the VMA to avoid reading in any
> + * pad that might exist between sections, which would be a waste
> + * of memory.
> + */
> + struct vm_area_struct *vma = vmf->vma;
> + unsigned long start = vma->vm_pgoff;
> + unsigned long end = start + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT);
> + unsigned long ra_end;
> +
> + ra->order = exec_folio_order();
> + ra->start = round_down(vmf->pgoff, 1UL << ra->order);
> + ra->start = max(ra->start, start);
> + ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
> + ra_end = min(ra_end, end);
> + ra->size = ra_end - ra->start;
> + ra->async_size = 0;
> + } else {
> + /*
> + * mmap read-around
> + */
> + ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> + ra->size = ra->ra_pages;
> + ra->async_size = ra->ra_pages / 4;
> + ra->order = 0;
> + }
> ractl._index = ra->start;
> page_cache_ra_order(&ractl, ra);
> return fpin;
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v5 5/5] mm/filemap: Allow arch to request folio size for exec memory
2025-07-11 15:41 ` Tao Xu
@ 2025-07-14 8:19 ` Ryan Roberts
0 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-07-14 8:19 UTC (permalink / raw)
To: Tao Xu, Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 11/07/2025 16:41, Tao Xu wrote:
> On 09/06/2025 10:27, Ryan Roberts wrote:
>> Change the readahead config so that if it is being requested for an
>> executable mapping, do a synchronous read into a set of folios with an
>> arch-specified order and in a naturally aligned manner. We no longer
>> center the read on the faulting page but simply align it down to the
>> previous natural boundary. Additionally, we don't bother with an
>> asynchronous part.
>>
>> On arm64 if memory is physically contiguous and naturally aligned to the
>> "contpte" size, we can use contpte mappings, which improves utilization
>> of the TLB. When paired with the "multi-size THP" feature, this works
>> well to reduce dTLB pressure. However iTLB pressure is still high due to
>> executable mappings having a low likelihood of being in the required
>> folio size and mapping alignment, even when the filesystem supports
>> readahead into large folios (e.g. XFS).
>>
>> The reason for the low likelihood is that the current readahead
>> algorithm starts with an order-0 folio and increases the folio order by
>> 2 every time the readahead mark is hit. But most executable memory tends
>> to be accessed randomly and so the readahead mark is rarely hit and most
>> executable folios remain order-0.
>>
>> So let's special-case the read(ahead) logic for executable mappings. The
>> trade-off is performance improvement (due to more efficient storage of
>> the translations in iTLB) vs potential for making reclaim more difficult
>> (due to the folios being larger so if a part of the folio is hot the
>> whole thing is considered hot). But executable memory is a small portion
>> of the overall system memory so I doubt this will even register from a
>> reclaim perspective.
>>
>> I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
>> base page size configs. Crucially the same amount of data is still read
>> (usually 128K) so I'm not expecting any read amplification issues. I
>> don't anticipate any write amplification because text is always RO.
>>
>> Note that the text region of an ELF file could be populated into the
>> page cache for other reasons than taking a fault in a mmapped area. The
>> most common case is due to the loader read()ing the header which can be
>> shared with the beginning of text. So some text will still remain in
>> small folios, but this simple, best effort change provides good
>> performance improvements as is.
>>
>> Confine this special-case approach to the bounds of the VMA. This
>> prevents wasting memory for any padding that might exist in the file
>> between sections. Previously the padding would have been contained in
>> order-0 folios and would be easy to reclaim. But now it would be part of
>> a larger folio so more difficult to reclaim. Solve this by simply not
>> reading it into memory in the first place.
>>
>> Benchmarking
>> ============
>>
>> The below shows pgbench and redis benchmarks on Graviton3 arm64 system.
>>
>> First, confirmation that this patch causes more text to be contained in
>> 64K folios:
>>
>> +----------------------+---------------+---------------+---------------+
>> | File-backed folios by| system boot | pgbench | redis |
>> | size as percentage of+-------+-------+-------+-------+-------+-------+
>> | all mapped text mem |before | after |before | after |before | after |
>> +======================+=======+=======+=======+=======+=======+=======+
>> | base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% |
>> | thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% |
>> | thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% |
>> | thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% |
>> | thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% |
>> | thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% |
>> | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% |
>> | thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% |
>> | thp-partial | 0% | 0% | 0% | 1% | 0% | 1% |
>> +----------------------+-------+-------+-------+-------+-------+-------+
>> | cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% |
>> +----------------------+-------+-------+-------+-------+-------+-------+
>>
>> The above shows that for both workloads (each isolated with cgroups) as
>> well as the general system state after boot, the amount of text backed
>> by 4K and 16K folios reduces and the amount backed by 64K folios
>> increases significantly. And the amount of text that is contpte-mapped
>> significantly increases (see last row).
>>
>> And this is reflected in performance improvement. "(I)" indicates a
>> statistically significant improvement. Note TPS and Reqs/sec are rates
>> so bigger is better, ms is time so smaller is better:
>>
>> +-------------+-------------------------------------------+------------+
>> | Benchmark | Result Class | Improvemnt |
>> +=============+===========================================+============+
>> | pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% |
>> | | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% |
>> | | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% |
>> | | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% |
>> | | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% |
>> | | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% |
>> | | Scale: 100 Clients: 1 RO (TPS) | 2.51% |
>> | | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% |
>> | | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% |
>> | | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
>> | | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% |
>> | | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
>> +-------------+-------------------------------------------+------------+
>> | pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% |
>> | | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% |
>> | | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% |
>> | | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% |
>> | | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% |
>> | | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% |
>> | | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% |
>> | | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% |
>> | | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% |
>> | | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% |
>> +-------------+-------------------------------------------+------------+
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>> Acked-by: Will Deacon <will@kernel.org>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>
> Tested-by: Tao Xu <tao.xu@arm.com>
Thanks for testing! Although, unfortunately I think you were a day late and this
patch is now in mm-stable so too late to add the tag.
Thanks,
Ryan
>
> Observed similar performance optimization and iTLB benefits in mysql sysbench on
> Azure Cobalt-100 arm64 system.
>
> Below shows more .text sections are now backed by 64K folios for the 52MiB
> mysqld binary file in XFS, and more in 128K folios when increasing the p_align
> from default 64k to 2M in ELF header:
>
> +----------------------+-------+-------+-------+
> | | mysql |
> +----------------------+-------+-------+-------+
> | |before | after |
> +----------------------+-------+-------+-------+
> | | | p_align |
> | | | 64k | 2M |
> +----------------------+-------+-------+-------+
> | thp-aligned-8kB | 1% | 0% | 0% |
> | thp-aligned-16kB | 53% | 0% | 0% |
> | thp-aligned-32kB | 0% | 0% | 0% |
> | thp-aligned-64kB | 3% | 72% | 1% |
> | thp-aligned-128kB | 0% | 0% | 67% |
> | thp-partial | 0% | 0% | 5% |
> +----------------------+-------+-------+-------+
>
> The resulting performance improvment is +5.65% in TPS throughput and -6.06% in
> average latency, using 16 local sysbench clients to the mysqld running on 32
> cores and 12GiB innodb_buffer_pool_size. Corresponding iTLB effectiveness
> benefits can also be observed from perf PMU metrics:
>
> +-------------+--------------------------+------------+
> | Benchmark | Result | Improvemnt |
> +=============+==========================+============+
> | sysbench | TPS | 5.65% |
> | | Latency (ms)| -6.06% |
> +-------------+--------------------------+------------+
> | perf PMU | l1i_tlb (M/sec)| +1.11% |
> | | l2d_tlb (M/sec)| -13.01% |
> | | l1i_tlb_refill (K/sec)| -46.50% |
> | | itlb_walk (K/sec)| -64.03% |
> | | l2d_tlb_refill (K/sec)| -33.90% |
> | | l1d_tlb (M/sec)| +1.24% |
> | | l1d_tlb_refill (M/sec)| +2.23% |
> | | dtlb_walk (K/sec)| -20.69% |
> | | IPC | +1.85% |
> +-------------+--------------------------+------------+
>
>> ---
>> arch/arm64/include/asm/pgtable.h | 8 ++++++
>> include/linux/pgtable.h | 11 ++++++++
>> mm/filemap.c | 47 ++++++++++++++++++++++++++------
>> 3 files changed, 57 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 88db8a0c0b37..7a7dfdce14b8 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1643,6 +1643,14 @@ static inline void update_mmu_cache_range(struct
>> vm_fault *vmf,
>> */
>> #define arch_wants_old_prefaulted_pte cpu_has_hw_af
>> +/*
>> + * Request exec memory is read into pagecache in at least 64K folios. This size
>> + * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
>> + * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
>> + * pages are in use.
>> + */
>> +#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
>> +
>> static inline bool pud_sect_supported(void)
>> {
>> return PAGE_SIZE == SZ_4K;
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 0b6e1f781d86..e4a3895c043b 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -456,6 +456,17 @@ static inline bool arch_has_hw_pte_young(void)
>> }
>> #endif
>> +#ifndef exec_folio_order
>> +/*
>> + * Returns preferred minimum folio order for executable file-backed memory. Must
>> + * be in range [0, PMD_ORDER). Default to order-0.
>> + */
>> +static inline unsigned int exec_folio_order(void)
>> +{
>> + return 0;
>> +}
>> +#endif
>> +
>> #ifndef arch_check_zapped_pte
>> static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>> pte_t pte)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 4b5c8d69f04c..93fbc2ef232a 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3238,8 +3238,11 @@ static struct file *do_sync_mmap_readahead(struct
>> vm_fault *vmf)
>> }
>> #endif
>> - /* If we don't want any read-ahead, don't bother */
>> - if (vm_flags & VM_RAND_READ)
>> + /*
>> + * If we don't want any read-ahead, don't bother. VM_EXEC case below is
>> + * already intended for random access.
>> + */
>> + if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
>> return fpin;
>> if (!ra->ra_pages)
>> return fpin;
>> @@ -3262,14 +3265,40 @@ static struct file *do_sync_mmap_readahead(struct
>> vm_fault *vmf)
>> if (mmap_miss > MMAP_LOTSAMISS)
>> return fpin;
>> - /*
>> - * mmap read-around
>> - */
>> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>> - ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>> - ra->size = ra->ra_pages;
>> - ra->async_size = ra->ra_pages / 4;
>> - ra->order = 0;
>> + if (vm_flags & VM_EXEC) {
>> + /*
>> + * Allow arch to request a preferred minimum folio order for
>> + * executable memory. This can often be beneficial to
>> + * performance if (e.g.) arm64 can contpte-map the folio.
>> + * Executable memory rarely benefits from readahead, due to its
>> + * random access nature, so set async_size to 0.
>> + *
>> + * Limit to the boundaries of the VMA to avoid reading in any
>> + * pad that might exist between sections, which would be a waste
>> + * of memory.
>> + */
>> + struct vm_area_struct *vma = vmf->vma;
>> + unsigned long start = vma->vm_pgoff;
>> + unsigned long end = start + ((vma->vm_end - vma->vm_start) >>
>> PAGE_SHIFT);
>> + unsigned long ra_end;
>> +
>> + ra->order = exec_folio_order();
>> + ra->start = round_down(vmf->pgoff, 1UL << ra->order);
>> + ra->start = max(ra->start, start);
>> + ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
>> + ra_end = min(ra_end, end);
>> + ra->size = ra_end - ra->start;
>> + ra->async_size = 0;
>> + } else {
>> + /*
>> + * mmap read-around
>> + */
>> + ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>> + ra->size = ra->ra_pages;
>> + ra->async_size = ra->ra_pages / 4;
>> + ra->order = 0;
>> + }
>> ractl._index = ra->start;
>> page_cache_ra_order(&ractl, ra);
>> return fpin;
>
^ permalink raw reply [flat|nested] 11+ messages in thread