* [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-04-30 14:59 [RFC PATCH v4 0/5] Readahead tweaks for larger folios Ryan Roberts
@ 2025-04-30 14:59 ` Ryan Roberts
2025-05-05 8:49 ` Jan Kara
` (4 more replies)
2025-04-30 14:59 ` [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary Ryan Roberts
` (4 subsequent siblings)
5 siblings, 5 replies; 40+ messages in thread
From: Ryan Roberts @ 2025-04-30 14:59 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
linux-mm
page_cache_ra_order() takes a parameter called new_order, which is
intended to express the preferred order of the folios that will be
allocated for the readahead operation. Most callers indeed call this
with their preferred new order. But page_cache_async_ra() calls it with
the preferred order of the previous readahead request (actually the
order of the folio that had the readahead marker, which may be smaller
when alignment comes into play).
And despite the parameter name, page_cache_ra_order() always treats it
at the old order, adding 2 to it on entry. As a result, a cold readahead
always starts with order-2 folios.
Let's fix this behaviour by always passing in the *new* order.
Worked example:
Prior to the change, mmaping an 8MB file and touching each page
sequentially, resulted in the following, where we start with order-2
folios for the first 128K then ramp up to order-4 for the next 128K,
then get clamped to order-5 for the rest of the file because pa_pages is
limited to 128K:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
----- ---------- ---------- --------- ------- ------- ----- -----
FOLIO 0x00000000 0x00004000 16384 0 4 4 2
FOLIO 0x00004000 0x00008000 16384 4 8 4 2
FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
FOLIO 0x00010000 0x00014000 16384 16 20 4 2
FOLIO 0x00014000 0x00018000 16384 20 24 4 2
FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
FOLIO 0x00020000 0x00030000 65536 32 48 16 4
FOLIO 0x00030000 0x00040000 65536 48 64 16 4
FOLIO 0x00040000 0x00060000 131072 64 96 32 5
FOLIO 0x00060000 0x00080000 131072 96 128 32 5
FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
...
After the change, the same operation results in the first 128K being
order-0, then we start ramping up to order-2, -4, and finally get
clamped at order-5:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
----- ---------- ---------- --------- ------- ------- ----- -----
FOLIO 0x00000000 0x00001000 4096 0 1 1 0
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00024000 16384 32 36 4 2
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00050000 65536 64 80 16 4
FOLIO 0x00050000 0x00060000 65536 80 96 16 4
FOLIO 0x00060000 0x00080000 131072 96 128 32 5
FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
...
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
mm/readahead.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 6a4e96b69702..8bb316f5a842 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
limit = min(limit, index + ra->size - 1);
- if (new_order < mapping_max_folio_order(mapping))
- new_order += 2;
-
new_order = min(mapping_max_folio_order(mapping), new_order);
new_order = min_t(unsigned int, new_order, ilog2(ra->size));
new_order = max(new_order, min_order);
@@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
ra->size = get_next_ra_size(ra, max_pages);
ra->async_size = ra->size;
readit:
+ order += 2;
ractl->_index = ra->start;
page_cache_ra_order(ractl, ra, order);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-04-30 14:59 ` [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
@ 2025-05-05 8:49 ` Jan Kara
2025-05-05 9:51 ` David Hildenbrand
` (3 subsequent siblings)
4 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2025-05-05 8:49 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Wed 30-04-25 15:59:14, Ryan Roberts wrote:
> page_cache_ra_order() takes a parameter called new_order, which is
> intended to express the preferred order of the folios that will be
> allocated for the readahead operation. Most callers indeed call this
> with their preferred new order. But page_cache_async_ra() calls it with
> the preferred order of the previous readahead request (actually the
> order of the folio that had the readahead marker, which may be smaller
> when alignment comes into play).
>
> And despite the parameter name, page_cache_ra_order() always treats it
> at the old order, adding 2 to it on entry. As a result, a cold readahead
> always starts with order-2 folios.
>
> Let's fix this behaviour by always passing in the *new* order.
>
> Worked example:
>
> Prior to the change, mmaping an 8MB file and touching each page
> sequentially, resulted in the following, where we start with order-2
> folios for the first 128K then ramp up to order-4 for the next 128K,
> then get clamped to order-5 for the rest of the file because pa_pages is
> limited to 128K:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> ----- ---------- ---------- --------- ------- ------- ----- -----
> FOLIO 0x00000000 0x00004000 16384 0 4 4 2
> FOLIO 0x00004000 0x00008000 16384 4 8 4 2
> FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
> FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
> FOLIO 0x00010000 0x00014000 16384 16 20 4 2
> FOLIO 0x00014000 0x00018000 16384 20 24 4 2
> FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
> FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
> FOLIO 0x00020000 0x00030000 65536 32 48 16 4
> FOLIO 0x00030000 0x00040000 65536 48 64 16 4
> FOLIO 0x00040000 0x00060000 131072 64 96 32 5
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> ...
>
> After the change, the same operation results in the first 128K being
> order-0, then we start ramping up to order-2, -4, and finally get
> clamped at order-5:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> ----- ---------- ---------- --------- ------- ------- ----- -----
> FOLIO 0x00000000 0x00001000 4096 0 1 1 0
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00024000 16384 32 36 4 2
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
> ...
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Makes sense and looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> mm/readahead.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a4e96b69702..8bb316f5a842 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>
> limit = min(limit, index + ra->size - 1);
>
> - if (new_order < mapping_max_folio_order(mapping))
> - new_order += 2;
> -
> new_order = min(mapping_max_folio_order(mapping), new_order);
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> new_order = max(new_order, min_order);
> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> ra->size = get_next_ra_size(ra, max_pages);
> ra->async_size = ra->size;
> readit:
> + order += 2;
> ractl->_index = ra->start;
> page_cache_ra_order(ractl, ra, order);
> }
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-04-30 14:59 ` [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
2025-05-05 8:49 ` Jan Kara
@ 2025-05-05 9:51 ` David Hildenbrand
2025-05-05 10:09 ` Jan Kara
2025-05-05 10:09 ` Anshuman Khandual
` (2 subsequent siblings)
4 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-05-05 9:51 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 30.04.25 16:59, Ryan Roberts wrote:
> page_cache_ra_order() takes a parameter called new_order, which is
> intended to express the preferred order of the folios that will be
> allocated for the readahead operation. Most callers indeed call this
> with their preferred new order. But page_cache_async_ra() calls it with
> the preferred order of the previous readahead request (actually the
> order of the folio that had the readahead marker, which may be smaller
> when alignment comes into play).
>
> And despite the parameter name, page_cache_ra_order() always treats it
> at the old order, adding 2 to it on entry. As a result, a cold readahead
> always starts with order-2 folios.
>
> Let's fix this behaviour by always passing in the *new* order.
>
> Worked example:
>
> Prior to the change, mmaping an 8MB file and touching each page
> sequentially, resulted in the following, where we start with order-2
> folios for the first 128K then ramp up to order-4 for the next 128K,
> then get clamped to order-5 for the rest of the file because pa_pages is
> limited to 128K:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> ----- ---------- ---------- --------- ------- ------- ----- -----
> FOLIO 0x00000000 0x00004000 16384 0 4 4 2
> FOLIO 0x00004000 0x00008000 16384 4 8 4 2
> FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
> FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
> FOLIO 0x00010000 0x00014000 16384 16 20 4 2
> FOLIO 0x00014000 0x00018000 16384 20 24 4 2
> FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
> FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
> FOLIO 0x00020000 0x00030000 65536 32 48 16 4
> FOLIO 0x00030000 0x00040000 65536 48 64 16 4
> FOLIO 0x00040000 0x00060000 131072 64 96 32 5
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
Interesting, I would have thought we'd ramp up earlier.
> ...
>
> After the change, the same operation results in the first 128K being
> order-0, then we start ramping up to order-2, -4, and finally get
> clamped at order-5:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> ----- ---------- ---------- --------- ------- ------- ----- -----
> FOLIO 0x00000000 0x00001000 4096 0 1 1 0
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00024000 16384 32 36 4 2
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
Similar here, do you know why we don't ramp up earlier. Allocating that
many order-0 + order-2 pages looks a bit suboptimal to me for a
sequential read.
I wonder if you're change will have a measurable downside on sequential
read. Anyhow, I think it was already not behaving how I would have
expected it ... :)
Acked-by: David Hildenbrand <david@redhat.com>
> ...
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> mm/readahead.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a4e96b69702..8bb316f5a842 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>
> limit = min(limit, index + ra->size - 1);
>
> - if (new_order < mapping_max_folio_order(mapping))
> - new_order += 2;
> -
> new_order = min(mapping_max_folio_order(mapping), new_order);
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> new_order = max(new_order, min_order);
> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> ra->size = get_next_ra_size(ra, max_pages);
> ra->async_size = ra->size;
> readit:
> + order += 2;
> ractl->_index = ra->start;
> page_cache_ra_order(ractl, ra, order);
> }
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-05-05 9:51 ` David Hildenbrand
@ 2025-05-05 10:09 ` Jan Kara
2025-05-05 10:25 ` David Hildenbrand
0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2025-05-05 10:09 UTC (permalink / raw)
To: David Hildenbrand
Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Mon 05-05-25 11:51:43, David Hildenbrand wrote:
> On 30.04.25 16:59, Ryan Roberts wrote:
> > page_cache_ra_order() takes a parameter called new_order, which is
> > intended to express the preferred order of the folios that will be
> > allocated for the readahead operation. Most callers indeed call this
> > with their preferred new order. But page_cache_async_ra() calls it with
> > the preferred order of the previous readahead request (actually the
> > order of the folio that had the readahead marker, which may be smaller
> > when alignment comes into play).
> >
> > And despite the parameter name, page_cache_ra_order() always treats it
> > at the old order, adding 2 to it on entry. As a result, a cold readahead
> > always starts with order-2 folios.
> >
> > Let's fix this behaviour by always passing in the *new* order.
> >
> > Worked example:
> >
> > Prior to the change, mmaping an 8MB file and touching each page
> > sequentially, resulted in the following, where we start with order-2
> > folios for the first 128K then ramp up to order-4 for the next 128K,
> > then get clamped to order-5 for the rest of the file because pa_pages is
> > limited to 128K:
> >
> > TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> > ----- ---------- ---------- --------- ------- ------- ----- -----
> > FOLIO 0x00000000 0x00004000 16384 0 4 4 2
> > FOLIO 0x00004000 0x00008000 16384 4 8 4 2
> > FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
> > FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
> > FOLIO 0x00010000 0x00014000 16384 16 20 4 2
> > FOLIO 0x00014000 0x00018000 16384 20 24 4 2
> > FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
> > FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
> > FOLIO 0x00020000 0x00030000 65536 32 48 16 4
> > FOLIO 0x00030000 0x00040000 65536 48 64 16 4
> > FOLIO 0x00040000 0x00060000 131072 64 96 32 5
> > FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> > FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> > FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>
> Interesting, I would have thought we'd ramp up earlier.
>
> > ...
> >
> > After the change, the same operation results in the first 128K being
> > order-0, then we start ramping up to order-2, -4, and finally get
> > clamped at order-5:
> >
> > TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> > ----- ---------- ---------- --------- ------- ------- ----- -----
> > FOLIO 0x00000000 0x00001000 4096 0 1 1 0
> > FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> > FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> > FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> > FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> > FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> > FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> > FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> > FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> > FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> > FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> > FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> > FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> > FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> > FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> > FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> > FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> > FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> > FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> > FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> > FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> > FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> > FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> > FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> > FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> > FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> > FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> > FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> > FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> > FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> > FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> > FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> > FOLIO 0x00020000 0x00024000 16384 32 36 4 2
> > FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> > FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> > FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> > FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> > FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> > FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> > FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> > FOLIO 0x00040000 0x00050000 65536 64 80 16 4
> > FOLIO 0x00050000 0x00060000 65536 80 96 16 4
> > FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> > FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> > FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> > FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
>
> Similar here, do you know why we don't ramp up earlier. Allocating that many
> order-0 + order-2 pages looks a bit suboptimal to me for a sequential read.
Note that this is reading through mmap using the mmap readahead code. If
you use standard read(2), the readahead window starts small as well and
ramps us along with the desired order so we don't allocate that many small
order pages in that case.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-05-05 10:09 ` Jan Kara
@ 2025-05-05 10:25 ` David Hildenbrand
2025-05-05 12:51 ` Ryan Roberts
0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-05-05 10:25 UTC (permalink / raw)
To: Jan Kara
Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Dave Chinner, Catalin Marinas,
Will Deacon, Kalesh Singh, Zi Yan, linux-arm-kernel, linux-kernel,
linux-fsdevel, linux-mm
On 05.05.25 12:09, Jan Kara wrote:
> On Mon 05-05-25 11:51:43, David Hildenbrand wrote:
>> On 30.04.25 16:59, Ryan Roberts wrote:
>>> page_cache_ra_order() takes a parameter called new_order, which is
>>> intended to express the preferred order of the folios that will be
>>> allocated for the readahead operation. Most callers indeed call this
>>> with their preferred new order. But page_cache_async_ra() calls it with
>>> the preferred order of the previous readahead request (actually the
>>> order of the folio that had the readahead marker, which may be smaller
>>> when alignment comes into play).
>>>
>>> And despite the parameter name, page_cache_ra_order() always treats it
>>> at the old order, adding 2 to it on entry. As a result, a cold readahead
>>> always starts with order-2 folios.
>>>
>>> Let's fix this behaviour by always passing in the *new* order.
>>>
>>> Worked example:
>>>
>>> Prior to the change, mmaping an 8MB file and touching each page
>>> sequentially, resulted in the following, where we start with order-2
>>> folios for the first 128K then ramp up to order-4 for the next 128K,
>>> then get clamped to order-5 for the rest of the file because pa_pages is
>>> limited to 128K:
>>>
>>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
>>> ----- ---------- ---------- --------- ------- ------- ----- -----
>>> FOLIO 0x00000000 0x00004000 16384 0 4 4 2
>>> FOLIO 0x00004000 0x00008000 16384 4 8 4 2
>>> FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
>>> FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
>>> FOLIO 0x00010000 0x00014000 16384 16 20 4 2
>>> FOLIO 0x00014000 0x00018000 16384 20 24 4 2
>>> FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
>>> FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
>>> FOLIO 0x00020000 0x00030000 65536 32 48 16 4
>>> FOLIO 0x00030000 0x00040000 65536 48 64 16 4
>>> FOLIO 0x00040000 0x00060000 131072 64 96 32 5
>>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
>>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
>>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>>
>> Interesting, I would have thought we'd ramp up earlier.
>>
>>> ...
>>>
>>> After the change, the same operation results in the first 128K being
>>> order-0, then we start ramping up to order-2, -4, and finally get
>>> clamped at order-5:
>>>
>>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
>>> ----- ---------- ---------- --------- ------- ------- ----- -----
>>> FOLIO 0x00000000 0x00001000 4096 0 1 1 0
>>> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
>>> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
>>> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
>>> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
>>> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
>>> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
>>> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
>>> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
>>> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
>>> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
>>> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
>>> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
>>> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
>>> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
>>> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
>>> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
>>> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
>>> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
>>> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
>>> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
>>> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
>>> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
>>> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
>>> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
>>> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
>>> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
>>> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
>>> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
>>> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
>>> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
>>> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
>>> FOLIO 0x00020000 0x00024000 16384 32 36 4 2
>>> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
>>> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
>>> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
>>> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
>>> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
>>> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
>>> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
>>> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
>>> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
>>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
>>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
>>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>>> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
>>
>> Similar here, do you know why we don't ramp up earlier. Allocating that many
>> order-0 + order-2 pages looks a bit suboptimal to me for a sequential read.
>
> Note that this is reading through mmap using the mmap readahead code. If
> you use standard read(2), the readahead window starts small as well and
> ramps us along with the desired order so we don't allocate that many small
> order pages in that case.
Ah, thanks for that information! :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-05-05 10:25 ` David Hildenbrand
@ 2025-05-05 12:51 ` Ryan Roberts
2025-05-05 16:14 ` Jan Kara
0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2025-05-05 12:51 UTC (permalink / raw)
To: David Hildenbrand, Jan Kara
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Dave Chinner, Catalin Marinas, Will Deacon,
Kalesh Singh, Zi Yan, linux-arm-kernel, linux-kernel,
linux-fsdevel, linux-mm
On 05/05/2025 11:25, David Hildenbrand wrote:
> On 05.05.25 12:09, Jan Kara wrote:
>> On Mon 05-05-25 11:51:43, David Hildenbrand wrote:
>>> On 30.04.25 16:59, Ryan Roberts wrote:
>>>> page_cache_ra_order() takes a parameter called new_order, which is
>>>> intended to express the preferred order of the folios that will be
>>>> allocated for the readahead operation. Most callers indeed call this
>>>> with their preferred new order. But page_cache_async_ra() calls it with
>>>> the preferred order of the previous readahead request (actually the
>>>> order of the folio that had the readahead marker, which may be smaller
>>>> when alignment comes into play).
>>>>
>>>> And despite the parameter name, page_cache_ra_order() always treats it
>>>> at the old order, adding 2 to it on entry. As a result, a cold readahead
>>>> always starts with order-2 folios.
>>>>
>>>> Let's fix this behaviour by always passing in the *new* order.
>>>>
>>>> Worked example:
>>>>
>>>> Prior to the change, mmaping an 8MB file and touching each page
>>>> sequentially, resulted in the following, where we start with order-2
>>>> folios for the first 128K then ramp up to order-4 for the next 128K,
>>>> then get clamped to order-5 for the rest of the file because pa_pages is
>>>> limited to 128K:
>>>>
>>>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
>>>> ----- ---------- ---------- --------- ------- ------- ----- -----
>>>> FOLIO 0x00000000 0x00004000 16384 0 4 4 2
>>>> FOLIO 0x00004000 0x00008000 16384 4 8 4 2
>>>> FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
>>>> FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
>>>> FOLIO 0x00010000 0x00014000 16384 16 20 4 2
>>>> FOLIO 0x00014000 0x00018000 16384 20 24 4 2
>>>> FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
>>>> FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
>>>> FOLIO 0x00020000 0x00030000 65536 32 48 16 4
>>>> FOLIO 0x00030000 0x00040000 65536 48 64 16 4
>>>> FOLIO 0x00040000 0x00060000 131072 64 96 32 5
>>>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
>>>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
>>>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>>>
>>> Interesting, I would have thought we'd ramp up earlier.
>>>
>>>> ...
>>>>
>>>> After the change, the same operation results in the first 128K being
>>>> order-0, then we start ramping up to order-2, -4, and finally get
>>>> clamped at order-5:
>>>>
>>>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
>>>> ----- ---------- ---------- --------- ------- ------- ----- -----
>>>> FOLIO 0x00000000 0x00001000 4096 0 1 1 0
>>>> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
>>>> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
>>>> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
>>>> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
>>>> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
>>>> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
>>>> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
>>>> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
>>>> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
>>>> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
>>>> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
>>>> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
>>>> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
>>>> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
>>>> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
>>>> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
>>>> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
>>>> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
>>>> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
>>>> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
>>>> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
>>>> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
>>>> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
>>>> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
>>>> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
>>>> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
>>>> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
>>>> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
>>>> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
>>>> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
>>>> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
>>>> FOLIO 0x00020000 0x00024000 16384 32 36 4 2
>>>> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
>>>> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
>>>> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
>>>> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
>>>> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
>>>> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
>>>> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
>>>> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
>>>> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
>>>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
>>>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
>>>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>>>> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
>>>
>>> Similar here, do you know why we don't ramp up earlier. Allocating that many
>>> order-0 + order-2 pages looks a bit suboptimal to me for a sequential read.
>>
>> Note that this is reading through mmap using the mmap readahead code. If
>> you use standard read(2), the readahead window starts small as well and
>> ramps us along with the desired order so we don't allocate that many small
>> order pages in that case.
That does raise an interesting question though; why do we use a fixed size
window for mmap? It feels like we could start with a smaller window and ramp it
up as order ramps up too, capped to the end of the vma.
Although perhaps that is an investigation for another day... My main motivation
here was to be consistent about what page_cache_ra_order()'s new_order means,
and to actually implement algorithm that was originally intended - start from 0
and ramp up +2 on each readahead marker.
>
> Ah, thanks for that information! :)
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-05-05 12:51 ` Ryan Roberts
@ 2025-05-05 16:14 ` Jan Kara
0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2025-05-05 16:14 UTC (permalink / raw)
To: Ryan Roberts
Cc: David Hildenbrand, Jan Kara, Andrew Morton,
Matthew Wilcox (Oracle), Alexander Viro, Christian Brauner,
Dave Chinner, Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Mon 05-05-25 13:51:48, Ryan Roberts wrote:
> On 05/05/2025 11:25, David Hildenbrand wrote:
> > On 05.05.25 12:09, Jan Kara wrote:
> >> On Mon 05-05-25 11:51:43, David Hildenbrand wrote:
> >>> On 30.04.25 16:59, Ryan Roberts wrote:
> >>>> page_cache_ra_order() takes a parameter called new_order, which is
> >>>> intended to express the preferred order of the folios that will be
> >>>> allocated for the readahead operation. Most callers indeed call this
> >>>> with their preferred new order. But page_cache_async_ra() calls it with
> >>>> the preferred order of the previous readahead request (actually the
> >>>> order of the folio that had the readahead marker, which may be smaller
> >>>> when alignment comes into play).
> >>>>
> >>>> And despite the parameter name, page_cache_ra_order() always treats it
> >>>> at the old order, adding 2 to it on entry. As a result, a cold readahead
> >>>> always starts with order-2 folios.
> >>>>
> >>>> Let's fix this behaviour by always passing in the *new* order.
> >>>>
> >>>> Worked example:
> >>>>
> >>>> Prior to the change, mmaping an 8MB file and touching each page
> >>>> sequentially, resulted in the following, where we start with order-2
> >>>> folios for the first 128K then ramp up to order-4 for the next 128K,
> >>>> then get clamped to order-5 for the rest of the file because pa_pages is
> >>>> limited to 128K:
> >>>>
> >>>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> >>>> ----- ---------- ---------- --------- ------- ------- ----- -----
> >>>> FOLIO 0x00000000 0x00004000 16384 0 4 4 2
> >>>> FOLIO 0x00004000 0x00008000 16384 4 8 4 2
> >>>> FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
> >>>> FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
> >>>> FOLIO 0x00010000 0x00014000 16384 16 20 4 2
> >>>> FOLIO 0x00014000 0x00018000 16384 20 24 4 2
> >>>> FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
> >>>> FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
> >>>> FOLIO 0x00020000 0x00030000 65536 32 48 16 4
> >>>> FOLIO 0x00030000 0x00040000 65536 48 64 16 4
> >>>> FOLIO 0x00040000 0x00060000 131072 64 96 32 5
> >>>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> >>>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> >>>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> >>>
> >>> Interesting, I would have thought we'd ramp up earlier.
> >>>
> >>>> ...
> >>>>
> >>>> After the change, the same operation results in the first 128K being
> >>>> order-0, then we start ramping up to order-2, -4, and finally get
> >>>> clamped at order-5:
> >>>>
> >>>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> >>>> ----- ---------- ---------- --------- ------- ------- ----- -----
> >>>> FOLIO 0x00000000 0x00001000 4096 0 1 1 0
> >>>> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> >>>> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> >>>> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> >>>> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> >>>> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> >>>> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> >>>> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> >>>> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> >>>> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> >>>> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> >>>> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> >>>> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> >>>> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> >>>> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> >>>> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> >>>> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> >>>> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> >>>> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> >>>> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> >>>> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> >>>> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> >>>> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> >>>> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> >>>> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> >>>> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> >>>> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> >>>> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> >>>> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> >>>> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> >>>> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> >>>> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> >>>> FOLIO 0x00020000 0x00024000 16384 32 36 4 2
> >>>> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> >>>> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> >>>> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> >>>> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> >>>> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> >>>> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> >>>> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> >>>> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
> >>>> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
> >>>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> >>>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> >>>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> >>>> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
> >>>
> >>> Similar here, do you know why we don't ramp up earlier. Allocating that many
> >>> order-0 + order-2 pages looks a bit suboptimal to me for a sequential read.
> >>
> >> Note that this is reading through mmap using the mmap readahead code. If
> >> you use standard read(2), the readahead window starts small as well and
> >> ramps us along with the desired order so we don't allocate that many small
> >> order pages in that case.
>
> That does raise an interesting question though; why do we use a fixed size
> window for mmap? It feels like we could start with a smaller window and ramp it
> up as order ramps up too, capped to the end of the vma.
>
> Although perhaps that is an investigation for another day... My main motivation
> here was to be consistent about what page_cache_ra_order()'s new_order means,
> and to actually implement algorithm that was originally intended - start from 0
> and ramp up +2 on each readahead marker.
Well, in my opinion the whole mmap readahead logic would deserve some
remodelling :) because a lot of decisions there are quite disputable for
contemporary systems. But that's definitely for some other patchset...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-04-30 14:59 ` [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
2025-05-05 8:49 ` Jan Kara
2025-05-05 9:51 ` David Hildenbrand
@ 2025-05-05 10:09 ` Anshuman Khandual
2025-05-05 13:00 ` Ryan Roberts
2025-05-08 12:55 ` Pankaj Raghav (Samsung)
2025-05-13 6:19 ` Chaitanya S Prakash
4 siblings, 1 reply; 40+ messages in thread
From: Anshuman Khandual @ 2025-05-05 10:09 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, David Hildenbrand,
Dave Chinner, Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 4/30/25 20:29, Ryan Roberts wrote:
> page_cache_ra_order() takes a parameter called new_order, which is
> intended to express the preferred order of the folios that will be
> allocated for the readahead operation. Most callers indeed call this
> with their preferred new order. But page_cache_async_ra() calls it with
> the preferred order of the previous readahead request (actually the
> order of the folio that had the readahead marker, which may be smaller
> when alignment comes into play).
>
> And despite the parameter name, page_cache_ra_order() always treats it
> at the old order, adding 2 to it on entry. As a result, a cold readahead
> always starts with order-2 folios.
>
> Let's fix this behaviour by always passing in the *new* order.
Makes sense.
>
> Worked example:
>
> Prior to the change, mmaping an 8MB file and touching each page
> sequentially, resulted in the following, where we start with order-2
> folios for the first 128K then ramp up to order-4 for the next 128K,
> then get clamped to order-5 for the rest of the file because pa_pages is
> limited to 128K:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> ----- ---------- ---------- --------- ------- ------- ----- -----
> FOLIO 0x00000000 0x00004000 16384 0 4 4 2
> FOLIO 0x00004000 0x00008000 16384 4 8 4 2
> FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
> FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
> FOLIO 0x00010000 0x00014000 16384 16 20 4 2
> FOLIO 0x00014000 0x00018000 16384 20 24 4 2
> FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
> FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
> FOLIO 0x00020000 0x00030000 65536 32 48 16 4
> FOLIO 0x00030000 0x00040000 65536 48 64 16 4
> FOLIO 0x00040000 0x00060000 131072 64 96 32 5
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> ...
>
> After the change, the same operation results in the first 128K being
> order-0, then we start ramping up to order-2, -4, and finally get
> clamped at order-5:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> ----- ---------- ---------- --------- ------- ------- ----- -----
> FOLIO 0x00000000 0x00001000 4096 0 1 1 0
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00024000 16384 32 36 4 2
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
I guess performance wise this will be worse than earlier ? Although it
does fix the semantics for page_cache_ra_order() with respect to the
parameter 'new_order'.
> ...
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> mm/readahead.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a4e96b69702..8bb316f5a842 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>
> limit = min(limit, index + ra->size - 1);
>
> - if (new_order < mapping_max_folio_order(mapping))
> - new_order += 2;
> -
> new_order = min(mapping_max_folio_order(mapping), new_order);
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> new_order = max(new_order, min_order);
> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> ra->size = get_next_ra_size(ra, max_pages);
> ra->async_size = ra->size;
> readit:
Should not the earlier conditional check also be brought here before
incrementing the order ? Just curious.
if (new_order < mapping_max_folio_order(mapping))
> + order += 2;
> ractl->_index = ra->start;
> page_cache_ra_order(ractl, ra, order);
> }
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-05-05 10:09 ` Anshuman Khandual
@ 2025-05-05 13:00 ` Ryan Roberts
0 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2025-05-05 13:00 UTC (permalink / raw)
To: Anshuman Khandual, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, David Hildenbrand,
Dave Chinner, Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 05/05/2025 11:09, Anshuman Khandual wrote:
>
>
> On 4/30/25 20:29, Ryan Roberts wrote:
>> page_cache_ra_order() takes a parameter called new_order, which is
>> intended to express the preferred order of the folios that will be
>> allocated for the readahead operation. Most callers indeed call this
>> with their preferred new order. But page_cache_async_ra() calls it with
>> the preferred order of the previous readahead request (actually the
>> order of the folio that had the readahead marker, which may be smaller
>> when alignment comes into play).
>>
>> And despite the parameter name, page_cache_ra_order() always treats it
>> at the old order, adding 2 to it on entry. As a result, a cold readahead
>> always starts with order-2 folios.
>>
>> Let's fix this behaviour by always passing in the *new* order.
>
> Makes sense.
>
>>
>> Worked example:
>>
>> Prior to the change, mmaping an 8MB file and touching each page
>> sequentially, resulted in the following, where we start with order-2
>> folios for the first 128K then ramp up to order-4 for the next 128K,
>> then get clamped to order-5 for the rest of the file because pa_pages is
>> limited to 128K:
>>
>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
>> ----- ---------- ---------- --------- ------- ------- ----- -----
>> FOLIO 0x00000000 0x00004000 16384 0 4 4 2
>> FOLIO 0x00004000 0x00008000 16384 4 8 4 2
>> FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
>> FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
>> FOLIO 0x00010000 0x00014000 16384 16 20 4 2
>> FOLIO 0x00014000 0x00018000 16384 20 24 4 2
>> FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
>> FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
>> FOLIO 0x00020000 0x00030000 65536 32 48 16 4
>> FOLIO 0x00030000 0x00040000 65536 48 64 16 4
>> FOLIO 0x00040000 0x00060000 131072 64 96 32 5
>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>> ...
>>
>> After the change, the same operation results in the first 128K being
>> order-0, then we start ramping up to order-2, -4, and finally get
>> clamped at order-5:
>>
>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
>> ----- ---------- ---------- --------- ------- ------- ----- -----
>> FOLIO 0x00000000 0x00001000 4096 0 1 1 0
>> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
>> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
>> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
>> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
>> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
>> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
>> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
>> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
>> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
>> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
>> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
>> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
>> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
>> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
>> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
>> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
>> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
>> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
>> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
>> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
>> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
>> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
>> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
>> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
>> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
>> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
>> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
>> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
>> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
>> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
>> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
>> FOLIO 0x00020000 0x00024000 16384 32 36 4 2
>> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
>> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
>> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
>> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
>> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
>> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
>> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
>> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
>> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
>
> I guess performance wise this will be worse than earlier ?
Maybe, maybe not. If higer order always gave better performance then we would
always surely use the highest order? Order-0 is a bit easier to allocate than
order-2. So if the file actually isn't being accessed sequentially, allocating
order-0 for the cold cache case might actually be better over all?
> Although it
> does fix the semantics for page_cache_ra_order() with respect to the
> parameter 'new_order'.
Yes that's the piece I was keen to sort out; Once you get to patch 5 it's
important that new_order really does mean new_order otherwise we would end up
allocating higher order than the arch intedended.
If we think we really *should* be starting at order-2 instead of order-0, we
should pass 2 as new_order instead of 0.
>
>> ...
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>> mm/readahead.c | 4 +---
>> 1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/mm/readahead.c b/mm/readahead.c
>> index 6a4e96b69702..8bb316f5a842 100644
>> --- a/mm/readahead.c
>> +++ b/mm/readahead.c
>> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>>
>> limit = min(limit, index + ra->size - 1);
>>
>> - if (new_order < mapping_max_folio_order(mapping))
>> - new_order += 2;
>> -
>> new_order = min(mapping_max_folio_order(mapping), new_order);
>> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>> new_order = max(new_order, min_order);
>> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>> ra->size = get_next_ra_size(ra, max_pages);
>> ra->async_size = ra->size;
>> readit:
>
> Should not the earlier conditional check also be brought here before
> incrementing the order ? Just curious.
>
> if (new_order < mapping_max_folio_order(mapping))
No that's not needed. page_cache_ra_order() will clamp new_order appropriately.
The conditional that I removed was unneeded becaude the following lines are
clamping the new value explicitly anyway.
Thanks,
Ryan
>
>> + order += 2;
>> ractl->_index = ra->start;
>> page_cache_ra_order(ractl, ra, order);
>> }
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-04-30 14:59 ` [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
` (2 preceding siblings ...)
2025-05-05 10:09 ` Anshuman Khandual
@ 2025-05-08 12:55 ` Pankaj Raghav (Samsung)
2025-05-09 13:30 ` Ryan Roberts
2025-05-13 6:19 ` Chaitanya S Prakash
4 siblings, 1 reply; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-05-08 12:55 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
Hey Ryan,
On Wed, Apr 30, 2025 at 03:59:14PM +0100, Ryan Roberts wrote:
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00024000 16384 32 36 4 2
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> ...
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> mm/readahead.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a4e96b69702..8bb316f5a842 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>
So we always had a fallback to do_page_cache_ra() if the size of the
readahead is less than 4 pages (16k). I think this was there because we
were adding `2` to the new_order:
unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
/*
* Fallback when size < min_nrpages as each folio should be
* at least min_nrpages anyway.
*/
if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
goto fallback;
> limit = min(limit, index + ra->size - 1);
>
> - if (new_order < mapping_max_folio_order(mapping))
> - new_order += 2;
Now that you have moved this, we could make the lhs of the max to be 2
(8k) instead of 4(16k).
- unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
+ unsigned int min_ra_size = max(2, mapping_min_folio_nrpages(mapping));
I think if we do that, we might ramp up to 8k sooner rather than jumping
from 4k to 16k directly?
> -
> new_order = min(mapping_max_folio_order(mapping), new_order);
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> new_order = max(new_order, min_order);
> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> ra->size = get_next_ra_size(ra, max_pages);
> ra->async_size = ra->size;
> readit:
> + order += 2;
> ractl->_index = ra->start;
> page_cache_ra_order(ractl, ra, order);
> }
> --
> 2.43.0
>
--
Pankaj
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-05-08 12:55 ` Pankaj Raghav (Samsung)
@ 2025-05-09 13:30 ` Ryan Roberts
2025-05-09 20:50 ` Pankaj Raghav (Samsung)
0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2025-05-09 13:30 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
Hi Pankaj,
Thanks for the review! ...
On 08/05/2025 13:55, Pankaj Raghav (Samsung) wrote:
> Hey Ryan,
>
> On Wed, Apr 30, 2025 at 03:59:14PM +0100, Ryan Roberts wrote:
>> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
>> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
>> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
>> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
>> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
>> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
>> FOLIO 0x00020000 0x00024000 16384 32 36 4 2
>> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
>> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
>> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
>> ...
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>> mm/readahead.c | 4 +---
>> 1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/mm/readahead.c b/mm/readahead.c
>> index 6a4e96b69702..8bb316f5a842 100644
>> --- a/mm/readahead.c
>> +++ b/mm/readahead.c
>> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>>
>
> So we always had a fallback to do_page_cache_ra() if the size of the
> readahead is less than 4 pages (16k). I think this was there because we
> were adding `2` to the new_order:
If this is the reason for the magic number 4, then it's a bug in itself IMHO. 4
pages is only 16K when the page size is 4K; arm64 supports other page sizes. But
additionally, it's not just ra->size that dictates the final order of the folio;
it also depends on alignment in the file, EOF, etc.
If we remove the fallback condition completely, things will still work out. So
unless someone can explain the reason for that condition (Matthew?), my vote
would be to remove it entirely.
>
> unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
>
> /*
> * Fallback when size < min_nrpages as each folio should be
> * at least min_nrpages anyway.
> */
> if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
> goto fallback;
>
>> limit = min(limit, index + ra->size - 1);
>>
>> - if (new_order < mapping_max_folio_order(mapping))
>> - new_order += 2;
>
> Now that you have moved this, we could make the lhs of the max to be 2
> (8k) instead of 4(16k).
I don't really understand why magic number 2 would now be correct?
>
> - unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
> + unsigned int min_ra_size = max(2, mapping_min_folio_nrpages(mapping));
>
> I think if we do that, we might ramp up to 8k sooner rather than jumping
> from 4k to 16k directly?
In practice I don't think so; This would only give us order-1 where we didn't
have it before if new_order >= 1 and ra->size is 3 or 4 pages.
But as I said, my vote would be to remove this fallback condition entirely. What
do you think?
Thanks,
Ryan
>
>> -
>> new_order = min(mapping_max_folio_order(mapping), new_order);
>> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>> new_order = max(new_order, min_order);
>> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>> ra->size = get_next_ra_size(ra, max_pages);
>> ra->async_size = ra->size;
>> readit:
>> + order += 2;
>> ractl->_index = ra->start;
>> page_cache_ra_order(ractl, ra, order);
>> }
>> --
>> 2.43.0
>>
>
> --
> Pankaj
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-05-09 13:30 ` Ryan Roberts
@ 2025-05-09 20:50 ` Pankaj Raghav (Samsung)
2025-05-13 12:33 ` Ryan Roberts
0 siblings, 1 reply; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-05-09 20:50 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm, p.raghav
> >>
> >
> > So we always had a fallback to do_page_cache_ra() if the size of the
> > readahead is less than 4 pages (16k). I think this was there because we
> > were adding `2` to the new_order:
>
> If this is the reason for the magic number 4, then it's a bug in itself IMHO. 4
> pages is only 16K when the page size is 4K; arm64 supports other page sizes. But
> additionally, it's not just ra->size that dictates the final order of the folio;
> it also depends on alignment in the file, EOF, etc.
>
IIRC, initially we were not able to use order-1 folios[1], so we always
did a fallback for any order < 2 using do_page_cache_ra(). I think that
is where the magic order 2 (4 pages) is coming. Please someone can
correct me if I am wrong.
But we don't have that limitation for file-backed folios anymore, so the
fallback for ra->size < 4 is probably not needed. So the only time we do
a fallback is if we don't support large folios.
> If we remove the fallback condition completely, things will still work out. So
> unless someone can explain the reason for that condition (Matthew?), my vote
> would be to remove it entirely.
I am actually fine with removing the first part of this fallback condition.
But as I said, we still need to do a fallback if we don't support large folios.
--
Pankaj
[1] https://lore.kernel.org/all/ZH0GvxAdw1RO2Shr@casper.infradead.org/
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-05-09 20:50 ` Pankaj Raghav (Samsung)
@ 2025-05-13 12:33 ` Ryan Roberts
0 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2025-05-13 12:33 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm, p.raghav
On 09/05/2025 21:50, Pankaj Raghav (Samsung) wrote:
>>>>
>>>
>>> So we always had a fallback to do_page_cache_ra() if the size of the
>>> readahead is less than 4 pages (16k). I think this was there because we
>>> were adding `2` to the new_order:
>>
>> If this is the reason for the magic number 4, then it's a bug in itself IMHO. 4
>> pages is only 16K when the page size is 4K; arm64 supports other page sizes. But
>> additionally, it's not just ra->size that dictates the final order of the folio;
>> it also depends on alignment in the file, EOF, etc.
>>
>
> IIRC, initially we were not able to use order-1 folios[1], so we always
> did a fallback for any order < 2 using do_page_cache_ra(). I think that
> is where the magic order 2 (4 pages) is coming. Please someone can
> correct me if I am wrong.
Ahh, I see. That might have been where it came from, but IMHO, it still didn't
really belong there; just because the size is bigger than 4 pages, it doesn't
mean you would never want to use order-1 folios - there are alignment
considerations that can cause that. The logic in page_cache_ra_order() used to
know to avoid order-1.
>
> But we don't have that limitation for file-backed folios anymore, so the
> fallback for ra->size < 4 is probably not needed. So the only time we do
> a fallback is if we don't support large folios.
>
>> If we remove the fallback condition completely, things will still work out. So
>> unless someone can explain the reason for that condition (Matthew?), my vote
>> would be to remove it entirely.
>
> I am actually fine with removing the first part of this fallback condition.
> But as I said, we still need to do a fallback if we don't support large folios.
Yep agreed. I'll make this change in the next version.
>
> --
> Pankaj
>
> [1] https://lore.kernel.org/all/ZH0GvxAdw1RO2Shr@casper.infradead.org/
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
2025-04-30 14:59 ` [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
` (3 preceding siblings ...)
2025-05-08 12:55 ` Pankaj Raghav (Samsung)
@ 2025-05-13 6:19 ` Chaitanya S Prakash
4 siblings, 0 replies; 40+ messages in thread
From: Chaitanya S Prakash @ 2025-05-13 6:19 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, David Hildenbrand,
Dave Chinner, Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 30/04/25 20:29, Ryan Roberts wrote:
> page_cache_ra_order() takes a parameter called new_order, which is
> intended to express the preferred order of the folios that will be
> allocated for the readahead operation. Most callers indeed call this
> with their preferred new order. But page_cache_async_ra() calls it with
> the preferred order of the previous readahead request (actually the
> order of the folio that had the readahead marker, which may be smaller
> when alignment comes into play).
>
> And despite the parameter name, page_cache_ra_order() always treats it
> at the old order, adding 2 to it on entry. As a result, a cold readahead
> always starts with order-2 folios.
>
> Let's fix this behaviour by always passing in the *new* order.
>
> Worked example:
>
> Prior to the change, mmaping an 8MB file and touching each page
> sequentially, resulted in the following, where we start with order-2
> folios for the first 128K then ramp up to order-4 for the next 128K,
> then get clamped to order-5 for the rest of the file because pa_pages is
> limited to 128K:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> ----- ---------- ---------- --------- ------- ------- ----- -----
> FOLIO 0x00000000 0x00004000 16384 0 4 4 2
> FOLIO 0x00004000 0x00008000 16384 4 8 4 2
> FOLIO 0x00008000 0x0000c000 16384 8 12 4 2
> FOLIO 0x0000c000 0x00010000 16384 12 16 4 2
> FOLIO 0x00010000 0x00014000 16384 16 20 4 2
> FOLIO 0x00014000 0x00018000 16384 20 24 4 2
> FOLIO 0x00018000 0x0001c000 16384 24 28 4 2
> FOLIO 0x0001c000 0x00020000 16384 28 32 4 2
> FOLIO 0x00020000 0x00030000 65536 32 48 16 4
> FOLIO 0x00030000 0x00040000 65536 48 64 16 4
> FOLIO 0x00040000 0x00060000 131072 64 96 32 5
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> ...
>
> After the change, the same operation results in the first 128K being
> order-0, then we start ramping up to order-2, -4, and finally get
> clamped at order-5:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER
> ----- ---------- ---------- --------- ------- ------- ----- -----
> FOLIO 0x00000000 0x00001000 4096 0 1 1 0
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00024000 16384 32 36 4 2
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
> ...
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
This looks good to me.
Tested-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
> ---
> mm/readahead.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a4e96b69702..8bb316f5a842 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>
> limit = min(limit, index + ra->size - 1);
>
> - if (new_order < mapping_max_folio_order(mapping))
> - new_order += 2;
> -
> new_order = min(mapping_max_folio_order(mapping), new_order);
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> new_order = max(new_order, min_order);
> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> ra->size = get_next_ra_size(ra, max_pages);
> ra->async_size = ra->size;
> readit:
> + order += 2;
> ractl->_index = ra->start;
> page_cache_ra_order(ractl, ra, order);
> }
^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary
2025-04-30 14:59 [RFC PATCH v4 0/5] Readahead tweaks for larger folios Ryan Roberts
2025-04-30 14:59 ` [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
@ 2025-04-30 14:59 ` Ryan Roberts
2025-05-05 9:13 ` Jan Kara
2025-04-30 14:59 ` [RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
` (3 subsequent siblings)
5 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2025-04-30 14:59 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
linux-mm
Previously asynchonous readahead would read ra_pages (usually 128K)
directly after the end of the synchonous readahead and given the
synchronous readahead portion had no alignment guarantees (beyond page
boundaries) it is possible (and likely) that the end of the initial 128K
region would not fall on a natural boundary for the folio size being
used. Therefore smaller folios were used to align down to the required
boundary, both at the end of the previous readahead block and at the
start of the new one.
In the worst cases, this can result in never properly ramping up the
folio size, and instead getting stuck oscillating between order-0, -1
and -2 folios. The next readahead will try to use folios whose order is
+2 bigger than the folio that had the readahead marker. But because of
the alignment requirements, that folio (the first one in the readahead
block) can end up being order-0 in some cases.
There will be 2 modifications to solve this issue:
1) Calculate the readahead size so the end is aligned to a folio
boundary. This prevents needing to allocate small folios to align
down at the end of the window and fixes the oscillation problem.
2) Remember the "preferred folio order" in the ra state instead of
inferring it from the folio with the readahead marker. This solves
the slow ramp up problem (discussed in a subsequent patch).
This patch addresses (1) only. A subsequent patch will address (2).
Worked example:
The following shows the previous pathalogical behaviour when the initial
synchronous readahead is unaligned. We start reading at page 17 in the
file and read sequentially from there. I'm showing a dump of the pages
in the page cache just after we read the first page of the folio with
the readahead marker.
Initially there are no pages in the page cache:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00800000 8388608 0 2048 2048
Then we access page 17, causing synchonous read-around of 128K with a
readahead marker set up at page 25. So far, all as expected:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 Y
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
HOLE 0x00021000 0x00800000 8253440 33 2048 2015
Now access pages 18-25 inclusive. This causes an asynchronous 128K
readahead starting at page 33. But since we are unaligned, even though
the preferred folio order is 2, the first folio in this batch (the one
with the new readahead marker) is order-0:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0 Y
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00041000 4096 64 65 1 0
HOLE 0x00041000 0x00800000 8122368 65 2048 1983
Which means that when we now read pages 26-33 and readahead is kicked
off again, the new preferred order is 2 (0 + 2), not 4 as we intended:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00041000 4096 64 65 1 0
FOLIO 0x00041000 0x00042000 4096 65 66 1 0 Y
FOLIO 0x00042000 0x00044000 8192 66 68 2 1
FOLIO 0x00044000 0x00048000 16384 68 72 4 2
FOLIO 0x00048000 0x0004c000 16384 72 76 4 2
FOLIO 0x0004c000 0x00050000 16384 76 80 4 2
FOLIO 0x00050000 0x00054000 16384 80 84 4 2
FOLIO 0x00054000 0x00058000 16384 84 88 4 2
FOLIO 0x00058000 0x0005c000 16384 88 92 4 2
FOLIO 0x0005c000 0x00060000 16384 92 96 4 2
FOLIO 0x00060000 0x00061000 4096 96 97 1 0
HOLE 0x00061000 0x00800000 7991296 97 2048 1951
This ramp up from order-0 with smaller orders at the edges for alignment
cycle continues all the way to the end of the file (not shown).
After the change, we round down the end boundary to the order boundary
so we no longer get stuck in the cycle and can ramp up the order over
time. Note that the rate of the ramp up is still not as we would expect
it. We will fix that next. Here we are touching pages 17-256
sequentially:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00044000 16384 64 68 4 2
FOLIO 0x00044000 0x00048000 16384 68 72 4 2
FOLIO 0x00048000 0x0004c000 16384 72 76 4 2
FOLIO 0x0004c000 0x00050000 16384 76 80 4 2
FOLIO 0x00050000 0x00054000 16384 80 84 4 2
FOLIO 0x00054000 0x00058000 16384 84 88 4 2
FOLIO 0x00058000 0x0005c000 16384 88 92 4 2
FOLIO 0x0005c000 0x00060000 16384 92 96 4 2
FOLIO 0x00060000 0x00070000 65536 96 112 16 4
FOLIO 0x00070000 0x00080000 65536 112 128 16 4
FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
FOLIO 0x00100000 0x00120000 131072 256 288 32 5
FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
HOLE 0x00140000 0x00800000 7077888 320 2048 1728
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
mm/readahead.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 8bb316f5a842..82f9f623f2d7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
unsigned long max_pages;
struct file_ra_state *ra = ractl->ra;
pgoff_t index = readahead_index(ractl);
- pgoff_t expected, start;
+ pgoff_t expected, start, end, aligned_end;
unsigned int order = folio_order(folio);
/* no readahead */
@@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
* the readahead window.
*/
ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
- ra->async_size = ra->size;
goto readit;
}
@@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
ra->size = start - index; /* old async_size */
ra->size += req_count;
ra->size = get_next_ra_size(ra, max_pages);
- ra->async_size = ra->size;
readit:
order += 2;
+ end = ra->start + ra->size;
+ aligned_end = round_down(end, 1UL << order);
+ if (aligned_end > ra->start)
+ ra->size -= end - aligned_end;
+ ra->async_size = ra->size;
ractl->_index = ra->start;
page_cache_ra_order(ractl, ra, order);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary
2025-04-30 14:59 ` [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary Ryan Roberts
@ 2025-05-05 9:13 ` Jan Kara
2025-05-05 9:37 ` Jan Kara
0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2025-05-05 9:13 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
> Previously asynchonous readahead would read ra_pages (usually 128K)
> directly after the end of the synchonous readahead and given the
> synchronous readahead portion had no alignment guarantees (beyond page
> boundaries) it is possible (and likely) that the end of the initial 128K
> region would not fall on a natural boundary for the folio size being
> used. Therefore smaller folios were used to align down to the required
> boundary, both at the end of the previous readahead block and at the
> start of the new one.
>
> In the worst cases, this can result in never properly ramping up the
> folio size, and instead getting stuck oscillating between order-0, -1
> and -2 folios. The next readahead will try to use folios whose order is
> +2 bigger than the folio that had the readahead marker. But because of
> the alignment requirements, that folio (the first one in the readahead
> block) can end up being order-0 in some cases.
>
> There will be 2 modifications to solve this issue:
>
> 1) Calculate the readahead size so the end is aligned to a folio
> boundary. This prevents needing to allocate small folios to align
> down at the end of the window and fixes the oscillation problem.
>
> 2) Remember the "preferred folio order" in the ra state instead of
> inferring it from the folio with the readahead marker. This solves
> the slow ramp up problem (discussed in a subsequent patch).
>
> This patch addresses (1) only. A subsequent patch will address (2).
>
> Worked example:
>
> The following shows the previous pathalogical behaviour when the initial
> synchronous readahead is unaligned. We start reading at page 17 in the
> file and read sequentially from there. I'm showing a dump of the pages
> in the page cache just after we read the first page of the folio with
> the readahead marker.
>
> Initially there are no pages in the page cache:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
> HOLE 0x00000000 0x00800000 8388608 0 2048 2048
>
> Then we access page 17, causing synchonous read-around of 128K with a
> readahead marker set up at page 25. So far, all as expected:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
> HOLE 0x00000000 0x00001000 4096 0 1 1
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 Y
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
> HOLE 0x00021000 0x00800000 8253440 33 2048 2015
>
> Now access pages 18-25 inclusive. This causes an asynchronous 128K
> readahead starting at page 33. But since we are unaligned, even though
> the preferred folio order is 2, the first folio in this batch (the one
> with the new readahead marker) is order-0:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
> HOLE 0x00000000 0x00001000 4096 0 1 1
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
> FOLIO 0x00021000 0x00022000 4096 33 34 1 0 Y
> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> FOLIO 0x00040000 0x00041000 4096 64 65 1 0
> HOLE 0x00041000 0x00800000 8122368 65 2048 1983
>
> Which means that when we now read pages 26-33 and readahead is kicked
> off again, the new preferred order is 2 (0 + 2), not 4 as we intended:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
> HOLE 0x00000000 0x00001000 4096 0 1 1
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
> FOLIO 0x00021000 0x00022000 4096 33 34 1 0
> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> FOLIO 0x00040000 0x00041000 4096 64 65 1 0
> FOLIO 0x00041000 0x00042000 4096 65 66 1 0 Y
> FOLIO 0x00042000 0x00044000 8192 66 68 2 1
> FOLIO 0x00044000 0x00048000 16384 68 72 4 2
> FOLIO 0x00048000 0x0004c000 16384 72 76 4 2
> FOLIO 0x0004c000 0x00050000 16384 76 80 4 2
> FOLIO 0x00050000 0x00054000 16384 80 84 4 2
> FOLIO 0x00054000 0x00058000 16384 84 88 4 2
> FOLIO 0x00058000 0x0005c000 16384 88 92 4 2
> FOLIO 0x0005c000 0x00060000 16384 92 96 4 2
> FOLIO 0x00060000 0x00061000 4096 96 97 1 0
> HOLE 0x00061000 0x00800000 7991296 97 2048 1951
>
> This ramp up from order-0 with smaller orders at the edges for alignment
> cycle continues all the way to the end of the file (not shown).
>
> After the change, we round down the end boundary to the order boundary
> so we no longer get stuck in the cycle and can ramp up the order over
> time. Note that the rate of the ramp up is still not as we would expect
> it. We will fix that next. Here we are touching pages 17-256
> sequentially:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
> HOLE 0x00000000 0x00001000 4096 0 1 1
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
> FOLIO 0x00021000 0x00022000 4096 33 34 1 0
> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> FOLIO 0x00040000 0x00044000 16384 64 68 4 2
> FOLIO 0x00044000 0x00048000 16384 68 72 4 2
> FOLIO 0x00048000 0x0004c000 16384 72 76 4 2
> FOLIO 0x0004c000 0x00050000 16384 76 80 4 2
> FOLIO 0x00050000 0x00054000 16384 80 84 4 2
> FOLIO 0x00054000 0x00058000 16384 84 88 4 2
> FOLIO 0x00058000 0x0005c000 16384 88 92 4 2
> FOLIO 0x0005c000 0x00060000 16384 92 96 4 2
> FOLIO 0x00060000 0x00070000 65536 96 112 16 4
> FOLIO 0x00070000 0x00080000 65536 112 128 16 4
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
> FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
> FOLIO 0x00100000 0x00120000 131072 256 288 32 5
> FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
> HOLE 0x00140000 0x00800000 7077888 320 2048 1728
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Looks good. When I was reading this code some time ago, I also felt we
should rather do some rounding instead of creating small folios so thanks
for working on this. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> mm/readahead.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 8bb316f5a842..82f9f623f2d7 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> unsigned long max_pages;
> struct file_ra_state *ra = ractl->ra;
> pgoff_t index = readahead_index(ractl);
> - pgoff_t expected, start;
> + pgoff_t expected, start, end, aligned_end;
> unsigned int order = folio_order(folio);
>
> /* no readahead */
> @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
> * the readahead window.
> */
> ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
> - ra->async_size = ra->size;
> goto readit;
> }
>
> @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
> ra->size = start - index; /* old async_size */
> ra->size += req_count;
> ra->size = get_next_ra_size(ra, max_pages);
> - ra->async_size = ra->size;
> readit:
> order += 2;
> + end = ra->start + ra->size;
> + aligned_end = round_down(end, 1UL << order);
> + if (aligned_end > ra->start)
> + ra->size -= end - aligned_end;
> + ra->async_size = ra->size;
> ractl->_index = ra->start;
> page_cache_ra_order(ractl, ra, order);
> }
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary
2025-05-05 9:13 ` Jan Kara
@ 2025-05-05 9:37 ` Jan Kara
2025-05-06 9:28 ` Ryan Roberts
0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2025-05-05 9:37 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Mon 05-05-25 11:13:26, Jan Kara wrote:
> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
> > Previously asynchonous readahead would read ra_pages (usually 128K)
> > directly after the end of the synchonous readahead and given the
> > synchronous readahead portion had no alignment guarantees (beyond page
> > boundaries) it is possible (and likely) that the end of the initial 128K
> > region would not fall on a natural boundary for the folio size being
> > used. Therefore smaller folios were used to align down to the required
> > boundary, both at the end of the previous readahead block and at the
> > start of the new one.
> >
> > In the worst cases, this can result in never properly ramping up the
> > folio size, and instead getting stuck oscillating between order-0, -1
> > and -2 folios. The next readahead will try to use folios whose order is
> > +2 bigger than the folio that had the readahead marker. But because of
> > the alignment requirements, that folio (the first one in the readahead
> > block) can end up being order-0 in some cases.
> >
> > There will be 2 modifications to solve this issue:
> >
> > 1) Calculate the readahead size so the end is aligned to a folio
> > boundary. This prevents needing to allocate small folios to align
> > down at the end of the window and fixes the oscillation problem.
> >
> > 2) Remember the "preferred folio order" in the ra state instead of
> > inferring it from the folio with the readahead marker. This solves
> > the slow ramp up problem (discussed in a subsequent patch).
> >
> > This patch addresses (1) only. A subsequent patch will address (2).
> >
> > Worked example:
> >
> > The following shows the previous pathalogical behaviour when the initial
> > synchronous readahead is unaligned. We start reading at page 17 in the
> > file and read sequentially from there. I'm showing a dump of the pages
> > in the page cache just after we read the first page of the folio with
> > the readahead marker.
<snip>
> Looks good. When I was reading this code some time ago, I also felt we
> should rather do some rounding instead of creating small folios so thanks
> for working on this. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
But now I've also remembered why what you do here isn't an obvious win.
There are storage devices (mostly RAID arrays) where optimum read size
isn't a power of 2. Think for example a RAID-0 device composed from three
disks. It will have max_pages something like 384 (512k * 3). Suppose we are
on x86 and max_order is 9. Then previously (if we were lucky with
alignment) we were alternating between order 7 and order 8 pages in the
page cache and do optimally sized IOs od 1536k. Now you will allocate all
folios of order 8 (nice) but reads will be just 1024k and you'll see
noticeable drop in read throughput (not nice). Note that this is not just a
theoretical example but a real case we have hit when doing performance
testing of servers and for which I was tweaking readahead code in the past.
So I think we need to tweak this logic a bit. Perhaps we should round_down
end to the minimum alignment dictated by 'order' and maxpages? Like:
1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)
If you set badly aligned readahead size manually, you will get small pages
in the page cache but that's just you being stupid. In practice, hardware
induced readahead size need not be powers of 2 but they are *sane* :).
Honza
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 8bb316f5a842..82f9f623f2d7 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> > unsigned long max_pages;
> > struct file_ra_state *ra = ractl->ra;
> > pgoff_t index = readahead_index(ractl);
> > - pgoff_t expected, start;
> > + pgoff_t expected, start, end, aligned_end;
> > unsigned int order = folio_order(folio);
> >
> > /* no readahead */
> > @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
> > * the readahead window.
> > */
> > ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
> > - ra->async_size = ra->size;
> > goto readit;
> > }
> >
> > @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
> > ra->size = start - index; /* old async_size */
> > ra->size += req_count;
> > ra->size = get_next_ra_size(ra, max_pages);
> > - ra->async_size = ra->size;
> > readit:
> > order += 2;
> > + end = ra->start + ra->size;
> > + aligned_end = round_down(end, 1UL << order);
> > + if (aligned_end > ra->start)
> > + ra->size -= end - aligned_end;
> > + ra->async_size = ra->size;
> > ractl->_index = ra->start;
> > page_cache_ra_order(ractl, ra, order);
> > }
> > --
> > 2.43.0
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary
2025-05-05 9:37 ` Jan Kara
@ 2025-05-06 9:28 ` Ryan Roberts
2025-05-06 11:29 ` Jan Kara
0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2025-05-06 9:28 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 05/05/2025 10:37, Jan Kara wrote:
> On Mon 05-05-25 11:13:26, Jan Kara wrote:
>> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
>>> Previously asynchonous readahead would read ra_pages (usually 128K)
>>> directly after the end of the synchonous readahead and given the
>>> synchronous readahead portion had no alignment guarantees (beyond page
>>> boundaries) it is possible (and likely) that the end of the initial 128K
>>> region would not fall on a natural boundary for the folio size being
>>> used. Therefore smaller folios were used to align down to the required
>>> boundary, both at the end of the previous readahead block and at the
>>> start of the new one.
>>>
>>> In the worst cases, this can result in never properly ramping up the
>>> folio size, and instead getting stuck oscillating between order-0, -1
>>> and -2 folios. The next readahead will try to use folios whose order is
>>> +2 bigger than the folio that had the readahead marker. But because of
>>> the alignment requirements, that folio (the first one in the readahead
>>> block) can end up being order-0 in some cases.
>>>
>>> There will be 2 modifications to solve this issue:
>>>
>>> 1) Calculate the readahead size so the end is aligned to a folio
>>> boundary. This prevents needing to allocate small folios to align
>>> down at the end of the window and fixes the oscillation problem.
>>>
>>> 2) Remember the "preferred folio order" in the ra state instead of
>>> inferring it from the folio with the readahead marker. This solves
>>> the slow ramp up problem (discussed in a subsequent patch).
>>>
>>> This patch addresses (1) only. A subsequent patch will address (2).
>>>
>>> Worked example:
>>>
>>> The following shows the previous pathalogical behaviour when the initial
>>> synchronous readahead is unaligned. We start reading at page 17 in the
>>> file and read sequentially from there. I'm showing a dump of the pages
>>> in the page cache just after we read the first page of the folio with
>>> the readahead marker.
>
> <snip>
>
>> Looks good. When I was reading this code some time ago, I also felt we
>> should rather do some rounding instead of creating small folios so thanks
>> for working on this. Feel free to add:
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>
> But now I've also remembered why what you do here isn't an obvious win.
> There are storage devices (mostly RAID arrays) where optimum read size
> isn't a power of 2. Think for example a RAID-0 device composed from three
> disks. It will have max_pages something like 384 (512k * 3). Suppose we are
> on x86 and max_order is 9. Then previously (if we were lucky with
> alignment) we were alternating between order 7 and order 8 pages in the
> page cache and do optimally sized IOs od 1536k.
Sorry I'm struggling to follow some of this, perhaps my superficial
understanding of all the readahead subtleties is starting to show...
How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and
bdi->ra_pages would remain the usual 32 (128K)? In which case, for mmap, won't
we continue to be limited by ra_pages and will never get beyond order-5? (for
mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just
returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not
spotting it in the code...
I guess you are also implicitly teaching me something about how the block layer
works here too... if there are 2 read requests for an order-7 and order-8, then
the block layer will merge those to a single read (upto the 384 optimal size?)
but if there are 2 reads of order-8 then it won't merge because it would be
bigger than the optimal size and it won't split the second one at the optimal
size either? Have I inferred that correctly?
> Now you will allocate all
> folios of order 8 (nice) but reads will be just 1024k and you'll see
> noticeable drop in read throughput (not nice). Note that this is not just a
> theoretical example but a real case we have hit when doing performance
> testing of servers and for which I was tweaking readahead code in the past.
>
> So I think we need to tweak this logic a bit. Perhaps we should round_down
> end to the minimum alignment dictated by 'order' and maxpages? Like:
>
> 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)
Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT - 1" part?
I think what you are suggesting is that the patch becomes something like this:
---8<---
+ end = ra->start + ra->size;
+ aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages)));
+ if (aligned_end > ra->start)
+ ra->size -= end - aligned_end;
+ ra->async_size = ra->size;
---8<---
So if max_pages=384, then aligned_end will be aligned down to a maximum of the
previous 1MB boundary?
Thanks,
Ryan
>
> If you set badly aligned readahead size manually, you will get small pages
> in the page cache but that's just you being stupid. In practice, hardware
> induced readahead size need not be powers of 2 but they are *sane* :).
>
> Honza
>
>>> diff --git a/mm/readahead.c b/mm/readahead.c
>>> index 8bb316f5a842..82f9f623f2d7 100644
>>> --- a/mm/readahead.c
>>> +++ b/mm/readahead.c
>>> @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>> unsigned long max_pages;
>>> struct file_ra_state *ra = ractl->ra;
>>> pgoff_t index = readahead_index(ractl);
>>> - pgoff_t expected, start;
>>> + pgoff_t expected, start, end, aligned_end;
>>> unsigned int order = folio_order(folio);
>>>
>>> /* no readahead */
>>> @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>> * the readahead window.
>>> */
>>> ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
>>> - ra->async_size = ra->size;
>>> goto readit;
>>> }
>>>
>>> @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>> ra->size = start - index; /* old async_size */
>>> ra->size += req_count;
>>> ra->size = get_next_ra_size(ra, max_pages);
>>> - ra->async_size = ra->size;
>>> readit:
>>> order += 2;
>>> + end = ra->start + ra->size;
>>> + aligned_end = round_down(end, 1UL << order);
>>> + if (aligned_end > ra->start)
>>> + ra->size -= end - aligned_end;
>>> + ra->async_size = ra->size;
>>> ractl->_index = ra->start;
>>> page_cache_ra_order(ractl, ra, order);
>>> }
>>> --
>>> 2.43.0
>>>
>> --
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary
2025-05-06 9:28 ` Ryan Roberts
@ 2025-05-06 11:29 ` Jan Kara
2025-05-06 15:31 ` Ryan Roberts
0 siblings, 1 reply; 40+ messages in thread
From: Jan Kara @ 2025-05-06 11:29 UTC (permalink / raw)
To: Ryan Roberts
Cc: Jan Kara, Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Tue 06-05-25 10:28:11, Ryan Roberts wrote:
> On 05/05/2025 10:37, Jan Kara wrote:
> > On Mon 05-05-25 11:13:26, Jan Kara wrote:
> >> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
> >>> Previously asynchonous readahead would read ra_pages (usually 128K)
> >>> directly after the end of the synchonous readahead and given the
> >>> synchronous readahead portion had no alignment guarantees (beyond page
> >>> boundaries) it is possible (and likely) that the end of the initial 128K
> >>> region would not fall on a natural boundary for the folio size being
> >>> used. Therefore smaller folios were used to align down to the required
> >>> boundary, both at the end of the previous readahead block and at the
> >>> start of the new one.
> >>>
> >>> In the worst cases, this can result in never properly ramping up the
> >>> folio size, and instead getting stuck oscillating between order-0, -1
> >>> and -2 folios. The next readahead will try to use folios whose order is
> >>> +2 bigger than the folio that had the readahead marker. But because of
> >>> the alignment requirements, that folio (the first one in the readahead
> >>> block) can end up being order-0 in some cases.
> >>>
> >>> There will be 2 modifications to solve this issue:
> >>>
> >>> 1) Calculate the readahead size so the end is aligned to a folio
> >>> boundary. This prevents needing to allocate small folios to align
> >>> down at the end of the window and fixes the oscillation problem.
> >>>
> >>> 2) Remember the "preferred folio order" in the ra state instead of
> >>> inferring it from the folio with the readahead marker. This solves
> >>> the slow ramp up problem (discussed in a subsequent patch).
> >>>
> >>> This patch addresses (1) only. A subsequent patch will address (2).
> >>>
> >>> Worked example:
> >>>
> >>> The following shows the previous pathalogical behaviour when the initial
> >>> synchronous readahead is unaligned. We start reading at page 17 in the
> >>> file and read sequentially from there. I'm showing a dump of the pages
> >>> in the page cache just after we read the first page of the folio with
> >>> the readahead marker.
> >
> > <snip>
> >
> >> Looks good. When I was reading this code some time ago, I also felt we
> >> should rather do some rounding instead of creating small folios so thanks
> >> for working on this. Feel free to add:
> >>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> >
> > But now I've also remembered why what you do here isn't an obvious win.
> > There are storage devices (mostly RAID arrays) where optimum read size
> > isn't a power of 2. Think for example a RAID-0 device composed from three
> > disks. It will have max_pages something like 384 (512k * 3). Suppose we are
> > on x86 and max_order is 9. Then previously (if we were lucky with
> > alignment) we were alternating between order 7 and order 8 pages in the
> > page cache and do optimally sized IOs od 1536k.
>
> Sorry I'm struggling to follow some of this, perhaps my superficial
> understanding of all the readahead subtleties is starting to show...
>
> How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and
> bdi->ra_pages would remain the usual 32 (128K)?
Sorry, I have been probably too brief in my previous message :)
bdi->ra_pages is actually set based on optimal IO size reported by the
hardware (see blk_apply_bdi_limits() and how its callers are filling in
lim->io_opt). The 128K you speak about is just a last-resort value if
hardware doesn't provide one. And some storage devices do report optimal IO
size that is not power of two.
Also note that bdi->ra_pages can be tuned in sysfs and a lot of users
actually do this (usually from their udev rules). We don't have to perform
well when some odd value gets set but you definitely cannot assume
bdi->ra_pages is 128K :).
> In which case, for mmap, won't
> we continue to be limited by ra_pages and will never get beyond order-5? (for
> mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just
> returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not
> spotting it in the code...
>
> I guess you are also implicitly teaching me something about how the block layer
> works here too... if there are 2 read requests for an order-7 and order-8, then
> the block layer will merge those to a single read (upto the 384 optimal size?)
Correct. In fact readahead code will already perform this merging when
submitting the IO.
> but if there are 2 reads of order-8 then it won't merge because it would be
> bigger than the optimal size and it won't split the second one at the optimal
> size either? Have I inferred that correctly?
With the code as you modify it, you would round down ra->size from 384 to
256 and submit only one 1MB sized IO (with one order-8 page). And this will
cause regression in read throughput for such devices because they now don't
get buffer large enough to run at full speed.
> > Now you will allocate all
> > folios of order 8 (nice) but reads will be just 1024k and you'll see
> > noticeable drop in read throughput (not nice). Note that this is not just a
> > theoretical example but a real case we have hit when doing performance
> > testing of servers and for which I was tweaking readahead code in the past.
> >
> > So I think we need to tweak this logic a bit. Perhaps we should round_down
> > end to the minimum alignment dictated by 'order' and maxpages? Like:
> >
> > 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)
>
> Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT -
> 1" part?
My bad. It should have been:
1 << min(order, ffs(max_pages) - 1)
> I think what you are suggesting is that the patch becomes something like
> this:
>
> ---8<---
> + end = ra->start + ra->size;
> + aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages)));
Not quite. ilog2() returns the most significant bit set but we really want
to align to the least significant bit set. So when max_pages is 384, we
want to align to at most order-7 (aligning the end more does not make sense
when you want to do IO 384 pages large). That's why I'm using ffs() and not
ilog2().
> + if (aligned_end > ra->start)
> + ra->size -= end - aligned_end;
> + ra->async_size = ra->size;
> ---8<---
>
> So if max_pages=384, then aligned_end will be aligned down to a maximum
> of the previous 1MB boundary?
No, it needs to be aligned only to previous 512K boundary because we want
to do IOs 3*512K large.
Hope things are a bit clearer now :)
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary
2025-05-06 11:29 ` Jan Kara
@ 2025-05-06 15:31 ` Ryan Roberts
0 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2025-05-06 15:31 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 06/05/2025 12:29, Jan Kara wrote:
> On Tue 06-05-25 10:28:11, Ryan Roberts wrote:
>> On 05/05/2025 10:37, Jan Kara wrote:
>>> On Mon 05-05-25 11:13:26, Jan Kara wrote:
>>>> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
>>>>> Previously asynchonous readahead would read ra_pages (usually 128K)
>>>>> directly after the end of the synchonous readahead and given the
>>>>> synchronous readahead portion had no alignment guarantees (beyond page
>>>>> boundaries) it is possible (and likely) that the end of the initial 128K
>>>>> region would not fall on a natural boundary for the folio size being
>>>>> used. Therefore smaller folios were used to align down to the required
>>>>> boundary, both at the end of the previous readahead block and at the
>>>>> start of the new one.
>>>>>
>>>>> In the worst cases, this can result in never properly ramping up the
>>>>> folio size, and instead getting stuck oscillating between order-0, -1
>>>>> and -2 folios. The next readahead will try to use folios whose order is
>>>>> +2 bigger than the folio that had the readahead marker. But because of
>>>>> the alignment requirements, that folio (the first one in the readahead
>>>>> block) can end up being order-0 in some cases.
>>>>>
>>>>> There will be 2 modifications to solve this issue:
>>>>>
>>>>> 1) Calculate the readahead size so the end is aligned to a folio
>>>>> boundary. This prevents needing to allocate small folios to align
>>>>> down at the end of the window and fixes the oscillation problem.
>>>>>
>>>>> 2) Remember the "preferred folio order" in the ra state instead of
>>>>> inferring it from the folio with the readahead marker. This solves
>>>>> the slow ramp up problem (discussed in a subsequent patch).
>>>>>
>>>>> This patch addresses (1) only. A subsequent patch will address (2).
>>>>>
>>>>> Worked example:
>>>>>
>>>>> The following shows the previous pathalogical behaviour when the initial
>>>>> synchronous readahead is unaligned. We start reading at page 17 in the
>>>>> file and read sequentially from there. I'm showing a dump of the pages
>>>>> in the page cache just after we read the first page of the folio with
>>>>> the readahead marker.
>>>
>>> <snip>
>>>
>>>> Looks good. When I was reading this code some time ago, I also felt we
>>>> should rather do some rounding instead of creating small folios so thanks
>>>> for working on this. Feel free to add:
>>>>
>>>> Reviewed-by: Jan Kara <jack@suse.cz>
>>>
>>> But now I've also remembered why what you do here isn't an obvious win.
>>> There are storage devices (mostly RAID arrays) where optimum read size
>>> isn't a power of 2. Think for example a RAID-0 device composed from three
>>> disks. It will have max_pages something like 384 (512k * 3). Suppose we are
>>> on x86 and max_order is 9. Then previously (if we were lucky with
>>> alignment) we were alternating between order 7 and order 8 pages in the
>>> page cache and do optimally sized IOs od 1536k.
>>
>> Sorry I'm struggling to follow some of this, perhaps my superficial
>> understanding of all the readahead subtleties is starting to show...
>>
>> How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and
>> bdi->ra_pages would remain the usual 32 (128K)?
>
> Sorry, I have been probably too brief in my previous message :)
> bdi->ra_pages is actually set based on optimal IO size reported by the
> hardware (see blk_apply_bdi_limits() and how its callers are filling in
> lim->io_opt). The 128K you speak about is just a last-resort value if
> hardware doesn't provide one. And some storage devices do report optimal IO
> size that is not power of two.
Ahh, got it - thanks for the education!
>
> Also note that bdi->ra_pages can be tuned in sysfs and a lot of users
> actually do this (usually from their udev rules). We don't have to perform
> well when some odd value gets set but you definitely cannot assume
> bdi->ra_pages is 128K :).
>
>> In which case, for mmap, won't
>> we continue to be limited by ra_pages and will never get beyond order-5? (for
>> mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just
>> returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not
>> spotting it in the code...
>>
>> I guess you are also implicitly teaching me something about how the block layer
>> works here too... if there are 2 read requests for an order-7 and order-8, then
>> the block layer will merge those to a single read (upto the 384 optimal size?)
>
> Correct. In fact readahead code will already perform this merging when
> submitting the IO.
>
>> but if there are 2 reads of order-8 then it won't merge because it would be
>> bigger than the optimal size and it won't split the second one at the optimal
>> size either? Have I inferred that correctly?
>
> With the code as you modify it, you would round down ra->size from 384 to
> 256 and submit only one 1MB sized IO (with one order-8 page). And this will
> cause regression in read throughput for such devices because they now don't
> get buffer large enough to run at full speed.
Ahha, yes, thanks - now it's clicking.
>
>>> Now you will allocate all
>>> folios of order 8 (nice) but reads will be just 1024k and you'll see
>>> noticeable drop in read throughput (not nice). Note that this is not just a
>>> theoretical example but a real case we have hit when doing performance
>>> testing of servers and for which I was tweaking readahead code in the past.
>>>
>>> So I think we need to tweak this logic a bit. Perhaps we should round_down
>>> end to the minimum alignment dictated by 'order' and maxpages? Like:
>>>
>>> 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)
>>
>> Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT -
>> 1" part?
>
> My bad. It should have been:
>
> 1 << min(order, ffs(max_pages) - 1)
>
>> I think what you are suggesting is that the patch becomes something like
>> this:
>>
>> ---8<---
>> + end = ra->start + ra->size;
>> + aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages)));
>
> Not quite. ilog2() returns the most significant bit set but we really want
> to align to the least significant bit set. So when max_pages is 384, we
> want to align to at most order-7 (aligning the end more does not make sense
> when you want to do IO 384 pages large). That's why I'm using ffs() and not
> ilog2().
Yep got it now.
>
>> + if (aligned_end > ra->start)
>> + ra->size -= end - aligned_end;
>> + ra->async_size = ra->size;
>> ---8<---
>>
>> So if max_pages=384, then aligned_end will be aligned down to a maximum
>> of the previous 1MB boundary?
>
> No, it needs to be aligned only to previous 512K boundary because we want
> to do IOs 3*512K large.
>
> Hope things are a bit clearer now :)
Yes, much!
Thanks,
Ryan
>
> Honza
^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state
2025-04-30 14:59 [RFC PATCH v4 0/5] Readahead tweaks for larger folios Ryan Roberts
2025-04-30 14:59 ` [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order() Ryan Roberts
2025-04-30 14:59 ` [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary Ryan Roberts
@ 2025-04-30 14:59 ` Ryan Roberts
2025-05-05 9:39 ` Jan Kara
` (2 more replies)
2025-04-30 14:59 ` [RFC PATCH v4 4/5] mm/readahead: Store folio order " Ryan Roberts
` (2 subsequent siblings)
5 siblings, 3 replies; 40+ messages in thread
From: Ryan Roberts @ 2025-04-30 14:59 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
linux-mm
We need to be able to store the preferred folio order associated with a
readahead request in the struct file_ra_state so that we can more
accurately increase the order across subsequent readahead requests. But
struct file_ra_state is per-struct file, so we don't really want to
increase it's size.
mmap_miss is currently 32 bits but it is only counted up to 10 *
MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be
plenty. Redefine it to unsigned short, making room for order as unsigned
short in follow up commit.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
include/linux/fs.h | 2 +-
mm/filemap.c | 11 ++++++-----
2 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 016b0fe1536e..44362bef0010 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1042,7 +1042,7 @@ struct file_ra_state {
unsigned int size;
unsigned int async_size;
unsigned int ra_pages;
- unsigned int mmap_miss;
+ unsigned short mmap_miss;
loff_t prev_pos;
};
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b90cbeb4a1a..fa129ecfd80f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3207,7 +3207,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
struct file *fpin = NULL;
unsigned long vm_flags = vmf->vma->vm_flags;
- unsigned int mmap_miss;
+ unsigned short mmap_miss;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* Use the readahead code, even if readahead is disabled */
@@ -3275,7 +3275,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
struct file_ra_state *ra = &file->f_ra;
DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
struct file *fpin = NULL;
- unsigned int mmap_miss;
+ unsigned short mmap_miss;
/* If we don't want any read-ahead, don't bother */
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
@@ -3595,7 +3595,7 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
struct folio *folio, unsigned long start,
unsigned long addr, unsigned int nr_pages,
- unsigned long *rss, unsigned int *mmap_miss)
+ unsigned long *rss, unsigned short *mmap_miss)
{
vm_fault_t ret = 0;
struct page *page = folio_page(folio, start);
@@ -3657,7 +3657,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
struct folio *folio, unsigned long addr,
- unsigned long *rss, unsigned int *mmap_miss)
+ unsigned long *rss, unsigned short *mmap_miss)
{
vm_fault_t ret = 0;
struct page *page = &folio->page;
@@ -3699,7 +3699,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
struct folio *folio;
vm_fault_t ret = 0;
unsigned long rss = 0;
- unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
+ unsigned int nr_pages = 0, folio_type;
+ unsigned short mmap_miss = 0, mmap_miss_saved;
rcu_read_lock();
folio = next_uptodate_folio(&xas, mapping, end_pgoff);
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state
2025-04-30 14:59 ` [RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
@ 2025-05-05 9:39 ` Jan Kara
2025-05-05 9:57 ` David Hildenbrand
2025-05-09 10:00 ` Pankaj Raghav (Samsung)
2 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2025-05-05 9:39 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Wed 30-04-25 15:59:16, Ryan Roberts wrote:
> We need to be able to store the preferred folio order associated with a
> readahead request in the struct file_ra_state so that we can more
> accurately increase the order across subsequent readahead requests. But
> struct file_ra_state is per-struct file, so we don't really want to
> increase it's size.
>
> mmap_miss is currently 32 bits but it is only counted up to 10 *
> MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be
> plenty. Redefine it to unsigned short, making room for order as unsigned
> short in follow up commit.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Sure. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> include/linux/fs.h | 2 +-
> mm/filemap.c | 11 ++++++-----
> 2 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 016b0fe1536e..44362bef0010 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1042,7 +1042,7 @@ struct file_ra_state {
> unsigned int size;
> unsigned int async_size;
> unsigned int ra_pages;
> - unsigned int mmap_miss;
> + unsigned short mmap_miss;
> loff_t prev_pos;
> };
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7b90cbeb4a1a..fa129ecfd80f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3207,7 +3207,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
> struct file *fpin = NULL;
> unsigned long vm_flags = vmf->vma->vm_flags;
> - unsigned int mmap_miss;
> + unsigned short mmap_miss;
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> /* Use the readahead code, even if readahead is disabled */
> @@ -3275,7 +3275,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
> struct file_ra_state *ra = &file->f_ra;
> DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
> struct file *fpin = NULL;
> - unsigned int mmap_miss;
> + unsigned short mmap_miss;
>
> /* If we don't want any read-ahead, don't bother */
> if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
> @@ -3595,7 +3595,7 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
> static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> struct folio *folio, unsigned long start,
> unsigned long addr, unsigned int nr_pages,
> - unsigned long *rss, unsigned int *mmap_miss)
> + unsigned long *rss, unsigned short *mmap_miss)
> {
> vm_fault_t ret = 0;
> struct page *page = folio_page(folio, start);
> @@ -3657,7 +3657,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>
> static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
> struct folio *folio, unsigned long addr,
> - unsigned long *rss, unsigned int *mmap_miss)
> + unsigned long *rss, unsigned short *mmap_miss)
> {
> vm_fault_t ret = 0;
> struct page *page = &folio->page;
> @@ -3699,7 +3699,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> struct folio *folio;
> vm_fault_t ret = 0;
> unsigned long rss = 0;
> - unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
> + unsigned int nr_pages = 0, folio_type;
> + unsigned short mmap_miss = 0, mmap_miss_saved;
>
> rcu_read_lock();
> folio = next_uptodate_folio(&xas, mapping, end_pgoff);
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state
2025-04-30 14:59 ` [RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
2025-05-05 9:39 ` Jan Kara
@ 2025-05-05 9:57 ` David Hildenbrand
2025-05-09 10:00 ` Pankaj Raghav (Samsung)
2 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2025-05-05 9:57 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 30.04.25 16:59, Ryan Roberts wrote:
> We need to be able to store the preferred folio order associated with a
> readahead request in the struct file_ra_state so that we can more
> accurately increase the order across subsequent readahead requests. But
> struct file_ra_state is per-struct file, so we don't really want to
> increase it's size.
>
> mmap_miss is currently 32 bits but it is only counted up to 10 *
> MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be
> plenty. Redefine it to unsigned short, making room for order as unsigned
> short in follow up commit.
Makes sense and LGTM
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state
2025-04-30 14:59 ` [RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
2025-05-05 9:39 ` Jan Kara
2025-05-05 9:57 ` David Hildenbrand
@ 2025-05-09 10:00 ` Pankaj Raghav (Samsung)
2 siblings, 0 replies; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-05-09 10:00 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Wed, Apr 30, 2025 at 03:59:16PM +0100, Ryan Roberts wrote:
> We need to be able to store the preferred folio order associated with a
> readahead request in the struct file_ra_state so that we can more
> accurately increase the order across subsequent readahead requests. But
> struct file_ra_state is per-struct file, so we don't really want to
> increase it's size.
>
> mmap_miss is currently 32 bits but it is only counted up to 10 *
> MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be
> plenty. Redefine it to unsigned short, making room for order as unsigned
> short in follow up commit.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Looks good.
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> include/linux/fs.h | 2 +-
> mm/filemap.c | 11 ++++++-----
> 2 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 016b0fe1536e..44362bef0010 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1042,7 +1042,7 @@ struct file_ra_state {
> unsigned int size;
> unsigned int async_size;
> unsigned int ra_pages;
> - unsigned int mmap_miss;
> + unsigned short mmap_miss;
> loff_t prev_pos;
> };
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7b90cbeb4a1a..fa129ecfd80f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3207,7 +3207,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
> struct file *fpin = NULL;
> unsigned long vm_flags = vmf->vma->vm_flags;
> - unsigned int mmap_miss;
> + unsigned short mmap_miss;
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> /* Use the readahead code, even if readahead is disabled */
> @@ -3275,7 +3275,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
> struct file_ra_state *ra = &file->f_ra;
> DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
> struct file *fpin = NULL;
> - unsigned int mmap_miss;
> + unsigned short mmap_miss;
>
> /* If we don't want any read-ahead, don't bother */
> if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
> @@ -3595,7 +3595,7 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
> static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> struct folio *folio, unsigned long start,
> unsigned long addr, unsigned int nr_pages,
> - unsigned long *rss, unsigned int *mmap_miss)
> + unsigned long *rss, unsigned short *mmap_miss)
> {
> vm_fault_t ret = 0;
> struct page *page = folio_page(folio, start);
> @@ -3657,7 +3657,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>
> static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
> struct folio *folio, unsigned long addr,
> - unsigned long *rss, unsigned int *mmap_miss)
> + unsigned long *rss, unsigned short *mmap_miss)
> {
> vm_fault_t ret = 0;
> struct page *page = &folio->page;
> @@ -3699,7 +3699,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> struct folio *folio;
> vm_fault_t ret = 0;
> unsigned long rss = 0;
> - unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
> + unsigned int nr_pages = 0, folio_type;
> + unsigned short mmap_miss = 0, mmap_miss_saved;
>
> rcu_read_lock();
> folio = next_uptodate_folio(&xas, mapping, end_pgoff);
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC PATCH v4 4/5] mm/readahead: Store folio order in struct file_ra_state
2025-04-30 14:59 [RFC PATCH v4 0/5] Readahead tweaks for larger folios Ryan Roberts
` (2 preceding siblings ...)
2025-04-30 14:59 ` [RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state Ryan Roberts
@ 2025-04-30 14:59 ` Ryan Roberts
2025-05-05 9:52 ` Jan Kara
2025-05-05 10:08 ` David Hildenbrand
2025-04-30 14:59 ` [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
2025-05-06 10:05 ` [RFC PATCH v4 0/5] Readahead tweaks for larger folios Ryan Roberts
5 siblings, 2 replies; 40+ messages in thread
From: Ryan Roberts @ 2025-04-30 14:59 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
linux-mm
Previously the folio order of the previous readahead request was
inferred from the folio who's readahead marker was hit. But due to the
way we have to round to non-natural boundaries sometimes, this first
folio in the readahead block is often smaller than the preferred order
for that request. This means that for cases where the initial sync
readahead is poorly aligned, the folio order will ramp up much more
slowly.
So instead, let's store the order in struct file_ra_state so we are not
affected by any required alignment. We previously made enough room in
the struct for a 16 order field. This should be plenty big enough since
we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
larger than ~20.
Since we now pass order in struct file_ra_state, page_cache_ra_order()
no longer needs it's new_order parameter, so let's remove that.
Worked example:
Here we are touching pages 17-256 sequentially just as we did in the
previous commit, but now that we are remembering the preferred order
explicitly, we no longer have the slow ramp up problem. Note
specifically that we no longer have 2 rounds (2x ~128K) of order-2
folios:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00050000 65536 64 80 16 4
FOLIO 0x00050000 0x00060000 65536 80 96 16 4
FOLIO 0x00060000 0x00080000 131072 96 128 32 5
FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
FOLIO 0x00100000 0x00120000 131072 256 288 32 5
FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
HOLE 0x00140000 0x00800000 7077888 320 2048 1728
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
include/linux/fs.h | 2 ++
mm/filemap.c | 6 ++++--
mm/internal.h | 3 +--
mm/readahead.c | 18 +++++++++++-------
4 files changed, 18 insertions(+), 11 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 44362bef0010..cde482a7270a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1031,6 +1031,7 @@ struct fown_struct {
* and so were/are genuinely "ahead". Start next readahead when
* the first of these pages is accessed.
* @ra_pages: Maximum size of a readahead request, copied from the bdi.
+ * @order: Preferred folio order used for most recent readahead.
* @mmap_miss: How many mmap accesses missed in the page cache.
* @prev_pos: The last byte in the most recent read request.
*
@@ -1042,6 +1043,7 @@ struct file_ra_state {
unsigned int size;
unsigned int async_size;
unsigned int ra_pages;
+ unsigned short order;
unsigned short mmap_miss;
loff_t prev_pos;
};
diff --git a/mm/filemap.c b/mm/filemap.c
index fa129ecfd80f..e61f374068d4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3222,7 +3222,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
if (!(vm_flags & VM_RAND_READ))
ra->size *= 2;
ra->async_size = HPAGE_PMD_NR;
- page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
+ ra->order = HPAGE_PMD_ORDER;
+ page_cache_ra_order(&ractl, ra);
return fpin;
}
#endif
@@ -3258,8 +3259,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
ra->size = ra->ra_pages;
ra->async_size = ra->ra_pages / 4;
+ ra->order = 0;
ractl._index = ra->start;
- page_cache_ra_order(&ractl, ra, 0);
+ page_cache_ra_order(&ractl, ra);
return fpin;
}
diff --git a/mm/internal.h b/mm/internal.h
index 40464f755092..437c7738668d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -442,8 +442,7 @@ void zap_page_range_single_batched(struct mmu_gather *tlb,
int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
gfp_t gfp);
-void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
- unsigned int order);
+void page_cache_ra_order(struct readahead_control *, struct file_ra_state *);
void force_page_cache_ra(struct readahead_control *, unsigned long nr);
static inline void force_page_cache_readahead(struct address_space *mapping,
struct file *file, pgoff_t index, unsigned long nr_to_read)
diff --git a/mm/readahead.c b/mm/readahead.c
index 82f9f623f2d7..18972bc34861 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -457,7 +457,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
}
void page_cache_ra_order(struct readahead_control *ractl,
- struct file_ra_state *ra, unsigned int new_order)
+ struct file_ra_state *ra)
{
struct address_space *mapping = ractl->mapping;
pgoff_t start = readahead_index(ractl);
@@ -469,6 +469,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
int err = 0;
gfp_t gfp = readahead_gfp_mask(mapping);
unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
+ unsigned int new_order = ra->order;
/*
* Fallback when size < min_nrpages as each folio should be
@@ -483,6 +484,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
new_order = min_t(unsigned int, new_order, ilog2(ra->size));
new_order = max(new_order, min_order);
+ ra->order = new_order;
+
/* See comment in page_cache_ra_unbounded() */
nofs = memalloc_nofs_save();
filemap_invalidate_lock_shared(mapping);
@@ -525,6 +528,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
* ->readahead() may have updated readahead window size so we have to
* check there's still something to read.
*/
+ ra->order = 0;
if (ra->size > index - start)
do_page_cache_ra(ractl, ra->size - (index - start),
ra->async_size);
@@ -614,8 +618,9 @@ void page_cache_sync_ra(struct readahead_control *ractl,
ra->size = min(contig_count + req_count, max_pages);
ra->async_size = 1;
readit:
+ ra->order = 0;
ractl->_index = ra->start;
- page_cache_ra_order(ractl, ra, 0);
+ page_cache_ra_order(ractl, ra);
}
EXPORT_SYMBOL_GPL(page_cache_sync_ra);
@@ -626,7 +631,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
struct file_ra_state *ra = ractl->ra;
pgoff_t index = readahead_index(ractl);
pgoff_t expected, start, end, aligned_end;
- unsigned int order = folio_order(folio);
/* no readahead */
if (!ra->ra_pages)
@@ -649,7 +653,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
* Ramp up sizes, and push forward the readahead window.
*/
expected = round_down(ra->start + ra->size - ra->async_size,
- 1UL << order);
+ 1UL << folio_order(folio));
if (index == expected) {
ra->start += ra->size;
/*
@@ -678,14 +682,14 @@ void page_cache_async_ra(struct readahead_control *ractl,
ra->size += req_count;
ra->size = get_next_ra_size(ra, max_pages);
readit:
- order += 2;
+ ra->order += 2;
end = ra->start + ra->size;
- aligned_end = round_down(end, 1UL << order);
+ aligned_end = round_down(end, 1UL << ra->order);
if (aligned_end > ra->start)
ra->size -= end - aligned_end;
ra->async_size = ra->size;
ractl->_index = ra->start;
- page_cache_ra_order(ractl, ra, order);
+ page_cache_ra_order(ractl, ra);
}
EXPORT_SYMBOL_GPL(page_cache_async_ra);
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 4/5] mm/readahead: Store folio order in struct file_ra_state
2025-04-30 14:59 ` [RFC PATCH v4 4/5] mm/readahead: Store folio order " Ryan Roberts
@ 2025-05-05 9:52 ` Jan Kara
2025-05-06 9:53 ` Ryan Roberts
2025-05-05 10:08 ` David Hildenbrand
1 sibling, 1 reply; 40+ messages in thread
From: Jan Kara @ 2025-05-05 9:52 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Wed 30-04-25 15:59:17, Ryan Roberts wrote:
> Previously the folio order of the previous readahead request was
> inferred from the folio who's readahead marker was hit. But due to the
> way we have to round to non-natural boundaries sometimes, this first
> folio in the readahead block is often smaller than the preferred order
> for that request. This means that for cases where the initial sync
> readahead is poorly aligned, the folio order will ramp up much more
> slowly.
>
> So instead, let's store the order in struct file_ra_state so we are not
> affected by any required alignment. We previously made enough room in
> the struct for a 16 order field. This should be plenty big enough since
> we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
> larger than ~20.
>
> Since we now pass order in struct file_ra_state, page_cache_ra_order()
> no longer needs it's new_order parameter, so let's remove that.
>
> Worked example:
>
> Here we are touching pages 17-256 sequentially just as we did in the
> previous commit, but now that we are remembering the preferred order
> explicitly, we no longer have the slow ramp up problem. Note
> specifically that we no longer have 2 rounds (2x ~128K) of order-2
> folios:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
> HOLE 0x00000000 0x00001000 4096 0 1 1
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
> FOLIO 0x00021000 0x00022000 4096 33 34 1 0
> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
> FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
> FOLIO 0x00100000 0x00120000 131072 256 288 32 5
> FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
> HOLE 0x00140000 0x00800000 7077888 320 2048 1728
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
...
> @@ -469,6 +469,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> int err = 0;
> gfp_t gfp = readahead_gfp_mask(mapping);
> unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
> + unsigned int new_order = ra->order;
>
> /*
> * Fallback when size < min_nrpages as each folio should be
> @@ -483,6 +484,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> new_order = max(new_order, min_order);
>
> + ra->order = new_order;
> +
> /* See comment in page_cache_ra_unbounded() */
> nofs = memalloc_nofs_save();
> filemap_invalidate_lock_shared(mapping);
> @@ -525,6 +528,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> * ->readahead() may have updated readahead window size so we have to
> * check there's still something to read.
> */
> + ra->order = 0;
Hum, so you reset desired folio order if readahead hit some pre-existing
pages in the page cache. Is this really desirable? Why not leave the
desired order as it was for the next request?
> if (ra->size > index - start)
> do_page_cache_ra(ractl, ra->size - (index - start),
> ra->async_size);
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 4/5] mm/readahead: Store folio order in struct file_ra_state
2025-05-05 9:52 ` Jan Kara
@ 2025-05-06 9:53 ` Ryan Roberts
2025-05-06 10:45 ` Jan Kara
0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2025-05-06 9:53 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 05/05/2025 10:52, Jan Kara wrote:
> On Wed 30-04-25 15:59:17, Ryan Roberts wrote:
>> Previously the folio order of the previous readahead request was
>> inferred from the folio who's readahead marker was hit. But due to the
>> way we have to round to non-natural boundaries sometimes, this first
>> folio in the readahead block is often smaller than the preferred order
>> for that request. This means that for cases where the initial sync
>> readahead is poorly aligned, the folio order will ramp up much more
>> slowly.
>>
>> So instead, let's store the order in struct file_ra_state so we are not
>> affected by any required alignment. We previously made enough room in
>> the struct for a 16 order field. This should be plenty big enough since
>> we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
>> larger than ~20.
>>
>> Since we now pass order in struct file_ra_state, page_cache_ra_order()
>> no longer needs it's new_order parameter, so let's remove that.
>>
>> Worked example:
>>
>> Here we are touching pages 17-256 sequentially just as we did in the
>> previous commit, but now that we are remembering the preferred order
>> explicitly, we no longer have the slow ramp up problem. Note
>> specifically that we no longer have 2 rounds (2x ~128K) of order-2
>> folios:
>>
>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
>> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
>> HOLE 0x00000000 0x00001000 4096 0 1 1
>> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
>> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
>> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
>> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
>> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
>> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
>> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
>> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
>> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
>> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
>> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
>> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
>> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
>> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
>> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
>> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
>> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
>> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
>> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
>> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
>> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
>> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
>> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
>> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
>> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
>> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
>> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
>> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
>> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
>> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
>> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
>> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
>> FOLIO 0x00021000 0x00022000 4096 33 34 1 0
>> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
>> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
>> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
>> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
>> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
>> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
>> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
>> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
>> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
>> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
>> FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
>> FOLIO 0x00100000 0x00120000 131072 256 288 32 5
>> FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
>> HOLE 0x00140000 0x00800000 7077888 320 2048 1728
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>
> ...
>
>> @@ -469,6 +469,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
>> int err = 0;
>> gfp_t gfp = readahead_gfp_mask(mapping);
>> unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
>> + unsigned int new_order = ra->order;
>>
>> /*
>> * Fallback when size < min_nrpages as each folio should be
>> @@ -483,6 +484,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
>> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>> new_order = max(new_order, min_order);
>>
>> + ra->order = new_order;
>> +
>> /* See comment in page_cache_ra_unbounded() */
>> nofs = memalloc_nofs_save();
>> filemap_invalidate_lock_shared(mapping);
>> @@ -525,6 +528,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
>> * ->readahead() may have updated readahead window size so we have to
>> * check there's still something to read.
>> */
>> + ra->order = 0;
>
> Hum, so you reset desired folio order if readahead hit some pre-existing
> pages in the page cache. Is this really desirable? Why not leave the
> desired order as it was for the next request?
My aim was to not let order grow unbounded. When the filesystem doesn't support
large folios we end up here (from the "goto fallback") and without this, order
will just grow and grow (perhaps it doesn't matter though). I think we should
keep this.
But I guess your point is that we can also end up here when the filesystem does
support large folios but there is an error. In thta case, yes, I'll change to
not reset order to 0; it has already been fixed up earlier in this path.
How's this:
---8<---
diff --git a/mm/readahead.c b/mm/readahead.c
index 18972bc34861..0054ca18a815 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -475,8 +475,10 @@ void page_cache_ra_order(struct readahead_control *ractl,
* Fallback when size < min_nrpages as each folio should be
* at least min_nrpages anyway.
*/
- if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
+ if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size) {
+ ra->order = 0;
goto fallback;
+ }
limit = min(limit, index + ra->size - 1);
@@ -528,7 +530,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
* ->readahead() may have updated readahead window size so we have to
* check there's still something to read.
*/
- ra->order = 0;
if (ra->size > index - start)
do_page_cache_ra(ractl, ra->size - (index - start),
ra->async_size);
---8<---
Thanks,
Ryan
>
>> if (ra->size > index - start)
>> do_page_cache_ra(ractl, ra->size - (index - start),
>> ra->async_size);
>
> Honza
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 4/5] mm/readahead: Store folio order in struct file_ra_state
2025-05-06 9:53 ` Ryan Roberts
@ 2025-05-06 10:45 ` Jan Kara
0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2025-05-06 10:45 UTC (permalink / raw)
To: Ryan Roberts
Cc: Jan Kara, Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Tue 06-05-25 10:53:07, Ryan Roberts wrote:
> On 05/05/2025 10:52, Jan Kara wrote:
> > On Wed 30-04-25 15:59:17, Ryan Roberts wrote:
> >> Previously the folio order of the previous readahead request was
> >> inferred from the folio who's readahead marker was hit. But due to the
> >> way we have to round to non-natural boundaries sometimes, this first
> >> folio in the readahead block is often smaller than the preferred order
> >> for that request. This means that for cases where the initial sync
> >> readahead is poorly aligned, the folio order will ramp up much more
> >> slowly.
> >>
> >> So instead, let's store the order in struct file_ra_state so we are not
> >> affected by any required alignment. We previously made enough room in
> >> the struct for a 16 order field. This should be plenty big enough since
> >> we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
> >> larger than ~20.
> >>
> >> Since we now pass order in struct file_ra_state, page_cache_ra_order()
> >> no longer needs it's new_order parameter, so let's remove that.
> >>
> >> Worked example:
> >>
> >> Here we are touching pages 17-256 sequentially just as we did in the
> >> previous commit, but now that we are remembering the preferred order
> >> explicitly, we no longer have the slow ramp up problem. Note
> >> specifically that we no longer have 2 rounds (2x ~128K) of order-2
> >> folios:
> >>
> >> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
> >> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
> >> HOLE 0x00000000 0x00001000 4096 0 1 1
> >> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> >> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> >> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> >> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> >> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> >> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> >> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> >> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> >> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> >> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> >> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> >> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> >> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> >> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> >> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> >> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> >> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> >> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> >> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> >> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> >> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> >> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> >> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> >> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> >> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> >> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> >> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> >> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> >> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> >> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> >> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> >> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
> >> FOLIO 0x00021000 0x00022000 4096 33 34 1 0
> >> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
> >> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> >> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> >> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> >> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> >> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> >> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> >> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> >> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
> >> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
> >> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> >> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> >> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> >> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
> >> FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
> >> FOLIO 0x00100000 0x00120000 131072 256 288 32 5
> >> FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
> >> HOLE 0x00140000 0x00800000 7077888 320 2048 1728
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >
> > ...
> >
> >> @@ -469,6 +469,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >> int err = 0;
> >> gfp_t gfp = readahead_gfp_mask(mapping);
> >> unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
> >> + unsigned int new_order = ra->order;
> >>
> >> /*
> >> * Fallback when size < min_nrpages as each folio should be
> >> @@ -483,6 +484,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> >> new_order = max(new_order, min_order);
> >>
> >> + ra->order = new_order;
> >> +
> >> /* See comment in page_cache_ra_unbounded() */
> >> nofs = memalloc_nofs_save();
> >> filemap_invalidate_lock_shared(mapping);
> >> @@ -525,6 +528,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >> * ->readahead() may have updated readahead window size so we have to
> >> * check there's still something to read.
> >> */
> >> + ra->order = 0;
> >
> > Hum, so you reset desired folio order if readahead hit some pre-existing
> > pages in the page cache. Is this really desirable? Why not leave the
> > desired order as it was for the next request?
>
> My aim was to not let order grow unbounded. When the filesystem doesn't support
> large folios we end up here (from the "goto fallback") and without this, order
> will just grow and grow (perhaps it doesn't matter though). I think we should
> keep this.
Yes, I agree that should be kept.
>
> But I guess your point is that we can also end up here when the filesystem does
> support large folios but there is an error. In thta case, yes, I'll change to
> not reset order to 0; it has already been fixed up earlier in this path.
Right.
> How's this:
>
> ---8<---
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 18972bc34861..0054ca18a815 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -475,8 +475,10 @@ void page_cache_ra_order(struct readahead_control *ractl,
> * Fallback when size < min_nrpages as each folio should be
> * at least min_nrpages anyway.
> */
> - if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
> + if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size) {
> + ra->order = 0;
> goto fallback;
> + }
>
> limit = min(limit, index + ra->size - 1);
>
> @@ -528,7 +530,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
> * ->readahead() may have updated readahead window size so we have to
> * check there's still something to read.
> */
> - ra->order = 0;
> if (ra->size > index - start)
> do_page_cache_ra(ractl, ra->size - (index - start),
> ra->async_size);
Yes, this looks good to me.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 4/5] mm/readahead: Store folio order in struct file_ra_state
2025-04-30 14:59 ` [RFC PATCH v4 4/5] mm/readahead: Store folio order " Ryan Roberts
2025-05-05 9:52 ` Jan Kara
@ 2025-05-05 10:08 ` David Hildenbrand
2025-05-06 10:03 ` Ryan Roberts
1 sibling, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-05-05 10:08 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 30.04.25 16:59, Ryan Roberts wrote:
> Previously the folio order of the previous readahead request was
> inferred from the folio who's readahead marker was hit. But due to the
> way we have to round to non-natural boundaries sometimes, this first
> folio in the readahead block is often smaller than the preferred order
> for that request. This means that for cases where the initial sync
> readahead is poorly aligned, the folio order will ramp up much more
> slowly.
>
> So instead, let's store the order in struct file_ra_state so we are not
> affected by any required alignment. We previously made enough room in
> the struct for a 16 order field. This should be plenty big enough since
> we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
> larger than ~20.
>
> Since we now pass order in struct file_ra_state, page_cache_ra_order()
> no longer needs it's new_order parameter, so let's remove that.
>
> Worked example:
>
> Here we are touching pages 17-256 sequentially just as we did in the
> previous commit, but now that we are remembering the preferred order
> explicitly, we no longer have the slow ramp up problem. Note
> specifically that we no longer have 2 rounds (2x ~128K) of order-2
> folios:
>
> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
> HOLE 0x00000000 0x00001000 4096 0 1 1
> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
> FOLIO 0x00021000 0x00022000 4096 33 34 1 0
> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
> FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
> FOLIO 0x00100000 0x00120000 131072 256 288 32 5
> FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
> HOLE 0x00140000 0x00800000 7077888 320 2048 1728
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> include/linux/fs.h | 2 ++
> mm/filemap.c | 6 ++++--
> mm/internal.h | 3 +--
> mm/readahead.c | 18 +++++++++++-------
> 4 files changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 44362bef0010..cde482a7270a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1031,6 +1031,7 @@ struct fown_struct {
> * and so were/are genuinely "ahead". Start next readahead when
> * the first of these pages is accessed.
> * @ra_pages: Maximum size of a readahead request, copied from the bdi.
> + * @order: Preferred folio order used for most recent readahead.
Looking at other members, and how it relates to the other members,
should we call this something like "ra_prev_order" / "prev_ra_order" to
distinguish it from !ra members and indicate the "most recent" semantics
similar to "prev_pos"?
Just a thought while digging through this patch ...
...
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3222,7 +3222,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> if (!(vm_flags & VM_RAND_READ))
> ra->size *= 2;
> ra->async_size = HPAGE_PMD_NR;
> - page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
> + ra->order = HPAGE_PMD_ORDER;
> + page_cache_ra_order(&ractl, ra);
> return fpin;
> }
> #endif
> @@ -3258,8 +3259,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> ra->size = ra->ra_pages;
> ra->async_size = ra->ra_pages / 4;
> + ra->order = 0;
> ractl._index = ra->start;
> - page_cache_ra_order(&ractl, ra, 0);
> + page_cache_ra_order(&ractl, ra);
> return fpin;
> }
Why not let page_cache_ra_order() consume the order and update ra->order
(or however it will be called :) ) internally?
That might make at least the "most recent readahead" semantics of the
variable clearer.
Again, just a thought ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 4/5] mm/readahead: Store folio order in struct file_ra_state
2025-05-05 10:08 ` David Hildenbrand
@ 2025-05-06 10:03 ` Ryan Roberts
2025-05-06 14:24 ` David Hildenbrand
0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2025-05-06 10:03 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 05/05/2025 11:08, David Hildenbrand wrote:
> On 30.04.25 16:59, Ryan Roberts wrote:
>> Previously the folio order of the previous readahead request was
>> inferred from the folio who's readahead marker was hit. But due to the
>> way we have to round to non-natural boundaries sometimes, this first
>> folio in the readahead block is often smaller than the preferred order
>> for that request. This means that for cases where the initial sync
>> readahead is poorly aligned, the folio order will ramp up much more
>> slowly.
>>
>> So instead, let's store the order in struct file_ra_state so we are not
>> affected by any required alignment. We previously made enough room in
>> the struct for a 16 order field. This should be plenty big enough since
>> we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
>> larger than ~20.
>>
>> Since we now pass order in struct file_ra_state, page_cache_ra_order()
>> no longer needs it's new_order parameter, so let's remove that.
>>
>> Worked example:
>>
>> Here we are touching pages 17-256 sequentially just as we did in the
>> previous commit, but now that we are remembering the preferred order
>> explicitly, we no longer have the slow ramp up problem. Note
>> specifically that we no longer have 2 rounds (2x ~128K) of order-2
>> folios:
>>
>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
>> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
>> HOLE 0x00000000 0x00001000 4096 0 1 1
>> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
>> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
>> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
>> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
>> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
>> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
>> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
>> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
>> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
>> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
>> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
>> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
>> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
>> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
>> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
>> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
>> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
>> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
>> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
>> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
>> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
>> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
>> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
>> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
>> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
>> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
>> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
>> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
>> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
>> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
>> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
>> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
>> FOLIO 0x00021000 0x00022000 4096 33 34 1 0
>> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
>> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
>> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
>> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
>> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
>> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
>> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
>> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
>> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
>> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
>> FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
>> FOLIO 0x00100000 0x00120000 131072 256 288 32 5
>> FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
>> HOLE 0x00140000 0x00800000 7077888 320 2048 1728
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>> include/linux/fs.h | 2 ++
>> mm/filemap.c | 6 ++++--
>> mm/internal.h | 3 +--
>> mm/readahead.c | 18 +++++++++++-------
>> 4 files changed, 18 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 44362bef0010..cde482a7270a 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1031,6 +1031,7 @@ struct fown_struct {
>> * and so were/are genuinely "ahead". Start next readahead when
>> * the first of these pages is accessed.
>> * @ra_pages: Maximum size of a readahead request, copied from the bdi.
>> + * @order: Preferred folio order used for most recent readahead.
>
> Looking at other members, and how it relates to the other members, should we
> call this something like "ra_prev_order" / "prev_ra_order" to distinguish it
> from !ra members and indicate the "most recent" semantics similar to "prev_pos"?
As you know, I'm crap at naming, but...
start, size, async_size and order make up the parameters for the "most recent"
readahead request. Where "most recent" includes "current" once passed into
page_cache_ra_order(). The others don't include "ra" or "prev" in their name so
wasn't sure it was necessary here.
ra_pages is a bit different; that's not part of the request, it's a (dynamic)
ceiling to use when creating requests.
Personally I'd leave it as is, but no strong opinion.
>
> Just a thought while digging through this patch ...
>
> ...
>
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3222,7 +3222,8 @@ static struct file *do_sync_mmap_readahead(struct
>> vm_fault *vmf)
>> if (!(vm_flags & VM_RAND_READ))
>> ra->size *= 2;
>> ra->async_size = HPAGE_PMD_NR;
>> - page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
>> + ra->order = HPAGE_PMD_ORDER;
>> + page_cache_ra_order(&ractl, ra);
>> return fpin;
>> }
>> #endif
>> @@ -3258,8 +3259,9 @@ static struct file *do_sync_mmap_readahead(struct
>> vm_fault *vmf)
>> ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>> ra->size = ra->ra_pages;
>> ra->async_size = ra->ra_pages / 4;
>> + ra->order = 0;
>> ractl._index = ra->start;
>> - page_cache_ra_order(&ractl, ra, 0);
>> + page_cache_ra_order(&ractl, ra);
>> return fpin;
>> }
>
> Why not let page_cache_ra_order() consume the order and update ra->order (or
> however it will be called :) ) internally?
You mean continue to pass new_order as a parameter to page_cache_ra_order()? The
reason I did it the way I'm doing it is because I thought it would be weird for
the caller of page_cache_ra_order() to set up all the parameters (start, size,
async_size) of the request except for order...
>
> That might make at least the "most recent readahead" semantics of the variable
> clearer.
But if you think your suggestion makes things clearer, then that's fine by me.
Thanks,
Ryan
>
> Again, just a thought ...
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 4/5] mm/readahead: Store folio order in struct file_ra_state
2025-05-06 10:03 ` Ryan Roberts
@ 2025-05-06 14:24 ` David Hildenbrand
2025-05-06 15:06 ` Ryan Roberts
0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-05-06 14:24 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 06.05.25 12:03, Ryan Roberts wrote:
> On 05/05/2025 11:08, David Hildenbrand wrote:
>> On 30.04.25 16:59, Ryan Roberts wrote:
>>> Previously the folio order of the previous readahead request was
>>> inferred from the folio who's readahead marker was hit. But due to the
>>> way we have to round to non-natural boundaries sometimes, this first
>>> folio in the readahead block is often smaller than the preferred order
>>> for that request. This means that for cases where the initial sync
>>> readahead is poorly aligned, the folio order will ramp up much more
>>> slowly.
>>>
>>> So instead, let's store the order in struct file_ra_state so we are not
>>> affected by any required alignment. We previously made enough room in
>>> the struct for a 16 order field. This should be plenty big enough since
>>> we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
>>> larger than ~20.
>>>
>>> Since we now pass order in struct file_ra_state, page_cache_ra_order()
>>> no longer needs it's new_order parameter, so let's remove that.
>>>
>>> Worked example:
>>>
>>> Here we are touching pages 17-256 sequentially just as we did in the
>>> previous commit, but now that we are remembering the preferred order
>>> explicitly, we no longer have the slow ramp up problem. Note
>>> specifically that we no longer have 2 rounds (2x ~128K) of order-2
>>> folios:
>>>
>>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
>>> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
>>> HOLE 0x00000000 0x00001000 4096 0 1 1
>>> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
>>> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
>>> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
>>> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
>>> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
>>> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
>>> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
>>> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
>>> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
>>> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
>>> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
>>> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
>>> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
>>> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
>>> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
>>> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
>>> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
>>> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
>>> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
>>> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
>>> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
>>> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
>>> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
>>> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
>>> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
>>> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
>>> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
>>> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
>>> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
>>> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
>>> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
>>> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
>>> FOLIO 0x00021000 0x00022000 4096 33 34 1 0
>>> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
>>> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
>>> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
>>> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
>>> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
>>> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
>>> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
>>> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
>>> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
>>> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
>>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
>>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
>>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>>> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
>>> FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
>>> FOLIO 0x00100000 0x00120000 131072 256 288 32 5
>>> FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
>>> HOLE 0x00140000 0x00800000 7077888 320 2048 1728
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>> include/linux/fs.h | 2 ++
>>> mm/filemap.c | 6 ++++--
>>> mm/internal.h | 3 +--
>>> mm/readahead.c | 18 +++++++++++-------
>>> 4 files changed, 18 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 44362bef0010..cde482a7270a 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -1031,6 +1031,7 @@ struct fown_struct {
>>> * and so were/are genuinely "ahead". Start next readahead when
>>> * the first of these pages is accessed.
>>> * @ra_pages: Maximum size of a readahead request, copied from the bdi.
>>> + * @order: Preferred folio order used for most recent readahead.
>>
>> Looking at other members, and how it relates to the other members, should we
>> call this something like "ra_prev_order" / "prev_ra_order" to distinguish it
>> from !ra members and indicate the "most recent" semantics similar to "prev_pos"?
>
> As you know, I'm crap at naming, but...
>
> start, size, async_size and order make up the parameters for the "most recent"
> readahead request. Where "most recent" includes "current" once passed into
> page_cache_ra_order(). The others don't include "ra" or "prev" in their name so
> wasn't sure it was necessary here.
>
> ra_pages is a bit different; that's not part of the request, it's a (dynamic)
> ceiling to use when creating requests.
>
> Personally I'd leave it as is, but no strong opinion.
I'm fine with it staying that way; I was merely trying to make sense of
it all ...
... maybe a better description of the parameters might make the
semantics easier to grasp.
""most recent" includes "current" once passed into page_cache_ra_order()"
is *really* hard to digest :)
>
>>
>> Just a thought while digging through this patch ...
>>
>> ...
>>
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -3222,7 +3222,8 @@ static struct file *do_sync_mmap_readahead(struct
>>> vm_fault *vmf)
>>> if (!(vm_flags & VM_RAND_READ))
>>> ra->size *= 2;
>>> ra->async_size = HPAGE_PMD_NR;
>>> - page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
>>> + ra->order = HPAGE_PMD_ORDER;
>>> + page_cache_ra_order(&ractl, ra);
>>> return fpin;
>>> }
>>> #endif
>>> @@ -3258,8 +3259,9 @@ static struct file *do_sync_mmap_readahead(struct
>>> vm_fault *vmf)
>>> ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>>> ra->size = ra->ra_pages;
>>> ra->async_size = ra->ra_pages / 4;
>>> + ra->order = 0;
>>> ractl._index = ra->start;
>>> - page_cache_ra_order(&ractl, ra, 0);
>>> + page_cache_ra_order(&ractl, ra);
>>> return fpin;
>>> }
>>
>> Why not let page_cache_ra_order() consume the order and update ra->order (or
>> however it will be called :) ) internally?
>
> You mean continue to pass new_order as a parameter to page_cache_ra_order()? The
> reason I did it the way I'm doing it is because I thought it would be weird for
> the caller of page_cache_ra_order() to set up all the parameters (start, size,
> async_size) of the request except for order...
Agreed. As above, I think we might do better with the description of
these parameters in general ...
or even document how page_cache_ra_order() acts on these inputs?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 4/5] mm/readahead: Store folio order in struct file_ra_state
2025-05-06 14:24 ` David Hildenbrand
@ 2025-05-06 15:06 ` Ryan Roberts
0 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2025-05-06 15:06 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
Alexander Viro, Christian Brauner, Jan Kara, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 06/05/2025 15:24, David Hildenbrand wrote:
> On 06.05.25 12:03, Ryan Roberts wrote:
>> On 05/05/2025 11:08, David Hildenbrand wrote:
>>> On 30.04.25 16:59, Ryan Roberts wrote:
>>>> Previously the folio order of the previous readahead request was
>>>> inferred from the folio who's readahead marker was hit. But due to the
>>>> way we have to round to non-natural boundaries sometimes, this first
>>>> folio in the readahead block is often smaller than the preferred order
>>>> for that request. This means that for cases where the initial sync
>>>> readahead is poorly aligned, the folio order will ramp up much more
>>>> slowly.
>>>>
>>>> So instead, let's store the order in struct file_ra_state so we are not
>>>> affected by any required alignment. We previously made enough room in
>>>> the struct for a 16 order field. This should be plenty big enough since
>>>> we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never
>>>> larger than ~20.
>>>>
>>>> Since we now pass order in struct file_ra_state, page_cache_ra_order()
>>>> no longer needs it's new_order parameter, so let's remove that.
>>>>
>>>> Worked example:
>>>>
>>>> Here we are touching pages 17-256 sequentially just as we did in the
>>>> previous commit, but now that we are remembering the preferred order
>>>> explicitly, we no longer have the slow ramp up problem. Note
>>>> specifically that we no longer have 2 rounds (2x ~128K) of order-2
>>>> folios:
>>>>
>>>> TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
>>>> ----- ---------- ---------- ---------- ------- ------- ----- ----- --
>>>> HOLE 0x00000000 0x00001000 4096 0 1 1
>>>> FOLIO 0x00001000 0x00002000 4096 1 2 1 0
>>>> FOLIO 0x00002000 0x00003000 4096 2 3 1 0
>>>> FOLIO 0x00003000 0x00004000 4096 3 4 1 0
>>>> FOLIO 0x00004000 0x00005000 4096 4 5 1 0
>>>> FOLIO 0x00005000 0x00006000 4096 5 6 1 0
>>>> FOLIO 0x00006000 0x00007000 4096 6 7 1 0
>>>> FOLIO 0x00007000 0x00008000 4096 7 8 1 0
>>>> FOLIO 0x00008000 0x00009000 4096 8 9 1 0
>>>> FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
>>>> FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
>>>> FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
>>>> FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
>>>> FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
>>>> FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
>>>> FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
>>>> FOLIO 0x00010000 0x00011000 4096 16 17 1 0
>>>> FOLIO 0x00011000 0x00012000 4096 17 18 1 0
>>>> FOLIO 0x00012000 0x00013000 4096 18 19 1 0
>>>> FOLIO 0x00013000 0x00014000 4096 19 20 1 0
>>>> FOLIO 0x00014000 0x00015000 4096 20 21 1 0
>>>> FOLIO 0x00015000 0x00016000 4096 21 22 1 0
>>>> FOLIO 0x00016000 0x00017000 4096 22 23 1 0
>>>> FOLIO 0x00017000 0x00018000 4096 23 24 1 0
>>>> FOLIO 0x00018000 0x00019000 4096 24 25 1 0
>>>> FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
>>>> FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
>>>> FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
>>>> FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
>>>> FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
>>>> FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
>>>> FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
>>>> FOLIO 0x00020000 0x00021000 4096 32 33 1 0
>>>> FOLIO 0x00021000 0x00022000 4096 33 34 1 0
>>>> FOLIO 0x00022000 0x00024000 8192 34 36 2 1
>>>> FOLIO 0x00024000 0x00028000 16384 36 40 4 2
>>>> FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
>>>> FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
>>>> FOLIO 0x00030000 0x00034000 16384 48 52 4 2
>>>> FOLIO 0x00034000 0x00038000 16384 52 56 4 2
>>>> FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
>>>> FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
>>>> FOLIO 0x00040000 0x00050000 65536 64 80 16 4
>>>> FOLIO 0x00050000 0x00060000 65536 80 96 16 4
>>>> FOLIO 0x00060000 0x00080000 131072 96 128 32 5
>>>> FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
>>>> FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
>>>> FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
>>>> FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
>>>> FOLIO 0x00100000 0x00120000 131072 256 288 32 5
>>>> FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
>>>> HOLE 0x00140000 0x00800000 7077888 320 2048 1728
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>> include/linux/fs.h | 2 ++
>>>> mm/filemap.c | 6 ++++--
>>>> mm/internal.h | 3 +--
>>>> mm/readahead.c | 18 +++++++++++-------
>>>> 4 files changed, 18 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>> index 44362bef0010..cde482a7270a 100644
>>>> --- a/include/linux/fs.h
>>>> +++ b/include/linux/fs.h
>>>> @@ -1031,6 +1031,7 @@ struct fown_struct {
>>>> * and so were/are genuinely "ahead". Start next readahead when
>>>> * the first of these pages is accessed.
>>>> * @ra_pages: Maximum size of a readahead request, copied from the bdi.
>>>> + * @order: Preferred folio order used for most recent readahead.
>>>
>>> Looking at other members, and how it relates to the other members, should we
>>> call this something like "ra_prev_order" / "prev_ra_order" to distinguish it
>>> from !ra members and indicate the "most recent" semantics similar to "prev_pos"?
>>
>> As you know, I'm crap at naming, but...
>>
>> start, size, async_size and order make up the parameters for the "most recent"
>> readahead request. Where "most recent" includes "current" once passed into
>> page_cache_ra_order(). The others don't include "ra" or "prev" in their name so
>> wasn't sure it was necessary here.
>>
>> ra_pages is a bit different; that's not part of the request, it's a (dynamic)
>> ceiling to use when creating requests.
>>
>> Personally I'd leave it as is, but no strong opinion.
>
> I'm fine with it staying that way; I was merely trying to make sense of it all ...
>
>
> ... maybe a better description of the parameters might make the semantics easier
> to grasp.
>
> ""most recent" includes "current" once passed into page_cache_ra_order()"
>
> is *really* hard to digest :)
>
>>
>>>
>>> Just a thought while digging through this patch ...
>>>
>>> ...
>>>
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -3222,7 +3222,8 @@ static struct file *do_sync_mmap_readahead(struct
>>>> vm_fault *vmf)
>>>> if (!(vm_flags & VM_RAND_READ))
>>>> ra->size *= 2;
>>>> ra->async_size = HPAGE_PMD_NR;
>>>> - page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER);
>>>> + ra->order = HPAGE_PMD_ORDER;
>>>> + page_cache_ra_order(&ractl, ra);
>>>> return fpin;
>>>> }
>>>> #endif
>>>> @@ -3258,8 +3259,9 @@ static struct file *do_sync_mmap_readahead(struct
>>>> vm_fault *vmf)
>>>> ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>>>> ra->size = ra->ra_pages;
>>>> ra->async_size = ra->ra_pages / 4;
>>>> + ra->order = 0;
>>>> ractl._index = ra->start;
>>>> - page_cache_ra_order(&ractl, ra, 0);
>>>> + page_cache_ra_order(&ractl, ra);
>>>> return fpin;
>>>> }
>>>
>>> Why not let page_cache_ra_order() consume the order and update ra->order (or
>>> however it will be called :) ) internally?
>>
>> You mean continue to pass new_order as a parameter to page_cache_ra_order()? The
>> reason I did it the way I'm doing it is because I thought it would be weird for
>> the caller of page_cache_ra_order() to set up all the parameters (start, size,
>> async_size) of the request except for order...
>
> Agreed. As above, I think we might do better with the description of these
> parameters in general ...
>
> or even document how page_cache_ra_order() acts on these inputs?
OK let me try to work something up for the next version...
^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory
2025-04-30 14:59 [RFC PATCH v4 0/5] Readahead tweaks for larger folios Ryan Roberts
` (3 preceding siblings ...)
2025-04-30 14:59 ` [RFC PATCH v4 4/5] mm/readahead: Store folio order " Ryan Roberts
@ 2025-04-30 14:59 ` Ryan Roberts
2025-05-05 10:06 ` Jan Kara
2025-05-09 13:52 ` Will Deacon
2025-05-06 10:05 ` [RFC PATCH v4 0/5] Readahead tweaks for larger folios Ryan Roberts
5 siblings, 2 replies; 40+ messages in thread
From: Ryan Roberts @ 2025-04-30 14:59 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-fsdevel,
linux-mm
Change the readahead config so that if it is being requested for an
executable mapping, do a synchronous read into a set of folios with an
arch-specified order and in a naturally aligned manner. We no longer
center the read on the faulting page but simply align it down to the
previous natural boundary. Additionally, we don't bother with an
asynchronous part.
On arm64 if memory is physically contiguous and naturally aligned to the
"contpte" size, we can use contpte mappings, which improves utilization
of the TLB. When paired with the "multi-size THP" feature, this works
well to reduce dTLB pressure. However iTLB pressure is still high due to
executable mappings having a low likelihood of being in the required
folio size and mapping alignment, even when the filesystem supports
readahead into large folios (e.g. XFS).
The reason for the low likelihood is that the current readahead
algorithm starts with an order-0 folio and increases the folio order by
2 every time the readahead mark is hit. But most executable memory tends
to be accessed randomly and so the readahead mark is rarely hit and most
executable folios remain order-0.
So let's special-case the read(ahead) logic for executable mappings. The
trade-off is performance improvement (due to more efficient storage of
the translations in iTLB) vs potential for making reclaim more difficult
(due to the folios being larger so if a part of the folio is hot the
whole thing is considered hot). But executable memory is a small portion
of the overall system memory so I doubt this will even register from a
reclaim perspective.
I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
base page size configs. Crucially the same amount of data is still read
(usually 128K) so I'm not expecting any read amplification issues. I
don't anticipate any write amplification because text is always RO.
Note that the text region of an ELF file could be populated into the
page cache for other reasons than taking a fault in a mmapped area. The
most common case is due to the loader read()ing the header which can be
shared with the beginning of text. So some text will still remain in
small folios, but this simple, best effort change provides good
performance improvements as is.
Confine this special-case approach to the bounds of the VMA. This
prevents wasting memory for any padding that might exist in the file
between sections. Previously the padding would have been contained in
order-0 folios and would be easy to reclaim. But now it would be part of
a larger folio so more difficult to reclaim. Solve this by simply not
reading it into memory in the first place.
Benchmarking
============
TODO: NUMBERS ARE FOR V3 OF SERIES. NEED TO RERUN FOR THIS VERSION.
The below shows nginx and redis benchmarks on Ampere Altra arm64 system.
First, confirmation that this patch causes more text to be contained in
64K folios:
| File-backed folios | system boot | nginx | redis |
| by size as percentage |-----------------|-----------------|-----------------|
| of all mapped text mem | before | after | before | after | before | after |
|========================|========|========|========|========|========|========|
| base-page-4kB | 26% | 9% | 27% | 6% | 21% | 5% |
| thp-aligned-8kB | 4% | 2% | 3% | 0% | 4% | 1% |
| thp-aligned-16kB | 57% | 21% | 57% | 6% | 54% | 10% |
| thp-aligned-32kB | 4% | 1% | 4% | 1% | 3% | 1% |
| thp-aligned-64kB | 7% | 65% | 8% | 85% | 9% | 72% |
| thp-aligned-2048kB | 0% | 0% | 0% | 0% | 7% | 8% |
| thp-unaligned-16kB | 1% | 1% | 1% | 1% | 1% | 1% |
| thp-unaligned-32kB | 0% | 0% | 0% | 0% | 0% | 0% |
| thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% |
| thp-partial | 1% | 1% | 0% | 0% | 1% | 1% |
|------------------------|--------|--------|--------|--------|--------|--------|
| cont-aligned-64kB | 7% | 65% | 8% | 85% | 16% | 80% |
The above shows that for both workloads (each isolated with cgroups) as
well as the general system state after boot, the amount of text backed
by 4K and 16K folios reduces and the amount backed by 64K folios
increases significantly. And the amount of text that is contpte-mapped
significantly increases (see last row).
And this is reflected in performance improvement:
| Benchmark | Improvement |
+===============================================+======================+
| pts/nginx (200 connections) | 8.96% |
| pts/nginx (1000 connections) | 6.80% |
+-----------------------------------------------+----------------------+
| pts/redis (LPOP, 50 connections) | 5.07% |
| pts/redis (LPUSH, 50 connections) | 3.68% |
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
arch/arm64/include/asm/pgtable.h | 8 +++++++
include/linux/pgtable.h | 11 +++++++++
mm/filemap.c | 40 ++++++++++++++++++++++++++------
3 files changed, 52 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2a77f11b78d5..9eb35af0d3cf 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1537,6 +1537,14 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
*/
#define arch_wants_old_prefaulted_pte cpu_has_hw_af
+/*
+ * Request exec memory is read into pagecache in at least 64K folios. This size
+ * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
+ * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
+ * pages are in use.
+ */
+#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
+
static inline bool pud_sect_supported(void)
{
return PAGE_SIZE == SZ_4K;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..1dd539c49f90 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -456,6 +456,17 @@ static inline bool arch_has_hw_pte_young(void)
}
#endif
+#ifndef exec_folio_order
+/*
+ * Returns preferred minimum folio order for executable file-backed memory. Must
+ * be in range [0, PMD_ORDER). Default to order-0.
+ */
+static inline unsigned int exec_folio_order(void)
+{
+ return 0;
+}
+#endif
+
#ifndef arch_check_zapped_pte
static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
pte_t pte)
diff --git a/mm/filemap.c b/mm/filemap.c
index e61f374068d4..37fe4a55c00d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3252,14 +3252,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
if (mmap_miss > MMAP_LOTSAMISS)
return fpin;
- /*
- * mmap read-around
- */
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
- ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
- ra->size = ra->ra_pages;
- ra->async_size = ra->ra_pages / 4;
- ra->order = 0;
+ if (vm_flags & VM_EXEC) {
+ /*
+ * Allow arch to request a preferred minimum folio order for
+ * executable memory. This can often be beneficial to
+ * performance if (e.g.) arm64 can contpte-map the folio.
+ * Executable memory rarely benefits from readahead, due to its
+ * random access nature, so set async_size to 0.
+ *
+ * Limit to the boundaries of the VMA to avoid reading in any
+ * pad that might exist between sections, which would be a waste
+ * of memory.
+ */
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long start = vma->vm_pgoff;
+ unsigned long end = start + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT);
+ unsigned long ra_end;
+
+ ra->order = exec_folio_order();
+ ra->start = round_down(vmf->pgoff, 1UL << ra->order);
+ ra->start = max(ra->start, start);
+ ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
+ ra_end = min(ra_end, end);
+ ra->size = ra_end - ra->start;
+ ra->async_size = 0;
+ } else {
+ /*
+ * mmap read-around
+ */
+ ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
+ ra->size = ra->ra_pages;
+ ra->async_size = ra->ra_pages / 4;
+ ra->order = 0;
+ }
ractl._index = ra->start;
page_cache_ra_order(&ractl, ra);
return fpin;
--
2.43.0
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory
2025-04-30 14:59 ` [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
@ 2025-05-05 10:06 ` Jan Kara
2025-05-09 13:52 ` Will Deacon
1 sibling, 0 replies; 40+ messages in thread
From: Jan Kara @ 2025-05-05 10:06 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan,
linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On Wed 30-04-25 15:59:18, Ryan Roberts wrote:
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read into a set of folios with an
> arch-specified order and in a naturally aligned manner. We no longer
> center the read on the faulting page but simply align it down to the
> previous natural boundary. Additionally, we don't bother with an
> asynchronous part.
>
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" feature, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low likelihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
>
> The reason for the low likelihood is that the current readahead
> algorithm starts with an order-0 folio and increases the folio order by
> 2 every time the readahead mark is hit. But most executable memory tends
> to be accessed randomly and so the readahead mark is rarely hit and most
> executable folios remain order-0.
>
> So let's special-case the read(ahead) logic for executable mappings. The
> trade-off is performance improvement (due to more efficient storage of
> the translations in iTLB) vs potential for making reclaim more difficult
> (due to the folios being larger so if a part of the folio is hot the
> whole thing is considered hot). But executable memory is a small portion
> of the overall system memory so I doubt this will even register from a
> reclaim perspective.
>
> I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
> base page size configs. Crucially the same amount of data is still read
> (usually 128K) so I'm not expecting any read amplification issues. I
> don't anticipate any write amplification because text is always RO.
>
> Note that the text region of an ELF file could be populated into the
> page cache for other reasons than taking a fault in a mmapped area. The
> most common case is due to the loader read()ing the header which can be
> shared with the beginning of text. So some text will still remain in
> small folios, but this simple, best effort change provides good
> performance improvements as is.
>
> Confine this special-case approach to the bounds of the VMA. This
> prevents wasting memory for any padding that might exist in the file
> between sections. Previously the padding would have been contained in
> order-0 folios and would be easy to reclaim. But now it would be part of
> a larger folio so more difficult to reclaim. Solve this by simply not
> reading it into memory in the first place.
>
> Benchmarking
> ============
> TODO: NUMBERS ARE FOR V3 OF SERIES. NEED TO RERUN FOR THIS VERSION.
>
> The below shows nginx and redis benchmarks on Ampere Altra arm64 system.
>
> First, confirmation that this patch causes more text to be contained in
> 64K folios:
>
> | File-backed folios | system boot | nginx | redis |
> | by size as percentage |-----------------|-----------------|-----------------|
> | of all mapped text mem | before | after | before | after | before | after |
> |========================|========|========|========|========|========|========|
> | base-page-4kB | 26% | 9% | 27% | 6% | 21% | 5% |
> | thp-aligned-8kB | 4% | 2% | 3% | 0% | 4% | 1% |
> | thp-aligned-16kB | 57% | 21% | 57% | 6% | 54% | 10% |
> | thp-aligned-32kB | 4% | 1% | 4% | 1% | 3% | 1% |
> | thp-aligned-64kB | 7% | 65% | 8% | 85% | 9% | 72% |
> | thp-aligned-2048kB | 0% | 0% | 0% | 0% | 7% | 8% |
> | thp-unaligned-16kB | 1% | 1% | 1% | 1% | 1% | 1% |
> | thp-unaligned-32kB | 0% | 0% | 0% | 0% | 0% | 0% |
> | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% |
> | thp-partial | 1% | 1% | 0% | 0% | 1% | 1% |
> |------------------------|--------|--------|--------|--------|--------|--------|
> | cont-aligned-64kB | 7% | 65% | 8% | 85% | 16% | 80% |
>
> The above shows that for both workloads (each isolated with cgroups) as
> well as the general system state after boot, the amount of text backed
> by 4K and 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of text that is contpte-mapped
> significantly increases (see last row).
>
> And this is reflected in performance improvement:
>
> | Benchmark | Improvement |
> +===============================================+======================+
> | pts/nginx (200 connections) | 8.96% |
> | pts/nginx (1000 connections) | 6.80% |
> +-----------------------------------------------+----------------------+
> | pts/redis (LPOP, 50 connections) | 5.07% |
> | pts/redis (LPUSH, 50 connections) | 3.68% |
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> diff --git a/mm/filemap.c b/mm/filemap.c
> index e61f374068d4..37fe4a55c00d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3252,14 +3252,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> if (mmap_miss > MMAP_LOTSAMISS)
> return fpin;
>
> - /*
> - * mmap read-around
> - */
> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> - ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> - ra->size = ra->ra_pages;
> - ra->async_size = ra->ra_pages / 4;
> - ra->order = 0;
> + if (vm_flags & VM_EXEC) {
> + /*
> + * Allow arch to request a preferred minimum folio order for
> + * executable memory. This can often be beneficial to
> + * performance if (e.g.) arm64 can contpte-map the folio.
> + * Executable memory rarely benefits from readahead, due to its
> + * random access nature, so set async_size to 0.
> + *
> + * Limit to the boundaries of the VMA to avoid reading in any
> + * pad that might exist between sections, which would be a waste
> + * of memory.
> + */
> + struct vm_area_struct *vma = vmf->vma;
> + unsigned long start = vma->vm_pgoff;
> + unsigned long end = start + ((vma->vm_end - vma->vm_start) >> PAGE_SHIFT);
> + unsigned long ra_end;
> +
> + ra->order = exec_folio_order();
> + ra->start = round_down(vmf->pgoff, 1UL << ra->order);
> + ra->start = max(ra->start, start);
> + ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
> + ra_end = min(ra_end, end);
> + ra->size = ra_end - ra->start;
> + ra->async_size = 0;
> + } else {
> + /*
> + * mmap read-around
> + */
> + ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> + ra->size = ra->ra_pages;
> + ra->async_size = ra->ra_pages / 4;
> + ra->order = 0;
> + }
> ractl._index = ra->start;
> page_cache_ra_order(&ractl, ra);
> return fpin;
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory
2025-04-30 14:59 ` [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
2025-05-05 10:06 ` Jan Kara
@ 2025-05-09 13:52 ` Will Deacon
2025-05-13 12:46 ` Ryan Roberts
1 sibling, 1 reply; 40+ messages in thread
From: Will Deacon @ 2025-05-09 13:52 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Kalesh Singh, Zi Yan, linux-arm-kernel,
linux-kernel, linux-fsdevel, linux-mm
On Wed, Apr 30, 2025 at 03:59:18PM +0100, Ryan Roberts wrote:
> Change the readahead config so that if it is being requested for an
> executable mapping, do a synchronous read into a set of folios with an
> arch-specified order and in a naturally aligned manner. We no longer
> center the read on the faulting page but simply align it down to the
> previous natural boundary. Additionally, we don't bother with an
> asynchronous part.
>
> On arm64 if memory is physically contiguous and naturally aligned to the
> "contpte" size, we can use contpte mappings, which improves utilization
> of the TLB. When paired with the "multi-size THP" feature, this works
> well to reduce dTLB pressure. However iTLB pressure is still high due to
> executable mappings having a low likelihood of being in the required
> folio size and mapping alignment, even when the filesystem supports
> readahead into large folios (e.g. XFS).
>
> The reason for the low likelihood is that the current readahead
> algorithm starts with an order-0 folio and increases the folio order by
> 2 every time the readahead mark is hit. But most executable memory tends
> to be accessed randomly and so the readahead mark is rarely hit and most
> executable folios remain order-0.
>
> So let's special-case the read(ahead) logic for executable mappings. The
> trade-off is performance improvement (due to more efficient storage of
> the translations in iTLB) vs potential for making reclaim more difficult
> (due to the folios being larger so if a part of the folio is hot the
> whole thing is considered hot). But executable memory is a small portion
> of the overall system memory so I doubt this will even register from a
> reclaim perspective.
>
> I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
> base page size configs. Crucially the same amount of data is still read
> (usually 128K) so I'm not expecting any read amplification issues. I
> don't anticipate any write amplification because text is always RO.
>
> Note that the text region of an ELF file could be populated into the
> page cache for other reasons than taking a fault in a mmapped area. The
> most common case is due to the loader read()ing the header which can be
> shared with the beginning of text. So some text will still remain in
> small folios, but this simple, best effort change provides good
> performance improvements as is.
>
> Confine this special-case approach to the bounds of the VMA. This
> prevents wasting memory for any padding that might exist in the file
> between sections. Previously the padding would have been contained in
> order-0 folios and would be easy to reclaim. But now it would be part of
> a larger folio so more difficult to reclaim. Solve this by simply not
> reading it into memory in the first place.
>
> Benchmarking
> ============
> TODO: NUMBERS ARE FOR V3 OF SERIES. NEED TO RERUN FOR THIS VERSION.
>
> The below shows nginx and redis benchmarks on Ampere Altra arm64 system.
>
> First, confirmation that this patch causes more text to be contained in
> 64K folios:
>
> | File-backed folios | system boot | nginx | redis |
> | by size as percentage |-----------------|-----------------|-----------------|
> | of all mapped text mem | before | after | before | after | before | after |
> |========================|========|========|========|========|========|========|
> | base-page-4kB | 26% | 9% | 27% | 6% | 21% | 5% |
> | thp-aligned-8kB | 4% | 2% | 3% | 0% | 4% | 1% |
> | thp-aligned-16kB | 57% | 21% | 57% | 6% | 54% | 10% |
> | thp-aligned-32kB | 4% | 1% | 4% | 1% | 3% | 1% |
> | thp-aligned-64kB | 7% | 65% | 8% | 85% | 9% | 72% |
> | thp-aligned-2048kB | 0% | 0% | 0% | 0% | 7% | 8% |
> | thp-unaligned-16kB | 1% | 1% | 1% | 1% | 1% | 1% |
> | thp-unaligned-32kB | 0% | 0% | 0% | 0% | 0% | 0% |
> | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% |
> | thp-partial | 1% | 1% | 0% | 0% | 1% | 1% |
> |------------------------|--------|--------|--------|--------|--------|--------|
> | cont-aligned-64kB | 7% | 65% | 8% | 85% | 16% | 80% |
>
> The above shows that for both workloads (each isolated with cgroups) as
> well as the general system state after boot, the amount of text backed
> by 4K and 16K folios reduces and the amount backed by 64K folios
> increases significantly. And the amount of text that is contpte-mapped
> significantly increases (see last row).
>
> And this is reflected in performance improvement:
>
> | Benchmark | Improvement |
> +===============================================+======================+
> | pts/nginx (200 connections) | 8.96% |
> | pts/nginx (1000 connections) | 6.80% |
> +-----------------------------------------------+----------------------+
> | pts/redis (LPOP, 50 connections) | 5.07% |
> | pts/redis (LPUSH, 50 connections) | 3.68% |
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> arch/arm64/include/asm/pgtable.h | 8 +++++++
> include/linux/pgtable.h | 11 +++++++++
> mm/filemap.c | 40 ++++++++++++++++++++++++++------
> 3 files changed, 52 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 2a77f11b78d5..9eb35af0d3cf 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1537,6 +1537,14 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
> */
> #define arch_wants_old_prefaulted_pte cpu_has_hw_af
>
> +/*
> + * Request exec memory is read into pagecache in at least 64K folios. This size
> + * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
> + * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
> + * pages are in use.
> + */
> +#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
> +
> static inline bool pud_sect_supported(void)
> {
> return PAGE_SIZE == SZ_4K;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..1dd539c49f90 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -456,6 +456,17 @@ static inline bool arch_has_hw_pte_young(void)
> }
> #endif
>
> +#ifndef exec_folio_order
> +/*
> + * Returns preferred minimum folio order for executable file-backed memory. Must
> + * be in range [0, PMD_ORDER). Default to order-0.
> + */
> +static inline unsigned int exec_folio_order(void)
> +{
> + return 0;
> +}
> +#endif
> +
> #ifndef arch_check_zapped_pte
> static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
> pte_t pte)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index e61f374068d4..37fe4a55c00d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3252,14 +3252,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> if (mmap_miss > MMAP_LOTSAMISS)
> return fpin;
>
> - /*
> - * mmap read-around
> - */
> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> - ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> - ra->size = ra->ra_pages;
> - ra->async_size = ra->ra_pages / 4;
> - ra->order = 0;
> + if (vm_flags & VM_EXEC) {
> + /*
> + * Allow arch to request a preferred minimum folio order for
> + * executable memory. This can often be beneficial to
> + * performance if (e.g.) arm64 can contpte-map the folio.
> + * Executable memory rarely benefits from readahead, due to its
> + * random access nature, so set async_size to 0.
In light of this observation (about randomness of instruction fetch), do
you think it's worth ignoring VM_RAND_READ for VM_EXEC?
Either way, I was looking at this because it touches arm64 and it looks
fine to me:
Acked-by: Will Deacon <will@kernel.org>
Will
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory
2025-05-09 13:52 ` Will Deacon
@ 2025-05-13 12:46 ` Ryan Roberts
2025-05-14 15:14 ` Will Deacon
0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2025-05-13 12:46 UTC (permalink / raw)
To: Will Deacon
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Kalesh Singh, Zi Yan, linux-arm-kernel,
linux-kernel, linux-fsdevel, linux-mm
On 09/05/2025 14:52, Will Deacon wrote:
> On Wed, Apr 30, 2025 at 03:59:18PM +0100, Ryan Roberts wrote:
>> Change the readahead config so that if it is being requested for an
>> executable mapping, do a synchronous read into a set of folios with an
>> arch-specified order and in a naturally aligned manner. We no longer
>> center the read on the faulting page but simply align it down to the
>> previous natural boundary. Additionally, we don't bother with an
>> asynchronous part.
>>
>> On arm64 if memory is physically contiguous and naturally aligned to the
>> "contpte" size, we can use contpte mappings, which improves utilization
>> of the TLB. When paired with the "multi-size THP" feature, this works
>> well to reduce dTLB pressure. However iTLB pressure is still high due to
>> executable mappings having a low likelihood of being in the required
>> folio size and mapping alignment, even when the filesystem supports
>> readahead into large folios (e.g. XFS).
>>
>> The reason for the low likelihood is that the current readahead
>> algorithm starts with an order-0 folio and increases the folio order by
>> 2 every time the readahead mark is hit. But most executable memory tends
>> to be accessed randomly and so the readahead mark is rarely hit and most
>> executable folios remain order-0.
>>
>> So let's special-case the read(ahead) logic for executable mappings. The
>> trade-off is performance improvement (due to more efficient storage of
>> the translations in iTLB) vs potential for making reclaim more difficult
>> (due to the folios being larger so if a part of the folio is hot the
>> whole thing is considered hot). But executable memory is a small portion
>> of the overall system memory so I doubt this will even register from a
>> reclaim perspective.
>>
>> I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
>> base page size configs. Crucially the same amount of data is still read
>> (usually 128K) so I'm not expecting any read amplification issues. I
>> don't anticipate any write amplification because text is always RO.
>>
>> Note that the text region of an ELF file could be populated into the
>> page cache for other reasons than taking a fault in a mmapped area. The
>> most common case is due to the loader read()ing the header which can be
>> shared with the beginning of text. So some text will still remain in
>> small folios, but this simple, best effort change provides good
>> performance improvements as is.
>>
>> Confine this special-case approach to the bounds of the VMA. This
>> prevents wasting memory for any padding that might exist in the file
>> between sections. Previously the padding would have been contained in
>> order-0 folios and would be easy to reclaim. But now it would be part of
>> a larger folio so more difficult to reclaim. Solve this by simply not
>> reading it into memory in the first place.
>>
>> Benchmarking
>> ============
>> TODO: NUMBERS ARE FOR V3 OF SERIES. NEED TO RERUN FOR THIS VERSION.
>>
>> The below shows nginx and redis benchmarks on Ampere Altra arm64 system.
>>
>> First, confirmation that this patch causes more text to be contained in
>> 64K folios:
>>
>> | File-backed folios | system boot | nginx | redis |
>> | by size as percentage |-----------------|-----------------|-----------------|
>> | of all mapped text mem | before | after | before | after | before | after |
>> |========================|========|========|========|========|========|========|
>> | base-page-4kB | 26% | 9% | 27% | 6% | 21% | 5% |
>> | thp-aligned-8kB | 4% | 2% | 3% | 0% | 4% | 1% |
>> | thp-aligned-16kB | 57% | 21% | 57% | 6% | 54% | 10% |
>> | thp-aligned-32kB | 4% | 1% | 4% | 1% | 3% | 1% |
>> | thp-aligned-64kB | 7% | 65% | 8% | 85% | 9% | 72% |
>> | thp-aligned-2048kB | 0% | 0% | 0% | 0% | 7% | 8% |
>> | thp-unaligned-16kB | 1% | 1% | 1% | 1% | 1% | 1% |
>> | thp-unaligned-32kB | 0% | 0% | 0% | 0% | 0% | 0% |
>> | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% |
>> | thp-partial | 1% | 1% | 0% | 0% | 1% | 1% |
>> |------------------------|--------|--------|--------|--------|--------|--------|
>> | cont-aligned-64kB | 7% | 65% | 8% | 85% | 16% | 80% |
>>
>> The above shows that for both workloads (each isolated with cgroups) as
>> well as the general system state after boot, the amount of text backed
>> by 4K and 16K folios reduces and the amount backed by 64K folios
>> increases significantly. And the amount of text that is contpte-mapped
>> significantly increases (see last row).
>>
>> And this is reflected in performance improvement:
>>
>> | Benchmark | Improvement |
>> +===============================================+======================+
>> | pts/nginx (200 connections) | 8.96% |
>> | pts/nginx (1000 connections) | 6.80% |
>> +-----------------------------------------------+----------------------+
>> | pts/redis (LPOP, 50 connections) | 5.07% |
>> | pts/redis (LPUSH, 50 connections) | 3.68% |
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>> arch/arm64/include/asm/pgtable.h | 8 +++++++
>> include/linux/pgtable.h | 11 +++++++++
>> mm/filemap.c | 40 ++++++++++++++++++++++++++------
>> 3 files changed, 52 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 2a77f11b78d5..9eb35af0d3cf 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1537,6 +1537,14 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>> */
>> #define arch_wants_old_prefaulted_pte cpu_has_hw_af
>>
>> +/*
>> + * Request exec memory is read into pagecache in at least 64K folios. This size
>> + * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
>> + * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
>> + * pages are in use.
>> + */
>> +#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
>> +
>> static inline bool pud_sect_supported(void)
>> {
>> return PAGE_SIZE == SZ_4K;
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..1dd539c49f90 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -456,6 +456,17 @@ static inline bool arch_has_hw_pte_young(void)
>> }
>> #endif
>>
>> +#ifndef exec_folio_order
>> +/*
>> + * Returns preferred minimum folio order for executable file-backed memory. Must
>> + * be in range [0, PMD_ORDER). Default to order-0.
>> + */
>> +static inline unsigned int exec_folio_order(void)
>> +{
>> + return 0;
>> +}
>> +#endif
>> +
>> #ifndef arch_check_zapped_pte
>> static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
>> pte_t pte)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index e61f374068d4..37fe4a55c00d 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3252,14 +3252,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>> if (mmap_miss > MMAP_LOTSAMISS)
>> return fpin;
>>
>> - /*
>> - * mmap read-around
>> - */
>> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>> - ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>> - ra->size = ra->ra_pages;
>> - ra->async_size = ra->ra_pages / 4;
>> - ra->order = 0;
>> + if (vm_flags & VM_EXEC) {
>> + /*
>> + * Allow arch to request a preferred minimum folio order for
>> + * executable memory. This can often be beneficial to
>> + * performance if (e.g.) arm64 can contpte-map the folio.
>> + * Executable memory rarely benefits from readahead, due to its
>> + * random access nature, so set async_size to 0.
>
> In light of this observation (about randomness of instruction fetch), do
> you think it's worth ignoring VM_RAND_READ for VM_EXEC?
Hmm, yeah that makes sense. Something like:
---8<---
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b90cbeb4a1a..6c8bf5116c54 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3233,7 +3233,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault
*vmf)
if (!ra->ra_pages)
return fpin;
- if (vm_flags & VM_SEQ_READ) {
+ /* VM_EXEC case below is already intended for random access */
+ if ((vm_flags & (VM_SEQ_READ | VM_EXEC)) == VM_SEQ_READ) {
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
page_cache_sync_ra(&ractl, ra->ra_pages);
return fpin;
---8<---
>
> Either way, I was looking at this because it touches arm64 and it looks
> fine to me:
>
> Acked-by: Will Deacon <will@kernel.org>
Thanks!
>
> Will
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory
2025-05-13 12:46 ` Ryan Roberts
@ 2025-05-14 15:14 ` Will Deacon
2025-05-14 15:31 ` Ryan Roberts
0 siblings, 1 reply; 40+ messages in thread
From: Will Deacon @ 2025-05-14 15:14 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Kalesh Singh, Zi Yan, linux-arm-kernel,
linux-kernel, linux-fsdevel, linux-mm
On Tue, May 13, 2025 at 01:46:06PM +0100, Ryan Roberts wrote:
> On 09/05/2025 14:52, Will Deacon wrote:
> > On Wed, Apr 30, 2025 at 03:59:18PM +0100, Ryan Roberts wrote:
> >> diff --git a/mm/filemap.c b/mm/filemap.c
> >> index e61f374068d4..37fe4a55c00d 100644
> >> --- a/mm/filemap.c
> >> +++ b/mm/filemap.c
> >> @@ -3252,14 +3252,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> >> if (mmap_miss > MMAP_LOTSAMISS)
> >> return fpin;
> >>
> >> - /*
> >> - * mmap read-around
> >> - */
> >> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> >> - ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> >> - ra->size = ra->ra_pages;
> >> - ra->async_size = ra->ra_pages / 4;
> >> - ra->order = 0;
> >> + if (vm_flags & VM_EXEC) {
> >> + /*
> >> + * Allow arch to request a preferred minimum folio order for
> >> + * executable memory. This can often be beneficial to
> >> + * performance if (e.g.) arm64 can contpte-map the folio.
> >> + * Executable memory rarely benefits from readahead, due to its
> >> + * random access nature, so set async_size to 0.
> >
> > In light of this observation (about randomness of instruction fetch), do
> > you think it's worth ignoring VM_RAND_READ for VM_EXEC?
>
> Hmm, yeah that makes sense. Something like:
>
> ---8<---
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7b90cbeb4a1a..6c8bf5116c54 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3233,7 +3233,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault
> *vmf)
> if (!ra->ra_pages)
> return fpin;
>
> - if (vm_flags & VM_SEQ_READ) {
> + /* VM_EXEC case below is already intended for random access */
> + if ((vm_flags & (VM_SEQ_READ | VM_EXEC)) == VM_SEQ_READ) {
> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> page_cache_sync_ra(&ractl, ra->ra_pages);
> return fpin;
> ---8<---
I was thinking about the:
if (vm_flags & VM_RAND_READ)
return fpin;
code above this which bails if VM_RAND_READ is set. That seems contrary
to the code you're adding which says that, even for random access
patterns where readahead doesn't help, it's still worth sizing the folio
appropriately for contpte mappings.
Will
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory
2025-05-14 15:14 ` Will Deacon
@ 2025-05-14 15:31 ` Ryan Roberts
0 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2025-05-14 15:31 UTC (permalink / raw)
To: Will Deacon
Cc: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Kalesh Singh, Zi Yan, linux-arm-kernel,
linux-kernel, linux-fsdevel, linux-mm
On 14/05/2025 16:14, Will Deacon wrote:
> On Tue, May 13, 2025 at 01:46:06PM +0100, Ryan Roberts wrote:
>> On 09/05/2025 14:52, Will Deacon wrote:
>>> On Wed, Apr 30, 2025 at 03:59:18PM +0100, Ryan Roberts wrote:
>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>> index e61f374068d4..37fe4a55c00d 100644
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -3252,14 +3252,40 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>>>> if (mmap_miss > MMAP_LOTSAMISS)
>>>> return fpin;
>>>>
>>>> - /*
>>>> - * mmap read-around
>>>> - */
>>>> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>>>> - ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
>>>> - ra->size = ra->ra_pages;
>>>> - ra->async_size = ra->ra_pages / 4;
>>>> - ra->order = 0;
>>>> + if (vm_flags & VM_EXEC) {
>>>> + /*
>>>> + * Allow arch to request a preferred minimum folio order for
>>>> + * executable memory. This can often be beneficial to
>>>> + * performance if (e.g.) arm64 can contpte-map the folio.
>>>> + * Executable memory rarely benefits from readahead, due to its
>>>> + * random access nature, so set async_size to 0.
>>>
>>> In light of this observation (about randomness of instruction fetch), do
>>> you think it's worth ignoring VM_RAND_READ for VM_EXEC?
>>
>> Hmm, yeah that makes sense. Something like:
>>
>> ---8<---
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7b90cbeb4a1a..6c8bf5116c54 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3233,7 +3233,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault
>> *vmf)
>> if (!ra->ra_pages)
>> return fpin;
>>
>> - if (vm_flags & VM_SEQ_READ) {
>> + /* VM_EXEC case below is already intended for random access */
>> + if ((vm_flags & (VM_SEQ_READ | VM_EXEC)) == VM_SEQ_READ) {
>> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>> page_cache_sync_ra(&ractl, ra->ra_pages);
>> return fpin;
>> ---8<---
>
> I was thinking about the:
>
> if (vm_flags & VM_RAND_READ)
> return fpin;
Yes sorry, I lost my mind when doing that patch... I intended to do it for the
VM_RAND_READ as you suggested, but my fingers did something completely different.
>
> code above this which bails if VM_RAND_READ is set. That seems contrary
> to the code you're adding which says that, even for random access
> patterns where readahead doesn't help, it's still worth sizing the folio
> appropriately for contpte mappings.
Anyway, I totally agree with this. So I'll avoid the early return VM_RAND_READ
if VM_EXEC is also set.
Thanks,
Ryan
>
> Will
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v4 0/5] Readahead tweaks for larger folios
2025-04-30 14:59 [RFC PATCH v4 0/5] Readahead tweaks for larger folios Ryan Roberts
` (4 preceding siblings ...)
2025-04-30 14:59 ` [RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory Ryan Roberts
@ 2025-05-06 10:05 ` Ryan Roberts
5 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2025-05-06 10:05 UTC (permalink / raw)
To: Andrew Morton, Matthew Wilcox (Oracle), Alexander Viro,
Christian Brauner, Jan Kara, David Hildenbrand, Dave Chinner,
Catalin Marinas, Will Deacon, Kalesh Singh, Zi Yan
Cc: linux-arm-kernel, linux-kernel, linux-fsdevel, linux-mm
On 30/04/2025 15:59, Ryan Roberts wrote:
> Hi All,
>
> This RFC series adds some tweaks to readahead so that it does a better job of
> ramping up folio sizes as readahead extends further into the file. And it
> additionally special-cases executable mappings to allow the arch to request a
> preferred folio size for text.
>
> Previous versions of the series focussed on the latter part only (large folios
> for text). See [3]. But after discussion with Matthew Wilcox last week, we
> decided that we should really be fixing some of the unintended behaviours in how
> a folio size is selected in general before special-casing for text. As a result
> patches 1-4 make folio size selection behave more sanely, then patch 5
> introduces large folios for text. Patch 5 depends on patch 1, but does not
> depend on patches 2-4.
>
> ---
>
> I'm leaving this marked as RFC for now as I intend to do more testing, and
> haven't yet updated the benchmark results in patch 5 (although I expect them to
> be similar).
Thanks Jan, David and Anshuman for the reviews! I'll do the suggested changes
and complete my testing, then aim to post again against -rc1, to hopefully get
it into linux-next.
Thanks,
Ryan
>
> Applies on top of Monday's mm-unstable (b18dec6a6ad3) and passes all mm
> kselftests.
>
> Changes since v3 [3]
> ====================
>
> - Added patchs 1-4 to do better job of ramping up folio order
> - In patch 5:
> - Confine readahead blocks to vma boundaries (per Kalesh)
> - Rename arch_exec_folio_order() to exec_folio_order() (per Matthew)
> - exec_folio_order() now returns unsigned int and defaults to order-0
> (per Matthew)
> - readahead size is honoured (including when disabled)
>
> Changes since v2 [2]
> ====================
>
> - Rename arch_wants_exec_folio_order() to arch_exec_folio_order() (per Andrew)
> - Fixed some typos (per Andrew)
>
> Changes since v1 [1]
> ====================
>
> - Remove "void" from arch_wants_exec_folio_order() macro args list
>
> [1] https://lore.kernel.org/linux-mm/20240111154106.3692206-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/all/20240215154059.2863126-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/linux-mm/20250327160700.1147155-1-ryan.roberts@arm.com/
>
> Thanks,
> Ryan
>
> Ryan Roberts (5):
> mm/readahead: Honour new_order in page_cache_ra_order()
> mm/readahead: Terminate async readahead on natural boundary
> mm/readahead: Make space in struct file_ra_state
> mm/readahead: Store folio order in struct file_ra_state
> mm/filemap: Allow arch to request folio size for exec memory
>
> arch/arm64/include/asm/pgtable.h | 8 +++++
> include/linux/fs.h | 4 ++-
> include/linux/pgtable.h | 11 +++++++
> mm/filemap.c | 55 ++++++++++++++++++++++++--------
> mm/internal.h | 3 +-
> mm/readahead.c | 27 +++++++++-------
> 6 files changed, 81 insertions(+), 27 deletions(-)
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 40+ messages in thread