[PATCH] mm/filemap: Align last

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/filemap: Align last_index to folio size
@ 2025-07-11  5:55 Youling Tang
  2025-07-11 16:08 ` David Hildenbrand
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Youling Tang @ 2025-07-11  5:55 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel, chizhiling,
	youling.tang, Youling Tang, Chi Zhiling

From: Youling Tang <tangyouling@kylinos.cn>

On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE
enabled, We observed the following readahead behaviors:
 # echo 3 > /proc/sys/vm/drop_caches
 # dd if=test of=/dev/null bs=64k count=1
 # ./tools/mm/page-types -r -L -f  /mnt/xfs/test
 foffset	offset	flags
 0	136d4c	__RU_l_________H______t_________________F_1
 1	136d4d	__RU_l__________T_____t_________________F_1
 2	136d4e	__RU_l__________T_____t_________________F_1
 3	136d4f	__RU_l__________T_____t_________________F_1
 ...
 c	136bb8	__RU_l_________H______t_________________F_1
 d	136bb9	__RU_l__________T_____t_________________F_1
 e	136bba	__RU_l__________T_____t_________________F_1
 f	136bbb	__RU_l__________T_____t_________________F_1   <-- first read
 10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
 11	13c2cd	___U_l__________T_____t______________I__F_1
 12	13c2ce	___U_l__________T_____t______________I__F_1
 13	13c2cf	___U_l__________T_____t______________I__F_1
 ...
 1c	1405d4	___U_l_________H______t_________________F_1
 1d	1405d5	___U_l__________T_____t_________________F_1
 1e	1405d6	___U_l__________T_____t_________________F_1
 1f	1405d7	___U_l__________T_____t_________________F_1
 [ra_size = 32, req_count = 16, async_size = 16]

 # echo 3 > /proc/sys/vm/drop_caches
 # dd if=test of=/dev/null bs=60k count=1
 # ./page-types -r -L -f  /mnt/xfs/test
 foffset	offset	flags
 0	136048	__RU_l_________H______t_________________F_1
 ...
 c	110a40	__RU_l_________H______t_________________F_1
 d	110a41	__RU_l__________T_____t_________________F_1
 e	110a42	__RU_l__________T_____t_________________F_1   <-- first read
 f	110a43	__RU_l__________T_____t_________________F_1   <-- first readahead flag
 10	13e7a8	___U_l_________H______t_________________F_1
 ...
 20	137a00	___U_l_________H______t_______P______I__F_1   <-- second readahead flag (20 - 2f)
 21	137a01	___U_l__________T_____t_______P______I__F_1
 ...
 3f	10d4af	___U_l__________T_____t_______P_________F_1
 [first readahead: ra_size = 32, req_count = 15, async_size = 17]

When reading 64k data (same for 61-63k range, where last_index is page-aligned
in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra()
and the PG_readahead flag is set on the next folio (the one containing 0x10 page).

When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra().
However, in this case the readahead flag is set on the 0xf page. Although the
requested read size (req_count) is 60k, the actual read will be aligned to
folio size (64k), which triggers the readahead flag and initiates asynchronous
readahead via page_cache_async_ra(). This results in two readahead operations
totaling 256k.

The root cause is that when the requested size is smaller than the actual read
size (due to folio alignment), it triggers asynchronous readahead. By changing
last_index alignment from page size to folio size, we ensure the requested size
matches the actual read size, preventing the case where a single read operation
triggers two readahead operations.

After applying the patch:
 # echo 3 > /proc/sys/vm/drop_caches
 # dd if=test of=/dev/null bs=60k count=1
 # ./page-types -r -L -f  /mnt/xfs/test
 foffset	offset	flags
 0	136d4c	__RU_l_________H______t_________________F_1
 1	136d4d	__RU_l__________T_____t_________________F_1
 2	136d4e	__RU_l__________T_____t_________________F_1
 3	136d4f	__RU_l__________T_____t_________________F_1
 ...
 c	136bb8	__RU_l_________H______t_________________F_1
 d	136bb9	__RU_l__________T_____t_________________F_1
 e	136bba	__RU_l__________T_____t_________________F_1   <-- first read
 f	136bbb	__RU_l__________T_____t_________________F_1
 10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
 11	13c2cd	___U_l__________T_____t______________I__F_1
 12	13c2ce	___U_l__________T_____t______________I__F_1
 13	13c2cf	___U_l__________T_____t______________I__F_1
 ...
 1c	1405d4	___U_l_________H______t_________________F_1
 1d	1405d5	___U_l__________T_____t_________________F_1
 1e	1405d6	___U_l__________T_____t_________________F_1
 1f	1405d7	___U_l__________T_____t_________________F_1
 [ra_size = 32, req_count = 16, async_size = 16]

The same phenomenon will occur when reading from 49k to 64k. Set the readahead
flag to the next folio.

Because the minimum order of folio in address_space equals the block size (at
least in xfs and bcachefs that already support bs > ps), having request_count
aligned to block size will not cause overread.

Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn>
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
---
 include/linux/pagemap.h | 6 ++++++
 mm/filemap.c            | 5 +++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e63fbfbd5b0f..447bb264fd94 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -480,6 +480,12 @@ mapping_min_folio_nrpages(struct address_space *mapping)
 	return 1UL << mapping_min_folio_order(mapping);
 }
 
+static inline unsigned long
+mapping_min_folio_nrbytes(struct address_space *mapping)
+{
+	return mapping_min_folio_nrpages(mapping) << PAGE_SHIFT;
+}
+
 /**
  * mapping_align_index() - Align index for this mapping.
  * @mapping: The address_space.
diff --git a/mm/filemap.c b/mm/filemap.c
index 765dc5ef6d5a..56a8656b6f86 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	unsigned int flags;
 	int err = 0;
 
-	/* "last_index" is the index of the page beyond the end of the read */
-	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
+	/* "last_index" is the index of the folio beyond the end of the read */
+	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
+	last_index >>= PAGE_SHIFT;
 retry:
 	if (fatal_signal_pending(current))
 		return -EINTR;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/filemap: Align last_index to folio size
  2025-07-11  5:55 [PATCH] mm/filemap: Align last_index to folio size Youling Tang
@ 2025-07-11 16:08 ` David Hildenbrand
  2025-07-14  8:16   ` Ryan Roberts
  2025-07-14  9:33 ` Jan Kara
  2025-07-14 22:34 ` Klara Modin
  2 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2025-07-11 16:08 UTC (permalink / raw)
  To: Youling Tang, Matthew Wilcox, Andrew Morton
  Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel, chizhiling,
	Youling Tang, Chi Zhiling, Ryan Roberts

CCing Ryan, who recently fiddled with readahead.


On 11.07.25 07:55, Youling Tang wrote:
> From: Youling Tang <tangyouling@kylinos.cn>
> 
> On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE
> enabled, We observed the following readahead behaviors:
>   # echo 3 > /proc/sys/vm/drop_caches
>   # dd if=test of=/dev/null bs=64k count=1
>   # ./tools/mm/page-types -r -L -f  /mnt/xfs/test
>   foffset	offset	flags
>   0	136d4c	__RU_l_________H______t_________________F_1
>   1	136d4d	__RU_l__________T_____t_________________F_1
>   2	136d4e	__RU_l__________T_____t_________________F_1
>   3	136d4f	__RU_l__________T_____t_________________F_1
>   ...
>   c	136bb8	__RU_l_________H______t_________________F_1
>   d	136bb9	__RU_l__________T_____t_________________F_1
>   e	136bba	__RU_l__________T_____t_________________F_1
>   f	136bbb	__RU_l__________T_____t_________________F_1   <-- first read
>   10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
>   11	13c2cd	___U_l__________T_____t______________I__F_1
>   12	13c2ce	___U_l__________T_____t______________I__F_1
>   13	13c2cf	___U_l__________T_____t______________I__F_1
>   ...
>   1c	1405d4	___U_l_________H______t_________________F_1
>   1d	1405d5	___U_l__________T_____t_________________F_1
>   1e	1405d6	___U_l__________T_____t_________________F_1
>   1f	1405d7	___U_l__________T_____t_________________F_1
>   [ra_size = 32, req_count = 16, async_size = 16]
> 
>   # echo 3 > /proc/sys/vm/drop_caches
>   # dd if=test of=/dev/null bs=60k count=1
>   # ./page-types -r -L -f  /mnt/xfs/test
>   foffset	offset	flags
>   0	136048	__RU_l_________H______t_________________F_1
>   ...
>   c	110a40	__RU_l_________H______t_________________F_1
>   d	110a41	__RU_l__________T_____t_________________F_1
>   e	110a42	__RU_l__________T_____t_________________F_1   <-- first read
>   f	110a43	__RU_l__________T_____t_________________F_1   <-- first readahead flag
>   10	13e7a8	___U_l_________H______t_________________F_1
>   ...
>   20	137a00	___U_l_________H______t_______P______I__F_1   <-- second readahead flag (20 - 2f)
>   21	137a01	___U_l__________T_____t_______P______I__F_1
>   ...
>   3f	10d4af	___U_l__________T_____t_______P_________F_1
>   [first readahead: ra_size = 32, req_count = 15, async_size = 17]
> 
> When reading 64k data (same for 61-63k range, where last_index is page-aligned
> in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra()
> and the PG_readahead flag is set on the next folio (the one containing 0x10 page).
> 
> When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra().
> However, in this case the readahead flag is set on the 0xf page. Although the
> requested read size (req_count) is 60k, the actual read will be aligned to
> folio size (64k), which triggers the readahead flag and initiates asynchronous
> readahead via page_cache_async_ra(). This results in two readahead operations
> totaling 256k.
> 
> The root cause is that when the requested size is smaller than the actual read
> size (due to folio alignment), it triggers asynchronous readahead. By changing
> last_index alignment from page size to folio size, we ensure the requested size
> matches the actual read size, preventing the case where a single read operation
> triggers two readahead operations.
> 
> After applying the patch:
>   # echo 3 > /proc/sys/vm/drop_caches
>   # dd if=test of=/dev/null bs=60k count=1
>   # ./page-types -r -L -f  /mnt/xfs/test
>   foffset	offset	flags
>   0	136d4c	__RU_l_________H______t_________________F_1
>   1	136d4d	__RU_l__________T_____t_________________F_1
>   2	136d4e	__RU_l__________T_____t_________________F_1
>   3	136d4f	__RU_l__________T_____t_________________F_1
>   ...
>   c	136bb8	__RU_l_________H______t_________________F_1
>   d	136bb9	__RU_l__________T_____t_________________F_1
>   e	136bba	__RU_l__________T_____t_________________F_1   <-- first read
>   f	136bbb	__RU_l__________T_____t_________________F_1
>   10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
>   11	13c2cd	___U_l__________T_____t______________I__F_1
>   12	13c2ce	___U_l__________T_____t______________I__F_1
>   13	13c2cf	___U_l__________T_____t______________I__F_1
>   ...
>   1c	1405d4	___U_l_________H______t_________________F_1
>   1d	1405d5	___U_l__________T_____t_________________F_1
>   1e	1405d6	___U_l__________T_____t_________________F_1
>   1f	1405d7	___U_l__________T_____t_________________F_1
>   [ra_size = 32, req_count = 16, async_size = 16]
> 
> The same phenomenon will occur when reading from 49k to 64k. Set the readahead
> flag to the next folio.
> 
> Because the minimum order of folio in address_space equals the block size (at
> least in xfs and bcachefs that already support bs > ps), having request_count
> aligned to block size will not cause overread.
> 
> Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn>
> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
> Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
> ---
>   include/linux/pagemap.h | 6 ++++++
>   mm/filemap.c            | 5 +++--
>   2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e63fbfbd5b0f..447bb264fd94 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -480,6 +480,12 @@ mapping_min_folio_nrpages(struct address_space *mapping)
>   	return 1UL << mapping_min_folio_order(mapping);
>   }
>   
> +static inline unsigned long
> +mapping_min_folio_nrbytes(struct address_space *mapping)
> +{
> +	return mapping_min_folio_nrpages(mapping) << PAGE_SHIFT;
> +}
> +
>   /**
>    * mapping_align_index() - Align index for this mapping.
>    * @mapping: The address_space.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 765dc5ef6d5a..56a8656b6f86 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>   	unsigned int flags;
>   	int err = 0;
>   
> -	/* "last_index" is the index of the page beyond the end of the read */
> -	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
> +	/* "last_index" is the index of the folio beyond the end of the read */
> +	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
> +	last_index >>= PAGE_SHIFT;
>   retry:
>   	if (fatal_signal_pending(current))
>   		return -EINTR;


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/filemap: Align last_index to folio size
  2025-07-11 16:08 ` David Hildenbrand
@ 2025-07-14  8:16   ` Ryan Roberts
  0 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-07-14  8:16 UTC (permalink / raw)
  To: David Hildenbrand, Youling Tang, Matthew Wilcox, Andrew Morton
  Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel, chizhiling,
	Youling Tang, Chi Zhiling

On 11/07/2025 17:08, David Hildenbrand wrote:
> CCing Ryan, who recently fiddled with readahead.
> 
> 
> On 11.07.25 07:55, Youling Tang wrote:
>> From: Youling Tang <tangyouling@kylinos.cn>
>>
>> On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE
>> enabled, We observed the following readahead behaviors:
>>   # echo 3 > /proc/sys/vm/drop_caches
>>   # dd if=test of=/dev/null bs=64k count=1
>>   # ./tools/mm/page-types -r -L -f  /mnt/xfs/test
>>   foffset    offset    flags
>>   0    136d4c    __RU_l_________H______t_________________F_1
>>   1    136d4d    __RU_l__________T_____t_________________F_1
>>   2    136d4e    __RU_l__________T_____t_________________F_1
>>   3    136d4f    __RU_l__________T_____t_________________F_1
>>   ...
>>   c    136bb8    __RU_l_________H______t_________________F_1
>>   d    136bb9    __RU_l__________T_____t_________________F_1
>>   e    136bba    __RU_l__________T_____t_________________F_1
>>   f    136bbb    __RU_l__________T_____t_________________F_1   <-- first read
>>   10    13c2cc    ___U_l_________H______t______________I__F_1   <-- readahead
>> flag
>>   11    13c2cd    ___U_l__________T_____t______________I__F_1
>>   12    13c2ce    ___U_l__________T_____t______________I__F_1
>>   13    13c2cf    ___U_l__________T_____t______________I__F_1
>>   ...
>>   1c    1405d4    ___U_l_________H______t_________________F_1
>>   1d    1405d5    ___U_l__________T_____t_________________F_1
>>   1e    1405d6    ___U_l__________T_____t_________________F_1
>>   1f    1405d7    ___U_l__________T_____t_________________F_1
>>   [ra_size = 32, req_count = 16, async_size = 16]
>>
>>   # echo 3 > /proc/sys/vm/drop_caches
>>   # dd if=test of=/dev/null bs=60k count=1
>>   # ./page-types -r -L -f  /mnt/xfs/test
>>   foffset    offset    flags
>>   0    136048    __RU_l_________H______t_________________F_1
>>   ...
>>   c    110a40    __RU_l_________H______t_________________F_1
>>   d    110a41    __RU_l__________T_____t_________________F_1
>>   e    110a42    __RU_l__________T_____t_________________F_1   <-- first read
>>   f    110a43    __RU_l__________T_____t_________________F_1   <-- first
>> readahead flag
>>   10    13e7a8    ___U_l_________H______t_________________F_1
>>   ...
>>   20    137a00    ___U_l_________H______t_______P______I__F_1   <-- second
>> readahead flag (20 - 2f)
>>   21    137a01    ___U_l__________T_____t_______P______I__F_1
>>   ...
>>   3f    10d4af    ___U_l__________T_____t_______P_________F_1
>>   [first readahead: ra_size = 32, req_count = 15, async_size = 17]
>>
>> When reading 64k data (same for 61-63k range, where last_index is page-aligned
>> in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra()
>> and the PG_readahead flag is set on the next folio (the one containing 0x10
>> page).
>>
>> When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra().
>> However, in this case the readahead flag is set on the 0xf page. Although the
>> requested read size (req_count) is 60k, the actual read will be aligned to
>> folio size (64k), which triggers the readahead flag and initiates asynchronous
>> readahead via page_cache_async_ra(). This results in two readahead operations
>> totaling 256k.
>>
>> The root cause is that when the requested size is smaller than the actual read
>> size (due to folio alignment), it triggers asynchronous readahead. By changing
>> last_index alignment from page size to folio size, we ensure the requested size
>> matches the actual read size, preventing the case where a single read operation
>> triggers two readahead operations.

I recently fiddled with mmap readahead paths, doing similar-ish things. I
haven't looked at the non-mmap paths so don't consider myself expert here. But
what you are saying makes sense and superficially the solution looks good to me, so:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

with one nit below...

>>
>> After applying the patch:
>>   # echo 3 > /proc/sys/vm/drop_caches
>>   # dd if=test of=/dev/null bs=60k count=1
>>   # ./page-types -r -L -f  /mnt/xfs/test
>>   foffset    offset    flags
>>   0    136d4c    __RU_l_________H______t_________________F_1
>>   1    136d4d    __RU_l__________T_____t_________________F_1
>>   2    136d4e    __RU_l__________T_____t_________________F_1
>>   3    136d4f    __RU_l__________T_____t_________________F_1
>>   ...
>>   c    136bb8    __RU_l_________H______t_________________F_1
>>   d    136bb9    __RU_l__________T_____t_________________F_1
>>   e    136bba    __RU_l__________T_____t_________________F_1   <-- first read
>>   f    136bbb    __RU_l__________T_____t_________________F_1
>>   10    13c2cc    ___U_l_________H______t______________I__F_1   <-- readahead
>> flag
>>   11    13c2cd    ___U_l__________T_____t______________I__F_1
>>   12    13c2ce    ___U_l__________T_____t______________I__F_1
>>   13    13c2cf    ___U_l__________T_____t______________I__F_1
>>   ...
>>   1c    1405d4    ___U_l_________H______t_________________F_1
>>   1d    1405d5    ___U_l__________T_____t_________________F_1
>>   1e    1405d6    ___U_l__________T_____t_________________F_1
>>   1f    1405d7    ___U_l__________T_____t_________________F_1
>>   [ra_size = 32, req_count = 16, async_size = 16]
>>
>> The same phenomenon will occur when reading from 49k to 64k. Set the readahead
>> flag to the next folio.
>>
>> Because the minimum order of folio in address_space equals the block size (at
>> least in xfs and bcachefs that already support bs > ps), having request_count
>> aligned to block size will not cause overread.
>>
>> Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn>
>> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
>> Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
>> ---
>>   include/linux/pagemap.h | 6 ++++++
>>   mm/filemap.c            | 5 +++--
>>   2 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index e63fbfbd5b0f..447bb264fd94 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -480,6 +480,12 @@ mapping_min_folio_nrpages(struct address_space *mapping)
>>       return 1UL << mapping_min_folio_order(mapping);
>>   }
>>   +static inline unsigned long
>> +mapping_min_folio_nrbytes(struct address_space *mapping)
>> +{
>> +    return mapping_min_folio_nrpages(mapping) << PAGE_SHIFT;
>> +}
>> +
>>   /**
>>    * mapping_align_index() - Align index for this mapping.
>>    * @mapping: The address_space.
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 765dc5ef6d5a..56a8656b6f86 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t
>> count,
>>       unsigned int flags;
>>       int err = 0;
>>   -    /* "last_index" is the index of the page beyond the end of the read */
>> -    last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
>> +    /* "last_index" is the index of the folio beyond the end of the read */

pedantic nit: I think you actually mean "the index of the first page within the
first minimum-sized folio beyond the end of the read"?

Thanks,
Ryan

>> +    last_index = round_up(iocb->ki_pos + count,
>> mapping_min_folio_nrbytes(mapping));
>> +    last_index >>= PAGE_SHIFT;
>>   retry:
>>       if (fatal_signal_pending(current))
>>           return -EINTR;
> 
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/filemap: Align last_index to folio size
  2025-07-11  5:55 [PATCH] mm/filemap: Align last_index to folio size Youling Tang
  2025-07-11 16:08 ` David Hildenbrand
@ 2025-07-14  9:33 ` Jan Kara
  2025-08-12  9:08   ` Youling Tang
  2025-07-14 22:34 ` Klara Modin
  2 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2025-07-14  9:33 UTC (permalink / raw)
  To: Youling Tang
  Cc: Matthew Wilcox, Andrew Morton, Jan Kara, linux-fsdevel, linux-mm,
	linux-kernel, chizhiling, Youling Tang, Chi Zhiling

On Fri 11-07-25 13:55:09, Youling Tang wrote:
> From: Youling Tang <tangyouling@kylinos.cn>
> 
> On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE
> enabled, We observed the following readahead behaviors:
>  # echo 3 > /proc/sys/vm/drop_caches
>  # dd if=test of=/dev/null bs=64k count=1
>  # ./tools/mm/page-types -r -L -f  /mnt/xfs/test
>  foffset	offset	flags
>  0	136d4c	__RU_l_________H______t_________________F_1
>  1	136d4d	__RU_l__________T_____t_________________F_1
>  2	136d4e	__RU_l__________T_____t_________________F_1
>  3	136d4f	__RU_l__________T_____t_________________F_1
>  ...
>  c	136bb8	__RU_l_________H______t_________________F_1
>  d	136bb9	__RU_l__________T_____t_________________F_1
>  e	136bba	__RU_l__________T_____t_________________F_1
>  f	136bbb	__RU_l__________T_____t_________________F_1   <-- first read
>  10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
>  11	13c2cd	___U_l__________T_____t______________I__F_1
>  12	13c2ce	___U_l__________T_____t______________I__F_1
>  13	13c2cf	___U_l__________T_____t______________I__F_1
>  ...
>  1c	1405d4	___U_l_________H______t_________________F_1
>  1d	1405d5	___U_l__________T_____t_________________F_1
>  1e	1405d6	___U_l__________T_____t_________________F_1
>  1f	1405d7	___U_l__________T_____t_________________F_1
>  [ra_size = 32, req_count = 16, async_size = 16]
> 
>  # echo 3 > /proc/sys/vm/drop_caches
>  # dd if=test of=/dev/null bs=60k count=1
>  # ./page-types -r -L -f  /mnt/xfs/test
>  foffset	offset	flags
>  0	136048	__RU_l_________H______t_________________F_1
>  ...
>  c	110a40	__RU_l_________H______t_________________F_1
>  d	110a41	__RU_l__________T_____t_________________F_1
>  e	110a42	__RU_l__________T_____t_________________F_1   <-- first read
>  f	110a43	__RU_l__________T_____t_________________F_1   <-- first readahead flag
>  10	13e7a8	___U_l_________H______t_________________F_1
>  ...
>  20	137a00	___U_l_________H______t_______P______I__F_1   <-- second readahead flag (20 - 2f)
>  21	137a01	___U_l__________T_____t_______P______I__F_1
>  ...
>  3f	10d4af	___U_l__________T_____t_______P_________F_1
>  [first readahead: ra_size = 32, req_count = 15, async_size = 17]
> 
> When reading 64k data (same for 61-63k range, where last_index is page-aligned
> in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra()
> and the PG_readahead flag is set on the next folio (the one containing 0x10 page).
> 
> When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra().
> However, in this case the readahead flag is set on the 0xf page. Although the
> requested read size (req_count) is 60k, the actual read will be aligned to
> folio size (64k), which triggers the readahead flag and initiates asynchronous
> readahead via page_cache_async_ra(). This results in two readahead operations
> totaling 256k.
> 
> The root cause is that when the requested size is smaller than the actual read
> size (due to folio alignment), it triggers asynchronous readahead. By changing
> last_index alignment from page size to folio size, we ensure the requested size
> matches the actual read size, preventing the case where a single read operation
> triggers two readahead operations.
> 
> After applying the patch:
>  # echo 3 > /proc/sys/vm/drop_caches
>  # dd if=test of=/dev/null bs=60k count=1
>  # ./page-types -r -L -f  /mnt/xfs/test
>  foffset	offset	flags
>  0	136d4c	__RU_l_________H______t_________________F_1
>  1	136d4d	__RU_l__________T_____t_________________F_1
>  2	136d4e	__RU_l__________T_____t_________________F_1
>  3	136d4f	__RU_l__________T_____t_________________F_1
>  ...
>  c	136bb8	__RU_l_________H______t_________________F_1
>  d	136bb9	__RU_l__________T_____t_________________F_1
>  e	136bba	__RU_l__________T_____t_________________F_1   <-- first read
>  f	136bbb	__RU_l__________T_____t_________________F_1
>  10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
>  11	13c2cd	___U_l__________T_____t______________I__F_1
>  12	13c2ce	___U_l__________T_____t______________I__F_1
>  13	13c2cf	___U_l__________T_____t______________I__F_1
>  ...
>  1c	1405d4	___U_l_________H______t_________________F_1
>  1d	1405d5	___U_l__________T_____t_________________F_1
>  1e	1405d6	___U_l__________T_____t_________________F_1
>  1f	1405d7	___U_l__________T_____t_________________F_1
>  [ra_size = 32, req_count = 16, async_size = 16]
> 
> The same phenomenon will occur when reading from 49k to 64k. Set the readahead
> flag to the next folio.
> 
> Because the minimum order of folio in address_space equals the block size (at
> least in xfs and bcachefs that already support bs > ps), having request_count
> aligned to block size will not cause overread.
> 
> Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn>
> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
> Signed-off-by: Youling Tang <tangyouling@kylinos.cn>

I agree with analysis of the problem but not quite with the solution. See
below.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 765dc5ef6d5a..56a8656b6f86 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  	unsigned int flags;
>  	int err = 0;
>  
> -	/* "last_index" is the index of the page beyond the end of the read */
> -	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
> +	/* "last_index" is the index of the folio beyond the end of the read */
> +	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
> +	last_index >>= PAGE_SHIFT;

I think that filemap_get_pages() shouldn't be really trying to guess what
readahead code needs and round last_index based on min folio order. After
all the situation isn't special for LBS filesystems. It can also happen
that the readahead mark ends up in the middle of large folio for other
reasons. In fact, we already do have code in page_cache_ra_order() ->
ra_alloc_folio() that handles rounding of index where mark should be placed
so your changes essentially try to outsmart that code which is not good. I
think the solution should really be placed in page_cache_ra_order() +
ra_alloc_folio() instead.

In fact the problem you are trying to solve was kind of introduced (or at
least made more visible) by my commit ab4443fe3ca62 ("readahead: avoid
multiple marked readahead pages"). There I've changed the code to round the
index down because I've convinced myself it doesn't matter and rounding
down is easier to handle in that place. But your example shows there are
cases where rounding down has weird consequences and rounding up would have
been better. So I think we need to come up with a method how to round up
the index of marked folio to fix your case without reintroducing problems
mentioned in commit ab4443fe3ca62.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/filemap: Align last_index to folio size
  2025-07-14  9:33 ` Jan Kara
@ 2025-08-12  9:08   ` Youling Tang
  2025-09-03 22:50     ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Youling Tang @ 2025-08-12  9:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
	linux-kernel, chizhiling, Youling Tang, Chi Zhiling

Hi, Jan
On 2025/7/14 17:33, Jan Kara wrote:
> On Fri 11-07-25 13:55:09, Youling Tang wrote:
>> From: Youling Tang <tangyouling@kylinos.cn>
>>
>> On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE
>> enabled, We observed the following readahead behaviors:
>>   # echo 3 > /proc/sys/vm/drop_caches
>>   # dd if=test of=/dev/null bs=64k count=1
>>   # ./tools/mm/page-types -r -L -f  /mnt/xfs/test
>>   foffset	offset	flags
>>   0	136d4c	__RU_l_________H______t_________________F_1
>>   1	136d4d	__RU_l__________T_____t_________________F_1
>>   2	136d4e	__RU_l__________T_____t_________________F_1
>>   3	136d4f	__RU_l__________T_____t_________________F_1
>>   ...
>>   c	136bb8	__RU_l_________H______t_________________F_1
>>   d	136bb9	__RU_l__________T_____t_________________F_1
>>   e	136bba	__RU_l__________T_____t_________________F_1
>>   f	136bbb	__RU_l__________T_____t_________________F_1   <-- first read
>>   10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
>>   11	13c2cd	___U_l__________T_____t______________I__F_1
>>   12	13c2ce	___U_l__________T_____t______________I__F_1
>>   13	13c2cf	___U_l__________T_____t______________I__F_1
>>   ...
>>   1c	1405d4	___U_l_________H______t_________________F_1
>>   1d	1405d5	___U_l__________T_____t_________________F_1
>>   1e	1405d6	___U_l__________T_____t_________________F_1
>>   1f	1405d7	___U_l__________T_____t_________________F_1
>>   [ra_size = 32, req_count = 16, async_size = 16]
>>
>>   # echo 3 > /proc/sys/vm/drop_caches
>>   # dd if=test of=/dev/null bs=60k count=1
>>   # ./page-types -r -L -f  /mnt/xfs/test
>>   foffset	offset	flags
>>   0	136048	__RU_l_________H______t_________________F_1
>>   ...
>>   c	110a40	__RU_l_________H______t_________________F_1
>>   d	110a41	__RU_l__________T_____t_________________F_1
>>   e	110a42	__RU_l__________T_____t_________________F_1   <-- first read
>>   f	110a43	__RU_l__________T_____t_________________F_1   <-- first readahead flag
>>   10	13e7a8	___U_l_________H______t_________________F_1
>>   ...
>>   20	137a00	___U_l_________H______t_______P______I__F_1   <-- second readahead flag (20 - 2f)
>>   21	137a01	___U_l__________T_____t_______P______I__F_1
>>   ...
>>   3f	10d4af	___U_l__________T_____t_______P_________F_1
>>   [first readahead: ra_size = 32, req_count = 15, async_size = 17]
>>
>> When reading 64k data (same for 61-63k range, where last_index is page-aligned
>> in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra()
>> and the PG_readahead flag is set on the next folio (the one containing 0x10 page).
>>
>> When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra().
>> However, in this case the readahead flag is set on the 0xf page. Although the
>> requested read size (req_count) is 60k, the actual read will be aligned to
>> folio size (64k), which triggers the readahead flag and initiates asynchronous
>> readahead via page_cache_async_ra(). This results in two readahead operations
>> totaling 256k.
>>
>> The root cause is that when the requested size is smaller than the actual read
>> size (due to folio alignment), it triggers asynchronous readahead. By changing
>> last_index alignment from page size to folio size, we ensure the requested size
>> matches the actual read size, preventing the case where a single read operation
>> triggers two readahead operations.
>>
>> After applying the patch:
>>   # echo 3 > /proc/sys/vm/drop_caches
>>   # dd if=test of=/dev/null bs=60k count=1
>>   # ./page-types -r -L -f  /mnt/xfs/test
>>   foffset	offset	flags
>>   0	136d4c	__RU_l_________H______t_________________F_1
>>   1	136d4d	__RU_l__________T_____t_________________F_1
>>   2	136d4e	__RU_l__________T_____t_________________F_1
>>   3	136d4f	__RU_l__________T_____t_________________F_1
>>   ...
>>   c	136bb8	__RU_l_________H______t_________________F_1
>>   d	136bb9	__RU_l__________T_____t_________________F_1
>>   e	136bba	__RU_l__________T_____t_________________F_1   <-- first read
>>   f	136bbb	__RU_l__________T_____t_________________F_1
>>   10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
>>   11	13c2cd	___U_l__________T_____t______________I__F_1
>>   12	13c2ce	___U_l__________T_____t______________I__F_1
>>   13	13c2cf	___U_l__________T_____t______________I__F_1
>>   ...
>>   1c	1405d4	___U_l_________H______t_________________F_1
>>   1d	1405d5	___U_l__________T_____t_________________F_1
>>   1e	1405d6	___U_l__________T_____t_________________F_1
>>   1f	1405d7	___U_l__________T_____t_________________F_1
>>   [ra_size = 32, req_count = 16, async_size = 16]
>>
>> The same phenomenon will occur when reading from 49k to 64k. Set the readahead
>> flag to the next folio.
>>
>> Because the minimum order of folio in address_space equals the block size (at
>> least in xfs and bcachefs that already support bs > ps), having request_count
>> aligned to block size will not cause overread.
>>
>> Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn>
>> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
>> Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
> I agree with analysis of the problem but not quite with the solution. See
> below.
>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 765dc5ef6d5a..56a8656b6f86 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>>   	unsigned int flags;
>>   	int err = 0;
>>   
>> -	/* "last_index" is the index of the page beyond the end of the read */
>> -	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
>> +	/* "last_index" is the index of the folio beyond the end of the read */
>> +	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
>> +	last_index >>= PAGE_SHIFT;
> I think that filemap_get_pages() shouldn't be really trying to guess what
> readahead code needs and round last_index based on min folio order. After
> all the situation isn't special for LBS filesystems. It can also happen
> that the readahead mark ends up in the middle of large folio for other
> reasons. In fact, we already do have code in page_cache_ra_order() ->
> ra_alloc_folio() that handles rounding of index where mark should be placed
> so your changes essentially try to outsmart that code which is not good. I
> think the solution should really be placed in page_cache_ra_order() +
> ra_alloc_folio() instead.
>
> In fact the problem you are trying to solve was kind of introduced (or at
> least made more visible) by my commit ab4443fe3ca62 ("readahead: avoid
> multiple marked readahead pages"). There I've changed the code to round the
> index down because I've convinced myself it doesn't matter and rounding
> down is easier to handle in that place. But your example shows there are
> cases where rounding down has weird consequences and rounding up would have
> been better. So I think we need to come up with a method how to round up
> the index of marked folio to fix your case without reintroducing problems
> mentioned in commit ab4443fe3ca62.
Yes, I simply replaced round_up() in ra_alloc_folio() with round_down()
to avoid this phenomenon before submitting this patch.

But at present, I haven't found a suitable way to solve both of these 
problems
simultaneously. Do you have a better solution on your side?

Thanks,
Youling.
>
> 								Honza


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/filemap: Align last_index to folio size
  2025-08-12  9:08   ` Youling Tang
@ 2025-09-03 22:50     ` Andrew Morton
  2025-09-04  2:04       ` Youling Tang
  2025-09-05 12:50       ` Jan Kara
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2025-09-03 22:50 UTC (permalink / raw)
  To: Youling Tang
  Cc: Jan Kara, Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel,
	chizhiling, Youling Tang, Chi Zhiling

On Tue, 12 Aug 2025 17:08:53 +0800 Youling Tang <youling.tang@linux.dev> wrote:

> Hi, Jan
> On 2025/7/14 17:33, Jan Kara wrote:
> > On Fri 11-07-25 13:55:09, Youling Tang wrote:
> >> From: Youling Tang <tangyouling@kylinos.cn>
>
> ...
>
> >> --- a/mm/filemap.c
> >> +++ b/mm/filemap.c
> >> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
> >>   	unsigned int flags;
> >>   	int err = 0;
> >>   
> >> -	/* "last_index" is the index of the page beyond the end of the read */
> >> -	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
> >> +	/* "last_index" is the index of the folio beyond the end of the read */
> >> +	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
> >> +	last_index >>= PAGE_SHIFT;
> > I think that filemap_get_pages() shouldn't be really trying to guess what
> > readahead code needs and round last_index based on min folio order. After
> > all the situation isn't special for LBS filesystems. It can also happen
> > that the readahead mark ends up in the middle of large folio for other
> > reasons. In fact, we already do have code in page_cache_ra_order() ->
> > ra_alloc_folio() that handles rounding of index where mark should be placed
> > so your changes essentially try to outsmart that code which is not good. I
> > think the solution should really be placed in page_cache_ra_order() +
> > ra_alloc_folio() instead.
> >
> > In fact the problem you are trying to solve was kind of introduced (or at
> > least made more visible) by my commit ab4443fe3ca62 ("readahead: avoid
> > multiple marked readahead pages"). There I've changed the code to round the
> > index down because I've convinced myself it doesn't matter and rounding
> > down is easier to handle in that place. But your example shows there are
> > cases where rounding down has weird consequences and rounding up would have
> > been better. So I think we need to come up with a method how to round up
> > the index of marked folio to fix your case without reintroducing problems
> > mentioned in commit ab4443fe3ca62.
> Yes, I simply replaced round_up() in ra_alloc_folio() with round_down()
> to avoid this phenomenon before submitting this patch.
> 
> But at present, I haven't found a suitable way to solve both of these 
> problems
> simultaneously. Do you have a better solution on your side?
> 

fyi, this patch remains stuck in mm.git awaiting resolution.

Do we have any information regarding its importance?  Which means do we
have any measurement of its effect upon any real-world workload?

Thanks.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/filemap: Align last_index to folio size
  2025-09-03 22:50     ` Andrew Morton
@ 2025-09-04  2:04       ` Youling Tang
  2025-09-05 12:50       ` Jan Kara
  1 sibling, 0 replies; 11+ messages in thread
From: Youling Tang @ 2025-09-04  2:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Matthew Wilcox, linux-fsdevel, linux-mm, linux-kernel,
	chizhiling, Youling Tang, Chi Zhiling

On 2025/9/4 06:50, Andrew Morton wrote:

> On Tue, 12 Aug 2025 17:08:53 +0800 Youling Tang <youling.tang@linux.dev> wrote:
>
>> Hi, Jan
>> On 2025/7/14 17:33, Jan Kara wrote:
>>> On Fri 11-07-25 13:55:09, Youling Tang wrote:
>>>> From: Youling Tang <tangyouling@kylinos.cn>
>> ...
>>
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>>>>    	unsigned int flags;
>>>>    	int err = 0;
>>>>    
>>>> -	/* "last_index" is the index of the page beyond the end of the read */
>>>> -	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
>>>> +	/* "last_index" is the index of the folio beyond the end of the read */
>>>> +	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
>>>> +	last_index >>= PAGE_SHIFT;
>>> I think that filemap_get_pages() shouldn't be really trying to guess what
>>> readahead code needs and round last_index based on min folio order. After
>>> all the situation isn't special for LBS filesystems. It can also happen
>>> that the readahead mark ends up in the middle of large folio for other
>>> reasons. In fact, we already do have code in page_cache_ra_order() ->
>>> ra_alloc_folio() that handles rounding of index where mark should be placed
>>> so your changes essentially try to outsmart that code which is not good. I
>>> think the solution should really be placed in page_cache_ra_order() +
>>> ra_alloc_folio() instead.
>>>
>>> In fact the problem you are trying to solve was kind of introduced (or at
>>> least made more visible) by my commit ab4443fe3ca62 ("readahead: avoid
>>> multiple marked readahead pages"). There I've changed the code to round the
>>> index down because I've convinced myself it doesn't matter and rounding
>>> down is easier to handle in that place. But your example shows there are
>>> cases where rounding down has weird consequences and rounding up would have
>>> been better. So I think we need to come up with a method how to round up
>>> the index of marked folio to fix your case without reintroducing problems
>>> mentioned in commit ab4443fe3ca62.
>> Yes, I simply replaced round_up() in ra_alloc_folio() with round_down()
>> to avoid this phenomenon before submitting this patch.
>>
>> But at present, I haven't found a suitable way to solve both of these
>> problems
>> simultaneously. Do you have a better solution on your side?
>>
> fyi, this patch remains stuck in mm.git awaiting resolution.
>
> Do we have any information regarding its importance?  Which means do we
> have any measurement of its effect upon any real-world workload?
No measurements were taken regarding the impact on real-world
workloads, this was merely a line of thinking triggered by unusual
readahead behavior.

Thanks,
Youling.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/filemap: Align last_index to folio size
  2025-09-03 22:50     ` Andrew Morton
  2025-09-04  2:04       ` Youling Tang
@ 2025-09-05 12:50       ` Jan Kara
  1 sibling, 0 replies; 11+ messages in thread
From: Jan Kara @ 2025-09-05 12:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Youling Tang, Jan Kara, Matthew Wilcox, linux-fsdevel, linux-mm,
	linux-kernel, chizhiling, Youling Tang, Chi Zhiling

On Wed 03-09-25 15:50:46, Andrew Morton wrote:
> On Tue, 12 Aug 2025 17:08:53 +0800 Youling Tang <youling.tang@linux.dev> wrote:
> 
> > Hi, Jan
> > On 2025/7/14 17:33, Jan Kara wrote:
> > > On Fri 11-07-25 13:55:09, Youling Tang wrote:
> > >> From: Youling Tang <tangyouling@kylinos.cn>
> >
> > ...
> >
> > >> --- a/mm/filemap.c
> > >> +++ b/mm/filemap.c
> > >> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
> > >>   	unsigned int flags;
> > >>   	int err = 0;
> > >>   
> > >> -	/* "last_index" is the index of the page beyond the end of the read */
> > >> -	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
> > >> +	/* "last_index" is the index of the folio beyond the end of the read */
> > >> +	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
> > >> +	last_index >>= PAGE_SHIFT;
> > > I think that filemap_get_pages() shouldn't be really trying to guess what
> > > readahead code needs and round last_index based on min folio order. After
> > > all the situation isn't special for LBS filesystems. It can also happen
> > > that the readahead mark ends up in the middle of large folio for other
> > > reasons. In fact, we already do have code in page_cache_ra_order() ->
> > > ra_alloc_folio() that handles rounding of index where mark should be placed
> > > so your changes essentially try to outsmart that code which is not good. I
> > > think the solution should really be placed in page_cache_ra_order() +
> > > ra_alloc_folio() instead.
> > >
> > > In fact the problem you are trying to solve was kind of introduced (or at
> > > least made more visible) by my commit ab4443fe3ca62 ("readahead: avoid
> > > multiple marked readahead pages"). There I've changed the code to round the
> > > index down because I've convinced myself it doesn't matter and rounding
> > > down is easier to handle in that place. But your example shows there are
> > > cases where rounding down has weird consequences and rounding up would have
> > > been better. So I think we need to come up with a method how to round up
> > > the index of marked folio to fix your case without reintroducing problems
> > > mentioned in commit ab4443fe3ca62.
> > Yes, I simply replaced round_up() in ra_alloc_folio() with round_down()
> > to avoid this phenomenon before submitting this patch.
> > 
> > But at present, I haven't found a suitable way to solve both of these 
> > problems
> > simultaneously. Do you have a better solution on your side?
> > 
> 
> fyi, this patch remains stuck in mm.git awaiting resolution.
> 
> Do we have any information regarding its importance?  Which means do we
> have any measurement of its effect upon any real-world workload?

OK, after experimenting with the code some more and with rounding the
readahead mark index up, I've found out we need something like Youling
proposed anyway because otherwise it could happen that the whole readahead
area fits into a single min-order folio and thus with rounding up we
wouldn't have a folio to place readahead mark on. So with that I'm
withdrawing my objections and feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

to Youling's patch.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/filemap: Align last_index to folio size
  2025-07-11  5:55 [PATCH] mm/filemap: Align last_index to folio size Youling Tang
  2025-07-11 16:08 ` David Hildenbrand
  2025-07-14  9:33 ` Jan Kara
@ 2025-07-14 22:34 ` Klara Modin
  2025-07-14 22:43   ` Andrew Morton
  2 siblings, 1 reply; 11+ messages in thread
From: Klara Modin @ 2025-07-14 22:34 UTC (permalink / raw)
  To: Youling Tang
  Cc: Matthew Wilcox, Andrew Morton, Jan Kara, linux-fsdevel, linux-mm,
	linux-kernel, chizhiling, Youling Tang, Chi Zhiling

Hi,

On 2025-07-11 13:55:09 +0800, Youling Tang wrote:
> From: Youling Tang <tangyouling@kylinos.cn>
> 
> On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE
> enabled, We observed the following readahead behaviors:
>  # echo 3 > /proc/sys/vm/drop_caches
>  # dd if=test of=/dev/null bs=64k count=1
>  # ./tools/mm/page-types -r -L -f  /mnt/xfs/test
>  foffset	offset	flags
>  0	136d4c	__RU_l_________H______t_________________F_1
>  1	136d4d	__RU_l__________T_____t_________________F_1
>  2	136d4e	__RU_l__________T_____t_________________F_1
>  3	136d4f	__RU_l__________T_____t_________________F_1
>  ...
>  c	136bb8	__RU_l_________H______t_________________F_1
>  d	136bb9	__RU_l__________T_____t_________________F_1
>  e	136bba	__RU_l__________T_____t_________________F_1
>  f	136bbb	__RU_l__________T_____t_________________F_1   <-- first read
>  10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
>  11	13c2cd	___U_l__________T_____t______________I__F_1
>  12	13c2ce	___U_l__________T_____t______________I__F_1
>  13	13c2cf	___U_l__________T_____t______________I__F_1
>  ...
>  1c	1405d4	___U_l_________H______t_________________F_1
>  1d	1405d5	___U_l__________T_____t_________________F_1
>  1e	1405d6	___U_l__________T_____t_________________F_1
>  1f	1405d7	___U_l__________T_____t_________________F_1
>  [ra_size = 32, req_count = 16, async_size = 16]
> 
>  # echo 3 > /proc/sys/vm/drop_caches
>  # dd if=test of=/dev/null bs=60k count=1
>  # ./page-types -r -L -f  /mnt/xfs/test
>  foffset	offset	flags
>  0	136048	__RU_l_________H______t_________________F_1
>  ...
>  c	110a40	__RU_l_________H______t_________________F_1
>  d	110a41	__RU_l__________T_____t_________________F_1
>  e	110a42	__RU_l__________T_____t_________________F_1   <-- first read
>  f	110a43	__RU_l__________T_____t_________________F_1   <-- first readahead flag
>  10	13e7a8	___U_l_________H______t_________________F_1
>  ...
>  20	137a00	___U_l_________H______t_______P______I__F_1   <-- second readahead flag (20 - 2f)
>  21	137a01	___U_l__________T_____t_______P______I__F_1
>  ...
>  3f	10d4af	___U_l__________T_____t_______P_________F_1
>  [first readahead: ra_size = 32, req_count = 15, async_size = 17]
> 
> When reading 64k data (same for 61-63k range, where last_index is page-aligned
> in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra()
> and the PG_readahead flag is set on the next folio (the one containing 0x10 page).
> 
> When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra().
> However, in this case the readahead flag is set on the 0xf page. Although the
> requested read size (req_count) is 60k, the actual read will be aligned to
> folio size (64k), which triggers the readahead flag and initiates asynchronous
> readahead via page_cache_async_ra(). This results in two readahead operations
> totaling 256k.
> 
> The root cause is that when the requested size is smaller than the actual read
> size (due to folio alignment), it triggers asynchronous readahead. By changing
> last_index alignment from page size to folio size, we ensure the requested size
> matches the actual read size, preventing the case where a single read operation
> triggers two readahead operations.
> 
> After applying the patch:
>  # echo 3 > /proc/sys/vm/drop_caches
>  # dd if=test of=/dev/null bs=60k count=1
>  # ./page-types -r -L -f  /mnt/xfs/test
>  foffset	offset	flags
>  0	136d4c	__RU_l_________H______t_________________F_1
>  1	136d4d	__RU_l__________T_____t_________________F_1
>  2	136d4e	__RU_l__________T_____t_________________F_1
>  3	136d4f	__RU_l__________T_____t_________________F_1
>  ...
>  c	136bb8	__RU_l_________H______t_________________F_1
>  d	136bb9	__RU_l__________T_____t_________________F_1
>  e	136bba	__RU_l__________T_____t_________________F_1   <-- first read
>  f	136bbb	__RU_l__________T_____t_________________F_1
>  10	13c2cc	___U_l_________H______t______________I__F_1   <-- readahead flag
>  11	13c2cd	___U_l__________T_____t______________I__F_1
>  12	13c2ce	___U_l__________T_____t______________I__F_1
>  13	13c2cf	___U_l__________T_____t______________I__F_1
>  ...
>  1c	1405d4	___U_l_________H______t_________________F_1
>  1d	1405d5	___U_l__________T_____t_________________F_1
>  1e	1405d6	___U_l__________T_____t_________________F_1
>  1f	1405d7	___U_l__________T_____t_________________F_1
>  [ra_size = 32, req_count = 16, async_size = 16]
> 
> The same phenomenon will occur when reading from 49k to 64k. Set the readahead
> flag to the next folio.
> 
> Because the minimum order of folio in address_space equals the block size (at
> least in xfs and bcachefs that already support bs > ps), having request_count
> aligned to block size will not cause overread.
> 
> Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn>
> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
> Signed-off-by: Youling Tang <tangyouling@kylinos.cn>

I bisected boot failures on two of my 32-bit systems to this patch.

> ---
>  include/linux/pagemap.h | 6 ++++++
>  mm/filemap.c            | 5 +++--
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e63fbfbd5b0f..447bb264fd94 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -480,6 +480,12 @@ mapping_min_folio_nrpages(struct address_space *mapping)
>  	return 1UL << mapping_min_folio_order(mapping);
>  }
>  
> +static inline unsigned long
> +mapping_min_folio_nrbytes(struct address_space *mapping)
> +{
> +	return mapping_min_folio_nrpages(mapping) << PAGE_SHIFT;
> +}
> +
>  /**
>   * mapping_align_index() - Align index for this mapping.
>   * @mapping: The address_space.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 765dc5ef6d5a..56a8656b6f86 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  	unsigned int flags;
>  	int err = 0;
>  
> -	/* "last_index" is the index of the page beyond the end of the read */
> -	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
> +	/* "last_index" is the index of the folio beyond the end of the read */

> +	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));

iocb->ki_pos is loff_t (long long) while pgoff_t is unsigned long and
this overflow seems to happen in practice, resulting in last_index being
before index.

The following diff resolves the issue for me:

diff --git a/mm/filemap.c b/mm/filemap.c
index 3c071307f40e..d2902be0b845 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2585,8 +2585,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	int err = 0;
 
 	/* "last_index" is the index of the folio beyond the end of the read */
-	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
-	last_index >>= PAGE_SHIFT;
+	last_index = round_up(iocb->ki_pos + count,
+			mapping_min_folio_nrbytes(mapping)) >> PAGE_SHIFT;
 retry:
 	if (fatal_signal_pending(current))
 		return -EINTR;

Regards,
Klara Modin

> +	last_index >>= PAGE_SHIFT;
>  retry:
>  	if (fatal_signal_pending(current))
>  		return -EINTR;
> -- 
> 2.34.1
> 


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/filemap: Align last_index to folio size
  2025-07-14 22:34 ` Klara Modin
@ 2025-07-14 22:43   ` Andrew Morton
  2025-07-14 22:48     ` Klara Modin
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2025-07-14 22:43 UTC (permalink / raw)
  To: Klara Modin
  Cc: Youling Tang, Matthew Wilcox, Jan Kara, linux-fsdevel, linux-mm,
	linux-kernel, chizhiling, Youling Tang, Chi Zhiling

On Tue, 15 Jul 2025 00:34:12 +0200 Klara Modin <klarasmodin@gmail.com> wrote:

> iocb->ki_pos is loff_t (long long) while pgoff_t is unsigned long and
> this overflow seems to happen in practice, resulting in last_index being
> before index.
> 
> The following diff resolves the issue for me:
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3c071307f40e..d2902be0b845 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2585,8 +2585,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  	int err = 0;
>  
>  	/* "last_index" is the index of the folio beyond the end of the read */
> -	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
> -	last_index >>= PAGE_SHIFT;
> +	last_index = round_up(iocb->ki_pos + count,
> +			mapping_min_folio_nrbytes(mapping)) >> PAGE_SHIFT;
>  retry:
>  	if (fatal_signal_pending(current))
>  		return -EINTR;

Looks good, thanks.  I added your signed-off-by (which I trust is OK?)
and queued this up.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/filemap: Align last_index to folio size
  2025-07-14 22:43   ` Andrew Morton
@ 2025-07-14 22:48     ` Klara Modin
  0 siblings, 0 replies; 11+ messages in thread
From: Klara Modin @ 2025-07-14 22:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Youling Tang, Matthew Wilcox, Jan Kara, linux-fsdevel, linux-mm,
	linux-kernel, chizhiling, Youling Tang, Chi Zhiling

On 2025-07-14 15:43:55 -0700, Andrew Morton wrote:
> On Tue, 15 Jul 2025 00:34:12 +0200 Klara Modin <klarasmodin@gmail.com> wrote:
> 
> > iocb->ki_pos is loff_t (long long) while pgoff_t is unsigned long and
> > this overflow seems to happen in practice, resulting in last_index being
> > before index.
> > 
> > The following diff resolves the issue for me:
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 3c071307f40e..d2902be0b845 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2585,8 +2585,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
> >  	int err = 0;
> >  
> >  	/* "last_index" is the index of the folio beyond the end of the read */
> > -	last_index = round_up(iocb->ki_pos + count, mapping_min_folio_nrbytes(mapping));
> > -	last_index >>= PAGE_SHIFT;
> > +	last_index = round_up(iocb->ki_pos + count,
> > +			mapping_min_folio_nrbytes(mapping)) >> PAGE_SHIFT;
> >  retry:
> >  	if (fatal_signal_pending(current))
> >  		return -EINTR;
> 
> Looks good, thanks.  I added your signed-off-by (which I trust is OK?)
> and queued this up.

Thanks, that's fine:
Signed-off-by: Klara Modin <klarasmodin@gmail.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-09-05 12:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-11  5:55 [PATCH] mm/filemap: Align last_index to folio size Youling Tang
2025-07-11 16:08 ` David Hildenbrand
2025-07-14  8:16   ` Ryan Roberts
2025-07-14  9:33 ` Jan Kara
2025-08-12  9:08   ` Youling Tang
2025-09-03 22:50     ` Andrew Morton
2025-09-04  2:04       ` Youling Tang
2025-09-05 12:50       ` Jan Kara
2025-07-14 22:34 ` Klara Modin
2025-07-14 22:43   ` Andrew Morton
2025-07-14 22:48     ` Klara Modin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).