[RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options
@ 2025-07-30  8:14 Baolin Wang
  2025-07-30  9:30 ` David Hildenbrand
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Baolin Wang @ 2025-07-30  8:14 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, lorenzo.stoakes, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, baolin.wang, linux-mm,
	linux-kernel

After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
we have extended tmpfs to allow any sized large folios, rather than just
PMD-sized large folios.

The strategy discussed previously was:

"
Considering that tmpfs already has the 'huge=' option to control the
PMD-sized large folios allocation, we can extend the 'huge=' option to
allow any sized large folios.  The semantics of the 'huge=' mount option
are:

    huge=never: no any sized large folios
    huge=always: any sized large folios
    huge=within_size: like 'always' but respect the i_size
    huge=advise: like 'always' if requested with madvise()

Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
allocate the PMD-sized huge folios if huge=always/within_size/advise is
set.

Moreover, the 'deny' and 'force' testing options controlled by
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
semantics.  The 'deny' can disable any sized large folios for tmpfs, while
the 'force' can enable PMD sized large folios for tmpfs.
"

This means that when tmpfs is mounted with 'huge=always' or 'huge=within_size',
tmpfs will allow getting a highest order hint based on the size of write() and
fallocate() paths. It will then try each allowable large order, rather than
continually attempting to allocate PMD-sized large folios as before.

However, this might break some user scenarios for those who want to use
PMD-sized large folios, such as the i915 driver which did not supply a write
size hint when allocating shmem [1].

Moreover, Hugh also complained that this will cause a regression in userspace
with 'huge=always' or 'huge=within_size'.

So, let's revisit the strategy for tmpfs large page allocation. A simple fix
would be to always try PMD-sized large folios first, and if that fails, fall
back to smaller large folios. However, this approach differs from the strategy
for large folio allocation used by other file systems. Is this acceptable?

[1] https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/
Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Note: this is just an RFC patch. I would like to hear others' opinions or
see if there is a better way to address Hugh's concern.
---
 Documentation/admin-guide/mm/transhuge.rst |  6 ++-
 mm/shmem.c                                 | 47 +++-------------------
 2 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 878796b4d7d3..121cbb3a72f7 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -383,12 +383,16 @@ option: ``huge=``. It can have following values:
 
 always
     Attempt to allocate huge pages every time we need a new page;
+    Always try PMD-sized huge pages first, and fall back to smaller-sized
+    huge pages if the PMD-sized huge page allocation fails;
 
 never
     Do not allocate huge pages;
 
 within_size
-    Only allocate huge page if it will be fully within i_size.
+    Only allocate huge page if it will be fully within i_size;
+    Always try PMD-sized huge pages first, and fall back to smaller-sized
+    huge pages if the PMD-sized huge page allocation fails;
     Also respect madvise() hints;
 
 advise
diff --git a/mm/shmem.c b/mm/shmem.c
index 75cc2cb92950..c1040a115f08 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -566,42 +566,6 @@ static int shmem_confirm_swap(struct address_space *mapping, pgoff_t index,
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
 
-/**
- * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
- * @mapping: Target address_space.
- * @index: The page index.
- * @write_end: end of a write, could extend inode size.
- *
- * This returns huge orders for folios (when supported) based on the file size
- * which the mapping currently allows at the given index. The index is relevant
- * due to alignment considerations the mapping might have. The returned order
- * may be less than the size passed.
- *
- * Return: The orders.
- */
-static inline unsigned int
-shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, loff_t write_end)
-{
-	unsigned int order;
-	size_t size;
-
-	if (!mapping_large_folio_support(mapping) || !write_end)
-		return 0;
-
-	/* Calculate the write size based on the write_end */
-	size = write_end - (index << PAGE_SHIFT);
-	order = filemap_get_order(size);
-	if (!order)
-		return 0;
-
-	/* If we're not aligned, allocate a smaller folio */
-	if (index & ((1UL << order) - 1))
-		order = __ffs(index);
-
-	order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
-	return order > 0 ? BIT(order + 1) - 1 : 0;
-}
-
 static unsigned int shmem_get_orders_within_size(struct inode *inode,
 		unsigned long within_size_orders, pgoff_t index,
 		loff_t write_end)
@@ -648,22 +612,21 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
 	 * For tmpfs mmap()'s huge order, we still use PMD-sized order to
 	 * allocate huge pages due to lack of a write size hint.
 	 *
-	 * Otherwise, tmpfs will allow getting a highest order hint based on
-	 * the size of write and fallocate paths, then will try each allowable
-	 * huge orders.
+	 * For tmpfs with 'huge=always' or 'huge=within_size' mount option,
+	 * we will always try PMD-sized order first. If that failed, it will
+	 * fall back to small large folios.
 	 */
 	switch (SHMEM_SB(inode->i_sb)->huge) {
 	case SHMEM_HUGE_ALWAYS:
 		if (vma)
 			return maybe_pmd_order;
 
-		return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
+		return THP_ORDERS_ALL_FILE_DEFAULT;
 	case SHMEM_HUGE_WITHIN_SIZE:
 		if (vma)
 			within_size_orders = maybe_pmd_order;
 		else
-			within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
-								       index, write_end);
+			within_size_orders = THP_ORDERS_ALL_FILE_DEFAULT;
 
 		within_size_orders = shmem_get_orders_within_size(inode, within_size_orders,
 								  index, write_end);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options
  2025-07-30  8:14 [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options Baolin Wang
@ 2025-07-30  9:30 ` David Hildenbrand
  2025-07-30 15:23   ` Lorenzo Stoakes
  2025-07-31  2:41   ` Baolin Wang
  2025-07-30 15:17 ` Lorenzo Stoakes
  2025-08-12  8:35 ` Baolin Wang
  2 siblings, 2 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-07-30  9:30 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: willy, lorenzo.stoakes, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, linux-mm, linux-kernel

On 30.07.25 10:14, Baolin Wang wrote:
> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> we have extended tmpfs to allow any sized large folios, rather than just
> PMD-sized large folios.
> 
> The strategy discussed previously was:
> 
> "
> Considering that tmpfs already has the 'huge=' option to control the
> PMD-sized large folios allocation, we can extend the 'huge=' option to
> allow any sized large folios.  The semantics of the 'huge=' mount option
> are:
> 
>      huge=never: no any sized large folios
>      huge=always: any sized large folios
>      huge=within_size: like 'always' but respect the i_size
>      huge=advise: like 'always' if requested with madvise()
> 
> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
> allocate the PMD-sized huge folios if huge=always/within_size/advise is
> set.
> 
> Moreover, the 'deny' and 'force' testing options controlled by
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
> semantics.  The 'deny' can disable any sized large folios for tmpfs, while
> the 'force' can enable PMD sized large folios for tmpfs.
> "
> 
> This means that when tmpfs is mounted with 'huge=always' or 'huge=within_size',
> tmpfs will allow getting a highest order hint based on the size of write() and
> fallocate() paths. It will then try each allowable large order, rather than
> continually attempting to allocate PMD-sized large folios as before.
> 
> However, this might break some user scenarios for those who want to use
> PMD-sized large folios, such as the i915 driver which did not supply a write
> size hint when allocating shmem [1].
> 
> Moreover, Hugh also complained that this will cause a regression in userspace
> with 'huge=always' or 'huge=within_size'.
> 
> So, let's revisit the strategy for tmpfs large page allocation. A simple fix
> would be to always try PMD-sized large folios first, and if that fails, fall
> back to smaller large folios. However, this approach differs from the strategy
> for large folio allocation used by other file systems. Is this acceptable?

My opinion so far has been that anon and shmem are different than 
ordinary FS'es ... primarily because 
allocation(readahead)+reclaim(writeback) behave differently.

There were opinions in the past that tmpfs should just behave like any 
other fs, and I think that's what we tried to satisfy here: use the 
write size as an indication.

I assume there will be workloads where either approach will be 
beneficial. I also assume that workloads that use ordinary fs'es could 
benefit from the same strategy (start with PMD), while others will 
clearly not.

So no real opinion, it all doesn't feel ideal ... at least with his 
approach here we would stick more to the old tmpfs behavior.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options
  2025-07-30  9:30 ` David Hildenbrand
@ 2025-07-30 15:23   ` Lorenzo Stoakes
  2025-07-31  2:41   ` Baolin Wang
  1 sibling, 0 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-07-30 15:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baolin Wang, akpm, hughd, willy, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel

On Wed, Jul 30, 2025 at 11:30:20AM +0200, David Hildenbrand wrote:
> There were opinions in the past that tmpfs should just behave like any other
> fs, and I think that's what we tried to satisfy here: use the write size as
> an indication.

Indeed, it feels like we have too much 'special snowflake' stuff for shmem
anyway. Each instance of those adds more ways to get bugs/unexpected
behaviour.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options
  2025-07-30  9:30 ` David Hildenbrand
  2025-07-30 15:23   ` Lorenzo Stoakes
@ 2025-07-31  2:41   ` Baolin Wang
  1 sibling, 0 replies; 9+ messages in thread
From: Baolin Wang @ 2025-07-31  2:41 UTC (permalink / raw)
  To: David Hildenbrand, akpm, hughd
  Cc: willy, lorenzo.stoakes, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, linux-mm, linux-kernel



On 2025/7/30 17:30, David Hildenbrand wrote:
> On 30.07.25 10:14, Baolin Wang wrote:
>> After commit acd7ccb284b8 ("mm: shmem: add large folio support for 
>> tmpfs"),
>> we have extended tmpfs to allow any sized large folios, rather than just
>> PMD-sized large folios.
>>
>> The strategy discussed previously was:
>>
>> "
>> Considering that tmpfs already has the 'huge=' option to control the
>> PMD-sized large folios allocation, we can extend the 'huge=' option to
>> allow any sized large folios.  The semantics of the 'huge=' mount option
>> are:
>>
>>      huge=never: no any sized large folios
>>      huge=always: any sized large folios
>>      huge=within_size: like 'always' but respect the i_size
>>      huge=advise: like 'always' if requested with madvise()
>>
>> Note: for tmpfs mmap() faults, due to the lack of a write size hint, 
>> still
>> allocate the PMD-sized huge folios if huge=always/within_size/advise is
>> set.
>>
>> Moreover, the 'deny' and 'force' testing options controlled by
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the 
>> same
>> semantics.  The 'deny' can disable any sized large folios for tmpfs, 
>> while
>> the 'force' can enable PMD sized large folios for tmpfs.
>> "
>>
>> This means that when tmpfs is mounted with 'huge=always' or 
>> 'huge=within_size',
>> tmpfs will allow getting a highest order hint based on the size of 
>> write() and
>> fallocate() paths. It will then try each allowable large order, rather 
>> than
>> continually attempting to allocate PMD-sized large folios as before.
>>
>> However, this might break some user scenarios for those who want to use
>> PMD-sized large folios, such as the i915 driver which did not supply a 
>> write
>> size hint when allocating shmem [1].
>>
>> Moreover, Hugh also complained that this will cause a regression in 
>> userspace
>> with 'huge=always' or 'huge=within_size'.
>>
>> So, let's revisit the strategy for tmpfs large page allocation. A 
>> simple fix
>> would be to always try PMD-sized large folios first, and if that 
>> fails, fall
>> back to smaller large folios. However, this approach differs from the 
>> strategy
>> for large folio allocation used by other file systems. Is this 
>> acceptable?
> 
> My opinion so far has been that anon and shmem are different than 
> ordinary FS'es ... primarily because 
> allocation(readahead)+reclaim(writeback) behave differently.
> 
> There were opinions in the past that tmpfs should just behave like any 
> other fs, and I think that's what we tried to satisfy here: use the 
> write size as an indication.
> 
> I assume there will be workloads where either approach will be 
> beneficial. I also assume that workloads that use ordinary fs'es could 
> benefit from the same strategy (start with PMD), while others will 
> clearly not.

Yes, using the write size as an indication to allocate large folios is 
certainly reasonable in some scenarios, as it avoids memory bloat while 
leveraging the advantages of large folios.

Personally, I prefer to use this method by default for allocating tmpfs 
large folios, but we also need to consider how to avoid regression if 
the 'huge=always/within_size' mount option is set.

> So no real opinion, it all doesn't feel ideal ... at least with his 
> approach here we would stick more to the old tmpfs behavior.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options
  2025-07-30  8:14 [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options Baolin Wang
  2025-07-30  9:30 ` David Hildenbrand
@ 2025-07-30 15:17 ` Lorenzo Stoakes
  2025-07-31  2:57   ` Baolin Wang
  2025-08-12  8:35 ` Baolin Wang
  2 siblings, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-07-30 15:17 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel

On Wed, Jul 30, 2025 at 04:14:55PM +0800, Baolin Wang wrote:
> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> we have extended tmpfs to allow any sized large folios, rather than just
> PMD-sized large folios.
>
> The strategy discussed previously was:
>
> "
> Considering that tmpfs already has the 'huge=' option to control the
> PMD-sized large folios allocation, we can extend the 'huge=' option to
> allow any sized large folios.  The semantics of the 'huge=' mount option
> are:
>
>     huge=never: no any sized large folios
>     huge=always: any sized large folios
>     huge=within_size: like 'always' but respect the i_size
>     huge=advise: like 'always' if requested with madvise()

Sort of hate we have a million different ways of setting behaviour for THP
and they all differ in subtle ways.

Also this is similar to sysfs settings but with slightly different
semantics... <insert appropriate meme here>.

>
> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
> allocate the PMD-sized huge folios if huge=always/within_size/advise is
> set.
>
> Moreover, the 'deny' and 'force' testing options controlled by
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
> semantics.  The 'deny' can disable any sized large folios for tmpfs, while
> the 'force' can enable PMD sized large folios for tmpfs.

And what about MADV_COLLAPSE?

> "
>
> This means that when tmpfs is mounted with 'huge=always' or 'huge=within_size',
> tmpfs will allow getting a highest order hint based on the size of write() and
> fallocate() paths. It will then try each allowable large order, rather than
> continually attempting to allocate PMD-sized large folios as before.
>
> However, this might break some user scenarios for those who want to use
> PMD-sized large folios, such as the i915 driver which did not supply a write
> size hint when allocating shmem [1].

Hmm, this is unclear to me, surely because it doesn't provide a write size
hint it's not this behaviour that breaks anything, but rather the fact that
we base things on the write hint at all?

>
> Moreover, Hugh also complained that this will cause a regression in userspace
> with 'huge=always' or 'huge=within_size'.

Will cause? Is this not already the case?

And what is the regression precisely? That i915 doesn't get huge pages
because it doesn't provide a hint?

>
> So, let's revisit the strategy for tmpfs large page allocation. A simple fix
> would be to always try PMD-sized large folios first, and if that fails, fall
> back to smaller large folios. However, this approach differs from the strategy
> for large folio allocation used by other file systems. Is this acceptable?

Doesn't this imply a waste of memory?

I mean if the 'implicit' semantics now are 'always ...but respecting a
write size hint' (which kind of sucks), is changing this ok?

Maybe somebody relies on that?

It seems (unless I'm missing something here) that in THP we've both made
never not mean never, and always not mean always.

>
> [1] https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/
> Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs")
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> Note: this is just an RFC patch. I would like to hear others' opinions or
> see if there is a better way to address Hugh's concern.
> ---
>  Documentation/admin-guide/mm/transhuge.rst |  6 ++-
>  mm/shmem.c                                 | 47 +++-------------------
>  2 files changed, 10 insertions(+), 43 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 878796b4d7d3..121cbb3a72f7 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -383,12 +383,16 @@ option: ``huge=``. It can have following values:
>
>  always
>      Attempt to allocate huge pages every time we need a new page;
> +    Always try PMD-sized huge pages first, and fall back to smaller-sized
> +    huge pages if the PMD-sized huge page allocation fails;
>
>  never
>      Do not allocate huge pages;
>
>  within_size
> -    Only allocate huge page if it will be fully within i_size.
> +    Only allocate huge page if it will be fully within i_size;
> +    Always try PMD-sized huge pages first, and fall back to smaller-sized
> +    huge pages if the PMD-sized huge page allocation fails;
>      Also respect madvise() hints;
>
>  advise
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 75cc2cb92950..c1040a115f08 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -566,42 +566,6 @@ static int shmem_confirm_swap(struct address_space *mapping, pgoff_t index,
>  static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>  static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
>
> -/**
> - * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
> - * @mapping: Target address_space.
> - * @index: The page index.
> - * @write_end: end of a write, could extend inode size.
> - *
> - * This returns huge orders for folios (when supported) based on the file size
> - * which the mapping currently allows at the given index. The index is relevant
> - * due to alignment considerations the mapping might have. The returned order
> - * may be less than the size passed.
> - *
> - * Return: The orders.
> - */
> -static inline unsigned int
> -shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, loff_t write_end)
> -{
> -	unsigned int order;
> -	size_t size;
> -
> -	if (!mapping_large_folio_support(mapping) || !write_end)
> -		return 0;
> -
> -	/* Calculate the write size based on the write_end */
> -	size = write_end - (index << PAGE_SHIFT);
> -	order = filemap_get_order(size);
> -	if (!order)
> -		return 0;
> -
> -	/* If we're not aligned, allocate a smaller folio */
> -	if (index & ((1UL << order) - 1))
> -		order = __ffs(index);

We need to care about alignment still no?

> -
> -	order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
> -	return order > 0 ? BIT(order + 1) - 1 : 0;
> -}
> -
>  static unsigned int shmem_get_orders_within_size(struct inode *inode,
>  		unsigned long within_size_orders, pgoff_t index,
>  		loff_t write_end)
> @@ -648,22 +612,21 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
>  	 * For tmpfs mmap()'s huge order, we still use PMD-sized order to
>  	 * allocate huge pages due to lack of a write size hint.
>  	 *
> -	 * Otherwise, tmpfs will allow getting a highest order hint based on
> -	 * the size of write and fallocate paths, then will try each allowable
> -	 * huge orders.
> +	 * For tmpfs with 'huge=always' or 'huge=within_size' mount option,
> +	 * we will always try PMD-sized order first. If that failed, it will
> +	 * fall back to small large folios.
>  	 */
>  	switch (SHMEM_SB(inode->i_sb)->huge) {
>  	case SHMEM_HUGE_ALWAYS:
>  		if (vma)
>  			return maybe_pmd_order;
>
> -		return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
> +		return THP_ORDERS_ALL_FILE_DEFAULT;
>  	case SHMEM_HUGE_WITHIN_SIZE:
>  		if (vma)
>  			within_size_orders = maybe_pmd_order;
>  		else
> -			within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
> -								       index, write_end);
> +			within_size_orders = THP_ORDERS_ALL_FILE_DEFAULT;
>
>  		within_size_orders = shmem_get_orders_within_size(inode, within_size_orders,
>  								  index, write_end);
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options
  2025-07-30 15:17 ` Lorenzo Stoakes
@ 2025-07-31  2:57   ` Baolin Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Baolin Wang @ 2025-07-31  2:57 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, hughd, willy, david, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel



On 2025/7/30 23:17, Lorenzo Stoakes wrote:
> On Wed, Jul 30, 2025 at 04:14:55PM +0800, Baolin Wang wrote:
>> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
>> we have extended tmpfs to allow any sized large folios, rather than just
>> PMD-sized large folios.
>>
>> The strategy discussed previously was:
>>
>> "
>> Considering that tmpfs already has the 'huge=' option to control the
>> PMD-sized large folios allocation, we can extend the 'huge=' option to
>> allow any sized large folios.  The semantics of the 'huge=' mount option
>> are:
>>
>>      huge=never: no any sized large folios
>>      huge=always: any sized large folios
>>      huge=within_size: like 'always' but respect the i_size
>>      huge=advise: like 'always' if requested with madvise()
> 
> Sort of hate we have a million different ways of setting behaviour for THP
> and they all differ in subtle ways.
> 
> Also this is similar to sysfs settings but with slightly different
> semantics... <insert appropriate meme here>.
> 
>>
>> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
>> allocate the PMD-sized huge folios if huge=always/within_size/advise is
>> set.
>>
>> Moreover, the 'deny' and 'force' testing options controlled by
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
>> semantics.  The 'deny' can disable any sized large folios for tmpfs, while
>> the 'force' can enable PMD sized large folios for tmpfs.
> 
> And what about MADV_COLLAPSE?

As Hugh mentioned beore, the 'deny' option will prohibit MADV_COLLAPSE 
for shmem, while 'force' option will allow it.

>> This means that when tmpfs is mounted with 'huge=always' or 'huge=within_size',
>> tmpfs will allow getting a highest order hint based on the size of write() and
>> fallocate() paths. It will then try each allowable large order, rather than
>> continually attempting to allocate PMD-sized large folios as before.
>>
>> However, this might break some user scenarios for those who want to use
>> PMD-sized large folios, such as the i915 driver which did not supply a write
>> size hint when allocating shmem [1].
> 
> Hmm, this is unclear to me, surely because it doesn't provide a write size
> hint it's not this behaviour that breaks anything, but rather the fact that
> we base things on the write hint at all?

Yes, we changed the allocation strategy for shmem large folios, but 
forgot to update the shmem allocation method for the i915 driver.

>> Moreover, Hugh also complained that this will cause a regression in userspace
>> with 'huge=always' or 'huge=within_size'.
> 
> Will cause? Is this not already the case?
> 
> And what is the regression precisely? That i915 doesn't get huge pages
> because it doesn't provide a hint?

Yes, see above.

>> So, let's revisit the strategy for tmpfs large page allocation. A simple fix
>> would be to always try PMD-sized large folios first, and if that fails, fall
>> back to smaller large folios. However, this approach differs from the strategy
>> for large folio allocation used by other file systems. Is this acceptable?
> 
> Doesn't this imply a waste of memory?

Right. As I replied to David, using the write size as an indication to 
allocate large folios is certainly reasonable in some scenarios, as it 
avoids memory bloat while leveraging the advantages of large folios.

However, there may be scenarios where PMD-sized large folios are always 
expected, such as the i915 driver. It's uncertain whether user-space 
tmpfs mounts with 'huge=' options have such scenarios, but we do have 
this concern.

> I mean if the 'implicit' semantics now are 'always ...but respecting a
> write size hint' (which kind of sucks), is changing this ok?
> 
> Maybe somebody relies on that?
> 
> It seems (unless I'm missing something here) that in THP we've both made
> never not mean never, and always not mean always.
> 
>>
>> [1] https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/
>> Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs")
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> Note: this is just an RFC patch. I would like to hear others' opinions or
>> see if there is a better way to address Hugh's concern.
>> ---
>>   Documentation/admin-guide/mm/transhuge.rst |  6 ++-
>>   mm/shmem.c                                 | 47 +++-------------------
>>   2 files changed, 10 insertions(+), 43 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index 878796b4d7d3..121cbb3a72f7 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -383,12 +383,16 @@ option: ``huge=``. It can have following values:
>>
>>   always
>>       Attempt to allocate huge pages every time we need a new page;
>> +    Always try PMD-sized huge pages first, and fall back to smaller-sized
>> +    huge pages if the PMD-sized huge page allocation fails;
>>
>>   never
>>       Do not allocate huge pages;
>>
>>   within_size
>> -    Only allocate huge page if it will be fully within i_size.
>> +    Only allocate huge page if it will be fully within i_size;
>> +    Always try PMD-sized huge pages first, and fall back to smaller-sized
>> +    huge pages if the PMD-sized huge page allocation fails;
>>       Also respect madvise() hints;
>>
>>   advise
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 75cc2cb92950..c1040a115f08 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -566,42 +566,6 @@ static int shmem_confirm_swap(struct address_space *mapping, pgoff_t index,
>>   static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>>   static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
>>
>> -/**
>> - * shmem_mapping_size_orders - Get allowable folio orders for the given file size.
>> - * @mapping: Target address_space.
>> - * @index: The page index.
>> - * @write_end: end of a write, could extend inode size.
>> - *
>> - * This returns huge orders for folios (when supported) based on the file size
>> - * which the mapping currently allows at the given index. The index is relevant
>> - * due to alignment considerations the mapping might have. The returned order
>> - * may be less than the size passed.
>> - *
>> - * Return: The orders.
>> - */
>> -static inline unsigned int
>> -shmem_mapping_size_orders(struct address_space *mapping, pgoff_t index, loff_t write_end)
>> -{
>> -	unsigned int order;
>> -	size_t size;
>> -
>> -	if (!mapping_large_folio_support(mapping) || !write_end)
>> -		return 0;
>> -
>> -	/* Calculate the write size based on the write_end */
>> -	size = write_end - (index << PAGE_SHIFT);
>> -	order = filemap_get_order(size);
>> -	if (!order)
>> -		return 0;
>> -
>> -	/* If we're not aligned, allocate a smaller folio */
>> -	if (index & ((1UL << order) - 1))
>> -		order = __ffs(index);
> 
> We need to care about alignment still no?

We‘ve already done alignment during shmem allocation.

>> -
>> -	order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
>> -	return order > 0 ? BIT(order + 1) - 1 : 0;
>> -}
>> -
>>   static unsigned int shmem_get_orders_within_size(struct inode *inode,
>>   		unsigned long within_size_orders, pgoff_t index,
>>   		loff_t write_end)
>> @@ -648,22 +612,21 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
>>   	 * For tmpfs mmap()'s huge order, we still use PMD-sized order to
>>   	 * allocate huge pages due to lack of a write size hint.
>>   	 *
>> -	 * Otherwise, tmpfs will allow getting a highest order hint based on
>> -	 * the size of write and fallocate paths, then will try each allowable
>> -	 * huge orders.
>> +	 * For tmpfs with 'huge=always' or 'huge=within_size' mount option,
>> +	 * we will always try PMD-sized order first. If that failed, it will
>> +	 * fall back to small large folios.
>>   	 */
>>   	switch (SHMEM_SB(inode->i_sb)->huge) {
>>   	case SHMEM_HUGE_ALWAYS:
>>   		if (vma)
>>   			return maybe_pmd_order;
>>
>> -		return shmem_mapping_size_orders(inode->i_mapping, index, write_end);
>> +		return THP_ORDERS_ALL_FILE_DEFAULT;
>>   	case SHMEM_HUGE_WITHIN_SIZE:
>>   		if (vma)
>>   			within_size_orders = maybe_pmd_order;
>>   		else
>> -			within_size_orders = shmem_mapping_size_orders(inode->i_mapping,
>> -								       index, write_end);
>> +			within_size_orders = THP_ORDERS_ALL_FILE_DEFAULT;
>>
>>   		within_size_orders = shmem_get_orders_within_size(inode, within_size_orders,
>>   								  index, write_end);
>> --
>> 2.43.5
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options
  2025-07-30  8:14 [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options Baolin Wang
  2025-07-30  9:30 ` David Hildenbrand
  2025-07-30 15:17 ` Lorenzo Stoakes
@ 2025-08-12  8:35 ` Baolin Wang
  2025-08-13  6:59   ` Hugh Dickins
  2 siblings, 1 reply; 9+ messages in thread
From: Baolin Wang @ 2025-08-12  8:35 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, lorenzo.stoakes, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel



On 2025/7/30 16:14, Baolin Wang wrote:
> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> we have extended tmpfs to allow any sized large folios, rather than just
> PMD-sized large folios.
> 
> The strategy discussed previously was:
> 
> "
> Considering that tmpfs already has the 'huge=' option to control the
> PMD-sized large folios allocation, we can extend the 'huge=' option to
> allow any sized large folios.  The semantics of the 'huge=' mount option
> are:
> 
>      huge=never: no any sized large folios
>      huge=always: any sized large folios
>      huge=within_size: like 'always' but respect the i_size
>      huge=advise: like 'always' if requested with madvise()
> 
> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
> allocate the PMD-sized huge folios if huge=always/within_size/advise is
> set.
> 
> Moreover, the 'deny' and 'force' testing options controlled by
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
> semantics.  The 'deny' can disable any sized large folios for tmpfs, while
> the 'force' can enable PMD sized large folios for tmpfs.
> "
> 
> This means that when tmpfs is mounted with 'huge=always' or 'huge=within_size',
> tmpfs will allow getting a highest order hint based on the size of write() and
> fallocate() paths. It will then try each allowable large order, rather than
> continually attempting to allocate PMD-sized large folios as before.
> 
> However, this might break some user scenarios for those who want to use
> PMD-sized large folios, such as the i915 driver which did not supply a write
> size hint when allocating shmem [1].
> 
> Moreover, Hugh also complained that this will cause a regression in userspace
> with 'huge=always' or 'huge=within_size'.
> 
> So, let's revisit the strategy for tmpfs large page allocation. A simple fix
> would be to always try PMD-sized large folios first, and if that fails, fall
> back to smaller large folios. However, this approach differs from the strategy
> for large folio allocation used by other file systems. Is this acceptable?
> 
> [1] https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/
> Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs")
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> Note: this is just an RFC patch. I would like to hear others' opinions or
> see if there is a better way to address Hugh's concern.
> ---

Hi Hugh,

If we use this approach to fix the PMD large folio regression, should we 
also change tmpfs mmap() to allow allocating any sized large folios, but 
always try to allocate PMD-sized large folios first? What do you think? 
Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options
  2025-08-12  8:35 ` Baolin Wang
@ 2025-08-13  6:59   ` Hugh Dickins
  2025-08-14 10:03     ` Baolin Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Hugh Dickins @ 2025-08-13  6:59 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, lorenzo.stoakes, ziy, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel

On Tue, 12 Aug 2025, Baolin Wang wrote:
> On 2025/7/30 16:14, Baolin Wang wrote:
> > After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> > we have extended tmpfs to allow any sized large folios, rather than just
> > PMD-sized large folios.
> > 
> > The strategy discussed previously was:
> > 
> > "
> > Considering that tmpfs already has the 'huge=' option to control the
> > PMD-sized large folios allocation, we can extend the 'huge=' option to
> > allow any sized large folios.  The semantics of the 'huge=' mount option
> > are:
> > 
> >      huge=never: no any sized large folios
> >      huge=always: any sized large folios
> >      huge=within_size: like 'always' but respect the i_size
> >      huge=advise: like 'always' if requested with madvise()
> > 
> > Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
> > allocate the PMD-sized huge folios if huge=always/within_size/advise is
> > set.
> > 
> > Moreover, the 'deny' and 'force' testing options controlled by
> > '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
> > semantics.  The 'deny' can disable any sized large folios for tmpfs, while
> > the 'force' can enable PMD sized large folios for tmpfs.
> > "
> > 
> > This means that when tmpfs is mounted with 'huge=always' or
> > 'huge=within_size',
> > tmpfs will allow getting a highest order hint based on the size of write()
> > and
> > fallocate() paths. It will then try each allowable large order, rather than
> > continually attempting to allocate PMD-sized large folios as before.
> > 
> > However, this might break some user scenarios for those who want to use
> > PMD-sized large folios, such as the i915 driver which did not supply a write
> > size hint when allocating shmem [1].
> > 
> > Moreover, Hugh also complained that this will cause a regression in
> > userspace
> > with 'huge=always' or 'huge=within_size'.
> > 
> > So, let's revisit the strategy for tmpfs large page allocation. A simple fix
> > would be to always try PMD-sized large folios first, and if that fails, fall
> > back to smaller large folios. However, this approach differs from the
> > strategy
> > for large folio allocation used by other file systems. Is this acceptable?
> > 
> > [1]
> > https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/
> > Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs")
> > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > ---
> > Note: this is just an RFC patch. I would like to hear others' opinions or
> > see if there is a better way to address Hugh's concern.

Sorry, I am still evaluating this RFC patch.

Certainly I observe it taking us in the right direction, giving PMD-sized
pages on tmpfs huge=always, as 6.13 and earlier releases did - thank you.

But the explosion of combinations which mTHP and FS large folios bring,
the amount that needs checking, is close to defeating me; and I've had
to spend a lot of the time re-educating myself on the background -
not looking to see whether this particular patch is right or not.
Still working on it.

> > ---
> 
> Hi Hugh,
> 
> If we use this approach to fix the PMD large folio regression, should we also
> change tmpfs mmap() to allow allocating any sized large folios, but always try
> to allocate PMD-sized large folios first? What do you think? Thanks.

Probably: I would like the mmap allocations to follow the same rules.

But finding it a bit odd how the current implementation limits tmpfs
large folios to when huge=notnever (is that a fair statement?), whereas
other filesystems are now being freely given large folios - using
different GFP flags from what MM uses (closest to defrag=always I think),
and with no limitation - whereas MM folks are off devising ever newer
ways to restrict access to huge pages.

And (conversely) I am unhappy with the way write and fallocate (and split
and collapse? in flight I think) are following the FS approach of allowing
every fractal, when mTHP/shmem_enabled is (or can be) more limiting.  I
think it less surprising (and more efficient when fragmented) for shmem
FS operations to be restricted to the same subset as "shared anon".

Hugh

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options
  2025-08-13  6:59   ` Hugh Dickins
@ 2025-08-14 10:03     ` Baolin Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Baolin Wang @ 2025-08-14 10:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, willy, david, lorenzo.stoakes, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, linux-mm, linux-kernel



On 2025/8/13 14:59, Hugh Dickins wrote:
> On Tue, 12 Aug 2025, Baolin Wang wrote:
>> On 2025/7/30 16:14, Baolin Wang wrote:
>>> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
>>> we have extended tmpfs to allow any sized large folios, rather than just
>>> PMD-sized large folios.
>>>
>>> The strategy discussed previously was:
>>>
>>> "
>>> Considering that tmpfs already has the 'huge=' option to control the
>>> PMD-sized large folios allocation, we can extend the 'huge=' option to
>>> allow any sized large folios.  The semantics of the 'huge=' mount option
>>> are:
>>>
>>>       huge=never: no any sized large folios
>>>       huge=always: any sized large folios
>>>       huge=within_size: like 'always' but respect the i_size
>>>       huge=advise: like 'always' if requested with madvise()
>>>
>>> Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
>>> allocate the PMD-sized huge folios if huge=always/within_size/advise is
>>> set.
>>>
>>> Moreover, the 'deny' and 'force' testing options controlled by
>>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
>>> semantics.  The 'deny' can disable any sized large folios for tmpfs, while
>>> the 'force' can enable PMD sized large folios for tmpfs.
>>> "
>>>
>>> This means that when tmpfs is mounted with 'huge=always' or
>>> 'huge=within_size',
>>> tmpfs will allow getting a highest order hint based on the size of write()
>>> and
>>> fallocate() paths. It will then try each allowable large order, rather than
>>> continually attempting to allocate PMD-sized large folios as before.
>>>
>>> However, this might break some user scenarios for those who want to use
>>> PMD-sized large folios, such as the i915 driver which did not supply a write
>>> size hint when allocating shmem [1].
>>>
>>> Moreover, Hugh also complained that this will cause a regression in
>>> userspace
>>> with 'huge=always' or 'huge=within_size'.
>>>
>>> So, let's revisit the strategy for tmpfs large page allocation. A simple fix
>>> would be to always try PMD-sized large folios first, and if that fails, fall
>>> back to smaller large folios. However, this approach differs from the
>>> strategy
>>> for large folio allocation used by other file systems. Is this acceptable?
>>>
>>> [1]
>>> https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/
>>> Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs")
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>> Note: this is just an RFC patch. I would like to hear others' opinions or
>>> see if there is a better way to address Hugh's concern.
> 
> Sorry, I am still evaluating this RFC patch.
> 
> Certainly I observe it taking us in the right direction, giving PMD-sized
> pages on tmpfs huge=always, as 6.13 and earlier releases did - thank you.
> 
> But the explosion of combinations which mTHP and FS large folios bring,
> the amount that needs checking, is close to defeating me; and I've had
> to spend a lot of the time re-educating myself on the background -
> not looking to see whether this particular patch is right or not.
> Still working on it.

OK. Thanks.

>> If we use this approach to fix the PMD large folio regression, should we also
>> change tmpfs mmap() to allow allocating any sized large folios, but always try
>> to allocate PMD-sized large folios first? What do you think? Thanks.
> 
> Probably: I would like the mmap allocations to follow the same rules.
> 
> But finding it a bit odd how the current implementation limits tmpfs
> large folios to when huge=notnever (is that a fair statement?), whereas

Yes, this is mainly to ensure backward compatibility with the 'huge=' 
options. Moreover, in the future, we could set the default value of 
‘tmpfs_huge’ to ‘always’ (controlled via the cmdline: 
transparent_hugepage_tmpfs=) to allow all tmpfs mounts to use large 
folios by default.

> other filesystems are now being freely given large folios - using
> different GFP flags from what MM uses (closest to defrag=always I think),
> and with no limitation - whereas MM folks are off devising ever newer
> ways to restrict access to huge pages.
> 
> And (conversely) I am unhappy with the way write and fallocate (and split
> and collapse? in flight I think) are following the FS approach of allowing
> every fractal, when mTHP/shmem_enabled is (or can be) more limiting.  I
> think it less surprising (and more efficient when fragmented) for shmem
> FS operations to be restricted to the same subset as "shared anon".

Understood. We discussed this before, but it didn’t get support :(

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-08-14 10:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-30  8:14 [RFC PATCH] mm: shmem: fix the strategy for the tmpfs 'huge=' options Baolin Wang
2025-07-30  9:30 ` David Hildenbrand
2025-07-30 15:23   ` Lorenzo Stoakes
2025-07-31  2:41   ` Baolin Wang
2025-07-30 15:17 ` Lorenzo Stoakes
2025-07-31  2:57   ` Baolin Wang
2025-08-12  8:35 ` Baolin Wang
2025-08-13  6:59   ` Hugh Dickins
2025-08-14 10:03     ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).