[PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
@ 2025-07-04  3:19 Baolin Wang
  2025-07-04  9:38 ` David Hildenbrand
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Baolin Wang @ 2025-07-04  3:19 UTC (permalink / raw)
  To: akpm, hughd, david
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, baolin.wang,
	linux-mm, linux-kernel

After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
tmpfs can also support large folio allocation (not just PMD-sized large
folios).

However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
we still establish mappings at the base page granularity, which is unreasonable.

We can map multiple consecutive pages of a tmpfs folios at once according to
the size of the large folio. On one hand, this can reduce the overhead of page
faults; on the other hand, it can leverage hardware architecture optimizations
to reduce TLB misses, such as contiguous PTEs on the ARM architecture.

Moreover, tmpfs mount will use the 'huge=' option to control large folio
allocation explicitly. So it can be understood that the process's RSS statistics
might increase, and I think this will not cause any obvious effects for users.

Performance test:
I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
sequentially via mmap(). I observed a significant performance improvement:

Before the patch:
real	0m0.158s
user	0m0.008s
sys	0m0.150s

After the patch:
real	0m0.021s
user	0m0.004s
sys	0m0.017s

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Changes from v1:
 - Drop the unnecessary IS_ALIGNED() check, per David.
 - Update the commit message, per David.
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0f9b32a20e5b..9944380e947d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)

 	/*
 	 * Using per-page fault to maintain the uffd semantics, and same
-	 * approach also applies to non-anonymous-shmem faults to avoid
+	 * approach also applies to non shmem/tmpfs faults to avoid
 	 * inflating the RSS of the process.
 	 */
-	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
+	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
 	    unlikely(needs_fallback)) {
 		nr_pages = 1;
 	} else if (nr_pages > 1) {
-- 
2.43.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-04  3:19 [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs Baolin Wang
@ 2025-07-04  9:38 ` David Hildenbrand
  2025-07-04 22:18 ` Andrew Morton
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-07-04  9:38 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel

On 04.07.25 05:19, Baolin Wang wrote:
> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> tmpfs can also support large folio allocation (not just PMD-sized large
> folios).
> 
> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> we still establish mappings at the base page granularity, which is unreasonable.
> 
> We can map multiple consecutive pages of a tmpfs folios at once according to
> the size of the large folio. On one hand, this can reduce the overhead of page
> faults; on the other hand, it can leverage hardware architecture optimizations
> to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
> 
> Moreover, tmpfs mount will use the 'huge=' option to control large folio
> allocation explicitly. So it can be understood that the process's RSS statistics
> might increase, and I think this will not cause any obvious effects for users.
> 
> Performance test:
> I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
> sequentially via mmap(). I observed a significant performance improvement:
> 
> Before the patch:
> real	0m0.158s
> user	0m0.008s
> sys	0m0.150s
> 
> After the patch:
> real	0m0.021s
> user	0m0.004s
> sys	0m0.017s
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> Changes from v1:
>   - Drop the unnecessary IS_ALIGNED() check, per David.
>   - Update the commit message, per David.
> ---
>   mm/memory.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f9b32a20e5b..9944380e947d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   
>   	/*
>   	 * Using per-page fault to maintain the uffd semantics, and same
> -	 * approach also applies to non-anonymous-shmem faults to avoid
> +	 * approach also applies to non shmem/tmpfs faults to avoid
>   	 * inflating the RSS of the process.
>   	 */
> -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>   	    unlikely(needs_fallback)) {
>   		nr_pages = 1;
>   	} else if (nr_pages > 1) {

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-04  3:19 [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs Baolin Wang
  2025-07-04  9:38 ` David Hildenbrand
@ 2025-07-04 22:18 ` Andrew Morton
  2025-07-06  2:02   ` Baolin Wang
  2025-07-04 22:31 ` Zi Yan
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2025-07-04 22:18 UTC (permalink / raw)
  To: Baolin Wang
  Cc: hughd, david, ziy, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel

On Fri,  4 Jul 2025 11:19:26 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> tmpfs can also support large folio allocation (not just PMD-sized large
> folios).
> 
> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> we still establish mappings at the base page granularity, which is unreasonable.
> 
> We can map multiple consecutive pages of a tmpfs folios at once according to
> the size of the large folio. On one hand, this can reduce the overhead of page
> faults; on the other hand, it can leverage hardware architecture optimizations
> to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
> 
> Moreover, tmpfs mount will use the 'huge=' option to control large folio
> allocation explicitly. So it can be understood that the process's RSS statistics
> might increase, and I think this will not cause any obvious effects for users.
> 
> Performance test:
> I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
> sequentially via mmap(). I observed a significant performance improvement:

That doesn't sound like a crazy thing to do.

> Before the patch:
> real	0m0.158s
> user	0m0.008s
> sys	0m0.150s
> 
> After the patch:
> real	0m0.021s
> user	0m0.004s
> sys	0m0.017s

And look at that.

> diff --git a/mm/memory.c b/mm/memory.c
> index 0f9b32a20e5b..9944380e947d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  
>  	/*
>  	 * Using per-page fault to maintain the uffd semantics, and same
> -	 * approach also applies to non-anonymous-shmem faults to avoid
> +	 * approach also applies to non shmem/tmpfs faults to avoid
>  	 * inflating the RSS of the process.
>  	 */
> -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>  	    unlikely(needs_fallback)) {
>  		nr_pages = 1;
>  	} else if (nr_pages > 1) {

and that's it?

I'm itching to get this into -stable, really.  What LTS user wouldn't
want this?  Could it be viewed as correcting an oversight in
acd7ccb284b8?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-04  3:19 [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs Baolin Wang
  2025-07-04  9:38 ` David Hildenbrand
  2025-07-04 22:18 ` Andrew Morton
@ 2025-07-04 22:31 ` Zi Yan
  2025-07-07 13:31 ` Lorenzo Stoakes
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Zi Yan @ 2025-07-04 22:31 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel

On 3 Jul 2025, at 23:19, Baolin Wang wrote:

> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> tmpfs can also support large folio allocation (not just PMD-sized large
> folios).
>
> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> we still establish mappings at the base page granularity, which is unreasonable.
>
> We can map multiple consecutive pages of a tmpfs folios at once according to
> the size of the large folio. On one hand, this can reduce the overhead of page
> faults; on the other hand, it can leverage hardware architecture optimizations
> to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
>
> Moreover, tmpfs mount will use the 'huge=' option to control large folio
> allocation explicitly. So it can be understood that the process's RSS statistics
> might increase, and I think this will not cause any obvious effects for users.
>
> Performance test:
> I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
> sequentially via mmap(). I observed a significant performance improvement:
>
> Before the patch:
> real	0m0.158s
> user	0m0.008s
> sys	0m0.150s
>
> After the patch:
> real	0m0.021s
> user	0m0.004s
> sys	0m0.017s
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> Changes from v1:
>  - Drop the unnecessary IS_ALIGNED() check, per David.
>  - Update the commit message, per David.
> ---
>  mm/memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>

LGTM. Acked-by: Zi Yan <ziy@nvidia.com>


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-04 22:18 ` Andrew Morton
@ 2025-07-06  2:02   ` Baolin Wang
  2025-07-07 13:33     ` Lorenzo Stoakes
  0 siblings, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2025-07-06  2:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hughd, david, ziy, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel



On 2025/7/5 06:18, Andrew Morton wrote:
> On Fri,  4 Jul 2025 11:19:26 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
>> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
>> tmpfs can also support large folio allocation (not just PMD-sized large
>> folios).
>>
>> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
>> we still establish mappings at the base page granularity, which is unreasonable.
>>
>> We can map multiple consecutive pages of a tmpfs folios at once according to
>> the size of the large folio. On one hand, this can reduce the overhead of page
>> faults; on the other hand, it can leverage hardware architecture optimizations
>> to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
>>
>> Moreover, tmpfs mount will use the 'huge=' option to control large folio
>> allocation explicitly. So it can be understood that the process's RSS statistics
>> might increase, and I think this will not cause any obvious effects for users.
>>
>> Performance test:
>> I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
>> sequentially via mmap(). I observed a significant performance improvement:
> 
> That doesn't sound like a crazy thing to do.
> 
>> Before the patch:
>> real	0m0.158s
>> user	0m0.008s
>> sys	0m0.150s
>>
>> After the patch:
>> real	0m0.021s
>> user	0m0.004s
>> sys	0m0.017s
> 
> And look at that.
> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 0f9b32a20e5b..9944380e947d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>   
>>   	/*
>>   	 * Using per-page fault to maintain the uffd semantics, and same
>> -	 * approach also applies to non-anonymous-shmem faults to avoid
>> +	 * approach also applies to non shmem/tmpfs faults to avoid
>>   	 * inflating the RSS of the process.
>>   	 */
>> -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>> +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>>   	    unlikely(needs_fallback)) {
>>   		nr_pages = 1;
>>   	} else if (nr_pages > 1) {
> 
> and that's it?
> 
> I'm itching to get this into -stable, really.  What LTS user wouldn't
> want this?  

This is an improvement rather than a bugfix, so I don't think it needs 
to go into LTS.

Could it be viewed as correcting an oversight in
> acd7ccb284b8?

Yes, I should have added this optimization in the series of the commit 
acd7ccb284b8. But obviously, I missed this :(.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-04  3:19 [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs Baolin Wang
                   ` (2 preceding siblings ...)
  2025-07-04 22:31 ` Zi Yan
@ 2025-07-07 13:31 ` Lorenzo Stoakes
  2025-07-07 15:47 ` Barry Song
  2025-07-07 16:18 ` Vishal Moola (Oracle)
  5 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-07-07 13:31 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, david, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel

On Fri, Jul 04, 2025 at 11:19:26AM +0800, Baolin Wang wrote:
> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> tmpfs can also support large folio allocation (not just PMD-sized large
> folios).
>
> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> we still establish mappings at the base page granularity, which is unreasonable.
>
> We can map multiple consecutive pages of a tmpfs folios at once according to
> the size of the large folio. On one hand, this can reduce the overhead of page
> faults; on the other hand, it can leverage hardware architecture optimizations
> to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
>
> Moreover, tmpfs mount will use the 'huge=' option to control large folio
> allocation explicitly. So it can be understood that the process's RSS statistics
> might increase, and I think this will not cause any obvious effects for users.
>
> Performance test:
> I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
> sequentially via mmap(). I observed a significant performance improvement:
>
> Before the patch:
> real	0m0.158s
> user	0m0.008s
> sys	0m0.150s
>
> After the patch:
> real	0m0.021s
> user	0m0.004s
> sys	0m0.017s

Wow!

>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Seems reasonable, if we explicitly support larger folios in tmpfs now as
well as anon shmem (what a concept...)

So,

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
> Changes from v1:
>  - Drop the unnecessary IS_ALIGNED() check, per David.
>  - Update the commit message, per David.
> ---
>  mm/memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f9b32a20e5b..9944380e947d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>
>  	/*
>  	 * Using per-page fault to maintain the uffd semantics, and same
> -	 * approach also applies to non-anonymous-shmem faults to avoid
> +	 * approach also applies to non shmem/tmpfs faults to avoid
>  	 * inflating the RSS of the process.
>  	 */
> -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>  	    unlikely(needs_fallback)) {
>  		nr_pages = 1;
>  	} else if (nr_pages > 1) {
> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-06  2:02   ` Baolin Wang
@ 2025-07-07 13:33     ` Lorenzo Stoakes
  2025-07-08  7:53       ` Baolin Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-07-07 13:33 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Andrew Morton, hughd, david, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel

On Sun, Jul 06, 2025 at 10:02:35AM +0800, Baolin Wang wrote:
>
>
> On 2025/7/5 06:18, Andrew Morton wrote:
> > On Fri,  4 Jul 2025 11:19:26 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> >
> > > After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> > > tmpfs can also support large folio allocation (not just PMD-sized large
> > > folios).
> > >
> > > However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> > > we still establish mappings at the base page granularity, which is unreasonable.
> > >
> > > We can map multiple consecutive pages of a tmpfs folios at once according to
> > > the size of the large folio. On one hand, this can reduce the overhead of page
> > > faults; on the other hand, it can leverage hardware architecture optimizations
> > > to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
> > >
> > > Moreover, tmpfs mount will use the 'huge=' option to control large folio
> > > allocation explicitly. So it can be understood that the process's RSS statistics
> > > might increase, and I think this will not cause any obvious effects for users.
> > >
> > > Performance test:
> > > I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
> > > sequentially via mmap(). I observed a significant performance improvement:
> >
> > That doesn't sound like a crazy thing to do.
> >
> > > Before the patch:
> > > real	0m0.158s
> > > user	0m0.008s
> > > sys	0m0.150s
> > >
> > > After the patch:
> > > real	0m0.021s
> > > user	0m0.004s
> > > sys	0m0.017s
> >
> > And look at that.
> >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 0f9b32a20e5b..9944380e947d 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > >   	/*
> > >   	 * Using per-page fault to maintain the uffd semantics, and same
> > > -	 * approach also applies to non-anonymous-shmem faults to avoid
> > > +	 * approach also applies to non shmem/tmpfs faults to avoid
> > >   	 * inflating the RSS of the process.
> > >   	 */
> > > -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> > > +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> > >   	    unlikely(needs_fallback)) {
> > >   		nr_pages = 1;
> > >   	} else if (nr_pages > 1) {
> >
> > and that's it?
> >
> > I'm itching to get this into -stable, really.  What LTS user wouldn't
> > want this?
>
> This is an improvement rather than a bugfix, so I don't think it needs to go
> into LTS.
>
> Could it be viewed as correcting an oversight in
> > acd7ccb284b8?
>
> Yes, I should have added this optimization in the series of the commit
> acd7ccb284b8. But obviously, I missed this :(.

Buuut if this was an oversight for that patch that causes an unnecessary
perf degradation, surely this should have fixes tag + cc stable no?

Seems correct to backport to me.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-04  3:19 [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs Baolin Wang
                   ` (3 preceding siblings ...)
  2025-07-07 13:31 ` Lorenzo Stoakes
@ 2025-07-07 15:47 ` Barry Song
  2025-07-07 16:18 ` Vishal Moola (Oracle)
  5 siblings, 0 replies; 17+ messages in thread
From: Barry Song @ 2025-07-07 15:47 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, david, ziy, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel

On Fri, Jul 4, 2025 at 11:19 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> tmpfs can also support large folio allocation (not just PMD-sized large
> folios).
>
> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> we still establish mappings at the base page granularity, which is unreasonable.
>
> We can map multiple consecutive pages of a tmpfs folios at once according to
> the size of the large folio. On one hand, this can reduce the overhead of page
> faults; on the other hand, it can leverage hardware architecture optimizations
> to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
>
> Moreover, tmpfs mount will use the 'huge=' option to control large folio
> allocation explicitly. So it can be understood that the process's RSS statistics
> might increase, and I think this will not cause any obvious effects for users.
>
> Performance test:
> I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
> sequentially via mmap(). I observed a significant performance improvement:
>
> Before the patch:
> real    0m0.158s
> user    0m0.008s
> sys     0m0.150s
>
> After the patch:
> real    0m0.021s
> user    0m0.004s
> sys     0m0.017s
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
> Changes from v1:
>  - Drop the unnecessary IS_ALIGNED() check, per David.
>  - Update the commit message, per David.
> ---
>  mm/memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f9b32a20e5b..9944380e947d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>
>         /*
>          * Using per-page fault to maintain the uffd semantics, and same
> -        * approach also applies to non-anonymous-shmem faults to avoid
> +        * approach also applies to non shmem/tmpfs faults to avoid
>          * inflating the RSS of the process.
>          */
> -       if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> +       if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>             unlikely(needs_fallback)) {
>                 nr_pages = 1;
>         } else if (nr_pages > 1) {
> --
> 2.43.5
>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-04  3:19 [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs Baolin Wang
                   ` (4 preceding siblings ...)
  2025-07-07 15:47 ` Barry Song
@ 2025-07-07 16:18 ` Vishal Moola (Oracle)
  5 siblings, 0 replies; 17+ messages in thread
From: Vishal Moola (Oracle) @ 2025-07-07 16:18 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, david, ziy, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel

On Fri, Jul 04, 2025 at 11:19:26AM +0800, Baolin Wang wrote:
> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> tmpfs can also support large folio allocation (not just PMD-sized large
> folios).
> 
> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> we still establish mappings at the base page granularity, which is unreasonable.
> 
> We can map multiple consecutive pages of a tmpfs folios at once according to
> the size of the large folio. On one hand, this can reduce the overhead of page
> faults; on the other hand, it can leverage hardware architecture optimizations
> to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
> 
> Moreover, tmpfs mount will use the 'huge=' option to control large folio
> allocation explicitly. So it can be understood that the process's RSS statistics
> might increase, and I think this will not cause any obvious effects for users.
> 
> Performance test:
> I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
> sequentially via mmap(). I observed a significant performance improvement:
> 
> Before the patch:
> real	0m0.158s
> user	0m0.008s
> sys	0m0.150s
> 
> After the patch:
> real	0m0.021s
> user	0m0.004s
> sys	0m0.017s
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-07 13:33     ` Lorenzo Stoakes
@ 2025-07-08  7:53       ` Baolin Wang
  2025-07-09 13:13         ` Lorenzo Stoakes
  0 siblings, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2025-07-08  7:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, hughd, david, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel



On 2025/7/7 21:33, Lorenzo Stoakes wrote:
> On Sun, Jul 06, 2025 at 10:02:35AM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/7/5 06:18, Andrew Morton wrote:
>>> On Fri,  4 Jul 2025 11:19:26 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
>>>> tmpfs can also support large folio allocation (not just PMD-sized large
>>>> folios).
>>>>
>>>> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
>>>> we still establish mappings at the base page granularity, which is unreasonable.
>>>>
>>>> We can map multiple consecutive pages of a tmpfs folios at once according to
>>>> the size of the large folio. On one hand, this can reduce the overhead of page
>>>> faults; on the other hand, it can leverage hardware architecture optimizations
>>>> to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
>>>>
>>>> Moreover, tmpfs mount will use the 'huge=' option to control large folio
>>>> allocation explicitly. So it can be understood that the process's RSS statistics
>>>> might increase, and I think this will not cause any obvious effects for users.
>>>>
>>>> Performance test:
>>>> I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
>>>> sequentially via mmap(). I observed a significant performance improvement:
>>>
>>> That doesn't sound like a crazy thing to do.
>>>
>>>> Before the patch:
>>>> real	0m0.158s
>>>> user	0m0.008s
>>>> sys	0m0.150s
>>>>
>>>> After the patch:
>>>> real	0m0.021s
>>>> user	0m0.004s
>>>> sys	0m0.017s
>>>
>>> And look at that.
>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 0f9b32a20e5b..9944380e947d 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>    	/*
>>>>    	 * Using per-page fault to maintain the uffd semantics, and same
>>>> -	 * approach also applies to non-anonymous-shmem faults to avoid
>>>> +	 * approach also applies to non shmem/tmpfs faults to avoid
>>>>    	 * inflating the RSS of the process.
>>>>    	 */
>>>> -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>>>> +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>>>>    	    unlikely(needs_fallback)) {
>>>>    		nr_pages = 1;
>>>>    	} else if (nr_pages > 1) {
>>>
>>> and that's it?
>>>
>>> I'm itching to get this into -stable, really.  What LTS user wouldn't
>>> want this?
>>
>> This is an improvement rather than a bugfix, so I don't think it needs to go
>> into LTS.
>>
>> Could it be viewed as correcting an oversight in
>>> acd7ccb284b8?
>>
>> Yes, I should have added this optimization in the series of the commit
>> acd7ccb284b8. But obviously, I missed this :(.
> 
> Buuut if this was an oversight for that patch that causes an unnecessary
> perf degradation, surely this should have fixes tag + cc stable no?

IMO, this commit acd7ccb284b8 won't cause perf degradation, instead it 
is used to introduce a new feature, while the current patch is a further 
reasonable optimization. As I mentioned, this is an improvement, not a 
bugfix or a patch to address performance regression.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-08  7:53       ` Baolin Wang
@ 2025-07-09 13:13         ` Lorenzo Stoakes
  2025-07-15 20:03           ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-07-09 13:13 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Andrew Morton, hughd, david, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel

On Tue, Jul 08, 2025 at 03:53:56PM +0800, Baolin Wang wrote:
>
>
> On 2025/7/7 21:33, Lorenzo Stoakes wrote:
> > On Sun, Jul 06, 2025 at 10:02:35AM +0800, Baolin Wang wrote:
> > >
> > >
> > > On 2025/7/5 06:18, Andrew Morton wrote:
> > > > On Fri,  4 Jul 2025 11:19:26 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> > > >
> > > > > After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> > > > > tmpfs can also support large folio allocation (not just PMD-sized large
> > > > > folios).
> > > > >
> > > > > However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> > > > > we still establish mappings at the base page granularity, which is unreasonable.
> > > > >
> > > > > We can map multiple consecutive pages of a tmpfs folios at once according to
> > > > > the size of the large folio. On one hand, this can reduce the overhead of page
> > > > > faults; on the other hand, it can leverage hardware architecture optimizations
> > > > > to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
> > > > >
> > > > > Moreover, tmpfs mount will use the 'huge=' option to control large folio
> > > > > allocation explicitly. So it can be understood that the process's RSS statistics
> > > > > might increase, and I think this will not cause any obvious effects for users.
> > > > >
> > > > > Performance test:
> > > > > I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
> > > > > sequentially via mmap(). I observed a significant performance improvement:
> > > >
> > > > That doesn't sound like a crazy thing to do.
> > > >
> > > > > Before the patch:
> > > > > real	0m0.158s
> > > > > user	0m0.008s
> > > > > sys	0m0.150s
> > > > >
> > > > > After the patch:
> > > > > real	0m0.021s
> > > > > user	0m0.004s
> > > > > sys	0m0.017s
> > > >
> > > > And look at that.
> > > >
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 0f9b32a20e5b..9944380e947d 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > > > >    	/*
> > > > >    	 * Using per-page fault to maintain the uffd semantics, and same
> > > > > -	 * approach also applies to non-anonymous-shmem faults to avoid
> > > > > +	 * approach also applies to non shmem/tmpfs faults to avoid
> > > > >    	 * inflating the RSS of the process.
> > > > >    	 */
> > > > > -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> > > > > +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> > > > >    	    unlikely(needs_fallback)) {
> > > > >    		nr_pages = 1;
> > > > >    	} else if (nr_pages > 1) {
> > > >
> > > > and that's it?
> > > >
> > > > I'm itching to get this into -stable, really.  What LTS user wouldn't
> > > > want this?
> > >
> > > This is an improvement rather than a bugfix, so I don't think it needs to go
> > > into LTS.
> > >
> > > Could it be viewed as correcting an oversight in
> > > > acd7ccb284b8?
> > >
> > > Yes, I should have added this optimization in the series of the commit
> > > acd7ccb284b8. But obviously, I missed this :(.
> >
> > Buuut if this was an oversight for that patch that causes an unnecessary
> > perf degradation, surely this should have fixes tag + cc stable no?
>
> IMO, this commit acd7ccb284b8 won't cause perf degradation, instead it is
> used to introduce a new feature, while the current patch is a further
> reasonable optimization. As I mentioned, this is an improvement, not a
> bugfix or a patch to address performance regression.

4Well :) you say yourself it was an oversight, and it very clearly has a perf
_impact_, which if you compare backwards to acd7ccb284b8 is a degradation, but I
get your point.

However, since you say 'oversight' this seems to me that you really meant to
have included it but hadn't noticed, and additionally, since it just seems to be
an unequivical good - let's maybe flip this round - why NOT backport it to
stable?

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-09 13:13         ` Lorenzo Stoakes
@ 2025-07-15 20:03           ` Hugh Dickins
  2025-07-17  8:01             ` Baolin Wang
  2025-07-17  9:19             ` Lorenzo Stoakes
  0 siblings, 2 replies; 17+ messages in thread
From: Hugh Dickins @ 2025-07-15 20:03 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Baolin Wang, Matthew Wilcox, hughd, david, ziy, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb,
	mhocko, linux-mm, linux-kernel

On Wed, 9 Jul 2025, Lorenzo Stoakes wrote:
> On Tue, Jul 08, 2025 at 03:53:56PM +0800, Baolin Wang wrote:
> > On 2025/7/7 21:33, Lorenzo Stoakes wrote:
> > > On Sun, Jul 06, 2025 at 10:02:35AM +0800, Baolin Wang wrote:
> > > > On 2025/7/5 06:18, Andrew Morton wrote:
> > > > > On Fri,  4 Jul 2025 11:19:26 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> > > > >
> > > > > > After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> > > > > > tmpfs can also support large folio allocation (not just PMD-sized large
> > > > > > folios).
> > > > > >
> > > > > > However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> > > > > > we still establish mappings at the base page granularity, which is unreasonable.
> > > > > >
> > > > > > We can map multiple consecutive pages of a tmpfs folios at once according to
> > > > > > the size of the large folio. On one hand, this can reduce the overhead of page
> > > > > > faults; on the other hand, it can leverage hardware architecture optimizations
> > > > > > to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
> > > > > >
> > > > > > Moreover, tmpfs mount will use the 'huge=' option to control large folio
> > > > > > allocation explicitly. So it can be understood that the process's RSS statistics
> > > > > > might increase, and I think this will not cause any obvious effects for users.
> > > > > >
> > > > > > Performance test:
> > > > > > I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
> > > > > > sequentially via mmap(). I observed a significant performance improvement:
> > > > >
> > > > > That doesn't sound like a crazy thing to do.
> > > > >
> > > > > > Before the patch:
> > > > > > real	0m0.158s
> > > > > > user	0m0.008s
> > > > > > sys	0m0.150s
> > > > > >
> > > > > > After the patch:
> > > > > > real	0m0.021s
> > > > > > user	0m0.004s
> > > > > > sys	0m0.017s
> > > > >
> > > > > And look at that.
> > > > >
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 0f9b32a20e5b..9944380e947d 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > > > > >    	/*
> > > > > >    	 * Using per-page fault to maintain the uffd semantics, and same
> > > > > > -	 * approach also applies to non-anonymous-shmem faults to avoid
> > > > > > +	 * approach also applies to non shmem/tmpfs faults to avoid
> > > > > >    	 * inflating the RSS of the process.
> > > > > >    	 */
> > > > > > -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> > > > > > +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> > > > > >    	    unlikely(needs_fallback)) {
> > > > > >    		nr_pages = 1;
> > > > > >    	} else if (nr_pages > 1) {
> > > > >
> > > > > and that's it?
> > > > >
> > > > > I'm itching to get this into -stable, really.  What LTS user wouldn't
> > > > > want this?
> > > >
> > > > This is an improvement rather than a bugfix, so I don't think it needs to go
> > > > into LTS.
> > > >
> > > > Could it be viewed as correcting an oversight in
> > > > > acd7ccb284b8?
> > > >
> > > > Yes, I should have added this optimization in the series of the commit
> > > > acd7ccb284b8. But obviously, I missed this :(.
> > >
> > > Buuut if this was an oversight for that patch that causes an unnecessary
> > > perf degradation, surely this should have fixes tag + cc stable no?
> >
> > IMO, this commit acd7ccb284b8 won't cause perf degradation, instead it is
> > used to introduce a new feature, while the current patch is a further
> > reasonable optimization. As I mentioned, this is an improvement, not a
> > bugfix or a patch to address performance regression.
> 
> 4Well :) you say yourself it was an oversight, and it very clearly has a perf
> _impact_, which if you compare backwards to acd7ccb284b8 is a degradation, but I
> get your point.
> 
> However, since you say 'oversight' this seems to me that you really meant to
> have included it but hadn't noticed, and additionally, since it just seems to be
> an unequivical good - let's maybe flip this round - why NOT backport it to
> stable?

I strongly agree with Baolin: this patch is good, thank you, but it is
a performance improvement, a new feature, not a candidate for the stable
tree.  I'm surprised anyone thinks otherwise: Andrew, please delete that
stable tag before advancing the patch from mm-unstable to mm-stable.

And the Fixee went into 6.14, so it couldn't go to 6.12 LTS anyway.

An unequivocal good? I'm not so sure.

I expect it ought to be limited, by fault_around_bytes (or suchlike).

If I understand all the mTHP versus large folio versus PMD-huge handling
correctly (and of course I do not, I'm still weeks if not months away
from understanding most of it), the old vma_is_anon_shmem() case would
be limited by the shmem mTHP tunables, and one can reasonably argue that
they would already take fault_around_bytes-like considerations into account;
but the newly added file-written cases are governed by huge= mount options
intended for PMD-size, but (currently) permitting all lesser orders.
I don't think that mounting a tmpfs huge=always implies that mapping
256 PTEs for one fault is necessarily a good strategy.

But looking in the opposite direction, why is there now a vma_is_shmem()
check there in finish_fault() at all?  If major filesystems are using
large folios, why aren't they also allowed to benefit from mapping
multiple PTEs at once (in this shared-writable case which the existing
fault-around does not cover - I presume to avoid write amplification,
but that's not an issue when the folio is large already).

It's fine to advance cautiously, keeping this to shmem in coming release;
but I think it should be extended soon (without any backport to stable),
and consideration given to limiting it.

Hugh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-15 20:03           ` Hugh Dickins
@ 2025-07-17  8:01             ` Baolin Wang
  2025-07-17  8:22               ` David Hildenbrand
  2025-07-17  9:19             ` Lorenzo Stoakes
  1 sibling, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2025-07-17  8:01 UTC (permalink / raw)
  To: Hugh Dickins, Lorenzo Stoakes, Andrew Morton
  Cc: Matthew Wilcox, david, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel



On 2025/7/16 04:03, Hugh Dickins wrote:
> On Wed, 9 Jul 2025, Lorenzo Stoakes wrote:
>> On Tue, Jul 08, 2025 at 03:53:56PM +0800, Baolin Wang wrote:
>>> On 2025/7/7 21:33, Lorenzo Stoakes wrote:
>>>> On Sun, Jul 06, 2025 at 10:02:35AM +0800, Baolin Wang wrote:
>>>>> On 2025/7/5 06:18, Andrew Morton wrote:
>>>>>> On Fri,  4 Jul 2025 11:19:26 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>>>>>
>>>>>>> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
>>>>>>> tmpfs can also support large folio allocation (not just PMD-sized large
>>>>>>> folios).
>>>>>>>
>>>>>>> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
>>>>>>> we still establish mappings at the base page granularity, which is unreasonable.
>>>>>>>
>>>>>>> We can map multiple consecutive pages of a tmpfs folios at once according to
>>>>>>> the size of the large folio. On one hand, this can reduce the overhead of page
>>>>>>> faults; on the other hand, it can leverage hardware architecture optimizations
>>>>>>> to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
>>>>>>>
>>>>>>> Moreover, tmpfs mount will use the 'huge=' option to control large folio
>>>>>>> allocation explicitly. So it can be understood that the process's RSS statistics
>>>>>>> might increase, and I think this will not cause any obvious effects for users.
>>>>>>>
>>>>>>> Performance test:
>>>>>>> I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
>>>>>>> sequentially via mmap(). I observed a significant performance improvement:
>>>>>>
>>>>>> That doesn't sound like a crazy thing to do.
>>>>>>
>>>>>>> Before the patch:
>>>>>>> real	0m0.158s
>>>>>>> user	0m0.008s
>>>>>>> sys	0m0.150s
>>>>>>>
>>>>>>> After the patch:
>>>>>>> real	0m0.021s
>>>>>>> user	0m0.004s
>>>>>>> sys	0m0.017s
>>>>>>
>>>>>> And look at that.
>>>>>>
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 0f9b32a20e5b..9944380e947d 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>>>     	/*
>>>>>>>     	 * Using per-page fault to maintain the uffd semantics, and same
>>>>>>> -	 * approach also applies to non-anonymous-shmem faults to avoid
>>>>>>> +	 * approach also applies to non shmem/tmpfs faults to avoid
>>>>>>>     	 * inflating the RSS of the process.
>>>>>>>     	 */
>>>>>>> -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>>>>>>> +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>>>>>>>     	    unlikely(needs_fallback)) {
>>>>>>>     		nr_pages = 1;
>>>>>>>     	} else if (nr_pages > 1) {
>>>>>>
>>>>>> and that's it?
>>>>>>
>>>>>> I'm itching to get this into -stable, really.  What LTS user wouldn't
>>>>>> want this?
>>>>>
>>>>> This is an improvement rather than a bugfix, so I don't think it needs to go
>>>>> into LTS.
>>>>>
>>>>> Could it be viewed as correcting an oversight in
>>>>>> acd7ccb284b8?
>>>>>
>>>>> Yes, I should have added this optimization in the series of the commit
>>>>> acd7ccb284b8. But obviously, I missed this :(.
>>>>
>>>> Buuut if this was an oversight for that patch that causes an unnecessary
>>>> perf degradation, surely this should have fixes tag + cc stable no?
>>>
>>> IMO, this commit acd7ccb284b8 won't cause perf degradation, instead it is
>>> used to introduce a new feature, while the current patch is a further
>>> reasonable optimization. As I mentioned, this is an improvement, not a
>>> bugfix or a patch to address performance regression.
>>
>> 4Well :) you say yourself it was an oversight, and it very clearly has a perf
>> _impact_, which if you compare backwards to acd7ccb284b8 is a degradation, but I
>> get your point.
>>
>> However, since you say 'oversight' this seems to me that you really meant to
>> have included it but hadn't noticed, and additionally, since it just seems to be
>> an unequivical good - let's maybe flip this round - why NOT backport it to
>> stable?
> 
> I strongly agree with Baolin: this patch is good, thank you, but it is
> a performance improvement, a new feature, not a candidate for the stable
> tree.  I'm surprised anyone thinks otherwise: Andrew, please delete that
> stable tag before advancing the patch from mm-unstable to mm-stable.
> 
> And the Fixee went into 6.14, so it couldn't go to 6.12 LTS anyway.

Agree.

> An unequivocal good? I'm not so sure.
> 
> I expect it ought to be limited, by fault_around_bytes (or suchlike).
> 
> If I understand all the mTHP versus large folio versus PMD-huge handling
> correctly (and of course I do not, I'm still weeks if not months away
> from understanding most of it), the old vma_is_anon_shmem() case would
> be limited by the shmem mTHP tunables, and one can reasonably argue that
> they would already take fault_around_bytes-like considerations into account;
> but the newly added file-written cases are governed by huge= mount options
> intended for PMD-size, but (currently) permitting all lesser orders.
> I don't think that mounting a tmpfs huge=always implies that mapping
> 256 PTEs for one fault is necessarily a good strategy.
> 
> But looking in the opposite direction, why is there now a vma_is_shmem()
> check there in finish_fault() at all?  If major filesystems are using
> large folios, why aren't they also allowed to benefit from mapping
> multiple PTEs at once (in this shared-writable case which the existing
> fault-around does not cover - I presume to avoid write amplification,
> but that's not an issue when the folio is large already).

This is what I'm going to do next. For other filesystems, I think they 
should also map multiple PTEs at once. IIUC, I don't think this will 
lead to write amplification (because large folios are already allocated, 
it will just cause RSS to inflate, but I wonder if this actually causes 
any problem?), instead the current dirty tracking for mmap write access 
is per-folio (see iomap_page_mkwrite()), which can cause write 
amplification.

> It's fine to advance cautiously, keeping this to shmem in coming release;
> but I think it should be extended soon (without any backport to stable),
> and consideration given to limiting it.

Yes, I will consider that. But fault_around_bytes is tricky, we can 
discuss it in subsequent patches. Thanks for your comments.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-17  8:01             ` Baolin Wang
@ 2025-07-17  8:22               ` David Hildenbrand
  2025-07-17  9:20                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-07-17  8:22 UTC (permalink / raw)
  To: Baolin Wang, Hugh Dickins, Lorenzo Stoakes, Andrew Morton
  Cc: Matthew Wilcox, ziy, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel

On 17.07.25 10:01, Baolin Wang wrote:
> 
> 
> On 2025/7/16 04:03, Hugh Dickins wrote:
>> On Wed, 9 Jul 2025, Lorenzo Stoakes wrote:
>>> On Tue, Jul 08, 2025 at 03:53:56PM +0800, Baolin Wang wrote:
>>>> On 2025/7/7 21:33, Lorenzo Stoakes wrote:
>>>>> On Sun, Jul 06, 2025 at 10:02:35AM +0800, Baolin Wang wrote:
>>>>>> On 2025/7/5 06:18, Andrew Morton wrote:
>>>>>>> On Fri,  4 Jul 2025 11:19:26 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>>>>>>
>>>>>>>> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
>>>>>>>> tmpfs can also support large folio allocation (not just PMD-sized large
>>>>>>>> folios).
>>>>>>>>
>>>>>>>> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
>>>>>>>> we still establish mappings at the base page granularity, which is unreasonable.
>>>>>>>>
>>>>>>>> We can map multiple consecutive pages of a tmpfs folios at once according to
>>>>>>>> the size of the large folio. On one hand, this can reduce the overhead of page
>>>>>>>> faults; on the other hand, it can leverage hardware architecture optimizations
>>>>>>>> to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
>>>>>>>>
>>>>>>>> Moreover, tmpfs mount will use the 'huge=' option to control large folio
>>>>>>>> allocation explicitly. So it can be understood that the process's RSS statistics
>>>>>>>> might increase, and I think this will not cause any obvious effects for users.
>>>>>>>>
>>>>>>>> Performance test:
>>>>>>>> I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
>>>>>>>> sequentially via mmap(). I observed a significant performance improvement:
>>>>>>>
>>>>>>> That doesn't sound like a crazy thing to do.
>>>>>>>
>>>>>>>> Before the patch:
>>>>>>>> real	0m0.158s
>>>>>>>> user	0m0.008s
>>>>>>>> sys	0m0.150s
>>>>>>>>
>>>>>>>> After the patch:
>>>>>>>> real	0m0.021s
>>>>>>>> user	0m0.004s
>>>>>>>> sys	0m0.017s
>>>>>>>
>>>>>>> And look at that.
>>>>>>>
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index 0f9b32a20e5b..9944380e947d 100644
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>>>>      	/*
>>>>>>>>      	 * Using per-page fault to maintain the uffd semantics, and same
>>>>>>>> -	 * approach also applies to non-anonymous-shmem faults to avoid
>>>>>>>> +	 * approach also applies to non shmem/tmpfs faults to avoid
>>>>>>>>      	 * inflating the RSS of the process.
>>>>>>>>      	 */
>>>>>>>> -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>>>>>>>> +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>>>>>>>>      	    unlikely(needs_fallback)) {
>>>>>>>>      		nr_pages = 1;
>>>>>>>>      	} else if (nr_pages > 1) {
>>>>>>>
>>>>>>> and that's it?
>>>>>>>
>>>>>>> I'm itching to get this into -stable, really.  What LTS user wouldn't
>>>>>>> want this?
>>>>>>
>>>>>> This is an improvement rather than a bugfix, so I don't think it needs to go
>>>>>> into LTS.
>>>>>>
>>>>>> Could it be viewed as correcting an oversight in
>>>>>>> acd7ccb284b8?
>>>>>>
>>>>>> Yes, I should have added this optimization in the series of the commit
>>>>>> acd7ccb284b8. But obviously, I missed this :(.
>>>>>
>>>>> Buuut if this was an oversight for that patch that causes an unnecessary
>>>>> perf degradation, surely this should have fixes tag + cc stable no?
>>>>
>>>> IMO, this commit acd7ccb284b8 won't cause perf degradation, instead it is
>>>> used to introduce a new feature, while the current patch is a further
>>>> reasonable optimization. As I mentioned, this is an improvement, not a
>>>> bugfix or a patch to address performance regression.
>>>
>>> 4Well :) you say yourself it was an oversight, and it very clearly has a perf
>>> _impact_, which if you compare backwards to acd7ccb284b8 is a degradation, but I
>>> get your point.
>>>
>>> However, since you say 'oversight' this seems to me that you really meant to
>>> have included it but hadn't noticed, and additionally, since it just seems to be
>>> an unequivical good - let's maybe flip this round - why NOT backport it to
>>> stable?
>>
>> I strongly agree with Baolin: this patch is good, thank you, but it is
>> a performance improvement, a new feature, not a candidate for the stable
>> tree.  I'm surprised anyone thinks otherwise: Andrew, please delete that
>> stable tag before advancing the patch from mm-unstable to mm-stable.
>>
>> And the Fixee went into 6.14, so it couldn't go to 6.12 LTS anyway.
> 
> Agree.
> 
>> An unequivocal good? I'm not so sure.
>>
>> I expect it ought to be limited, by fault_around_bytes (or suchlike).
>>
>> If I understand all the mTHP versus large folio versus PMD-huge handling
>> correctly (and of course I do not, I'm still weeks if not months away
>> from understanding most of it), the old vma_is_anon_shmem() case would
>> be limited by the shmem mTHP tunables, and one can reasonably argue that
>> they would already take fault_around_bytes-like considerations into account;
>> but the newly added file-written cases are governed by huge= mount options
>> intended for PMD-size, but (currently) permitting all lesser orders.
>> I don't think that mounting a tmpfs huge=always implies that mapping
>> 256 PTEs for one fault is necessarily a good strategy.
>>
>> But looking in the opposite direction, why is there now a vma_is_shmem()
>> check there in finish_fault() at all?  If major filesystems are using
>> large folios, why aren't they also allowed to benefit from mapping
>> multiple PTEs at once (in this shared-writable case which the existing
>> fault-around does not cover - I presume to avoid write amplification,
>> but that's not an issue when the folio is large already).
> 
> This is what I'm going to do next. For other filesystems, I think they
> should also map multiple PTEs at once.

I recall [1]. But that would, of course, only affect the RSS of a 
program and not the actual memory consumption (the large folio resides 
in memory ...).

The comments in the code spells that out: "inflating the RSS of the 
process."

[1] https://www.suse.com/support/kb/doc/?id=000019017

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-15 20:03           ` Hugh Dickins
  2025-07-17  8:01             ` Baolin Wang
@ 2025-07-17  9:19             ` Lorenzo Stoakes
  1 sibling, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17  9:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Baolin Wang, Matthew Wilcox, david, ziy,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-kernel

TL;DR - I am fine with this not being backported.

On Tue, Jul 15, 2025 at 01:03:40PM -0700, Hugh Dickins wrote:
> On Wed, 9 Jul 2025, Lorenzo Stoakes wrote:
> > Well :) you say yourself it was an oversight, and it very clearly has a perf
> > _impact_, which if you compare backwards to acd7ccb284b8 is a degradation, but I
> > get your point.
> >
> > However, since you say 'oversight' this seems to me that you really meant to
> > have included it but hadn't noticed, and additionally, since it just seems to be
> > an unequivical good - let's maybe flip this round - why NOT backport it to
> > stable?
>
> I strongly agree with Baolin: this patch is good, thank you, but it is
> a performance improvement, a new feature, not a candidate for the stable
> tree.  I'm surprised anyone thinks otherwise: Andrew, please delete that
> stable tag before advancing the patch from mm-unstable to mm-stable.

I understand that we don't arbitrarily backport perf improvements,
obviously :)

But at some point here Baolin said (though perhaps I misinterpreted) this
was an oversight that he left out to be done then, in effect.

But I'm fine for us to decide to treat this differently, I'm not going to
keep pushing this - it's all good - let's not backport.

>
> And the Fixee went into 6.14, so it couldn't go to 6.12 LTS anyway.

On my part I was talking about a standard stable backport not somehow
arbitrarily shoving the two patches into 6.12 LTS... to be clear!

>
> An unequivocal good? I'm not so sure.
>
> I expect it ought to be limited, by fault_around_bytes (or suchlike).

Yeah...

>
> If I understand all the mTHP versus large folio versus PMD-huge handling
> correctly (and of course I do not, I'm still weeks if not months away
> from understanding most of it), the old vma_is_anon_shmem() case would
> be limited by the shmem mTHP tunables, and one can reasonably argue that
> they would already take fault_around_bytes-like considerations into account;
> but the newly added file-written cases are governed by huge= mount options
> intended for PMD-size, but (currently) permitting all lesser orders.
> I don't think that mounting a tmpfs huge=always implies that mapping
> 256 PTEs for one fault is necessarily a good strategy.
>
> But looking in the opposite direction, why is there now a vma_is_shmem()
> check there in finish_fault() at all?  If major filesystems are using
> large folios, why aren't they also allowed to benefit from mapping
> multiple PTEs at once (in this shared-writable case which the existing
> fault-around does not cover - I presume to avoid write amplification,
> but that's not an issue when the folio is large already).

This seems like something we should tread carefully around.

>
> It's fine to advance cautiously, keeping this to shmem in coming release;
> but I think it should be extended soon (without any backport to stable),
> and consideration given to limiting it.

Yes agreed!

>
> Hugh

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-17  8:22               ` David Hildenbrand
@ 2025-07-17  9:20                 ` Lorenzo Stoakes
  2025-07-17  9:25                   ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17  9:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baolin Wang, Hugh Dickins, Andrew Morton, Matthew Wilcox, ziy,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-kernel

On Thu, Jul 17, 2025 at 10:22:47AM +0200, David Hildenbrand wrote:
> I recall [1]. But that would, of course, only affect the RSS of a program
> and not the actual memory consumption (the large folio resides in memory
> ...).
>
> The comments in the code spells that out: "inflating the RSS of the
> process."

RSS is a pretty sucky measure though, as covered in depth by Vlastimil in
his KR 2024 talk [2]. So maybe we don't need to worry _so_ much about this
:)

[2]:https://kernel-recipes.org/en/2024/schedule/all-your-memory-are-belong-to-whom/

>
> [1] https://www.suse.com/support/kb/doc/?id=000019017
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs
  2025-07-17  9:20                 ` Lorenzo Stoakes
@ 2025-07-17  9:25                   ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-07-17  9:25 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Baolin Wang, Hugh Dickins, Andrew Morton, Matthew Wilcox, ziy,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka,
	rppt, surenb, mhocko, linux-mm, linux-kernel

On 17.07.25 11:20, Lorenzo Stoakes wrote:
> On Thu, Jul 17, 2025 at 10:22:47AM +0200, David Hildenbrand wrote:
>> I recall [1]. But that would, of course, only affect the RSS of a program
>> and not the actual memory consumption (the large folio resides in memory
>> ...).
>>
>> The comments in the code spells that out: "inflating the RSS of the
>> process."
> 
> RSS is a pretty sucky measure though, as covered in depth by Vlastimil in
> his KR 2024 talk [2]. So maybe we don't need to worry _so_ much about this
> :)

Well, I pointed an an actual issue where people *did* worry about ;)

I also think that it's probably fine, but some apps/environments 
apparently don't enjoy suddenly seeing spikes in RSS. Maybe these are so 
specific that only distributions like in the report have to worry about 
that (good luck once the toggles are gone, lol).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-17  9:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-04  3:19 [PATCH v2] mm: fault in complete folios instead of individual pages for tmpfs Baolin Wang
2025-07-04  9:38 ` David Hildenbrand
2025-07-04 22:18 ` Andrew Morton
2025-07-06  2:02   ` Baolin Wang
2025-07-07 13:33     ` Lorenzo Stoakes
2025-07-08  7:53       ` Baolin Wang
2025-07-09 13:13         ` Lorenzo Stoakes
2025-07-15 20:03           ` Hugh Dickins
2025-07-17  8:01             ` Baolin Wang
2025-07-17  8:22               ` David Hildenbrand
2025-07-17  9:20                 ` Lorenzo Stoakes
2025-07-17  9:25                   ` David Hildenbrand
2025-07-17  9:19             ` Lorenzo Stoakes
2025-07-04 22:31 ` Zi Yan
2025-07-07 13:31 ` Lorenzo Stoakes
2025-07-07 15:47 ` Barry Song
2025-07-07 16:18 ` Vishal Moola (Oracle)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).