[PATCH] mm: support large mapping building for tmpfs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: support large mapping building for tmpfs
@ 2025-07-01  8:40 Baolin Wang
  2025-07-01 13:08 ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Baolin Wang @ 2025-07-01  8:40 UTC (permalink / raw)
  To: akpm, hughd, david
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, baolin.wang,
	linux-mm, linux-kernel

After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
tmpfs can also support large folio allocation (not just PMD-sized large
folios).

However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
we still establish mappings at the base page granularity, which is unreasonable.

We can establish large mappings according to the size of the large folio. On one
hand, this can reduce the overhead of page faults; on the other hand, it can
leverage hardware architecture optimizations to reduce TLB misses, such as
contiguous PTEs on the ARM architecture.

Moreover, since the user has already added the 'huge=' option when mounting tmpfs
to allow for large folio allocation, establishing large folios' mapping is expected
and will not surprise users by inflating the RSS of the process.

In order to support large mappings for tmpfs, besides checking VMA limits and
PMD pagetable limits, it is also necessary to check if the linear page offset
of the VMA is order-aligned within the file.

Performance test:
I created a 1G tmpfs file, populated with 64K large folios, and accessed it
sequentially via mmap(). I observed a significant performance improvement:

Before the patch:
real	0m0.214s
user	0m0.012s
sys	0m0.203s

After the patch:
real	0m0.025s
user	0m0.000s
sys	0m0.024s

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/memory.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0f9b32a20e5b..6385a9385a9b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)

 	/*
 	 * Using per-page fault to maintain the uffd semantics, and same
-	 * approach also applies to non-anonymous-shmem faults to avoid
+	 * approach also applies to non shmem/tmpfs faults to avoid
 	 * inflating the RSS of the process.
 	 */
-	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
+	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
 	    unlikely(needs_fallback)) {
 		nr_pages = 1;
 	} else if (nr_pages > 1) {
@@ -5395,15 +5395,20 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
 		/* The index of the entry in the pagetable for fault page. */
 		pgoff_t pte_off = pte_index(vmf->address);
+		unsigned long hpage_size = PAGE_SIZE << folio_order(folio);

 		/*
 		 * Fallback to per-page fault in case the folio size in page
-		 * cache beyond the VMA limits and PMD pagetable limits.
+		 * cache beyond the VMA limits or PMD pagetable limits. And
+		 * also check if the linear page offset of vma is order-aligned
+		 * within the file for tmpfs.
 		 */
 		if (unlikely(vma_off < idx ||
 			    vma_off + (nr_pages - idx) > vma_pages(vma) ||
 			    pte_off < idx ||
-			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE)) {
+			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE) ||
+			    !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
+					hpage_size >> PAGE_SHIFT)) {
 			nr_pages = 1;
 		} else {
 			/* Now we can set mappings for the whole large folio. */
-- 
2.43.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: support large mapping building for tmpfs
  2025-07-01  8:40 [PATCH] mm: support large mapping building for tmpfs Baolin Wang
@ 2025-07-01 13:08 ` David Hildenbrand
  2025-07-02  2:03   ` Baolin Wang
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-07-01 13:08 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel

On 01.07.25 10:40, Baolin Wang wrote:

Nit: talking about "large mappings" is confusing. Did you actually mean:

"mm: fault in complete folios instead of individual pages for tmpfs"

I suggest not talking about "large mappings" anywhere in this patch 
description, and instead talking about mapping multiple consecutive 
pages of a tmpfs folios at once instead.

> After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> tmpfs can also support large folio allocation (not just PMD-sized large
> folios).
> 
> However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> we still establish mappings at the base page granularity, which is unreasonable.
 > > We can establish large mappings according to the size of the large 
folio. On one
> hand, this can reduce the overhead of page faults; on the other hand, it can
> leverage hardware architecture optimizations to reduce TLB misses, such as
> contiguous PTEs on the ARM architecture.

The latter would still apply if faulting in each individual page I 
guess. cont-pte will try to auto-optimize IIRC.

> 
> Moreover, since the user has already added the 'huge=' option when mounting tmpfs
> to allow for large folio allocation, establishing large folios' mapping is expected
> and will not surprise users by inflating the RSS of the process.

Hm, are we sure about that? Also, how does fault_around_bytes interact here?

> 
> In order to support large mappings for tmpfs, besides checking VMA limits and
> PMD pagetable limits, it is also necessary to check if the linear page offset
> of the VMA is order-aligned within the file.

Why?

This only applies to PMD mappings. See below.

> 
> Performance test:
> I created a 1G tmpfs file, populated with 64K large folios, and accessed it
> sequentially via mmap(). I observed a significant performance improvement:
> 
> Before the patch:
> real	0m0.214s
> user	0m0.012s
> sys	0m0.203s
> 
> After the patch:
> real	0m0.025s
> user	0m0.000s
> sys	0m0.024s
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   mm/memory.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 0f9b32a20e5b..6385a9385a9b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   
>   	/*
>   	 * Using per-page fault to maintain the uffd semantics, and same
> -	 * approach also applies to non-anonymous-shmem faults to avoid
> +	 * approach also applies to non shmem/tmpfs faults to avoid
>   	 * inflating the RSS of the process.
>   	 */
> -	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> +	if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>   	    unlikely(needs_fallback)) {
>   		nr_pages = 1;
>   	} else if (nr_pages > 1) {
> @@ -5395,15 +5395,20 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   		pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
>   		/* The index of the entry in the pagetable for fault page. */
>   		pgoff_t pte_off = pte_index(vmf->address);
> +		unsigned long hpage_size = PAGE_SIZE << folio_order(folio);
>   
>   		/*
>   		 * Fallback to per-page fault in case the folio size in page
> -		 * cache beyond the VMA limits and PMD pagetable limits.
> +		 * cache beyond the VMA limits or PMD pagetable limits. And
> +		 * also check if the linear page offset of vma is order-aligned
> +		 * within the file for tmpfs.
>   		 */
>   		if (unlikely(vma_off < idx ||
>   			    vma_off + (nr_pages - idx) > vma_pages(vma) ||
>   			    pte_off < idx ||
> -			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE)) {
> +			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE) ||
> +			    !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
> +					hpage_size >> PAGE_SHIFT)) {

Again, why? Shouldn't set_pte_range() just do the right thing? 
set_ptes() doesn't have any such restriction.

Also see the arm64 variant where we call

	contpte_set_ptes(mm, addr, ptep, pte, nr);

There, I think we perform checks whether whether we can set the cont-pte 
bit IIUC.

	if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
		pte = pte_mkcont(pte);
	else
		pte = pte_mknoncont(pte);

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: support large mapping building for tmpfs
  2025-07-01 13:08 ` David Hildenbrand
@ 2025-07-02  2:03   ` Baolin Wang
  2025-07-02  8:45     ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Baolin Wang @ 2025-07-02  2:03 UTC (permalink / raw)
  To: David Hildenbrand, akpm, hughd
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel



On 2025/7/1 21:08, David Hildenbrand wrote:
> On 01.07.25 10:40, Baolin Wang wrote:
> 
> Nit: talking about "large mappings" is confusing. Did you actually mean:
> 
> "mm: fault in complete folios instead of individual pages for tmpfs"
> 
> I suggest not talking about "large mappings" anywhere in this patch 
> description, and instead talking about mapping multiple consecutive 
> pages of a tmpfs folios at once instead.

OK.

>> After commit acd7ccb284b8 ("mm: shmem: add large folio support for 
>> tmpfs"),
>> tmpfs can also support large folio allocation (not just PMD-sized large
>> folios).
>>
>> However, when accessing tmpfs via mmap(), although tmpfs supports 
>> large folios,
>> we still establish mappings at the base page granularity, which is 
>> unreasonable.
>  > > We can establish large mappings according to the size of the large 
> folio. On one
>> hand, this can reduce the overhead of page faults; on the other hand, 
>> it can
>> leverage hardware architecture optimizations to reduce TLB misses, 
>> such as
>> contiguous PTEs on the ARM architecture.
> 
> The latter would still apply if faulting in each individual page I 
> guess. cont-pte will try to auto-optimize IIRC.

Yes, but need more CPU cycles.

>> Moreover, since the user has already added the 'huge=' option when 
>> mounting tmpfs
>> to allow for large folio allocation, establishing large folios' 
>> mapping is expected
>> and will not surprise users by inflating the RSS of the process.
> 
> Hm, are we sure about that? 

IMO, referring to the definition of RSS:
"resident set size (RSS) is the portion of memory (measured in 
kilobytes) occupied by a process that is held in main memory (RAM). "

Seems we should report the whole large folio already in file to users. 
Moreover, the tmpfs mount already adds the 'huge=always (or within)' 
option to allocate large folios, so the increase in RSS seems also expected?

Also, how does fault_around_bytes interact
> here?

The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only 
applies to read faults (via do_read_fault()) and does not control write 
shared faults (via do_shared_fault()). Additionally, in the 
do_shared_fault() function, PMD-sized large folios are also not 
controlled by 'fault_around', so I just follow the handling of PMD-sized 
large folios.

>> In order to support large mappings for tmpfs, besides checking VMA 
>> limits and
>> PMD pagetable limits, it is also necessary to check if the linear page 
>> offset
>> of the VMA is order-aligned within the file.
> 
> Why?
> 
> This only applies to PMD mappings. See below.

I previously had the same question, but I saw the comments for 
‘thp_vma_suitable_order’ function, so I added the check here. If it's 
not necessary to check non-PMD-sized large folios, should we update the 
comments for 'thp_vma_suitable_order'?

>> Performance test:
>> I created a 1G tmpfs file, populated with 64K large folios, and 
>> accessed it
>> sequentially via mmap(). I observed a significant performance 
>> improvement:
>>
>> Before the patch:
>> real    0m0.214s
>> user    0m0.012s
>> sys    0m0.203s
>>
>> After the patch:
>> real    0m0.025s
>> user    0m0.000s
>> sys    0m0.024s
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/memory.c | 13 +++++++++----
>>   1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 0f9b32a20e5b..6385a9385a9b 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>       /*
>>        * Using per-page fault to maintain the uffd semantics, and same
>> -     * approach also applies to non-anonymous-shmem faults to avoid
>> +     * approach also applies to non shmem/tmpfs faults to avoid
>>        * inflating the RSS of the process.
>>        */
>> -    if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>> +    if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
>>           unlikely(needs_fallback)) {
>>           nr_pages = 1;
>>       } else if (nr_pages > 1) {
>> @@ -5395,15 +5395,20 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>           pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
>>           /* The index of the entry in the pagetable for fault page. */
>>           pgoff_t pte_off = pte_index(vmf->address);
>> +        unsigned long hpage_size = PAGE_SIZE << folio_order(folio);
>>           /*
>>            * Fallback to per-page fault in case the folio size in page
>> -         * cache beyond the VMA limits and PMD pagetable limits.
>> +         * cache beyond the VMA limits or PMD pagetable limits. And
>> +         * also check if the linear page offset of vma is order-aligned
>> +         * within the file for tmpfs.
>>            */
>>           if (unlikely(vma_off < idx ||
>>                   vma_off + (nr_pages - idx) > vma_pages(vma) ||
>>                   pte_off < idx ||
>> -                pte_off + (nr_pages - idx)  > PTRS_PER_PTE)) {
>> +                pte_off + (nr_pages - idx)  > PTRS_PER_PTE) ||
>> +                !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma- 
>> >vm_pgoff,
>> +                    hpage_size >> PAGE_SHIFT)) {
> 
> Again, why? Shouldn't set_pte_range() just do the right thing? 
> set_ptes() doesn't have any such restriction.
> 
> Also see the arm64 variant where we call
> 
>      contpte_set_ptes(mm, addr, ptep, pte, nr);
> 
> There, I think we perform checks whether whether we can set the cont-pte 
> bit IIUC.
> 
>      if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>          pte = pte_mkcont(pte);
>      else
>          pte = pte_mknoncont(pte);
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: support large mapping building for tmpfs
  2025-07-02  2:03   ` Baolin Wang
@ 2025-07-02  8:45     ` David Hildenbrand
  2025-07-02  9:44       ` Baolin Wang
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-07-02  8:45 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel

>> Hm, are we sure about that?
> 
> IMO, referring to the definition of RSS:
> "resident set size (RSS) is the portion of memory (measured in
> kilobytes) occupied by a process that is held in main memory (RAM). "
> 
> Seems we should report the whole large folio already in file to users.
> Moreover, the tmpfs mount already adds the 'huge=always (or within)'
> option to allocate large folios, so the increase in RSS seems also expected?

Well, traditionally we only account what is actually mapped. If you
MADV_DONTNEED part of the large folio, or only mmap() parts of it,
the RSS would never cover the whole folio -- only what is mapped.

I discuss part of that in:

commit 749492229e3bd6222dda7267b8244135229d1fd8
Author: David Hildenbrand <david@redhat.com>
Date:   Mon Mar 3 17:30:13 2025 +0100

     mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)

And how my changes there affect some system stats (e.g., "AnonPages", "Mapped").
But the RSS stays unchanged and corresponds to what is actually mapped into
the process.

Doing something similar for the RSS would be extremely hard (single page mapped into process
-> account whole folio to RSS), because it's per-folio-per-process information, not
per-folio information.

So by mapping more in a single page fault, you end up increasing "RSS". But I wouldn't
call that "expected". I rather suspect that nobody will really care :)

> 
> Also, how does fault_around_bytes interact
>> here?
> 
> The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only
> applies to read faults (via do_read_fault()) and does not control write
> shared faults (via do_shared_fault()). Additionally, in the
> do_shared_fault() function, PMD-sized large folios are also not
> controlled by 'fault_around', so I just follow the handling of PMD-sized
> large folios.
> 
>>> In order to support large mappings for tmpfs, besides checking VMA
>>> limits and
>>> PMD pagetable limits, it is also necessary to check if the linear page
>>> offset
>>> of the VMA is order-aligned within the file.
>>
>> Why?
>>
>> This only applies to PMD mappings. See below.
> 
> I previously had the same question, but I saw the comments for
> ‘thp_vma_suitable_order’ function, so I added the check here. If it's
> not necessary to check non-PMD-sized large folios, should we update the
> comments for 'thp_vma_suitable_order'?

I was not quite clear about PMD vs. !PMD.

The thing is, when you *allocate* a new folio, it must adhere at least to
pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- that is what
thp_vma_suitable_order() checks. Otherwise you cannot add it to the pagecache.

But once you *obtain* a folio from the pagecache and are supposed to map it
into the page tables, that must already hold true.

So you should be able to just blindly map whatever is given to you here
AFAIKS.

If you would get a pagecache folio that violates the linear page offset requirement
at that point, something else would have messed up the pagecache.

Or am I missing something?

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: support large mapping building for tmpfs
  2025-07-02  8:45     ` David Hildenbrand
@ 2025-07-02  9:44       ` Baolin Wang
  2025-07-02 11:38         ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Baolin Wang @ 2025-07-02  9:44 UTC (permalink / raw)
  To: David Hildenbrand, akpm, hughd
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel



On 2025/7/2 16:45, David Hildenbrand wrote:
>>> Hm, are we sure about that?
>>
>> IMO, referring to the definition of RSS:
>> "resident set size (RSS) is the portion of memory (measured in
>> kilobytes) occupied by a process that is held in main memory (RAM). "
>>
>> Seems we should report the whole large folio already in file to users.
>> Moreover, the tmpfs mount already adds the 'huge=always (or within)'
>> option to allocate large folios, so the increase in RSS seems also 
>> expected?
> 
> Well, traditionally we only account what is actually mapped. If you
> MADV_DONTNEED part of the large folio, or only mmap() parts of it,
> the RSS would never cover the whole folio -- only what is mapped.
> 
> I discuss part of that in:
> 
> commit 749492229e3bd6222dda7267b8244135229d1fd8
> Author: David Hildenbrand <david@redhat.com>
> Date:   Mon Mar 3 17:30:13 2025 +0100
> 
>      mm: stop maintaining the per-page mapcount of large folios 
> (CONFIG_NO_PAGE_MAPCOUNT)
> 
> And how my changes there affect some system stats (e.g., "AnonPages", 
> "Mapped").
> But the RSS stays unchanged and corresponds to what is actually mapped into
> the process.
> Doing something similar for the RSS would be extremely hard (single page 
> mapped into process
> -> account whole folio to RSS), because it's per-folio-per-process 
> information, not
> per-folio information.

Thanks. Good to know this.

> So by mapping more in a single page fault, you end up increasing "RSS". 
> But I wouldn't
> call that "expected". I rather suspect that nobody will really care :)

But tmpfs is a little special here. It uses the 'huge=' option to 
control large folio allocation. So, I think users should know they want 
to use large folios and build the whole mapping for the large folios. 
That is why I call it 'expected'.

>> Also, how does fault_around_bytes interact
>>> here?
>>
>> The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only
>> applies to read faults (via do_read_fault()) and does not control write
>> shared faults (via do_shared_fault()). Additionally, in the
>> do_shared_fault() function, PMD-sized large folios are also not
>> controlled by 'fault_around', so I just follow the handling of PMD-sized
>> large folios.
>>
>>>> In order to support large mappings for tmpfs, besides checking VMA
>>>> limits and
>>>> PMD pagetable limits, it is also necessary to check if the linear page
>>>> offset
>>>> of the VMA is order-aligned within the file.
>>>
>>> Why?
>>>
>>> This only applies to PMD mappings. See below.
>>
>> I previously had the same question, but I saw the comments for
>> ‘thp_vma_suitable_order’ function, so I added the check here. If it's
>> not necessary to check non-PMD-sized large folios, should we update the
>> comments for 'thp_vma_suitable_order'?
> 
> I was not quite clear about PMD vs. !PMD.
> 
> The thing is, when you *allocate* a new folio, it must adhere at least to
> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- 

Yes, agree.

> that is what
> thp_vma_suitable_order() checks. Otherwise you cannot add it to the 
> pagecache.

But this alignment is not done by thp_vma_suitable_order().

For tmpfs, it will check the alignment in shmem_suitable_orders() via:
"
	if (!xa_find(&mapping->i_pages, &aligned_index,
			aligned_index + pages - 1, XA_PRESENT))
"

For other fs systems, it will check the alignment in 
__filemap_get_folio() via:
"
	/* If we're not aligned, allocate a smaller folio */
	if (index & ((1UL << order) - 1))
		order = __ffs(index);
"

> But once you *obtain* a folio from the pagecache and are supposed to map it
> into the page tables, that must already hold true.
> 
> So you should be able to just blindly map whatever is given to you here
> AFAIKS.
> 
> If you would get a pagecache folio that violates the linear page offset 
> requirement
> at that point, something else would have messed up the pagecache.

Yes. But the comments from thp_vma_suitable_order() is not about the 
pagecache alignment, it says "the order-aligned addresses in the VMA map 
to order-aligned offsets within the file", which is used to do alignment 
for PMD mapping originally. So I wonder if we need this restriction for 
non-PMD-sized large folios?

"
  *   - For file vma, check if the linear page offset of vma is
  *     order-aligned within the file.  The hugepage is
  *     guaranteed to be order-aligned within the file, but we must
  *     check that the order-aligned addresses in the VMA map to
  *     order-aligned offsets within the file, else the hugepage will
  *     not be mappable.
"


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: support large mapping building for tmpfs
  2025-07-02  9:44       ` Baolin Wang
@ 2025-07-02 11:38         ` David Hildenbrand
  2025-07-02 11:55           ` David Hildenbrand
  2025-07-04  2:04           ` Baolin Wang
  0 siblings, 2 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-07-02 11:38 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel

>> So by mapping more in a single page fault, you end up increasing "RSS".
>> But I wouldn't
>> call that "expected". I rather suspect that nobody will really care :)
> 
> But tmpfs is a little special here. It uses the 'huge=' option to
> control large folio allocation. So, I think users should know they want
> to use large folios and build the whole mapping for the large folios.
> That is why I call it 'expected'.

Well, if your distribution decides to set huge= on /tmp or something 
like that, your application might have very little saying in that, right? :)

Again, I assume it's fine, but we might find surprises on the way.

>>
>> The thing is, when you *allocate* a new folio, it must adhere at least to
>> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) --
> 
> Yes, agree.
> 
>> that is what
>> thp_vma_suitable_order() checks. Otherwise you cannot add it to the
>> pagecache.
> 
> But this alignment is not done by thp_vma_suitable_order().
> 
> For tmpfs, it will check the alignment in shmem_suitable_orders() via:
> "
> 	if (!xa_find(&mapping->i_pages, &aligned_index,
> 			aligned_index + pages - 1, XA_PRESENT))
> "

That's not really alignment check, that's just checking whether a 
suitable folio order spans already-present entries, no?

Finding suitable orders is still up to other code IIUC.

> 
> For other fs systems, it will check the alignment in
> __filemap_get_folio() via:
> "
> 	/* If we're not aligned, allocate a smaller folio */
> 	if (index & ((1UL << order) - 1))
> 		order = __ffs(index);
> "
> 
>> But once you *obtain* a folio from the pagecache and are supposed to map it
>> into the page tables, that must already hold true.
>>
>> So you should be able to just blindly map whatever is given to you here
>> AFAIKS.
>>
>> If you would get a pagecache folio that violates the linear page offset
>> requirement
>> at that point, something else would have messed up the pagecache.
> 
> Yes. But the comments from thp_vma_suitable_order() is not about the
> pagecache alignment, it says "the order-aligned addresses in the VMA map
> to order-aligned offsets within the file",

Let's dig, it's confusing.

The code in question is:

if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
		hpage_size >> PAGE_SHIFT))

So yes, I think this tells us: if we would have a PMD THP in the 
pagecache, would we be able to map it with a PMD. If not, then don't 
bother with allocating a PMD THP.

Of course, this also applies to other orders, but for PMD THPs it's 
probably most relevant: if we cannot even map it through a PMD, then 
probably it could be a wasted THP.

So yes, I agree: if we are both no missing something, then this 
primarily relevant for the PMD case.

And it's more about "optimization" than "correctness" I guess?

But when mapping a folio that is already in the pagecache, I assume this 
is not required.

Assume we have a 2 MiB THP in the pagecache.

If someone were to map it at virtual addr 1MiB, we could still map 1MiB 
worth of PTEs into a single page table in one go, and not fallback to 
individual PTEs.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: support large mapping building for tmpfs
  2025-07-02 11:38         ` David Hildenbrand
@ 2025-07-02 11:55           ` David Hildenbrand
  2025-07-04  2:35             ` Baolin Wang
  2025-07-04  2:04           ` Baolin Wang
  1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-07-02 11:55 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel

On 02.07.25 13:38, David Hildenbrand wrote:
> 
>>> So by mapping more in a single page fault, you end up increasing "RSS".
>>> But I wouldn't
>>> call that "expected". I rather suspect that nobody will really care :)
>>
>> But tmpfs is a little special here. It uses the 'huge=' option to
>> control large folio allocation. So, I think users should know they want
>> to use large folios and build the whole mapping for the large folios.
>> That is why I call it 'expected'.
> 
> Well, if your distribution decides to set huge= on /tmp or something
> like that, your application might have very little saying in that, right? :)
> 
> Again, I assume it's fine, but we might find surprises on the way.
> 
>>>
>>> The thing is, when you *allocate* a new folio, it must adhere at least to
>>> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) --
>>
>> Yes, agree.
>>
>>> that is what
>>> thp_vma_suitable_order() checks. Otherwise you cannot add it to the
>>> pagecache.
>>
>> But this alignment is not done by thp_vma_suitable_order().
>>
>> For tmpfs, it will check the alignment in shmem_suitable_orders() via:
>> "
>> 	if (!xa_find(&mapping->i_pages, &aligned_index,
>> 			aligned_index + pages - 1, XA_PRESENT))
>> "
> 
> That's not really alignment check, that's just checking whether a
> suitable folio order spans already-present entries, no?
> 
> Finding suitable orders is still up to other code IIUC.
> 
>>
>> For other fs systems, it will check the alignment in
>> __filemap_get_folio() via:
>> "
>> 	/* If we're not aligned, allocate a smaller folio */
>> 	if (index & ((1UL << order) - 1))
>> 		order = __ffs(index);
>> "
>>
>>> But once you *obtain* a folio from the pagecache and are supposed to map it
>>> into the page tables, that must already hold true.
>>>
>>> So you should be able to just blindly map whatever is given to you here
>>> AFAIKS.
>>>
>>> If you would get a pagecache folio that violates the linear page offset
>>> requirement
>>> at that point, something else would have messed up the pagecache.
>>
>> Yes. But the comments from thp_vma_suitable_order() is not about the
>> pagecache alignment, it says "the order-aligned addresses in the VMA map
>> to order-aligned offsets within the file",
> 
> Let's dig, it's confusing.
> 
> The code in question is:
> 
> if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
> 		hpage_size >> PAGE_SHIFT))
> 
> So yes, I think this tells us: if we would have a PMD THP in the
> pagecache, would we be able to map it with a PMD. If not, then don't
> bother with allocating a PMD THP.
> 
> Of course, this also applies to other orders, but for PMD THPs it's
> probably most relevant: if we cannot even map it through a PMD, then
> probably it could be a wasted THP.
> 
> So yes, I agree: if we are both no missing something, then this
> primarily relevant for the PMD case.
> 
> And it's more about "optimization" than "correctness" I guess?

Correction: only if a caller doesn't assume that this is an implicit 
pagecache alignment check. Not sure if that might be the case for shmem 
when it calls thp_vma_suitable_order() with a VMA ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: support large mapping building for tmpfs
  2025-07-02 11:55           ` David Hildenbrand
@ 2025-07-04  2:35             ` Baolin Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Baolin Wang @ 2025-07-04  2:35 UTC (permalink / raw)
  To: David Hildenbrand, akpm, hughd
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel



On 2025/7/2 19:55, David Hildenbrand wrote:
> On 02.07.25 13:38, David Hildenbrand wrote:
>>
>>>> So by mapping more in a single page fault, you end up increasing "RSS".
>>>> But I wouldn't
>>>> call that "expected". I rather suspect that nobody will really care :)
>>>
>>> But tmpfs is a little special here. It uses the 'huge=' option to
>>> control large folio allocation. So, I think users should know they want
>>> to use large folios and build the whole mapping for the large folios.
>>> That is why I call it 'expected'.
>>
>> Well, if your distribution decides to set huge= on /tmp or something
>> like that, your application might have very little saying in that, 
>> right? :)
>>
>> Again, I assume it's fine, but we might find surprises on the way.
>>
>>>>
>>>> The thing is, when you *allocate* a new folio, it must adhere at 
>>>> least to
>>>> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) --
>>>
>>> Yes, agree.
>>>
>>>> that is what
>>>> thp_vma_suitable_order() checks. Otherwise you cannot add it to the
>>>> pagecache.
>>>
>>> But this alignment is not done by thp_vma_suitable_order().
>>>
>>> For tmpfs, it will check the alignment in shmem_suitable_orders() via:
>>> "
>>>     if (!xa_find(&mapping->i_pages, &aligned_index,
>>>             aligned_index + pages - 1, XA_PRESENT))
>>> "
>>
>> That's not really alignment check, that's just checking whether a
>> suitable folio order spans already-present entries, no?
>>
>> Finding suitable orders is still up to other code IIUC.
>>
>>>
>>> For other fs systems, it will check the alignment in
>>> __filemap_get_folio() via:
>>> "
>>>     /* If we're not aligned, allocate a smaller folio */
>>>     if (index & ((1UL << order) - 1))
>>>         order = __ffs(index);
>>> "
>>>
>>>> But once you *obtain* a folio from the pagecache and are supposed to 
>>>> map it
>>>> into the page tables, that must already hold true.
>>>>
>>>> So you should be able to just blindly map whatever is given to you here
>>>> AFAIKS.
>>>>
>>>> If you would get a pagecache folio that violates the linear page offset
>>>> requirement
>>>> at that point, something else would have messed up the pagecache.
>>>
>>> Yes. But the comments from thp_vma_suitable_order() is not about the
>>> pagecache alignment, it says "the order-aligned addresses in the VMA map
>>> to order-aligned offsets within the file",
>>
>> Let's dig, it's confusing.
>>
>> The code in question is:
>>
>> if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
>>         hpage_size >> PAGE_SHIFT))
>>
>> So yes, I think this tells us: if we would have a PMD THP in the
>> pagecache, would we be able to map it with a PMD. If not, then don't
>> bother with allocating a PMD THP.
>>
>> Of course, this also applies to other orders, but for PMD THPs it's
>> probably most relevant: if we cannot even map it through a PMD, then
>> probably it could be a wasted THP.
>>
>> So yes, I agree: if we are both no missing something, then this
>> primarily relevant for the PMD case.
>>
>> And it's more about "optimization" than "correctness" I guess?
> 
> Correction: only if a caller doesn't assume that this is an implicit 
> pagecache alignment check. Not sure if that might be the case for shmem 
> when it calls thp_vma_suitable_order() with a VMA ...

I am sure shmem will not use thp_vma_suitable_order() for pagecache 
alignment checks, because shmem has explicit code for pagecache 
alignment checks.

Adding thp_vma_suitable_order() in shmem is more about following the 
allocation logic of anonymous pages. I also think the 'IS_ALIGNED()' 
check in thp_vma_suitable_order() for shmem is more about 'optimization' 
rather than 'correction'. Anyway, I will take another look at shmem's 
checking logic.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: support large mapping building for tmpfs
  2025-07-02 11:38         ` David Hildenbrand
  2025-07-02 11:55           ` David Hildenbrand
@ 2025-07-04  2:04           ` Baolin Wang
  1 sibling, 0 replies; 9+ messages in thread
From: Baolin Wang @ 2025-07-04  2:04 UTC (permalink / raw)
  To: David Hildenbrand, akpm, hughd
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm,
	linux-kernel



On 2025/7/2 19:38, David Hildenbrand wrote:
> 
>>> So by mapping more in a single page fault, you end up increasing "RSS".
>>> But I wouldn't
>>> call that "expected". I rather suspect that nobody will really care :)
>>
>> But tmpfs is a little special here. It uses the 'huge=' option to
>> control large folio allocation. So, I think users should know they want
>> to use large folios and build the whole mapping for the large folios.
>> That is why I call it 'expected'.
> 
> Well, if your distribution decides to set huge= on /tmp or something 
> like that, your application might have very little saying in that, 
> right? :)
> 
> Again, I assume it's fine, but we might find surprises on the way.
> 
>>>
>>> The thing is, when you *allocate* a new folio, it must adhere at 
>>> least to
>>> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) --
>>
>> Yes, agree.
>>
>>> that is what
>>> thp_vma_suitable_order() checks. Otherwise you cannot add it to the
>>> pagecache.
>>
>> But this alignment is not done by thp_vma_suitable_order().
>>
>> For tmpfs, it will check the alignment in shmem_suitable_orders() via:
>> "
>>     if (!xa_find(&mapping->i_pages, &aligned_index,
>>             aligned_index + pages - 1, XA_PRESENT))
>> "
> 
> That's not really alignment check, that's just checking whether a 
> suitable folio order spans already-present entries, no?

Because 'aligned_index' already did round_down() before checking if it's 
suitable. So it's still considered an implicit alignment check.

"
	pages = 1UL << order;
	aligned_index = round_down(index, pages);
"

> Finding suitable orders is still up to other code IIUC.
> 
>>
>> For other fs systems, it will check the alignment in
>> __filemap_get_folio() via:
>> "
>>     /* If we're not aligned, allocate a smaller folio */
>>     if (index & ((1UL << order) - 1))
>>         order = __ffs(index);
>> "
>>
>>> But once you *obtain* a folio from the pagecache and are supposed to 
>>> map it
>>> into the page tables, that must already hold true.
>>>
>>> So you should be able to just blindly map whatever is given to you here
>>> AFAIKS.
>>>
>>> If you would get a pagecache folio that violates the linear page offset
>>> requirement
>>> at that point, something else would have messed up the pagecache.
>>
>> Yes. But the comments from thp_vma_suitable_order() is not about the
>> pagecache alignment, it says "the order-aligned addresses in the VMA map
>> to order-aligned offsets within the file",
> 
> Let's dig, it's confusing.
> 
> The code in question is:
> 
> if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
>          hpage_size >> PAGE_SHIFT))
> 
> So yes, I think this tells us: if we would have a PMD THP in the 
> pagecache, would we be able to map it with a PMD. If not, then don't 
> bother with allocating a PMD THP.
> 
> Of course, this also applies to other orders, but for PMD THPs it's 
> probably most relevant: if we cannot even map it through a PMD, then 
> probably it could be a wasted THP.
> 
> So yes, I agree: if we are both no missing something, then this 
> primarily relevant for the PMD case.
> 
> And it's more about "optimization" than "correctness" I guess?
> 
> But when mapping a folio that is already in the pagecache, I assume this 
> is not required.
> 
> Assume we have a 2 MiB THP in the pagecache.
> 
> If someone were to map it at virtual addr 1MiB, we could still map 1MiB 
> worth of PTEs into a single page table in one go, and not fallback to 
> individual PTEs.

That's how I understand the code as well, I just wasn't very sure 
before. Thanks for your explanation which clarified my doubts.

I will drop the check in next version.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-07-04  2:35 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01  8:40 [PATCH] mm: support large mapping building for tmpfs Baolin Wang
2025-07-01 13:08 ` David Hildenbrand
2025-07-02  2:03   ` Baolin Wang
2025-07-02  8:45     ` David Hildenbrand
2025-07-02  9:44       ` Baolin Wang
2025-07-02 11:38         ` David Hildenbrand
2025-07-02 11:55           ` David Hildenbrand
2025-07-04  2:35             ` Baolin Wang
2025-07-04  2:04           ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).