* [PATCH] mm: support large mapping building for tmpfs @ 2025-07-01 8:40 Baolin Wang 2025-07-01 13:08 ` David Hildenbrand 0 siblings, 1 reply; 9+ messages in thread From: Baolin Wang @ 2025-07-01 8:40 UTC (permalink / raw) To: akpm, hughd, david Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko, baolin.wang, linux-mm, linux-kernel After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"), tmpfs can also support large folio allocation (not just PMD-sized large folios). However, when accessing tmpfs via mmap(), although tmpfs supports large folios, we still establish mappings at the base page granularity, which is unreasonable. We can establish large mappings according to the size of the large folio. On one hand, this can reduce the overhead of page faults; on the other hand, it can leverage hardware architecture optimizations to reduce TLB misses, such as contiguous PTEs on the ARM architecture. Moreover, since the user has already added the 'huge=' option when mounting tmpfs to allow for large folio allocation, establishing large folios' mapping is expected and will not surprise users by inflating the RSS of the process. In order to support large mappings for tmpfs, besides checking VMA limits and PMD pagetable limits, it is also necessary to check if the linear page offset of the VMA is order-aligned within the file. Performance test: I created a 1G tmpfs file, populated with 64K large folios, and accessed it sequentially via mmap(). I observed a significant performance improvement: Before the patch: real 0m0.214s user 0m0.012s sys 0m0.203s After the patch: real 0m0.025s user 0m0.000s sys 0m0.024s Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> --- mm/memory.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 0f9b32a20e5b..6385a9385a9b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf) /* * Using per-page fault to maintain the uffd semantics, and same - * approach also applies to non-anonymous-shmem faults to avoid + * approach also applies to non shmem/tmpfs faults to avoid * inflating the RSS of the process. */ - if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) || + if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) { nr_pages = 1; } else if (nr_pages > 1) { @@ -5395,15 +5395,20 @@ vm_fault_t finish_fault(struct vm_fault *vmf) pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff; /* The index of the entry in the pagetable for fault page. */ pgoff_t pte_off = pte_index(vmf->address); + unsigned long hpage_size = PAGE_SIZE << folio_order(folio); /* * Fallback to per-page fault in case the folio size in page - * cache beyond the VMA limits and PMD pagetable limits. + * cache beyond the VMA limits or PMD pagetable limits. And + * also check if the linear page offset of vma is order-aligned + * within the file for tmpfs. */ if (unlikely(vma_off < idx || vma_off + (nr_pages - idx) > vma_pages(vma) || pte_off < idx || - pte_off + (nr_pages - idx) > PTRS_PER_PTE)) { + pte_off + (nr_pages - idx) > PTRS_PER_PTE) || + !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, + hpage_size >> PAGE_SHIFT)) { nr_pages = 1; } else { /* Now we can set mappings for the whole large folio. */ -- 2.43.5 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: support large mapping building for tmpfs 2025-07-01 8:40 [PATCH] mm: support large mapping building for tmpfs Baolin Wang @ 2025-07-01 13:08 ` David Hildenbrand 2025-07-02 2:03 ` Baolin Wang 0 siblings, 1 reply; 9+ messages in thread From: David Hildenbrand @ 2025-07-01 13:08 UTC (permalink / raw) To: Baolin Wang, akpm, hughd Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel On 01.07.25 10:40, Baolin Wang wrote: Nit: talking about "large mappings" is confusing. Did you actually mean: "mm: fault in complete folios instead of individual pages for tmpfs" I suggest not talking about "large mappings" anywhere in this patch description, and instead talking about mapping multiple consecutive pages of a tmpfs folios at once instead. > After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"), > tmpfs can also support large folio allocation (not just PMD-sized large > folios). > > However, when accessing tmpfs via mmap(), although tmpfs supports large folios, > we still establish mappings at the base page granularity, which is unreasonable. > > We can establish large mappings according to the size of the large folio. On one > hand, this can reduce the overhead of page faults; on the other hand, it can > leverage hardware architecture optimizations to reduce TLB misses, such as > contiguous PTEs on the ARM architecture. The latter would still apply if faulting in each individual page I guess. cont-pte will try to auto-optimize IIRC. > > Moreover, since the user has already added the 'huge=' option when mounting tmpfs > to allow for large folio allocation, establishing large folios' mapping is expected > and will not surprise users by inflating the RSS of the process. Hm, are we sure about that? Also, how does fault_around_bytes interact here? > > In order to support large mappings for tmpfs, besides checking VMA limits and > PMD pagetable limits, it is also necessary to check if the linear page offset > of the VMA is order-aligned within the file. Why? This only applies to PMD mappings. See below. > > Performance test: > I created a 1G tmpfs file, populated with 64K large folios, and accessed it > sequentially via mmap(). I observed a significant performance improvement: > > Before the patch: > real 0m0.214s > user 0m0.012s > sys 0m0.203s > > After the patch: > real 0m0.025s > user 0m0.000s > sys 0m0.024s > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > --- > mm/memory.c | 13 +++++++++---- > 1 file changed, 9 insertions(+), 4 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 0f9b32a20e5b..6385a9385a9b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > > /* > * Using per-page fault to maintain the uffd semantics, and same > - * approach also applies to non-anonymous-shmem faults to avoid > + * approach also applies to non shmem/tmpfs faults to avoid > * inflating the RSS of the process. > */ > - if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) || > + if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) || > unlikely(needs_fallback)) { > nr_pages = 1; > } else if (nr_pages > 1) { > @@ -5395,15 +5395,20 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff; > /* The index of the entry in the pagetable for fault page. */ > pgoff_t pte_off = pte_index(vmf->address); > + unsigned long hpage_size = PAGE_SIZE << folio_order(folio); > > /* > * Fallback to per-page fault in case the folio size in page > - * cache beyond the VMA limits and PMD pagetable limits. > + * cache beyond the VMA limits or PMD pagetable limits. And > + * also check if the linear page offset of vma is order-aligned > + * within the file for tmpfs. > */ > if (unlikely(vma_off < idx || > vma_off + (nr_pages - idx) > vma_pages(vma) || > pte_off < idx || > - pte_off + (nr_pages - idx) > PTRS_PER_PTE)) { > + pte_off + (nr_pages - idx) > PTRS_PER_PTE) || > + !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, > + hpage_size >> PAGE_SHIFT)) { Again, why? Shouldn't set_pte_range() just do the right thing? set_ptes() doesn't have any such restriction. Also see the arm64 variant where we call contpte_set_ptes(mm, addr, ptep, pte, nr); There, I think we perform checks whether whether we can set the cont-pte bit IIUC. if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0) pte = pte_mkcont(pte); else pte = pte_mknoncont(pte); -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: support large mapping building for tmpfs 2025-07-01 13:08 ` David Hildenbrand @ 2025-07-02 2:03 ` Baolin Wang 2025-07-02 8:45 ` David Hildenbrand 0 siblings, 1 reply; 9+ messages in thread From: Baolin Wang @ 2025-07-02 2:03 UTC (permalink / raw) To: David Hildenbrand, akpm, hughd Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel On 2025/7/1 21:08, David Hildenbrand wrote: > On 01.07.25 10:40, Baolin Wang wrote: > > Nit: talking about "large mappings" is confusing. Did you actually mean: > > "mm: fault in complete folios instead of individual pages for tmpfs" > > I suggest not talking about "large mappings" anywhere in this patch > description, and instead talking about mapping multiple consecutive > pages of a tmpfs folios at once instead. OK. >> After commit acd7ccb284b8 ("mm: shmem: add large folio support for >> tmpfs"), >> tmpfs can also support large folio allocation (not just PMD-sized large >> folios). >> >> However, when accessing tmpfs via mmap(), although tmpfs supports >> large folios, >> we still establish mappings at the base page granularity, which is >> unreasonable. > > > We can establish large mappings according to the size of the large > folio. On one >> hand, this can reduce the overhead of page faults; on the other hand, >> it can >> leverage hardware architecture optimizations to reduce TLB misses, >> such as >> contiguous PTEs on the ARM architecture. > > The latter would still apply if faulting in each individual page I > guess. cont-pte will try to auto-optimize IIRC. Yes, but need more CPU cycles. >> Moreover, since the user has already added the 'huge=' option when >> mounting tmpfs >> to allow for large folio allocation, establishing large folios' >> mapping is expected >> and will not surprise users by inflating the RSS of the process. > > Hm, are we sure about that? IMO, referring to the definition of RSS: "resident set size (RSS) is the portion of memory (measured in kilobytes) occupied by a process that is held in main memory (RAM). " Seems we should report the whole large folio already in file to users. Moreover, the tmpfs mount already adds the 'huge=always (or within)' option to allocate large folios, so the increase in RSS seems also expected? Also, how does fault_around_bytes interact > here? The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only applies to read faults (via do_read_fault()) and does not control write shared faults (via do_shared_fault()). Additionally, in the do_shared_fault() function, PMD-sized large folios are also not controlled by 'fault_around', so I just follow the handling of PMD-sized large folios. >> In order to support large mappings for tmpfs, besides checking VMA >> limits and >> PMD pagetable limits, it is also necessary to check if the linear page >> offset >> of the VMA is order-aligned within the file. > > Why? > > This only applies to PMD mappings. See below. I previously had the same question, but I saw the comments for ‘thp_vma_suitable_order’ function, so I added the check here. If it's not necessary to check non-PMD-sized large folios, should we update the comments for 'thp_vma_suitable_order'? >> Performance test: >> I created a 1G tmpfs file, populated with 64K large folios, and >> accessed it >> sequentially via mmap(). I observed a significant performance >> improvement: >> >> Before the patch: >> real 0m0.214s >> user 0m0.012s >> sys 0m0.203s >> >> After the patch: >> real 0m0.025s >> user 0m0.000s >> sys 0m0.024s >> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >> --- >> mm/memory.c | 13 +++++++++---- >> 1 file changed, 9 insertions(+), 4 deletions(-) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 0f9b32a20e5b..6385a9385a9b 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >> /* >> * Using per-page fault to maintain the uffd semantics, and same >> - * approach also applies to non-anonymous-shmem faults to avoid >> + * approach also applies to non shmem/tmpfs faults to avoid >> * inflating the RSS of the process. >> */ >> - if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) || >> + if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) || >> unlikely(needs_fallback)) { >> nr_pages = 1; >> } else if (nr_pages > 1) { >> @@ -5395,15 +5395,20 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >> pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff; >> /* The index of the entry in the pagetable for fault page. */ >> pgoff_t pte_off = pte_index(vmf->address); >> + unsigned long hpage_size = PAGE_SIZE << folio_order(folio); >> /* >> * Fallback to per-page fault in case the folio size in page >> - * cache beyond the VMA limits and PMD pagetable limits. >> + * cache beyond the VMA limits or PMD pagetable limits. And >> + * also check if the linear page offset of vma is order-aligned >> + * within the file for tmpfs. >> */ >> if (unlikely(vma_off < idx || >> vma_off + (nr_pages - idx) > vma_pages(vma) || >> pte_off < idx || >> - pte_off + (nr_pages - idx) > PTRS_PER_PTE)) { >> + pte_off + (nr_pages - idx) > PTRS_PER_PTE) || >> + !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma- >> >vm_pgoff, >> + hpage_size >> PAGE_SHIFT)) { > > Again, why? Shouldn't set_pte_range() just do the right thing? > set_ptes() doesn't have any such restriction. > > Also see the arm64 variant where we call > > contpte_set_ptes(mm, addr, ptep, pte, nr); > > There, I think we perform checks whether whether we can set the cont-pte > bit IIUC. > > if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0) > pte = pte_mkcont(pte); > else > pte = pte_mknoncont(pte); > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: support large mapping building for tmpfs 2025-07-02 2:03 ` Baolin Wang @ 2025-07-02 8:45 ` David Hildenbrand 2025-07-02 9:44 ` Baolin Wang 0 siblings, 1 reply; 9+ messages in thread From: David Hildenbrand @ 2025-07-02 8:45 UTC (permalink / raw) To: Baolin Wang, akpm, hughd Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel >> Hm, are we sure about that? > > IMO, referring to the definition of RSS: > "resident set size (RSS) is the portion of memory (measured in > kilobytes) occupied by a process that is held in main memory (RAM). " > > Seems we should report the whole large folio already in file to users. > Moreover, the tmpfs mount already adds the 'huge=always (or within)' > option to allocate large folios, so the increase in RSS seems also expected? Well, traditionally we only account what is actually mapped. If you MADV_DONTNEED part of the large folio, or only mmap() parts of it, the RSS would never cover the whole folio -- only what is mapped. I discuss part of that in: commit 749492229e3bd6222dda7267b8244135229d1fd8 Author: David Hildenbrand <david@redhat.com> Date: Mon Mar 3 17:30:13 2025 +0100 mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT) And how my changes there affect some system stats (e.g., "AnonPages", "Mapped"). But the RSS stays unchanged and corresponds to what is actually mapped into the process. Doing something similar for the RSS would be extremely hard (single page mapped into process -> account whole folio to RSS), because it's per-folio-per-process information, not per-folio information. So by mapping more in a single page fault, you end up increasing "RSS". But I wouldn't call that "expected". I rather suspect that nobody will really care :) > > Also, how does fault_around_bytes interact >> here? > > The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only > applies to read faults (via do_read_fault()) and does not control write > shared faults (via do_shared_fault()). Additionally, in the > do_shared_fault() function, PMD-sized large folios are also not > controlled by 'fault_around', so I just follow the handling of PMD-sized > large folios. > >>> In order to support large mappings for tmpfs, besides checking VMA >>> limits and >>> PMD pagetable limits, it is also necessary to check if the linear page >>> offset >>> of the VMA is order-aligned within the file. >> >> Why? >> >> This only applies to PMD mappings. See below. > > I previously had the same question, but I saw the comments for > ‘thp_vma_suitable_order’ function, so I added the check here. If it's > not necessary to check non-PMD-sized large folios, should we update the > comments for 'thp_vma_suitable_order'? I was not quite clear about PMD vs. !PMD. The thing is, when you *allocate* a new folio, it must adhere at least to pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- that is what thp_vma_suitable_order() checks. Otherwise you cannot add it to the pagecache. But once you *obtain* a folio from the pagecache and are supposed to map it into the page tables, that must already hold true. So you should be able to just blindly map whatever is given to you here AFAIKS. If you would get a pagecache folio that violates the linear page offset requirement at that point, something else would have messed up the pagecache. Or am I missing something? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: support large mapping building for tmpfs 2025-07-02 8:45 ` David Hildenbrand @ 2025-07-02 9:44 ` Baolin Wang 2025-07-02 11:38 ` David Hildenbrand 0 siblings, 1 reply; 9+ messages in thread From: Baolin Wang @ 2025-07-02 9:44 UTC (permalink / raw) To: David Hildenbrand, akpm, hughd Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel On 2025/7/2 16:45, David Hildenbrand wrote: >>> Hm, are we sure about that? >> >> IMO, referring to the definition of RSS: >> "resident set size (RSS) is the portion of memory (measured in >> kilobytes) occupied by a process that is held in main memory (RAM). " >> >> Seems we should report the whole large folio already in file to users. >> Moreover, the tmpfs mount already adds the 'huge=always (or within)' >> option to allocate large folios, so the increase in RSS seems also >> expected? > > Well, traditionally we only account what is actually mapped. If you > MADV_DONTNEED part of the large folio, or only mmap() parts of it, > the RSS would never cover the whole folio -- only what is mapped. > > I discuss part of that in: > > commit 749492229e3bd6222dda7267b8244135229d1fd8 > Author: David Hildenbrand <david@redhat.com> > Date: Mon Mar 3 17:30:13 2025 +0100 > > mm: stop maintaining the per-page mapcount of large folios > (CONFIG_NO_PAGE_MAPCOUNT) > > And how my changes there affect some system stats (e.g., "AnonPages", > "Mapped"). > But the RSS stays unchanged and corresponds to what is actually mapped into > the process. > Doing something similar for the RSS would be extremely hard (single page > mapped into process > -> account whole folio to RSS), because it's per-folio-per-process > information, not > per-folio information. Thanks. Good to know this. > So by mapping more in a single page fault, you end up increasing "RSS". > But I wouldn't > call that "expected". I rather suspect that nobody will really care :) But tmpfs is a little special here. It uses the 'huge=' option to control large folio allocation. So, I think users should know they want to use large folios and build the whole mapping for the large folios. That is why I call it 'expected'. >> Also, how does fault_around_bytes interact >>> here? >> >> The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only >> applies to read faults (via do_read_fault()) and does not control write >> shared faults (via do_shared_fault()). Additionally, in the >> do_shared_fault() function, PMD-sized large folios are also not >> controlled by 'fault_around', so I just follow the handling of PMD-sized >> large folios. >> >>>> In order to support large mappings for tmpfs, besides checking VMA >>>> limits and >>>> PMD pagetable limits, it is also necessary to check if the linear page >>>> offset >>>> of the VMA is order-aligned within the file. >>> >>> Why? >>> >>> This only applies to PMD mappings. See below. >> >> I previously had the same question, but I saw the comments for >> ‘thp_vma_suitable_order’ function, so I added the check here. If it's >> not necessary to check non-PMD-sized large folios, should we update the >> comments for 'thp_vma_suitable_order'? > > I was not quite clear about PMD vs. !PMD. > > The thing is, when you *allocate* a new folio, it must adhere at least to > pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- Yes, agree. > that is what > thp_vma_suitable_order() checks. Otherwise you cannot add it to the > pagecache. But this alignment is not done by thp_vma_suitable_order(). For tmpfs, it will check the alignment in shmem_suitable_orders() via: " if (!xa_find(&mapping->i_pages, &aligned_index, aligned_index + pages - 1, XA_PRESENT)) " For other fs systems, it will check the alignment in __filemap_get_folio() via: " /* If we're not aligned, allocate a smaller folio */ if (index & ((1UL << order) - 1)) order = __ffs(index); " > But once you *obtain* a folio from the pagecache and are supposed to map it > into the page tables, that must already hold true. > > So you should be able to just blindly map whatever is given to you here > AFAIKS. > > If you would get a pagecache folio that violates the linear page offset > requirement > at that point, something else would have messed up the pagecache. Yes. But the comments from thp_vma_suitable_order() is not about the pagecache alignment, it says "the order-aligned addresses in the VMA map to order-aligned offsets within the file", which is used to do alignment for PMD mapping originally. So I wonder if we need this restriction for non-PMD-sized large folios? " * - For file vma, check if the linear page offset of vma is * order-aligned within the file. The hugepage is * guaranteed to be order-aligned within the file, but we must * check that the order-aligned addresses in the VMA map to * order-aligned offsets within the file, else the hugepage will * not be mappable. " ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: support large mapping building for tmpfs 2025-07-02 9:44 ` Baolin Wang @ 2025-07-02 11:38 ` David Hildenbrand 2025-07-02 11:55 ` David Hildenbrand 2025-07-04 2:04 ` Baolin Wang 0 siblings, 2 replies; 9+ messages in thread From: David Hildenbrand @ 2025-07-02 11:38 UTC (permalink / raw) To: Baolin Wang, akpm, hughd Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel >> So by mapping more in a single page fault, you end up increasing "RSS". >> But I wouldn't >> call that "expected". I rather suspect that nobody will really care :) > > But tmpfs is a little special here. It uses the 'huge=' option to > control large folio allocation. So, I think users should know they want > to use large folios and build the whole mapping for the large folios. > That is why I call it 'expected'. Well, if your distribution decides to set huge= on /tmp or something like that, your application might have very little saying in that, right? :) Again, I assume it's fine, but we might find surprises on the way. >> >> The thing is, when you *allocate* a new folio, it must adhere at least to >> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- > > Yes, agree. > >> that is what >> thp_vma_suitable_order() checks. Otherwise you cannot add it to the >> pagecache. > > But this alignment is not done by thp_vma_suitable_order(). > > For tmpfs, it will check the alignment in shmem_suitable_orders() via: > " > if (!xa_find(&mapping->i_pages, &aligned_index, > aligned_index + pages - 1, XA_PRESENT)) > " That's not really alignment check, that's just checking whether a suitable folio order spans already-present entries, no? Finding suitable orders is still up to other code IIUC. > > For other fs systems, it will check the alignment in > __filemap_get_folio() via: > " > /* If we're not aligned, allocate a smaller folio */ > if (index & ((1UL << order) - 1)) > order = __ffs(index); > " > >> But once you *obtain* a folio from the pagecache and are supposed to map it >> into the page tables, that must already hold true. >> >> So you should be able to just blindly map whatever is given to you here >> AFAIKS. >> >> If you would get a pagecache folio that violates the linear page offset >> requirement >> at that point, something else would have messed up the pagecache. > > Yes. But the comments from thp_vma_suitable_order() is not about the > pagecache alignment, it says "the order-aligned addresses in the VMA map > to order-aligned offsets within the file", Let's dig, it's confusing. The code in question is: if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, hpage_size >> PAGE_SHIFT)) So yes, I think this tells us: if we would have a PMD THP in the pagecache, would we be able to map it with a PMD. If not, then don't bother with allocating a PMD THP. Of course, this also applies to other orders, but for PMD THPs it's probably most relevant: if we cannot even map it through a PMD, then probably it could be a wasted THP. So yes, I agree: if we are both no missing something, then this primarily relevant for the PMD case. And it's more about "optimization" than "correctness" I guess? But when mapping a folio that is already in the pagecache, I assume this is not required. Assume we have a 2 MiB THP in the pagecache. If someone were to map it at virtual addr 1MiB, we could still map 1MiB worth of PTEs into a single page table in one go, and not fallback to individual PTEs. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: support large mapping building for tmpfs 2025-07-02 11:38 ` David Hildenbrand @ 2025-07-02 11:55 ` David Hildenbrand 2025-07-04 2:35 ` Baolin Wang 2025-07-04 2:04 ` Baolin Wang 1 sibling, 1 reply; 9+ messages in thread From: David Hildenbrand @ 2025-07-02 11:55 UTC (permalink / raw) To: Baolin Wang, akpm, hughd Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel On 02.07.25 13:38, David Hildenbrand wrote: > >>> So by mapping more in a single page fault, you end up increasing "RSS". >>> But I wouldn't >>> call that "expected". I rather suspect that nobody will really care :) >> >> But tmpfs is a little special here. It uses the 'huge=' option to >> control large folio allocation. So, I think users should know they want >> to use large folios and build the whole mapping for the large folios. >> That is why I call it 'expected'. > > Well, if your distribution decides to set huge= on /tmp or something > like that, your application might have very little saying in that, right? :) > > Again, I assume it's fine, but we might find surprises on the way. > >>> >>> The thing is, when you *allocate* a new folio, it must adhere at least to >>> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- >> >> Yes, agree. >> >>> that is what >>> thp_vma_suitable_order() checks. Otherwise you cannot add it to the >>> pagecache. >> >> But this alignment is not done by thp_vma_suitable_order(). >> >> For tmpfs, it will check the alignment in shmem_suitable_orders() via: >> " >> if (!xa_find(&mapping->i_pages, &aligned_index, >> aligned_index + pages - 1, XA_PRESENT)) >> " > > That's not really alignment check, that's just checking whether a > suitable folio order spans already-present entries, no? > > Finding suitable orders is still up to other code IIUC. > >> >> For other fs systems, it will check the alignment in >> __filemap_get_folio() via: >> " >> /* If we're not aligned, allocate a smaller folio */ >> if (index & ((1UL << order) - 1)) >> order = __ffs(index); >> " >> >>> But once you *obtain* a folio from the pagecache and are supposed to map it >>> into the page tables, that must already hold true. >>> >>> So you should be able to just blindly map whatever is given to you here >>> AFAIKS. >>> >>> If you would get a pagecache folio that violates the linear page offset >>> requirement >>> at that point, something else would have messed up the pagecache. >> >> Yes. But the comments from thp_vma_suitable_order() is not about the >> pagecache alignment, it says "the order-aligned addresses in the VMA map >> to order-aligned offsets within the file", > > Let's dig, it's confusing. > > The code in question is: > > if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, > hpage_size >> PAGE_SHIFT)) > > So yes, I think this tells us: if we would have a PMD THP in the > pagecache, would we be able to map it with a PMD. If not, then don't > bother with allocating a PMD THP. > > Of course, this also applies to other orders, but for PMD THPs it's > probably most relevant: if we cannot even map it through a PMD, then > probably it could be a wasted THP. > > So yes, I agree: if we are both no missing something, then this > primarily relevant for the PMD case. > > And it's more about "optimization" than "correctness" I guess? Correction: only if a caller doesn't assume that this is an implicit pagecache alignment check. Not sure if that might be the case for shmem when it calls thp_vma_suitable_order() with a VMA ... -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: support large mapping building for tmpfs 2025-07-02 11:55 ` David Hildenbrand @ 2025-07-04 2:35 ` Baolin Wang 0 siblings, 0 replies; 9+ messages in thread From: Baolin Wang @ 2025-07-04 2:35 UTC (permalink / raw) To: David Hildenbrand, akpm, hughd Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel On 2025/7/2 19:55, David Hildenbrand wrote: > On 02.07.25 13:38, David Hildenbrand wrote: >> >>>> So by mapping more in a single page fault, you end up increasing "RSS". >>>> But I wouldn't >>>> call that "expected". I rather suspect that nobody will really care :) >>> >>> But tmpfs is a little special here. It uses the 'huge=' option to >>> control large folio allocation. So, I think users should know they want >>> to use large folios and build the whole mapping for the large folios. >>> That is why I call it 'expected'. >> >> Well, if your distribution decides to set huge= on /tmp or something >> like that, your application might have very little saying in that, >> right? :) >> >> Again, I assume it's fine, but we might find surprises on the way. >> >>>> >>>> The thing is, when you *allocate* a new folio, it must adhere at >>>> least to >>>> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- >>> >>> Yes, agree. >>> >>>> that is what >>>> thp_vma_suitable_order() checks. Otherwise you cannot add it to the >>>> pagecache. >>> >>> But this alignment is not done by thp_vma_suitable_order(). >>> >>> For tmpfs, it will check the alignment in shmem_suitable_orders() via: >>> " >>> if (!xa_find(&mapping->i_pages, &aligned_index, >>> aligned_index + pages - 1, XA_PRESENT)) >>> " >> >> That's not really alignment check, that's just checking whether a >> suitable folio order spans already-present entries, no? >> >> Finding suitable orders is still up to other code IIUC. >> >>> >>> For other fs systems, it will check the alignment in >>> __filemap_get_folio() via: >>> " >>> /* If we're not aligned, allocate a smaller folio */ >>> if (index & ((1UL << order) - 1)) >>> order = __ffs(index); >>> " >>> >>>> But once you *obtain* a folio from the pagecache and are supposed to >>>> map it >>>> into the page tables, that must already hold true. >>>> >>>> So you should be able to just blindly map whatever is given to you here >>>> AFAIKS. >>>> >>>> If you would get a pagecache folio that violates the linear page offset >>>> requirement >>>> at that point, something else would have messed up the pagecache. >>> >>> Yes. But the comments from thp_vma_suitable_order() is not about the >>> pagecache alignment, it says "the order-aligned addresses in the VMA map >>> to order-aligned offsets within the file", >> >> Let's dig, it's confusing. >> >> The code in question is: >> >> if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, >> hpage_size >> PAGE_SHIFT)) >> >> So yes, I think this tells us: if we would have a PMD THP in the >> pagecache, would we be able to map it with a PMD. If not, then don't >> bother with allocating a PMD THP. >> >> Of course, this also applies to other orders, but for PMD THPs it's >> probably most relevant: if we cannot even map it through a PMD, then >> probably it could be a wasted THP. >> >> So yes, I agree: if we are both no missing something, then this >> primarily relevant for the PMD case. >> >> And it's more about "optimization" than "correctness" I guess? > > Correction: only if a caller doesn't assume that this is an implicit > pagecache alignment check. Not sure if that might be the case for shmem > when it calls thp_vma_suitable_order() with a VMA ... I am sure shmem will not use thp_vma_suitable_order() for pagecache alignment checks, because shmem has explicit code for pagecache alignment checks. Adding thp_vma_suitable_order() in shmem is more about following the allocation logic of anonymous pages. I also think the 'IS_ALIGNED()' check in thp_vma_suitable_order() for shmem is more about 'optimization' rather than 'correction'. Anyway, I will take another look at shmem's checking logic. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: support large mapping building for tmpfs 2025-07-02 11:38 ` David Hildenbrand 2025-07-02 11:55 ` David Hildenbrand @ 2025-07-04 2:04 ` Baolin Wang 1 sibling, 0 replies; 9+ messages in thread From: Baolin Wang @ 2025-07-04 2:04 UTC (permalink / raw) To: David Hildenbrand, akpm, hughd Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel On 2025/7/2 19:38, David Hildenbrand wrote: > >>> So by mapping more in a single page fault, you end up increasing "RSS". >>> But I wouldn't >>> call that "expected". I rather suspect that nobody will really care :) >> >> But tmpfs is a little special here. It uses the 'huge=' option to >> control large folio allocation. So, I think users should know they want >> to use large folios and build the whole mapping for the large folios. >> That is why I call it 'expected'. > > Well, if your distribution decides to set huge= on /tmp or something > like that, your application might have very little saying in that, > right? :) > > Again, I assume it's fine, but we might find surprises on the way. > >>> >>> The thing is, when you *allocate* a new folio, it must adhere at >>> least to >>> pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- >> >> Yes, agree. >> >>> that is what >>> thp_vma_suitable_order() checks. Otherwise you cannot add it to the >>> pagecache. >> >> But this alignment is not done by thp_vma_suitable_order(). >> >> For tmpfs, it will check the alignment in shmem_suitable_orders() via: >> " >> if (!xa_find(&mapping->i_pages, &aligned_index, >> aligned_index + pages - 1, XA_PRESENT)) >> " > > That's not really alignment check, that's just checking whether a > suitable folio order spans already-present entries, no? Because 'aligned_index' already did round_down() before checking if it's suitable. So it's still considered an implicit alignment check. " pages = 1UL << order; aligned_index = round_down(index, pages); " > Finding suitable orders is still up to other code IIUC. > >> >> For other fs systems, it will check the alignment in >> __filemap_get_folio() via: >> " >> /* If we're not aligned, allocate a smaller folio */ >> if (index & ((1UL << order) - 1)) >> order = __ffs(index); >> " >> >>> But once you *obtain* a folio from the pagecache and are supposed to >>> map it >>> into the page tables, that must already hold true. >>> >>> So you should be able to just blindly map whatever is given to you here >>> AFAIKS. >>> >>> If you would get a pagecache folio that violates the linear page offset >>> requirement >>> at that point, something else would have messed up the pagecache. >> >> Yes. But the comments from thp_vma_suitable_order() is not about the >> pagecache alignment, it says "the order-aligned addresses in the VMA map >> to order-aligned offsets within the file", > > Let's dig, it's confusing. > > The code in question is: > > if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, > hpage_size >> PAGE_SHIFT)) > > So yes, I think this tells us: if we would have a PMD THP in the > pagecache, would we be able to map it with a PMD. If not, then don't > bother with allocating a PMD THP. > > Of course, this also applies to other orders, but for PMD THPs it's > probably most relevant: if we cannot even map it through a PMD, then > probably it could be a wasted THP. > > So yes, I agree: if we are both no missing something, then this > primarily relevant for the PMD case. > > And it's more about "optimization" than "correctness" I guess? > > But when mapping a folio that is already in the pagecache, I assume this > is not required. > > Assume we have a 2 MiB THP in the pagecache. > > If someone were to map it at virtual addr 1MiB, we could still map 1MiB > worth of PTEs into a single page table in one go, and not fallback to > individual PTEs. That's how I understand the code as well, I just wasn't very sure before. Thanks for your explanation which clarified my doubts. I will drop the check in next version. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-07-04 2:35 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-01 8:40 [PATCH] mm: support large mapping building for tmpfs Baolin Wang 2025-07-01 13:08 ` David Hildenbrand 2025-07-02 2:03 ` Baolin Wang 2025-07-02 8:45 ` David Hildenbrand 2025-07-02 9:44 ` Baolin Wang 2025-07-02 11:38 ` David Hildenbrand 2025-07-02 11:55 ` David Hildenbrand 2025-07-04 2:35 ` Baolin Wang 2025-07-04 2:04 ` Baolin Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).