[PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
@ 2024-01-29 14:32 David Hildenbrand
  2024-01-29 14:32 ` [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte() David Hildenbrand
                   ` (9 more replies)
  0 siblings, 10 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Catalin Marinas, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Michael Ellerman, Christophe Leroy,
	Naveen N. Rao, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

This series is based on [1] and must be applied on top of it.
Similar to what we did with fork(), let's implement PTE batching
during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.

While this series should -- similar to [1] -- be beneficial for adding
cont-pte support on arm64[2], it's one of the requirements for maintaining
a total mapcount[3] for large folios with minimal added overhead and
further changes[4] that build up on top of the total mapcount.

Independent of all that, this series results in a speedup during munmap()
and similar unmapping (process teardown, MADV_DONTNEED on larger ranges)
with PTE-mapped THP, which is the default with THPs that are smaller than
a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by
PTE-mapped folios of the same size (stddev < 1%) results in the following
runtimes for munmap() in seconds (shorter is better):

Folio Size | mm-unstable |      New | Change
---------------------------------------------
      4KiB |    0.058110 | 0.057715 |   - 1%
     16KiB |    0.044198 | 0.035469 |   -20%
     32KiB |    0.034216 | 0.023522 |   -31%
     64KiB |    0.029207 | 0.018434 |   -37%
    128KiB |    0.026579 | 0.014026 |   -47%
    256KiB |    0.025130 | 0.011756 |   -53%
    512KiB |    0.024292 | 0.010703 |   -56%
   1024KiB |    0.023812 | 0.010294 |   -57%
   2048KiB |    0.023785 | 0.009910 |   -58%

CCing especially s390x folks, because they have a tlb freeing hooks that
needs adjustment. Only tested on x86-64 for now, will have to do some more
stress testing. Compile-tested on most other architectures. The PPC
change is negleglible and makes my cross-compiler happy.

[1] https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com
[2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com
[3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com
[4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com
[5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org

David Hildenbrand (9):
  mm/memory: factor out zapping of present pte into zap_present_pte()
  mm/memory: handle !page case in zap_present_pte() separately
  mm/memory: further separate anon and pagecache folio handling in
    zap_present_pte()
  mm/memory: factor out zapping folio pte into zap_present_folio_pte()
  mm/mmu_gather: pass "delay_rmap" instead of encoded page to
    __tlb_remove_page_size()
  mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP
  mm/mmu_gather: add __tlb_remove_folio_pages()
  mm/mmu_gather: add tlb_remove_tlb_entries()
  mm/memory: optimize unmap/zap with PTE-mapped THP

 arch/powerpc/include/asm/tlb.h |   2 +
 arch/s390/include/asm/tlb.h    |  30 ++++--
 include/asm-generic/tlb.h      |  40 ++++++--
 include/linux/mm_types.h       |  37 ++++++--
 include/linux/pgtable.h        |  66 +++++++++++++
 mm/memory.c                    | 167 +++++++++++++++++++++++----------
 mm/mmu_gather.c                |  63 +++++++++++--
 mm/swap.c                      |  12 ++-
 mm/swap_state.c                |  12 ++-
 9 files changed, 347 insertions(+), 82 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()
  2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
@ 2024-01-29 14:32 ` David Hildenbrand
  2024-01-30  8:13   ` Ryan Roberts
  2024-01-29 14:32 ` [PATCH v1 2/9] mm/memory: handle !page case in zap_present_pte() separately David Hildenbrand
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Catalin Marinas, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Michael Ellerman, Christophe Leroy,
	Naveen N. Rao, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

Let's prepare for further changes by factoring out processing of present
PTEs.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 92 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 52 insertions(+), 40 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b05fd28dbce1..50a6c79c78fc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }
 
+static inline void zap_present_pte(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+		unsigned long addr, struct zap_details *details,
+		int *rss, bool *force_flush, bool *force_break)
+{
+	struct mm_struct *mm = tlb->mm;
+	bool delay_rmap = false;
+	struct folio *folio;
+	struct page *page;
+
+	page = vm_normal_page(vma, addr, ptent);
+	if (page)
+		folio = page_folio(page);
+
+	if (unlikely(!should_zap_folio(details, folio)))
+		return;
+	ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+	arch_check_zapped_pte(vma, ptent);
+	tlb_remove_tlb_entry(tlb, pte, addr);
+	zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+	if (unlikely(!page)) {
+		ksm_might_unmap_zero_page(mm, ptent);
+		return;
+	}
+
+	if (!folio_test_anon(folio)) {
+		if (pte_dirty(ptent)) {
+			folio_mark_dirty(folio);
+			if (tlb_delay_rmap(tlb)) {
+				delay_rmap = true;
+				*force_flush = true;
+			}
+		}
+		if (pte_young(ptent) && likely(vma_has_recency(vma)))
+			folio_mark_accessed(folio);
+	}
+	rss[mm_counter(folio)]--;
+	if (!delay_rmap) {
+		folio_remove_rmap_pte(folio, page, vma);
+		if (unlikely(page_mapcount(page) < 0))
+			print_bad_pte(vma, addr, ptent, page);
+	}
+	if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+		*force_flush = true;
+		*force_break = true;
+	}
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
+	bool force_flush = false, force_break = false;
 	struct mm_struct *mm = tlb->mm;
-	int force_flush = 0;
 	int rss[NR_MM_COUNTERS];
 	spinlock_t *ptl;
 	pte_t *start_pte;
@@ -1565,45 +1613,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		if (pte_present(ptent)) {
-			unsigned int delay_rmap;
-
-			page = vm_normal_page(vma, addr, ptent);
-			if (page)
-				folio = page_folio(page);
-
-			if (unlikely(!should_zap_folio(details, folio)))
-				continue;
-			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
-			arch_check_zapped_pte(vma, ptent);
-			tlb_remove_tlb_entry(tlb, pte, addr);
-			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
-						      ptent);
-			if (unlikely(!page)) {
-				ksm_might_unmap_zero_page(mm, ptent);
-				continue;
-			}
-
-			delay_rmap = 0;
-			if (!folio_test_anon(folio)) {
-				if (pte_dirty(ptent)) {
-					folio_mark_dirty(folio);
-					if (tlb_delay_rmap(tlb)) {
-						delay_rmap = 1;
-						force_flush = 1;
-					}
-				}
-				if (pte_young(ptent) && likely(vma_has_recency(vma)))
-					folio_mark_accessed(folio);
-			}
-			rss[mm_counter(folio)]--;
-			if (!delay_rmap) {
-				folio_remove_rmap_pte(folio, page, vma);
-				if (unlikely(page_mapcount(page) < 0))
-					print_bad_pte(vma, addr, ptent, page);
-			}
-			if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
-				force_flush = 1;
+			zap_present_pte(tlb, vma, pte, ptent, addr, details,
+					rss, &force_flush, &force_break);
+			if (unlikely(force_break)) {
 				addr += PAGE_SIZE;
 				break;
 			}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()
  2024-01-29 14:32 ` [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte() David Hildenbrand
@ 2024-01-30  8:13   ` Ryan Roberts
  2024-01-30  8:41     ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  8:13 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 29/01/2024 14:32, David Hildenbrand wrote:
> Let's prepare for further changes by factoring out processing of present
> PTEs.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 92 ++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 52 insertions(+), 40 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b05fd28dbce1..50a6c79c78fc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>  	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>  }
>  
> +static inline void zap_present_pte(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> +		unsigned long addr, struct zap_details *details,
> +		int *rss, bool *force_flush, bool *force_break)
> +{
> +	struct mm_struct *mm = tlb->mm;
> +	bool delay_rmap = false;
> +	struct folio *folio;

You need to init this to NULL otherwise its a random value when calling
should_zap_folio() if vm_normal_page() returns NULL.

> +	struct page *page;
> +
> +	page = vm_normal_page(vma, addr, ptent);
> +	if (page)
> +		folio = page_folio(page);
> +
> +	if (unlikely(!should_zap_folio(details, folio)))
> +		return;
> +	ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> +	arch_check_zapped_pte(vma, ptent);
> +	tlb_remove_tlb_entry(tlb, pte, addr);
> +	zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> +	if (unlikely(!page)) {
> +		ksm_might_unmap_zero_page(mm, ptent);
> +		return;
> +	}
> +
> +	if (!folio_test_anon(folio)) {
> +		if (pte_dirty(ptent)) {
> +			folio_mark_dirty(folio);
> +			if (tlb_delay_rmap(tlb)) {
> +				delay_rmap = true;
> +				*force_flush = true;
> +			}
> +		}
> +		if (pte_young(ptent) && likely(vma_has_recency(vma)))
> +			folio_mark_accessed(folio);
> +	}
> +	rss[mm_counter(folio)]--;
> +	if (!delay_rmap) {
> +		folio_remove_rmap_pte(folio, page, vma);
> +		if (unlikely(page_mapcount(page) < 0))
> +			print_bad_pte(vma, addr, ptent, page);
> +	}
> +	if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> +		*force_flush = true;
> +		*force_break = true;
> +	}
> +}
> +
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, pmd_t *pmd,
>  				unsigned long addr, unsigned long end,
>  				struct zap_details *details)
>  {
> +	bool force_flush = false, force_break = false;
>  	struct mm_struct *mm = tlb->mm;
> -	int force_flush = 0;
>  	int rss[NR_MM_COUNTERS];
>  	spinlock_t *ptl;
>  	pte_t *start_pte;
> @@ -1565,45 +1613,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			break;
>  
>  		if (pte_present(ptent)) {
> -			unsigned int delay_rmap;
> -
> -			page = vm_normal_page(vma, addr, ptent);
> -			if (page)
> -				folio = page_folio(page);
> -
> -			if (unlikely(!should_zap_folio(details, folio)))
> -				continue;
> -			ptent = ptep_get_and_clear_full(mm, addr, pte,
> -							tlb->fullmm);
> -			arch_check_zapped_pte(vma, ptent);
> -			tlb_remove_tlb_entry(tlb, pte, addr);
> -			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> -						      ptent);
> -			if (unlikely(!page)) {
> -				ksm_might_unmap_zero_page(mm, ptent);
> -				continue;
> -			}
> -
> -			delay_rmap = 0;
> -			if (!folio_test_anon(folio)) {
> -				if (pte_dirty(ptent)) {
> -					folio_mark_dirty(folio);
> -					if (tlb_delay_rmap(tlb)) {
> -						delay_rmap = 1;
> -						force_flush = 1;
> -					}
> -				}
> -				if (pte_young(ptent) && likely(vma_has_recency(vma)))
> -					folio_mark_accessed(folio);
> -			}
> -			rss[mm_counter(folio)]--;
> -			if (!delay_rmap) {
> -				folio_remove_rmap_pte(folio, page, vma);
> -				if (unlikely(page_mapcount(page) < 0))
> -					print_bad_pte(vma, addr, ptent, page);
> -			}
> -			if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> -				force_flush = 1;
> +			zap_present_pte(tlb, vma, pte, ptent, addr, details,
> +					rss, &force_flush, &force_break);
> +			if (unlikely(force_break)) {
>  				addr += PAGE_SIZE;
>  				break;
>  			}



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()
  2024-01-30  8:13   ` Ryan Roberts
@ 2024-01-30  8:41     ` David Hildenbrand
  2024-01-30  8:46       ` Ryan Roberts
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-30  8:41 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 30.01.24 09:13, Ryan Roberts wrote:
> On 29/01/2024 14:32, David Hildenbrand wrote:
>> Let's prepare for further changes by factoring out processing of present
>> PTEs.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memory.c | 92 ++++++++++++++++++++++++++++++-----------------------
>>   1 file changed, 52 insertions(+), 40 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b05fd28dbce1..50a6c79c78fc 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>>   	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>>   }
>>   
>> +static inline void zap_present_pte(struct mmu_gather *tlb,
>> +		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
>> +		unsigned long addr, struct zap_details *details,
>> +		int *rss, bool *force_flush, bool *force_break)
>> +{
>> +	struct mm_struct *mm = tlb->mm;
>> +	bool delay_rmap = false;
>> +	struct folio *folio;
> 
> You need to init this to NULL otherwise its a random value when calling
> should_zap_folio() if vm_normal_page() returns NULL.

Right, and we can stop setting it to NULL in the original function. 
Patch #2 changes these checks, which is why it's only a problem in this 
patch.

Will fix, thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()
  2024-01-30  8:41     ` David Hildenbrand
@ 2024-01-30  8:46       ` Ryan Roberts
  2024-01-30  8:49         ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  8:46 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 30/01/2024 08:41, David Hildenbrand wrote:
> On 30.01.24 09:13, Ryan Roberts wrote:
>> On 29/01/2024 14:32, David Hildenbrand wrote:
>>> Let's prepare for further changes by factoring out processing of present
>>> PTEs.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/memory.c | 92 ++++++++++++++++++++++++++++++-----------------------
>>>   1 file changed, 52 insertions(+), 40 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index b05fd28dbce1..50a6c79c78fc 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct
>>> *vma,
>>>       pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>>>   }
>>>   +static inline void zap_present_pte(struct mmu_gather *tlb,
>>> +        struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
>>> +        unsigned long addr, struct zap_details *details,
>>> +        int *rss, bool *force_flush, bool *force_break)
>>> +{
>>> +    struct mm_struct *mm = tlb->mm;
>>> +    bool delay_rmap = false;
>>> +    struct folio *folio;
>>
>> You need to init this to NULL otherwise its a random value when calling
>> should_zap_folio() if vm_normal_page() returns NULL.
> 
> Right, and we can stop setting it to NULL in the original function. Patch #2
> changes these checks, which is why it's only a problem in this patch.

Yeah I only noticed that after sending out this reply and moving to the next
patch. Still worth fixing this intermediate state I think.

> 
> Will fix, thanks!
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()
  2024-01-30  8:46       ` Ryan Roberts
@ 2024-01-30  8:49         ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-01-30  8:49 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 30.01.24 09:46, Ryan Roberts wrote:
> On 30/01/2024 08:41, David Hildenbrand wrote:
>> On 30.01.24 09:13, Ryan Roberts wrote:
>>> On 29/01/2024 14:32, David Hildenbrand wrote:
>>>> Let's prepare for further changes by factoring out processing of present
>>>> PTEs.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    mm/memory.c | 92 ++++++++++++++++++++++++++++++-----------------------
>>>>    1 file changed, 52 insertions(+), 40 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index b05fd28dbce1..50a6c79c78fc 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct
>>>> *vma,
>>>>        pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>>>>    }
>>>>    +static inline void zap_present_pte(struct mmu_gather *tlb,
>>>> +        struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
>>>> +        unsigned long addr, struct zap_details *details,
>>>> +        int *rss, bool *force_flush, bool *force_break)
>>>> +{
>>>> +    struct mm_struct *mm = tlb->mm;
>>>> +    bool delay_rmap = false;
>>>> +    struct folio *folio;
>>>
>>> You need to init this to NULL otherwise its a random value when calling
>>> should_zap_folio() if vm_normal_page() returns NULL.
>>
>> Right, and we can stop setting it to NULL in the original function. Patch #2
>> changes these checks, which is why it's only a problem in this patch.
> 
> Yeah I only noticed that after sending out this reply and moving to the next
> patch. Still worth fixing this intermediate state I think.

Absolutely, I didn't do path-by-patch compilation yet (I suspect the 
compiler would complain).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v1 2/9] mm/memory: handle !page case in zap_present_pte() separately
  2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
  2024-01-29 14:32 ` [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte() David Hildenbrand
@ 2024-01-29 14:32 ` David Hildenbrand
  2024-01-30  8:20   ` Ryan Roberts
  2024-01-29 14:32 ` [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte() David Hildenbrand
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Catalin Marinas, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Michael Ellerman, Christophe Leroy,
	Naveen N. Rao, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

We don't need uptodate accessed/dirty bits, so in theory we could
replace ptep_get_and_clear_full() by an optimized ptep_clear_full()
function. Let's rely on the provided pte.

Further, there is no scenario where we would have to insert uffd-wp
markers when zapping something that is not a normal page (i.e., zeropage).
Add a sanity check to make sure this remains true.

should_zap_folio() no longer has to handle NULL pointers. This change
replaces 2/3 "!page/!folio" checks by a single "!page" one.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 50a6c79c78fc..69502cdc0a7d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1497,10 +1497,6 @@ static inline bool should_zap_folio(struct zap_details *details,
 	if (should_zap_cows(details))
 		return true;
 
-	/* E.g. the caller passes NULL for the case of a zero folio */
-	if (!folio)
-		return true;
-
 	/* Otherwise we should only zap non-anon folios */
 	return !folio_test_anon(folio);
 }
@@ -1543,19 +1539,23 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
 	struct page *page;
 
 	page = vm_normal_page(vma, addr, ptent);
-	if (page)
-		folio = page_folio(page);
+	if (!page) {
+		/* We don't need up-to-date accessed/dirty bits. */
+		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+		arch_check_zapped_pte(vma, ptent);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+		VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+		ksm_might_unmap_zero_page(mm, ptent);
+		return;
+	}
 
+	folio = page_folio(page);
 	if (unlikely(!should_zap_folio(details, folio)))
 		return;
 	ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
 	arch_check_zapped_pte(vma, ptent);
 	tlb_remove_tlb_entry(tlb, pte, addr);
 	zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
-	if (unlikely(!page)) {
-		ksm_might_unmap_zero_page(mm, ptent);
-		return;
-	}
 
 	if (!folio_test_anon(folio)) {
 		if (pte_dirty(ptent)) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 2/9] mm/memory: handle !page case in zap_present_pte() separately
  2024-01-29 14:32 ` [PATCH v1 2/9] mm/memory: handle !page case in zap_present_pte() separately David Hildenbrand
@ 2024-01-30  8:20   ` Ryan Roberts
  0 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  8:20 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 29/01/2024 14:32, David Hildenbrand wrote:
> We don't need uptodate accessed/dirty bits, so in theory we could
> replace ptep_get_and_clear_full() by an optimized ptep_clear_full()
> function. Let's rely on the provided pte.
> 
> Further, there is no scenario where we would have to insert uffd-wp
> markers when zapping something that is not a normal page (i.e., zeropage).
> Add a sanity check to make sure this remains true.
> 
> should_zap_folio() no longer has to handle NULL pointers. This change
> replaces 2/3 "!page/!folio" checks by a single "!page" one.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/memory.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 50a6c79c78fc..69502cdc0a7d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1497,10 +1497,6 @@ static inline bool should_zap_folio(struct zap_details *details,
>  	if (should_zap_cows(details))
>  		return true;
>  
> -	/* E.g. the caller passes NULL for the case of a zero folio */
> -	if (!folio)
> -		return true;
> -
>  	/* Otherwise we should only zap non-anon folios */
>  	return !folio_test_anon(folio);
>  }
> @@ -1543,19 +1539,23 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
>  	struct page *page;
>  
>  	page = vm_normal_page(vma, addr, ptent);
> -	if (page)
> -		folio = page_folio(page);
> +	if (!page) {
> +		/* We don't need up-to-date accessed/dirty bits. */
> +		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> +		arch_check_zapped_pte(vma, ptent);
> +		tlb_remove_tlb_entry(tlb, pte, addr);
> +		VM_WARN_ON_ONCE(userfaultfd_wp(vma));
> +		ksm_might_unmap_zero_page(mm, ptent);
> +		return;
> +	}
>  
> +	folio = page_folio(page);
>  	if (unlikely(!should_zap_folio(details, folio)))
>  		return;
>  	ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>  	arch_check_zapped_pte(vma, ptent);
>  	tlb_remove_tlb_entry(tlb, pte, addr);
>  	zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> -	if (unlikely(!page)) {
> -		ksm_might_unmap_zero_page(mm, ptent);
> -		return;
> -	}
>  
>  	if (!folio_test_anon(folio)) {
>  		if (pte_dirty(ptent)) {



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()
  2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
  2024-01-29 14:32 ` [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte() David Hildenbrand
  2024-01-29 14:32 ` [PATCH v1 2/9] mm/memory: handle !page case in zap_present_pte() separately David Hildenbrand
@ 2024-01-29 14:32 ` David Hildenbrand
  2024-01-30  8:31   ` Ryan Roberts
  2024-01-29 14:32 ` [PATCH v1 4/9] mm/memory: factor out zapping folio pte into zap_present_folio_pte() David Hildenbrand
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Catalin Marinas, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Michael Ellerman, Christophe Leroy,
	Naveen N. Rao, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

We don't need up-to-date accessed-dirty information for anon folios and can
simply work with the ptent we already have. Also, we know the RSS counter
we want to update.

We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
zap_install_uffd_wp_if_needed() after updating the folio and RSS.

While at it, only call zap_install_uffd_wp_if_needed() if there is even
any chance that pte_install_uffd_wp_if_needed() would do *something*.
That is, just don't bother if uffd-wp does not apply.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 69502cdc0a7d..20bc13ab8db2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
 	folio = page_folio(page);
 	if (unlikely(!should_zap_folio(details, folio)))
 		return;
-	ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-	arch_check_zapped_pte(vma, ptent);
-	tlb_remove_tlb_entry(tlb, pte, addr);
-	zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
 
 	if (!folio_test_anon(folio)) {
+		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
 		if (pte_dirty(ptent)) {
 			folio_mark_dirty(folio);
 			if (tlb_delay_rmap(tlb)) {
@@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
 		}
 		if (pte_young(ptent) && likely(vma_has_recency(vma)))
 			folio_mark_accessed(folio);
+		rss[mm_counter(folio)]--;
+	} else {
+		/* We don't need up-to-date accessed/dirty bits. */
+		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+		rss[MM_ANONPAGES]--;
 	}
-	rss[mm_counter(folio)]--;
+	arch_check_zapped_pte(vma, ptent);
+	tlb_remove_tlb_entry(tlb, pte, addr);
+	if (unlikely(userfaultfd_pte_wp(vma, ptent)))
+		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+
 	if (!delay_rmap) {
 		folio_remove_rmap_pte(folio, page, vma);
 		if (unlikely(page_mapcount(page) < 0))
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()
  2024-01-29 14:32 ` [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte() David Hildenbrand
@ 2024-01-30  8:31   ` Ryan Roberts
  2024-01-30  8:37     ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  8:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 29/01/2024 14:32, David Hildenbrand wrote:
> We don't need up-to-date accessed-dirty information for anon folios and can
> simply work with the ptent we already have. Also, we know the RSS counter
> we want to update.
> 
> We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
> zap_install_uffd_wp_if_needed() after updating the folio and RSS.
> 
> While at it, only call zap_install_uffd_wp_if_needed() if there is even
> any chance that pte_install_uffd_wp_if_needed() would do *something*.
> That is, just don't bother if uffd-wp does not apply.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 69502cdc0a7d..20bc13ab8db2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
>  	folio = page_folio(page);
>  	if (unlikely(!should_zap_folio(details, folio)))
>  		return;
> -	ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> -	arch_check_zapped_pte(vma, ptent);
> -	tlb_remove_tlb_entry(tlb, pte, addr);
> -	zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
>  
>  	if (!folio_test_anon(folio)) {
> +		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>  		if (pte_dirty(ptent)) {
>  			folio_mark_dirty(folio);
>  			if (tlb_delay_rmap(tlb)) {
> @@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
>  		}
>  		if (pte_young(ptent) && likely(vma_has_recency(vma)))
>  			folio_mark_accessed(folio);
> +		rss[mm_counter(folio)]--;
> +	} else {
> +		/* We don't need up-to-date accessed/dirty bits. */
> +		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> +		rss[MM_ANONPAGES]--;
>  	}
> -	rss[mm_counter(folio)]--;
> +	arch_check_zapped_pte(vma, ptent);

Isn't the x86 (only) implementation of this relying on the dirty bit? So doesn't
that imply you still need get_and_clear for anon? (And in hindsight I think that
logic would apply to the previous patch too?)

Impl:

void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte)
{
	/*
	 * Hardware before shadow stack can (rarely) set Dirty=1
	 * on a Write=0 PTE. So the below condition
	 * only indicates a software bug when shadow stack is
	 * supported by the HW. This checking is covered in
	 * pte_shstk().
	 */
	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
			pte_shstk(pte));
}

static inline bool pte_shstk(pte_t pte)
{
	return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
	       (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
}


> +	tlb_remove_tlb_entry(tlb, pte, addr);
> +	if (unlikely(userfaultfd_pte_wp(vma, ptent)))
> +		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> +
>  	if (!delay_rmap) {
>  		folio_remove_rmap_pte(folio, page, vma);
>  		if (unlikely(page_mapcount(page) < 0))



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()
  2024-01-30  8:31   ` Ryan Roberts
@ 2024-01-30  8:37     ` David Hildenbrand
  2024-01-30  8:45       ` Ryan Roberts
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-30  8:37 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 30.01.24 09:31, Ryan Roberts wrote:
> On 29/01/2024 14:32, David Hildenbrand wrote:
>> We don't need up-to-date accessed-dirty information for anon folios and can
>> simply work with the ptent we already have. Also, we know the RSS counter
>> we want to update.
>>
>> We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
>> zap_install_uffd_wp_if_needed() after updating the folio and RSS.
>>
>> While at it, only call zap_install_uffd_wp_if_needed() if there is even
>> any chance that pte_install_uffd_wp_if_needed() would do *something*.
>> That is, just don't bother if uffd-wp does not apply.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memory.c | 16 +++++++++++-----
>>   1 file changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 69502cdc0a7d..20bc13ab8db2 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
>>   	folio = page_folio(page);
>>   	if (unlikely(!should_zap_folio(details, folio)))
>>   		return;
>> -	ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>> -	arch_check_zapped_pte(vma, ptent);
>> -	tlb_remove_tlb_entry(tlb, pte, addr);
>> -	zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
>>   
>>   	if (!folio_test_anon(folio)) {
>> +		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>   		if (pte_dirty(ptent)) {
>>   			folio_mark_dirty(folio);
>>   			if (tlb_delay_rmap(tlb)) {
>> @@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
>>   		}
>>   		if (pte_young(ptent) && likely(vma_has_recency(vma)))
>>   			folio_mark_accessed(folio);
>> +		rss[mm_counter(folio)]--;
>> +	} else {
>> +		/* We don't need up-to-date accessed/dirty bits. */
>> +		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>> +		rss[MM_ANONPAGES]--;
>>   	}
>> -	rss[mm_counter(folio)]--;
>> +	arch_check_zapped_pte(vma, ptent);
> 
> Isn't the x86 (only) implementation of this relying on the dirty bit? So doesn't
> that imply you still need get_and_clear for anon? (And in hindsight I think that
> logic would apply to the previous patch too?)

x86 uses the encoding !writable && dirty to indicate special shadow 
stacks. That is, the hw dirty bit is set by software (to create that 
combination), not by hardware.

So you don't have to sync against any hw changes of the hw dirty bit. 
What you had in the original PTE you read is sufficient.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()
  2024-01-30  8:37     ` David Hildenbrand
@ 2024-01-30  8:45       ` Ryan Roberts
  2024-01-30  8:47         ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  8:45 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 30/01/2024 08:37, David Hildenbrand wrote:
> On 30.01.24 09:31, Ryan Roberts wrote:
>> On 29/01/2024 14:32, David Hildenbrand wrote:
>>> We don't need up-to-date accessed-dirty information for anon folios and can
>>> simply work with the ptent we already have. Also, we know the RSS counter
>>> we want to update.
>>>
>>> We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
>>> zap_install_uffd_wp_if_needed() after updating the folio and RSS.
>>>
>>> While at it, only call zap_install_uffd_wp_if_needed() if there is even
>>> any chance that pte_install_uffd_wp_if_needed() would do *something*.
>>> That is, just don't bother if uffd-wp does not apply.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/memory.c | 16 +++++++++++-----
>>>   1 file changed, 11 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 69502cdc0a7d..20bc13ab8db2 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather
>>> *tlb,
>>>       folio = page_folio(page);
>>>       if (unlikely(!should_zap_folio(details, folio)))
>>>           return;
>>> -    ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>> -    arch_check_zapped_pte(vma, ptent);
>>> -    tlb_remove_tlb_entry(tlb, pte, addr);
>>> -    zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
>>>         if (!folio_test_anon(folio)) {
>>> +        ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>>           if (pte_dirty(ptent)) {
>>>               folio_mark_dirty(folio);
>>>               if (tlb_delay_rmap(tlb)) {
>>> @@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather
>>> *tlb,
>>>           }
>>>           if (pte_young(ptent) && likely(vma_has_recency(vma)))
>>>               folio_mark_accessed(folio);
>>> +        rss[mm_counter(folio)]--;
>>> +    } else {
>>> +        /* We don't need up-to-date accessed/dirty bits. */
>>> +        ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>> +        rss[MM_ANONPAGES]--;
>>>       }
>>> -    rss[mm_counter(folio)]--;
>>> +    arch_check_zapped_pte(vma, ptent);
>>
>> Isn't the x86 (only) implementation of this relying on the dirty bit? So doesn't
>> that imply you still need get_and_clear for anon? (And in hindsight I think that
>> logic would apply to the previous patch too?)
> 
> x86 uses the encoding !writable && dirty to indicate special shadow stacks. That
> is, the hw dirty bit is set by software (to create that combination), not by
> hardware.
> 
> So you don't have to sync against any hw changes of the hw dirty bit. What you
> had in the original PTE you read is sufficient.
> 

Right, got it. In that case:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()
  2024-01-30  8:45       ` Ryan Roberts
@ 2024-01-30  8:47         ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-01-30  8:47 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 30.01.24 09:45, Ryan Roberts wrote:
> On 30/01/2024 08:37, David Hildenbrand wrote:
>> On 30.01.24 09:31, Ryan Roberts wrote:
>>> On 29/01/2024 14:32, David Hildenbrand wrote:
>>>> We don't need up-to-date accessed-dirty information for anon folios and can
>>>> simply work with the ptent we already have. Also, we know the RSS counter
>>>> we want to update.
>>>>
>>>> We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
>>>> zap_install_uffd_wp_if_needed() after updating the folio and RSS.
>>>>
>>>> While at it, only call zap_install_uffd_wp_if_needed() if there is even
>>>> any chance that pte_install_uffd_wp_if_needed() would do *something*.
>>>> That is, just don't bother if uffd-wp does not apply.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    mm/memory.c | 16 +++++++++++-----
>>>>    1 file changed, 11 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 69502cdc0a7d..20bc13ab8db2 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather
>>>> *tlb,
>>>>        folio = page_folio(page);
>>>>        if (unlikely(!should_zap_folio(details, folio)))
>>>>            return;
>>>> -    ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>>> -    arch_check_zapped_pte(vma, ptent);
>>>> -    tlb_remove_tlb_entry(tlb, pte, addr);
>>>> -    zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
>>>>          if (!folio_test_anon(folio)) {
>>>> +        ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>>>            if (pte_dirty(ptent)) {
>>>>                folio_mark_dirty(folio);
>>>>                if (tlb_delay_rmap(tlb)) {
>>>> @@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather
>>>> *tlb,
>>>>            }
>>>>            if (pte_young(ptent) && likely(vma_has_recency(vma)))
>>>>                folio_mark_accessed(folio);
>>>> +        rss[mm_counter(folio)]--;
>>>> +    } else {
>>>> +        /* We don't need up-to-date accessed/dirty bits. */
>>>> +        ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>>> +        rss[MM_ANONPAGES]--;
>>>>        }
>>>> -    rss[mm_counter(folio)]--;
>>>> +    arch_check_zapped_pte(vma, ptent);
>>>
>>> Isn't the x86 (only) implementation of this relying on the dirty bit? So doesn't
>>> that imply you still need get_and_clear for anon? (And in hindsight I think that
>>> logic would apply to the previous patch too?)
>>
>> x86 uses the encoding !writable && dirty to indicate special shadow stacks. That
>> is, the hw dirty bit is set by software (to create that combination), not by
>> hardware.
>>
>> So you don't have to sync against any hw changes of the hw dirty bit. What you
>> had in the original PTE you read is sufficient.
>>
> 
> Right, got it. In that case:

Thanks a lot for paying that much attention during your reviews! Highly 
appreciated!

> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> 

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v1 4/9] mm/memory: factor out zapping folio pte into zap_present_folio_pte()
  2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
                   ` (2 preceding siblings ...)
  2024-01-29 14:32 ` [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte() David Hildenbrand
@ 2024-01-29 14:32 ` David Hildenbrand
  2024-01-30  8:47   ` Ryan Roberts
  2024-01-29 14:32 ` [PATCH v1 5/9] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size() David Hildenbrand
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Catalin Marinas, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Michael Ellerman, Christophe Leroy,
	Naveen N. Rao, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

Let's prepare for further changes by factoring it out into a separate
function.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 53 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 20bc13ab8db2..a2190d7cfa74 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1528,30 +1528,14 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }
 
-static inline void zap_present_pte(struct mmu_gather *tlb,
-		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
-		unsigned long addr, struct zap_details *details,
-		int *rss, bool *force_flush, bool *force_break)
+static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, struct folio *folio,
+		struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
+		struct zap_details *details, int *rss, bool *force_flush,
+		bool *force_break)
 {
 	struct mm_struct *mm = tlb->mm;
 	bool delay_rmap = false;
-	struct folio *folio;
-	struct page *page;
-
-	page = vm_normal_page(vma, addr, ptent);
-	if (!page) {
-		/* We don't need up-to-date accessed/dirty bits. */
-		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-		arch_check_zapped_pte(vma, ptent);
-		tlb_remove_tlb_entry(tlb, pte, addr);
-		VM_WARN_ON_ONCE(userfaultfd_wp(vma));
-		ksm_might_unmap_zero_page(mm, ptent);
-		return;
-	}
-
-	folio = page_folio(page);
-	if (unlikely(!should_zap_folio(details, folio)))
-		return;
 
 	if (!folio_test_anon(folio)) {
 		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
@@ -1586,6 +1570,33 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
 	}
 }
 
+static inline void zap_present_pte(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+		unsigned long addr, struct zap_details *details,
+		int *rss, bool *force_flush, bool *force_break)
+{
+	struct mm_struct *mm = tlb->mm;
+	struct folio *folio;
+	struct page *page;
+
+	page = vm_normal_page(vma, addr, ptent);
+	if (!page) {
+		/* We don't need up-to-date accessed/dirty bits. */
+		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+		arch_check_zapped_pte(vma, ptent);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+		VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+		ksm_might_unmap_zero_page(mm, ptent);
+		return;
+	}
+
+	folio = page_folio(page);
+	if (unlikely(!should_zap_folio(details, folio)))
+		return;
+	zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
+			      rss, force_flush, force_break);
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 4/9] mm/memory: factor out zapping folio pte into zap_present_folio_pte()
  2024-01-29 14:32 ` [PATCH v1 4/9] mm/memory: factor out zapping folio pte into zap_present_folio_pte() David Hildenbrand
@ 2024-01-30  8:47   ` Ryan Roberts
  0 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  8:47 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 29/01/2024 14:32, David Hildenbrand wrote:
> Let's prepare for further changes by factoring it out into a separate
> function.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/memory.c | 53 ++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 20bc13ab8db2..a2190d7cfa74 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1528,30 +1528,14 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>  	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>  }
>  
> -static inline void zap_present_pte(struct mmu_gather *tlb,
> -		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> -		unsigned long addr, struct zap_details *details,
> -		int *rss, bool *force_flush, bool *force_break)
> +static inline void zap_present_folio_pte(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, struct folio *folio,
> +		struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
> +		struct zap_details *details, int *rss, bool *force_flush,
> +		bool *force_break)
>  {
>  	struct mm_struct *mm = tlb->mm;
>  	bool delay_rmap = false;
> -	struct folio *folio;
> -	struct page *page;
> -
> -	page = vm_normal_page(vma, addr, ptent);
> -	if (!page) {
> -		/* We don't need up-to-date accessed/dirty bits. */
> -		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> -		arch_check_zapped_pte(vma, ptent);
> -		tlb_remove_tlb_entry(tlb, pte, addr);
> -		VM_WARN_ON_ONCE(userfaultfd_wp(vma));
> -		ksm_might_unmap_zero_page(mm, ptent);
> -		return;
> -	}
> -
> -	folio = page_folio(page);
> -	if (unlikely(!should_zap_folio(details, folio)))
> -		return;
>  
>  	if (!folio_test_anon(folio)) {
>  		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> @@ -1586,6 +1570,33 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
>  	}
>  }
>  
> +static inline void zap_present_pte(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> +		unsigned long addr, struct zap_details *details,
> +		int *rss, bool *force_flush, bool *force_break)
> +{
> +	struct mm_struct *mm = tlb->mm;
> +	struct folio *folio;
> +	struct page *page;
> +
> +	page = vm_normal_page(vma, addr, ptent);
> +	if (!page) {
> +		/* We don't need up-to-date accessed/dirty bits. */
> +		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> +		arch_check_zapped_pte(vma, ptent);
> +		tlb_remove_tlb_entry(tlb, pte, addr);
> +		VM_WARN_ON_ONCE(userfaultfd_wp(vma));
> +		ksm_might_unmap_zero_page(mm, ptent);
> +		return;
> +	}
> +
> +	folio = page_folio(page);
> +	if (unlikely(!should_zap_folio(details, folio)))
> +		return;
> +	zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
> +			      rss, force_flush, force_break);
> +}
> +
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				struct vm_area_struct *vma, pmd_t *pmd,
>  				unsigned long addr, unsigned long end,



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v1 5/9] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size()
  2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
                   ` (3 preceding siblings ...)
  2024-01-29 14:32 ` [PATCH v1 4/9] mm/memory: factor out zapping folio pte into zap_present_folio_pte() David Hildenbrand
@ 2024-01-29 14:32 ` David Hildenbrand
  2024-01-30  8:41   ` Ryan Roberts
  2024-01-29 14:32 ` [PATCH v1 6/9] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP David Hildenbrand
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Catalin Marinas, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Michael Ellerman, Christophe Leroy,
	Naveen N. Rao, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

We have two bits available in the encoded page pointer to store
additional information. Currently, we use one bit to request delay of the
rmap removal until after a TLB flush.

We want to make use of the remaining bit internally for batching of
multiple pages of the same folio, specifying that the next encoded page
pointer in an array is actually "nr_pages". So pass page + delay_rmap flag
instead of an encoded page, to handle the encoding internally.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/tlb.h | 13 ++++++-------
 include/asm-generic/tlb.h   | 12 ++++++------
 mm/mmu_gather.c             |  7 ++++---
 3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index d1455a601adc..48df896d5b79 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,8 +25,7 @@
 void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
-					  struct encoded_page *page,
-					  int page_size);
+		struct page *page, bool delay_rmap, int page_size);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -42,14 +41,14 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
  * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  *
- * s390 doesn't delay rmap removal, so there is nothing encoded in
- * the page pointer.
+ * s390 doesn't delay rmap removal.
  */
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
-					  struct encoded_page *page,
-					  int page_size)
+		struct page *page, bool delay_rmap, int page_size)
 {
-	free_page_and_swap_cache(encoded_page_ptr(page));
+	VM_WARN_ON_ONCE(delay_rmap);
+
+	free_page_and_swap_cache(page);
 	return false;
 }
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 129a3a759976..2eb7b0d4f5d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -260,9 +260,8 @@ struct mmu_gather_batch {
  */
 #define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
 
-extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
-				   struct encoded_page *page,
-				   int page_size);
+extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+		bool delay_rmap, int page_size);
 
 #ifdef CONFIG_SMP
 /*
@@ -462,13 +461,14 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 					struct page *page, int page_size)
 {
-	if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size))
+	if (__tlb_remove_page_size(tlb, page, false, page_size))
 		tlb_flush_mmu(tlb);
 }
 
-static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
+static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb,
+		struct page *page, bool delay_rmap)
 {
-	return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE);
+	return __tlb_remove_page_size(tlb, page, delay_rmap, PAGE_SIZE);
 }
 
 /* tlb_remove_page
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 604ddf08affe..ac733d81b112 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -116,7 +116,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 	tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+		bool delay_rmap, int page_size)
 {
 	struct mmu_gather_batch *batch;
 
@@ -131,13 +132,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, i
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->encoded_pages[batch->nr++] = page;
+	batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
 		batch = tlb->active;
 	}
-	VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page));
+	VM_BUG_ON_PAGE(batch->nr > batch->max, page);
 
 	return false;
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 5/9] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size()
  2024-01-29 14:32 ` [PATCH v1 5/9] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size() David Hildenbrand
@ 2024-01-30  8:41   ` Ryan Roberts
  0 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  8:41 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 29/01/2024 14:32, David Hildenbrand wrote:
> We have two bits available in the encoded page pointer to store
> additional information. Currently, we use one bit to request delay of the
> rmap removal until after a TLB flush.
> 
> We want to make use of the remaining bit internally for batching of
> multiple pages of the same folio, specifying that the next encoded page
> pointer in an array is actually "nr_pages". So pass page + delay_rmap flag
> instead of an encoded page, to handle the encoding internally.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  arch/s390/include/asm/tlb.h | 13 ++++++-------
>  include/asm-generic/tlb.h   | 12 ++++++------
>  mm/mmu_gather.c             |  7 ++++---
>  3 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index d1455a601adc..48df896d5b79 100644
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -25,8 +25,7 @@
>  void __tlb_remove_table(void *_table);
>  static inline void tlb_flush(struct mmu_gather *tlb);
>  static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
> -					  struct encoded_page *page,
> -					  int page_size);
> +		struct page *page, bool delay_rmap, int page_size);
>  
>  #define tlb_flush tlb_flush
>  #define pte_free_tlb pte_free_tlb
> @@ -42,14 +41,14 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>   * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
>   * has already been freed, so just do free_page_and_swap_cache.
>   *
> - * s390 doesn't delay rmap removal, so there is nothing encoded in
> - * the page pointer.
> + * s390 doesn't delay rmap removal.
>   */
>  static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
> -					  struct encoded_page *page,
> -					  int page_size)
> +		struct page *page, bool delay_rmap, int page_size)
>  {
> -	free_page_and_swap_cache(encoded_page_ptr(page));
> +	VM_WARN_ON_ONCE(delay_rmap);
> +
> +	free_page_and_swap_cache(page);
>  	return false;
>  }
>  
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 129a3a759976..2eb7b0d4f5d2 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -260,9 +260,8 @@ struct mmu_gather_batch {
>   */
>  #define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
>  
> -extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
> -				   struct encoded_page *page,
> -				   int page_size);
> +extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
> +		bool delay_rmap, int page_size);
>  
>  #ifdef CONFIG_SMP
>  /*
> @@ -462,13 +461,14 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
>  static inline void tlb_remove_page_size(struct mmu_gather *tlb,
>  					struct page *page, int page_size)
>  {
> -	if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size))
> +	if (__tlb_remove_page_size(tlb, page, false, page_size))
>  		tlb_flush_mmu(tlb);
>  }
>  
> -static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
> +static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb,
> +		struct page *page, bool delay_rmap)
>  {
> -	return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE);
> +	return __tlb_remove_page_size(tlb, page, delay_rmap, PAGE_SIZE);
>  }
>  
>  /* tlb_remove_page
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 604ddf08affe..ac733d81b112 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -116,7 +116,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
>  	tlb->local.next = NULL;
>  }
>  
> -bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, int page_size)
> +bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
> +		bool delay_rmap, int page_size)
>  {
>  	struct mmu_gather_batch *batch;
>  
> @@ -131,13 +132,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, i
>  	 * Add the page and check if we are full. If so
>  	 * force a flush.
>  	 */
> -	batch->encoded_pages[batch->nr++] = page;
> +	batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
>  	if (batch->nr == batch->max) {
>  		if (!tlb_next_batch(tlb))
>  			return true;
>  		batch = tlb->active;
>  	}
> -	VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page));
> +	VM_BUG_ON_PAGE(batch->nr > batch->max, page);
>  
>  	return false;
>  }



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v1 6/9] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP
  2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
                   ` (4 preceding siblings ...)
  2024-01-29 14:32 ` [PATCH v1 5/9] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size() David Hildenbrand
@ 2024-01-29 14:32 ` David Hildenbrand
  2024-01-30  9:03   ` Ryan Roberts
  2024-01-29 14:32 ` [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages() David Hildenbrand
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Catalin Marinas, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Michael Ellerman, Christophe Leroy,
	Naveen N. Rao, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

Nowadays, encoded pages are only used in mmu_gather handling. Let's
update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at
it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.

If encoded page pointers would ever be used in other context again, we'd
likely want to change the defines to reflect their context (e.g.,
ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple.

This is a preparation for using the remaining spare bit to indicate that
the next item in an array of encoded pages is a "nr_pages" argument and
not an encoded page.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm_types.h | 17 +++++++++++------
 mm/mmu_gather.c          |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..1b89eec0d6df 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -210,8 +210,8 @@ struct page {
  *
  * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
  * with the low bits of the pointer indicating extra context-dependent
- * information. Not super-common, but happens in mmu_gather and mlock
- * handling, and this acts as a type system check on that use.
+ * information. Only used in mmu_gather handling, and this acts as a type
+ * system check on that use.
  *
  * We only really have two guaranteed bits in general, although you could
  * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
@@ -220,21 +220,26 @@ struct page {
  * Use the supplied helper functions to endcode/decode the pointer and bits.
  */
 struct encoded_page;
-#define ENCODE_PAGE_BITS 3ul
+
+#define ENCODED_PAGE_BITS			3ul
+
+/* Perform rmap removal after we have flushed the TLB. */
+#define ENCODED_PAGE_BIT_DELAY_RMAP		1ul
+
 static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
 {
-	BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);
+	BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
 	return (struct encoded_page *)(flags | (unsigned long)page);
 }
 
 static inline unsigned long encoded_page_flags(struct encoded_page *page)
 {
-	return ENCODE_PAGE_BITS & (unsigned long)page;
+	return ENCODED_PAGE_BITS & (unsigned long)page;
 }
 
 static inline struct page *encoded_page_ptr(struct encoded_page *page)
 {
-	return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+	return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
 }
 
 /*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index ac733d81b112..6540c99c6758 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -53,7 +53,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_
 	for (int i = 0; i < batch->nr; i++) {
 		struct encoded_page *enc = batch->encoded_pages[i];
 
-		if (encoded_page_flags(enc)) {
+		if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
 			struct page *page = encoded_page_ptr(enc);
 			folio_remove_rmap_pte(page_folio(page), page, vma);
 		}
@@ -119,6 +119,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
 		bool delay_rmap, int page_size)
 {
+	int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
 	struct mmu_gather_batch *batch;
 
 	VM_BUG_ON(!tlb->end);
@@ -132,7 +133,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
+	batch->encoded_pages[batch->nr++] = encode_page(page, flags);
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 6/9] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP
  2024-01-29 14:32 ` [PATCH v1 6/9] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP David Hildenbrand
@ 2024-01-30  9:03   ` Ryan Roberts
  0 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  9:03 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 29/01/2024 14:32, David Hildenbrand wrote:
> Nowadays, encoded pages are only used in mmu_gather handling. Let's
> update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at
> it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.
> 
> If encoded page pointers would ever be used in other context again, we'd
> likely want to change the defines to reflect their context (e.g.,
> ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple.
> 
> This is a preparation for using the remaining spare bit to indicate that
> the next item in an array of encoded pages is a "nr_pages" argument and
> not an encoded page.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  include/linux/mm_types.h | 17 +++++++++++------
>  mm/mmu_gather.c          |  5 +++--
>  2 files changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8b611e13153e..1b89eec0d6df 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -210,8 +210,8 @@ struct page {
>   *
>   * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
>   * with the low bits of the pointer indicating extra context-dependent
> - * information. Not super-common, but happens in mmu_gather and mlock
> - * handling, and this acts as a type system check on that use.
> + * information. Only used in mmu_gather handling, and this acts as a type
> + * system check on that use.
>   *
>   * We only really have two guaranteed bits in general, although you could
>   * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
> @@ -220,21 +220,26 @@ struct page {
>   * Use the supplied helper functions to endcode/decode the pointer and bits.
>   */
>  struct encoded_page;
> -#define ENCODE_PAGE_BITS 3ul
> +
> +#define ENCODED_PAGE_BITS			3ul
> +
> +/* Perform rmap removal after we have flushed the TLB. */
> +#define ENCODED_PAGE_BIT_DELAY_RMAP		1ul
> +
>  static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
>  {
> -	BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);
> +	BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
>  	return (struct encoded_page *)(flags | (unsigned long)page);
>  }
>  
>  static inline unsigned long encoded_page_flags(struct encoded_page *page)
>  {
> -	return ENCODE_PAGE_BITS & (unsigned long)page;
> +	return ENCODED_PAGE_BITS & (unsigned long)page;
>  }
>  
>  static inline struct page *encoded_page_ptr(struct encoded_page *page)
>  {
> -	return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
> +	return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
>  }
>  
>  /*
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index ac733d81b112..6540c99c6758 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -53,7 +53,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_
>  	for (int i = 0; i < batch->nr; i++) {
>  		struct encoded_page *enc = batch->encoded_pages[i];
>  
> -		if (encoded_page_flags(enc)) {
> +		if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
>  			struct page *page = encoded_page_ptr(enc);
>  			folio_remove_rmap_pte(page_folio(page), page, vma);
>  		}
> @@ -119,6 +119,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
>  bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>  		bool delay_rmap, int page_size)
>  {
> +	int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
>  	struct mmu_gather_batch *batch;
>  
>  	VM_BUG_ON(!tlb->end);
> @@ -132,7 +133,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>  	 * Add the page and check if we are full. If so
>  	 * force a flush.
>  	 */
> -	batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
> +	batch->encoded_pages[batch->nr++] = encode_page(page, flags);
>  	if (batch->nr == batch->max) {
>  		if (!tlb_next_batch(tlb))
>  			return true;



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages()
  2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
                   ` (5 preceding siblings ...)
  2024-01-29 14:32 ` [PATCH v1 6/9] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP David Hildenbrand
@ 2024-01-29 14:32 ` David Hildenbrand
  2024-01-30  9:21   ` Ryan Roberts
  2024-01-29 14:32 ` [PATCH v1 8/9] mm/mmu_gather: add tlb_remove_tlb_entries() David Hildenbrand
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Catalin Marinas, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Michael Ellerman, Christophe Leroy,
	Naveen N. Rao, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

Add __tlb_remove_folio_pages(), which will remove multiple consecutive
pages that belong to the same large folio, instead of only a single
page. We'll be using this function when optimizing unmapping/zapping of
large folios that are mapped by PTEs.

We're using the remaining spare bit in an encoded_page to indicate that
the next enoced page in an array contains actually shifted "nr_pages".
Teach swap/freeing code about putting multiple folio references, and
delayed rmap handling to remove page ranges of a folio.

This extension allows for still gathering almost as many small folios
as we used to (-1, because we have to prepare for a possibly bigger next
entry), but still allows for gathering consecutive pages that belong to the
same large folio.

Note that we don't pass the folio pointer, because it is not required for
now. Further, we don't support page_size != PAGE_SIZE, it won't be
required for simple PTE batching.

We have to provide a separate s390 implementation, but it's fairly
straight forward.

Another, more invasive and likely more expensive, approach would be to
use folio+range or a PFN range instead of page+nr_pages. But, we should
do that consistently for the whole mmu_gather. For now, let's keep it
simple and add "nr_pages" only.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/tlb.h | 17 +++++++++++
 include/asm-generic/tlb.h   |  8 +++++
 include/linux/mm_types.h    | 20 ++++++++++++
 mm/mmu_gather.c             | 61 +++++++++++++++++++++++++++++++------
 mm/swap.c                   | 12 ++++++--
 mm/swap_state.c             | 12 ++++++--
 6 files changed, 116 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 48df896d5b79..abfd2bf29e9e 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 		struct page *page, bool delay_rmap, int page_size);
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+		struct page *page, unsigned int nr_pages, bool delay_rmap);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 	return false;
 }
 
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+		struct page *page, unsigned int nr_pages, bool delay_rmap)
+{
+	struct encoded_page *encoded_pages[] = {
+		encode_page(page, ENCODED_PAGE_BIT_NR_PAGES),
+		encode_nr_pages(nr_pages),
+	};
+
+	VM_WARN_ON_ONCE(delay_rmap);
+	VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
+
+	free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+	return false;
+}
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	__tlb_flush_mm_lazy(tlb->mm);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2eb7b0d4f5d2..428c3f93addc 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -69,6 +69,7 @@
  *
  *  - tlb_remove_page() / __tlb_remove_page()
  *  - tlb_remove_page_size() / __tlb_remove_page_size()
+ *  - __tlb_remove_folio_pages()
  *
  *    __tlb_remove_page_size() is the basic primitive that queues a page for
  *    freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
@@ -78,6 +79,11 @@
  *    tlb_remove_page() and tlb_remove_page_size() imply the call to
  *    tlb_flush_mmu() when required and has no return value.
  *
+ *    __tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
+ *    instead of removing a single page, remove the given number of consecutive
+ *    pages that are all part of the same (large) folio: just like calling
+ *    __tlb_remove_page() on each page individually.
+ *
  *  - tlb_change_page_size()
  *
  *    call before __tlb_remove_page*() to set the current page-size; implies a
@@ -262,6 +268,8 @@ struct mmu_gather_batch {
 
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
 		bool delay_rmap, int page_size);
+bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
+		unsigned int nr_pages, bool delay_rmap);
 
 #ifdef CONFIG_SMP
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1b89eec0d6df..198662b7a39a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -226,6 +226,15 @@ struct encoded_page;
 /* Perform rmap removal after we have flushed the TLB. */
 #define ENCODED_PAGE_BIT_DELAY_RMAP		1ul
 
+/*
+ * The next item in an encoded_page array is the "nr_pages" argument, specifying
+ * the number of consecutive pages starting from this page, that all belong to
+ * the same folio. For example, "nr_pages" corresponds to the number of folio
+ * references that must be dropped. If this bit is not set, "nr_pages" is
+ * implicitly 1.
+ */
+#define ENCODED_PAGE_BIT_NR_PAGES		2ul
+
 static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
 {
 	BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
@@ -242,6 +251,17 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
 	return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
 }
 
+static __always_inline struct encoded_page *encode_nr_pages(unsigned long nr)
+{
+	VM_WARN_ON_ONCE((nr << 2) >> 2 != nr);
+	return (struct encoded_page *)(nr << 2);
+}
+
+static __always_inline unsigned long encoded_nr_pages(struct encoded_page *page)
+{
+	return ((unsigned long)page) >> 2;
+}
+
 /*
  * A swap entry has to fit into a "unsigned long", as the entry is hidden
  * in the "index" field of the swapper address space.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 6540c99c6758..dba1973dfe25 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -50,12 +50,21 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 #ifdef CONFIG_SMP
 static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma)
 {
+	struct encoded_page **pages = batch->encoded_pages;
+
 	for (int i = 0; i < batch->nr; i++) {
-		struct encoded_page *enc = batch->encoded_pages[i];
+		struct encoded_page *enc = pages[i];
 
 		if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
 			struct page *page = encoded_page_ptr(enc);
-			folio_remove_rmap_pte(page_folio(page), page, vma);
+			unsigned int nr_pages = 1;
+
+			if (unlikely(encoded_page_flags(enc) &
+				     ENCODED_PAGE_BIT_NR_PAGES))
+				nr_pages = encoded_nr_pages(pages[++i]);
+
+			folio_remove_rmap_ptes(page_folio(page), page, nr_pages,
+					       vma);
 		}
 	}
 }
@@ -89,18 +98,26 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
 		struct encoded_page **pages = batch->encoded_pages;
 
-		do {
+		while (batch->nr) {
 			/*
 			 * limit free batch count when PAGE_SIZE > 4K
 			 */
 			unsigned int nr = min(512U, batch->nr);
 
+			/*
+			 * Make sure we cover page + nr_pages, and don't leave
+			 * nr_pages behind when capping the number of entries.
+			 */
+			if (unlikely(encoded_page_flags(pages[nr - 1]) &
+				     ENCODED_PAGE_BIT_NR_PAGES))
+				nr++;
+
 			free_pages_and_swap_cache(pages, nr);
 			pages += nr;
 			batch->nr -= nr;
 
 			cond_resched();
-		} while (batch->nr);
+		}
 	}
 	tlb->active = &tlb->local;
 }
@@ -116,8 +133,9 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 	tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
-		bool delay_rmap, int page_size)
+static bool __tlb_remove_folio_pages_size(struct mmu_gather *tlb,
+		struct page *page, unsigned int nr_pages, bool delay_rmap,
+		int page_size)
 {
 	int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
 	struct mmu_gather_batch *batch;
@@ -126,6 +144,8 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
 
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	VM_WARN_ON(tlb->page_size != page_size);
+	VM_WARN_ON_ONCE(nr_pages != 1 && page_size != PAGE_SIZE);
+	VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
 #endif
 
 	batch = tlb->active;
@@ -133,17 +153,40 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->encoded_pages[batch->nr++] = encode_page(page, flags);
-	if (batch->nr == batch->max) {
+	if (likely(nr_pages == 1)) {
+		batch->encoded_pages[batch->nr++] = encode_page(page, flags);
+	} else {
+		flags |= ENCODED_PAGE_BIT_NR_PAGES;
+		batch->encoded_pages[batch->nr++] = encode_page(page, flags);
+		batch->encoded_pages[batch->nr++] = encode_nr_pages(nr_pages);
+	}
+	/*
+	 * Make sure that we can always add another "page" + "nr_pages",
+	 * requiring two entries instead of only a single one.
+	 */
+	if (batch->nr >= batch->max - 1) {
 		if (!tlb_next_batch(tlb))
 			return true;
 		batch = tlb->active;
 	}
-	VM_BUG_ON_PAGE(batch->nr > batch->max, page);
+	VM_BUG_ON_PAGE(batch->nr > batch->max - 1, page);
 
 	return false;
 }
 
+bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
+		unsigned int nr_pages, bool delay_rmap)
+{
+	return __tlb_remove_folio_pages_size(tlb, page, nr_pages, delay_rmap,
+					     PAGE_SIZE);
+}
+
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+		bool delay_rmap, int page_size)
+{
+	return __tlb_remove_folio_pages_size(tlb, page, 1, delay_rmap, page_size);
+}
+
 #endif /* MMU_GATHER_NO_GATHER */
 
 #ifdef CONFIG_MMU_GATHER_TABLE_FREE
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..2a217520b80b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -967,11 +967,17 @@ void release_pages(release_pages_arg arg, int nr)
 	unsigned int lock_batch;
 
 	for (i = 0; i < nr; i++) {
+		unsigned int nr_refs = 1;
 		struct folio *folio;
 
 		/* Turn any of the argument types into a folio */
 		folio = page_folio(encoded_page_ptr(encoded[i]));
 
+		/* Is our next entry actually "nr_pages" -> "nr_refs" ? */
+		if (unlikely(encoded_page_flags(encoded[i]) &
+			     ENCODED_PAGE_BIT_NR_PAGES))
+			nr_refs = encoded_nr_pages(encoded[++i]);
+
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
@@ -990,14 +996,14 @@ void release_pages(release_pages_arg arg, int nr)
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			if (put_devmap_managed_page(&folio->page))
+			if (put_devmap_managed_page_refs(&folio->page, nr_refs))
 				continue;
-			if (folio_put_testzero(folio))
+			if (folio_ref_sub_and_test(folio, nr_refs))
 				free_zone_device_page(&folio->page);
 			continue;
 		}
 
-		if (!folio_put_testzero(folio))
+		if (!folio_ref_sub_and_test(folio, nr_refs))
 			continue;
 
 		if (folio_test_large(folio)) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e671266ad772..ae0c0f1f51bd 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -311,8 +311,16 @@ void free_page_and_swap_cache(struct page *page)
 void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 {
 	lru_add_drain();
-	for (int i = 0; i < nr; i++)
-		free_swap_cache(encoded_page_ptr(pages[i]));
+	for (int i = 0; i < nr; i++) {
+		struct page *page = encoded_page_ptr(pages[i]);
+
+		/* Skip over "nr_pages". Only call it once for the folio. */
+		if (unlikely(encoded_page_flags(pages[i]) &
+			     ENCODED_PAGE_BIT_NR_PAGES))
+			i++;
+
+		free_swap_cache(page);
+	}
 	release_pages(pages, nr);
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages()
  2024-01-29 14:32 ` [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages() David Hildenbrand
@ 2024-01-30  9:21   ` Ryan Roberts
  2024-01-30  9:33     ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  9:21 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 29/01/2024 14:32, David Hildenbrand wrote:
> Add __tlb_remove_folio_pages(), which will remove multiple consecutive
> pages that belong to the same large folio, instead of only a single
> page. We'll be using this function when optimizing unmapping/zapping of
> large folios that are mapped by PTEs.
> 
> We're using the remaining spare bit in an encoded_page to indicate that
> the next enoced page in an array contains actually shifted "nr_pages".
> Teach swap/freeing code about putting multiple folio references, and
> delayed rmap handling to remove page ranges of a folio.
> 
> This extension allows for still gathering almost as many small folios
> as we used to (-1, because we have to prepare for a possibly bigger next
> entry), but still allows for gathering consecutive pages that belong to the
> same large folio.
> 
> Note that we don't pass the folio pointer, because it is not required for
> now. Further, we don't support page_size != PAGE_SIZE, it won't be
> required for simple PTE batching.
> 
> We have to provide a separate s390 implementation, but it's fairly
> straight forward.
> 
> Another, more invasive and likely more expensive, approach would be to
> use folio+range or a PFN range instead of page+nr_pages. But, we should
> do that consistently for the whole mmu_gather. For now, let's keep it
> simple and add "nr_pages" only.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/include/asm/tlb.h | 17 +++++++++++
>  include/asm-generic/tlb.h   |  8 +++++
>  include/linux/mm_types.h    | 20 ++++++++++++
>  mm/mmu_gather.c             | 61 +++++++++++++++++++++++++++++++------
>  mm/swap.c                   | 12 ++++++--
>  mm/swap_state.c             | 12 ++++++--
>  6 files changed, 116 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index 48df896d5b79..abfd2bf29e9e 100644
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
>  static inline void tlb_flush(struct mmu_gather *tlb);
>  static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>  		struct page *page, bool delay_rmap, int page_size);
> +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> +		struct page *page, unsigned int nr_pages, bool delay_rmap);
>  
>  #define tlb_flush tlb_flush
>  #define pte_free_tlb pte_free_tlb
> @@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>  	return false;
>  }
>  
> +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> +		struct page *page, unsigned int nr_pages, bool delay_rmap)
> +{
> +	struct encoded_page *encoded_pages[] = {
> +		encode_page(page, ENCODED_PAGE_BIT_NR_PAGES),
> +		encode_nr_pages(nr_pages),
> +	};
> +
> +	VM_WARN_ON_ONCE(delay_rmap);
> +	VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
> +
> +	free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
> +	return false;
> +}
> +
>  static inline void tlb_flush(struct mmu_gather *tlb)
>  {
>  	__tlb_flush_mm_lazy(tlb->mm);
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 2eb7b0d4f5d2..428c3f93addc 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -69,6 +69,7 @@
>   *
>   *  - tlb_remove_page() / __tlb_remove_page()
>   *  - tlb_remove_page_size() / __tlb_remove_page_size()
> + *  - __tlb_remove_folio_pages()
>   *
>   *    __tlb_remove_page_size() is the basic primitive that queues a page for
>   *    freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
> @@ -78,6 +79,11 @@
>   *    tlb_remove_page() and tlb_remove_page_size() imply the call to
>   *    tlb_flush_mmu() when required and has no return value.
>   *
> + *    __tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
> + *    instead of removing a single page, remove the given number of consecutive
> + *    pages that are all part of the same (large) folio: just like calling
> + *    __tlb_remove_page() on each page individually.
> + *
>   *  - tlb_change_page_size()
>   *
>   *    call before __tlb_remove_page*() to set the current page-size; implies a
> @@ -262,6 +268,8 @@ struct mmu_gather_batch {
>  
>  extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>  		bool delay_rmap, int page_size);
> +bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
> +		unsigned int nr_pages, bool delay_rmap);
>  
>  #ifdef CONFIG_SMP
>  /*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 1b89eec0d6df..198662b7a39a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -226,6 +226,15 @@ struct encoded_page;
>  /* Perform rmap removal after we have flushed the TLB. */
>  #define ENCODED_PAGE_BIT_DELAY_RMAP		1ul
>  
> +/*
> + * The next item in an encoded_page array is the "nr_pages" argument, specifying
> + * the number of consecutive pages starting from this page, that all belong to
> + * the same folio. For example, "nr_pages" corresponds to the number of folio
> + * references that must be dropped. If this bit is not set, "nr_pages" is
> + * implicitly 1.
> + */
> +#define ENCODED_PAGE_BIT_NR_PAGES		2ul

nit: Perhaps this should be called ENCODED_PAGE_BIT_NR_PAGES_NEXT? There are a
couple of places where you check for this bit on the current entry and advance
to the next. So the "_NEXT" might make things clearer?

Otherwise, LGTM!

> +
>  static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
>  {
>  	BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
> @@ -242,6 +251,17 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
>  	return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
>  }
>  
> +static __always_inline struct encoded_page *encode_nr_pages(unsigned long nr)
> +{
> +	VM_WARN_ON_ONCE((nr << 2) >> 2 != nr);
> +	return (struct encoded_page *)(nr << 2);
> +}
> +
> +static __always_inline unsigned long encoded_nr_pages(struct encoded_page *page)
> +{
> +	return ((unsigned long)page) >> 2;
> +}
> +
>  /*
>   * A swap entry has to fit into a "unsigned long", as the entry is hidden
>   * in the "index" field of the swapper address space.
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 6540c99c6758..dba1973dfe25 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -50,12 +50,21 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
>  #ifdef CONFIG_SMP
>  static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma)
>  {
> +	struct encoded_page **pages = batch->encoded_pages;
> +
>  	for (int i = 0; i < batch->nr; i++) {
> -		struct encoded_page *enc = batch->encoded_pages[i];
> +		struct encoded_page *enc = pages[i];
>  
>  		if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
>  			struct page *page = encoded_page_ptr(enc);
> -			folio_remove_rmap_pte(page_folio(page), page, vma);
> +			unsigned int nr_pages = 1;
> +
> +			if (unlikely(encoded_page_flags(enc) &
> +				     ENCODED_PAGE_BIT_NR_PAGES))
> +				nr_pages = encoded_nr_pages(pages[++i]);
> +
> +			folio_remove_rmap_ptes(page_folio(page), page, nr_pages,
> +					       vma);
>  		}
>  	}
>  }
> @@ -89,18 +98,26 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>  	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>  		struct encoded_page **pages = batch->encoded_pages;
>  
> -		do {
> +		while (batch->nr) {
>  			/*
>  			 * limit free batch count when PAGE_SIZE > 4K
>  			 */
>  			unsigned int nr = min(512U, batch->nr);
>  
> +			/*
> +			 * Make sure we cover page + nr_pages, and don't leave
> +			 * nr_pages behind when capping the number of entries.
> +			 */
> +			if (unlikely(encoded_page_flags(pages[nr - 1]) &
> +				     ENCODED_PAGE_BIT_NR_PAGES))
> +				nr++;
> +
>  			free_pages_and_swap_cache(pages, nr);
>  			pages += nr;
>  			batch->nr -= nr;
>  
>  			cond_resched();
> -		} while (batch->nr);
> +		}
>  	}
>  	tlb->active = &tlb->local;
>  }
> @@ -116,8 +133,9 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
>  	tlb->local.next = NULL;
>  }
>  
> -bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
> -		bool delay_rmap, int page_size)
> +static bool __tlb_remove_folio_pages_size(struct mmu_gather *tlb,
> +		struct page *page, unsigned int nr_pages, bool delay_rmap,
> +		int page_size)
>  {
>  	int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
>  	struct mmu_gather_batch *batch;
> @@ -126,6 +144,8 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>  
>  #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
>  	VM_WARN_ON(tlb->page_size != page_size);
> +	VM_WARN_ON_ONCE(nr_pages != 1 && page_size != PAGE_SIZE);
> +	VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
>  #endif
>  
>  	batch = tlb->active;
> @@ -133,17 +153,40 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>  	 * Add the page and check if we are full. If so
>  	 * force a flush.
>  	 */
> -	batch->encoded_pages[batch->nr++] = encode_page(page, flags);
> -	if (batch->nr == batch->max) {
> +	if (likely(nr_pages == 1)) {
> +		batch->encoded_pages[batch->nr++] = encode_page(page, flags);
> +	} else {
> +		flags |= ENCODED_PAGE_BIT_NR_PAGES;
> +		batch->encoded_pages[batch->nr++] = encode_page(page, flags);
> +		batch->encoded_pages[batch->nr++] = encode_nr_pages(nr_pages);
> +	}
> +	/*
> +	 * Make sure that we can always add another "page" + "nr_pages",
> +	 * requiring two entries instead of only a single one.
> +	 */
> +	if (batch->nr >= batch->max - 1) {
>  		if (!tlb_next_batch(tlb))
>  			return true;
>  		batch = tlb->active;
>  	}
> -	VM_BUG_ON_PAGE(batch->nr > batch->max, page);
> +	VM_BUG_ON_PAGE(batch->nr > batch->max - 1, page);
>  
>  	return false;
>  }
>  
> +bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
> +		unsigned int nr_pages, bool delay_rmap)
> +{
> +	return __tlb_remove_folio_pages_size(tlb, page, nr_pages, delay_rmap,
> +					     PAGE_SIZE);
> +}
> +
> +bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
> +		bool delay_rmap, int page_size)
> +{
> +	return __tlb_remove_folio_pages_size(tlb, page, 1, delay_rmap, page_size);
> +}
> +
>  #endif /* MMU_GATHER_NO_GATHER */
>  
>  #ifdef CONFIG_MMU_GATHER_TABLE_FREE
> diff --git a/mm/swap.c b/mm/swap.c
> index cd8f0150ba3a..2a217520b80b 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -967,11 +967,17 @@ void release_pages(release_pages_arg arg, int nr)
>  	unsigned int lock_batch;
>  
>  	for (i = 0; i < nr; i++) {
> +		unsigned int nr_refs = 1;
>  		struct folio *folio;
>  
>  		/* Turn any of the argument types into a folio */
>  		folio = page_folio(encoded_page_ptr(encoded[i]));
>  
> +		/* Is our next entry actually "nr_pages" -> "nr_refs" ? */
> +		if (unlikely(encoded_page_flags(encoded[i]) &
> +			     ENCODED_PAGE_BIT_NR_PAGES))
> +			nr_refs = encoded_nr_pages(encoded[++i]);
> +
>  		/*
>  		 * Make sure the IRQ-safe lock-holding time does not get
>  		 * excessive with a continuous string of pages from the
> @@ -990,14 +996,14 @@ void release_pages(release_pages_arg arg, int nr)
>  				unlock_page_lruvec_irqrestore(lruvec, flags);
>  				lruvec = NULL;
>  			}
> -			if (put_devmap_managed_page(&folio->page))
> +			if (put_devmap_managed_page_refs(&folio->page, nr_refs))
>  				continue;
> -			if (folio_put_testzero(folio))
> +			if (folio_ref_sub_and_test(folio, nr_refs))
>  				free_zone_device_page(&folio->page);
>  			continue;
>  		}
>  
> -		if (!folio_put_testzero(folio))
> +		if (!folio_ref_sub_and_test(folio, nr_refs))
>  			continue;
>  
>  		if (folio_test_large(folio)) {
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e671266ad772..ae0c0f1f51bd 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -311,8 +311,16 @@ void free_page_and_swap_cache(struct page *page)
>  void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
>  {
>  	lru_add_drain();
> -	for (int i = 0; i < nr; i++)
> -		free_swap_cache(encoded_page_ptr(pages[i]));
> +	for (int i = 0; i < nr; i++) {
> +		struct page *page = encoded_page_ptr(pages[i]);
> +
> +		/* Skip over "nr_pages". Only call it once for the folio. */
> +		if (unlikely(encoded_page_flags(pages[i]) &
> +			     ENCODED_PAGE_BIT_NR_PAGES))
> +			i++;
> +
> +		free_swap_cache(page);
> +	}
>  	release_pages(pages, nr);
>  }
>  



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages()
  2024-01-30  9:21   ` Ryan Roberts
@ 2024-01-30  9:33     ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-01-30  9:33 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 30.01.24 10:21, Ryan Roberts wrote:
> On 29/01/2024 14:32, David Hildenbrand wrote:
>> Add __tlb_remove_folio_pages(), which will remove multiple consecutive
>> pages that belong to the same large folio, instead of only a single
>> page. We'll be using this function when optimizing unmapping/zapping of
>> large folios that are mapped by PTEs.
>>
>> We're using the remaining spare bit in an encoded_page to indicate that
>> the next enoced page in an array contains actually shifted "nr_pages".
>> Teach swap/freeing code about putting multiple folio references, and
>> delayed rmap handling to remove page ranges of a folio.
>>
>> This extension allows for still gathering almost as many small folios
>> as we used to (-1, because we have to prepare for a possibly bigger next
>> entry), but still allows for gathering consecutive pages that belong to the
>> same large folio.
>>
>> Note that we don't pass the folio pointer, because it is not required for
>> now. Further, we don't support page_size != PAGE_SIZE, it won't be
>> required for simple PTE batching.
>>
>> We have to provide a separate s390 implementation, but it's fairly
>> straight forward.
>>
>> Another, more invasive and likely more expensive, approach would be to
>> use folio+range or a PFN range instead of page+nr_pages. But, we should
>> do that consistently for the whole mmu_gather. For now, let's keep it
>> simple and add "nr_pages" only.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   arch/s390/include/asm/tlb.h | 17 +++++++++++
>>   include/asm-generic/tlb.h   |  8 +++++
>>   include/linux/mm_types.h    | 20 ++++++++++++
>>   mm/mmu_gather.c             | 61 +++++++++++++++++++++++++++++++------
>>   mm/swap.c                   | 12 ++++++--
>>   mm/swap_state.c             | 12 ++++++--
>>   6 files changed, 116 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>> index 48df896d5b79..abfd2bf29e9e 100644
>> --- a/arch/s390/include/asm/tlb.h
>> +++ b/arch/s390/include/asm/tlb.h
>> @@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
>>   static inline void tlb_flush(struct mmu_gather *tlb);
>>   static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>>   		struct page *page, bool delay_rmap, int page_size);
>> +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>> +		struct page *page, unsigned int nr_pages, bool delay_rmap);
>>   
>>   #define tlb_flush tlb_flush
>>   #define pte_free_tlb pte_free_tlb
>> @@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>>   	return false;
>>   }
>>   
>> +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>> +		struct page *page, unsigned int nr_pages, bool delay_rmap)
>> +{
>> +	struct encoded_page *encoded_pages[] = {
>> +		encode_page(page, ENCODED_PAGE_BIT_NR_PAGES),
>> +		encode_nr_pages(nr_pages),
>> +	};
>> +
>> +	VM_WARN_ON_ONCE(delay_rmap);
>> +	VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
>> +
>> +	free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
>> +	return false;
>> +}
>> +
>>   static inline void tlb_flush(struct mmu_gather *tlb)
>>   {
>>   	__tlb_flush_mm_lazy(tlb->mm);
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 2eb7b0d4f5d2..428c3f93addc 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -69,6 +69,7 @@
>>    *
>>    *  - tlb_remove_page() / __tlb_remove_page()
>>    *  - tlb_remove_page_size() / __tlb_remove_page_size()
>> + *  - __tlb_remove_folio_pages()
>>    *
>>    *    __tlb_remove_page_size() is the basic primitive that queues a page for
>>    *    freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
>> @@ -78,6 +79,11 @@
>>    *    tlb_remove_page() and tlb_remove_page_size() imply the call to
>>    *    tlb_flush_mmu() when required and has no return value.
>>    *
>> + *    __tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
>> + *    instead of removing a single page, remove the given number of consecutive
>> + *    pages that are all part of the same (large) folio: just like calling
>> + *    __tlb_remove_page() on each page individually.
>> + *
>>    *  - tlb_change_page_size()
>>    *
>>    *    call before __tlb_remove_page*() to set the current page-size; implies a
>> @@ -262,6 +268,8 @@ struct mmu_gather_batch {
>>   
>>   extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>>   		bool delay_rmap, int page_size);
>> +bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
>> +		unsigned int nr_pages, bool delay_rmap);
>>   
>>   #ifdef CONFIG_SMP
>>   /*
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 1b89eec0d6df..198662b7a39a 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -226,6 +226,15 @@ struct encoded_page;
>>   /* Perform rmap removal after we have flushed the TLB. */
>>   #define ENCODED_PAGE_BIT_DELAY_RMAP		1ul
>>   
>> +/*
>> + * The next item in an encoded_page array is the "nr_pages" argument, specifying
>> + * the number of consecutive pages starting from this page, that all belong to
>> + * the same folio. For example, "nr_pages" corresponds to the number of folio
>> + * references that must be dropped. If this bit is not set, "nr_pages" is
>> + * implicitly 1.
>> + */
>> +#define ENCODED_PAGE_BIT_NR_PAGES		2ul
> 
> nit: Perhaps this should be called ENCODED_PAGE_BIT_NR_PAGES_NEXT? There are a
> couple of places where you check for this bit on the current entry and advance
> to the next. So the "_NEXT" might make things clearer?

Yes, makes sense, thanks for the suggestion.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v1 8/9] mm/mmu_gather: add tlb_remove_tlb_entries()
  2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
                   ` (6 preceding siblings ...)
  2024-01-29 14:32 ` [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages() David Hildenbrand
@ 2024-01-29 14:32 ` David Hildenbrand
  2024-01-30  9:33   ` Ryan Roberts
  2024-01-29 14:32 ` [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
  2024-01-31  2:20 ` [PATCH v1 0/9] " Yin Fengwei
  9 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Catalin Marinas, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Michael Ellerman, Christophe Leroy,
	Naveen N. Rao, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

Let's add a helper that lets us batch-process multiple consecutive PTEs.

Note that the loop will get optimized out on all architectures except on
powerpc. We have to add an early define of __tlb_remove_tlb_entry() on
ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a
macro).

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/tlb.h |  2 ++
 include/asm-generic/tlb.h      | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index b3de6102a907..1ca7d4c4b90d 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -19,6 +19,8 @@
 
 #include <linux/pagemap.h>
 
+static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
+					  unsigned long address);
 #define __tlb_remove_tlb_entry	__tlb_remove_tlb_entry
 
 #define tlb_flush tlb_flush
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 428c3f93addc..bd00dd238b79 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -616,6 +616,26 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
 
+/**
+ * tlb_remove_tlb_entries - remember unmapping of multiple consecutive ptes for
+ *			    later tlb invalidation.
+ *
+ * Similar to tlb_remove_tlb_entry(), but remember unmapping of multiple
+ * consecutive ptes instead of only a single one.
+ */
+static inline void tlb_remove_tlb_entries(struct mmu_gather *tlb,
+		pte_t *ptep, unsigned int nr, unsigned long address)
+{
+	tlb_flush_pte_range(tlb, address, PAGE_SIZE * nr);
+	for (;;) {
+		__tlb_remove_tlb_entry(tlb, ptep, address);
+		if (--nr == 0)
+			break;
+		ptep++;
+		address += PAGE_SIZE;
+	}
+}
+
 #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
 	do {							\
 		unsigned long _sz = huge_page_size(h);		\
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 8/9] mm/mmu_gather: add tlb_remove_tlb_entries()
  2024-01-29 14:32 ` [PATCH v1 8/9] mm/mmu_gather: add tlb_remove_tlb_entries() David Hildenbrand
@ 2024-01-30  9:33   ` Ryan Roberts
  0 siblings, 0 replies; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  9:33 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 29/01/2024 14:32, David Hildenbrand wrote:
> Let's add a helper that lets us batch-process multiple consecutive PTEs.
> 
> Note that the loop will get optimized out on all architectures except on
> powerpc. We have to add an early define of __tlb_remove_tlb_entry() on
> ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a
> macro).
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  arch/powerpc/include/asm/tlb.h |  2 ++
>  include/asm-generic/tlb.h      | 20 ++++++++++++++++++++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
> index b3de6102a907..1ca7d4c4b90d 100644
> --- a/arch/powerpc/include/asm/tlb.h
> +++ b/arch/powerpc/include/asm/tlb.h
> @@ -19,6 +19,8 @@
>  
>  #include <linux/pagemap.h>
>  
> +static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
> +					  unsigned long address);
>  #define __tlb_remove_tlb_entry	__tlb_remove_tlb_entry
>  
>  #define tlb_flush tlb_flush
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 428c3f93addc..bd00dd238b79 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -616,6 +616,26 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
>  		__tlb_remove_tlb_entry(tlb, ptep, address);	\
>  	} while (0)
>  
> +/**
> + * tlb_remove_tlb_entries - remember unmapping of multiple consecutive ptes for
> + *			    later tlb invalidation.
> + *
> + * Similar to tlb_remove_tlb_entry(), but remember unmapping of multiple
> + * consecutive ptes instead of only a single one.
> + */
> +static inline void tlb_remove_tlb_entries(struct mmu_gather *tlb,
> +		pte_t *ptep, unsigned int nr, unsigned long address)
> +{
> +	tlb_flush_pte_range(tlb, address, PAGE_SIZE * nr);
> +	for (;;) {
> +		__tlb_remove_tlb_entry(tlb, ptep, address);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		address += PAGE_SIZE;
> +	}
> +}
> +
>  #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
>  	do {							\
>  		unsigned long _sz = huge_page_size(h);		\



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
                   ` (7 preceding siblings ...)
  2024-01-29 14:32 ` [PATCH v1 8/9] mm/mmu_gather: add tlb_remove_tlb_entries() David Hildenbrand
@ 2024-01-29 14:32 ` David Hildenbrand
  2024-01-30  9:08   ` David Hildenbrand
                     ` (2 more replies)
  2024-01-31  2:20 ` [PATCH v1 0/9] " Yin Fengwei
  9 siblings, 3 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-01-29 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Catalin Marinas, Will Deacon, Aneesh Kumar K.V,
	Nick Piggin, Peter Zijlstra, Michael Ellerman, Christophe Leroy,
	Naveen N. Rao, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

Similar to how we optimized fork(), let's implement PTE batching when
consecutive (present) PTEs map consecutive pages of the same large
folio.

Most infrastructure we need for batching (mmu gather, rmap) is already
there. We only have to add get_and_clear_full_ptes() and
clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
process a PTE range.

We won't bother sanity-checking the mapcount of all subpages, but only
check the mapcount of the first subpage we process.

To keep small folios as fast as possible force inlining of a specialized
variant using __always_inline with nr=1.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/pgtable.h | 66 +++++++++++++++++++++++++++++
 mm/memory.c             | 92 +++++++++++++++++++++++++++++------------
 2 files changed, 132 insertions(+), 26 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index aab227e12493..f0feae7f89fb 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -580,6 +580,72 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 }
 #endif
 
+#ifndef get_and_clear_full_ptes
+/**
+ * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the same
+ *			     folio, collecting dirty/accessed bits.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into
+ * returned PTE.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+	pte_t pte, tmp_pte;
+
+	pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+#endif
+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+		pte_t *ptep, unsigned int nr, int full)
+{
+	for (;;) {
+		ptep_get_and_clear_full(mm, addr, ptep, full);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+}
+#endif
 
 /*
  * If two threads concurrently fault at the same page, the thread that
diff --git a/mm/memory.c b/mm/memory.c
index a2190d7cfa74..38a010c4d04d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
  */
 static inline void
 zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
-			      unsigned long addr, pte_t *pte,
+			      unsigned long addr, pte_t *pte, int nr,
 			      struct zap_details *details, pte_t pteval)
 {
 	/* Zap on anonymous always means dropping everything */
@@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 	if (zap_drop_file_uffd_wp(details))
 		return;
 
-	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+	for (;;) {
+		/* the PFN in the PTE is irrelevant. */
+		pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+		if (--nr == 0)
+			break;
+		pte++;
+		addr += PAGE_SIZE;
+	}
 }
 
-static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, struct folio *folio,
-		struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
-		struct zap_details *details, int *rss, bool *force_flush,
-		bool *force_break)
+		struct page *page, pte_t *pte, pte_t ptent, unsigned int nr,
+		unsigned long addr, struct zap_details *details, int *rss,
+		bool *force_flush, bool *force_break)
 {
 	struct mm_struct *mm = tlb->mm;
 	bool delay_rmap = false;
 
 	if (!folio_test_anon(folio)) {
-		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+		ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
 		if (pte_dirty(ptent)) {
 			folio_mark_dirty(folio);
 			if (tlb_delay_rmap(tlb)) {
@@ -1548,36 +1555,49 @@ static inline void zap_present_folio_pte(struct mmu_gather *tlb,
 		}
 		if (pte_young(ptent) && likely(vma_has_recency(vma)))
 			folio_mark_accessed(folio);
-		rss[mm_counter(folio)]--;
+		rss[mm_counter(folio)] -= nr;
 	} else {
 		/* We don't need up-to-date accessed/dirty bits. */
-		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-		rss[MM_ANONPAGES]--;
+		clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
+		rss[MM_ANONPAGES] -= nr;
 	}
+	/* Checking a single PTE in a batch is sufficient. */
 	arch_check_zapped_pte(vma, ptent);
-	tlb_remove_tlb_entry(tlb, pte, addr);
+	tlb_remove_tlb_entries(tlb, pte, nr, addr);
 	if (unlikely(userfaultfd_pte_wp(vma, ptent)))
-		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details,
+					      ptent);
 
 	if (!delay_rmap) {
-		folio_remove_rmap_pte(folio, page, vma);
+		folio_remove_rmap_ptes(folio, page, nr, vma);
+
+		/* Only sanity-check the first page in a batch. */
 		if (unlikely(page_mapcount(page) < 0))
 			print_bad_pte(vma, addr, ptent, page);
 	}
-	if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+	if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
 		*force_flush = true;
 		*force_break = true;
 	}
 }
 
-static inline void zap_present_pte(struct mmu_gather *tlb,
+/*
+ * Zap or skip one present PTE, trying to batch-process subsequent PTEs that map
+ * consecutive pages of the same folio.
+ *
+ * Returns the number of processed (skipped or zapped) PTEs (at least 1).
+ */
+static inline int zap_present_ptes(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
-		unsigned long addr, struct zap_details *details,
-		int *rss, bool *force_flush, bool *force_break)
+		unsigned int max_nr, unsigned long addr,
+		struct zap_details *details, int *rss, bool *force_flush,
+		bool *force_break)
 {
+	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	struct mm_struct *mm = tlb->mm;
 	struct folio *folio;
 	struct page *page;
+	int nr;
 
 	page = vm_normal_page(vma, addr, ptent);
 	if (!page) {
@@ -1587,14 +1607,29 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
 		tlb_remove_tlb_entry(tlb, pte, addr);
 		VM_WARN_ON_ONCE(userfaultfd_wp(vma));
 		ksm_might_unmap_zero_page(mm, ptent);
-		return;
+		return 1;
 	}
 
 	folio = page_folio(page);
 	if (unlikely(!should_zap_folio(details, folio)))
-		return;
-	zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
-			      rss, force_flush, force_break);
+		return 1;
+
+	/*
+	 * Make sure that the common "small folio" case is as fast as possible
+	 * by keeping the batching logic separate.
+	 */
+	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
+		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
+				     NULL);
+
+		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
+				       addr, details, rss, force_flush,
+				       force_break);
+		return nr;
+	}
+	zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr,
+			       details, rss, force_flush, force_break);
+	return 1;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -1609,6 +1644,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *start_pte;
 	pte_t *pte;
 	swp_entry_t entry;
+	int nr;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	init_rss_vec(rss);
@@ -1622,7 +1658,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		pte_t ptent = ptep_get(pte);
 		struct folio *folio = NULL;
 		struct page *page;
+		int max_nr;
 
+		nr = 1;
 		if (pte_none(ptent))
 			continue;
 
@@ -1630,10 +1668,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		if (pte_present(ptent)) {
-			zap_present_pte(tlb, vma, pte, ptent, addr, details,
-					rss, &force_flush, &force_break);
+			max_nr = (end - addr) / PAGE_SIZE;
+			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+					      addr, details, rss, &force_flush,
+					      &force_break);
 			if (unlikely(force_break)) {
-				addr += PAGE_SIZE;
+				addr += nr * PAGE_SIZE;
 				break;
 			}
 			continue;
@@ -1687,8 +1727,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			WARN_ON_ONCE(1);
 		}
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
-		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+		zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
+	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-29 14:32 ` [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
@ 2024-01-30  9:08   ` David Hildenbrand
  2024-01-30  9:48   ` Ryan Roberts
  2024-01-31  2:30   ` Yin Fengwei
  2 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-01-30  9:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Ryan Roberts,
	Catalin Marinas, Will Deacon, Aneesh Kumar K.V, Nick Piggin,
	Peter Zijlstra, Michael Ellerman, Christophe Leroy, Naveen N. Rao,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

Re-reading the docs myself:

> +#ifndef get_and_clear_full_ptes
> +/**
> + * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the same
> + *			     folio, collecting dirty/accessed bits.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into
> + * returned PTE.

"into the"

> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +#ifndef clear_full_ptes
> +/**
> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.

Something went missing:

May be overridden by the architecture; otherwise, implemented as a 
simple loop over ptep_get_and_clear_full().


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-29 14:32 ` [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
  2024-01-30  9:08   ` David Hildenbrand
@ 2024-01-30  9:48   ` Ryan Roberts
  2024-01-31 10:21     ` David Hildenbrand
  2024-01-31  2:30   ` Yin Fengwei
  2 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2024-01-30  9:48 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 29/01/2024 14:32, David Hildenbrand wrote:
> Similar to how we optimized fork(), let's implement PTE batching when
> consecutive (present) PTEs map consecutive pages of the same large
> folio.
> 
> Most infrastructure we need for batching (mmu gather, rmap) is already
> there. We only have to add get_and_clear_full_ptes() and
> clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
> process a PTE range.
> 
> We won't bother sanity-checking the mapcount of all subpages, but only
> check the mapcount of the first subpage we process.
> 
> To keep small folios as fast as possible force inlining of a specialized
> variant using __always_inline with nr=1.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/pgtable.h | 66 +++++++++++++++++++++++++++++
>  mm/memory.c             | 92 +++++++++++++++++++++++++++++------------
>  2 files changed, 132 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index aab227e12493..f0feae7f89fb 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -580,6 +580,72 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  }
>  #endif
>  
> +#ifndef get_and_clear_full_ptes
> +/**
> + * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the same
> + *			     folio, collecting dirty/accessed bits.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into
> + * returned PTE.
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +#ifndef clear_full_ptes
> +/**
> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.

I know its implied from "pages of the same folio" (and even more so for the
above variant due to mention of access/dirty), but I wonder if its useful to
explicitly state that "all ptes being cleared are present at the time of the call"?

> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> +		pte_t *ptep, unsigned int nr, int full)
> +{
> +	for (;;) {
> +		ptep_get_and_clear_full(mm, addr, ptep, full);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +}
> +#endif
>  
>  /*
>   * If two threads concurrently fault at the same page, the thread that
> diff --git a/mm/memory.c b/mm/memory.c
> index a2190d7cfa74..38a010c4d04d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
>   */
>  static inline void
>  zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> -			      unsigned long addr, pte_t *pte,
> +			      unsigned long addr, pte_t *pte, int nr,
>  			      struct zap_details *details, pte_t pteval)
>  {
>  	/* Zap on anonymous always means dropping everything */
> @@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>  	if (zap_drop_file_uffd_wp(details))
>  		return;
>  
> -	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> +	for (;;) {
> +		/* the PFN in the PTE is irrelevant. */
> +		pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> +		if (--nr == 0)
> +			break;
> +		pte++;
> +		addr += PAGE_SIZE;
> +	}
>  }
>  
> -static inline void zap_present_folio_pte(struct mmu_gather *tlb,
> +static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, struct folio *folio,
> -		struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
> -		struct zap_details *details, int *rss, bool *force_flush,
> -		bool *force_break)
> +		struct page *page, pte_t *pte, pte_t ptent, unsigned int nr,
> +		unsigned long addr, struct zap_details *details, int *rss,
> +		bool *force_flush, bool *force_break)
>  {
>  	struct mm_struct *mm = tlb->mm;
>  	bool delay_rmap = false;
>  
>  	if (!folio_test_anon(folio)) {
> -		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> +		ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>  		if (pte_dirty(ptent)) {
>  			folio_mark_dirty(folio);
>  			if (tlb_delay_rmap(tlb)) {
> @@ -1548,36 +1555,49 @@ static inline void zap_present_folio_pte(struct mmu_gather *tlb,
>  		}
>  		if (pte_young(ptent) && likely(vma_has_recency(vma)))
>  			folio_mark_accessed(folio);
> -		rss[mm_counter(folio)]--;
> +		rss[mm_counter(folio)] -= nr;
>  	} else {
>  		/* We don't need up-to-date accessed/dirty bits. */
> -		ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> -		rss[MM_ANONPAGES]--;
> +		clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> +		rss[MM_ANONPAGES] -= nr;
>  	}
> +	/* Checking a single PTE in a batch is sufficient. */
>  	arch_check_zapped_pte(vma, ptent);
> -	tlb_remove_tlb_entry(tlb, pte, addr);
> +	tlb_remove_tlb_entries(tlb, pte, nr, addr);
>  	if (unlikely(userfaultfd_pte_wp(vma, ptent)))
> -		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> +		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details,
> +					      ptent);
>  
>  	if (!delay_rmap) {
> -		folio_remove_rmap_pte(folio, page, vma);
> +		folio_remove_rmap_ptes(folio, page, nr, vma);
> +
> +		/* Only sanity-check the first page in a batch. */
>  		if (unlikely(page_mapcount(page) < 0))
>  			print_bad_pte(vma, addr, ptent, page);

Is there a case for either removing this all together or moving it into
folio_remove_rmap_ptes()? It seems odd to only check some pages.


>  	}
> -	if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> +	if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
>  		*force_flush = true;
>  		*force_break = true;
>  	}
>  }
>  
> -static inline void zap_present_pte(struct mmu_gather *tlb,
> +/*
> + * Zap or skip one present PTE, trying to batch-process subsequent PTEs that map

Zap or skip *at least* one... ?

> + * consecutive pages of the same folio.
> + *
> + * Returns the number of processed (skipped or zapped) PTEs (at least 1).
> + */
> +static inline int zap_present_ptes(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> -		unsigned long addr, struct zap_details *details,
> -		int *rss, bool *force_flush, bool *force_break)
> +		unsigned int max_nr, unsigned long addr,
> +		struct zap_details *details, int *rss, bool *force_flush,
> +		bool *force_break)
>  {
> +	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>  	struct mm_struct *mm = tlb->mm;
>  	struct folio *folio;
>  	struct page *page;
> +	int nr;
>  
>  	page = vm_normal_page(vma, addr, ptent);
>  	if (!page) {
> @@ -1587,14 +1607,29 @@ static inline void zap_present_pte(struct mmu_gather *tlb,
>  		tlb_remove_tlb_entry(tlb, pte, addr);
>  		VM_WARN_ON_ONCE(userfaultfd_wp(vma));
>  		ksm_might_unmap_zero_page(mm, ptent);
> -		return;
> +		return 1;
>  	}
>  
>  	folio = page_folio(page);
>  	if (unlikely(!should_zap_folio(details, folio)))
> -		return;
> -	zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
> -			      rss, force_flush, force_break);
> +		return 1;
> +
> +	/*
> +	 * Make sure that the common "small folio" case is as fast as possible
> +	 * by keeping the batching logic separate.
> +	 */
> +	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> +		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
> +				     NULL);
> +
> +		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
> +				       addr, details, rss, force_flush,
> +				       force_break);
> +		return nr;
> +	}
> +	zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr,
> +			       details, rss, force_flush, force_break);
> +	return 1;
>  }
>  
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
> @@ -1609,6 +1644,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  	pte_t *start_pte;
>  	pte_t *pte;
>  	swp_entry_t entry;
> +	int nr;
>  
>  	tlb_change_page_size(tlb, PAGE_SIZE);
>  	init_rss_vec(rss);
> @@ -1622,7 +1658,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  		pte_t ptent = ptep_get(pte);
>  		struct folio *folio = NULL;
>  		struct page *page;
> +		int max_nr;
>  
> +		nr = 1;
>  		if (pte_none(ptent))
>  			continue;
>  
> @@ -1630,10 +1668,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			break;
>  
>  		if (pte_present(ptent)) {
> -			zap_present_pte(tlb, vma, pte, ptent, addr, details,
> -					rss, &force_flush, &force_break);
> +			max_nr = (end - addr) / PAGE_SIZE;
> +			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
> +					      addr, details, rss, &force_flush,
> +					      &force_break);
>  			if (unlikely(force_break)) {
> -				addr += PAGE_SIZE;
> +				addr += nr * PAGE_SIZE;
>  				break;
>  			}
>  			continue;
> @@ -1687,8 +1727,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			WARN_ON_ONCE(1);
>  		}
>  		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> -		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> -	} while (pte++, addr += PAGE_SIZE, addr != end);
> +		zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
> +	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>  
>  	add_mm_rss_vec(mm, rss);
>  	arch_leave_lazy_mmu_mode();



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-30  9:48   ` Ryan Roberts
@ 2024-01-31 10:21     ` David Hildenbrand
  2024-01-31 10:31       ` Ryan Roberts
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-31 10:21 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390


>> +
>> +#ifndef clear_full_ptes
>> +/**
>> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
> 
> I know its implied from "pages of the same folio" (and even more so for the
> above variant due to mention of access/dirty), but I wonder if its useful to
> explicitly state that "all ptes being cleared are present at the time of the call"?

"Clear PTEs" -> "Clear present PTEs" ?

That should make it clearer.

[...]

>>   	if (!delay_rmap) {
>> -		folio_remove_rmap_pte(folio, page, vma);
>> +		folio_remove_rmap_ptes(folio, page, nr, vma);
>> +
>> +		/* Only sanity-check the first page in a batch. */
>>   		if (unlikely(page_mapcount(page) < 0))
>>   			print_bad_pte(vma, addr, ptent, page);
> 
> Is there a case for either removing this all together or moving it into
> folio_remove_rmap_ptes()? It seems odd to only check some pages.
> 

I really wanted to avoid another nasty loop here.

In my thinking, for 4k folios, or when zapping subpages of large folios, 
we still perform the exact same checks. Only when batching we don't. So 
if there is some problem, there are ways to get it triggered. And these 
problems are barely ever seen.

folio_remove_rmap_ptes() feels like the better place -- especially 
because the delayed-rmap handling is effectively unchecked. But in 
there, we cannot "print_bad_pte()".

[background: if we had a total mapcount -- iow cheap folio_mapcount(), 
I'd check here that the total mapcount does not underflow, instead of 
checking per-subpage]

> 
>>   	}
>> -	if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
>> +	if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
>>   		*force_flush = true;
>>   		*force_break = true;
>>   	}
>>   }
>>   
>> -static inline void zap_present_pte(struct mmu_gather *tlb,
>> +/*
>> + * Zap or skip one present PTE, trying to batch-process subsequent PTEs that map
> 
> Zap or skip *at least* one... ?

Ack

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-31 10:21     ` David Hildenbrand
@ 2024-01-31 10:31       ` Ryan Roberts
  2024-01-31 11:13         ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2024-01-31 10:31 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

On 31/01/2024 10:21, David Hildenbrand wrote:
> 
>>> +
>>> +#ifndef clear_full_ptes
>>> +/**
>>> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
>>
>> I know its implied from "pages of the same folio" (and even more so for the
>> above variant due to mention of access/dirty), but I wonder if its useful to
>> explicitly state that "all ptes being cleared are present at the time of the
>> call"?
> 
> "Clear PTEs" -> "Clear present PTEs" ?
> 
> That should make it clearer.

Works for me.

> 
> [...]
> 
>>>       if (!delay_rmap) {
>>> -        folio_remove_rmap_pte(folio, page, vma);
>>> +        folio_remove_rmap_ptes(folio, page, nr, vma);
>>> +
>>> +        /* Only sanity-check the first page in a batch. */
>>>           if (unlikely(page_mapcount(page) < 0))
>>>               print_bad_pte(vma, addr, ptent, page);
>>
>> Is there a case for either removing this all together or moving it into
>> folio_remove_rmap_ptes()? It seems odd to only check some pages.
>>
> 
> I really wanted to avoid another nasty loop here.
> 
> In my thinking, for 4k folios, or when zapping subpages of large folios, we
> still perform the exact same checks. Only when batching we don't. So if there is
> some problem, there are ways to get it triggered. And these problems are barely
> ever seen.
> 
> folio_remove_rmap_ptes() feels like the better place -- especially because the
> delayed-rmap handling is effectively unchecked. But in there, we cannot
> "print_bad_pte()".
> 
> [background: if we had a total mapcount -- iow cheap folio_mapcount(), I'd check
> here that the total mapcount does not underflow, instead of checking per-subpage]

All good points... perhaps extend the comment to describe how this could be
solved in future with cheap total_mapcount()? Or in the commit log if you prefer?

> 
>>
>>>       }
>>> -    if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
>>> +    if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
>>>           *force_flush = true;
>>>           *force_break = true;
>>>       }
>>>   }
>>>   -static inline void zap_present_pte(struct mmu_gather *tlb,
>>> +/*
>>> + * Zap or skip one present PTE, trying to batch-process subsequent PTEs that
>>> map
>>
>> Zap or skip *at least* one... ?
> 
> Ack
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-31 10:31       ` Ryan Roberts
@ 2024-01-31 11:13         ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-01-31 11:13 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390

>>>> -        folio_remove_rmap_pte(folio, page, vma);
>>>> +        folio_remove_rmap_ptes(folio, page, nr, vma);
>>>> +
>>>> +        /* Only sanity-check the first page in a batch. */
>>>>            if (unlikely(page_mapcount(page) < 0))
>>>>                print_bad_pte(vma, addr, ptent, page);
>>>
>>> Is there a case for either removing this all together or moving it into
>>> folio_remove_rmap_ptes()? It seems odd to only check some pages.
>>>
>>
>> I really wanted to avoid another nasty loop here.
>>
>> In my thinking, for 4k folios, or when zapping subpages of large folios, we
>> still perform the exact same checks. Only when batching we don't. So if there is
>> some problem, there are ways to get it triggered. And these problems are barely
>> ever seen.
>>
>> folio_remove_rmap_ptes() feels like the better place -- especially because the
>> delayed-rmap handling is effectively unchecked. But in there, we cannot
>> "print_bad_pte()".
>>
>> [background: if we had a total mapcount -- iow cheap folio_mapcount(), I'd check
>> here that the total mapcount does not underflow, instead of checking per-subpage]
> 
> All good points... perhaps extend the comment to describe how this could be
> solved in future with cheap total_mapcount()? Or in the commit log if you prefer?

I'll add more meat to the cover letter, thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-29 14:32 ` [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
  2024-01-30  9:08   ` David Hildenbrand
  2024-01-30  9:48   ` Ryan Roberts
@ 2024-01-31  2:30   ` Yin Fengwei
  2024-01-31 10:30     ` David Hildenbrand
  2 siblings, 1 reply; 40+ messages in thread
From: Yin Fengwei @ 2024-01-31  2:30 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Ryan Roberts,
	Catalin Marinas, Will Deacon, Aneesh Kumar K.V, Nick Piggin,
	Peter Zijlstra, Michael Ellerman, Christophe Leroy, Naveen N. Rao,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390



On 1/29/24 22:32, David Hildenbrand wrote:
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
I am wondering whether it's worthy to move the pte_mkdirty() and pte_mkyoung()
out of the loop and just do it one time if needed. The worst case is that they
are called nr - 1 time. Or it's just too micro?


Regards
Yin, Fengwei

> +	}
> +	return pte;
> +}


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-31  2:30   ` Yin Fengwei
@ 2024-01-31 10:30     ` David Hildenbrand
  2024-01-31 10:43       ` Yin, Fengwei
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-01-31 10:30 UTC (permalink / raw)
  To: Yin Fengwei, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Ryan Roberts,
	Catalin Marinas, Will Deacon, Aneesh Kumar K.V, Nick Piggin,
	Peter Zijlstra, Michael Ellerman, Christophe Leroy, Naveen N. Rao,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390

On 31.01.24 03:30, Yin Fengwei wrote:
> 
> 
> On 1/29/24 22:32, David Hildenbrand wrote:
>> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = ptep_get_and_clear_full(mm, addr, ptep, full);
>> +	while (--nr) {
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
> I am wondering whether it's worthy to move the pte_mkdirty() and pte_mkyoung()
> out of the loop and just do it one time if needed. The worst case is that they
> are called nr - 1 time. Or it's just too micro?

I also thought about just indicating "any_accessed" or "any_dirty" using 
flags to the caller, to avoid the PTE modifications completely. Felt a 
bit micro-optimized.

Regarding your proposal: I thought about that as well, but my assumption 
was that dirty+young are "cheap" to be set.

On x86, pte_mkyoung() is setting _PAGE_ACCESSED.
pte_mkdirty() is setting _PAGE_DIRTY | _PAGE_SOFT_DIRTY, but it also has 
to handle the saveddirty handling, using some bit trickery.

So at least for pte_mkyoung() there would be no real benefit as far as I 
can see (might be even worse). For pte_mkdirty() there might be a small 
benefit.

Is it going to be measurable? Likely not.

Am I missing something?

Thanks!

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-31 10:30     ` David Hildenbrand
@ 2024-01-31 10:43       ` Yin, Fengwei
  0 siblings, 0 replies; 40+ messages in thread
From: Yin, Fengwei @ 2024-01-31 10:43 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Ryan Roberts,
	Catalin Marinas, Will Deacon, Aneesh Kumar K.V, Nick Piggin,
	Peter Zijlstra, Michael Ellerman, Christophe Leroy, Naveen N. Rao,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390



On 1/31/2024 6:30 PM, David Hildenbrand wrote:
> On 31.01.24 03:30, Yin Fengwei wrote:
>>
>>
>> On 1/29/24 22:32, David Hildenbrand wrote:
>>> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
>>> +        unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>>> +{
>>> +    pte_t pte, tmp_pte;
>>> +
>>> +    pte = ptep_get_and_clear_full(mm, addr, ptep, full);
>>> +    while (--nr) {
>>> +        ptep++;
>>> +        addr += PAGE_SIZE;
>>> +        tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
>>> +        if (pte_dirty(tmp_pte))
>>> +            pte = pte_mkdirty(pte);
>>> +        if (pte_young(tmp_pte))
>>> +            pte = pte_mkyoung(pte);
>> I am wondering whether it's worthy to move the pte_mkdirty() and 
>> pte_mkyoung()
>> out of the loop and just do it one time if needed. The worst case is 
>> that they
>> are called nr - 1 time. Or it's just too micro?
> 
> I also thought about just indicating "any_accessed" or "any_dirty" using 
> flags to the caller, to avoid the PTE modifications completely. Felt a 
> bit micro-optimized.
> 
> Regarding your proposal: I thought about that as well, but my assumption 
> was that dirty+young are "cheap" to be set.
> 
> On x86, pte_mkyoung() is setting _PAGE_ACCESSED.
> pte_mkdirty() is setting _PAGE_DIRTY | _PAGE_SOFT_DIRTY, but it also has 
> to handle the saveddirty handling, using some bit trickery.
> 
> So at least for pte_mkyoung() there would be no real benefit as far as I 
> can see (might be even worse). For pte_mkdirty() there might be a small 
> benefit.
> 
> Is it going to be measurable? Likely not.
Yeah. We can do more investigation when performance profiling call this
out.


Regards
Yin, Fengwei

> 
> Am I missing something?
> 
> Thanks!
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
                   ` (8 preceding siblings ...)
  2024-01-29 14:32 ` [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
@ 2024-01-31  2:20 ` Yin Fengwei
  2024-01-31 10:16   ` David Hildenbrand
  2024-01-31 10:43   ` David Hildenbrand
  9 siblings, 2 replies; 40+ messages in thread
From: Yin Fengwei @ 2024-01-31  2:20 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Ryan Roberts,
	Catalin Marinas, Will Deacon, Aneesh Kumar K.V, Nick Piggin,
	Peter Zijlstra, Michael Ellerman, Christophe Leroy, Naveen N. Rao,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390, Huang, Ying

On 1/29/24 22:32, David Hildenbrand wrote:
> This series is based on [1] and must be applied on top of it.
> Similar to what we did with fork(), let's implement PTE batching
> during unmap/zap when processing PTE-mapped THPs.
> 
> We collect consecutive PTEs that map consecutive pages of the same large
> folio, making sure that the other PTE bits are compatible, and (a) adjust
> the refcount only once per batch, (b) call rmap handling functions only
> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
> entry removal once per batch.
> 
> Ryan was previously working on this in the context of cont-pte for
> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
> This series implements the optimization for all architectures, independent
> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
> large-folio-pages batches as well, and amkes use of our new rmap batching
> function when removing the rmap.
> 
> To achieve that, we have to enlighten MMU gather / page freeing code
> (i.e., everything that consumes encoded_page) to process unmapping
> of consecutive pages that all belong to the same large folio. I'm being
> very careful to not degrade order-0 performance, and it looks like I
> managed to achieve that.

One possible scenario:
If all the folio is 2M size folio, then one full batch could hold 510M memory.
Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
memory?


Regards
Yin, Fengwei



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-31  2:20 ` [PATCH v1 0/9] " Yin Fengwei
@ 2024-01-31 10:16   ` David Hildenbrand
  2024-01-31 10:26     ` Ryan Roberts
  2024-01-31 14:03     ` Michal Hocko
  2024-01-31 10:43   ` David Hildenbrand
  1 sibling, 2 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-01-31 10:16 UTC (permalink / raw)
  To: Yin Fengwei, linux-kernel, Linus Torvalds, Michal Hocko
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Ryan Roberts,
	Catalin Marinas, Will Deacon, Aneesh Kumar K.V, Nick Piggin,
	Peter Zijlstra, Michael Ellerman, Christophe Leroy, Naveen N. Rao,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390, Huang, Ying

On 31.01.24 03:20, Yin Fengwei wrote:
> On 1/29/24 22:32, David Hildenbrand wrote:
>> This series is based on [1] and must be applied on top of it.
>> Similar to what we did with fork(), let's implement PTE batching
>> during unmap/zap when processing PTE-mapped THPs.
>>
>> We collect consecutive PTEs that map consecutive pages of the same large
>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>> the refcount only once per batch, (b) call rmap handling functions only
>> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
>> entry removal once per batch.
>>
>> Ryan was previously working on this in the context of cont-pte for
>> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
>> This series implements the optimization for all architectures, independent
>> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
>> large-folio-pages batches as well, and amkes use of our new rmap batching
>> function when removing the rmap.
>>
>> To achieve that, we have to enlighten MMU gather / page freeing code
>> (i.e., everything that consumes encoded_page) to process unmapping
>> of consecutive pages that all belong to the same large folio. I'm being
>> very careful to not degrade order-0 performance, and it looks like I
>> managed to achieve that.
> 

Let's CC Linus and Michal to make sure I'm not daydreaming.

Relevant patch:
   https://lkml.kernel.org/r/20240129143221.263763-8-david@redhat.com

Context: I'm adjusting MMU gather code to support batching of 
consecutive pages that belong to the same large folio, when 
unmapping/zapping PTEs.

For small folios, there is no (relevant) change.

Imagine we have a PTE-mapped THP (2M folio -> 512 pages) and zap all 512 
PTEs: Instead of adding 512 individual encoded_page entries, we add a 
combined entry that expresses "page+nr_pages". That allows for "easily" 
adding various other per-folio batching (refcount, rmap, swap freeing).

The implication is, that we can now batch effective more pages with 
large folios, exceeding the old 10000 limit. The number of involved 
*folios* does not increase, though.

> One possible scenario:
> If all the folio is 2M size folio, then one full batch could hold 510M memory.
> Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
> memory?

Excellent point, I think there are three parts to it:

(1) Batch pages / folio fragments per batch page

Before this change (and with 4k folios) we have exactly one page (4k) 
per encoded_page entry in the batch. Now, we can have (with 2M folios), 
512 pages for every two encoded_page entries (page+nr_pages) in a batch 
page. So an average ~256 pages per encoded_page entry.

So one batch page can now store in the worst case ~256 times the number 
of pages, but the number of folio fragments ("pages+nr_pages") would not 
increase.

The time it takes to perform the actual page freeing of a batch will not 
be 256 times higher -- the time is expected to be much closer to the old 
time (i.e., not freeing more folios).

(2) Delayed rmap handling

We limit batching early (see tlb_next_batch()) when we have delayed rmap 
pending. Reason being, that we don't want to check for many entries if 
they require delayed rmap handling, while still holding the page table 
lock (see tlb_flush_rmaps()), because we have to remove the rmap before 
dropping the PTL.

Note that we perform the check whether we need delayed rmap handling per 
page+nr_pages entry, not per page. So we won't perform more such checks.

Once we set tlb->delayed_rmap (because we add one entry that requires 
it), we already force a flush before dropping the PT lock. So once we 
get a single delayed rmap entry in there, we will not batch more than we 
could have in the same page table: so not more than 512 entries (x86-64) 
in the worst case. So it will still be bounded, and not significantly 
more than what we had before.

So regarding delayed rmap handling I think this should be fine.

(3) Total patched pages

MAX_GATHER_BATCH_COUNT effectively limits the number of pages we 
allocate (full batches), and thereby limits the number of pages we were 
able to batch.

The old limit was ~10000 pages, now we could batch ~5000 folio fragments 
(page+nr_pages), resulting int the "times 256" increase in the worst 
case on x86-64 as you point out.

This 10000 pages limit was introduced in 53a59fc67f97 ("mm: limit 
mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT") where we 
wanted to handle soft-lockups.

As the number of effective folios we are freeing does not increase, I 
*think* this should be fine.

If any of that is a problem, we would have to keep track of the total 
number of pages in our batch, and stop as soon as we hit our 10000 limit 
-- independent of page vs. folio fragment. Something I would like to 
avoid of possible.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-31 10:16   ` David Hildenbrand
@ 2024-01-31 10:26     ` Ryan Roberts
  2024-01-31 14:08       ` Michal Hocko
  2024-01-31 14:03     ` Michal Hocko
  1 sibling, 1 reply; 40+ messages in thread
From: Ryan Roberts @ 2024-01-31 10:26 UTC (permalink / raw)
  To: David Hildenbrand, Yin Fengwei, linux-kernel, Linus Torvalds,
	Michal Hocko
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390, Huang, Ying

On 31/01/2024 10:16, David Hildenbrand wrote:
> On 31.01.24 03:20, Yin Fengwei wrote:
>> On 1/29/24 22:32, David Hildenbrand wrote:
>>> This series is based on [1] and must be applied on top of it.
>>> Similar to what we did with fork(), let's implement PTE batching
>>> during unmap/zap when processing PTE-mapped THPs.
>>>
>>> We collect consecutive PTEs that map consecutive pages of the same large
>>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>>> the refcount only once per batch, (b) call rmap handling functions only
>>> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
>>> entry removal once per batch.
>>>
>>> Ryan was previously working on this in the context of cont-pte for
>>> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
>>> This series implements the optimization for all architectures, independent
>>> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
>>> large-folio-pages batches as well, and amkes use of our new rmap batching
>>> function when removing the rmap.
>>>
>>> To achieve that, we have to enlighten MMU gather / page freeing code
>>> (i.e., everything that consumes encoded_page) to process unmapping
>>> of consecutive pages that all belong to the same large folio. I'm being
>>> very careful to not degrade order-0 performance, and it looks like I
>>> managed to achieve that.
>>
> 
> Let's CC Linus and Michal to make sure I'm not daydreaming.
> 
> Relevant patch:
>   https://lkml.kernel.org/r/20240129143221.263763-8-david@redhat.com
> 
> Context: I'm adjusting MMU gather code to support batching of consecutive pages
> that belong to the same large folio, when unmapping/zapping PTEs.
> 
> For small folios, there is no (relevant) change.
> 
> Imagine we have a PTE-mapped THP (2M folio -> 512 pages) and zap all 512 PTEs:
> Instead of adding 512 individual encoded_page entries, we add a combined entry
> that expresses "page+nr_pages". That allows for "easily" adding various other
> per-folio batching (refcount, rmap, swap freeing).
> 
> The implication is, that we can now batch effective more pages with large
> folios, exceeding the old 10000 limit. The number of involved *folios* does not
> increase, though.
> 
>> One possible scenario:
>> If all the folio is 2M size folio, then one full batch could hold 510M memory.
>> Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
>> memory?
> 
> Excellent point, I think there are three parts to it:
> 
> (1) Batch pages / folio fragments per batch page
> 
> Before this change (and with 4k folios) we have exactly one page (4k) per
> encoded_page entry in the batch. Now, we can have (with 2M folios), 512 pages
> for every two encoded_page entries (page+nr_pages) in a batch page. So an
> average ~256 pages per encoded_page entry.
> 
> So one batch page can now store in the worst case ~256 times the number of
> pages, but the number of folio fragments ("pages+nr_pages") would not increase.
> 
> The time it takes to perform the actual page freeing of a batch will not be 256
> times higher -- the time is expected to be much closer to the old time (i.e.,
> not freeing more folios).

IIRC there is an option to zero memory when it is freed back to the buddy? So
that could be a place where time is proportional to size rather than
proportional to folio count? But I think that option is intended for debug only?
So perhaps not a problem in practice?


> 
> (2) Delayed rmap handling
> 
> We limit batching early (see tlb_next_batch()) when we have delayed rmap
> pending. Reason being, that we don't want to check for many entries if they
> require delayed rmap handling, while still holding the page table lock (see
> tlb_flush_rmaps()), because we have to remove the rmap before dropping the PTL.
> 
> Note that we perform the check whether we need delayed rmap handling per
> page+nr_pages entry, not per page. So we won't perform more such checks.
> 
> Once we set tlb->delayed_rmap (because we add one entry that requires it), we
> already force a flush before dropping the PT lock. So once we get a single
> delayed rmap entry in there, we will not batch more than we could have in the
> same page table: so not more than 512 entries (x86-64) in the worst case. So it
> will still be bounded, and not significantly more than what we had before.
> 
> So regarding delayed rmap handling I think this should be fine.
> 
> (3) Total patched pages
> 
> MAX_GATHER_BATCH_COUNT effectively limits the number of pages we allocate (full
> batches), and thereby limits the number of pages we were able to batch.
> 
> The old limit was ~10000 pages, now we could batch ~5000 folio fragments
> (page+nr_pages), resulting int the "times 256" increase in the worst case on
> x86-64 as you point out.
> 
> This 10000 pages limit was introduced in 53a59fc67f97 ("mm: limit mmu_gather
> batching to fix soft lockups on !CONFIG_PREEMPT") where we wanted to handle
> soft-lockups.
> 
> As the number of effective folios we are freeing does not increase, I *think*
> this should be fine.
> 
> 
> If any of that is a problem, we would have to keep track of the total number of
> pages in our batch, and stop as soon as we hit our 10000 limit -- independent of
> page vs. folio fragment. Something I would like to avoid of possible.
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-31 10:26     ` Ryan Roberts
@ 2024-01-31 14:08       ` Michal Hocko
  2024-01-31 14:20         ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2024-01-31 14:08 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Yin Fengwei, linux-kernel, Linus Torvalds,
	linux-mm, Andrew Morton, Matthew Wilcox, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390, Huang, Ying

On Wed 31-01-24 10:26:13, Ryan Roberts wrote:
> IIRC there is an option to zero memory when it is freed back to the buddy? So
> that could be a place where time is proportional to size rather than
> proportional to folio count? But I think that option is intended for debug only?
> So perhaps not a problem in practice?

init_on_free is considered a security/hardening feature more than a
debugging one. It will surely add an overhead and I guess this is
something people who use it know about. The batch size limit is a latency
reduction feature for !PREEMPT kernels but by no means it should be
considered low latency guarantee feature. A lot of has changed since
the limit was introduced and the current latency numbers will surely be
different than back then. As long as soft lockups do not trigger again
this should be acceptable IMHO.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-31 14:08       ` Michal Hocko
@ 2024-01-31 14:20         ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-01-31 14:20 UTC (permalink / raw)
  To: Michal Hocko, Ryan Roberts
  Cc: Yin Fengwei, linux-kernel, Linus Torvalds, linux-mm,
	Andrew Morton, Matthew Wilcox, Catalin Marinas, Will Deacon,
	Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra, Michael Ellerman,
	Christophe Leroy, Naveen N. Rao, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Arnd Bergmann, linux-arch, linuxppc-dev, linux-s390, Huang, Ying

On 31.01.24 15:08, Michal Hocko wrote:
> On Wed 31-01-24 10:26:13, Ryan Roberts wrote:
>> IIRC there is an option to zero memory when it is freed back to the buddy? So
>> that could be a place where time is proportional to size rather than
>> proportional to folio count? But I think that option is intended for debug only?
>> So perhaps not a problem in practice?
> 
> init_on_free is considered a security/hardening feature more than a
> debugging one. It will surely add an overhead and I guess this is
> something people who use it know about. The batch size limit is a latency
> reduction feature for !PREEMPT kernels but by no means it should be
> considered low latency guarantee feature. A lot of has changed since
> the limit was introduced and the current latency numbers will surely be
> different than back then. As long as soft lockups do not trigger again
> this should be acceptable IMHO.

It could now be zeroing out ~512 MiB. That shouldn't take double-digit 
seconds unless we are running in a very problematic environment 
(over-committed VM). But then, we might have different problems already.

I'll do some sanity checks with an extremely large processes (as much as 
I can fit on my machines), with a !CONFIG_PREEMPT kernel and 
init_on_free, to see if anything pops up.

Thanks Michal!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-31 10:16   ` David Hildenbrand
  2024-01-31 10:26     ` Ryan Roberts
@ 2024-01-31 14:03     ` Michal Hocko
  1 sibling, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2024-01-31 14:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin Fengwei, linux-kernel, Linus Torvalds, linux-mm,
	Andrew Morton, Matthew Wilcox, Ryan Roberts, Catalin Marinas,
	Will Deacon, Aneesh Kumar K.V, Nick Piggin, Peter Zijlstra,
	Michael Ellerman, Christophe Leroy, Naveen N. Rao, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Arnd Bergmann, linux-arch, linuxppc-dev,
	linux-s390, Huang, Ying

On Wed 31-01-24 11:16:01, David Hildenbrand wrote:
[...]
> This 10000 pages limit was introduced in 53a59fc67f97 ("mm: limit mmu_gather
> batching to fix soft lockups on !CONFIG_PREEMPT") where we wanted to handle
> soft-lockups.

AFAIR at the time of this patch this was mostly just to put some cap on
the number of batches to collect and free at once. If there is a lot of
free memory and a large process exiting this could grow really high. Now
that those pages^Wfolios can represent larger memory chunks it could
mean more physical memory being freed but from which might make the
operation take longer but still far from soft lockup triggering.

Now latency might suck on !PREEMPT kernels with too many pages to
free in a single batch but I guess this is somehow expected for this
preemption model. The soft lockup has to be avoided because this can
panic the machine in some configurations.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP
  2024-01-31  2:20 ` [PATCH v1 0/9] " Yin Fengwei
  2024-01-31 10:16   ` David Hildenbrand
@ 2024-01-31 10:43   ` David Hildenbrand
  1 sibling, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-01-31 10:43 UTC (permalink / raw)
  To: Yin Fengwei, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Ryan Roberts,
	Catalin Marinas, Will Deacon, Aneesh Kumar K.V, Nick Piggin,
	Peter Zijlstra, Michael Ellerman, Christophe Leroy, Naveen N. Rao,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Arnd Bergmann, linux-arch,
	linuxppc-dev, linux-s390, Huang, Ying

On 31.01.24 03:20, Yin Fengwei wrote:
> On 1/29/24 22:32, David Hildenbrand wrote:
>> This series is based on [1] and must be applied on top of it.
>> Similar to what we did with fork(), let's implement PTE batching
>> during unmap/zap when processing PTE-mapped THPs.
>>
>> We collect consecutive PTEs that map consecutive pages of the same large
>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>> the refcount only once per batch, (b) call rmap handling functions only
>> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
>> entry removal once per batch.
>>
>> Ryan was previously working on this in the context of cont-pte for
>> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
>> This series implements the optimization for all architectures, independent
>> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
>> large-folio-pages batches as well, and amkes use of our new rmap batching
>> function when removing the rmap.
>>
>> To achieve that, we have to enlighten MMU gather / page freeing code
>> (i.e., everything that consumes encoded_page) to process unmapping
>> of consecutive pages that all belong to the same large folio. I'm being
>> very careful to not degrade order-0 performance, and it looks like I
>> managed to achieve that.
> 
> One possible scenario:
> If all the folio is 2M size folio, then one full batch could hold 510M memory.
> Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
> memory?

Good point, we do have CONFIG_INIT_ON_FREE_DEFAULT_ON. I don't remember 
if init_on_free or init_on_alloc was used in production systems. In 
tlb_batch_pages_flush(), there is a cond_resched() to limit the number 
of entries we process.

So if that is actually problematic, we'd run into a soft-lockup and need 
another cond_resched() [I have some faint recollection that people are 
working on removing cond_resched() completely].

One could do some counting in free_pages_and_swap_cache() (where we 
iterate all entries already) and insert cond_resched+release_pages() for 
every (e.g., 512) pages.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2024-01-31 14:20 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-29 14:32 [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
2024-01-29 14:32 ` [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte() David Hildenbrand
2024-01-30  8:13   ` Ryan Roberts
2024-01-30  8:41     ` David Hildenbrand
2024-01-30  8:46       ` Ryan Roberts
2024-01-30  8:49         ` David Hildenbrand
2024-01-29 14:32 ` [PATCH v1 2/9] mm/memory: handle !page case in zap_present_pte() separately David Hildenbrand
2024-01-30  8:20   ` Ryan Roberts
2024-01-29 14:32 ` [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte() David Hildenbrand
2024-01-30  8:31   ` Ryan Roberts
2024-01-30  8:37     ` David Hildenbrand
2024-01-30  8:45       ` Ryan Roberts
2024-01-30  8:47         ` David Hildenbrand
2024-01-29 14:32 ` [PATCH v1 4/9] mm/memory: factor out zapping folio pte into zap_present_folio_pte() David Hildenbrand
2024-01-30  8:47   ` Ryan Roberts
2024-01-29 14:32 ` [PATCH v1 5/9] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size() David Hildenbrand
2024-01-30  8:41   ` Ryan Roberts
2024-01-29 14:32 ` [PATCH v1 6/9] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP David Hildenbrand
2024-01-30  9:03   ` Ryan Roberts
2024-01-29 14:32 ` [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages() David Hildenbrand
2024-01-30  9:21   ` Ryan Roberts
2024-01-30  9:33     ` David Hildenbrand
2024-01-29 14:32 ` [PATCH v1 8/9] mm/mmu_gather: add tlb_remove_tlb_entries() David Hildenbrand
2024-01-30  9:33   ` Ryan Roberts
2024-01-29 14:32 ` [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP David Hildenbrand
2024-01-30  9:08   ` David Hildenbrand
2024-01-30  9:48   ` Ryan Roberts
2024-01-31 10:21     ` David Hildenbrand
2024-01-31 10:31       ` Ryan Roberts
2024-01-31 11:13         ` David Hildenbrand
2024-01-31  2:30   ` Yin Fengwei
2024-01-31 10:30     ` David Hildenbrand
2024-01-31 10:43       ` Yin, Fengwei
2024-01-31  2:20 ` [PATCH v1 0/9] " Yin Fengwei
2024-01-31 10:16   ` David Hildenbrand
2024-01-31 10:26     ` Ryan Roberts
2024-01-31 14:08       ` Michal Hocko
2024-01-31 14:20         ` David Hildenbrand
2024-01-31 14:03     ` Michal Hocko
2024-01-31 10:43   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).