public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
* [PATCH] arm64: Enable vmalloc-huge with ptdump
@ 2025-05-30  8:20 Dev Jain
  2025-05-30  8:28 ` Dev Jain
  2025-05-30  8:40 ` Ryan Roberts
  0 siblings, 2 replies; 16+ messages in thread
From: Dev Jain @ 2025-05-30  8:20 UTC (permalink / raw)
  To: catalin.marinas, will
  Cc: anshuman.khandual, quic_zhenhuah, ryan.roberts, kevin.brodsky,
	yangyicong, joey.gouly, linux-arm-kernel, linux-kernel, david,
	Dev Jain

arm64 disables vmalloc-huge when kernel page table dumping is enabled,
because an intermediate table may be removed, potentially causing the
ptdump code to dereference an invalid address. We want to be able to
analyze block vs page mappings for kernel mappings with ptdump, so to
enable vmalloc-huge with ptdump, synchronize between page table removal in
pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
use mmap_read_lock and not write lock because we don't need to synchronize
between two different vm_structs; two vmalloc objects running this same
code path will point to different page tables, hence there is no race. 

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/include/asm/vmalloc.h | 6 ++----
 arch/arm64/mm/mmu.c              | 7 +++++++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 38fafffe699f..28b7173d8693 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
 	/*
 	 * SW table walks can't handle removal of intermediate entries.
 	 */
-	return pud_sect_supported() &&
-	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+	return pud_sect_supported();
 }
 
 #define arch_vmap_pmd_supported arch_vmap_pmd_supported
 static inline bool arch_vmap_pmd_supported(pgprot_t prot)
 {
-	/* See arch_vmap_pud_supported() */
-	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+	return true;
 }
 
 #endif
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ea6695d53fb9..798cebd9e147 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
 	}
 
 	table = pte_offset_kernel(pmdp, addr);
+
+	/* Synchronize against ptdump_walk_pgd() */
+	mmap_read_lock(&init_mm);
 	pmd_clear(pmdp);
+	mmap_read_unlock(&init_mm);
 	__flush_tlb_kernel_pgtable(addr);
 	pte_free_kernel(NULL, table);
 	return 1;
@@ -1289,7 +1293,10 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
 		pmd_free_pte_page(pmdp, next);
 	} while (pmdp++, next += PMD_SIZE, next != end);
 
+	/* Synchronize against ptdump_walk_pgd() */
+	mmap_read_lock(&init_mm);
 	pud_clear(pudp);
+	mmap_read_unlock(&init_mm);
 	__flush_tlb_kernel_pgtable(addr);
 	pmd_free(NULL, table);
 	return 1;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30  8:20 [PATCH] arm64: Enable vmalloc-huge with ptdump Dev Jain
@ 2025-05-30  8:28 ` Dev Jain
  2025-05-30  8:40 ` Ryan Roberts
  1 sibling, 0 replies; 16+ messages in thread
From: Dev Jain @ 2025-05-30  8:28 UTC (permalink / raw)
  To: catalin.marinas, will
  Cc: anshuman.khandual, quic_zhenhuah, ryan.roberts, kevin.brodsky,
	yangyicong, joey.gouly, linux-arm-kernel, linux-kernel, david


On 30/05/25 1:50 pm, Dev Jain wrote:
> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> because an intermediate table may be removed, potentially causing the
> ptdump code to dereference an invalid address. We want to be able to
> analyze block vs page mappings for kernel mappings with ptdump, so to
> enable vmalloc-huge with ptdump, synchronize between page table removal in
> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> use mmap_read_lock and not write lock because we don't need to synchronize
> between two different vm_structs; two vmalloc objects running this same
> code path will point to different page tables, hence there is no race.

I mean, there *is* a race, but there is no problem :)

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   arch/arm64/include/asm/vmalloc.h | 6 ++----
>   arch/arm64/mm/mmu.c              | 7 +++++++
>   2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
> index 38fafffe699f..28b7173d8693 100644
> --- a/arch/arm64/include/asm/vmalloc.h
> +++ b/arch/arm64/include/asm/vmalloc.h
> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>   	/*
>   	 * SW table walks can't handle removal of intermediate entries.
>   	 */
> -	return pud_sect_supported() &&
> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> +	return pud_sect_supported();
>   }
>   
>   #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>   {
> -	/* See arch_vmap_pud_supported() */
> -	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> +	return true;
>   }
>   
>   #endif
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index ea6695d53fb9..798cebd9e147 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>   	}
>   
>   	table = pte_offset_kernel(pmdp, addr);
> +
> +	/* Synchronize against ptdump_walk_pgd() */
> +	mmap_read_lock(&init_mm);
>   	pmd_clear(pmdp);
> +	mmap_read_unlock(&init_mm);
>   	__flush_tlb_kernel_pgtable(addr);
>   	pte_free_kernel(NULL, table);
>   	return 1;
> @@ -1289,7 +1293,10 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>   		pmd_free_pte_page(pmdp, next);
>   	} while (pmdp++, next += PMD_SIZE, next != end);
>   
> +	/* Synchronize against ptdump_walk_pgd() */
> +	mmap_read_lock(&init_mm);
>   	pud_clear(pudp);
> +	mmap_read_unlock(&init_mm);
>   	__flush_tlb_kernel_pgtable(addr);
>   	pmd_free(NULL, table);
>   	return 1;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30  8:20 [PATCH] arm64: Enable vmalloc-huge with ptdump Dev Jain
  2025-05-30  8:28 ` Dev Jain
@ 2025-05-30  8:40 ` Ryan Roberts
  2025-05-30  9:07   ` Anshuman Khandual
  2025-05-30  9:14   ` Dev Jain
  1 sibling, 2 replies; 16+ messages in thread
From: Ryan Roberts @ 2025-05-30  8:40 UTC (permalink / raw)
  To: Dev Jain, catalin.marinas, will
  Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david

On 30/05/2025 09:20, Dev Jain wrote:
> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> because an intermediate table may be removed, potentially causing the
> ptdump code to dereference an invalid address. We want to be able to
> analyze block vs page mappings for kernel mappings with ptdump, so to
> enable vmalloc-huge with ptdump, synchronize between page table removal in
> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> use mmap_read_lock and not write lock because we don't need to synchronize
> between two different vm_structs; two vmalloc objects running this same
> code path will point to different page tables, hence there is no race. 
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  arch/arm64/include/asm/vmalloc.h | 6 ++----
>  arch/arm64/mm/mmu.c              | 7 +++++++
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
> index 38fafffe699f..28b7173d8693 100644
> --- a/arch/arm64/include/asm/vmalloc.h
> +++ b/arch/arm64/include/asm/vmalloc.h
> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>  	/*
>  	 * SW table walks can't handle removal of intermediate entries.
>  	 */
> -	return pud_sect_supported() &&
> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> +	return pud_sect_supported();
>  }
>  
>  #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>  static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>  {
> -	/* See arch_vmap_pud_supported() */
> -	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> +	return true;
>  }
>  
>  #endif
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index ea6695d53fb9..798cebd9e147 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>  	}
>  
>  	table = pte_offset_kernel(pmdp, addr);
> +
> +	/* Synchronize against ptdump_walk_pgd() */
> +	mmap_read_lock(&init_mm);
>  	pmd_clear(pmdp);
> +	mmap_read_unlock(&init_mm);

So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
exclusive with any read_lock holders) for the duration of the table walk, so it
will either consistently see the pgtables before or after this removal. It will
never disappear during the walk, correct?

I guess there is a risk of this showing up as contention with other init_mm
write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
are called sufficiently rarely that the risk is very small. Let's fix any perf
problem if/when we see it.

>  	__flush_tlb_kernel_pgtable(addr);

And the tlbi doesn't need to be serialized because there is no security issue.
The walker can be trusted to only dereference memory that it sees as it walks
the pgtable (obviously).

>  	pte_free_kernel(NULL, table);
>  	return 1;
> @@ -1289,7 +1293,10 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>  		pmd_free_pte_page(pmdp, next);
>  	} while (pmdp++, next += PMD_SIZE, next != end);
>  
> +	/* Synchronize against ptdump_walk_pgd() */
> +	mmap_read_lock(&init_mm);
>  	pud_clear(pudp);
> +	mmap_read_unlock(&init_mm);

Hmm, so pud_free_pmd_page() is now going to cause us to acquire and release the
(upto) lock 513 times (for a 4K kernel). I wonder if there is an argument for
clearing the pud first (under the lock), then the pmds can all be cleared
without a lock, since the walker won't be able to see the pmds once the pud is
cleared.

Thanks,
Ryan

>  	__flush_tlb_kernel_pgtable(addr);
>  	pmd_free(NULL, table);
>  	return 1;



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30  8:40 ` Ryan Roberts
@ 2025-05-30  9:07   ` Anshuman Khandual
  2025-05-30  9:14   ` Dev Jain
  1 sibling, 0 replies; 16+ messages in thread
From: Anshuman Khandual @ 2025-05-30  9:07 UTC (permalink / raw)
  To: Ryan Roberts, Dev Jain, catalin.marinas, will
  Cc: quic_zhenhuah, kevin.brodsky, yangyicong, joey.gouly,
	linux-arm-kernel, linux-kernel, david

On 5/30/25 14:10, Ryan Roberts wrote:
> On 30/05/2025 09:20, Dev Jain wrote:
>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>> because an intermediate table may be removed, potentially causing the
>> ptdump code to dereference an invalid address. We want to be able to
>> analyze block vs page mappings for kernel mappings with ptdump, so to
>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>> use mmap_read_lock and not write lock because we don't need to synchronize
>> between two different vm_structs; two vmalloc objects running this same
>> code path will point to different page tables, hence there is no race. 
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  arch/arm64/include/asm/vmalloc.h | 6 ++----
>>  arch/arm64/mm/mmu.c              | 7 +++++++
>>  2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>> index 38fafffe699f..28b7173d8693 100644
>> --- a/arch/arm64/include/asm/vmalloc.h
>> +++ b/arch/arm64/include/asm/vmalloc.h
>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>  	/*
>>  	 * SW table walks can't handle removal of intermediate entries.
>>  	 */
>> -	return pud_sect_supported() &&
>> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return pud_sect_supported();
>>  }
>>  
>>  #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>  static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>  {
>> -	/* See arch_vmap_pud_supported() */
>> -	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return true;
>>  }
>>  
>>  #endif
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index ea6695d53fb9..798cebd9e147 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>  	}
>>  
>>  	table = pte_offset_kernel(pmdp, addr);
>> +
>> +	/* Synchronize against ptdump_walk_pgd() */
>> +	mmap_read_lock(&init_mm);
>>  	pmd_clear(pmdp);
>> +	mmap_read_unlock(&init_mm);
> 
> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
> exclusive with any read_lock holders) for the duration of the table walk, so it
> will either consistently see the pgtables before or after this removal. It will
> never disappear during the walk, correct?

Agreed.

> 
> I guess there is a risk of this showing up as contention with other init_mm
> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
> are called sufficiently rarely that the risk is very small. Let's fix any perf
> problem if/when we see it.

Checking against CONFIG_PTDUMP_DEBUGFS being enabled is simple enough without much
cost. So why not make this conditional only for scenarios, where this read lock is
really required. Something like

--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1293,11 +1293,15 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
                pmd_free_pte_page(pmdp, next);
        } while (pmdp++, next += PMD_SIZE, next != end);
 
-       /* Synchronize against ptdump_walk_pgd() */
-       mmap_read_lock(&init_mm);
-       pud_clear(pudp);
-       mmap_read_unlock(&init_mm);
        __flush_tlb_kernel_pgtable(addr);
+       if (IS_ENABLED(CONFIG_PTDUMP_DEBUGFS)) {
+               /* Synchronize against ptdump_walk_pgd() */
+               mmap_read_lock(&init_mm);
+               pud_clear(pudp);
+               mmap_read_unlock(&init_mm);
+       } else {
+               pud_clear(pudp);
+       }
        pmd_free(NULL, table);
        return 1;
 }

> 
>>  	__flush_tlb_kernel_pgtable(addr);
> 
> And the tlbi doesn't need to be serialized because there is no security issue.
> The walker can be trusted to only dereference memory that it sees as it walks
> the pgtable (obviously).

Agreed.

> 
>>  	pte_free_kernel(NULL, table);
>>  	return 1;
>> @@ -1289,7 +1293,10 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>  		pmd_free_pte_page(pmdp, next);
>>  	} while (pmdp++, next += PMD_SIZE, next != end);
>>  
>> +	/* Synchronize against ptdump_walk_pgd() */
>> +	mmap_read_lock(&init_mm);
>>  	pud_clear(pudp);
>> +	mmap_read_unlock(&init_mm);
> 
> Hmm, so pud_free_pmd_page() is now going to cause us to acquire and release the
> (upto) lock 513 times (for a 4K kernel). I wonder if there is an argument for
> clearing the pud first (under the lock), then the pmds can all be cleared
> without a lock, since the walker won't be able to see the pmds once the pud is
> cleared.

Makes sense if pud_free_pmd_page() would have been the only caller but seems like
vmap_try_huge_pmd() calls pmd_free_pte_page() directly as well.

> 
> Thanks,
> Ryan
> 
>>  	__flush_tlb_kernel_pgtable(addr);
>>  	pmd_free(NULL, table);
>>  	return 1;
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30  8:40 ` Ryan Roberts
  2025-05-30  9:07   ` Anshuman Khandual
@ 2025-05-30  9:14   ` Dev Jain
  2025-05-30  9:47     ` Ryan Roberts
  2025-05-30 11:50     ` Ryan Roberts
  1 sibling, 2 replies; 16+ messages in thread
From: Dev Jain @ 2025-05-30  9:14 UTC (permalink / raw)
  To: Ryan Roberts, catalin.marinas, will
  Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david


On 30/05/25 2:10 pm, Ryan Roberts wrote:
> On 30/05/2025 09:20, Dev Jain wrote:
>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>> because an intermediate table may be removed, potentially causing the
>> ptdump code to dereference an invalid address. We want to be able to
>> analyze block vs page mappings for kernel mappings with ptdump, so to
>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>> use mmap_read_lock and not write lock because we don't need to synchronize
>> between two different vm_structs; two vmalloc objects running this same
>> code path will point to different page tables, hence there is no race.

My "correction" from race->no problem was incorrect after all :) There will
be no race too since the vm_struct object has exclusive access to whatever
table it is clearing.

>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   arch/arm64/include/asm/vmalloc.h | 6 ++----
>>   arch/arm64/mm/mmu.c              | 7 +++++++
>>   2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>> index 38fafffe699f..28b7173d8693 100644
>> --- a/arch/arm64/include/asm/vmalloc.h
>> +++ b/arch/arm64/include/asm/vmalloc.h
>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>   	/*
>>   	 * SW table walks can't handle removal of intermediate entries.
>>   	 */
>> -	return pud_sect_supported() &&
>> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return pud_sect_supported();
>>   }
>>   
>>   #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>   {
>> -	/* See arch_vmap_pud_supported() */
>> -	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return true;
>>   }
>>   
>>   #endif
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index ea6695d53fb9..798cebd9e147 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>   	}
>>   
>>   	table = pte_offset_kernel(pmdp, addr);
>> +
>> +	/* Synchronize against ptdump_walk_pgd() */
>> +	mmap_read_lock(&init_mm);
>>   	pmd_clear(pmdp);
>> +	mmap_read_unlock(&init_mm);
> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
> exclusive with any read_lock holders) for the duration of the table walk, so it
> will either consistently see the pgtables before or after this removal. It will
> never disappear during the walk, correct?
>
> I guess there is a risk of this showing up as contention with other init_mm
> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
> are called sufficiently rarely that the risk is very small. Let's fix any perf
> problem if/when we see it.

We can avoid all of that by my initial approach - to wrap the lock around CONFIG_PTDUMP_DEBUGFS.
I don't have a strong opinion, just putting it out there.

>
>>   	__flush_tlb_kernel_pgtable(addr);
> And the tlbi doesn't need to be serialized because there is no security issue.
> The walker can be trusted to only dereference memory that it sees as it walks
> the pgtable (obviously).
>
>>   	pte_free_kernel(NULL, table);
>>   	return 1;
>> @@ -1289,7 +1293,10 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   		pmd_free_pte_page(pmdp, next);
>>   	} while (pmdp++, next += PMD_SIZE, next != end);
>>   
>> +	/* Synchronize against ptdump_walk_pgd() */
>> +	mmap_read_lock(&init_mm);
>>   	pud_clear(pudp);
>> +	mmap_read_unlock(&init_mm);
> Hmm, so pud_free_pmd_page() is now going to cause us to acquire and release the
> (upto) lock 513 times (for a 4K kernel). I wonder if there is an argument for
> clearing the pud first (under the lock), then the pmds can all be cleared
> without a lock, since the walker won't be able to see the pmds once the pud is
> cleared.

Yes, we can isolate the PMD table in case the caller of pmd_free_pte_page is
pud_free_pmd_page. In this case, vm_struct_1 has exclusive access to the entire
pmd page, hence no race will occur. But, in case of vmap_try_huge_pmd() being the
caller, we cannot drop the locks around pmd_free_pte_page. So we can have something
like

#ifdef CONFIG_PTDUMP_DEBUGFS
static inline void ptdump_synchronize_lock(bool flag)
{
	if (flag)
		mmap_read_lock(&init_mm);
}

and pass false when the caller is pud_free_pmd_page.

>
> Thanks,
> Ryan
>
>>   	__flush_tlb_kernel_pgtable(addr);
>>   	pmd_free(NULL, table);
>>   	return 1;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30  9:14   ` Dev Jain
@ 2025-05-30  9:47     ` Ryan Roberts
  2025-05-30  9:57       ` Dev Jain
  2025-05-30 11:50     ` Ryan Roberts
  1 sibling, 1 reply; 16+ messages in thread
From: Ryan Roberts @ 2025-05-30  9:47 UTC (permalink / raw)
  To: Dev Jain, catalin.marinas, will
  Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david

On 30/05/2025 10:14, Dev Jain wrote:
> 
> On 30/05/25 2:10 pm, Ryan Roberts wrote:
>> On 30/05/2025 09:20, Dev Jain wrote:
>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>> because an intermediate table may be removed, potentially causing the
>>> ptdump code to dereference an invalid address. We want to be able to
>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>> between two different vm_structs; two vmalloc objects running this same
>>> code path will point to different page tables, hence there is no race.
> 
> My "correction" from race->no problem was incorrect after all :) There will
> be no race too since the vm_struct object has exclusive access to whatever
> table it is clearing.
> 
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   arch/arm64/include/asm/vmalloc.h | 6 ++----
>>>   arch/arm64/mm/mmu.c              | 7 +++++++
>>>   2 files changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>>> index 38fafffe699f..28b7173d8693 100644
>>> --- a/arch/arm64/include/asm/vmalloc.h
>>> +++ b/arch/arm64/include/asm/vmalloc.h
>>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>>       /*
>>>        * SW table walks can't handle removal of intermediate entries.
>>>        */
>>> -    return pud_sect_supported() &&
>>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>> +    return pud_sect_supported();
>>>   }
>>>     #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>>   {
>>> -    /* See arch_vmap_pud_supported() */
>>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>> +    return true;
>>>   }
>>>     #endif
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index ea6695d53fb9..798cebd9e147 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>       }
>>>         table = pte_offset_kernel(pmdp, addr);
>>> +
>>> +    /* Synchronize against ptdump_walk_pgd() */
>>> +    mmap_read_lock(&init_mm);
>>>       pmd_clear(pmdp);
>>> +    mmap_read_unlock(&init_mm);
>> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
>> exclusive with any read_lock holders) for the duration of the table walk, so it
>> will either consistently see the pgtables before or after this removal. It will
>> never disappear during the walk, correct?
>>
>> I guess there is a risk of this showing up as contention with other init_mm
>> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
>> are called sufficiently rarely that the risk is very small. Let's fix any perf
>> problem if/when we see it.
> 
> We can avoid all of that by my initial approach - to wrap the lock around
> CONFIG_PTDUMP_DEBUGFS.
> I don't have a strong opinion, just putting it out there.
> 
>>
>>>       __flush_tlb_kernel_pgtable(addr);
>> And the tlbi doesn't need to be serialized because there is no security issue.
>> The walker can be trusted to only dereference memory that it sees as it walks
>> the pgtable (obviously).
>>
>>>       pte_free_kernel(NULL, table);
>>>       return 1;
>>> @@ -1289,7 +1293,10 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>           pmd_free_pte_page(pmdp, next);
>>>       } while (pmdp++, next += PMD_SIZE, next != end);
>>>   +    /* Synchronize against ptdump_walk_pgd() */
>>> +    mmap_read_lock(&init_mm);
>>>       pud_clear(pudp);
>>> +    mmap_read_unlock(&init_mm);
>> Hmm, so pud_free_pmd_page() is now going to cause us to acquire and release the
>> (upto) lock 513 times (for a 4K kernel). I wonder if there is an argument for
>> clearing the pud first (under the lock), then the pmds can all be cleared
>> without a lock, since the walker won't be able to see the pmds once the pud is
>> cleared.
> 
> Yes, we can isolate the PMD table in case the caller of pmd_free_pte_page is
> pud_free_pmd_page. In this case, vm_struct_1 has exclusive access to the entire
> pmd page, hence no race will occur. But, in case of vmap_try_huge_pmd() being the
> caller, we cannot drop the locks around pmd_free_pte_page. So we can have something
> like
> 
> #ifdef CONFIG_PTDUMP_DEBUGFS
> static inline void ptdump_synchronize_lock(bool flag)
> {
>     if (flag)
>         mmap_read_lock(&init_mm);
> }
> 
> and pass false when the caller is pud_free_pmd_page.

of something like this? (completely untested):

---8<---
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 8fcf59ba39db..1f3a922167e4 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1267,7 +1267,7 @@ int pmd_clear_huge(pmd_t *pmdp)
        return 1;
 }

-int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
+static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
 {
        pte_t *table;
        pmd_t pmd;
@@ -1280,12 +1280,23 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
        }

        table = pte_offset_kernel(pmdp, addr);
+
+       if (lock)
+               mmap_read_lock(&init_mm);
        pmd_clear(pmdp);
+       if (lock)
+               mmap_read_unlock(&init_mm);
+
        __flush_tlb_kernel_pgtable(addr);
        pte_free_kernel(NULL, table);
        return 1;
 }

+int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
+{
+       return __pmd_free_pte_page(pmdp, addr, true);
+}
+
 int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
 {
        pmd_t *table;
@@ -1300,15 +1311,19 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
                return 1;
        }

+       /* Synchronize against ptdump_walk_pgd() */
+       mmap_read_lock(&init_mm);
+       pud_clear(pudp);
+       mmap_read_unlock(&init_mm);
+
        table = pmd_offset(pudp, addr);
        pmdp = table;
        next = addr;
        end = addr + PUD_SIZE;
        do {
-               pmd_free_pte_page(pmdp, next);
+               __pmd_free_pte_page(pmdp, next, false);
        } while (pmdp++, next += PMD_SIZE, next != end);

-       pud_clear(pudp);
        __flush_tlb_kernel_pgtable(addr);
        pmd_free(NULL, table);
        return 1;
---8<---



> 
>>
>> Thanks,
>> Ryan
>>
>>>       __flush_tlb_kernel_pgtable(addr);
>>>       pmd_free(NULL, table);
>>>       return 1;



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30  9:47     ` Ryan Roberts
@ 2025-05-30  9:57       ` Dev Jain
  0 siblings, 0 replies; 16+ messages in thread
From: Dev Jain @ 2025-05-30  9:57 UTC (permalink / raw)
  To: Ryan Roberts, catalin.marinas, will
  Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david


On 30/05/25 3:17 pm, Ryan Roberts wrote:
> On 30/05/2025 10:14, Dev Jain wrote:
>> On 30/05/25 2:10 pm, Ryan Roberts wrote:
>>> On 30/05/2025 09:20, Dev Jain wrote:
>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>>> because an intermediate table may be removed, potentially causing the
>>>> ptdump code to dereference an invalid address. We want to be able to
>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>>> between two different vm_structs; two vmalloc objects running this same
>>>> code path will point to different page tables, hence there is no race.
>> My "correction" from race->no problem was incorrect after all :) There will
>> be no race too since the vm_struct object has exclusive access to whatever
>> table it is clearing.
>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    arch/arm64/include/asm/vmalloc.h | 6 ++----
>>>>    arch/arm64/mm/mmu.c              | 7 +++++++
>>>>    2 files changed, 9 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>>>> index 38fafffe699f..28b7173d8693 100644
>>>> --- a/arch/arm64/include/asm/vmalloc.h
>>>> +++ b/arch/arm64/include/asm/vmalloc.h
>>>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>>>        /*
>>>>         * SW table walks can't handle removal of intermediate entries.
>>>>         */
>>>> -    return pud_sect_supported() &&
>>>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>> +    return pud_sect_supported();
>>>>    }
>>>>      #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>>>    static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>>>    {
>>>> -    /* See arch_vmap_pud_supported() */
>>>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>> +    return true;
>>>>    }
>>>>      #endif
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index ea6695d53fb9..798cebd9e147 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>>        }
>>>>          table = pte_offset_kernel(pmdp, addr);
>>>> +
>>>> +    /* Synchronize against ptdump_walk_pgd() */
>>>> +    mmap_read_lock(&init_mm);
>>>>        pmd_clear(pmdp);
>>>> +    mmap_read_unlock(&init_mm);
>>> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
>>> exclusive with any read_lock holders) for the duration of the table walk, so it
>>> will either consistently see the pgtables before or after this removal. It will
>>> never disappear during the walk, correct?
>>>
>>> I guess there is a risk of this showing up as contention with other init_mm
>>> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
>>> are called sufficiently rarely that the risk is very small. Let's fix any perf
>>> problem if/when we see it.
>> We can avoid all of that by my initial approach - to wrap the lock around
>> CONFIG_PTDUMP_DEBUGFS.
>> I don't have a strong opinion, just putting it out there.
>>
>>>>        __flush_tlb_kernel_pgtable(addr);
>>> And the tlbi doesn't need to be serialized because there is no security issue.
>>> The walker can be trusted to only dereference memory that it sees as it walks
>>> the pgtable (obviously).
>>>
>>>>        pte_free_kernel(NULL, table);
>>>>        return 1;
>>>> @@ -1289,7 +1293,10 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>>            pmd_free_pte_page(pmdp, next);
>>>>        } while (pmdp++, next += PMD_SIZE, next != end);
>>>>    +    /* Synchronize against ptdump_walk_pgd() */
>>>> +    mmap_read_lock(&init_mm);
>>>>        pud_clear(pudp);
>>>> +    mmap_read_unlock(&init_mm);
>>> Hmm, so pud_free_pmd_page() is now going to cause us to acquire and release the
>>> (upto) lock 513 times (for a 4K kernel). I wonder if there is an argument for
>>> clearing the pud first (under the lock), then the pmds can all be cleared
>>> without a lock, since the walker won't be able to see the pmds once the pud is
>>> cleared.
>> Yes, we can isolate the PMD table in case the caller of pmd_free_pte_page is
>> pud_free_pmd_page. In this case, vm_struct_1 has exclusive access to the entire
>> pmd page, hence no race will occur. But, in case of vmap_try_huge_pmd() being the
>> caller, we cannot drop the locks around pmd_free_pte_page. So we can have something
>> like
>>
>> #ifdef CONFIG_PTDUMP_DEBUGFS
>> static inline void ptdump_synchronize_lock(bool flag)
>> {
>>      if (flag)
>>          mmap_read_lock(&init_mm);
>> }
>>
>> and pass false when the caller is pud_free_pmd_page.
> of something like this? (completely untested):
>
> ---8<---
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 8fcf59ba39db..1f3a922167e4 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1267,7 +1267,7 @@ int pmd_clear_huge(pmd_t *pmdp)
>          return 1;
>   }
>
> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>   {
>          pte_t *table;
>          pmd_t pmd;
> @@ -1280,12 +1280,23 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>          }
>
>          table = pte_offset_kernel(pmdp, addr);
> +
> +       if (lock)
> +               mmap_read_lock(&init_mm);
>          pmd_clear(pmdp);
> +       if (lock)
> +               mmap_read_unlock(&init_mm);
> +
>          __flush_tlb_kernel_pgtable(addr);
>          pte_free_kernel(NULL, table);
>          return 1;
>   }
>
> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +{
> +       return __pmd_free_pte_page(pmdp, addr, true);
> +}
> +
>   int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>   {
>          pmd_t *table;
> @@ -1300,15 +1311,19 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>                  return 1;
>          }
>
> +       /* Synchronize against ptdump_walk_pgd() */
> +       mmap_read_lock(&init_mm);
> +       pud_clear(pudp);
> +       mmap_read_unlock(&init_mm);
> +
>          table = pmd_offset(pudp, addr);
>          pmdp = table;
>          next = addr;
>          end = addr + PUD_SIZE;
>          do {
> -               pmd_free_pte_page(pmdp, next);
> +               __pmd_free_pte_page(pmdp, next, false);
>          } while (pmdp++, next += PMD_SIZE, next != end);
>
> -       pud_clear(pudp);
>          __flush_tlb_kernel_pgtable(addr);
>          pmd_free(NULL, table);
>          return 1;
> ---8<---


This looks good too! Although would like to first decide on whether we want
to wrap around CONFIG_PTDUMP_DEBUGFS.


>
>
>>> Thanks,
>>> Ryan
>>>
>>>>        __flush_tlb_kernel_pgtable(addr);
>>>>        pmd_free(NULL, table);
>>>>        return 1;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30  9:14   ` Dev Jain
  2025-05-30  9:47     ` Ryan Roberts
@ 2025-05-30 11:50     ` Ryan Roberts
  2025-05-30 12:35       ` Will Deacon
  1 sibling, 1 reply; 16+ messages in thread
From: Ryan Roberts @ 2025-05-30 11:50 UTC (permalink / raw)
  To: Dev Jain, catalin.marinas, will
  Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david

On 30/05/2025 10:14, Dev Jain wrote:
> 
> On 30/05/25 2:10 pm, Ryan Roberts wrote:
>> On 30/05/2025 09:20, Dev Jain wrote:
>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>> because an intermediate table may be removed, potentially causing the
>>> ptdump code to dereference an invalid address. We want to be able to
>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>> between two different vm_structs; two vmalloc objects running this same
>>> code path will point to different page tables, hence there is no race.
> 
> My "correction" from race->no problem was incorrect after all :) There will
> be no race too since the vm_struct object has exclusive access to whatever
> table it is clearing.
> 
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   arch/arm64/include/asm/vmalloc.h | 6 ++----
>>>   arch/arm64/mm/mmu.c              | 7 +++++++
>>>   2 files changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>>> index 38fafffe699f..28b7173d8693 100644
>>> --- a/arch/arm64/include/asm/vmalloc.h
>>> +++ b/arch/arm64/include/asm/vmalloc.h
>>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>>       /*
>>>        * SW table walks can't handle removal of intermediate entries.
>>>        */
>>> -    return pud_sect_supported() &&
>>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>> +    return pud_sect_supported();
>>>   }
>>>     #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>>   {
>>> -    /* See arch_vmap_pud_supported() */
>>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>> +    return true;
>>>   }
>>>     #endif
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index ea6695d53fb9..798cebd9e147 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>       }
>>>         table = pte_offset_kernel(pmdp, addr);
>>> +
>>> +    /* Synchronize against ptdump_walk_pgd() */
>>> +    mmap_read_lock(&init_mm);
>>>       pmd_clear(pmdp);
>>> +    mmap_read_unlock(&init_mm);
>> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
>> exclusive with any read_lock holders) for the duration of the table walk, so it
>> will either consistently see the pgtables before or after this removal. It will
>> never disappear during the walk, correct?
>>
>> I guess there is a risk of this showing up as contention with other init_mm
>> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
>> are called sufficiently rarely that the risk is very small. Let's fix any perf
>> problem if/when we see it.
> 
> We can avoid all of that by my initial approach - to wrap the lock around
> CONFIG_PTDUMP_DEBUGFS.
> I don't have a strong opinion, just putting it out there.

(I wrote then failed to send earlier):

It's ugly though. Personally I'd prefer to keep it simple unless we have clear
evidence that its needed. I was just laying out my justification for not needing
to doing the conditional wrapping in this comment.

> 
>>
>>>       __flush_tlb_kernel_pgtable(addr);
>> And the tlbi doesn't need to be serialized because there is no security issue.
>> The walker can be trusted to only dereference memory that it sees as it walks
>> the pgtable (obviously).
>>
>>>       pte_free_kernel(NULL, table);
>>>       return 1;
>>> @@ -1289,7 +1293,10 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>           pmd_free_pte_page(pmdp, next);
>>>       } while (pmdp++, next += PMD_SIZE, next != end);
>>>   +    /* Synchronize against ptdump_walk_pgd() */
>>> +    mmap_read_lock(&init_mm);
>>>       pud_clear(pudp);
>>> +    mmap_read_unlock(&init_mm);
>> Hmm, so pud_free_pmd_page() is now going to cause us to acquire and release the
>> (upto) lock 513 times (for a 4K kernel). I wonder if there is an argument for
>> clearing the pud first (under the lock), then the pmds can all be cleared
>> without a lock, since the walker won't be able to see the pmds once the pud is
>> cleared.
> 
> Yes, we can isolate the PMD table in case the caller of pmd_free_pte_page is
> pud_free_pmd_page. In this case, vm_struct_1 has exclusive access to the entire
> pmd page, hence no race will occur. But, in case of vmap_try_huge_pmd() being the
> caller, we cannot drop the locks around pmd_free_pte_page. So we can have something
> like
> 
> #ifdef CONFIG_PTDUMP_DEBUGFS
> static inline void ptdump_synchronize_lock(bool flag)
> {
>     if (flag)
>         mmap_read_lock(&init_mm);
> }
> 
> and pass false when the caller is pud_free_pmd_page.
> 
>>
>> Thanks,
>> Ryan
>>
>>>       __flush_tlb_kernel_pgtable(addr);
>>>       pmd_free(NULL, table);
>>>       return 1;



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30 11:50     ` Ryan Roberts
@ 2025-05-30 12:35       ` Will Deacon
  2025-05-30 13:11         ` Ryan Roberts
  0 siblings, 1 reply; 16+ messages in thread
From: Will Deacon @ 2025-05-30 12:35 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, catalin.marinas, anshuman.khandual, quic_zhenhuah,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david

On Fri, May 30, 2025 at 12:50:40PM +0100, Ryan Roberts wrote:
> On 30/05/2025 10:14, Dev Jain wrote:
> > 
> > On 30/05/25 2:10 pm, Ryan Roberts wrote:
> >> On 30/05/2025 09:20, Dev Jain wrote:
> >>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> >>> because an intermediate table may be removed, potentially causing the
> >>> ptdump code to dereference an invalid address. We want to be able to
> >>> analyze block vs page mappings for kernel mappings with ptdump, so to
> >>> enable vmalloc-huge with ptdump, synchronize between page table removal in
> >>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> >>> use mmap_read_lock and not write lock because we don't need to synchronize
> >>> between two different vm_structs; two vmalloc objects running this same
> >>> code path will point to different page tables, hence there is no race.
> > 
> > My "correction" from race->no problem was incorrect after all :) There will
> > be no race too since the vm_struct object has exclusive access to whatever
> > table it is clearing.
> > 
> >>>
> >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>> ---
> >>>   arch/arm64/include/asm/vmalloc.h | 6 ++----
> >>>   arch/arm64/mm/mmu.c              | 7 +++++++
> >>>   2 files changed, 9 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
> >>> index 38fafffe699f..28b7173d8693 100644
> >>> --- a/arch/arm64/include/asm/vmalloc.h
> >>> +++ b/arch/arm64/include/asm/vmalloc.h
> >>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
> >>>       /*
> >>>        * SW table walks can't handle removal of intermediate entries.
> >>>        */
> >>> -    return pud_sect_supported() &&
> >>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> >>> +    return pud_sect_supported();
> >>>   }
> >>>     #define arch_vmap_pmd_supported arch_vmap_pmd_supported
> >>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
> >>>   {
> >>> -    /* See arch_vmap_pud_supported() */
> >>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> >>> +    return true;
> >>>   }
> >>>     #endif
> >>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >>> index ea6695d53fb9..798cebd9e147 100644
> >>> --- a/arch/arm64/mm/mmu.c
> >>> +++ b/arch/arm64/mm/mmu.c
> >>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> >>>       }
> >>>         table = pte_offset_kernel(pmdp, addr);
> >>> +
> >>> +    /* Synchronize against ptdump_walk_pgd() */
> >>> +    mmap_read_lock(&init_mm);
> >>>       pmd_clear(pmdp);
> >>> +    mmap_read_unlock(&init_mm);
> >> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
> >> exclusive with any read_lock holders) for the duration of the table walk, so it
> >> will either consistently see the pgtables before or after this removal. It will
> >> never disappear during the walk, correct?
> >>
> >> I guess there is a risk of this showing up as contention with other init_mm
> >> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
> >> are called sufficiently rarely that the risk is very small. Let's fix any perf
> >> problem if/when we see it.
> > 
> > We can avoid all of that by my initial approach - to wrap the lock around
> > CONFIG_PTDUMP_DEBUGFS.
> > I don't have a strong opinion, just putting it out there.
> 
> (I wrote then failed to send earlier):
> 
> It's ugly though. Personally I'd prefer to keep it simple unless we have clear
> evidence that its needed. I was just laying out my justification for not needing
> to doing the conditional wrapping in this comment.

I really don't think we should be adding unconditional locking overhead
to core mm routines purely to facilitate a rarely used debug option.

Instead, can we either adopt something like the RCU-like walk used by
fast GUP or stick the locking behind a static key that's only enabled
when a ptdump walk is in progress (a bit like how
hugetlb_vmemmap_optimize_folio() manipulates hugetlb_optimize_vmemmap_key)?

Will


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30 12:35       ` Will Deacon
@ 2025-05-30 13:11         ` Ryan Roberts
  2025-05-30 13:36           ` Will Deacon
  0 siblings, 1 reply; 16+ messages in thread
From: Ryan Roberts @ 2025-05-30 13:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: Dev Jain, catalin.marinas, anshuman.khandual, quic_zhenhuah,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david

On 30/05/2025 13:35, Will Deacon wrote:
> On Fri, May 30, 2025 at 12:50:40PM +0100, Ryan Roberts wrote:
>> On 30/05/2025 10:14, Dev Jain wrote:
>>>
>>> On 30/05/25 2:10 pm, Ryan Roberts wrote:
>>>> On 30/05/2025 09:20, Dev Jain wrote:
>>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>>>> because an intermediate table may be removed, potentially causing the
>>>>> ptdump code to dereference an invalid address. We want to be able to
>>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>>>> between two different vm_structs; two vmalloc objects running this same
>>>>> code path will point to different page tables, hence there is no race.
>>>
>>> My "correction" from race->no problem was incorrect after all :) There will
>>> be no race too since the vm_struct object has exclusive access to whatever
>>> table it is clearing.
>>>
>>>>>
>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>> ---
>>>>>   arch/arm64/include/asm/vmalloc.h | 6 ++----
>>>>>   arch/arm64/mm/mmu.c              | 7 +++++++
>>>>>   2 files changed, 9 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>>>>> index 38fafffe699f..28b7173d8693 100644
>>>>> --- a/arch/arm64/include/asm/vmalloc.h
>>>>> +++ b/arch/arm64/include/asm/vmalloc.h
>>>>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>>>>       /*
>>>>>        * SW table walks can't handle removal of intermediate entries.
>>>>>        */
>>>>> -    return pud_sect_supported() &&
>>>>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>>> +    return pud_sect_supported();
>>>>>   }
>>>>>     #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>>>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>>>>   {
>>>>> -    /* See arch_vmap_pud_supported() */
>>>>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>>> +    return true;
>>>>>   }
>>>>>     #endif
>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>> index ea6695d53fb9..798cebd9e147 100644
>>>>> --- a/arch/arm64/mm/mmu.c
>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>>>       }
>>>>>         table = pte_offset_kernel(pmdp, addr);
>>>>> +
>>>>> +    /* Synchronize against ptdump_walk_pgd() */
>>>>> +    mmap_read_lock(&init_mm);
>>>>>       pmd_clear(pmdp);
>>>>> +    mmap_read_unlock(&init_mm);
>>>> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
>>>> exclusive with any read_lock holders) for the duration of the table walk, so it
>>>> will either consistently see the pgtables before or after this removal. It will
>>>> never disappear during the walk, correct?
>>>>
>>>> I guess there is a risk of this showing up as contention with other init_mm
>>>> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
>>>> are called sufficiently rarely that the risk is very small. Let's fix any perf
>>>> problem if/when we see it.
>>>
>>> We can avoid all of that by my initial approach - to wrap the lock around
>>> CONFIG_PTDUMP_DEBUGFS.
>>> I don't have a strong opinion, just putting it out there.
>>
>> (I wrote then failed to send earlier):
>>
>> It's ugly though. Personally I'd prefer to keep it simple unless we have clear
>> evidence that its needed. I was just laying out my justification for not needing
>> to doing the conditional wrapping in this comment.
> 
> I really don't think we should be adding unconditional locking overhead
> to core mm routines purely to facilitate a rarely used debug option.
> 
> Instead, can we either adopt something like the RCU-like walk used by
> fast GUP or stick the locking behind a static key that's only enabled
> when a ptdump walk is in progress (a bit like how
> hugetlb_vmemmap_optimize_folio() manipulates hugetlb_optimize_vmemmap_key)?

My sense is that the static key will be less effort and can be contained fully
in arm64. I think we would need to enable the key around the call to
ptdump_walk_pgd() in both ptdump_walk() and ptdump_check_wx(). Then where Dev is
currently taking the read lock, that would be contingent on the key being
enabled and the unlock would be contingent on having taken the lock.

Does that sound like an acceptable approach?

> 
> Will



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30 13:11         ` Ryan Roberts
@ 2025-05-30 13:36           ` Will Deacon
  2025-05-30 14:07             ` Ryan Roberts
  2025-06-05  4:48             ` Dev Jain
  0 siblings, 2 replies; 16+ messages in thread
From: Will Deacon @ 2025-05-30 13:36 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, catalin.marinas, anshuman.khandual, quic_zhenhuah,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david

On Fri, May 30, 2025 at 02:11:36PM +0100, Ryan Roberts wrote:
> On 30/05/2025 13:35, Will Deacon wrote:
> > On Fri, May 30, 2025 at 12:50:40PM +0100, Ryan Roberts wrote:
> >> On 30/05/2025 10:14, Dev Jain wrote:
> >>>
> >>> On 30/05/25 2:10 pm, Ryan Roberts wrote:
> >>>> On 30/05/2025 09:20, Dev Jain wrote:
> >>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> >>>>> because an intermediate table may be removed, potentially causing the
> >>>>> ptdump code to dereference an invalid address. We want to be able to
> >>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
> >>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
> >>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> >>>>> use mmap_read_lock and not write lock because we don't need to synchronize
> >>>>> between two different vm_structs; two vmalloc objects running this same
> >>>>> code path will point to different page tables, hence there is no race.
> >>>
> >>> My "correction" from race->no problem was incorrect after all :) There will
> >>> be no race too since the vm_struct object has exclusive access to whatever
> >>> table it is clearing.
> >>>
> >>>>>
> >>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>>>> ---
> >>>>>   arch/arm64/include/asm/vmalloc.h | 6 ++----
> >>>>>   arch/arm64/mm/mmu.c              | 7 +++++++
> >>>>>   2 files changed, 9 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
> >>>>> index 38fafffe699f..28b7173d8693 100644
> >>>>> --- a/arch/arm64/include/asm/vmalloc.h
> >>>>> +++ b/arch/arm64/include/asm/vmalloc.h
> >>>>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
> >>>>>       /*
> >>>>>        * SW table walks can't handle removal of intermediate entries.
> >>>>>        */
> >>>>> -    return pud_sect_supported() &&
> >>>>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> >>>>> +    return pud_sect_supported();
> >>>>>   }
> >>>>>     #define arch_vmap_pmd_supported arch_vmap_pmd_supported
> >>>>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
> >>>>>   {
> >>>>> -    /* See arch_vmap_pud_supported() */
> >>>>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> >>>>> +    return true;
> >>>>>   }
> >>>>>     #endif
> >>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >>>>> index ea6695d53fb9..798cebd9e147 100644
> >>>>> --- a/arch/arm64/mm/mmu.c
> >>>>> +++ b/arch/arm64/mm/mmu.c
> >>>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> >>>>>       }
> >>>>>         table = pte_offset_kernel(pmdp, addr);
> >>>>> +
> >>>>> +    /* Synchronize against ptdump_walk_pgd() */
> >>>>> +    mmap_read_lock(&init_mm);
> >>>>>       pmd_clear(pmdp);
> >>>>> +    mmap_read_unlock(&init_mm);
> >>>> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
> >>>> exclusive with any read_lock holders) for the duration of the table walk, so it
> >>>> will either consistently see the pgtables before or after this removal. It will
> >>>> never disappear during the walk, correct?
> >>>>
> >>>> I guess there is a risk of this showing up as contention with other init_mm
> >>>> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
> >>>> are called sufficiently rarely that the risk is very small. Let's fix any perf
> >>>> problem if/when we see it.
> >>>
> >>> We can avoid all of that by my initial approach - to wrap the lock around
> >>> CONFIG_PTDUMP_DEBUGFS.
> >>> I don't have a strong opinion, just putting it out there.
> >>
> >> (I wrote then failed to send earlier):
> >>
> >> It's ugly though. Personally I'd prefer to keep it simple unless we have clear
> >> evidence that its needed. I was just laying out my justification for not needing
> >> to doing the conditional wrapping in this comment.
> > 
> > I really don't think we should be adding unconditional locking overhead
> > to core mm routines purely to facilitate a rarely used debug option.
> > 
> > Instead, can we either adopt something like the RCU-like walk used by
> > fast GUP or stick the locking behind a static key that's only enabled
> > when a ptdump walk is in progress (a bit like how
> > hugetlb_vmemmap_optimize_folio() manipulates hugetlb_optimize_vmemmap_key)?
> 
> My sense is that the static key will be less effort and can be contained fully
> in arm64. I think we would need to enable the key around the call to
> ptdump_walk_pgd() in both ptdump_walk() and ptdump_check_wx(). Then where Dev is
> currently taking the read lock, that would be contingent on the key being
> enabled and the unlock would be contingent on having taken the lock.
> 
> Does that sound like an acceptable approach?

Yup, and I think you'll probably need something like a synchronize_rcu()
when flipping the key to deal with any pre-existing page-table freers.

Will


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30 13:36           ` Will Deacon
@ 2025-05-30 14:07             ` Ryan Roberts
  2025-06-05  4:48             ` Dev Jain
  1 sibling, 0 replies; 16+ messages in thread
From: Ryan Roberts @ 2025-05-30 14:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: Dev Jain, catalin.marinas, anshuman.khandual, quic_zhenhuah,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david

On 30/05/2025 14:36, Will Deacon wrote:
> On Fri, May 30, 2025 at 02:11:36PM +0100, Ryan Roberts wrote:
>> On 30/05/2025 13:35, Will Deacon wrote:
>>> On Fri, May 30, 2025 at 12:50:40PM +0100, Ryan Roberts wrote:
>>>> On 30/05/2025 10:14, Dev Jain wrote:
>>>>>
>>>>> On 30/05/25 2:10 pm, Ryan Roberts wrote:
>>>>>> On 30/05/2025 09:20, Dev Jain wrote:
>>>>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>>>>>> because an intermediate table may be removed, potentially causing the
>>>>>>> ptdump code to dereference an invalid address. We want to be able to
>>>>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>>>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>>>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>>>>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>>>>>> between two different vm_structs; two vmalloc objects running this same
>>>>>>> code path will point to different page tables, hence there is no race.
>>>>>
>>>>> My "correction" from race->no problem was incorrect after all :) There will
>>>>> be no race too since the vm_struct object has exclusive access to whatever
>>>>> table it is clearing.
>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>>>> ---
>>>>>>>   arch/arm64/include/asm/vmalloc.h | 6 ++----
>>>>>>>   arch/arm64/mm/mmu.c              | 7 +++++++
>>>>>>>   2 files changed, 9 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>>>>>>> index 38fafffe699f..28b7173d8693 100644
>>>>>>> --- a/arch/arm64/include/asm/vmalloc.h
>>>>>>> +++ b/arch/arm64/include/asm/vmalloc.h
>>>>>>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>>>>>>       /*
>>>>>>>        * SW table walks can't handle removal of intermediate entries.
>>>>>>>        */
>>>>>>> -    return pud_sect_supported() &&
>>>>>>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>>>>> +    return pud_sect_supported();
>>>>>>>   }
>>>>>>>     #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>>>>>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>>>>>>   {
>>>>>>> -    /* See arch_vmap_pud_supported() */
>>>>>>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>>>>> +    return true;
>>>>>>>   }
>>>>>>>     #endif
>>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>>> index ea6695d53fb9..798cebd9e147 100644
>>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>>>>>       }
>>>>>>>         table = pte_offset_kernel(pmdp, addr);
>>>>>>> +
>>>>>>> +    /* Synchronize against ptdump_walk_pgd() */
>>>>>>> +    mmap_read_lock(&init_mm);
>>>>>>>       pmd_clear(pmdp);
>>>>>>> +    mmap_read_unlock(&init_mm);
>>>>>> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
>>>>>> exclusive with any read_lock holders) for the duration of the table walk, so it
>>>>>> will either consistently see the pgtables before or after this removal. It will
>>>>>> never disappear during the walk, correct?
>>>>>>
>>>>>> I guess there is a risk of this showing up as contention with other init_mm
>>>>>> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
>>>>>> are called sufficiently rarely that the risk is very small. Let's fix any perf
>>>>>> problem if/when we see it.
>>>>>
>>>>> We can avoid all of that by my initial approach - to wrap the lock around
>>>>> CONFIG_PTDUMP_DEBUGFS.
>>>>> I don't have a strong opinion, just putting it out there.
>>>>
>>>> (I wrote then failed to send earlier):
>>>>
>>>> It's ugly though. Personally I'd prefer to keep it simple unless we have clear
>>>> evidence that its needed. I was just laying out my justification for not needing
>>>> to doing the conditional wrapping in this comment.
>>>
>>> I really don't think we should be adding unconditional locking overhead
>>> to core mm routines purely to facilitate a rarely used debug option.
>>>
>>> Instead, can we either adopt something like the RCU-like walk used by
>>> fast GUP or stick the locking behind a static key that's only enabled
>>> when a ptdump walk is in progress (a bit like how
>>> hugetlb_vmemmap_optimize_folio() manipulates hugetlb_optimize_vmemmap_key)?
>>
>> My sense is that the static key will be less effort and can be contained fully
>> in arm64. I think we would need to enable the key around the call to
>> ptdump_walk_pgd() in both ptdump_walk() and ptdump_check_wx(). Then where Dev is
>> currently taking the read lock, that would be contingent on the key being
>> enabled and the unlock would be contingent on having taken the lock.
>>
>> Does that sound like an acceptable approach?
> 
> Yup, and I think you'll probably need something like a synchronize_rcu()
> when flipping the key to deal with any pre-existing page-table freers.

Yeah good point.

> 
> Will



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-05-30 13:36           ` Will Deacon
  2025-05-30 14:07             ` Ryan Roberts
@ 2025-06-05  4:48             ` Dev Jain
  2025-06-05  8:16               ` Dev Jain
  1 sibling, 1 reply; 16+ messages in thread
From: Dev Jain @ 2025-06-05  4:48 UTC (permalink / raw)
  To: Will Deacon, Ryan Roberts
  Cc: catalin.marinas, anshuman.khandual, quic_zhenhuah, kevin.brodsky,
	yangyicong, joey.gouly, linux-arm-kernel, linux-kernel, david


On 30/05/25 7:06 pm, Will Deacon wrote:
> On Fri, May 30, 2025 at 02:11:36PM +0100, Ryan Roberts wrote:
>> On 30/05/2025 13:35, Will Deacon wrote:
>>> On Fri, May 30, 2025 at 12:50:40PM +0100, Ryan Roberts wrote:
>>>> On 30/05/2025 10:14, Dev Jain wrote:
>>>>> On 30/05/25 2:10 pm, Ryan Roberts wrote:
>>>>>> On 30/05/2025 09:20, Dev Jain wrote:
>>>>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>>>>>> because an intermediate table may be removed, potentially causing the
>>>>>>> ptdump code to dereference an invalid address. We want to be able to
>>>>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>>>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>>>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>>>>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>>>>>> between two different vm_structs; two vmalloc objects running this same
>>>>>>> code path will point to different page tables, hence there is no race.
>>>>> My "correction" from race->no problem was incorrect after all :) There will
>>>>> be no race too since the vm_struct object has exclusive access to whatever
>>>>> table it is clearing.
>>>>>
>>>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>>>> ---
>>>>>>>    arch/arm64/include/asm/vmalloc.h | 6 ++----
>>>>>>>    arch/arm64/mm/mmu.c              | 7 +++++++
>>>>>>>    2 files changed, 9 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>>>>>>> index 38fafffe699f..28b7173d8693 100644
>>>>>>> --- a/arch/arm64/include/asm/vmalloc.h
>>>>>>> +++ b/arch/arm64/include/asm/vmalloc.h
>>>>>>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>>>>>>        /*
>>>>>>>         * SW table walks can't handle removal of intermediate entries.
>>>>>>>         */
>>>>>>> -    return pud_sect_supported() &&
>>>>>>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>>>>> +    return pud_sect_supported();
>>>>>>>    }
>>>>>>>      #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>>>>>>    static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>>>>>>    {
>>>>>>> -    /* See arch_vmap_pud_supported() */
>>>>>>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>>>>> +    return true;
>>>>>>>    }
>>>>>>>      #endif
>>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>>> index ea6695d53fb9..798cebd9e147 100644
>>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>>>>>        }
>>>>>>>          table = pte_offset_kernel(pmdp, addr);
>>>>>>> +
>>>>>>> +    /* Synchronize against ptdump_walk_pgd() */
>>>>>>> +    mmap_read_lock(&init_mm);
>>>>>>>        pmd_clear(pmdp);
>>>>>>> +    mmap_read_unlock(&init_mm);
>>>>>> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
>>>>>> exclusive with any read_lock holders) for the duration of the table walk, so it
>>>>>> will either consistently see the pgtables before or after this removal. It will
>>>>>> never disappear during the walk, correct?
>>>>>>
>>>>>> I guess there is a risk of this showing up as contention with other init_mm
>>>>>> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
>>>>>> are called sufficiently rarely that the risk is very small. Let's fix any perf
>>>>>> problem if/when we see it.
>>>>> We can avoid all of that by my initial approach - to wrap the lock around
>>>>> CONFIG_PTDUMP_DEBUGFS.
>>>>> I don't have a strong opinion, just putting it out there.
>>>> (I wrote then failed to send earlier):
>>>>
>>>> It's ugly though. Personally I'd prefer to keep it simple unless we have clear
>>>> evidence that its needed. I was just laying out my justification for not needing
>>>> to doing the conditional wrapping in this comment.
>>> I really don't think we should be adding unconditional locking overhead
>>> to core mm routines purely to facilitate a rarely used debug option.
>>>
>>> Instead, can we either adopt something like the RCU-like walk used by
>>> fast GUP or stick the locking behind a static key that's only enabled
>>> when a ptdump walk is in progress (a bit like how
>>> hugetlb_vmemmap_optimize_folio() manipulates hugetlb_optimize_vmemmap_key)?
>> My sense is that the static key will be less effort and can be contained fully
>> in arm64. I think we would need to enable the key around the call to
>> ptdump_walk_pgd() in both ptdump_walk() and ptdump_check_wx(). Then where Dev is
>> currently taking the read lock, that would be contingent on the key being
>> enabled and the unlock would be contingent on having taken the lock.
>>
>> Does that sound like an acceptable approach?
> Yup, and I think you'll probably need something like a synchronize_rcu()
> when flipping the key to deal with any pre-existing page-table freers.

IIUC, you mean to say that we need to ensure any existing readers having
a reference to the isolated table in pmd_free_pte_page and friend, have exited.
But the problem is that we take an mmap_write_lock() around ptdump_walk_pgd() so
this is a sleepable code path.

>
> Will


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-06-05  4:48             ` Dev Jain
@ 2025-06-05  8:16               ` Dev Jain
  2025-06-11  9:33                 ` Will Deacon
  0 siblings, 1 reply; 16+ messages in thread
From: Dev Jain @ 2025-06-05  8:16 UTC (permalink / raw)
  To: Will Deacon, Ryan Roberts
  Cc: catalin.marinas, anshuman.khandual, quic_zhenhuah, kevin.brodsky,
	yangyicong, joey.gouly, linux-arm-kernel, linux-kernel, david


On 05/06/25 10:18 am, Dev Jain wrote:
>
> On 30/05/25 7:06 pm, Will Deacon wrote:
>> On Fri, May 30, 2025 at 02:11:36PM +0100, Ryan Roberts wrote:
>>> On 30/05/2025 13:35, Will Deacon wrote:
>>>> On Fri, May 30, 2025 at 12:50:40PM +0100, Ryan Roberts wrote:
>>>>> On 30/05/2025 10:14, Dev Jain wrote:
>>>>>> On 30/05/25 2:10 pm, Ryan Roberts wrote:
>>>>>>> On 30/05/2025 09:20, Dev Jain wrote:
>>>>>>>> arm64 disables vmalloc-huge when kernel page table dumping is 
>>>>>>>> enabled,
>>>>>>>> because an intermediate table may be removed, potentially 
>>>>>>>> causing the
>>>>>>>> ptdump code to dereference an invalid address. We want to be 
>>>>>>>> able to
>>>>>>>> analyze block vs page mappings for kernel mappings with ptdump, 
>>>>>>>> so to
>>>>>>>> enable vmalloc-huge with ptdump, synchronize between page table 
>>>>>>>> removal in
>>>>>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable 
>>>>>>>> walking. We
>>>>>>>> use mmap_read_lock and not write lock because we don't need to 
>>>>>>>> synchronize
>>>>>>>> between two different vm_structs; two vmalloc objects running 
>>>>>>>> this same
>>>>>>>> code path will point to different page tables, hence there is 
>>>>>>>> no race.
>>>>>> My "correction" from race->no problem was incorrect after all :) 
>>>>>> There will
>>>>>> be no race too since the vm_struct object has exclusive access to 
>>>>>> whatever
>>>>>> table it is clearing.
>>>>>>
>>>>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>>>>> ---
>>>>>>>>    arch/arm64/include/asm/vmalloc.h | 6 ++----
>>>>>>>>    arch/arm64/mm/mmu.c              | 7 +++++++
>>>>>>>>    2 files changed, 9 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/arm64/include/asm/vmalloc.h 
>>>>>>>> b/arch/arm64/include/asm/vmalloc.h
>>>>>>>> index 38fafffe699f..28b7173d8693 100644
>>>>>>>> --- a/arch/arm64/include/asm/vmalloc.h
>>>>>>>> +++ b/arch/arm64/include/asm/vmalloc.h
>>>>>>>> @@ -12,15 +12,13 @@ static inline bool 
>>>>>>>> arch_vmap_pud_supported(pgprot_t prot)
>>>>>>>>        /*
>>>>>>>>         * SW table walks can't handle removal of intermediate 
>>>>>>>> entries.
>>>>>>>>         */
>>>>>>>> -    return pud_sect_supported() &&
>>>>>>>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>>>>>> +    return pud_sect_supported();
>>>>>>>>    }
>>>>>>>>      #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>>>>>>>    static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>>>>>>>    {
>>>>>>>> -    /* See arch_vmap_pud_supported() */
>>>>>>>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>>>>>> +    return true;
>>>>>>>>    }
>>>>>>>>      #endif
>>>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>>>> index ea6695d53fb9..798cebd9e147 100644
>>>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, 
>>>>>>>> unsigned long addr)
>>>>>>>>        }
>>>>>>>>          table = pte_offset_kernel(pmdp, addr);
>>>>>>>> +
>>>>>>>> +    /* Synchronize against ptdump_walk_pgd() */
>>>>>>>> +    mmap_read_lock(&init_mm);
>>>>>>>>        pmd_clear(pmdp);
>>>>>>>> +    mmap_read_unlock(&init_mm);
>>>>>>> So this works because ptdump_walk_pgd() takes the write_lock 
>>>>>>> (which is mutually
>>>>>>> exclusive with any read_lock holders) for the duration of the 
>>>>>>> table walk, so it
>>>>>>> will either consistently see the pgtables before or after this 
>>>>>>> removal. It will
>>>>>>> never disappear during the walk, correct?
>>>>>>>
>>>>>>> I guess there is a risk of this showing up as contention with 
>>>>>>> other init_mm
>>>>>>> write_lock holders. But I expect that 
>>>>>>> pmd_free_pte_page()/pud_free_pmd_page()
>>>>>>> are called sufficiently rarely that the risk is very small. 
>>>>>>> Let's fix any perf
>>>>>>> problem if/when we see it.
>>>>>> We can avoid all of that by my initial approach - to wrap the 
>>>>>> lock around
>>>>>> CONFIG_PTDUMP_DEBUGFS.
>>>>>> I don't have a strong opinion, just putting it out there.
>>>>> (I wrote then failed to send earlier):
>>>>>
>>>>> It's ugly though. Personally I'd prefer to keep it simple unless 
>>>>> we have clear
>>>>> evidence that its needed. I was just laying out my justification 
>>>>> for not needing
>>>>> to doing the conditional wrapping in this comment.
>>>> I really don't think we should be adding unconditional locking 
>>>> overhead
>>>> to core mm routines purely to facilitate a rarely used debug option.
>>>>
>>>> Instead, can we either adopt something like the RCU-like walk used by
>>>> fast GUP or stick the locking behind a static key that's only enabled
>>>> when a ptdump walk is in progress (a bit like how
>>>> hugetlb_vmemmap_optimize_folio() manipulates 
>>>> hugetlb_optimize_vmemmap_key)?
>>> My sense is that the static key will be less effort and can be 
>>> contained fully
>>> in arm64. I think we would need to enable the key around the call to
>>> ptdump_walk_pgd() in both ptdump_walk() and ptdump_check_wx(). Then 
>>> where Dev is
>>> currently taking the read lock, that would be contingent on the key 
>>> being
>>> enabled and the unlock would be contingent on having taken the lock.
>>>
>>> Does that sound like an acceptable approach?
>> Yup, and I think you'll probably need something like a synchronize_rcu()
>> when flipping the key to deal with any pre-existing page-table freers.
>
> IIUC, you mean to say that we need to ensure any existing readers having
> a reference to the isolated table in pmd_free_pte_page and friend, 
> have exited.
> But the problem is that we take an mmap_write_lock() around 
> ptdump_walk_pgd() so
> this is a sleepable code path.

The mmap_write_lock around ptdump_walk_pgd() is definitely restrictive. 
I think that

was put because ptdump is rarely used, I think we could have done 
RCU-freeing of pagetables to

synchronize between ptdump and vmalloc pagetable lazy freeing/ hotunplug 
pagetable freeing.


>
>>
>> Will
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-06-05  8:16               ` Dev Jain
@ 2025-06-11  9:33                 ` Will Deacon
  2025-06-11 12:18                   ` Dev Jain
  0 siblings, 1 reply; 16+ messages in thread
From: Will Deacon @ 2025-06-11  9:33 UTC (permalink / raw)
  To: Dev Jain
  Cc: Ryan Roberts, catalin.marinas, anshuman.khandual, quic_zhenhuah,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david

On Thu, Jun 05, 2025 at 01:46:01PM +0530, Dev Jain wrote:
> On 05/06/25 10:18 am, Dev Jain wrote:
> > On 30/05/25 7:06 pm, Will Deacon wrote:
> > > On Fri, May 30, 2025 at 02:11:36PM +0100, Ryan Roberts wrote:
> > > > On 30/05/2025 13:35, Will Deacon wrote:
> > > > > I really don't think we should be adding unconditional
> > > > > locking overhead
> > > > > to core mm routines purely to facilitate a rarely used debug option.
> > > > > 
> > > > > Instead, can we either adopt something like the RCU-like walk used by
> > > > > fast GUP or stick the locking behind a static key that's only enabled
> > > > > when a ptdump walk is in progress (a bit like how
> > > > > hugetlb_vmemmap_optimize_folio() manipulates
> > > > > hugetlb_optimize_vmemmap_key)?
> > > > My sense is that the static key will be less effort and can be
> > > > contained fully
> > > > in arm64. I think we would need to enable the key around the call to
> > > > ptdump_walk_pgd() in both ptdump_walk() and ptdump_check_wx().
> > > > Then where Dev is
> > > > currently taking the read lock, that would be contingent on the
> > > > key being
> > > > enabled and the unlock would be contingent on having taken the lock.
> > > > 
> > > > Does that sound like an acceptable approach?
> > > Yup, and I think you'll probably need something like a synchronize_rcu()
> > > when flipping the key to deal with any pre-existing page-table freers.
> > 
> > IIUC, you mean to say that we need to ensure any existing readers having
> > a reference to the isolated table in pmd_free_pte_page and friend, have
> > exited.
> > But the problem is that we take an mmap_write_lock() around
> > ptdump_walk_pgd() so
> > this is a sleepable code path.
> 
> The mmap_write_lock around ptdump_walk_pgd() is definitely restrictive. I
> think that
> 
> was put because ptdump is rarely used, I think we could have done
> RCU-freeing of pagetables to
> 
> synchronize between ptdump and vmalloc pagetable lazy freeing/ hotunplug
> pagetable freeing.

The other idea was to use a static key like the HVO code does, which
shouldn't place any RCU requirements on the debug walk.

Will


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] arm64: Enable vmalloc-huge with ptdump
  2025-06-11  9:33                 ` Will Deacon
@ 2025-06-11 12:18                   ` Dev Jain
  0 siblings, 0 replies; 16+ messages in thread
From: Dev Jain @ 2025-06-11 12:18 UTC (permalink / raw)
  To: Will Deacon
  Cc: Ryan Roberts, catalin.marinas, anshuman.khandual, quic_zhenhuah,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david


On 11/06/25 3:03 pm, Will Deacon wrote:
> On Thu, Jun 05, 2025 at 01:46:01PM +0530, Dev Jain wrote:
>> On 05/06/25 10:18 am, Dev Jain wrote:
>>> On 30/05/25 7:06 pm, Will Deacon wrote:
>>>> On Fri, May 30, 2025 at 02:11:36PM +0100, Ryan Roberts wrote:
>>>>> On 30/05/2025 13:35, Will Deacon wrote:
>>>>>> I really don't think we should be adding unconditional
>>>>>> locking overhead
>>>>>> to core mm routines purely to facilitate a rarely used debug option.
>>>>>>
>>>>>> Instead, can we either adopt something like the RCU-like walk used by
>>>>>> fast GUP or stick the locking behind a static key that's only enabled
>>>>>> when a ptdump walk is in progress (a bit like how
>>>>>> hugetlb_vmemmap_optimize_folio() manipulates
>>>>>> hugetlb_optimize_vmemmap_key)?
>>>>> My sense is that the static key will be less effort and can be
>>>>> contained fully
>>>>> in arm64. I think we would need to enable the key around the call to
>>>>> ptdump_walk_pgd() in both ptdump_walk() and ptdump_check_wx().
>>>>> Then where Dev is
>>>>> currently taking the read lock, that would be contingent on the
>>>>> key being
>>>>> enabled and the unlock would be contingent on having taken the lock.
>>>>>
>>>>> Does that sound like an acceptable approach?
>>>> Yup, and I think you'll probably need something like a synchronize_rcu()
>>>> when flipping the key to deal with any pre-existing page-table freers.
>>> IIUC, you mean to say that we need to ensure any existing readers having
>>> a reference to the isolated table in pmd_free_pte_page and friend, have
>>> exited.
>>> But the problem is that we take an mmap_write_lock() around
>>> ptdump_walk_pgd() so
>>> this is a sleepable code path.
>> The mmap_write_lock around ptdump_walk_pgd() is definitely restrictive. I
>> think that
>>
>> was put because ptdump is rarely used, I think we could have done
>> RCU-freeing of pagetables to
>>
>> synchronize between ptdump and vmalloc pagetable lazy freeing/ hotunplug
>> pagetable freeing.
> The other idea was to use a static key like the HVO code does, which
> shouldn't place any RCU requirements on the debug walk.

Thanks for your suggestion. I shall look into this.

>
> Will


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-06-11 15:32 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30  8:20 [PATCH] arm64: Enable vmalloc-huge with ptdump Dev Jain
2025-05-30  8:28 ` Dev Jain
2025-05-30  8:40 ` Ryan Roberts
2025-05-30  9:07   ` Anshuman Khandual
2025-05-30  9:14   ` Dev Jain
2025-05-30  9:47     ` Ryan Roberts
2025-05-30  9:57       ` Dev Jain
2025-05-30 11:50     ` Ryan Roberts
2025-05-30 12:35       ` Will Deacon
2025-05-30 13:11         ` Ryan Roberts
2025-05-30 13:36           ` Will Deacon
2025-05-30 14:07             ` Ryan Roberts
2025-06-05  4:48             ` Dev Jain
2025-06-05  8:16               ` Dev Jain
2025-06-11  9:33                 ` Will Deacon
2025-06-11 12:18                   ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox