linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] arm64: Enable vmalloc-huge with ptdump
@ 2025-06-26  5:25 Dev Jain
  2025-07-04  5:05 ` Dev Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Dev Jain @ 2025-06-26  5:25 UTC (permalink / raw)
  To: catalin.marinas, will
  Cc: anshuman.khandual, quic_zhenhuah, ryan.roberts, kevin.brodsky,
	yangyicong, joey.gouly, linux-arm-kernel, linux-kernel, david,
	Dev Jain

arm64 disables vmalloc-huge when kernel page table dumping is enabled,
because an intermediate table may be removed, potentially causing the
ptdump code to dereference an invalid address. We want to be able to
analyze block vs page mappings for kernel mappings with ptdump, so to
enable vmalloc-huge with ptdump, synchronize between page table removal in
pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
use mmap_read_lock and not write lock because we don't need to synchronize
between two different vm_structs; two vmalloc objects running this same
code path will point to different page tables, hence there is no race.

For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
512 times again via pmd_free_pte_page().

We implement the locking mechanism using static keys, since the chance
of a race is very small. Observe that the synchronization is needed
to avoid the following race:

CPU1							CPU2
						take reference of PMD table
pud_clear()
pte_free_kernel()
						walk freed PMD table

and similar race between pmd_free_pte_page and ptdump_walk_pgd.

Therefore, there are two cases: if ptdump sees the cleared PUD, then
we are safe. If not, then the patched-in read and write locks help us
avoid the race.

To implement the mechanism, we need the static key access from mmu.c and
ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
target in the Makefile, therefore we cannot initialize the key there, as
is being done, for example, in the static key implementation of
hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
the jump_label mechanism. Declare the key there and define the key to false
in mmu.c.

No issues were observed with mm-selftests. No issues were observed while
parallelly running test_vmalloc.sh and dumping the kernel pagetable through
sysfs in a loop.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
v3->v4:
 - Lock-unlock immediately
 - Simplify includes

v2->v3:
 - Use static key mechanism

v1->v2:
 - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
 - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
   the lock 512 times again via pmd_free_pte_page()

 arch/arm64/include/asm/ptdump.h |  2 ++
 arch/arm64/mm/mmu.c             | 44 ++++++++++++++++++++++++++++++---
 arch/arm64/mm/ptdump.c          |  5 +++-
 3 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index fded5358641f..5b331f2a7be1 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -7,6 +7,8 @@
 
 #include <linux/ptdump.h>
 
+DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
+
 #ifdef CONFIG_PTDUMP
 
 #include <linux/mm_types.h>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 00ab1d648db6..9d3be249047c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -46,6 +46,8 @@
 #define NO_CONT_MAPPINGS	BIT(1)
 #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
 
+DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
+
 enum pgtable_type {
 	TABLE_PTE,
 	TABLE_PMD,
@@ -1267,7 +1269,7 @@ int pmd_clear_huge(pmd_t *pmdp)
 	return 1;
 }
 
-int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
+static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
 {
 	pte_t *table;
 	pmd_t pmd;
@@ -1279,13 +1281,24 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
 		return 1;
 	}
 
+	/* See comment in pud_free_pmd_page for static key logic */
 	table = pte_offset_kernel(pmdp, addr);
 	pmd_clear(pmdp);
 	__flush_tlb_kernel_pgtable(addr);
+	if (static_branch_unlikely(&ptdump_lock_key) && lock) {
+		mmap_read_lock(&init_mm);
+		mmap_read_unlock(&init_mm);
+	}
+
 	pte_free_kernel(NULL, table);
 	return 1;
 }
 
+int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
+{
+	return __pmd_free_pte_page(pmdp, addr, true);
+}
+
 int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
 {
 	pmd_t *table;
@@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
 	}
 
 	table = pmd_offset(pudp, addr);
+	/*
+	 * Isolate the PMD table; in case of race with ptdump, this helps
+	 * us to avoid taking the lock in __pmd_free_pte_page().
+	 *
+	 * Static key logic:
+	 *
+	 * Case 1: If ptdump does static_branch_enable(), and after that we
+	 * execute the if block, then this patches in the read lock, ptdump has
+	 * the write lock patched in, therefore ptdump will never read from
+	 * a potentially freed PMD table.
+	 *
+	 * Case 2: If the if block starts executing before ptdump's
+	 * static_branch_enable(), then no locking synchronization
+	 * will be done. However, pud_clear() + the dsb() in
+	 * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
+	 * empty PUD. Thus, it will never walk over a potentially freed
+	 * PMD table.
+	 */
+	pud_clear(pudp);
+	__flush_tlb_kernel_pgtable(addr);
+	if (static_branch_unlikely(&ptdump_lock_key)) {
+		mmap_read_lock(&init_mm);
+		mmap_read_unlock(&init_mm);
+	}
+
 	pmdp = table;
 	next = addr;
 	end = addr + PUD_SIZE;
 	do {
 		if (pmd_present(pmdp_get(pmdp)))
-			pmd_free_pte_page(pmdp, next);
+			__pmd_free_pte_page(pmdp, next, false);
 	} while (pmdp++, next += PMD_SIZE, next != end);
 
-	pud_clear(pudp);
-	__flush_tlb_kernel_pgtable(addr);
 	pmd_free(NULL, table);
 	return 1;
 }
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 421a5de806c6..41c9ea61813b 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -25,7 +25,6 @@
 #include <asm/pgtable-hwdef.h>
 #include <asm/ptdump.h>
 
-
 #define pt_dump_seq_printf(m, fmt, args...)	\
 ({						\
 	if (m)					\
@@ -311,7 +310,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
 		}
 	};
 
+	static_branch_enable(&ptdump_lock_key);
 	ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
+	static_branch_disable(&ptdump_lock_key);
 }
 
 static void __init ptdump_initialize(void)
@@ -353,7 +354,9 @@ bool ptdump_check_wx(void)
 		}
 	};
 
+	static_branch_enable(&ptdump_lock_key);
 	ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
+	static_branch_disable(&ptdump_lock_key);
 
 	if (st.wx_pages || st.uxn_pages) {
 		pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump
  2025-06-26  5:25 [PATCH v4] arm64: Enable vmalloc-huge with ptdump Dev Jain
@ 2025-07-04  5:05 ` Dev Jain
  2025-07-04  5:58 ` Anshuman Khandual
  2025-07-04 11:22 ` Catalin Marinas
  2 siblings, 0 replies; 9+ messages in thread
From: Dev Jain @ 2025-07-04  5:05 UTC (permalink / raw)
  To: catalin.marinas, will
  Cc: anshuman.khandual, quic_zhenhuah, ryan.roberts, kevin.brodsky,
	yangyicong, joey.gouly, linux-arm-kernel, linux-kernel, david


On 26/06/25 10:55 am, Dev Jain wrote:
> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> because an intermediate table may be removed, potentially causing the
> ptdump code to dereference an invalid address. We want to be able to
> analyze block vs page mappings for kernel mappings with ptdump, so to
> enable vmalloc-huge with ptdump, synchronize between page table removal in
> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> use mmap_read_lock and not write lock because we don't need to synchronize
> between two different vm_structs; two vmalloc objects running this same
> code path will point to different page tables, hence there is no race.
>
> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
> 512 times again via pmd_free_pte_page().
>
> We implement the locking mechanism using static keys, since the chance
> of a race is very small. Observe that the synchronization is needed
> to avoid the following race:
>
> CPU1							CPU2
> 						take reference of PMD table
> pud_clear()
> pte_free_kernel()
> 						walk freed PMD table
>
> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>
> Therefore, there are two cases: if ptdump sees the cleared PUD, then
> we are safe. If not, then the patched-in read and write locks help us
> avoid the race.
>
> To implement the mechanism, we need the static key access from mmu.c and
> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
> target in the Makefile, therefore we cannot initialize the key there, as
> is being done, for example, in the static key implementation of
> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
> the jump_label mechanism. Declare the key there and define the key to false
> in mmu.c.
>
> No issues were observed with mm-selftests. No issues were observed while
> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
> sysfs in a loop.
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---

Sorry, I observed that post v3 I forgot to add the change wherein
I drop the PTDUMP condition from arch_vmap_pxd_supported, I'll add
that back in the next version. Thanks Anshuman for the spot.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump
  2025-06-26  5:25 [PATCH v4] arm64: Enable vmalloc-huge with ptdump Dev Jain
  2025-07-04  5:05 ` Dev Jain
@ 2025-07-04  5:58 ` Anshuman Khandual
  2025-07-04  6:37   ` Dev Jain
  2025-07-04 11:22 ` Catalin Marinas
  2 siblings, 1 reply; 9+ messages in thread
From: Anshuman Khandual @ 2025-07-04  5:58 UTC (permalink / raw)
  To: Dev Jain, catalin.marinas, will
  Cc: quic_zhenhuah, ryan.roberts, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david



On 26/06/25 10:55 AM, Dev Jain wrote:
> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> because an intermediate table may be removed, potentially causing the
> ptdump code to dereference an invalid address. We want to be able to

Please keep this paragraph separate explaining the current scenario. It
might be worth adding bit more details such CONFIG_PTDUMP_DEBUGFS config
dependency in vmalloc()'s arch call backs arch_vmap_pud|pmd_supported().
 > analyze block vs page mappings for kernel mappings with ptdump, so to
> enable vmalloc-huge with ptdump, synchronize between page table removal in
> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We

Please keep this paragraph separate explaining what is the benefit of
enabling ptdump with huge vmalloc() mappings. This will establish the
required motivation to change the current scenario.

> use mmap_read_lock and not write lock because we don't need to synchronize
> between two different vm_structs; two vmalloc objects running this same
> code path will point to different page tables, hence there is no race.

Different page tables ? OR different areas in same page table i.e init_mm.

> 
> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
> 512 times again via pmd_free_pte_page().

This talks about subsequent optimization without establishing the base
solution first.
 > 
> We implement the locking mechanism using static keys, since the chance
> of a race is very small. Observe that the synchronization is needed
> to avoid the following race:
> 
> CPU1							CPU2
> 						take reference of PMD table
> pud_clear()
> pte_free_kernel()
> 						walk freed PMD table
> 
> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
> 
> Therefore, there are two cases: if ptdump sees the cleared PUD, then
> we are safe. If not, then the patched-in read and write locks help us
> avoid the race.
> 
> To implement the mechanism, we need the static key access from mmu.c and
> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
> target in the Makefile, therefore we cannot initialize the key there, as
> is being done, for example, in the static key implementation of
> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
> the jump_label mechanism. Declare the key there and define the key to false
> in mmu.c.

Above write up is convoluted and not very clear. Please rewrite the solution
description while avoiding too much implementation and code file details that
can be derived from the patch itself.

> 
> No issues were observed with mm-selftests. No issues were observed while
> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
> sysfs in a loop.
Please provide some more details about test_vmallo.sh which seems to be a
custom script.

> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> v3->v4:
>  - Lock-unlock immediately
>  - Simplify includes
> 
> v2->v3:
>  - Use static key mechanism
> 
> v1->v2:
>  - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>  - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>    the lock 512 times again via pmd_free_pte_page()
> 
>  arch/arm64/include/asm/ptdump.h |  2 ++
>  arch/arm64/mm/mmu.c             | 44 ++++++++++++++++++++++++++++++---
>  arch/arm64/mm/ptdump.c          |  5 +++-
>  3 files changed, 46 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
> index fded5358641f..5b331f2a7be1 100644
> --- a/arch/arm64/include/asm/ptdump.h
> +++ b/arch/arm64/include/asm/ptdump.h
> @@ -7,6 +7,8 @@
>  
>  #include <linux/ptdump.h>
>  
> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
> +
>  #ifdef CONFIG_PTDUMP
>  
>  #include <linux/mm_types.h>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 00ab1d648db6..9d3be249047c 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -46,6 +46,8 @@
>  #define NO_CONT_MAPPINGS	BIT(1)
>  #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>  
> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
> +ptdump_lock_key sounds too generic even though this locking
requirement is currently arm64 platform specific. Otherwise
this might just appear as a locking from the generic ptdump
Could this be renamed as arm64_ptdump_lock_key instead ?

>  enum pgtable_type {
>  	TABLE_PTE,
>  	TABLE_PMD,
> @@ -1267,7 +1269,7 @@ int pmd_clear_huge(pmd_t *pmdp)
>  	return 1;
>  }
>  
> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>  {

'lock' argument needs explanation here in a comment.

>  	pte_t *table;
>  	pmd_t pmd;
> @@ -1279,13 +1281,24 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>  		return 1;
>  	}
>  
> +	/* See comment in pud_free_pmd_page for static key logic */
>  	table = pte_offset_kernel(pmdp, addr);
>  	pmd_clear(pmdp);
>  	__flush_tlb_kernel_pgtable(addr);

Or here ?

> +	if (static_branch_unlikely(&ptdump_lock_key) && lock) {
> +		mmap_read_lock(&init_mm);
> +		mmap_read_unlock(&init_mm);
> +	}
> +
>  	pte_free_kernel(NULL, table);
>  	return 1;
>  }
>  
> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +{
> +	return __pmd_free_pte_page(pmdp, addr, true);
> +}
> +
>  int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>  {
>  	pmd_t *table;
> @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>  	}
>  
>  	table = pmd_offset(pudp, addr);
> +	/*
> +	 * Isolate the PMD table; in case of race with ptdump, this helps
> +	 * us to avoid taking the lock in __pmd_free_pte_page().
> +	 *
> +	 * Static key logic:
> +	 *
> +	 * Case 1: If ptdump does static_branch_enable(), and after that we
> +	 * execute the if block, then this patches in the read lock, ptdump has
> +	 * the write lock patched in, therefore ptdump will never read from
> +	 * a potentially freed PMD table.
> +	 *
> +	 * Case 2: If the if block starts executing before ptdump's
> +	 * static_branch_enable(), then no locking synchronization
> +	 * will be done. However, pud_clear() + the dsb() in
> +	 * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
> +	 * empty PUD. Thus, it will never walk over a potentially freed
> +	 * PMD table.
> +	 */
> +	pud_clear(pudp);
> +	__flush_tlb_kernel_pgtable(addr);
> +	if (static_branch_unlikely(&ptdump_lock_key)) {

		/* case 1 - ptdump comes first
		 *
		 * mmap_read_lock() here will wait on mmap_write_lock()
	 	 * taken in generic ptdump until it has been released.
		 */

> +		mmap_read_lock(&init_mm);
> +		mmap_read_unlock(&init_mm);
> +	}
	} else {
		/* case 2 - pud_free_pmd_page() comes first
		 * 
		 * pud_clear() and __flush_tlb_kernel_pgtable() are
		 * sufficient for ptdump to observe an empty PUD.
		 */	 
	}> +
>  	pmdp = table;
>  	next = addr;
>  	end = addr + PUD_SIZE;
>  	do {
>  		if (pmd_present(pmdp_get(pmdp)))
> -			pmd_free_pte_page(pmdp, next);
> +			__pmd_free_pte_page(pmdp, next, false);
>  	} while (pmdp++, next += PMD_SIZE, next != end);
>  
> -	pud_clear(pudp);
> -	__flush_tlb_kernel_pgtable(addr);
>  	pmd_free(NULL, table);
>  	return 1;
>  }
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 421a5de806c6..41c9ea61813b 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -25,7 +25,6 @@
>  #include <asm/pgtable-hwdef.h>
>  #include <asm/ptdump.h>
>  
> -

Stray change. Please drop it.

>  #define pt_dump_seq_printf(m, fmt, args...)	\
>  ({						\
>  	if (m)					\
> @@ -311,7 +310,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>  		}
>  	};
>  
> +	static_branch_enable(&ptdump_lock_key);
>  	ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
> +	static_branch_disable(&ptdump_lock_key);
>  }
>  
>  static void __init ptdump_initialize(void)
> @@ -353,7 +354,9 @@ bool ptdump_check_wx(void)
>  		}
>  	};
>  
> +	static_branch_enable(&ptdump_lock_key);
>  	ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
> +	static_branch_disable(&ptdump_lock_key);
>  
>  	if (st.wx_pages || st.uxn_pages) {
>  		pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",Although tempting but I guess encapsulating above ptdump_walk_pgd()
locking sequence inside a new helper arm64_ptdump_walk_pgd() while
also explaining this locking might be worth ?

PUD/PMD vmalloc mappings need to be enabled via these arch callback
changes via dropping the current CONFIG_PTDUMP_DEBUGFS dependency.

#define arch_vmap_pud_supported arch_vmap_pud_supported
static inline bool arch_vmap_pud_supported(pgprot_t prot)
{
	return pud_sect_supported()
}

#define arch_vmap_pmd_supported arch_vmap_pmd_supported
static inline bool arch_vmap_pmd_supported(pgprot_t prot)
{
	return return.
}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump
  2025-07-04  5:58 ` Anshuman Khandual
@ 2025-07-04  6:37   ` Dev Jain
  0 siblings, 0 replies; 9+ messages in thread
From: Dev Jain @ 2025-07-04  6:37 UTC (permalink / raw)
  To: Anshuman Khandual, catalin.marinas, will
  Cc: quic_zhenhuah, ryan.roberts, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david


On 04/07/25 11:28 am, Anshuman Khandual wrote:
>
> On 26/06/25 10:55 AM, Dev Jain wrote:
>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>> because an intermediate table may be removed, potentially causing the
>> ptdump code to dereference an invalid address. We want to be able to
> Please keep this paragraph separate explaining the current scenario. It
> might be worth adding bit more details such CONFIG_PTDUMP_DEBUGFS config
> dependency in vmalloc()'s arch call backs arch_vmap_pud|pmd_supported().
>   > analyze block vs page mappings for kernel mappings with ptdump, so to
>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> Please keep this paragraph separate explaining what is the benefit of
> enabling ptdump with huge vmalloc() mappings. This will establish the
> required motivation to change the current scenario.
>
>> use mmap_read_lock and not write lock because we don't need to synchronize
>> between two different vm_structs; two vmalloc objects running this same
>> code path will point to different page tables, hence there is no race.
> Different page tables ? OR different areas in same page table i.e init_mm.

Yes different areas, will clear that.

>
>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>> 512 times again via pmd_free_pte_page().
> This talks about subsequent optimization without establishing the base
> solution first.
>   >
>> We implement the locking mechanism using static keys, since the chance
>> of a race is very small. Observe that the synchronization is needed
>> to avoid the following race:
>>
>> CPU1							CPU2
>> 						take reference of PMD table
>> pud_clear()
>> pte_free_kernel()
>> 						walk freed PMD table
>>
>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>
>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>> we are safe. If not, then the patched-in read and write locks help us
>> avoid the race.
>>
>> To implement the mechanism, we need the static key access from mmu.c and
>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>> target in the Makefile, therefore we cannot initialize the key there, as
>> is being done, for example, in the static key implementation of
>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>> the jump_label mechanism. Declare the key there and define the key to false
>> in mmu.c.
> Above write up is convoluted and not very clear. Please rewrite the solution
> description while avoiding too much implementation and code file details that
> can be derived from the patch itself.

Okay.

>
>> No issues were observed with mm-selftests. No issues were observed while
>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>> sysfs in a loop.
> Please provide some more details about test_vmallo.sh which seems to be a
> custom script.

What do you mean by "custom script"? It is part of selftests/mm and explaining
what a selftest is doing in the patch description, when the name makes it pretty
clear (test_vmalloc) seems redundant.

>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> v3->v4:
>>   - Lock-unlock immediately
>>   - Simplify includes
>>
>> v2->v3:
>>   - Use static key mechanism
>>
>> v1->v2:
>>   - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>   - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>>     the lock 512 times again via pmd_free_pte_page()
>>
>>   arch/arm64/include/asm/ptdump.h |  2 ++
>>   arch/arm64/mm/mmu.c             | 44 ++++++++++++++++++++++++++++++---
>>   arch/arm64/mm/ptdump.c          |  5 +++-
>>   3 files changed, 46 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
>> index fded5358641f..5b331f2a7be1 100644
>> --- a/arch/arm64/include/asm/ptdump.h
>> +++ b/arch/arm64/include/asm/ptdump.h
>> @@ -7,6 +7,8 @@
>>   
>>   #include <linux/ptdump.h>
>>   
>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>> +
>>   #ifdef CONFIG_PTDUMP
>>   
>>   #include <linux/mm_types.h>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 00ab1d648db6..9d3be249047c 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -46,6 +46,8 @@
>>   #define NO_CONT_MAPPINGS	BIT(1)
>>   #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>>   
>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>> +ptdump_lock_key sounds too generic even though this locking
> requirement is currently arm64 platform specific. Otherwise
> this might just appear as a locking from the generic ptdump
> Could this be renamed as arm64_ptdump_lock_key instead ?

Sure.

>
>>   enum pgtable_type {
>>   	TABLE_PTE,
>>   	TABLE_PMD,
>> @@ -1267,7 +1269,7 @@ int pmd_clear_huge(pmd_t *pmdp)
>>   	return 1;
>>   }
>>   
>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>>   {
> 'lock' argument needs explanation here in a comment.

Sure.

>
>>   	pte_t *table;
>>   	pmd_t pmd;
>> @@ -1279,13 +1281,24 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>   		return 1;
>>   	}
>>   
>> +	/* See comment in pud_free_pmd_page for static key logic */
>>   	table = pte_offset_kernel(pmdp, addr);
>>   	pmd_clear(pmdp);
>>   	__flush_tlb_kernel_pgtable(addr);
> Or here ?

I'll do it above the function declaration, the reader then won't
have to look for the comment.

>
>> +	if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>> +		mmap_read_lock(&init_mm);
>> +		mmap_read_unlock(&init_mm);
>> +	}
>> +
>>   	pte_free_kernel(NULL, table);
>>   	return 1;
>>   }
>>   
>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +{
>> +	return __pmd_free_pte_page(pmdp, addr, true);
>> +}
>> +
>>   int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   {
>>   	pmd_t *table;
>> @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   	}
>>   
>>   	table = pmd_offset(pudp, addr);
>> +	/*
>> +	 * Isolate the PMD table; in case of race with ptdump, this helps
>> +	 * us to avoid taking the lock in __pmd_free_pte_page().
>> +	 *
>> +	 * Static key logic:
>> +	 *
>> +	 * Case 1: If ptdump does static_branch_enable(), and after that we
>> +	 * execute the if block, then this patches in the read lock, ptdump has
>> +	 * the write lock patched in, therefore ptdump will never read from
>> +	 * a potentially freed PMD table.
>> +	 *
>> +	 * Case 2: If the if block starts executing before ptdump's
>> +	 * static_branch_enable(), then no locking synchronization
>> +	 * will be done. However, pud_clear() + the dsb() in
>> +	 * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>> +	 * empty PUD. Thus, it will never walk over a potentially freed
>> +	 * PMD table.
>> +	 */
>> +	pud_clear(pudp);
>> +	__flush_tlb_kernel_pgtable(addr);
>> +	if (static_branch_unlikely(&ptdump_lock_key)) {
> 		/* case 1 - ptdump comes first
> 		 *
> 		 * mmap_read_lock() here will wait on mmap_write_lock()
> 	 	 * taken in generic ptdump until it has been released.
> 		 */
>
>> +		mmap_read_lock(&init_mm);
>> +		mmap_read_unlock(&init_mm);
>> +	}
> 	} else {
> 		/* case 2 - pud_free_pmd_page() comes first
> 		 *
> 		 * pud_clear() and __flush_tlb_kernel_pgtable() are
> 		 * sufficient for ptdump to observe an empty PUD.
> 		 */	
> 	}> +

Thanks.

>>   	pmdp = table;
>>   	next = addr;
>>   	end = addr + PUD_SIZE;
>>   	do {
>>   		if (pmd_present(pmdp_get(pmdp)))
>> -			pmd_free_pte_page(pmdp, next);
>> +			__pmd_free_pte_page(pmdp, next, false);
>>   	} while (pmdp++, next += PMD_SIZE, next != end);
>>   
>> -	pud_clear(pudp);
>> -	__flush_tlb_kernel_pgtable(addr);
>>   	pmd_free(NULL, table);
>>   	return 1;
>>   }
>> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
>> index 421a5de806c6..41c9ea61813b 100644
>> --- a/arch/arm64/mm/ptdump.c
>> +++ b/arch/arm64/mm/ptdump.c
>> @@ -25,7 +25,6 @@
>>   #include <asm/pgtable-hwdef.h>
>>   #include <asm/ptdump.h>
>>   
>> -
> Stray change. Please drop it.

Thanks.

>
>>   #define pt_dump_seq_printf(m, fmt, args...)	\
>>   ({						\
>>   	if (m)					\
>> @@ -311,7 +310,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>>   		}
>>   	};
>>   
>> +	static_branch_enable(&ptdump_lock_key);
>>   	ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
>> +	static_branch_disable(&ptdump_lock_key);
>>   }
>>   
>>   static void __init ptdump_initialize(void)
>> @@ -353,7 +354,9 @@ bool ptdump_check_wx(void)
>>   		}
>>   	};
>>   
>> +	static_branch_enable(&ptdump_lock_key);
>>   	ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
>> +	static_branch_disable(&ptdump_lock_key);
>>   
>>   	if (st.wx_pages || st.uxn_pages) {
>>   		pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",Although tempting but I guess encapsulating above ptdump_walk_pgd()
> locking sequence inside a new helper arm64_ptdump_walk_pgd() while
> also explaining this locking might be worth ?

Okay.

>
> PUD/PMD vmalloc mappings need to be enabled via these arch callback
> changes via dropping the current CONFIG_PTDUMP_DEBUGFS dependency.
>
> #define arch_vmap_pud_supported arch_vmap_pud_supported
> static inline bool arch_vmap_pud_supported(pgprot_t prot)
> {
> 	return pud_sect_supported()
> }
>
> #define arch_vmap_pmd_supported arch_vmap_pmd_supported
> static inline bool arch_vmap_pmd_supported(pgprot_t prot)
> {
> 	return return.
> }

Thanks for the spot.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump
  2025-06-26  5:25 [PATCH v4] arm64: Enable vmalloc-huge with ptdump Dev Jain
  2025-07-04  5:05 ` Dev Jain
  2025-07-04  5:58 ` Anshuman Khandual
@ 2025-07-04 11:22 ` Catalin Marinas
  2025-07-04 11:42   ` Dev Jain
  2 siblings, 1 reply; 9+ messages in thread
From: Catalin Marinas @ 2025-07-04 11:22 UTC (permalink / raw)
  To: Dev Jain
  Cc: will, anshuman.khandual, quic_zhenhuah, ryan.roberts,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david

On Thu, Jun 26, 2025 at 10:55:24AM +0530, Dev Jain wrote:
> @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>  	}
>  
>  	table = pmd_offset(pudp, addr);
> +	/*
> +	 * Isolate the PMD table; in case of race with ptdump, this helps
> +	 * us to avoid taking the lock in __pmd_free_pte_page().
> +	 *
> +	 * Static key logic:
> +	 *
> +	 * Case 1: If ptdump does static_branch_enable(), and after that we
> +	 * execute the if block, then this patches in the read lock, ptdump has
> +	 * the write lock patched in, therefore ptdump will never read from
> +	 * a potentially freed PMD table.
> +	 *
> +	 * Case 2: If the if block starts executing before ptdump's
> +	 * static_branch_enable(), then no locking synchronization
> +	 * will be done. However, pud_clear() + the dsb() in
> +	 * __flush_tlb_kernel_pgtable will ensure that ptdump observes an

Statements like "observes" really need to be relative, not absolute.
Like in "observes an empty PUD before/after ...".

> +	 * empty PUD. Thus, it will never walk over a potentially freed
> +	 * PMD table.
> +	 */
> +	pud_clear(pudp);
> +	__flush_tlb_kernel_pgtable(addr);
> +	if (static_branch_unlikely(&ptdump_lock_key)) {
> +		mmap_read_lock(&init_mm);
> +		mmap_read_unlock(&init_mm);
> +	}

This needs a formal model ;).

static_branch_enable() is called before the mmap_write_lock(), so even
if the above observes the new branch, it may do the read_unlock() before
the ptdump_walk_pgd() attempted the write lock. So your case 1
description is not entirely correct.

I don't get case 2. You want to ensure pud_clear() is observed by the
ptdump code before the pmd_free(). The DSB in the TLB flushing code
ensures some ordering between the pud_clear() and presumably something
that the ptdump code can observe as well. Would that be the mmap
semaphore? However, the read_lock would only be attempted if this code
is seeing the static branch update, which is not guaranteed. I don't
think it even matters since the lock may be released anyway before the
write_lock in ptdump.

For example, you do a pud_clear() above, skip the whole static branch.
The ptdump comes along on another CPU but does not observe the
pud_clear() since there's no synchronisation. It goes ahead and
dereferences it while this CPU does a pmd_free().

And I can't get my head around memory ordering, it doesn't look sound.
static_branch_enable() I don't think has acquire semantics, at least not
in relation to the actual branch update. The static_branch_unlikely()
test, again, not sure what guarantees it has (I don't think it has any
in relation to memory loads). Maybe you have worked it all out and is
fine but it needs a better explanation and ideally some simple formal
model. Show it's correct with a global variable first and then we can
optimise with static branches.

-- 
Catalin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump
  2025-07-04 11:22 ` Catalin Marinas
@ 2025-07-04 11:42   ` Dev Jain
  2025-07-07  0:44     ` Catalin Marinas
  0 siblings, 1 reply; 9+ messages in thread
From: Dev Jain @ 2025-07-04 11:42 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, anshuman.khandual, quic_zhenhuah, ryan.roberts,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david


On 04/07/25 4:52 pm, Catalin Marinas wrote:
> On Thu, Jun 26, 2025 at 10:55:24AM +0530, Dev Jain wrote:
>> @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   	}
>>   
>>   	table = pmd_offset(pudp, addr);
>> +	/*
>> +	 * Isolate the PMD table; in case of race with ptdump, this helps
>> +	 * us to avoid taking the lock in __pmd_free_pte_page().
>> +	 *
>> +	 * Static key logic:
>> +	 *
>> +	 * Case 1: If ptdump does static_branch_enable(), and after that we
>> +	 * execute the if block, then this patches in the read lock, ptdump has
>> +	 * the write lock patched in, therefore ptdump will never read from
>> +	 * a potentially freed PMD table.
>> +	 *
>> +	 * Case 2: If the if block starts executing before ptdump's
>> +	 * static_branch_enable(), then no locking synchronization
>> +	 * will be done. However, pud_clear() + the dsb() in
>> +	 * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
> Statements like "observes" really need to be relative, not absolute.
> Like in "observes an empty PUD before/after ...".
>
>> +	 * empty PUD. Thus, it will never walk over a potentially freed
>> +	 * PMD table.
>> +	 */
>> +	pud_clear(pudp);
>> +	__flush_tlb_kernel_pgtable(addr);
>> +	if (static_branch_unlikely(&ptdump_lock_key)) {
>> +		mmap_read_lock(&init_mm);
>> +		mmap_read_unlock(&init_mm);
>> +	}
> This needs a formal model ;).
>
> static_branch_enable() is called before the mmap_write_lock(), so even
> if the above observes the new branch, it may do the read_unlock() before
> the ptdump_walk_pgd() attempted the write lock. So your case 1
> description is not entirely correct.

Thanks, I see I escaped by saying "we have read and write locks patched
in so it will all work out" this surely needs a better explanation.

>
> I don't get case 2. You want to ensure pud_clear() is observed by the
> ptdump code before the pmd_free(). The DSB in the TLB flushing code
> ensures some ordering between the pud_clear() and presumably something
> that the ptdump code can observe as well. Would that be the mmap
> semaphore? However, the read_lock would only be attempted if this code
> is seeing the static branch update, which is not guaranteed. I don't
> think it even matters since the lock may be released anyway before the
> write_lock in ptdump.
>
> For example, you do a pud_clear() above, skip the whole static branch.
> The ptdump comes along on another CPU but does not observe the
> pud_clear() since there's no synchronisation. It goes ahead and
> dereferences it while this CPU does a pmd_free().

The objective is: ptdump must not dereference a freed pagetable.
So for your example, if the static branch is not observed by
pud_free_pmd_page, this means that ptdump will take the write lock
after the execution of flush_tlb_kernel_pagetable completes (for if ptdump takes
the write lock before execution of flush_tlb_kernel_pagetable completes, we have
executed static_branch_enable(), contradiction). Since that contains a
dsb(ishst), that means that the store is observed by all observers in the inner shareable domain,
which is (I think?) all CPUs. Therefore before taking the write lock, ptdump
will observe an empty pud, so we cannot have an invalid dereference.
What is wrong with my reasoning here?

>
> And I can't get my head around memory ordering, it doesn't look sound.
> static_branch_enable() I don't think has acquire semantics, at least not
> in relation to the actual branch update. The static_branch_unlikely()
> test, again, not sure what guarantees it has (I don't think it has any
> in relation to memory loads). Maybe you have worked it all out and is
> fine but it needs a better explanation and ideally some simple formal
> model. Show it's correct with a global variable first and then we can
> optimise with static branches.

What do you suggest? As in what do you mean by showing its correct with
a global variable first...and, for the formal model thingy, do you
want mathematical rigor similar to what you explain in [1] :P, because unfortunately
(and quite embarrassingly) I didn't have formal methods in college : )

[1] https://www.youtube.com/watch?v=qeUHlXOBXmg


>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump
  2025-07-04 11:42   ` Dev Jain
@ 2025-07-07  0:44     ` Catalin Marinas
  2025-07-14  9:26       ` Dev Jain
  0 siblings, 1 reply; 9+ messages in thread
From: Catalin Marinas @ 2025-07-07  0:44 UTC (permalink / raw)
  To: Dev Jain
  Cc: will, anshuman.khandual, quic_zhenhuah, ryan.roberts,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david

On Fri, Jul 04, 2025 at 05:12:13PM +0530, Dev Jain wrote:
> On 04/07/25 4:52 pm, Catalin Marinas wrote:
> > On Thu, Jun 26, 2025 at 10:55:24AM +0530, Dev Jain wrote:
> > > @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> > >   	}
> > >   	table = pmd_offset(pudp, addr);
> > > +	/*
> > > +	 * Isolate the PMD table; in case of race with ptdump, this helps
> > > +	 * us to avoid taking the lock in __pmd_free_pte_page().
> > > +	 *
> > > +	 * Static key logic:
> > > +	 *
> > > +	 * Case 1: If ptdump does static_branch_enable(), and after that we
> > > +	 * execute the if block, then this patches in the read lock, ptdump has
> > > +	 * the write lock patched in, therefore ptdump will never read from
> > > +	 * a potentially freed PMD table.
> > > +	 *
> > > +	 * Case 2: If the if block starts executing before ptdump's
> > > +	 * static_branch_enable(), then no locking synchronization
> > > +	 * will be done. However, pud_clear() + the dsb() in
> > > +	 * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
[...]
> > I don't get case 2. You want to ensure pud_clear() is observed by the
> > ptdump code before the pmd_free(). The DSB in the TLB flushing code
> > ensures some ordering between the pud_clear() and presumably something
> > that the ptdump code can observe as well. Would that be the mmap
> > semaphore? However, the read_lock would only be attempted if this code
> > is seeing the static branch update, which is not guaranteed. I don't
> > think it even matters since the lock may be released anyway before the
> > write_lock in ptdump.
> > 
> > For example, you do a pud_clear() above, skip the whole static branch.
> > The ptdump comes along on another CPU but does not observe the
> > pud_clear() since there's no synchronisation. It goes ahead and
> > dereferences it while this CPU does a pmd_free().
> 
> The objective is: ptdump must not dereference a freed pagetable.
> So for your example, if the static branch is not observed by
> pud_free_pmd_page, this means that ptdump will take the write lock
> after the execution of flush_tlb_kernel_pagetable completes (for if ptdump takes
> the write lock before execution of flush_tlb_kernel_pagetable completes, we have
> executed static_branch_enable(), contradiction).

I don't see why the write lock matters since pud_free_pmd_page() doesn't
take the read lock in the second scenario. What we need is acquire
semantics after the static branch update on the ptdump path but we get
it before we even attempt the write lock.

For simplicity, ignoring the observability of instruction writes and
considering the static branch a variable, if pud_free_pmd_page() did not
observe the static branch update, is the ptdump guaranteed to see the
cleared pud subsequently?

With initial state pud=1 (non-zero), stb=0 (static branch):

P0 (pud_free_pmd_page)		P1 (ptdump)

    W_pud=0			   W_stb=1
    DSB				   barrier/acq
    R_stb=0			   R_pud=?

The write to the static branch on P1 will be ordered after the read of
the branch on P0, so the pud will be seen as 0. It's not even worth
mentioning the semaphore here as the static branch update has enough
barriers for cache flushing and kick_all_cpus_sync().


The other scenario is P0 (pud_free_pmd_page) observing the write to the
static branch (that's case 1 in your comment). This doesn't say anything
about whether P1 (ptdump) sees a clear or valid pud. What we do know is
that P0 will try to acquire (and release) the lock. If P1 already
acquired the write lock, P0 will wait and the state of the pud is
irrelevant (no freeing). Similarly if P1 already completed by the time
P0 takes the lock.

If P0 takes the lock first, the lock release guarantees that the
pud_clear() is seen by the ptdump code _after_ it acquired the lock.


W.r.t. the visibility of the branch update vs pud access, the
combinations of DSB+ISB (part of the TLB flushing) on P0 and cache
maintenance to PoU together with kick_all_cpus_sync() on P1 should
suffice.

I think the logic works (though I'm heavily jetlagged and writing from a
plane) but the comments need to be improved. As described above, case 1
has two sub-cases depending on when P0 runs in relation to the write
lock (before or during/after). And the write lock doesn't matter for
case 2.

> > And I can't get my head around memory ordering, it doesn't look sound.
> > static_branch_enable() I don't think has acquire semantics, at least not
> > in relation to the actual branch update. The static_branch_unlikely()
> > test, again, not sure what guarantees it has (I don't think it has any
> > in relation to memory loads). Maybe you have worked it all out and is
> > fine but it needs a better explanation and ideally some simple formal
> > model. Show it's correct with a global variable first and then we can
> > optimise with static branches.
> 
> What do you suggest? As in what do you mean by showing its correct with
> a global variable first...and, for the formal model thingy, do you
> want mathematical rigor similar to what you explain in [1] :P, because unfortunately
> (and quite embarrassingly) I didn't have formal methods in college : )

Neither did I ;) (mostly analogue electronics). I was thinking something
like our cat/herd tools where you can write some simple assembly. It's a
good investment if you want to give it a try.

-- 
Catalin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump
  2025-07-07  0:44     ` Catalin Marinas
@ 2025-07-14  9:26       ` Dev Jain
  2025-07-14 15:27         ` Dev Jain
  0 siblings, 1 reply; 9+ messages in thread
From: Dev Jain @ 2025-07-14  9:26 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, anshuman.khandual, quic_zhenhuah, ryan.roberts,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david


On 07/07/25 6:14 am, Catalin Marinas wrote:
> On Fri, Jul 04, 2025 at 05:12:13PM +0530, Dev Jain wrote:
>> On 04/07/25 4:52 pm, Catalin Marinas wrote:
>>> On Thu, Jun 26, 2025 at 10:55:24AM +0530, Dev Jain wrote:
>>>> @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>>    	}
>>>>    	table = pmd_offset(pudp, addr);
>>>> +	/*
>>>> +	 * Isolate the PMD table; in case of race with ptdump, this helps
>>>> +	 * us to avoid taking the lock in __pmd_free_pte_page().
>>>> +	 *
>>>> +	 * Static key logic:
>>>> +	 *
>>>> +	 * Case 1: If ptdump does static_branch_enable(), and after that we
>>>> +	 * execute the if block, then this patches in the read lock, ptdump has
>>>> +	 * the write lock patched in, therefore ptdump will never read from
>>>> +	 * a potentially freed PMD table.
>>>> +	 *
>>>> +	 * Case 2: If the if block starts executing before ptdump's
>>>> +	 * static_branch_enable(), then no locking synchronization
>>>> +	 * will be done. However, pud_clear() + the dsb() in
>>>> +	 * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
> [...]
>>> I don't get case 2. You want to ensure pud_clear() is observed by the
>>> ptdump code before the pmd_free(). The DSB in the TLB flushing code
>>> ensures some ordering between the pud_clear() and presumably something
>>> that the ptdump code can observe as well. Would that be the mmap
>>> semaphore? However, the read_lock would only be attempted if this code
>>> is seeing the static branch update, which is not guaranteed. I don't
>>> think it even matters since the lock may be released anyway before the
>>> write_lock in ptdump.
>>>
>>> For example, you do a pud_clear() above, skip the whole static branch.
>>> The ptdump comes along on another CPU but does not observe the
>>> pud_clear() since there's no synchronisation. It goes ahead and
>>> dereferences it while this CPU does a pmd_free().
>> The objective is: ptdump must not dereference a freed pagetable.
>> So for your example, if the static branch is not observed by
>> pud_free_pmd_page, this means that ptdump will take the write lock
>> after the execution of flush_tlb_kernel_pagetable completes (for if ptdump takes
>> the write lock before execution of flush_tlb_kernel_pagetable completes, we have
>> executed static_branch_enable(), contradiction).
> I don't see why the write lock matters since pud_free_pmd_page() doesn't

True.

> take the read lock in the second scenario. What we need is acquire
> semantics after the static branch update on the ptdump path but we get
> it before we even attempt the write lock.
>
> For simplicity, ignoring the observability of instruction writes and
> considering the static branch a variable, if pud_free_pmd_page() did not
> observe the static branch update, is the ptdump guaranteed to see the
> cleared pud subsequently?
>
> With initial state pud=1 (non-zero), stb=0 (static branch):
>
> P0 (pud_free_pmd_page)		P1 (ptdump)
>
>      W_pud=0			   W_stb=1
>      DSB				   barrier/acq
>      R_stb=0			   R_pud=?
>
> The write to the static branch on P1 will be ordered after the read of
> the branch on P0, so the pud will be seen as 0. It's not even worth
> mentioning the semaphore here as the static branch update has enough
> barriers for cache flushing and kick_all_cpus_sync().
>
>
> The other scenario is P0 (pud_free_pmd_page) observing the write to the
> static branch (that's case 1 in your comment). This doesn't say anything
> about whether P1 (ptdump) sees a clear or valid pud. What we do know is
> that P0 will try to acquire (and release) the lock. If P1 already
> acquired the write lock, P0 will wait and the state of the pud is
> irrelevant (no freeing). Similarly if P1 already completed by the time
> P0 takes the lock.
>
> If P0 takes the lock first, the lock release guarantees that the
> pud_clear() is seen by the ptdump code _after_ it acquired the lock.
>
>
> W.r.t. the visibility of the branch update vs pud access, the
> combinations of DSB+ISB (part of the TLB flushing) on P0 and cache
> maintenance to PoU together with kick_all_cpus_sync() on P1 should
> suffice.
>
> I think the logic works (though I'm heavily jetlagged and writing from a
> plane) but the comments need to be improved. As described above, case 1
> has two sub-cases depending on when P0 runs in relation to the write
> lock (before or during/after). And the write lock doesn't matter for
> case 2.
>
>>> And I can't get my head around memory ordering, it doesn't look sound.
>>> static_branch_enable() I don't think has acquire semantics, at least not
>>> in relation to the actual branch update. The static_branch_unlikely()
>>> test, again, not sure what guarantees it has (I don't think it has any
>>> in relation to memory loads). Maybe you have worked it all out and is
>>> fine but it needs a better explanation and ideally some simple formal
>>> model. Show it's correct with a global variable first and then we can
>>> optimise with static branches.
>> What do you suggest? As in what do you mean by showing its correct with
>> a global variable first...and, for the formal model thingy, do you
>> want mathematical rigor similar to what you explain in [1] :P, because unfortunately
>> (and quite embarrassingly) I didn't have formal methods in college : )
> Neither did I ;) (mostly analogue electronics). I was thinking something
> like our cat/herd tools where you can write some simple assembly. It's a
> good investment if you want to give it a try.


Will the following proof work -

Proof of correctness: The below diagram represents pud_free_pmd_page
executing on P0 and ptdump executing on P1. Note that, we can ignore
the situation when processes migrate to another CPU, since we will
have extra barriers because of switch_to(), and all of the embedded
barriers that are used in the reasoning of the proof below apply
to the inner shareable domain, and therefore will be observed by
all CPUs, therefore it suffices to prove for the case where
pud_free_pmd_page executes completely on P0 and ptdump
executes completely on P1.

Let t_i, 0 <= i <= 8 denote the *global* timestamp taken for the corresponding
instruction to complete. Therefore from here on we do not need to use the term
"observe" in a relative context. Let t_i' (t_i dash) denote the global timestamp
for the corresponding instruction to start. That is, an instruction labelled
with t_i implies that it started at t_i' and finished at t_i.


P0				P1:

W_PUD = 0: t0			x = 1: t2

if (x == 1) {: t7			write lock: t3
	read lock: t6		R_PUD = 1: t4
	read unlock: t8		write unlock: t5
}

Free PUD: t1

We need to prove that ptdump completely finishes before
we free the PUD. Since write unlock has release semantics,
if the write unlock finishes, it is guaranteed that ptdump
has finished => it suffices to prove that t5 < t1'.


R_PUD = 1 => t4 < t0 .... (i)

Because of acquire semantics of down_write(rw_semaphore lock),
t3 < t4' .... (ii)

(i) and (ii) (and t4' < t4) => t3 < t0 ... (iii)

ptdump is executed on a single kernel thread, which implies
that the transition x = 1 -> x = 1 will never happen; that is,
when static_branch_enable() is executed, then x was 0, which
means that the call chain static_key_enable -> static_key_enable_cpuslocked
-> jump_label_update -> jump_label_can_update/ arch_jump_label_transform_apply
-> kick_all_cpus_sync -> smp_mb -> __smp_mb -> dmb(ish) will always be followed.
The emission of dmb(ish) => t2 < t3 ... (iv)

(iii) and (iv) => t2 < t0, also, flush_tlb_kernel_pgtable -> dsb(ish) ensures that t0 < t7'
=> t2 < t7' => the static branch is observed by P0 always => t6 and t8 exist.

Now, t0 < t6' because of flush_tlb_kernel_pgtable; combining with (iii), this gives
us t3 < t6' => the write lock is observed first => t5 < t6 (the read lock cannot
be taken before the write unlock finishes) ...(v)

Release semantics of read unlock => t8 < t1' ...(vi)
Also, trivially t6 < t8...(vii)

Combining v, vi and vii, t5 < t1'. Q.E.D


>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump
  2025-07-14  9:26       ` Dev Jain
@ 2025-07-14 15:27         ` Dev Jain
  0 siblings, 0 replies; 9+ messages in thread
From: Dev Jain @ 2025-07-14 15:27 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, anshuman.khandual, quic_zhenhuah, ryan.roberts,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david


On 14/07/25 2:56 pm, Dev Jain wrote:
>
> On 07/07/25 6:14 am, Catalin Marinas wrote:
>> On Fri, Jul 04, 2025 at 05:12:13PM +0530, Dev Jain wrote:
>>> On 04/07/25 4:52 pm, Catalin Marinas wrote:
>>>> On Thu, Jun 26, 2025 at 10:55:24AM +0530, Dev Jain wrote:
>>>>> @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, 
>>>>> unsigned long addr)
>>>>>        }
>>>>>        table = pmd_offset(pudp, addr);
>>>>> +    /*
>>>>> +     * Isolate the PMD table; in case of race with ptdump, this 
>>>>> helps
>>>>> +     * us to avoid taking the lock in __pmd_free_pte_page().
>>>>> +     *
>>>>> +     * Static key logic:
>>>>> +     *
>>>>> +     * Case 1: If ptdump does static_branch_enable(), and after 
>>>>> that we
>>>>> +     * execute the if block, then this patches in the read lock, 
>>>>> ptdump has
>>>>> +     * the write lock patched in, therefore ptdump will never 
>>>>> read from
>>>>> +     * a potentially freed PMD table.
>>>>> +     *
>>>>> +     * Case 2: If the if block starts executing before ptdump's
>>>>> +     * static_branch_enable(), then no locking synchronization
>>>>> +     * will be done. However, pud_clear() + the dsb() in
>>>>> +     * __flush_tlb_kernel_pgtable will ensure that ptdump 
>>>>> observes an
>> [...]
>>>> I don't get case 2. You want to ensure pud_clear() is observed by the
>>>> ptdump code before the pmd_free(). The DSB in the TLB flushing code
>>>> ensures some ordering between the pud_clear() and presumably something
>>>> that the ptdump code can observe as well. Would that be the mmap
>>>> semaphore? However, the read_lock would only be attempted if this code
>>>> is seeing the static branch update, which is not guaranteed. I don't
>>>> think it even matters since the lock may be released anyway before the
>>>> write_lock in ptdump.
>>>>
>>>> For example, you do a pud_clear() above, skip the whole static branch.
>>>> The ptdump comes along on another CPU but does not observe the
>>>> pud_clear() since there's no synchronisation. It goes ahead and
>>>> dereferences it while this CPU does a pmd_free().
>>> The objective is: ptdump must not dereference a freed pagetable.
>>> So for your example, if the static branch is not observed by
>>> pud_free_pmd_page, this means that ptdump will take the write lock
>>> after the execution of flush_tlb_kernel_pagetable completes (for if 
>>> ptdump takes
>>> the write lock before execution of flush_tlb_kernel_pagetable 
>>> completes, we have
>>> executed static_branch_enable(), contradiction).
>> I don't see why the write lock matters since pud_free_pmd_page() doesn't
>
> True.
>
>> take the read lock in the second scenario. What we need is acquire
>> semantics after the static branch update on the ptdump path but we get
>> it before we even attempt the write lock.
>>
>> For simplicity, ignoring the observability of instruction writes and
>> considering the static branch a variable, if pud_free_pmd_page() did not
>> observe the static branch update, is the ptdump guaranteed to see the
>> cleared pud subsequently?
>>
>> With initial state pud=1 (non-zero), stb=0 (static branch):
>>
>> P0 (pud_free_pmd_page)        P1 (ptdump)
>>
>>      W_pud=0               W_stb=1
>>      DSB                   barrier/acq
>>      R_stb=0               R_pud=?
>>
>> The write to the static branch on P1 will be ordered after the read of
>> the branch on P0, so the pud will be seen as 0. It's not even worth
>> mentioning the semaphore here as the static branch update has enough
>> barriers for cache flushing and kick_all_cpus_sync().
>>
>>
>> The other scenario is P0 (pud_free_pmd_page) observing the write to the
>> static branch (that's case 1 in your comment). This doesn't say anything
>> about whether P1 (ptdump) sees a clear or valid pud. What we do know is
>> that P0 will try to acquire (and release) the lock. If P1 already
>> acquired the write lock, P0 will wait and the state of the pud is
>> irrelevant (no freeing). Similarly if P1 already completed by the time
>> P0 takes the lock.
>>
>> If P0 takes the lock first, the lock release guarantees that the
>> pud_clear() is seen by the ptdump code _after_ it acquired the lock.
>>
>>
>> W.r.t. the visibility of the branch update vs pud access, the
>> combinations of DSB+ISB (part of the TLB flushing) on P0 and cache
>> maintenance to PoU together with kick_all_cpus_sync() on P1 should
>> suffice.
>>
>> I think the logic works (though I'm heavily jetlagged and writing from a
>> plane) but the comments need to be improved. As described above, case 1
>> has two sub-cases depending on when P0 runs in relation to the write
>> lock (before or during/after). And the write lock doesn't matter for
>> case 2.
>>
>>>> And I can't get my head around memory ordering, it doesn't look sound.
>>>> static_branch_enable() I don't think has acquire semantics, at 
>>>> least not
>>>> in relation to the actual branch update. The static_branch_unlikely()
>>>> test, again, not sure what guarantees it has (I don't think it has any
>>>> in relation to memory loads). Maybe you have worked it all out and is
>>>> fine but it needs a better explanation and ideally some simple formal
>>>> model. Show it's correct with a global variable first and then we can
>>>> optimise with static branches.
>>> What do you suggest? As in what do you mean by showing its correct with
>>> a global variable first...and, for the formal model thingy, do you
>>> want mathematical rigor similar to what you explain in [1] :P, 
>>> because unfortunately
>>> (and quite embarrassingly) I didn't have formal methods in college : )
>> Neither did I ;) (mostly analogue electronics). I was thinking something
>> like our cat/herd tools where you can write some simple assembly. It's a
>> good investment if you want to give it a try.
>
>
> Will the following proof work -
>
> Proof of correctness: The below diagram represents pud_free_pmd_page
> executing on P0 and ptdump executing on P1. Note that, we can ignore
> the situation when processes migrate to another CPU, since we will
> have extra barriers because of switch_to(), and all of the embedded
> barriers that are used in the reasoning of the proof below apply
> to the inner shareable domain, and therefore will be observed by
> all CPUs, therefore it suffices to prove for the case where
> pud_free_pmd_page executes completely on P0 and ptdump
> executes completely on P1.
>
> Let t_i, 0 <= i <= 8 denote the *global* timestamp taken for the 
> corresponding
> instruction to complete. Therefore from here on we do not need to use 
> the term
> "observe" in a relative context. Let t_i' (t_i dash) denote the global 
> timestamp
> for the corresponding instruction to start. That is, an instruction 
> labelled
> with t_i implies that it started at t_i' and finished at t_i.
>
>
> P0                P1:
>
> W_PUD = 0: t0            x = 1: t2
>
> if (x == 1) {: t7            write lock: t3
>     read lock: t6        R_PUD = 1: t4
>     read unlock: t8        write unlock: t5
> }
>
> Free PUD: t1
>
> We need to prove that ptdump completely finishes before
> we free the PUD. Since write unlock has release semantics,
> if the write unlock finishes, it is guaranteed that ptdump
> has finished => it suffices to prove that t5 < t1'.
>
>
> R_PUD = 1 => t4 < t0 .... (i)
>
> Because of acquire semantics of down_write(rw_semaphore lock),
> t3 < t4' .... (ii)
>
> (i) and (ii) (and t4' < t4) => t3 < t0 ... (iii)
>
> ptdump is executed on a single kernel thread, which implies
> that the transition x = 1 -> x = 1 will never happen; that is,
> when static_branch_enable() is executed, then x was 0, which
> means that the call chain static_key_enable -> 
> static_key_enable_cpuslocked
> -> jump_label_update -> jump_label_can_update/ 
> arch_jump_label_transform_apply
> -> kick_all_cpus_sync -> smp_mb -> __smp_mb -> dmb(ish) will always be 
> followed.
> The emission of dmb(ish) => t2 < t3 ... (iv)
>
> (iii) and (iv) => t2 < t0, also, flush_tlb_kernel_pgtable -> dsb(ish) 
> ensures that t0 < t7'
> => t2 < t7' => the static branch is observed by P0 always => t6 and t8 
> exist.
>
> Now, t0 < t6' because of flush_tlb_kernel_pgtable; combining with 
> (iii), this gives
> us t3 < t6' => the write lock is observed first => t5 < t6 (the read 
> lock cannot
> be taken before the write unlock finishes) ...(v)
>
> Release semantics of read unlock => t8 < t1' ...(vi)
> Also, trivially t6 < t8...(vii)
>
> Combining v, vi and vii, t5 < t1'. Q.E.D
>
>
For the sake of rigor, t_i should be defined w.r.t loads and stores 
rather than

start and end of instruction, since theoretically, for example, t5 > t6 
is possible,

if t5 has a lot of, let's say, statistics updation operations. So, the 
relevant store

of write_unlock will finish, and the read lock will completely execute 
before

we actually exit the write_unlock due to the instructions embedded in 
write_unlock

which are not relevant to us.


>>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-07-14 15:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-26  5:25 [PATCH v4] arm64: Enable vmalloc-huge with ptdump Dev Jain
2025-07-04  5:05 ` Dev Jain
2025-07-04  5:58 ` Anshuman Khandual
2025-07-04  6:37   ` Dev Jain
2025-07-04 11:22 ` Catalin Marinas
2025-07-04 11:42   ` Dev Jain
2025-07-07  0:44     ` Catalin Marinas
2025-07-14  9:26       ` Dev Jain
2025-07-14 15:27         ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).