[RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
@ 2025-07-23 16:18 Dev Jain
  2025-07-24  5:50 ` Anshuman Khandual
  2025-07-30 17:00 ` Catalin Marinas
  0 siblings, 2 replies; 11+ messages in thread
From: Dev Jain @ 2025-07-23 16:18 UTC (permalink / raw)
  To: catalin.marinas, will
  Cc: anshuman.khandual, quic_zhenhuah, ryan.roberts, kevin.brodsky,
	yangyicong, joey.gouly, linux-arm-kernel, linux-kernel, david,
	mark.rutland, urezki, Dev Jain

Our goal is to move towards enabling vmalloc-huge by default on arm64 so
as to reduce TLB pressure. Therefore, we need a way to analyze the portion
of block mappings in vmalloc space we can get on a production system; this
can be done through ptdump, but currently we disable vmalloc-huge if
CONFIG_PTDUMP_DEBUGFS is on. The reason is that lazy freeing of kernel
pagetables via vmap_try_huge_pxd() may race with ptdump, so ptdump
may dereference a bogus address.

To solve this, we need to synchronize ptdump_walk_pgd() with
pud_free_pmd_page() and pmd_free_pte_page().

Since this race is very unlikely to happen in practice, we do not want to
penalize other code paths taking the init_mm mmap_lock. Therefore, we use
static keys. ptdump will enable the static key - upon observing that, the
vmalloc table freeing path will get patched in with an
mmap_read_lock/unlock sequence. A code comment explains in detail, how
a combination of acquire sematics of static_branch_enable() and the
barriers in __flush_tlb_kernel_pgtable() ensures that ptdump will never
get a hold on the address of a freed PMD or PTE table.

For an (almost) rigorous proof of correctness, one may look at:
https://lore.kernel.org/all/e1e87f16-1c48-481b-8f7c-9333ac5d13e7@arm.com/

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
Rebased on 6.16-rc7.

mm-selfests pass. No issues were observed while parallelly running
test_vmalloc.sh (which stresses the vmalloc subsystem) and dumping the
kernel pagetable through sysfs in a loop.

v4->v5:
 - The arch_vmap_pxd_supported() changes were dropped by mistake in between
   the revisions, add them back (Anshuman)
 - Rewrite cover letter, drop stray change, add arm64_ptdump_walk_pgd()
   helper, rename ptdump_lock_key -> arm64_ptdump_lock_key, add comment
   above __pmd_free_pte_page() to explain when the lock won't
   be taken (Anshuman)
 - Rewrite the comment explaining the synchronization logic (Catalin)

v3->v4:
 - Lock-unlock immediately
 - Simplify includes

v2->v3:
 - Use static key mechanism

v1->v2:
 - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
 - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
   the lock 512 times again via pmd_free_pte_page()

---
 arch/arm64/include/asm/ptdump.h  |  2 +
 arch/arm64/include/asm/vmalloc.h |  6 +--
 arch/arm64/mm/mmu.c              | 86 ++++++++++++++++++++++++++++++--
 arch/arm64/mm/ptdump.c           | 11 +++-
 4 files changed, 95 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index fded5358641f..baff24004459 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -7,6 +7,8 @@
 
 #include <linux/ptdump.h>
 
+DECLARE_STATIC_KEY_FALSE(arm64_ptdump_lock_key);
+
 #ifdef CONFIG_PTDUMP
 
 #include <linux/mm_types.h>
diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 12f534e8f3ed..e835fd437ae0 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
 	/*
 	 * SW table walks can't handle removal of intermediate entries.
 	 */
-	return pud_sect_supported() &&
-	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+	return pud_sect_supported();
 }
 
 #define arch_vmap_pmd_supported arch_vmap_pmd_supported
 static inline bool arch_vmap_pmd_supported(pgprot_t prot)
 {
-	/* See arch_vmap_pud_supported() */
-	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+	return true;
 }
 
 #define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 00ab1d648db6..49932c1dd34e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -46,6 +46,8 @@
 #define NO_CONT_MAPPINGS	BIT(1)
 #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
 
+DEFINE_STATIC_KEY_FALSE(arm64_ptdump_lock_key);
+
 enum pgtable_type {
 	TABLE_PTE,
 	TABLE_PMD,
@@ -1267,7 +1269,12 @@ int pmd_clear_huge(pmd_t *pmdp)
 	return 1;
 }
 
-int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
+/*
+ * If PMD has been isolated via pud_free_pmd_page(), ptdump cannot get
+ * a hold to it, so no need to serialize with mmap_lock, hence lock
+ * will be passed as false here. Otherwise, lock will be true.
+ */
+static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
 {
 	pte_t *table;
 	pmd_t pmd;
@@ -1279,13 +1286,24 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
 		return 1;
 	}
 
+	/* See comment in pud_free_pmd_page for static key logic */
 	table = pte_offset_kernel(pmdp, addr);
 	pmd_clear(pmdp);
 	__flush_tlb_kernel_pgtable(addr);
+	if (static_branch_unlikely(&arm64_ptdump_lock_key) && lock) {
+		mmap_read_lock(&init_mm);
+		mmap_read_unlock(&init_mm);
+	}
+
 	pte_free_kernel(NULL, table);
 	return 1;
 }
 
+int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
+{
+	return __pmd_free_pte_page(pmdp, addr, true);
+}
+
 int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
 {
 	pmd_t *table;
@@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
 	}
 
 	table = pmd_offset(pudp, addr);
+
+	/*
+	 * Our objective is to prevent ptdump from reading a PMD table which has
+	 * been freed.  Assume that ptdump_walk_pgd() (call this thread T1)
+	 * executes completely on CPU1 and pud_free_pmd_page() (call this thread
+	 * T2) executes completely on CPU2. Let the region sandwiched by the
+	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
+	 *
+	 * Claim: The CS of T1 will never operate on a freed PMD table.
+	 *
+	 * Proof:
+	 *
+	 * Case 1: The static branch is visible to T2.
+	 *
+	 * Case 1 (a): T1 acquires the lock before T2 can.
+	 * T2 will block until T1 drops the lock, so pmd_free() will only be
+	 * executed after T1 exits CS.
+	 *
+	 * Case 1 (b): T2 acquires the lock before T1 can.
+	 * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
+	 * ensures that an empty PUD (via pud_clear()) is visible to T1 before
+	 * T1 can enter CS, therefore it is impossible for the CS to get hold
+	 * of the address of the isolated PMD table.
+	 *
+	 * Case 2: The static branch is not visible to T2.
+	 *
+	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
+	 * have acquire semantics, it is guaranteed that the static branch
+	 * will be visible to all CPUs before T1 can enter CS. The static
+	 * branch not being visible to T2 therefore guarantees that T1 has
+	 * not yet entered CS .... (i)
+	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
+	 * implies that if the invisibility of the static branch has been
+	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
+	 * then all CPUs will have observed an empty PUD ... (ii)
+	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
+	 * before entering CS => it is impossible for the CS to get hold of
+	 * the address of the isolated PMD table. Q.E.D
+	 *
+	 * We have proven that the claim is true on the assumption that
+	 * there is no context switch for T1 and T2. Note that the reasoning
+	 * of the proof uses barriers operating on the inner shareable domain,
+	 * which means that they will affect all CPUs, and also a context switch
+	 * will insert extra barriers into the code paths => the claim will
+	 * stand true even if we drop the assumption.
+	 *
+	 * It is also worth reasoning whether something can go wrong via
+	 * pud_free_pmd_page() -> __pmd_free_pte_page(), since the latter
+	 * will be called locklessly on this code path.
+	 *
+	 * For Case 1 (a), T2 will block until CS is finished, so we are safe.
+	 * For Case 1 (b) and Case 2, the PMD table will be isolated before
+	 * T1 can enter CS, therefore it is safe for T2 to operate on the
+	 * PMD table locklessly.
+	 */
+	pud_clear(pudp);
+	__flush_tlb_kernel_pgtable(addr);
+	if (static_branch_unlikely(&arm64_ptdump_lock_key)) {
+		mmap_read_lock(&init_mm);
+		mmap_read_unlock(&init_mm);
+	}
+
 	pmdp = table;
 	next = addr;
 	end = addr + PUD_SIZE;
 	do {
 		if (pmd_present(pmdp_get(pmdp)))
-			pmd_free_pte_page(pmdp, next);
+			__pmd_free_pte_page(pmdp, next, false);
 	} while (pmdp++, next += PMD_SIZE, next != end);
 
-	pud_clear(pudp);
-	__flush_tlb_kernel_pgtable(addr);
 	pmd_free(NULL, table);
 	return 1;
 }
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 421a5de806c6..65335c7ba482 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -283,6 +283,13 @@ void note_page_flush(struct ptdump_state *pt_st)
 	note_page(pt_st, 0, -1, pte_val(pte_zero));
 }
 
+static void arm64_ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
+{
+	static_branch_enable(&arm64_ptdump_lock_key);
+	ptdump_walk_pgd(st, mm, NULL);
+	static_branch_disable(&arm64_ptdump_lock_key);
+}
+
 void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
 {
 	unsigned long end = ~0UL;
@@ -311,7 +318,7 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
 		}
 	};
 
-	ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
+	arm64_ptdump_walk_pgd(&st.ptdump, info->mm);
 }
 
 static void __init ptdump_initialize(void)
@@ -353,7 +360,7 @@ bool ptdump_check_wx(void)
 		}
 	};
 
-	ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
+	arm64_ptdump_walk_pgd(&st.ptdump, &init_mm);
 
 	if (st.wx_pages || st.uxn_pages) {
 		pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
  2025-07-23 16:18 [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump Dev Jain
@ 2025-07-24  5:50 ` Anshuman Khandual
  2025-07-24  7:20   ` Dev Jain
  2025-07-30 17:00 ` Catalin Marinas
  1 sibling, 1 reply; 11+ messages in thread
From: Anshuman Khandual @ 2025-07-24  5:50 UTC (permalink / raw)
  To: Dev Jain, catalin.marinas, will
  Cc: quic_zhenhuah, ryan.roberts, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david, mark.rutland,
	urezki

A very small nit -

arm64/mm: Enable vmalloc-huge with ptdump

On 23/07/25 9:48 PM, Dev Jain wrote:
> Our goal is to move towards enabling vmalloc-huge by default on arm64 so
> as to reduce TLB pressure. Therefore, we need a way to analyze the portion
> of block mappings in vmalloc space we can get on a production system; this
> can be done through ptdump, but currently we disable vmalloc-huge if
> CONFIG_PTDUMP_DEBUGFS is on. The reason is that lazy freeing of kernel
> pagetables via vmap_try_huge_pxd() may race with ptdump, so ptdump
> may dereference a bogus address.
> 
> To solve this, we need to synchronize ptdump_walk_pgd() with
> pud_free_pmd_page() and pmd_free_pte_page().
> 
> Since this race is very unlikely to happen in practice, we do not want to
> penalize other code paths taking the init_mm mmap_lock. Therefore, we use
> static keys. ptdump will enable the static key - upon observing that, the
> vmalloc table freeing path will get patched in with an
> mmap_read_lock/unlock sequence. A code comment explains in detail, how
> a combination of acquire sematics of static_branch_enable() and the

typo                       ^^^^^^^^ s/sematics/semantics

> barriers in __flush_tlb_kernel_pgtable() ensures that ptdump will never
> get a hold on the address of a freed PMD or PTE table.
> 
> For an (almost) rigorous proof of correctness, one may look at:
> https://lore.kernel.org/all/e1e87f16-1c48-481b-8f7c-9333ac5d13e7@arm.com/
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Rebased on 6.16-rc7.
> 
> mm-selfests pass. No issues were observed while parallelly running
> test_vmalloc.sh (which stresses the vmalloc subsystem) and dumping the
> kernel pagetable through sysfs in a loop.
> 
> v4->v5:
>  - The arch_vmap_pxd_supported() changes were dropped by mistake in between
>    the revisions, add them back (Anshuman)
>  - Rewrite cover letter, drop stray change, add arm64_ptdump_walk_pgd()
>    helper, rename ptdump_lock_key -> arm64_ptdump_lock_key, add comment
>    above __pmd_free_pte_page() to explain when the lock won't
>    be taken (Anshuman)
>  - Rewrite the comment explaining the synchronization logic (Catalin)
> 
> v3->v4:
>  - Lock-unlock immediately
>  - Simplify includes
> 
> v2->v3:
>  - Use static key mechanism
> 
> v1->v2:
>  - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>  - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>    the lock 512 times again via pmd_free_pte_page()
> 
> ---
>  arch/arm64/include/asm/ptdump.h  |  2 +
>  arch/arm64/include/asm/vmalloc.h |  6 +--
>  arch/arm64/mm/mmu.c              | 86 ++++++++++++++++++++++++++++++--
>  arch/arm64/mm/ptdump.c           | 11 +++-
>  4 files changed, 95 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
> index fded5358641f..baff24004459 100644
> --- a/arch/arm64/include/asm/ptdump.h
> +++ b/arch/arm64/include/asm/ptdump.h
> @@ -7,6 +7,8 @@
>  
>  #include <linux/ptdump.h>
>  
> +DECLARE_STATIC_KEY_FALSE(arm64_ptdump_lock_key);
> +
>  #ifdef CONFIG_PTDUMP
>  
>  #include <linux/mm_types.h>
> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
> index 12f534e8f3ed..e835fd437ae0 100644
> --- a/arch/arm64/include/asm/vmalloc.h
> +++ b/arch/arm64/include/asm/vmalloc.h
> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>  	/*
>  	 * SW table walks can't handle removal of intermediate entries.
>  	 */

This above comment can now be dropped.

> -	return pud_sect_supported() &&
> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> +	return pud_sect_supported();
>  }
>  
>  #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>  static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>  {
> -	/* See arch_vmap_pud_supported() */
> -	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> +	return true;
>  }
>  
>  #define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 00ab1d648db6..49932c1dd34e 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -46,6 +46,8 @@
>  #define NO_CONT_MAPPINGS	BIT(1)
>  #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>  
> +DEFINE_STATIC_KEY_FALSE(arm64_ptdump_lock_key);

A short introduction about the static key's purpose might be helpful.

> +
>  enum pgtable_type {
>  	TABLE_PTE,
>  	TABLE_PMD,
> @@ -1267,7 +1269,12 @@ int pmd_clear_huge(pmd_t *pmdp)
>  	return 1;
>  }
>  
> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +/*
> + * If PMD has been isolated via pud_free_pmd_page(), ptdump cannot get
> + * a hold to it, so no need to serialize with mmap_lock, hence lock
> + * will be passed as false here. Otherwise, lock will be true.
> + */

This comment should be split into two and added near their respective
call sites with and without the kernel mmap_lock requirement.

> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)IIUC 'lock' is the need for taking mmap_lock in init_mm. Hence changing
that as 'mmap_lock' or 'kernel_mmap_lock' might be better which will
also add some required context.

>  {
>  	pte_t *table;
>  	pmd_t pmd;
> @@ -1279,13 +1286,24 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>  		return 1;
>  	}
>  
> +	/* See comment in pud_free_pmd_page for static key logic */
>  	table = pte_offset_kernel(pmdp, addr);
>  	pmd_clear(pmdp);
>  	__flush_tlb_kernel_pgtable(addr);
> +	if (static_branch_unlikely(&arm64_ptdump_lock_key) && lock) {
> +		mmap_read_lock(&init_mm);
> +		mmap_read_unlock(&init_mm);
> +	}
> +
>  	pte_free_kernel(NULL, table);
>  	return 1;
>  }
>  
> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +{
> +	return __pmd_free_pte_page(pmdp, addr, true);
> +}
> +
>  int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>  {
>  	pmd_t *table;
> @@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>  	}
>  
>  	table = pmd_offset(pudp, addr);
> +
> +	/*
> +	 * Our objective is to prevent ptdump from reading a PMD table which has
> +	 * been freed.  Assume that ptdump_walk_pgd() (call this thread T1)
> +	 * executes completely on CPU1 and pud_free_pmd_page() (call this thread
> +	 * T2) executes completely on CPU2. Let the region sandwiched by the
> +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
> +	 *
> +	 * Claim: The CS of T1 will never operate on a freed PMD table.
> +	 *
> +	 * Proof:
> +	 *
> +	 * Case 1: The static branch is visible to T2.
> +	 *
> +	 * Case 1 (a): T1 acquires the lock before T2 can.
> +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
> +	 * executed after T1 exits CS.

T2 blocks here because ptdump_walk_pgd() takes mmap_write_lock(). This needs
to be mentioned.

> +	 *> +	 * Case 1 (b): T2 acquires the lock before T1 can.
> +	 * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
> +	 * ensures that an empty PUD (via pud_clear()) is visible to T1 before
> +	 * T1 can enter CS, therefore it is impossible for the CS to get hold
> +	 * of the address of the isolated PMD table.

Agreed. mmap_write_lock() from ptdump_walk_pgd() will proceed after T2 calls
mmap_read_unlock(). So protection from race comes from the fact that T1 can
never get hold of isolated PMD table not because of the mmap_lock.

> +	 *
> +	 * Case 2: The static branch is not visible to T2.
> +	 *
> +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
> +	 * have acquire semantics, it is guaranteed that the static branch
> +	 * will be visible to all CPUs before T1 can enter CS. The static
> +	 * branch not being visible to T2 therefore guarantees that T1 has
> +	 * not yet entered CS .... (i)
> +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
> +	 * implies that if the invisibility of the static branch has been
> +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
> +	 * then all CPUs will have observed an empty PUD ... (ii)
> +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
> +	 * before entering CS => it is impossible for the CS to get hold of
> +	 * the address of the isolated PMD table. Q.E.D
> +	 *> +	 * We have proven that the claim is true on the assumption that
> +	 * there is no context switch for T1 and T2. Note that the reasoning
> +	 * of the proof uses barriers operating on the inner shareable domain,
> +	 * which means that they will affect all CPUs, and also a context switch
> +	 * will insert extra barriers into the code paths => the claim will
> +	 * stand true even if we drop the assumption.

Do these all rest on the fact that static_branch_enable() and mmap_write_lock()
have acquire semantics available via a particular class of barrier instructions
? In which case should the generation of those instructions via these functions
be asserted for the above T1/T2 guarantee to hold ? Just curious.

> +	 *
> +	 * It is also worth reasoning whether something can go wrong via
> +	 * pud_free_pmd_page() -> __pmd_free_pte_page(), since the latter
> +	 * will be called locklessly on this code path.
> +	 *
> +	 * For Case 1 (a), T2 will block until CS is finished, so we are safe.
> +	 * For Case 1 (b) and Case 2, the PMD table will be isolated before
> +	 * T1 can enter CS, therefore it is safe for T2 to operate on the
> +	 * PMD table locklessly.
> +	 */

Although looks reasonable - the above comment block is going to be difficult to
process after time has passed :) Just wondering could it some how be simplified ?

> +	pud_clear(pudp);
> +	__flush_tlb_kernel_pgtable(addr);
> +	if (static_branch_unlikely(&arm64_ptdump_lock_key)) {
> +		mmap_read_lock(&init_mm);
> +		mmap_read_unlock(&init_mm);
> +	}
> +
>  	pmdp = table;
>  	next = addr;
>  	end = addr + PUD_SIZE;
>  	do {
>  		if (pmd_present(pmdp_get(pmdp)))
> -			pmd_free_pte_page(pmdp, next);
> +			__pmd_free_pte_page(pmdp, next, false);
>  	} while (pmdp++, next += PMD_SIZE, next != end);
>  
> -	pud_clear(pudp);
> -	__flush_tlb_kernel_pgtable(addr);
>  	pmd_free(NULL, table);
>  	return 1;
>  }
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 421a5de806c6..65335c7ba482 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -283,6 +283,13 @@ void note_page_flush(struct ptdump_state *pt_st)
>  	note_page(pt_st, 0, -1, pte_val(pte_zero));
>  }
>  
> +static void arm64_ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
> +{
> +	static_branch_enable(&arm64_ptdump_lock_key);
> +	ptdump_walk_pgd(st, mm, NULL);
> +	static_branch_disable(&arm64_ptdump_lock_key);
> +}
Encapsulating generic ptdump_walk_pgd() between arm64 platform
specific static key enable/disable requirement does make sense.

> +
>  void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>  {
>  	unsigned long end = ~0UL;
> @@ -311,7 +318,7 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>  		}
>  	};
>  
> -	ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
> +	arm64_ptdump_walk_pgd(&st.ptdump, info->mm);
>  }
>  
>  static void __init ptdump_initialize(void)
> @@ -353,7 +360,7 @@ bool ptdump_check_wx(void)
>  		}
>  	};
>  
> -	ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
> +	arm64_ptdump_walk_pgd(&st.ptdump, &init_mm);
>  
>  	if (st.wx_pages || st.uxn_pages) {
>  		pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
  2025-07-24  5:50 ` Anshuman Khandual
@ 2025-07-24  7:20   ` Dev Jain
  0 siblings, 0 replies; 11+ messages in thread
From: Dev Jain @ 2025-07-24  7:20 UTC (permalink / raw)
  To: Anshuman Khandual, catalin.marinas, will
  Cc: quic_zhenhuah, ryan.roberts, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david, mark.rutland,
	urezki, Ash.Wilding

+ Ash Wilding. He may be interested in this patch because it concerns memory
barriers, and I always refer to his youtube video for this stuff : )


On 24/07/25 11:20 am, Anshuman Khandual wrote:
> A very small nit -
>
> arm64/mm: Enable vmalloc-huge with ptdump

Sure.

>
> On 23/07/25 9:48 PM, Dev Jain wrote:
>> Our goal is to move towards enabling vmalloc-huge by default on arm64 so
>> as to reduce TLB pressure. Therefore, we need a way to analyze the portion
>> of block mappings in vmalloc space we can get on a production system; this
>> can be done through ptdump, but currently we disable vmalloc-huge if
>> CONFIG_PTDUMP_DEBUGFS is on. The reason is that lazy freeing of kernel
>> pagetables via vmap_try_huge_pxd() may race with ptdump, so ptdump
>> may dereference a bogus address.
>>
>> To solve this, we need to synchronize ptdump_walk_pgd() with
>> pud_free_pmd_page() and pmd_free_pte_page().
>>
>> Since this race is very unlikely to happen in practice, we do not want to
>> penalize other code paths taking the init_mm mmap_lock. Therefore, we use
>> static keys. ptdump will enable the static key - upon observing that, the
>> vmalloc table freeing path will get patched in with an
>> mmap_read_lock/unlock sequence. A code comment explains in detail, how
>> a combination of acquire sematics of static_branch_enable() and the
> typo                       ^^^^^^^^ s/sematics/semantics

Thanks.

>
>> barriers in __flush_tlb_kernel_pgtable() ensures that ptdump will never
>> get a hold on the address of a freed PMD or PTE table.
>>
>> For an (almost) rigorous proof of correctness, one may look at:
>> https://lore.kernel.org/all/e1e87f16-1c48-481b-8f7c-9333ac5d13e7@arm.com/
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Rebased on 6.16-rc7.
>>
>> mm-selfests pass. No issues were observed while parallelly running
>> test_vmalloc.sh (which stresses the vmalloc subsystem) and dumping the
>> kernel pagetable through sysfs in a loop.
>>
>> v4->v5:
>>   - The arch_vmap_pxd_supported() changes were dropped by mistake in between
>>     the revisions, add them back (Anshuman)
>>   - Rewrite cover letter, drop stray change, add arm64_ptdump_walk_pgd()
>>     helper, rename ptdump_lock_key -> arm64_ptdump_lock_key, add comment
>>     above __pmd_free_pte_page() to explain when the lock won't
>>     be taken (Anshuman)
>>   - Rewrite the comment explaining the synchronization logic (Catalin)
>>
>> v3->v4:
>>   - Lock-unlock immediately
>>   - Simplify includes
>>
>> v2->v3:
>>   - Use static key mechanism
>>
>> v1->v2:
>>   - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>   - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>>     the lock 512 times again via pmd_free_pte_page()
>>
>> ---
>>   arch/arm64/include/asm/ptdump.h  |  2 +
>>   arch/arm64/include/asm/vmalloc.h |  6 +--
>>   arch/arm64/mm/mmu.c              | 86 ++++++++++++++++++++++++++++++--
>>   arch/arm64/mm/ptdump.c           | 11 +++-
>>   4 files changed, 95 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
>> index fded5358641f..baff24004459 100644
>> --- a/arch/arm64/include/asm/ptdump.h
>> +++ b/arch/arm64/include/asm/ptdump.h
>> @@ -7,6 +7,8 @@
>>   
>>   #include <linux/ptdump.h>
>>   
>> +DECLARE_STATIC_KEY_FALSE(arm64_ptdump_lock_key);
>> +
>>   #ifdef CONFIG_PTDUMP
>>   
>>   #include <linux/mm_types.h>
>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>> index 12f534e8f3ed..e835fd437ae0 100644
>> --- a/arch/arm64/include/asm/vmalloc.h
>> +++ b/arch/arm64/include/asm/vmalloc.h
>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>   	/*
>>   	 * SW table walks can't handle removal of intermediate entries.
>>   	 */
> This above comment can now be dropped.
>
>> -	return pud_sect_supported() &&
>> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return pud_sect_supported();
>>   }

I keep missing something or the other :(
  

>>   
>>   #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>   {
>> -	/* See arch_vmap_pud_supported() */
>> -	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return true;
>>   }
>>   
>>   #define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 00ab1d648db6..49932c1dd34e 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -46,6 +46,8 @@
>>   #define NO_CONT_MAPPINGS	BIT(1)
>>   #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>>   
>> +DEFINE_STATIC_KEY_FALSE(arm64_ptdump_lock_key);
> A short introduction about the static key's purpose might be helpful.

It may be, but I don't see comments above static key declarations anywhere else
in the codebase. My opinion is that adding a comment here will be over-documenting
stuff.

>
>> +
>>   enum pgtable_type {
>>   	TABLE_PTE,
>>   	TABLE_PMD,
>> @@ -1267,7 +1269,12 @@ int pmd_clear_huge(pmd_t *pmdp)
>>   	return 1;
>>   }
>>   
>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +/*
>> + * If PMD has been isolated via pud_free_pmd_page(), ptdump cannot get
>> + * a hold to it, so no need to serialize with mmap_lock, hence lock
>> + * will be passed as false here. Otherwise, lock will be true.
>> + */
> This comment should be split into two and added near their respective
> call sites with and without the kernel mmap_lock requirement.

That would be better.

>
>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)IIUC 'lock' is the need for taking mmap_lock in init_mm. Hence changing
> that as 'mmap_lock' or 'kernel_mmap_lock' might be better which will
> also add some required context.

Makes sense.

>
>>   {
>>   	pte_t *table;
>>   	pmd_t pmd;
>> @@ -1279,13 +1286,24 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>   		return 1;
>>   	}
>>   
>> +	/* See comment in pud_free_pmd_page for static key logic */
>>   	table = pte_offset_kernel(pmdp, addr);
>>   	pmd_clear(pmdp);
>>   	__flush_tlb_kernel_pgtable(addr);
>> +	if (static_branch_unlikely(&arm64_ptdump_lock_key) && lock) {
>> +		mmap_read_lock(&init_mm);
>> +		mmap_read_unlock(&init_mm);
>> +	}
>> +
>>   	pte_free_kernel(NULL, table);
>>   	return 1;
>>   }
>>   
>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +{
>> +	return __pmd_free_pte_page(pmdp, addr, true);
>> +}
>> +
>>   int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   {
>>   	pmd_t *table;
>> @@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   	}
>>   
>>   	table = pmd_offset(pudp, addr);
>> +
>> +	/*
>> +	 * Our objective is to prevent ptdump from reading a PMD table which has
>> +	 * been freed.  Assume that ptdump_walk_pgd() (call this thread T1)
>> +	 * executes completely on CPU1 and pud_free_pmd_page() (call this thread
>> +	 * T2) executes completely on CPU2. Let the region sandwiched by the
>> +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
>> +	 *
>> +	 * Claim: The CS of T1 will never operate on a freed PMD table.
>> +	 *
>> +	 * Proof:
>> +	 *
>> +	 * Case 1: The static branch is visible to T2.
>> +	 *
>> +	 * Case 1 (a): T1 acquires the lock before T2 can.
>> +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
>> +	 * executed after T1 exits CS.
> T2 blocks here because ptdump_walk_pgd() takes mmap_write_lock(). This needs
> to be mentioned.

I have declared the case as "T1 acquires the lock before T2 can". Then I
write "T2 will block until T1 drops the lock". This exactly means what you
mention above.

>
>> +	 *> +	 * Case 1 (b): T2 acquires the lock before T1 can.
>> +	 * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
>> +	 * ensures that an empty PUD (via pud_clear()) is visible to T1 before
>> +	 * T1 can enter CS, therefore it is impossible for the CS to get hold
>> +	 * of the address of the isolated PMD table.
> Agreed. mmap_write_lock() from ptdump_walk_pgd() will proceed after T2 calls
> mmap_read_unlock(). So protection from race comes from the fact that T1 can
> never get hold of isolated PMD table not because of the mmap_lock.

Thanks for confirmation.

>
>> +	 *
>> +	 * Case 2: The static branch is not visible to T2.
>> +	 *
>> +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
>> +	 * have acquire semantics, it is guaranteed that the static branch
>> +	 * will be visible to all CPUs before T1 can enter CS. The static
>> +	 * branch not being visible to T2 therefore guarantees that T1 has
>> +	 * not yet entered CS .... (i)
>> +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
>> +	 * implies that if the invisibility of the static branch has been
>> +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
>> +	 * then all CPUs will have observed an empty PUD ... (ii)
>> +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
>> +	 * before entering CS => it is impossible for the CS to get hold of
>> +	 * the address of the isolated PMD table. Q.E.D
>> +	 *> +	 * We have proven that the claim is true on the assumption that
>> +	 * there is no context switch for T1 and T2. Note that the reasoning
>> +	 * of the proof uses barriers operating on the inner shareable domain,
>> +	 * which means that they will affect all CPUs, and also a context switch
>> +	 * will insert extra barriers into the code paths => the claim will
>> +	 * stand true even if we drop the assumption.
> Do these all rest on the fact that static_branch_enable() and mmap_write_lock()
> have acquire semantics available via a particular class of barrier instructions

Yes.

> ? In which case should the generation of those instructions via these functions
> be asserted for the above T1/T2 guarantee to hold ? Just curious.

Nice suggestion but how do we do that :) I feel it will be being over-cautious.

Although one can check by tracing the code paths that static_branch_enable()
and mmap_write_lock() will emit dmb(ish), one does not need to bother about
looking at the actual implementations. Let us ask ourselves what these
two functions must guarantee us on any operating system.

In the below sequence, let CSI stand for critical section instruction.

Take lock
CSI-1
CSI-2
Release lock

Since mmap_write_lock() is about locking an rw semaphore for protecting a
critical section, the computer science definition of a lock must ensure that
CSI-1 does not get reordered before taking the lock, CSI-2 does not get reordered
after releasing the lock, otherwise the critical section may be expanded beyond
the locks, thus contradicting the definition of a critical section. Hence any
implementation must have acquire and release semantics for locking and unlocking.
(Note that we do not care about instructions operating on registers since registers
are local to the CPU - we need to synchronize only memory loads and stores since
memory is shared). A similar reasoning can be used when one CPU does static_branch_enable()
and others observe it with static_branch_unlikely(). In the actual Linux implementation,
these converge onto __smp_mb_after_atomic() -> the atomic cmpxchg must be followed by
an smp_mb(), so that no CS instruction is executed before the doing the memory store
corresponding to acquiring the lock.


>
>> +	 *
>> +	 * It is also worth reasoning whether something can go wrong via
>> +	 * pud_free_pmd_page() -> __pmd_free_pte_page(), since the latter
>> +	 * will be called locklessly on this code path.
>> +	 *
>> +	 * For Case 1 (a), T2 will block until CS is finished, so we are safe.
>> +	 * For Case 1 (b) and Case 2, the PMD table will be isolated before
>> +	 * T1 can enter CS, therefore it is safe for T2 to operate on the
>> +	 * PMD table locklessly.
>> +	 */
> Although looks reasonable - the above comment block is going to be difficult to
> process after time has passed :) Just wondering could it some how be simplified ?

IMHO it is easy to process, but I have written that so obviously I am biased : )
I can only comment if you can give a suggestion of your own.

>
>> +	pud_clear(pudp);
>> +	__flush_tlb_kernel_pgtable(addr);
>> +	if (static_branch_unlikely(&arm64_ptdump_lock_key)) {
>> +		mmap_read_lock(&init_mm);
>> +		mmap_read_unlock(&init_mm);
>> +	}
>> +
>>   	pmdp = table;
>>   	next = addr;
>>   	end = addr + PUD_SIZE;
>>   	do {
>>   		if (pmd_present(pmdp_get(pmdp)))
>> -			pmd_free_pte_page(pmdp, next);
>> +			__pmd_free_pte_page(pmdp, next, false);
>>   	} while (pmdp++, next += PMD_SIZE, next != end);
>>   
>> -	pud_clear(pudp);
>> -	__flush_tlb_kernel_pgtable(addr);
>>   	pmd_free(NULL, table);
>>   	return 1;
>>   }
>> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
>> index 421a5de806c6..65335c7ba482 100644
>> --- a/arch/arm64/mm/ptdump.c
>> +++ b/arch/arm64/mm/ptdump.c
>> @@ -283,6 +283,13 @@ void note_page_flush(struct ptdump_state *pt_st)
>>   	note_page(pt_st, 0, -1, pte_val(pte_zero));
>>   }
>>   
>> +static void arm64_ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
>> +{
>> +	static_branch_enable(&arm64_ptdump_lock_key);
>> +	ptdump_walk_pgd(st, mm, NULL);
>> +	static_branch_disable(&arm64_ptdump_lock_key);
>> +}
> Encapsulating generic ptdump_walk_pgd() between arm64 platform
> specific static key enable/disable requirement does make sense.
>
>> +
>>   void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>>   {
>>   	unsigned long end = ~0UL;
>> @@ -311,7 +318,7 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>>   		}
>>   	};
>>   
>> -	ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
>> +	arm64_ptdump_walk_pgd(&st.ptdump, info->mm);
>>   }
>>   
>>   static void __init ptdump_initialize(void)
>> @@ -353,7 +360,7 @@ bool ptdump_check_wx(void)
>>   		}
>>   	};
>>   
>> -	ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
>> +	arm64_ptdump_walk_pgd(&st.ptdump, &init_mm);
>>   
>>   	if (st.wx_pages || st.uxn_pages) {
>>   		pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
  2025-07-23 16:18 [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump Dev Jain
  2025-07-24  5:50 ` Anshuman Khandual
@ 2025-07-30 17:00 ` Catalin Marinas
  2025-07-30 18:29   ` Ryan Roberts
  2025-07-31  7:12   ` Dev Jain
  1 sibling, 2 replies; 11+ messages in thread
From: Catalin Marinas @ 2025-07-30 17:00 UTC (permalink / raw)
  To: Dev Jain
  Cc: will, anshuman.khandual, quic_zhenhuah, ryan.roberts,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david, mark.rutland, urezki

On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
> index 12f534e8f3ed..e835fd437ae0 100644
> --- a/arch/arm64/include/asm/vmalloc.h
> +++ b/arch/arm64/include/asm/vmalloc.h
> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>  	/*
>  	 * SW table walks can't handle removal of intermediate entries.
>  	 */
> -	return pud_sect_supported() &&
> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> +	return pud_sect_supported();

Does the "SW table walks..." comment still make sense?

> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 00ab1d648db6..49932c1dd34e 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
[...]
> @@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>  	}
>  
>  	table = pmd_offset(pudp, addr);
> +
> +	/*
> +	 * Our objective is to prevent ptdump from reading a PMD table which has
> +	 * been freed.  Assume that ptdump_walk_pgd() (call this thread T1)
> +	 * executes completely on CPU1 and pud_free_pmd_page() (call this thread
> +	 * T2) executes completely on CPU2. Let the region sandwiched by the

I think "executes completely on CPU*" is confusing. Just talk about
threads as they can be migrated between CPUs and the memory model is
preserved.

> +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
> +	 *
> +	 * Claim: The CS of T1 will never operate on a freed PMD table.
> +	 *
> +	 * Proof:
> +	 *
> +	 * Case 1: The static branch is visible to T2.
> +	 *
> +	 * Case 1 (a): T1 acquires the lock before T2 can.
> +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
> +	 * executed after T1 exits CS.

This assumes that there is some ordering between unlock and pmd_free()
(e.g. some poisoning of the old page). The unlock only gives us release
semantics, not acquire. It just happens that we have an atomic
dec-and-test down the __free_pages() path but I'm not convinced we
should rely on it unless free_pages() has clear semantics on ordering
related to prior memory writes.

> +	 *
> +	 * Case 1 (b): T2 acquires the lock before T1 can.
> +	 * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
> +	 * ensures that an empty PUD (via pud_clear()) is visible to T1 before
> +	 * T1 can enter CS, therefore it is impossible for the CS to get hold
> +	 * of the address of the isolated PMD table.

Is this the effect of the barriers in the TLB flushing or the release
semantics of the unlock?

> +	 *
> +	 * Case 2: The static branch is not visible to T2.
> +	 *
> +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
> +	 * have acquire semantics, it is guaranteed that the static branch

Does static_branch_disable() have release semantics? I think it does via
some atomic_cmpxchg() but it's worth spelling it out.

> +	 * will be visible to all CPUs before T1 can enter CS. The static

As I mentioned on a previous version, this visibility is not absolute.
You could say that it will be observed by other CPUs before they observe
the write_lock.

> +	 * branch not being visible to T2 therefore guarantees that T1 has
> +	 * not yet entered CS .... (i)
> +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
> +	 * implies that if the invisibility of the static branch has been
> +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
> +	 * then all CPUs will have observed an empty PUD ... (ii)
> +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
> +	 * before entering CS => it is impossible for the CS to get hold of
> +	 * the address of the isolated PMD table. Q.E.D

OK, I nearly got lost. That's not a straightforward memory ordering with
as we have instruction updates with ISB as part of the TLB flushing. I
concluded last time I looked that this branch patching part works
because we also have kick_all_cpus_sync() as part of the static branch
update.

Stepping back to a simpler model, let's say that the static key is a
variable. I wrote this quick test for herd7 and the Linux kernel memory
model (fairly easy to play with):

-------------------------------------8<---------------------------------------
C pte_free_ptdump

(*
 * $ cd tools/memory-model
 * $ herd7 -conf linux-kernel.cfg path/to/pte_free_ptdump.litmus
 *)

{
	pmd = 1;				(* allocated pmd *)
	pte_page = 1;				(* valid pte page pointed at by the pmd *)
}

// pmd_free_pte_page()
P0(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
{
	WRITE_ONCE(*pmd, 0);			// pmd_clear()
	smp_mb();
	if (READ_ONCE(*ptdump_lock_key)) {	// static_branch() approximation
		spin_lock(init_mm);		// mmap_read_lock()
		spin_unlock(init_mm);		// mmap_read_unlock()
	}
	smp_mb();
	WRITE_ONCE(*pte_page, 0);		// pte_free_kernel()
}

// ptdump_walk_pgd()
P1(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
{
	int val;
	int page;

	WRITE_ONCE(*ptdump_lock_key, 1);	// static_branch_enable()
	smp_mb();
	spin_lock(init_mm);			// mmap_write_lock()
	val = READ_ONCE(*pmd);
	page = READ_ONCE(*pte_page);		// freed pte page?
	spin_unlock(init_mm);			// mmap_write_unlock()
	smp_mb();
	WRITE_ONCE(*ptdump_lock_key, 0);	// static_branch_disable()
}

exists(1:val=1 /\ 1:page=0)
-------------------------------------8<---------------------------------------

I sprinkled some necessary smp_mb() but in most cases we have
release/acquire semantics. It does show that we need a barrier before
the page freeing. We also need to acquire for enabling the static key
and release for disabling it.

Next step is to assess that replacing the static key variable read/write
with code patching preserves the model.

-- 
Catalin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
  2025-07-30 17:00 ` Catalin Marinas
@ 2025-07-30 18:29   ` Ryan Roberts
  2025-07-31  4:30     ` Dev Jain
  2025-07-31  7:12   ` Dev Jain
  1 sibling, 1 reply; 11+ messages in thread
From: Ryan Roberts @ 2025-07-30 18:29 UTC (permalink / raw)
  To: Catalin Marinas, Dev Jain
  Cc: will, anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david, mark.rutland,
	urezki

On 30/07/2025 18:00, Catalin Marinas wrote:
> On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:

[...]

> 
>> +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
>> +	 *
>> +	 * Claim: The CS of T1 will never operate on a freed PMD table.
>> +	 *
>> +	 * Proof:
>> +	 *
>> +	 * Case 1: The static branch is visible to T2.
>> +	 *
>> +	 * Case 1 (a): T1 acquires the lock before T2 can.
>> +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
>> +	 * executed after T1 exits CS.
> 
> This assumes that there is some ordering between unlock and pmd_free()
> (e.g. some poisoning of the old page). The unlock only gives us release
> semantics, not acquire. It just happens that we have an atomic
> dec-and-test down the __free_pages() path but I'm not convinced we
> should rely on it unless free_pages() has clear semantics on ordering
> related to prior memory writes.

I can understand how pmd_free() could be re-ordered before the unlock, but
surely it can't be reorded before the lock? I need to go unlearn everything I
thought I understood about locking if that's the case...



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
  2025-07-30 18:29   ` Ryan Roberts
@ 2025-07-31  4:30     ` Dev Jain
  2025-07-31 11:38       ` Catalin Marinas
  0 siblings, 1 reply; 11+ messages in thread
From: Dev Jain @ 2025-07-31  4:30 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas
  Cc: will, anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
	joey.gouly, linux-arm-kernel, linux-kernel, david, mark.rutland,
	urezki


On 30/07/25 11:59 pm, Ryan Roberts wrote:
> On 30/07/2025 18:00, Catalin Marinas wrote:
>> On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
> [...]
>
>>> +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
>>> +	 *
>>> +	 * Claim: The CS of T1 will never operate on a freed PMD table.
>>> +	 *
>>> +	 * Proof:
>>> +	 *
>>> +	 * Case 1: The static branch is visible to T2.
>>> +	 *
>>> +	 * Case 1 (a): T1 acquires the lock before T2 can.
>>> +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
>>> +	 * executed after T1 exits CS.
>> This assumes that there is some ordering between unlock and pmd_free()
>> (e.g. some poisoning of the old page). The unlock only gives us release
>> semantics, not acquire. It just happens that we have an atomic
>> dec-and-test down the __free_pages() path but I'm not convinced we
>> should rely on it unless free_pages() has clear semantics on ordering
>> related to prior memory writes.
> I can understand how pmd_free() could be re-ordered before the unlock, but
> surely it can't be reorded before the lock? I need to go unlearn everything I
> thought I understood about locking if that's the case...
>

You are correct, what Catalin is saying is that my reasoning has a hole.
There is no obvious ordering between unlock and free(), but

mmap_write_unlock() will happen before mmap_read_lock() ... (i)
mmap_read_lock() will happen before pmd_free() ... (ii)

which lets us conclude that mmap_write_unlock() will happen before pmd_free().

A more rigorous way to write this would be (for Case 1):

T2								T1

pmd_clear()							5. cmpxchg(enable static branch)
1. while (!cmpxchg(check if lock not taken in write mode));	6. smp_mb()
2. smp_mb();							7. while (!cmpxchg(check if lock not taken))
3. smp_mb();							8. smp_mb()
4. cmpxchg(release lock)					CS instructions
pmd_free()							9. smp_mb()
								10. cmpxchg (release lock)

where: (1,2) = mmap_read_lock(), (3,4) = mmap_read_unlock(), (5,6) = static_branch_enable(),
(7,8) = mmap_write_lock(), (9, 10) = mmap_write_unlock().

For Case 1(a), 7 succeeds before 1 can. So, 1 will block until 10 completes, so 10 is
observed before 1, and 1 is observed before pmd_free() due to the barriers. The associativity follows.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
  2025-07-30 17:00 ` Catalin Marinas
  2025-07-30 18:29   ` Ryan Roberts
@ 2025-07-31  7:12   ` Dev Jain
  2025-07-31 17:06     ` Catalin Marinas
  1 sibling, 1 reply; 11+ messages in thread
From: Dev Jain @ 2025-07-31  7:12 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, anshuman.khandual, quic_zhenhuah, ryan.roberts,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david, mark.rutland, urezki


On 30/07/25 10:30 pm, Catalin Marinas wrote:
> On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>> index 12f534e8f3ed..e835fd437ae0 100644
>> --- a/arch/arm64/include/asm/vmalloc.h
>> +++ b/arch/arm64/include/asm/vmalloc.h
>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>   	/*
>>   	 * SW table walks can't handle removal of intermediate entries.
>>   	 */
>> -	return pud_sect_supported() &&
>> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return pud_sect_supported();
> Does the "SW table walks..." comment still make sense?

Anshuman pointed that out, I missed dropping that.

>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 00ab1d648db6..49932c1dd34e 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
> [...]
>> @@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   	}
>>   
>>   	table = pmd_offset(pudp, addr);
>> +
>> +	/*
>> +	 * Our objective is to prevent ptdump from reading a PMD table which has
>> +	 * been freed.  Assume that ptdump_walk_pgd() (call this thread T1)
>> +	 * executes completely on CPU1 and pud_free_pmd_page() (call this thread
>> +	 * T2) executes completely on CPU2. Let the region sandwiched by the
> I think "executes completely on CPU*" is confusing. Just talk about
> threads as they can be migrated between CPUs and the memory model is
> preserved.

Okay.

>
>> +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
>> +	 *
>> +	 * Claim: The CS of T1 will never operate on a freed PMD table.
>> +	 *
>> +	 * Proof:
>> +	 *
>> +	 * Case 1: The static branch is visible to T2.
>> +	 *
>> +	 * Case 1 (a): T1 acquires the lock before T2 can.
>> +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
>> +	 * executed after T1 exits CS.
> This assumes that there is some ordering between unlock and pmd_free()
> (e.g. some poisoning of the old page). The unlock only gives us release
> semantics, not acquire. It just happens that we have an atomic
> dec-and-test down the __free_pages() path but I'm not convinced we
> should rely on it unless free_pages() has clear semantics on ordering
> related to prior memory writes.

I replied in the other mail to Ryan, hopefully that logic is satisfactory.

>
>> +	 *
>> +	 * Case 1 (b): T2 acquires the lock before T1 can.
>> +	 * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
>> +	 * ensures that an empty PUD (via pud_clear()) is visible to T1 before
>> +	 * T1 can enter CS, therefore it is impossible for the CS to get hold
>> +	 * of the address of the isolated PMD table.
> Is this the effect of the barriers in the TLB flushing or the release
> semantics of the unlock?

Yeah you are right :) the release semantics of unlock guarantee that
an empty PUD will be observed before unlock, and the unlock
will have to happen before T1 can enter CS - the associativity leads us
to the conclusion.

>
>> +	 *
>> +	 * Case 2: The static branch is not visible to T2.
>> +	 *
>> +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
>> +	 * have acquire semantics, it is guaranteed that the static branch
> Does static_branch_disable() have release semantics? I think it does via
> some atomic_cmpxchg() but it's worth spelling it out.

Yes it *should* have, but this proof is already getting complicated so I thought,
let's not mention anything which is not required in the reasoning of the proof :)

>
>> +	 * will be visible to all CPUs before T1 can enter CS. The static
> As I mentioned on a previous version, this visibility is not absolute.
> You could say that it will be observed by other CPUs before they observe
> the write_lock.

Hmm. I probably am bad at explaining all of this in English :)

>
>> +	 * branch not being visible to T2 therefore guarantees that T1 has
>> +	 * not yet entered CS .... (i)
>> +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
>> +	 * implies that if the invisibility of the static branch has been
>> +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
>> +	 * then all CPUs will have observed an empty PUD ... (ii)
>> +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
>> +	 * before entering CS => it is impossible for the CS to get hold of
>> +	 * the address of the isolated PMD table. Q.E.D
> OK, I nearly got lost. That's not a straightforward memory ordering with
> as we have instruction updates with ISB as part of the TLB flushing. I
> concluded last time I looked that this branch patching part works
> because we also have kick_all_cpus_sync() as part of the static branch
> update.
>
> Stepping back to a simpler model, let's say that the static key is a
> variable. I wrote this quick test for herd7 and the Linux kernel memory
> model (fairly easy to play with):
>
> -------------------------------------8<---------------------------------------
> C pte_free_ptdump
>
> (*
>   * $ cd tools/memory-model
>   * $ herd7 -conf linux-kernel.cfg path/to/pte_free_ptdump.litmus
>   *)
>
> {
> 	pmd = 1;				(* allocated pmd *)
> 	pte_page = 1;				(* valid pte page pointed at by the pmd *)
> }
>
> // pmd_free_pte_page()
> P0(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
> {
> 	WRITE_ONCE(*pmd, 0);			// pmd_clear()
> 	smp_mb();
> 	if (READ_ONCE(*ptdump_lock_key)) {	// static_branch() approximation
> 		spin_lock(init_mm);		// mmap_read_lock()
> 		spin_unlock(init_mm);		// mmap_read_unlock()
> 	}
> 	smp_mb();
> 	WRITE_ONCE(*pte_page, 0);		// pte_free_kernel()
> }
>
> // ptdump_walk_pgd()
> P1(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
> {
> 	int val;
> 	int page;
>
> 	WRITE_ONCE(*ptdump_lock_key, 1);	// static_branch_enable()
> 	smp_mb();
> 	spin_lock(init_mm);			// mmap_write_lock()
> 	val = READ_ONCE(*pmd);
> 	page = READ_ONCE(*pte_page);		// freed pte page?
> 	spin_unlock(init_mm);			// mmap_write_unlock()
> 	smp_mb();
> 	WRITE_ONCE(*ptdump_lock_key, 0);	// static_branch_disable()
> }
>
> exists(1:val=1 /\ 1:page=0)
> -------------------------------------8<---------------------------------------
>
> I sprinkled some necessary smp_mb() but in most cases we have
> release/acquire semantics. It does show that we need a barrier before
> the page freeing. We also need to acquire for enabling the static key
> and release for disabling it.
>
> Next step is to assess that replacing the static key variable read/write
> with code patching preserves the model.

Thanks for the litmus test!
I ran it and I get the same result. But I still cannot figure out where is the hole
in my proof. I think you raise an objection to Case 2:

T1:						T2:

WRITE_ONCE(*pmd, 0) : t1			WRITE_ONCE(*ptdump_lock_key, 1): t3
smp_mb() // flush_tlb_kernel_pgtable		cmpxchg(acquire write lock)
						smp_mb()
READ_ONCE(*ptdump_lock_key) = 0 : t2		val = READ_ONCE(*pmd) : t4
WRITE_ONCE(*pte_page, 0)			page = READ_ONCE(*pte_page)
						smp_mb()
						cmpxchg(release write lock)


Let t_i be the global timestamps of the execution of the labelled instructions
(note that each labelled instruction is atomic so assigning a timestamp makes sense).

The fact that we read ptdump_lock_key as 0 implies that
t2 < t3 ... (i)
Now,
t1 < t2 because of barrier (ii)
t3 < t4 because of barrier (iii)

So we conclude that t1 < t4, so val will be observed as 0.

Alright I am horribly confused now :)


>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
  2025-07-31  4:30     ` Dev Jain
@ 2025-07-31 11:38       ` Catalin Marinas
  0 siblings, 0 replies; 11+ messages in thread
From: Catalin Marinas @ 2025-07-31 11:38 UTC (permalink / raw)
  To: Dev Jain
  Cc: Ryan Roberts, will, anshuman.khandual, quic_zhenhuah,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david, mark.rutland, urezki

On Thu, Jul 31, 2025 at 10:00:15AM +0530, Dev Jain wrote:
> On 30/07/25 11:59 pm, Ryan Roberts wrote:
> > On 30/07/2025 18:00, Catalin Marinas wrote:
> > > On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
> > > > +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
> > > > +	 *
> > > > +	 * Claim: The CS of T1 will never operate on a freed PMD table.
> > > > +	 *
> > > > +	 * Proof:
> > > > +	 *
> > > > +	 * Case 1: The static branch is visible to T2.
> > > > +	 *
> > > > +	 * Case 1 (a): T1 acquires the lock before T2 can.
> > > > +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
> > > > +	 * executed after T1 exits CS.
> > > 
> > > This assumes that there is some ordering between unlock and pmd_free()
> > > (e.g. some poisoning of the old page). The unlock only gives us release
> > > semantics, not acquire. It just happens that we have an atomic
> > > dec-and-test down the __free_pages() path but I'm not convinced we
> > > should rely on it unless free_pages() has clear semantics on ordering
> > > related to prior memory writes.
> > 
> > I can understand how pmd_free() could be re-ordered before the unlock, but
> > surely it can't be reorded before the lock? I need to go unlearn everything I
> > thought I understood about locking if that's the case...

Indeed, it can't be reordered before the lock as it has acquire
semantics.

> You are correct, what Catalin is saying is that my reasoning has a hole.
> There is no obvious ordering between unlock and free(), but
> 
> mmap_write_unlock() will happen before mmap_read_lock() ... (i)
> mmap_read_lock() will happen before pmd_free() ... (ii)
> 
> which lets us conclude that mmap_write_unlock() will happen before pmd_free().

Yes, in this sub-case, mmap_write_unlock() on T1 will happen before
pmd_free() on T2 if T2 waits on the lock.

-- 
Catalin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
  2025-07-31  7:12   ` Dev Jain
@ 2025-07-31 17:06     ` Catalin Marinas
  2025-08-01 12:15       ` Dev Jain
  0 siblings, 1 reply; 11+ messages in thread
From: Catalin Marinas @ 2025-07-31 17:06 UTC (permalink / raw)
  To: Dev Jain
  Cc: will, anshuman.khandual, quic_zhenhuah, ryan.roberts,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david, mark.rutland, urezki

On Thu, Jul 31, 2025 at 12:42:21PM +0530, Dev Jain wrote:
> On 30/07/25 10:30 pm, Catalin Marinas wrote:
> > On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
> > > +	 * Case 2: The static branch is not visible to T2.
> > > +	 *
> > > +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
> > > +	 * have acquire semantics, it is guaranteed that the static branch
> > > +	 * will be visible to all CPUs before T1 can enter CS. The static
> > 
> > Does static_branch_disable() have release semantics? I think it does via
> > some atomic_cmpxchg() but it's worth spelling it out.
> 
> Yes it *should* have, but this proof is already getting complicated so I thought,
> let's not mention anything which is not required in the reasoning of the proof :)

The static branch observability relative to the mmap_lock is essential,
both when enabling and disabling the static key (i.e. you don't want the
static branch to be seen disabled while T1 is in the critical section).
That's why it's worth mentioning.

On the static_branch_enable() path, we have a kick_all_cpus_sync(), so
the T1 update of mmap lock will become visible after the branch update
(of course, also subject to correct ordering on T2, see more at the end
of the email).

On the static_branch_disable() path, we need it to have release
semantics _prior_ to disabling the key. We get this via the
atomic_cmpxchg() in static_key_disable_cpuslocked().

So I think it's all good here from the T1 perspective, just capture this
in a comment so that we remember in a couple of months time.

> > As I mentioned on a previous version, this visibility is not absolute.
> > You could say that it will be observed by other CPUs before they observe
> > the write_lock.
> 
> Hmm. I probably am bad at explaining all of this in English :)

It's the other way around - I don't think English is appropriate for
explaining this ;).

> > > +	 * branch not being visible to T2 therefore guarantees that T1 has
> > > +	 * not yet entered CS .... (i)
> > > +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
> > > +	 * implies that if the invisibility of the static branch has been
> > > +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
> > > +	 * then all CPUs will have observed an empty PUD ... (ii)
> > > +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
> > > +	 * before entering CS => it is impossible for the CS to get hold of
> > > +	 * the address of the isolated PMD table. Q.E.D
> > 
> > OK, I nearly got lost. That's not a straightforward memory ordering with
> > as we have instruction updates with ISB as part of the TLB flushing. I
> > concluded last time I looked that this branch patching part works
> > because we also have kick_all_cpus_sync() as part of the static branch
> > update.
> > 
> > Stepping back to a simpler model, let's say that the static key is a
> > variable. I wrote this quick test for herd7 and the Linux kernel memory
> > model (fairly easy to play with):
> > 
> > -------------------------------------8<---------------------------------------
> > C pte_free_ptdump
> > 
> > (*
> >   * $ cd tools/memory-model
> >   * $ herd7 -conf linux-kernel.cfg path/to/pte_free_ptdump.litmus
> >   *)
> > 
> > {
> > 	pmd = 1;				(* allocated pmd *)
> > 	pte_page = 1;				(* valid pte page pointed at by the pmd *)
> > }
> > 
> > // pmd_free_pte_page()
> > P0(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
> > {
> > 	WRITE_ONCE(*pmd, 0);			// pmd_clear()
> > 	smp_mb();
> > 	if (READ_ONCE(*ptdump_lock_key)) {	// static_branch() approximation
> > 		spin_lock(init_mm);		// mmap_read_lock()
> > 		spin_unlock(init_mm);		// mmap_read_unlock()
> > 	}
> > 	smp_mb();
> > 	WRITE_ONCE(*pte_page, 0);		// pte_free_kernel()
> > }
> > 
> > // ptdump_walk_pgd()
> > P1(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
> > {
> > 	int val;
> > 	int page;
> > 
> > 	WRITE_ONCE(*ptdump_lock_key, 1);	// static_branch_enable()
> > 	smp_mb();
> > 	spin_lock(init_mm);			// mmap_write_lock()
> > 	val = READ_ONCE(*pmd);
> > 	page = READ_ONCE(*pte_page);		// freed pte page?
> > 	spin_unlock(init_mm);			// mmap_write_unlock()
> > 	smp_mb();
> > 	WRITE_ONCE(*ptdump_lock_key, 0);	// static_branch_disable()
> > }
> > 
> > exists(1:val=1 /\ 1:page=0)
> > -------------------------------------8<---------------------------------------
> > 
> > I sprinkled some necessary smp_mb() but in most cases we have
> > release/acquire semantics. It does show that we need a barrier before
> > the page freeing. We also need to acquire for enabling the static key
> > and release for disabling it.

The smp_mb() before the write to *pte_page in the model above is only
needed on the static key disabled path. As Ryan pointed out, on the
other path we take the mmap_lock and we have acquire semantics.

> > Next step is to assess that replacing the static key variable read/write
> > with code patching preserves the model.
> 
> Thanks for the litmus test!
> I ran it and I get the same result. But I still cannot figure out where is the hole
> in my proof.

TBH, I couldn't fully follow the proof. The arm memory model is based on
relations between accesses (e.g. read-from, coherence order) rather than
timestamps. They may be equivalent but that's a whole new (and complex)
proof to write.

So I'm happy with a simple English explanation and a formal model along
the lines of the one I posted as the equivalent of a proof.

> I think you raise an objection to Case 2:

Let's follow your approach, you are missing some timestamps, adding them
below:

> T1:						T2:
> 
> WRITE_ONCE(*pmd, 0) : t1			WRITE_ONCE(*ptdump_lock_key, 1) : t3
> smp_mb() // flush_tlb_kernel_pgtable		cmpxchg(acquire write lock)
						^^^ : t3'
> 						smp_mb()
> READ_ONCE(*ptdump_lock_key) = 0 : t2		val = READ_ONCE(*pmd) : t4
> WRITE_ONCE(*pte_page, 0)			page = READ_ONCE(*pte_page)
  ^^^ : t2'					^^^ : t4'
> 						smp_mb()
> 						cmpxchg(release write lock)
						^^^ : t5
						smp_mb()
						WRITE_ONCE(*ptdump_lock_key, 0): t6

> Let t_i be the global timestamps of the execution of the labelled instructions
> (note that each labelled instruction is atomic so assigning a timestamp makes sense).
> 
> The fact that we read ptdump_lock_key as 0 implies that
> t2 < t3 ... (i)

Or that t2 > t6, a plausible scenario in the litmus test above.

> Now,
> t1 < t2 because of barrier (ii)
> t3 < t4 because of barrier (iii)
> 
> So we conclude that t1 < t4, so val will be observed as 0.

Without a barrier between t2 and t2', there's no guarantee that t2' >
t2. If you remove the last smp_mb() on P0 in my litmus test, you'll see
it failing (run herd7 with something like "-view evince -show prop" for
a visualisation of the transitions and relations; you need graphviz
installed as well).

The way it fails is that t4 is seeing the pmd=1 prior to t1, t2 is
reading from the t6 write and t4' is reading pte_page=0 from t2' (since
t2' < t2 is possible without a barrier).

A control dependency would work as well without a barrier, i.e.:

	if (READ_ONCE(*ptdump_lock_key)) {
		mmap_lock();
		mmap_unlock();
		WRITE_ONCE(*pte_page, 0);
	} else {
		WRITE_ONCE(*pte_page, 0);
	}

but the compiler is probably free to only issue a single WRITE_ONCE()
irrespective of the ptdump_lock_key check.

Of course, using smp_load_acquire(ptdump_lock_key) would also work.

However, things get fuzzier as we don't have a classic load from the
ptdump_lock_key but rather a patched instruction. We need to guarantee
that t2' is issued after the t2 branch when the instruction is patched.
The kick_all_cpus_sync() on the static key disable path doesn't help us
since P0 (T2 in your description) may see the patched branch/nop and go
straight to the WRITE_ONCE(*pte_page). Not sure what barrier helps here
(after more sleep, I may have a better idea tomorrow).

-- 
Catalin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
  2025-07-31 17:06     ` Catalin Marinas
@ 2025-08-01 12:15       ` Dev Jain
  2025-08-01 15:48         ` Catalin Marinas
  0 siblings, 1 reply; 11+ messages in thread
From: Dev Jain @ 2025-08-01 12:15 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, anshuman.khandual, quic_zhenhuah, ryan.roberts,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david, mark.rutland, urezki


On 31/07/25 10:36 pm, Catalin Marinas wrote:
> On Thu, Jul 31, 2025 at 12:42:21PM +0530, Dev Jain wrote:
>> On 30/07/25 10:30 pm, Catalin Marinas wrote:
>>> On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
>>>> +	 * Case 2: The static branch is not visible to T2.
>>>> +	 *
>>>> +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
>>>> +	 * have acquire semantics, it is guaranteed that the static branch
>>>> +	 * will be visible to all CPUs before T1 can enter CS. The static
>>> Does static_branch_disable() have release semantics? I think it does via
>>> some atomic_cmpxchg() but it's worth spelling it out.
>> Yes it *should* have, but this proof is already getting complicated so I thought,
>> let's not mention anything which is not required in the reasoning of the proof :)
> The static branch observability relative to the mmap_lock is essential,
> both when enabling and disabling the static key (i.e. you don't want the
> static branch to be seen disabled while T1 is in the critical section).
> That's why it's worth mentioning.
>
> On the static_branch_enable() path, we have a kick_all_cpus_sync(), so
> the T1 update of mmap lock will become visible after the branch update
> (of course, also subject to correct ordering on T2, see more at the end
> of the email).
>
> On the static_branch_disable() path, we need it to have release
> semantics _prior_ to disabling the key. We get this via the
> atomic_cmpxchg() in static_key_disable_cpuslocked().
>
> So I think it's all good here from the T1 perspective, just capture this
> in a comment so that we remember in a couple of months time.

Got it, thanks.

>
>>> As I mentioned on a previous version, this visibility is not absolute.
>>> You could say that it will be observed by other CPUs before they observe
>>> the write_lock.
>> Hmm. I probably am bad at explaining all of this in English :)
> It's the other way around - I don't think English is appropriate for
> explaining this ;).
>
>>>> +	 * branch not being visible to T2 therefore guarantees that T1 has
>>>> +	 * not yet entered CS .... (i)
>>>> +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
>>>> +	 * implies that if the invisibility of the static branch has been
>>>> +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
>>>> +	 * then all CPUs will have observed an empty PUD ... (ii)
>>>> +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
>>>> +	 * before entering CS => it is impossible for the CS to get hold of
>>>> +	 * the address of the isolated PMD table. Q.E.D
>>> OK, I nearly got lost. That's not a straightforward memory ordering with
>>> as we have instruction updates with ISB as part of the TLB flushing. I
>>> concluded last time I looked that this branch patching part works
>>> because we also have kick_all_cpus_sync() as part of the static branch
>>> update.
>>>
>>> Stepping back to a simpler model, let's say that the static key is a
>>> variable. I wrote this quick test for herd7 and the Linux kernel memory
>>> model (fairly easy to play with):
>>>
>>> -------------------------------------8<---------------------------------------
>>> C pte_free_ptdump
>>>
>>> (*
>>>    * $ cd tools/memory-model
>>>    * $ herd7 -conf linux-kernel.cfg path/to/pte_free_ptdump.litmus
>>>    *)
>>>
>>> {
>>> 	pmd = 1;				(* allocated pmd *)
>>> 	pte_page = 1;				(* valid pte page pointed at by the pmd *)
>>> }
>>>
>>> // pmd_free_pte_page()
>>> P0(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
>>> {
>>> 	WRITE_ONCE(*pmd, 0);			// pmd_clear()
>>> 	smp_mb();
>>> 	if (READ_ONCE(*ptdump_lock_key)) {	// static_branch() approximation
>>> 		spin_lock(init_mm);		// mmap_read_lock()
>>> 		spin_unlock(init_mm);		// mmap_read_unlock()
>>> 	}
>>> 	smp_mb();
>>> 	WRITE_ONCE(*pte_page, 0);		// pte_free_kernel()
>>> }
>>>
>>> // ptdump_walk_pgd()
>>> P1(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
>>> {
>>> 	int val;
>>> 	int page;
>>>
>>> 	WRITE_ONCE(*ptdump_lock_key, 1);	// static_branch_enable()
>>> 	smp_mb();
>>> 	spin_lock(init_mm);			// mmap_write_lock()
>>> 	val = READ_ONCE(*pmd);
>>> 	page = READ_ONCE(*pte_page);		// freed pte page?
>>> 	spin_unlock(init_mm);			// mmap_write_unlock()
>>> 	smp_mb();
>>> 	WRITE_ONCE(*ptdump_lock_key, 0);	// static_branch_disable()
>>> }
>>>
>>> exists(1:val=1 /\ 1:page=0)
>>> -------------------------------------8<---------------------------------------
>>>
>>> I sprinkled some necessary smp_mb() but in most cases we have
>>> release/acquire semantics. It does show that we need a barrier before
>>> the page freeing. We also need to acquire for enabling the static key
>>> and release for disabling it.
> The smp_mb() before the write to *pte_page in the model above is only
> needed on the static key disabled path. As Ryan pointed out, on the
> other path we take the mmap_lock and we have acquire semantics.
>
>>> Next step is to assess that replacing the static key variable read/write
>>> with code patching preserves the model.
>> Thanks for the litmus test!
>> I ran it and I get the same result. But I still cannot figure out where is the hole
>> in my proof.
> TBH, I couldn't fully follow the proof. The arm memory model is based on
> relations between accesses (e.g. read-from, coherence order) rather than
> timestamps. They may be equivalent but that's a whole new (and complex)
> proof to write.
>
> So I'm happy with a simple English explanation and a formal model along
> the lines of the one I posted as the equivalent of a proof.
>
>> I think you raise an objection to Case 2:
> Let's follow your approach, you are missing some timestamps, adding them
> below:
>
>> T1:						T2:
>>
>> WRITE_ONCE(*pmd, 0) : t1			WRITE_ONCE(*ptdump_lock_key, 1) : t3
>> smp_mb() // flush_tlb_kernel_pgtable		cmpxchg(acquire write lock)
> 						^^^ : t3'
>> 						smp_mb()
>> READ_ONCE(*ptdump_lock_key) = 0 : t2		val = READ_ONCE(*pmd) : t4
>> WRITE_ONCE(*pte_page, 0)			page = READ_ONCE(*pte_page)
>    ^^^ : t2'					^^^ : t4'
>> 						smp_mb()
>> 						cmpxchg(release write lock)
> 						^^^ : t5
> 						smp_mb()
> 						WRITE_ONCE(*ptdump_lock_key, 0): t6
>
>> Let t_i be the global timestamps of the execution of the labelled instructions
>> (note that each labelled instruction is atomic so assigning a timestamp makes sense).
>>
>> The fact that we read ptdump_lock_key as 0 implies that
>> t2 < t3 ... (i)
> Or that t2 > t6, a plausible scenario in the litmus test above.
>
>> Now,
>> t1 < t2 because of barrier (ii)
>> t3 < t4 because of barrier (iii)
>>
>> So we conclude that t1 < t4, so val will be observed as 0.
> Without a barrier between t2 and t2', there's no guarantee that t2' >
> t2. If you remove the last smp_mb() on P0 in my litmus test, you'll see
> it failing (run herd7 with something like "-view evince -show prop" for
> a visualisation of the transitions and relations; you need graphviz
> installed as well).
>
> The way it fails is that t4 is seeing the pmd=1 prior to t1, t2 is
> reading from the t6 write and t4' is reading pte_page=0 from t2' (since
> t2' < t2 is possible without a barrier).
>
> A control dependency would work as well without a barrier, i.e.:
>
> 	if (READ_ONCE(*ptdump_lock_key)) {
> 		mmap_lock();
> 		mmap_unlock();
> 		WRITE_ONCE(*pte_page, 0);
> 	} else {
> 		WRITE_ONCE(*pte_page, 0);
> 	}
>
> but the compiler is probably free to only issue a single WRITE_ONCE()
> irrespective of the ptdump_lock_key check.
>
> Of course, using smp_load_acquire(ptdump_lock_key) would also work.
>
> However, things get fuzzier as we don't have a classic load from the
> ptdump_lock_key but rather a patched instruction. We need to guarantee
> that t2' is issued after the t2 branch when the instruction is patched.
> The kick_all_cpus_sync() on the static key disable path doesn't help us
> since P0 (T2 in your description) may see the patched branch/nop and go
> straight to the WRITE_ONCE(*pte_page). Not sure what barrier helps here
> (after more sleep, I may have a better idea tomorrow).

Got it. The hole in my proof is not with Case 2 but with Case 1: the assumption
in the reasoning is that pmd_free() will be observed after the patched-in
read lock/unlock, but that will happen when patching-in happens, for which
we need to observe the branch before the pmd_free(), but that isn't guaranteed
since there is no barrier between the if block and the pmd_free(), nor is there any
control dependency, like you describe above. So, in pud_free_pmd_page, the entire block from "pmdp = table"
till "pmd_free()" can be observed before the observation of the branch.

Reading tools/memory-model/Documentation/control-dependencies.txt, I interpret that the
compiler is free to hoist out the WRITE_ONCE() out of the control block, and then
we have the same problem, BUT I tested with herd and the test passes :)


>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
  2025-08-01 12:15       ` Dev Jain
@ 2025-08-01 15:48         ` Catalin Marinas
  0 siblings, 0 replies; 11+ messages in thread
From: Catalin Marinas @ 2025-08-01 15:48 UTC (permalink / raw)
  To: Dev Jain
  Cc: will, anshuman.khandual, quic_zhenhuah, ryan.roberts,
	kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
	linux-kernel, david, mark.rutland, urezki

On Fri, Aug 01, 2025 at 05:45:53PM +0530, Dev Jain wrote:
> On 31/07/25 10:36 pm, Catalin Marinas wrote:
> > A control dependency would work as well without a barrier, i.e.:
> > 
> > 	if (READ_ONCE(*ptdump_lock_key)) {
> > 		mmap_lock();
> > 		mmap_unlock();
> > 		WRITE_ONCE(*pte_page, 0);
> > 	} else {
> > 		WRITE_ONCE(*pte_page, 0);
> > 	}
> > 
> > but the compiler is probably free to only issue a single WRITE_ONCE()
> > irrespective of the ptdump_lock_key check.
> > 
> > Of course, using smp_load_acquire(ptdump_lock_key) would also work.
> > 
> > However, things get fuzzier as we don't have a classic load from the
> > ptdump_lock_key but rather a patched instruction. We need to guarantee
> > that t2' is issued after the t2 branch when the instruction is patched.
> > The kick_all_cpus_sync() on the static key disable path doesn't help us
> > since P0 (T2 in your description) may see the patched branch/nop and go
> > straight to the WRITE_ONCE(*pte_page). Not sure what barrier helps here
> > (after more sleep, I may have a better idea tomorrow).
> 
> Got it. The hole in my proof is not with Case 2 but with Case 1: the assumption
> in the reasoning is that pmd_free() will be observed after the patched-in
> read lock/unlock, but that will happen when patching-in happens, for which
> we need to observe the branch before the pmd_free(), but that isn't guaranteed
> since there is no barrier between the if block and the pmd_free(), nor is there any
> control dependency, like you describe above. So, in pud_free_pmd_page, the entire block from "pmdp = table"
> till "pmd_free()" can be observed before the observation of the branch.
> 
> Reading tools/memory-model/Documentation/control-dependencies.txt, I interpret that the
> compiler is free to hoist out the WRITE_ONCE() out of the control block, and then
> we have the same problem, BUT I tested with herd and the test passes :)

I don't think the tool reorders the litmus test events based on what a
compiler may generate. However, with instruction patching we don't even
have a control dependency - there's no check of the ptdump_lock_key but
a replacement of an unconditional branch with a NOP (or vice-versa).

Anyway, email to the memory model experts in Arm sent (you are on copy).

-- 
Catalin


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-08-01 15:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-23 16:18 [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump Dev Jain
2025-07-24  5:50 ` Anshuman Khandual
2025-07-24  7:20   ` Dev Jain
2025-07-30 17:00 ` Catalin Marinas
2025-07-30 18:29   ` Ryan Roberts
2025-07-31  4:30     ` Dev Jain
2025-07-31 11:38       ` Catalin Marinas
2025-07-31  7:12   ` Dev Jain
2025-07-31 17:06     ` Catalin Marinas
2025-08-01 12:15       ` Dev Jain
2025-08-01 15:48         ` Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).