* [PATCH v3] arm64: Enable vmalloc-huge with ptdump
@ 2025-06-16 10:33 Dev Jain
2025-06-16 15:00 ` David Hildenbrand
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Dev Jain @ 2025-06-16 10:33 UTC (permalink / raw)
To: catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, ryan.roberts, kevin.brodsky,
yangyicong, joey.gouly, linux-arm-kernel, linux-kernel, david,
Dev Jain
arm64 disables vmalloc-huge when kernel page table dumping is enabled,
because an intermediate table may be removed, potentially causing the
ptdump code to dereference an invalid address. We want to be able to
analyze block vs page mappings for kernel mappings with ptdump, so to
enable vmalloc-huge with ptdump, synchronize between page table removal in
pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
use mmap_read_lock and not write lock because we don't need to synchronize
between two different vm_structs; two vmalloc objects running this same
code path will point to different page tables, hence there is no race.
For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
512 times again via pmd_free_pte_page().
We implement the locking mechanism using static keys, since the chance
of a race is very small. Observe that the synchronization is needed
to avoid the following race:
CPU1 CPU2
take reference of PMD table
pud_clear()
pte_free_kernel()
walk freed PMD table
and similar race between pmd_free_pte_page and ptdump_walk_pgd.
Therefore, there are two cases: if ptdump sees the cleared PUD, then
we are safe. If not, then the patched-in read and write locks help us
avoid the race.
To implement the mechanism, we need the static key access from mmu.c and
ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
target in the Makefile, therefore we cannot initialize the key there, as
is being done, for example, in the static key implementation of
hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
the jump_label mechanism. Declare the key there and define the key to false
in mmu.c.
No issues were observed with mm-selftests. No issues were observed while
parallelly running test_vmalloc.sh and dumping the kernel pagetable through
sysfs in a loop.
v2->v3:
- Use static key mechanism
v1->v2:
- Take lock only when CONFIG_PTDUMP_DEBUGFS is on
- In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
the lock 512 times again via pmd_free_pte_page()
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
arch/arm64/include/asm/cpufeature.h | 1 +
arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
arch/arm64/mm/ptdump.c | 5 +++
3 files changed, 53 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index c4326f1cb917..3e386563b587 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -26,6 +26,7 @@
#include <linux/kernel.h>
#include <linux/cpumask.h>
+DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
/*
* CPU feature register tracking
*
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 8fcf59ba39db..e242ba428820 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -41,11 +41,14 @@
#include <asm/tlbflush.h>
#include <asm/pgalloc.h>
#include <asm/kfence.h>
+#include <asm/cpufeature.h>
#define NO_BLOCK_MAPPINGS BIT(0)
#define NO_CONT_MAPPINGS BIT(1)
#define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
+DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
+
enum pgtable_type {
TABLE_PTE,
TABLE_PMD,
@@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
return 1;
}
-int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
+static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
{
+ bool lock_taken = false;
pte_t *table;
pmd_t pmd;
@@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
return 1;
}
+ /* See comment in pud_free_pmd_page for static key logic */
table = pte_offset_kernel(pmdp, addr);
pmd_clear(pmdp);
__flush_tlb_kernel_pgtable(addr);
+ if (static_branch_unlikely(&ptdump_lock_key) && lock) {
+ mmap_read_lock(&init_mm);
+ lock_taken = true;
+ }
+ if (unlikely(lock_taken))
+ mmap_read_unlock(&init_mm);
+
pte_free_kernel(NULL, table);
return 1;
}
+int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
+{
+ return __pmd_free_pte_page(pmdp, addr, true);
+}
+
int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
{
+ bool lock_taken = false;
pmd_t *table;
pmd_t *pmdp;
pud_t pud;
@@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
}
table = pmd_offset(pudp, addr);
+ /*
+ * Isolate the PMD table; in case of race with ptdump, this helps
+ * us to avoid taking the lock in __pmd_free_pte_page().
+ *
+ * Static key logic:
+ *
+ * Case 1: If ptdump does static_branch_enable(), and after that we
+ * execute the if block, then this patches in the read lock, ptdump has
+ * the write lock patched in, therefore ptdump will never read from
+ * a potentially freed PMD table.
+ *
+ * Case 2: If the if block starts executing before ptdump's
+ * static_branch_enable(), then no locking synchronization
+ * will be done. However, pud_clear() + the dsb() in
+ * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
+ * empty PUD. Thus, it will never walk over a potentially freed
+ * PMD table.
+ */
+ pud_clear(pudp);
+ __flush_tlb_kernel_pgtable(addr);
+ if (static_branch_unlikely(&ptdump_lock_key)) {
+ mmap_read_lock(&init_mm);
+ lock_taken = true;
+ }
+ if (unlikely(lock_taken))
+ mmap_read_unlock(&init_mm);
+
pmdp = table;
next = addr;
end = addr + PUD_SIZE;
do {
- pmd_free_pte_page(pmdp, next);
+ __pmd_free_pte_page(pmdp, next, false);
} while (pmdp++, next += PMD_SIZE, next != end);
- pud_clear(pudp);
- __flush_tlb_kernel_pgtable(addr);
pmd_free(NULL, table);
return 1;
}
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 421a5de806c6..f75e12a1d068 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -25,6 +25,7 @@
#include <asm/pgtable-hwdef.h>
#include <asm/ptdump.h>
+#include <asm/cpufeature.h>
#define pt_dump_seq_printf(m, fmt, args...) \
({ \
@@ -311,7 +312,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
}
};
+ static_branch_enable(&ptdump_lock_key);
ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
+ static_branch_disable(&ptdump_lock_key);
}
static void __init ptdump_initialize(void)
@@ -353,7 +356,9 @@ bool ptdump_check_wx(void)
}
};
+ static_branch_enable(&ptdump_lock_key);
ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
+ static_branch_disable(&ptdump_lock_key);
if (st.wx_pages || st.uxn_pages) {
pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
--
2.30.2
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-16 10:33 [PATCH v3] arm64: Enable vmalloc-huge with ptdump Dev Jain
@ 2025-06-16 15:00 ` David Hildenbrand
2025-06-16 16:34 ` Dev Jain
2025-06-16 18:07 ` Ryan Roberts
2025-06-25 10:35 ` Ryan Roberts
2 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2025-06-16 15:00 UTC (permalink / raw)
To: Dev Jain, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, ryan.roberts, kevin.brodsky,
yangyicong, joey.gouly, linux-arm-kernel, linux-kernel
On 16.06.25 12:33, Dev Jain wrote:
> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> because an intermediate table may be removed, potentially causing the
> ptdump code to dereference an invalid address. We want to be able to
> analyze block vs page mappings for kernel mappings with ptdump, so to
> enable vmalloc-huge with ptdump, synchronize between page table removal in
> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> use mmap_read_lock and not write lock because we don't need to synchronize
> between two different vm_structs; two vmalloc objects running this same
> code path will point to different page tables, hence there is no race.
>
> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
> 512 times again via pmd_free_pte_page().
>
> We implement the locking mechanism using static keys, since the chance
> of a race is very small. Observe that the synchronization is needed
> to avoid the following race:
>
> CPU1 CPU2
> take reference of PMD table
> pud_clear()
> pte_free_kernel()
> walk freed PMD table
>
> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>
> Therefore, there are two cases: if ptdump sees the cleared PUD, then
> we are safe. If not, then the patched-in read and write locks help us
> avoid the race.
>
> To implement the mechanism, we need the static key access from mmu.c and
> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
> target in the Makefile, therefore we cannot initialize the key there, as
> is being done, for example, in the static key implementation of
> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
> the jump_label mechanism. Declare the key there and define the key to false
> in mmu.c.
>
> No issues were observed with mm-selftests. No issues were observed while
> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
> sysfs in a loop.
>
> v2->v3:
> - Use static key mechanism
>
> v1->v2:
> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
> the lock 512 times again via pmd_free_pte_page()
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> arch/arm64/include/asm/cpufeature.h | 1 +
> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
> arch/arm64/mm/ptdump.c | 5 +++
> 3 files changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index c4326f1cb917..3e386563b587 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -26,6 +26,7 @@
> #include <linux/kernel.h>
> #include <linux/cpumask.h>
>
> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
> /*
> * CPU feature register tracking
> *
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 8fcf59ba39db..e242ba428820 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -41,11 +41,14 @@
> #include <asm/tlbflush.h>
> #include <asm/pgalloc.h>
> #include <asm/kfence.h>
> +#include <asm/cpufeature.h>
>
> #define NO_BLOCK_MAPPINGS BIT(0)
> #define NO_CONT_MAPPINGS BIT(1)
> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>
> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
> +
> enum pgtable_type {
> TABLE_PTE,
> TABLE_PMD,
> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
> return 1;
> }
>
> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
> {
> + bool lock_taken = false;
> pte_t *table;
> pmd_t pmd;
>
> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> return 1;
> }
>
> + /* See comment in pud_free_pmd_page for static key logic */
> table = pte_offset_kernel(pmdp, addr);
> pmd_clear(pmdp);
> __flush_tlb_kernel_pgtable(addr);
> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
> + mmap_read_lock(&init_mm);
> + lock_taken = true;
> + }
> + if (unlikely(lock_taken))
> + mmap_read_unlock(&init_mm);
> +
I'm missing something important: why not
if (static_branch_unlikely(&ptdump_lock_key) && lock) {
mmap_read_lock(&init_mm);
mmap_read_unlock(&init_mm);
}
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-16 15:00 ` David Hildenbrand
@ 2025-06-16 16:34 ` Dev Jain
0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2025-06-16 16:34 UTC (permalink / raw)
To: David Hildenbrand, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, ryan.roberts, kevin.brodsky,
yangyicong, joey.gouly, linux-arm-kernel, linux-kernel
On 16/06/25 8:30 pm, David Hildenbrand wrote:
> On 16.06.25 12:33, Dev Jain wrote:
>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>> because an intermediate table may be removed, potentially causing the
>> ptdump code to dereference an invalid address. We want to be able to
>> analyze block vs page mappings for kernel mappings with ptdump, so to
>> enable vmalloc-huge with ptdump, synchronize between page table
>> removal in
>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>> use mmap_read_lock and not write lock because we don't need to
>> synchronize
>> between two different vm_structs; two vmalloc objects running this same
>> code path will point to different page tables, hence there is no race.
>>
>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the
>> lock
>> 512 times again via pmd_free_pte_page().
>>
>> We implement the locking mechanism using static keys, since the chance
>> of a race is very small. Observe that the synchronization is needed
>> to avoid the following race:
>>
>> CPU1 CPU2
>> take reference of PMD table
>> pud_clear()
>> pte_free_kernel()
>> walk freed PMD table
>>
>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>
>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>> we are safe. If not, then the patched-in read and write locks help us
>> avoid the race.
>>
>> To implement the mechanism, we need the static key access from mmu.c and
>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>> target in the Makefile, therefore we cannot initialize the key there, as
>> is being done, for example, in the static key implementation of
>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>> the jump_label mechanism. Declare the key there and define the key to
>> false
>> in mmu.c.
>>
>> No issues were observed with mm-selftests. No issues were observed while
>> parallelly running test_vmalloc.sh and dumping the kernel pagetable
>> through
>> sysfs in a loop.
>>
>> v2->v3:
>> - Use static key mechanism
>>
>> v1->v2:
>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid
>> taking
>> the lock 512 times again via pmd_free_pte_page()
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> arch/arm64/include/asm/cpufeature.h | 1 +
>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>> arch/arm64/mm/ptdump.c | 5 +++
>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cpufeature.h
>> b/arch/arm64/include/asm/cpufeature.h
>> index c4326f1cb917..3e386563b587 100644
>> --- a/arch/arm64/include/asm/cpufeature.h
>> +++ b/arch/arm64/include/asm/cpufeature.h
>> @@ -26,6 +26,7 @@
>> #include <linux/kernel.h>
>> #include <linux/cpumask.h>
>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>> /*
>> * CPU feature register tracking
>> *
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 8fcf59ba39db..e242ba428820 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -41,11 +41,14 @@
>> #include <asm/tlbflush.h>
>> #include <asm/pgalloc.h>
>> #include <asm/kfence.h>
>> +#include <asm/cpufeature.h>
>> #define NO_BLOCK_MAPPINGS BIT(0)
>> #define NO_CONT_MAPPINGS BIT(1)
>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not
>> used */
>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>> +
>> enum pgtable_type {
>> TABLE_PTE,
>> TABLE_PMD,
>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>> return 1;
>> }
>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool
>> lock)
>> {
>> + bool lock_taken = false;
>> pte_t *table;
>> pmd_t pmd;
>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned
>> long addr)
>> return 1;
>> }
>> + /* See comment in pud_free_pmd_page for static key logic */
>> table = pte_offset_kernel(pmdp, addr);
>> pmd_clear(pmdp);
>> __flush_tlb_kernel_pgtable(addr);
>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>> + mmap_read_lock(&init_mm);
>> + lock_taken = true;
>> + }
>> + if (unlikely(lock_taken))
>> + mmap_read_unlock(&init_mm);
>> +
>
> I'm missing something important: why not
>
> if (static_branch_unlikely(&ptdump_lock_key) && lock) {
> mmap_read_lock(&init_mm);
> mmap_read_unlock(&init_mm);
> }
>
The thing you are missing is that I unlocked a new personal record in
dumbassery : ) I was focussed so much on the static key logic that I
forgot to change this code from the previous revision.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-16 10:33 [PATCH v3] arm64: Enable vmalloc-huge with ptdump Dev Jain
2025-06-16 15:00 ` David Hildenbrand
@ 2025-06-16 18:07 ` Ryan Roberts
2025-06-16 21:20 ` Ryan Roberts
2025-06-17 2:54 ` Dev Jain
2025-06-25 10:35 ` Ryan Roberts
2 siblings, 2 replies; 19+ messages in thread
From: Ryan Roberts @ 2025-06-16 18:07 UTC (permalink / raw)
To: Dev Jain, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On 16/06/2025 11:33, Dev Jain wrote:
> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> because an intermediate table may be removed, potentially causing the
> ptdump code to dereference an invalid address. We want to be able to
> analyze block vs page mappings for kernel mappings with ptdump, so to
> enable vmalloc-huge with ptdump, synchronize between page table removal in
> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> use mmap_read_lock and not write lock because we don't need to synchronize
> between two different vm_structs; two vmalloc objects running this same
> code path will point to different page tables, hence there is no race.
>
> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
> 512 times again via pmd_free_pte_page().
>
> We implement the locking mechanism using static keys, since the chance
> of a race is very small. Observe that the synchronization is needed
> to avoid the following race:
>
> CPU1 CPU2
> take reference of PMD table
> pud_clear()
> pte_free_kernel()
> walk freed PMD table
>
> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>
> Therefore, there are two cases: if ptdump sees the cleared PUD, then
> we are safe. If not, then the patched-in read and write locks help us
> avoid the race.
>
> To implement the mechanism, we need the static key access from mmu.c and
> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
> target in the Makefile, therefore we cannot initialize the key there, as
> is being done, for example, in the static key implementation of
> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
> the jump_label mechanism. Declare the key there and define the key to false
> in mmu.c.
>
> No issues were observed with mm-selftests. No issues were observed while
> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
> sysfs in a loop.
>
> v2->v3:
> - Use static key mechanism
>
> v1->v2:
> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
> the lock 512 times again via pmd_free_pte_page()
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> arch/arm64/include/asm/cpufeature.h | 1 +
> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
> arch/arm64/mm/ptdump.c | 5 +++
> 3 files changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index c4326f1cb917..3e386563b587 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -26,6 +26,7 @@
> #include <linux/kernel.h>
> #include <linux/cpumask.h>
>
> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
> /*
> * CPU feature register tracking
> *
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 8fcf59ba39db..e242ba428820 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -41,11 +41,14 @@
> #include <asm/tlbflush.h>
> #include <asm/pgalloc.h>
> #include <asm/kfence.h>
> +#include <asm/cpufeature.h>
>
> #define NO_BLOCK_MAPPINGS BIT(0)
> #define NO_CONT_MAPPINGS BIT(1)
> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>
> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
> +
> enum pgtable_type {
> TABLE_PTE,
> TABLE_PMD,
> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
> return 1;
> }
>
> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
> {
> + bool lock_taken = false;
> pte_t *table;
> pmd_t pmd;
>
> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> return 1;
> }
>
> + /* See comment in pud_free_pmd_page for static key logic */
> table = pte_offset_kernel(pmdp, addr);
> pmd_clear(pmdp);
> __flush_tlb_kernel_pgtable(addr);
> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
> + mmap_read_lock(&init_mm);
> + lock_taken = true;
> + }
> + if (unlikely(lock_taken))
> + mmap_read_unlock(&init_mm);
> +
> pte_free_kernel(NULL, table);
> return 1;
> }
>
> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +{
> + return __pmd_free_pte_page(pmdp, addr, true);
> +}
> +
> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> {
> + bool lock_taken = false;
> pmd_t *table;
> pmd_t *pmdp;
> pud_t pud;
> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> }
>
> table = pmd_offset(pudp, addr);
> + /*
> + * Isolate the PMD table; in case of race with ptdump, this helps
> + * us to avoid taking the lock in __pmd_free_pte_page().
> + *
> + * Static key logic:
> + *
> + * Case 1: If ptdump does static_branch_enable(), and after that we
> + * execute the if block, then this patches in the read lock, ptdump has
> + * the write lock patched in, therefore ptdump will never read from
> + * a potentially freed PMD table.
> + *
> + * Case 2: If the if block starts executing before ptdump's
> + * static_branch_enable(), then no locking synchronization
> + * will be done. However, pud_clear() + the dsb() in
> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
> + * empty PUD. Thus, it will never walk over a potentially freed
> + * PMD table.
> + */
> + pud_clear(pudp);
How can this possibly be correct; you're clearing the pud without any
synchronisation. So you could have this situation:
CPU1 (vmalloc) CPU2 (ptdump)
static_branch_enable()
mmap_write_lock()
pud = pudp_get()
pud_free_pmd_page()
pud_clear()
access the table pointed to by pud
BANG!
Surely the logic needs to be:
if (static_branch_unlikely(&ptdump_lock_key)) {
mmap_read_lock(&init_mm);
lock_taken = true;
}
pud_clear(pudp);
if (unlikely(lock_taken))
mmap_read_unlock(&init_mm);
That fixes your first case, I think? But doesn't fix your second case. You could
still have:
CPU1 (vmalloc) CPU2 (ptdump)
pud_free_pmd_page()
<ptdump_lock_key=FALSE>
static_branch_enable()
mmap_write_lock()
pud = pudp_get()
pud_clear()
access the table pointed to by pud
BANG!
I think what you need is some sort of RCU read-size critical section in the
vmalloc side that you can then synchonize on in the ptdump side. But you would
need to be in the read side critical section when you sample the static key, but
you can't sleep waiting for the mmap lock while in the critical section. This
feels solvable, and there is almost certainly a well-used pattern, but I'm not
quite sure what the answer is. Perhaps others can help...
Thanks,
Ryan
> + __flush_tlb_kernel_pgtable(addr);
> + if (static_branch_unlikely(&ptdump_lock_key)) {
> + mmap_read_lock(&init_mm);
> + lock_taken = true;
> + }
> + if (unlikely(lock_taken))
> + mmap_read_unlock(&init_mm);
> +
> pmdp = table;
> next = addr;
> end = addr + PUD_SIZE;
> do {
> - pmd_free_pte_page(pmdp, next);
> + __pmd_free_pte_page(pmdp, next, false);
> } while (pmdp++, next += PMD_SIZE, next != end);
>
> - pud_clear(pudp);
> - __flush_tlb_kernel_pgtable(addr);
> pmd_free(NULL, table);
> return 1;
> }
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 421a5de806c6..f75e12a1d068 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -25,6 +25,7 @@
> #include <asm/pgtable-hwdef.h>
> #include <asm/ptdump.h>
>
> +#include <asm/cpufeature.h>
>
> #define pt_dump_seq_printf(m, fmt, args...) \
> ({ \
> @@ -311,7 +312,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
> }
> };
>
> + static_branch_enable(&ptdump_lock_key);
> ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
> + static_branch_disable(&ptdump_lock_key);
> }
>
> static void __init ptdump_initialize(void)
> @@ -353,7 +356,9 @@ bool ptdump_check_wx(void)
> }
> };
>
> + static_branch_enable(&ptdump_lock_key);
> ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
> + static_branch_disable(&ptdump_lock_key);
>
> if (st.wx_pages || st.uxn_pages) {
> pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-16 18:07 ` Ryan Roberts
@ 2025-06-16 21:20 ` Ryan Roberts
2025-06-17 11:51 ` Uladzislau Rezki
2025-06-17 2:54 ` Dev Jain
1 sibling, 1 reply; 19+ messages in thread
From: Ryan Roberts @ 2025-06-16 21:20 UTC (permalink / raw)
To: Dev Jain, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On 16/06/2025 19:07, Ryan Roberts wrote:
> On 16/06/2025 11:33, Dev Jain wrote:
>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>> because an intermediate table may be removed, potentially causing the
>> ptdump code to dereference an invalid address. We want to be able to
>> analyze block vs page mappings for kernel mappings with ptdump, so to
>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>> use mmap_read_lock and not write lock because we don't need to synchronize
>> between two different vm_structs; two vmalloc objects running this same
>> code path will point to different page tables, hence there is no race.
>>
>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>> 512 times again via pmd_free_pte_page().
>>
>> We implement the locking mechanism using static keys, since the chance
>> of a race is very small. Observe that the synchronization is needed
>> to avoid the following race:
>>
>> CPU1 CPU2
>> take reference of PMD table
>> pud_clear()
>> pte_free_kernel()
>> walk freed PMD table
>>
>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>
>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>> we are safe. If not, then the patched-in read and write locks help us
>> avoid the race.
>>
>> To implement the mechanism, we need the static key access from mmu.c and
>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>> target in the Makefile, therefore we cannot initialize the key there, as
>> is being done, for example, in the static key implementation of
>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>> the jump_label mechanism. Declare the key there and define the key to false
>> in mmu.c.
>>
>> No issues were observed with mm-selftests. No issues were observed while
>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>> sysfs in a loop.
>>
>> v2->v3:
>> - Use static key mechanism
>>
>> v1->v2:
>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>> the lock 512 times again via pmd_free_pte_page()
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> arch/arm64/include/asm/cpufeature.h | 1 +
>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>> arch/arm64/mm/ptdump.c | 5 +++
>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>> index c4326f1cb917..3e386563b587 100644
>> --- a/arch/arm64/include/asm/cpufeature.h
>> +++ b/arch/arm64/include/asm/cpufeature.h
>> @@ -26,6 +26,7 @@
>> #include <linux/kernel.h>
>> #include <linux/cpumask.h>
>>
>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>> /*
>> * CPU feature register tracking
>> *
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 8fcf59ba39db..e242ba428820 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -41,11 +41,14 @@
>> #include <asm/tlbflush.h>
>> #include <asm/pgalloc.h>
>> #include <asm/kfence.h>
>> +#include <asm/cpufeature.h>
>>
>> #define NO_BLOCK_MAPPINGS BIT(0)
>> #define NO_CONT_MAPPINGS BIT(1)
>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>>
>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>> +
>> enum pgtable_type {
>> TABLE_PTE,
>> TABLE_PMD,
>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>> return 1;
>> }
>>
>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>> {
>> + bool lock_taken = false;
>> pte_t *table;
>> pmd_t pmd;
>>
>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> return 1;
>> }
>>
>> + /* See comment in pud_free_pmd_page for static key logic */
>> table = pte_offset_kernel(pmdp, addr);
>> pmd_clear(pmdp);
>> __flush_tlb_kernel_pgtable(addr);
>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>> + mmap_read_lock(&init_mm);
>> + lock_taken = true;
>> + }
>> + if (unlikely(lock_taken))
>> + mmap_read_unlock(&init_mm);
>> +
>> pte_free_kernel(NULL, table);
>> return 1;
>> }
>>
>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +{
>> + return __pmd_free_pte_page(pmdp, addr, true);
>> +}
>> +
>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>> {
>> + bool lock_taken = false;
>> pmd_t *table;
>> pmd_t *pmdp;
>> pud_t pud;
>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>> }
>>
>> table = pmd_offset(pudp, addr);
>> + /*
>> + * Isolate the PMD table; in case of race with ptdump, this helps
>> + * us to avoid taking the lock in __pmd_free_pte_page().
>> + *
>> + * Static key logic:
>> + *
>> + * Case 1: If ptdump does static_branch_enable(), and after that we
>> + * execute the if block, then this patches in the read lock, ptdump has
>> + * the write lock patched in, therefore ptdump will never read from
>> + * a potentially freed PMD table.
>> + *
>> + * Case 2: If the if block starts executing before ptdump's
>> + * static_branch_enable(), then no locking synchronization
>> + * will be done. However, pud_clear() + the dsb() in
>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>> + * empty PUD. Thus, it will never walk over a potentially freed
>> + * PMD table.
>> + */
>> + pud_clear(pudp);
>
> How can this possibly be correct; you're clearing the pud without any
> synchronisation. So you could have this situation:
>
> CPU1 (vmalloc) CPU2 (ptdump)
>
> static_branch_enable()
> mmap_write_lock()
> pud = pudp_get()
> pud_free_pmd_page()
> pud_clear()
> access the table pointed to by pud
> BANG!
>
> Surely the logic needs to be:
>
> if (static_branch_unlikely(&ptdump_lock_key)) {
> mmap_read_lock(&init_mm);
> lock_taken = true;
> }
> pud_clear(pudp);
> if (unlikely(lock_taken))
> mmap_read_unlock(&init_mm);
>
> That fixes your first case, I think? But doesn't fix your second case. You could
> still have:
>
> CPU1 (vmalloc) CPU2 (ptdump)
>
> pud_free_pmd_page()
> <ptdump_lock_key=FALSE>
> static_branch_enable()
> mmap_write_lock()
> pud = pudp_get()
> pud_clear()
> access the table pointed to by pud
> BANG!
>
> I think what you need is some sort of RCU read-size critical section in the
> vmalloc side that you can then synchonize on in the ptdump side. But you would
> need to be in the read side critical section when you sample the static key, but
> you can't sleep waiting for the mmap lock while in the critical section. This
> feels solvable, and there is almost certainly a well-used pattern, but I'm not
> quite sure what the answer is. Perhaps others can help...
Just taking a step back here, I found the "percpu rw semaphore". From the
documentation:
"""
Percpu rw semaphores is a new read-write semaphore design that is
optimized for locking for reading.
The problem with traditional read-write semaphores is that when multiple
cores take the lock for reading, the cache line containing the semaphore
is bouncing between L1 caches of the cores, causing performance
degradation.
Locking for reading is very fast, it uses RCU and it avoids any atomic
instruction in the lock and unlock path. On the other hand, locking for
writing is very expensive, it calls synchronize_rcu() that can take
hundreds of milliseconds.
"""
Perhaps this provides the properties we are looking for? Could just define one
of these and lock it in read mode around pXd_clear() on the vmalloc side. Then
lock it in write mode around ptdump_walk_pgd() on the ptdump side. No need for
static key or other hoops. Given its a dedicated lock, there is no risk of
accidental contention because no other code is using it.
>
> Thanks,
> Ryan
>
>
>> + __flush_tlb_kernel_pgtable(addr);
>> + if (static_branch_unlikely(&ptdump_lock_key)) {
>> + mmap_read_lock(&init_mm);
>> + lock_taken = true;
>> + }
>> + if (unlikely(lock_taken))
>> + mmap_read_unlock(&init_mm);
>> +
>> pmdp = table;
>> next = addr;
>> end = addr + PUD_SIZE;
>> do {
>> - pmd_free_pte_page(pmdp, next);
>> + __pmd_free_pte_page(pmdp, next, false);
>> } while (pmdp++, next += PMD_SIZE, next != end);
>>
>> - pud_clear(pudp);
>> - __flush_tlb_kernel_pgtable(addr);
>> pmd_free(NULL, table);
>> return 1;
>> }
>> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
>> index 421a5de806c6..f75e12a1d068 100644
>> --- a/arch/arm64/mm/ptdump.c
>> +++ b/arch/arm64/mm/ptdump.c
>> @@ -25,6 +25,7 @@
>> #include <asm/pgtable-hwdef.h>
>> #include <asm/ptdump.h>
>>
>> +#include <asm/cpufeature.h>
>>
>> #define pt_dump_seq_printf(m, fmt, args...) \
>> ({ \
>> @@ -311,7 +312,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>> }
>> };
>>
>> + static_branch_enable(&ptdump_lock_key);
>> ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
>> + static_branch_disable(&ptdump_lock_key);
>> }
>>
>> static void __init ptdump_initialize(void)
>> @@ -353,7 +356,9 @@ bool ptdump_check_wx(void)
>> }
>> };
>>
>> + static_branch_enable(&ptdump_lock_key);
>> ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
>> + static_branch_disable(&ptdump_lock_key);
>>
>> if (st.wx_pages || st.uxn_pages) {
>> pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-16 18:07 ` Ryan Roberts
2025-06-16 21:20 ` Ryan Roberts
@ 2025-06-17 2:54 ` Dev Jain
2025-06-17 3:59 ` Dev Jain
1 sibling, 1 reply; 19+ messages in thread
From: Dev Jain @ 2025-06-17 2:54 UTC (permalink / raw)
To: Ryan Roberts, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On 16/06/25 11:37 pm, Ryan Roberts wrote:
> On 16/06/2025 11:33, Dev Jain wrote:
>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>> because an intermediate table may be removed, potentially causing the
>> ptdump code to dereference an invalid address. We want to be able to
>> analyze block vs page mappings for kernel mappings with ptdump, so to
>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>> use mmap_read_lock and not write lock because we don't need to synchronize
>> between two different vm_structs; two vmalloc objects running this same
>> code path will point to different page tables, hence there is no race.
>>
>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>> 512 times again via pmd_free_pte_page().
>>
>> We implement the locking mechanism using static keys, since the chance
>> of a race is very small. Observe that the synchronization is needed
>> to avoid the following race:
>>
>> CPU1 CPU2
>> take reference of PMD table
>> pud_clear()
>> pte_free_kernel()
>> walk freed PMD table
>>
>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>
>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>> we are safe. If not, then the patched-in read and write locks help us
>> avoid the race.
>>
>> To implement the mechanism, we need the static key access from mmu.c and
>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>> target in the Makefile, therefore we cannot initialize the key there, as
>> is being done, for example, in the static key implementation of
>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>> the jump_label mechanism. Declare the key there and define the key to false
>> in mmu.c.
>>
>> No issues were observed with mm-selftests. No issues were observed while
>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>> sysfs in a loop.
>>
>> v2->v3:
>> - Use static key mechanism
>>
>> v1->v2:
>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>> the lock 512 times again via pmd_free_pte_page()
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> arch/arm64/include/asm/cpufeature.h | 1 +
>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>> arch/arm64/mm/ptdump.c | 5 +++
>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>> index c4326f1cb917..3e386563b587 100644
>> --- a/arch/arm64/include/asm/cpufeature.h
>> +++ b/arch/arm64/include/asm/cpufeature.h
>> @@ -26,6 +26,7 @@
>> #include <linux/kernel.h>
>> #include <linux/cpumask.h>
>>
>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>> /*
>> * CPU feature register tracking
>> *
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 8fcf59ba39db..e242ba428820 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -41,11 +41,14 @@
>> #include <asm/tlbflush.h>
>> #include <asm/pgalloc.h>
>> #include <asm/kfence.h>
>> +#include <asm/cpufeature.h>
>>
>> #define NO_BLOCK_MAPPINGS BIT(0)
>> #define NO_CONT_MAPPINGS BIT(1)
>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>>
>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>> +
>> enum pgtable_type {
>> TABLE_PTE,
>> TABLE_PMD,
>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>> return 1;
>> }
>>
>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>> {
>> + bool lock_taken = false;
>> pte_t *table;
>> pmd_t pmd;
>>
>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> return 1;
>> }
>>
>> + /* See comment in pud_free_pmd_page for static key logic */
>> table = pte_offset_kernel(pmdp, addr);
>> pmd_clear(pmdp);
>> __flush_tlb_kernel_pgtable(addr);
>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>> + mmap_read_lock(&init_mm);
>> + lock_taken = true;
>> + }
>> + if (unlikely(lock_taken))
>> + mmap_read_unlock(&init_mm);
>> +
>> pte_free_kernel(NULL, table);
>> return 1;
>> }
>>
>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +{
>> + return __pmd_free_pte_page(pmdp, addr, true);
>> +}
>> +
>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>> {
>> + bool lock_taken = false;
>> pmd_t *table;
>> pmd_t *pmdp;
>> pud_t pud;
>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>> }
>>
>> table = pmd_offset(pudp, addr);
>> + /*
>> + * Isolate the PMD table; in case of race with ptdump, this helps
>> + * us to avoid taking the lock in __pmd_free_pte_page().
>> + *
>> + * Static key logic:
>> + *
>> + * Case 1: If ptdump does static_branch_enable(), and after that we
>> + * execute the if block, then this patches in the read lock, ptdump has
>> + * the write lock patched in, therefore ptdump will never read from
>> + * a potentially freed PMD table.
>> + *
>> + * Case 2: If the if block starts executing before ptdump's
>> + * static_branch_enable(), then no locking synchronization
>> + * will be done. However, pud_clear() + the dsb() in
>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>> + * empty PUD. Thus, it will never walk over a potentially freed
>> + * PMD table.
>> + */
>> + pud_clear(pudp);
> How can this possibly be correct; you're clearing the pud without any
> synchronisation. So you could have this situation:
>
> CPU1 (vmalloc) CPU2 (ptdump)
>
> static_branch_enable()
> mmap_write_lock()
> pud = pudp_get()
When you do pudp_get(), you won't be dereferencing a NULL pointer.
pud_clear() will nullify the pud entry. So pudp_get() will boil
down to retrieving a NULL entry. Or, pudp_get() will retrieve an
entry pointing to the now isolated PMD table. Correct me if I am
wrong.
> pud_free_pmd_page()
> pud_clear()
> access the table pointed to by pud
> BANG!
>
> Surely the logic needs to be:
>
> if (static_branch_unlikely(&ptdump_lock_key)) {
> mmap_read_lock(&init_mm);
> lock_taken = true;
> }
> pud_clear(pudp);
> if (unlikely(lock_taken))
> mmap_read_unlock(&init_mm);
>
> That fixes your first case, I think? But doesn't fix your second case. You could
> still have:
>
> CPU1 (vmalloc) CPU2 (ptdump)
>
> pud_free_pmd_page()
> <ptdump_lock_key=FALSE>
> static_branch_enable()
> mmap_write_lock()
> pud = pudp_get()
> pud_clear()
> access the table pointed to by pud
> BANG!
>
> I think what you need is some sort of RCU read-size critical section in the
> vmalloc side that you can then synchonize on in the ptdump side. But you would
> need to be in the read side critical section when you sample the static key, but
> you can't sleep waiting for the mmap lock while in the critical section. This
> feels solvable, and there is almost certainly a well-used pattern, but I'm not
> quite sure what the answer is. Perhaps others can help...
>
> Thanks,
> Ryan
>
>
>> + __flush_tlb_kernel_pgtable(addr);
>> + if (static_branch_unlikely(&ptdump_lock_key)) {
>> + mmap_read_lock(&init_mm);
>> + lock_taken = true;
>> + }
>> + if (unlikely(lock_taken))
>> + mmap_read_unlock(&init_mm);
>> +
>> pmdp = table;
>> next = addr;
>> end = addr + PUD_SIZE;
>> do {
>> - pmd_free_pte_page(pmdp, next);
>> + __pmd_free_pte_page(pmdp, next, false);
>> } while (pmdp++, next += PMD_SIZE, next != end);
>>
>> - pud_clear(pudp);
>> - __flush_tlb_kernel_pgtable(addr);
>> pmd_free(NULL, table);
>> return 1;
>> }
>> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
>> index 421a5de806c6..f75e12a1d068 100644
>> --- a/arch/arm64/mm/ptdump.c
>> +++ b/arch/arm64/mm/ptdump.c
>> @@ -25,6 +25,7 @@
>> #include <asm/pgtable-hwdef.h>
>> #include <asm/ptdump.h>
>>
>> +#include <asm/cpufeature.h>
>>
>> #define pt_dump_seq_printf(m, fmt, args...) \
>> ({ \
>> @@ -311,7 +312,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>> }
>> };
>>
>> + static_branch_enable(&ptdump_lock_key);
>> ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
>> + static_branch_disable(&ptdump_lock_key);
>> }
>>
>> static void __init ptdump_initialize(void)
>> @@ -353,7 +356,9 @@ bool ptdump_check_wx(void)
>> }
>> };
>>
>> + static_branch_enable(&ptdump_lock_key);
>> ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
>> + static_branch_disable(&ptdump_lock_key);
>>
>> if (st.wx_pages || st.uxn_pages) {
>> pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-17 2:54 ` Dev Jain
@ 2025-06-17 3:59 ` Dev Jain
2025-06-17 8:12 ` Ryan Roberts
0 siblings, 1 reply; 19+ messages in thread
From: Dev Jain @ 2025-06-17 3:59 UTC (permalink / raw)
To: Ryan Roberts, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On 17/06/25 8:24 am, Dev Jain wrote:
>
> On 16/06/25 11:37 pm, Ryan Roberts wrote:
>> On 16/06/2025 11:33, Dev Jain wrote:
>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>> because an intermediate table may be removed, potentially causing the
>>> ptdump code to dereference an invalid address. We want to be able to
>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>> enable vmalloc-huge with ptdump, synchronize between page table
>>> removal in
>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable
>>> walking. We
>>> use mmap_read_lock and not write lock because we don't need to
>>> synchronize
>>> between two different vm_structs; two vmalloc objects running this same
>>> code path will point to different page tables, hence there is no race.
>>>
>>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking
>>> the lock
>>> 512 times again via pmd_free_pte_page().
>>>
>>> We implement the locking mechanism using static keys, since the chance
>>> of a race is very small. Observe that the synchronization is needed
>>> to avoid the following race:
>>>
>>> CPU1 CPU2
>>> take reference of PMD table
>>> pud_clear()
>>> pte_free_kernel()
>>> walk freed PMD table
>>>
>>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>>
>>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>>> we are safe. If not, then the patched-in read and write locks help us
>>> avoid the race.
>>>
>>> To implement the mechanism, we need the static key access from mmu.c
>>> and
>>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>>> target in the Makefile, therefore we cannot initialize the key
>>> there, as
>>> is being done, for example, in the static key implementation of
>>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>>> the jump_label mechanism. Declare the key there and define the key
>>> to false
>>> in mmu.c.
>>>
>>> No issues were observed with mm-selftests. No issues were observed
>>> while
>>> parallelly running test_vmalloc.sh and dumping the kernel pagetable
>>> through
>>> sysfs in a loop.
>>>
>>> v2->v3:
>>> - Use static key mechanism
>>>
>>> v1->v2:
>>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid
>>> taking
>>> the lock 512 times again via pmd_free_pte_page()
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>> arch/arm64/include/asm/cpufeature.h | 1 +
>>> arch/arm64/mm/mmu.c | 51
>>> ++++++++++++++++++++++++++---
>>> arch/arm64/mm/ptdump.c | 5 +++
>>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/cpufeature.h
>>> b/arch/arm64/include/asm/cpufeature.h
>>> index c4326f1cb917..3e386563b587 100644
>>> --- a/arch/arm64/include/asm/cpufeature.h
>>> +++ b/arch/arm64/include/asm/cpufeature.h
>>> @@ -26,6 +26,7 @@
>>> #include <linux/kernel.h>
>>> #include <linux/cpumask.h>
>>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>>> /*
>>> * CPU feature register tracking
>>> *
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 8fcf59ba39db..e242ba428820 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -41,11 +41,14 @@
>>> #include <asm/tlbflush.h>
>>> #include <asm/pgalloc.h>
>>> #include <asm/kfence.h>
>>> +#include <asm/cpufeature.h>
>>> #define NO_BLOCK_MAPPINGS BIT(0)
>>> #define NO_CONT_MAPPINGS BIT(1)
>>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not
>>> used */
>>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>>> +
>>> enum pgtable_type {
>>> TABLE_PTE,
>>> TABLE_PMD,
>>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>>> return 1;
>>> }
>>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr,
>>> bool lock)
>>> {
>>> + bool lock_taken = false;
>>> pte_t *table;
>>> pmd_t pmd;
>>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp,
>>> unsigned long addr)
>>> return 1;
>>> }
>>> + /* See comment in pud_free_pmd_page for static key logic */
>>> table = pte_offset_kernel(pmdp, addr);
>>> pmd_clear(pmdp);
>>> __flush_tlb_kernel_pgtable(addr);
>>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>>> + mmap_read_lock(&init_mm);
>>> + lock_taken = true;
>>> + }
>>> + if (unlikely(lock_taken))
>>> + mmap_read_unlock(&init_mm);
>>> +
>>> pte_free_kernel(NULL, table);
>>> return 1;
>>> }
>>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>> +{
>>> + return __pmd_free_pte_page(pmdp, addr, true);
>>> +}
>>> +
>>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>> {
>>> + bool lock_taken = false;
>>> pmd_t *table;
>>> pmd_t *pmdp;
>>> pud_t pud;
>>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned
>>> long addr)
>>> }
>>> table = pmd_offset(pudp, addr);
>>> + /*
>>> + * Isolate the PMD table; in case of race with ptdump, this helps
>>> + * us to avoid taking the lock in __pmd_free_pte_page().
>>> + *
>>> + * Static key logic:
>>> + *
>>> + * Case 1: If ptdump does static_branch_enable(), and after
>>> that we
>>> + * execute the if block, then this patches in the read lock,
>>> ptdump has
>>> + * the write lock patched in, therefore ptdump will never read
>>> from
>>> + * a potentially freed PMD table.
>>> + *
>>> + * Case 2: If the if block starts executing before ptdump's
>>> + * static_branch_enable(), then no locking synchronization
>>> + * will be done. However, pud_clear() + the dsb() in
>>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>>> + * empty PUD. Thus, it will never walk over a potentially freed
>>> + * PMD table.
>>> + */
>>> + pud_clear(pudp);
>> How can this possibly be correct; you're clearing the pud without any
>> synchronisation. So you could have this situation:
>>
>> CPU1 (vmalloc) CPU2 (ptdump)
>>
>> static_branch_enable()
>> mmap_write_lock()
>> pud = pudp_get()
>
> When you do pudp_get(), you won't be dereferencing a NULL pointer.
> pud_clear() will nullify the pud entry. So pudp_get() will boil
> down to retrieving a NULL entry. Or, pudp_get() will retrieve an
> entry pointing to the now isolated PMD table. Correct me if I am
> wrong.
>
>> pud_free_pmd_page()
>> pud_clear()
>> access the table pointed to by pud
>> BANG!
I am also confused thoroughly now : ) This should not go bang as the
table pointed to by pud is still there, and our sequence guarantees that
if the ptdump walk is using the pmd table, then pud_free_pmd_page won't
free the PMD table yet.
>>
>> Surely the logic needs to be:
>>
>> if (static_branch_unlikely(&ptdump_lock_key)) {
>> mmap_read_lock(&init_mm);
>> lock_taken = true;
>> }
>> pud_clear(pudp);
>> if (unlikely(lock_taken))
>> mmap_read_unlock(&init_mm);
>>
>> That fixes your first case, I think? But doesn't fix your second
>> case. You could
>> still have:
>>
>> CPU1 (vmalloc) CPU2 (ptdump)
>>
>> pud_free_pmd_page()
>> <ptdump_lock_key=FALSE>
>> static_branch_enable()
>> mmap_write_lock()
>> pud = pudp_get()
>> pud_clear()
>> access the table pointed to by pud
>> BANG!
>>
>> I think what you need is some sort of RCU read-size critical section
>> in the
>> vmalloc side that you can then synchonize on in the ptdump side. But
>> you would
>> need to be in the read side critical section when you sample the
>> static key, but
>> you can't sleep waiting for the mmap lock while in the critical
>> section. This
>> feels solvable, and there is almost certainly a well-used pattern,
>> but I'm not
>> quite sure what the answer is. Perhaps others can help...
>>
>> Thanks,
>> Ryan
>>
>>
>>> + __flush_tlb_kernel_pgtable(addr);
>>> + if (static_branch_unlikely(&ptdump_lock_key)) {
>>> + mmap_read_lock(&init_mm);
>>> + lock_taken = true;
>>> + }
>>> + if (unlikely(lock_taken))
>>> + mmap_read_unlock(&init_mm);
>>> +
>>> pmdp = table;
>>> next = addr;
>>> end = addr + PUD_SIZE;
>>> do {
>>> - pmd_free_pte_page(pmdp, next);
>>> + __pmd_free_pte_page(pmdp, next, false);
>>> } while (pmdp++, next += PMD_SIZE, next != end);
>>> - pud_clear(pudp);
>>> - __flush_tlb_kernel_pgtable(addr);
>>> pmd_free(NULL, table);
>>> return 1;
>>> }
>>> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
>>> index 421a5de806c6..f75e12a1d068 100644
>>> --- a/arch/arm64/mm/ptdump.c
>>> +++ b/arch/arm64/mm/ptdump.c
>>> @@ -25,6 +25,7 @@
>>> #include <asm/pgtable-hwdef.h>
>>> #include <asm/ptdump.h>
>>> +#include <asm/cpufeature.h>
>>> #define pt_dump_seq_printf(m, fmt, args...) \
>>> ({ \
>>> @@ -311,7 +312,9 @@ void ptdump_walk(struct seq_file *s, struct
>>> ptdump_info *info)
>>> }
>>> };
>>> + static_branch_enable(&ptdump_lock_key);
>>> ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
>>> + static_branch_disable(&ptdump_lock_key);
>>> }
>>> static void __init ptdump_initialize(void)
>>> @@ -353,7 +356,9 @@ bool ptdump_check_wx(void)
>>> }
>>> };
>>> + static_branch_enable(&ptdump_lock_key);
>>> ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
>>> + static_branch_disable(&ptdump_lock_key);
>>> if (st.wx_pages || st.uxn_pages) {
>>> pr_warn("Checked W+X mappings: FAILED, %lu W+X pages
>>> found, %lu non-UXN pages found\n",
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-17 3:59 ` Dev Jain
@ 2025-06-17 8:12 ` Ryan Roberts
2025-06-17 8:58 ` Dev Jain
0 siblings, 1 reply; 19+ messages in thread
From: Ryan Roberts @ 2025-06-17 8:12 UTC (permalink / raw)
To: Dev Jain, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On 17/06/2025 04:59, Dev Jain wrote:
>
> On 17/06/25 8:24 am, Dev Jain wrote:
>>
>> On 16/06/25 11:37 pm, Ryan Roberts wrote:
>>> On 16/06/2025 11:33, Dev Jain wrote:
>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>>> because an intermediate table may be removed, potentially causing the
>>>> ptdump code to dereference an invalid address. We want to be able to
>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>>> between two different vm_structs; two vmalloc objects running this same
>>>> code path will point to different page tables, hence there is no race.
>>>>
>>>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>>>> 512 times again via pmd_free_pte_page().
>>>>
>>>> We implement the locking mechanism using static keys, since the chance
>>>> of a race is very small. Observe that the synchronization is needed
>>>> to avoid the following race:
>>>>
>>>> CPU1 CPU2
>>>> take reference of PMD table
>>>> pud_clear()
>>>> pte_free_kernel()
>>>> walk freed PMD table
>>>>
>>>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>>>
>>>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>>>> we are safe. If not, then the patched-in read and write locks help us
>>>> avoid the race.
>>>>
>>>> To implement the mechanism, we need the static key access from mmu.c and
>>>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>>>> target in the Makefile, therefore we cannot initialize the key there, as
>>>> is being done, for example, in the static key implementation of
>>>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>>>> the jump_label mechanism. Declare the key there and define the key to false
>>>> in mmu.c.
>>>>
>>>> No issues were observed with mm-selftests. No issues were observed while
>>>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>>>> sysfs in a loop.
>>>>
>>>> v2->v3:
>>>> - Use static key mechanism
>>>>
>>>> v1->v2:
>>>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>>>> the lock 512 times again via pmd_free_pte_page()
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>> arch/arm64/include/asm/cpufeature.h | 1 +
>>>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>>>> arch/arm64/mm/ptdump.c | 5 +++
>>>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>>>
[...]
>>>> + pud_clear(pudp);
>>> How can this possibly be correct; you're clearing the pud without any
>>> synchronisation. So you could have this situation:
>>>
>>> CPU1 (vmalloc) CPU2 (ptdump)
>>>
>>> static_branch_enable()
>>> mmap_write_lock()
>>> pud = pudp_get()
>>
>> When you do pudp_get(), you won't be dereferencing a NULL pointer.
>> pud_clear() will nullify the pud entry. So pudp_get() will boil
>> down to retrieving a NULL entry. Or, pudp_get() will retrieve an
>> entry pointing to the now isolated PMD table. Correct me if I am
>> wrong.
>>
>>> pud_free_pmd_page()
>>> pud_clear()
>>> access the table pointed to by pud
>>> BANG!
>
> I am also confused thoroughly now : ) This should not go bang as the
>
> table pointed to by pud is still there, and our sequence guarantees that
>
> if the ptdump walk is using the pmd table, then pud_free_pmd_page won't
>
> free the PMD table yet.
You're right... I'm not sure what I was smoking last night. For some reason I
read the pXd_clear() as "free". This approach looks good to me - very clever!
And you even managed to ensure the WRITE_ONCE() in pXd_clear() doesn't get
reordered after taking the lock via the existing dsb in the tlb maintenance
operation - I like it!
I'll send a separate review with some nits, but I'm out today, so that might
have to wait until tomorrow.
Thanks, and sorry again for the noise!
Ryan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-17 8:12 ` Ryan Roberts
@ 2025-06-17 8:58 ` Dev Jain
0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2025-06-17 8:58 UTC (permalink / raw)
To: Ryan Roberts, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On 17/06/25 1:42 pm, Ryan Roberts wrote:
> On 17/06/2025 04:59, Dev Jain wrote:
>> On 17/06/25 8:24 am, Dev Jain wrote:
>>> On 16/06/25 11:37 pm, Ryan Roberts wrote:
>>>> On 16/06/2025 11:33, Dev Jain wrote:
>>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>>>> because an intermediate table may be removed, potentially causing the
>>>>> ptdump code to dereference an invalid address. We want to be able to
>>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>>>> between two different vm_structs; two vmalloc objects running this same
>>>>> code path will point to different page tables, hence there is no race.
>>>>>
>>>>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>>>>> 512 times again via pmd_free_pte_page().
>>>>>
>>>>> We implement the locking mechanism using static keys, since the chance
>>>>> of a race is very small. Observe that the synchronization is needed
>>>>> to avoid the following race:
>>>>>
>>>>> CPU1 CPU2
>>>>> take reference of PMD table
>>>>> pud_clear()
>>>>> pte_free_kernel()
>>>>> walk freed PMD table
>>>>>
>>>>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>>>>
>>>>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>>>>> we are safe. If not, then the patched-in read and write locks help us
>>>>> avoid the race.
>>>>>
>>>>> To implement the mechanism, we need the static key access from mmu.c and
>>>>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>>>>> target in the Makefile, therefore we cannot initialize the key there, as
>>>>> is being done, for example, in the static key implementation of
>>>>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>>>>> the jump_label mechanism. Declare the key there and define the key to false
>>>>> in mmu.c.
>>>>>
>>>>> No issues were observed with mm-selftests. No issues were observed while
>>>>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>>>>> sysfs in a loop.
>>>>>
>>>>> v2->v3:
>>>>> - Use static key mechanism
>>>>>
>>>>> v1->v2:
>>>>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>>>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>>>>> the lock 512 times again via pmd_free_pte_page()
>>>>>
>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>> ---
>>>>> arch/arm64/include/asm/cpufeature.h | 1 +
>>>>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>>>>> arch/arm64/mm/ptdump.c | 5 +++
>>>>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>>>>
> [...]
>
>>>>> + pud_clear(pudp);
>>>> How can this possibly be correct; you're clearing the pud without any
>>>> synchronisation. So you could have this situation:
>>>>
>>>> CPU1 (vmalloc) CPU2 (ptdump)
>>>>
>>>> static_branch_enable()
>>>> mmap_write_lock()
>>>> pud = pudp_get()
>>> When you do pudp_get(), you won't be dereferencing a NULL pointer.
>>> pud_clear() will nullify the pud entry. So pudp_get() will boil
>>> down to retrieving a NULL entry. Or, pudp_get() will retrieve an
>>> entry pointing to the now isolated PMD table. Correct me if I am
>>> wrong.
>>>
>>>> pud_free_pmd_page()
>>>> pud_clear()
>>>> access the table pointed to by pud
>>>> BANG!
>> I am also confused thoroughly now : ) This should not go bang as the
>>
>> table pointed to by pud is still there, and our sequence guarantees that
>>
>> if the ptdump walk is using the pmd table, then pud_free_pmd_page won't
>>
>> free the PMD table yet.
> You're right... I'm not sure what I was smoking last night. For some reason I
> read the pXd_clear() as "free". This approach looks good to me - very clever!
> And you even managed to ensure the WRITE_ONCE() in pXd_clear() doesn't get
> reordered after taking the lock via the existing dsb in the tlb maintenance
> operation - I like it!
Haha! It indeed was very confusing, the important observation separating this
from other cases was that ptdump only cares about reading the tables, not about
what it reads.
>
> I'll send a separate review with some nits, but I'm out today, so that might
> have to wait until tomorrow.
>
> Thanks, and sorry again for the noise!
Ah no it was not noise : ) Sure, enjoy.
> Ryan
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-16 21:20 ` Ryan Roberts
@ 2025-06-17 11:51 ` Uladzislau Rezki
2025-06-18 3:11 ` Dev Jain
2025-06-18 11:21 ` Ryan Roberts
0 siblings, 2 replies; 19+ messages in thread
From: Uladzislau Rezki @ 2025-06-17 11:51 UTC (permalink / raw)
To: Ryan Roberts
Cc: Dev Jain, catalin.marinas, will, anshuman.khandual, quic_zhenhuah,
kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
linux-kernel, david
On Mon, Jun 16, 2025 at 10:20:29PM +0100, Ryan Roberts wrote:
> On 16/06/2025 19:07, Ryan Roberts wrote:
> > On 16/06/2025 11:33, Dev Jain wrote:
> >> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> >> because an intermediate table may be removed, potentially causing the
> >> ptdump code to dereference an invalid address. We want to be able to
> >> analyze block vs page mappings for kernel mappings with ptdump, so to
> >> enable vmalloc-huge with ptdump, synchronize between page table removal in
> >> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> >> use mmap_read_lock and not write lock because we don't need to synchronize
> >> between two different vm_structs; two vmalloc objects running this same
> >> code path will point to different page tables, hence there is no race.
> >>
> >> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
> >> 512 times again via pmd_free_pte_page().
> >>
> >> We implement the locking mechanism using static keys, since the chance
> >> of a race is very small. Observe that the synchronization is needed
> >> to avoid the following race:
> >>
> >> CPU1 CPU2
> >> take reference of PMD table
> >> pud_clear()
> >> pte_free_kernel()
> >> walk freed PMD table
> >>
> >> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
> >>
> >> Therefore, there are two cases: if ptdump sees the cleared PUD, then
> >> we are safe. If not, then the patched-in read and write locks help us
> >> avoid the race.
> >>
> >> To implement the mechanism, we need the static key access from mmu.c and
> >> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
> >> target in the Makefile, therefore we cannot initialize the key there, as
> >> is being done, for example, in the static key implementation of
> >> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
> >> the jump_label mechanism. Declare the key there and define the key to false
> >> in mmu.c.
> >>
> >> No issues were observed with mm-selftests. No issues were observed while
> >> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
> >> sysfs in a loop.
> >>
> >> v2->v3:
> >> - Use static key mechanism
> >>
> >> v1->v2:
> >> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
> >> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
> >> the lock 512 times again via pmd_free_pte_page()
> >>
> >> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >> ---
> >> arch/arm64/include/asm/cpufeature.h | 1 +
> >> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
> >> arch/arm64/mm/ptdump.c | 5 +++
> >> 3 files changed, 53 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> >> index c4326f1cb917..3e386563b587 100644
> >> --- a/arch/arm64/include/asm/cpufeature.h
> >> +++ b/arch/arm64/include/asm/cpufeature.h
> >> @@ -26,6 +26,7 @@
> >> #include <linux/kernel.h>
> >> #include <linux/cpumask.h>
> >>
> >> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
> >> /*
> >> * CPU feature register tracking
> >> *
> >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >> index 8fcf59ba39db..e242ba428820 100644
> >> --- a/arch/arm64/mm/mmu.c
> >> +++ b/arch/arm64/mm/mmu.c
> >> @@ -41,11 +41,14 @@
> >> #include <asm/tlbflush.h>
> >> #include <asm/pgalloc.h>
> >> #include <asm/kfence.h>
> >> +#include <asm/cpufeature.h>
> >>
> >> #define NO_BLOCK_MAPPINGS BIT(0)
> >> #define NO_CONT_MAPPINGS BIT(1)
> >> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
> >>
> >> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
> >> +
> >> enum pgtable_type {
> >> TABLE_PTE,
> >> TABLE_PMD,
> >> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
> >> return 1;
> >> }
> >>
> >> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> >> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
> >> {
> >> + bool lock_taken = false;
> >> pte_t *table;
> >> pmd_t pmd;
> >>
> >> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> >> return 1;
> >> }
> >>
> >> + /* See comment in pud_free_pmd_page for static key logic */
> >> table = pte_offset_kernel(pmdp, addr);
> >> pmd_clear(pmdp);
> >> __flush_tlb_kernel_pgtable(addr);
> >> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
> >> + mmap_read_lock(&init_mm);
> >> + lock_taken = true;
> >> + }
> >> + if (unlikely(lock_taken))
> >> + mmap_read_unlock(&init_mm);
> >> +
> >> pte_free_kernel(NULL, table);
> >> return 1;
> >> }
> >>
> >> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> >> +{
> >> + return __pmd_free_pte_page(pmdp, addr, true);
> >> +}
> >> +
> >> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> >> {
> >> + bool lock_taken = false;
> >> pmd_t *table;
> >> pmd_t *pmdp;
> >> pud_t pud;
> >> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> >> }
> >>
> >> table = pmd_offset(pudp, addr);
> >> + /*
> >> + * Isolate the PMD table; in case of race with ptdump, this helps
> >> + * us to avoid taking the lock in __pmd_free_pte_page().
> >> + *
> >> + * Static key logic:
> >> + *
> >> + * Case 1: If ptdump does static_branch_enable(), and after that we
> >> + * execute the if block, then this patches in the read lock, ptdump has
> >> + * the write lock patched in, therefore ptdump will never read from
> >> + * a potentially freed PMD table.
> >> + *
> >> + * Case 2: If the if block starts executing before ptdump's
> >> + * static_branch_enable(), then no locking synchronization
> >> + * will be done. However, pud_clear() + the dsb() in
> >> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
> >> + * empty PUD. Thus, it will never walk over a potentially freed
> >> + * PMD table.
> >> + */
> >> + pud_clear(pudp);
> >
> > How can this possibly be correct; you're clearing the pud without any
> > synchronisation. So you could have this situation:
> >
> > CPU1 (vmalloc) CPU2 (ptdump)
> >
> > static_branch_enable()
> > mmap_write_lock()
> > pud = pudp_get()
> > pud_free_pmd_page()
> > pud_clear()
> > access the table pointed to by pud
> > BANG!
> >
> > Surely the logic needs to be:
> >
> > if (static_branch_unlikely(&ptdump_lock_key)) {
> > mmap_read_lock(&init_mm);
> > lock_taken = true;
> > }
> > pud_clear(pudp);
> > if (unlikely(lock_taken))
> > mmap_read_unlock(&init_mm);
> >
> > That fixes your first case, I think? But doesn't fix your second case. You could
> > still have:
> >
> > CPU1 (vmalloc) CPU2 (ptdump)
> >
> > pud_free_pmd_page()
> > <ptdump_lock_key=FALSE>
> > static_branch_enable()
> > mmap_write_lock()
> > pud = pudp_get()
> > pud_clear()
> > access the table pointed to by pud
> > BANG!
> >
> > I think what you need is some sort of RCU read-size critical section in the
> > vmalloc side that you can then synchonize on in the ptdump side. But you would
> > need to be in the read side critical section when you sample the static key, but
> > you can't sleep waiting for the mmap lock while in the critical section. This
> > feels solvable, and there is almost certainly a well-used pattern, but I'm not
> > quite sure what the answer is. Perhaps others can help...
>
> Just taking a step back here, I found the "percpu rw semaphore". From the
> documentation:
>
> """
> Percpu rw semaphores is a new read-write semaphore design that is
> optimized for locking for reading.
>
> The problem with traditional read-write semaphores is that when multiple
> cores take the lock for reading, the cache line containing the semaphore
> is bouncing between L1 caches of the cores, causing performance
> degradation.
>
> Locking for reading is very fast, it uses RCU and it avoids any atomic
> instruction in the lock and unlock path. On the other hand, locking for
> writing is very expensive, it calls synchronize_rcu() that can take
> hundreds of milliseconds.
> """
>
> Perhaps this provides the properties we are looking for? Could just define one
> of these and lock it in read mode around pXd_clear() on the vmalloc side. Then
> lock it in write mode around ptdump_walk_pgd() on the ptdump side. No need for
> static key or other hoops. Given its a dedicated lock, there is no risk of
> accidental contention because no other code is using it.
>
Write-lock indeed is super expensive, as you noted it blocks on
synchronize_rcu(). If that write-lock interferes with a critical
vmalloc fast path, where a read-lock could be injected, then it
is definitely a problem.
I have not analysed this patch series. I need to have a look what
"ptdump" does.
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-17 11:51 ` Uladzislau Rezki
@ 2025-06-18 3:11 ` Dev Jain
2025-06-18 17:19 ` Uladzislau Rezki
2025-06-18 11:21 ` Ryan Roberts
1 sibling, 1 reply; 19+ messages in thread
From: Dev Jain @ 2025-06-18 3:11 UTC (permalink / raw)
To: Uladzislau Rezki, Ryan Roberts
Cc: catalin.marinas, will, anshuman.khandual, quic_zhenhuah,
kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
linux-kernel, david
On 17/06/25 5:21 pm, Uladzislau Rezki wrote:
> On Mon, Jun 16, 2025 at 10:20:29PM +0100, Ryan Roberts wrote:
>> On 16/06/2025 19:07, Ryan Roberts wrote:
>>> On 16/06/2025 11:33, Dev Jain wrote:
>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>>> because an intermediate table may be removed, potentially causing the
>>>> ptdump code to dereference an invalid address. We want to be able to
>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>>> between two different vm_structs; two vmalloc objects running this same
>>>> code path will point to different page tables, hence there is no race.
>>>>
>>>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>>>> 512 times again via pmd_free_pte_page().
>>>>
>>>> We implement the locking mechanism using static keys, since the chance
>>>> of a race is very small. Observe that the synchronization is needed
>>>> to avoid the following race:
>>>>
>>>> CPU1 CPU2
>>>> take reference of PMD table
>>>> pud_clear()
>>>> pte_free_kernel()
>>>> walk freed PMD table
>>>>
>>>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>>>
>>>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>>>> we are safe. If not, then the patched-in read and write locks help us
>>>> avoid the race.
>>>>
>>>> To implement the mechanism, we need the static key access from mmu.c and
>>>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>>>> target in the Makefile, therefore we cannot initialize the key there, as
>>>> is being done, for example, in the static key implementation of
>>>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>>>> the jump_label mechanism. Declare the key there and define the key to false
>>>> in mmu.c.
>>>>
>>>> No issues were observed with mm-selftests. No issues were observed while
>>>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>>>> sysfs in a loop.
>>>>
>>>> v2->v3:
>>>> - Use static key mechanism
>>>>
>>>> v1->v2:
>>>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>>>> the lock 512 times again via pmd_free_pte_page()
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>> arch/arm64/include/asm/cpufeature.h | 1 +
>>>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>>>> arch/arm64/mm/ptdump.c | 5 +++
>>>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>>>> index c4326f1cb917..3e386563b587 100644
>>>> --- a/arch/arm64/include/asm/cpufeature.h
>>>> +++ b/arch/arm64/include/asm/cpufeature.h
>>>> @@ -26,6 +26,7 @@
>>>> #include <linux/kernel.h>
>>>> #include <linux/cpumask.h>
>>>>
>>>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>>>> /*
>>>> * CPU feature register tracking
>>>> *
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index 8fcf59ba39db..e242ba428820 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -41,11 +41,14 @@
>>>> #include <asm/tlbflush.h>
>>>> #include <asm/pgalloc.h>
>>>> #include <asm/kfence.h>
>>>> +#include <asm/cpufeature.h>
>>>>
>>>> #define NO_BLOCK_MAPPINGS BIT(0)
>>>> #define NO_CONT_MAPPINGS BIT(1)
>>>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>>>>
>>>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>>>> +
>>>> enum pgtable_type {
>>>> TABLE_PTE,
>>>> TABLE_PMD,
>>>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>>>> return 1;
>>>> }
>>>>
>>>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>>>> {
>>>> + bool lock_taken = false;
>>>> pte_t *table;
>>>> pmd_t pmd;
>>>>
>>>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>> return 1;
>>>> }
>>>>
>>>> + /* See comment in pud_free_pmd_page for static key logic */
>>>> table = pte_offset_kernel(pmdp, addr);
>>>> pmd_clear(pmdp);
>>>> __flush_tlb_kernel_pgtable(addr);
>>>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>>>> + mmap_read_lock(&init_mm);
>>>> + lock_taken = true;
>>>> + }
>>>> + if (unlikely(lock_taken))
>>>> + mmap_read_unlock(&init_mm);
>>>> +
>>>> pte_free_kernel(NULL, table);
>>>> return 1;
>>>> }
>>>>
>>>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>> +{
>>>> + return __pmd_free_pte_page(pmdp, addr, true);
>>>> +}
>>>> +
>>>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>> {
>>>> + bool lock_taken = false;
>>>> pmd_t *table;
>>>> pmd_t *pmdp;
>>>> pud_t pud;
>>>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>> }
>>>>
>>>> table = pmd_offset(pudp, addr);
>>>> + /*
>>>> + * Isolate the PMD table; in case of race with ptdump, this helps
>>>> + * us to avoid taking the lock in __pmd_free_pte_page().
>>>> + *
>>>> + * Static key logic:
>>>> + *
>>>> + * Case 1: If ptdump does static_branch_enable(), and after that we
>>>> + * execute the if block, then this patches in the read lock, ptdump has
>>>> + * the write lock patched in, therefore ptdump will never read from
>>>> + * a potentially freed PMD table.
>>>> + *
>>>> + * Case 2: If the if block starts executing before ptdump's
>>>> + * static_branch_enable(), then no locking synchronization
>>>> + * will be done. However, pud_clear() + the dsb() in
>>>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>>>> + * empty PUD. Thus, it will never walk over a potentially freed
>>>> + * PMD table.
>>>> + */
>>>> + pud_clear(pudp);
>>> How can this possibly be correct; you're clearing the pud without any
>>> synchronisation. So you could have this situation:
>>>
>>> CPU1 (vmalloc) CPU2 (ptdump)
>>>
>>> static_branch_enable()
>>> mmap_write_lock()
>>> pud = pudp_get()
>>> pud_free_pmd_page()
>>> pud_clear()
>>> access the table pointed to by pud
>>> BANG!
>>>
>>> Surely the logic needs to be:
>>>
>>> if (static_branch_unlikely(&ptdump_lock_key)) {
>>> mmap_read_lock(&init_mm);
>>> lock_taken = true;
>>> }
>>> pud_clear(pudp);
>>> if (unlikely(lock_taken))
>>> mmap_read_unlock(&init_mm);
>>>
>>> That fixes your first case, I think? But doesn't fix your second case. You could
>>> still have:
>>>
>>> CPU1 (vmalloc) CPU2 (ptdump)
>>>
>>> pud_free_pmd_page()
>>> <ptdump_lock_key=FALSE>
>>> static_branch_enable()
>>> mmap_write_lock()
>>> pud = pudp_get()
>>> pud_clear()
>>> access the table pointed to by pud
>>> BANG!
>>>
>>> I think what you need is some sort of RCU read-size critical section in the
>>> vmalloc side that you can then synchonize on in the ptdump side. But you would
>>> need to be in the read side critical section when you sample the static key, but
>>> you can't sleep waiting for the mmap lock while in the critical section. This
>>> feels solvable, and there is almost certainly a well-used pattern, but I'm not
>>> quite sure what the answer is. Perhaps others can help...
>> Just taking a step back here, I found the "percpu rw semaphore". From the
>> documentation:
>>
>> """
>> Percpu rw semaphores is a new read-write semaphore design that is
>> optimized for locking for reading.
>>
>> The problem with traditional read-write semaphores is that when multiple
>> cores take the lock for reading, the cache line containing the semaphore
>> is bouncing between L1 caches of the cores, causing performance
>> degradation.
>>
>> Locking for reading is very fast, it uses RCU and it avoids any atomic
>> instruction in the lock and unlock path. On the other hand, locking for
>> writing is very expensive, it calls synchronize_rcu() that can take
>> hundreds of milliseconds.
>> """
>>
>> Perhaps this provides the properties we are looking for? Could just define one
>> of these and lock it in read mode around pXd_clear() on the vmalloc side. Then
>> lock it in write mode around ptdump_walk_pgd() on the ptdump side. No need for
>> static key or other hoops. Given its a dedicated lock, there is no risk of
>> accidental contention because no other code is using it.
>>
> Write-lock indeed is super expensive, as you noted it blocks on
> synchronize_rcu(). If that write-lock interferes with a critical
> vmalloc fast path, where a read-lock could be injected, then it
> is definitely a problem.
I have a question - is this pmd_free_pte_page/pud_free_pmd_page part of
a fast path?
>
>
> I have not analysed this patch series. I need to have a look what
> "ptdump" does.
>
> --
> Uladzislau Rezki
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-17 11:51 ` Uladzislau Rezki
2025-06-18 3:11 ` Dev Jain
@ 2025-06-18 11:21 ` Ryan Roberts
2025-06-18 17:19 ` Uladzislau Rezki
1 sibling, 1 reply; 19+ messages in thread
From: Ryan Roberts @ 2025-06-18 11:21 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Dev Jain, catalin.marinas, will, anshuman.khandual, quic_zhenhuah,
kevin.brodsky, yangyicong, joey.gouly, linux-arm-kernel,
linux-kernel, david
On 17/06/2025 12:51, Uladzislau Rezki wrote:
> On Mon, Jun 16, 2025 at 10:20:29PM +0100, Ryan Roberts wrote:
>> On 16/06/2025 19:07, Ryan Roberts wrote:
>>> On 16/06/2025 11:33, Dev Jain wrote:
>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>>> because an intermediate table may be removed, potentially causing the
>>>> ptdump code to dereference an invalid address. We want to be able to
>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>>> between two different vm_structs; two vmalloc objects running this same
>>>> code path will point to different page tables, hence there is no race.
>>>>
>>>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>>>> 512 times again via pmd_free_pte_page().
>>>>
>>>> We implement the locking mechanism using static keys, since the chance
>>>> of a race is very small. Observe that the synchronization is needed
>>>> to avoid the following race:
>>>>
>>>> CPU1 CPU2
>>>> take reference of PMD table
>>>> pud_clear()
>>>> pte_free_kernel()
>>>> walk freed PMD table
>>>>
>>>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>>>
>>>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>>>> we are safe. If not, then the patched-in read and write locks help us
>>>> avoid the race.
>>>>
>>>> To implement the mechanism, we need the static key access from mmu.c and
>>>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>>>> target in the Makefile, therefore we cannot initialize the key there, as
>>>> is being done, for example, in the static key implementation of
>>>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>>>> the jump_label mechanism. Declare the key there and define the key to false
>>>> in mmu.c.
>>>>
>>>> No issues were observed with mm-selftests. No issues were observed while
>>>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>>>> sysfs in a loop.
>>>>
>>>> v2->v3:
>>>> - Use static key mechanism
>>>>
>>>> v1->v2:
>>>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>>>> the lock 512 times again via pmd_free_pte_page()
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>> arch/arm64/include/asm/cpufeature.h | 1 +
>>>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>>>> arch/arm64/mm/ptdump.c | 5 +++
>>>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>>>> index c4326f1cb917..3e386563b587 100644
>>>> --- a/arch/arm64/include/asm/cpufeature.h
>>>> +++ b/arch/arm64/include/asm/cpufeature.h
>>>> @@ -26,6 +26,7 @@
>>>> #include <linux/kernel.h>
>>>> #include <linux/cpumask.h>
>>>>
>>>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>>>> /*
>>>> * CPU feature register tracking
>>>> *
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index 8fcf59ba39db..e242ba428820 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -41,11 +41,14 @@
>>>> #include <asm/tlbflush.h>
>>>> #include <asm/pgalloc.h>
>>>> #include <asm/kfence.h>
>>>> +#include <asm/cpufeature.h>
>>>>
>>>> #define NO_BLOCK_MAPPINGS BIT(0)
>>>> #define NO_CONT_MAPPINGS BIT(1)
>>>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>>>>
>>>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>>>> +
>>>> enum pgtable_type {
>>>> TABLE_PTE,
>>>> TABLE_PMD,
>>>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>>>> return 1;
>>>> }
>>>>
>>>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>>>> {
>>>> + bool lock_taken = false;
>>>> pte_t *table;
>>>> pmd_t pmd;
>>>>
>>>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>> return 1;
>>>> }
>>>>
>>>> + /* See comment in pud_free_pmd_page for static key logic */
>>>> table = pte_offset_kernel(pmdp, addr);
>>>> pmd_clear(pmdp);
>>>> __flush_tlb_kernel_pgtable(addr);
>>>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>>>> + mmap_read_lock(&init_mm);
>>>> + lock_taken = true;
>>>> + }
>>>> + if (unlikely(lock_taken))
>>>> + mmap_read_unlock(&init_mm);
>>>> +
>>>> pte_free_kernel(NULL, table);
>>>> return 1;
>>>> }
>>>>
>>>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>> +{
>>>> + return __pmd_free_pte_page(pmdp, addr, true);
>>>> +}
>>>> +
>>>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>> {
>>>> + bool lock_taken = false;
>>>> pmd_t *table;
>>>> pmd_t *pmdp;
>>>> pud_t pud;
>>>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>> }
>>>>
>>>> table = pmd_offset(pudp, addr);
>>>> + /*
>>>> + * Isolate the PMD table; in case of race with ptdump, this helps
>>>> + * us to avoid taking the lock in __pmd_free_pte_page().
>>>> + *
>>>> + * Static key logic:
>>>> + *
>>>> + * Case 1: If ptdump does static_branch_enable(), and after that we
>>>> + * execute the if block, then this patches in the read lock, ptdump has
>>>> + * the write lock patched in, therefore ptdump will never read from
>>>> + * a potentially freed PMD table.
>>>> + *
>>>> + * Case 2: If the if block starts executing before ptdump's
>>>> + * static_branch_enable(), then no locking synchronization
>>>> + * will be done. However, pud_clear() + the dsb() in
>>>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>>>> + * empty PUD. Thus, it will never walk over a potentially freed
>>>> + * PMD table.
>>>> + */
>>>> + pud_clear(pudp);
>>>
>>> How can this possibly be correct; you're clearing the pud without any
>>> synchronisation. So you could have this situation:
>>>
>>> CPU1 (vmalloc) CPU2 (ptdump)
>>>
>>> static_branch_enable()
>>> mmap_write_lock()
>>> pud = pudp_get()
>>> pud_free_pmd_page()
>>> pud_clear()
>>> access the table pointed to by pud
>>> BANG!
>>>
>>> Surely the logic needs to be:
>>>
>>> if (static_branch_unlikely(&ptdump_lock_key)) {
>>> mmap_read_lock(&init_mm);
>>> lock_taken = true;
>>> }
>>> pud_clear(pudp);
>>> if (unlikely(lock_taken))
>>> mmap_read_unlock(&init_mm);
>>>
>>> That fixes your first case, I think? But doesn't fix your second case. You could
>>> still have:
>>>
>>> CPU1 (vmalloc) CPU2 (ptdump)
>>>
>>> pud_free_pmd_page()
>>> <ptdump_lock_key=FALSE>
>>> static_branch_enable()
>>> mmap_write_lock()
>>> pud = pudp_get()
>>> pud_clear()
>>> access the table pointed to by pud
>>> BANG!
>>>
>>> I think what you need is some sort of RCU read-size critical section in the
>>> vmalloc side that you can then synchonize on in the ptdump side. But you would
>>> need to be in the read side critical section when you sample the static key, but
>>> you can't sleep waiting for the mmap lock while in the critical section. This
>>> feels solvable, and there is almost certainly a well-used pattern, but I'm not
>>> quite sure what the answer is. Perhaps others can help...
>>
>> Just taking a step back here, I found the "percpu rw semaphore". From the
>> documentation:
>>
>> """
>> Percpu rw semaphores is a new read-write semaphore design that is
>> optimized for locking for reading.
>>
>> The problem with traditional read-write semaphores is that when multiple
>> cores take the lock for reading, the cache line containing the semaphore
>> is bouncing between L1 caches of the cores, causing performance
>> degradation.
>>
>> Locking for reading is very fast, it uses RCU and it avoids any atomic
>> instruction in the lock and unlock path. On the other hand, locking for
>> writing is very expensive, it calls synchronize_rcu() that can take
>> hundreds of milliseconds.
>> """
>>
>> Perhaps this provides the properties we are looking for? Could just define one
>> of these and lock it in read mode around pXd_clear() on the vmalloc side. Then
>> lock it in write mode around ptdump_walk_pgd() on the ptdump side. No need for
>> static key or other hoops. Given its a dedicated lock, there is no risk of
>> accidental contention because no other code is using it.
>>
> Write-lock indeed is super expensive, as you noted it blocks on
> synchronize_rcu(). If that write-lock interferes with a critical
> vmalloc fast path, where a read-lock could be injected, then it
> is definitely a problem.
>
> I have not analysed this patch series. I need to have a look what
> "ptdump" does.
ptdump is the kernel page table dumper. It is only invoked when a privileged
user reads from a debugfs file. This is debug functionality only. It's
acceptable for the write side to be slow.
Regardless, I've backtracked on my original review... I actually think Dev's
approach with the static key is the correct approach here. It means that there
is zero overhead for the common case of nobody actively dumping kernel page
tables but it also makes the table removal operation safe in the presence of ptdump.
Thanks,
Ryan
>
> --
> Uladzislau Rezki
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-18 3:11 ` Dev Jain
@ 2025-06-18 17:19 ` Uladzislau Rezki
2025-06-19 3:13 ` Dev Jain
0 siblings, 1 reply; 19+ messages in thread
From: Uladzislau Rezki @ 2025-06-18 17:19 UTC (permalink / raw)
To: Dev Jain
Cc: Uladzislau Rezki, Ryan Roberts, catalin.marinas, will,
anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On Wed, Jun 18, 2025 at 08:41:36AM +0530, Dev Jain wrote:
>
> On 17/06/25 5:21 pm, Uladzislau Rezki wrote:
> > On Mon, Jun 16, 2025 at 10:20:29PM +0100, Ryan Roberts wrote:
> > > On 16/06/2025 19:07, Ryan Roberts wrote:
> > > > On 16/06/2025 11:33, Dev Jain wrote:
> > > > > arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> > > > > because an intermediate table may be removed, potentially causing the
> > > > > ptdump code to dereference an invalid address. We want to be able to
> > > > > analyze block vs page mappings for kernel mappings with ptdump, so to
> > > > > enable vmalloc-huge with ptdump, synchronize between page table removal in
> > > > > pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> > > > > use mmap_read_lock and not write lock because we don't need to synchronize
> > > > > between two different vm_structs; two vmalloc objects running this same
> > > > > code path will point to different page tables, hence there is no race.
> > > > >
> > > > > For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
> > > > > 512 times again via pmd_free_pte_page().
> > > > >
> > > > > We implement the locking mechanism using static keys, since the chance
> > > > > of a race is very small. Observe that the synchronization is needed
> > > > > to avoid the following race:
> > > > >
> > > > > CPU1 CPU2
> > > > > take reference of PMD table
> > > > > pud_clear()
> > > > > pte_free_kernel()
> > > > > walk freed PMD table
> > > > >
> > > > > and similar race between pmd_free_pte_page and ptdump_walk_pgd.
> > > > >
> > > > > Therefore, there are two cases: if ptdump sees the cleared PUD, then
> > > > > we are safe. If not, then the patched-in read and write locks help us
> > > > > avoid the race.
> > > > >
> > > > > To implement the mechanism, we need the static key access from mmu.c and
> > > > > ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
> > > > > target in the Makefile, therefore we cannot initialize the key there, as
> > > > > is being done, for example, in the static key implementation of
> > > > > hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
> > > > > the jump_label mechanism. Declare the key there and define the key to false
> > > > > in mmu.c.
> > > > >
> > > > > No issues were observed with mm-selftests. No issues were observed while
> > > > > parallelly running test_vmalloc.sh and dumping the kernel pagetable through
> > > > > sysfs in a loop.
> > > > >
> > > > > v2->v3:
> > > > > - Use static key mechanism
> > > > >
> > > > > v1->v2:
> > > > > - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
> > > > > - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
> > > > > the lock 512 times again via pmd_free_pte_page()
> > > > >
> > > > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > > > ---
> > > > > arch/arm64/include/asm/cpufeature.h | 1 +
> > > > > arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
> > > > > arch/arm64/mm/ptdump.c | 5 +++
> > > > > 3 files changed, 53 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> > > > > index c4326f1cb917..3e386563b587 100644
> > > > > --- a/arch/arm64/include/asm/cpufeature.h
> > > > > +++ b/arch/arm64/include/asm/cpufeature.h
> > > > > @@ -26,6 +26,7 @@
> > > > > #include <linux/kernel.h>
> > > > > #include <linux/cpumask.h>
> > > > > +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
> > > > > /*
> > > > > * CPU feature register tracking
> > > > > *
> > > > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > > > > index 8fcf59ba39db..e242ba428820 100644
> > > > > --- a/arch/arm64/mm/mmu.c
> > > > > +++ b/arch/arm64/mm/mmu.c
> > > > > @@ -41,11 +41,14 @@
> > > > > #include <asm/tlbflush.h>
> > > > > #include <asm/pgalloc.h>
> > > > > #include <asm/kfence.h>
> > > > > +#include <asm/cpufeature.h>
> > > > > #define NO_BLOCK_MAPPINGS BIT(0)
> > > > > #define NO_CONT_MAPPINGS BIT(1)
> > > > > #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
> > > > > +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
> > > > > +
> > > > > enum pgtable_type {
> > > > > TABLE_PTE,
> > > > > TABLE_PMD,
> > > > > @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
> > > > > return 1;
> > > > > }
> > > > > -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> > > > > +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
> > > > > {
> > > > > + bool lock_taken = false;
> > > > > pte_t *table;
> > > > > pmd_t pmd;
> > > > > @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> > > > > return 1;
> > > > > }
> > > > > + /* See comment in pud_free_pmd_page for static key logic */
> > > > > table = pte_offset_kernel(pmdp, addr);
> > > > > pmd_clear(pmdp);
> > > > > __flush_tlb_kernel_pgtable(addr);
> > > > > + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
> > > > > + mmap_read_lock(&init_mm);
> > > > > + lock_taken = true;
> > > > > + }
> > > > > + if (unlikely(lock_taken))
> > > > > + mmap_read_unlock(&init_mm);
> > > > > +
> > > > > pte_free_kernel(NULL, table);
> > > > > return 1;
> > > > > }
> > > > > +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> > > > > +{
> > > > > + return __pmd_free_pte_page(pmdp, addr, true);
> > > > > +}
> > > > > +
> > > > > int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> > > > > {
> > > > > + bool lock_taken = false;
> > > > > pmd_t *table;
> > > > > pmd_t *pmdp;
> > > > > pud_t pud;
> > > > > @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> > > > > }
> > > > > table = pmd_offset(pudp, addr);
> > > > > + /*
> > > > > + * Isolate the PMD table; in case of race with ptdump, this helps
> > > > > + * us to avoid taking the lock in __pmd_free_pte_page().
> > > > > + *
> > > > > + * Static key logic:
> > > > > + *
> > > > > + * Case 1: If ptdump does static_branch_enable(), and after that we
> > > > > + * execute the if block, then this patches in the read lock, ptdump has
> > > > > + * the write lock patched in, therefore ptdump will never read from
> > > > > + * a potentially freed PMD table.
> > > > > + *
> > > > > + * Case 2: If the if block starts executing before ptdump's
> > > > > + * static_branch_enable(), then no locking synchronization
> > > > > + * will be done. However, pud_clear() + the dsb() in
> > > > > + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
> > > > > + * empty PUD. Thus, it will never walk over a potentially freed
> > > > > + * PMD table.
> > > > > + */
> > > > > + pud_clear(pudp);
> > > > How can this possibly be correct; you're clearing the pud without any
> > > > synchronisation. So you could have this situation:
> > > >
> > > > CPU1 (vmalloc) CPU2 (ptdump)
> > > >
> > > > static_branch_enable()
> > > > mmap_write_lock()
> > > > pud = pudp_get()
> > > > pud_free_pmd_page()
> > > > pud_clear()
> > > > access the table pointed to by pud
> > > > BANG!
> > > >
> > > > Surely the logic needs to be:
> > > >
> > > > if (static_branch_unlikely(&ptdump_lock_key)) {
> > > > mmap_read_lock(&init_mm);
> > > > lock_taken = true;
> > > > }
> > > > pud_clear(pudp);
> > > > if (unlikely(lock_taken))
> > > > mmap_read_unlock(&init_mm);
> > > >
> > > > That fixes your first case, I think? But doesn't fix your second case. You could
> > > > still have:
> > > >
> > > > CPU1 (vmalloc) CPU2 (ptdump)
> > > >
> > > > pud_free_pmd_page()
> > > > <ptdump_lock_key=FALSE>
> > > > static_branch_enable()
> > > > mmap_write_lock()
> > > > pud = pudp_get()
> > > > pud_clear()
> > > > access the table pointed to by pud
> > > > BANG!
> > > >
> > > > I think what you need is some sort of RCU read-size critical section in the
> > > > vmalloc side that you can then synchonize on in the ptdump side. But you would
> > > > need to be in the read side critical section when you sample the static key, but
> > > > you can't sleep waiting for the mmap lock while in the critical section. This
> > > > feels solvable, and there is almost certainly a well-used pattern, but I'm not
> > > > quite sure what the answer is. Perhaps others can help...
> > > Just taking a step back here, I found the "percpu rw semaphore". From the
> > > documentation:
> > >
> > > """
> > > Percpu rw semaphores is a new read-write semaphore design that is
> > > optimized for locking for reading.
> > >
> > > The problem with traditional read-write semaphores is that when multiple
> > > cores take the lock for reading, the cache line containing the semaphore
> > > is bouncing between L1 caches of the cores, causing performance
> > > degradation.
> > >
> > > Locking for reading is very fast, it uses RCU and it avoids any atomic
> > > instruction in the lock and unlock path. On the other hand, locking for
> > > writing is very expensive, it calls synchronize_rcu() that can take
> > > hundreds of milliseconds.
> > > """
> > >
> > > Perhaps this provides the properties we are looking for? Could just define one
> > > of these and lock it in read mode around pXd_clear() on the vmalloc side. Then
> > > lock it in write mode around ptdump_walk_pgd() on the ptdump side. No need for
> > > static key or other hoops. Given its a dedicated lock, there is no risk of
> > > accidental contention because no other code is using it.
> > >
> > Write-lock indeed is super expensive, as you noted it blocks on
> > synchronize_rcu(). If that write-lock interferes with a critical
> > vmalloc fast path, where a read-lock could be injected, then it
> > is definitely a problem.
>
> I have a question - is this pmd_free_pte_page/pud_free_pmd_page part of
> a fast path?
>
<snip>
vmalloc()
__vmalloc_node_range_noprof()
__vmalloc_area_node()
vmap_pages_range();
vmap_pages_range_noflush()
__vmap_pages_range_noflush()
vmap_range_noflush()
vmap_p4d_range()
vmap_try_huge_p4d()
if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr))
<snip>
The point is, we would like to avoid any long-sleeping primitive or
introduce any new bottle-necks which makes the vmalloc less scalable
or slower.
I reacted on the synchronize_rcu() and rw-semaphores because it makes
the current context to enter into sleeping state, i.e. waiting on the
wait_for_completion(). Also, we would like to exclude any sleeping if
possible at all, for example GFP_ATOMIC and GFP_NOWAIT flags support,
where i look at currently.
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-18 11:21 ` Ryan Roberts
@ 2025-06-18 17:19 ` Uladzislau Rezki
0 siblings, 0 replies; 19+ messages in thread
From: Uladzislau Rezki @ 2025-06-18 17:19 UTC (permalink / raw)
To: Ryan Roberts
Cc: Uladzislau Rezki, Dev Jain, catalin.marinas, will,
anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On Wed, Jun 18, 2025 at 12:21:48PM +0100, Ryan Roberts wrote:
> On 17/06/2025 12:51, Uladzislau Rezki wrote:
> > On Mon, Jun 16, 2025 at 10:20:29PM +0100, Ryan Roberts wrote:
> >> On 16/06/2025 19:07, Ryan Roberts wrote:
> >>> On 16/06/2025 11:33, Dev Jain wrote:
> >>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> >>>> because an intermediate table may be removed, potentially causing the
> >>>> ptdump code to dereference an invalid address. We want to be able to
> >>>> analyze block vs page mappings for kernel mappings with ptdump, so to
> >>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
> >>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> >>>> use mmap_read_lock and not write lock because we don't need to synchronize
> >>>> between two different vm_structs; two vmalloc objects running this same
> >>>> code path will point to different page tables, hence there is no race.
> >>>>
> >>>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
> >>>> 512 times again via pmd_free_pte_page().
> >>>>
> >>>> We implement the locking mechanism using static keys, since the chance
> >>>> of a race is very small. Observe that the synchronization is needed
> >>>> to avoid the following race:
> >>>>
> >>>> CPU1 CPU2
> >>>> take reference of PMD table
> >>>> pud_clear()
> >>>> pte_free_kernel()
> >>>> walk freed PMD table
> >>>>
> >>>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
> >>>>
> >>>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
> >>>> we are safe. If not, then the patched-in read and write locks help us
> >>>> avoid the race.
> >>>>
> >>>> To implement the mechanism, we need the static key access from mmu.c and
> >>>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
> >>>> target in the Makefile, therefore we cannot initialize the key there, as
> >>>> is being done, for example, in the static key implementation of
> >>>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
> >>>> the jump_label mechanism. Declare the key there and define the key to false
> >>>> in mmu.c.
> >>>>
> >>>> No issues were observed with mm-selftests. No issues were observed while
> >>>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
> >>>> sysfs in a loop.
> >>>>
> >>>> v2->v3:
> >>>> - Use static key mechanism
> >>>>
> >>>> v1->v2:
> >>>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
> >>>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
> >>>> the lock 512 times again via pmd_free_pte_page()
> >>>>
> >>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>>> ---
> >>>> arch/arm64/include/asm/cpufeature.h | 1 +
> >>>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
> >>>> arch/arm64/mm/ptdump.c | 5 +++
> >>>> 3 files changed, 53 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> >>>> index c4326f1cb917..3e386563b587 100644
> >>>> --- a/arch/arm64/include/asm/cpufeature.h
> >>>> +++ b/arch/arm64/include/asm/cpufeature.h
> >>>> @@ -26,6 +26,7 @@
> >>>> #include <linux/kernel.h>
> >>>> #include <linux/cpumask.h>
> >>>>
> >>>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
> >>>> /*
> >>>> * CPU feature register tracking
> >>>> *
> >>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >>>> index 8fcf59ba39db..e242ba428820 100644
> >>>> --- a/arch/arm64/mm/mmu.c
> >>>> +++ b/arch/arm64/mm/mmu.c
> >>>> @@ -41,11 +41,14 @@
> >>>> #include <asm/tlbflush.h>
> >>>> #include <asm/pgalloc.h>
> >>>> #include <asm/kfence.h>
> >>>> +#include <asm/cpufeature.h>
> >>>>
> >>>> #define NO_BLOCK_MAPPINGS BIT(0)
> >>>> #define NO_CONT_MAPPINGS BIT(1)
> >>>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
> >>>>
> >>>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
> >>>> +
> >>>> enum pgtable_type {
> >>>> TABLE_PTE,
> >>>> TABLE_PMD,
> >>>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
> >>>> return 1;
> >>>> }
> >>>>
> >>>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> >>>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
> >>>> {
> >>>> + bool lock_taken = false;
> >>>> pte_t *table;
> >>>> pmd_t pmd;
> >>>>
> >>>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> >>>> return 1;
> >>>> }
> >>>>
> >>>> + /* See comment in pud_free_pmd_page for static key logic */
> >>>> table = pte_offset_kernel(pmdp, addr);
> >>>> pmd_clear(pmdp);
> >>>> __flush_tlb_kernel_pgtable(addr);
> >>>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
> >>>> + mmap_read_lock(&init_mm);
> >>>> + lock_taken = true;
> >>>> + }
> >>>> + if (unlikely(lock_taken))
> >>>> + mmap_read_unlock(&init_mm);
> >>>> +
> >>>> pte_free_kernel(NULL, table);
> >>>> return 1;
> >>>> }
> >>>>
> >>>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> >>>> +{
> >>>> + return __pmd_free_pte_page(pmdp, addr, true);
> >>>> +}
> >>>> +
> >>>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> >>>> {
> >>>> + bool lock_taken = false;
> >>>> pmd_t *table;
> >>>> pmd_t *pmdp;
> >>>> pud_t pud;
> >>>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> >>>> }
> >>>>
> >>>> table = pmd_offset(pudp, addr);
> >>>> + /*
> >>>> + * Isolate the PMD table; in case of race with ptdump, this helps
> >>>> + * us to avoid taking the lock in __pmd_free_pte_page().
> >>>> + *
> >>>> + * Static key logic:
> >>>> + *
> >>>> + * Case 1: If ptdump does static_branch_enable(), and after that we
> >>>> + * execute the if block, then this patches in the read lock, ptdump has
> >>>> + * the write lock patched in, therefore ptdump will never read from
> >>>> + * a potentially freed PMD table.
> >>>> + *
> >>>> + * Case 2: If the if block starts executing before ptdump's
> >>>> + * static_branch_enable(), then no locking synchronization
> >>>> + * will be done. However, pud_clear() + the dsb() in
> >>>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
> >>>> + * empty PUD. Thus, it will never walk over a potentially freed
> >>>> + * PMD table.
> >>>> + */
> >>>> + pud_clear(pudp);
> >>>
> >>> How can this possibly be correct; you're clearing the pud without any
> >>> synchronisation. So you could have this situation:
> >>>
> >>> CPU1 (vmalloc) CPU2 (ptdump)
> >>>
> >>> static_branch_enable()
> >>> mmap_write_lock()
> >>> pud = pudp_get()
> >>> pud_free_pmd_page()
> >>> pud_clear()
> >>> access the table pointed to by pud
> >>> BANG!
> >>>
> >>> Surely the logic needs to be:
> >>>
> >>> if (static_branch_unlikely(&ptdump_lock_key)) {
> >>> mmap_read_lock(&init_mm);
> >>> lock_taken = true;
> >>> }
> >>> pud_clear(pudp);
> >>> if (unlikely(lock_taken))
> >>> mmap_read_unlock(&init_mm);
> >>>
> >>> That fixes your first case, I think? But doesn't fix your second case. You could
> >>> still have:
> >>>
> >>> CPU1 (vmalloc) CPU2 (ptdump)
> >>>
> >>> pud_free_pmd_page()
> >>> <ptdump_lock_key=FALSE>
> >>> static_branch_enable()
> >>> mmap_write_lock()
> >>> pud = pudp_get()
> >>> pud_clear()
> >>> access the table pointed to by pud
> >>> BANG!
> >>>
> >>> I think what you need is some sort of RCU read-size critical section in the
> >>> vmalloc side that you can then synchonize on in the ptdump side. But you would
> >>> need to be in the read side critical section when you sample the static key, but
> >>> you can't sleep waiting for the mmap lock while in the critical section. This
> >>> feels solvable, and there is almost certainly a well-used pattern, but I'm not
> >>> quite sure what the answer is. Perhaps others can help...
> >>
> >> Just taking a step back here, I found the "percpu rw semaphore". From the
> >> documentation:
> >>
> >> """
> >> Percpu rw semaphores is a new read-write semaphore design that is
> >> optimized for locking for reading.
> >>
> >> The problem with traditional read-write semaphores is that when multiple
> >> cores take the lock for reading, the cache line containing the semaphore
> >> is bouncing between L1 caches of the cores, causing performance
> >> degradation.
> >>
> >> Locking for reading is very fast, it uses RCU and it avoids any atomic
> >> instruction in the lock and unlock path. On the other hand, locking for
> >> writing is very expensive, it calls synchronize_rcu() that can take
> >> hundreds of milliseconds.
> >> """
> >>
> >> Perhaps this provides the properties we are looking for? Could just define one
> >> of these and lock it in read mode around pXd_clear() on the vmalloc side. Then
> >> lock it in write mode around ptdump_walk_pgd() on the ptdump side. No need for
> >> static key or other hoops. Given its a dedicated lock, there is no risk of
> >> accidental contention because no other code is using it.
> >>
> > Write-lock indeed is super expensive, as you noted it blocks on
> > synchronize_rcu(). If that write-lock interferes with a critical
> > vmalloc fast path, where a read-lock could be injected, then it
> > is definitely a problem.
> >
> > I have not analysed this patch series. I need to have a look what
> > "ptdump" does.
>
> ptdump is the kernel page table dumper. It is only invoked when a privileged
> user reads from a debugfs file. This is debug functionality only. It's
> acceptable for the write side to be slow.
>
> Regardless, I've backtracked on my original review... I actually think Dev's
> approach with the static key is the correct approach here. It means that there
> is zero overhead for the common case of nobody actively dumping kernel page
> tables but it also makes the table removal operation safe in the presence of ptdump.
>
OK, thank you!
--
Uladzislau Rezki
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-18 17:19 ` Uladzislau Rezki
@ 2025-06-19 3:13 ` Dev Jain
0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2025-06-19 3:13 UTC (permalink / raw)
To: Uladzislau Rezki
Cc: Ryan Roberts, catalin.marinas, will, anshuman.khandual,
quic_zhenhuah, kevin.brodsky, yangyicong, joey.gouly,
linux-arm-kernel, linux-kernel, david
On 18/06/25 10:49 pm, Uladzislau Rezki wrote:
> On Wed, Jun 18, 2025 at 08:41:36AM +0530, Dev Jain wrote:
>> On 17/06/25 5:21 pm, Uladzislau Rezki wrote:
>>> On Mon, Jun 16, 2025 at 10:20:29PM +0100, Ryan Roberts wrote:
>>>> On 16/06/2025 19:07, Ryan Roberts wrote:
>>>>> On 16/06/2025 11:33, Dev Jain wrote:
>>>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>>>>> because an intermediate table may be removed, potentially causing the
>>>>>> ptdump code to dereference an invalid address. We want to be able to
>>>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>>>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>>>>> between two different vm_structs; two vmalloc objects running this same
>>>>>> code path will point to different page tables, hence there is no race.
>>>>>>
>>>>>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>>>>>> 512 times again via pmd_free_pte_page().
>>>>>>
>>>>>> We implement the locking mechanism using static keys, since the chance
>>>>>> of a race is very small. Observe that the synchronization is needed
>>>>>> to avoid the following race:
>>>>>>
>>>>>> CPU1 CPU2
>>>>>> take reference of PMD table
>>>>>> pud_clear()
>>>>>> pte_free_kernel()
>>>>>> walk freed PMD table
>>>>>>
>>>>>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>>>>>
>>>>>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>>>>>> we are safe. If not, then the patched-in read and write locks help us
>>>>>> avoid the race.
>>>>>>
>>>>>> To implement the mechanism, we need the static key access from mmu.c and
>>>>>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>>>>>> target in the Makefile, therefore we cannot initialize the key there, as
>>>>>> is being done, for example, in the static key implementation of
>>>>>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>>>>>> the jump_label mechanism. Declare the key there and define the key to false
>>>>>> in mmu.c.
>>>>>>
>>>>>> No issues were observed with mm-selftests. No issues were observed while
>>>>>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>>>>>> sysfs in a loop.
>>>>>>
>>>>>> v2->v3:
>>>>>> - Use static key mechanism
>>>>>>
>>>>>> v1->v2:
>>>>>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>>>>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>>>>>> the lock 512 times again via pmd_free_pte_page()
>>>>>>
>>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>>> ---
>>>>>> arch/arm64/include/asm/cpufeature.h | 1 +
>>>>>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>>>>>> arch/arm64/mm/ptdump.c | 5 +++
>>>>>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>>>>>> index c4326f1cb917..3e386563b587 100644
>>>>>> --- a/arch/arm64/include/asm/cpufeature.h
>>>>>> +++ b/arch/arm64/include/asm/cpufeature.h
>>>>>> @@ -26,6 +26,7 @@
>>>>>> #include <linux/kernel.h>
>>>>>> #include <linux/cpumask.h>
>>>>>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>>>>>> /*
>>>>>> * CPU feature register tracking
>>>>>> *
>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>> index 8fcf59ba39db..e242ba428820 100644
>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>> @@ -41,11 +41,14 @@
>>>>>> #include <asm/tlbflush.h>
>>>>>> #include <asm/pgalloc.h>
>>>>>> #include <asm/kfence.h>
>>>>>> +#include <asm/cpufeature.h>
>>>>>> #define NO_BLOCK_MAPPINGS BIT(0)
>>>>>> #define NO_CONT_MAPPINGS BIT(1)
>>>>>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>>>>>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>>>>>> +
>>>>>> enum pgtable_type {
>>>>>> TABLE_PTE,
>>>>>> TABLE_PMD,
>>>>>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>>>>>> return 1;
>>>>>> }
>>>>>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>>>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>>>>>> {
>>>>>> + bool lock_taken = false;
>>>>>> pte_t *table;
>>>>>> pmd_t pmd;
>>>>>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>>>> return 1;
>>>>>> }
>>>>>> + /* See comment in pud_free_pmd_page for static key logic */
>>>>>> table = pte_offset_kernel(pmdp, addr);
>>>>>> pmd_clear(pmdp);
>>>>>> __flush_tlb_kernel_pgtable(addr);
>>>>>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>>>>>> + mmap_read_lock(&init_mm);
>>>>>> + lock_taken = true;
>>>>>> + }
>>>>>> + if (unlikely(lock_taken))
>>>>>> + mmap_read_unlock(&init_mm);
>>>>>> +
>>>>>> pte_free_kernel(NULL, table);
>>>>>> return 1;
>>>>>> }
>>>>>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>>>> +{
>>>>>> + return __pmd_free_pte_page(pmdp, addr, true);
>>>>>> +}
>>>>>> +
>>>>>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>>>> {
>>>>>> + bool lock_taken = false;
>>>>>> pmd_t *table;
>>>>>> pmd_t *pmdp;
>>>>>> pud_t pud;
>>>>>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>>>> }
>>>>>> table = pmd_offset(pudp, addr);
>>>>>> + /*
>>>>>> + * Isolate the PMD table; in case of race with ptdump, this helps
>>>>>> + * us to avoid taking the lock in __pmd_free_pte_page().
>>>>>> + *
>>>>>> + * Static key logic:
>>>>>> + *
>>>>>> + * Case 1: If ptdump does static_branch_enable(), and after that we
>>>>>> + * execute the if block, then this patches in the read lock, ptdump has
>>>>>> + * the write lock patched in, therefore ptdump will never read from
>>>>>> + * a potentially freed PMD table.
>>>>>> + *
>>>>>> + * Case 2: If the if block starts executing before ptdump's
>>>>>> + * static_branch_enable(), then no locking synchronization
>>>>>> + * will be done. However, pud_clear() + the dsb() in
>>>>>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>>>>>> + * empty PUD. Thus, it will never walk over a potentially freed
>>>>>> + * PMD table.
>>>>>> + */
>>>>>> + pud_clear(pudp);
>>>>> How can this possibly be correct; you're clearing the pud without any
>>>>> synchronisation. So you could have this situation:
>>>>>
>>>>> CPU1 (vmalloc) CPU2 (ptdump)
>>>>>
>>>>> static_branch_enable()
>>>>> mmap_write_lock()
>>>>> pud = pudp_get()
>>>>> pud_free_pmd_page()
>>>>> pud_clear()
>>>>> access the table pointed to by pud
>>>>> BANG!
>>>>>
>>>>> Surely the logic needs to be:
>>>>>
>>>>> if (static_branch_unlikely(&ptdump_lock_key)) {
>>>>> mmap_read_lock(&init_mm);
>>>>> lock_taken = true;
>>>>> }
>>>>> pud_clear(pudp);
>>>>> if (unlikely(lock_taken))
>>>>> mmap_read_unlock(&init_mm);
>>>>>
>>>>> That fixes your first case, I think? But doesn't fix your second case. You could
>>>>> still have:
>>>>>
>>>>> CPU1 (vmalloc) CPU2 (ptdump)
>>>>>
>>>>> pud_free_pmd_page()
>>>>> <ptdump_lock_key=FALSE>
>>>>> static_branch_enable()
>>>>> mmap_write_lock()
>>>>> pud = pudp_get()
>>>>> pud_clear()
>>>>> access the table pointed to by pud
>>>>> BANG!
>>>>>
>>>>> I think what you need is some sort of RCU read-size critical section in the
>>>>> vmalloc side that you can then synchonize on in the ptdump side. But you would
>>>>> need to be in the read side critical section when you sample the static key, but
>>>>> you can't sleep waiting for the mmap lock while in the critical section. This
>>>>> feels solvable, and there is almost certainly a well-used pattern, but I'm not
>>>>> quite sure what the answer is. Perhaps others can help...
>>>> Just taking a step back here, I found the "percpu rw semaphore". From the
>>>> documentation:
>>>>
>>>> """
>>>> Percpu rw semaphores is a new read-write semaphore design that is
>>>> optimized for locking for reading.
>>>>
>>>> The problem with traditional read-write semaphores is that when multiple
>>>> cores take the lock for reading, the cache line containing the semaphore
>>>> is bouncing between L1 caches of the cores, causing performance
>>>> degradation.
>>>>
>>>> Locking for reading is very fast, it uses RCU and it avoids any atomic
>>>> instruction in the lock and unlock path. On the other hand, locking for
>>>> writing is very expensive, it calls synchronize_rcu() that can take
>>>> hundreds of milliseconds.
>>>> """
>>>>
>>>> Perhaps this provides the properties we are looking for? Could just define one
>>>> of these and lock it in read mode around pXd_clear() on the vmalloc side. Then
>>>> lock it in write mode around ptdump_walk_pgd() on the ptdump side. No need for
>>>> static key or other hoops. Given its a dedicated lock, there is no risk of
>>>> accidental contention because no other code is using it.
>>>>
>>> Write-lock indeed is super expensive, as you noted it blocks on
>>> synchronize_rcu(). If that write-lock interferes with a critical
>>> vmalloc fast path, where a read-lock could be injected, then it
>>> is definitely a problem.
>> I have a question - is this pmd_free_pte_page/pud_free_pmd_page part of
>> a fast path?
>>
> <snip>
> vmalloc()
> __vmalloc_node_range_noprof()
> __vmalloc_area_node()
> vmap_pages_range();
> vmap_pages_range_noflush()
> __vmap_pages_range_noflush()
> vmap_range_noflush()
> vmap_p4d_range()
> vmap_try_huge_p4d()
> if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr))
> <snip>
I meant, how often is that we actually overwrite pagetables of an old vmalloc
object? pmd_free_pte_page and friends will be a hot path if this overwriting
happens frequently, that was my question.
>
> The point is, we would like to avoid any long-sleeping primitive or
> introduce any new bottle-necks which makes the vmalloc less scalable
> or slower.
>
> I reacted on the synchronize_rcu() and rw-semaphores because it makes
> the current context to enter into sleeping state, i.e. waiting on the
> wait_for_completion(). Also, we would like to exclude any sleeping if
> possible at all, for example GFP_ATOMIC and GFP_NOWAIT flags support,
> where i look at currently.
>
> --
> Uladzislau Rezki
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-16 10:33 [PATCH v3] arm64: Enable vmalloc-huge with ptdump Dev Jain
2025-06-16 15:00 ` David Hildenbrand
2025-06-16 18:07 ` Ryan Roberts
@ 2025-06-25 10:35 ` Ryan Roberts
2025-06-25 11:12 ` Dev Jain
2 siblings, 1 reply; 19+ messages in thread
From: Ryan Roberts @ 2025-06-25 10:35 UTC (permalink / raw)
To: Dev Jain, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On 16/06/2025 11:33, Dev Jain wrote:
> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> because an intermediate table may be removed, potentially causing the
> ptdump code to dereference an invalid address. We want to be able to
> analyze block vs page mappings for kernel mappings with ptdump, so to
> enable vmalloc-huge with ptdump, synchronize between page table removal in
> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> use mmap_read_lock and not write lock because we don't need to synchronize
> between two different vm_structs; two vmalloc objects running this same
> code path will point to different page tables, hence there is no race.
>
> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
> 512 times again via pmd_free_pte_page().
>
> We implement the locking mechanism using static keys, since the chance
> of a race is very small. Observe that the synchronization is needed
> to avoid the following race:
>
> CPU1 CPU2
> take reference of PMD table
> pud_clear()
> pte_free_kernel()
> walk freed PMD table
>
> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>
> Therefore, there are two cases: if ptdump sees the cleared PUD, then
> we are safe. If not, then the patched-in read and write locks help us
> avoid the race.
>
> To implement the mechanism, we need the static key access from mmu.c and
> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
> target in the Makefile, therefore we cannot initialize the key there, as
> is being done, for example, in the static key implementation of
> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
> the jump_label mechanism. Declare the key there and define the key to false
> in mmu.c.
>
> No issues were observed with mm-selftests. No issues were observed while
> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
> sysfs in a loop.
>
> v2->v3:
> - Use static key mechanism
>
> v1->v2:
> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
> the lock 512 times again via pmd_free_pte_page()
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> arch/arm64/include/asm/cpufeature.h | 1 +
> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
> arch/arm64/mm/ptdump.c | 5 +++
> 3 files changed, 53 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index c4326f1cb917..3e386563b587 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -26,6 +26,7 @@
> #include <linux/kernel.h>
> #include <linux/cpumask.h>
>
> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
Is this really the correct header file for this declaration? Perhaps it would be
better in arch/arm64/include/asm/ptdump.h ?
> /*
> * CPU feature register tracking
> *
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 8fcf59ba39db..e242ba428820 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -41,11 +41,14 @@
> #include <asm/tlbflush.h>
> #include <asm/pgalloc.h>
> #include <asm/kfence.h>
> +#include <asm/cpufeature.h>
>
> #define NO_BLOCK_MAPPINGS BIT(0)
> #define NO_CONT_MAPPINGS BIT(1)
> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>
> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
> +
> enum pgtable_type {
> TABLE_PTE,
> TABLE_PMD,
> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
> return 1;
> }
>
> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
> {
> + bool lock_taken = false;
As David commented, no need for this.
> pte_t *table;
> pmd_t pmd;
>
> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> return 1;
> }
>
> + /* See comment in pud_free_pmd_page for static key logic */
> table = pte_offset_kernel(pmdp, addr);
> pmd_clear(pmdp);
> __flush_tlb_kernel_pgtable(addr);
> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
> + mmap_read_lock(&init_mm);
> + lock_taken = true;
> + }
> + if (unlikely(lock_taken))
> + mmap_read_unlock(&init_mm);
> +
As per David's comment this can just be:
if (static_branch_unlikely(&ptdump_lock_key) && lock) {
mmap_read_lock(&init_mm);
mmap_read_unlock(&init_mm);
}
> pte_free_kernel(NULL, table);
> return 1;
> }
>
> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> +{
> + return __pmd_free_pte_page(pmdp, addr, true);
> +}
> +
> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> {
> + bool lock_taken = false;
Same comment.
> pmd_t *table;
> pmd_t *pmdp;
> pud_t pud;
> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> }
>
> table = pmd_offset(pudp, addr);
> + /*
> + * Isolate the PMD table; in case of race with ptdump, this helps
> + * us to avoid taking the lock in __pmd_free_pte_page().
> + *
> + * Static key logic:
> + *
> + * Case 1: If ptdump does static_branch_enable(), and after that we
> + * execute the if block, then this patches in the read lock, ptdump has
> + * the write lock patched in, therefore ptdump will never read from
> + * a potentially freed PMD table.
> + *
> + * Case 2: If the if block starts executing before ptdump's
> + * static_branch_enable(), then no locking synchronization
> + * will be done. However, pud_clear() + the dsb() in
> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
> + * empty PUD. Thus, it will never walk over a potentially freed
> + * PMD table.
> + */
> + pud_clear(pudp);
> + __flush_tlb_kernel_pgtable(addr);
> + if (static_branch_unlikely(&ptdump_lock_key)) {
> + mmap_read_lock(&init_mm);
> + lock_taken = true;
> + }
> + if (unlikely(lock_taken))
> + mmap_read_unlock(&init_mm);
Same comment.
> +
> pmdp = table;
> next = addr;
> end = addr + PUD_SIZE;
> do {
> - pmd_free_pte_page(pmdp, next);
> + __pmd_free_pte_page(pmdp, next, false);
> } while (pmdp++, next += PMD_SIZE, next != end);
>
> - pud_clear(pudp);
> - __flush_tlb_kernel_pgtable(addr);
> pmd_free(NULL, table);
> return 1;
> }
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 421a5de806c6..f75e12a1d068 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -25,6 +25,7 @@
> #include <asm/pgtable-hwdef.h>
> #include <asm/ptdump.h>
>
> +#include <asm/cpufeature.h>
>
> #define pt_dump_seq_printf(m, fmt, args...) \
> ({ \
> @@ -311,7 +312,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
> }
> };
>
> + static_branch_enable(&ptdump_lock_key);
> ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
> + static_branch_disable(&ptdump_lock_key);
> }
>
> static void __init ptdump_initialize(void)
> @@ -353,7 +356,9 @@ bool ptdump_check_wx(void)
> }
> };
>
> + static_branch_enable(&ptdump_lock_key);
> ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
> + static_branch_disable(&ptdump_lock_key);
>
> if (st.wx_pages || st.uxn_pages) {
> pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
With the improvements as suggested, LGTM:
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-25 10:35 ` Ryan Roberts
@ 2025-06-25 11:12 ` Dev Jain
2025-06-25 11:16 ` Ryan Roberts
0 siblings, 1 reply; 19+ messages in thread
From: Dev Jain @ 2025-06-25 11:12 UTC (permalink / raw)
To: Ryan Roberts, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On 25/06/25 4:05 pm, Ryan Roberts wrote:
> On 16/06/2025 11:33, Dev Jain wrote:
>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>> because an intermediate table may be removed, potentially causing the
>> ptdump code to dereference an invalid address. We want to be able to
>> analyze block vs page mappings for kernel mappings with ptdump, so to
>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>> use mmap_read_lock and not write lock because we don't need to synchronize
>> between two different vm_structs; two vmalloc objects running this same
>> code path will point to different page tables, hence there is no race.
>>
>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>> 512 times again via pmd_free_pte_page().
>>
>> We implement the locking mechanism using static keys, since the chance
>> of a race is very small. Observe that the synchronization is needed
>> to avoid the following race:
>>
>> CPU1 CPU2
>> take reference of PMD table
>> pud_clear()
>> pte_free_kernel()
>> walk freed PMD table
>>
>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>
>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>> we are safe. If not, then the patched-in read and write locks help us
>> avoid the race.
>>
>> To implement the mechanism, we need the static key access from mmu.c and
>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>> target in the Makefile, therefore we cannot initialize the key there, as
>> is being done, for example, in the static key implementation of
>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>> the jump_label mechanism. Declare the key there and define the key to false
>> in mmu.c.
>>
>> No issues were observed with mm-selftests. No issues were observed while
>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>> sysfs in a loop.
>>
>> v2->v3:
>> - Use static key mechanism
>>
>> v1->v2:
>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>> the lock 512 times again via pmd_free_pte_page()
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> arch/arm64/include/asm/cpufeature.h | 1 +
>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>> arch/arm64/mm/ptdump.c | 5 +++
>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>> index c4326f1cb917..3e386563b587 100644
>> --- a/arch/arm64/include/asm/cpufeature.h
>> +++ b/arch/arm64/include/asm/cpufeature.h
>> @@ -26,6 +26,7 @@
>> #include <linux/kernel.h>
>> #include <linux/cpumask.h>
>>
>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
> Is this really the correct header file for this declaration? Perhaps it would be
> better in arch/arm64/include/asm/ptdump.h ?
I tried a lot of things; this didn't work. I get the following:
ld: arch/arm64/mm/mmu.o:(__jump_table+0x8): undefined reference to `ptdump_lock_key'
ld: arch/arm64/mm/mmu.o:(__jump_table+0x48): undefined reference to `ptdump_lock_key'
in case of !CONFIG_PTDUMP_DEBUGFS.
>
>> /*
>> * CPU feature register tracking
>> *
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 8fcf59ba39db..e242ba428820 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -41,11 +41,14 @@
>> #include <asm/tlbflush.h>
>> #include <asm/pgalloc.h>
>> #include <asm/kfence.h>
>> +#include <asm/cpufeature.h>
>>
>> #define NO_BLOCK_MAPPINGS BIT(0)
>> #define NO_CONT_MAPPINGS BIT(1)
>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>>
>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>> +
>> enum pgtable_type {
>> TABLE_PTE,
>> TABLE_PMD,
>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>> return 1;
>> }
>>
>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>> {
>> + bool lock_taken = false;
> As David commented, no need for this.
>
>> pte_t *table;
>> pmd_t pmd;
>>
>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> return 1;
>> }
>>
>> + /* See comment in pud_free_pmd_page for static key logic */
>> table = pte_offset_kernel(pmdp, addr);
>> pmd_clear(pmdp);
>> __flush_tlb_kernel_pgtable(addr);
>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>> + mmap_read_lock(&init_mm);
>> + lock_taken = true;
>> + }
>> + if (unlikely(lock_taken))
>> + mmap_read_unlock(&init_mm);
>> +
> As per David's comment this can just be:
>
> if (static_branch_unlikely(&ptdump_lock_key) && lock) {
> mmap_read_lock(&init_mm);
> mmap_read_unlock(&init_mm);
> }
>
>> pte_free_kernel(NULL, table);
>> return 1;
>> }
>>
>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +{
>> + return __pmd_free_pte_page(pmdp, addr, true);
>> +}
>> +
>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>> {
>> + bool lock_taken = false;
> Same comment.
>
>> pmd_t *table;
>> pmd_t *pmdp;
>> pud_t pud;
>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>> }
>>
>> table = pmd_offset(pudp, addr);
>> + /*
>> + * Isolate the PMD table; in case of race with ptdump, this helps
>> + * us to avoid taking the lock in __pmd_free_pte_page().
>> + *
>> + * Static key logic:
>> + *
>> + * Case 1: If ptdump does static_branch_enable(), and after that we
>> + * execute the if block, then this patches in the read lock, ptdump has
>> + * the write lock patched in, therefore ptdump will never read from
>> + * a potentially freed PMD table.
>> + *
>> + * Case 2: If the if block starts executing before ptdump's
>> + * static_branch_enable(), then no locking synchronization
>> + * will be done. However, pud_clear() + the dsb() in
>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>> + * empty PUD. Thus, it will never walk over a potentially freed
>> + * PMD table.
>> + */
>> + pud_clear(pudp);
>> + __flush_tlb_kernel_pgtable(addr);
>> + if (static_branch_unlikely(&ptdump_lock_key)) {
>> + mmap_read_lock(&init_mm);
>> + lock_taken = true;
>> + }
>> + if (unlikely(lock_taken))
>> + mmap_read_unlock(&init_mm);
> Same comment.
>
>> +
>> pmdp = table;
>> next = addr;
>> end = addr + PUD_SIZE;
>> do {
>> - pmd_free_pte_page(pmdp, next);
>> + __pmd_free_pte_page(pmdp, next, false);
>> } while (pmdp++, next += PMD_SIZE, next != end);
>>
>> - pud_clear(pudp);
>> - __flush_tlb_kernel_pgtable(addr);
>> pmd_free(NULL, table);
>> return 1;
>> }
>> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
>> index 421a5de806c6..f75e12a1d068 100644
>> --- a/arch/arm64/mm/ptdump.c
>> +++ b/arch/arm64/mm/ptdump.c
>> @@ -25,6 +25,7 @@
>> #include <asm/pgtable-hwdef.h>
>> #include <asm/ptdump.h>
>>
>> +#include <asm/cpufeature.h>
>>
>> #define pt_dump_seq_printf(m, fmt, args...) \
>> ({ \
>> @@ -311,7 +312,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>> }
>> };
>>
>> + static_branch_enable(&ptdump_lock_key);
>> ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
>> + static_branch_disable(&ptdump_lock_key);
>> }
>>
>> static void __init ptdump_initialize(void)
>> @@ -353,7 +356,9 @@ bool ptdump_check_wx(void)
>> }
>> };
>>
>> + static_branch_enable(&ptdump_lock_key);
>> ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
>> + static_branch_disable(&ptdump_lock_key);
>>
>> if (st.wx_pages || st.uxn_pages) {
>> pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
>
> With the improvements as suggested, LGTM:
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-25 11:12 ` Dev Jain
@ 2025-06-25 11:16 ` Ryan Roberts
2025-06-25 11:25 ` Dev Jain
0 siblings, 1 reply; 19+ messages in thread
From: Ryan Roberts @ 2025-06-25 11:16 UTC (permalink / raw)
To: Dev Jain, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On 25/06/2025 12:12, Dev Jain wrote:
>
> On 25/06/25 4:05 pm, Ryan Roberts wrote:
>> On 16/06/2025 11:33, Dev Jain wrote:
>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>> because an intermediate table may be removed, potentially causing the
>>> ptdump code to dereference an invalid address. We want to be able to
>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>> between two different vm_structs; two vmalloc objects running this same
>>> code path will point to different page tables, hence there is no race.
>>>
>>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>>> 512 times again via pmd_free_pte_page().
>>>
>>> We implement the locking mechanism using static keys, since the chance
>>> of a race is very small. Observe that the synchronization is needed
>>> to avoid the following race:
>>>
>>> CPU1 CPU2
>>> take reference of PMD table
>>> pud_clear()
>>> pte_free_kernel()
>>> walk freed PMD table
>>>
>>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>>
>>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>>> we are safe. If not, then the patched-in read and write locks help us
>>> avoid the race.
>>>
>>> To implement the mechanism, we need the static key access from mmu.c and
>>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>>> target in the Makefile, therefore we cannot initialize the key there, as
>>> is being done, for example, in the static key implementation of
>>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>>> the jump_label mechanism. Declare the key there and define the key to false
>>> in mmu.c.
>>>
>>> No issues were observed with mm-selftests. No issues were observed while
>>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>>> sysfs in a loop.
>>>
>>> v2->v3:
>>> - Use static key mechanism
>>>
>>> v1->v2:
>>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>>> the lock 512 times again via pmd_free_pte_page()
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>> arch/arm64/include/asm/cpufeature.h | 1 +
>>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>>> arch/arm64/mm/ptdump.c | 5 +++
>>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/
>>> cpufeature.h
>>> index c4326f1cb917..3e386563b587 100644
>>> --- a/arch/arm64/include/asm/cpufeature.h
>>> +++ b/arch/arm64/include/asm/cpufeature.h
>>> @@ -26,6 +26,7 @@
>>> #include <linux/kernel.h>
>>> #include <linux/cpumask.h>
>>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>> Is this really the correct header file for this declaration? Perhaps it would be
>> better in arch/arm64/include/asm/ptdump.h ?
>
> I tried a lot of things; this didn't work. I get the following:
>
> ld: arch/arm64/mm/mmu.o:(__jump_table+0x8): undefined reference to
> `ptdump_lock_key'
> ld: arch/arm64/mm/mmu.o:(__jump_table+0x48): undefined reference to
> `ptdump_lock_key'
>
> in case of !CONFIG_PTDUMP_DEBUGFS.
I think you're talking about *defining*. I'm talking about *declaring*. By all
means define it in mmu.c. Then you always have it. But the declaration doesn't
need to live in cpufeature.h does it? Why can't it live in ptdump.h? The
declaration is just a compiler directive really; it doesn't control the output
of any symbols.
>
>>
>>> /*
>>> * CPU feature register tracking
>>> *
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 8fcf59ba39db..e242ba428820 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -41,11 +41,14 @@
>>> #include <asm/tlbflush.h>
>>> #include <asm/pgalloc.h>
>>> #include <asm/kfence.h>
>>> +#include <asm/cpufeature.h>
>>> #define NO_BLOCK_MAPPINGS BIT(0)
>>> #define NO_CONT_MAPPINGS BIT(1)
>>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>>> +
>>> enum pgtable_type {
>>> TABLE_PTE,
>>> TABLE_PMD,
>>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>>> return 1;
>>> }
>>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>>> {
>>> + bool lock_taken = false;
>> As David commented, no need for this.
>>
>>> pte_t *table;
>>> pmd_t pmd;
>>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>> return 1;
>>> }
>>> + /* See comment in pud_free_pmd_page for static key logic */
>>> table = pte_offset_kernel(pmdp, addr);
>>> pmd_clear(pmdp);
>>> __flush_tlb_kernel_pgtable(addr);
>>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>>> + mmap_read_lock(&init_mm);
>>> + lock_taken = true;
>>> + }
>>> + if (unlikely(lock_taken))
>>> + mmap_read_unlock(&init_mm);
>>> +
>> As per David's comment this can just be:
>>
>> if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>> mmap_read_lock(&init_mm);
>> mmap_read_unlock(&init_mm);
>> }
>>
>>> pte_free_kernel(NULL, table);
>>> return 1;
>>> }
>>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>> +{
>>> + return __pmd_free_pte_page(pmdp, addr, true);
>>> +}
>>> +
>>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>> {
>>> + bool lock_taken = false;
>> Same comment.
>>
>>> pmd_t *table;
>>> pmd_t *pmdp;
>>> pud_t pud;
>>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>> }
>>> table = pmd_offset(pudp, addr);
>>> + /*
>>> + * Isolate the PMD table; in case of race with ptdump, this helps
>>> + * us to avoid taking the lock in __pmd_free_pte_page().
>>> + *
>>> + * Static key logic:
>>> + *
>>> + * Case 1: If ptdump does static_branch_enable(), and after that we
>>> + * execute the if block, then this patches in the read lock, ptdump has
>>> + * the write lock patched in, therefore ptdump will never read from
>>> + * a potentially freed PMD table.
>>> + *
>>> + * Case 2: If the if block starts executing before ptdump's
>>> + * static_branch_enable(), then no locking synchronization
>>> + * will be done. However, pud_clear() + the dsb() in
>>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>>> + * empty PUD. Thus, it will never walk over a potentially freed
>>> + * PMD table.
>>> + */
>>> + pud_clear(pudp);
>>> + __flush_tlb_kernel_pgtable(addr);
>>> + if (static_branch_unlikely(&ptdump_lock_key)) {
>>> + mmap_read_lock(&init_mm);
>>> + lock_taken = true;
>>> + }
>>> + if (unlikely(lock_taken))
>>> + mmap_read_unlock(&init_mm);
>> Same comment.
>>
>>> +
>>> pmdp = table;
>>> next = addr;
>>> end = addr + PUD_SIZE;
>>> do {
>>> - pmd_free_pte_page(pmdp, next);
>>> + __pmd_free_pte_page(pmdp, next, false);
>>> } while (pmdp++, next += PMD_SIZE, next != end);
>>> - pud_clear(pudp);
>>> - __flush_tlb_kernel_pgtable(addr);
>>> pmd_free(NULL, table);
>>> return 1;
>>> }
>>> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
>>> index 421a5de806c6..f75e12a1d068 100644
>>> --- a/arch/arm64/mm/ptdump.c
>>> +++ b/arch/arm64/mm/ptdump.c
>>> @@ -25,6 +25,7 @@
>>> #include <asm/pgtable-hwdef.h>
>>> #include <asm/ptdump.h>
>>> +#include <asm/cpufeature.h>
>>> #define pt_dump_seq_printf(m, fmt, args...) \
>>> ({ \
>>> @@ -311,7 +312,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info
>>> *info)
>>> }
>>> };
>>> + static_branch_enable(&ptdump_lock_key);
>>> ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
>>> + static_branch_disable(&ptdump_lock_key);
>>> }
>>> static void __init ptdump_initialize(void)
>>> @@ -353,7 +356,9 @@ bool ptdump_check_wx(void)
>>> }
>>> };
>>> + static_branch_enable(&ptdump_lock_key);
>>> ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
>>> + static_branch_disable(&ptdump_lock_key);
>>> if (st.wx_pages || st.uxn_pages) {
>>> pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu
>>> non-UXN pages found\n",
>>
>> With the improvements as suggested, LGTM:
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
2025-06-25 11:16 ` Ryan Roberts
@ 2025-06-25 11:25 ` Dev Jain
0 siblings, 0 replies; 19+ messages in thread
From: Dev Jain @ 2025-06-25 11:25 UTC (permalink / raw)
To: Ryan Roberts, catalin.marinas, will
Cc: anshuman.khandual, quic_zhenhuah, kevin.brodsky, yangyicong,
joey.gouly, linux-arm-kernel, linux-kernel, david
On 25/06/25 4:46 pm, Ryan Roberts wrote:
> On 25/06/2025 12:12, Dev Jain wrote:
>> On 25/06/25 4:05 pm, Ryan Roberts wrote:
>>> On 16/06/2025 11:33, Dev Jain wrote:
>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>>> because an intermediate table may be removed, potentially causing the
>>>> ptdump code to dereference an invalid address. We want to be able to
>>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>>> between two different vm_structs; two vmalloc objects running this same
>>>> code path will point to different page tables, hence there is no race.
>>>>
>>>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>>>> 512 times again via pmd_free_pte_page().
>>>>
>>>> We implement the locking mechanism using static keys, since the chance
>>>> of a race is very small. Observe that the synchronization is needed
>>>> to avoid the following race:
>>>>
>>>> CPU1 CPU2
>>>> take reference of PMD table
>>>> pud_clear()
>>>> pte_free_kernel()
>>>> walk freed PMD table
>>>>
>>>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>>>
>>>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>>>> we are safe. If not, then the patched-in read and write locks help us
>>>> avoid the race.
>>>>
>>>> To implement the mechanism, we need the static key access from mmu.c and
>>>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>>>> target in the Makefile, therefore we cannot initialize the key there, as
>>>> is being done, for example, in the static key implementation of
>>>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>>>> the jump_label mechanism. Declare the key there and define the key to false
>>>> in mmu.c.
>>>>
>>>> No issues were observed with mm-selftests. No issues were observed while
>>>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>>>> sysfs in a loop.
>>>>
>>>> v2->v3:
>>>> - Use static key mechanism
>>>>
>>>> v1->v2:
>>>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>>>> the lock 512 times again via pmd_free_pte_page()
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>> arch/arm64/include/asm/cpufeature.h | 1 +
>>>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>>>> arch/arm64/mm/ptdump.c | 5 +++
>>>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/
>>>> cpufeature.h
>>>> index c4326f1cb917..3e386563b587 100644
>>>> --- a/arch/arm64/include/asm/cpufeature.h
>>>> +++ b/arch/arm64/include/asm/cpufeature.h
>>>> @@ -26,6 +26,7 @@
>>>> #include <linux/kernel.h>
>>>> #include <linux/cpumask.h>
>>>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>>> Is this really the correct header file for this declaration? Perhaps it would be
>>> better in arch/arm64/include/asm/ptdump.h ?
>> I tried a lot of things; this didn't work. I get the following:
>>
>> ld: arch/arm64/mm/mmu.o:(__jump_table+0x8): undefined reference to
>> `ptdump_lock_key'
>> ld: arch/arm64/mm/mmu.o:(__jump_table+0x48): undefined reference to
>> `ptdump_lock_key'
>>
>> in case of !CONFIG_PTDUMP_DEBUGFS.
> I think you're talking about *defining*. I'm talking about *declaring*. By all
> means define it in mmu.c. Then you always have it. But the declaration doesn't
> need to live in cpufeature.h does it? Why can't it live in ptdump.h? The
> declaration is just a compiler directive really; it doesn't control the output
> of any symbols.
Ah I tried again and it worked. Not sure how I missed this permutation-combination
initially, I hit a lot of failures and then came up with this complex including
scheme :( thanks!
>
>>>> /*
>>>> * CPU feature register tracking
>>>> *
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index 8fcf59ba39db..e242ba428820 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -41,11 +41,14 @@
>>>> #include <asm/tlbflush.h>
>>>> #include <asm/pgalloc.h>
>>>> #include <asm/kfence.h>
>>>> +#include <asm/cpufeature.h>
>>>> #define NO_BLOCK_MAPPINGS BIT(0)
>>>> #define NO_CONT_MAPPINGS BIT(1)
>>>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>>>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>>>> +
>>>> enum pgtable_type {
>>>> TABLE_PTE,
>>>> TABLE_PMD,
>>>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>>>> return 1;
>>>> }
>>>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>>>> {
>>>> + bool lock_taken = false;
>>> As David commented, no need for this.
>>>
>>>> pte_t *table;
>>>> pmd_t pmd;
>>>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>> return 1;
>>>> }
>>>> + /* See comment in pud_free_pmd_page for static key logic */
>>>> table = pte_offset_kernel(pmdp, addr);
>>>> pmd_clear(pmdp);
>>>> __flush_tlb_kernel_pgtable(addr);
>>>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>>>> + mmap_read_lock(&init_mm);
>>>> + lock_taken = true;
>>>> + }
>>>> + if (unlikely(lock_taken))
>>>> + mmap_read_unlock(&init_mm);
>>>> +
>>> As per David's comment this can just be:
>>>
>>> if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>>> mmap_read_lock(&init_mm);
>>> mmap_read_unlock(&init_mm);
>>> }
>>>
>>>> pte_free_kernel(NULL, table);
>>>> return 1;
>>>> }
>>>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>> +{
>>>> + return __pmd_free_pte_page(pmdp, addr, true);
>>>> +}
>>>> +
>>>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>> {
>>>> + bool lock_taken = false;
>>> Same comment.
>>>
>>>> pmd_t *table;
>>>> pmd_t *pmdp;
>>>> pud_t pud;
>>>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>> }
>>>> table = pmd_offset(pudp, addr);
>>>> + /*
>>>> + * Isolate the PMD table; in case of race with ptdump, this helps
>>>> + * us to avoid taking the lock in __pmd_free_pte_page().
>>>> + *
>>>> + * Static key logic:
>>>> + *
>>>> + * Case 1: If ptdump does static_branch_enable(), and after that we
>>>> + * execute the if block, then this patches in the read lock, ptdump has
>>>> + * the write lock patched in, therefore ptdump will never read from
>>>> + * a potentially freed PMD table.
>>>> + *
>>>> + * Case 2: If the if block starts executing before ptdump's
>>>> + * static_branch_enable(), then no locking synchronization
>>>> + * will be done. However, pud_clear() + the dsb() in
>>>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>>>> + * empty PUD. Thus, it will never walk over a potentially freed
>>>> + * PMD table.
>>>> + */
>>>> + pud_clear(pudp);
>>>> + __flush_tlb_kernel_pgtable(addr);
>>>> + if (static_branch_unlikely(&ptdump_lock_key)) {
>>>> + mmap_read_lock(&init_mm);
>>>> + lock_taken = true;
>>>> + }
>>>> + if (unlikely(lock_taken))
>>>> + mmap_read_unlock(&init_mm);
>>> Same comment.
>>>
>>>> +
>>>> pmdp = table;
>>>> next = addr;
>>>> end = addr + PUD_SIZE;
>>>> do {
>>>> - pmd_free_pte_page(pmdp, next);
>>>> + __pmd_free_pte_page(pmdp, next, false);
>>>> } while (pmdp++, next += PMD_SIZE, next != end);
>>>> - pud_clear(pudp);
>>>> - __flush_tlb_kernel_pgtable(addr);
>>>> pmd_free(NULL, table);
>>>> return 1;
>>>> }
>>>> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
>>>> index 421a5de806c6..f75e12a1d068 100644
>>>> --- a/arch/arm64/mm/ptdump.c
>>>> +++ b/arch/arm64/mm/ptdump.c
>>>> @@ -25,6 +25,7 @@
>>>> #include <asm/pgtable-hwdef.h>
>>>> #include <asm/ptdump.h>
>>>> +#include <asm/cpufeature.h>
>>>> #define pt_dump_seq_printf(m, fmt, args...) \
>>>> ({ \
>>>> @@ -311,7 +312,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info
>>>> *info)
>>>> }
>>>> };
>>>> + static_branch_enable(&ptdump_lock_key);
>>>> ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
>>>> + static_branch_disable(&ptdump_lock_key);
>>>> }
>>>> static void __init ptdump_initialize(void)
>>>> @@ -353,7 +356,9 @@ bool ptdump_check_wx(void)
>>>> }
>>>> };
>>>> + static_branch_enable(&ptdump_lock_key);
>>>> ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
>>>> + static_branch_disable(&ptdump_lock_key);
>>>> if (st.wx_pages || st.uxn_pages) {
>>>> pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu
>>>> non-UXN pages found\n",
>>> With the improvements as suggested, LGTM:
>>>
>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>>
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-06-25 11:25 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16 10:33 [PATCH v3] arm64: Enable vmalloc-huge with ptdump Dev Jain
2025-06-16 15:00 ` David Hildenbrand
2025-06-16 16:34 ` Dev Jain
2025-06-16 18:07 ` Ryan Roberts
2025-06-16 21:20 ` Ryan Roberts
2025-06-17 11:51 ` Uladzislau Rezki
2025-06-18 3:11 ` Dev Jain
2025-06-18 17:19 ` Uladzislau Rezki
2025-06-19 3:13 ` Dev Jain
2025-06-18 11:21 ` Ryan Roberts
2025-06-18 17:19 ` Uladzislau Rezki
2025-06-17 2:54 ` Dev Jain
2025-06-17 3:59 ` Dev Jain
2025-06-17 8:12 ` Ryan Roberts
2025-06-17 8:58 ` Dev Jain
2025-06-25 10:35 ` Ryan Roberts
2025-06-25 11:12 ` Dev Jain
2025-06-25 11:16 ` Ryan Roberts
2025-06-25 11:25 ` Dev Jain
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).