* [RFC PATCH 0/4] mm/arm64: re-enable HVO
@ 2024-08-06 2:21 Yu Zhao
2024-08-06 2:21 ` [RFC PATCH 1/4] mm: HVO: introduce helper function to update and flush pgtable Yu Zhao
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Yu Zhao @ 2024-08-06 2:21 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: Andrew Morton, David Rientjes, Douglas Anderson,
Frank van der Linden, Mark Rutland, Muchun Song, Nanyong Sun,
Yang Shi, linux-arm-kernel, linux-mm, linux-kernel, Yu Zhao
This series presents one of the previously discussed approaches to
re-enable HugeTLB Vmemmap Optimization (HVO) on arm64. HVO was
disabled by commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable
HUGETLB_PAGE_OPTIMIZE_VMEMMAP") due to the following reason:
This is deemed UNPREDICTABLE by the Arm architecture without a
break-before-make sequence (make the PTE invalid, TLBI, write the
new valid PTE). However, such sequence is not possible since the
vmemmap may be concurrently accessed by the kernel.
Other approaches that have been discussed include:
A. Handle kernel PF while doing BBM [1],
B. Use stop_machine() while doing BBM [2], and,
C. Enable FEAT_BBM level 2 and keep the memory contents at the old
and new output addresses unchanged to avoid BBM (D8.16.1-2) [3].
A quick comparison between this approach (D) and the above approaches:
--+------------------------------+-----------------------------+
| Pro | Con |
--+------------------------------+-----------------------------+
A | Low latency, h/w independent | Predictability concerns [4] |
B | Predictable, h/w independent | High latency |
C | Predictable, low latency | H/w dependent, complex |
D | Predictable, h/w independent | Medium latency |
--+------------------------------+-----------------------------+
[1] https://lore.kernel.org/20240113094436.2506396-1-sunnanyong@huawei.com/
[2] https://lore.kernel.org/ZbKjHHeEdFYY1xR5@arm.com/
[3] https://lore.kernel.org/Zo68DP6siXfb6ZBR@arm.com/
[4] https://lore.kernel.org/20240326125409.GA9552@willie-the-truck/
Nanyong Sun (2):
mm: HVO: introduce helper function to update and flush pgtable
arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP
Yu Zhao (2):
arm64: use IPIs to pause/resume remote CPUs
arm64: pause remote CPUs to update vmemmap
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgalloc.h | 55 ++++++++++++++++
arch/arm64/include/asm/smp.h | 3 +
arch/arm64/kernel/smp.c | 110 +++++++++++++++++++++++++++++++
mm/hugetlb_vmemmap.c | 69 +++++++++++++++----
5 files changed, 226 insertions(+), 12 deletions(-)
base-commit: de9c2c66ad8e787abec7c9d7eff4f8c3cdd28aed
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply [flat|nested] 8+ messages in thread
* [RFC PATCH 1/4] mm: HVO: introduce helper function to update and flush pgtable
2024-08-06 2:21 [RFC PATCH 0/4] mm/arm64: re-enable HVO Yu Zhao
@ 2024-08-06 2:21 ` Yu Zhao
2024-08-06 2:21 ` [RFC PATCH 2/4] arm64: use IPIs to pause/resume remote CPUs Yu Zhao
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: Yu Zhao @ 2024-08-06 2:21 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: Andrew Morton, David Rientjes, Douglas Anderson,
Frank van der Linden, Mark Rutland, Muchun Song, Nanyong Sun,
Yang Shi, linux-arm-kernel, linux-mm, linux-kernel, Muchun Song,
Yu Zhao
From: Nanyong Sun <sunnanyong@huawei.com>
Add pmd/pte update and tlb flush helper function to update page
table. This refactoring patch is designed to facilitate each
architecture to implement its own special logic in preparation
for the arm64 architecture to follow the necessary break-before-make
sequence when updating page tables.
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
mm/hugetlb_vmemmap.c | 55 ++++++++++++++++++++++++++++++++++----------
1 file changed, 43 insertions(+), 12 deletions(-)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 829112b0a914..2dd92e58f304 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -46,6 +46,37 @@ struct vmemmap_remap_walk {
unsigned long flags;
};
+#ifndef vmemmap_update_pmd
+static inline void vmemmap_update_pmd(unsigned long addr,
+ pmd_t *pmdp, pte_t *ptep)
+{
+ pmd_populate_kernel(&init_mm, pmdp, ptep);
+}
+#endif
+
+#ifndef vmemmap_update_pte
+static inline void vmemmap_update_pte(unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ set_pte_at(&init_mm, addr, ptep, pte);
+}
+#endif
+
+#ifndef vmemmap_flush_tlb_all
+static inline void vmemmap_flush_tlb_all(void)
+{
+ flush_tlb_all();
+}
+#endif
+
+#ifndef vmemmap_flush_tlb_range
+static inline void vmemmap_flush_tlb_range(unsigned long start,
+ unsigned long end)
+{
+ flush_tlb_kernel_range(start, end);
+}
+#endif
+
static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
struct vmemmap_remap_walk *walk)
{
@@ -81,9 +112,9 @@ static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, unsigned long start,
/* Make pte visible before pmd. See comment in pmd_install(). */
smp_wmb();
- pmd_populate_kernel(&init_mm, pmd, pgtable);
+ vmemmap_update_pmd(start, pmd, pgtable);
if (!(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH))
- flush_tlb_kernel_range(start, start + PMD_SIZE);
+ vmemmap_flush_tlb_range(start, start + PMD_SIZE);
} else {
pte_free_kernel(&init_mm, pgtable);
}
@@ -171,7 +202,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
return ret;
if (walk->remap_pte && !(walk->flags & VMEMMAP_REMAP_NO_TLB_FLUSH))
- flush_tlb_kernel_range(start, end);
+ vmemmap_flush_tlb_range(start, end);
return 0;
}
@@ -220,15 +251,15 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
/*
* Makes sure that preceding stores to the page contents from
- * vmemmap_remap_free() become visible before the set_pte_at()
- * write.
+ * vmemmap_remap_free() become visible before the
+ * vmemmap_update_pte() write.
*/
smp_wmb();
}
entry = mk_pte(walk->reuse_page, pgprot);
list_add(&page->lru, walk->vmemmap_pages);
- set_pte_at(&init_mm, addr, pte, entry);
+ vmemmap_update_pte(addr, pte, entry);
}
/*
@@ -267,10 +298,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
/*
* Makes sure that preceding stores to the page contents become visible
- * before the set_pte_at() write.
+ * before the vmemmap_update_pte() write.
*/
smp_wmb();
- set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+ vmemmap_update_pte(addr, pte, mk_pte(page, pgprot));
}
/**
@@ -536,7 +567,7 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
}
if (restored)
- flush_tlb_all();
+ vmemmap_flush_tlb_all();
if (!ret)
ret = restored;
return ret;
@@ -664,7 +695,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
break;
}
- flush_tlb_all();
+ vmemmap_flush_tlb_all();
/* avoid writes from page_ref_add_unless() while folding vmemmap */
synchronize_rcu();
@@ -684,7 +715,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
* allowing more vmemmap remaps to occur.
*/
if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
- flush_tlb_all();
+ vmemmap_flush_tlb_all();
free_vmemmap_page_list(&vmemmap_pages);
INIT_LIST_HEAD(&vmemmap_pages);
__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
@@ -692,7 +723,7 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
}
}
- flush_tlb_all();
+ vmemmap_flush_tlb_all();
free_vmemmap_page_list(&vmemmap_pages);
}
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH 2/4] arm64: use IPIs to pause/resume remote CPUs
2024-08-06 2:21 [RFC PATCH 0/4] mm/arm64: re-enable HVO Yu Zhao
2024-08-06 2:21 ` [RFC PATCH 1/4] mm: HVO: introduce helper function to update and flush pgtable Yu Zhao
@ 2024-08-06 2:21 ` Yu Zhao
2024-08-06 9:12 ` David Hildenbrand
2024-08-08 16:09 ` Doug Anderson
2024-08-06 2:21 ` [RFC PATCH 3/4] arm64: pause remote CPUs to update vmemmap Yu Zhao
2024-08-06 2:21 ` [RFC PATCH 4/4] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP Yu Zhao
3 siblings, 2 replies; 8+ messages in thread
From: Yu Zhao @ 2024-08-06 2:21 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: Andrew Morton, David Rientjes, Douglas Anderson,
Frank van der Linden, Mark Rutland, Muchun Song, Nanyong Sun,
Yang Shi, linux-arm-kernel, linux-mm, linux-kernel, Yu Zhao
Use pseudo-NMI IPIs to pause remote CPUs for a short period of time,
and then reliably resume them when the local CPU exits critical
sections that preclude the execution of remote CPUs.
A typical example of such critical sections is BBM on kernel PTEs.
HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by commit
060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP")
due to the folllowing reason:
This is deemed UNPREDICTABLE by the Arm architecture without a
break-before-make sequence (make the PTE invalid, TLBI, write the
new valid PTE). However, such sequence is not possible since the
vmemmap may be concurrently accessed by the kernel.
Supporting BBM on kernel PTEs is one of the approaches that can
potentially make arm64 support HVO.
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
arch/arm64/include/asm/smp.h | 3 +
arch/arm64/kernel/smp.c | 110 +++++++++++++++++++++++++++++++++++
2 files changed, 113 insertions(+)
diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index 2510eec026f7..cffb0cfed961 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void);
extern void crash_smp_send_stop(void);
extern bool smp_crash_stop_failed(void);
+void pause_remote_cpus(void);
+void resume_remote_cpus(void);
+
#endif /* ifndef __ASSEMBLY__ */
#endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 5e18fbcee9a2..aa80266e5c9d 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -68,16 +68,25 @@ enum ipi_msg_type {
IPI_RESCHEDULE,
IPI_CALL_FUNC,
IPI_CPU_STOP,
+ IPI_CPU_PAUSE,
+#ifdef CONFIG_KEXEC_CORE
IPI_CPU_CRASH_STOP,
+#endif
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
IPI_TIMER,
+#endif
+#ifdef CONFIG_IRQ_WORK
IPI_IRQ_WORK,
+#endif
NR_IPI,
/*
* Any enum >= NR_IPI and < MAX_IPI is special and not tracable
* with trace_ipi_*
*/
IPI_CPU_BACKTRACE = NR_IPI,
+#ifdef CONFIG_KGDB
IPI_KGDB_ROUNDUP,
+#endif
MAX_IPI
};
@@ -821,11 +830,20 @@ static const char *ipi_types[MAX_IPI] __tracepoint_string = {
[IPI_RESCHEDULE] = "Rescheduling interrupts",
[IPI_CALL_FUNC] = "Function call interrupts",
[IPI_CPU_STOP] = "CPU stop interrupts",
+ [IPI_CPU_PAUSE] = "CPU pause interrupts",
+#ifdef CONFIG_KEXEC_CORE
[IPI_CPU_CRASH_STOP] = "CPU stop (for crash dump) interrupts",
+#endif
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
[IPI_TIMER] = "Timer broadcast interrupts",
+#endif
+#ifdef CONFIG_IRQ_WORK
[IPI_IRQ_WORK] = "IRQ work interrupts",
+#endif
[IPI_CPU_BACKTRACE] = "CPU backtrace interrupts",
+#ifdef CONFIG_KGDB
[IPI_KGDB_ROUNDUP] = "KGDB roundup interrupts",
+#endif
};
static void smp_cross_call(const struct cpumask *target, unsigned int ipinr);
@@ -884,6 +902,85 @@ void __noreturn panic_smp_self_stop(void)
local_cpu_stop();
}
+static DEFINE_SPINLOCK(cpu_pause_lock);
+static cpumask_t paused_cpus;
+static cpumask_t resumed_cpus;
+
+static void pause_local_cpu(void)
+{
+ int cpu = smp_processor_id();
+
+ cpumask_clear_cpu(cpu, &resumed_cpus);
+ /*
+ * Paired with pause_remote_cpus() to confirm that this CPU not only
+ * will be paused but also can be reliably resumed.
+ */
+ smp_wmb();
+ cpumask_set_cpu(cpu, &paused_cpus);
+ /* A typical example for sleep and wake-up functions. */
+ smp_mb();
+ while (!cpumask_test_cpu(cpu, &resumed_cpus)) {
+ wfe();
+ barrier();
+ }
+ barrier();
+ cpumask_clear_cpu(cpu, &paused_cpus);
+}
+
+void pause_remote_cpus(void)
+{
+ cpumask_t cpus_to_pause;
+
+ lockdep_assert_cpus_held();
+ lockdep_assert_preemption_disabled();
+
+ cpumask_copy(&cpus_to_pause, cpu_online_mask);
+ cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);
+
+ spin_lock(&cpu_pause_lock);
+
+ WARN_ON_ONCE(!cpumask_empty(&paused_cpus));
+
+ smp_cross_call(&cpus_to_pause, IPI_CPU_PAUSE);
+
+ while (!cpumask_equal(&cpus_to_pause, &paused_cpus)) {
+ cpu_relax();
+ barrier();
+ }
+ /*
+ * Paired with pause_local_cpu() to confirm that all CPUs not only will
+ * be paused but also can be reliably resumed.
+ */
+ smp_rmb();
+ WARN_ON_ONCE(cpumask_intersects(&cpus_to_pause, &resumed_cpus));
+
+ spin_unlock(&cpu_pause_lock);
+}
+
+void resume_remote_cpus(void)
+{
+ cpumask_t cpus_to_resume;
+
+ lockdep_assert_cpus_held();
+ lockdep_assert_preemption_disabled();
+
+ cpumask_copy(&cpus_to_resume, cpu_online_mask);
+ cpumask_clear_cpu(smp_processor_id(), &cpus_to_resume);
+
+ spin_lock(&cpu_pause_lock);
+
+ cpumask_setall(&resumed_cpus);
+ /* A typical example for sleep and wake-up functions. */
+ smp_mb();
+ while (cpumask_intersects(&cpus_to_resume, &paused_cpus)) {
+ sev();
+ cpu_relax();
+ barrier();
+ }
+
+ spin_unlock(&cpu_pause_lock);
+}
+
#ifdef CONFIG_KEXEC_CORE
static atomic_t waiting_for_crash_ipi = ATOMIC_INIT(0);
#endif
@@ -963,6 +1060,11 @@ static void do_handle_IPI(int ipinr)
local_cpu_stop();
break;
+ case IPI_CPU_PAUSE:
+ pause_local_cpu();
+ break;
+
+#ifdef CONFIG_KEXEC_CORE
case IPI_CPU_CRASH_STOP:
if (IS_ENABLED(CONFIG_KEXEC_CORE)) {
ipi_cpu_crash_stop(cpu, get_irq_regs());
@@ -970,6 +1072,7 @@ static void do_handle_IPI(int ipinr)
unreachable();
}
break;
+#endif
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
case IPI_TIMER:
@@ -991,9 +1094,11 @@ static void do_handle_IPI(int ipinr)
nmi_cpu_backtrace(get_irq_regs());
break;
+#ifdef CONFIG_KGDB
case IPI_KGDB_ROUNDUP:
kgdb_nmicallback(cpu, get_irq_regs());
break;
+#endif
default:
pr_crit("CPU%u: Unknown IPI message 0x%x\n", cpu, ipinr);
@@ -1023,9 +1128,14 @@ static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
switch (ipi) {
case IPI_CPU_STOP:
+ case IPI_CPU_PAUSE:
+#ifdef CONFIG_KEXEC_CORE
case IPI_CPU_CRASH_STOP:
+#endif
case IPI_CPU_BACKTRACE:
+#ifdef CONFIG_KGDB
case IPI_KGDB_ROUNDUP:
+#endif
return true;
default:
return false;
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH 3/4] arm64: pause remote CPUs to update vmemmap
2024-08-06 2:21 [RFC PATCH 0/4] mm/arm64: re-enable HVO Yu Zhao
2024-08-06 2:21 ` [RFC PATCH 1/4] mm: HVO: introduce helper function to update and flush pgtable Yu Zhao
2024-08-06 2:21 ` [RFC PATCH 2/4] arm64: use IPIs to pause/resume remote CPUs Yu Zhao
@ 2024-08-06 2:21 ` Yu Zhao
2024-08-06 5:08 ` Yu Zhao
2024-08-06 2:21 ` [RFC PATCH 4/4] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP Yu Zhao
3 siblings, 1 reply; 8+ messages in thread
From: Yu Zhao @ 2024-08-06 2:21 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: Andrew Morton, David Rientjes, Douglas Anderson,
Frank van der Linden, Mark Rutland, Muchun Song, Nanyong Sun,
Yang Shi, linux-arm-kernel, linux-mm, linux-kernel, Yu Zhao
Pause remote CPUs so that the local CPU can follow the proper BBM
sequence to safely update the vmemmap mapping `struct page` areas.
While updating the vmemmap, it is guaranteed that neither the local
CPU nor the remote ones will access the `struct page` area being
updated, and therefore they will not trigger kernel PFs.
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
arch/arm64/include/asm/pgalloc.h | 55 ++++++++++++++++++++++++++++++++
mm/hugetlb_vmemmap.c | 14 ++++++++
2 files changed, 69 insertions(+)
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index 8ff5f2a2579e..1af1aa34a351 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -12,6 +12,7 @@
#include <asm/processor.h>
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
+#include <asm/cpu.h>
#define __HAVE_ARCH_PGD_FREE
#define __HAVE_ARCH_PUD_FREE
@@ -137,4 +138,58 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmdp, pgtable_t ptep)
__pmd_populate(pmdp, page_to_phys(ptep), PMD_TYPE_TABLE | PMD_TABLE_PXN);
}
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+
+#define vmemmap_update_lock vmemmap_update_lock
+static inline void vmemmap_update_lock(void)
+{
+ cpus_read_lock();
+}
+
+#define vmemmap_update_unlock vmemmap_update_unlock
+static inline void vmemmap_update_unlock(void)
+{
+ cpus_read_unlock();
+}
+
+#define vmemmap_update_pte vmemmap_update_pte
+static inline void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
+{
+ preempt_disable();
+ pause_remote_cpus();
+
+ pte_clear(&init_mm, addr, ptep);
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+ set_pte_at(&init_mm, addr, ptep, pte);
+
+ resume_remote_cpus();
+ preempt_enable();
+}
+
+#define vmemmap_update_pmd vmemmap_update_pmd
+static inline void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep)
+{
+ preempt_disable();
+ pause_remote_cpus();
+
+ pmd_clear(pmdp);
+ flush_tlb_kernel_range(addr, addr + PMD_SIZE);
+ pmd_populate_kernel(&init_mm, pmdp, ptep);
+
+ resume_remote_cpus();
+ preempt_enable();
+}
+
+#define vmemmap_flush_tlb_all vmemmap_flush_tlb_all
+static inline void vmemmap_flush_tlb_all(void)
+{
+}
+
+#define vmemmap_flush_tlb_range vmemmap_flush_tlb_range
+static inline void vmemmap_flush_tlb_range(unsigned long start, unsigned long end)
+{
+}
+
+#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
+
#endif
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 2dd92e58f304..893c73493d9c 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -46,6 +46,18 @@ struct vmemmap_remap_walk {
unsigned long flags;
};
+#ifndef vmemmap_update_lock
+static void vmemmap_update_lock(void)
+{
+}
+#endif
+
+#ifndef vmemmap_update_unlock
+static void vmemmap_update_unlock(void)
+{
+}
+#endif
+
#ifndef vmemmap_update_pmd
static inline void vmemmap_update_pmd(unsigned long addr,
pmd_t *pmdp, pte_t *ptep)
@@ -194,10 +206,12 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
VM_BUG_ON(!PAGE_ALIGNED(start | end));
+ vmemmap_update_lock();
mmap_read_lock(&init_mm);
ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops,
NULL, walk);
mmap_read_unlock(&init_mm);
+ vmemmap_update_unlock();
if (ret)
return ret;
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH 4/4] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP
2024-08-06 2:21 [RFC PATCH 0/4] mm/arm64: re-enable HVO Yu Zhao
` (2 preceding siblings ...)
2024-08-06 2:21 ` [RFC PATCH 3/4] arm64: pause remote CPUs to update vmemmap Yu Zhao
@ 2024-08-06 2:21 ` Yu Zhao
3 siblings, 0 replies; 8+ messages in thread
From: Yu Zhao @ 2024-08-06 2:21 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: Andrew Morton, David Rientjes, Douglas Anderson,
Frank van der Linden, Mark Rutland, Muchun Song, Nanyong Sun,
Yang Shi, linux-arm-kernel, linux-mm, linux-kernel, Muchun Song,
Yu Zhao
From: Nanyong Sun <sunnanyong@huawei.com>
Now update of vmemmap page table can follow the rule of
break-before-make safely for arm64 architecture, re-enable
HVO on arm64.
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2f8ff354ca6..25ff026cdaf5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -105,6 +105,7 @@ config ARM64
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
+ select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WANTS_EXECMEM_LATE if EXECMEM
select ARCH_WANTS_NO_INSTR
--
2.46.0.rc2.264.g509ed76dc8-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 3/4] arm64: pause remote CPUs to update vmemmap
2024-08-06 2:21 ` [RFC PATCH 3/4] arm64: pause remote CPUs to update vmemmap Yu Zhao
@ 2024-08-06 5:08 ` Yu Zhao
0 siblings, 0 replies; 8+ messages in thread
From: Yu Zhao @ 2024-08-06 5:08 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon
Cc: Andrew Morton, David Rientjes, Douglas Anderson,
Frank van der Linden, Mark Rutland, Muchun Song, Nanyong Sun,
Yang Shi, linux-arm-kernel, linux-mm, linux-kernel
On Mon, Aug 5, 2024 at 8:21 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Pause remote CPUs so that the local CPU can follow the proper BBM
> sequence to safely update the vmemmap mapping `struct page` areas.
>
> While updating the vmemmap, it is guaranteed that neither the local
> CPU nor the remote ones will access the `struct page` area being
> updated, and therefore they will not trigger kernel PFs.
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
> arch/arm64/include/asm/pgalloc.h | 55 ++++++++++++++++++++++++++++++++
> mm/hugetlb_vmemmap.c | 14 ++++++++
> 2 files changed, 69 insertions(+)
>
> diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
> index 8ff5f2a2579e..1af1aa34a351 100644
> --- a/arch/arm64/include/asm/pgalloc.h
> +++ b/arch/arm64/include/asm/pgalloc.h
> @@ -12,6 +12,7 @@
> #include <asm/processor.h>
> #include <asm/cacheflush.h>
> #include <asm/tlbflush.h>
> +#include <asm/cpu.h>
>
> #define __HAVE_ARCH_PGD_FREE
> #define __HAVE_ARCH_PUD_FREE
> @@ -137,4 +138,58 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmdp, pgtable_t ptep)
> __pmd_populate(pmdp, page_to_phys(ptep), PMD_TYPE_TABLE | PMD_TABLE_PXN);
> }
>
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +
> +#define vmemmap_update_lock vmemmap_update_lock
> +static inline void vmemmap_update_lock(void)
> +{
> + cpus_read_lock();
> +}
> +
> +#define vmemmap_update_unlock vmemmap_update_unlock
> +static inline void vmemmap_update_unlock(void)
> +{
> + cpus_read_unlock();
> +}
> +
> +#define vmemmap_update_pte vmemmap_update_pte
> +static inline void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> + preempt_disable();
> + pause_remote_cpus();
> +
> + pte_clear(&init_mm, addr, ptep);
> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> + set_pte_at(&init_mm, addr, ptep, pte);
> +
> + resume_remote_cpus();
> + preempt_enable();
> +}
Note that I kept this API from Nanyong for the sake of discussion.
What I actually plan to test in our production is:
#define vmemmap_update_pte_range_start vmemmap_update_pte_range_start
static inline void vmemmap_update_pte_range_start(pte_t *pte,
unsigned long start,
unsigned long end)
{
preempt_disable();
pause_remote_cpus();
for (; start != end; start += PAGE_SIZE, pte++)
pte_clear(&init_mm, start, pte);
flush_tlb_kernel_range(start, end);
}
#define vmemmap_update_pte_range_end vmemmap_update_pte_range_end
static inline void vmemmap_update_pte_range_end(void)
{
resume_remote_cpus();
preempt_enable();
}
> +#define vmemmap_update_pmd vmemmap_update_pmd
> +static inline void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep)
> +{
> + preempt_disable();
> + pause_remote_cpus();
> +
> + pmd_clear(pmdp);
> + flush_tlb_kernel_range(addr, addr + PMD_SIZE);
> + pmd_populate_kernel(&init_mm, pmdp, ptep);
> +
> + resume_remote_cpus();
> + preempt_enable();
> +}
> +
> +#define vmemmap_flush_tlb_all vmemmap_flush_tlb_all
> +static inline void vmemmap_flush_tlb_all(void)
> +{
> +}
> +
> +#define vmemmap_flush_tlb_range vmemmap_flush_tlb_range
> +static inline void vmemmap_flush_tlb_range(unsigned long start, unsigned long end)
> +{
> +}
> +
> +#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */
> +
> #endif
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 2dd92e58f304..893c73493d9c 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -46,6 +46,18 @@ struct vmemmap_remap_walk {
> unsigned long flags;
> };
>
> +#ifndef vmemmap_update_lock
> +static void vmemmap_update_lock(void)
> +{
> +}
> +#endif
> +
> +#ifndef vmemmap_update_unlock
> +static void vmemmap_update_unlock(void)
> +{
> +}
> +#endif
> +
> #ifndef vmemmap_update_pmd
> static inline void vmemmap_update_pmd(unsigned long addr,
> pmd_t *pmdp, pte_t *ptep)
> @@ -194,10 +206,12 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>
> VM_BUG_ON(!PAGE_ALIGNED(start | end));
>
> + vmemmap_update_lock();
> mmap_read_lock(&init_mm);
> ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops,
> NULL, walk);
> mmap_read_unlock(&init_mm);
> + vmemmap_update_unlock();
> if (ret)
> return ret;
>
> --
> 2.46.0.rc2.264.g509ed76dc8-goog
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 2/4] arm64: use IPIs to pause/resume remote CPUs
2024-08-06 2:21 ` [RFC PATCH 2/4] arm64: use IPIs to pause/resume remote CPUs Yu Zhao
@ 2024-08-06 9:12 ` David Hildenbrand
2024-08-08 16:09 ` Doug Anderson
1 sibling, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2024-08-06 9:12 UTC (permalink / raw)
To: Yu Zhao, Catalin Marinas, Will Deacon
Cc: Andrew Morton, David Rientjes, Douglas Anderson,
Frank van der Linden, Mark Rutland, Muchun Song, Nanyong Sun,
Yang Shi, linux-arm-kernel, linux-mm, linux-kernel
[...]
> +
> +void resume_remote_cpus(void)
> +{
> + cpumask_t cpus_to_resume;
> +
> + lockdep_assert_cpus_held();
> + lockdep_assert_preemption_disabled();
> +
> + cpumask_copy(&cpus_to_resume, cpu_online_mask);
> + cpumask_clear_cpu(smp_processor_id(), &cpus_to_resume);
> +
> + spin_lock(&cpu_pause_lock);
> +
> + cpumask_setall(&resumed_cpus);
> + /* A typical example for sleep and wake-up functions. */
> + smp_mb();
> + while (cpumask_intersects(&cpus_to_resume, &paused_cpus)) {
> + sev();
> + cpu_relax();
> + barrier();
> + }
>
I'm curious, is there a fundamental reason why we wait for paused CPUs
to actually start running, or is it simply easier to get the
implementation race-free, in particular when we have two
pause_remote_cpus() calls shortly after each other and another remote
CPU might still be on its way out of pause_local_cpu() from the first pause.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 2/4] arm64: use IPIs to pause/resume remote CPUs
2024-08-06 2:21 ` [RFC PATCH 2/4] arm64: use IPIs to pause/resume remote CPUs Yu Zhao
2024-08-06 9:12 ` David Hildenbrand
@ 2024-08-08 16:09 ` Doug Anderson
1 sibling, 0 replies; 8+ messages in thread
From: Doug Anderson @ 2024-08-08 16:09 UTC (permalink / raw)
To: Yu Zhao
Cc: Catalin Marinas, Will Deacon, Andrew Morton, David Rientjes,
Frank van der Linden, Mark Rutland, Muchun Song, Nanyong Sun,
Yang Shi, linux-arm-kernel, linux-mm, linux-kernel
Hi,
On Mon, Aug 5, 2024 at 7:21 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Use pseudo-NMI IPIs to pause remote CPUs for a short period of time,
> and then reliably resume them when the local CPU exits critical
> sections that preclude the execution of remote CPUs.
>
> A typical example of such critical sections is BBM on kernel PTEs.
> HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by commit
> 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP")
> due to the folllowing reason:
>
> This is deemed UNPREDICTABLE by the Arm architecture without a
> break-before-make sequence (make the PTE invalid, TLBI, write the
> new valid PTE). However, such sequence is not possible since the
> vmemmap may be concurrently accessed by the kernel.
>
> Supporting BBM on kernel PTEs is one of the approaches that can
> potentially make arm64 support HVO.
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
> arch/arm64/include/asm/smp.h | 3 +
> arch/arm64/kernel/smp.c | 110 +++++++++++++++++++++++++++++++++++
> 2 files changed, 113 insertions(+)
I'm a bit curious how your approach is reliable / performant in all
cases. As far as I understand it:
1. Patch #4 in your series unconditionally turns on
"ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP" for arm64.
2. In order for it to work reliably, you need the "pause all CPUs"
functionality introduced in this patch.
3. In order for the "pause all CPUs" functionality to be performant
you need NMI or, at least, pseudo-NMI to be used to pause all CPUs.
4. Even when you configure the kernel for pseudo-NMI it's not 100%
guaranteed that pseudo-NMI will be turned on. Specifically:
4a) There's an extra kernel command line parameter you need to
actually enable pseudo-NMI. We can debate about the inability to turn
on pseudo-NMI without the command line parameter, but at the moment
it's there because pseudo-NMI has some performance implications.
Apparently these performance implications are more non-trivial on some
early arm64 CPUs.
4b) Even if we changed it so that the command-line parameter wasn't
needed, there are still some boards out there that are known not to be
able to enable pseudo-NMI. There are certainly some Mediatek
Chromebooks that have a BIOS bug making pseudo-NMI unreliable. See the
`mediatek,broken-save-restore-fw` device tree property. ...and even if
you ignore the Mediatek Chromebooks, there's at least one more system
I know of that's broken with pseudo-NMI. Since you're at Google, you
could look at b/308278090 for details but the quick summary is that
some devices running a TEE are hanging when pseudo NMI is enabled.
...and, even if that's fixed, it feels somewhat likely that there are
other systems where pseudo-NMI won't be usable.
Unless I'm misunderstanding, it feels like anything you have that
relies on NMI/pseudo-NMI needs to fall back safely/reliably if
NMI/pseudo-NMI isn't there.
> diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> index 2510eec026f7..cffb0cfed961 100644
> --- a/arch/arm64/include/asm/smp.h
> +++ b/arch/arm64/include/asm/smp.h
> @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void);
> extern void crash_smp_send_stop(void);
> extern bool smp_crash_stop_failed(void);
>
> +void pause_remote_cpus(void);
> +void resume_remote_cpus(void);
> +
> #endif /* ifndef __ASSEMBLY__ */
>
> #endif /* ifndef __ASM_SMP_H */
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 5e18fbcee9a2..aa80266e5c9d 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -68,16 +68,25 @@ enum ipi_msg_type {
> IPI_RESCHEDULE,
> IPI_CALL_FUNC,
> IPI_CPU_STOP,
> + IPI_CPU_PAUSE,
> +#ifdef CONFIG_KEXEC_CORE
> IPI_CPU_CRASH_STOP,
> +#endif
> +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
> IPI_TIMER,
> +#endif
> +#ifdef CONFIG_IRQ_WORK
> IPI_IRQ_WORK,
> +#endif
I assume all these "ifdefs" are there because this adds up to more
than 8 IPIs. That means that someone wouldn't be able to enable all of
these things, right? Feels like we'd want to solve this before landing
things. In the least it would be good if this built upon:
https://lore.kernel.org/r/20240625160718.v2.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid/
...and then maybe we could figure out if there are other ways to
consolidate NMIs. Previously, for instance, we had the "KGDB" and
"backtrace" IPIs combined into one but we split them upon review
feedback. If necessary they would probably be easy to re-combine.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-08-08 16:10 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-06 2:21 [RFC PATCH 0/4] mm/arm64: re-enable HVO Yu Zhao
2024-08-06 2:21 ` [RFC PATCH 1/4] mm: HVO: introduce helper function to update and flush pgtable Yu Zhao
2024-08-06 2:21 ` [RFC PATCH 2/4] arm64: use IPIs to pause/resume remote CPUs Yu Zhao
2024-08-06 9:12 ` David Hildenbrand
2024-08-08 16:09 ` Doug Anderson
2024-08-06 2:21 ` [RFC PATCH 3/4] arm64: pause remote CPUs to update vmemmap Yu Zhao
2024-08-06 5:08 ` Yu Zhao
2024-08-06 2:21 ` [RFC PATCH 4/4] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP Yu Zhao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).