* [PATCH AUTOSEL 5.15 02/12] powerpc/mm: Fix null-pointer dereference in pgtable_cache_add
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 03/12] arm64: irq: set the correct node for VMAP stack Sasha Levin
` (9 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Kunwu Chan, Christophe Leroy, Michael Ellerman, Sasha Levin,
linuxppc-dev
From: Kunwu Chan <chentao@kylinos.cn>
[ Upstream commit f46c8a75263f97bda13c739ba1c90aced0d3b071 ]
kasprintf() returns a pointer to dynamically allocated memory
which can be NULL upon failure. Ensure the allocation was successful
by checking the pointer validity.
Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231204023223.2447523-1-chentao@kylinos.cn
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/powerpc/mm/init-common.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index 3a82f89827a5..4bc8f0c893a2 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -105,7 +105,7 @@ void pgtable_cache_add(unsigned int shift)
* as to leave enough 0 bits in the address to contain it. */
unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
HUGEPD_SHIFT_MASK + 1);
- struct kmem_cache *new;
+ struct kmem_cache *new = NULL;
/* It would be nice if this was a BUILD_BUG_ON(), but at the
* moment, gcc doesn't seem to recognize is_power_of_2 as a
@@ -118,7 +118,8 @@ void pgtable_cache_add(unsigned int shift)
align = max_t(unsigned long, align, minalign);
name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
- new = kmem_cache_create(name, table_size, align, 0, ctor(shift));
+ if (name)
+ new = kmem_cache_create(name, table_size, align, 0, ctor(shift));
if (!new)
panic("Could not allocate pgtable cache for order %d", shift);
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread* [PATCH AUTOSEL 5.15 03/12] arm64: irq: set the correct node for VMAP stack
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 02/12] powerpc/mm: Fix null-pointer dereference in pgtable_cache_add Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 04/12] drivers/perf: pmuv3: don't expose SW_INCR event in sysfs Sasha Levin
` (8 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Huang Shijie, Catalin Marinas, Will Deacon, Sasha Levin, gregkh,
linux-arm-kernel, linux-arch
From: Huang Shijie <shijie@os.amperecomputing.com>
[ Upstream commit 75b5e0bf90bffaca4b1f19114065dc59f5cc161f ]
In current code, init_irq_stacks() will call cpu_to_node().
The cpu_to_node() depends on percpu "numa_node" which is initialized in:
arch_call_rest_init() --> rest_init() -- kernel_init()
--> kernel_init_freeable() --> smp_prepare_cpus()
But init_irq_stacks() is called in init_IRQ() which is before
arch_call_rest_init().
So in init_irq_stacks(), the cpu_to_node() does not work, it
always return 0. In NUMA, it makes the node 1 cpu accesses the IRQ stack which
is in the node 0.
This patch fixes it by:
1.) export the early_cpu_to_node(), and use it in the init_irq_stacks().
2.) change init_irq_stacks() to __init function.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20231124031513.81548-1-shijie@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/arm64/kernel/irq.c | 5 +++--
drivers/base/arch_numa.c | 2 +-
include/asm-generic/numa.h | 2 ++
3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
index bda49430c9ea..dab45f19df49 100644
--- a/arch/arm64/kernel/irq.c
+++ b/arch/arm64/kernel/irq.c
@@ -19,6 +19,7 @@
#include <linux/kprobes.h>
#include <linux/scs.h>
#include <linux/seq_file.h>
+#include <asm/numa.h>
#include <linux/vmalloc.h>
#include <asm/daifflags.h>
#include <asm/vmap_stack.h>
@@ -48,13 +49,13 @@ static void init_irq_scs(void)
}
#ifdef CONFIG_VMAP_STACK
-static void init_irq_stacks(void)
+static void __init init_irq_stacks(void)
{
int cpu;
unsigned long *p;
for_each_possible_cpu(cpu) {
- p = arch_alloc_vmap_stack(IRQ_STACK_SIZE, cpu_to_node(cpu));
+ p = arch_alloc_vmap_stack(IRQ_STACK_SIZE, early_cpu_to_node(cpu));
per_cpu(irq_stack_ptr, cpu) = p;
}
}
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index 00fb4120a5b3..bce0902dccb4 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -144,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid)
unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
EXPORT_SYMBOL(__per_cpu_offset);
-static int __init early_cpu_to_node(int cpu)
+int __init early_cpu_to_node(int cpu)
{
return cpu_to_node_map[cpu];
}
diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h
index 1a3ad6d29833..c32e0cf23c90 100644
--- a/include/asm-generic/numa.h
+++ b/include/asm-generic/numa.h
@@ -35,6 +35,7 @@ int __init numa_add_memblk(int nodeid, u64 start, u64 end);
void __init numa_set_distance(int from, int to, int distance);
void __init numa_free_distance(void);
void __init early_map_cpu_to_node(unsigned int cpu, int nid);
+int __init early_cpu_to_node(int cpu);
void numa_store_cpu_info(unsigned int cpu);
void numa_add_cpu(unsigned int cpu);
void numa_remove_cpu(unsigned int cpu);
@@ -46,6 +47,7 @@ static inline void numa_add_cpu(unsigned int cpu) { }
static inline void numa_remove_cpu(unsigned int cpu) { }
static inline void arch_numa_init(void) { }
static inline void early_map_cpu_to_node(unsigned int cpu, int nid) { }
+static inline int early_cpu_to_node(int cpu) { return 0; }
#endif /* CONFIG_NUMA */
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread* [PATCH AUTOSEL 5.15 04/12] drivers/perf: pmuv3: don't expose SW_INCR event in sysfs
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 02/12] powerpc/mm: Fix null-pointer dereference in pgtable_cache_add Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 03/12] arm64: irq: set the correct node for VMAP stack Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 05/12] powerpc: Fix build error due to is_valid_bugaddr() Sasha Levin
` (7 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Mark Rutland, Will Deacon, Sasha Levin, peterz, mingo, acme,
catalin.marinas, linux-arm-kernel, linux-perf-users
From: Mark Rutland <mark.rutland@arm.com>
[ Upstream commit ca6f537e459e2da4b331fe8928d1a0b0f9301f42 ]
The SW_INCR event is somewhat unusual, and depends on the specific HW
counter that it is programmed into. When programmed into PMEVCNTR<n>,
SW_INCR will count any writes to PMSWINC_EL0 with bit n set, ignoring
writes to SW_INCR with bit n clear.
Event rotation means that there's no fixed relationship between
perf_events and HW counters, so this isn't all that useful.
Further, we program PMUSERENR.{SW,EN}=={0,0}, which causes EL0 writes to
PMSWINC_EL0 to be trapped and handled as UNDEFINED, resulting in a
SIGILL to userspace.
Given that, it's not a good idea to expose SW_INCR in sysfs. Hide it as
we did for CHAIN back in commit:
4ba2578fa7b55701 ("arm64: perf: don't expose CHAIN event in sysfs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20231204115847.2993026-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/arm64/kernel/perf_event.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index b4044469527e..c77b9460d63e 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -168,7 +168,11 @@ armv8pmu_events_sysfs_show(struct device *dev,
PMU_EVENT_ATTR_ID(name, armv8pmu_events_sysfs_show, config)
static struct attribute *armv8_pmuv3_event_attrs[] = {
- ARMV8_EVENT_ATTR(sw_incr, ARMV8_PMUV3_PERFCTR_SW_INCR),
+ /*
+ * Don't expose the sw_incr event in /sys. It's not usable as writes to
+ * PMSWINC_EL0 will trap as PMUSERENR.{SW,EN}=={0,0} and event rotation
+ * means we don't have a fixed event<->counter relationship regardless.
+ */
ARMV8_EVENT_ATTR(l1i_cache_refill, ARMV8_PMUV3_PERFCTR_L1I_CACHE_REFILL),
ARMV8_EVENT_ATTR(l1i_tlb_refill, ARMV8_PMUV3_PERFCTR_L1I_TLB_REFILL),
ARMV8_EVENT_ATTR(l1d_cache_refill, ARMV8_PMUV3_PERFCTR_L1D_CACHE_REFILL),
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread* [PATCH AUTOSEL 5.15 05/12] powerpc: Fix build error due to is_valid_bugaddr()
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
` (2 preceding siblings ...)
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 04/12] drivers/perf: pmuv3: don't expose SW_INCR event in sysfs Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 06/12] powerpc/mm: Fix build failures due to arch_reserved_kernel_pages() Sasha Levin
` (6 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Michael Ellerman, Sasha Levin, christophe.leroy, bgray, naveen,
arnd, linuxppc-dev
From: Michael Ellerman <mpe@ellerman.id.au>
[ Upstream commit f8d3555355653848082c351fa90775214fb8a4fa ]
With CONFIG_GENERIC_BUG=n the build fails with:
arch/powerpc/kernel/traps.c:1442:5: error: no previous prototype for ‘is_valid_bugaddr’ [-Werror=missing-prototypes]
1442 | int is_valid_bugaddr(unsigned long addr)
| ^~~~~~~~~~~~~~~~
The prototype is only defined, and the function is only needed, when
CONFIG_GENERIC_BUG=y, so move the implementation under that.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231130114433.3053544-2-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/powerpc/kernel/traps.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index fe912983ced9..f1a2a75c4757 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1423,10 +1423,12 @@ static int emulate_instruction(struct pt_regs *regs)
return -EINVAL;
}
+#ifdef CONFIG_GENERIC_BUG
int is_valid_bugaddr(unsigned long addr)
{
return is_kernel_addr(addr);
}
+#endif
#ifdef CONFIG_MATH_EMULATION
static int emulate_math(struct pt_regs *regs)
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread* [PATCH AUTOSEL 5.15 06/12] powerpc/mm: Fix build failures due to arch_reserved_kernel_pages()
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
` (3 preceding siblings ...)
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 05/12] powerpc: Fix build error due to is_valid_bugaddr() Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 07/12] powerpc/64s: Fix CONFIG_NUMA=n build due to create_section_mapping() Sasha Levin
` (5 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Michael Ellerman, Sasha Levin, christophe.leroy, linuxppc-dev
From: Michael Ellerman <mpe@ellerman.id.au>
[ Upstream commit d8c3f243d4db24675b653f0568bb65dae34e6455 ]
With NUMA=n and FA_DUMP=y or PRESERVE_FA_DUMP=y the build fails with:
arch/powerpc/kernel/fadump.c:1739:22: error: no previous prototype for ‘arch_reserved_kernel_pages’ [-Werror=missing-prototypes]
1739 | unsigned long __init arch_reserved_kernel_pages(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
The prototype for arch_reserved_kernel_pages() is in include/linux/mm.h,
but it's guarded by __HAVE_ARCH_RESERVED_KERNEL_PAGES. The powerpc
headers define __HAVE_ARCH_RESERVED_KERNEL_PAGES in asm/mmzone.h, which
is not included into the generic headers when NUMA=n.
Move the definition of __HAVE_ARCH_RESERVED_KERNEL_PAGES into asm/mmu.h
which is included regardless of NUMA=n.
Additionally the ifdef around __HAVE_ARCH_RESERVED_KERNEL_PAGES needs to
also check for CONFIG_PRESERVE_FA_DUMP.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231130114433.3053544-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/powerpc/include/asm/mmu.h | 4 ++++
arch/powerpc/include/asm/mmzone.h | 3 ---
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index 8abe8e42e045..de0bb77f54e0 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -416,5 +416,9 @@ extern void *abatron_pteptrs[2];
#include <asm/nohash/mmu.h>
#endif
+#if defined(CONFIG_FA_DUMP) || defined(CONFIG_PRESERVE_FA_DUMP)
+#define __HAVE_ARCH_RESERVED_KERNEL_PAGES
+#endif
+
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_MMU_H_ */
diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
index 4c6c6dbd182f..3764d3585d30 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -42,9 +42,6 @@ u64 memory_hotplug_max(void);
#else
#define memory_hotplug_max() memblock_end_of_DRAM()
#endif /* CONFIG_NUMA */
-#ifdef CONFIG_FA_DUMP
-#define __HAVE_ARCH_RESERVED_KERNEL_PAGES
-#endif
#ifdef CONFIG_MEMORY_HOTPLUG
extern int create_section_mapping(unsigned long start, unsigned long end,
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread* [PATCH AUTOSEL 5.15 07/12] powerpc/64s: Fix CONFIG_NUMA=n build due to create_section_mapping()
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
` (4 preceding siblings ...)
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 06/12] powerpc/mm: Fix build failures due to arch_reserved_kernel_pages() Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 08/12] x86/boot: Ignore NMIs during very early boot Sasha Levin
` (4 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Michael Ellerman, Sasha Levin, christophe.leroy, linuxppc-dev
From: Michael Ellerman <mpe@ellerman.id.au>
[ Upstream commit ede66cd22441820cbd399936bf84fdc4294bc7fa ]
With CONFIG_NUMA=n the build fails with:
arch/powerpc/mm/book3s64/pgtable.c:275:15: error: no previous prototype for ‘create_section_mapping’ [-Werror=missing-prototypes]
275 | int __meminit create_section_mapping(unsigned long start, unsigned long end,
| ^~~~~~~~~~~~~~~~~~~~~~
That happens because the prototype for create_section_mapping() is in
asm/mmzone.h, but asm/mmzone.h is only included by linux/mmzone.h
when CONFIG_NUMA=y.
In fact the prototype is only needed by arch/powerpc/mm code, so move
the prototype into arch/powerpc/mm/mmu_decl.h, which also fixes the
build error.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231129131919.2528517-5-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/powerpc/include/asm/mmzone.h | 5 -----
arch/powerpc/mm/mmu_decl.h | 5 +++++
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h
index 3764d3585d30..da827d2d0866 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -43,10 +43,5 @@ u64 memory_hotplug_max(void);
#define memory_hotplug_max() memblock_end_of_DRAM()
#endif /* CONFIG_NUMA */
-#ifdef CONFIG_MEMORY_HOTPLUG
-extern int create_section_mapping(unsigned long start, unsigned long end,
- int nid, pgprot_t prot);
-#endif
-
#endif /* __KERNEL__ */
#endif /* _ASM_MMZONE_H_ */
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index dd1cabc2ea0f..21996b9e0a64 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -190,3 +190,8 @@ static inline bool debug_pagealloc_enabled_or_kfence(void)
{
return IS_ENABLED(CONFIG_KFENCE) || debug_pagealloc_enabled();
}
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+int create_section_mapping(unsigned long start, unsigned long end,
+ int nid, pgprot_t prot);
+#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread* [PATCH AUTOSEL 5.15 08/12] x86/boot: Ignore NMIs during very early boot
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
` (5 preceding siblings ...)
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 07/12] powerpc/64s: Fix CONFIG_NUMA=n build due to create_section_mapping() Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 09/12] powerpc: pmd_move_must_withdraw() is only needed for CONFIG_TRANSPARENT_HUGEPAGE Sasha Levin
` (3 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Jun'ichi Nomura, Derek Barbosa, Ingo Molnar, Kees Cook,
Linus Torvalds, Paul E . McKenney, Andy Lutomirski,
H. Peter Anvin, Peter Zijlstra, Sasha Levin, tglx, mingo, bp,
dave.hansen, x86, kirill.shutemov, david, ardb, nikunj,
thomas.lendacky
From: Jun'ichi Nomura <junichi.nomura@nec.com>
[ Upstream commit 78a509fba9c9b1fcb77f95b7c6be30da3d24823a ]
When there are two racing NMIs on x86, the first NMI invokes NMI handler and
the 2nd NMI is latched until IRET is executed.
If panic on NMI and panic kexec are enabled, the first NMI triggers
panic and starts booting the next kernel via kexec. Note that the 2nd
NMI is still latched. During the early boot of the next kernel, once
an IRET is executed as a result of a page fault, then the 2nd NMI is
unlatched and invokes the NMI handler.
However, NMI handler is not set up at the early stage of boot, which
results in a boot failure.
Avoid such problems by setting up a NOP handler for early NMIs.
[ mingo: Refined the changelog. ]
Signed-off-by: Jun'ichi Nomura <junichi.nomura@nec.com>
Signed-off-by: Derek Barbosa <debarbos@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/x86/boot/compressed/ident_map_64.c | 5 +++++
arch/x86/boot/compressed/idt_64.c | 1 +
arch/x86/boot/compressed/idt_handlers_64.S | 1 +
arch/x86/boot/compressed/misc.h | 1 +
4 files changed, 8 insertions(+)
diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 575d881ff86e..b72dea92cd96 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -357,3 +357,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
*/
add_identity_map(address, end);
}
+
+void do_boot_nmi_trap(struct pt_regs *regs, unsigned long error_code)
+{
+ /* Empty handler to ignore NMI during early boot */
+}
diff --git a/arch/x86/boot/compressed/idt_64.c b/arch/x86/boot/compressed/idt_64.c
index 9b93567d663a..9620883485ac 100644
--- a/arch/x86/boot/compressed/idt_64.c
+++ b/arch/x86/boot/compressed/idt_64.c
@@ -45,6 +45,7 @@ void load_stage2_idt(void)
boot_idt_desc.address = (unsigned long)boot_idt;
set_idt_entry(X86_TRAP_PF, boot_page_fault);
+ set_idt_entry(X86_TRAP_NMI, boot_nmi_trap);
#ifdef CONFIG_AMD_MEM_ENCRYPT
set_idt_entry(X86_TRAP_VC, boot_stage2_vc);
diff --git a/arch/x86/boot/compressed/idt_handlers_64.S b/arch/x86/boot/compressed/idt_handlers_64.S
index 22890e199f5b..4d03c8562f63 100644
--- a/arch/x86/boot/compressed/idt_handlers_64.S
+++ b/arch/x86/boot/compressed/idt_handlers_64.S
@@ -70,6 +70,7 @@ SYM_FUNC_END(\name)
.code64
EXCEPTION_HANDLER boot_page_fault do_boot_page_fault error_code=1
+EXCEPTION_HANDLER boot_nmi_trap do_boot_nmi_trap error_code=0
#ifdef CONFIG_AMD_MEM_ENCRYPT
EXCEPTION_HANDLER boot_stage1_vc do_vc_no_ghcb error_code=1
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 31139256859f..cea1b96c48b2 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -163,6 +163,7 @@ static inline void cleanup_exception_handling(void) { }
/* IDT Entry Points */
void boot_page_fault(void);
+void boot_nmi_trap(void);
void boot_stage1_vc(void);
void boot_stage2_vc(void);
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread* [PATCH AUTOSEL 5.15 09/12] powerpc: pmd_move_must_withdraw() is only needed for CONFIG_TRANSPARENT_HUGEPAGE
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
` (6 preceding siblings ...)
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 08/12] x86/boot: Ignore NMIs during very early boot Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 10/12] powerpc/lib: Validate size for vector operations Sasha Levin
` (2 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Stephen Rothwell, Michael Ellerman, Sasha Levin, akpm,
aneesh.kumar, vishal.moola, arnd, christophe.leroy, linuxppc-dev
From: Stephen Rothwell <sfr@canb.auug.org.au>
[ Upstream commit 0d555b57ee660d8a871781c0eebf006e855e918d ]
The linux-next build of powerpc64 allnoconfig fails with:
arch/powerpc/mm/book3s64/pgtable.c:557:5: error: no previous prototype for 'pmd_move_must_withdraw'
557 | int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
| ^~~~~~~~~~~~~~~~~~~~~~
Caused by commit:
c6345dfa6e3e ("Makefile.extrawarn: turn on missing-prototypes globally")
Fix it by moving the function definition under
CONFIG_TRANSPARENT_HUGEPAGE like the prototype. The function is only
called when CONFIG_TRANSPARENT_HUGEPAGE=y.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
[mpe: Flesh out change log from linux-next patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231127132809.45c2b398@canb.auug.org.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/powerpc/mm/book3s64/pgtable.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 9e16c7b1a6c5..da15f28c7b13 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -459,6 +459,7 @@ void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
set_pte_at(vma->vm_mm, addr, ptep, pte);
}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
* For hash translation mode, we use the deposited table to store hash slot
* information and they are stored at PTRS_PER_PMD offset from related pmd
@@ -480,6 +481,7 @@ int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
return true;
}
+#endif
/*
* Does the CPU support tlbie?
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread* [PATCH AUTOSEL 5.15 10/12] powerpc/lib: Validate size for vector operations
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
` (7 preceding siblings ...)
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 09/12] powerpc: pmd_move_must_withdraw() is only needed for CONFIG_TRANSPARENT_HUGEPAGE Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD Sasha Levin
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 12/12] x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel Sasha Levin
10 siblings, 0 replies; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Naveen N Rao, Gustavo A . R . Silva, Michael Ellerman,
Sasha Levin, christophe.leroy, linuxppc-dev
From: Naveen N Rao <naveen@kernel.org>
[ Upstream commit 8f9abaa6d7de0a70fc68acaedce290c1f96e2e59 ]
Some of the fp/vmx code in sstep.c assume a certain maximum size for the
instructions being emulated. The size of those operations however is
determined separately in analyse_instr().
Add a check to validate the assumption on the maximum size of the
operations, so as to prevent any unintended kernel stack corruption.
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Build-tested-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20231123071705.397625-1-naveen@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/powerpc/lib/sstep.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 2d39b7c246e3..ecc2e06854d7 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -529,6 +529,8 @@ static int do_fp_load(struct instruction_op *op, unsigned long ea,
} u;
nb = GETSIZE(op->type);
+ if (nb > sizeof(u))
+ return -EINVAL;
if (!address_ok(regs, ea, nb))
return -EFAULT;
rn = op->reg;
@@ -579,6 +581,8 @@ static int do_fp_store(struct instruction_op *op, unsigned long ea,
} u;
nb = GETSIZE(op->type);
+ if (nb > sizeof(u))
+ return -EINVAL;
if (!address_ok(regs, ea, nb))
return -EFAULT;
rn = op->reg;
@@ -623,6 +627,9 @@ static nokprobe_inline int do_vec_load(int rn, unsigned long ea,
u8 b[sizeof(__vector128)];
} u = {};
+ if (size > sizeof(u))
+ return -EINVAL;
+
if (!address_ok(regs, ea & ~0xfUL, 16))
return -EFAULT;
/* align to multiple of size */
@@ -650,6 +657,9 @@ static nokprobe_inline int do_vec_store(int rn, unsigned long ea,
u8 b[sizeof(__vector128)];
} u;
+ if (size > sizeof(u))
+ return -EINVAL;
+
if (!address_ok(regs, ea & ~0xfUL, 16))
return -EFAULT;
/* align to multiple of size */
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread* [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
` (8 preceding siblings ...)
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 10/12] powerpc/lib: Validate size for vector operations Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
2024-11-28 11:59 ` Borislav Petkov
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 12/12] x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel Sasha Levin
10 siblings, 1 reply; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Borislav Petkov (AMD), Sasha Levin, tglx, mingo, dave.hansen, x86,
puwen, seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap
From: "Borislav Petkov (AMD)" <bp@alien8.de>
[ Upstream commit 04c3024560d3a14acd18d0a51a1d0a89d29b7eb5 ]
AMD does not have the requirement for a synchronization barrier when
acccessing a certain group of MSRs. Do not incur that unnecessary
penalty there.
There will be a CPUID bit which explicitly states that a MFENCE is not
needed. Once that bit is added to the APM, this will be extended with
it.
While at it, move to processor.h to avoid include hell. Untangling that
file properly is a matter for another day.
Some notes on the performance aspect of why this is relevant, courtesy
of Kishon VijayAbraham <Kishon.VijayAbraham@amd.com>:
On a AMD Zen4 system with 96 cores, a modified ipi-bench[1] on a VM
shows x2AVIC IPI rate is 3% to 4% lower than AVIC IPI rate. The
ipi-bench is modified so that the IPIs are sent between two vCPUs in the
same CCX. This also requires to pin the vCPU to a physical core to
prevent any latencies. This simulates the use case of pinning vCPUs to
the thread of a single CCX to avoid interrupt IPI latency.
In order to avoid run-to-run variance (for both x2AVIC and AVIC), the
below configurations are done:
1) Disable Power States in BIOS (to prevent the system from going to
lower power state)
2) Run the system at fixed frequency 2500MHz (to prevent the system
from increasing the frequency when the load is more)
With the above configuration:
*) Performance measured using ipi-bench for AVIC:
Average Latency: 1124.98ns [Time to send IPI from one vCPU to another vCPU]
Cumulative throughput: 42.6759M/s [Total number of IPIs sent in a second from
48 vCPUs simultaneously]
*) Performance measured using ipi-bench for x2AVIC:
Average Latency: 1172.42ns [Time to send IPI from one vCPU to another vCPU]
Cumulative throughput: 40.9432M/s [Total number of IPIs sent in a second from
48 vCPUs simultaneously]
From above, x2AVIC latency is ~4% more than AVIC. However, the expectation is
x2AVIC performance to be better or equivalent to AVIC. Upon analyzing
the perf captures, it is observed significant time is spent in
weak_wrmsr_fence() invoked by x2apic_send_IPI().
With the fix to skip weak_wrmsr_fence()
*) Performance measured using ipi-bench for x2AVIC:
Average Latency: 1117.44ns [Time to send IPI from one vCPU to another vCPU]
Cumulative throughput: 42.9608M/s [Total number of IPIs sent in a second from
48 vCPUs simultaneously]
Comparing the performance of x2AVIC with and without the fix, it can be seen
the performance improves by ~4%.
Performance captured using an unmodified ipi-bench using the 'mesh-ipi' option
with and without weak_wrmsr_fence() on a Zen4 system also showed significant
performance improvement without weak_wrmsr_fence(). The 'mesh-ipi' option ignores
CCX or CCD and just picks random vCPU.
Average throughput (10 iterations) with weak_wrmsr_fence(),
Cumulative throughput: 4933374 IPI/s
Average throughput (10 iterations) without weak_wrmsr_fence(),
Cumulative throughput: 6355156 IPI/s
[1] https://github.com/bytedance/kvm-utils/tree/master/microbenchmark/ipi-bench
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230622095212.20940-1-bp@alien8.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/x86/include/asm/barrier.h | 18 ------------------
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/include/asm/processor.h | 18 ++++++++++++++++++
arch/x86/kernel/cpu/amd.c | 3 +++
arch/x86/kernel/cpu/common.c | 7 +++++++
arch/x86/kernel/cpu/hygon.c | 3 +++
6 files changed, 32 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 3ba772a69cc8..dab2db15a8c4 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -81,22 +81,4 @@ do { \
#include <asm-generic/barrier.h>
-/*
- * Make previous memory operations globally visible before
- * a WRMSR.
- *
- * MFENCE makes writes visible, but only affects load/store
- * instructions. WRMSR is unfortunately not a load/store
- * instruction and is unaffected by MFENCE. The LFENCE ensures
- * that the WRMSR is not reordered.
- *
- * Most WRMSRs are full serializing instructions themselves and
- * do not require this barrier. This is only required for the
- * IA32_TSC_DEADLINE and X2APIC MSRs.
- */
-static inline void weak_wrmsr_fence(void)
-{
- asm volatile("mfence; lfence" : : : "memory");
-}
-
#endif /* _ASM_X86_BARRIER_H */
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d6089072ee41..dd6f714c215e 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -305,10 +305,10 @@
#define X86_FEATURE_MSR_TSX_CTRL (11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
-
#define X86_FEATURE_SRSO (11*32+24) /* "" AMD BTB untrain RETs */
#define X86_FEATURE_SRSO_ALIAS (11*32+25) /* "" AMD BTB untrain RETs through aliasing */
#define X86_FEATURE_IBPB_ON_VMEXIT (11*32+26) /* "" Issue an IBPB only on VMEXIT */
+#define X86_FEATURE_APIC_MSRS_FENCE (11*32+27) /* "" IA32_TSC_DEADLINE and X2APIC MSRs need fencing */
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index bbbf27cfe701..52329752dd4f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -861,4 +861,22 @@ enum mds_mitigations {
extern bool gds_ucode_mitigated(void);
+/*
+ * Make previous memory operations globally visible before
+ * a WRMSR.
+ *
+ * MFENCE makes writes visible, but only affects load/store
+ * instructions. WRMSR is unfortunately not a load/store
+ * instruction and is unaffected by MFENCE. The LFENCE ensures
+ * that the WRMSR is not reordered.
+ *
+ * Most WRMSRs are full serializing instructions themselves and
+ * do not require this barrier. This is only required for the
+ * IA32_TSC_DEADLINE and X2APIC MSRs.
+ */
+static inline void weak_wrmsr_fence(void)
+{
+ alternative("mfence; lfence", "", ALT_NOT(X86_FEATURE_APIC_MSRS_FENCE));
+}
+
#endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index dba7fe7ecea9..06186b7d0ed0 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1158,6 +1158,9 @@ static void init_amd(struct cpuinfo_x86 *c)
if (!cpu_has(c, X86_FEATURE_HYPERVISOR) &&
cpu_has_amd_erratum(c, amd_erratum_1485))
msr_set_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT);
+
+ /* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
+ clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
}
#ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 01c4f8f45b83..1e870f6dd51c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1687,6 +1687,13 @@ static void identify_cpu(struct cpuinfo_x86 *c)
c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);
#endif
+
+ /*
+ * Set default APIC and TSC_DEADLINE MSR fencing flag. AMD and
+ * Hygon will clear it in ->c_init() below.
+ */
+ set_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
+
/*
* Vendor-specific initialization. In this section we
* canonicalize the feature flags, meaning if there are
diff --git a/arch/x86/kernel/cpu/hygon.c b/arch/x86/kernel/cpu/hygon.c
index 9e8380bd4fb9..8a80d5343f3a 100644
--- a/arch/x86/kernel/cpu/hygon.c
+++ b/arch/x86/kernel/cpu/hygon.c
@@ -347,6 +347,9 @@ static void init_hygon(struct cpuinfo_x86 *c)
set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
check_null_seg_clears_base(c);
+
+ /* Hygon CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
+ clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
}
static void cpu_detect_tlb_hygon(struct cpuinfo_x86 *c)
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD Sasha Levin
@ 2024-11-28 11:59 ` Borislav Petkov
2024-11-28 15:52 ` Sasha Levin
0 siblings, 1 reply; 27+ messages in thread
From: Borislav Petkov @ 2024-11-28 11:59 UTC (permalink / raw)
To: Sasha Levin
Cc: linux-kernel, stable, tglx, mingo, dave.hansen, x86, puwen,
seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap,
Erwan Velu
Hey folks,
On Mon, Jan 15, 2024 at 06:26:56PM -0500, Sasha Levin wrote:
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
>
> [ Upstream commit 04c3024560d3a14acd18d0a51a1d0a89d29b7eb5 ]
>
> AMD does not have the requirement for a synchronization barrier when
> acccessing a certain group of MSRs. Do not incur that unnecessary
> penalty there.
Erwan just mentioned that this one is not in 6.1 and in 5.15. And I have mails
about it getting picked up by AUTOSEL.
Did the AI reconsider in the meantime?
:-P
Thx.
Two autosel messages are:
https://lore.kernel.org/r/20240115232718.209642-11-sashal@kernel.org
https://lore.kernel.org/r/20240115232611.209265-13-sashal@kernel.org
Leaving in the rest for reference.
> There will be a CPUID bit which explicitly states that a MFENCE is not
> needed. Once that bit is added to the APM, this will be extended with
> it.
>
> While at it, move to processor.h to avoid include hell. Untangling that
> file properly is a matter for another day.
>
> Some notes on the performance aspect of why this is relevant, courtesy
> of Kishon VijayAbraham <Kishon.VijayAbraham@amd.com>:
>
> On a AMD Zen4 system with 96 cores, a modified ipi-bench[1] on a VM
> shows x2AVIC IPI rate is 3% to 4% lower than AVIC IPI rate. The
> ipi-bench is modified so that the IPIs are sent between two vCPUs in the
> same CCX. This also requires to pin the vCPU to a physical core to
> prevent any latencies. This simulates the use case of pinning vCPUs to
> the thread of a single CCX to avoid interrupt IPI latency.
>
> In order to avoid run-to-run variance (for both x2AVIC and AVIC), the
> below configurations are done:
>
> 1) Disable Power States in BIOS (to prevent the system from going to
> lower power state)
>
> 2) Run the system at fixed frequency 2500MHz (to prevent the system
> from increasing the frequency when the load is more)
>
> With the above configuration:
>
> *) Performance measured using ipi-bench for AVIC:
> Average Latency: 1124.98ns [Time to send IPI from one vCPU to another vCPU]
>
> Cumulative throughput: 42.6759M/s [Total number of IPIs sent in a second from
> 48 vCPUs simultaneously]
>
> *) Performance measured using ipi-bench for x2AVIC:
> Average Latency: 1172.42ns [Time to send IPI from one vCPU to another vCPU]
>
> Cumulative throughput: 40.9432M/s [Total number of IPIs sent in a second from
> 48 vCPUs simultaneously]
>
> From above, x2AVIC latency is ~4% more than AVIC. However, the expectation is
> x2AVIC performance to be better or equivalent to AVIC. Upon analyzing
> the perf captures, it is observed significant time is spent in
> weak_wrmsr_fence() invoked by x2apic_send_IPI().
>
> With the fix to skip weak_wrmsr_fence()
>
> *) Performance measured using ipi-bench for x2AVIC:
> Average Latency: 1117.44ns [Time to send IPI from one vCPU to another vCPU]
>
> Cumulative throughput: 42.9608M/s [Total number of IPIs sent in a second from
> 48 vCPUs simultaneously]
>
> Comparing the performance of x2AVIC with and without the fix, it can be seen
> the performance improves by ~4%.
>
> Performance captured using an unmodified ipi-bench using the 'mesh-ipi' option
> with and without weak_wrmsr_fence() on a Zen4 system also showed significant
> performance improvement without weak_wrmsr_fence(). The 'mesh-ipi' option ignores
> CCX or CCD and just picks random vCPU.
>
> Average throughput (10 iterations) with weak_wrmsr_fence(),
> Cumulative throughput: 4933374 IPI/s
>
> Average throughput (10 iterations) without weak_wrmsr_fence(),
> Cumulative throughput: 6355156 IPI/s
>
> [1] https://github.com/bytedance/kvm-utils/tree/master/microbenchmark/ipi-bench
>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> Link: https://lore.kernel.org/r/20230622095212.20940-1-bp@alien8.de
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
> arch/x86/include/asm/barrier.h | 18 ------------------
> arch/x86/include/asm/cpufeatures.h | 2 +-
> arch/x86/include/asm/processor.h | 18 ++++++++++++++++++
> arch/x86/kernel/cpu/amd.c | 3 +++
> arch/x86/kernel/cpu/common.c | 7 +++++++
> arch/x86/kernel/cpu/hygon.c | 3 +++
> 6 files changed, 32 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 3ba772a69cc8..dab2db15a8c4 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -81,22 +81,4 @@ do { \
>
> #include <asm-generic/barrier.h>
>
> -/*
> - * Make previous memory operations globally visible before
> - * a WRMSR.
> - *
> - * MFENCE makes writes visible, but only affects load/store
> - * instructions. WRMSR is unfortunately not a load/store
> - * instruction and is unaffected by MFENCE. The LFENCE ensures
> - * that the WRMSR is not reordered.
> - *
> - * Most WRMSRs are full serializing instructions themselves and
> - * do not require this barrier. This is only required for the
> - * IA32_TSC_DEADLINE and X2APIC MSRs.
> - */
> -static inline void weak_wrmsr_fence(void)
> -{
> - asm volatile("mfence; lfence" : : : "memory");
> -}
> -
> #endif /* _ASM_X86_BARRIER_H */
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index d6089072ee41..dd6f714c215e 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -305,10 +305,10 @@
>
>
> #define X86_FEATURE_MSR_TSX_CTRL (11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
> -
> #define X86_FEATURE_SRSO (11*32+24) /* "" AMD BTB untrain RETs */
> #define X86_FEATURE_SRSO_ALIAS (11*32+25) /* "" AMD BTB untrain RETs through aliasing */
> #define X86_FEATURE_IBPB_ON_VMEXIT (11*32+26) /* "" Issue an IBPB only on VMEXIT */
> +#define X86_FEATURE_APIC_MSRS_FENCE (11*32+27) /* "" IA32_TSC_DEADLINE and X2APIC MSRs need fencing */
>
> /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
> #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index bbbf27cfe701..52329752dd4f 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -861,4 +861,22 @@ enum mds_mitigations {
>
> extern bool gds_ucode_mitigated(void);
>
> +/*
> + * Make previous memory operations globally visible before
> + * a WRMSR.
> + *
> + * MFENCE makes writes visible, but only affects load/store
> + * instructions. WRMSR is unfortunately not a load/store
> + * instruction and is unaffected by MFENCE. The LFENCE ensures
> + * that the WRMSR is not reordered.
> + *
> + * Most WRMSRs are full serializing instructions themselves and
> + * do not require this barrier. This is only required for the
> + * IA32_TSC_DEADLINE and X2APIC MSRs.
> + */
> +static inline void weak_wrmsr_fence(void)
> +{
> + alternative("mfence; lfence", "", ALT_NOT(X86_FEATURE_APIC_MSRS_FENCE));
> +}
> +
> #endif /* _ASM_X86_PROCESSOR_H */
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index dba7fe7ecea9..06186b7d0ed0 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -1158,6 +1158,9 @@ static void init_amd(struct cpuinfo_x86 *c)
> if (!cpu_has(c, X86_FEATURE_HYPERVISOR) &&
> cpu_has_amd_erratum(c, amd_erratum_1485))
> msr_set_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT);
> +
> + /* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
> + clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
> }
>
> #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 01c4f8f45b83..1e870f6dd51c 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1687,6 +1687,13 @@ static void identify_cpu(struct cpuinfo_x86 *c)
> c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);
> #endif
>
> +
> + /*
> + * Set default APIC and TSC_DEADLINE MSR fencing flag. AMD and
> + * Hygon will clear it in ->c_init() below.
> + */
> + set_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
> +
> /*
> * Vendor-specific initialization. In this section we
> * canonicalize the feature flags, meaning if there are
> diff --git a/arch/x86/kernel/cpu/hygon.c b/arch/x86/kernel/cpu/hygon.c
> index 9e8380bd4fb9..8a80d5343f3a 100644
> --- a/arch/x86/kernel/cpu/hygon.c
> +++ b/arch/x86/kernel/cpu/hygon.c
> @@ -347,6 +347,9 @@ static void init_hygon(struct cpuinfo_x86 *c)
> set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
>
> check_null_seg_clears_base(c);
> +
> + /* Hygon CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
> + clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
> }
>
> static void cpu_detect_tlb_hygon(struct cpuinfo_x86 *c)
> --
> 2.43.0
>
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-28 11:59 ` Borislav Petkov
@ 2024-11-28 15:52 ` Sasha Levin
2024-11-28 16:08 ` Erwan Velu
2024-11-28 16:43 ` Borislav Petkov
0 siblings, 2 replies; 27+ messages in thread
From: Sasha Levin @ 2024-11-28 15:52 UTC (permalink / raw)
To: Borislav Petkov
Cc: linux-kernel, stable, tglx, mingo, dave.hansen, x86, puwen,
seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap,
Erwan Velu, pavel
On Thu, Nov 28, 2024 at 12:59:24PM +0100, Borislav Petkov wrote:
>Hey folks,
>
>On Mon, Jan 15, 2024 at 06:26:56PM -0500, Sasha Levin wrote:
>> From: "Borislav Petkov (AMD)" <bp@alien8.de>
>>
>> [ Upstream commit 04c3024560d3a14acd18d0a51a1d0a89d29b7eb5 ]
>>
>> AMD does not have the requirement for a synchronization barrier when
>> acccessing a certain group of MSRs. Do not incur that unnecessary
>> penalty there.
>
>Erwan just mentioned that this one is not in 6.1 and in 5.15. And I have mails
>about it getting picked up by AUTOSEL.
>
>Did the AI reconsider in the meantime?
You've missed the 5.10 mail :)
Pavel objected to it so I've dropped it: https://lore.kernel.org/all/Zbli7QIGVFT8EtO4@sashalap/
--
Thanks,
Sasha
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-28 15:52 ` Sasha Levin
@ 2024-11-28 16:08 ` Erwan Velu
2024-11-28 16:43 ` Borislav Petkov
1 sibling, 0 replies; 27+ messages in thread
From: Erwan Velu @ 2024-11-28 16:08 UTC (permalink / raw)
To: Sasha Levin
Cc: Borislav Petkov, linux-kernel, stable, tglx, mingo, dave.hansen,
x86, puwen, seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap, pavel
But it was backported on 6.6 and the performance impact is pretty high
on some workloads.
As this patch is only providing a perf benefit and to keep consistency
with other stable kernels, would it be possible to get it merged ?
Le jeu. 28 nov. 2024 à 16:52, Sasha Levin <sashal@kernel.org> a écrit :
>
> On Thu, Nov 28, 2024 at 12:59:24PM +0100, Borislav Petkov wrote:
> >Hey folks,
> >
> >On Mon, Jan 15, 2024 at 06:26:56PM -0500, Sasha Levin wrote:
> >> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> >>
> >> [ Upstream commit 04c3024560d3a14acd18d0a51a1d0a89d29b7eb5 ]
> >>
> >> AMD does not have the requirement for a synchronization barrier when
> >> acccessing a certain group of MSRs. Do not incur that unnecessary
> >> penalty there.
> >
> >Erwan just mentioned that this one is not in 6.1 and in 5.15. And I have mails
> >about it getting picked up by AUTOSEL.
> >
> >Did the AI reconsider in the meantime?
>
> You've missed the 5.10 mail :)
>
> Pavel objected to it so I've dropped it: https://lore.kernel.org/all/Zbli7QIGVFT8EtO4@sashalap/
>
> --
> Thanks,
> Sasha
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-28 15:52 ` Sasha Levin
2024-11-28 16:08 ` Erwan Velu
@ 2024-11-28 16:43 ` Borislav Petkov
2024-11-29 0:21 ` Sasha Levin
2024-11-29 9:45 ` Pavel Machek
1 sibling, 2 replies; 27+ messages in thread
From: Borislav Petkov @ 2024-11-28 16:43 UTC (permalink / raw)
To: Sasha Levin
Cc: linux-kernel, stable, tglx, mingo, dave.hansen, x86, puwen,
seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap,
Erwan Velu, pavel
On Thu, Nov 28, 2024 at 10:52:44AM -0500, Sasha Levin wrote:
> You've missed the 5.10 mail :)
You mean in the flood? ;-P
> Pavel objected to it so I've dropped it: https://lore.kernel.org/all/Zbli7QIGVFT8EtO4@sashalap/
So we're not backporting those anymore? But everything else? :-P
And 5.15 has it already...
Frankly, with the amount of stuff going into stable, I see no problem with
backporting such patches. Especially if the people using stable kernels will
end up backporting it themselves and thus multiply work. I.e., Erwan's case.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-28 16:43 ` Borislav Petkov
@ 2024-11-29 0:21 ` Sasha Levin
2024-11-29 9:30 ` Erwan Velu
2024-11-29 13:30 ` Borislav Petkov
2024-11-29 9:45 ` Pavel Machek
1 sibling, 2 replies; 27+ messages in thread
From: Sasha Levin @ 2024-11-29 0:21 UTC (permalink / raw)
To: Borislav Petkov
Cc: linux-kernel, stable, tglx, mingo, dave.hansen, x86, puwen,
seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap,
Erwan Velu, pavel
On Thu, Nov 28, 2024 at 05:43:10PM +0100, Borislav Petkov wrote:
>On Thu, Nov 28, 2024 at 10:52:44AM -0500, Sasha Levin wrote:
>> You've missed the 5.10 mail :)
>
>You mean in the flood? ;-P
Suggestions welcome... I don't really insist on a massive flood and was
mostly following what was the convention when I started doing this work.
I thought about cutting it down to one mail per commit, but OTOH I had
folks complain often enough that they missed a mail or that it wasn't
obvious enough.
>> Pavel objected to it so I've dropped it: https://lore.kernel.org/all/Zbli7QIGVFT8EtO4@sashalap/
>
>So we're not backporting those anymore? But everything else? :-P
It was dropped from everywhere. The reason it ended up in 6.6 was
because your AMD friends requested it:
https://lore.kernel.org/all/2024022146-chunk-fencing-1e8f@gregkh/
>And 5.15 has it already...
It might be all those thanksgiving drinks, but I can't find it in
5.15... I see it was part of 6.7.6 and 6.6.18, but nothing older.
>Frankly, with the amount of stuff going into stable, I see no problem with
>backporting such patches. Especially if the people using stable kernels will
>end up backporting it themselves and thus multiply work. I.e., Erwan's case.
The stable kernel rules allow for "notable" performance fixes, and we
already added it to 6.6. No objection to adding it to older kernels...
Happy to do it after the current round of releases goes out.
--
Thanks,
Sasha
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-29 0:21 ` Sasha Levin
@ 2024-11-29 9:30 ` Erwan Velu
2024-11-29 13:33 ` Borislav Petkov
2024-11-29 13:30 ` Borislav Petkov
1 sibling, 1 reply; 27+ messages in thread
From: Erwan Velu @ 2024-11-29 9:30 UTC (permalink / raw)
To: Sasha Levin
Cc: Borislav Petkov, linux-kernel, stable, tglx, mingo, dave.hansen,
x86, puwen, seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap, pavel
[...]
> The stable kernel rules allow for "notable" performance fixes, and we
> already added it to 6.6. No objection to adding it to older kernels...
> Happy to do it after the current round of releases goes out.
Sacha,
Thanks for considering the backport of this patch, much appreciated.
This patch greatly impacts servers on production with AMD systems that
have lasted since 5.11, having it backported really improves systems
performance.
Since this patch, I can share that our database team is no longer
paged during the night, that's a real noticeable impact.
We all know how difficult it is to maintain kernels and support the
hard work you do on this.
It's also up to us, the users & industry, to give feedback and testing
around this type of story.
It could probably be interesting to have a presentation around this at
KernelRecipes ;)
Erwan,
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-29 9:30 ` Erwan Velu
@ 2024-11-29 13:33 ` Borislav Petkov
2024-11-29 15:37 ` Sasha Levin
0 siblings, 1 reply; 27+ messages in thread
From: Borislav Petkov @ 2024-11-29 13:33 UTC (permalink / raw)
To: Erwan Velu
Cc: Sasha Levin, linux-kernel, stable, tglx, mingo, dave.hansen, x86,
puwen, seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap, pavel
On Fri, Nov 29, 2024 at 10:30:11AM +0100, Erwan Velu wrote:
> We all know how difficult it is to maintain kernels and support the hard
> work you do on this. It's also up to us, the users & industry, to give
> feedback and testing around this type of story. It could probably be
> interesting to have a presentation around this at KernelRecipes ;)
Yes, absolutely.
And you should reserve a whole slot of 1h after it for discussion. Because all
the folks using stable kernels or doing backports will definitely have
experience and thoughts to share... who knows, might improve the proces in the
end.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-29 13:33 ` Borislav Petkov
@ 2024-11-29 15:37 ` Sasha Levin
2024-11-29 21:18 ` Erwan Velu
0 siblings, 1 reply; 27+ messages in thread
From: Sasha Levin @ 2024-11-29 15:37 UTC (permalink / raw)
To: Borislav Petkov
Cc: Erwan Velu, linux-kernel, stable, tglx, mingo, dave.hansen, x86,
puwen, seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap, pavel
On Fri, Nov 29, 2024 at 02:33:10PM +0100, Borislav Petkov wrote:
>On Fri, Nov 29, 2024 at 10:30:11AM +0100, Erwan Velu wrote:
>> We all know how difficult it is to maintain kernels and support the hard
>> work you do on this. It's also up to us, the users & industry, to give
>> feedback and testing around this type of story. It could probably be
>> interesting to have a presentation around this at KernelRecipes ;)
Or LPC? I don't usually end up attending KR but Greg is always there :)
>Yes, absolutely.
>
>And you should reserve a whole slot of 1h after it for discussion. Because all
>the folks using stable kernels or doing backports will definitely have
>experience and thoughts to share... who knows, might improve the proces in the
>end.
If there are enough folks who are interested in a BoF session I'm happy
to help set it up for LPC.
--
Thanks,
Sasha
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-29 15:37 ` Sasha Levin
@ 2024-11-29 21:18 ` Erwan Velu
0 siblings, 0 replies; 27+ messages in thread
From: Erwan Velu @ 2024-11-29 21:18 UTC (permalink / raw)
To: Sasha Levin
Cc: Borislav Petkov, linux-kernel, stable, tglx, mingo, dave.hansen,
x86, puwen, seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap, pavel
> Or LPC? I don't usually end up attending KR but Greg is always there :)
<innocent tone>
You should come once, it's such a nice conference!
</innocent>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-29 0:21 ` Sasha Levin
2024-11-29 9:30 ` Erwan Velu
@ 2024-11-29 13:30 ` Borislav Petkov
1 sibling, 0 replies; 27+ messages in thread
From: Borislav Petkov @ 2024-11-29 13:30 UTC (permalink / raw)
To: Sasha Levin
Cc: linux-kernel, stable, tglx, mingo, dave.hansen, x86, puwen,
seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap,
Erwan Velu, pavel
On Thu, Nov 28, 2024 at 07:21:50PM -0500, Sasha Levin wrote:
> Suggestions welcome... I don't really insist on a massive flood and was
> mostly following what was the convention when I started doing this work.
>
> I thought about cutting it down to one mail per commit, but OTOH I had
> folks complain often enough that they missed a mail or that it wasn't
> obvious enough.
I don't know. It is a lot of mail and I doubt people look at most of them but
it's not like cutting down the mails from stable will alleviate the general
firehose drinking situation so...
> It might be all those thanksgiving drinks, but I can't find it in
> 5.15... I see it was part of 6.7.6 and 6.6.18, but nothing older.
Sorry, I meant 6.6.
> The stable kernel rules allow for "notable" performance fixes, and we
> already added it to 6.6.
Frankly, I don't have anything smarter than "well, it needs to be decided on
a case-by-case basis." As Erwan confirms, it really brings perf improvements
for their use case. So I guess that's a reason good enough to backport it
everywhere.
But cutting a general rule about such patches...? Nope, I can't think of any
good one.
> No objection to adding it to older kernels...
>
> Happy to do it after the current round of releases goes out.
Thanks.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-28 16:43 ` Borislav Petkov
2024-11-29 0:21 ` Sasha Levin
@ 2024-11-29 9:45 ` Pavel Machek
2024-11-29 13:38 ` Sasha Levin
1 sibling, 1 reply; 27+ messages in thread
From: Pavel Machek @ 2024-11-29 9:45 UTC (permalink / raw)
To: Borislav Petkov
Cc: Sasha Levin, linux-kernel, stable, tglx, mingo, dave.hansen, x86,
puwen, seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap,
Erwan Velu, pavel
[-- Attachment #1: Type: text/plain, Size: 1272 bytes --]
Hi!
> > You've missed the 5.10 mail :)
>
> You mean in the flood? ;-P
>
> > Pavel objected to it so I've dropped it: https://lore.kernel.org/all/Zbli7QIGVFT8EtO4@sashalap/
>
> So we're not backporting those anymore? But everything else? :-P
>
> And 5.15 has it already...
>
> Frankly, with the amount of stuff going into stable, I see no problem with
> backporting such patches. Especially if the people using stable kernels will
> end up backporting it themselves and thus multiply work. I.e., Erwan's case.
Well, some people would prefer -stable to only contain fixes for
critical things, as documented.
stable-kernel-rules.rst:
- It must fix a problem that causes a build error (but not for things
marked CONFIG_BROKEN), an oops, a hang, data corruption, a real
security issue, or some "oh, that's not good" issue. In short, something
critical.
Now, you are right that reality and documentation are not exactly
"aligned". I don't care much about which one is fixed, but I'd really
like them to match (because that's what our users expect).
Best regards,
Pavel
--
DENX Software Engineering GmbH, Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-29 9:45 ` Pavel Machek
@ 2024-11-29 13:38 ` Sasha Levin
2024-11-29 20:34 ` Pavel Machek
0 siblings, 1 reply; 27+ messages in thread
From: Sasha Levin @ 2024-11-29 13:38 UTC (permalink / raw)
To: Pavel Machek
Cc: Borislav Petkov, linux-kernel, stable, tglx, mingo, dave.hansen,
x86, puwen, seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap,
Erwan Velu
On Fri, Nov 29, 2024 at 10:45:48AM +0100, Pavel Machek wrote:
>Hi!
>
>> > You've missed the 5.10 mail :)
>>
>> You mean in the flood? ;-P
>>
>> > Pavel objected to it so I've dropped it: https://lore.kernel.org/all/Zbli7QIGVFT8EtO4@sashalap/
>>
>> So we're not backporting those anymore? But everything else? :-P
>>
>> And 5.15 has it already...
>>
>> Frankly, with the amount of stuff going into stable, I see no problem with
>> backporting such patches. Especially if the people using stable kernels will
>> end up backporting it themselves and thus multiply work. I.e., Erwan's case.
>
>Well, some people would prefer -stable to only contain fixes for
>critical things, as documented.
>
>stable-kernel-rules.rst:
>
> - It must fix a problem that causes a build error (but not for things
> marked CONFIG_BROKEN), an oops, a hang, data corruption, a real
> security issue, or some "oh, that's not good" issue. In short, something
> critical.
>
>Now, you are right that reality and documentation are not exactly
>"aligned". I don't care much about which one is fixed, but I'd really
>like them to match (because that's what our users expect).
You should consider reading past the first bullet in that section :)
- Serious issues as reported by a user of a distribution kernel may also
be considered if they fix a notable performance or interactivity issue.
It sounds like what's going on here, no?
--
Thanks,
Sasha
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-29 13:38 ` Sasha Levin
@ 2024-11-29 20:34 ` Pavel Machek
2024-11-29 20:50 ` Sasha Levin
0 siblings, 1 reply; 27+ messages in thread
From: Pavel Machek @ 2024-11-29 20:34 UTC (permalink / raw)
To: Sasha Levin
Cc: Pavel Machek, Borislav Petkov, linux-kernel, stable, tglx, mingo,
dave.hansen, x86, puwen, seanjc, kim.phillips, jmattson,
babu.moger, peterz, rick.p.edgecombe, brgerst, ashok.raj, mjguzik,
jpoimboe, nik.borisov, aik, vegard.nossum, daniel.sneddon,
acdunlap, Erwan Velu
[-- Attachment #1: Type: text/plain, Size: 1943 bytes --]
On Fri 2024-11-29 08:38:48, Sasha Levin wrote:
> On Fri, Nov 29, 2024 at 10:45:48AM +0100, Pavel Machek wrote:
> > Hi!
> >
> > > > You've missed the 5.10 mail :)
> > >
> > > You mean in the flood? ;-P
> > >
> > > > Pavel objected to it so I've dropped it: https://lore.kernel.org/all/Zbli7QIGVFT8EtO4@sashalap/
> > >
> > > So we're not backporting those anymore? But everything else? :-P
> > >
> > > And 5.15 has it already...
> > >
> > > Frankly, with the amount of stuff going into stable, I see no problem with
> > > backporting such patches. Especially if the people using stable kernels will
> > > end up backporting it themselves and thus multiply work. I.e., Erwan's case.
> >
> > Well, some people would prefer -stable to only contain fixes for
> > critical things, as documented.
> >
> > stable-kernel-rules.rst:
> >
> > - It must fix a problem that causes a build error (but not for things
> > marked CONFIG_BROKEN), an oops, a hang, data corruption, a real
> > security issue, or some "oh, that's not good" issue. In short, something
> > critical.
> >
> > Now, you are right that reality and documentation are not exactly
> > "aligned". I don't care much about which one is fixed, but I'd really
> > like them to match (because that's what our users expect).
>
> You should consider reading past the first bullet in that section :)
>
> - Serious issues as reported by a user of a distribution kernel may also
> be considered if they fix a notable performance or interactivity issue.
>
> It sounds like what's going on here, no?
Is it? I'd not expect this to be visible in anything but
microbenchmarks. Do you have user reports hitting this?
It is not like this makes kernel build 10% slower, is it?
Pavel
--
DENX Software Engineering GmbH, Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-29 20:34 ` Pavel Machek
@ 2024-11-29 20:50 ` Sasha Levin
2024-11-29 21:27 ` Pavel Machek
0 siblings, 1 reply; 27+ messages in thread
From: Sasha Levin @ 2024-11-29 20:50 UTC (permalink / raw)
To: Pavel Machek
Cc: Borislav Petkov, linux-kernel, stable, tglx, mingo, dave.hansen,
x86, puwen, seanjc, kim.phillips, jmattson, babu.moger, peterz,
rick.p.edgecombe, brgerst, ashok.raj, mjguzik, jpoimboe,
nik.borisov, aik, vegard.nossum, daniel.sneddon, acdunlap,
Erwan Velu
On Fri, Nov 29, 2024 at 09:34:53PM +0100, Pavel Machek wrote:
>On Fri 2024-11-29 08:38:48, Sasha Levin wrote:
>> On Fri, Nov 29, 2024 at 10:45:48AM +0100, Pavel Machek wrote:
>> > Hi!
>> >
>> > > > You've missed the 5.10 mail :)
>> > >
>> > > You mean in the flood? ;-P
>> > >
>> > > > Pavel objected to it so I've dropped it: https://lore.kernel.org/all/Zbli7QIGVFT8EtO4@sashalap/
>> > >
>> > > So we're not backporting those anymore? But everything else? :-P
>> > >
>> > > And 5.15 has it already...
>> > >
>> > > Frankly, with the amount of stuff going into stable, I see no problem with
>> > > backporting such patches. Especially if the people using stable kernels will
>> > > end up backporting it themselves and thus multiply work. I.e., Erwan's case.
>> >
>> > Well, some people would prefer -stable to only contain fixes for
>> > critical things, as documented.
>> >
>> > stable-kernel-rules.rst:
>> >
>> > - It must fix a problem that causes a build error (but not for things
>> > marked CONFIG_BROKEN), an oops, a hang, data corruption, a real
>> > security issue, or some "oh, that's not good" issue. In short, something
>> > critical.
>> >
>> > Now, you are right that reality and documentation are not exactly
>> > "aligned". I don't care much about which one is fixed, but I'd really
>> > like them to match (because that's what our users expect).
>>
>> You should consider reading past the first bullet in that section :)
>>
>> - Serious issues as reported by a user of a distribution kernel may also
>> be considered if they fix a notable performance or interactivity issue.
>>
>> It sounds like what's going on here, no?
>
>Is it? I'd not expect this to be visible in anything but
>microbenchmarks. Do you have user reports hitting this?
>
>It is not like this makes kernel build 10% slower, is it?
On Fri, Nov 29, 2024 at 10:30:11AM +0100, Erwan Velu wrote:
>This patch greatly impacts servers on production with AMD systems that
>have lasted since 5.11, having it backported really improves systems
>performance.
>Since this patch, I can share that our database team is no longer
>paged during the night, that's a real noticeable impact.
--
Thanks,
Sasha
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD
2024-11-29 20:50 ` Sasha Levin
@ 2024-11-29 21:27 ` Pavel Machek
0 siblings, 0 replies; 27+ messages in thread
From: Pavel Machek @ 2024-11-29 21:27 UTC (permalink / raw)
To: Sasha Levin
Cc: Pavel Machek, Borislav Petkov, linux-kernel, stable, tglx, mingo,
dave.hansen, x86, puwen, seanjc, kim.phillips, jmattson,
babu.moger, peterz, rick.p.edgecombe, brgerst, ashok.raj, mjguzik,
jpoimboe, nik.borisov, aik, vegard.nossum, daniel.sneddon,
acdunlap, Erwan Velu
[-- Attachment #1: Type: text/plain, Size: 1051 bytes --]
Hi!
> > > - Serious issues as reported by a user of a distribution kernel may also
> > > be considered if they fix a notable performance or interactivity issue.
> > >
> > > It sounds like what's going on here, no?
> >
> > Is it? I'd not expect this to be visible in anything but
> > microbenchmarks. Do you have user reports hitting this?
> >
> > It is not like this makes kernel build 10% slower, is it?
>
> On Fri, Nov 29, 2024 at 10:30:11AM +0100, Erwan Velu wrote:
> > This patch greatly impacts servers on production with AMD systems that
> > have lasted since 5.11, having it backported really improves systems
> > performance.
> > Since this patch, I can share that our database team is no longer
> > paged during the night, that's a real noticeable impact.
Ok, looks like I misjudged the performance impact, sorry for the
noise.
Best regards,
Pavel
--
DENX Software Engineering GmbH, Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH AUTOSEL 5.15 12/12] x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel
2024-01-15 23:26 [PATCH AUTOSEL 5.15 01/12] watch_queue: fix kcalloc() arguments order Sasha Levin
` (9 preceding siblings ...)
2024-01-15 23:26 ` [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD Sasha Levin
@ 2024-01-15 23:26 ` Sasha Levin
10 siblings, 0 replies; 27+ messages in thread
From: Sasha Levin @ 2024-01-15 23:26 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Zhiquan Li, Youquan Song, Borislav Petkov, Naoya Horiguchi,
Sasha Levin, tglx, mingo, dave.hansen, x86, linux-edac
From: Zhiquan Li <zhiquan1.li@intel.com>
[ Upstream commit 9f3b130048bfa2e44a8cfb1b616f826d9d5d8188 ]
Memory errors don't happen very often, especially fatal ones. However,
in large-scale scenarios such as data centers, that probability
increases with the amount of machines present.
When a fatal machine check happens, mce_panic() is called based on the
severity grading of that error. The page containing the error is not
marked as poison.
However, when kexec is enabled, tools like makedumpfile understand when
pages are marked as poison and do not touch them so as not to cause
a fatal machine check exception again while dumping the previous
kernel's memory.
Therefore, mark the page containing the error as poisoned so that the
kexec'ed kernel can avoid accessing the page.
[ bp: Rewrite commit message and comment. ]
Co-developed-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Link: https://lore.kernel.org/r/20231014051754.3759099-1-zhiquan1.li@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/x86/kernel/cpu/mce/core.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index a0727723676b..eb48729e220e 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -44,6 +44,7 @@
#include <linux/sync_core.h>
#include <linux/task_work.h>
#include <linux/hardirq.h>
+#include <linux/kexec.h>
#include <asm/intel-family.h>
#include <asm/processor.h>
@@ -274,6 +275,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
struct llist_node *pending;
struct mce_evt_llist *l;
int apei_err = 0;
+ struct page *p;
/*
* Allow instrumentation around external facilities usage. Not that it
@@ -329,6 +331,20 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (!fake_panic) {
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
+
+ /*
+ * Kdump skips the poisoned page in order to avoid
+ * touching the error bits again. Poison the page even
+ * if the error is fatal and the machine is about to
+ * panic.
+ */
+ if (kexec_crash_loaded()) {
+ if (final && (final->status & MCI_STATUS_ADDRV)) {
+ p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
+ if (p)
+ SetPageHWPoison(p);
+ }
+ }
panic(msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
--
2.43.0
^ permalink raw reply related [flat|nested] 27+ messages in thread