[PATCh 0/3] x86,tlb: context switch optimizations

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCh 0/3] x86,tlb: context switch optimizations
@ 2024-11-09  0:27 Rik van Riel
  2024-11-09  0:27 ` [PATCH 1/3] x86,tlb: update mm_cpumask lazily Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Rik van Riel @ 2024-11-09  0:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: dave.hansen, luto, peterz, tglx, mingo, bp, x86, kernel-team, hpa

While profiling switch_mm_irqs_off with several workloads,
it appears there are two hot spots that probably don't need
to be there.

The first is the atomic clearing and setting of the current
CPU in prev's and next's mm_cpumask. This can create a large
amount of cache line contention. On a web server, these two
together take about 17% of the CPU time spent in switch_mm_irqs_off.

We should be able to avoid much of the cache line thrashing
by only clearing bits in mm_cpumask lazily from the first
TLB flush to a process, after which the other TLB flushes can
be more narrowly targeted.

A second cause of overhead seems to be the cpumask_test_cpu
inside the WARN_ON_ONCE in the prev == next branch of
switch_mm_irqs_off.

This warning never ever seems to fire, even on a very large
fleet, so it may be best to hide that behind CONFIG_DEBUG_VM.
With the web server workload, this is also about 17% of
switch_mm_irqs_off.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] x86,tlb: update mm_cpumask lazily
  2024-11-09  0:27 [PATCh 0/3] x86,tlb: context switch optimizations Rik van Riel
@ 2024-11-09  0:27 ` Rik van Riel
  2024-11-13  2:59   ` [tip: x86/mm] x86/mm/tlb: Update " tip-bot2 for Rik van Riel
  2024-11-09  0:27 ` [PATCH 2/3] x86,tlb: add tracepoint for TLB flush IPI to stale CPU Rik van Riel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2024-11-09  0:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: dave.hansen, luto, peterz, tglx, mingo, bp, x86, kernel-team, hpa,
	Rik van Riel

On busy multi-threaded workloads, there can be significant contention
on the mm_cpumask at context switch time.

Reduce that contention by updating mm_cpumask lazily, setting the CPU bit
at context switch time (if not already set), and clearing the CPU bit at
the first TLB flush sent to a CPU where the process isn't running.

When a flurry of TLB flushes for a process happen, only the first one
will be sent to CPUs where the process isn't running. The others will
be sent to CPUs where the process is currently running.

On an AMD Milan system with 36 cores, there is a noticeable difference:
$ hackbench --groups 20 --loops 10000

Before: ~4.5s +/- 0.1s
After:  ~4.2s +/- 0.1s

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 86593d1b787d..f19f6378cabf 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -606,18 +606,15 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		cond_mitigation(tsk);
 
 		/*
-		 * Stop remote flushes for the previous mm.
-		 * Skip kernel threads; we never send init_mm TLB flushing IPIs,
-		 * but the bitmap manipulation can cause cache line contention.
+		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
+		 * mm_cpumask can be expensive under contention. The CPU
+		 * will be removed lazily at TLB flush time.
 		 */
-		if (prev != &init_mm) {
-			VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu,
-						mm_cpumask(prev)));
-			cpumask_clear_cpu(cpu, mm_cpumask(prev));
-		}
+		VM_WARN_ON_ONCE(prev != &init_mm && !cpumask_test_cpu(cpu,
+				mm_cpumask(prev)));
 
 		/* Start receiving IPIs and then read tlb_gen (and LAM below) */
-		if (next != &init_mm)
+		if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
@@ -761,8 +758,10 @@ static void flush_tlb_func(void *info)
 		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
 
 		/* Can only happen on remote CPUs */
-		if (f->mm && f->mm != loaded_mm)
+		if (f->mm && f->mm != loaded_mm) {
+			cpumask_clear_cpu(raw_smp_processor_id(), mm_cpumask(f->mm));
 			return;
+		}
 	}
 
 	if (unlikely(loaded_mm == &init_mm))
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] x86,tlb: add tracepoint for TLB flush IPI to stale CPU
  2024-11-09  0:27 [PATCh 0/3] x86,tlb: context switch optimizations Rik van Riel
  2024-11-09  0:27 ` [PATCH 1/3] x86,tlb: update mm_cpumask lazily Rik van Riel
@ 2024-11-09  0:27 ` Rik van Riel
  2024-11-13  2:59   ` [tip: x86/mm] x86/mm/tlb: Add " tip-bot2 for Rik van Riel
  2024-11-09  0:27 ` [PATCH 3/3] x86,tlb: put cpumask_test_cpu in prev == next under CONFIG_DEBUG_VM Rik van Riel
  2024-11-13  9:55 ` [PATCh 0/3] x86,tlb: context switch optimizations Borislav Petkov
  3 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2024-11-09  0:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: dave.hansen, luto, peterz, tglx, mingo, bp, x86, kernel-team, hpa,
	Rik van Riel, Dave Hansen

Add a tracepoint when we send a TLB flush IPI to a CPU that used
to be in the mm_cpumask, but isn't any more.

This can be used to evaluate whether there any workloads where
we end up in this path problematically often. Hopefully they
don't exist.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c        | 1 +
 include/linux/mm_types.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f19f6378cabf..9d0d34576928 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -760,6 +760,7 @@ static void flush_tlb_func(void *info)
 		/* Can only happen on remote CPUs */
 		if (f->mm && f->mm != loaded_mm) {
 			cpumask_clear_cpu(raw_smp_processor_id(), mm_cpumask(f->mm));
+			trace_tlb_flush(TLB_REMOTE_WRONG_CPU, 0);
 			return;
 		}
 	}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e3bdf8e38bc..6b6f05404304 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1335,6 +1335,7 @@ enum tlb_flush_reason {
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
 	TLB_REMOTE_SEND_IPI,
+	TLB_REMOTE_WRONG_CPU,
 	NR_TLB_FLUSH_REASONS,
 };
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] x86,tlb: put cpumask_test_cpu in prev == next under CONFIG_DEBUG_VM
  2024-11-09  0:27 [PATCh 0/3] x86,tlb: context switch optimizations Rik van Riel
  2024-11-09  0:27 ` [PATCH 1/3] x86,tlb: update mm_cpumask lazily Rik van Riel
  2024-11-09  0:27 ` [PATCH 2/3] x86,tlb: add tracepoint for TLB flush IPI to stale CPU Rik van Riel
@ 2024-11-09  0:27 ` Rik van Riel
  2024-11-13  2:59   ` [tip: x86/mm] x86/mm/tlb: Put cpumask_test_cpu() check in switch_mm_irqs_off() " tip-bot2 for Rik van Riel
  2024-11-13  9:55 ` [PATCh 0/3] x86,tlb: context switch optimizations Borislav Petkov
  3 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2024-11-09  0:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: dave.hansen, luto, peterz, tglx, mingo, bp, x86, kernel-team, hpa,
	Rik van Riel

On a web server workload, the cpumask_test_cpu inside the
WARN_ON_ONCE in the prev == next branch takes about 17% of
all the CPU time of switch_mm_irqs_off.

On a large fleet, this WARN_ON_ONCE has not fired in at least
a month, possibly never.

Move this test under CONFIG_DEBUG_VM so it does not get compiled
in production kernels.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9d0d34576928..1aac4fa90d3d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -568,7 +568,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		 * mm_cpumask. The TLB shootdown code can figure out from
 		 * cpu_tlbstate_shared.is_lazy whether or not to send an IPI.
 		 */
-		if (WARN_ON_ONCE(prev != &init_mm &&
+		if (IS_ENABLED(CONFIG_DEBUG_VM) && WARN_ON_ONCE(prev != &init_mm &&
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip: x86/mm] x86/mm/tlb: Add tracepoint for TLB flush IPI to stale CPU
  2024-11-09  0:27 ` [PATCH 2/3] x86,tlb: add tracepoint for TLB flush IPI to stale CPU Rik van Riel
@ 2024-11-13  2:59   ` tip-bot2 for Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Rik van Riel @ 2024-11-13  2:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dave Hansen, Rik van Riel, Ingo Molnar, Andy Lutomirski,
	Peter Zijlstra, Linus Torvalds, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     c058cc5fedf0cc2d05d44b166e155690e101aa6e
Gitweb:        https://git.kernel.org/tip/c058cc5fedf0cc2d05d44b166e155690e101aa6e
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Fri, 08 Nov 2024 19:27:49 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 13 Nov 2024 03:42:41 +01:00

x86/mm/tlb: Add tracepoint for TLB flush IPI to stale CPU

Add a tracepoint when we send a TLB flush IPI to a CPU that used
to be in the mm_cpumask, but isn't any more.

This can be used to evaluate whether there any workloads where
we end up in this path problematically often. Hopefully they
don't exist.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20241109003727.3958374-3-riel@surriel.com
---
 arch/x86/mm/tlb.c        | 1 +
 include/linux/mm_types.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index cc4e57a..1aac4fa 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -760,6 +760,7 @@ static void flush_tlb_func(void *info)
 		/* Can only happen on remote CPUs */
 		if (f->mm && f->mm != loaded_mm) {
 			cpumask_clear_cpu(raw_smp_processor_id(), mm_cpumask(f->mm));
+			trace_tlb_flush(TLB_REMOTE_WRONG_CPU, 0);
 			return;
 		}
 	}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e3bdf8..6b6f054 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1335,6 +1335,7 @@ enum tlb_flush_reason {
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
 	TLB_REMOTE_SEND_IPI,
+	TLB_REMOTE_WRONG_CPU,
 	NR_TLB_FLUSH_REASONS,
 };
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip: x86/mm] x86/mm/tlb: Update mm_cpumask lazily
  2024-11-09  0:27 ` [PATCH 1/3] x86,tlb: update mm_cpumask lazily Rik van Riel
@ 2024-11-13  2:59   ` tip-bot2 for Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Rik van Riel @ 2024-11-13  2:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Ingo Molnar, Andy Lutomirski, Peter Zijlstra,
	Linus Torvalds, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     d6c1b74d0d5e06106ed6571e4dc90f6b94fff63a
Gitweb:        https://git.kernel.org/tip/d6c1b74d0d5e06106ed6571e4dc90f6b94fff63a
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Fri, 08 Nov 2024 19:27:48 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 13 Nov 2024 03:42:41 +01:00

x86/mm/tlb: Update mm_cpumask lazily

On busy multi-threaded workloads, there can be significant contention
on the mm_cpumask at context switch time.

Reduce that contention by updating mm_cpumask lazily, setting the CPU bit
at context switch time (if not already set), and clearing the CPU bit at
the first TLB flush sent to a CPU where the process isn't running.

When a flurry of TLB flushes for a process happen, only the first one
will be sent to CPUs where the process isn't running. The others will
be sent to CPUs where the process is currently running.

On an AMD Milan system with 36 cores, there is a noticeable difference:

  $ hackbench --groups 20 --loops 10000

  Before: ~4.5s +/- 0.1s
  After:  ~4.2s +/- 0.1s

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20241109003727.3958374-2-riel@surriel.com
---
 arch/x86/mm/tlb.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b0d5a64..cc4e57a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -606,18 +606,15 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		cond_mitigation(tsk);
 
 		/*
-		 * Stop remote flushes for the previous mm.
-		 * Skip kernel threads; we never send init_mm TLB flushing IPIs,
-		 * but the bitmap manipulation can cause cache line contention.
+		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
+		 * mm_cpumask can be expensive under contention. The CPU
+		 * will be removed lazily at TLB flush time.
 		 */
-		if (prev != &init_mm) {
-			VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu,
-						mm_cpumask(prev)));
-			cpumask_clear_cpu(cpu, mm_cpumask(prev));
-		}
+		VM_WARN_ON_ONCE(prev != &init_mm && !cpumask_test_cpu(cpu,
+				mm_cpumask(prev)));
 
 		/* Start receiving IPIs and then read tlb_gen (and LAM below) */
-		if (next != &init_mm)
+		if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
@@ -761,8 +758,10 @@ static void flush_tlb_func(void *info)
 		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
 
 		/* Can only happen on remote CPUs */
-		if (f->mm && f->mm != loaded_mm)
+		if (f->mm && f->mm != loaded_mm) {
+			cpumask_clear_cpu(raw_smp_processor_id(), mm_cpumask(f->mm));
 			return;
+		}
 	}
 
 	if (unlikely(loaded_mm == &init_mm))

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip: x86/mm] x86/mm/tlb: Put cpumask_test_cpu() check in switch_mm_irqs_off() under CONFIG_DEBUG_VM
  2024-11-09  0:27 ` [PATCH 3/3] x86,tlb: put cpumask_test_cpu in prev == next under CONFIG_DEBUG_VM Rik van Riel
@ 2024-11-13  2:59   ` tip-bot2 for Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Rik van Riel @ 2024-11-13  2:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Ingo Molnar, Andy Lutomirski, Peter Zijlstra,
	Linus Torvalds, x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     7e33001b8b9a78062679e0fdf5b0842a49063135
Gitweb:        https://git.kernel.org/tip/7e33001b8b9a78062679e0fdf5b0842a49063135
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Fri, 08 Nov 2024 19:27:50 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 13 Nov 2024 03:42:41 +01:00

x86/mm/tlb: Put cpumask_test_cpu() check in switch_mm_irqs_off() under CONFIG_DEBUG_VM

On a web server workload, the cpumask_test_cpu() inside the
WARN_ON_ONCE() in the 'prev == next branch' takes about 17% of
all the CPU time of switch_mm_irqs_off().

On a large fleet, this WARN_ON_ONCE() has not fired in at least
a month, possibly never.

Move this test under CONFIG_DEBUG_VM so it does not get compiled
in production kernels.

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20241109003727.3958374-4-riel@surriel.com
---
 arch/x86/mm/tlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 86593d1..b0d5a64 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -568,7 +568,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		 * mm_cpumask. The TLB shootdown code can figure out from
 		 * cpu_tlbstate_shared.is_lazy whether or not to send an IPI.
 		 */
-		if (WARN_ON_ONCE(prev != &init_mm &&
+		if (IS_ENABLED(CONFIG_DEBUG_VM) && WARN_ON_ONCE(prev != &init_mm &&
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-09  0:27 [PATCh 0/3] x86,tlb: context switch optimizations Rik van Riel
                   ` (2 preceding siblings ...)
  2024-11-09  0:27 ` [PATCH 3/3] x86,tlb: put cpumask_test_cpu in prev == next under CONFIG_DEBUG_VM Rik van Riel
@ 2024-11-13  9:55 ` Borislav Petkov
  2024-11-13 10:00   ` Ingo Molnar
                     ` (2 more replies)
  3 siblings, 3 replies; 18+ messages in thread
From: Borislav Petkov @ 2024-11-13  9:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, dave.hansen, luto, peterz, tglx, mingo, x86,
	kernel-team, hpa

On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> While profiling switch_mm_irqs_off with several workloads,
> it appears there are two hot spots that probably don't need
> to be there.

One of those three is causing the below here, zapping them from tip.

[    3.050931] smpboot: x86: Booting SMP configuration:
[    3.054476] .... node  #0, CPUs:          #1   #2   #3   #4   #5   #6   #7   #8   #9  #10  #11  #12  #13  #14  #15  #16  #17  #18  #19  #20
[    3.166533] Callback from call_rcu_tasks() invoked.
[    3.178695]   #21  #22  #23  #24  #25  #26  #27  #28  #29  #30  #31
[    3.186469] ------------[ cut here ]------------
[    3.186469] WARNING: CPU: 16 PID: 97 at kernel/smp.c:807 smp_call_function_many_cond+0x188/0x720
[    3.186469] Modules linked in:
[    3.186469] CPU: 16 UID: 0 PID: 97 Comm: cpuhp/16 Not tainted 6.12.0-rc7+ #1
[    3.186469] Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.5 09/08/2022
[    3.186469] RIP: 0010:smp_call_function_many_cond+0x188/0x720
[    3.186469] Code: 96 54 91 01 85 d2 0f 84 f7 fe ff ff 65 8b 05 37 8c e3 7e 85 c0 0f 85 e8 fe ff ff 65 8b 05 0c 89 e3 7e 85 c0 0f 85 d9 fe ff ff <0f> 0b e9 d2 fe ff ff 65 f7 05 4e c5 e4 7e ff ff ff 7f 0f 85 a6 fe
[    3.186469] RSP: 0018:ffffc90000dbfbe8 EFLAGS: 00010046
[    3.186469] RAX: 0000000000000000 RBX: ffffffff8109eeb0 RCX: 0000000000000000
[    3.186469] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff824908dd
[    3.186469] RBP: ffff889003235380 R08: ffffffff8109eeb0 R09: 0000000000000006
[    3.186469] R10: 0000000000000013 R11: 0000000000000005 R12: 0000000000000010
[    3.186469] R13: ffff88810006a480 R14: ffff889003235380 R15: 0000000000000010
[    3.186469] FS:  0000000000000000(0000) GS:ffff889003200000(0000) knlGS:0000000000000000
[    3.186469] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.186469] CR2: 0000000000000000 CR3: 0000000002a3a001 CR4: 0000000000770ef0
[    3.186469] PKRU: 55555554
[    3.186469] Call Trace:
[    3.186469]  <TASK>
[    3.186469]  ? __warn+0xa1/0x1c0
[    3.186469]  ? smp_call_function_many_cond+0x188/0x720
[    3.186469]  ? report_bug+0x1b5/0x1e0
[    3.186469]  ? handle_bug+0x58/0x90
[    3.186469]  ? exc_invalid_op+0x17/0x70
[    3.186469]  ? asm_exc_invalid_op+0x16/0x20
[    3.186469]  ? __pfx_tlb_is_not_lazy+0x10/0x10
[    3.186469]  ? __pfx_tlb_is_not_lazy+0x10/0x10
[    3.186469]  ? smp_call_function_many_cond+0x188/0x720
[    3.186469]  ? smp_call_function_many_cond+0x2a/0x720
[    3.186469]  ? __pte_offset_map_lock+0xa4/0x190
[    3.186469]  ? __pfx_flush_tlb_func+0x10/0x10
[    3.186469]  ? srso_alias_return_thunk+0x5/0xfbef5
[    3.186469]  ? lock_acquire+0x11a/0x350
[    3.186469]  ? __pfx_flush_tlb_func+0x10/0x10
[    3.186469]  ? __pfx_tlb_is_not_lazy+0x10/0x10
[    3.186469]  on_each_cpu_cond_mask+0x50/0x90
[    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
[    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
[    3.186469]  __text_poke+0x366/0x5d0
[    3.186469]  ? __pfx_text_poke_memcpy+0x10/0x10
[    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
[    3.186469]  text_poke_bp_batch+0xa1/0x3d0
[    3.186469]  ? mptcp_pm_get_add_addr_signal_max+0x10/0x30
[    3.186469]  ? arch_jump_label_transform_queue+0x55/0x80
[    3.186469]  ? __pfx_sched_cpu_activate+0x10/0x10
[    3.186469]  text_poke_finish+0x1b/0x30
[    3.186469]  arch_jump_label_transform_apply+0x18/0x30
[    3.186469]  static_key_slow_inc_cpuslocked+0x55/0xa0
[    3.186469]  ? srso_alias_return_thunk+0x5/0xfbef5
[    3.186469]  sched_cpu_activate+0x45/0x190
[    3.186469]  ? __pfx_sched_cpu_activate+0x10/0x10
[    3.186469]  cpuhp_invoke_callback+0x159/0x6b0
[    3.186469]  ? cpuhp_thread_fun+0x81/0x290
[    3.186469]  cpuhp_thread_fun+0x203/0x290
[    3.186469]  ? cpuhp_thread_fun+0x81/0x290
[    3.186469]  ? smpboot_thread_fn+0x2b/0x260
[    3.186469]  smpboot_thread_fn+0x1ae/0x260
[    3.186469]  ? __pfx_smpboot_thread_fn+0x10/0x10
[    3.186469]  kthread+0xee/0x120
[    3.186469]  ? __pfx_kthread+0x10/0x10
[    3.186469]  ret_from_fork+0x4c/0x60
[    3.186469]  ? __pfx_kthread+0x10/0x10
[    3.186469]  ret_from_fork_asm+0x1a/0x30
[    3.186469]  </TASK>
[    3.186469] irq event stamp: 122
[    3.186469] hardirqs last  enabled at (121): [<ffffffff81106ff7>] balance_push_set+0xe7/0x110
[    3.186469] hardirqs last disabled at (122): [<ffffffff81048129>] __text_poke+0x489/0x5d0
[    3.186469] softirqs last  enabled at (0): [<ffffffff810b1ae5>] copy_process+0x9f5/0x2a70
[    3.186469] softirqs last disabled at (0): [<0000000000000000>] 0x0
[    3.186469] ---[ end trace 0000000000000000 ]---

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-13  9:55 ` [PATCh 0/3] x86,tlb: context switch optimizations Borislav Petkov
@ 2024-11-13 10:00   ` Ingo Molnar
  2024-11-13 14:38   ` Rik van Riel
  2024-11-13 14:55   ` Rik van Riel
  2 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2024-11-13 10:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rik van Riel, linux-kernel, dave.hansen, luto, peterz, tglx,
	mingo, x86, kernel-team, hpa


* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > While profiling switch_mm_irqs_off with several workloads,
> > it appears there are two hot spots that probably don't need
> > to be there.
> 
> One of those three is causing the below here, zapping them from tip.

I've zapped the final two commits from tip:x86/mm that are the likely source of the 
regression.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-13  9:55 ` [PATCh 0/3] x86,tlb: context switch optimizations Borislav Petkov
  2024-11-13 10:00   ` Ingo Molnar
@ 2024-11-13 14:38   ` Rik van Riel
  2024-11-14 11:33     ` Peter Zijlstra
  2024-11-13 14:55   ` Rik van Riel
  2 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2024-11-13 14:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, dave.hansen, luto, peterz, tglx, mingo, x86,
	kernel-team, hpa, bigeasy

On Wed, 13 Nov 2024 10:55:50 +0100
Borislav Petkov <bp@alien8.de> wrote:
On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > While profiling switch_mm_irqs_off with several workloads,
> > it appears there are two hot spots that probably don't need
> > to be there.
> 
> One of those three is causing the below here, zapping them from tip.
> 

This is interesting, and unexpected.

> [    3.186469] ------------[ cut here ]------------
> [    3.186469] WARNING: CPU: 16 PID: 97 at kernel/smp.c:807
> smp_call_function_many_cond+0x188/0x720

This is the lockdep_assert_irqs_enabled() from this branch:

        if (cpu_online(this_cpu) && !oops_in_progress &&
            !early_boot_irqs_disabled)
                lockdep_assert_irqs_enabled();

> [    3.186469] Call Trace:
> [    3.186469]  <TASK>
> [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> [    3.186469]  __text_poke+0x366/0x5d0

... and sure enough, it looks like __text_poke() calls
flush_tlb_mm_range() with IRQs disabled!

> [    3.186469]  text_poke_bp_batch+0xa1/0x3d0
> [    3.186469]  text_poke_finish+0x1b/0x30
> [    3.186469]  arch_jump_label_transform_apply+0x18/0x30
> [    3.186469]  static_key_slow_inc_cpuslocked+0x55/0xa0
...

I have no good explanation for why that lockdep_assert_irqs_enabled()
would not be firing without my patches applied.

We obviously should not be sending out any IPIs with IRQs disabled.

However, __text_poke has been sending IPIs with interrupts disabled
for 4 years now! No wonder we see deadlocks involving __text_poke
on a semi-regular basis.

Should we move the local_irq_restore() in __text_poke() up a few lines,
like in the patch below?

Alternatively, should we explicitly clear the mm_cpumask in unuse_temporary_mm,
to make sure that mm never has any bits set in mm_cpumask?

Or, since we do not flush the TLB for the poking_mm until AFTER we have switched
back to the prev mm, should we simply always switch to the poking_mm in a way
that involves flushing the TLB? That way we won't even have to flush the entry
after unuse...

What is the best approach here?
---8<---
From a2e7c517bbd2cf108fc14c51449bf8e53e314b53 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@surriel.com>
Date: Wed, 13 Nov 2024 09:19:39 -0500
Subject: [PATCH] x86,alternatives: re-enable interrupts before sending TLB  flush IPI

__text_poke() calls flush_tlb_mm_range() to flush the mapping of
the text poke address. However, it does so with interrupts disabled,
which can cause a deadlock.

We do occasionally observe deadlocks involving __text_poke(), but
not frequently enough to spend much time debugging them.

Borislav triggered this bug while testing a different patch, which
lazily clears bits from the mm_cpumask, resulting in more bits being
set when __text_poke() calls flush_tlb_mm_range(), which in turn
triggered the lockdep_assert_irqs_enabled() in smp_call_function_many_cond().

Avoid sending IPIs with IRQs disabled by re-enabling IRQs earlier.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: stable@kernel.org
Fixes: 7cf494270424 ("x86: expand irq-off region in text_poke()")
---
 arch/x86/kernel/alternative.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index d17518ca19b8..f71d84249f6e 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1940,6 +1940,9 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 	 */
 	unuse_temporary_mm(prev);

+	/* Re-enable interrupts before sending an IPI. */
+	local_irq_restore(flags);
+
 	/*
 	 * Flushing the TLB might involve IPIs, which would require enabled
 	 * IRQs, but not if the mm is not used, as it is in this point.
@@ -1956,7 +1959,6 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 		BUG_ON(memcmp(addr, src, len));
 	}

-	local_irq_restore(flags);
 	pte_unmap_unlock(ptep, ptl);
 	return addr;
 }
-- 
2.45.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-13  9:55 ` [PATCh 0/3] x86,tlb: context switch optimizations Borislav Petkov
  2024-11-13 10:00   ` Ingo Molnar
  2024-11-13 14:38   ` Rik van Riel
@ 2024-11-13 14:55   ` Rik van Riel
  2024-11-14  9:52     ` Ingo Molnar
  2024-11-14 11:36     ` Peter Zijlstra
  2 siblings, 2 replies; 18+ messages in thread
From: Rik van Riel @ 2024-11-13 14:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, dave.hansen, luto, peterz, tglx, mingo, x86,
	kernel-team, hpa, bigeasy

On Wed, 13 Nov 2024 10:55:50 +0100
Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > While profiling switch_mm_irqs_off with several workloads,
> > it appears there are two hot spots that probably don't need
> > to be there.  
> 
> One of those three is causing the below here, zapping them from tip.
> 

TL;DR: __text_poke ends up sending IPIs with interrupts disabled.

> [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> [    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
> [    3.186469]  __text_poke+0x366/0x5d0

Here is an alternative to avoid __text_poke() from calling
on_each_cpu_cond_mask() with IRQs disabled:

---8<---
From e872edeaad14c793036f290afc28000281e1b76a Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@surriel.com>
Date: Wed, 13 Nov 2024 09:51:16 -0500
Subject: [PATCH] x86/alternatives: defer poking_mm TLB flush to next use

Instead of doing a TLB flush of the poking_mm after we have
already switched back to the prev mm, we can simply increment
the tlb_gen for the poking_mm at unuse time.

This will cause switch_mm_irqs_off to flush the TLB next time
it loads the poking_mm, in the unlikely case that poking_mm still
has an ASID on that CPU by then.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/kernel/alternative.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index d17518ca19b8..f3caf5bc4df9 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1830,6 +1830,9 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
 	lockdep_assert_irqs_disabled();
 	switch_mm_irqs_off(NULL, prev_state.mm, current);
 
+	/* Force a TLB flush next time poking_mm is used. */
+	inc_mm_tlb_gen(poking_mm);
+
 	/*
 	 * Restore the breakpoints if they were disabled before the temporary mm
 	 * was loaded.
@@ -1940,14 +1943,6 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 	 */
 	unuse_temporary_mm(prev);
 
-	/*
-	 * Flushing the TLB might involve IPIs, which would require enabled
-	 * IRQs, but not if the mm is not used, as it is in this point.
-	 */
-	flush_tlb_mm_range(poking_mm, poking_addr, poking_addr +
-			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
-			   PAGE_SHIFT, false);
-
 	if (func == text_poke_memcpy) {
 		/*
 		 * If the text does not match what we just wrote then something is
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-13 14:55   ` Rik van Riel
@ 2024-11-14  9:52     ` Ingo Molnar
  2024-11-14 11:36       ` Peter Zijlstra
  2024-11-14 14:27       ` Rik van Riel
  2024-11-14 11:36     ` Peter Zijlstra
  1 sibling, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2024-11-14  9:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, linux-kernel, dave.hansen, luto, peterz, tglx,
	mingo, x86, kernel-team, hpa, bigeasy


* Rik van Riel <riel@surriel.com> wrote:

> On Wed, 13 Nov 2024 10:55:50 +0100
> Borislav Petkov <bp@alien8.de> wrote:
> 
> > On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > > While profiling switch_mm_irqs_off with several workloads,
> > > it appears there are two hot spots that probably don't need
> > > to be there.  
> > 
> > One of those three is causing the below here, zapping them from tip.
> > 
> 
> TL;DR: __text_poke ends up sending IPIs with interrupts disabled.
> 
> > [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> > [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> > [    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
> > [    3.186469]  __text_poke+0x366/0x5d0
> 
> Here is an alternative to avoid __text_poke() from calling
> on_each_cpu_cond_mask() with IRQs disabled:
> 
> ---8<---
> From e872edeaad14c793036f290afc28000281e1b76a Mon Sep 17 00:00:00 2001
> From: Rik van Riel <riel@surriel.com>
> Date: Wed, 13 Nov 2024 09:51:16 -0500
> Subject: [PATCH] x86/alternatives: defer poking_mm TLB flush to next use

I'd argue *both* of your patches improve the code, right?

Mind sending an updated series? It might not make it into the merge window,
but these look like good changes to me.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-13 14:38   ` Rik van Riel
@ 2024-11-14 11:33     ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2024-11-14 11:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, linux-kernel, dave.hansen, luto, tglx, mingo,
	x86, kernel-team, hpa, bigeasy

On Wed, Nov 13, 2024 at 09:38:26AM -0500, Rik van Riel wrote:
> On Wed, 13 Nov 2024 10:55:50 +0100
> Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > > While profiling switch_mm_irqs_off with several workloads,
> > > it appears there are two hot spots that probably don't need
> > > to be there.
> > 
> > One of those three is causing the below here, zapping them from tip.
> > 
> 
> This is interesting, and unexpected.
> 
> > [    3.186469] ------------[ cut here ]------------
> > [    3.186469] WARNING: CPU: 16 PID: 97 at kernel/smp.c:807
> > smp_call_function_many_cond+0x188/0x720
> 
> This is the lockdep_assert_irqs_enabled() from this branch:
> 
>         if (cpu_online(this_cpu) && !oops_in_progress &&
>             !early_boot_irqs_disabled)
>                 lockdep_assert_irqs_enabled();
> 
> > [    3.186469] Call Trace:
> > [    3.186469]  <TASK>
> > [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> > [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> > [    3.186469]  __text_poke+0x366/0x5d0
> 
> ... and sure enough, it looks like __text_poke() calls
> flush_tlb_mm_range() with IRQs disabled!
> 
> > [    3.186469]  text_poke_bp_batch+0xa1/0x3d0
> > [    3.186469]  text_poke_finish+0x1b/0x30
> > [    3.186469]  arch_jump_label_transform_apply+0x18/0x30
> > [    3.186469]  static_key_slow_inc_cpuslocked+0x55/0xa0
> ...
> 
> I have no good explanation for why that lockdep_assert_irqs_enabled()
> would not be firing without my patches applied.
> 
> We obviously should not be sending out any IPIs with IRQs disabled.
> 
> However, __text_poke has been sending IPIs with interrupts disabled
> for 4 years now! No wonder we see deadlocks involving __text_poke
> on a semi-regular basis.

I don't think we have. Isn't the problem with patch 1, where we
unuse_temporary_mm() now fails to clear the bit, with the direct result
being that flush_tlb_mm_range() now thinks it should be doing IPIs,
where previously it was a strict CPU local affair.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-13 14:55   ` Rik van Riel
  2024-11-14  9:52     ` Ingo Molnar
@ 2024-11-14 11:36     ` Peter Zijlstra
  2024-11-14 11:43       ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2024-11-14 11:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, linux-kernel, dave.hansen, luto, tglx, mingo,
	x86, kernel-team, hpa, bigeasy

On Wed, Nov 13, 2024 at 09:55:57AM -0500, Rik van Riel wrote:
>  arch/x86/kernel/alternative.c | 11 +++--------
>  1 file changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index d17518ca19b8..f3caf5bc4df9 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1830,6 +1830,9 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
>  	lockdep_assert_irqs_disabled();
>  	switch_mm_irqs_off(NULL, prev_state.mm, current);
>  
> +	/* Force a TLB flush next time poking_mm is used. */
> +	inc_mm_tlb_gen(poking_mm);
> +
>  	/*
>  	 * Restore the breakpoints if they were disabled before the temporary mm
>  	 * was loaded.
> @@ -1940,14 +1943,6 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
>  	 */
>  	unuse_temporary_mm(prev);
>  
> -	/*
> -	 * Flushing the TLB might involve IPIs, which would require enabled
> -	 * IRQs, but not if the mm is not used, as it is in this point.
> -	 */
> -	flush_tlb_mm_range(poking_mm, poking_addr, poking_addr +
> -			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
> -			   PAGE_SHIFT, false);
> -
>  	if (func == text_poke_memcpy) {
>  		/*
>  		 * If the text does not match what we just wrote then something is

So I really don't like this.

Yes it avoids the immediate problem, but we're now sending IPIs where we
shoulnd't be.

Fundamentally this whole text_poke thing is set up such that only a
single CPU will have this magical mapping with the aliasses. Having it
send IPIs is crazy.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-14  9:52     ` Ingo Molnar
@ 2024-11-14 11:36       ` Peter Zijlstra
  2024-11-14 14:27       ` Rik van Riel
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2024-11-14 11:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Borislav Petkov, linux-kernel, dave.hansen, luto,
	tglx, mingo, x86, kernel-team, hpa, bigeasy

On Thu, Nov 14, 2024 at 10:52:48AM +0100, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@surriel.com> wrote:
> 
> > On Wed, 13 Nov 2024 10:55:50 +0100
> > Borislav Petkov <bp@alien8.de> wrote:
> > 
> > > On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > > > While profiling switch_mm_irqs_off with several workloads,
> > > > it appears there are two hot spots that probably don't need
> > > > to be there.  
> > > 
> > > One of those three is causing the below here, zapping them from tip.
> > > 
> > 
> > TL;DR: __text_poke ends up sending IPIs with interrupts disabled.
> > 
> > > [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> > > [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> > > [    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
> > > [    3.186469]  __text_poke+0x366/0x5d0
> > 
> > Here is an alternative to avoid __text_poke() from calling
> > on_each_cpu_cond_mask() with IRQs disabled:
> > 
> > ---8<---
> > From e872edeaad14c793036f290afc28000281e1b76a Mon Sep 17 00:00:00 2001
> > From: Rik van Riel <riel@surriel.com>
> > Date: Wed, 13 Nov 2024 09:51:16 -0500
> > Subject: [PATCH] x86/alternatives: defer poking_mm TLB flush to next use
> 
> I'd argue *both* of your patches improve the code, right?

No, please don't.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-14 11:36     ` Peter Zijlstra
@ 2024-11-14 11:43       ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2024-11-14 11:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Borislav Petkov, linux-kernel, dave.hansen, luto, tglx, mingo,
	x86, kernel-team, hpa, bigeasy

On Thu, Nov 14, 2024 at 12:36:17PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 09:55:57AM -0500, Rik van Riel wrote:
> >  arch/x86/kernel/alternative.c | 11 +++--------
> >  1 file changed, 3 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index d17518ca19b8..f3caf5bc4df9 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -1830,6 +1830,9 @@ static inline void unuse_temporary_mm(temp_mm_state_t prev_state)
> >  	lockdep_assert_irqs_disabled();
> >  	switch_mm_irqs_off(NULL, prev_state.mm, current);
> >  
> > +	/* Force a TLB flush next time poking_mm is used. */
> > +	inc_mm_tlb_gen(poking_mm);
> > +
> >  	/*
> >  	 * Restore the breakpoints if they were disabled before the temporary mm
> >  	 * was loaded.
> > @@ -1940,14 +1943,6 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
> >  	 */
> >  	unuse_temporary_mm(prev);
> >  
> > -	/*
> > -	 * Flushing the TLB might involve IPIs, which would require enabled
> > -	 * IRQs, but not if the mm is not used, as it is in this point.
> > -	 */
> > -	flush_tlb_mm_range(poking_mm, poking_addr, poking_addr +
> > -			   (cross_page_boundary ? 2 : 1) * PAGE_SIZE,
> > -			   PAGE_SHIFT, false);
> > -
> >  	if (func == text_poke_memcpy) {
> >  		/*
> >  		 * If the text does not match what we just wrote then something is
> 
> So I really don't like this.
> 
> Yes it avoids the immediate problem, but we're now sending IPIs where we
> shoulnd't be.
> 
> Fundamentally this whole text_poke thing is set up such that only a
> single CPU will have this magical mapping with the aliasses. Having it
> send IPIs is crazy.

I'm thinking the better fix is to make unuse_temporary_mm() explicitly
clear the bit if we don't want switch_mm() to do it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-14  9:52     ` Ingo Molnar
  2024-11-14 11:36       ` Peter Zijlstra
@ 2024-11-14 14:27       ` Rik van Riel
  2024-11-14 14:40         ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2024-11-14 14:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, linux-kernel, dave.hansen, luto, peterz, tglx,
	mingo, x86, kernel-team, hpa, bigeasy

On Thu, 2024-11-14 at 10:52 +0100, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@surriel.com> wrote:
> 
> > On Wed, 13 Nov 2024 10:55:50 +0100
> > Borislav Petkov <bp@alien8.de> wrote:
> > 
> > > On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > > > While profiling switch_mm_irqs_off with several workloads,
> > > > it appears there are two hot spots that probably don't need
> > > > to be there.  
> > > 
> > > One of those three is causing the below here, zapping them from
> > > tip.
> > > 
> > 
> > TL;DR: __text_poke ends up sending IPIs with interrupts disabled.
> > 
> > > [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> > > [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> > > [    3.186469]  ? cpu_bugs_smt_update+0x14/0x1f0
> > > [    3.186469]  __text_poke+0x366/0x5d0
> > 
> > Here is an alternative to avoid __text_poke() from calling
> > on_each_cpu_cond_mask() with IRQs disabled:
> > 
> > ---8<---
> > From e872edeaad14c793036f290afc28000281e1b76a Mon Sep 17 00:00:00
> > 2001
> > From: Rik van Riel <riel@surriel.com>
> > Date: Wed, 13 Nov 2024 09:51:16 -0500
> > Subject: [PATCH] x86/alternatives: defer poking_mm TLB flush to
> > next use
> 
> I'd argue *both* of your patches improve the code, right?
> 

We have 3 possible solutions, and I think we only need one of them.

> Mind sending an updated series? It might not make it into the merge
> window,
> but these look like good changes to me.

I would be happy to send a new series, but it would be good if
we agreed on what solution we wanted :)

1) Move the interrupt re-enabling up (probably not this one?)

2) Explicitly clear the mm_cpumask bit in unuse_temporary_mm()

3) Have unuse_temporary_mm increment the mm's tlb_gen, since that
   is the only thing flush_tlb_mm_range really does for an MM
   without any bits set in the mm_cpumask.

Which one would you guys prefer?

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCh 0/3] x86,tlb: context switch optimizations
  2024-11-14 14:27       ` Rik van Riel
@ 2024-11-14 14:40         ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2024-11-14 14:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Borislav Petkov, linux-kernel, dave.hansen, luto,
	tglx, mingo, x86, kernel-team, hpa, bigeasy

On Thu, Nov 14, 2024 at 09:27:25AM -0500, Rik van Riel wrote:

> 1) Move the interrupt re-enabling up (probably not this one?)

Correct, that one is wrong for it results in IPIs that we don't want or
need.

> 2) Explicitly clear the mm_cpumask bit in unuse_temporary_mm()
> 
> 3) Have unuse_temporary_mm increment the mm's tlb_gen, since that
>    is the only thing flush_tlb_mm_range really does for an MM
>    without any bits set in the mm_cpumask.

So flush_tlb_mm_range() has an 'mm == loaded_mm' case, which does a
local flush. I *think* we're not hitting that because switch_mm() does a
write to loaded_mm() just before this.

But I don't think we want to proliferate the logic contained in
flush_tlb_mm_range() further than we have to.

So my preference goes to 2, as that seems to be the safest option.
Notably text_poke() it not concerned with performance much.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-11-14 14:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-09  0:27 [PATCh 0/3] x86,tlb: context switch optimizations Rik van Riel
2024-11-09  0:27 ` [PATCH 1/3] x86,tlb: update mm_cpumask lazily Rik van Riel
2024-11-13  2:59   ` [tip: x86/mm] x86/mm/tlb: Update " tip-bot2 for Rik van Riel
2024-11-09  0:27 ` [PATCH 2/3] x86,tlb: add tracepoint for TLB flush IPI to stale CPU Rik van Riel
2024-11-13  2:59   ` [tip: x86/mm] x86/mm/tlb: Add " tip-bot2 for Rik van Riel
2024-11-09  0:27 ` [PATCH 3/3] x86,tlb: put cpumask_test_cpu in prev == next under CONFIG_DEBUG_VM Rik van Riel
2024-11-13  2:59   ` [tip: x86/mm] x86/mm/tlb: Put cpumask_test_cpu() check in switch_mm_irqs_off() " tip-bot2 for Rik van Riel
2024-11-13  9:55 ` [PATCh 0/3] x86,tlb: context switch optimizations Borislav Petkov
2024-11-13 10:00   ` Ingo Molnar
2024-11-13 14:38   ` Rik van Riel
2024-11-14 11:33     ` Peter Zijlstra
2024-11-13 14:55   ` Rik van Riel
2024-11-14  9:52     ` Ingo Molnar
2024-11-14 11:36       ` Peter Zijlstra
2024-11-14 14:27       ` Rik van Riel
2024-11-14 14:40         ` Peter Zijlstra
2024-11-14 11:36     ` Peter Zijlstra
2024-11-14 11:43       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox