All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/2] uprobes: Improve scalability by reducing the contention on siglock
@ 2025-01-24  9:38 Liao Chang
  2025-01-24  9:38 ` [PATCH v5 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Liao Chang @ 2025-01-24  9:38 UTC (permalink / raw)
  To: mhiramat, oleg, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	andrii.nakryiko
  Cc: linux-kernel, linux-trace-kernel, linux-perf-users, bpf

The profiling result of BPF selftest on ARM64 platform reveals the
significant contention on the current->sighand->siglock is the
scalability bottleneck. The reason is also very straightforward that all
producer threads of benchmark have to contend the spinlock mentioned to
resume the TIF_SIGPENDING bit in thread_info that might be removed in
uprobe_deny_signal().

The contention on current->sighand->siglock is unnecessary, this series
remove them thoroughly. I've use the script developed by Andrii in [1]
to run benchmark. The CPU used was Kunpeng916 (Hi1616), 4 NUMA nodes,
64 cores@2.4GHz running the kernel on next tree + the optimization in
[2] for get_xol_insn_slot().

before-opt
----------
uprobe-nop      ( 1 cpus):    0.907 ± 0.003M/s  (  0.907M/s/cpu)
uprobe-nop      ( 2 cpus):    1.676 ± 0.008M/s  (  0.838M/s/cpu)
uprobe-nop      ( 4 cpus):    3.210 ± 0.003M/s  (  0.802M/s/cpu)
uprobe-nop      ( 8 cpus):    4.457 ± 0.003M/s  (  0.557M/s/cpu)
uprobe-nop      (16 cpus):    3.724 ± 0.011M/s  (  0.233M/s/cpu)
uprobe-nop      (32 cpus):    2.761 ± 0.003M/s  (  0.086M/s/cpu)
uprobe-nop      (64 cpus):    1.293 ± 0.015M/s  (  0.020M/s/cpu)

uprobe-push     ( 1 cpus):    0.883 ± 0.001M/s  (  0.883M/s/cpu)
uprobe-push     ( 2 cpus):    1.642 ± 0.005M/s  (  0.821M/s/cpu)
uprobe-push     ( 4 cpus):    3.086 ± 0.002M/s  (  0.771M/s/cpu)
uprobe-push     ( 8 cpus):    3.390 ± 0.003M/s  (  0.424M/s/cpu)
uprobe-push     (16 cpus):    2.652 ± 0.005M/s  (  0.166M/s/cpu)
uprobe-push     (32 cpus):    2.713 ± 0.005M/s  (  0.085M/s/cpu)
uprobe-push     (64 cpus):    1.313 ± 0.009M/s  (  0.021M/s/cpu)

uprobe-ret      ( 1 cpus):    1.774 ± 0.000M/s  (  1.774M/s/cpu)
uprobe-ret      ( 2 cpus):    3.350 ± 0.001M/s  (  1.675M/s/cpu)
uprobe-ret      ( 4 cpus):    6.604 ± 0.000M/s  (  1.651M/s/cpu)
uprobe-ret      ( 8 cpus):    6.706 ± 0.005M/s  (  0.838M/s/cpu)
uprobe-ret      (16 cpus):    5.231 ± 0.001M/s  (  0.327M/s/cpu)
uprobe-ret      (32 cpus):    5.743 ± 0.003M/s  (  0.179M/s/cpu)
uprobe-ret      (64 cpus):    4.726 ± 0.016M/s  (  0.074M/s/cpu)

after-opt
---------
uprobe-nop      ( 1 cpus):    0.985 ± 0.002M/s  (  0.985M/s/cpu)
uprobe-nop      ( 2 cpus):    1.773 ± 0.005M/s  (  0.887M/s/cpu)
uprobe-nop      ( 4 cpus):    3.304 ± 0.001M/s  (  0.826M/s/cpu)
uprobe-nop      ( 8 cpus):    5.328 ± 0.002M/s  (  0.666M/s/cpu)
uprobe-nop      (16 cpus):    6.475 ± 0.002M/s  (  0.405M/s/cpu)
uprobe-nop      (32 cpus):    4.831 ± 0.082M/s  (  0.151M/s/cpu)
uprobe-nop      (64 cpus):    2.564 ± 0.053M/s  (  0.040M/s/cpu)

uprobe-push     ( 1 cpus):    0.964 ± 0.001M/s  (  0.964M/s/cpu)
uprobe-push     ( 2 cpus):    1.766 ± 0.002M/s  (  0.883M/s/cpu)
uprobe-push     ( 4 cpus):    3.290 ± 0.009M/s  (  0.823M/s/cpu)
uprobe-push     ( 8 cpus):    4.670 ± 0.002M/s  (  0.584M/s/cpu)
uprobe-push     (16 cpus):    5.197 ± 0.004M/s  (  0.325M/s/cpu)
uprobe-push     (32 cpus):    5.068 ± 0.161M/s  (  0.158M/s/cpu)
uprobe-push     (64 cpus):    2.605 ± 0.026M/s  (  0.041M/s/cpu)

uprobe-ret      ( 1 cpus):    1.833 ± 0.001M/s  (  1.833M/s/cpu)
uprobe-ret      ( 2 cpus):    3.384 ± 0.003M/s  (  1.692M/s/cpu)
uprobe-ret      ( 4 cpus):    6.677 ± 0.004M/s  (  1.669M/s/cpu)
uprobe-ret      ( 8 cpus):    6.854 ± 0.005M/s  (  0.857M/s/cpu)
uprobe-ret      (16 cpus):    6.508 ± 0.006M/s  (  0.407M/s/cpu)
uprobe-ret      (32 cpus):    5.793 ± 0.009M/s  (  0.181M/s/cpu)
uprobe-ret      (64 cpus):    4.743 ± 0.016M/s  (  0.074M/s/cpu)

Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.

v5->v4:
Nothing new, just rebase to next-20250124.

v4->v3:
1. Rebase v3 [3] to the lateset tip/perf/core.
2. Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
3. Acked-by: Oleg Nesterov <oleg@redhat.com>

v3->v2:
Renaming the flag in [2/2], s/deny_signal/signal_denied/g.

v2->v1:
Oleg pointed out the _DENY_SIGNAL will be replaced by _ACK upon the
completion of singlestep which leads to handle_singlestep() has no
chance to restore the removed TIF_SIGPENDING [3] and some case in
question. So this revision proposes to use a flag in uprobe_task to
track the denied TIF_SIGPENDING instead of new UPROBE_SSTEP state.

[1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com
[3] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ 


Liao Chang (2):
  uprobes: Remove redundant spinlock in uprobe_deny_signal()
  uprobes: Remove the spinlock within handle_singlestep()

 include/linux/uprobes.h |  1 +
 kernel/events/uprobes.c | 10 +++++-----
 2 files changed, 6 insertions(+), 5 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal()
  2025-01-24  9:38 [PATCH v5 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao Chang
@ 2025-01-24  9:38 ` Liao Chang
  2025-01-24 15:27   ` Steven Rostedt
  2025-02-03 12:48   ` [tip: perf/core] " tip-bot2 for Liao Chang
  2025-01-24  9:38 ` [PATCH v5 2/2] uprobes: Remove the spinlock within handle_singlestep() Liao Chang
  2025-01-27 11:28 ` [PATCH v5 0/2] uprobes: Improve scalability by reducing the contention on siglock Peter Zijlstra
  2 siblings, 2 replies; 10+ messages in thread
From: Liao Chang @ 2025-01-24  9:38 UTC (permalink / raw)
  To: mhiramat, oleg, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	andrii.nakryiko
  Cc: linux-kernel, linux-trace-kernel, linux-perf-users, bpf

Since clearing a bit in thread_info is an atomic operation, the spinlock
is redundant and can be removed, reducing lock contention is good for
performance.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 kernel/events/uprobes.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e421a5f2ec7d..7a3348dfedeb 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2298,9 +2298,7 @@ bool uprobe_deny_signal(void)
 	WARN_ON_ONCE(utask->state != UTASK_SSTEP);
 
 	if (task_sigpending(t)) {
-		spin_lock_irq(&t->sighand->siglock);
 		clear_tsk_thread_flag(t, TIF_SIGPENDING);
-		spin_unlock_irq(&t->sighand->siglock);
 
 		if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
 			utask->state = UTASK_SSTEP_TRAPPED;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 2/2] uprobes: Remove the spinlock within handle_singlestep()
  2025-01-24  9:38 [PATCH v5 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao Chang
  2025-01-24  9:38 ` [PATCH v5 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
@ 2025-01-24  9:38 ` Liao Chang
  2025-02-03 12:48   ` [tip: perf/core] " tip-bot2 for Liao Chang
  2025-02-05  9:38   ` tip-bot2 for Liao Chang
  2025-01-27 11:28 ` [PATCH v5 0/2] uprobes: Improve scalability by reducing the contention on siglock Peter Zijlstra
  2 siblings, 2 replies; 10+ messages in thread
From: Liao Chang @ 2025-01-24  9:38 UTC (permalink / raw)
  To: mhiramat, oleg, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	andrii.nakryiko
  Cc: linux-kernel, linux-trace-kernel, linux-perf-users, bpf

This patch introduces a flag to track TIF_SIGPENDING is suppress
temporarily during the uprobe single-step. Upon uprobe singlestep is
handled and the flag is confirmed, it could resume the TIF_SIGPENDING
directly without acquiring the siglock in most case, then reducing
contention and improving overall performance.

I've use the script developed by Andrii in [1] to run benchmark. The CPU
used was Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@2.4GHz running the
kernel on next tree + the optimization for get_xol_insn_slot() [2].

before-opt
----------
uprobe-nop      ( 1 cpus):    0.907 ± 0.003M/s  (  0.907M/s/cpu)
uprobe-nop      ( 2 cpus):    1.676 ± 0.008M/s  (  0.838M/s/cpu)
uprobe-nop      ( 4 cpus):    3.210 ± 0.003M/s  (  0.802M/s/cpu)
uprobe-nop      ( 8 cpus):    4.457 ± 0.003M/s  (  0.557M/s/cpu)
uprobe-nop      (16 cpus):    3.724 ± 0.011M/s  (  0.233M/s/cpu)
uprobe-nop      (32 cpus):    2.761 ± 0.003M/s  (  0.086M/s/cpu)
uprobe-nop      (64 cpus):    1.293 ± 0.015M/s  (  0.020M/s/cpu)

uprobe-push     ( 1 cpus):    0.883 ± 0.001M/s  (  0.883M/s/cpu)
uprobe-push     ( 2 cpus):    1.642 ± 0.005M/s  (  0.821M/s/cpu)
uprobe-push     ( 4 cpus):    3.086 ± 0.002M/s  (  0.771M/s/cpu)
uprobe-push     ( 8 cpus):    3.390 ± 0.003M/s  (  0.424M/s/cpu)
uprobe-push     (16 cpus):    2.652 ± 0.005M/s  (  0.166M/s/cpu)
uprobe-push     (32 cpus):    2.713 ± 0.005M/s  (  0.085M/s/cpu)
uprobe-push     (64 cpus):    1.313 ± 0.009M/s  (  0.021M/s/cpu)

uprobe-ret      ( 1 cpus):    1.774 ± 0.000M/s  (  1.774M/s/cpu)
uprobe-ret      ( 2 cpus):    3.350 ± 0.001M/s  (  1.675M/s/cpu)
uprobe-ret      ( 4 cpus):    6.604 ± 0.000M/s  (  1.651M/s/cpu)
uprobe-ret      ( 8 cpus):    6.706 ± 0.005M/s  (  0.838M/s/cpu)
uprobe-ret      (16 cpus):    5.231 ± 0.001M/s  (  0.327M/s/cpu)
uprobe-ret      (32 cpus):    5.743 ± 0.003M/s  (  0.179M/s/cpu)
uprobe-ret      (64 cpus):    4.726 ± 0.016M/s  (  0.074M/s/cpu)

after-opt
---------
uprobe-nop      ( 1 cpus):    0.985 ± 0.002M/s  (  0.985M/s/cpu)
uprobe-nop      ( 2 cpus):    1.773 ± 0.005M/s  (  0.887M/s/cpu)
uprobe-nop      ( 4 cpus):    3.304 ± 0.001M/s  (  0.826M/s/cpu)
uprobe-nop      ( 8 cpus):    5.328 ± 0.002M/s  (  0.666M/s/cpu)
uprobe-nop      (16 cpus):    6.475 ± 0.002M/s  (  0.405M/s/cpu)
uprobe-nop      (32 cpus):    4.831 ± 0.082M/s  (  0.151M/s/cpu)
uprobe-nop      (64 cpus):    2.564 ± 0.053M/s  (  0.040M/s/cpu)

uprobe-push     ( 1 cpus):    0.964 ± 0.001M/s  (  0.964M/s/cpu)
uprobe-push     ( 2 cpus):    1.766 ± 0.002M/s  (  0.883M/s/cpu)
uprobe-push     ( 4 cpus):    3.290 ± 0.009M/s  (  0.823M/s/cpu)
uprobe-push     ( 8 cpus):    4.670 ± 0.002M/s  (  0.584M/s/cpu)
uprobe-push     (16 cpus):    5.197 ± 0.004M/s  (  0.325M/s/cpu)
uprobe-push     (32 cpus):    5.068 ± 0.161M/s  (  0.158M/s/cpu)
uprobe-push     (64 cpus):    2.605 ± 0.026M/s  (  0.041M/s/cpu)

uprobe-ret      ( 1 cpus):    1.833 ± 0.001M/s  (  1.833M/s/cpu)
uprobe-ret      ( 2 cpus):    3.384 ± 0.003M/s  (  1.692M/s/cpu)
uprobe-ret      ( 4 cpus):    6.677 ± 0.004M/s  (  1.669M/s/cpu)
uprobe-ret      ( 8 cpus):    6.854 ± 0.005M/s  (  0.857M/s/cpu)
uprobe-ret      (16 cpus):    6.508 ± 0.006M/s  (  0.407M/s/cpu)
uprobe-ret      (32 cpus):    5.793 ± 0.009M/s  (  0.181M/s/cpu)
uprobe-ret      (64 cpus):    4.743 ± 0.016M/s  (  0.074M/s/cpu)

Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.

[1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 include/linux/uprobes.h | 1 +
 kernel/events/uprobes.c | 8 +++++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b1df7d792fa1..a40efdda9052 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -143,6 +143,7 @@ struct uprobe_task {
 
 	struct uprobe			*active_uprobe;
 	unsigned long			xol_vaddr;
+	bool				signal_denied;
 
 	struct arch_uprobe              *auprobe;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 7a3348dfedeb..597b9e036e5f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2298,6 +2298,7 @@ bool uprobe_deny_signal(void)
 	WARN_ON_ONCE(utask->state != UTASK_SSTEP);
 
 	if (task_sigpending(t)) {
+		utask->signal_denied = true;
 		clear_tsk_thread_flag(t, TIF_SIGPENDING);
 
 		if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
@@ -2731,9 +2732,10 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
 	utask->state = UTASK_RUNNING;
 	xol_free_insn_slot(utask);
 
-	spin_lock_irq(&current->sighand->siglock);
-	recalc_sigpending(); /* see uprobe_deny_signal() */
-	spin_unlock_irq(&current->sighand->siglock);
+	if (utask->signal_denied) {
+		set_thread_flag(TIF_SIGPENDING);
+		utask->signal_denied = false;
+	}
 
 	if (unlikely(err)) {
 		uprobe_warn(current, "execute the probed insn, sending SIGILL.");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal()
  2025-01-24  9:38 ` [PATCH v5 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
@ 2025-01-24 15:27   ` Steven Rostedt
  2025-01-24 17:25     ` Oleg Nesterov
  2025-02-03 12:48   ` [tip: perf/core] " tip-bot2 for Liao Chang
  1 sibling, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2025-01-24 15:27 UTC (permalink / raw)
  To: Liao Chang
  Cc: mhiramat, oleg, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	andrii.nakryiko, linux-kernel, linux-trace-kernel,
	linux-perf-users, bpf

On Fri, 24 Jan 2025 09:38:25 +0000
Liao Chang <liaochang1@huawei.com> wrote:

> Since clearing a bit in thread_info is an atomic operation, the spinlock
> is redundant and can be removed, reducing lock contention is good for
> performance.

Although this patch is probably fine, the change log suggests a dangerous
precedence. Just because clearing a flag is atomic, that alone does not
guarantee that it doesn't need spin locks around it.

There may be another path that tests the flag within a spin lock, and then
does a bunch of work assuming that the flag does not change while it is
doing that work. That other path would require a spin lock around the
clearing of the flag elsewhere.

I don't know this code well enough to know if this has that scenario, and
seeing the Acked-by from Oleg, I'm assuming it does not. But in any case,
the change log needs to give a better rationale for removing a spin lock than
just "clearing a flag atomically doesn't need a spin lock"!

-- Steve


> 
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Liao Chang <liaochang1@huawei.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal()
  2025-01-24 15:27   ` Steven Rostedt
@ 2025-01-24 17:25     ` Oleg Nesterov
  2025-01-24 17:38       ` Oleg Nesterov
  0 siblings, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2025-01-24 17:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Liao Chang, mhiramat, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	andrii.nakryiko, linux-kernel, linux-trace-kernel,
	linux-perf-users, bpf

On 01/24, Steven Rostedt wrote:
>
> On Fri, 24 Jan 2025 09:38:25 +0000
> Liao Chang <liaochang1@huawei.com> wrote:
>
> > Since clearing a bit in thread_info is an atomic operation, the spinlock
> > is redundant and can be removed, reducing lock contention is good for
> > performance.
>
> Although this patch is probably fine, the change log suggests a dangerous
> precedence. Just because clearing a flag is atomic, that alone does not
> guarantee that it doesn't need spin locks around it.

Yes. And iirc we already have the lockless users of clear(TIF_SIGPENDING)
(some if not most of them look buggy). But afaics in this (very special)
case it should be fine.

See also https://lore.kernel.org/all/20240812120738.GC11656@redhat.com/

> There may be another path that tests the flag within a spin lock,

Yes, retarget_shared_pending() or the complete_signal/wants_signal loop.
That is why it was decided to take siglock in uprobe_deny_signal(), just
to be "safe".

But I still think this patch is fine. The current task is going to execute
a single insn which can't enter the kernel and/or return to the userspace
before it calls handle_singlestep() and restores TIF_SIGPENDING. We do not
care if it races with another source of TIF_SIGPENDING.

The only problem is that task_sigpending() from another task can "wrongly"
return false in this window, but I don't see any problem.

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal()
  2025-01-24 17:25     ` Oleg Nesterov
@ 2025-01-24 17:38       ` Oleg Nesterov
  0 siblings, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2025-01-24 17:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Liao Chang, mhiramat, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	andrii.nakryiko, linux-kernel, linux-trace-kernel,
	linux-perf-users, bpf

On 01/24, Oleg Nesterov wrote:
>
> But I still think this patch is fine. The current task is going to execute
> a single insn which can't enter the kernel and/or return to the userspace
                      ^^^^^^^^^^^^^^^^^^^^^^
I mean't, it can't do syscall, sorry for the possible confusion.

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 0/2] uprobes: Improve scalability by reducing the contention on siglock
  2025-01-24  9:38 [PATCH v5 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao Chang
  2025-01-24  9:38 ` [PATCH v5 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
  2025-01-24  9:38 ` [PATCH v5 2/2] uprobes: Remove the spinlock within handle_singlestep() Liao Chang
@ 2025-01-27 11:28 ` Peter Zijlstra
  2 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2025-01-27 11:28 UTC (permalink / raw)
  To: Liao Chang
  Cc: mhiramat, oleg, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	andrii.nakryiko, linux-kernel, linux-trace-kernel,
	linux-perf-users, bpf

On Fri, Jan 24, 2025 at 09:38:24AM +0000, Liao Chang wrote:
> Liao Chang (2):
>   uprobes: Remove redundant spinlock in uprobe_deny_signal()
>   uprobes: Remove the spinlock within handle_singlestep()
> 
>  include/linux/uprobes.h |  1 +
>  kernel/events/uprobes.c | 10 +++++-----
>  2 files changed, 6 insertions(+), 5 deletions(-)

Thanks, I've picked up the patches but will not merge them until post
-rc1.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [tip: perf/core] uprobes: Remove the spinlock within handle_singlestep()
  2025-01-24  9:38 ` [PATCH v5 2/2] uprobes: Remove the spinlock within handle_singlestep() Liao Chang
@ 2025-02-03 12:48   ` tip-bot2 for Liao Chang
  2025-02-05  9:38   ` tip-bot2 for Liao Chang
  1 sibling, 0 replies; 10+ messages in thread
From: tip-bot2 for Liao Chang @ 2025-02-03 12:48 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     a66396c911bd662d503de3ec8f0a140b1081bde7
Gitweb:        https://git.kernel.org/tip/a66396c911bd662d503de3ec8f0a140b1081bde7
Author:        Liao Chang <liaochang1@huawei.com>
AuthorDate:    Fri, 24 Jan 2025 09:38:26 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Feb 2025 11:46:06 +01:00

uprobes: Remove the spinlock within handle_singlestep()

This patch introduces a flag to track TIF_SIGPENDING is suppress
temporarily during the uprobe single-step. Upon uprobe singlestep is
handled and the flag is confirmed, it could resume the TIF_SIGPENDING
directly without acquiring the siglock in most case, then reducing
contention and improving overall performance.

I've use the script developed by Andrii in [1] to run benchmark. The CPU
used was Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@2.4GHz running the
kernel on next tree + the optimization for get_xol_insn_slot() [2].

before-opt
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250124093826.2123675-3-liaochang1@huawei.com
---
 include/linux/uprobes.h | 1 +
 kernel/events/uprobes.c | 8 +++++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b1df7d7..a40efdd 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -143,6 +143,7 @@ struct uprobe_task {
 
 	struct uprobe			*active_uprobe;
 	unsigned long			xol_vaddr;
+	bool				signal_denied;
 
 	struct arch_uprobe              *auprobe;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 33bd608..870f697 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2302,6 +2302,7 @@ bool uprobe_deny_signal(void)
 	WARN_ON_ONCE(utask->state != UTASK_SSTEP);
 
 	if (task_sigpending(t)) {
+		utask->signal_denied = true;
 		clear_tsk_thread_flag(t, TIF_SIGPENDING);
 
 		if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
@@ -2735,9 +2736,10 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
 	utask->state = UTASK_RUNNING;
 	xol_free_insn_slot(utask);
 
-	spin_lock_irq(&current->sighand->siglock);
-	recalc_sigpending(); /* see uprobe_deny_signal() */
-	spin_unlock_irq(&current->sighand->siglock);
+	if (utask->signal_denied) {
+		set_thread_flag(TIF_SIGPENDING);
+		utask->signal_denied = false;
+	}
 
 	if (unlikely(err)) {
 		uprobe_warn(current, "execute the probed insn, sending SIGILL.");

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: perf/core] uprobes: Remove redundant spinlock in uprobe_deny_signal()
  2025-01-24  9:38 ` [PATCH v5 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
  2025-01-24 15:27   ` Steven Rostedt
@ 2025-02-03 12:48   ` tip-bot2 for Liao Chang
  1 sibling, 0 replies; 10+ messages in thread
From: tip-bot2 for Liao Chang @ 2025-02-03 12:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Liao Chang, Peter Zijlstra (Intel), Masami Hiramatsu (Google),
	Oleg Nesterov, x86, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     eae8a56ae0c74c1cf2f92a6709d215a9f329f60c
Gitweb:        https://git.kernel.org/tip/eae8a56ae0c74c1cf2f92a6709d215a9f329f60c
Author:        Liao Chang <liaochang1@huawei.com>
AuthorDate:    Fri, 24 Jan 2025 09:38:25 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Feb 2025 11:46:06 +01:00

uprobes: Remove redundant spinlock in uprobe_deny_signal()

Since clearing a bit in thread_info is an atomic operation, the spinlock
is redundant and can be removed, reducing lock contention is good for
performance.

Signed-off-by: Liao Chang <liaochang1@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250124093826.2123675-2-liaochang1@huawei.com
---
 kernel/events/uprobes.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2ca797c..33bd608 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2302,9 +2302,7 @@ bool uprobe_deny_signal(void)
 	WARN_ON_ONCE(utask->state != UTASK_SSTEP);
 
 	if (task_sigpending(t)) {
-		spin_lock_irq(&t->sighand->siglock);
 		clear_tsk_thread_flag(t, TIF_SIGPENDING);
-		spin_unlock_irq(&t->sighand->siglock);
 
 		if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
 			utask->state = UTASK_SSTEP_TRAPPED;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip: perf/core] uprobes: Remove the spinlock within handle_singlestep()
  2025-01-24  9:38 ` [PATCH v5 2/2] uprobes: Remove the spinlock within handle_singlestep() Liao Chang
  2025-02-03 12:48   ` [tip: perf/core] " tip-bot2 for Liao Chang
@ 2025-02-05  9:38   ` tip-bot2 for Liao Chang
  1 sibling, 0 replies; 10+ messages in thread
From: tip-bot2 for Liao Chang @ 2025-02-05  9:38 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Masami Hiramatsu (Google), Oleg Nesterov, Liao Chang,
	Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     83179cd67846fae0e6aea5848c07faaa5d89a1de
Gitweb:        https://git.kernel.org/tip/83179cd67846fae0e6aea5848c07faaa5d89a1de
Author:        Liao Chang <liaochang1@huawei.com>
AuthorDate:    Fri, 24 Jan 2025 09:38:26 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 05 Feb 2025 10:29:12 +01:00

uprobes: Remove the spinlock within handle_singlestep()

This patch introduces a flag to track TIF_SIGPENDING is suppress
temporarily during the uprobe single-step. Upon uprobe singlestep is
handled and the flag is confirmed, it could resume the TIF_SIGPENDING
directly without acquiring the siglock in most case, then reducing
contention and improving overall performance.

I've use the script developed by Andrii in [1] to run benchmark. The CPU
used was Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@2.4GHz running the
kernel on next tree + the optimization for get_xol_insn_slot() [2].

before-opt
----------
uprobe-nop      ( 1 cpus):    0.907 ± 0.003M/s  (  0.907M/s/cpu)
uprobe-nop      ( 2 cpus):    1.676 ± 0.008M/s  (  0.838M/s/cpu)
uprobe-nop      ( 4 cpus):    3.210 ± 0.003M/s  (  0.802M/s/cpu)
uprobe-nop      ( 8 cpus):    4.457 ± 0.003M/s  (  0.557M/s/cpu)
uprobe-nop      (16 cpus):    3.724 ± 0.011M/s  (  0.233M/s/cpu)
uprobe-nop      (32 cpus):    2.761 ± 0.003M/s  (  0.086M/s/cpu)
uprobe-nop      (64 cpus):    1.293 ± 0.015M/s  (  0.020M/s/cpu)

uprobe-push     ( 1 cpus):    0.883 ± 0.001M/s  (  0.883M/s/cpu)
uprobe-push     ( 2 cpus):    1.642 ± 0.005M/s  (  0.821M/s/cpu)
uprobe-push     ( 4 cpus):    3.086 ± 0.002M/s  (  0.771M/s/cpu)
uprobe-push     ( 8 cpus):    3.390 ± 0.003M/s  (  0.424M/s/cpu)
uprobe-push     (16 cpus):    2.652 ± 0.005M/s  (  0.166M/s/cpu)
uprobe-push     (32 cpus):    2.713 ± 0.005M/s  (  0.085M/s/cpu)
uprobe-push     (64 cpus):    1.313 ± 0.009M/s  (  0.021M/s/cpu)

uprobe-ret      ( 1 cpus):    1.774 ± 0.000M/s  (  1.774M/s/cpu)
uprobe-ret      ( 2 cpus):    3.350 ± 0.001M/s  (  1.675M/s/cpu)
uprobe-ret      ( 4 cpus):    6.604 ± 0.000M/s  (  1.651M/s/cpu)
uprobe-ret      ( 8 cpus):    6.706 ± 0.005M/s  (  0.838M/s/cpu)
uprobe-ret      (16 cpus):    5.231 ± 0.001M/s  (  0.327M/s/cpu)
uprobe-ret      (32 cpus):    5.743 ± 0.003M/s  (  0.179M/s/cpu)
uprobe-ret      (64 cpus):    4.726 ± 0.016M/s  (  0.074M/s/cpu)

after-opt
---------
uprobe-nop      ( 1 cpus):    0.985 ± 0.002M/s  (  0.985M/s/cpu)
uprobe-nop      ( 2 cpus):    1.773 ± 0.005M/s  (  0.887M/s/cpu)
uprobe-nop      ( 4 cpus):    3.304 ± 0.001M/s  (  0.826M/s/cpu)
uprobe-nop      ( 8 cpus):    5.328 ± 0.002M/s  (  0.666M/s/cpu)
uprobe-nop      (16 cpus):    6.475 ± 0.002M/s  (  0.405M/s/cpu)
uprobe-nop      (32 cpus):    4.831 ± 0.082M/s  (  0.151M/s/cpu)
uprobe-nop      (64 cpus):    2.564 ± 0.053M/s  (  0.040M/s/cpu)

uprobe-push     ( 1 cpus):    0.964 ± 0.001M/s  (  0.964M/s/cpu)
uprobe-push     ( 2 cpus):    1.766 ± 0.002M/s  (  0.883M/s/cpu)
uprobe-push     ( 4 cpus):    3.290 ± 0.009M/s  (  0.823M/s/cpu)
uprobe-push     ( 8 cpus):    4.670 ± 0.002M/s  (  0.584M/s/cpu)
uprobe-push     (16 cpus):    5.197 ± 0.004M/s  (  0.325M/s/cpu)
uprobe-push     (32 cpus):    5.068 ± 0.161M/s  (  0.158M/s/cpu)
uprobe-push     (64 cpus):    2.605 ± 0.026M/s  (  0.041M/s/cpu)

uprobe-ret      ( 1 cpus):    1.833 ± 0.001M/s  (  1.833M/s/cpu)
uprobe-ret      ( 2 cpus):    3.384 ± 0.003M/s  (  1.692M/s/cpu)
uprobe-ret      ( 4 cpus):    6.677 ± 0.004M/s  (  1.669M/s/cpu)
uprobe-ret      ( 8 cpus):    6.854 ± 0.005M/s  (  0.857M/s/cpu)
uprobe-ret      (16 cpus):    6.508 ± 0.006M/s  (  0.407M/s/cpu)
uprobe-ret      (32 cpus):    5.793 ± 0.009M/s  (  0.181M/s/cpu)
uprobe-ret      (64 cpus):    4.743 ± 0.016M/s  (  0.074M/s/cpu)

Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.

[1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Liao Chang <liaochang1@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250124093826.2123675-3-liaochang1@huawei.com
---
 include/linux/uprobes.h | 1 +
 kernel/events/uprobes.c | 8 +++++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b1df7d7..a40efdd 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -143,6 +143,7 @@ struct uprobe_task {
 
 	struct uprobe			*active_uprobe;
 	unsigned long			xol_vaddr;
+	bool				signal_denied;
 
 	struct arch_uprobe              *auprobe;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 33bd608..870f697 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2302,6 +2302,7 @@ bool uprobe_deny_signal(void)
 	WARN_ON_ONCE(utask->state != UTASK_SSTEP);
 
 	if (task_sigpending(t)) {
+		utask->signal_denied = true;
 		clear_tsk_thread_flag(t, TIF_SIGPENDING);
 
 		if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
@@ -2735,9 +2736,10 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
 	utask->state = UTASK_RUNNING;
 	xol_free_insn_slot(utask);
 
-	spin_lock_irq(&current->sighand->siglock);
-	recalc_sigpending(); /* see uprobe_deny_signal() */
-	spin_unlock_irq(&current->sighand->siglock);
+	if (utask->signal_denied) {
+		set_thread_flag(TIF_SIGPENDING);
+		utask->signal_denied = false;
+	}
 
 	if (unlikely(err)) {
 		uprobe_warn(current, "execute the probed insn, sending SIGILL.");

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-02-05  9:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-24  9:38 [PATCH v5 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao Chang
2025-01-24  9:38 ` [PATCH v5 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
2025-01-24 15:27   ` Steven Rostedt
2025-01-24 17:25     ` Oleg Nesterov
2025-01-24 17:38       ` Oleg Nesterov
2025-02-03 12:48   ` [tip: perf/core] " tip-bot2 for Liao Chang
2025-01-24  9:38 ` [PATCH v5 2/2] uprobes: Remove the spinlock within handle_singlestep() Liao Chang
2025-02-03 12:48   ` [tip: perf/core] " tip-bot2 for Liao Chang
2025-02-05  9:38   ` tip-bot2 for Liao Chang
2025-01-27 11:28 ` [PATCH v5 0/2] uprobes: Improve scalability by reducing the contention on siglock Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.