* [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock
@ 2024-08-15 1:46 Liao Chang
2024-08-15 1:46 ` [PATCH v3 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Liao Chang @ 2024-08-15 1:46 UTC (permalink / raw)
To: mhiramat, oleg, peterz, mingo, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
Cc: linux-kernel, linux-trace-kernel, linux-perf-users, bpf
The profiling result of BPF selftest on ARM64 platform reveals the
significant contention on the current->sighand->siglock is the
scalability bottleneck. The reason is also very straightforward that all
producer threads of benchmark have to contend the spinlock mentioned to
resume the TIF_SIGPENDING bit in thread_info that might be removed in
uprobe_deny_signal().
The contention on current->sighand->siglock is unnecessary, this series
remove them thoroughly. I've use the script developed by Andrii in [1]
to run benchmark. The CPU used was Kunpeng916 (Hi1616), 4 NUMA nodes,
64 cores@2.4GHz running the kernel on next tree + the optimization in
[2] for get_xol_insn_slot().
before-opt
----------
uprobe-nop ( 1 cpus): 0.907 ± 0.003M/s ( 0.907M/s/cpu)
uprobe-nop ( 2 cpus): 1.676 ± 0.008M/s ( 0.838M/s/cpu)
uprobe-nop ( 4 cpus): 3.210 ± 0.003M/s ( 0.802M/s/cpu)
uprobe-nop ( 8 cpus): 4.457 ± 0.003M/s ( 0.557M/s/cpu)
uprobe-nop (16 cpus): 3.724 ± 0.011M/s ( 0.233M/s/cpu)
uprobe-nop (32 cpus): 2.761 ± 0.003M/s ( 0.086M/s/cpu)
uprobe-nop (64 cpus): 1.293 ± 0.015M/s ( 0.020M/s/cpu)
uprobe-push ( 1 cpus): 0.883 ± 0.001M/s ( 0.883M/s/cpu)
uprobe-push ( 2 cpus): 1.642 ± 0.005M/s ( 0.821M/s/cpu)
uprobe-push ( 4 cpus): 3.086 ± 0.002M/s ( 0.771M/s/cpu)
uprobe-push ( 8 cpus): 3.390 ± 0.003M/s ( 0.424M/s/cpu)
uprobe-push (16 cpus): 2.652 ± 0.005M/s ( 0.166M/s/cpu)
uprobe-push (32 cpus): 2.713 ± 0.005M/s ( 0.085M/s/cpu)
uprobe-push (64 cpus): 1.313 ± 0.009M/s ( 0.021M/s/cpu)
uprobe-ret ( 1 cpus): 1.774 ± 0.000M/s ( 1.774M/s/cpu)
uprobe-ret ( 2 cpus): 3.350 ± 0.001M/s ( 1.675M/s/cpu)
uprobe-ret ( 4 cpus): 6.604 ± 0.000M/s ( 1.651M/s/cpu)
uprobe-ret ( 8 cpus): 6.706 ± 0.005M/s ( 0.838M/s/cpu)
uprobe-ret (16 cpus): 5.231 ± 0.001M/s ( 0.327M/s/cpu)
uprobe-ret (32 cpus): 5.743 ± 0.003M/s ( 0.179M/s/cpu)
uprobe-ret (64 cpus): 4.726 ± 0.016M/s ( 0.074M/s/cpu)
after-opt
---------
uprobe-nop ( 1 cpus): 0.985 ± 0.002M/s ( 0.985M/s/cpu)
uprobe-nop ( 2 cpus): 1.773 ± 0.005M/s ( 0.887M/s/cpu)
uprobe-nop ( 4 cpus): 3.304 ± 0.001M/s ( 0.826M/s/cpu)
uprobe-nop ( 8 cpus): 5.328 ± 0.002M/s ( 0.666M/s/cpu)
uprobe-nop (16 cpus): 6.475 ± 0.002M/s ( 0.405M/s/cpu)
uprobe-nop (32 cpus): 4.831 ± 0.082M/s ( 0.151M/s/cpu)
uprobe-nop (64 cpus): 2.564 ± 0.053M/s ( 0.040M/s/cpu)
uprobe-push ( 1 cpus): 0.964 ± 0.001M/s ( 0.964M/s/cpu)
uprobe-push ( 2 cpus): 1.766 ± 0.002M/s ( 0.883M/s/cpu)
uprobe-push ( 4 cpus): 3.290 ± 0.009M/s ( 0.823M/s/cpu)
uprobe-push ( 8 cpus): 4.670 ± 0.002M/s ( 0.584M/s/cpu)
uprobe-push (16 cpus): 5.197 ± 0.004M/s ( 0.325M/s/cpu)
uprobe-push (32 cpus): 5.068 ± 0.161M/s ( 0.158M/s/cpu)
uprobe-push (64 cpus): 2.605 ± 0.026M/s ( 0.041M/s/cpu)
uprobe-ret ( 1 cpus): 1.833 ± 0.001M/s ( 1.833M/s/cpu)
uprobe-ret ( 2 cpus): 3.384 ± 0.003M/s ( 1.692M/s/cpu)
uprobe-ret ( 4 cpus): 6.677 ± 0.004M/s ( 1.669M/s/cpu)
uprobe-ret ( 8 cpus): 6.854 ± 0.005M/s ( 0.857M/s/cpu)
uprobe-ret (16 cpus): 6.508 ± 0.006M/s ( 0.407M/s/cpu)
uprobe-ret (32 cpus): 5.793 ± 0.009M/s ( 0.181M/s/cpu)
uprobe-ret (64 cpus): 4.743 ± 0.016M/s ( 0.074M/s/cpu)
Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.
v3->v2:
Renaming the flag in [2/2], s/deny_signal/signal_denied/g.
v2->v1:
Oleg pointed out the _DENY_SIGNAL will be replaced by _ACK upon the
completion of singlestep which leads to handle_singlestep() has no
chance to restore the removed TIF_SIGPENDING [3] and some case in
question. So this revision proposes to use a flag in uprobe_task to
track the denied TIF_SIGPENDING instead of new UPROBE_SSTEP state.
[1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com
[3] https://lore.kernel.org/all/20240801082407.1618451-1-liaochang1@huawei.com
Liao Chang (2):
uprobes: Remove redundant spinlock in uprobe_deny_signal()
uprobes: Remove the spinlock within handle_singlestep()
include/linux/uprobes.h | 1 +
kernel/events/uprobes.c | 10 +++++-----
2 files changed, 6 insertions(+), 5 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal()
2024-08-15 1:46 [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao Chang
@ 2024-08-15 1:46 ` Liao Chang
2024-10-22 4:01 ` Masami Hiramatsu
2024-08-15 1:46 ` [PATCH v3 2/2] uprobes: Remove the spinlock within handle_singlestep() Liao Chang
2024-09-14 2:53 ` [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao, Chang
2 siblings, 1 reply; 12+ messages in thread
From: Liao Chang @ 2024-08-15 1:46 UTC (permalink / raw)
To: mhiramat, oleg, peterz, mingo, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
Cc: linux-kernel, linux-trace-kernel, linux-perf-users, bpf
Since clearing a bit in thread_info is an atomic operation, the spinlock
is redundant and can be removed, reducing lock contention is good for
performance.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
kernel/events/uprobes.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 73cc47708679..76a51a1f51e2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1979,9 +1979,7 @@ bool uprobe_deny_signal(void)
WARN_ON_ONCE(utask->state != UTASK_SSTEP);
if (task_sigpending(t)) {
- spin_lock_irq(&t->sighand->siglock);
clear_tsk_thread_flag(t, TIF_SIGPENDING);
- spin_unlock_irq(&t->sighand->siglock);
if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
utask->state = UTASK_SSTEP_TRAPPED;
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v3 2/2] uprobes: Remove the spinlock within handle_singlestep()
2024-08-15 1:46 [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao Chang
2024-08-15 1:46 ` [PATCH v3 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
@ 2024-08-15 1:46 ` Liao Chang
2024-10-22 4:01 ` Masami Hiramatsu
2024-09-14 2:53 ` [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao, Chang
2 siblings, 1 reply; 12+ messages in thread
From: Liao Chang @ 2024-08-15 1:46 UTC (permalink / raw)
To: mhiramat, oleg, peterz, mingo, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
Cc: linux-kernel, linux-trace-kernel, linux-perf-users, bpf
This patch introduces a flag to track TIF_SIGPENDING is suppress
temporarily during the uprobe single-step. Upon uprobe singlestep is
handled and the flag is confirmed, it could resume the TIF_SIGPENDING
directly without acquiring the siglock in most case, then reducing
contention and improving overall performance.
I've use the script developed by Andrii in [1] to run benchmark. The CPU
used was Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@2.4GHz running the
kernel on next tree + the optimization for get_xol_insn_slot() [2].
before-opt
----------
uprobe-nop ( 1 cpus): 0.907 ± 0.003M/s ( 0.907M/s/cpu)
uprobe-nop ( 2 cpus): 1.676 ± 0.008M/s ( 0.838M/s/cpu)
uprobe-nop ( 4 cpus): 3.210 ± 0.003M/s ( 0.802M/s/cpu)
uprobe-nop ( 8 cpus): 4.457 ± 0.003M/s ( 0.557M/s/cpu)
uprobe-nop (16 cpus): 3.724 ± 0.011M/s ( 0.233M/s/cpu)
uprobe-nop (32 cpus): 2.761 ± 0.003M/s ( 0.086M/s/cpu)
uprobe-nop (64 cpus): 1.293 ± 0.015M/s ( 0.020M/s/cpu)
uprobe-push ( 1 cpus): 0.883 ± 0.001M/s ( 0.883M/s/cpu)
uprobe-push ( 2 cpus): 1.642 ± 0.005M/s ( 0.821M/s/cpu)
uprobe-push ( 4 cpus): 3.086 ± 0.002M/s ( 0.771M/s/cpu)
uprobe-push ( 8 cpus): 3.390 ± 0.003M/s ( 0.424M/s/cpu)
uprobe-push (16 cpus): 2.652 ± 0.005M/s ( 0.166M/s/cpu)
uprobe-push (32 cpus): 2.713 ± 0.005M/s ( 0.085M/s/cpu)
uprobe-push (64 cpus): 1.313 ± 0.009M/s ( 0.021M/s/cpu)
uprobe-ret ( 1 cpus): 1.774 ± 0.000M/s ( 1.774M/s/cpu)
uprobe-ret ( 2 cpus): 3.350 ± 0.001M/s ( 1.675M/s/cpu)
uprobe-ret ( 4 cpus): 6.604 ± 0.000M/s ( 1.651M/s/cpu)
uprobe-ret ( 8 cpus): 6.706 ± 0.005M/s ( 0.838M/s/cpu)
uprobe-ret (16 cpus): 5.231 ± 0.001M/s ( 0.327M/s/cpu)
uprobe-ret (32 cpus): 5.743 ± 0.003M/s ( 0.179M/s/cpu)
uprobe-ret (64 cpus): 4.726 ± 0.016M/s ( 0.074M/s/cpu)
after-opt
---------
uprobe-nop ( 1 cpus): 0.985 ± 0.002M/s ( 0.985M/s/cpu)
uprobe-nop ( 2 cpus): 1.773 ± 0.005M/s ( 0.887M/s/cpu)
uprobe-nop ( 4 cpus): 3.304 ± 0.001M/s ( 0.826M/s/cpu)
uprobe-nop ( 8 cpus): 5.328 ± 0.002M/s ( 0.666M/s/cpu)
uprobe-nop (16 cpus): 6.475 ± 0.002M/s ( 0.405M/s/cpu)
uprobe-nop (32 cpus): 4.831 ± 0.082M/s ( 0.151M/s/cpu)
uprobe-nop (64 cpus): 2.564 ± 0.053M/s ( 0.040M/s/cpu)
uprobe-push ( 1 cpus): 0.964 ± 0.001M/s ( 0.964M/s/cpu)
uprobe-push ( 2 cpus): 1.766 ± 0.002M/s ( 0.883M/s/cpu)
uprobe-push ( 4 cpus): 3.290 ± 0.009M/s ( 0.823M/s/cpu)
uprobe-push ( 8 cpus): 4.670 ± 0.002M/s ( 0.584M/s/cpu)
uprobe-push (16 cpus): 5.197 ± 0.004M/s ( 0.325M/s/cpu)
uprobe-push (32 cpus): 5.068 ± 0.161M/s ( 0.158M/s/cpu)
uprobe-push (64 cpus): 2.605 ± 0.026M/s ( 0.041M/s/cpu)
uprobe-ret ( 1 cpus): 1.833 ± 0.001M/s ( 1.833M/s/cpu)
uprobe-ret ( 2 cpus): 3.384 ± 0.003M/s ( 1.692M/s/cpu)
uprobe-ret ( 4 cpus): 6.677 ± 0.004M/s ( 1.669M/s/cpu)
uprobe-ret ( 8 cpus): 6.854 ± 0.005M/s ( 0.857M/s/cpu)
uprobe-ret (16 cpus): 6.508 ± 0.006M/s ( 0.407M/s/cpu)
uprobe-ret (32 cpus): 5.793 ± 0.009M/s ( 0.181M/s/cpu)
uprobe-ret (64 cpus): 4.743 ± 0.016M/s ( 0.074M/s/cpu)
Above benchmark results demonstrates a obivious improvement in the
scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.
[1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org
[2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
include/linux/uprobes.h | 1 +
kernel/events/uprobes.c | 8 +++++---
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..e4f57117d9c3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -75,6 +75,7 @@ struct uprobe_task {
struct uprobe *active_uprobe;
unsigned long xol_vaddr;
+ bool signal_denied;
struct return_instance *return_instances;
unsigned int depth;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 76a51a1f51e2..589aa2af1a99 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1979,6 +1979,7 @@ bool uprobe_deny_signal(void)
WARN_ON_ONCE(utask->state != UTASK_SSTEP);
if (task_sigpending(t)) {
+ utask->signal_denied = true;
clear_tsk_thread_flag(t, TIF_SIGPENDING);
if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
@@ -2288,9 +2289,10 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
utask->state = UTASK_RUNNING;
xol_free_insn_slot(current);
- spin_lock_irq(¤t->sighand->siglock);
- recalc_sigpending(); /* see uprobe_deny_signal() */
- spin_unlock_irq(¤t->sighand->siglock);
+ if (utask->signal_denied) {
+ set_thread_flag(TIF_SIGPENDING);
+ utask->signal_denied = false;
+ }
if (unlikely(err)) {
uprobe_warn(current, "execute the probed insn, sending SIGILL.");
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock
2024-08-15 1:46 [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao Chang
2024-08-15 1:46 ` [PATCH v3 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
2024-08-15 1:46 ` [PATCH v3 2/2] uprobes: Remove the spinlock within handle_singlestep() Liao Chang
@ 2024-09-14 2:53 ` Liao, Chang
2024-09-15 15:18 ` Oleg Nesterov
2 siblings, 1 reply; 12+ messages in thread
From: Liao, Chang @ 2024-09-14 2:53 UTC (permalink / raw)
To: oleg
Cc: linux-kernel, linux-trace-kernel, linux-perf-users, bpf,
Masami Hiramatsu, Peter Zijlstra, Andrii Nakryiko
Hi, Oleg
Kindly ping.
This series have been pending for a month. Is thre any issue I overlook?
Thanks.
在 2024/8/15 9:46, Liao Chang 写道:
> The profiling result of BPF selftest on ARM64 platform reveals the
> significant contention on the current->sighand->siglock is the
> scalability bottleneck. The reason is also very straightforward that all
> producer threads of benchmark have to contend the spinlock mentioned to
> resume the TIF_SIGPENDING bit in thread_info that might be removed in
> uprobe_deny_signal().
>
> The contention on current->sighand->siglock is unnecessary, this series
> remove them thoroughly. I've use the script developed by Andrii in [1]
> to run benchmark. The CPU used was Kunpeng916 (Hi1616), 4 NUMA nodes,
> 64 cores@2.4GHz running the kernel on next tree + the optimization in
> [2] for get_xol_insn_slot().
>
> before-opt
> ----------
> uprobe-nop ( 1 cpus): 0.907 ± 0.003M/s ( 0.907M/s/cpu)
> uprobe-nop ( 2 cpus): 1.676 ± 0.008M/s ( 0.838M/s/cpu)
> uprobe-nop ( 4 cpus): 3.210 ± 0.003M/s ( 0.802M/s/cpu)
> uprobe-nop ( 8 cpus): 4.457 ± 0.003M/s ( 0.557M/s/cpu)
> uprobe-nop (16 cpus): 3.724 ± 0.011M/s ( 0.233M/s/cpu)
> uprobe-nop (32 cpus): 2.761 ± 0.003M/s ( 0.086M/s/cpu)
> uprobe-nop (64 cpus): 1.293 ± 0.015M/s ( 0.020M/s/cpu)
>
> uprobe-push ( 1 cpus): 0.883 ± 0.001M/s ( 0.883M/s/cpu)
> uprobe-push ( 2 cpus): 1.642 ± 0.005M/s ( 0.821M/s/cpu)
> uprobe-push ( 4 cpus): 3.086 ± 0.002M/s ( 0.771M/s/cpu)
> uprobe-push ( 8 cpus): 3.390 ± 0.003M/s ( 0.424M/s/cpu)
> uprobe-push (16 cpus): 2.652 ± 0.005M/s ( 0.166M/s/cpu)
> uprobe-push (32 cpus): 2.713 ± 0.005M/s ( 0.085M/s/cpu)
> uprobe-push (64 cpus): 1.313 ± 0.009M/s ( 0.021M/s/cpu)
>
> uprobe-ret ( 1 cpus): 1.774 ± 0.000M/s ( 1.774M/s/cpu)
> uprobe-ret ( 2 cpus): 3.350 ± 0.001M/s ( 1.675M/s/cpu)
> uprobe-ret ( 4 cpus): 6.604 ± 0.000M/s ( 1.651M/s/cpu)
> uprobe-ret ( 8 cpus): 6.706 ± 0.005M/s ( 0.838M/s/cpu)
> uprobe-ret (16 cpus): 5.231 ± 0.001M/s ( 0.327M/s/cpu)
> uprobe-ret (32 cpus): 5.743 ± 0.003M/s ( 0.179M/s/cpu)
> uprobe-ret (64 cpus): 4.726 ± 0.016M/s ( 0.074M/s/cpu)
>
> after-opt
> ---------
> uprobe-nop ( 1 cpus): 0.985 ± 0.002M/s ( 0.985M/s/cpu)
> uprobe-nop ( 2 cpus): 1.773 ± 0.005M/s ( 0.887M/s/cpu)
> uprobe-nop ( 4 cpus): 3.304 ± 0.001M/s ( 0.826M/s/cpu)
> uprobe-nop ( 8 cpus): 5.328 ± 0.002M/s ( 0.666M/s/cpu)
> uprobe-nop (16 cpus): 6.475 ± 0.002M/s ( 0.405M/s/cpu)
> uprobe-nop (32 cpus): 4.831 ± 0.082M/s ( 0.151M/s/cpu)
> uprobe-nop (64 cpus): 2.564 ± 0.053M/s ( 0.040M/s/cpu)
>
> uprobe-push ( 1 cpus): 0.964 ± 0.001M/s ( 0.964M/s/cpu)
> uprobe-push ( 2 cpus): 1.766 ± 0.002M/s ( 0.883M/s/cpu)
> uprobe-push ( 4 cpus): 3.290 ± 0.009M/s ( 0.823M/s/cpu)
> uprobe-push ( 8 cpus): 4.670 ± 0.002M/s ( 0.584M/s/cpu)
> uprobe-push (16 cpus): 5.197 ± 0.004M/s ( 0.325M/s/cpu)
> uprobe-push (32 cpus): 5.068 ± 0.161M/s ( 0.158M/s/cpu)
> uprobe-push (64 cpus): 2.605 ± 0.026M/s ( 0.041M/s/cpu)
>
> uprobe-ret ( 1 cpus): 1.833 ± 0.001M/s ( 1.833M/s/cpu)
> uprobe-ret ( 2 cpus): 3.384 ± 0.003M/s ( 1.692M/s/cpu)
> uprobe-ret ( 4 cpus): 6.677 ± 0.004M/s ( 1.669M/s/cpu)
> uprobe-ret ( 8 cpus): 6.854 ± 0.005M/s ( 0.857M/s/cpu)
> uprobe-ret (16 cpus): 6.508 ± 0.006M/s ( 0.407M/s/cpu)
> uprobe-ret (32 cpus): 5.793 ± 0.009M/s ( 0.181M/s/cpu)
> uprobe-ret (64 cpus): 4.743 ± 0.016M/s ( 0.074M/s/cpu)
>
> Above benchmark results demonstrates a obivious improvement in the
> scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
> of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.
>
> v3->v2:
> Renaming the flag in [2/2], s/deny_signal/signal_denied/g.
>
> v2->v1:
> Oleg pointed out the _DENY_SIGNAL will be replaced by _ACK upon the
> completion of singlestep which leads to handle_singlestep() has no
> chance to restore the removed TIF_SIGPENDING [3] and some case in
> question. So this revision proposes to use a flag in uprobe_task to
> track the denied TIF_SIGPENDING instead of new UPROBE_SSTEP state.
>
> [1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org
> [2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com
> [3] https://lore.kernel.org/all/20240801082407.1618451-1-liaochang1@huawei.com
>
> Liao Chang (2):
> uprobes: Remove redundant spinlock in uprobe_deny_signal()
> uprobes: Remove the spinlock within handle_singlestep()
>
> include/linux/uprobes.h | 1 +
> kernel/events/uprobes.c | 10 +++++-----
> 2 files changed, 6 insertions(+), 5 deletions(-)
>
--
BR
Liao, Chang
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock
2024-09-14 2:53 ` [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao, Chang
@ 2024-09-15 15:18 ` Oleg Nesterov
2024-09-18 2:05 ` Liao, Chang
0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2024-09-15 15:18 UTC (permalink / raw)
To: Liao, Chang
Cc: linux-kernel, linux-trace-kernel, linux-perf-users, bpf,
Masami Hiramatsu, Peter Zijlstra, Andrii Nakryiko
Hi Liao,
On 09/14, Liao, Chang wrote:
>
> Hi, Oleg
>
> Kindly ping.
>
> This series have been pending for a month. Is thre any issue I overlook?
Well, I have already acked both patches.
Please resend them to Peter/Masami, with my acks included.
Oleg.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock
2024-09-15 15:18 ` Oleg Nesterov
@ 2024-09-18 2:05 ` Liao, Chang
2024-10-11 19:34 ` Andrii Nakryiko
0 siblings, 1 reply; 12+ messages in thread
From: Liao, Chang @ 2024-09-18 2:05 UTC (permalink / raw)
To: Masami Hiramatsu, Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, linux-perf-users, bpf,
Andrii Nakryiko, Oleg Nesterov
Hi, Peter and Masami
I look forward to your inputs on these series. Andrii has proven they are
hepful for uprobe scalability.
Thanks.
在 2024/9/15 23:18, Oleg Nesterov 写道:
> Hi Liao,
>
> On 09/14, Liao, Chang wrote:
>>
>> Hi, Oleg
>>
>> Kindly ping.
>>
>> This series have been pending for a month. Is thre any issue I overlook?
>
> Well, I have already acked both patches.
>
> Please resend them to Peter/Masami, with my acks included.
>
> Oleg.
>
>
--
BR
Liao, Chang
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock
2024-09-18 2:05 ` Liao, Chang
@ 2024-10-11 19:34 ` Andrii Nakryiko
2024-10-21 10:43 ` Liao, Chang
0 siblings, 1 reply; 12+ messages in thread
From: Andrii Nakryiko @ 2024-10-11 19:34 UTC (permalink / raw)
To: Liao, Chang
Cc: Masami Hiramatsu, Peter Zijlstra, linux-kernel,
linux-trace-kernel, linux-perf-users, bpf, Andrii Nakryiko,
Oleg Nesterov
On Tue, Sep 17, 2024 at 7:05 PM Liao, Chang <liaochang1@huawei.com> wrote:
>
> Hi, Peter and Masami
>
> I look forward to your inputs on these series. Andrii has proven they are
> hepful for uprobe scalability.
>
> Thanks.
>
> 在 2024/9/15 23:18, Oleg Nesterov 写道:
> > Hi Liao,
> >
> > On 09/14, Liao, Chang wrote:
> >>
> >> Hi, Oleg
> >>
> >> Kindly ping.
> >>
> >> This series have been pending for a month. Is thre any issue I overlook?
> >
> > Well, I have already acked both patches.
> >
> > Please resend them to Peter/Masami, with my acks included.
> >
Hey Liao,
I didn't see v4 from you for this patch set with Oleg's acks. Did you
get a chance to rebase, add acks, and send the latest version?
> > Oleg.
> >
> >
>
> --
> BR
> Liao, Chang
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock
2024-10-11 19:34 ` Andrii Nakryiko
@ 2024-10-21 10:43 ` Liao, Chang
2024-10-21 17:18 ` Andrii Nakryiko
0 siblings, 1 reply; 12+ messages in thread
From: Liao, Chang @ 2024-10-21 10:43 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Masami Hiramatsu, Peter Zijlstra, linux-kernel,
linux-trace-kernel, linux-perf-users, bpf, Andrii Nakryiko,
Oleg Nesterov
在 2024/10/12 3:34, Andrii Nakryiko 写道:
> On Tue, Sep 17, 2024 at 7:05 PM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>> Hi, Peter and Masami
>>
>> I look forward to your inputs on these series. Andrii has proven they are
>> hepful for uprobe scalability.
>>
>> Thanks.
>>
>> 在 2024/9/15 23:18, Oleg Nesterov 写道:
>>> Hi Liao,
>>>
>>> On 09/14, Liao, Chang wrote:
>>>>
>>>> Hi, Oleg
>>>>
>>>> Kindly ping.
>>>>
>>>> This series have been pending for a month. Is thre any issue I overlook?
>>>
>>> Well, I have already acked both patches.
>>>
>>> Please resend them to Peter/Masami, with my acks included.
>>>
>
> Hey Liao,
>
> I didn't see v4 from you for this patch set with Oleg's acks. Did you
> get a chance to rebase, add acks, and send the latest version?
Andrii,
I am ready to send v4 based on the latest kernel from next tree. Otherwise,
I haven't heard back from any of maintainers except Oleg, so I'm a bit unsure
if I should make further changes to this series.
>
>>> Oleg.
>>>
>>>
>>
>> --
>> BR
>> Liao, Chang
--
BR
Liao, Chang
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock
2024-10-21 10:43 ` Liao, Chang
@ 2024-10-21 17:18 ` Andrii Nakryiko
2024-10-22 6:18 ` Liao, Chang
0 siblings, 1 reply; 12+ messages in thread
From: Andrii Nakryiko @ 2024-10-21 17:18 UTC (permalink / raw)
To: Liao, Chang
Cc: Masami Hiramatsu, Peter Zijlstra, linux-kernel,
linux-trace-kernel, linux-perf-users, bpf, Andrii Nakryiko,
Oleg Nesterov
On Mon, Oct 21, 2024 at 3:43 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/10/12 3:34, Andrii Nakryiko 写道:
> > On Tue, Sep 17, 2024 at 7:05 PM Liao, Chang <liaochang1@huawei.com> wrote:
> >>
> >> Hi, Peter and Masami
> >>
> >> I look forward to your inputs on these series. Andrii has proven they are
> >> hepful for uprobe scalability.
> >>
> >> Thanks.
> >>
> >> 在 2024/9/15 23:18, Oleg Nesterov 写道:
> >>> Hi Liao,
> >>>
> >>> On 09/14, Liao, Chang wrote:
> >>>>
> >>>> Hi, Oleg
> >>>>
> >>>> Kindly ping.
> >>>>
> >>>> This series have been pending for a month. Is thre any issue I overlook?
> >>>
> >>> Well, I have already acked both patches.
> >>>
> >>> Please resend them to Peter/Masami, with my acks included.
> >>>
> >
> > Hey Liao,
> >
> > I didn't see v4 from you for this patch set with Oleg's acks. Did you
> > get a chance to rebase, add acks, and send the latest version?
>
> Andrii,
>
> I am ready to send v4 based on the latest kernel from next tree. Otherwise,
> I haven't heard back from any of maintainers except Oleg, so I'm a bit unsure
> if I should make further changes to this series.
>
Let's just rebase to the latest tip/perf/core and resend with Oleg's
ack. Hopefully this should be enough.
> >
> >>> Oleg.
> >>>
> >>>
> >>
> >> --
> >> BR
> >> Liao, Chang
>
> --
> BR
> Liao, Chang
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 2/2] uprobes: Remove the spinlock within handle_singlestep()
2024-08-15 1:46 ` [PATCH v3 2/2] uprobes: Remove the spinlock within handle_singlestep() Liao Chang
@ 2024-10-22 4:01 ` Masami Hiramatsu
0 siblings, 0 replies; 12+ messages in thread
From: Masami Hiramatsu @ 2024-10-22 4:01 UTC (permalink / raw)
To: Liao Chang
Cc: oleg, peterz, mingo, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
linux-kernel, linux-trace-kernel, linux-perf-users, bpf
Hi Liao,
On Thu, 15 Aug 2024 01:46:29 +0000
Liao Chang <liaochang1@huawei.com> wrote:
> This patch introduces a flag to track TIF_SIGPENDING is suppress
> temporarily during the uprobe single-step. Upon uprobe singlestep is
> handled and the flag is confirmed, it could resume the TIF_SIGPENDING
> directly without acquiring the siglock in most case, then reducing
> contention and improving overall performance.
>
> I've use the script developed by Andrii in [1] to run benchmark. The CPU
> used was Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@2.4GHz running the
> kernel on next tree + the optimization for get_xol_insn_slot() [2].
>
> before-opt
> ----------
> uprobe-nop ( 1 cpus): 0.907 ± 0.003M/s ( 0.907M/s/cpu)
> uprobe-nop ( 2 cpus): 1.676 ± 0.008M/s ( 0.838M/s/cpu)
> uprobe-nop ( 4 cpus): 3.210 ± 0.003M/s ( 0.802M/s/cpu)
> uprobe-nop ( 8 cpus): 4.457 ± 0.003M/s ( 0.557M/s/cpu)
> uprobe-nop (16 cpus): 3.724 ± 0.011M/s ( 0.233M/s/cpu)
> uprobe-nop (32 cpus): 2.761 ± 0.003M/s ( 0.086M/s/cpu)
> uprobe-nop (64 cpus): 1.293 ± 0.015M/s ( 0.020M/s/cpu)
>
> uprobe-push ( 1 cpus): 0.883 ± 0.001M/s ( 0.883M/s/cpu)
> uprobe-push ( 2 cpus): 1.642 ± 0.005M/s ( 0.821M/s/cpu)
> uprobe-push ( 4 cpus): 3.086 ± 0.002M/s ( 0.771M/s/cpu)
> uprobe-push ( 8 cpus): 3.390 ± 0.003M/s ( 0.424M/s/cpu)
> uprobe-push (16 cpus): 2.652 ± 0.005M/s ( 0.166M/s/cpu)
> uprobe-push (32 cpus): 2.713 ± 0.005M/s ( 0.085M/s/cpu)
> uprobe-push (64 cpus): 1.313 ± 0.009M/s ( 0.021M/s/cpu)
>
> uprobe-ret ( 1 cpus): 1.774 ± 0.000M/s ( 1.774M/s/cpu)
> uprobe-ret ( 2 cpus): 3.350 ± 0.001M/s ( 1.675M/s/cpu)
> uprobe-ret ( 4 cpus): 6.604 ± 0.000M/s ( 1.651M/s/cpu)
> uprobe-ret ( 8 cpus): 6.706 ± 0.005M/s ( 0.838M/s/cpu)
> uprobe-ret (16 cpus): 5.231 ± 0.001M/s ( 0.327M/s/cpu)
> uprobe-ret (32 cpus): 5.743 ± 0.003M/s ( 0.179M/s/cpu)
> uprobe-ret (64 cpus): 4.726 ± 0.016M/s ( 0.074M/s/cpu)
>
> after-opt
> ---------
> uprobe-nop ( 1 cpus): 0.985 ± 0.002M/s ( 0.985M/s/cpu)
> uprobe-nop ( 2 cpus): 1.773 ± 0.005M/s ( 0.887M/s/cpu)
> uprobe-nop ( 4 cpus): 3.304 ± 0.001M/s ( 0.826M/s/cpu)
> uprobe-nop ( 8 cpus): 5.328 ± 0.002M/s ( 0.666M/s/cpu)
> uprobe-nop (16 cpus): 6.475 ± 0.002M/s ( 0.405M/s/cpu)
> uprobe-nop (32 cpus): 4.831 ± 0.082M/s ( 0.151M/s/cpu)
> uprobe-nop (64 cpus): 2.564 ± 0.053M/s ( 0.040M/s/cpu)
>
> uprobe-push ( 1 cpus): 0.964 ± 0.001M/s ( 0.964M/s/cpu)
> uprobe-push ( 2 cpus): 1.766 ± 0.002M/s ( 0.883M/s/cpu)
> uprobe-push ( 4 cpus): 3.290 ± 0.009M/s ( 0.823M/s/cpu)
> uprobe-push ( 8 cpus): 4.670 ± 0.002M/s ( 0.584M/s/cpu)
> uprobe-push (16 cpus): 5.197 ± 0.004M/s ( 0.325M/s/cpu)
> uprobe-push (32 cpus): 5.068 ± 0.161M/s ( 0.158M/s/cpu)
> uprobe-push (64 cpus): 2.605 ± 0.026M/s ( 0.041M/s/cpu)
>
> uprobe-ret ( 1 cpus): 1.833 ± 0.001M/s ( 1.833M/s/cpu)
> uprobe-ret ( 2 cpus): 3.384 ± 0.003M/s ( 1.692M/s/cpu)
> uprobe-ret ( 4 cpus): 6.677 ± 0.004M/s ( 1.669M/s/cpu)
> uprobe-ret ( 8 cpus): 6.854 ± 0.005M/s ( 0.857M/s/cpu)
> uprobe-ret (16 cpus): 6.508 ± 0.006M/s ( 0.407M/s/cpu)
> uprobe-ret (32 cpus): 5.793 ± 0.009M/s ( 0.181M/s/cpu)
> uprobe-ret (64 cpus): 4.743 ± 0.016M/s ( 0.074M/s/cpu)
>
> Above benchmark results demonstrates a obivious improvement in the
> scalability of trig-uprobe-nop and trig-uprobe-push, the peak throughput
> of which are from 4.5M/s to 6.4M/s and 3.3M/s to 5.1M/s individually.
>
> [1] https://lore.kernel.org/all/20240731214256.3588718-1-andrii@kernel.org
> [2] https://lore.kernel.org/all/20240727094405.1362496-1-liaochang1@huawei.com
>
This looks good to me.
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thanks,
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> ---
> include/linux/uprobes.h | 1 +
> kernel/events/uprobes.c | 8 +++++---
> 2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index b503fafb7fb3..e4f57117d9c3 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -75,6 +75,7 @@ struct uprobe_task {
>
> struct uprobe *active_uprobe;
> unsigned long xol_vaddr;
> + bool signal_denied;
>
> struct return_instance *return_instances;
> unsigned int depth;
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 76a51a1f51e2..589aa2af1a99 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1979,6 +1979,7 @@ bool uprobe_deny_signal(void)
> WARN_ON_ONCE(utask->state != UTASK_SSTEP);
>
> if (task_sigpending(t)) {
> + utask->signal_denied = true;
> clear_tsk_thread_flag(t, TIF_SIGPENDING);
>
> if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
> @@ -2288,9 +2289,10 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
> utask->state = UTASK_RUNNING;
> xol_free_insn_slot(current);
>
> - spin_lock_irq(¤t->sighand->siglock);
> - recalc_sigpending(); /* see uprobe_deny_signal() */
> - spin_unlock_irq(¤t->sighand->siglock);
> + if (utask->signal_denied) {
> + set_thread_flag(TIF_SIGPENDING);
> + utask->signal_denied = false;
> + }
>
> if (unlikely(err)) {
> uprobe_warn(current, "execute the probed insn, sending SIGILL.");
> --
> 2.34.1
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal()
2024-08-15 1:46 ` [PATCH v3 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
@ 2024-10-22 4:01 ` Masami Hiramatsu
0 siblings, 0 replies; 12+ messages in thread
From: Masami Hiramatsu @ 2024-10-22 4:01 UTC (permalink / raw)
To: Liao Chang
Cc: oleg, peterz, mingo, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
linux-kernel, linux-trace-kernel, linux-perf-users, bpf
On Thu, 15 Aug 2024 01:46:28 +0000
Liao Chang <liaochang1@huawei.com> wrote:
> Since clearing a bit in thread_info is an atomic operation, the spinlock
> is redundant and can be removed, reducing lock contention is good for
> performance.
>
Looks good to me.
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thanks!
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> ---
> kernel/events/uprobes.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 73cc47708679..76a51a1f51e2 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1979,9 +1979,7 @@ bool uprobe_deny_signal(void)
> WARN_ON_ONCE(utask->state != UTASK_SSTEP);
>
> if (task_sigpending(t)) {
> - spin_lock_irq(&t->sighand->siglock);
> clear_tsk_thread_flag(t, TIF_SIGPENDING);
> - spin_unlock_irq(&t->sighand->siglock);
>
> if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
> utask->state = UTASK_SSTEP_TRAPPED;
> --
> 2.34.1
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock
2024-10-21 17:18 ` Andrii Nakryiko
@ 2024-10-22 6:18 ` Liao, Chang
0 siblings, 0 replies; 12+ messages in thread
From: Liao, Chang @ 2024-10-22 6:18 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Masami Hiramatsu, Peter Zijlstra, linux-kernel,
linux-trace-kernel, linux-perf-users, bpf, Andrii Nakryiko,
Oleg Nesterov
在 2024/10/22 1:18, Andrii Nakryiko 写道:
> On Mon, Oct 21, 2024 at 3:43 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>>
>>
>> 在 2024/10/12 3:34, Andrii Nakryiko 写道:
>>> On Tue, Sep 17, 2024 at 7:05 PM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>
>>>> Hi, Peter and Masami
>>>>
>>>> I look forward to your inputs on these series. Andrii has proven they are
>>>> hepful for uprobe scalability.
>>>>
>>>> Thanks.
>>>>
>>>> 在 2024/9/15 23:18, Oleg Nesterov 写道:
>>>>> Hi Liao,
>>>>>
>>>>> On 09/14, Liao, Chang wrote:
>>>>>>
>>>>>> Hi, Oleg
>>>>>>
>>>>>> Kindly ping.
>>>>>>
>>>>>> This series have been pending for a month. Is thre any issue I overlook?
>>>>>
>>>>> Well, I have already acked both patches.
>>>>>
>>>>> Please resend them to Peter/Masami, with my acks included.
>>>>>
>>>
>>> Hey Liao,
>>>
>>> I didn't see v4 from you for this patch set with Oleg's acks. Did you
>>> get a chance to rebase, add acks, and send the latest version?
>>
>> Andrii,
>>
>> I am ready to send v4 based on the latest kernel from next tree. Otherwise,
>> I haven't heard back from any of maintainers except Oleg, so I'm a bit unsure
>> if I should make further changes to this series.
>>
>
> Let's just rebase to the latest tip/perf/core and resend with Oleg's
> ack. Hopefully this should be enough.
OK, the v4 is on the way with Masami's Acked-by.
>
>>>
>>>>> Oleg.
>>>>>
>>>>>
>>>>
>>>> --
>>>> BR
>>>> Liao, Chang
>>
>> --
>> BR
>> Liao, Chang
>>
--
BR
Liao, Chang
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-10-22 6:18 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-15 1:46 [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao Chang
2024-08-15 1:46 ` [PATCH v3 1/2] uprobes: Remove redundant spinlock in uprobe_deny_signal() Liao Chang
2024-10-22 4:01 ` Masami Hiramatsu
2024-08-15 1:46 ` [PATCH v3 2/2] uprobes: Remove the spinlock within handle_singlestep() Liao Chang
2024-10-22 4:01 ` Masami Hiramatsu
2024-09-14 2:53 ` [PATCH v3 0/2] uprobes: Improve scalability by reducing the contention on siglock Liao, Chang
2024-09-15 15:18 ` Oleg Nesterov
2024-09-18 2:05 ` Liao, Chang
2024-10-11 19:34 ` Andrii Nakryiko
2024-10-21 10:43 ` Liao, Chang
2024-10-21 17:18 ` Andrii Nakryiko
2024-10-22 6:18 ` Liao, Chang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).