[PATCH perf/core 0/4] Improve performance and scalability of uretprobes

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH perf/core 0/4] Improve performance and scalability of uretprobes
@ 2024-12-06  0:24 Andrii Nakryiko
  2024-12-06  0:24 ` [PATCH perf/core 1/4] uprobes: simplify session consumer tracking Andrii Nakryiko
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2024-12-06  0:24 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, mingo
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, liaochang1,
	kernel-team, Andrii Nakryiko

Include performance and multi-CPU scalability of uretprobes by avoiding
a rather expensive (and somewhat limiting in terms of multi-CPU scalability)
use of kmalloc()+kfree() combo for short-lived struct return_instance, used
for keeping track of pending return uprobes.

First few patches are preparatory doing some internal refactoring and setting
things up for the actual struct return_instance reuse done in the last patch
of the series.

Andrii Nakryiko (4):
  uprobes: simplify session consumer tracking
  uprobes: decouple return_instance list traversal and freeing
  uprobes: ensure return_instance is detached from the list before
    freeing
  uprobes: reuse return_instances between multiple uretprobes within
    task

 include/linux/uprobes.h |  16 +++-
 kernel/events/uprobes.c | 176 +++++++++++++++++++++++++++-------------
 2 files changed, 133 insertions(+), 59 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH perf/core 1/4] uprobes: simplify session consumer tracking
  2024-12-06  0:24 [PATCH perf/core 0/4] Improve performance and scalability of uretprobes Andrii Nakryiko
@ 2024-12-06  0:24 ` Andrii Nakryiko
  2024-12-06 14:07   ` Jiri Olsa
  2024-12-06  0:24 ` [PATCH perf/core 2/4] uprobes: decouple return_instance list traversal and freeing Andrii Nakryiko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2024-12-06  0:24 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, mingo
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, liaochang1,
	kernel-team, Andrii Nakryiko

In practice, each return_instance will typically contain either zero or
one return_consumer, depending on whether it has any uprobe session
consumer attached or not. It's highly unlikely that more than one uprobe
session consumers will be attached to any given uprobe, so there is no
need to optimize for that case. But the way we currently do memory
allocation and accounting is by pre-allocating the space for 4 session
consumers in contiguous block of memory next to struct return_instance
fixed part. This is unnecessarily wasteful.

This patch changes this to keep struct return_instance fixed-sized with one
pre-allocated return_consumer, while (in a highly unlikely scenario)
allowing for more session consumers in a separate dynamically
allocated and reallocated array.

We also simplify accounting a bit by not maintaining a separate
temporary capacity for consumers array, and, instead, relying on
krealloc() to be a no-op if underlying memory can accommodate a slightly
bigger allocation (but again, it's very uncommon scenario to even have
to do this reallocation).

All this gets rid of ri_size(), simplifies push_consumer() and removes
confusing ri->consumers_cnt re-assignment, while containing this
singular preallocated consumer logic contained within a few simple
preexisting helpers.

Having fixed-sized struct return_instance simplifies and speeds up
return_instance reuse that we ultimately add later in this patch set,
see follow up patches.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h | 10 ++++--
 kernel/events/uprobes.c | 72 +++++++++++++++++++++--------------------
 2 files changed, 45 insertions(+), 37 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index e0a4c2082245..1d449978558d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -154,12 +154,18 @@ struct return_instance {
 	unsigned long		stack;		/* stack pointer */
 	unsigned long		orig_ret_vaddr; /* original return address */
 	bool			chained;	/* true, if instance is nested */
-	int			consumers_cnt;
+	int			cons_cnt;	/* total number of session consumers */
 
 	struct return_instance	*next;		/* keep as stack */
 	struct rcu_head		rcu;
 
-	struct return_consumer	consumers[] __counted_by(consumers_cnt);
+	/* singular pre-allocated return_consumer instance for common case */
+	struct return_consumer	consumer;
+	/*
+	 * extra return_consumer instances for rare cases of multiple session consumers,
+	 * contains (cons_cnt - 1) elements
+	 */
+	struct return_consumer	*extra_consumers;
 } ____cacheline_aligned;
 
 enum rp_check {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index daf4314961ab..6beac52239be 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1899,6 +1899,7 @@ static struct return_instance *free_ret_instance(struct return_instance *ri, boo
 		hprobe_finalize(&ri->hprobe, hstate);
 	}
 
+	kfree(ri->extra_consumers);
 	kfree_rcu(ri, rcu);
 	return next;
 }
@@ -1974,32 +1975,34 @@ static struct uprobe_task *get_utask(void)
 	return current->utask;
 }
 
-static size_t ri_size(int consumers_cnt)
-{
-	struct return_instance *ri;
-
-	return sizeof(*ri) + sizeof(ri->consumers[0]) * consumers_cnt;
-}
-
-#define DEF_CNT 4
-
 static struct return_instance *alloc_return_instance(void)
 {
 	struct return_instance *ri;
 
-	ri = kzalloc(ri_size(DEF_CNT), GFP_KERNEL);
+	ri = kzalloc(sizeof(*ri), GFP_KERNEL);
 	if (!ri)
 		return ZERO_SIZE_PTR;
 
-	ri->consumers_cnt = DEF_CNT;
 	return ri;
 }
 
 static struct return_instance *dup_return_instance(struct return_instance *old)
 {
-	size_t size = ri_size(old->consumers_cnt);
+	struct return_instance *ri;
+
+	ri = kmemdup(old, sizeof(*ri), GFP_KERNEL);
+
+	if (unlikely(old->cons_cnt > 1)) {
+		ri->extra_consumers = kmemdup(old->extra_consumers,
+					      sizeof(ri->extra_consumers[0]) * (old->cons_cnt - 1),
+					      GFP_KERNEL);
+		if (!ri->extra_consumers) {
+			kfree(ri);
+			return NULL;
+		}
+	}
 
-	return kmemdup(old, size, GFP_KERNEL);
+	return ri;
 }
 
 static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
@@ -2369,25 +2372,28 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
 	return uprobe;
 }
 
-static struct return_instance*
-push_consumer(struct return_instance *ri, int idx, __u64 id, __u64 cookie)
+static struct return_instance *push_consumer(struct return_instance *ri, __u64 id, __u64 cookie)
 {
+	struct return_consumer *ric;
+
 	if (unlikely(ri == ZERO_SIZE_PTR))
 		return ri;
 
-	if (unlikely(idx >= ri->consumers_cnt)) {
-		struct return_instance *old_ri = ri;
-
-		ri->consumers_cnt += DEF_CNT;
-		ri = krealloc(old_ri, ri_size(old_ri->consumers_cnt), GFP_KERNEL);
-		if (!ri) {
-			kfree(old_ri);
+	if (unlikely(ri->cons_cnt > 0)) {
+		ric = krealloc(ri->extra_consumers, sizeof(*ric) * ri->cons_cnt, GFP_KERNEL);
+		if (!ric) {
+			kfree(ri->extra_consumers);
+			kfree_rcu(ri, rcu);
 			return ZERO_SIZE_PTR;
 		}
+		ri->extra_consumers = ric;
 	}
 
-	ri->consumers[idx].id = id;
-	ri->consumers[idx].cookie = cookie;
+	ric = likely(ri->cons_cnt == 0) ? &ri->consumer : &ri->extra_consumers[ri->cons_cnt - 1];
+	ric->id = id;
+	ric->cookie = cookie;
+
+	ri->cons_cnt++;
 	return ri;
 }
 
@@ -2395,14 +2401,17 @@ static struct return_consumer *
 return_consumer_find(struct return_instance *ri, int *iter, int id)
 {
 	struct return_consumer *ric;
-	int idx = *iter;
+	int idx;
 
-	for (ric = &ri->consumers[idx]; idx < ri->consumers_cnt; idx++, ric++) {
+	for (idx = *iter; idx < ri->cons_cnt; idx++)
+	{
+		ric = likely(idx == 0) ? &ri->consumer : &ri->extra_consumers[idx - 1];
 		if (ric->id == id) {
 			*iter = idx + 1;
 			return ric;
 		}
 	}
+
 	return NULL;
 }
 
@@ -2416,7 +2425,6 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 	struct uprobe_consumer *uc;
 	bool has_consumers = false, remove = true;
 	struct return_instance *ri = NULL;
-	int push_idx = 0;
 
 	current->utask->auprobe = &uprobe->arch;
 
@@ -2441,18 +2449,12 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 			ri = alloc_return_instance();
 
 		if (session)
-			ri = push_consumer(ri, push_idx++, uc->id, cookie);
+			ri = push_consumer(ri, uc->id, cookie);
 	}
 	current->utask->auprobe = NULL;
 
-	if (!ZERO_OR_NULL_PTR(ri)) {
-		/*
-		 * The push_idx value has the final number of return consumers,
-		 * and ri->consumers_cnt has number of allocated consumers.
-		 */
-		ri->consumers_cnt = push_idx;
+	if (!ZERO_OR_NULL_PTR(ri))
 		prepare_uretprobe(uprobe, regs, ri);
-	}
 
 	if (remove && has_consumers) {
 		down_read(&uprobe->register_rwsem);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH perf/core 2/4] uprobes: decouple return_instance list traversal and freeing
  2024-12-06  0:24 [PATCH perf/core 0/4] Improve performance and scalability of uretprobes Andrii Nakryiko
  2024-12-06  0:24 ` [PATCH perf/core 1/4] uprobes: simplify session consumer tracking Andrii Nakryiko
@ 2024-12-06  0:24 ` Andrii Nakryiko
  2024-12-06  0:24 ` [PATCH perf/core 3/4] uprobes: ensure return_instance is detached from the list before freeing Andrii Nakryiko
  2024-12-06  0:24 ` [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task Andrii Nakryiko
  3 siblings, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2024-12-06  0:24 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, mingo
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, liaochang1,
	kernel-team, Andrii Nakryiko

free_ret_instance() has two unrelated responsibilities: actually
cleaning up return_instance's resources and freeing memory, and also
helping with utask->return_instances list traversal by returning the
next alive pointer.

There is no reason why these two aspects have to be mixed together, so
turn free_ret_instance() into void-returning function and make callers
do list traversal on their own.

We'll use this simplification in the next patch that will guarantee that
to-be-freed return_instance isn't reachable from utask->return_instances
list.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 37 +++++++++++++++++++++----------------
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6beac52239be..cca1fe4a3fb1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1888,10 +1888,8 @@ unsigned long uprobe_get_trap_addr(struct pt_regs *regs)
 	return instruction_pointer(regs);
 }
 
-static struct return_instance *free_ret_instance(struct return_instance *ri, bool cleanup_hprobe)
+static void free_ret_instance(struct return_instance *ri, bool cleanup_hprobe)
 {
-	struct return_instance *next = ri->next;
-
 	if (cleanup_hprobe) {
 		enum hprobe_state hstate;
 
@@ -1901,7 +1899,6 @@ static struct return_instance *free_ret_instance(struct return_instance *ri, boo
 
 	kfree(ri->extra_consumers);
 	kfree_rcu(ri, rcu);
-	return next;
 }
 
 /*
@@ -1911,7 +1908,7 @@ static struct return_instance *free_ret_instance(struct return_instance *ri, boo
 void uprobe_free_utask(struct task_struct *t)
 {
 	struct uprobe_task *utask = t->utask;
-	struct return_instance *ri;
+	struct return_instance *ri, *ri_next;
 
 	if (!utask)
 		return;
@@ -1921,8 +1918,11 @@ void uprobe_free_utask(struct task_struct *t)
 	timer_delete_sync(&utask->ri_timer);
 
 	ri = utask->return_instances;
-	while (ri)
-		ri = free_ret_instance(ri, true /* cleanup_hprobe */);
+	while (ri) {
+		ri_next = ri->next;
+		free_ret_instance(ri, true /* cleanup_hprobe */);
+		ri = ri_next;
+	}
 
 	kfree(utask);
 	t->utask = NULL;
@@ -2111,12 +2111,15 @@ unsigned long uprobe_get_trampoline_vaddr(void)
 static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
 					struct pt_regs *regs)
 {
-	struct return_instance *ri = utask->return_instances;
+	struct return_instance *ri = utask->return_instances, *ri_next;
 	enum rp_check ctx = chained ? RP_CHECK_CHAIN_CALL : RP_CHECK_CALL;
 
 	while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
-		ri = free_ret_instance(ri, true /* cleanup_hprobe */);
+		ri_next = ri->next;
 		utask->depth--;
+
+		free_ret_instance(ri, true /* cleanup_hprobe */);
+		ri = ri_next;
 	}
 	rcu_assign_pointer(utask->return_instances, ri);
 }
@@ -2508,7 +2511,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
 void uprobe_handle_trampoline(struct pt_regs *regs)
 {
 	struct uprobe_task *utask;
-	struct return_instance *ri, *next;
+	struct return_instance *ri, *ri_next, *next_chain;
 	struct uprobe *uprobe;
 	enum hprobe_state hstate;
 	bool valid;
@@ -2528,8 +2531,8 @@ void uprobe_handle_trampoline(struct pt_regs *regs)
 		 * or NULL; the latter case means that nobody but ri->func
 		 * could hit this trampoline on return. TODO: sigaltstack().
 		 */
-		next = find_next_ret_chain(ri);
-		valid = !next || arch_uretprobe_is_alive(next, RP_CHECK_RET, regs);
+		next_chain = find_next_ret_chain(ri);
+		valid = !next_chain || arch_uretprobe_is_alive(next_chain, RP_CHECK_RET, regs);
 
 		instruction_pointer_set(regs, ri->orig_ret_vaddr);
 		do {
@@ -2541,7 +2544,9 @@ void uprobe_handle_trampoline(struct pt_regs *regs)
 			 * trampoline addresses on the stack are replaced with correct
 			 * original return addresses
 			 */
-			rcu_assign_pointer(utask->return_instances, ri->next);
+			ri_next = ri->next;
+			rcu_assign_pointer(utask->return_instances, ri_next);
+			utask->depth--;
 
 			uprobe = hprobe_consume(&ri->hprobe, &hstate);
 			if (valid)
@@ -2549,9 +2554,9 @@ void uprobe_handle_trampoline(struct pt_regs *regs)
 			hprobe_finalize(&ri->hprobe, hstate);
 
 			/* We already took care of hprobe, no need to waste more time on that. */
-			ri = free_ret_instance(ri, false /* !cleanup_hprobe */);
-			utask->depth--;
-		} while (ri != next);
+			free_ret_instance(ri, false /* !cleanup_hprobe */);
+			ri = ri_next;
+		} while (ri != next_chain);
 	} while (!valid);
 
 	return;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH perf/core 3/4] uprobes: ensure return_instance is detached from the list before freeing
  2024-12-06  0:24 [PATCH perf/core 0/4] Improve performance and scalability of uretprobes Andrii Nakryiko
  2024-12-06  0:24 ` [PATCH perf/core 1/4] uprobes: simplify session consumer tracking Andrii Nakryiko
  2024-12-06  0:24 ` [PATCH perf/core 2/4] uprobes: decouple return_instance list traversal and freeing Andrii Nakryiko
@ 2024-12-06  0:24 ` Andrii Nakryiko
  2024-12-06  0:24 ` [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task Andrii Nakryiko
  3 siblings, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2024-12-06  0:24 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, mingo
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, liaochang1,
	kernel-team, Andrii Nakryiko

Ensure that by the time we call free_ret_instance() to clean up an
instance of struct return_instance it isn't reachable from
utask->return_instances anymore.

free_ret_instance() is called in a few different situations, all but one
of which already are fine w.r.t. return_instance visibility:
  - uprobe_free_utask() guarantees that ri_timer() won't be called
    (through timer_delete_sync() call), and so there is no need to
    unlink anything, because entire utask is being freed;
  - uprobe_handle_trampoline() is already unlinking to-be-freed
    return_instance with rcu_assign_pointer() before calling
    free_ret_instance().

Only cleanup_return_instances() violates this property, which so far is
not causing problems due to RCU-delayed freeing of return_instance,
which we'll change in the next patch. So make sure we unlink
return_instance before passing it into free_ret_instance(), as otherwise
reuse will be unsafe.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cca1fe4a3fb1..2345aeb63d3b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2116,12 +2116,12 @@ static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
 
 	while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
 		ri_next = ri->next;
+		rcu_assign_pointer(utask->return_instances, ri_next);
 		utask->depth--;
 
 		free_ret_instance(ri, true /* cleanup_hprobe */);
 		ri = ri_next;
 	}
-	rcu_assign_pointer(utask->return_instances, ri);
 }
 
 static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task
  2024-12-06  0:24 [PATCH perf/core 0/4] Improve performance and scalability of uretprobes Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-12-06  0:24 ` [PATCH perf/core 3/4] uprobes: ensure return_instance is detached from the list before freeing Andrii Nakryiko
@ 2024-12-06  0:24 ` Andrii Nakryiko
  2024-12-06 14:07   ` Jiri Olsa
  2024-12-06 14:09   ` Jiri Olsa
  3 siblings, 2 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2024-12-06  0:24 UTC (permalink / raw)
  To: linux-trace-kernel, peterz, mingo
  Cc: oleg, rostedt, mhiramat, bpf, linux-kernel, jolsa, liaochang1,
	kernel-team, Andrii Nakryiko

Instead of constantly allocating and freeing very short-lived
struct return_instance, reuse it as much as possible within current
task. For that, store a linked list of reusable return_instances within
current->utask.

The only complication is that ri_timer() might be still processing such
return_instance. And so while the main uretprobe processing logic might
be already done with return_instance and would be OK to immediately
reuse it for the next uretprobe instance, it's not correct to
unconditionally reuse it just like that.

Instead we make sure that ri_timer() can't possibly be processing it by
using seqcount_t, with ri_timer() being "a writer", while
free_ret_instance() being "a reader". If, after we unlink return
instance from utask->return_instances list, we know that ri_timer()
hasn't gotten to processing utask->return_instances yet, then we can be
sure that immediate return_instance reuse is OK, and so we put it
onto utask->ri_pool for future (potentially, almost immediate) reuse.

This change shows improvements both in single CPU performance (by
avoiding relatively expensive kmalloc/free combon) and in terms of
multi-CPU scalability, where you can see that per-CPU throughput doesn't
decline as steeply with increased number of CPUs (which were previously
attributed to kmalloc()/free() through profiling):

BASELINE (latest perf/core)
===========================
uretprobe-nop         ( 1 cpus):    1.898 ± 0.002M/s  (  1.898M/s/cpu)
uretprobe-nop         ( 2 cpus):    3.574 ± 0.011M/s  (  1.787M/s/cpu)
uretprobe-nop         ( 3 cpus):    5.279 ± 0.066M/s  (  1.760M/s/cpu)
uretprobe-nop         ( 4 cpus):    6.824 ± 0.047M/s  (  1.706M/s/cpu)
uretprobe-nop         ( 5 cpus):    8.339 ± 0.060M/s  (  1.668M/s/cpu)
uretprobe-nop         ( 6 cpus):    9.812 ± 0.047M/s  (  1.635M/s/cpu)
uretprobe-nop         ( 7 cpus):   11.030 ± 0.048M/s  (  1.576M/s/cpu)
uretprobe-nop         ( 8 cpus):   12.453 ± 0.126M/s  (  1.557M/s/cpu)
uretprobe-nop         (10 cpus):   14.838 ± 0.044M/s  (  1.484M/s/cpu)
uretprobe-nop         (12 cpus):   17.092 ± 0.115M/s  (  1.424M/s/cpu)
uretprobe-nop         (14 cpus):   19.576 ± 0.022M/s  (  1.398M/s/cpu)
uretprobe-nop         (16 cpus):   22.264 ± 0.015M/s  (  1.391M/s/cpu)
uretprobe-nop         (24 cpus):   33.534 ± 0.078M/s  (  1.397M/s/cpu)
uretprobe-nop         (32 cpus):   43.262 ± 0.127M/s  (  1.352M/s/cpu)
uretprobe-nop         (40 cpus):   53.252 ± 0.080M/s  (  1.331M/s/cpu)
uretprobe-nop         (48 cpus):   55.778 ± 0.045M/s  (  1.162M/s/cpu)
uretprobe-nop         (56 cpus):   56.850 ± 0.227M/s  (  1.015M/s/cpu)
uretprobe-nop         (64 cpus):   62.005 ± 0.077M/s  (  0.969M/s/cpu)
uretprobe-nop         (72 cpus):   66.445 ± 0.236M/s  (  0.923M/s/cpu)
uretprobe-nop         (80 cpus):   68.353 ± 0.180M/s  (  0.854M/s/cpu)

THIS PATCHSET (on top of latest perf/core)
==========================================
uretprobe-nop         ( 1 cpus):    2.253 ± 0.004M/s  (  2.253M/s/cpu)
uretprobe-nop         ( 2 cpus):    4.281 ± 0.003M/s  (  2.140M/s/cpu)
uretprobe-nop         ( 3 cpus):    6.389 ± 0.027M/s  (  2.130M/s/cpu)
uretprobe-nop         ( 4 cpus):    8.328 ± 0.005M/s  (  2.082M/s/cpu)
uretprobe-nop         ( 5 cpus):   10.353 ± 0.001M/s  (  2.071M/s/cpu)
uretprobe-nop         ( 6 cpus):   12.513 ± 0.010M/s  (  2.086M/s/cpu)
uretprobe-nop         ( 7 cpus):   14.525 ± 0.017M/s  (  2.075M/s/cpu)
uretprobe-nop         ( 8 cpus):   15.633 ± 0.013M/s  (  1.954M/s/cpu)
uretprobe-nop         (10 cpus):   19.532 ± 0.011M/s  (  1.953M/s/cpu)
uretprobe-nop         (12 cpus):   21.405 ± 0.009M/s  (  1.784M/s/cpu)
uretprobe-nop         (14 cpus):   24.857 ± 0.020M/s  (  1.776M/s/cpu)
uretprobe-nop         (16 cpus):   26.466 ± 0.018M/s  (  1.654M/s/cpu)
uretprobe-nop         (24 cpus):   40.513 ± 0.222M/s  (  1.688M/s/cpu)
uretprobe-nop         (32 cpus):   54.180 ± 0.074M/s  (  1.693M/s/cpu)
uretprobe-nop         (40 cpus):   66.100 ± 0.082M/s  (  1.652M/s/cpu)
uretprobe-nop         (48 cpus):   70.544 ± 0.068M/s  (  1.470M/s/cpu)
uretprobe-nop         (56 cpus):   74.494 ± 0.055M/s  (  1.330M/s/cpu)
uretprobe-nop         (64 cpus):   79.317 ± 0.029M/s  (  1.239M/s/cpu)
uretprobe-nop         (72 cpus):   84.875 ± 0.020M/s  (  1.179M/s/cpu)
uretprobe-nop         (80 cpus):   92.318 ± 0.224M/s  (  1.154M/s/cpu)

For reference, with uprobe-nop we hit the following throughput:

uprobe-nop            (80 cpus):  143.485 ± 0.035M/s  (  1.794M/s/cpu)

So now uretprobe stays a bit closer to that performance.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h |  6 ++-
 kernel/events/uprobes.c | 83 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 75 insertions(+), 14 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 1d449978558d..b1df7d792fa1 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -16,6 +16,7 @@
 #include <linux/types.h>
 #include <linux/wait.h>
 #include <linux/timer.h>
+#include <linux/seqlock.h>
 
 struct uprobe;
 struct vm_area_struct;
@@ -124,6 +125,10 @@ struct uprobe_task {
 	unsigned int			depth;
 	struct return_instance		*return_instances;
 
+	struct return_instance		*ri_pool;
+	struct timer_list		ri_timer;
+	seqcount_t			ri_seqcount;
+
 	union {
 		struct {
 			struct arch_uprobe_task	autask;
@@ -137,7 +142,6 @@ struct uprobe_task {
 	};
 
 	struct uprobe			*active_uprobe;
-	struct timer_list		ri_timer;
 	unsigned long			xol_vaddr;
 
 	struct arch_uprobe              *auprobe;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2345aeb63d3b..1af950208c2b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1888,8 +1888,34 @@ unsigned long uprobe_get_trap_addr(struct pt_regs *regs)
 	return instruction_pointer(regs);
 }
 
-static void free_ret_instance(struct return_instance *ri, bool cleanup_hprobe)
+static void ri_pool_push(struct uprobe_task *utask, struct return_instance *ri)
 {
+	ri->cons_cnt = 0;
+	ri->next = utask->ri_pool;
+	utask->ri_pool = ri;
+}
+
+static struct return_instance *ri_pool_pop(struct uprobe_task *utask)
+{
+	struct return_instance *ri = utask->ri_pool;
+
+	if (likely(ri))
+		utask->ri_pool = ri->next;
+
+	return ri;
+}
+
+static void ri_free(struct return_instance *ri)
+{
+	kfree(ri->extra_consumers);
+	kfree_rcu(ri, rcu);
+}
+
+static void free_ret_instance(struct uprobe_task *utask,
+			      struct return_instance *ri, bool cleanup_hprobe)
+{
+	unsigned seq;
+
 	if (cleanup_hprobe) {
 		enum hprobe_state hstate;
 
@@ -1897,8 +1923,22 @@ static void free_ret_instance(struct return_instance *ri, bool cleanup_hprobe)
 		hprobe_finalize(&ri->hprobe, hstate);
 	}
 
-	kfree(ri->extra_consumers);
-	kfree_rcu(ri, rcu);
+	/*
+	 * At this point return_instance is unlinked from utask's
+	 * return_instances list and this has become visible to ri_timer().
+	 * If seqcount now indicates that ri_timer's return instance
+	 * processing loop isn't active, we can return ri into the pool of
+	 * to-be-reused return instances for future uretprobes. If ri_timer()
+	 * happens to be running right now, though, we fallback to safety and
+	 * just perform RCU-delated freeing of ri.
+	 */
+	if (raw_seqcount_try_begin(&utask->ri_seqcount, seq)) {
+		/* immediate reuse of ri without RCU GP is OK */
+		ri_pool_push(utask, ri);
+	} else {
+		/* we might be racing with ri_timer(), so play it safe */
+		ri_free(ri);
+	}
 }
 
 /*
@@ -1920,7 +1960,15 @@ void uprobe_free_utask(struct task_struct *t)
 	ri = utask->return_instances;
 	while (ri) {
 		ri_next = ri->next;
-		free_ret_instance(ri, true /* cleanup_hprobe */);
+		free_ret_instance(utask, ri, true /* cleanup_hprobe */);
+		ri = ri_next;
+	}
+
+	/* free_ret_instance() above might add to ri_pool, so this loop should come last */
+	ri = utask->ri_pool;
+	while (ri) {
+		ri_next = ri->next;
+		ri_free(ri);
 		ri = ri_next;
 	}
 
@@ -1943,8 +1991,12 @@ static void ri_timer(struct timer_list *timer)
 	/* RCU protects return_instance from freeing. */
 	guard(rcu)();
 
+	write_seqcount_begin(&utask->ri_seqcount);
+
 	for_each_ret_instance_rcu(ri, utask->return_instances)
 		hprobe_expire(&ri->hprobe, false);
+
+	write_seqcount_end(&utask->ri_seqcount);
 }
 
 static struct uprobe_task *alloc_utask(void)
@@ -1956,6 +2008,7 @@ static struct uprobe_task *alloc_utask(void)
 		return NULL;
 
 	timer_setup(&utask->ri_timer, ri_timer, 0);
+	seqcount_init(&utask->ri_seqcount);
 
 	return utask;
 }
@@ -1975,10 +2028,14 @@ static struct uprobe_task *get_utask(void)
 	return current->utask;
 }
 
-static struct return_instance *alloc_return_instance(void)
+static struct return_instance *alloc_return_instance(struct uprobe_task *utask)
 {
 	struct return_instance *ri;
 
+	ri = ri_pool_pop(utask);
+	if (ri)
+		return ri;
+
 	ri = kzalloc(sizeof(*ri), GFP_KERNEL);
 	if (!ri)
 		return ZERO_SIZE_PTR;
@@ -2119,7 +2176,7 @@ static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
 		rcu_assign_pointer(utask->return_instances, ri_next);
 		utask->depth--;
 
-		free_ret_instance(ri, true /* cleanup_hprobe */);
+		free_ret_instance(utask, ri, true /* cleanup_hprobe */);
 		ri = ri_next;
 	}
 }
@@ -2186,7 +2243,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs,
 
 	return;
 free:
-	kfree(ri);
+	ri_free(ri);
 }
 
 /* Prepare to single-step probed instruction out of line. */
@@ -2385,8 +2442,7 @@ static struct return_instance *push_consumer(struct return_instance *ri, __u64 i
 	if (unlikely(ri->cons_cnt > 0)) {
 		ric = krealloc(ri->extra_consumers, sizeof(*ric) * ri->cons_cnt, GFP_KERNEL);
 		if (!ric) {
-			kfree(ri->extra_consumers);
-			kfree_rcu(ri, rcu);
+			ri_free(ri);
 			return ZERO_SIZE_PTR;
 		}
 		ri->extra_consumers = ric;
@@ -2428,8 +2484,9 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 	struct uprobe_consumer *uc;
 	bool has_consumers = false, remove = true;
 	struct return_instance *ri = NULL;
+	struct uprobe_task *utask = current->utask;
 
-	current->utask->auprobe = &uprobe->arch;
+	utask->auprobe = &uprobe->arch;
 
 	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		bool session = uc->handler && uc->ret_handler;
@@ -2449,12 +2506,12 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 			continue;
 
 		if (!ri)
-			ri = alloc_return_instance();
+			ri = alloc_return_instance(utask);
 
 		if (session)
 			ri = push_consumer(ri, uc->id, cookie);
 	}
-	current->utask->auprobe = NULL;
+	utask->auprobe = NULL;
 
 	if (!ZERO_OR_NULL_PTR(ri))
 		prepare_uretprobe(uprobe, regs, ri);
@@ -2554,7 +2611,7 @@ void uprobe_handle_trampoline(struct pt_regs *regs)
 			hprobe_finalize(&ri->hprobe, hstate);
 
 			/* We already took care of hprobe, no need to waste more time on that. */
-			free_ret_instance(ri, false /* !cleanup_hprobe */);
+			free_ret_instance(utask, ri, false /* !cleanup_hprobe */);
 			ri = ri_next;
 		} while (ri != next_chain);
 	} while (!valid);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task
  2024-12-06  0:24 ` [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task Andrii Nakryiko
@ 2024-12-06 14:07   ` Jiri Olsa
  2024-12-06 18:00     ` Andrii Nakryiko
  2024-12-06 14:09   ` Jiri Olsa
  1 sibling, 1 reply; 11+ messages in thread
From: Jiri Olsa @ 2024-12-06 14:07 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, mingo, oleg, rostedt, mhiramat, bpf,
	linux-kernel, liaochang1, kernel-team

On Thu, Dec 05, 2024 at 04:24:17PM -0800, Andrii Nakryiko wrote:

SNIP

> +static void free_ret_instance(struct uprobe_task *utask,
> +			      struct return_instance *ri, bool cleanup_hprobe)
> +{
> +	unsigned seq;
> +
>  	if (cleanup_hprobe) {
>  		enum hprobe_state hstate;
>  
> @@ -1897,8 +1923,22 @@ static void free_ret_instance(struct return_instance *ri, bool cleanup_hprobe)
>  		hprobe_finalize(&ri->hprobe, hstate);
>  	}
>  
> -	kfree(ri->extra_consumers);
> -	kfree_rcu(ri, rcu);
> +	/*
> +	 * At this point return_instance is unlinked from utask's
> +	 * return_instances list and this has become visible to ri_timer().
> +	 * If seqcount now indicates that ri_timer's return instance
> +	 * processing loop isn't active, we can return ri into the pool of
> +	 * to-be-reused return instances for future uretprobes. If ri_timer()
> +	 * happens to be running right now, though, we fallback to safety and
> +	 * just perform RCU-delated freeing of ri.
> +	 */
> +	if (raw_seqcount_try_begin(&utask->ri_seqcount, seq)) {
> +		/* immediate reuse of ri without RCU GP is OK */
> +		ri_pool_push(utask, ri);

should the push be limitted somehow? I wonder you could make uprobes/consumers
setup that would allocate/push many of ri instances that would not be freed
until the process exits?

jirka

> +	} else {
> +		/* we might be racing with ri_timer(), so play it safe */
> +		ri_free(ri);
> +	}
>  }
>  
>  /*
> @@ -1920,7 +1960,15 @@ void uprobe_free_utask(struct task_struct *t)
>  	ri = utask->return_instances;
>  	while (ri) {
>  		ri_next = ri->next;
> -		free_ret_instance(ri, true /* cleanup_hprobe */);
> +		free_ret_instance(utask, ri, true /* cleanup_hprobe */);
> +		ri = ri_next;
> +	}
> +
> +	/* free_ret_instance() above might add to ri_pool, so this loop should come last */
> +	ri = utask->ri_pool;
> +	while (ri) {
> +		ri_next = ri->next;
> +		ri_free(ri);
>  		ri = ri_next;
>  	}
>  
> @@ -1943,8 +1991,12 @@ static void ri_timer(struct timer_list *timer)
>  	/* RCU protects return_instance from freeing. */
>  	guard(rcu)();
>  
> +	write_seqcount_begin(&utask->ri_seqcount);
> +
>  	for_each_ret_instance_rcu(ri, utask->return_instances)
>  		hprobe_expire(&ri->hprobe, false);
> +
> +	write_seqcount_end(&utask->ri_seqcount);
>  }
>  
>  static struct uprobe_task *alloc_utask(void)
> @@ -1956,6 +2008,7 @@ static struct uprobe_task *alloc_utask(void)
>  		return NULL;
>  
>  	timer_setup(&utask->ri_timer, ri_timer, 0);
> +	seqcount_init(&utask->ri_seqcount);
>  
>  	return utask;
>  }
> @@ -1975,10 +2028,14 @@ static struct uprobe_task *get_utask(void)
>  	return current->utask;
>  }
>  
> -static struct return_instance *alloc_return_instance(void)
> +static struct return_instance *alloc_return_instance(struct uprobe_task *utask)
>  {
>  	struct return_instance *ri;
>  
> +	ri = ri_pool_pop(utask);
> +	if (ri)
> +		return ri;
> +
>  	ri = kzalloc(sizeof(*ri), GFP_KERNEL);
>  	if (!ri)
>  		return ZERO_SIZE_PTR;
> @@ -2119,7 +2176,7 @@ static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
>  		rcu_assign_pointer(utask->return_instances, ri_next);
>  		utask->depth--;
>  
> -		free_ret_instance(ri, true /* cleanup_hprobe */);
> +		free_ret_instance(utask, ri, true /* cleanup_hprobe */);
>  		ri = ri_next;
>  	}
>  }
> @@ -2186,7 +2243,7 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs,
>  
>  	return;
>  free:
> -	kfree(ri);
> +	ri_free(ri);
>  }
>  
>  /* Prepare to single-step probed instruction out of line. */
> @@ -2385,8 +2442,7 @@ static struct return_instance *push_consumer(struct return_instance *ri, __u64 i
>  	if (unlikely(ri->cons_cnt > 0)) {
>  		ric = krealloc(ri->extra_consumers, sizeof(*ric) * ri->cons_cnt, GFP_KERNEL);
>  		if (!ric) {
> -			kfree(ri->extra_consumers);
> -			kfree_rcu(ri, rcu);
> +			ri_free(ri);
>  			return ZERO_SIZE_PTR;
>  		}
>  		ri->extra_consumers = ric;
> @@ -2428,8 +2484,9 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
>  	struct uprobe_consumer *uc;
>  	bool has_consumers = false, remove = true;
>  	struct return_instance *ri = NULL;
> +	struct uprobe_task *utask = current->utask;
>  
> -	current->utask->auprobe = &uprobe->arch;
> +	utask->auprobe = &uprobe->arch;
>  
>  	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
>  		bool session = uc->handler && uc->ret_handler;
> @@ -2449,12 +2506,12 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
>  			continue;
>  
>  		if (!ri)
> -			ri = alloc_return_instance();
> +			ri = alloc_return_instance(utask);
>  
>  		if (session)
>  			ri = push_consumer(ri, uc->id, cookie);
>  	}
> -	current->utask->auprobe = NULL;
> +	utask->auprobe = NULL;
>  
>  	if (!ZERO_OR_NULL_PTR(ri))
>  		prepare_uretprobe(uprobe, regs, ri);
> @@ -2554,7 +2611,7 @@ void uprobe_handle_trampoline(struct pt_regs *regs)
>  			hprobe_finalize(&ri->hprobe, hstate);
>  
>  			/* We already took care of hprobe, no need to waste more time on that. */
> -			free_ret_instance(ri, false /* !cleanup_hprobe */);
> +			free_ret_instance(utask, ri, false /* !cleanup_hprobe */);
>  			ri = ri_next;
>  		} while (ri != next_chain);
>  	} while (!valid);
> -- 
> 2.43.5
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH perf/core 1/4] uprobes: simplify session consumer tracking
  2024-12-06  0:24 ` [PATCH perf/core 1/4] uprobes: simplify session consumer tracking Andrii Nakryiko
@ 2024-12-06 14:07   ` Jiri Olsa
  2024-12-06 17:50     ` Andrii Nakryiko
  0 siblings, 1 reply; 11+ messages in thread
From: Jiri Olsa @ 2024-12-06 14:07 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, mingo, oleg, rostedt, mhiramat, bpf,
	linux-kernel, liaochang1, kernel-team

On Thu, Dec 05, 2024 at 04:24:14PM -0800, Andrii Nakryiko wrote:

SNIP

>  static struct return_instance *alloc_return_instance(void)
>  {
>  	struct return_instance *ri;
>  
> -	ri = kzalloc(ri_size(DEF_CNT), GFP_KERNEL);
> +	ri = kzalloc(sizeof(*ri), GFP_KERNEL);
>  	if (!ri)
>  		return ZERO_SIZE_PTR;
>  
> -	ri->consumers_cnt = DEF_CNT;
>  	return ri;
>  }
>  
>  static struct return_instance *dup_return_instance(struct return_instance *old)
>  {
> -	size_t size = ri_size(old->consumers_cnt);
> +	struct return_instance *ri;
> +
> +	ri = kmemdup(old, sizeof(*ri), GFP_KERNEL);

missing ri == NULL check

jirka

> +
> +	if (unlikely(old->cons_cnt > 1)) {
> +		ri->extra_consumers = kmemdup(old->extra_consumers,
> +					      sizeof(ri->extra_consumers[0]) * (old->cons_cnt - 1),
> +					      GFP_KERNEL);
> +		if (!ri->extra_consumers) {
> +			kfree(ri);
> +			return NULL;
> +		}
> +	}
>  
> -	return kmemdup(old, size, GFP_KERNEL);
> +	return ri;
>  }
>  
>  static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
> @@ -2369,25 +2372,28 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
>  	return uprobe;
>  }
>  
> -static struct return_instance*

SNIP

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task
  2024-12-06  0:24 ` [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task Andrii Nakryiko
  2024-12-06 14:07   ` Jiri Olsa
@ 2024-12-06 14:09   ` Jiri Olsa
  1 sibling, 0 replies; 11+ messages in thread
From: Jiri Olsa @ 2024-12-06 14:09 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, peterz, mingo, oleg, rostedt, mhiramat, bpf,
	linux-kernel, liaochang1, kernel-team

On Thu, Dec 05, 2024 at 04:24:17PM -0800, Andrii Nakryiko wrote:
> Instead of constantly allocating and freeing very short-lived
> struct return_instance, reuse it as much as possible within current
> task. For that, store a linked list of reusable return_instances within
> current->utask.
> 
> The only complication is that ri_timer() might be still processing such
> return_instance. And so while the main uretprobe processing logic might
> be already done with return_instance and would be OK to immediately
> reuse it for the next uretprobe instance, it's not correct to
> unconditionally reuse it just like that.
> 
> Instead we make sure that ri_timer() can't possibly be processing it by
> using seqcount_t, with ri_timer() being "a writer", while
> free_ret_instance() being "a reader". If, after we unlink return
> instance from utask->return_instances list, we know that ri_timer()
> hasn't gotten to processing utask->return_instances yet, then we can be
> sure that immediate return_instance reuse is OK, and so we put it
> onto utask->ri_pool for future (potentially, almost immediate) reuse.
> 
> This change shows improvements both in single CPU performance (by
> avoiding relatively expensive kmalloc/free combon) and in terms of
> multi-CPU scalability, where you can see that per-CPU throughput doesn't
> decline as steeply with increased number of CPUs (which were previously
> attributed to kmalloc()/free() through profiling):
> 
> BASELINE (latest perf/core)
> ===========================
> uretprobe-nop         ( 1 cpus):    1.898 ± 0.002M/s  (  1.898M/s/cpu)
> uretprobe-nop         ( 2 cpus):    3.574 ± 0.011M/s  (  1.787M/s/cpu)
> uretprobe-nop         ( 3 cpus):    5.279 ± 0.066M/s  (  1.760M/s/cpu)
> uretprobe-nop         ( 4 cpus):    6.824 ± 0.047M/s  (  1.706M/s/cpu)
> uretprobe-nop         ( 5 cpus):    8.339 ± 0.060M/s  (  1.668M/s/cpu)
> uretprobe-nop         ( 6 cpus):    9.812 ± 0.047M/s  (  1.635M/s/cpu)
> uretprobe-nop         ( 7 cpus):   11.030 ± 0.048M/s  (  1.576M/s/cpu)
> uretprobe-nop         ( 8 cpus):   12.453 ± 0.126M/s  (  1.557M/s/cpu)
> uretprobe-nop         (10 cpus):   14.838 ± 0.044M/s  (  1.484M/s/cpu)
> uretprobe-nop         (12 cpus):   17.092 ± 0.115M/s  (  1.424M/s/cpu)
> uretprobe-nop         (14 cpus):   19.576 ± 0.022M/s  (  1.398M/s/cpu)
> uretprobe-nop         (16 cpus):   22.264 ± 0.015M/s  (  1.391M/s/cpu)
> uretprobe-nop         (24 cpus):   33.534 ± 0.078M/s  (  1.397M/s/cpu)
> uretprobe-nop         (32 cpus):   43.262 ± 0.127M/s  (  1.352M/s/cpu)
> uretprobe-nop         (40 cpus):   53.252 ± 0.080M/s  (  1.331M/s/cpu)
> uretprobe-nop         (48 cpus):   55.778 ± 0.045M/s  (  1.162M/s/cpu)
> uretprobe-nop         (56 cpus):   56.850 ± 0.227M/s  (  1.015M/s/cpu)
> uretprobe-nop         (64 cpus):   62.005 ± 0.077M/s  (  0.969M/s/cpu)
> uretprobe-nop         (72 cpus):   66.445 ± 0.236M/s  (  0.923M/s/cpu)
> uretprobe-nop         (80 cpus):   68.353 ± 0.180M/s  (  0.854M/s/cpu)
> 
> THIS PATCHSET (on top of latest perf/core)
> ==========================================
> uretprobe-nop         ( 1 cpus):    2.253 ± 0.004M/s  (  2.253M/s/cpu)
> uretprobe-nop         ( 2 cpus):    4.281 ± 0.003M/s  (  2.140M/s/cpu)
> uretprobe-nop         ( 3 cpus):    6.389 ± 0.027M/s  (  2.130M/s/cpu)
> uretprobe-nop         ( 4 cpus):    8.328 ± 0.005M/s  (  2.082M/s/cpu)
> uretprobe-nop         ( 5 cpus):   10.353 ± 0.001M/s  (  2.071M/s/cpu)
> uretprobe-nop         ( 6 cpus):   12.513 ± 0.010M/s  (  2.086M/s/cpu)
> uretprobe-nop         ( 7 cpus):   14.525 ± 0.017M/s  (  2.075M/s/cpu)
> uretprobe-nop         ( 8 cpus):   15.633 ± 0.013M/s  (  1.954M/s/cpu)
> uretprobe-nop         (10 cpus):   19.532 ± 0.011M/s  (  1.953M/s/cpu)
> uretprobe-nop         (12 cpus):   21.405 ± 0.009M/s  (  1.784M/s/cpu)
> uretprobe-nop         (14 cpus):   24.857 ± 0.020M/s  (  1.776M/s/cpu)
> uretprobe-nop         (16 cpus):   26.466 ± 0.018M/s  (  1.654M/s/cpu)
> uretprobe-nop         (24 cpus):   40.513 ± 0.222M/s  (  1.688M/s/cpu)
> uretprobe-nop         (32 cpus):   54.180 ± 0.074M/s  (  1.693M/s/cpu)
> uretprobe-nop         (40 cpus):   66.100 ± 0.082M/s  (  1.652M/s/cpu)
> uretprobe-nop         (48 cpus):   70.544 ± 0.068M/s  (  1.470M/s/cpu)
> uretprobe-nop         (56 cpus):   74.494 ± 0.055M/s  (  1.330M/s/cpu)
> uretprobe-nop         (64 cpus):   79.317 ± 0.029M/s  (  1.239M/s/cpu)
> uretprobe-nop         (72 cpus):   84.875 ± 0.020M/s  (  1.179M/s/cpu)
> uretprobe-nop         (80 cpus):   92.318 ± 0.224M/s  (  1.154M/s/cpu)

nice! left few comments but overall lgtm

thanks,
jirka

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH perf/core 1/4] uprobes: simplify session consumer tracking
  2024-12-06 14:07   ` Jiri Olsa
@ 2024-12-06 17:50     ` Andrii Nakryiko
  0 siblings, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2024-12-06 17:50 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, mingo, oleg, rostedt,
	mhiramat, bpf, linux-kernel, liaochang1, kernel-team

On Fri, Dec 6, 2024 at 6:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Dec 05, 2024 at 04:24:14PM -0800, Andrii Nakryiko wrote:
>
> SNIP
>
> >  static struct return_instance *alloc_return_instance(void)
> >  {
> >       struct return_instance *ri;
> >
> > -     ri = kzalloc(ri_size(DEF_CNT), GFP_KERNEL);
> > +     ri = kzalloc(sizeof(*ri), GFP_KERNEL);
> >       if (!ri)
> >               return ZERO_SIZE_PTR;
> >
> > -     ri->consumers_cnt = DEF_CNT;
> >       return ri;
> >  }
> >
> >  static struct return_instance *dup_return_instance(struct return_instance *old)
> >  {
> > -     size_t size = ri_size(old->consumers_cnt);
> > +     struct return_instance *ri;
> > +
> > +     ri = kmemdup(old, sizeof(*ri), GFP_KERNEL);
>
> missing ri == NULL check
>

Doh, of course, sorry, my stupid mistake. I'll send a follow up fix.

> jirka
>
> > +
> > +     if (unlikely(old->cons_cnt > 1)) {
> > +             ri->extra_consumers = kmemdup(old->extra_consumers,
> > +                                           sizeof(ri->extra_consumers[0]) * (old->cons_cnt - 1),
> > +                                           GFP_KERNEL);
> > +             if (!ri->extra_consumers) {
> > +                     kfree(ri);
> > +                     return NULL;
> > +             }
> > +     }
> >
> > -     return kmemdup(old, size, GFP_KERNEL);
> > +     return ri;
> >  }
> >
> >  static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
> > @@ -2369,25 +2372,28 @@ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
> >       return uprobe;
> >  }
> >
> > -static struct return_instance*
>
> SNIP

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task
  2024-12-06 14:07   ` Jiri Olsa
@ 2024-12-06 18:00     ` Andrii Nakryiko
  2024-12-07  0:36       ` Jiri Olsa
  0 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2024-12-06 18:00 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, linux-trace-kernel, peterz, mingo, oleg, rostedt,
	mhiramat, bpf, linux-kernel, liaochang1, kernel-team

On Fri, Dec 6, 2024 at 6:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Dec 05, 2024 at 04:24:17PM -0800, Andrii Nakryiko wrote:
>
> SNIP
>
> > +static void free_ret_instance(struct uprobe_task *utask,
> > +                           struct return_instance *ri, bool cleanup_hprobe)
> > +{
> > +     unsigned seq;
> > +
> >       if (cleanup_hprobe) {
> >               enum hprobe_state hstate;
> >
> > @@ -1897,8 +1923,22 @@ static void free_ret_instance(struct return_instance *ri, bool cleanup_hprobe)
> >               hprobe_finalize(&ri->hprobe, hstate);
> >       }
> >
> > -     kfree(ri->extra_consumers);
> > -     kfree_rcu(ri, rcu);
> > +     /*
> > +      * At this point return_instance is unlinked from utask's
> > +      * return_instances list and this has become visible to ri_timer().
> > +      * If seqcount now indicates that ri_timer's return instance
> > +      * processing loop isn't active, we can return ri into the pool of
> > +      * to-be-reused return instances for future uretprobes. If ri_timer()
> > +      * happens to be running right now, though, we fallback to safety and
> > +      * just perform RCU-delated freeing of ri.
> > +      */
> > +     if (raw_seqcount_try_begin(&utask->ri_seqcount, seq)) {
> > +             /* immediate reuse of ri without RCU GP is OK */
> > +             ri_pool_push(utask, ri);
>
> should the push be limitted somehow? I wonder you could make uprobes/consumers
> setup that would allocate/push many of ri instances that would not be freed
> until the process exits?

So I'm just relying on the existing MAX_URETPROBE_DEPTH limit that is
enforced by prepare_uretprobe anyways. But yes, we can have up to 64
instances in ri_pool.

I did consider cleaning this up from ri_timer() (that would be a nice
properly, because ri_timer fires after 100ms of inactivity), and my
initial version did use lockless llist for that, but there is a bit of
a problem: llist doesn't support popping single iter from the list
(you can only atomically take *all* of the items) in lockless way. So
my implementation had to swap the entire list, take one element out of
it, and then put N - 1 items back. Which, when there are deep chains
of uretprobes, would be quite an unnecessary CPU overhead. And I
clearly didn't want to add locking anywhere in this hot path, of
course.

So I figured that at the absolute worst case we'll just keep
MAX_URETPROBE_DEPTH items in ri_pool until the task dies. That's not
that much memory for a small subset of tasks on the system.

One more idea I explored and rejected was to limit the size of ri_pool
to something smaller than MAX_URETPROBE_DEPTH, say just 16. But then
there is a corner case of high-frequency long chain of uretprobes up
to 64 depth, then returning through all of them, and then going into
the same set of functions again, up to 64. So depth oscillates between
0 and full 64. In this case this ri_pool will be causing allocation
for the majority of those invocations, completely defeating the
purpose.

So, in the end, it felt like 64 cached instances (worst case, if we
actually ever reached such a deep chain) would be acceptable.
Especially that commonly I wouldn't expect more than 3-4, actually.

WDYT?

>
> jirka
>
> > +     } else {
> > +             /* we might be racing with ri_timer(), so play it safe */
> > +             ri_free(ri);
> > +     }
> >  }
> >
> >  /*

[...]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task
  2024-12-06 18:00     ` Andrii Nakryiko
@ 2024-12-07  0:36       ` Jiri Olsa
  0 siblings, 0 replies; 11+ messages in thread
From: Jiri Olsa @ 2024-12-07  0:36 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Andrii Nakryiko, linux-trace-kernel, peterz, mingo,
	oleg, rostedt, mhiramat, bpf, linux-kernel, liaochang1,
	kernel-team

On Fri, Dec 06, 2024 at 10:00:16AM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 6, 2024 at 6:07 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Thu, Dec 05, 2024 at 04:24:17PM -0800, Andrii Nakryiko wrote:
> >
> > SNIP
> >
> > > +static void free_ret_instance(struct uprobe_task *utask,
> > > +                           struct return_instance *ri, bool cleanup_hprobe)
> > > +{
> > > +     unsigned seq;
> > > +
> > >       if (cleanup_hprobe) {
> > >               enum hprobe_state hstate;
> > >
> > > @@ -1897,8 +1923,22 @@ static void free_ret_instance(struct return_instance *ri, bool cleanup_hprobe)
> > >               hprobe_finalize(&ri->hprobe, hstate);
> > >       }
> > >
> > > -     kfree(ri->extra_consumers);
> > > -     kfree_rcu(ri, rcu);
> > > +     /*
> > > +      * At this point return_instance is unlinked from utask's
> > > +      * return_instances list and this has become visible to ri_timer().
> > > +      * If seqcount now indicates that ri_timer's return instance
> > > +      * processing loop isn't active, we can return ri into the pool of
> > > +      * to-be-reused return instances for future uretprobes. If ri_timer()
> > > +      * happens to be running right now, though, we fallback to safety and
> > > +      * just perform RCU-delated freeing of ri.
> > > +      */
> > > +     if (raw_seqcount_try_begin(&utask->ri_seqcount, seq)) {
> > > +             /* immediate reuse of ri without RCU GP is OK */
> > > +             ri_pool_push(utask, ri);
> >
> > should the push be limitted somehow? I wonder you could make uprobes/consumers
> > setup that would allocate/push many of ri instances that would not be freed
> > until the process exits?
> 
> So I'm just relying on the existing MAX_URETPROBE_DEPTH limit that is
> enforced by prepare_uretprobe anyways. But yes, we can have up to 64
> instances in ri_pool.
> 
> I did consider cleaning this up from ri_timer() (that would be a nice
> properly, because ri_timer fires after 100ms of inactivity), and my
> initial version did use lockless llist for that, but there is a bit of
> a problem: llist doesn't support popping single iter from the list
> (you can only atomically take *all* of the items) in lockless way. So
> my implementation had to swap the entire list, take one element out of
> it, and then put N - 1 items back. Which, when there are deep chains
> of uretprobes, would be quite an unnecessary CPU overhead. And I
> clearly didn't want to add locking anywhere in this hot path, of
> course.
> 
> So I figured that at the absolute worst case we'll just keep
> MAX_URETPROBE_DEPTH items in ri_pool until the task dies. That's not
> that much memory for a small subset of tasks on the system.
> 
> One more idea I explored and rejected was to limit the size of ri_pool
> to something smaller than MAX_URETPROBE_DEPTH, say just 16. But then
> there is a corner case of high-frequency long chain of uretprobes up
> to 64 depth, then returning through all of them, and then going into
> the same set of functions again, up to 64. So depth oscillates between
> 0 and full 64. In this case this ri_pool will be causing allocation
> for the majority of those invocations, completely defeating the
> purpose.
> 
> So, in the end, it felt like 64 cached instances (worst case, if we
> actually ever reached such a deep chain) would be acceptable.
> Especially that commonly I wouldn't expect more than 3-4, actually.
> 
> WDYT?

ah ok, there's MAX_URETPROBE_DEPTH limit for task, 64 should be fine

thanks,
jirka

> 
> >
> > jirka
> >
> > > +     } else {
> > > +             /* we might be racing with ri_timer(), so play it safe */
> > > +             ri_free(ri);
> > > +     }
> > >  }
> > >
> > >  /*
> 
> [...]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-12-07  0:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-06  0:24 [PATCH perf/core 0/4] Improve performance and scalability of uretprobes Andrii Nakryiko
2024-12-06  0:24 ` [PATCH perf/core 1/4] uprobes: simplify session consumer tracking Andrii Nakryiko
2024-12-06 14:07   ` Jiri Olsa
2024-12-06 17:50     ` Andrii Nakryiko
2024-12-06  0:24 ` [PATCH perf/core 2/4] uprobes: decouple return_instance list traversal and freeing Andrii Nakryiko
2024-12-06  0:24 ` [PATCH perf/core 3/4] uprobes: ensure return_instance is detached from the list before freeing Andrii Nakryiko
2024-12-06  0:24 ` [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task Andrii Nakryiko
2024-12-06 14:07   ` Jiri Olsa
2024-12-06 18:00     ` Andrii Nakryiko
2024-12-07  0:36       ` Jiri Olsa
2024-12-06 14:09   ` Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).