linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
@ 2024-06-25  0:21 Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 01/12] uprobes: update outdated comment Andrii Nakryiko
                   ` (11 more replies)
  0 siblings, 12 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

This patch set, ultimately, switches global uprobes_treelock from RW spinlock
to per-CPU RW semaphore, which has better performance and scales better under
contention and multiple parallel threads triggering lots of uprobes.

To make this work well with attaching multiple uprobes (through BPF
multi-uprobe), we need to add batched versions of uprobe register/unregister
APIs. This is what most of the patch set is actually doing. The actual switch
to per-CPU RW semaphore is trivial after that and is done in the very last
patch #12. See commit message with some comparison numbers.

Patch #4 is probably the most important patch in the series, revamping uprobe
lifetime management and refcounting. See patch description and added code
comments for all the details.

With changes in patch #4, we open up the way to refactor uprobe_register() and
uprobe_unregister() implementations in such a way that we can avoid taking
uprobes_treelock many times during a single batched attachment/detachment.
This allows to accommodate a much higher latency of taking per-CPU RW
semaphore for write. The end result of this patch set is that attaching 50
thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms
both before and after the changes in this patch set.

Patch #5 updates existing uprobe consumers to put all the relevant necessary
pieces into struct uprobe_consumer, without having to pass around
offset/ref_ctr_offset. Existing consumers already keep this data around, we
just formalize the interface.

Patches #6 through #10 add batched versions of register/unregister APIs and
gradually factor them in such a way as to allow taking single (batched)
uprobes_treelock, splitting the logic into multiple independent phases.

Patch #11 switched BPF multi-uprobes to batched uprobe APIs.

As mentioned, a very straightforward patch #12 takes advantage of all the prep
work and just switches uprobes_treelock to per-CPU RW semaphore.

Andrii Nakryiko (12):
  uprobes: update outdated comment
  uprobes: grab write mmap lock in unapply_uprobe()
  uprobes: simplify error handling for alloc_uprobe()
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: move offset and ref_ctr_offset into uprobe_consumer
  uprobes: add batch uprobe register/unregister APIs
  uprobes: inline alloc_uprobe() logic into __uprobe_register()
  uprobes: split uprobe allocation and uprobes_tree insertion steps
  uprobes: batch uprobes_treelock during registration
  uprobes: improve lock batching for uprobe_unregister_batch
  uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
  uprobes: switch uprobes_treelock to per-CPU RW semaphore

 include/linux/uprobes.h                       |  29 +-
 kernel/events/uprobes.c                       | 522 ++++++++++++------
 kernel/trace/bpf_trace.c                      |  40 +-
 kernel/trace/trace_uprobe.c                   |  53 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  23 +-
 5 files changed, 419 insertions(+), 248 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/12] uprobes: update outdated comment
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() Andrii Nakryiko
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

There is no task_struct passed into get_user_pages_remote() anymore,
drop the parts of comment mentioning NULL tsk, it's just confusing at
this point.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2816e65729ac..197fbe4663b5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
 		goto out;
 
 	/*
-	 * The NULL 'tsk' here ensures that any faults that occur here
-	 * will not be accounted to the task.  'mm' *is* current->mm,
-	 * but we treat this as a 'remote' access since it is
-	 * essentially a kernel access to the memory.
+	 * 'mm' *is* current->mm, but we treat this as a 'remote' access since
+	 * it is essentially a kernel access to the memory.
 	 */
 	result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL);
 	if (result < 0)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 01/12] uprobes: update outdated comment Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-25  1:29   ` Masami Hiramatsu
  2024-06-25 10:50   ` Oleg Nesterov
  2024-06-25  0:21 ` [PATCH 03/12] uprobes: simplify error handling for alloc_uprobe() Andrii Nakryiko
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Given unapply_uprobe() can call remove_breakpoint() which eventually
calls uprobe_write_opcode(), which can modify a set of memory pages and
expects mm->mmap_lock held for write, it needs to have writer lock.

Fix this by switching to mmap_write_lock()/mmap_write_unlock().

Fixes: da1816b1caec ("uprobes: Teach handler_chain() to filter out the probed task")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 197fbe4663b5..e896eeecb091 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1235,7 +1235,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
 	struct vm_area_struct *vma;
 	int err = 0;
 
-	mmap_read_lock(mm);
+	mmap_write_lock(mm);
 	for_each_vma(vmi, vma) {
 		unsigned long vaddr;
 		loff_t offset;
@@ -1252,7 +1252,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
 		vaddr = offset_to_vaddr(vma, uprobe->offset);
 		err |= remove_breakpoint(uprobe, mm, vaddr);
 	}
-	mmap_read_unlock(mm);
+	mmap_write_unlock(mm);
 
 	return err;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/12] uprobes: simplify error handling for alloc_uprobe()
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 01/12] uprobes: update outdated comment Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Return -ENOMEM instead of NULL, which makes caller's error handling just
a touch simpler.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e896eeecb091..aa59fa53ae67 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 
 	uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
 	if (!uprobe)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	uprobe->inode = inode;
 	uprobe->offset = offset;
@@ -1161,8 +1161,6 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 
  retry:
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
-	if (!uprobe)
-		return -ENOMEM;
 	if (IS_ERR(uprobe))
 		return PTR_ERR(uprobe);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-06-25  0:21 ` [PATCH 03/12] uprobes: simplify error handling for alloc_uprobe() Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-25 14:44   ` Oleg Nesterov
                     ` (2 more replies)
  2024-06-25  0:21 ` [PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Andrii Nakryiko
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
    and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
    handler code, temporary refcount bump is done to keep uprobe from
    being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
    take another refcount to keep uprobe alive until user space code
    returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is problematic and inconvenient.
Because of it, we have extra retry logic in uprobe_register(), and we
have an extra logic in __uprobe_unregister(), which checks that uprobe
has no more consumers, and if that's the case, it removes struct uprobe
from uprobes_tree (through delete_uprobe(), which takes writer lock on
uprobes_tree), decrementing refcount after that. The latter is the
source of unfortunate race with uprobe_register, necessitating retries.

All of the above is a complication that makes adding batched uprobe
registration/unregistration APIs hard, and generally makes following the
logic harder.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead,
uprobe_consumer is assuming this extra refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, though find_active_uprobe()->find_uprobe() lookup
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). This is because
find_active_uprobe() bumps refcount without knowing for sure that
uprobe's refcount is already positive (and it has to be this way, there
is no way around that setup).

One, attempted initially, way to solve this is through using
atomic_inc_not_zero() approach, turning get_uprobe() into
try_get_uprobe(), which can fail to bump refcount if uprobe is already
destined to be destroyed. This, unfortunately, turns out to be a rather
expensive due to underlying cmpxchg() operation in
atomic_inc_not_zero() and scales rather poorly with increased amount of
parallel threads triggering uprobes.

So, we devise a refcounting scheme that doesn't require cmpxchg(),
instead relying only on atomic additions, which scale better and are
faster. While the solution has a bit of a trick to it, all the logic is
nicely compartmentalized in __get_uprobe() and put_uprobe() helpers and
doesn't leak outside of those low-level helpers.

We, effectively, structure uprobe's destruction (i.e., put_uprobe() logic)
in such a way that we support "resurrecting" uprobe by bumping its
refcount from zero back to one, and pretending like it never dropped to
zero in the first place. This is done in a race-free way under
exclusive writer uprobes_treelock. Crucially, we take lock only once
refcount drops to zero. If we had to take lock before decrementing
refcount, the approach would be prohibitively expensive.

Anyways, under exclusive writer lock, we double-check that refcount
didn't change and is still zero. If it is, we proceed with destruction,
because at that point we have a guarantee that find_active_uprobe()
can't successfully look up this uprobe instance, as it's going to be
removed in destructor under writer lock. If, on the other hand,
find_active_uprobe() managed to bump refcount from zero to one in
between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and
write_lock(&uprobes_treelock), we'll deterministically detect this with
extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we
pretend like atomic_dec_and_test() never returned true. There is no
resource freeing or any other irreversible action taken up till this
point, so we just exit early.

One tricky part in the above is actually two CPUs racing and dropping
refcnt to zero, and then attempting to free resources. This can happen
as follows:
  - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
  - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
    which point it decides that it needs to free uprobe as well.

At this point both CPU #0 and CPU #1 will believe they need to destroy
uprobe, which is obviously wrong. To prevent this situations, we augment
refcount with epoch counter, which is always incremented by 1 on either
get or put operation. This allows those two CPUs above to disambiguate
who should actually free uprobe (it's the CPU #1, because it has
up-to-date epoch). See comments in the code and note the specific values
of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
a single atomi64_t is actually a two sort-of-independent 32-bit counters
that are incremented/decremented with a single atomic_add_and_return()
operation. Note also a small and extremely rare (and thus having no
effect on performance) need to clear the highest bit every 2 billion
get/put operations to prevent high 32-bit counter from "bleeding over"
into lower 32-bit counter.

All the above contained trickery aside, we end up with a nice semantics
for get and put operations, where get always succeeds and put handles
all the races properly and transparently to the caller.

And just to justify this a bit unorthodox refcounting approach, under
uprobe triggering micro-benchmark (using BPF selftests' bench tool) with
8 triggering threads, atomic_inc_not_zero() approach was producing about
3.3 millions/sec total uprobe triggerings across all threads. While the
final atomic_add_and_return()-based approach managed to get 3.6 millions/sec
throughput under the same 8 competing threads.

Furthermore, CPU profiling showed the following overall CPU usage:
  - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for
    atomic_inc_not_zero approach;
  - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for
    atomic_add_and_return approach implemented by this patch.

So, CPU is spending relatively more CPU time in get/put operations while
delivering less total throughput if using atomic_inc_not_zero(). And
this will be even more prominent once we optimize away uprobe->register_rwsem
in the subsequent patch sets. So while slightly less straightforward,
current approach seems to be clearly winning and justified.

We also rename get_uprobe() to __get_uprobe() to indicate it's
a delicate internal helper that is only safe to call under valid
circumstances:
  - while holding uprobes_treelock (to synchronize with exclusive write
    lock in put_uprobe(), as described above);
  - or if we have a guarantee that uprobe's refcount is already positive
    through caller holding at least one refcount (in this case there is
    no risk of refcount dropping to zero by any other CPU).

We also document why it's safe to do unconditional __get_uprobe() at all
call sites, to make it clear that we maintain the above invariants.

Note also, we now don't have a race between registration and
unregistration, so we remove the retry logic completely.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 231 +++++++++++++++++++++++++++++-----------
 1 file changed, 167 insertions(+), 64 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index aa59fa53ae67..8ce669bc6474 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -53,7 +53,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 
 struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
-	refcount_t		ref;
+	atomic64_t		ref;		/* see UPROBE_REFCNT_GET below */
 	struct rw_semaphore	register_rwsem;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
@@ -587,15 +587,114 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
-static struct uprobe *get_uprobe(struct uprobe *uprobe)
+/*
+ * Uprobe's 64-bit refcount is actually two independent counters co-located in
+ * a single u64 value:
+ *   - lower 32 bits are just a normal refcount with is increment and
+ *   decremented on get and put, respectively, just like normal refcount
+ *   would;
+ *   - upper 32 bits are a tag (or epoch, if you will), which is always
+ *   incremented by one, no matter whether get or put operation is done.
+ *
+ * This upper counter is meant to distinguish between:
+ *   - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction",
+ *   - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt
+ *   sequence, also proceeding to "destruction".
+ *
+ * In both cases refcount drops to zero, but in one case it will have epoch N,
+ * while the second drop to zero will have a different epoch N + 2, allowing
+ * first destructor to bail out because epoch changed between refcount going
+ * to zero and put_uprobe() taking uprobes_treelock (under which overall
+ * 64-bit refcount is double-checked, see put_uprobe() for details).
+ *
+ * Lower 32-bit counter is not meant to over overflow, while it's expected
+ * that upper 32-bit counter will overflow occasionally. Note, though, that we
+ * can't allow upper 32-bit counter to "bleed over" into lower 32-bit counter,
+ * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and
+ * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes
+ * epoch effectively a 31-bit counter with highest bit used as a flag to
+ * perform a fix-up. This ensures epoch and refcnt parts do not "interfere".
+ *
+ * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
+ * epoch and refcnt parts atomically with one atomic_add().
+ * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and
+ * *increment* epoch part.
+ */
+#define UPROBE_REFCNT_GET ((1LL << 32) | 1LL)
+#define UPROBE_REFCNT_PUT (0xffffffffLL)
+
+/**
+ * Caller has to make sure that:
+ *   a) either uprobe's refcnt is positive before this call;
+ *   b) or uprobes_treelock is held (doesn't matter if for read or write),
+ *      preventing uprobe's destructor from removing it from uprobes_tree.
+ *
+ * In the latter case, uprobe's destructor will "resurrect" uprobe instance if
+ * it detects that its refcount went back to being positive again inbetween it
+ * dropping to zero at some point and (potentially delayed) destructor
+ * callback actually running.
+ */
+static struct uprobe *__get_uprobe(struct uprobe *uprobe)
 {
-	refcount_inc(&uprobe->ref);
+	s64 v;
+
+	v = atomic64_add_return(UPROBE_REFCNT_GET, &uprobe->ref);
+
+	/*
+	 * If the highest bit is set, we need to clear it. If cmpxchg() fails,
+	 * we don't retry because there is another CPU that just managed to
+	 * update refcnt and will attempt the same "fix up". Eventually one of
+	 * them will succeed to clear highset bit.
+	 */
+	if (unlikely(v < 0))
+		(void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
+
 	return uprobe;
 }
 
+static inline bool uprobe_is_active(struct uprobe *uprobe)
+{
+	return !RB_EMPTY_NODE(&uprobe->rb_node);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
-	if (refcount_dec_and_test(&uprobe->ref)) {
+	s64 v;
+
+	v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);
+
+	if (unlikely((u32)v == 0)) {
+		bool destroy;
+
+		write_lock(&uprobes_treelock);
+		/*
+		 * We might race with find_uprobe()->__get_uprobe() executed
+		 * from inside read-locked uprobes_treelock, which can bump
+		 * refcount from zero back to one, after we got here. Even
+		 * worse, it's possible for another CPU to do 0 -> 1 -> 0
+		 * transition between this CPU doing atomic_add() and taking
+		 * uprobes_treelock. In either case this CPU should bail out
+		 * and not proceed with destruction.
+		 *
+		 * So now that we have exclusive write lock, we double check
+		 * the total 64-bit refcount value, which includes the epoch.
+		 * If nothing changed (i.e., epoch is the same and refcnt is
+		 * still zero), we are good and we proceed with the clean up.
+		 *
+		 * But if it managed to be updated back at least once, we just
+		 * pretend it never went to zero. If lower 32-bit refcnt part
+		 * drops to zero again, another CPU will proceed with
+		 * destruction, due to more up to date epoch.
+		 */
+		destroy = atomic64_read(&uprobe->ref) == v;
+		if (destroy && uprobe_is_active(uprobe))
+			rb_erase(&uprobe->rb_node, &uprobes_tree);
+		write_unlock(&uprobes_treelock);
+
+		/* uprobe got resurrected, pretend we never tried to free it */
+		if (!destroy)
+			return;
+
 		/*
 		 * If application munmap(exec_vma) before uprobe_unregister()
 		 * gets called, we don't get a chance to remove uprobe from
@@ -604,8 +703,19 @@ static void put_uprobe(struct uprobe *uprobe)
 		mutex_lock(&delayed_uprobe_lock);
 		delayed_uprobe_remove(uprobe, NULL);
 		mutex_unlock(&delayed_uprobe_lock);
+
 		kfree(uprobe);
+		return;
 	}
+
+	/*
+	 * If the highest bit is set, we need to clear it. If cmpxchg() fails,
+	 * we don't retry because there is another CPU that just managed to
+	 * update refcnt and will attempt the same "fix up". Eventually one of
+	 * them will succeed to clear highset bit.
+	 */
+	if (unlikely(v < 0))
+		(void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
 }
 
 static __always_inline
@@ -653,12 +763,15 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
 		.inode = inode,
 		.offset = offset,
 	};
-	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
+	struct rb_node *node;
+	struct uprobe *u = NULL;
 
+	node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
 	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+		/* we hold uprobes_treelock, so it's safe to __get_uprobe() */
+		u = __get_uprobe(__node_2_uprobe(node));
 
-	return NULL;
+	return u;
 }
 
 /*
@@ -676,26 +789,37 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+/*
+ * Attempt to insert a new uprobe into uprobes_tree.
+ *
+ * If uprobe already exists (for given inode+offset), we just increment
+ * refcount of previously existing uprobe.
+ *
+ * If not, a provided new instance of uprobe is inserted into the tree (with
+ * assumed initial refcount == 1).
+ *
+ * In any case, we return a uprobe instance that ends up being in uprobes_tree.
+ * Caller has to clean up new uprobe instance, if it ended up not being
+ * inserted into the tree.
+ *
+ * We assume that uprobes_treelock is held for writing.
+ */
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
+	struct uprobe *u = uprobe;
 
 	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
 	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+		/* we hold uprobes_treelock, so it's safe to __get_uprobe() */
+		u = __get_uprobe(__node_2_uprobe(node));
 
-	/* get access + creation ref */
-	refcount_set(&uprobe->ref, 2);
-	return NULL;
+	return u;
 }
 
 /*
- * Acquire uprobes_treelock.
- * Matching uprobe already exists in rbtree;
- *	increment (access refcount) and return the matching uprobe.
- *
- * No matching uprobe; insert the uprobe in rb_tree;
- *	get a double refcount (access + creation) and return NULL.
+ * Acquire uprobes_treelock and insert uprobe into uprobes_tree
+ * (or reuse existing one, see __insert_uprobe() comments above).
  */
 static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 {
@@ -732,11 +856,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->ref_ctr_offset = ref_ctr_offset;
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
+	RB_CLEAR_NODE(&uprobe->rb_node);
+	atomic64_set(&uprobe->ref, 1);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
 	/* a uprobe exists for this inode:offset combination */
-	if (cur_uprobe) {
+	if (cur_uprobe != uprobe) {
 		if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
 			ref_ctr_mismatch_warn(cur_uprobe, uprobe);
 			put_uprobe(cur_uprobe);
@@ -921,27 +1047,6 @@ remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vad
 	return set_orig_insn(&uprobe->arch, mm, vaddr);
 }
 
-static inline bool uprobe_is_active(struct uprobe *uprobe)
-{
-	return !RB_EMPTY_NODE(&uprobe->rb_node);
-}
-/*
- * There could be threads that have already hit the breakpoint. They
- * will recheck the current insn and restart if find_uprobe() fails.
- * See find_active_uprobe().
- */
-static void delete_uprobe(struct uprobe *uprobe)
-{
-	if (WARN_ON(!uprobe_is_active(uprobe)))
-		return;
-
-	write_lock(&uprobes_treelock);
-	rb_erase(&uprobe->rb_node, &uprobes_tree);
-	write_unlock(&uprobes_treelock);
-	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
-	put_uprobe(uprobe);
-}
-
 struct map_info {
 	struct map_info *next;
 	struct mm_struct *mm;
@@ -1082,15 +1187,11 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 static void
 __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
-	int err;
-
 	if (WARN_ON(!consumer_del(uprobe, uc)))
 		return;
 
-	err = register_for_each_vma(uprobe, NULL);
 	/* TODO : cant unregister? schedule a worker thread */
-	if (!uprobe->consumers && !err)
-		delete_uprobe(uprobe);
+	(void)register_for_each_vma(uprobe, NULL);
 }
 
 /*
@@ -1159,28 +1260,20 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
 		return -EINVAL;
 
- retry:
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
 	if (IS_ERR(uprobe))
 		return PTR_ERR(uprobe);
 
-	/*
-	 * We can race with uprobe_unregister()->delete_uprobe().
-	 * Check uprobe_is_active() and retry if it is false.
-	 */
 	down_write(&uprobe->register_rwsem);
-	ret = -EAGAIN;
-	if (likely(uprobe_is_active(uprobe))) {
-		consumer_add(uprobe, uc);
-		ret = register_for_each_vma(uprobe, uc);
-		if (ret)
-			__uprobe_unregister(uprobe, uc);
-	}
+	consumer_add(uprobe, uc);
+	ret = register_for_each_vma(uprobe, uc);
+	if (ret)
+		__uprobe_unregister(uprobe, uc);
 	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
 
-	if (unlikely(ret == -EAGAIN))
-		goto retry;
+	if (ret)
+		put_uprobe(uprobe);
+
 	return ret;
 }
 
@@ -1303,15 +1396,15 @@ static void build_probe_list(struct inode *inode,
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset < min)
 				break;
+			__get_uprobe(u); /* uprobes_treelock is held */
 			list_add(&u->pending_list, head);
-			get_uprobe(u);
 		}
 		for (t = n; (t = rb_next(t)); ) {
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset > max)
 				break;
+			__get_uprobe(u); /* uprobes_treelock is held */
 			list_add(&u->pending_list, head);
-			get_uprobe(u);
 		}
 	}
 	read_unlock(&uprobes_treelock);
@@ -1769,7 +1862,14 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 			return -ENOMEM;
 
 		*n = *o;
-		get_uprobe(n->uprobe);
+		/*
+		 * uprobe's refcnt has to be positive at this point, kept by
+		 * utask->return_instances items; return_instances can't be
+		 * removed right now, as task is blocked due to duping; so
+		 * __get_uprobe() is safe to use here without holding
+		 * uprobes_treelock.
+		 */
+		__get_uprobe(n->uprobe);
 		n->next = NULL;
 
 		*p = n;
@@ -1911,8 +2011,11 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
-
-	ri->uprobe = get_uprobe(uprobe);
+	 /*
+	  * uprobe's refcnt is positive, held by caller, so it's safe to
+	  * unconditionally bump it one more time here
+	  */
+	ri->uprobe = __get_uprobe(uprobe);
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);
 	ri->orig_ret_vaddr = orig_ret_vaddr;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-06-25  0:21 ` [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-27  3:06   ` Masami Hiramatsu
  2024-06-25  0:21 ` [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs Andrii Nakryiko
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Simplify uprobe registration/unregistration interfaces by making offset
and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
existing users already store these fields somewhere in uprobe_consumer's
containing structure, so this doesn't pose any problem. We just move
some fields around.

On the other hand, this simplifies uprobe_register() and
uprobe_unregister() API by having only struct uprobe_consumer as one
thing representing attachment/detachment entity. This makes batched
versions of uprobe_register() and uprobe_unregister() simpler.

This also makes uprobe_register_refctr() unnecessary, so remove it and
simplify consumers.

No functional changes intended.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h                       | 18 +++----
 kernel/events/uprobes.c                       | 19 ++-----
 kernel/trace/bpf_trace.c                      | 21 +++-----
 kernel/trace/trace_uprobe.c                   | 53 ++++++++-----------
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 23 ++++----
 5 files changed, 55 insertions(+), 79 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a75ba37ce3c8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -42,6 +42,11 @@ struct uprobe_consumer {
 				enum uprobe_filter_ctx ctx,
 				struct mm_struct *mm);
 
+	/* associated file offset of this probe */
+	loff_t offset;
+	/* associated refctr file offset of this probe, or zero */
+	loff_t ref_ctr_offset;
+	/* for internal uprobe infra use, consumers shouldn't touch fields below */
 	struct uprobe_consumer *next;
 };
 
@@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
-extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
-extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
+extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
-extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
+extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -152,11 +156,7 @@ static inline void uprobes_init(void)
 #define uprobe_get_trap_addr(regs)	instruction_pointer(regs)
 
 static inline int
-uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
-{
-	return -ENOSYS;
-}
-static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
 	return -ENOSYS;
 }
@@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, boo
 	return -ENOSYS;
 }
 static inline void
-uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
 static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8ce669bc6474..2544e8b79bad 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1197,14 +1197,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 /*
  * uprobe_unregister - unregister an already registered probe.
  * @inode: the file in which the probe has to be removed.
- * @offset: offset from the start of the file.
- * @uc: identify which probe if multiple probes are colocated.
+ * @uc: identify which probe consumer to unregister.
  */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 	struct uprobe *uprobe;
 
-	uprobe = find_uprobe(inode, offset);
+	uprobe = find_uprobe(inode, uc->offset);
 	if (WARN_ON(!uprobe))
 		return;
 
@@ -1277,20 +1276,12 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 	return ret;
 }
 
-int uprobe_register(struct inode *inode, loff_t offset,
-		    struct uprobe_consumer *uc)
+int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
-	return __uprobe_register(inode, offset, 0, uc);
+	return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc);
 }
 EXPORT_SYMBOL_GPL(uprobe_register);
 
-int uprobe_register_refctr(struct inode *inode, loff_t offset,
-			   loff_t ref_ctr_offset, struct uprobe_consumer *uc)
-{
-	return __uprobe_register(inode, offset, ref_ctr_offset, uc);
-}
-EXPORT_SYMBOL_GPL(uprobe_register_refctr);
-
 /*
  * uprobe_apply - unregister an already registered probe.
  * @inode: the file in which the probe has to be removed.
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d1daeab1bbc1..ba62baec3152 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3154,8 +3154,6 @@ struct bpf_uprobe_multi_link;
 
 struct bpf_uprobe {
 	struct bpf_uprobe_multi_link *link;
-	loff_t offset;
-	unsigned long ref_ctr_offset;
 	u64 cookie;
 	struct uprobe_consumer consumer;
 };
@@ -3181,8 +3179,7 @@ static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes,
 	u32 i;
 
 	for (i = 0; i < cnt; i++) {
-		uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
-				  &uprobes[i].consumer);
+		uprobe_unregister(d_real_inode(path->dentry), &uprobes[i].consumer);
 	}
 }
 
@@ -3262,10 +3259,10 @@ static int bpf_uprobe_multi_link_fill_link_info(const struct bpf_link *link,
 
 	for (i = 0; i < ucount; i++) {
 		if (uoffsets &&
-		    put_user(umulti_link->uprobes[i].offset, uoffsets + i))
+		    put_user(umulti_link->uprobes[i].consumer.offset, uoffsets + i))
 			return -EFAULT;
 		if (uref_ctr_offsets &&
-		    put_user(umulti_link->uprobes[i].ref_ctr_offset, uref_ctr_offsets + i))
+		    put_user(umulti_link->uprobes[i].consumer.ref_ctr_offset, uref_ctr_offsets + i))
 			return -EFAULT;
 		if (ucookies &&
 		    put_user(umulti_link->uprobes[i].cookie, ucookies + i))
@@ -3439,15 +3436,16 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 		goto error_free;
 
 	for (i = 0; i < cnt; i++) {
-		if (__get_user(uprobes[i].offset, uoffsets + i)) {
+		if (__get_user(uprobes[i].consumer.offset, uoffsets + i)) {
 			err = -EFAULT;
 			goto error_free;
 		}
-		if (uprobes[i].offset < 0) {
+		if (uprobes[i].consumer.offset < 0) {
 			err = -EINVAL;
 			goto error_free;
 		}
-		if (uref_ctr_offsets && __get_user(uprobes[i].ref_ctr_offset, uref_ctr_offsets + i)) {
+		if (uref_ctr_offsets &&
+		    __get_user(uprobes[i].consumer.ref_ctr_offset, uref_ctr_offsets + i)) {
 			err = -EFAULT;
 			goto error_free;
 		}
@@ -3477,10 +3475,7 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 		      &bpf_uprobe_multi_link_lops, prog);
 
 	for (i = 0; i < cnt; i++) {
-		err = uprobe_register_refctr(d_real_inode(link->path.dentry),
-					     uprobes[i].offset,
-					     uprobes[i].ref_ctr_offset,
-					     &uprobes[i].consumer);
+		err = uprobe_register(d_real_inode(link->path.dentry), &uprobes[i].consumer);
 		if (err) {
 			bpf_uprobe_unregister(&path, uprobes, i);
 			goto error_free;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index c98e3b3386ba..d786f99114be 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -60,8 +60,6 @@ struct trace_uprobe {
 	struct path			path;
 	struct inode			*inode;
 	char				*filename;
-	unsigned long			offset;
-	unsigned long			ref_ctr_offset;
 	unsigned long			nhit;
 	struct trace_probe		tp;
 };
@@ -205,7 +203,7 @@ static unsigned long translate_user_vaddr(unsigned long file_offset)
 
 	udd = (void *) current->utask->vaddr;
 
-	base_addr = udd->bp_addr - udd->tu->offset;
+	base_addr = udd->bp_addr - udd->tu->consumer.offset;
 	return base_addr + file_offset;
 }
 
@@ -286,13 +284,13 @@ static bool trace_uprobe_match_command_head(struct trace_uprobe *tu,
 	if (strncmp(tu->filename, argv[0], len) || argv[0][len] != ':')
 		return false;
 
-	if (tu->ref_ctr_offset == 0)
-		snprintf(buf, sizeof(buf), "0x%0*lx",
-				(int)(sizeof(void *) * 2), tu->offset);
+	if (tu->consumer.ref_ctr_offset == 0)
+		snprintf(buf, sizeof(buf), "0x%0*llx",
+				(int)(sizeof(void *) * 2), tu->consumer.offset);
 	else
-		snprintf(buf, sizeof(buf), "0x%0*lx(0x%lx)",
-				(int)(sizeof(void *) * 2), tu->offset,
-				tu->ref_ctr_offset);
+		snprintf(buf, sizeof(buf), "0x%0*llx(0x%llx)",
+				(int)(sizeof(void *) * 2), tu->consumer.offset,
+				tu->consumer.ref_ctr_offset);
 	if (strcmp(buf, &argv[0][len + 1]))
 		return false;
 
@@ -410,7 +408,7 @@ static bool trace_uprobe_has_same_uprobe(struct trace_uprobe *orig,
 
 	list_for_each_entry(orig, &tpe->probes, tp.list) {
 		if (comp_inode != d_real_inode(orig->path.dentry) ||
-		    comp->offset != orig->offset)
+		    comp->consumer.offset != orig->consumer.offset)
 			continue;
 
 		/*
@@ -472,8 +470,8 @@ static int validate_ref_ctr_offset(struct trace_uprobe *new)
 
 	for_each_trace_uprobe(tmp, pos) {
 		if (new_inode == d_real_inode(tmp->path.dentry) &&
-		    new->offset == tmp->offset &&
-		    new->ref_ctr_offset != tmp->ref_ctr_offset) {
+		    new->consumer.offset == tmp->consumer.offset &&
+		    new->consumer.ref_ctr_offset != tmp->consumer.ref_ctr_offset) {
 			pr_warn("Reference counter offset mismatch.");
 			return -EINVAL;
 		}
@@ -675,8 +673,8 @@ static int __trace_uprobe_create(int argc, const char **argv)
 		WARN_ON_ONCE(ret != -ENOMEM);
 		goto fail_address_parse;
 	}
-	tu->offset = offset;
-	tu->ref_ctr_offset = ref_ctr_offset;
+	tu->consumer.offset = offset;
+	tu->consumer.ref_ctr_offset = ref_ctr_offset;
 	tu->path = path;
 	tu->filename = filename;
 
@@ -746,12 +744,12 @@ static int trace_uprobe_show(struct seq_file *m, struct dyn_event *ev)
 	char c = is_ret_probe(tu) ? 'r' : 'p';
 	int i;
 
-	seq_printf(m, "%c:%s/%s %s:0x%0*lx", c, trace_probe_group_name(&tu->tp),
+	seq_printf(m, "%c:%s/%s %s:0x%0*llx", c, trace_probe_group_name(&tu->tp),
 			trace_probe_name(&tu->tp), tu->filename,
-			(int)(sizeof(void *) * 2), tu->offset);
+			(int)(sizeof(void *) * 2), tu->consumer.offset);
 
-	if (tu->ref_ctr_offset)
-		seq_printf(m, "(0x%lx)", tu->ref_ctr_offset);
+	if (tu->consumer.ref_ctr_offset)
+		seq_printf(m, "(0x%llx)", tu->consumer.ref_ctr_offset);
 
 	for (i = 0; i < tu->tp.nr_args; i++)
 		seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
@@ -1089,12 +1087,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
 	tu->consumer.filter = filter;
 	tu->inode = d_real_inode(tu->path.dentry);
 
-	if (tu->ref_ctr_offset)
-		ret = uprobe_register_refctr(tu->inode, tu->offset,
-				tu->ref_ctr_offset, &tu->consumer);
-	else
-		ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
-
+	ret = uprobe_register(tu->inode, &tu->consumer);
 	if (ret)
 		tu->inode = NULL;
 
@@ -1112,7 +1105,7 @@ static void __probe_event_disable(struct trace_probe *tp)
 		if (!tu->inode)
 			continue;
 
-		uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
+		uprobe_unregister(tu->inode, &tu->consumer);
 		tu->inode = NULL;
 	}
 }
@@ -1310,7 +1303,7 @@ static int uprobe_perf_close(struct trace_event_call *call,
 		return 0;
 
 	list_for_each_entry(tu, trace_probe_probe_list(tp), tp.list) {
-		ret = uprobe_apply(tu->inode, tu->offset, &tu->consumer, false);
+		ret = uprobe_apply(tu->inode, tu->consumer.offset, &tu->consumer, false);
 		if (ret)
 			break;
 	}
@@ -1334,7 +1327,7 @@ static int uprobe_perf_open(struct trace_event_call *call,
 		return 0;
 
 	list_for_each_entry(tu, trace_probe_probe_list(tp), tp.list) {
-		err = uprobe_apply(tu->inode, tu->offset, &tu->consumer, true);
+		err = uprobe_apply(tu->inode, tu->consumer.offset, &tu->consumer, true);
 		if (err) {
 			uprobe_perf_close(call, event);
 			break;
@@ -1464,7 +1457,7 @@ int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type,
 	*fd_type = is_ret_probe(tu) ? BPF_FD_TYPE_URETPROBE
 				    : BPF_FD_TYPE_UPROBE;
 	*filename = tu->filename;
-	*probe_offset = tu->offset;
+	*probe_offset = tu->consumer.offset;
 	*probe_addr = 0;
 	return 0;
 }
@@ -1627,9 +1620,9 @@ create_local_trace_uprobe(char *name, unsigned long offs,
 		return ERR_CAST(tu);
 	}
 
-	tu->offset = offs;
+	tu->consumer.offset = offs;
 	tu->path = path;
-	tu->ref_ctr_offset = ref_ctr_offset;
+	tu->consumer.ref_ctr_offset = ref_ctr_offset;
 	tu->filename = kstrdup(name, GFP_KERNEL);
 	if (!tu->filename) {
 		ret = -ENOMEM;
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index b0132a342bb5..9ae2a3c64daa 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -377,7 +377,6 @@ uprobe_ret_handler(struct uprobe_consumer *self, unsigned long func,
 
 struct testmod_uprobe {
 	struct path path;
-	loff_t offset;
 	struct uprobe_consumer consumer;
 };
 
@@ -391,25 +390,24 @@ static int testmod_register_uprobe(loff_t offset)
 {
 	int err = -EBUSY;
 
-	if (uprobe.offset)
+	if (uprobe.consumer.offset)
 		return -EBUSY;
 
 	mutex_lock(&testmod_uprobe_mutex);
 
-	if (uprobe.offset)
+	if (uprobe.consumer.offset)
 		goto out;
 
 	err = kern_path("/proc/self/exe", LOOKUP_FOLLOW, &uprobe.path);
 	if (err)
 		goto out;
 
-	err = uprobe_register_refctr(d_real_inode(uprobe.path.dentry),
-				     offset, 0, &uprobe.consumer);
-	if (err)
+	uprobe.consumer.offset = offset;
+	err = uprobe_register(d_real_inode(uprobe.path.dentry), &uprobe.consumer);
+	if (err) {
 		path_put(&uprobe.path);
-	else
-		uprobe.offset = offset;
-
+		uprobe.consumer.offset = 0;
+	}
 out:
 	mutex_unlock(&testmod_uprobe_mutex);
 	return err;
@@ -419,10 +417,9 @@ static void testmod_unregister_uprobe(void)
 {
 	mutex_lock(&testmod_uprobe_mutex);
 
-	if (uprobe.offset) {
-		uprobe_unregister(d_real_inode(uprobe.path.dentry),
-				  uprobe.offset, &uprobe.consumer);
-		uprobe.offset = 0;
+	if (uprobe.consumer.offset) {
+		uprobe_unregister(d_real_inode(uprobe.path.dentry), &uprobe.consumer);
+		uprobe.consumer.offset = 0;
 	}
 
 	mutex_unlock(&testmod_uprobe_mutex);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2024-06-25  0:21 ` [PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-26 11:27   ` Jiri Olsa
  2024-06-27 13:04   ` Masami Hiramatsu
  2024-06-25  0:21 ` [PATCH 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register() Andrii Nakryiko
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Introduce batch versions of uprobe registration (attachment) and
unregistration (detachment) APIs.

Unregistration is presumed to never fail, so that's easy.

Batch registration can fail, and so the semantics of
uprobe_register_batch() is such that either all uprobe_consumers are
successfully attached or none of them remain attached after the return.

There is no guarantee of atomicity of attachment, though, and so while
batch attachment is proceeding, some uprobes might start firing before
others are completely attached. Even if overall attachment eventually
fails, some successfully attached uprobes might fire and callers have to
be prepared to handle that. This is in no way a regression compared to
current approach of attaching uprobes one-by-one, though.

One crucial implementation detail is the addition of `struct uprobe
*uprobe` field to `struct uprobe_consumer` which is meant for internal
uprobe subsystem usage only. We use this field both as temporary storage
(to avoid unnecessary allocations) and as a back link to associated
uprobe to simplify and speed up uprobe unregistration, as we now can
avoid yet another tree lookup when unregistering uprobe_consumer.

The general direction with uprobe registration implementation is to do
batch attachment in distinct steps, each step performing some set of
checks or actions on all uprobe_consumers before proceeding to the next
phase. This, after some more changes in next patches, allows to batch
locking for each phase and in such a way amortize any long delays that
might be added by writer locks (especially once we switch
uprobes_treelock to per-CPU R/W semaphore later).

Currently, uprobe_register_batch() performs all the sanity checks first.
Then proceeds to allocate-and-insert (we'll split this up further later
on) uprobe instances, as necessary. And then the last step is actual
uprobe registration for all affected VMAs.

We take care to undo all the actions in the event of an error at any
point in this lengthy process, so end result is all-or-nothing, as
described above.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h |  17 ++++
 kernel/events/uprobes.c | 181 +++++++++++++++++++++++++++++-----------
 2 files changed, 147 insertions(+), 51 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a75ba37ce3c8..a6e6eb70539d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -33,6 +33,8 @@ enum uprobe_filter_ctx {
 	UPROBE_FILTER_MMAP,
 };
 
+typedef struct uprobe_consumer *(*uprobe_consumer_fn)(size_t idx, void *ctx);
+
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	int (*ret_handler)(struct uprobe_consumer *self,
@@ -48,6 +50,8 @@ struct uprobe_consumer {
 	loff_t ref_ctr_offset;
 	/* for internal uprobe infra use, consumers shouldn't touch fields below */
 	struct uprobe_consumer *next;
+	/* associated uprobe instance (or NULL if consumer isn't attached) */
+	struct uprobe *uprobe;
 };
 
 #ifdef CONFIG_UPROBES
@@ -116,8 +120,12 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
 extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
+extern int uprobe_register_batch(struct inode *inode, int cnt,
+				 uprobe_consumer_fn get_uprobe_consumer, void *ctx);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
+extern void uprobe_unregister_batch(struct inode *inode, int cnt,
+				    uprobe_consumer_fn get_uprobe_consumer, void *ctx);
 extern int uprobe_mmap(struct vm_area_struct *vma);
 extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
 extern void uprobe_start_dup_mmap(void);
@@ -160,6 +168,11 @@ uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
 	return -ENOSYS;
 }
+static inline int uprobe_register_batch(struct inode *inode, int cnt,
+					uprobe_consumer_fn get_uprobe_consumer, void *ctx)
+{
+	return -ENOSYS;
+}
 static inline int
 uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add)
 {
@@ -169,6 +182,10 @@ static inline void
 uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
 }
+static inline void uprobe_unregister_batch(struct inode *inode, int cnt,
+					     uprobe_consumer_fn get_uprobe_consumer, void *ctx)
+{
+}
 static inline int uprobe_mmap(struct vm_area_struct *vma)
 {
 	return 0;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2544e8b79bad..846efda614cb 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1194,6 +1194,41 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	(void)register_for_each_vma(uprobe, NULL);
 }
 
+/*
+ * uprobe_unregister_batch - unregister a batch of already registered uprobe
+ * consumers.
+ * @inode: the file in which the probes have to be removed.
+ * @cnt: number of consumers to unregister
+ * @get_uprobe_consumer: a callback that returns Nth uprobe_consumer to attach
+ * @ctx: an arbitrary context passed through into get_uprobe_consumer callback
+ */
+void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn get_uprobe_consumer, void *ctx)
+{
+	struct uprobe *uprobe;
+	struct uprobe_consumer *uc;
+	int i;
+
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
+		uprobe = uc->uprobe;
+
+		if (WARN_ON(!uprobe))
+			continue;
+
+		down_write(&uprobe->register_rwsem);
+		__uprobe_unregister(uprobe, uc);
+		up_write(&uprobe->register_rwsem);
+		put_uprobe(uprobe);
+
+		uc->uprobe = NULL;
+	}
+}
+
+static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
+{
+	return (struct uprobe_consumer *)ctx;
+}
+
 /*
  * uprobe_unregister - unregister an already registered probe.
  * @inode: the file in which the probe has to be removed.
@@ -1201,84 +1236,128 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
  */
 void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
 {
-	struct uprobe *uprobe;
-
-	uprobe = find_uprobe(inode, uc->offset);
-	if (WARN_ON(!uprobe))
-		return;
-
-	down_write(&uprobe->register_rwsem);
-	__uprobe_unregister(uprobe, uc);
-	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
+	uprobe_unregister_batch(inode, 1, uprobe_consumer_identity, uc);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
 /*
- * __uprobe_register - register a probe
- * @inode: the file in which the probe has to be placed.
- * @offset: offset from the start of the file.
- * @uc: information on howto handle the probe..
+ * uprobe_register_batch - register a batch of probes for a given inode
+ * @inode: the file in which the probes have to be placed.
+ * @cnt: number of probes to register
+ * @get_uprobe_consumer: a callback that returns Nth uprobe_consumer
+ * @ctx: an arbitrary context passed through into get_uprobe_consumer callback
  *
- * Apart from the access refcount, __uprobe_register() takes a creation
- * refcount (thro alloc_uprobe) if and only if this @uprobe is getting
- * inserted into the rbtree (i.e first consumer for a @inode:@offset
- * tuple).  Creation refcount stops uprobe_unregister from freeing the
- * @uprobe even before the register operation is complete. Creation
- * refcount is released when the last @uc for the @uprobe
- * unregisters. Caller of __uprobe_register() is required to keep @inode
- * (and the containing mount) referenced.
+ * uprobe_consumer instance itself contains offset and (optional)
+ * ref_ctr_offset within inode to attach to.
+ *
+ * On success, each attached uprobe_consumer assumes one refcount taken for
+ * respective uprobe instance (uniquely identified by inode+offset
+ * combination). Each uprobe_consumer is expected to eventually be detached
+ * through uprobe_unregister() or uprobe_unregister_batch() call, dropping
+ * their owning refcount.
+ *
+ * Caller of uprobe_register()/uprobe_register_batch() is required to keep
+ * @inode (and the containing mount) referenced.
+ *
+ * If not all probes are successfully installed, then all the successfully
+ * installed ones are rolled back. Note, there is no atomicity guarantees
+ * w.r.t. batch attachment. Some probes might start firing before batch
+ * attachment is completed. Even more so, some consumers might fire even if
+ * overall batch attachment ultimately fails.
  *
  * Return errno if it cannot successully install probes
  * else return 0 (success)
  */
-static int __uprobe_register(struct inode *inode, loff_t offset,
-			     loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+int uprobe_register_batch(struct inode *inode, int cnt,
+			  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
 {
 	struct uprobe *uprobe;
-	int ret;
-
-	/* Uprobe must have at least one set consumer */
-	if (!uc->handler && !uc->ret_handler)
-		return -EINVAL;
+	struct uprobe_consumer *uc;
+	int ret, i;
 
 	/* copy_insn() uses read_mapping_page() or shmem_read_mapping_page() */
 	if (!inode->i_mapping->a_ops->read_folio &&
 	    !shmem_mapping(inode->i_mapping))
 		return -EIO;
-	/* Racy, just to catch the obvious mistakes */
-	if (offset > i_size_read(inode))
-		return -EINVAL;
 
-	/*
-	 * This ensures that copy_from_page(), copy_to_page() and
-	 * __update_ref_ctr() can't cross page boundary.
-	 */
-	if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE))
-		return -EINVAL;
-	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
+	if (cnt <= 0 || !get_uprobe_consumer)
 		return -EINVAL;
 
-	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
-	if (IS_ERR(uprobe))
-		return PTR_ERR(uprobe);
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
+
+		/* Each consumer must have at least one set consumer */
+		if (!uc || (!uc->handler && !uc->ret_handler))
+			return -EINVAL;
+		/* Racy, just to catch the obvious mistakes */
+		if (uc->offset > i_size_read(inode))
+			return -EINVAL;
+		if (uc->uprobe)
+			return -EINVAL;
+		/*
+		 * This ensures that copy_from_page(), copy_to_page() and
+		 * __update_ref_ctr() can't cross page boundary.
+		 */
+		if (!IS_ALIGNED(uc->offset, UPROBE_SWBP_INSN_SIZE))
+			return -EINVAL;
+		if (!IS_ALIGNED(uc->ref_ctr_offset, sizeof(short)))
+			return -EINVAL;
+	}
 
-	down_write(&uprobe->register_rwsem);
-	consumer_add(uprobe, uc);
-	ret = register_for_each_vma(uprobe, uc);
-	if (ret)
-		__uprobe_unregister(uprobe, uc);
-	up_write(&uprobe->register_rwsem);
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
 
-	if (ret)
-		put_uprobe(uprobe);
+		uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
+		if (IS_ERR(uprobe)) {
+			ret = PTR_ERR(uprobe);
+			goto cleanup_uprobes;
+		}
+
+		uc->uprobe = uprobe;
+	}
+
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
+		uprobe = uc->uprobe;
+
+		down_write(&uprobe->register_rwsem);
+		consumer_add(uprobe, uc);
+		ret = register_for_each_vma(uprobe, uc);
+		if (ret)
+			__uprobe_unregister(uprobe, uc);
+		up_write(&uprobe->register_rwsem);
+
+		if (ret) {
+			put_uprobe(uprobe);
+			goto cleanup_unreg;
+		}
+	}
+
+	return 0;
 
+cleanup_unreg:
+	/* unregister all uprobes we managed to register until failure */
+	for (i--; i >= 0; i--) {
+		uc = get_uprobe_consumer(i, ctx);
+
+		down_write(&uprobe->register_rwsem);
+		__uprobe_unregister(uc->uprobe, uc);
+		up_write(&uprobe->register_rwsem);
+	}
+cleanup_uprobes:
+	/* put all the successfully allocated/reused uprobes */
+	for (i = cnt - 1; i >= 0; i--) {
+		uc = get_uprobe_consumer(i, ctx);
+
+		put_uprobe(uc->uprobe);
+		uc->uprobe = NULL;
+	}
 	return ret;
 }
 
 int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
 {
-	return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc);
+	return uprobe_register_batch(inode, 1, uprobe_consumer_identity, uc);
 }
 EXPORT_SYMBOL_GPL(uprobe_register);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register()
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (5 preceding siblings ...)
  2024-06-25  0:21 ` [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps Andrii Nakryiko
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

To allow unbundling alloc-uprobe-and-insert step which is currently
tightly coupled, inline alloc_uprobe() logic into
uprobe_register_batch() loop. It's called from one place, so we don't
really lose much in terms of maintainability.

No functional changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 65 ++++++++++++++++++-----------------------
 1 file changed, 28 insertions(+), 37 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 846efda614cb..ebd8511b6eb2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -842,40 +842,6 @@ ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe)
 		(unsigned long long) uprobe->ref_ctr_offset);
 }
 
-static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
-				   loff_t ref_ctr_offset)
-{
-	struct uprobe *uprobe, *cur_uprobe;
-
-	uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
-	if (!uprobe)
-		return ERR_PTR(-ENOMEM);
-
-	uprobe->inode = inode;
-	uprobe->offset = offset;
-	uprobe->ref_ctr_offset = ref_ctr_offset;
-	init_rwsem(&uprobe->register_rwsem);
-	init_rwsem(&uprobe->consumer_rwsem);
-	RB_CLEAR_NODE(&uprobe->rb_node);
-	atomic64_set(&uprobe->ref, 1);
-
-	/* add to uprobes_tree, sorted on inode:offset */
-	cur_uprobe = insert_uprobe(uprobe);
-	/* a uprobe exists for this inode:offset combination */
-	if (cur_uprobe != uprobe) {
-		if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
-			ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-			put_uprobe(cur_uprobe);
-			kfree(uprobe);
-			return ERR_PTR(-EINVAL);
-		}
-		kfree(uprobe);
-		uprobe = cur_uprobe;
-	}
-
-	return uprobe;
-}
-
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	down_write(&uprobe->consumer_rwsem);
@@ -1305,14 +1271,39 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 	}
 
 	for (i = 0; i < cnt; i++) {
+		struct uprobe *cur_uprobe;
+
 		uc = get_uprobe_consumer(i, ctx);
 
-		uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
-		if (IS_ERR(uprobe)) {
-			ret = PTR_ERR(uprobe);
+		uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+		if (!uprobe) {
+			ret = -ENOMEM;
 			goto cleanup_uprobes;
 		}
 
+		uprobe->inode = inode;
+		uprobe->offset = uc->offset;
+		uprobe->ref_ctr_offset = uc->ref_ctr_offset;
+		init_rwsem(&uprobe->register_rwsem);
+		init_rwsem(&uprobe->consumer_rwsem);
+		RB_CLEAR_NODE(&uprobe->rb_node);
+		atomic64_set(&uprobe->ref, 1);
+
+		/* add to uprobes_tree, sorted on inode:offset */
+		cur_uprobe = insert_uprobe(uprobe);
+		/* a uprobe exists for this inode:offset combination */
+		if (cur_uprobe != uprobe) {
+			if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
+				ref_ctr_mismatch_warn(cur_uprobe, uprobe);
+				put_uprobe(cur_uprobe);
+				kfree(uprobe);
+				ret = -EINVAL;
+				goto cleanup_uprobes;
+			}
+			kfree(uprobe);
+			uprobe = cur_uprobe;
+		}
+
 		uc->uprobe = uprobe;
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (6 preceding siblings ...)
  2024-06-25  0:21 ` [PATCH 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register() Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 09/12] uprobes: batch uprobes_treelock during registration Andrii Nakryiko
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Now we are ready to split alloc-and-insert coupled step into two
separate phases.

First, we allocate and prepare all potentially-to-be-inserted uprobe
instances, assuming corresponding uprobes are not yet in uprobes_tree.
This is needed so that we don't do memory allocations under
uprobes_treelock (once we batch locking for each step).

Second, we insert new uprobes or reuse already existing ones into
uprobes_tree. Any uprobe that turned out to be not necessary is
immediately freed, as there are no other references to it.

This concludes preparations that make uprobes_register_batch() ready to
batch and optimize locking per each phase.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ebd8511b6eb2..5e98e179d47d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1270,9 +1270,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 			return -EINVAL;
 	}
 
+	/* pre-allocate new uprobe instances */
 	for (i = 0; i < cnt; i++) {
-		struct uprobe *cur_uprobe;
-
 		uc = get_uprobe_consumer(i, ctx);
 
 		uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
@@ -1289,6 +1288,15 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 		RB_CLEAR_NODE(&uprobe->rb_node);
 		atomic64_set(&uprobe->ref, 1);
 
+		uc->uprobe = uprobe;
+	}
+
+	for (i = 0; i < cnt; i++) {
+		struct uprobe *cur_uprobe;
+
+		uc = get_uprobe_consumer(i, ctx);
+		uprobe = uc->uprobe;
+
 		/* add to uprobes_tree, sorted on inode:offset */
 		cur_uprobe = insert_uprobe(uprobe);
 		/* a uprobe exists for this inode:offset combination */
@@ -1296,15 +1304,12 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 			if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
 				ref_ctr_mismatch_warn(cur_uprobe, uprobe);
 				put_uprobe(cur_uprobe);
-				kfree(uprobe);
 				ret = -EINVAL;
 				goto cleanup_uprobes;
 			}
 			kfree(uprobe);
-			uprobe = cur_uprobe;
+			uc->uprobe = cur_uprobe;
 		}
-
-		uc->uprobe = uprobe;
 	}
 
 	for (i = 0; i < cnt; i++) {
@@ -1318,10 +1323,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 			__uprobe_unregister(uprobe, uc);
 		up_write(&uprobe->register_rwsem);
 
-		if (ret) {
-			put_uprobe(uprobe);
+		if (ret)
 			goto cleanup_unreg;
-		}
 	}
 
 	return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/12] uprobes: batch uprobes_treelock during registration
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (7 preceding siblings ...)
  2024-06-25  0:21 ` [PATCH 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 10/12] uprobes: improve lock batching for uprobe_unregister_batch Andrii Nakryiko
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Now that we have a good separate of each registration step, take
uprobes_treelock just once for relevant registration step, and then
process all relevant uprobes in one go.

Even if writer lock introduces a relatively large delay (as might happen
with per-CPU RW semaphore), this will keep overall batch attachment
reasonably fast.

We teach put_uprobe(), though __put_uprobe() helper, to optionally take
or not uprobes_treelock, to accommodate this pattern.

With these changes we don't need insert_uprobe() operation that
unconditionally takes uprobes_treelock, so get rid of it, leaving only
lower-level __insert_uprobe() helper.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 45 +++++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 5e98e179d47d..416f408cbed9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -657,7 +657,7 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
 	return !RB_EMPTY_NODE(&uprobe->rb_node);
 }
 
-static void put_uprobe(struct uprobe *uprobe)
+static void __put_uprobe(struct uprobe *uprobe, bool tree_locked)
 {
 	s64 v;
 
@@ -666,7 +666,8 @@ static void put_uprobe(struct uprobe *uprobe)
 	if (unlikely((u32)v == 0)) {
 		bool destroy;
 
-		write_lock(&uprobes_treelock);
+		if (!tree_locked)
+			write_lock(&uprobes_treelock);
 		/*
 		 * We might race with find_uprobe()->__get_uprobe() executed
 		 * from inside read-locked uprobes_treelock, which can bump
@@ -689,7 +690,8 @@ static void put_uprobe(struct uprobe *uprobe)
 		destroy = atomic64_read(&uprobe->ref) == v;
 		if (destroy && uprobe_is_active(uprobe))
 			rb_erase(&uprobe->rb_node, &uprobes_tree);
-		write_unlock(&uprobes_treelock);
+		if (!tree_locked)
+			write_unlock(&uprobes_treelock);
 
 		/* uprobe got resurrected, pretend we never tried to free it */
 		if (!destroy)
@@ -718,6 +720,11 @@ static void put_uprobe(struct uprobe *uprobe)
 		(void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
 }
 
+static void put_uprobe(struct uprobe *uprobe)
+{
+	__put_uprobe(uprobe, false);
+}
+
 static __always_inline
 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
 	       const struct uprobe *r)
@@ -817,21 +824,6 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 	return u;
 }
 
-/*
- * Acquire uprobes_treelock and insert uprobe into uprobes_tree
- * (or reuse existing one, see __insert_uprobe() comments above).
- */
-static struct uprobe *insert_uprobe(struct uprobe *uprobe)
-{
-	struct uprobe *u;
-
-	write_lock(&uprobes_treelock);
-	u = __insert_uprobe(uprobe);
-	write_unlock(&uprobes_treelock);
-
-	return u;
-}
-
 static void
 ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe)
 {
@@ -1291,6 +1283,8 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 		uc->uprobe = uprobe;
 	}
 
+	ret = 0;
+	write_lock(&uprobes_treelock);
 	for (i = 0; i < cnt; i++) {
 		struct uprobe *cur_uprobe;
 
@@ -1298,19 +1292,24 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 		uprobe = uc->uprobe;
 
 		/* add to uprobes_tree, sorted on inode:offset */
-		cur_uprobe = insert_uprobe(uprobe);
+		cur_uprobe = __insert_uprobe(uprobe);
 		/* a uprobe exists for this inode:offset combination */
 		if (cur_uprobe != uprobe) {
 			if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
 				ref_ctr_mismatch_warn(cur_uprobe, uprobe);
-				put_uprobe(cur_uprobe);
+
+				__put_uprobe(cur_uprobe, true);
 				ret = -EINVAL;
-				goto cleanup_uprobes;
+				goto unlock_treelock;
 			}
 			kfree(uprobe);
 			uc->uprobe = cur_uprobe;
 		}
 	}
+unlock_treelock:
+	write_unlock(&uprobes_treelock);
+	if (ret)
+		goto cleanup_uprobes;
 
 	for (i = 0; i < cnt; i++) {
 		uc = get_uprobe_consumer(i, ctx);
@@ -1340,12 +1339,14 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 	}
 cleanup_uprobes:
 	/* put all the successfully allocated/reused uprobes */
+	write_lock(&uprobes_treelock);
 	for (i = cnt - 1; i >= 0; i--) {
 		uc = get_uprobe_consumer(i, ctx);
 
-		put_uprobe(uc->uprobe);
+		__put_uprobe(uc->uprobe, true);
 		uc->uprobe = NULL;
 	}
+	write_unlock(&uprobes_treelock);
 	return ret;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/12] uprobes: improve lock batching for uprobe_unregister_batch
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (8 preceding siblings ...)
  2024-06-25  0:21 ` [PATCH 09/12] uprobes: batch uprobes_treelock during registration Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore Andrii Nakryiko
  11 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Similarly to what we did for uprobes_register_batch(), split
uprobe_unregister_batch() into two separate phases with different
locking needs.

First, all the VMA unregistration is performed while holding
a per-uprobe register_rwsem.

Then, we take a batched uprobes_treelock once to __put_uprobe() for all
uprobe_consumers. That uprobe_consumer->uprobe field is really handy in
helping with this.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 416f408cbed9..7e94671a672a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1162,8 +1162,8 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
  */
 void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn get_uprobe_consumer, void *ctx)
 {
-	struct uprobe *uprobe;
 	struct uprobe_consumer *uc;
+	struct uprobe *uprobe;
 	int i;
 
 	for (i = 0; i < cnt; i++) {
@@ -1176,10 +1176,20 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge
 		down_write(&uprobe->register_rwsem);
 		__uprobe_unregister(uprobe, uc);
 		up_write(&uprobe->register_rwsem);
-		put_uprobe(uprobe);
+	}
 
+	write_lock(&uprobes_treelock);
+	for (i = 0; i < cnt; i++) {
+		uc = get_uprobe_consumer(i, ctx);
+		uprobe = uc->uprobe;
+
+		if (!uprobe)
+			continue;
+
+		__put_uprobe(uprobe, true);
 		uc->uprobe = NULL;
 	}
+	write_unlock(&uprobes_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (9 preceding siblings ...)
  2024-06-25  0:21 ` [PATCH 10/12] uprobes: improve lock batching for uprobe_unregister_batch Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  2024-06-25  0:21 ` [PATCH 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore Andrii Nakryiko
  11 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

Switch internals of BPF multi-uprobes to batched version of uprobe
registration and unregistration APIs.

This also simplifies BPF clean up code a bit thanks to all-or-nothing
guarantee of uprobes_register_batch().

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/trace/bpf_trace.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ba62baec3152..41bf6736c542 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3173,14 +3173,11 @@ struct bpf_uprobe_multi_run_ctx {
 	struct bpf_uprobe *uprobe;
 };
 
-static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes,
-				  u32 cnt)
+static struct uprobe_consumer *umulti_link_get_uprobe_consumer(size_t idx, void *ctx)
 {
-	u32 i;
+	struct bpf_uprobe_multi_link *link = ctx;
 
-	for (i = 0; i < cnt; i++) {
-		uprobe_unregister(d_real_inode(path->dentry), &uprobes[i].consumer);
-	}
+	return &link->uprobes[idx].consumer;
 }
 
 static void bpf_uprobe_multi_link_release(struct bpf_link *link)
@@ -3188,7 +3185,8 @@ static void bpf_uprobe_multi_link_release(struct bpf_link *link)
 	struct bpf_uprobe_multi_link *umulti_link;
 
 	umulti_link = container_of(link, struct bpf_uprobe_multi_link, link);
-	bpf_uprobe_unregister(&umulti_link->path, umulti_link->uprobes, umulti_link->cnt);
+	uprobe_unregister_batch(d_real_inode(umulti_link->path.dentry), umulti_link->cnt,
+				umulti_link_get_uprobe_consumer, umulti_link);
 	if (umulti_link->task)
 		put_task_struct(umulti_link->task);
 	path_put(&umulti_link->path);
@@ -3474,13 +3472,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	bpf_link_init(&link->link, BPF_LINK_TYPE_UPROBE_MULTI,
 		      &bpf_uprobe_multi_link_lops, prog);
 
-	for (i = 0; i < cnt; i++) {
-		err = uprobe_register(d_real_inode(link->path.dentry), &uprobes[i].consumer);
-		if (err) {
-			bpf_uprobe_unregister(&path, uprobes, i);
-			goto error_free;
-		}
-	}
+	err = uprobe_register_batch(d_real_inode(link->path.dentry), cnt,
+				    umulti_link_get_uprobe_consumer, link);
+	if (err)
+		goto error_free;
 
 	err = bpf_link_prime(&link->link, &link_primer);
 	if (err)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore
  2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
                   ` (10 preceding siblings ...)
  2024-06-25  0:21 ` [PATCH 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes Andrii Nakryiko
@ 2024-06-25  0:21 ` Andrii Nakryiko
  11 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25  0:21 UTC (permalink / raw)
  To: linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: peterz, mingo, bpf, jolsa, paulmck, clm, Andrii Nakryiko

With all the batch uprobe APIs work we are now finally ready to reap the
benefits. Switch uprobes_treelock from reader-writer spinlock to a much
more efficient and scalable per-CPU RW semaphore.

Benchmarks and numbers time. I've used BPF selftests' bench tool,
trig-uprobe-nop benchmark specifically, to see how uprobe total
throughput scales with number of competing threads (mapped to individual
CPUs). Here are results:

  # threads   BEFORE (mln/s)    AFTER (mln/s)
  ---------   --------------    -------------
  1           3.131             3.140
  2           3.394             3.601
  3           3.630             3.960
  4           3.317             3.551
  5           3.448             3.464
  6           3.345             3.283
  7           3.469             3.444
  8           3.182             3.258
  9           3.138             3.139
  10          2.999             3.212
  11          2.903             3.183
  12          2.802             3.027
  13          2.792             3.027
  14          2.695             3.086
  15          2.822             2.965
  16          2.679             2.939
  17          2.622             2.888
  18          2.628             2.914
  19          2.702             2.836
  20          2.561             2.837

One can see that per-CPU RW semaphore-based implementation scales better
with number of CPUs (especially that single CPU throughput is basically
the same).

Note, scalability is still limited by register_rwsem and this will
hopefully be address in follow up patch set(s).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 7e94671a672a..e24b81b0e149 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
  */
 #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
 
-static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+DEFINE_STATIC_PERCPU_RWSEM(uprobes_treelock);	/* serialize rbtree access */
 
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
@@ -667,7 +667,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked)
 		bool destroy;
 
 		if (!tree_locked)
-			write_lock(&uprobes_treelock);
+			percpu_down_write(&uprobes_treelock);
 		/*
 		 * We might race with find_uprobe()->__get_uprobe() executed
 		 * from inside read-locked uprobes_treelock, which can bump
@@ -691,7 +691,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked)
 		if (destroy && uprobe_is_active(uprobe))
 			rb_erase(&uprobe->rb_node, &uprobes_tree);
 		if (!tree_locked)
-			write_unlock(&uprobes_treelock);
+			percpu_up_write(&uprobes_treelock);
 
 		/* uprobe got resurrected, pretend we never tried to free it */
 		if (!destroy)
@@ -789,9 +789,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 {
 	struct uprobe *uprobe;
 
-	read_lock(&uprobes_treelock);
+	percpu_down_read(&uprobes_treelock);
 	uprobe = __find_uprobe(inode, offset);
-	read_unlock(&uprobes_treelock);
+	percpu_up_read(&uprobes_treelock);
 
 	return uprobe;
 }
@@ -1178,7 +1178,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge
 		up_write(&uprobe->register_rwsem);
 	}
 
-	write_lock(&uprobes_treelock);
+	percpu_down_write(&uprobes_treelock);
 	for (i = 0; i < cnt; i++) {
 		uc = get_uprobe_consumer(i, ctx);
 		uprobe = uc->uprobe;
@@ -1189,7 +1189,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge
 		__put_uprobe(uprobe, true);
 		uc->uprobe = NULL;
 	}
-	write_unlock(&uprobes_treelock);
+	percpu_up_write(&uprobes_treelock);
 }
 
 static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx)
@@ -1294,7 +1294,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 	}
 
 	ret = 0;
-	write_lock(&uprobes_treelock);
+	percpu_down_write(&uprobes_treelock);
 	for (i = 0; i < cnt; i++) {
 		struct uprobe *cur_uprobe;
 
@@ -1317,7 +1317,7 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 		}
 	}
 unlock_treelock:
-	write_unlock(&uprobes_treelock);
+	percpu_up_write(&uprobes_treelock);
 	if (ret)
 		goto cleanup_uprobes;
 
@@ -1349,14 +1349,14 @@ int uprobe_register_batch(struct inode *inode, int cnt,
 	}
 cleanup_uprobes:
 	/* put all the successfully allocated/reused uprobes */
-	write_lock(&uprobes_treelock);
+	percpu_down_write(&uprobes_treelock);
 	for (i = cnt - 1; i >= 0; i--) {
 		uc = get_uprobe_consumer(i, ctx);
 
 		__put_uprobe(uc->uprobe, true);
 		uc->uprobe = NULL;
 	}
-	write_unlock(&uprobes_treelock);
+	percpu_up_write(&uprobes_treelock);
 	return ret;
 }
 
@@ -1464,7 +1464,7 @@ static void build_probe_list(struct inode *inode,
 	min = vaddr_to_offset(vma, start);
 	max = min + (end - start) - 1;
 
-	read_lock(&uprobes_treelock);
+	percpu_down_read(&uprobes_treelock);
 	n = find_node_in_range(inode, min, max);
 	if (n) {
 		for (t = n; t; t = rb_prev(t)) {
@@ -1482,7 +1482,7 @@ static void build_probe_list(struct inode *inode,
 			list_add(&u->pending_list, head);
 		}
 	}
-	read_unlock(&uprobes_treelock);
+	percpu_up_read(&uprobes_treelock);
 }
 
 /* @vma contains reference counter, not the probed instruction. */
@@ -1573,9 +1573,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
 	min = vaddr_to_offset(vma, start);
 	max = min + (end - start) - 1;
 
-	read_lock(&uprobes_treelock);
+	percpu_down_read(&uprobes_treelock);
 	n = find_node_in_range(inode, min, max);
-	read_unlock(&uprobes_treelock);
+	percpu_up_read(&uprobes_treelock);
 
 	return !!n;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
  2024-06-25  0:21 ` [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() Andrii Nakryiko
@ 2024-06-25  1:29   ` Masami Hiramatsu
  2024-06-25 14:49     ` Oleg Nesterov
  2024-06-25 10:50   ` Oleg Nesterov
  1 sibling, 1 reply; 42+ messages in thread
From: Masami Hiramatsu @ 2024-06-25  1:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, oleg, peterz, mingo, bpf, jolsa,
	paulmck, clm

On Mon, 24 Jun 2024 17:21:34 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

> Given unapply_uprobe() can call remove_breakpoint() which eventually
> calls uprobe_write_opcode(), which can modify a set of memory pages and
> expects mm->mmap_lock held for write, it needs to have writer lock.
> 
> Fix this by switching to mmap_write_lock()/mmap_write_unlock().
> 

Oops, it is an actual bug, right?

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks,

> Fixes: da1816b1caec ("uprobes: Teach handler_chain() to filter out the probed task")
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 197fbe4663b5..e896eeecb091 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1235,7 +1235,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
>  	struct vm_area_struct *vma;
>  	int err = 0;
>  
> -	mmap_read_lock(mm);
> +	mmap_write_lock(mm);
>  	for_each_vma(vmi, vma) {
>  		unsigned long vaddr;
>  		loff_t offset;
> @@ -1252,7 +1252,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
>  		vaddr = offset_to_vaddr(vma, uprobe->offset);
>  		err |= remove_breakpoint(uprobe, mm, vaddr);
>  	}
> -	mmap_read_unlock(mm);
> +	mmap_write_unlock(mm);
>  
>  	return err;
>  }
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
  2024-06-25  0:21 ` [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() Andrii Nakryiko
  2024-06-25  1:29   ` Masami Hiramatsu
@ 2024-06-25 10:50   ` Oleg Nesterov
  1 sibling, 0 replies; 42+ messages in thread
From: Oleg Nesterov @ 2024-06-25 10:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, peterz, mingo, bpf, jolsa,
	paulmck, clm

I don't think I can review, I forgot everything, but I'll try to read this
series when I have time to (try to) understand what it does...

On 06/24, Andrii Nakryiko wrote:
>
> Given unapply_uprobe() can call remove_breakpoint() which eventually
> calls uprobe_write_opcode(), which can modify a set of memory pages and
> expects mm->mmap_lock held for write, it needs to have writer lock.
>
> Fix this by switching to mmap_write_lock()/mmap_write_unlock().
>
> Fixes: da1816b1caec ("uprobes: Teach handler_chain() to filter out the probed task")
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 197fbe4663b5..e896eeecb091 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1235,7 +1235,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
>  	struct vm_area_struct *vma;
>  	int err = 0;
>
> -	mmap_read_lock(mm);
> +	mmap_write_lock(mm);

Can you explain what exactly is wrong?

FOLL_FORCE/etc is fine under mmap_read_lock(), __replace_page() seems too...

I recall that initially uprobes.c always took mmap_sem for reading, then
register_for_each_vma() was changed by 77fc4af1b59d12 but there was other
reasons for this change...

Again, I don't understand this code today, quite possibly I missed something,
I am just trying to understand.

Well, it seems there is another reason for this change... Currently 2
unapply_uprobe()'s can race with each other if they try to update the same
page. But in this case we can rely on -EAGAIN from __replace_page() ?

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-06-25  0:21 ` [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
@ 2024-06-25 14:44   ` Oleg Nesterov
  2024-06-25 17:30     ` Andrii Nakryiko
  2024-06-26  6:02   ` kernel test robot
  2024-06-27  2:29   ` Masami Hiramatsu
  2 siblings, 1 reply; 42+ messages in thread
From: Oleg Nesterov @ 2024-06-25 14:44 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, peterz, mingo, bpf, jolsa,
	paulmck, clm

Again, I'll try to read (at least this) patch later this week,
just one cosmetic nit for now...

On 06/24, Andrii Nakryiko wrote:
>
> + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> + * epoch and refcnt parts atomically with one atomic_add().
> + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and
> + * *increment* epoch part.
> + */
> +#define UPROBE_REFCNT_GET ((1LL << 32) | 1LL)
> +#define UPROBE_REFCNT_PUT (0xffffffffLL)

How about

	#define UPROBE_REFCNT_GET ((1ULL << 32) + 1ULL)
	#define UPROBE_REFCNT_PUT ((1ULL << 32) - 1ULL)

? this makes them symmetrical and makes it clear why
atomic64_add(UPROBE_REFCNT_PUT) works as described in the comment.

Feel free to ignore if you don't like it.

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
  2024-06-25  1:29   ` Masami Hiramatsu
@ 2024-06-25 14:49     ` Oleg Nesterov
  2024-06-25 17:37       ` Andrii Nakryiko
  0 siblings, 1 reply; 42+ messages in thread
From: Oleg Nesterov @ 2024-06-25 14:49 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, peterz, mingo, bpf,
	jolsa, paulmck, clm

On 06/25, Masami Hiramatsu wrote:
>
> On Mon, 24 Jun 2024 17:21:34 -0700
> Andrii Nakryiko <andrii@kernel.org> wrote:
>
> > Given unapply_uprobe() can call remove_breakpoint() which eventually
> > calls uprobe_write_opcode(), which can modify a set of memory pages and
> > expects mm->mmap_lock held for write, it needs to have writer lock.
>
> Oops, it is an actual bug, right?

Why?

So far I don't understand this change. Quite possibly I missed something,
but in this case the changelog should explain the problem more clearly.

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-06-25 14:44   ` Oleg Nesterov
@ 2024-06-25 17:30     ` Andrii Nakryiko
  0 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25 17:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, peterz,
	mingo, bpf, jolsa, paulmck, clm

On Tue, Jun 25, 2024 at 7:45 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Again, I'll try to read (at least this) patch later this week,
> just one cosmetic nit for now...

Thanks, and yep, please take your time. I understand that this is not
a trivial change to a code base that has been sitting untouched for
many years now. But I'd really appreciate if you can give it a through
review anyways!

>
> On 06/24, Andrii Nakryiko wrote:
> >
> > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> > + * epoch and refcnt parts atomically with one atomic_add().
> > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and
> > + * *increment* epoch part.
> > + */
> > +#define UPROBE_REFCNT_GET ((1LL << 32) | 1LL)
> > +#define UPROBE_REFCNT_PUT (0xffffffffLL)
>
> How about
>
>         #define UPROBE_REFCNT_GET ((1ULL << 32) + 1ULL)
>         #define UPROBE_REFCNT_PUT ((1ULL << 32) - 1ULL)

It's cute, I'll change to that. But I'll probably also add a comment
with the final value in hex for someone like me (because I can reason
about 0xffffffff and its effect on refcount, not so much with `(1LL <<
32) - 1`.

>
> ? this makes them symmetrical and makes it clear why
> atomic64_add(UPROBE_REFCNT_PUT) works as described in the comment.
>
> Feel free to ignore if you don't like it.
>
> Oleg.
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
  2024-06-25 14:49     ` Oleg Nesterov
@ 2024-06-25 17:37       ` Andrii Nakryiko
  2024-06-25 19:07         ` Oleg Nesterov
  0 siblings, 1 reply; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-25 17:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Masami Hiramatsu, Andrii Nakryiko, linux-trace-kernel, rostedt,
	peterz, mingo, bpf, jolsa, paulmck, clm

On Tue, Jun 25, 2024 at 7:51 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 06/25, Masami Hiramatsu wrote:
> >
> > On Mon, 24 Jun 2024 17:21:34 -0700
> > Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > > Given unapply_uprobe() can call remove_breakpoint() which eventually
> > > calls uprobe_write_opcode(), which can modify a set of memory pages and
> > > expects mm->mmap_lock held for write, it needs to have writer lock.
> >
> > Oops, it is an actual bug, right?
>
> Why?
>
> So far I don't understand this change. Quite possibly I missed something,
> but in this case the changelog should explain the problem more clearly.
>

I just went off of "Called with mm->mmap_lock held for write." comment
in uprobe_write_opcode(), tbh. If we don't actually need writer
mmap_lock, we should probably update at least that comment. There is a
lot going on in uprobe_write_opcode(), and I don't understand all the
requirements there.


> Oleg.
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
  2024-06-25 17:37       ` Andrii Nakryiko
@ 2024-06-25 19:07         ` Oleg Nesterov
  2024-06-26 16:38           ` Andrii Nakryiko
  0 siblings, 1 reply; 42+ messages in thread
From: Oleg Nesterov @ 2024-06-25 19:07 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Masami Hiramatsu, Andrii Nakryiko, linux-trace-kernel, rostedt,
	peterz, mingo, bpf, jolsa, paulmck, clm

On 06/25, Andrii Nakryiko wrote:
>
> On Tue, Jun 25, 2024 at 7:51 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Why?
> >
> > So far I don't understand this change. Quite possibly I missed something,
> > but in this case the changelog should explain the problem more clearly.
> >
>
> I just went off of "Called with mm->mmap_lock held for write." comment
> in uprobe_write_opcode(), tbh.

Ah, indeed... and git blame makes me sad ;)

I _think_ that 29dedee0e693a updated this comment without any thinking,
but today I can't recall. In any case, today this nothing to do with
mem_cgroup_charge(). Not sure __replace_page() is correct (in this respect)
when it returns -EAGAIN but this is another story.

> If we don't actually need writer
> mmap_lock, we should probably update at least that comment.

Agreed.

> There is a
> lot going on in uprobe_write_opcode(), and I don't understand all the
> requirements there.

Heh. Neither me today ;)

Oleg.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-06-25  0:21 ` [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
  2024-06-25 14:44   ` Oleg Nesterov
@ 2024-06-26  6:02   ` kernel test robot
  2024-06-26 16:39     ` Andrii Nakryiko
  2024-06-27  2:29   ` Masami Hiramatsu
  2 siblings, 1 reply; 42+ messages in thread
From: kernel test robot @ 2024-06-26  6:02 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg
  Cc: oe-kbuild-all, peterz, mingo, bpf, jolsa, paulmck, clm,
	Andrii Nakryiko

Hi Andrii,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20240624]
[also build test WARNING on v6.10-rc5]
[cannot apply to perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools linus/master acme/perf/core v6.10-rc5 v6.10-rc4 v6.10-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/uprobes-update-outdated-comment/20240626-001728
base:   next-20240624
patch link:    https://lore.kernel.org/r/20240625002144.3485799-5-andrii%40kernel.org
patch subject: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
config: x86_64-defconfig (https://download.01.org/0day-ci/archive/20240626/202406261300.ebbfM0XJ-lkp@intel.com/config)
compiler: gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240626/202406261300.ebbfM0XJ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202406261300.ebbfM0XJ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/events/uprobes.c:638: warning: Function parameter or struct member 'uprobe' not described in '__get_uprobe'
>> kernel/events/uprobes.c:638: warning: expecting prototype for Caller has to make sure that(). Prototype was for __get_uprobe() instead


vim +638 kernel/events/uprobes.c

b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  625  
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  626  /**
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  627   * Caller has to make sure that:
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  628   *   a) either uprobe's refcnt is positive before this call;
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  629   *   b) or uprobes_treelock is held (doesn't matter if for read or write),
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  630   *      preventing uprobe's destructor from removing it from uprobes_tree.
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  631   *
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  632   * In the latter case, uprobe's destructor will "resurrect" uprobe instance if
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  633   * it detects that its refcount went back to being positive again inbetween it
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  634   * dropping to zero at some point and (potentially delayed) destructor
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  635   * callback actually running.
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  636   */
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  637  static struct uprobe *__get_uprobe(struct uprobe *uprobe)
f231722a2b27ee Oleg Nesterov   2015-07-21 @638  {
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  639  	s64 v;
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  640  
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  641  	v = atomic64_add_return(UPROBE_REFCNT_GET, &uprobe->ref);
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  642  
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  643  	/*
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  644  	 * If the highest bit is set, we need to clear it. If cmpxchg() fails,
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  645  	 * we don't retry because there is another CPU that just managed to
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  646  	 * update refcnt and will attempt the same "fix up". Eventually one of
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  647  	 * them will succeed to clear highset bit.
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  648  	 */
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  649  	if (unlikely(v < 0))
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  650  		(void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  651  
f231722a2b27ee Oleg Nesterov   2015-07-21  652  	return uprobe;
f231722a2b27ee Oleg Nesterov   2015-07-21  653  }
f231722a2b27ee Oleg Nesterov   2015-07-21  654  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-06-25  0:21 ` [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs Andrii Nakryiko
@ 2024-06-26 11:27   ` Jiri Olsa
  2024-06-26 16:44     ` Andrii Nakryiko
  2024-06-27 13:04   ` Masami Hiramatsu
  1 sibling, 1 reply; 42+ messages in thread
From: Jiri Olsa @ 2024-06-26 11:27 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, mhiramat, oleg, peterz, mingo, bpf,
	paulmck, clm

On Mon, Jun 24, 2024 at 05:21:38PM -0700, Andrii Nakryiko wrote:

SNIP

> +	for (i = 0; i < cnt; i++) {
> +		uc = get_uprobe_consumer(i, ctx);
> +
> +		/* Each consumer must have at least one set consumer */
> +		if (!uc || (!uc->handler && !uc->ret_handler))
> +			return -EINVAL;
> +		/* Racy, just to catch the obvious mistakes */
> +		if (uc->offset > i_size_read(inode))
> +			return -EINVAL;
> +		if (uc->uprobe)
> +			return -EINVAL;
> +		/*
> +		 * This ensures that copy_from_page(), copy_to_page() and
> +		 * __update_ref_ctr() can't cross page boundary.
> +		 */
> +		if (!IS_ALIGNED(uc->offset, UPROBE_SWBP_INSN_SIZE))
> +			return -EINVAL;
> +		if (!IS_ALIGNED(uc->ref_ctr_offset, sizeof(short)))
> +			return -EINVAL;
> +	}
>  
> -	down_write(&uprobe->register_rwsem);
> -	consumer_add(uprobe, uc);
> -	ret = register_for_each_vma(uprobe, uc);
> -	if (ret)
> -		__uprobe_unregister(uprobe, uc);
> -	up_write(&uprobe->register_rwsem);
> +	for (i = 0; i < cnt; i++) {
> +		uc = get_uprobe_consumer(i, ctx);
>  
> -	if (ret)
> -		put_uprobe(uprobe);
> +		uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
> +		if (IS_ERR(uprobe)) {
> +			ret = PTR_ERR(uprobe);
> +			goto cleanup_uprobes;
> +		}
> +
> +		uc->uprobe = uprobe;
> +	}
> +
> +	for (i = 0; i < cnt; i++) {
> +		uc = get_uprobe_consumer(i, ctx);
> +		uprobe = uc->uprobe;
> +
> +		down_write(&uprobe->register_rwsem);
> +		consumer_add(uprobe, uc);
> +		ret = register_for_each_vma(uprobe, uc);
> +		if (ret)
> +			__uprobe_unregister(uprobe, uc);
> +		up_write(&uprobe->register_rwsem);
> +
> +		if (ret) {
> +			put_uprobe(uprobe);
> +			goto cleanup_unreg;
> +		}
> +	}
> +
> +	return 0;
>  
> +cleanup_unreg:
> +	/* unregister all uprobes we managed to register until failure */
> +	for (i--; i >= 0; i--) {
> +		uc = get_uprobe_consumer(i, ctx);
> +
> +		down_write(&uprobe->register_rwsem);
> +		__uprobe_unregister(uc->uprobe, uc);
> +		up_write(&uprobe->register_rwsem);
> +	}
> +cleanup_uprobes:

when we jump here from 'goto cleanup_uprobes' not all of the
consumers might have uc->uprobe set up

perhaps we can set cnt = i - 1 before the goto, or just check uc->uprobe below


> +	/* put all the successfully allocated/reused uprobes */
> +	for (i = cnt - 1; i >= 0; i--) {

curious, any reason why we go from the top here?

thanks,
jirka

> +		uc = get_uprobe_consumer(i, ctx);
> +
> +		put_uprobe(uc->uprobe);
> +		uc->uprobe = NULL;
> +	}
>  	return ret;
>  }
>  
>  int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
>  {
> -	return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc);
> +	return uprobe_register_batch(inode, 1, uprobe_consumer_identity, uc);
>  }
>  EXPORT_SYMBOL_GPL(uprobe_register);
>  
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
  2024-06-25 19:07         ` Oleg Nesterov
@ 2024-06-26 16:38           ` Andrii Nakryiko
  0 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-26 16:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Masami Hiramatsu, Andrii Nakryiko, linux-trace-kernel, rostedt,
	peterz, mingo, bpf, jolsa, paulmck, clm

On Tue, Jun 25, 2024 at 12:09 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 06/25, Andrii Nakryiko wrote:
> >
> > On Tue, Jun 25, 2024 at 7:51 AM Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > Why?
> > >
> > > So far I don't understand this change. Quite possibly I missed something,
> > > but in this case the changelog should explain the problem more clearly.
> > >
> >
> > I just went off of "Called with mm->mmap_lock held for write." comment
> > in uprobe_write_opcode(), tbh.
>
> Ah, indeed... and git blame makes me sad ;)
>
> I _think_ that 29dedee0e693a updated this comment without any thinking,
> but today I can't recall. In any case, today this nothing to do with
> mem_cgroup_charge(). Not sure __replace_page() is correct (in this respect)
> when it returns -EAGAIN but this is another story.
>
> > If we don't actually need writer
> > mmap_lock, we should probably update at least that comment.
>
> Agreed.

Ok, I'll drop this change and will just update the comment.

>
> > There is a
> > lot going on in uprobe_write_opcode(), and I don't understand all the
> > requirements there.
>
> Heh. Neither me today ;)
>
> Oleg.
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-06-26  6:02   ` kernel test robot
@ 2024-06-26 16:39     ` Andrii Nakryiko
  0 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-26 16:39 UTC (permalink / raw)
  To: kernel test robot
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	oe-kbuild-all, peterz, mingo, bpf, jolsa, paulmck, clm

On Tue, Jun 25, 2024 at 11:03 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Andrii,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on next-20240624]
> [also build test WARNING on v6.10-rc5]
> [cannot apply to perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools linus/master acme/perf/core v6.10-rc5 v6.10-rc4 v6.10-rc3]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/uprobes-update-outdated-comment/20240626-001728
> base:   next-20240624
> patch link:    https://lore.kernel.org/r/20240625002144.3485799-5-andrii%40kernel.org
> patch subject: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
> config: x86_64-defconfig (https://download.01.org/0day-ci/archive/20240626/202406261300.ebbfM0XJ-lkp@intel.com/config)
> compiler: gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240626/202406261300.ebbfM0XJ-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202406261300.ebbfM0XJ-lkp@intel.com/
>
> All warnings (new ones prefixed by >>):
>
> >> kernel/events/uprobes.c:638: warning: Function parameter or struct member 'uprobe' not described in '__get_uprobe'
> >> kernel/events/uprobes.c:638: warning: expecting prototype for Caller has to make sure that(). Prototype was for __get_uprobe() instead
>
>
> vim +638 kernel/events/uprobes.c
>
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  625
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  626  /**

I shouldn't have used /** here, I'll fix this.

> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  627   * Caller has to make sure that:
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  628   *   a) either uprobe's refcnt is positive before this call;
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  629   *   b) or uprobes_treelock is held (doesn't matter if for read or write),
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  630   *      preventing uprobe's destructor from removing it from uprobes_tree.
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  631   *
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  632   * In the latter case, uprobe's destructor will "resurrect" uprobe instance if
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  633   * it detects that its refcount went back to being positive again inbetween it
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  634   * dropping to zero at some point and (potentially delayed) destructor
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  635   * callback actually running.
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  636   */
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  637  static struct uprobe *__get_uprobe(struct uprobe *uprobe)
> f231722a2b27ee Oleg Nesterov   2015-07-21 @638  {
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  639          s64 v;
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  640
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  641          v = atomic64_add_return(UPROBE_REFCNT_GET, &uprobe->ref);
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  642
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  643          /*
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  644           * If the highest bit is set, we need to clear it. If cmpxchg() fails,
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  645           * we don't retry because there is another CPU that just managed to
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  646           * update refcnt and will attempt the same "fix up". Eventually one of
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  647           * them will succeed to clear highset bit.
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  648           */
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  649          if (unlikely(v < 0))
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  650                  (void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
> b9adadbcb8dfc8 Andrii Nakryiko 2024-06-24  651
> f231722a2b27ee Oleg Nesterov   2015-07-21  652          return uprobe;
> f231722a2b27ee Oleg Nesterov   2015-07-21  653  }
> f231722a2b27ee Oleg Nesterov   2015-07-21  654
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-06-26 11:27   ` Jiri Olsa
@ 2024-06-26 16:44     ` Andrii Nakryiko
  0 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-26 16:44 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, oleg,
	peterz, mingo, bpf, paulmck, clm

On Wed, Jun 26, 2024 at 4:27 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Jun 24, 2024 at 05:21:38PM -0700, Andrii Nakryiko wrote:
>
> SNIP
>
> > +     for (i = 0; i < cnt; i++) {
> > +             uc = get_uprobe_consumer(i, ctx);
> > +
> > +             /* Each consumer must have at least one set consumer */
> > +             if (!uc || (!uc->handler && !uc->ret_handler))
> > +                     return -EINVAL;
> > +             /* Racy, just to catch the obvious mistakes */
> > +             if (uc->offset > i_size_read(inode))
> > +                     return -EINVAL;
> > +             if (uc->uprobe)
> > +                     return -EINVAL;
> > +             /*
> > +              * This ensures that copy_from_page(), copy_to_page() and
> > +              * __update_ref_ctr() can't cross page boundary.
> > +              */
> > +             if (!IS_ALIGNED(uc->offset, UPROBE_SWBP_INSN_SIZE))
> > +                     return -EINVAL;
> > +             if (!IS_ALIGNED(uc->ref_ctr_offset, sizeof(short)))
> > +                     return -EINVAL;
> > +     }
> >
> > -     down_write(&uprobe->register_rwsem);
> > -     consumer_add(uprobe, uc);
> > -     ret = register_for_each_vma(uprobe, uc);
> > -     if (ret)
> > -             __uprobe_unregister(uprobe, uc);
> > -     up_write(&uprobe->register_rwsem);
> > +     for (i = 0; i < cnt; i++) {
> > +             uc = get_uprobe_consumer(i, ctx);
> >
> > -     if (ret)
> > -             put_uprobe(uprobe);
> > +             uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset);
> > +             if (IS_ERR(uprobe)) {
> > +                     ret = PTR_ERR(uprobe);
> > +                     goto cleanup_uprobes;
> > +             }
> > +
> > +             uc->uprobe = uprobe;
> > +     }
> > +
> > +     for (i = 0; i < cnt; i++) {
> > +             uc = get_uprobe_consumer(i, ctx);
> > +             uprobe = uc->uprobe;
> > +
> > +             down_write(&uprobe->register_rwsem);
> > +             consumer_add(uprobe, uc);
> > +             ret = register_for_each_vma(uprobe, uc);
> > +             if (ret)
> > +                     __uprobe_unregister(uprobe, uc);
> > +             up_write(&uprobe->register_rwsem);
> > +
> > +             if (ret) {
> > +                     put_uprobe(uprobe);
> > +                     goto cleanup_unreg;
> > +             }
> > +     }
> > +
> > +     return 0;
> >
> > +cleanup_unreg:
> > +     /* unregister all uprobes we managed to register until failure */
> > +     for (i--; i >= 0; i--) {
> > +             uc = get_uprobe_consumer(i, ctx);
> > +
> > +             down_write(&uprobe->register_rwsem);
> > +             __uprobe_unregister(uc->uprobe, uc);
> > +             up_write(&uprobe->register_rwsem);
> > +     }
> > +cleanup_uprobes:
>
> when we jump here from 'goto cleanup_uprobes' not all of the
> consumers might have uc->uprobe set up
>
> perhaps we can set cnt = i - 1 before the goto, or just check uc->uprobe below

yep, you are right, I missed this part during multiple rounds of
refactorings. I think the `if (uc->uprobe)` check is the cleanest
approach here.

>
>
> > +     /* put all the successfully allocated/reused uprobes */
> > +     for (i = cnt - 1; i >= 0; i--) {
>
> curious, any reason why we go from the top here?

No particular reason. This started as (i = i - 1; i >= 0; i--), but
then as I kept splitting steps I needed to do this over all uprobes.
Anyways, I can do a clean `i = 0; i < cnt; i++` with `if (uc->uprobe)`
check.

>
> thanks,
> jirka
>
> > +             uc = get_uprobe_consumer(i, ctx);
> > +
> > +             put_uprobe(uc->uprobe);
> > +             uc->uprobe = NULL;
> > +     }
> >       return ret;
> >  }
> >
> >  int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
> >  {
> > -     return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc);
> > +     return uprobe_register_batch(inode, 1, uprobe_consumer_identity, uc);
> >  }
> >  EXPORT_SYMBOL_GPL(uprobe_register);
> >
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-06-25  0:21 ` [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
  2024-06-25 14:44   ` Oleg Nesterov
  2024-06-26  6:02   ` kernel test robot
@ 2024-06-27  2:29   ` Masami Hiramatsu
  2024-06-27 16:43     ` Andrii Nakryiko
  2 siblings, 1 reply; 42+ messages in thread
From: Masami Hiramatsu @ 2024-06-27  2:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, oleg, peterz, mingo, bpf, jolsa,
	paulmck, clm

On Mon, 24 Jun 2024 17:21:36 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

> Anyways, under exclusive writer lock, we double-check that refcount
> didn't change and is still zero. If it is, we proceed with destruction,
> because at that point we have a guarantee that find_active_uprobe()
> can't successfully look up this uprobe instance, as it's going to be
> removed in destructor under writer lock. If, on the other hand,
> find_active_uprobe() managed to bump refcount from zero to one in
> between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and
> write_lock(&uprobes_treelock), we'll deterministically detect this with
> extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we
> pretend like atomic_dec_and_test() never returned true. There is no
> resource freeing or any other irreversible action taken up till this
> point, so we just exit early.
> 
> One tricky part in the above is actually two CPUs racing and dropping
> refcnt to zero, and then attempting to free resources. This can happen
> as follows:
>   - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
>   - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
>     which point it decides that it needs to free uprobe as well.
> 
> At this point both CPU #0 and CPU #1 will believe they need to destroy
> uprobe, which is obviously wrong. To prevent this situations, we augment
> refcount with epoch counter, which is always incremented by 1 on either
> get or put operation. This allows those two CPUs above to disambiguate
> who should actually free uprobe (it's the CPU #1, because it has
> up-to-date epoch). See comments in the code and note the specific values
> of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
> a single atomi64_t is actually a two sort-of-independent 32-bit counters
> that are incremented/decremented with a single atomic_add_and_return()
> operation. Note also a small and extremely rare (and thus having no
> effect on performance) need to clear the highest bit every 2 billion
> get/put operations to prevent high 32-bit counter from "bleeding over"
> into lower 32-bit counter.

I have a question here.
Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
again? e.g.

CPU#0							CPU#1

put_uprobe() {
	atomic64_add_return()
							__get_uprobe();
							put_uprobe() {
								kfree(uprobe)
							}
	write_lock(&uprobes_treelock);
	atomic64_read(&uprobe->ref);
}

I think it is very rare case, but I could not find any code to prevent
this scenario.

Thank you,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
  2024-06-25  0:21 ` [PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Andrii Nakryiko
@ 2024-06-27  3:06   ` Masami Hiramatsu
  0 siblings, 0 replies; 42+ messages in thread
From: Masami Hiramatsu @ 2024-06-27  3:06 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, oleg, peterz, mingo, bpf, jolsa,
	paulmck, clm

On Mon, 24 Jun 2024 17:21:37 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

> Simplify uprobe registration/unregistration interfaces by making offset
> and ref_ctr_offset part of uprobe_consumer "interface". In practice, all
> existing users already store these fields somewhere in uprobe_consumer's
> containing structure, so this doesn't pose any problem. We just move
> some fields around.
> 
> On the other hand, this simplifies uprobe_register() and
> uprobe_unregister() API by having only struct uprobe_consumer as one
> thing representing attachment/detachment entity. This makes batched
> versions of uprobe_register() and uprobe_unregister() simpler.
> 
> This also makes uprobe_register_refctr() unnecessary, so remove it and
> simplify consumers.
> 
> No functional changes intended.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks!

> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/uprobes.h                       | 18 +++----
>  kernel/events/uprobes.c                       | 19 ++-----
>  kernel/trace/bpf_trace.c                      | 21 +++-----
>  kernel/trace/trace_uprobe.c                   | 53 ++++++++-----------
>  .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 23 ++++----
>  5 files changed, 55 insertions(+), 79 deletions(-)
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index b503fafb7fb3..a75ba37ce3c8 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -42,6 +42,11 @@ struct uprobe_consumer {
>  				enum uprobe_filter_ctx ctx,
>  				struct mm_struct *mm);
>  
> +	/* associated file offset of this probe */
> +	loff_t offset;
> +	/* associated refctr file offset of this probe, or zero */
> +	loff_t ref_ctr_offset;
> +	/* for internal uprobe infra use, consumers shouldn't touch fields below */
>  	struct uprobe_consumer *next;
>  };
>  
> @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn);
>  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
>  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
>  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
> -extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
> +extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
> -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> +extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc);
>  extern int uprobe_mmap(struct vm_area_struct *vma);
>  extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
>  extern void uprobe_start_dup_mmap(void);
> @@ -152,11 +156,7 @@ static inline void uprobes_init(void)
>  #define uprobe_get_trap_addr(regs)	instruction_pointer(regs)
>  
>  static inline int
> -uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
> -{
> -	return -ENOSYS;
> -}
> -static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> +uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
>  {
>  	return -ENOSYS;
>  }
> @@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, boo
>  	return -ENOSYS;
>  }
>  static inline void
> -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
> +uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
>  {
>  }
>  static inline int uprobe_mmap(struct vm_area_struct *vma)
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 8ce669bc6474..2544e8b79bad 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1197,14 +1197,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  /*
>   * uprobe_unregister - unregister an already registered probe.
>   * @inode: the file in which the probe has to be removed.
> - * @offset: offset from the start of the file.
> - * @uc: identify which probe if multiple probes are colocated.
> + * @uc: identify which probe consumer to unregister.
>   */
> -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
> +void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc)
>  {
>  	struct uprobe *uprobe;
>  
> -	uprobe = find_uprobe(inode, offset);
> +	uprobe = find_uprobe(inode, uc->offset);
>  	if (WARN_ON(!uprobe))
>  		return;
>  
> @@ -1277,20 +1276,12 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
>  	return ret;
>  }
>  
> -int uprobe_register(struct inode *inode, loff_t offset,
> -		    struct uprobe_consumer *uc)
> +int uprobe_register(struct inode *inode, struct uprobe_consumer *uc)
>  {
> -	return __uprobe_register(inode, offset, 0, uc);
> +	return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc);
>  }
>  EXPORT_SYMBOL_GPL(uprobe_register);
>  
> -int uprobe_register_refctr(struct inode *inode, loff_t offset,
> -			   loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> -{
> -	return __uprobe_register(inode, offset, ref_ctr_offset, uc);
> -}
> -EXPORT_SYMBOL_GPL(uprobe_register_refctr);
> -
>  /*
>   * uprobe_apply - unregister an already registered probe.
>   * @inode: the file in which the probe has to be removed.
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d1daeab1bbc1..ba62baec3152 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -3154,8 +3154,6 @@ struct bpf_uprobe_multi_link;
>  
>  struct bpf_uprobe {
>  	struct bpf_uprobe_multi_link *link;
> -	loff_t offset;
> -	unsigned long ref_ctr_offset;
>  	u64 cookie;
>  	struct uprobe_consumer consumer;
>  };
> @@ -3181,8 +3179,7 @@ static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes,
>  	u32 i;
>  
>  	for (i = 0; i < cnt; i++) {
> -		uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
> -				  &uprobes[i].consumer);
> +		uprobe_unregister(d_real_inode(path->dentry), &uprobes[i].consumer);
>  	}
>  }
>  
> @@ -3262,10 +3259,10 @@ static int bpf_uprobe_multi_link_fill_link_info(const struct bpf_link *link,
>  
>  	for (i = 0; i < ucount; i++) {
>  		if (uoffsets &&
> -		    put_user(umulti_link->uprobes[i].offset, uoffsets + i))
> +		    put_user(umulti_link->uprobes[i].consumer.offset, uoffsets + i))
>  			return -EFAULT;
>  		if (uref_ctr_offsets &&
> -		    put_user(umulti_link->uprobes[i].ref_ctr_offset, uref_ctr_offsets + i))
> +		    put_user(umulti_link->uprobes[i].consumer.ref_ctr_offset, uref_ctr_offsets + i))
>  			return -EFAULT;
>  		if (ucookies &&
>  		    put_user(umulti_link->uprobes[i].cookie, ucookies + i))
> @@ -3439,15 +3436,16 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
>  		goto error_free;
>  
>  	for (i = 0; i < cnt; i++) {
> -		if (__get_user(uprobes[i].offset, uoffsets + i)) {
> +		if (__get_user(uprobes[i].consumer.offset, uoffsets + i)) {
>  			err = -EFAULT;
>  			goto error_free;
>  		}
> -		if (uprobes[i].offset < 0) {
> +		if (uprobes[i].consumer.offset < 0) {
>  			err = -EINVAL;
>  			goto error_free;
>  		}
> -		if (uref_ctr_offsets && __get_user(uprobes[i].ref_ctr_offset, uref_ctr_offsets + i)) {
> +		if (uref_ctr_offsets &&
> +		    __get_user(uprobes[i].consumer.ref_ctr_offset, uref_ctr_offsets + i)) {
>  			err = -EFAULT;
>  			goto error_free;
>  		}
> @@ -3477,10 +3475,7 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
>  		      &bpf_uprobe_multi_link_lops, prog);
>  
>  	for (i = 0; i < cnt; i++) {
> -		err = uprobe_register_refctr(d_real_inode(link->path.dentry),
> -					     uprobes[i].offset,
> -					     uprobes[i].ref_ctr_offset,
> -					     &uprobes[i].consumer);
> +		err = uprobe_register(d_real_inode(link->path.dentry), &uprobes[i].consumer);
>  		if (err) {
>  			bpf_uprobe_unregister(&path, uprobes, i);
>  			goto error_free;
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index c98e3b3386ba..d786f99114be 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -60,8 +60,6 @@ struct trace_uprobe {
>  	struct path			path;
>  	struct inode			*inode;
>  	char				*filename;
> -	unsigned long			offset;
> -	unsigned long			ref_ctr_offset;
>  	unsigned long			nhit;
>  	struct trace_probe		tp;
>  };
> @@ -205,7 +203,7 @@ static unsigned long translate_user_vaddr(unsigned long file_offset)
>  
>  	udd = (void *) current->utask->vaddr;
>  
> -	base_addr = udd->bp_addr - udd->tu->offset;
> +	base_addr = udd->bp_addr - udd->tu->consumer.offset;
>  	return base_addr + file_offset;
>  }
>  
> @@ -286,13 +284,13 @@ static bool trace_uprobe_match_command_head(struct trace_uprobe *tu,
>  	if (strncmp(tu->filename, argv[0], len) || argv[0][len] != ':')
>  		return false;
>  
> -	if (tu->ref_ctr_offset == 0)
> -		snprintf(buf, sizeof(buf), "0x%0*lx",
> -				(int)(sizeof(void *) * 2), tu->offset);
> +	if (tu->consumer.ref_ctr_offset == 0)
> +		snprintf(buf, sizeof(buf), "0x%0*llx",
> +				(int)(sizeof(void *) * 2), tu->consumer.offset);
>  	else
> -		snprintf(buf, sizeof(buf), "0x%0*lx(0x%lx)",
> -				(int)(sizeof(void *) * 2), tu->offset,
> -				tu->ref_ctr_offset);
> +		snprintf(buf, sizeof(buf), "0x%0*llx(0x%llx)",
> +				(int)(sizeof(void *) * 2), tu->consumer.offset,
> +				tu->consumer.ref_ctr_offset);
>  	if (strcmp(buf, &argv[0][len + 1]))
>  		return false;
>  
> @@ -410,7 +408,7 @@ static bool trace_uprobe_has_same_uprobe(struct trace_uprobe *orig,
>  
>  	list_for_each_entry(orig, &tpe->probes, tp.list) {
>  		if (comp_inode != d_real_inode(orig->path.dentry) ||
> -		    comp->offset != orig->offset)
> +		    comp->consumer.offset != orig->consumer.offset)
>  			continue;
>  
>  		/*
> @@ -472,8 +470,8 @@ static int validate_ref_ctr_offset(struct trace_uprobe *new)
>  
>  	for_each_trace_uprobe(tmp, pos) {
>  		if (new_inode == d_real_inode(tmp->path.dentry) &&
> -		    new->offset == tmp->offset &&
> -		    new->ref_ctr_offset != tmp->ref_ctr_offset) {
> +		    new->consumer.offset == tmp->consumer.offset &&
> +		    new->consumer.ref_ctr_offset != tmp->consumer.ref_ctr_offset) {
>  			pr_warn("Reference counter offset mismatch.");
>  			return -EINVAL;
>  		}
> @@ -675,8 +673,8 @@ static int __trace_uprobe_create(int argc, const char **argv)
>  		WARN_ON_ONCE(ret != -ENOMEM);
>  		goto fail_address_parse;
>  	}
> -	tu->offset = offset;
> -	tu->ref_ctr_offset = ref_ctr_offset;
> +	tu->consumer.offset = offset;
> +	tu->consumer.ref_ctr_offset = ref_ctr_offset;
>  	tu->path = path;
>  	tu->filename = filename;
>  
> @@ -746,12 +744,12 @@ static int trace_uprobe_show(struct seq_file *m, struct dyn_event *ev)
>  	char c = is_ret_probe(tu) ? 'r' : 'p';
>  	int i;
>  
> -	seq_printf(m, "%c:%s/%s %s:0x%0*lx", c, trace_probe_group_name(&tu->tp),
> +	seq_printf(m, "%c:%s/%s %s:0x%0*llx", c, trace_probe_group_name(&tu->tp),
>  			trace_probe_name(&tu->tp), tu->filename,
> -			(int)(sizeof(void *) * 2), tu->offset);
> +			(int)(sizeof(void *) * 2), tu->consumer.offset);
>  
> -	if (tu->ref_ctr_offset)
> -		seq_printf(m, "(0x%lx)", tu->ref_ctr_offset);
> +	if (tu->consumer.ref_ctr_offset)
> +		seq_printf(m, "(0x%llx)", tu->consumer.ref_ctr_offset);
>  
>  	for (i = 0; i < tu->tp.nr_args; i++)
>  		seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
> @@ -1089,12 +1087,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter)
>  	tu->consumer.filter = filter;
>  	tu->inode = d_real_inode(tu->path.dentry);
>  
> -	if (tu->ref_ctr_offset)
> -		ret = uprobe_register_refctr(tu->inode, tu->offset,
> -				tu->ref_ctr_offset, &tu->consumer);
> -	else
> -		ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
> -
> +	ret = uprobe_register(tu->inode, &tu->consumer);
>  	if (ret)
>  		tu->inode = NULL;
>  
> @@ -1112,7 +1105,7 @@ static void __probe_event_disable(struct trace_probe *tp)
>  		if (!tu->inode)
>  			continue;
>  
> -		uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
> +		uprobe_unregister(tu->inode, &tu->consumer);
>  		tu->inode = NULL;
>  	}
>  }
> @@ -1310,7 +1303,7 @@ static int uprobe_perf_close(struct trace_event_call *call,
>  		return 0;
>  
>  	list_for_each_entry(tu, trace_probe_probe_list(tp), tp.list) {
> -		ret = uprobe_apply(tu->inode, tu->offset, &tu->consumer, false);
> +		ret = uprobe_apply(tu->inode, tu->consumer.offset, &tu->consumer, false);
>  		if (ret)
>  			break;
>  	}
> @@ -1334,7 +1327,7 @@ static int uprobe_perf_open(struct trace_event_call *call,
>  		return 0;
>  
>  	list_for_each_entry(tu, trace_probe_probe_list(tp), tp.list) {
> -		err = uprobe_apply(tu->inode, tu->offset, &tu->consumer, true);
> +		err = uprobe_apply(tu->inode, tu->consumer.offset, &tu->consumer, true);
>  		if (err) {
>  			uprobe_perf_close(call, event);
>  			break;
> @@ -1464,7 +1457,7 @@ int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type,
>  	*fd_type = is_ret_probe(tu) ? BPF_FD_TYPE_URETPROBE
>  				    : BPF_FD_TYPE_UPROBE;
>  	*filename = tu->filename;
> -	*probe_offset = tu->offset;
> +	*probe_offset = tu->consumer.offset;
>  	*probe_addr = 0;
>  	return 0;
>  }
> @@ -1627,9 +1620,9 @@ create_local_trace_uprobe(char *name, unsigned long offs,
>  		return ERR_CAST(tu);
>  	}
>  
> -	tu->offset = offs;
> +	tu->consumer.offset = offs;
>  	tu->path = path;
> -	tu->ref_ctr_offset = ref_ctr_offset;
> +	tu->consumer.ref_ctr_offset = ref_ctr_offset;
>  	tu->filename = kstrdup(name, GFP_KERNEL);
>  	if (!tu->filename) {
>  		ret = -ENOMEM;
> diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> index b0132a342bb5..9ae2a3c64daa 100644
> --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> @@ -377,7 +377,6 @@ uprobe_ret_handler(struct uprobe_consumer *self, unsigned long func,
>  
>  struct testmod_uprobe {
>  	struct path path;
> -	loff_t offset;
>  	struct uprobe_consumer consumer;
>  };
>  
> @@ -391,25 +390,24 @@ static int testmod_register_uprobe(loff_t offset)
>  {
>  	int err = -EBUSY;
>  
> -	if (uprobe.offset)
> +	if (uprobe.consumer.offset)
>  		return -EBUSY;
>  
>  	mutex_lock(&testmod_uprobe_mutex);
>  
> -	if (uprobe.offset)
> +	if (uprobe.consumer.offset)
>  		goto out;
>  
>  	err = kern_path("/proc/self/exe", LOOKUP_FOLLOW, &uprobe.path);
>  	if (err)
>  		goto out;
>  
> -	err = uprobe_register_refctr(d_real_inode(uprobe.path.dentry),
> -				     offset, 0, &uprobe.consumer);
> -	if (err)
> +	uprobe.consumer.offset = offset;
> +	err = uprobe_register(d_real_inode(uprobe.path.dentry), &uprobe.consumer);
> +	if (err) {
>  		path_put(&uprobe.path);
> -	else
> -		uprobe.offset = offset;
> -
> +		uprobe.consumer.offset = 0;
> +	}
>  out:
>  	mutex_unlock(&testmod_uprobe_mutex);
>  	return err;
> @@ -419,10 +417,9 @@ static void testmod_unregister_uprobe(void)
>  {
>  	mutex_lock(&testmod_uprobe_mutex);
>  
> -	if (uprobe.offset) {
> -		uprobe_unregister(d_real_inode(uprobe.path.dentry),
> -				  uprobe.offset, &uprobe.consumer);
> -		uprobe.offset = 0;
> +	if (uprobe.consumer.offset) {
> +		uprobe_unregister(d_real_inode(uprobe.path.dentry), &uprobe.consumer);
> +		uprobe.consumer.offset = 0;
>  	}
>  
>  	mutex_unlock(&testmod_uprobe_mutex);
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-06-25  0:21 ` [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs Andrii Nakryiko
  2024-06-26 11:27   ` Jiri Olsa
@ 2024-06-27 13:04   ` Masami Hiramatsu
  2024-06-27 16:47     ` Andrii Nakryiko
  1 sibling, 1 reply; 42+ messages in thread
From: Masami Hiramatsu @ 2024-06-27 13:04 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-trace-kernel, rostedt, oleg, peterz, mingo, bpf, jolsa,
	paulmck, clm

On Mon, 24 Jun 2024 17:21:38 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

> -static int __uprobe_register(struct inode *inode, loff_t offset,
> -			     loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> +int uprobe_register_batch(struct inode *inode, int cnt,
> +			  uprobe_consumer_fn get_uprobe_consumer, void *ctx)

Is this interface just for avoiding memory allocation? Can't we just
allocate a temporary array of *uprobe_consumer instead? 

Thank you,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-06-27  2:29   ` Masami Hiramatsu
@ 2024-06-27 16:43     ` Andrii Nakryiko
  2024-07-01 21:59       ` Andrii Nakryiko
  0 siblings, 1 reply; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-27 16:43 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Mon, 24 Jun 2024 17:21:36 -0700
> Andrii Nakryiko <andrii@kernel.org> wrote:
>
> > Anyways, under exclusive writer lock, we double-check that refcount
> > didn't change and is still zero. If it is, we proceed with destruction,
> > because at that point we have a guarantee that find_active_uprobe()
> > can't successfully look up this uprobe instance, as it's going to be
> > removed in destructor under writer lock. If, on the other hand,
> > find_active_uprobe() managed to bump refcount from zero to one in
> > between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and
> > write_lock(&uprobes_treelock), we'll deterministically detect this with
> > extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we
> > pretend like atomic_dec_and_test() never returned true. There is no
> > resource freeing or any other irreversible action taken up till this
> > point, so we just exit early.
> >
> > One tricky part in the above is actually two CPUs racing and dropping
> > refcnt to zero, and then attempting to free resources. This can happen
> > as follows:
> >   - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
> >   - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
> >     which point it decides that it needs to free uprobe as well.
> >
> > At this point both CPU #0 and CPU #1 will believe they need to destroy
> > uprobe, which is obviously wrong. To prevent this situations, we augment
> > refcount with epoch counter, which is always incremented by 1 on either
> > get or put operation. This allows those two CPUs above to disambiguate
> > who should actually free uprobe (it's the CPU #1, because it has
> > up-to-date epoch). See comments in the code and note the specific values
> > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
> > a single atomi64_t is actually a two sort-of-independent 32-bit counters
> > that are incremented/decremented with a single atomic_add_and_return()
> > operation. Note also a small and extremely rare (and thus having no
> > effect on performance) need to clear the highest bit every 2 billion
> > get/put operations to prevent high 32-bit counter from "bleeding over"
> > into lower 32-bit counter.
>
> I have a question here.
> Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
> the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
> again? e.g.
>
> CPU#0                                                   CPU#1
>
> put_uprobe() {
>         atomic64_add_return()
>                                                         __get_uprobe();
>                                                         put_uprobe() {
>                                                                 kfree(uprobe)
>                                                         }
>         write_lock(&uprobes_treelock);
>         atomic64_read(&uprobe->ref);
> }
>
> I think it is very rare case, but I could not find any code to prevent
> this scenario.
>

Yes, I think you are right, great catch!

I concentrated on preventing double kfree() in this situation, and
somehow convinced myself that eager kfree() is fine. But I think I'll
need to delay freeing, probably with RCU. The problem is that we can't
use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has
to be a sleepable variant of RCU. I'm thinking of using
rcu_read_lock_trace(), the same variant of RCU we use for sleepable
BPF programs (including sleepable uprobes). srcu might be too heavy
for this.

I'll try a few variants over the next few days and see how the
performance looks.

> Thank you,
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-06-27 13:04   ` Masami Hiramatsu
@ 2024-06-27 16:47     ` Andrii Nakryiko
  2024-06-28  6:28       ` Masami Hiramatsu
  0 siblings, 1 reply; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-27 16:47 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Mon, 24 Jun 2024 17:21:38 -0700
> Andrii Nakryiko <andrii@kernel.org> wrote:
>
> > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > -                          loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> > +int uprobe_register_batch(struct inode *inode, int cnt,
> > +                       uprobe_consumer_fn get_uprobe_consumer, void *ctx)
>
> Is this interface just for avoiding memory allocation? Can't we just
> allocate a temporary array of *uprobe_consumer instead?

Yes, exactly, to avoid the need for allocating another array that
would just contain pointers to uprobe_consumer. Consumers would never
just have an array of `struct uprobe_consumer *`, because
uprobe_consumer struct is embedded in some other struct, so the array
interface isn't the most convenient.

If you feel strongly, I can do an array, but this necessitates
allocating an extra array *and keeping it* for the entire duration of
BPF multi-uprobe link (attachment) existence, so it feels like a
waste. This is because we don't want to do anything that can fail in
the detachment logic (so no temporary array allocation there).

Anyways, let me know how you feel about keeping this callback.

>
> Thank you,
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-06-27 16:47     ` Andrii Nakryiko
@ 2024-06-28  6:28       ` Masami Hiramatsu
  2024-06-28 16:34         ` Andrii Nakryiko
  0 siblings, 1 reply; 42+ messages in thread
From: Masami Hiramatsu @ 2024-06-28  6:28 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Thu, 27 Jun 2024 09:47:10 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Mon, 24 Jun 2024 17:21:38 -0700
> > Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > -                          loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > +                       uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> >
> > Is this interface just for avoiding memory allocation? Can't we just
> > allocate a temporary array of *uprobe_consumer instead?
> 
> Yes, exactly, to avoid the need for allocating another array that
> would just contain pointers to uprobe_consumer. Consumers would never
> just have an array of `struct uprobe_consumer *`, because
> uprobe_consumer struct is embedded in some other struct, so the array
> interface isn't the most convenient.

OK, I understand it.

> 
> If you feel strongly, I can do an array, but this necessitates
> allocating an extra array *and keeping it* for the entire duration of
> BPF multi-uprobe link (attachment) existence, so it feels like a
> waste. This is because we don't want to do anything that can fail in
> the detachment logic (so no temporary array allocation there).

No need to change it, that sounds reasonable.

> 
> Anyways, let me know how you feel about keeping this callback.

IMHO, maybe the interface function is better to change to
`uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
side uses a linked list of structure, index access will need to
follow the list every time.

Thank you,


> 
> >
> > Thank you,
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-06-28  6:28       ` Masami Hiramatsu
@ 2024-06-28 16:34         ` Andrii Nakryiko
  2024-06-29 23:30           ` Masami Hiramatsu
  0 siblings, 1 reply; 42+ messages in thread
From: Andrii Nakryiko @ 2024-06-28 16:34 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Thu, 27 Jun 2024 09:47:10 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > >
> > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > -                          loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > +                       uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > >
> > > Is this interface just for avoiding memory allocation? Can't we just
> > > allocate a temporary array of *uprobe_consumer instead?
> >
> > Yes, exactly, to avoid the need for allocating another array that
> > would just contain pointers to uprobe_consumer. Consumers would never
> > just have an array of `struct uprobe_consumer *`, because
> > uprobe_consumer struct is embedded in some other struct, so the array
> > interface isn't the most convenient.
>
> OK, I understand it.
>
> >
> > If you feel strongly, I can do an array, but this necessitates
> > allocating an extra array *and keeping it* for the entire duration of
> > BPF multi-uprobe link (attachment) existence, so it feels like a
> > waste. This is because we don't want to do anything that can fail in
> > the detachment logic (so no temporary array allocation there).
>
> No need to change it, that sounds reasonable.
>

Great, thanks.

> >
> > Anyways, let me know how you feel about keeping this callback.
>
> IMHO, maybe the interface function is better to change to
> `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> side uses a linked list of structure, index access will need to
> follow the list every time.

This would be problematic. Note how we call get_uprobe_consumer(i,
ctx) with i going from 0 to N in multiple independent loops. So if we
are only allowed to ask for the next consumer, then
uprobe_register_batch and uprobe_unregister_batch would need to build
its own internal index and remember ith instance. Which again means
more allocations and possibly failing uprobe_unregister_batch(), which
isn't great.

For now this API works well, I propose to keep it as is. For linked
list case consumers would need to allocate one extra array or pay the
price of O(N) search (which might be ok, depending on how many uprobes
are being attached). But we don't have such consumers right now,
thankfully.

>
> Thank you,
>
>
> >
> > >
> > > Thank you,
> > >
> > > --
> > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-06-28 16:34         ` Andrii Nakryiko
@ 2024-06-29 23:30           ` Masami Hiramatsu
  2024-07-01 17:55             ` Andrii Nakryiko
  0 siblings, 1 reply; 42+ messages in thread
From: Masami Hiramatsu @ 2024-06-29 23:30 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Fri, 28 Jun 2024 09:34:26 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Thu, 27 Jun 2024 09:47:10 -0700
> > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > >
> > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > -                          loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > +                       uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > >
> > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > allocate a temporary array of *uprobe_consumer instead?
> > >
> > > Yes, exactly, to avoid the need for allocating another array that
> > > would just contain pointers to uprobe_consumer. Consumers would never
> > > just have an array of `struct uprobe_consumer *`, because
> > > uprobe_consumer struct is embedded in some other struct, so the array
> > > interface isn't the most convenient.
> >
> > OK, I understand it.
> >
> > >
> > > If you feel strongly, I can do an array, but this necessitates
> > > allocating an extra array *and keeping it* for the entire duration of
> > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > waste. This is because we don't want to do anything that can fail in
> > > the detachment logic (so no temporary array allocation there).
> >
> > No need to change it, that sounds reasonable.
> >
> 
> Great, thanks.
> 
> > >
> > > Anyways, let me know how you feel about keeping this callback.
> >
> > IMHO, maybe the interface function is better to change to
> > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > side uses a linked list of structure, index access will need to
> > follow the list every time.
> 
> This would be problematic. Note how we call get_uprobe_consumer(i,
> ctx) with i going from 0 to N in multiple independent loops. So if we
> are only allowed to ask for the next consumer, then
> uprobe_register_batch and uprobe_unregister_batch would need to build
> its own internal index and remember ith instance. Which again means
> more allocations and possibly failing uprobe_unregister_batch(), which
> isn't great.

No, I think we can use a cursor variable as;

int uprobe_register_batch(struct inode *inode,
                 uprobe_consumer_fn get_uprobe_consumer, void *ctx)
{
	void *cur = ctx;

	while ((uc = get_uprobe_consumer(&cur)) != NULL) {
		...
	} 

	cur = ctx;
	while ((uc = get_uprobe_consumer(&cur)) != NULL) {
		...
	} 
}

This can also remove the cnt.

Thank you,

> 
> For now this API works well, I propose to keep it as is. For linked
> list case consumers would need to allocate one extra array or pay the
> price of O(N) search (which might be ok, depending on how many uprobes
> are being attached). But we don't have such consumers right now,
> thankfully.
> 
> >
> > Thank you,
> >
> >
> > >
> > > >
> > > > Thank you,
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-06-29 23:30           ` Masami Hiramatsu
@ 2024-07-01 17:55             ` Andrii Nakryiko
  2024-07-01 22:15               ` Andrii Nakryiko
  0 siblings, 1 reply; 42+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 17:55 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Fri, 28 Jun 2024 09:34:26 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > >
> > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > >
> > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > > >
> > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > >
> > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > > -                          loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > +                       uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > > >
> > > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > > allocate a temporary array of *uprobe_consumer instead?
> > > >
> > > > Yes, exactly, to avoid the need for allocating another array that
> > > > would just contain pointers to uprobe_consumer. Consumers would never
> > > > just have an array of `struct uprobe_consumer *`, because
> > > > uprobe_consumer struct is embedded in some other struct, so the array
> > > > interface isn't the most convenient.
> > >
> > > OK, I understand it.
> > >
> > > >
> > > > If you feel strongly, I can do an array, but this necessitates
> > > > allocating an extra array *and keeping it* for the entire duration of
> > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > waste. This is because we don't want to do anything that can fail in
> > > > the detachment logic (so no temporary array allocation there).
> > >
> > > No need to change it, that sounds reasonable.
> > >
> >
> > Great, thanks.
> >
> > > >
> > > > Anyways, let me know how you feel about keeping this callback.
> > >
> > > IMHO, maybe the interface function is better to change to
> > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > side uses a linked list of structure, index access will need to
> > > follow the list every time.
> >
> > This would be problematic. Note how we call get_uprobe_consumer(i,
> > ctx) with i going from 0 to N in multiple independent loops. So if we
> > are only allowed to ask for the next consumer, then
> > uprobe_register_batch and uprobe_unregister_batch would need to build
> > its own internal index and remember ith instance. Which again means
> > more allocations and possibly failing uprobe_unregister_batch(), which
> > isn't great.
>
> No, I think we can use a cursor variable as;
>
> int uprobe_register_batch(struct inode *inode,
>                  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> {
>         void *cur = ctx;
>
>         while ((uc = get_uprobe_consumer(&cur)) != NULL) {
>                 ...
>         }
>
>         cur = ctx;
>         while ((uc = get_uprobe_consumer(&cur)) != NULL) {
>                 ...
>         }
> }
>
> This can also remove the cnt.

Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
for callers, but we have one right now, and might have another one, so
not a big deal.

>
> Thank you,
>
> >
> > For now this API works well, I propose to keep it as is. For linked
> > list case consumers would need to allocate one extra array or pay the
> > price of O(N) search (which might be ok, depending on how many uprobes
> > are being attached). But we don't have such consumers right now,
> > thankfully.
> >
> > >
> > > Thank you,
> > >
> > >
> > > >
> > > > >
> > > > > Thank you,
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
  2024-06-27 16:43     ` Andrii Nakryiko
@ 2024-07-01 21:59       ` Andrii Nakryiko
  0 siblings, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 21:59 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Thu, Jun 27, 2024 at 9:43 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jun 26, 2024 at 7:30 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Mon, 24 Jun 2024 17:21:36 -0700
> > Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > > Anyways, under exclusive writer lock, we double-check that refcount
> > > didn't change and is still zero. If it is, we proceed with destruction,
> > > because at that point we have a guarantee that find_active_uprobe()
> > > can't successfully look up this uprobe instance, as it's going to be
> > > removed in destructor under writer lock. If, on the other hand,
> > > find_active_uprobe() managed to bump refcount from zero to one in
> > > between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and
> > > write_lock(&uprobes_treelock), we'll deterministically detect this with
> > > extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we
> > > pretend like atomic_dec_and_test() never returned true. There is no
> > > resource freeing or any other irreversible action taken up till this
> > > point, so we just exit early.
> > >
> > > One tricky part in the above is actually two CPUs racing and dropping
> > > refcnt to zero, and then attempting to free resources. This can happen
> > > as follows:
> > >   - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
> > >   - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
> > >     which point it decides that it needs to free uprobe as well.
> > >
> > > At this point both CPU #0 and CPU #1 will believe they need to destroy
> > > uprobe, which is obviously wrong. To prevent this situations, we augment
> > > refcount with epoch counter, which is always incremented by 1 on either
> > > get or put operation. This allows those two CPUs above to disambiguate
> > > who should actually free uprobe (it's the CPU #1, because it has
> > > up-to-date epoch). See comments in the code and note the specific values
> > > of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
> > > a single atomi64_t is actually a two sort-of-independent 32-bit counters
> > > that are incremented/decremented with a single atomic_add_and_return()
> > > operation. Note also a small and extremely rare (and thus having no
> > > effect on performance) need to clear the highest bit every 2 billion
> > > get/put operations to prevent high 32-bit counter from "bleeding over"
> > > into lower 32-bit counter.
> >
> > I have a question here.
> > Is there any chance to the CPU#1 to put the uprobe before CPU#0 gets
> > the uprobes_treelock, and free uprobe before CPU#0 validate uprobe->ref
> > again? e.g.
> >
> > CPU#0                                                   CPU#1
> >
> > put_uprobe() {
> >         atomic64_add_return()
> >                                                         __get_uprobe();
> >                                                         put_uprobe() {
> >                                                                 kfree(uprobe)
> >                                                         }
> >         write_lock(&uprobes_treelock);
> >         atomic64_read(&uprobe->ref);
> > }
> >
> > I think it is very rare case, but I could not find any code to prevent
> > this scenario.
> >
>
> Yes, I think you are right, great catch!
>
> I concentrated on preventing double kfree() in this situation, and
> somehow convinced myself that eager kfree() is fine. But I think I'll
> need to delay freeing, probably with RCU. The problem is that we can't
> use rcu_read_lock()/rcu_read_unlock() because we take locks, so it has
> to be a sleepable variant of RCU. I'm thinking of using
> rcu_read_lock_trace(), the same variant of RCU we use for sleepable
> BPF programs (including sleepable uprobes). srcu might be too heavy
> for this.
>
> I'll try a few variants over the next few days and see how the
> performance looks.
>

So I think I'm going with the changes below, incorporated into this
patch (nothing else changes). __get_uprobe() doesn't need any added
RCU protection (we know that uprobe is alive). It's only put_uprobe()
that needs to guarantee RCU protection before we drop refcount all the
way until we know whether we are the winning destructor or not.

Good thing is that the changes are pretty minimal in code and also
don't seem to regress performance/scalability. So I'm pretty happy
about that, will send v2 soon.


diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 07ad8b2e7508..41d9e37633ca 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -56,6 +56,7 @@ struct uprobe {
        atomic64_t              ref;            /* see
UPROBE_REFCNT_GET below */
        struct rw_semaphore     register_rwsem;
        struct rw_semaphore     consumer_rwsem;
+       struct rcu_head         rcu;
        struct list_head        pending_list;
        struct uprobe_consumer  *consumers;
        struct inode            *inode;         /* Also hold a ref to inode */
@@ -623,7 +624,7 @@ set_orig_insn(struct arch_uprobe *auprobe, struct
mm_struct *mm, unsigned long v
 #define UPROBE_REFCNT_GET ((1LL << 32) | 1LL)
 #define UPROBE_REFCNT_PUT (0xffffffffLL)

-/**
+/*
  * Caller has to make sure that:
  *   a) either uprobe's refcnt is positive before this call;
  *   b) or uprobes_treelock is held (doesn't matter if for read or write),
@@ -657,10 +658,26 @@ static inline bool uprobe_is_active(struct uprobe *uprobe)
        return !RB_EMPTY_NODE(&uprobe->rb_node);
 }

+static void uprobe_free_rcu(struct rcu_head *rcu)
+{
+       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+
+       kfree(uprobe);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
        s64 v;

+       /*
+        * here uprobe instance is guaranteed to be alive, so we use Tasks
+        * Trace RCU to guarantee that uprobe won't be freed from under us, if
+        * we end up being a losing "destructor" inside uprobe_treelock'ed
+        * section double-checking uprobe->ref value below.
+        * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
+        */
+       rcu_read_lock_trace();
+
        v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref);

        if (unlikely((u32)v == 0)) {
@@ -691,6 +708,8 @@ static void put_uprobe(struct uprobe *uprobe)
                        rb_erase(&uprobe->rb_node, &uprobes_tree);
                write_unlock(&uprobes_treelock);

+               rcu_read_unlock_trace();
+
                /* uprobe got resurrected, pretend we never tried to free it */
                if (!destroy)
                        return;
@@ -704,7 +723,7 @@ static void put_uprobe(struct uprobe *uprobe)
                delayed_uprobe_remove(uprobe, NULL);
                mutex_unlock(&delayed_uprobe_lock);

-               kfree(uprobe);
+               call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu);
                return;
        }

@@ -716,6 +735,8 @@ static void put_uprobe(struct uprobe *uprobe)
         */
        if (unlikely(v < 0))
                (void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63));
+
+       rcu_read_unlock_trace();
 }

 static __always_inline



> > Thank you,
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> >

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-07-01 17:55             ` Andrii Nakryiko
@ 2024-07-01 22:15               ` Andrii Nakryiko
  2024-07-02  1:01                 ` Masami Hiramatsu
  0 siblings, 1 reply; 42+ messages in thread
From: Andrii Nakryiko @ 2024-07-01 22:15 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > On Fri, 28 Jun 2024 09:34:26 -0700
> > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > >
> > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > > > >
> > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > >
> > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > > > -                          loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > +                       uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > > > >
> > > > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > >
> > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > would just contain pointers to uprobe_consumer. Consumers would never
> > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > uprobe_consumer struct is embedded in some other struct, so the array
> > > > > interface isn't the most convenient.
> > > >
> > > > OK, I understand it.
> > > >
> > > > >
> > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > allocating an extra array *and keeping it* for the entire duration of
> > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > waste. This is because we don't want to do anything that can fail in
> > > > > the detachment logic (so no temporary array allocation there).
> > > >
> > > > No need to change it, that sounds reasonable.
> > > >
> > >
> > > Great, thanks.
> > >
> > > > >
> > > > > Anyways, let me know how you feel about keeping this callback.
> > > >
> > > > IMHO, maybe the interface function is better to change to
> > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > side uses a linked list of structure, index access will need to
> > > > follow the list every time.
> > >
> > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > are only allowed to ask for the next consumer, then
> > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > its own internal index and remember ith instance. Which again means
> > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > isn't great.
> >
> > No, I think we can use a cursor variable as;
> >
> > int uprobe_register_batch(struct inode *inode,
> >                  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > {
> >         void *cur = ctx;
> >
> >         while ((uc = get_uprobe_consumer(&cur)) != NULL) {
> >                 ...
> >         }
> >
> >         cur = ctx;
> >         while ((uc = get_uprobe_consumer(&cur)) != NULL) {
> >                 ...
> >         }
> > }
> >
> > This can also remove the cnt.
>
> Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
> for callers, but we have one right now, and might have another one, so
> not a big deal.
>

Actually, now that I started implementing this, I really-really don't
like it. In the example above you assume by storing and reusing
original ctx value you will reset iteration to the very beginning.
This is not how it works in practice though. Ctx is most probably a
pointer to some struct somewhere with iteration state (e.g., array of
all uprobes + current index), and so get_uprobe_consumer() doesn't
update `void *ctx` itself, it updates the state of that struct.

And so there is no easy and clean way to reset this iterator without
adding another callback or something. At which point it becomes quite
cumbersome and convoluted.

How about this? I'll keep the existing get_uprobe_consumer(idx, ctx)
contract, which works for the only user right now, BPF multi-uprobes.
When it's time to add another consumer that works with a linked list,
we can add another more complicated contract that would do
iterator-style callbacks. This would be used by linked list users, and
we can transparently implement existing uprobe_register_batch()
contract on top of if by implementing a trivial iterator wrapper on
top of get_uprobe_consumer(idx, ctx) approach.

Let's not add unnecessary complications right now given we have a
clear path forward to add it later, if necessary, without breaking
anything. I'll send v2 without changes to get_uprobe_consumer() for
now, hopefully my above plan makes sense to you. Thanks!

> >
> > Thank you,
> >
> > >
> > > For now this API works well, I propose to keep it as is. For linked
> > > list case consumers would need to allocate one extra array or pay the
> > > price of O(N) search (which might be ok, depending on how many uprobes
> > > are being attached). But we don't have such consumers right now,
> > > thankfully.
> > >
> > > >
> > > > Thank you,
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > --
> > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > > >
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-07-01 22:15               ` Andrii Nakryiko
@ 2024-07-02  1:01                 ` Masami Hiramatsu
  2024-07-02  1:34                   ` Andrii Nakryiko
  0 siblings, 1 reply; 42+ messages in thread
From: Masami Hiramatsu @ 2024-07-02  1:01 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Mon, 1 Jul 2024 15:15:56 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > >
> > > On Fri, 28 Jun 2024 09:34:26 -0700
> > > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > >
> > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > > >
> > > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > > > > >
> > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > > Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > > >
> > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > > > > -                          loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > > +                       uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > > > > >
> > > > > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > > >
> > > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > > would just contain pointers to uprobe_consumer. Consumers would never
> > > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > > uprobe_consumer struct is embedded in some other struct, so the array
> > > > > > interface isn't the most convenient.
> > > > >
> > > > > OK, I understand it.
> > > > >
> > > > > >
> > > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > > allocating an extra array *and keeping it* for the entire duration of
> > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > > waste. This is because we don't want to do anything that can fail in
> > > > > > the detachment logic (so no temporary array allocation there).
> > > > >
> > > > > No need to change it, that sounds reasonable.
> > > > >
> > > >
> > > > Great, thanks.
> > > >
> > > > > >
> > > > > > Anyways, let me know how you feel about keeping this callback.
> > > > >
> > > > > IMHO, maybe the interface function is better to change to
> > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > > side uses a linked list of structure, index access will need to
> > > > > follow the list every time.
> > > >
> > > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > > are only allowed to ask for the next consumer, then
> > > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > > its own internal index and remember ith instance. Which again means
> > > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > > isn't great.
> > >
> > > No, I think we can use a cursor variable as;
> > >
> > > int uprobe_register_batch(struct inode *inode,
> > >                  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > {
> > >         void *cur = ctx;
> > >
> > >         while ((uc = get_uprobe_consumer(&cur)) != NULL) {
> > >                 ...
> > >         }
> > >
> > >         cur = ctx;
> > >         while ((uc = get_uprobe_consumer(&cur)) != NULL) {
> > >                 ...
> > >         }
> > > }
> > >
> > > This can also remove the cnt.
> >
> > Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
> > for callers, but we have one right now, and might have another one, so
> > not a big deal.
> >
> 
> Actually, now that I started implementing this, I really-really don't
> like it. In the example above you assume by storing and reusing
> original ctx value you will reset iteration to the very beginning.
> This is not how it works in practice though. Ctx is most probably a
> pointer to some struct somewhere with iteration state (e.g., array of
> all uprobes + current index), and so get_uprobe_consumer() doesn't
> update `void *ctx` itself, it updates the state of that struct.

Yeah, that should be noted so that if the get_uprobe_consumer() is
called with original `ctx` value, it should return the same.
Ah, and I found we need to pass both ctx and pos...

       while ((uc = get_uprobe_consumer(&cur, ctx)) != NULL) {
                 ...
         }

Usually it is enough to pass the cursor as similar to the other
for_each_* macros. For example, struct foo has .list and .uc, then

struct uprobe_consumer *get_uprobe_consumer_foo(void **pos, void *head)
{
	struct foo *foo = *pos;

	if (!foo)
		return NULL;

	*pos = list_next_entry(foo, list);
	if (list_is_head(pos, (head)))
		*pos = NULL;

	return foo->uc;
}

This works something like this.

#define for_each_uprobe_consumer_from_foo(uc, pos, head) \
	list_for_each_entry(pos, head, list) \
		if (uc = uprobe_consumer_from_foo(pos))

or, for array of *uprobe_consumer (array must be end with NULL), 

struct uprobe_consumer *get_uprobe_consumer_array(void **pos, void *head __unused)
{
	struct uprobe_consumer **uc = *pos;

	if (!*uc)
		return NULL;

	*pos = uc + 1;

	return *uc;
}

But this may not be able to support array of uprobe_consumer. Hmm.


> And so there is no easy and clean way to reset this iterator without
> adding another callback or something. At which point it becomes quite
> cumbersome and convoluted.

If you consider that is problematic, I think we can prepare more
iterator like object;

struct uprobe_consumer_iter_ops {
	struct uprobe_consumer *(*start)(struct uprobe_consumer_iter_ops *);
	struct uprobe_consumer *(*next)(struct uprobe_consumer_iter_ops *);
	void *ctx; // or, just embed the data in this structure.
};


> How about this? I'll keep the existing get_uprobe_consumer(idx, ctx)
> contract, which works for the only user right now, BPF multi-uprobes.
> When it's time to add another consumer that works with a linked list,
> we can add another more complicated contract that would do
> iterator-style callbacks. This would be used by linked list users, and
> we can transparently implement existing uprobe_register_batch()
> contract on top of if by implementing a trivial iterator wrapper on
> top of get_uprobe_consumer(idx, ctx) approach.

Agreed, anyway as far as it uses an array of uprobe_consumer, it works.
When we need to register list of the structure, we may be possible to
allocate an array or introduce new function.

Thank you!

> 
> Let's not add unnecessary complications right now given we have a
> clear path forward to add it later, if necessary, without breaking
> anything. I'll send v2 without changes to get_uprobe_consumer() for
> now, hopefully my above plan makes sense to you. Thanks!
> 
> > >
> > > Thank you,
> > >
> > > >
> > > > For now this API works well, I propose to keep it as is. For linked
> > > > list case consumers would need to allocate one extra array or pay the
> > > > price of O(N) search (which might be ok, depending on how many uprobes
> > > > are being attached). But we don't have such consumers right now,
> > > > thankfully.
> > > >
> > > > >
> > > > > Thank you,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Thank you,
> > > > > > >
> > > > > > > --
> > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > > > >
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-07-02  1:01                 ` Masami Hiramatsu
@ 2024-07-02  1:34                   ` Andrii Nakryiko
  2024-07-02 15:19                     ` Masami Hiramatsu
  0 siblings, 1 reply; 42+ messages in thread
From: Andrii Nakryiko @ 2024-07-02  1:34 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Mon, Jul 1, 2024 at 6:01 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Mon, 1 Jul 2024 15:15:56 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > On Mon, Jul 1, 2024 at 10:55 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Sat, Jun 29, 2024 at 4:30 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > >
> > > > On Fri, 28 Jun 2024 09:34:26 -0700
> > > > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > > On Thu, Jun 27, 2024 at 11:28 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > > > >
> > > > > > On Thu, 27 Jun 2024 09:47:10 -0700
> > > > > > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > > > > >
> > > > > > > On Thu, Jun 27, 2024 at 6:04 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, 24 Jun 2024 17:21:38 -0700
> > > > > > > > Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > > > >
> > > > > > > > > -static int __uprobe_register(struct inode *inode, loff_t offset,
> > > > > > > > > -                          loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> > > > > > > > > +int uprobe_register_batch(struct inode *inode, int cnt,
> > > > > > > > > +                       uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > > > > > >
> > > > > > > > Is this interface just for avoiding memory allocation? Can't we just
> > > > > > > > allocate a temporary array of *uprobe_consumer instead?
> > > > > > >
> > > > > > > Yes, exactly, to avoid the need for allocating another array that
> > > > > > > would just contain pointers to uprobe_consumer. Consumers would never
> > > > > > > just have an array of `struct uprobe_consumer *`, because
> > > > > > > uprobe_consumer struct is embedded in some other struct, so the array
> > > > > > > interface isn't the most convenient.
> > > > > >
> > > > > > OK, I understand it.
> > > > > >
> > > > > > >
> > > > > > > If you feel strongly, I can do an array, but this necessitates
> > > > > > > allocating an extra array *and keeping it* for the entire duration of
> > > > > > > BPF multi-uprobe link (attachment) existence, so it feels like a
> > > > > > > waste. This is because we don't want to do anything that can fail in
> > > > > > > the detachment logic (so no temporary array allocation there).
> > > > > >
> > > > > > No need to change it, that sounds reasonable.
> > > > > >
> > > > >
> > > > > Great, thanks.
> > > > >
> > > > > > >
> > > > > > > Anyways, let me know how you feel about keeping this callback.
> > > > > >
> > > > > > IMHO, maybe the interface function is better to change to
> > > > > > `uprobe_consumer *next_uprobe_consumer(void **data)`. If caller
> > > > > > side uses a linked list of structure, index access will need to
> > > > > > follow the list every time.
> > > > >
> > > > > This would be problematic. Note how we call get_uprobe_consumer(i,
> > > > > ctx) with i going from 0 to N in multiple independent loops. So if we
> > > > > are only allowed to ask for the next consumer, then
> > > > > uprobe_register_batch and uprobe_unregister_batch would need to build
> > > > > its own internal index and remember ith instance. Which again means
> > > > > more allocations and possibly failing uprobe_unregister_batch(), which
> > > > > isn't great.
> > > >
> > > > No, I think we can use a cursor variable as;
> > > >
> > > > int uprobe_register_batch(struct inode *inode,
> > > >                  uprobe_consumer_fn get_uprobe_consumer, void *ctx)
> > > > {
> > > >         void *cur = ctx;
> > > >
> > > >         while ((uc = get_uprobe_consumer(&cur)) != NULL) {
> > > >                 ...
> > > >         }
> > > >
> > > >         cur = ctx;
> > > >         while ((uc = get_uprobe_consumer(&cur)) != NULL) {
> > > >                 ...
> > > >         }
> > > > }
> > > >
> > > > This can also remove the cnt.
> > >
> > > Ok, if you prefer this I'll switch. It's a bit more cumbersome to use
> > > for callers, but we have one right now, and might have another one, so
> > > not a big deal.
> > >
> >
> > Actually, now that I started implementing this, I really-really don't
> > like it. In the example above you assume by storing and reusing
> > original ctx value you will reset iteration to the very beginning.
> > This is not how it works in practice though. Ctx is most probably a
> > pointer to some struct somewhere with iteration state (e.g., array of
> > all uprobes + current index), and so get_uprobe_consumer() doesn't
> > update `void *ctx` itself, it updates the state of that struct.
>
> Yeah, that should be noted so that if the get_uprobe_consumer() is
> called with original `ctx` value, it should return the same.
> Ah, and I found we need to pass both ctx and pos...
>
>        while ((uc = get_uprobe_consumer(&cur, ctx)) != NULL) {
>                  ...
>          }
>
> Usually it is enough to pass the cursor as similar to the other
> for_each_* macros. For example, struct foo has .list and .uc, then
>
> struct uprobe_consumer *get_uprobe_consumer_foo(void **pos, void *head)
> {
>         struct foo *foo = *pos;
>
>         if (!foo)
>                 return NULL;
>
>         *pos = list_next_entry(foo, list);
>         if (list_is_head(pos, (head)))
>                 *pos = NULL;
>
>         return foo->uc;
> }
>
> This works something like this.
>
> #define for_each_uprobe_consumer_from_foo(uc, pos, head) \
>         list_for_each_entry(pos, head, list) \
>                 if (uc = uprobe_consumer_from_foo(pos))
>
> or, for array of *uprobe_consumer (array must be end with NULL),
>
> struct uprobe_consumer *get_uprobe_consumer_array(void **pos, void *head __unused)
> {
>         struct uprobe_consumer **uc = *pos;
>
>         if (!*uc)
>                 return NULL;
>
>         *pos = uc + 1;
>
>         return *uc;
> }
>
> But this may not be able to support array of uprobe_consumer. Hmm.
>
>
> > And so there is no easy and clean way to reset this iterator without
> > adding another callback or something. At which point it becomes quite
> > cumbersome and convoluted.
>
> If you consider that is problematic, I think we can prepare more
> iterator like object;
>
> struct uprobe_consumer_iter_ops {
>         struct uprobe_consumer *(*start)(struct uprobe_consumer_iter_ops *);
>         struct uprobe_consumer *(*next)(struct uprobe_consumer_iter_ops *);
>         void *ctx; // or, just embed the data in this structure.
> };
>

Yeah, I was thinking about something like this for adding a proper
iterator-based interface.

>
> > How about this? I'll keep the existing get_uprobe_consumer(idx, ctx)
> > contract, which works for the only user right now, BPF multi-uprobes.
> > When it's time to add another consumer that works with a linked list,
> > we can add another more complicated contract that would do
> > iterator-style callbacks. This would be used by linked list users, and
> > we can transparently implement existing uprobe_register_batch()
> > contract on top of if by implementing a trivial iterator wrapper on
> > top of get_uprobe_consumer(idx, ctx) approach.
>
> Agreed, anyway as far as it uses an array of uprobe_consumer, it works.
> When we need to register list of the structure, we may be possible to
> allocate an array or introduce new function.
>

Cool, glad we agree. What you propose above with start + next + ctx
seems like a way forward if we need this.

BTW, is this (batched register/unregister APIs) something you'd like
to use from the tracefs-based (or whatever it's called, I mean non-BPF
ones) uprobes as well? Or there is just no way to even specify a batch
of uprobes? Just curious if you had any plans for this.

> Thank you!
>
> >
> > Let's not add unnecessary complications right now given we have a
> > clear path forward to add it later, if necessary, without breaking
> > anything. I'll send v2 without changes to get_uprobe_consumer() for
> > now, hopefully my above plan makes sense to you. Thanks!
> >
> > > >
> > > > Thank you,
> > > >
> > > > >
> > > > > For now this API works well, I propose to keep it as is. For linked
> > > > > list case consumers would need to allocate one extra array or pay the
> > > > > price of O(N) search (which might be ok, depending on how many uprobes
> > > > > are being attached). But we don't have such consumers right now,
> > > > > thankfully.
> > > > >
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > --
> > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > > >
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-07-02  1:34                   ` Andrii Nakryiko
@ 2024-07-02 15:19                     ` Masami Hiramatsu
  2024-07-02 16:53                       ` Steven Rostedt
  0 siblings, 1 reply; 42+ messages in thread
From: Masami Hiramatsu @ 2024-07-02 15:19 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, oleg, peterz, mingo,
	bpf, jolsa, paulmck, clm

On Mon, 1 Jul 2024 18:34:55 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> > > How about this? I'll keep the existing get_uprobe_consumer(idx, ctx)
> > > contract, which works for the only user right now, BPF multi-uprobes.
> > > When it's time to add another consumer that works with a linked list,
> > > we can add another more complicated contract that would do
> > > iterator-style callbacks. This would be used by linked list users, and
> > > we can transparently implement existing uprobe_register_batch()
> > > contract on top of if by implementing a trivial iterator wrapper on
> > > top of get_uprobe_consumer(idx, ctx) approach.
> >
> > Agreed, anyway as far as it uses an array of uprobe_consumer, it works.
> > When we need to register list of the structure, we may be possible to
> > allocate an array or introduce new function.
> >
> 
> Cool, glad we agree. What you propose above with start + next + ctx
> seems like a way forward if we need this.
> 
> BTW, is this (batched register/unregister APIs) something you'd like
> to use from the tracefs-based (or whatever it's called, I mean non-BPF
> ones) uprobes as well? Or there is just no way to even specify a batch
> of uprobes? Just curious if you had any plans for this.

No, because current tracefs dynamic event interface is not designed for
batched registration. I think we can expand it to pass wildcard symbols
(for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
Um, that maybe another good idea.

Thank you!

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-07-02 15:19                     ` Masami Hiramatsu
@ 2024-07-02 16:53                       ` Steven Rostedt
  2024-07-02 21:23                         ` Andrii Nakryiko
  2024-07-02 23:16                         ` Masami Hiramatsu
  0 siblings, 2 replies; 42+ messages in thread
From: Steven Rostedt @ 2024-07-02 16:53 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, oleg,
	peterz, mingo, bpf, jolsa, paulmck, clm

On Wed, 3 Jul 2024 00:19:05 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > BTW, is this (batched register/unregister APIs) something you'd like
> > to use from the tracefs-based (or whatever it's called, I mean non-BPF
> > ones) uprobes as well? Or there is just no way to even specify a batch
> > of uprobes? Just curious if you had any plans for this.  
> 
> No, because current tracefs dynamic event interface is not designed for
> batched registration. I think we can expand it to pass wildcard symbols
> (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
> Um, that maybe another good idea.

I don't see why not. The wild cards were added to the kernel
specifically for the tracefs interface (set_ftrace_filter).

-- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-07-02 16:53                       ` Steven Rostedt
@ 2024-07-02 21:23                         ` Andrii Nakryiko
  2024-07-02 23:16                         ` Masami Hiramatsu
  1 sibling, 0 replies; 42+ messages in thread
From: Andrii Nakryiko @ 2024-07-02 21:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu (Google), Andrii Nakryiko, linux-trace-kernel,
	oleg, peterz, mingo, bpf, jolsa, paulmck, clm

On Tue, Jul 2, 2024 at 9:53 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 3 Jul 2024 00:19:05 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > > BTW, is this (batched register/unregister APIs) something you'd like
> > > to use from the tracefs-based (or whatever it's called, I mean non-BPF
> > > ones) uprobes as well? Or there is just no way to even specify a batch
> > > of uprobes? Just curious if you had any plans for this.
> >
> > No, because current tracefs dynamic event interface is not designed for
> > batched registration. I think we can expand it to pass wildcard symbols
> > (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
> > Um, that maybe another good idea.
>
> I don't see why not. The wild cards were added to the kernel
> specifically for the tracefs interface (set_ftrace_filter).

Nice, I'd be happy to adjust batch API to work for that use case as
well (when we get there).

>
> -- Steve

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs
  2024-07-02 16:53                       ` Steven Rostedt
  2024-07-02 21:23                         ` Andrii Nakryiko
@ 2024-07-02 23:16                         ` Masami Hiramatsu
  1 sibling, 0 replies; 42+ messages in thread
From: Masami Hiramatsu @ 2024-07-02 23:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrii Nakryiko, Andrii Nakryiko, linux-trace-kernel, oleg,
	peterz, mingo, bpf, jolsa, paulmck, clm

On Tue, 2 Jul 2024 12:53:20 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 3 Jul 2024 00:19:05 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > > BTW, is this (batched register/unregister APIs) something you'd like
> > > to use from the tracefs-based (or whatever it's called, I mean non-BPF
> > > ones) uprobes as well? Or there is just no way to even specify a batch
> > > of uprobes? Just curious if you had any plans for this.  
> > 
> > No, because current tracefs dynamic event interface is not designed for
> > batched registration. I think we can expand it to pass wildcard symbols
> > (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
> > Um, that maybe another good idea.
> 
> I don't see why not. The wild cards were added to the kernel
> specifically for the tracefs interface (set_ftrace_filter).

Sorry for mislead you, I meant current "dynamic_events" interface does not
support the wildcard places.
And I agree that we can update it to support something like

 p:multi_uprobe 0x1234,0x2234,0x3234@/bin/foo $arg1 $arg2 $arg3

(note: kernel does not read the symbols in user binary)

Thank you,


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-07-02 23:16 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-25  0:21 [PATCH 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Andrii Nakryiko
2024-06-25  0:21 ` [PATCH 01/12] uprobes: update outdated comment Andrii Nakryiko
2024-06-25  0:21 ` [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe() Andrii Nakryiko
2024-06-25  1:29   ` Masami Hiramatsu
2024-06-25 14:49     ` Oleg Nesterov
2024-06-25 17:37       ` Andrii Nakryiko
2024-06-25 19:07         ` Oleg Nesterov
2024-06-26 16:38           ` Andrii Nakryiko
2024-06-25 10:50   ` Oleg Nesterov
2024-06-25  0:21 ` [PATCH 03/12] uprobes: simplify error handling for alloc_uprobe() Andrii Nakryiko
2024-06-25  0:21 ` [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
2024-06-25 14:44   ` Oleg Nesterov
2024-06-25 17:30     ` Andrii Nakryiko
2024-06-26  6:02   ` kernel test robot
2024-06-26 16:39     ` Andrii Nakryiko
2024-06-27  2:29   ` Masami Hiramatsu
2024-06-27 16:43     ` Andrii Nakryiko
2024-07-01 21:59       ` Andrii Nakryiko
2024-06-25  0:21 ` [PATCH 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Andrii Nakryiko
2024-06-27  3:06   ` Masami Hiramatsu
2024-06-25  0:21 ` [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs Andrii Nakryiko
2024-06-26 11:27   ` Jiri Olsa
2024-06-26 16:44     ` Andrii Nakryiko
2024-06-27 13:04   ` Masami Hiramatsu
2024-06-27 16:47     ` Andrii Nakryiko
2024-06-28  6:28       ` Masami Hiramatsu
2024-06-28 16:34         ` Andrii Nakryiko
2024-06-29 23:30           ` Masami Hiramatsu
2024-07-01 17:55             ` Andrii Nakryiko
2024-07-01 22:15               ` Andrii Nakryiko
2024-07-02  1:01                 ` Masami Hiramatsu
2024-07-02  1:34                   ` Andrii Nakryiko
2024-07-02 15:19                     ` Masami Hiramatsu
2024-07-02 16:53                       ` Steven Rostedt
2024-07-02 21:23                         ` Andrii Nakryiko
2024-07-02 23:16                         ` Masami Hiramatsu
2024-06-25  0:21 ` [PATCH 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register() Andrii Nakryiko
2024-06-25  0:21 ` [PATCH 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps Andrii Nakryiko
2024-06-25  0:21 ` [PATCH 09/12] uprobes: batch uprobes_treelock during registration Andrii Nakryiko
2024-06-25  0:21 ` [PATCH 10/12] uprobes: improve lock batching for uprobe_unregister_batch Andrii Nakryiko
2024-06-25  0:21 ` [PATCH 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes Andrii Nakryiko
2024-06-25  0:21 ` [PATCH 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).