[PATCH v3 00/13] uprobes: RCU-protected hot path optimizations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrii Nakryiko <andrii@kernel.org>
To: linux-trace-kernel@vger.kernel.org, peterz@infradead.org,
	oleg@redhat.com
Cc: rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org,
	linux-kernel@vger.kernel.org, jolsa@kernel.org,
	paulmck@kernel.org, willy@infradead.org, surenb@google.com,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	Andrii Nakryiko <andrii@kernel.org>
Subject: [PATCH v3 00/13] uprobes: RCU-protected hot path optimizations
Date: Mon, 12 Aug 2024 21:29:04 -0700	[thread overview]
Message-ID: <20240813042917.506057-1-andrii@kernel.org> (raw)

( In addition to previously posted first 8 patches, I'm sending 5 more as an
RFC for people to get the general gist of where this work heading and what
uprobe performance is now achievable. I think first 8 patches are ready to be
applied and I'd appreciate early feedback on the remaining 5 ones.

Patches #9 and #10 implement SRCU-based refcounting avoidance for
return_instance and single-stepped uprobes, and lockless SRCU expiration by
timeout. Please look at hprobe_*() helpers for most of the details.

Last three patches are implementing lockless VMA -> uprobe resolution logic by
relying on a) SLAB_TYPESAFE_BY_RCU protection for files_cachep and b) on
Suren's mm_lock_seq changes. Note benchmarking numbers in the last patch.

The remainder of cover letter is written as applied to only first 8 patches. )

This patch set is heavily inspired by Peter Zijlstra's uprobe optimization
patches ([0]) and continue that work, albeit trying to keep complexity to the
minimum, and attepting to reuse existing primitives as much as possible. The
goal here is to optimize obvious uprobe triggering hot path, while keeping the
rest of locking mostly intact.

I've added uprobe_unregister_sync() into the error handling code path inside
uprobe_unregister(). This is due to recent refactorings from Oleg Nesterov
([1]), which necessitates this addition.

Except for refcounting change patch (which I stongly believe is a good
improvement we should do and forget about quasi-refcounting schema of
uprobe->consumers list), the rest of the changes are similar to Peter's
initial changes in [0].

Main differences would be:
  - no special RCU protection for mmap and fork handling, we just stick to
    refcounts there, as those are infrequent and not performance-sensitive
    code, while being complex and thus benefiting from proper locking;
  - the above means we don't need to do any custom SRCU additions to handle
    forking code path;
  - I handled UPROBE_HANDLER_REMOVE problem in handler_chain() differently,
    again, leveraging existing locking scheam;
  - I kept refcount usage for uretprobe and single-stepping uprobes, I plan to
    address that in a separate follow up patches. The plan is to avoid
    task_work, but I need to sit down and write and test the code.
  - finally, I dutifully was using SRCU throughout all the changes, and only
    last patch switches SRCU to RCU Tasks Trace and demonstrates significant
    performance and scalability gains from this.

The changes in this patch set were tested using BPF selftests and using
uprobe-stress ([2]) tool. One recent BPF selftest (uprobe_multi/consumers),
only recently added by Jiri Olsa will need a single-line adjustment to the
counting logic, but the patch itself is in bpf-next/master, so we'll have to
address that once linux-trace or tip and bpf-next trees merge. I'll take care
of that when this happens.

Now, for the benchmarking results. I've used the following script (which
utilizes BPF selftests-based bench tool). The CPU used was 80-core Intel Xeon
Gold 6138 CPU @ 2.00GHz running kernel with production-like config. I minimized
background noise by stopping any service I could identify and stop, so results
are pretty stable and variability is pretty small, overall.

Benchmark script:

#!/bin/bash

set -eufo pipefail

for i in uprobe-nop uretprobe-nop; do
    for p in 1 2 4 8 16 32 64; do
        summary=$(sudo ./bench -w3 -d5 -p$p -a trig-$i | tail -n1)
        total=$(echo "$summary" | cut -d'(' -f1 | cut -d' ' -f3-)
        percpu=$(echo "$summary" | cut -d'(' -f2 | cut -d')' -f1 | cut -d'/' -f1)
        printf "%-15s (%2d cpus): %s (%s/s/cpu)\n" $i $p "$total" "$percpu"
    done
    echo
done

With all the lock-avoiding changes done in this patch set, we get a pretty
decent improvement in performance and scalability of uprobes with number of
CPUs, even though we are still nowhere near linear scalability. This is due to
the remaining mmap_lock, which is currently taken to resolve interrupt address
to inode+offset and then uprobe instance. And, of course, uretprobes still need
similar RCU to avoid refcount in the hot path, which will be addressed in the
follow up patches. (Again, note, I left the benchmark numbers with lockless
RB-tree patches in.)

BASELINE (on top of Oleg's clean up patches)
============================================
uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)

uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)

Up to second-to-last patch (i.e., SRCU-based optimizations)
===========================================================
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

RCU Tasks Trace
===============
uprobe-nop      ( 1 cpus):    3.396 ± 0.002M/s  (  3.396M/s/cpu)
uprobe-nop      ( 2 cpus):    4.271 ± 0.006M/s  (  2.135M/s/cpu)
uprobe-nop      ( 4 cpus):    8.499 ± 0.015M/s  (  2.125M/s/cpu)
uprobe-nop      ( 8 cpus):   10.355 ± 0.028M/s  (  1.294M/s/cpu)
uprobe-nop      (16 cpus):    7.615 ± 0.099M/s  (  0.476M/s/cpu)
uprobe-nop      (32 cpus):    4.430 ± 0.007M/s  (  0.138M/s/cpu)
uprobe-nop      (64 cpus):    6.887 ± 0.020M/s  (  0.108M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.174 ± 0.001M/s  (  2.174M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.853 ± 0.001M/s  (  1.426M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.913 ± 0.002M/s  (  1.228M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.883 ± 0.002M/s  (  0.735M/s/cpu)
uretprobe-nop   (16 cpus):    5.147 ± 0.001M/s  (  0.322M/s/cpu)
uretprobe-nop   (32 cpus):    3.738 ± 0.008M/s  (  0.117M/s/cpu)
uretprobe-nop   (64 cpus):    4.397 ± 0.002M/s  (  0.069M/s/cpu)

For baseline vs SRCU, peak througput increased from 3.7 M/s (million uprobe
triggerings per second) up to about 8 M/s. For uretprobes it's a bit more
modest with bump from 2.4 M/s to 5 M/s.

For SRCU vs RCU Tasks Trace, peak throughput for uprobes increases further from
8 M/s to 10.3 M/s (+28%!), and for uretprobes from 5.3 M/s to 5.8 M/s (+11%),
as we have more work to do on uretprobes side.

Even single-thread (no contention) performance is slightly better: 3.276 M/s to
3.396 M/s (+3.5%) for uprobes, and 2.055 M/s to 2.174 M/s (+5.8%)
for uretprobes.

  [0] https://lore.kernel.org/linux-trace-kernel/20240711110235.098009979@infradead.org/
  [1] https://lore.kernel.org/linux-trace-kernel/20240729134444.GA12293@redhat.com/
  [2] https://github.com/libbpf/libbpf-bootstrap/tree/uprobe-stress

v2->v3:
  - undid rcu and rb_node fields colocation which were causing crashes (Oleg);
  - ensure synchronize_srcu() on registration failure in patch #4 (Oleg);
v1->v2:
  - added back missed kfree() in patch #1 (Oleg);
  - forgot the rest, but there were a few small things here and there.

Andrii Nakryiko (10):
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: protected uprobe lifetime with SRCU
  uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
  uprobes: travers uprobe's consumer list locklessly under SRCU
    protection
  uprobes: perform lockless SRCU-protected uprobes_tree lookup
  uprobes: switch to RCU Tasks Trace flavor for better performance
  uprobes: SRCU-protect uretprobe lifetime (with timeout)
  uprobes: implement SRCU-protected lifetime for single-stepped uprobe
  mm: add SLAB_TYPESAFE_BY_RCU to files_cache
  uprobes: add speculative lockless VMA to inode resolution

Peter Zijlstra (2):
  perf/uprobe: split uprobe_unregister()
  rbtree: provide rb_find_rcu() / rb_find_add_rcu()

Suren Baghdasaryan (1):
  mm: introduce mmap_lock_speculation_{start|end}

 include/linux/mm_types.h                      |   3 +
 include/linux/mmap_lock.h                     |  53 +-
 include/linux/rbtree.h                        |  67 ++
 include/linux/uprobes.h                       |  73 +-
 kernel/events/uprobes.c                       | 733 +++++++++++++-----
 kernel/fork.c                                 |   7 +-
 kernel/trace/bpf_trace.c                      |   8 +-
 kernel/trace/trace_uprobe.c                   |  15 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |   3 +-
 9 files changed, 722 insertions(+), 240 deletions(-)

-- 
2.43.5

next             reply	other threads:[~2024-08-13  4:29 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-13  4:29 Andrii Nakryiko [this message]
2024-08-13  4:29 ` [PATCH v3 01/13] uprobes: revamp uprobe refcounting and lifetime management Andrii Nakryiko
2024-08-13  4:29 ` [PATCH v3 02/13] uprobes: protected uprobe lifetime with SRCU Andrii Nakryiko
2024-08-13  4:29 ` [PATCH v3 03/13] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks Andrii Nakryiko
2024-08-13  4:29 ` [PATCH v3 04/13] uprobes: travers uprobe's consumer list locklessly under SRCU protection Andrii Nakryiko
2024-08-22 14:22   ` Jiri Olsa
2024-08-22 16:59     ` Andrii Nakryiko
2024-08-22 17:35       ` Jiri Olsa
2024-08-22 17:51         ` Andrii Nakryiko
2024-08-13  4:29 ` [PATCH v3 05/13] perf/uprobe: split uprobe_unregister() Andrii Nakryiko
2024-08-13  4:29 ` [PATCH v3 06/13] rbtree: provide rb_find_rcu() / rb_find_add_rcu() Andrii Nakryiko
2024-08-13  4:29 ` [PATCH v3 07/13] uprobes: perform lockless SRCU-protected uprobes_tree lookup Andrii Nakryiko
2024-08-13  4:29 ` [PATCH v3 08/13] uprobes: switch to RCU Tasks Trace flavor for better performance Andrii Nakryiko
2024-08-13  4:29 ` [PATCH RFC v3 09/13] uprobes: SRCU-protect uretprobe lifetime (with timeout) Andrii Nakryiko
2024-08-19 13:41   ` Oleg Nesterov
2024-08-19 20:34     ` Andrii Nakryiko
2024-08-20 15:05       ` Oleg Nesterov
2024-08-20 18:01         ` Andrii Nakryiko
2024-08-13  4:29 ` [PATCH RFC v3 10/13] uprobes: implement SRCU-protected lifetime for single-stepped uprobe Andrii Nakryiko
2024-08-13  4:29 ` [PATCH RFC v3 11/13] mm: introduce mmap_lock_speculation_{start|end} Andrii Nakryiko
2024-08-13  4:29 ` [PATCH RFC v3 12/13] mm: add SLAB_TYPESAFE_BY_RCU to files_cache Andrii Nakryiko
2024-08-13  6:07   ` Mateusz Guzik
2024-08-13 14:49     ` Suren Baghdasaryan
2024-08-13 18:15       ` Andrii Nakryiko
2024-08-13  4:29 ` [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution Andrii Nakryiko
2024-08-13  6:17   ` Mateusz Guzik
2024-08-13 15:36     ` Suren Baghdasaryan
2024-08-15 13:44       ` Mateusz Guzik
2024-08-15 16:47         ` Andrii Nakryiko
2024-08-15 17:45           ` Suren Baghdasaryan
2024-08-15 18:24             ` Mateusz Guzik
2024-08-15 18:58             ` Jann Horn
2024-08-15 19:07               ` Mateusz Guzik
2024-08-15 19:17                 ` Arnaldo Carvalho de Melo
2024-08-15 19:18                   ` Arnaldo Carvalho de Melo
2024-08-15 19:44               ` Suren Baghdasaryan
2024-08-15 20:17               ` Andrii Nakryiko
2024-08-15 13:24 ` [PATCH v3 00/13] uprobes: RCU-protected hot path optimizations Oleg Nesterov
2024-08-15 16:49   ` Andrii Nakryiko
2024-08-21 16:41     ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240813042917.506057-1-andrii@kernel.org \
    --to=andrii@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bpf@vger.kernel.org \
    --cc=jolsa@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).