public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
  • * Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore
           [not found] <20240701223935.3783951-1-andrii@kernel.org>
           [not found] ` <20240702102353.GG11386@noisy.programming.kicks-ass.net>
    @ 2024-07-03 21:33 ` Andrii Nakryiko
      2024-07-04  9:15   ` Peter Zijlstra
      1 sibling, 1 reply; 21+ messages in thread
    From: Andrii Nakryiko @ 2024-07-03 21:33 UTC (permalink / raw)
      To: Andrii Nakryiko
      Cc: linux-trace-kernel, rostedt, mhiramat, oleg, peterz, mingo, bpf,
    	jolsa, paulmck, clm, open list
    
    On Mon, Jul 1, 2024 at 3:39 PM Andrii Nakryiko <andrii@kernel.org> wrote:
    >
    > This patch set, ultimately, switches global uprobes_treelock from RW spinlock
    > to per-CPU RW semaphore, which has better performance and scales better under
    > contention and multiple parallel threads triggering lots of uprobes.
    >
    > To make this work well with attaching multiple uprobes (through BPF
    > multi-uprobe), we need to add batched versions of uprobe register/unregister
    > APIs. This is what most of the patch set is actually doing. The actual switch
    > to per-CPU RW semaphore is trivial after that and is done in the very last
    > patch #12. See commit message with some comparison numbers.
    >
    
    Peter,
    
    I think I've addressed all the questions so far, but I wanted to take
    a moment and bring all the discussions into a single palace, summarize
    what I think are the main points of contention and hopefully make some
    progress, or at least get us to a bit more constructive discussion
    where *both sides* provide arguments. Right now there is a lot of "you
    are doing X, but why don't you just do Y" with no argument for a) why
    X is bad/wrong/inferior and b) why Y is better (and not just
    equivalent or, even worse, inferior).
    
    I trust you have the best intentions in mind for this piece of kernel
    infrastructure, so do I, so let's try to find a path forward.
    
    1. Strategically, uprobes/uretprobes have to be improved. Customers do
    complain more and more that "uprobes are slow", justifiably so. Both
    single-threaded performance matters, but also, critically, uprobes
    scalability. I.e., if the kernel can handle N uprobe per second on a
    single uncontended CPU, then triggering uprobes across M CPUs should,
    ideally and roughly, give us about N * M total throughput.
    
    This doesn't seem controversial, but I wanted to make it clear that
    this is the end goal of my work. And no, this patch set alone doesn't,
    yet, get us there. But it's a necessary step, IMO. Jiri Olsa took
    single-threaded performance and is improving it with sys_uretprobe and
    soon sys_uprobe, I'm looking into scalability and other smaller
    single-threaded wins, where possible.
    
    2. More tactically, RCU protection seems like the best way forward. We
    got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
    clarified that RCU Tasks Trace has nothing to do with Tasks Rude
    flavor (whatever that is, I have no idea).
    
    Now, RCU Tasks Trace were specifically designed for least overhead
    hotpath (reader side) performance, at the expense of slowing down much
    rarer writers. My microbenchmarking does show at least 5% difference.
    Both flavors can handle sleepable uprobes waiting for page faults.
    Tasks Trace flavor is already used for tracing in the BPF realm,
    including for sleepable uprobes and works well. It's not going away.
    
    Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
    haven't seen a single argument why. Please provide that, or let's
    stick to RCU Tasks Trace, because uprobe's use case is an ideal case
    of what Tasks Trace flavor was designed for.
    
    3. Regardless of RCU flavor, due to RCU protection, we have to add
    batched register/unregister APIs, so we can amortize sync_rcu cost
    during deregistration. Can we please agree on that as well? This is
    the main goal of this patch set and I'd like to land it before working
    further on changing and improving the rest of the locking schema.
    
    I won't be happy about it, but just to move things forward, I can drop
    a) custom refcounting and/or b) percpu RW semaphore. Both are
    beneficial but not essential for batched APIs work. But if you force
    me to do that, please state clearly your reasons/arguments. No one had
    yet pointed out why refcounting is broken and why percpu RW semaphore
    is bad. On the contrary, Ingo Molnar did suggest percpu RW semaphore
    in the first place (see [0]), but we postponed it due to the lack of
    batched APIs, and promised to do this work. Here I am, doing the
    promised work. Not purely because of percpu RW semaphore, but
    benefiting from it just as well.
    
      [0] https://lore.kernel.org/linux-trace-kernel/Zf+d9twfyIDosINf@gmail.com/
    
    4. Another tactical thing, but an important one. Refcounting schema
    for uprobes. I've replied already, but I think refcounting is
    unavoidable for uretprobes, and current refcounting schema is
    problematic for batched APIs due to race between finding uprobe and
    there still being a possibility we'd need to undo all that and retry
    again.
    
    I think the main thing is to agree to change refcounting to avoid this
    race, allowing for simpler batched registration. Hopefully we can
    agree on that.
    
    But also, refcount_inc_not_zero() which is another limiting factor for
    scalability (see above about the end goal of scalability) vs
    atomic64_add()-based epoch+refcount approach I took, which is
    noticeably better on x86-64, and I don't think hurts any other
    architecture, to say the least. I think the latter could be
    generalized as an alternative flavor of refcount_t, but I'd prefer to
    land it in uprobes in current shape, and if we think it's a good idea
    to generalize, we can always do that refactoring once things stabilize
    a bit.
    
    You seem to have problems with the refcounting implementation I did
    (besides overflow detection, which I'll address in the next revision,
    so not a problem). My arguments are a) performance and b) it's well
    contained within get/put helpers and doesn't leak outside of them *at
    all*, while providing a nice always successful get_uprobe() primitive.
    
    Can I please hear the arguments for not doing it, besides "Everyone is
    using refcount_inc_not_zero", which isn't much of a reason (we'd never
    do anything novel in the kernel if that was a good enough reason to
    not do something new).
    
    Again, thanks for engagement, I do appreciate it. But let's try to
    move this forward. Thanks!
    
    ^ permalink raw reply	[flat|nested] 21+ messages in thread

  • end of thread, other threads:[~2024-07-08 17:49 UTC | newest]
    
    Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
    -- links below jump to the message on this page --
         [not found] <20240701223935.3783951-1-andrii@kernel.org>
         [not found] ` <20240702102353.GG11386@noisy.programming.kicks-ass.net>
    2024-07-02 11:54   ` [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore Peter Zijlstra
    2024-07-02 12:01     ` Peter Zijlstra
    2024-07-02 17:54     ` Andrii Nakryiko
    2024-07-02 19:18       ` Peter Zijlstra
    2024-07-02 23:56         ` Paul E. McKenney
    2024-07-03  4:54           ` Andrii Nakryiko
    2024-07-03  7:50           ` Peter Zijlstra
    2024-07-03 14:08             ` Paul E. McKenney
    2024-07-04  8:39               ` Peter Zijlstra
    2024-07-04 15:13                 ` Paul E. McKenney
    2024-07-03 21:57             ` Steven Rostedt
    2024-07-03 22:07               ` Paul E. McKenney
    2024-07-03  4:47         ` Andrii Nakryiko
    2024-07-03  8:07           ` Peter Zijlstra
    2024-07-03 20:55             ` Andrii Nakryiko
    2024-07-03 21:33 ` Andrii Nakryiko
    2024-07-04  9:15   ` Peter Zijlstra
    2024-07-04 13:56     ` Steven Rostedt
    2024-07-04 15:44     ` Paul E. McKenney
    2024-07-08 17:47       ` Andrii Nakryiko
    2024-07-08 17:48     ` Andrii Nakryiko
    

    This is a public inbox, see mirroring instructions
    for how to clone and mirror all data and code used for this inbox