netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Beau Belgrave <beaub@linux.microsoft.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Song Liu <song@kernel.org>, Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	linux-trace-devel <linux-trace-devel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>, bpf <bpf@vger.kernel.org>,
	Network Development <netdev@vger.kernel.org>,
	linux-arch <linux-arch@vger.kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Subject: Re: [PATCH] tracing/user_events: Add eBPF interface for user_event created events
Date: Wed, 30 Mar 2022 14:27:08 -0700	[thread overview]
Message-ID: <20220330212708.GA2759@kbox> (raw)
In-Reply-To: <CAADnVQL9T=zDbO04-s+WCtNRETs+mC=oDRwv8UUJYiJ7Dv1owA@mail.gmail.com>

On Wed, Mar 30, 2022 at 01:39:49PM -0700, Alexei Starovoitov wrote:
> On Wed, Mar 30, 2022 at 12:15 PM Beau Belgrave
> <beaub@linux.microsoft.com> wrote:
> >
> > On Wed, Mar 30, 2022 at 11:22:32AM -0700, Alexei Starovoitov wrote:
> > > On Wed, Mar 30, 2022 at 9:34 AM Beau Belgrave <beaub@linux.microsoft.com> wrote:
> > > > > >
> > > > > > But you are fine with uprobe costs? uprobes appear to be much more costly
> > > > > > than a syscall approach on the hardware I've run on.
> > >
> > > Care to share the numbers?
> > > uprobe over USDT is a single trap.
> > > Not much slower compared to syscall with kpti.
> > >
> >
> > Sure, these are the numbers we have from a production device.
> >
> > They are captured via perf via PERF_COUNT_HW_CPU_CYCLES.
> > It's running a 20K loop emitting 4 bytes of data out.
> > Each 4 byte event time is recorded via perf.
> > At the end we have the total time and the max seen.
> >
> > null numbers represent a 20K loop with just perf start/stop ioctl costs.
> >
> > null: min=2863, avg=2953, max=30815
> > uprobe: min=10994, avg=11376, max=146682
> 
> I suspect it's a 3 trap case of uprobe.
> USDT is a nop. It's a 1 trap case.
> 
> > uevent: min=7043, avg=7320, max=95396
> > lttng: min=6270, avg=6508, max=41951
> >
> > These costs include the data getting into a buffer, so they represent
> > what we would see in production vs the trap cost alone. For uprobe this
> > means we created a uprobe and attached it via tracefs to get the above
> > numbers.
> >
> > There also seems to be some thinking around this as well from Song Liu.
> > Link: https://lore.kernel.org/lkml/20200801084721.1812607-1-songliubraving@fb.com/
> >
> > From the link:
> > 1. User programs are faster. The new selftest added in 5/5, shows that a
> >    simple uprobe program takes 1400 nanoseconds, while user program only
> >       takes 300 nanoseconds.
> 
> 
> Take a look at Song's code. It's 2 trap case.
> The USDT is a half of that. ~700ns.
> Compared to 300ns of syscall that difference
> could be acceptable.
> 
> >
> > > > >
> > > > > Can we achieve the same/similar performance with sys_bpf(BPF_PROG_RUN)?
> > > > >
> > > >
> > > > I think so, the tough part is how do you let the user-space know which
> > > > program is attached to run? In the current code this is done by the BPF
> > > > program attaching to the event via perf and we run the one there if
> > > > any when data is emitted out via write calls.
> > > >
> > > > I would want to make sure that operators can decide where the user-space
> > > > data goes (perf/ftrace/eBPF) after the code has been written. With the
> > > > current code this is done via the tracepoint callbacks that perf/ftrace
> > > > hook up when operators enable recording via perf, tracefs, libbpf, etc.
> > > >
> > > > We have managed code (C#/Java) where we cannot utilize stubs or traps
> > > > easily due to code movement. So we are limited in how we can approach
> > > > this problem. Having the interface be mmap/write has enabled this
> > > > for us, since it's easy to interact with in most languages and gives us
> > > > lifetime management of the trace objects between user-space and the
> > > > kernel.
> > >
> > > Then you should probably invest into making USDT work inside
> > > java applications instead of reinventing the wheel.
> > >
> > > As an alternative you can do a dummy write or any other syscall
> > > and attach bpf on the kernel side.
> > > No kernel changes are necessary.
> >
> > We only want syscall/tracing overheads for the specific events that are
> > hooked. I don't see how we could hook up a dummy write that is unique
> > per-event without having a way to know when the event is being traced.
> 
> You're adding writev-s to user apps. Keep that writev without
> any user_events on the kernel side and pass -1 as FD.
> Hook bpf prog to sys_writev and filter by pid.

I see. That would have all events incur a syscall cost regardless if a
BPF program is attached or not. We are typically monitoring all processes
so we would not want that overhead on each writev invocation.

We would also have to decode each writev payload to determine if it's
the event we are interested in. The mmap part of user_events solves that
part for us, the byte/bits get set to non-zero when the writev cost is
worth it.

Thanks,
-Beau

  reply	other threads:[~2022-03-30 21:27 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-29 18:19 [PATCH] tracing/user_events: Add eBPF interface for user_event created events Beau Belgrave
2022-03-29 19:50 ` Alexei Starovoitov
2022-03-29 20:10   ` Beau Belgrave
2022-03-29 22:31     ` Alexei Starovoitov
2022-03-29 23:11       ` Beau Belgrave
2022-03-29 23:54         ` Alexei Starovoitov
2022-03-30 16:06         ` Song Liu
2022-03-30 16:34           ` Beau Belgrave
2022-03-30 18:22             ` Alexei Starovoitov
2022-03-30 19:15               ` Beau Belgrave
2022-03-30 19:57                 ` Mathieu Desnoyers
2022-03-30 21:24                   ` Beau Belgrave
2022-03-30 20:39                 ` Alexei Starovoitov
2022-03-30 21:27                   ` Beau Belgrave [this message]
2022-03-30 21:57                     ` Alexei Starovoitov
2022-03-30  5:22       ` Masami Hiramatsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220330212708.GA2759@kbox \
    --to=beaub@linux.microsoft.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-devel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).