All of lore.kernel.org
 help / color / mirror / Atom feed
From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: Ian Rogers <irogers@google.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Howard Chu <howardchu95@gmail.com>,
	adrian.hunter@intel.com, jolsa@kernel.org,
	kan.liang@linux.intel.com, namhyung@kernel.org,
	linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 01/10] perf trace: Fix perf trace -p <PID>
Date: Fri, 16 Aug 2024 11:52:36 -0300	[thread overview]
Message-ID: <Zr9ntEtIvseYwG90@x1> (raw)
In-Reply-To: <CAP-5=fXCt4axmdjqDZXZzjW-gZK_KZ8R--OTR-M1sZYxZ94k5A@mail.gmail.com>

On Thu, Aug 15, 2024 at 11:28:17AM -0700, Ian Rogers wrote:
> On Wed, Aug 14, 2024 at 6:36 PM Howard Chu <howardchu95@gmail.com> wrote:
> >
> > perf trace -p <PID> work on a syscall that is unaugmented, but doesn't
> > work on a syscall that's augmented (when it calls perf_event_output() in
> > BPF).
> >
> > Let's take open() as an example. open() is augmented in perf trace.
> >
> > Before:
> > ```
> > perf $ perf trace -e open -p 3792392
> >          ? (         ):  ... [continued]: open())                                             = -1 ENOENT (No such file or directory)
> >          ? (         ):  ... [continued]: open())                                             = -1 ENOENT (No such file or directory)
> > ```
> >
> > We can see there's no output.
> >
> > After:
> > ```
> > perf $ perf trace -e open -p 3792392
> >      0.000 ( 0.123 ms): a.out/3792392 open(filename: "DINGZHEN", flags: WRONLY)                             = -1 ENOENT (No such file or directory)
> >   1000.398 ( 0.116 ms): a.out/3792392 open(filename: "DINGZHEN", flags: WRONLY)                             = -1 ENOENT (No such file or directory)
> > ```
> >
> > Reason:
> >
> > bpf_perf_event_output() will fail when you specify a pid in perf trace (EOPNOTSUPP).

Trying to figure out why it returns EOPNOTSUPP:

static __always_inline u64
__bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
                        u64 flags, struct perf_sample_data *sd)
{
        struct bpf_array *array = container_of(map, struct bpf_array, map);
        unsigned int cpu = smp_processor_id();
        u64 index = flags & BPF_F_INDEX_MASK;
        struct bpf_event_entry *ee;
        struct perf_event *event; 

        if (index == BPF_F_CURRENT_CPU)
                index = cpu;
        if (unlikely(index >= array->map.max_entries))
                return -E2BIG;

        ee = READ_ONCE(array->ptrs[index]);
<SNIP>
        if (unlikely(event->oncpu != cpu))
                return -EOPNOTSUPP;

        return perf_event_output(event, sd, regs);
}

⬢[acme@toolbox perf-tools-next]$ git grep -- '->oncpu'
arch/x86/kvm/vmx/pmu_intel.c:    *   cpu pinned event reclaims LBR, the event->oncpu will be set to -1;
kernel/events/core.c:   event->oncpu = -1;
kernel/events/core.c:   WRITE_ONCE(event->oncpu, smp_processor_id());
kernel/events/core.c:    * ->oncpu if it sees ACTIVE.
kernel/events/core.c:           event->oncpu = -1;
kernel/events/core.c:   if (READ_ONCE(event->oncpu) != smp_processor_id())
kernel/events/core.c:            * inactive here (event->oncpu==-1), there's nothing more to do;
kernel/events/core.c:           ret = cpu_function_call(READ_ONCE(event->oncpu),
kernel/events/core.c:   event_oncpu = __perf_event_read_cpu(event, event->oncpu);
kernel/events/core.c:            * Orders the ->state and ->oncpu loads such that if we see
kernel/events/core.c:            * ACTIVE we must also see the right ->oncpu.
kernel/events/core.c:           event_cpu = READ_ONCE(event->oncpu);
kernel/events/core.c:   int cpu = READ_ONCE(event->oncpu);
kernel/events/core.c:   if (WARN_ON_ONCE(READ_ONCE(sampler->oncpu) != smp_processor_id()))
kernel/events/core.c:                   cpu = READ_ONCE(iter->oncpu);
kernel/events/core.c:   event->oncpu            = -1;
kernel/trace/bpf_trace.c:       if (unlikely(event->oncpu != cpu))
⬢[acme@toolbox perf-tools-next]$

This looks like a bug, no? It seems what this patch is doing is papering
onver that bug :-\

Alexei, Peter?

That part in the cset introducing bpf_perf_output_event() says:

"User space needs to perf_event_open() it (either for one or all cpus)"

-1 should mean all cpus, right, or is bpf_perf_event_output() only
supported if we do as Howard did in this patch?

---
commit a43eec304259a6c637f4014a6d4767159b6a3aa3
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Tue Oct 20 20:02:34 2015 -0700

    bpf: introduce bpf_perf_event_output() helper
    
    This helper is used to send raw data from eBPF program into
    special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
    User space needs to perf_event_open() it (either for one or all cpus) and
    store FD into perf_event_array (similar to bpf_perf_event_read() helper)
    before eBPF program can send data into it.
---

We're missing something, can you help?

- Arnaldo

> > When using perf trace -p 114, before perf_event_open(), we'll have PID
> > = 114, and CPU = -1.
> >
> > This is bad for bpf-output event, because the ring buffer won't accept
> > output from BPF's perf_event_output(), making it fail. I'm still trying
> > to find out why.
> >
> > If we open bpf-output for every cpu, instead of setting it to -1, like
> > this:
> > PID = <PID>, CPU = 0
> > PID = <PID>, CPU = 1
> > PID = <PID>, CPU = 2
> > PID = <PID>, CPU = 3
> > ...
> >
> > Everything works.
> >
> > You can test it with this script:
> > ```
> >  #include <unistd.h>
> >  #include <sys/syscall.h>
> >
> > int main()
> > {
> >         int i1 = 1, i2 = 2, i3 = 3, i4 = 4;
> >         char s1[] = "DINGZHEN", s2[] = "XUEBAO";
> >
> >         while (1) {
> >                 syscall(SYS_open, s1, i1, i2);
> >                 sleep(1);
> >         }
> >
> >         return 0;
> > }
> > ```
> >
> > save, compile
> > ```
> > gcc open.c
> > ```
> >
> > perf trace
> > ```
> > perf trace -e open <path-to-the-executable>
> > ```
> >
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > ---
> >  tools/perf/util/evlist.c | 14 +++++++++++++-
> >  tools/perf/util/evlist.h |  1 +
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > index c0379fa1ccfe..f14b7e6ff1dc 100644
> > --- a/tools/perf/util/evlist.c
> > +++ b/tools/perf/util/evlist.c
> > @@ -1067,7 +1067,7 @@ int evlist__create_maps(struct evlist *evlist, struct target *target)
> >         if (!threads)
> >                 return -1;
> >
> > -       if (target__uses_dummy_map(target))
> > +       if (target__uses_dummy_map(target) && !evlist__has_bpf_output(evlist))
> 
> I'm wondering if there should be a comment above this, something like:
> 
> Do we need the "any"/-1 CPU value or do we need to open an event on
> all target CPUs (the default NULL implies all online CPUs). The dummy
> map indicates that we need sideband perf record events in order to
> symbolize samples. The mmap-ed ring buffers need CPUs. Similarly BPF
> output events need ring buffers.
> 
> I'm not 100% on the ring buffer perf_event_open restrictions, and the
> man page doesn't cover it, my knowledge is implied from failures and
> seeing what the tool does.
> 
> Thanks,
> Ian
> 
> 
> >                 cpus = perf_cpu_map__new_any_cpu();
> >         else
> >                 cpus = perf_cpu_map__new(target->cpu_list);
> > @@ -2627,3 +2627,15 @@ void evlist__uniquify_name(struct evlist *evlist)
> >                 }
> >         }
> >  }
> > +
> > +bool evlist__has_bpf_output(struct evlist *evlist)
> > +{
> > +       struct evsel *evsel;
> > +
> > +       evlist__for_each_entry(evlist, evsel) {
> > +               if (evsel__is_bpf_output(evsel))
> > +                       return true;
> > +       }
> > +
> > +       return false;
> > +}
> > diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
> > index b46f1a320783..bcc1c6984bb5 100644
> > --- a/tools/perf/util/evlist.h
> > +++ b/tools/perf/util/evlist.h
> > @@ -447,5 +447,6 @@ int evlist__scnprintf_evsels(struct evlist *evlist, size_t size, char *bf);
> >  void evlist__check_mem_load_aux(struct evlist *evlist);
> >  void evlist__warn_user_requested_cpus(struct evlist *evlist, const char *cpu_list);
> >  void evlist__uniquify_name(struct evlist *evlist);
> > +bool evlist__has_bpf_output(struct evlist *evlist);
> >
> >  #endif /* __PERF_EVLIST_H */
> > --
> > 2.45.2
> >

  reply	other threads:[~2024-08-16 14:52 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-15  1:36 [PATCH v2 00/10] perf trace: Enhanced augmentation for pointer arguments Howard Chu
2024-08-15  1:36 ` [PATCH v2 01/10] perf trace: Fix perf trace -p <PID> Howard Chu
2024-08-15 18:28   ` Ian Rogers
2024-08-16 14:52     ` Arnaldo Carvalho de Melo [this message]
2024-08-16 17:25       ` Howard Chu
2024-08-15  1:36 ` [PATCH v2 02/10] perf trace: Change some comments Howard Chu
2024-08-16 14:58   ` Arnaldo Carvalho de Melo
2024-08-15  1:36 ` [PATCH v2 03/10] perf trace: Add trace__bpf_sys_enter_beauty_map() to prepare for fetching data in BPF Howard Chu
2024-08-22 17:49   ` Arnaldo Carvalho de Melo
2024-08-22 17:53     ` Arnaldo Carvalho de Melo
2024-08-22 21:09       ` Arnaldo Carvalho de Melo
2024-08-23  4:09         ` Howard Chu
2024-08-15  1:36 ` [PATCH v2 04/10] perf trace: Add some string arguments' name in syscall_arg_fmt__init_array() Howard Chu
2024-08-22 22:14   ` Arnaldo Carvalho de Melo
2024-08-23  4:37     ` Howard Chu
2024-08-23 13:17       ` Arnaldo Carvalho de Melo
2024-08-15  1:36 ` [PATCH v2 05/10] perf trace: Add a new argument to trace__btf_scnprintf() Howard Chu
2024-08-22 18:00   ` Arnaldo Carvalho de Melo
2024-08-22 18:13     ` Arnaldo Carvalho de Melo
2024-08-23  4:05       ` Howard Chu
2024-08-15  1:36 ` [PATCH v2 06/10] perf trace: Pretty print struct data Howard Chu
2024-08-23 12:41   ` Arnaldo Carvalho de Melo
2024-08-23 13:15     ` Arnaldo Carvalho de Melo
2024-08-15  1:36 ` [PATCH v2 07/10] perf trace: Pretty print buffer data Howard Chu
2024-08-23 14:17   ` Arnaldo Carvalho de Melo
2024-08-15  1:36 ` [PATCH v2 08/10] perf trace: Add pids_allowed and rename pids_filtered Howard Chu
2024-08-15  1:36 ` [PATCH v2 09/10] perf trace: Collect augmented data using BPF Howard Chu
2024-08-23 13:24   ` Arnaldo Carvalho de Melo
2024-08-23 13:38     ` Arnaldo Carvalho de Melo
2024-08-23 13:42       ` Arnaldo Carvalho de Melo
2024-08-23 14:23         ` Arnaldo Carvalho de Melo
2024-08-15  1:36 ` [PATCH v2 10/10] perf trace: Add general tests for augmented syscalls Howard Chu
2024-08-16  3:15   ` Ian Rogers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zr9ntEtIvseYwG90@x1 \
    --to=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=ast@kernel.org \
    --cc=howardchu95@gmail.com \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=kan.liang@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.