All of lore.kernel.org
 help / color / mirror / Atom feed
From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: "Wangnan (F)" <wangnan0@huawei.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>,
	"linux-perf-use." <linux-perf-users@vger.kernel.org>,
	Alexei Starovoitov <ast@plumgrid.com>
Subject: Re: linux 4.4, perf & BPF, and bpf_perf_event_output
Date: Tue, 12 Jan 2016 12:27:13 -0300	[thread overview]
Message-ID: <20160112152713.GE18367@kernel.org> (raw)
In-Reply-To: <569466B9.2040702@huawei.com>

Em Tue, Jan 12, 2016 at 10:36:41AM +0800, Wangnan (F) escreveu:
> 
> 
> On 2016/1/12 8:07, Brendan Gregg wrote:
> >G'Day,
> >
> >Congrats Wang Nan on the new perf bpf features in 4.4! I just
> >revisited where I was at with them now that it's official, and have a
> >couple of questions.
> >
> >So I can run something like this:
> >
> >---syncsnoop.c---
> >/* perf.data: default event record only; trace_pipe:
> >bpf_trace_printk() output */
> >#include <uapi/linux/bpf.h>
> >#include "bpf_helpers.h"
> >
> >SEC("func=sys_sync")
> >int bpf_func__sys_sync(void *ctx)
> >{
> >     char fmt[] = "sync()\n";
> >     bpf_trace_printk(fmt, sizeof (fmt));
> >     return 1;
> >};
> >
> >char _license[] SEC("license") = "GPL";
> >int _version SEC("version") = LINUX_VERSION_CODE;
> >------
> >
> >With: perf record -a --clang-opt "-DLINUX_VERSION_CODE=0x40400"
> >--event syncsnoop.c sleep 5
> >
> >And it works. I get PERF_RECORD_SAMPLE entries in perf.data, and my
> >custom strings in /sys/kernel/debug/tracing/trace_pipe.
> >
> >Q1. Given bpf_trace_printk() & trace_pipe is a hack, is there a way
> >yet to print my custom strings to perf.data? Eg, something like
> >bpf_perf_event_output() to change the format string of the event.
> >Currently I just get:
> >
> ># perf script
> >             sync 30207 [002]  2509.576984: perf_bpf_probe:func:
> >(ffffffff81221b20)
> >             sync 30209 [003]  2509.766632: perf_bpf_probe:func:
> >(ffffffff81221b20)
> >             sync 30229 [003]  2509.936299: perf_bpf_probe:func:
> >(ffffffff81221b20)
> >             sync 30230 [004]  2510.099059: perf_bpf_probe:func:
> >(ffffffff81221b20)
> >             sync 30231 [004]  2510.289351: perf_bpf_probe:func:
> >(ffffffff81221b20)
> Yes. I have implemented this feature. Patch has posted, but not
> in 4.4. I hope you will be able to use this feature in v4.5.
> It depends on Arnaldo.

Right, I have to go testing this stuff patch by patch, its in my table
on more time this week.
 
> There is a small example at commit message of [1]. The basic workflow is:
> 
>  1. Create a bpf-output map in your BPF file
>  2. Output data to it by bpf_perf_event_output in BPF source
>  3. Create bpf-output event in perf cmdline
>  4. Use perf-CTF conversion
>  5. User babeltrace to see your output in raw format
>  6. Use python script for decoding
> 
> Currently you are unable to see string in normal 'perf script' output.
> I sent a mail for suggestion about perf script output format in [2]
> at Oct. 28. Please check your mailbox.
> 
> At the end of this mail I provide some detail description.
> 
> [1] http://lkml.kernel.org/g/1452520124-2073-26-git-send-email-wangnan0@huawei.com
> [2] http://lkml.kernel.org/g/5630AC2A.7030308@huawei.com
> 
> >Q2. Apart from per-event strings, is there a way to emit the maps?
> >bpf-script-example.c populates a map, but I don't think it ends up in
> >perf.data. I think a common case would be to emit it once for the run.
> >Another common case would be to emit it once per second. Any way we
> >can do this? I did try bpf_perf_event_output(), but I don't think it
> >ended up in perf.data (at least, I don't see it using "perf script
> >-D").
> 
> Not support yet. Map data don't go to 'perf.data' at all. But I have
> another idea.
> 
> I'm thinking about linking user-mode BPF into perf, then we can put all
> things together into one BPF source, including:
> 
> data sampling, filtering, accumulation, pretty-printing
> 
> makes it similar to dtrace.
> 
> Here's an example. In this example we dump a perf.data output when
> we found a write system call takes longer than expected (in my recent
> patch series perf can act as a flight recorder), then print the
> map data we collected during recording.
> 
> /* Define message passed between BPF script and perf */
> enum perf_cmd {
>   DUMP,
> }
> enum perf_cmd cmd;
> 
> /* define histogram map */
> SEC("map")
> struct bpf_map_def hist_map = { ... };
> 
> SEC("func=sys_write")
> int sys_write_enter(...) {  }
> 
> SEC("func=sys_write%return")
> int sys_write_exit(...) {
>   enum perf_cmd cmd;
> 
>   /* check how long this syscall takes */
>   /* maintain histogram */
>   if (syscall_too_long) {
>     cmd = DUMP;
>     bpf_perf_event_output(..., &cmd, ...);   <-- trigger a dump
>   }
> }
> 
> /* Following are user mode BPF scripts which are run in perf context */
> 
> SEC("mode=perf;action=recv_bpf_msg")
> int perf_recv_bpf_msg(void *pcmd)
> {
>    ...
>    switch (*pcmd) {
>      ...
>      case DUMP:
>        /* make perf dump a perf.data. */
>        perf_hist_output(&hist_map);  /* let perf output histogram */
>        ...
>    }
> }
> 
> SEC("mode=perf;action=perf_end")
> int perf_end(...)
> {
>   print_the_final_result(&hist_map);
>   ...
> }
> 
> Thank you.
> 
> >thanks,
> >
> >Brendan
> 
> Ddetail description about perf event output:
> 
> First, you need create a bpf-output map in your BPF file:
> 
>  struct bpf_map_def SEC("maps") channel = {
>      .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
>      .key_size = sizeof(int),
>      .value_size = sizeof(u32),
>      .max_entries = __NR_CPUS__,
>  };
> 
> Then you output data to it by bpf_perf_event_output (this is its name in
> samples/bpf/bpf_helper.h. In my
> commit message it is perf_event_output):
> 
>  SEC("func=sys_write")
>  int bpf_func__sys_write(void *ctx)
>  {
>      char fmt[] = "write()\n";
>      bpf_perf_event_output(ctx, &channel,
>                            bpf_get_smp_processor_id(),
>                            fmt, sizeof (fmt));
>      return 1;
>  };
> 
> When running perf you need create the channel for it by opening a bpf-output
> event and connect it with
> 'channel' map you defined in your BPF source file:
> 
> # perf record -e evt=bpf-output/no-inherit/ -e
> ./test.c/maps:channel.event=evt/ ls
> 
> Then in perf.data you will see two samples, one is the function you probe in
> (sys_write), another is the
> BPF output.
> 
> # perf script
>               ls 32115 1009664.006228: evt=bpf-output/no-inherit/:
> ffffffff811fcac1 sys_write ([kernel.kallsyms])
>               ls 32115 [002] 1009664.006229: perf_bpf_probe:func:
> (ffffffff811fcac0)
> 
> You can't see the string you output now, because perf script doesn't support
> decoding data you passed through bpf-output event. Now you can see your
> string by perf script by 'perf script -D' like this:
> 
>  . ... raw event: size 56 bytes
>  .  0000:  09 00 00 00 01 00 38 00 01 16 00 00 00 00 00 00 ......8.........
>  .  0010:  c1 ca 1f 81 ff ff ff ff 73 7d 00 00 73 7d 00 00 ........s}..s}..
>  .  0020:  bb 09 6f b8 48 96 03 00 0c 00 00 00 77 72 69 74 ..o.H.......writ
> <--- *HERE*
>  .  0030:  65 28 29 0a 00 00 00 00                          e().....
>  .
>  4294967295 1009664006228411 0x2ed0 [0x38]: PERF_RECORD_SAMPLE(IP, 0x1):
> 32115/32115: 0xffffffff811fcac1 period: 1 addr: 0
>   ... thread: ls:32115
>   ...... dso:
> /home/wangnan/.debug/.build-id/fb/0f83d011364583af77e563a390dfc1b78cce8d
>                ls 32115 1009664.006228: evt=bpf-output/no-inherit/:
> ffffffff811fcac1 sys_write ([kernel.kallsyms])
> 
> For me, I prefer CTF conversion:
> 
>  # ~/perf data convert --to-ctf ./out.ctf
>  [ perf data convert: Converted 'perf.data' into CTF data './out.ctf' ]
>  [ perf data convert: Converted and wrote 0.000 MB (2 samples) ]
>  # babeltrace ./out.ctf/
>  [16:27:44.006228411] (+?.?????????) evt=bpf-output/no-inherit/: { cpu_id =
> 0 }, { perf_ip = 0xFFFFFFFF811FCAC1, perf_tid = 32115, perf_pid = 32115,
> perf_id = 5633,  raw_len = 3, raw_data = [ [0] = 0x74697277, [1] =
> 0xA292865, [2] = 0x0 ] }
>  [16:27:44.006229678] (+0.000001267) perf_bpf_probe:func: { cpu_id = 2 }, {
> perf_ip = 0xFFFFFFFF811FCAC1, perf_tid = 32115, perf_pid = 32115, perf_id =
> 5641, perf_period = 1, common_type = 1177, common_flags = 1,
> common_preempt_count = 0, common_pid = 32115, _probe_ip = 0xFFFFFFFF811FCAC0
> }
> 
> Your string is there:
> 
>  raw_data = [ [0] = 0x74697277, [1] = 0xA292865, [2] = 0x0 ] }
> 
> Still not good, but don't forget we have babeltrace python binding. A
> simple python script can decode it for you.
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2016-01-12 15:27 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-12  0:07 linux 4.4, perf & BPF, and bpf_perf_event_output Brendan Gregg
2016-01-12  2:36 ` Wangnan (F)
2016-01-12 15:27   ` Arnaldo Carvalho de Melo [this message]
2016-01-12 20:56   ` Brendan Gregg
2016-01-13  2:54     ` Wangnan (F)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160112152713.GE18367@kernel.org \
    --to=acme@kernel.org \
    --cc=ast@plumgrid.com \
    --cc=brendan.d.gregg@gmail.com \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=wangnan0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.