From: "Wangnan (F)" <wangnan0@huawei.com>
To: Alexei Starovoitov <ast@plumgrid.com>,
He Kuang <hekuang@huawei.com>,
Peter Zijlstra <peterz@infradead.org>
Cc: <rostedt@goodmis.org>, <masami.hiramatsu.pt@hitachi.com>,
<mingo@redhat.com>, <acme@redhat.com>, <jolsa@kernel.org>,
<namhyung@kernel.org>, <linux-kernel@vger.kernel.org>,
pi3orama <pi3orama@163.com>
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event
Date: Thu, 2 Jul 2015 17:24:38 +0800 [thread overview]
Message-ID: <55950356.3050507@huawei.com> (raw)
In-Reply-To: <5594B569.60103@plumgrid.com>
On 2015/7/2 11:52, Alexei Starovoitov wrote:
> On 7/1/15 8:38 PM, He Kuang wrote:
>>
>>
>> On 2015/7/2 10:48, Alexei Starovoitov wrote:
>>> On 7/1/15 4:58 AM, Peter Zijlstra wrote:
>>>>
>>>> But why create a separate trace buffer, it should go into the regular
>>>> perf buffer.
>>>
>>> +1
>>>
>>> I think
>>> +static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS];
>>> is redundant.
>>> It adds quite a bit of unnecessary complexity to the whole patch set.
>>>
>>> Also the call to bpf_output_sample() is not effective unless program
>>> returns 1. It's a confusing user interface.
>>>
>>> Also you cannot ever do:
>>> BPF_FUNC_probe_read,
>>> + BPF_FUNC_output_sample,
>>> BPF_FUNC_ktime_get_ns,
>>> new functions must be added to the end.
>>>
>>> Why not just do:
>>> perf_trace_buf_prepare() + perf_trace_buf_submit() from the helper?
>>> No changes to current code.
>>> No need to call __get_data_size() and other overhead.
>>> The helper can be called multiple times from the same program.
>>> imo much cleaner.
>>>
>>
>> Invoke perf_trace_buf_submit() will generate a second perf
>> event (header->type = PERF_RECORD_SAMPLE) entry which is
>> different from the event entry outputed by the orignial
>> kprobe. So the final result of the example in 00/00 patch may
>> like this:
>>
>> sample entry 1(from bpf_prog):
>> comm timestamp1 generic_perform_write pmu_value=0x1234
>> sample entry 2(from original kprobe):
>> comm timestamp2 generic_perform_write: (ffffffff81140b60)
>> Compared with current implementation:
>> combined sample entry:
>> comm timestamp generic_perform_write: (ffffffff81140b60)
>> pmu_value=0x1234
>>
>> The former two entries may be discontinuous as there are multiple
>> threads and kprobes to be recorded, and there's a chance that one
>> entry is missed but the other is recorded. What we need is the
>> pmu_value read when 'generic_perform_write' enters, the two
>> entries result is not intuitive enough and userspace tools have
>> to do the work to find and combine those two sample entries to
>> get the result.
>
> Just change your example to return 0 and user space will see
> one sample.
>
Yes, by using perf_trace_buf_prepare() + perf_trace_buf_submit() in
helper function and let bpf program always returns 0 we can make data
collected by BPF programs output into samples, if following problems
are solved:
1. In bpf program there's no way to get 'struct perf_event' or 'struct
ftrace_event_call'. We have to deduce them through pt_regs:
pt_regs -> ip -> kprobe -> struct trace_kprobe -> struct
ftrace_event_call -> hlist_entry -> struct perf_event
Which seems dirty, but without that we can't call
perf_trace_buf_submit().
2. Even if we finally get 'struct perf_event', I'm not sure whether
user really concern on it. If we really concern on all information
output through perf_trace_buf_submit() like callstack and
register, why not make bpf program return non-zero instead? But then
we have to consider how to connect two samples together.
So maybe writing a new function to replace perf_trace_buf_submit() and
output some light-weight information instead of full event data is
worth considering. Otherwise, maybe a dummy 'struct perf_event' for BPF
outputing is also acceptable?
What we are trying to do in previous patches is to merge data output by
BPF programs and original data output by perf_trace_buf_submit()
together. For example (expressed in CTF metadata format):
event.header := struct { // both output by perf_trace_buf_submit()
integer { ... } id;
integer { ... } timestamp;
}
event {
name = "perf_bpf_probe:lock_page";
...
fields := struct {
integer { ... } perf_ip; // perf_trace_buf_submit()
integer { ... } perf_tid; // perf_trace_buf_submit()
...
integer { ... } page; <-- Fetched using prologue
integer { ... } cycle_cmu_counter; <-- Output by BPF program
}
}
We believe that implemented should be simpler. Whether to use an extra
perf_trace_buf or not can be discussed. We have other choices. For
example, we can make BPF program write its data from the end of
bpf_trace_buf, and connect two parts of output data before calling
perf_trace_buf_submit().
Thank you.
next prev parent reply other threads:[~2015-07-02 9:26 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-01 2:57 [RFC PATCH 0/5] Make eBPF programs output data to perf event He Kuang
2015-07-01 2:57 ` [RFC PATCH 1/5] bpf: Put perf_events check ahead of bpf prog He Kuang
2015-07-02 3:50 ` Alexei Starovoitov
2015-07-02 5:52 ` Wangnan (F)
2015-07-02 18:02 ` Alexei Starovoitov
2015-07-01 2:57 ` [RFC PATCH 2/5] perf/trace: Add perf extra percpu trace buffer He Kuang
2015-07-01 2:57 ` [RFC PATCH 3/5] tracing/kprobe: Separate inc recursion count out of perf_trace_buf_prepare He Kuang
2015-07-01 2:57 ` [RFC PATCH 4/5] bpf: Introduce function for outputing sample data to perf event He Kuang
2015-07-01 2:57 ` [RFC PATCH 5/5] tracing/kprobe: Combine extra trace buf into perf trace buf He Kuang
2015-07-01 5:44 ` [RFC PATCH 0/5] Make eBPF programs output data to perf event Peter Zijlstra
2015-07-01 6:21 ` Wangnan (F)
2015-07-01 11:58 ` Peter Zijlstra
2015-07-02 2:48 ` Alexei Starovoitov
2015-07-02 3:38 ` He Kuang
2015-07-02 3:52 ` Alexei Starovoitov
2015-07-02 9:24 ` Wangnan (F) [this message]
2015-07-02 18:37 ` Alexei Starovoitov
2015-07-02 9:31 ` Peter Zijlstra
2015-07-02 13:50 ` [RFC PATCH v2 0/4] " He Kuang
2015-07-02 13:50 ` [RFC PATCH v2 1/4] bpf: Put perf_events check ahead of bpf prog He Kuang
2015-07-02 18:41 ` Alexei Starovoitov
2015-07-02 13:50 ` [RFC PATCH v2 2/4] tracing/kprobe: Separate inc recursion count out of perf_trace_buf_prepare He Kuang
2015-07-02 13:50 ` [RFC PATCH v2 3/4] bpf: Introduce function for outputing data to perf event He Kuang
2015-07-02 13:50 ` [RFC PATCH v2 4/4] tracing/kprobe: Combine bpf output and perf event output He Kuang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55950356.3050507@huawei.com \
--to=wangnan0@huawei.com \
--cc=acme@redhat.com \
--cc=ast@plumgrid.com \
--cc=hekuang@huawei.com \
--cc=jolsa@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=masami.hiramatsu.pt@hitachi.com \
--cc=mingo@redhat.com \
--cc=namhyung@kernel.org \
--cc=peterz@infradead.org \
--cc=pi3orama@163.com \
--cc=rostedt@goodmis.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.