From: "Wangnan (F)" <wangnan0@huawei.com>
To: Alexei Starovoitov <ast@plumgrid.com>,
He Kuang <hekuang@huawei.com>,
Peter Zijlstra <peterz@infradead.org>
Cc: <rostedt@goodmis.org>, <masami.hiramatsu.pt@hitachi.com>,
<mingo@redhat.com>, <acme@redhat.com>, <jolsa@kernel.org>,
<namhyung@kernel.org>, <linux-kernel@vger.kernel.org>,
pi3orama <pi3orama@163.com>
Subject: Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event
Date: Thu, 2 Jul 2015 17:24:38 +0800 [thread overview]
Message-ID: <55950356.3050507@huawei.com> (raw)
In-Reply-To: <5594B569.60103@plumgrid.com>
On 2015/7/2 11:52, Alexei Starovoitov wrote:
> On 7/1/15 8:38 PM, He Kuang wrote:
>>
>>
>> On 2015/7/2 10:48, Alexei Starovoitov wrote:
>>> On 7/1/15 4:58 AM, Peter Zijlstra wrote:
>>>>
>>>> But why create a separate trace buffer, it should go into the regular
>>>> perf buffer.
>>>
>>> +1
>>>
>>> I think
>>> +static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS];
>>> is redundant.
>>> It adds quite a bit of unnecessary complexity to the whole patch set.
>>>
>>> Also the call to bpf_output_sample() is not effective unless program
>>> returns 1. It's a confusing user interface.
>>>
>>> Also you cannot ever do:
>>> BPF_FUNC_probe_read,
>>> + BPF_FUNC_output_sample,
>>> BPF_FUNC_ktime_get_ns,
>>> new functions must be added to the end.
>>>
>>> Why not just do:
>>> perf_trace_buf_prepare() + perf_trace_buf_submit() from the helper?
>>> No changes to current code.
>>> No need to call __get_data_size() and other overhead.
>>> The helper can be called multiple times from the same program.
>>> imo much cleaner.
>>>
>>
>> Invoke perf_trace_buf_submit() will generate a second perf
>> event (header->type = PERF_RECORD_SAMPLE) entry which is
>> different from the event entry outputed by the orignial
>> kprobe. So the final result of the example in 00/00 patch may
>> like this:
>>
>> sample entry 1(from bpf_prog):
>> comm timestamp1 generic_perform_write pmu_value=0x1234
>> sample entry 2(from original kprobe):
>> comm timestamp2 generic_perform_write: (ffffffff81140b60)
>> Compared with current implementation:
>> combined sample entry:
>> comm timestamp generic_perform_write: (ffffffff81140b60)
>> pmu_value=0x1234
>>
>> The former two entries may be discontinuous as there are multiple
>> threads and kprobes to be recorded, and there's a chance that one
>> entry is missed but the other is recorded. What we need is the
>> pmu_value read when 'generic_perform_write' enters, the two
>> entries result is not intuitive enough and userspace tools have
>> to do the work to find and combine those two sample entries to
>> get the result.
>
> Just change your example to return 0 and user space will see
> one sample.
>
Yes, by using perf_trace_buf_prepare() + perf_trace_buf_submit() in
helper function and let bpf program always returns 0 we can make data
collected by BPF programs output into samples, if following problems
are solved:
1. In bpf program there's no way to get 'struct perf_event' or 'struct
ftrace_event_call'. We have to deduce them through pt_regs:
pt_regs -> ip -> kprobe -> struct trace_kprobe -> struct
ftrace_event_call -> hlist_entry -> struct perf_event
Which seems dirty, but without that we can't call
perf_trace_buf_submit().
2. Even if we finally get 'struct perf_event', I'm not sure whether
user really concern on it. If we really concern on all information
output through perf_trace_buf_submit() like callstack and
register, why not make bpf program return non-zero instead? But then
we have to consider how to connect two samples together.
So maybe writing a new function to replace perf_trace_buf_submit() and
output some light-weight information instead of full event data is
worth considering. Otherwise, maybe a dummy 'struct perf_event' for BPF
outputing is also acceptable?
What we are trying to do in previous patches is to merge data output by
BPF programs and original data output by perf_trace_buf_submit()
together. For example (expressed in CTF metadata format):
event.header := struct { // both output by perf_trace_buf_submit()
integer { ... } id;
integer { ... } timestamp;
}
event {
name = "perf_bpf_probe:lock_page";
...
fields := struct {
integer { ... } perf_ip; // perf_trace_buf_submit()
integer { ... } perf_tid; // perf_trace_buf_submit()
...
integer { ... } page; <-- Fetched using prologue
integer { ... } cycle_cmu_counter; <-- Output by BPF program
}
}
We believe that implemented should be simpler. Whether to use an extra
perf_trace_buf or not can be discussed. We have other choices. For
example, we can make BPF program write its data from the end of
bpf_trace_buf, and connect two parts of output data before calling
perf_trace_buf_submit().
Thank you.
next prev parent reply other threads:[~2015-07-02 9:26 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-01 2:57 [RFC PATCH 0/5] Make eBPF programs output data to perf event He Kuang
2015-07-01 2:57 ` [RFC PATCH 1/5] bpf: Put perf_events check ahead of bpf prog He Kuang
2015-07-02 3:50 ` Alexei Starovoitov
2015-07-02 5:52 ` Wangnan (F)
2015-07-02 18:02 ` Alexei Starovoitov
2015-07-01 2:57 ` [RFC PATCH 2/5] perf/trace: Add perf extra percpu trace buffer He Kuang
2015-07-01 2:57 ` [RFC PATCH 3/5] tracing/kprobe: Separate inc recursion count out of perf_trace_buf_prepare He Kuang
2015-07-01 2:57 ` [RFC PATCH 4/5] bpf: Introduce function for outputing sample data to perf event He Kuang
2015-07-01 2:57 ` [RFC PATCH 5/5] tracing/kprobe: Combine extra trace buf into perf trace buf He Kuang
2015-07-01 5:44 ` [RFC PATCH 0/5] Make eBPF programs output data to perf event Peter Zijlstra
2015-07-01 6:21 ` Wangnan (F)
2015-07-01 11:58 ` Peter Zijlstra
2015-07-02 2:48 ` Alexei Starovoitov
2015-07-02 3:38 ` He Kuang
2015-07-02 3:52 ` Alexei Starovoitov
2015-07-02 9:24 ` Wangnan (F) [this message]
2015-07-02 18:37 ` Alexei Starovoitov
2015-07-02 9:31 ` Peter Zijlstra
2015-07-02 13:50 ` [RFC PATCH v2 0/4] " He Kuang
2015-07-02 13:50 ` [RFC PATCH v2 1/4] bpf: Put perf_events check ahead of bpf prog He Kuang
2015-07-02 18:41 ` Alexei Starovoitov
2015-07-02 13:50 ` [RFC PATCH v2 2/4] tracing/kprobe: Separate inc recursion count out of perf_trace_buf_prepare He Kuang
2015-07-02 13:50 ` [RFC PATCH v2 3/4] bpf: Introduce function for outputing data to perf event He Kuang
2015-07-02 13:50 ` [RFC PATCH v2 4/4] tracing/kprobe: Combine bpf output and perf event output He Kuang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55950356.3050507@huawei.com \
--to=wangnan0@huawei.com \
--cc=acme@redhat.com \
--cc=ast@plumgrid.com \
--cc=hekuang@huawei.com \
--cc=jolsa@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=masami.hiramatsu.pt@hitachi.com \
--cc=mingo@redhat.com \
--cc=namhyung@kernel.org \
--cc=peterz@infradead.org \
--cc=pi3orama@163.com \
--cc=rostedt@goodmis.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox