From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752595AbbE1HPn (ORCPT <rfc822;w@1wt.eu>);
	Thu, 28 May 2015 03:15:43 -0400
Received: from szxga01-in.huawei.com ([58.251.152.64]:16428 "EHLO
	szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751372AbbE1HPf (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 28 May 2015 03:15:35 -0400
Message-ID: <5566C064.6020205@huawei.com>
Date: Thu, 28 May 2015 15:14:44 +0800
From: "Wangnan (F)" <wangnan0@huawei.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
CC: <paulus@samba.org>, <a.p.zijlstra@chello.nl>, <mingo@redhat.com>,
        <acme@kernel.org>, <namhyung@kernel.org>, <jolsa@kernel.org>,
        <dsahern@gmail.com>, <daniel@iogearbox.net>,
        <brendan.d.gregg@gmail.com>, <masami.hiramatsu.pt@hitachi.com>,
        <lizefan@huawei.com>, <linux-kernel@vger.kernel.org>,
        <pi3orama@163.com>, xiakaixu 00238161 <xiakaixu@huawei.com>
Subject: Re: [RFC PATCH v4 10/29] bpf tools: Collect map definitions from
 'maps' section
References: <1432704004-171454-1-git-send-email-wangnan0@huawei.com> <1432704004-171454-11-git-send-email-wangnan0@huawei.com> <20150528015307.GE20764@Alexeis-MacBook-Pro.local> <55667758.1070206@huawei.com> <20150528022833.GI20764@Alexeis-MacBook-Pro.local> <556686FE.105@huawei.com> <20150528060957.GA21013@Alexeis-MBP.westell.com>
In-Reply-To: <20150528060957.GA21013@Alexeis-MBP.westell.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
X-Originating-IP: [10.111.66.109]
X-CFilter-Loop: Reflected
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


On 2015/5/28 14:09, Alexei Starovoitov wrote:
> On Thu, May 28, 2015 at 11:09:50AM +0800, Wangnan (F) wrote:
>> However this breaks a law in current design that opening phase doesn't
>> talk to kernel with sys_bpf() at all. All related staff is done in loading
>> phase. This principle ensures that in every systems, no matter it support
>> sys_bpf() or not, can read eBPF object without failure.
> I see, so you want 'parse elf' and 'create maps + load programs'
> to be separate phases?
> Fair enough. Then please add a call to release the information
> collected from elf after program loading is done.
> relocations and other things are not needed at that point.

What about appending a flag into bpf_object__load() to let it know
whether to cleanup resource it taken or not? for example:

  int bpf_object__load(struct bpf_object *obj, bool clean);

then we can further wrap it by a macro:

  #define bpf_object__load_clean(o) bpf_object__load(o, true)

If 'clear' is true, after loading resources will be freed, and the same
object will be unable to reload again after unload. B doing this we can
avoid adding a new function.

>> Moreover, we are planning to introduce hardware PMU to eBPF in the way like
>> maps,
>> to give eBPF programs the ability to access hardware PMU counter. I haven't
> that's very interesting. Please share more info when you can :)
> If I understood it right, you want in-kernel bpf to do aggregation
> and filtering of pmu counters ?
> And computing a number of cache misses between two kprobe events?
> I can see how I can use that to measure not only time
> taken by syscall, but number of cache misses occurred due
> to syscall. Sounds very useful!

I'm glad to see you are also interested with it.

Of course, filtering and aggregation based on PMU counter will be 
useful, but
this is only our first goal.

You know there are many useful PMU provided by x86 and ARM64. Many 
people ask
me if there is a way to record absolute PMU counter value when sampling, so
they can measure IPC changing, cache miss rate, page faults and so on.
Currently 'perf state' is able to read PMU counter, but the cost is
relatively high.

For me, enable eBPF program to read PMU counter is the first thing need 
to be done.
The other thing is enabling eBPF programs to bring some information to 
perf sample.

Here is an example to show my idea.

I have a program which:

int main()
{
   while(1) {
     read(...);
     /* do A */
     write(...);
     /* do B */
   }
}

Then by using following script:

  SEC("enter=sys_write $outdata:u64")
  int enter_sys_write(...) {
    u64 cycles_cnt = bpf_read_pmu(&cycles_pmu);
    bpf_store_value(cycles_cnt);
    return 1;
  }

  SEC("enter=sys_read $outdata:u64")
  int enter_sys_read(...) {
    u64 cycles_cnt = bpf_read_pmu(&cycles_pmu);
    bpf_store_value(cycles_cnt);
    return 1;
  }

by 'perf script', we can check the counter of cycles at each points, 
then we are allowed
to compute the number of cycles between any two sampling points. This 
way we can compute
how many cycles taken by A and B. If instruction counter is also 
recorded, we will know
the IPC of A and B.

Above is still a casual idea. Currently I focus on bring eBPF to perf. 
This should
be the base for all other interesting stuffs. However, I'm glad to see 
people discuss
with it.

Thank you.