Re: [PATCH v0] bpf: BPF based latency tracing

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Daniel Wagner <daniel.wagner@bmw-carit.de>
To: Alexei Starovoitov <ast@plumgrid.com>
Cc: <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v0] bpf: BPF based latency tracing
Date: Fri, 19 Jun 2015 08:07:37 +0200	[thread overview]
Message-ID: <5583B1A9.60503@bmw-carit.de> (raw)
In-Reply-To: <5582FA7A.4050505@plumgrid.com>

On 06/18/2015 07:06 PM, Alexei Starovoitov wrote:
> On 6/18/15 4:40 AM, Daniel Wagner wrote:
>> BPF offers another way to generate latency histograms. We attach
>> kprobes at trace_preempt_off and trace_preempt_on and calculate the
>> time it takes to from seeing the off/on transition.
>>
>> The first array is used to store the start time stamp. The key is the
>> CPU id. The second array stores the log2(time diff). We need to use
>> static allocation here (array and not hash tables). The kprobes
>> hooking into trace_preempt_on|off should not calling any dynamic
>> memory allocation or free path. We need to avoid recursivly
>> getting called. Besides that, it reduces jitter in the measurement.
>>
>> CPU 0
>>        latency        : count     distribution
>>         1 -> 1        : 0       
>> |                                        |
>>         2 -> 3        : 0       
>> |                                        |
>>         4 -> 7        : 0       
>> |                                        |
>>         8 -> 15       : 0       
>> |                                        |
>>        16 -> 31       : 0       
>> |                                        |
>>        32 -> 63       : 0       
>> |                                        |
>>        64 -> 127      : 0       
>> |                                        |
>>       128 -> 255      : 0       
>> |                                        |
>>       256 -> 511      : 0       
>> |                                        |
>>       512 -> 1023     : 0       
>> |                                        |
>>      1024 -> 2047     : 0       
>> |                                        |
>>      2048 -> 4095     : 166723  
>> |*************************************** |
>>      4096 -> 8191     : 19870   
>> |***                                     |
>>      8192 -> 16383    : 6324    
>> |                                        |
>>     16384 -> 32767    : 1098    
>> |                                        |
> 
> nice useful sample indeed!
> The numbers are non-JITed, right?

You are right. I forgot to enable the JIT compiler. 

> JIT should reduce the measurement cost 2-3x, but preempt_on/off
> latency probably will stay in 2k range.

Here is an example output with JIT enabled after a few seconds.

      latency        : count     distribution
       1 -> 1        : 0        |                                        |
       2 -> 3        : 0        |                                        |
       4 -> 7        : 0        |                                        |
       8 -> 15       : 0        |                                        |
      16 -> 31       : 0        |                                        |
      32 -> 63       : 0        |                                        |
      64 -> 127      : 0        |                                        |
     128 -> 255      : 0        |                                        |
     256 -> 511      : 0        |                                        |
     512 -> 1023     : 0        |                                        |
    1024 -> 2047     : 0        |                                        |
    2048 -> 4095     : 11595    |*************************************** |
    4096 -> 8191     : 3086     |*********                               |
    8192 -> 16383    : 392      |                                        |
   16384 -> 32767    : 114      |                                        |
   32768 -> 65535    : 16       |                                        |
   65536 -> 131071   : 1        |                                        |
  131072 -> 262143   : 0        |                                        |
  262144 -> 524287   : 0        |                                        |
  524288 -> 1048575  : 34       |                                        |

As you predicted it wont go lower than 2k range. 

>> I am not sure if it is really worth spending more time getting
>> the hash table working for the trace_preempt_[on|off] kprobes.
>> There are so many things which could go wrong, so going with
>> a static version seems for me the right choice.
> 
> agree. for this use case arrays are better choice anyway.
> But I'll keep working on getting hash tables working even
> in this extreme conditions. bpf should be always rock solid.
> 
> I'm only a bit suspicious of kprobes, since we have:
> NOKPROBE_SYMBOL(preempt_count_sub)
> but trace_preemp_on() called by preempt_count_sub()
> don't have this mark...

The original commit indicates that anything called from
preempt_disable() should also be marked as NOKPROBE_SYMBOL:

	commit 43627582799db317e966ecb0002c2c3c9805ec0f
	Author:	Srinivasa Ds <srinivasa@in.ibm.com>  Sun Feb 24 00:24:04 2008
	Committer:	Linus Torvalds <torvalds@woody.linux-foundation.org>  Sun Feb 24 02:13:24 2008
	Original File:	kernel/sched.c

	kprobes: refuse kprobe insertion on add/sub_preempt_counter()

	Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these
	functions inturn call add/sub_preempt_count().  So we need to refuse user from
	inserting probe in to these functions.

	This patch disallows user from probing add/sub_preempt_count().


Obviously, this would render this patch useless. 

>> +SEC("kprobe/trace_preempt_off")

BTW, is there a reason why not supporting build-in
tracepoints/events? It looks like it is only an artificial
limitation of bpf_helpers.

>> +int bpf_prog1(struct pt_regs *ctx)
>> +{
>> +    int cpu = bpf_get_smp_processor_id();
>> +    u64 *ts = bpf_map_lookup_elem(&my_map, &cpu);
>> +
>> +    if (ts)
>> +        *ts = bpf_ktime_get_ns();
> 
> btw, I'm planning to add native per-cpu maps which will
> speed up things more and reduce measurement overhead.

Funny I was about to suggest something like this :)

> I think you can retarget this patch to net-next and send
> it to netdev. It's not too late for this merge window.

I'll rebase it to net-next.

cheers,
daniel

next prev parent reply	other threads:[~2015-06-19  6:07 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-18 11:40 [PATCH v0] bpf: BPF based latency tracing Daniel Wagner
2015-06-18 17:06 ` Alexei Starovoitov
2015-06-19  6:07   ` Daniel Wagner [this message]
2015-06-19  7:06     ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5583B1A9.60503@bmw-carit.de \
    --to=daniel.wagner@bmw-carit.de \
    --cc=ast@plumgrid.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.