public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
@ 2015-03-02 19:29 Alexei Starovoitov
  2015-03-02 22:04 ` Tom Zanussi
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 19:29 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML

On Mon, Mar 2, 2015 at 10:54 AM, Tom Zanussi
<tom.zanussi@linux.intel.com> wrote:
>> >
>> > The idea would be that instead of getting your individually kmalloc'ed
>> > elements on-demand from kmalloc while in the handler, you'd get them
>> > from a pool you've pre-allocated when you set up the table.  This could
>> > be from a list of individual entries you've already kmalloc'ed ahead of
>> > time, or from an array of n * sizeof(entry).
>>
>> would work, but kinda ugly, since we will pre-allocate a lot
>> and may not be using it at all.
>>
>
> That's true but you have a user-defined map limit anyway, which you can
> adjust to minimize wastage.  And allocating ahead of time also means you
> perturb the system less while actually tracing.

nope. it's the other way around.
using kmalloc is faster and less overhead on the whole system
then grabbing cache-cold objects from special pool.

^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
@ 2015-03-02 19:52 Alexei Starovoitov
  0 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 19:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML

On Mon, Mar 2, 2015 at 11:33 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2 Mar 2015 11:24:04 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> well, percentage of tracepoints called from NMI is tiny
>> comparing to the rest, so assuming nmi context
>> everywhere is very inefficient.
>> Like we can use pre-allocated pool of map entries when
>> tracepoint is called from NMI, but we shouldn't be using
>> it in other cases. Just like ring buffers and other things
>> have nmi and non-nmi pools and code paths, it doesn't
>> make sense to disallow kmalloc all together.
>> btw, calling kmalloc is _faster_ than taking
>> objects from cache-cold special nmi only pool.
>
> Please show the numbers and post the tests when stating something like
> that.

sure. here is the work that Jesper is doing accelerate slub:
http://thread.gmane.org/gmane.linux.kernel.mm/126138
he's measuring kmalloc/kfree pair as 19ns
which is way less then cost of cache miss on cold buffer
we'll get from custom pool. We want to minimize cache
misses and not absolute number of instructions.
yes, in custom pool one can receive a buffer few cycles
faster then from kmalloc, but it will be cold.
'prefetch of next' trick that slub is using won't work
for custom pool. That is the main problem with it.
By using main allocator the buffers are coming hot.

^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
@ 2015-03-02 19:24 Alexei Starovoitov
  2015-03-02 19:33 ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 19:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML

On Mon, Mar 2, 2015 at 10:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2 Mar 2015 10:12:32 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> I'm not sure what would be the meaning of hash map that has all
>> elements pre-allocated...
>> As I'm reading your cover letter, I agree, we need to find a way
>> to call kmalloc_notrace-like from tracepoints.
>> Not sure that patch 8 style of duplicating the functions is clean.
>> Can we keep kmalloc/kfree as-is and do something like
>> if (in_tracepoint()) check inside ftrace_raw_kmalloc*  ?
>
> I'm strongly against this. You should not be doing anything in a
> tracepoint that you can't do from NMI context. And calling kmalloc
> happens to be one of them.

well, percentage of tracepoints called from NMI is tiny
comparing to the rest, so assuming nmi context
everywhere is very inefficient.
Like we can use pre-allocated pool of map entries when
tracepoint is called from NMI, but we shouldn't be using
it in other cases. Just like ring buffers and other things
have nmi and non-nmi pools and code paths, it doesn't
make sense to disallow kmalloc all together.
btw, calling kmalloc is _faster_ than taking
objects from cache-cold special nmi only pool.

^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
@ 2015-03-02 18:40 Alexei Starovoitov
  2015-03-02 18:54 ` Tom Zanussi
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 18:40 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML

On Mon, Mar 2, 2015 at 10:25 AM, Tom Zanussi
<tom.zanussi@linux.intel.com> wrote:
> On Mon, 2015-03-02 at 10:12 -0800, Alexei Starovoitov wrote:
>> On Mon, Mar 2, 2015 at 10:03 AM, Tom Zanussi
>> <tom.zanussi@linux.intel.com> wrote:
>> > On Mon, 2015-03-02 at 09:58 -0800, Alexei Starovoitov wrote:
>> >> On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>> >> > On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
>> >> >> On Mon,  2 Mar 2015 10:01:00 -0600
>> >> >> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>> >> >>
>> >> >> > Add a gfp flag that allows kmalloc() et al to be used in tracing
>> >> >> > functions.
>> >> >> >
>> >> >> > The problem with using kmalloc for tracing is that the tracing
>> >> >> > subsystem should be able to trace kmalloc itself, which it can't do
>> >> >> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
>> >> >> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
>> >> >>
>> >> >> This part I don't like at all. Why can't the memory be preallocated
>> >> >> when the hist is created (the echo 'hist:...')?
>> >> >>
>> >> >
>> >> > Yeah, I didn't like it either.  My original version did exactly what you
>> >> > suggest and preallocated an array of entries to 'allocate' from in order
>> >> > to avoid the problem.
>> >> >
>> >> > But I wanted to attempt to use the bpf_map directly, which already uses
>> >> > kmalloc internally.  My fallback in case this wouldn't fly, which it
>> >> > obviously won't, would be to add an option to have the bpf_map code
>> >> > preallocate a maximum number of entries or pass in a client-owned array
>> >> > for the purpose.  I'll do that if I don't hear any better ideas..
>> >>
>> >> Tom, I'm still reading through the patch set.
>> >> Quick comment for the above.
>> >> Currently there are two map types: array and hash.
>> >> array type is pre-allocating all memory at map creation time.
>> >> hash is allocating on demand.
>> >
>> > OK, so would it make sense to do the same for the hash type, or at least
>> > add an option that does that?
>>
>> I'm not sure what would be the meaning of hash map that has all
>> elements pre-allocated...
>
> The idea would be that instead of getting your individually kmalloc'ed
> elements on-demand from kmalloc while in the handler, you'd get them
> from a pool you've pre-allocated when you set up the table.  This could
> be from a list of individual entries you've already kmalloc'ed ahead of
> time, or from an array of n * sizeof(entry).

would work, but kinda ugly, since we will pre-allocate a lot
and may not be using it at all.

> This would also allow you to avoid GFP_ATOMIC for those.
>
>> As I'm reading your cover letter, I agree, we need to find a way
>> to call kmalloc_notrace-like from tracepoints.
>> Not sure that patch 8 style of duplicating the functions is clean.
>
> No, it's horrible, but it does the job without changing the normal path
> at all.
>
>> Can we keep kmalloc/kfree as-is and do something like
>> if (in_tracepoint()) check inside ftrace_raw_kmalloc*  ?
>
> Yeah, that's essentially what TP_CONDITION() in patch 8 (Make kmem
> memory allocation tracepoints conditional) does.

not quite, I mean something like global kmalloc recursion flag.
then kmalloc doesn't need to change.
ftrace_raw_event_kmem_alloc() would use a flag
inside ftrace_event_call struct or global recursion flag.

^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-02 16:00 Tom Zanussi
  2015-03-02 16:01 ` [PATCH 07/15] mm: Add ___GFP_NOTRACE Tom Zanussi
  0 siblings, 1 reply; 16+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:00 UTC (permalink / raw)
  To: rostedt; +Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel,
	Tom Zanussi

This is v2 of my previously posted 'hashtriggers' patchset [1], but
renamed to 'hist triggers' following feedback from v1.

Since then, the kernel has gained a tracing map implementation in the
form of bpf_map, which this patchset makes a bit more generic, exports
and uses (as tracing_map_*, still in the bpf syscall file however).

A large part of the initial hash triggers implementation was devoted
to a map implementation and general-purpose hashing functions, which
have now been subsumed by the bpf maps.  I've completely redone the
trigger patches themselves to work on top of tracing_map.  The result
is a much simpler and easier-to-review patchset that's able to focus
more directly on the problem at hand.

The new version addresses all the comments from the previous review,
including changing the name from hash->hist, adding separate 'hist'
files for the output, and moving the examples into Documentation.

This patchset also includes a couple other new and related triggers,
enable_hist and disable_hist, very similar to the existing
enable_event/disable_event triggers used to automatically enable and
disable events based on a triggering condition, but in this case
allowing hist triggers to be enabled and disabled in the same way.

The only problem with using the bpf_map implementation for this is
that it uses kmalloc internally, which causes problems when trying to
trace kmalloc itself.  I'm guessing the ebpf tracing code would also
share this problem e.g. when using bpf_maps from probes on kmalloc().
This patchset attempts a solution to that problem (by adding a
gfp_flag and changing the kmem memory allocation tracepoints to
conditional variants) for checking for it in for but I'm not sure it's
the best way to address it.

There are a couple of important bits of functionality that were
present in v1 but dropped in v2 mainly because I'm still trying to
figure out the best way to accomplish those things using the bpf_map
implementation.

The first is support for compound keys.  Currently, maps can only be
keyed on a single event field, whereas in v1 they could be keyed on
multiple keys.  With support for compound keys, you can create much
more interesting output, such as for example per-pid lists of
syscalls or read counts e.g.:

  # echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount' > \
        /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger

  # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist

  key: common_pid:bash[3112], id:sys_write		       vals: count:69
  key: common_pid:bash[3112], id:sys_rt_sigprocmask	       vals: count:218

  key: common_pid:update-notifier[3164], id:sys_poll	       vals: count:37
  key: common_pid:update-notifier[3164], id:sys_recvfrom       vals: count:118

  key: common_pid:deja-dup-monito[3194], id:sys_sendto	       vals: count:1
  key: common_pid:deja-dup-monito[3194], id:sys_read	       vals: count:4
  key: common_pid:deja-dup-monito[3194], id:sys_poll	       vals: count:8
  key: common_pid:deja-dup-monito[3194], id:sys_recvmsg	       vals: count:8
  key: common_pid:deja-dup-monito[3194], id:sys_getegid	       vals: count:8

  key: common_pid:emacs[3275], id:sys_fsync		       vals: count:1
  key: common_pid:emacs[3275], id:sys_open		       vals: count:1
  key: common_pid:emacs[3275], id:sys_symlink		       vals: count:2
  key: common_pid:emacs[3275], id:sys_poll		       vals: count:23
  key: common_pid:emacs[3275], id:sys_select		       vals: count:23
  key: common_pid:emacs[3275], id:unknown_syscall	       vals: count:34
  key: common_pid:emacs[3275], id:sys_ioctl		       vals: count:60
  key: common_pid:emacs[3275], id:sys_rt_sigprocmask	       vals: count:116

  key: common_pid:cat[3323], id:sys_munmap		       vals: count:1
  key: common_pid:cat[3323], id:sys_fadvise64		       vals: count:1

Related to that is support for sorting on multiple fields.  Currently,
you can sort using only a primary key.  Being able to sort on multiple
or at least a secondary key is indispensible for seeing trends when
displaying multiple values.

[1] http://thread.gmane.org/gmane.linux.kernel/1673551

Changes from v1:
 - completely rewritten on top of tracing_map (renamed and exported bpf_map)
 - added map clearing and client ops to tracing_map
 - changed the name from 'hash' triggers to 'hist' triggers
 - added new trigger 'pause' feature
 - added new enable_hist and disable_hist triggers
 - added usage for hist/enable_hist/disable hist to tracing/README
 - moved examples into Documentation/trace/event.txt
 - added ___GFP_NOTRACE, kmalloc/kfree macros, and conditional kmem tracepoints

The following changes since commit 49058038a12cfd9044146a1bf4b286781268d5c9:

  ring-buffer: Do not wake up a splice waiter when page is not full (2015-02-24 14:00:41 -0600)

are available in the git repository at:

  git://git.yoctoproject.org/linux-yocto-contrib.git tzanussi/hist-triggers-v2
  http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/log/?h=tzanussi/hist-triggers-v2

Tom Zanussi (15):
  tracing: Make ftrace_event_field checking functions available
  tracing: Add event record param to trigger_ops.func()
  tracing: Add get_syscall_name()
  bpf: Export bpf map functionality as trace_map_*
  bpf: Export a map-clearing function
  bpf: Add tracing_map client ops
  mm: Add ___GFP_NOTRACE
  tracing: Make kmem memory allocation tracepoints conditional
  tracing: Add kmalloc/kfree macros
  bpf: Make tracing_map use kmalloc/kfree_notrace()
  tracing: Add a per-event-trigger 'paused' field
  tracing: Add 'hist' event trigger command
  tracing: Add sorting to hist triggers
  tracing: Add enable_hist/disable_hist triggers
  tracing: Add 'hist' trigger Documentation

 Documentation/trace/events.txt      |  870 +++++++++++++++++++++
 include/linux/bpf.h                 |   15 +
 include/linux/ftrace_event.h        |    9 +-
 include/linux/gfp.h                 |    3 +-
 include/linux/slab.h                |   61 +-
 include/trace/events/kmem.h         |   28 +-
 kernel/bpf/arraymap.c               |   16 +
 kernel/bpf/hashtab.c                |   39 +-
 kernel/bpf/syscall.c                |  193 ++++-
 kernel/trace/trace.c                |   48 ++
 kernel/trace/trace.h                |   25 +-
 kernel/trace/trace_events.c         |    3 +
 kernel/trace/trace_events_filter.c  |   15 +-
 kernel/trace/trace_events_trigger.c | 1466 ++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_syscalls.c       |   11 +
 mm/slab.c                           |   45 +-
 mm/slob.c                           |   45 +-
 mm/slub.c                           |   47 +-
 18 files changed, 2795 insertions(+), 144 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-03-02 23:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-02 19:29 [PATCH 07/15] mm: Add ___GFP_NOTRACE Alexei Starovoitov
2015-03-02 22:04 ` Tom Zanussi
2015-03-02 23:21   ` Alexei Starovoitov
  -- strict thread matches above, loose matches on Subject: below --
2015-03-02 19:52 Alexei Starovoitov
2015-03-02 19:24 Alexei Starovoitov
2015-03-02 19:33 ` Steven Rostedt
2015-03-02 18:40 Alexei Starovoitov
2015-03-02 18:54 ` Tom Zanussi
2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
2015-03-02 16:01 ` [PATCH 07/15] mm: Add ___GFP_NOTRACE Tom Zanussi
2015-03-02 16:37   ` Steven Rostedt
2015-03-02 16:46     ` Tom Zanussi
2015-03-02 17:58       ` Alexei Starovoitov
2015-03-02 18:03         ` Tom Zanussi
2015-03-02 18:12           ` Alexei Starovoitov
2015-03-02 18:25             ` Tom Zanussi
2015-03-02 18:43             ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox