* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
@ 2015-03-02 19:29 Alexei Starovoitov
2015-03-02 22:04 ` Tom Zanussi
0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 19:29 UTC (permalink / raw)
To: Tom Zanussi
Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, Mar 2, 2015 at 10:54 AM, Tom Zanussi
<tom.zanussi@linux.intel.com> wrote:
>> >
>> > The idea would be that instead of getting your individually kmalloc'ed
>> > elements on-demand from kmalloc while in the handler, you'd get them
>> > from a pool you've pre-allocated when you set up the table. This could
>> > be from a list of individual entries you've already kmalloc'ed ahead of
>> > time, or from an array of n * sizeof(entry).
>>
>> would work, but kinda ugly, since we will pre-allocate a lot
>> and may not be using it at all.
>>
>
> That's true but you have a user-defined map limit anyway, which you can
> adjust to minimize wastage. And allocating ahead of time also means you
> perturb the system less while actually tracing.
nope. it's the other way around.
using kmalloc is faster and less overhead on the whole system
then grabbing cache-cold objects from special pool.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 19:29 [PATCH 07/15] mm: Add ___GFP_NOTRACE Alexei Starovoitov
@ 2015-03-02 22:04 ` Tom Zanussi
2015-03-02 23:21 ` Alexei Starovoitov
0 siblings, 1 reply; 16+ messages in thread
From: Tom Zanussi @ 2015-03-02 22:04 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, 2015-03-02 at 11:29 -0800, Alexei Starovoitov wrote:
> On Mon, Mar 2, 2015 at 10:54 AM, Tom Zanussi
> <tom.zanussi@linux.intel.com> wrote:
> >> >
> >> > The idea would be that instead of getting your individually kmalloc'ed
> >> > elements on-demand from kmalloc while in the handler, you'd get them
> >> > from a pool you've pre-allocated when you set up the table. This could
> >> > be from a list of individual entries you've already kmalloc'ed ahead of
> >> > time, or from an array of n * sizeof(entry).
> >>
> >> would work, but kinda ugly, since we will pre-allocate a lot
> >> and may not be using it at all.
> >>
> >
> > That's true but you have a user-defined map limit anyway, which you can
> > adjust to minimize wastage. And allocating ahead of time also means you
> > perturb the system less while actually tracing.
>
> nope. it's the other way around.
> using kmalloc is faster and less overhead on the whole system
> then grabbing cache-cold objects from special pool.
Until you start causing GFP_ATOMIC failures for what you're tracing
because your map has grabbed them all...
Tom
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 22:04 ` Tom Zanussi
@ 2015-03-02 23:21 ` Alexei Starovoitov
0 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 23:21 UTC (permalink / raw)
To: Tom Zanussi
Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, Mar 2, 2015 at 2:04 PM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> On Mon, 2015-03-02 at 11:29 -0800, Alexei Starovoitov wrote:
>> On Mon, Mar 2, 2015 at 10:54 AM, Tom Zanussi
>> <tom.zanussi@linux.intel.com> wrote:
>> >> >
>> >> > The idea would be that instead of getting your individually kmalloc'ed
>> >> > elements on-demand from kmalloc while in the handler, you'd get them
>> >> > from a pool you've pre-allocated when you set up the table. This could
>> >> > be from a list of individual entries you've already kmalloc'ed ahead of
>> >> > time, or from an array of n * sizeof(entry).
>> >>
>> >> would work, but kinda ugly, since we will pre-allocate a lot
>> >> and may not be using it at all.
>> >>
>> >
>> > That's true but you have a user-defined map limit anyway, which you can
>> > adjust to minimize wastage. And allocating ahead of time also means you
>> > perturb the system less while actually tracing.
>>
>> nope. it's the other way around.
>> using kmalloc is faster and less overhead on the whole system
>> then grabbing cache-cold objects from special pool.
>
> Until you start causing GFP_ATOMIC failures for what you're tracing
> because your map has grabbed them all...
hmm, you mean kmalloc runs out of memory since we're
trying to trace on a system that exhausted all memory already?
so then pre-allocation before hand will help?
yep, that's good use case.
I'm not against pre-allocation feature for maps.
My point was that it's not necessary most of the time.
We can add a flag to map_create to pre-allocate. Easily.
Just needs to be controlled by user.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
@ 2015-03-02 19:52 Alexei Starovoitov
0 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 19:52 UTC (permalink / raw)
To: Steven Rostedt
Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, Mar 2, 2015 at 11:33 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2 Mar 2015 11:24:04 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> well, percentage of tracepoints called from NMI is tiny
>> comparing to the rest, so assuming nmi context
>> everywhere is very inefficient.
>> Like we can use pre-allocated pool of map entries when
>> tracepoint is called from NMI, but we shouldn't be using
>> it in other cases. Just like ring buffers and other things
>> have nmi and non-nmi pools and code paths, it doesn't
>> make sense to disallow kmalloc all together.
>> btw, calling kmalloc is _faster_ than taking
>> objects from cache-cold special nmi only pool.
>
> Please show the numbers and post the tests when stating something like
> that.
sure. here is the work that Jesper is doing accelerate slub:
http://thread.gmane.org/gmane.linux.kernel.mm/126138
he's measuring kmalloc/kfree pair as 19ns
which is way less then cost of cache miss on cold buffer
we'll get from custom pool. We want to minimize cache
misses and not absolute number of instructions.
yes, in custom pool one can receive a buffer few cycles
faster then from kmalloc, but it will be cold.
'prefetch of next' trick that slub is using won't work
for custom pool. That is the main problem with it.
By using main allocator the buffers are coming hot.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
@ 2015-03-02 19:24 Alexei Starovoitov
2015-03-02 19:33 ` Steven Rostedt
0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 19:24 UTC (permalink / raw)
To: Steven Rostedt
Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, Mar 2, 2015 at 10:43 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2 Mar 2015 10:12:32 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> I'm not sure what would be the meaning of hash map that has all
>> elements pre-allocated...
>> As I'm reading your cover letter, I agree, we need to find a way
>> to call kmalloc_notrace-like from tracepoints.
>> Not sure that patch 8 style of duplicating the functions is clean.
>> Can we keep kmalloc/kfree as-is and do something like
>> if (in_tracepoint()) check inside ftrace_raw_kmalloc* ?
>
> I'm strongly against this. You should not be doing anything in a
> tracepoint that you can't do from NMI context. And calling kmalloc
> happens to be one of them.
well, percentage of tracepoints called from NMI is tiny
comparing to the rest, so assuming nmi context
everywhere is very inefficient.
Like we can use pre-allocated pool of map entries when
tracepoint is called from NMI, but we shouldn't be using
it in other cases. Just like ring buffers and other things
have nmi and non-nmi pools and code paths, it doesn't
make sense to disallow kmalloc all together.
btw, calling kmalloc is _faster_ than taking
objects from cache-cold special nmi only pool.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 19:24 Alexei Starovoitov
@ 2015-03-02 19:33 ` Steven Rostedt
0 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2015-03-02 19:33 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, 2 Mar 2015 11:24:04 -0800
Alexei Starovoitov <ast@plumgrid.com> wrote:
> well, percentage of tracepoints called from NMI is tiny
> comparing to the rest, so assuming nmi context
> everywhere is very inefficient.
> Like we can use pre-allocated pool of map entries when
> tracepoint is called from NMI, but we shouldn't be using
> it in other cases. Just like ring buffers and other things
> have nmi and non-nmi pools and code paths, it doesn't
> make sense to disallow kmalloc all together.
> btw, calling kmalloc is _faster_ than taking
> objects from cache-cold special nmi only pool.
Please show the numbers and post the tests when stating something like
that.
-- Steve
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
@ 2015-03-02 18:40 Alexei Starovoitov
2015-03-02 18:54 ` Tom Zanussi
0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 18:40 UTC (permalink / raw)
To: Tom Zanussi
Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, Mar 2, 2015 at 10:25 AM, Tom Zanussi
<tom.zanussi@linux.intel.com> wrote:
> On Mon, 2015-03-02 at 10:12 -0800, Alexei Starovoitov wrote:
>> On Mon, Mar 2, 2015 at 10:03 AM, Tom Zanussi
>> <tom.zanussi@linux.intel.com> wrote:
>> > On Mon, 2015-03-02 at 09:58 -0800, Alexei Starovoitov wrote:
>> >> On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>> >> > On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
>> >> >> On Mon, 2 Mar 2015 10:01:00 -0600
>> >> >> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>> >> >>
>> >> >> > Add a gfp flag that allows kmalloc() et al to be used in tracing
>> >> >> > functions.
>> >> >> >
>> >> >> > The problem with using kmalloc for tracing is that the tracing
>> >> >> > subsystem should be able to trace kmalloc itself, which it can't do
>> >> >> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
>> >> >> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
>> >> >>
>> >> >> This part I don't like at all. Why can't the memory be preallocated
>> >> >> when the hist is created (the echo 'hist:...')?
>> >> >>
>> >> >
>> >> > Yeah, I didn't like it either. My original version did exactly what you
>> >> > suggest and preallocated an array of entries to 'allocate' from in order
>> >> > to avoid the problem.
>> >> >
>> >> > But I wanted to attempt to use the bpf_map directly, which already uses
>> >> > kmalloc internally. My fallback in case this wouldn't fly, which it
>> >> > obviously won't, would be to add an option to have the bpf_map code
>> >> > preallocate a maximum number of entries or pass in a client-owned array
>> >> > for the purpose. I'll do that if I don't hear any better ideas..
>> >>
>> >> Tom, I'm still reading through the patch set.
>> >> Quick comment for the above.
>> >> Currently there are two map types: array and hash.
>> >> array type is pre-allocating all memory at map creation time.
>> >> hash is allocating on demand.
>> >
>> > OK, so would it make sense to do the same for the hash type, or at least
>> > add an option that does that?
>>
>> I'm not sure what would be the meaning of hash map that has all
>> elements pre-allocated...
>
> The idea would be that instead of getting your individually kmalloc'ed
> elements on-demand from kmalloc while in the handler, you'd get them
> from a pool you've pre-allocated when you set up the table. This could
> be from a list of individual entries you've already kmalloc'ed ahead of
> time, or from an array of n * sizeof(entry).
would work, but kinda ugly, since we will pre-allocate a lot
and may not be using it at all.
> This would also allow you to avoid GFP_ATOMIC for those.
>
>> As I'm reading your cover letter, I agree, we need to find a way
>> to call kmalloc_notrace-like from tracepoints.
>> Not sure that patch 8 style of duplicating the functions is clean.
>
> No, it's horrible, but it does the job without changing the normal path
> at all.
>
>> Can we keep kmalloc/kfree as-is and do something like
>> if (in_tracepoint()) check inside ftrace_raw_kmalloc* ?
>
> Yeah, that's essentially what TP_CONDITION() in patch 8 (Make kmem
> memory allocation tracepoints conditional) does.
not quite, I mean something like global kmalloc recursion flag.
then kmalloc doesn't need to change.
ftrace_raw_event_kmem_alloc() would use a flag
inside ftrace_event_call struct or global recursion flag.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 18:40 Alexei Starovoitov
@ 2015-03-02 18:54 ` Tom Zanussi
0 siblings, 0 replies; 16+ messages in thread
From: Tom Zanussi @ 2015-03-02 18:54 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, 2015-03-02 at 10:40 -0800, Alexei Starovoitov wrote:
> On Mon, Mar 2, 2015 at 10:25 AM, Tom Zanussi
> <tom.zanussi@linux.intel.com> wrote:
> > On Mon, 2015-03-02 at 10:12 -0800, Alexei Starovoitov wrote:
> >> On Mon, Mar 2, 2015 at 10:03 AM, Tom Zanussi
> >> <tom.zanussi@linux.intel.com> wrote:
> >> > On Mon, 2015-03-02 at 09:58 -0800, Alexei Starovoitov wrote:
> >> >> On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> >> >> > On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
> >> >> >> On Mon, 2 Mar 2015 10:01:00 -0600
> >> >> >> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> >> >> >>
> >> >> >> > Add a gfp flag that allows kmalloc() et al to be used in tracing
> >> >> >> > functions.
> >> >> >> >
> >> >> >> > The problem with using kmalloc for tracing is that the tracing
> >> >> >> > subsystem should be able to trace kmalloc itself, which it can't do
> >> >> >> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
> >> >> >> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
> >> >> >>
> >> >> >> This part I don't like at all. Why can't the memory be preallocated
> >> >> >> when the hist is created (the echo 'hist:...')?
> >> >> >>
> >> >> >
> >> >> > Yeah, I didn't like it either. My original version did exactly what you
> >> >> > suggest and preallocated an array of entries to 'allocate' from in order
> >> >> > to avoid the problem.
> >> >> >
> >> >> > But I wanted to attempt to use the bpf_map directly, which already uses
> >> >> > kmalloc internally. My fallback in case this wouldn't fly, which it
> >> >> > obviously won't, would be to add an option to have the bpf_map code
> >> >> > preallocate a maximum number of entries or pass in a client-owned array
> >> >> > for the purpose. I'll do that if I don't hear any better ideas..
> >> >>
> >> >> Tom, I'm still reading through the patch set.
> >> >> Quick comment for the above.
> >> >> Currently there are two map types: array and hash.
> >> >> array type is pre-allocating all memory at map creation time.
> >> >> hash is allocating on demand.
> >> >
> >> > OK, so would it make sense to do the same for the hash type, or at least
> >> > add an option that does that?
> >>
> >> I'm not sure what would be the meaning of hash map that has all
> >> elements pre-allocated...
> >
> > The idea would be that instead of getting your individually kmalloc'ed
> > elements on-demand from kmalloc while in the handler, you'd get them
> > from a pool you've pre-allocated when you set up the table. This could
> > be from a list of individual entries you've already kmalloc'ed ahead of
> > time, or from an array of n * sizeof(entry).
>
> would work, but kinda ugly, since we will pre-allocate a lot
> and may not be using it at all.
>
That's true but you have a user-defined map limit anyway, which you can
adjust to minimize wastage. And allocating ahead of time also means you
perturb the system less while actually tracing.
Tom
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-02 16:00 Tom Zanussi
2015-03-02 16:01 ` [PATCH 07/15] mm: Add ___GFP_NOTRACE Tom Zanussi
0 siblings, 1 reply; 16+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:00 UTC (permalink / raw)
To: rostedt; +Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel,
Tom Zanussi
This is v2 of my previously posted 'hashtriggers' patchset [1], but
renamed to 'hist triggers' following feedback from v1.
Since then, the kernel has gained a tracing map implementation in the
form of bpf_map, which this patchset makes a bit more generic, exports
and uses (as tracing_map_*, still in the bpf syscall file however).
A large part of the initial hash triggers implementation was devoted
to a map implementation and general-purpose hashing functions, which
have now been subsumed by the bpf maps. I've completely redone the
trigger patches themselves to work on top of tracing_map. The result
is a much simpler and easier-to-review patchset that's able to focus
more directly on the problem at hand.
The new version addresses all the comments from the previous review,
including changing the name from hash->hist, adding separate 'hist'
files for the output, and moving the examples into Documentation.
This patchset also includes a couple other new and related triggers,
enable_hist and disable_hist, very similar to the existing
enable_event/disable_event triggers used to automatically enable and
disable events based on a triggering condition, but in this case
allowing hist triggers to be enabled and disabled in the same way.
The only problem with using the bpf_map implementation for this is
that it uses kmalloc internally, which causes problems when trying to
trace kmalloc itself. I'm guessing the ebpf tracing code would also
share this problem e.g. when using bpf_maps from probes on kmalloc().
This patchset attempts a solution to that problem (by adding a
gfp_flag and changing the kmem memory allocation tracepoints to
conditional variants) for checking for it in for but I'm not sure it's
the best way to address it.
There are a couple of important bits of functionality that were
present in v1 but dropped in v2 mainly because I'm still trying to
figure out the best way to accomplish those things using the bpf_map
implementation.
The first is support for compound keys. Currently, maps can only be
keyed on a single event field, whereas in v1 they could be keyed on
multiple keys. With support for compound keys, you can create much
more interesting output, such as for example per-pid lists of
syscalls or read counts e.g.:
# echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount' > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
# cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
key: common_pid:bash[3112], id:sys_write vals: count:69
key: common_pid:bash[3112], id:sys_rt_sigprocmask vals: count:218
key: common_pid:update-notifier[3164], id:sys_poll vals: count:37
key: common_pid:update-notifier[3164], id:sys_recvfrom vals: count:118
key: common_pid:deja-dup-monito[3194], id:sys_sendto vals: count:1
key: common_pid:deja-dup-monito[3194], id:sys_read vals: count:4
key: common_pid:deja-dup-monito[3194], id:sys_poll vals: count:8
key: common_pid:deja-dup-monito[3194], id:sys_recvmsg vals: count:8
key: common_pid:deja-dup-monito[3194], id:sys_getegid vals: count:8
key: common_pid:emacs[3275], id:sys_fsync vals: count:1
key: common_pid:emacs[3275], id:sys_open vals: count:1
key: common_pid:emacs[3275], id:sys_symlink vals: count:2
key: common_pid:emacs[3275], id:sys_poll vals: count:23
key: common_pid:emacs[3275], id:sys_select vals: count:23
key: common_pid:emacs[3275], id:unknown_syscall vals: count:34
key: common_pid:emacs[3275], id:sys_ioctl vals: count:60
key: common_pid:emacs[3275], id:sys_rt_sigprocmask vals: count:116
key: common_pid:cat[3323], id:sys_munmap vals: count:1
key: common_pid:cat[3323], id:sys_fadvise64 vals: count:1
Related to that is support for sorting on multiple fields. Currently,
you can sort using only a primary key. Being able to sort on multiple
or at least a secondary key is indispensible for seeing trends when
displaying multiple values.
[1] http://thread.gmane.org/gmane.linux.kernel/1673551
Changes from v1:
- completely rewritten on top of tracing_map (renamed and exported bpf_map)
- added map clearing and client ops to tracing_map
- changed the name from 'hash' triggers to 'hist' triggers
- added new trigger 'pause' feature
- added new enable_hist and disable_hist triggers
- added usage for hist/enable_hist/disable hist to tracing/README
- moved examples into Documentation/trace/event.txt
- added ___GFP_NOTRACE, kmalloc/kfree macros, and conditional kmem tracepoints
The following changes since commit 49058038a12cfd9044146a1bf4b286781268d5c9:
ring-buffer: Do not wake up a splice waiter when page is not full (2015-02-24 14:00:41 -0600)
are available in the git repository at:
git://git.yoctoproject.org/linux-yocto-contrib.git tzanussi/hist-triggers-v2
http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/log/?h=tzanussi/hist-triggers-v2
Tom Zanussi (15):
tracing: Make ftrace_event_field checking functions available
tracing: Add event record param to trigger_ops.func()
tracing: Add get_syscall_name()
bpf: Export bpf map functionality as trace_map_*
bpf: Export a map-clearing function
bpf: Add tracing_map client ops
mm: Add ___GFP_NOTRACE
tracing: Make kmem memory allocation tracepoints conditional
tracing: Add kmalloc/kfree macros
bpf: Make tracing_map use kmalloc/kfree_notrace()
tracing: Add a per-event-trigger 'paused' field
tracing: Add 'hist' event trigger command
tracing: Add sorting to hist triggers
tracing: Add enable_hist/disable_hist triggers
tracing: Add 'hist' trigger Documentation
Documentation/trace/events.txt | 870 +++++++++++++++++++++
include/linux/bpf.h | 15 +
include/linux/ftrace_event.h | 9 +-
include/linux/gfp.h | 3 +-
include/linux/slab.h | 61 +-
include/trace/events/kmem.h | 28 +-
kernel/bpf/arraymap.c | 16 +
kernel/bpf/hashtab.c | 39 +-
kernel/bpf/syscall.c | 193 ++++-
kernel/trace/trace.c | 48 ++
kernel/trace/trace.h | 25 +-
kernel/trace/trace_events.c | 3 +
kernel/trace/trace_events_filter.c | 15 +-
kernel/trace/trace_events_trigger.c | 1466 ++++++++++++++++++++++++++++++++++-
kernel/trace/trace_syscalls.c | 11 +
mm/slab.c | 45 +-
mm/slob.c | 45 +-
mm/slub.c | 47 +-
18 files changed, 2795 insertions(+), 144 deletions(-)
--
1.9.3
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
@ 2015-03-02 16:01 ` Tom Zanussi
2015-03-02 16:37 ` Steven Rostedt
0 siblings, 1 reply; 16+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:01 UTC (permalink / raw)
To: rostedt; +Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel,
Tom Zanussi
Add a gfp flag that allows kmalloc() et al to be used in tracing
functions.
The problem with using kmalloc for tracing is that the tracing
subsystem should be able to trace kmalloc itself, which it can't do
directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
or kmalloc()->trace_mm_page_alloc()->kmalloc().
With this flag, tracing code could use a special version of kmalloc()
that sets __GFP_NOTRACE on every allocation it does, while leaving the
normal kmalloc() path untouched.
This would allow any tracepoint in the kmalloc path to be avoided via
DEFINE_EVENT_CONDITION() redefinitions of those events, which check
for ___GFP_NOTRACE immediately in their execution and break if set,
thereby avoiding the recursion.
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
include/linux/gfp.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b840e3b..acbbc7c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -34,6 +34,7 @@ struct vm_area_struct;
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
+#define ___GFP_NOTRACE 0x2000000u
/* If the above are modified, __GFP_BITS_SHIFT may need updating */
/*
@@ -97,7 +98,7 @@ struct vm_area_struct;
*/
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
-#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/* This equals 0, but use constants in case they ever change */
--
1.9.3
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 16:01 ` [PATCH 07/15] mm: Add ___GFP_NOTRACE Tom Zanussi
@ 2015-03-02 16:37 ` Steven Rostedt
2015-03-02 16:46 ` Tom Zanussi
0 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2015-03-02 16:37 UTC (permalink / raw)
To: Tom Zanussi; +Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel
On Mon, 2 Mar 2015 10:01:00 -0600
Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> Add a gfp flag that allows kmalloc() et al to be used in tracing
> functions.
>
> The problem with using kmalloc for tracing is that the tracing
> subsystem should be able to trace kmalloc itself, which it can't do
> directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
> or kmalloc()->trace_mm_page_alloc()->kmalloc().
This part I don't like at all. Why can't the memory be preallocated
when the hist is created (the echo 'hist:...')?
kmalloc must never be called from any tracepoint callback.
This change is currently a showstopper.
-- Steve
>
> With this flag, tracing code could use a special version of kmalloc()
> that sets __GFP_NOTRACE on every allocation it does, while leaving the
> normal kmalloc() path untouched.
>
> This would allow any tracepoint in the kmalloc path to be avoided via
> DEFINE_EVENT_CONDITION() redefinitions of those events, which check
> for ___GFP_NOTRACE immediately in their execution and break if set,
> thereby avoiding the recursion.
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 16:37 ` Steven Rostedt
@ 2015-03-02 16:46 ` Tom Zanussi
2015-03-02 17:58 ` Alexei Starovoitov
0 siblings, 1 reply; 16+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:46 UTC (permalink / raw)
To: Steven Rostedt; +Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel
On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
> On Mon, 2 Mar 2015 10:01:00 -0600
> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>
> > Add a gfp flag that allows kmalloc() et al to be used in tracing
> > functions.
> >
> > The problem with using kmalloc for tracing is that the tracing
> > subsystem should be able to trace kmalloc itself, which it can't do
> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
>
> This part I don't like at all. Why can't the memory be preallocated
> when the hist is created (the echo 'hist:...')?
>
Yeah, I didn't like it either. My original version did exactly what you
suggest and preallocated an array of entries to 'allocate' from in order
to avoid the problem.
But I wanted to attempt to use the bpf_map directly, which already uses
kmalloc internally. My fallback in case this wouldn't fly, which it
obviously won't, would be to add an option to have the bpf_map code
preallocate a maximum number of entries or pass in a client-owned array
for the purpose. I'll do that if I don't hear any better ideas..
Tom
> kmalloc must never be called from any tracepoint callback.
>
> This change is currently a showstopper.
>
> -- Steve
>
>
> >
> > With this flag, tracing code could use a special version of kmalloc()
> > that sets __GFP_NOTRACE on every allocation it does, while leaving the
> > normal kmalloc() path untouched.
> >
> > This would allow any tracepoint in the kmalloc path to be avoided via
> > DEFINE_EVENT_CONDITION() redefinitions of those events, which check
> > for ___GFP_NOTRACE immediately in their execution and break if set,
> > thereby avoiding the recursion.
> >
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 16:46 ` Tom Zanussi
@ 2015-03-02 17:58 ` Alexei Starovoitov
2015-03-02 18:03 ` Tom Zanussi
0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 17:58 UTC (permalink / raw)
To: Tom Zanussi
Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
>> On Mon, 2 Mar 2015 10:01:00 -0600
>> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>>
>> > Add a gfp flag that allows kmalloc() et al to be used in tracing
>> > functions.
>> >
>> > The problem with using kmalloc for tracing is that the tracing
>> > subsystem should be able to trace kmalloc itself, which it can't do
>> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
>> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
>>
>> This part I don't like at all. Why can't the memory be preallocated
>> when the hist is created (the echo 'hist:...')?
>>
>
> Yeah, I didn't like it either. My original version did exactly what you
> suggest and preallocated an array of entries to 'allocate' from in order
> to avoid the problem.
>
> But I wanted to attempt to use the bpf_map directly, which already uses
> kmalloc internally. My fallback in case this wouldn't fly, which it
> obviously won't, would be to add an option to have the bpf_map code
> preallocate a maximum number of entries or pass in a client-owned array
> for the purpose. I'll do that if I don't hear any better ideas..
Tom, I'm still reading through the patch set.
Quick comment for the above.
Currently there are two map types: array and hash.
array type is pre-allocating all memory at map creation time.
hash is allocating on demand.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 17:58 ` Alexei Starovoitov
@ 2015-03-02 18:03 ` Tom Zanussi
2015-03-02 18:12 ` Alexei Starovoitov
0 siblings, 1 reply; 16+ messages in thread
From: Tom Zanussi @ 2015-03-02 18:03 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, 2015-03-02 at 09:58 -0800, Alexei Starovoitov wrote:
> On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> > On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
> >> On Mon, 2 Mar 2015 10:01:00 -0600
> >> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> >>
> >> > Add a gfp flag that allows kmalloc() et al to be used in tracing
> >> > functions.
> >> >
> >> > The problem with using kmalloc for tracing is that the tracing
> >> > subsystem should be able to trace kmalloc itself, which it can't do
> >> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
> >> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
> >>
> >> This part I don't like at all. Why can't the memory be preallocated
> >> when the hist is created (the echo 'hist:...')?
> >>
> >
> > Yeah, I didn't like it either. My original version did exactly what you
> > suggest and preallocated an array of entries to 'allocate' from in order
> > to avoid the problem.
> >
> > But I wanted to attempt to use the bpf_map directly, which already uses
> > kmalloc internally. My fallback in case this wouldn't fly, which it
> > obviously won't, would be to add an option to have the bpf_map code
> > preallocate a maximum number of entries or pass in a client-owned array
> > for the purpose. I'll do that if I don't hear any better ideas..
>
> Tom, I'm still reading through the patch set.
> Quick comment for the above.
> Currently there are two map types: array and hash.
> array type is pre-allocating all memory at map creation time.
> hash is allocating on demand.
OK, so would it make sense to do the same for the hash type, or at least
add an option that does that?
Tom
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 18:03 ` Tom Zanussi
@ 2015-03-02 18:12 ` Alexei Starovoitov
2015-03-02 18:25 ` Tom Zanussi
2015-03-02 18:43 ` Steven Rostedt
0 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 18:12 UTC (permalink / raw)
To: Tom Zanussi
Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, Mar 2, 2015 at 10:03 AM, Tom Zanussi
<tom.zanussi@linux.intel.com> wrote:
> On Mon, 2015-03-02 at 09:58 -0800, Alexei Starovoitov wrote:
>> On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>> > On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
>> >> On Mon, 2 Mar 2015 10:01:00 -0600
>> >> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>> >>
>> >> > Add a gfp flag that allows kmalloc() et al to be used in tracing
>> >> > functions.
>> >> >
>> >> > The problem with using kmalloc for tracing is that the tracing
>> >> > subsystem should be able to trace kmalloc itself, which it can't do
>> >> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
>> >> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
>> >>
>> >> This part I don't like at all. Why can't the memory be preallocated
>> >> when the hist is created (the echo 'hist:...')?
>> >>
>> >
>> > Yeah, I didn't like it either. My original version did exactly what you
>> > suggest and preallocated an array of entries to 'allocate' from in order
>> > to avoid the problem.
>> >
>> > But I wanted to attempt to use the bpf_map directly, which already uses
>> > kmalloc internally. My fallback in case this wouldn't fly, which it
>> > obviously won't, would be to add an option to have the bpf_map code
>> > preallocate a maximum number of entries or pass in a client-owned array
>> > for the purpose. I'll do that if I don't hear any better ideas..
>>
>> Tom, I'm still reading through the patch set.
>> Quick comment for the above.
>> Currently there are two map types: array and hash.
>> array type is pre-allocating all memory at map creation time.
>> hash is allocating on demand.
>
> OK, so would it make sense to do the same for the hash type, or at least
> add an option that does that?
I'm not sure what would be the meaning of hash map that has all
elements pre-allocated...
As I'm reading your cover letter, I agree, we need to find a way
to call kmalloc_notrace-like from tracepoints.
Not sure that patch 8 style of duplicating the functions is clean.
Can we keep kmalloc/kfree as-is and do something like
if (in_tracepoint()) check inside ftrace_raw_kmalloc* ?
so that kmalloc will be traced but calls to kmalloc from inside
tracepoints will be automatically suppressed ?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 18:12 ` Alexei Starovoitov
@ 2015-03-02 18:25 ` Tom Zanussi
2015-03-02 18:43 ` Steven Rostedt
1 sibling, 0 replies; 16+ messages in thread
From: Tom Zanussi @ 2015-03-02 18:25 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, 2015-03-02 at 10:12 -0800, Alexei Starovoitov wrote:
> On Mon, Mar 2, 2015 at 10:03 AM, Tom Zanussi
> <tom.zanussi@linux.intel.com> wrote:
> > On Mon, 2015-03-02 at 09:58 -0800, Alexei Starovoitov wrote:
> >> On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> >> > On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
> >> >> On Mon, 2 Mar 2015 10:01:00 -0600
> >> >> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> >> >>
> >> >> > Add a gfp flag that allows kmalloc() et al to be used in tracing
> >> >> > functions.
> >> >> >
> >> >> > The problem with using kmalloc for tracing is that the tracing
> >> >> > subsystem should be able to trace kmalloc itself, which it can't do
> >> >> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
> >> >> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
> >> >>
> >> >> This part I don't like at all. Why can't the memory be preallocated
> >> >> when the hist is created (the echo 'hist:...')?
> >> >>
> >> >
> >> > Yeah, I didn't like it either. My original version did exactly what you
> >> > suggest and preallocated an array of entries to 'allocate' from in order
> >> > to avoid the problem.
> >> >
> >> > But I wanted to attempt to use the bpf_map directly, which already uses
> >> > kmalloc internally. My fallback in case this wouldn't fly, which it
> >> > obviously won't, would be to add an option to have the bpf_map code
> >> > preallocate a maximum number of entries or pass in a client-owned array
> >> > for the purpose. I'll do that if I don't hear any better ideas..
> >>
> >> Tom, I'm still reading through the patch set.
> >> Quick comment for the above.
> >> Currently there are two map types: array and hash.
> >> array type is pre-allocating all memory at map creation time.
> >> hash is allocating on demand.
> >
> > OK, so would it make sense to do the same for the hash type, or at least
> > add an option that does that?
>
> I'm not sure what would be the meaning of hash map that has all
> elements pre-allocated...
The idea would be that instead of getting your individually kmalloc'ed
elements on-demand from kmalloc while in the handler, you'd get them
from a pool you've pre-allocated when you set up the table. This could
be from a list of individual entries you've already kmalloc'ed ahead of
time, or from an array of n * sizeof(entry).
This would also allow you to avoid GFP_ATOMIC for those.
> As I'm reading your cover letter, I agree, we need to find a way
> to call kmalloc_notrace-like from tracepoints.
> Not sure that patch 8 style of duplicating the functions is clean.
No, it's horrible, but it does the job without changing the normal path
at all.
> Can we keep kmalloc/kfree as-is and do something like
> if (in_tracepoint()) check inside ftrace_raw_kmalloc* ?
Yeah, that's essentially what TP_CONDITION() in patch 8 (Make kmem
memory allocation tracepoints conditional) does.
Tom
> so that kmalloc will be traced but calls to kmalloc from inside
> tracepoints will be automatically suppressed ?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
2015-03-02 18:12 ` Alexei Starovoitov
2015-03-02 18:25 ` Tom Zanussi
@ 2015-03-02 18:43 ` Steven Rostedt
1 sibling, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2015-03-02 18:43 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML
On Mon, 2 Mar 2015 10:12:32 -0800
Alexei Starovoitov <ast@plumgrid.com> wrote:
> I'm not sure what would be the meaning of hash map that has all
> elements pre-allocated...
> As I'm reading your cover letter, I agree, we need to find a way
> to call kmalloc_notrace-like from tracepoints.
> Not sure that patch 8 style of duplicating the functions is clean.
> Can we keep kmalloc/kfree as-is and do something like
> if (in_tracepoint()) check inside ftrace_raw_kmalloc* ?
I'm strongly against this. You should not be doing anything in a
tracepoint that you can't do from NMI context. And calling kmalloc
happens to be one of them.
Not to mention, kmalloc is a hot path, and tracing must not have any
impact on it (no extra if statements).
-- Steve
> so that kmalloc will be traced but calls to kmalloc from inside
> tracepoints will be automatically suppressed ?
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-03-02 23:21 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-02 19:29 [PATCH 07/15] mm: Add ___GFP_NOTRACE Alexei Starovoitov
2015-03-02 22:04 ` Tom Zanussi
2015-03-02 23:21 ` Alexei Starovoitov
-- strict thread matches above, loose matches on Subject: below --
2015-03-02 19:52 Alexei Starovoitov
2015-03-02 19:24 Alexei Starovoitov
2015-03-02 19:33 ` Steven Rostedt
2015-03-02 18:40 Alexei Starovoitov
2015-03-02 18:54 ` Tom Zanussi
2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
2015-03-02 16:01 ` [PATCH 07/15] mm: Add ___GFP_NOTRACE Tom Zanussi
2015-03-02 16:37 ` Steven Rostedt
2015-03-02 16:46 ` Tom Zanussi
2015-03-02 17:58 ` Alexei Starovoitov
2015-03-02 18:03 ` Tom Zanussi
2015-03-02 18:12 ` Alexei Starovoitov
2015-03-02 18:25 ` Tom Zanussi
2015-03-02 18:43 ` Steven Rostedt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox