* Checking to see if a bit is _not_ set in a ftrace event filter @ 2014-12-02 2:19 Theodore Ts'o 2014-12-02 2:41 ` Steven Rostedt 0 siblings, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2014-12-02 2:19 UTC (permalink / raw) To: Steven Rostedt; +Cc: LKML I was trying to do something like this: filter="events/writeback/writeback_mark_inode_dirty/filter" echo "(flags & 2048) && ((state & 2048) == 0)" > $filter ... but that doesn't work. This works: echo "flags & 2048" > $filter But the problem is this: echo "(state & 2048) == 0" > $filter The simplest patch to add this would be add a new filter_ops so we could do this: echo "(state !& 2048)" > $filter ... but that's pretty ugly. But adding more general expression parsing in the ftrace event filter code would be non-trivial, and if we start trying to make things like "!(state & 2048)" or "(state & 2048) == 0", then at some point some crazy person might request supporting something like this: "(state ^ flags) == 2048". :-) So I guess the main question I want to ask is your opinion about whether a patch that adds support for the operator "!&" is too ugly to live? Thanks, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Checking to see if a bit is _not_ set in a ftrace event filter 2014-12-02 2:19 Checking to see if a bit is _not_ set in a ftrace event filter Theodore Ts'o @ 2014-12-02 2:41 ` Steven Rostedt 2014-12-02 3:52 ` Alexei Starovoitov 0 siblings, 1 reply; 10+ messages in thread From: Steven Rostedt @ 2014-12-02 2:41 UTC (permalink / raw) To: Theodore Ts'o; +Cc: LKML, Alexei Starovoitov On Mon, 1 Dec 2014 21:19:12 -0500 Theodore Ts'o <tytso@mit.edu> wrote: > I was trying to do something like this: > > filter="events/writeback/writeback_mark_inode_dirty/filter" > echo "(flags & 2048) && ((state & 2048) == 0)" > $filter > > ... but that doesn't work. > > This works: > > echo "flags & 2048" > $filter > > But the problem is this: > > echo "(state & 2048) == 0" > $filter > > The simplest patch to add this would be add a new filter_ops so we > could do this: > > echo "(state !& 2048)" > $filter > > ... but that's pretty ugly. But adding more general expression > parsing in the ftrace event filter code would be non-trivial, and if > we start trying to make things like "!(state & 2048)" or "(state & > 2048) == 0", then at some point some crazy person might request > supporting something like this: "(state ^ flags) == 2048". :-) > > So I guess the main question I want to ask is your opinion about > whether a patch that adds support for the operator "!&" is too ugly to > live? > Yeah, I don't want to add some bastardization compare that we'll be stuck with till the end of time. Either we modify the tree walk to handle values (it shouldn't be too difficult, but it wont be trivial), or we wait till eBPF is up and running as the trace filter replacement and that should be able to handle this much better. -- Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Checking to see if a bit is _not_ set in a ftrace event filter 2014-12-02 2:41 ` Steven Rostedt @ 2014-12-02 3:52 ` Alexei Starovoitov 2014-12-02 3:57 ` Steven Rostedt 2014-12-02 5:04 ` Theodore Ts'o 0 siblings, 2 replies; 10+ messages in thread From: Alexei Starovoitov @ 2014-12-02 3:52 UTC (permalink / raw) To: Steven Rostedt; +Cc: Theodore Ts'o, LKML On Mon, Dec 1, 2014 at 6:41 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 1 Dec 2014 21:19:12 -0500 > Theodore Ts'o <tytso@mit.edu> wrote: > >> I was trying to do something like this: >> >> filter="events/writeback/writeback_mark_inode_dirty/filter" >> echo "(flags & 2048) && ((state & 2048) == 0)" > $filter >> >> ... but that doesn't work. >> >> This works: >> >> echo "flags & 2048" > $filter >> >> But the problem is this: >> >> echo "(state & 2048) == 0" > $filter >> >> The simplest patch to add this would be add a new filter_ops so we >> could do this: >> >> echo "(state !& 2048)" > $filter >> >> ... but that's pretty ugly. But adding more general expression >> parsing in the ftrace event filter code would be non-trivial, and if >> we start trying to make things like "!(state & 2048)" or "(state & >> 2048) == 0", then at some point some crazy person might request >> supporting something like this: "(state ^ flags) == 2048". :-) >> >> So I guess the main question I want to ask is your opinion about >> whether a patch that adds support for the operator "!&" is too ugly to >> live? >> > > Yeah, I don't want to add some bastardization compare that we'll be > stuck with till the end of time. Either we modify the tree walk to > handle values (it shouldn't be too difficult, but it wont be trivial), > or we wait till eBPF is up and running as the trace filter replacement > and that should be able to handle this much better. yeah. teaching tree walk to do (state & 2048) == 0 is not trivial, since it doesn't have a concept of value of expression. Doing !(state & 2048) is probably a bit easier, since there are hacks for 'not' already. Another alternative is, of course, to wait little bit for eBPF+tracing to land ;) All the core pieces (verifier, bpf syscall, maps) are in net-next already, so in the next dev cycle we can get tracing bits reviewed/tested/merged without cross-tree conflicts. All parsing and code generation will be done by user space, so all complex expression will be supported. It will change the workflow for folks who use 'echo expr > filter' directly. trace-cmd -e -f can be made to work transparently with new features. Ted, I don't see 'writeback_mark_inode_dirty' event in the tree. Some new stuff? What kind of post-filtering are you doing with this event? Just visually checking that trace is sane or the trace output is fed into other tools? Are you trying to aggregate or correlate multiple events (may be based on 'ino') ? One of the goals for eBPF+tracing is to minimize additions of new tracepoints. Right now we already have a ton of them. events/ext4.h is ~2500 lines. Some of them look like hooks for in-production debugging of a function at a time. Sort of like poor's man kprobe/kretprobe. With eBPF we should be able to avoid adding trace_func_enter(), trace_func_exit() to so many func. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Checking to see if a bit is _not_ set in a ftrace event filter 2014-12-02 3:52 ` Alexei Starovoitov @ 2014-12-02 3:57 ` Steven Rostedt 2014-12-02 3:58 ` Steven Rostedt 2014-12-02 4:51 ` Steven Rostedt 2014-12-02 5:04 ` Theodore Ts'o 1 sibling, 2 replies; 10+ messages in thread From: Steven Rostedt @ 2014-12-02 3:57 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Theodore Ts'o, LKML On Mon, 1 Dec 2014 19:52:11 -0800 Alexei Starovoitov <ast@plumgrid.com> wrote: > yeah. teaching tree walk to do (state & 2048) == 0 is not trivial, > since it doesn't have a concept of value of expression. > Doing !(state & 2048) is probably a bit easier, since there > are hacks for 'not' already. That's exactly what I was looking at doing. If I can get something working without spending more than an hour or two, I may even push it for this release cycle. -- Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Checking to see if a bit is _not_ set in a ftrace event filter 2014-12-02 3:57 ` Steven Rostedt @ 2014-12-02 3:58 ` Steven Rostedt 2014-12-02 4:51 ` Steven Rostedt 1 sibling, 0 replies; 10+ messages in thread From: Steven Rostedt @ 2014-12-02 3:58 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Theodore Ts'o, LKML On Mon, 1 Dec 2014 22:57:25 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 1 Dec 2014 19:52:11 -0800 > Alexei Starovoitov <ast@plumgrid.com> wrote: > > > yeah. teaching tree walk to do (state & 2048) == 0 is not trivial, > > since it doesn't have a concept of value of expression. > > Doing !(state & 2048) is probably a bit easier, since there > > are hacks for 'not' already. > > That's exactly what I was looking at doing. If I can get something > working without spending more than an hour or two, I may even push it > for this release cycle. Clarification; When I said "release cycle", I meant the next merge window. Not for 3.18. -- Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Checking to see if a bit is _not_ set in a ftrace event filter 2014-12-02 3:57 ` Steven Rostedt 2014-12-02 3:58 ` Steven Rostedt @ 2014-12-02 4:51 ` Steven Rostedt 1 sibling, 0 replies; 10+ messages in thread From: Steven Rostedt @ 2014-12-02 4:51 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Theodore Ts'o, LKML On Mon, 1 Dec 2014 22:57:25 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 1 Dec 2014 19:52:11 -0800 > Alexei Starovoitov <ast@plumgrid.com> wrote: > > > yeah. teaching tree walk to do (state & 2048) == 0 is not trivial, > > since it doesn't have a concept of value of expression. > > Doing !(state & 2048) is probably a bit easier, since there > > are hacks for 'not' already. > > That's exactly what I was looking at doing. If I can get something > working without spending more than an hour or two, I may even push it > for this release cycle. Actually, this was rather trivial. It took me longer because I was doing this half asleep, but it seems to work. Note, this is a major beta patch. It probably doesn't work with ANDS and ORS, that is !(x == 1 && y == 2) probably doesn't work, but this should: !(x == 1), and so should !(state & 2048). I'll clean this up and get it ready for 3.19. Cheers! -- Steve diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 7a8c1528e141..aee7b743c9df 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -37,6 +37,7 @@ enum filter_op_ids { OP_OR, OP_AND, + OP_NOT, OP_GLOB, OP_NE, OP_EQ, @@ -59,6 +60,7 @@ struct filter_op { static struct filter_op filter_ops[] = { { OP_OR, "||", 1 }, { OP_AND, "&&", 2 }, + { OP_NOT, "!", 3 }, { OP_GLOB, "~", 4 }, { OP_NE, "!=", 4 }, { OP_EQ, "==", 4 }, @@ -166,7 +168,7 @@ static int filter_pred_##type(struct filter_pred *pred, void *event) \ break; \ } \ \ - return match; \ + return match == !pred->not; \ } #define DEFINE_EQUALITY_PRED(size) \ @@ -565,7 +567,7 @@ int filter_match_preds(struct event_filter *filter, void *rec) data.preds = preds = rcu_dereference_sched(filter->preds); ret = walk_pred_tree(preds, root, filter_match_preds_cb, &data); WARN_ON(ret); - return data.match; + return !!data.match; } EXPORT_SYMBOL_GPL(filter_match_preds); @@ -1028,7 +1030,7 @@ static int init_pred(struct filter_parse_state *ps, } if (pred->op == OP_NE) - pred->not = 1; + pred->not ^= 1; pred->fn = fn; return 0; @@ -1590,6 +1592,16 @@ static int replace_preds(struct ftrace_event_call *call, continue; } + if (elt->op == OP_NOT) { + if (!n_preds) { + err = -EINVAL; + goto fail; + } + if (!dry_run) + filter->preds[n_preds - 1].not ^= 1; + continue; + } + if (WARN_ON(n_preds++ == MAX_FILTER_PRED)) { parse_error(ps, FILT_ERR_TOO_MANY_PREDS, 0); err = -ENOSPC; ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: Checking to see if a bit is _not_ set in a ftrace event filter 2014-12-02 3:52 ` Alexei Starovoitov 2014-12-02 3:57 ` Steven Rostedt @ 2014-12-02 5:04 ` Theodore Ts'o 2014-12-02 5:58 ` Alexei Starovoitov 1 sibling, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2014-12-02 5:04 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Steven Rostedt, LKML On Mon, Dec 01, 2014 at 07:52:11PM -0800, Alexei Starovoitov wrote: > Ted, I don't see 'writeback_mark_inode_dirty' event > in the tree. Some new stuff? Yep, see: http://thread.gmane.org/gmane.comp.file-systems.ext4/47092 Except instead of the mini-script which I gave in the above URL, I wanted to do additional filtering. The current hack which I am using instead of: echo "flags == 2048" > events/writeback/writeback_mark_inode_dirty/filter is: echo "(flags == 2048) && (state < 2048)" > events/writeback/writeback_mark_inode_dirty/filter ... but this relies on the fact that all of the i_state bits that I care about are at positions 1 << 10 and below. i.e., it's a terrible hack. > What kind of post-filtering are you doing with this event? > Just visually checking that trace is sane or the trace output > is fed into other tools? Are you trying to aggregate or > correlate multiple events (may be based on 'ino') ? I plan to write some tools that agregate based on 'ino', but I haven't yet. > It will change the workflow for folks who use 'echo expr > filter' > directly. trace-cmd -e -f can be made to work transparently > with new features. This will break a bunch of **really** useful scripts found at: https://github.com/brendangregg/perf-tools.git OTOH, Brendan will probably will be able to rewrite them to take advantage of the new interfaces, and I'm sure he'll appreciate the power of being able to use eBPF. :-) > One of the goals for eBPF+tracing is to minimize > additions of new tracepoints. Right now we already > have a ton of them. events/ext4.h is ~2500 lines. > Some of them look like hooks for in-production > debugging of a function at a time. Sort of like poor's man > kprobe/kretprobe. Well, except that kprobe and kretprobe can't give me the arguments passed into the function (unless you compile with full -g debugging info enabled and bloat the object files and compilation time by a factor of 10 --- which I can't stand and why I use ftrace instead of systemtap :-) > With eBPF we should be able to avoid adding > trace_func_enter(), trace_func_exit() to so many func. If eBPF can solve the ability to be able to get at the critical function variables without making the compiled kernel take 10x the disk space and time to compile (mostly due to the time to write out the !@#!@?! bloated object files), that would be great. My understanding though is that this fundamentally requires improved DWARF compression and structure information deduping, which the systemtap folks promised would be coming in improved compiler toolchains many years ago, but as far as I know has never materialized. :-( But that's why I have the trace_func_enter() and trace_func_exit() calls; I need to be able to get do various run-time debugging without needing to recompile the kernel and without forcing all of my development builds to have full debug info. - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Checking to see if a bit is _not_ set in a ftrace event filter 2014-12-02 5:04 ` Theodore Ts'o @ 2014-12-02 5:58 ` Alexei Starovoitov 2014-12-02 12:14 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Alexei Starovoitov @ 2014-12-02 5:58 UTC (permalink / raw) To: Theodore Ts'o, Alexei Starovoitov, Steven Rostedt, LKML On Mon, Dec 1, 2014 at 9:04 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Mon, Dec 01, 2014 at 07:52:11PM -0800, Alexei Starovoitov wrote: > >> It will change the workflow for folks who use 'echo expr > filter' >> directly. trace-cmd -e -f can be made to work transparently >> with new features. > > This will break a bunch of **really** useful scripts found at: I didn't mean that new features will break old. All existing filters will stay as-is. >> One of the goals for eBPF+tracing is to minimize >> additions of new tracepoints. Right now we already >> have a ton of them. events/ext4.h is ~2500 lines. >> Some of them look like hooks for in-production >> debugging of a function at a time. Sort of like poor's man >> kprobe/kretprobe. > > Well, except that kprobe and kretprobe can't give me the arguments > passed into the function (unless you compile with full -g debugging > info enabled and bloat the object files and compilation time by a > factor of 10 --- which I can't stand and why I use ftrace instead of > systemtap :-) well, dwarf is only needed if you have > 6 function arguments in x64, since x64 ABI promotes scalars to 64-bit and passes first six args in registers, so it's trivial to know where arguments are even without debug info. Similar situation is on most 64-bit risc archs. dwarf is needed when one wants to see a value of local C variable somewhere in the middle of the function, but it's not common and rarely works in practice, since var-tracking is not easy for compilers. Another reason people say that dwarf is 'must have' is to access struct fields. The 'inode->i_state' dereference would require stap to use dwarf to know the offset of 'i_state'. For eBPF we don't need debug info in such case, since C compiler does it for us. We can just do 'offsetof(typeof(*inode), i_state)' as part of eBPF C program and llvm will figure out correct field offset during compile time of eBPF program. (for this to work, one need to have matching kernel headers). > If eBPF can solve the ability to be able to get at the critical > function variables ... If 'critical' function variables are arguments or variables accessible via pointer walking (like inode->i_sb->s_fs_info) then eBPF with kernel headers will do the job. (For arguments only no headers needed) > But that's why I have the trace_func_enter() and trace_func_exit() > calls; I need to be able to get do various run-time debugging without > needing to recompile the kernel and without forcing all of my > development builds to have full debug info. Understood. Thanks for sharing! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Checking to see if a bit is _not_ set in a ftrace event filter 2014-12-02 5:58 ` Alexei Starovoitov @ 2014-12-02 12:14 ` Theodore Ts'o 2014-12-03 0:02 ` Alexei Starovoitov 0 siblings, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2014-12-02 12:14 UTC (permalink / raw) To: Alexei Starovoitov; +Cc: Steven Rostedt, LKML On Mon, Dec 01, 2014 at 09:58:32PM -0800, Alexei Starovoitov wrote: > well, dwarf is only needed if you have > 6 function > arguments in x64, since x64 ABI promotes scalars > to 64-bit and passes first six args in registers, so it's > trivial to know where arguments are even without debug info. Well, if eBPF can get more information without DWARF, that's great. I'll use whatever I can get, so long as I can do it without DWARF. :-) So if eBPF can do a better job without the presence of debug info, and even a better job with debug info (i.e., distro kernels where it's worth the time to spend over an hour compiling a distro release rpm or deb package), that would clearly be the best of both worlds. I suspect that developers like me will want to continue to add tracepoints so we can do what we need to get done w/o debug info, though. (To me that was always the reason why I ignored systemtap; it focused too much on the need of the release customer --- which makes sense, since nearly all of the stap developers worked for Red Hat --- but it ignored the needs of kernel developers. So they shouldn't have been surprised that most kernel developers returned the favor. :-) > Similar situation is on most 64-bit risc archs. > dwarf is needed when one wants to see a value of > local C variable somewhere in the middle of the function, > but it's not common and rarely works in practice, since > var-tracking is not easy for compilers. Yes; fortunately, we can make that available via adding new tracepoints --- which in general is why I'm in favor of not relying on the ability to grab information via pure kprobe and kretprobe. If we can get the first six arguments on x86_64, then if I didn't think to add a tracepoint, I can do some emergency debugging w/o needing to recompile, which is a win. (But since I also do a lot of debug runs using 32-bit kernels, it would be nice to have things that worked there as well, which is why in general I'll add tracepoints once I find that it might be useful to trace a particular function.) Cheers, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Checking to see if a bit is _not_ set in a ftrace event filter 2014-12-02 12:14 ` Theodore Ts'o @ 2014-12-03 0:02 ` Alexei Starovoitov 0 siblings, 0 replies; 10+ messages in thread From: Alexei Starovoitov @ 2014-12-03 0:02 UTC (permalink / raw) To: Theodore Ts'o, Alexei Starovoitov, Steven Rostedt, LKML On Tue, Dec 2, 2014 at 4:14 AM, Theodore Ts'o <tytso@mit.edu> wrote: > > Well, if eBPF can get more information without DWARF, that's great. > I'll use whatever I can get, so long as I can do it without DWARF. :-) eBPF is just an engine. Anyone can develop a userspace bits that will use debug info and/or their favorite language (stap, ktap, dtrace or any other) when right hooks are in place on the kernel side. I'm focusing on restricted C as a way to program it and environments without debug info. > recompile, which is a win. (But since I also do a lot of debug runs > using 32-bit kernels, it would be nice to have things that worked > there as well, which is why in general I'll add tracepoints once I > find that it might be useful to trace a particular function.) Understood. i386 is difficult, since arg passing is using combination of registers(eax,edx,ecx) and stack in non-obvious way, so some trick is needed to get it right without debug info. I'm still trying to solve this one. That's for programs attached to kprobes. Programs that attach to tracepoints will see arguments in a normal way on all archs. Anyway, one step at a time. Easy things first. As one of the crazy things for the future I was thinking about mini-tracepoints. Instead of generating full ftrace_raw_* bodies, generate minimal stubs to collect tracepoint arguments only and invoke programs. In other words move responsibility of TP_STRUCT__entry, TP_fast_assign and TP_printk into the bpf program. On x64 all we need is single stub for all mini-tracepoints. For 32-bit archs all mini-tracepoint stubs will be different to accomodate 32/64-bit argument passing. That will save quite a bit from .text and hopefully more people will enable such tracepoints in production. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-12-03 0:02 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-12-02 2:19 Checking to see if a bit is _not_ set in a ftrace event filter Theodore Ts'o 2014-12-02 2:41 ` Steven Rostedt 2014-12-02 3:52 ` Alexei Starovoitov 2014-12-02 3:57 ` Steven Rostedt 2014-12-02 3:58 ` Steven Rostedt 2014-12-02 4:51 ` Steven Rostedt 2014-12-02 5:04 ` Theodore Ts'o 2014-12-02 5:58 ` Alexei Starovoitov 2014-12-02 12:14 ` Theodore Ts'o 2014-12-03 0:02 ` Alexei Starovoitov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox