Re: [RFC/PATCH] ftrace: add set_graph_notrace filter

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Namhyung Kim <namhyung@kernel.org>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>,
	Ingo Molnar <mingo@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Namhyung Kim <namhyung.kim@lge.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>
Subject: Re: [RFC/PATCH] ftrace: add set_graph_notrace filter
Date: Fri, 11 Oct 2013 17:19:46 +0900	[thread overview]
Message-ID: <87a9igcj8d.fsf@sejong.aot.lge.com> (raw)
In-Reply-To: <20131011001717.26b92150@gandalf.local.home> (Steven Rostedt's message of "Fri, 11 Oct 2013 00:17:17 -0400")

Hi Steve,

On Fri, 11 Oct 2013 00:17:17 -0400, Steven Rostedt wrote:
> Sorry for the very late reply, finally got some time to look at other
> peoples code.

Thank you for taking your time to review this carefully. :)

>
> On Tue,  3 Sep 2013 14:05:08 +0900
> Namhyung Kim <namhyung@kernel.org> wrote:
>
>> From: Namhyung Kim <namhyung.kim@lge.com>
>> 
>> The set_graph_notrace filter is analogous to set_ftrace_notrace and
>> can be used for eliminating uninteresting part of function graph trace
>> output.  It also works with set_graph_function nicely.
>> 
>>   # cd /sys/kernel/debug/tracing/
>>   # echo do_page_fault > set_graph_function
>>   # perf ftrace live true
>>    2)               |  do_page_fault() {
>>    2)               |    __do_page_fault() {
>>    2)   0.381 us    |      down_read_trylock();
>>    2)   0.055 us    |      __might_sleep();
>>    2)   0.696 us    |      find_vma();
>>    2)               |      handle_mm_fault() {
>>    2)               |        handle_pte_fault() {
>>    2)               |          __do_fault() {
>>    2)               |            filemap_fault() {
>>    2)               |              find_get_page() {
>>    2)   0.033 us    |                __rcu_read_lock();
>>    2)   0.035 us    |                __rcu_read_unlock();
>>    2)   1.696 us    |              }
>>    2)   0.031 us    |              __might_sleep();
>>    2)   2.831 us    |            }
>>    2)               |            _raw_spin_lock() {
>>    2)   0.046 us    |              add_preempt_count();
>>    2)   0.841 us    |            }
>>    2)   0.033 us    |            page_add_file_rmap();
>>    2)               |            _raw_spin_unlock() {
>>    2)   0.057 us    |              sub_preempt_count();
>>    2)   0.568 us    |            }
>>    2)               |            unlock_page() {
>>    2)   0.084 us    |              page_waitqueue();
>>    2)   0.126 us    |              __wake_up_bit();
>>    2)   1.117 us    |            }
>>    2)   7.729 us    |          }
>>    2)   8.397 us    |        }
>>    2)   8.956 us    |      }
>>    2)   0.085 us    |      up_read();
>>    2) + 12.745 us   |    }
>>    2) + 13.401 us   |  }
>>   ...
>> 
>>   # echo handle_mm_fault > set_graph_notrace
>>   # perf ftrace live true
>>    1)               |  do_page_fault() {
>>    1)               |    __do_page_fault() {
>>    1)   0.205 us    |      down_read_trylock();
>>    1)   0.041 us    |      __might_sleep();
>>    1)   0.344 us    |      find_vma();
>>    1)   0.069 us    |      up_read();
>>    1)   4.692 us    |    }
>>    1)   5.311 us    |  }
>>   ...
>
> Concept looks good.

hehe, thanks.

[SNIP]
>
>> @@ -3910,8 +3971,6 @@ out:
>>  	if (fail)
>>  		return -EINVAL;
>>  
>> -	ftrace_graph_filter_enabled = !!(*idx);
>> -
>>  	return 0;
>>  }
>>  
>> @@ -3946,6 +4005,50 @@ ftrace_graph_write(struct file *file, const char __user *ubuf,
>>  
>>  	ret = read;
>>  
>> +	ftrace_graph_filter_enabled = ftrace_graph_count > 0;
>
> You modified the ftrace_graph_filter code without mentioning this in
> the change log. Actually, this should be a separate patch (clean up)
> and not just included with the ftrace_graph_notrace code, as it is a
> separate change.

Okay, I'll make it a separate patch.  And I think we can get rid of
ftrace_graph_{filter,notrace}_enabled altogether and just use _count
variable instead.

>
>> +
>> +out_free:
>> +	trace_parser_put(&parser);
>> +out_unlock:
>> +	mutex_unlock(&graph_lock);
>> +
>> +	return ret;
>> +}
>> +
>> +static ssize_t
>> +ftrace_graph_notrace_write(struct file *file, const char __user *ubuf,
>> +			   size_t cnt, loff_t *ppos)
>> +{
>> +	struct trace_parser parser;
>> +	ssize_t read, ret;
>> +
>> +	if (!cnt)
>> +		return 0;
>> +
>> +	mutex_lock(&graph_lock);
>> +
>> +	if (trace_parser_get_init(&parser, FTRACE_BUFF_MAX)) {
>> +		ret = -ENOMEM;
>> +		goto out_unlock;
>> +	}
>> +
>> +	read = trace_get_user(&parser, ubuf, cnt, ppos);
>> +
>> +	if (read >= 0 && trace_parser_loaded((&parser))) {
>> +		parser.buffer[parser.idx] = 0;
>> +
>> +		/* we allow only one expression at a time */
>> +		ret = ftrace_set_func(ftrace_graph_notrace_funcs,
>> +				      &ftrace_graph_notrace_count,
>> +				      parser.buffer);
>> +		if (ret)
>> +			goto out_free;
>> +	}
>> +
>> +	ret = read;
>> +
>> +	ftrace_graph_notrace_enabled = ftrace_graph_notrace_count > 0;
>> +
>
> ftrace_graph_notrace_write() and ftrace_graph_write() are basically
> identical. Please make a helper function that does the work and passes
> in the different variables. That is:
>
> static ssize_t
> ftrace_graph_notrace_write(struct file *file, const char __user *ubuf,
> 		size_t cnt, loff_t *ppos)
> {
> 	return __ftrace_graph_write(file, ubuf, cnt, ppos,
> 				ftrace_graph_notrace_funcs,
> 				&ftrace_graph_notrace_count)
> }
>
> and do the same with ftrace_graph_write(). This way if we find a bug,
> or update the code, it gets done to both and not just one.

Agreed, will do that.

[SNIP]
>
>> diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
>> index b5c09242683d..3ba240daa5e0 100644
>> --- a/kernel/trace/trace_functions_graph.c
>> +++ b/kernel/trace/trace_functions_graph.c
>> @@ -114,16 +114,22 @@ ftrace_push_return_trace(unsigned long ret, unsigned long func, int *depth,
>>  		return -EBUSY;
>>  	}
>>  
>> +	/* The function was filtered out */
>> +	if (current->curr_ret_stack < -1)
>> +		return -EBUSY;
>
> I would be a bit more specific in your comment. Explain that
> curr_ret_stack is negative when we hit a function in the
> set_graph_notrace file.

How about this:

	/*
	 * curr_ret_stack is initialized to -1 and gets increased in
	 * this function.  So it can be less than -1 only if it was
	 * filtered out via ftrace_graph_notrace_addr() which can be
	 * set from set_graph_notrace file in debugfs by user.
	 */

>
>> +
>>  	calltime = trace_clock_local();
>>  
>>  	index = ++current->curr_ret_stack;
>> +	if (ftrace_graph_notrace_addr(func))
>> +		current->curr_ret_stack -= FTRACE_NOTRACE_DEPTH;
>
> I really hate this double call to ftrace_graph_notrace_addr(). The
> first one in trace_graph_entry(), and then here too.
>
> Isn't there a way we could pass the state? Hmm, I think we could use
> depth to do that. As depth is a pointer to trace.depth and not used
> before then. We could make it negative and then check that.
>
> /me looks at other archs.
>
> Darn it, s390 calls ftrace_push_return_trace() before
> ftrace_graph_entry(). They got fancy, as they must have noticed that
> trace.depth is set by ftrace_push_return_trace() and they just figured
> to let the ftrace_push_return_trace() do the work.

Yes, I thought about it before but as you can see other archs go to the
other way around so I just ended up to call it twice.

>
> Hmm, we could add a config to do the check on one side or the other
> depending on how the arch handles it.
>
> It needs to be well commented though.

Hmm.. maybe.  But it'd better if this function call order is fixed
IMHO.  Anyway, I'll add comments.

>
>
>
>>  	barrier();
>>  	current->ret_stack[index].ret = ret;
>>  	current->ret_stack[index].func = func;
>>  	current->ret_stack[index].calltime = calltime;
>>  	current->ret_stack[index].subtime = 0;
>>  	current->ret_stack[index].fp = frame_pointer;
>> -	*depth = index;
>> +	*depth = current->curr_ret_stack;
>>  
>>  	return 0;
>>  }
>> @@ -137,6 +143,9 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret,
>>  
>>  	index = current->curr_ret_stack;
>>  
>> +	if (index < 0)
>> +		index += FTRACE_NOTRACE_DEPTH;
>
> Speaking of comments. This needs to be commented, and not only that,
> perhaps this is where another ftrace_graph_notrace_addr() belongs.
>
> 	bool check = false;
>
> 	if (index < 0) {
> 		index += FTRACE_NOTRACE_DEPTH;
> 		check = true;
> 	}
>
> 	[...]
>
> 	if (check && !ftrace_graph_notrace_addr(trace->func))
> 		[ do the same bug as index < 0 ]
>
> 		
>
> Otherwise if the stack does get corrupted, this will just think we are
> in a notrace and fail someplace else, making it much harder to debug
> what went wrong.
>
> Hmm, this may be racy, as the user could change the notrace file and
> this could happen after that, causing a false positive.
>
> Well, we could do other tricks, like adding a flag in
> current->ret_stack[index].func (LSB set).
>
> But for now, just comment this better. We can come up with a proper
> check later. (/me is rambling as I've been on and off reviewing this
> patch all day, and it's now past my bedtime).

Sorry for interrupting your bedtime. :)

Hmm.. never thought about the corruption.  And yes, simply checking it
again here might cause a trouble, I guess.  I'll add a comment for this.

	/*
	 * A negative index here means that it's just returned from a
	 * notrace'd function.  Recover index to get an original return
	 * address.  See ftrace_push_return_trace().
	 *
	 * TODO: Need to check whether the stack gets corrupted.
	 */

>
>
>> +
>>  	if (unlikely(index < 0)) {
>>  		ftrace_graph_stop();
>>  		WARN_ON(1);
>> @@ -193,6 +202,10 @@ unsigned long ftrace_return_to_handler(unsigned long frame_pointer)
>>  	trace.rettime = trace_clock_local();
>>  	barrier();
>>  	current->curr_ret_stack--;
>> +	if (current->curr_ret_stack < -1) {
>> +		current->curr_ret_stack += FTRACE_NOTRACE_DEPTH;
>> +		return ret;
>> +	}
>>  
>>  	/*
>>  	 * The trace should run after decrementing the ret counter
>> @@ -259,10 +272,14 @@ int trace_graph_entry(struct ftrace_graph_ent *trace)
>>  
>>  	/* trace it when it is-nested-in or is a function enabled. */
>>  	if ((!(trace->depth || ftrace_graph_addr(trace->func)) ||
>> -	     ftrace_graph_ignore_irqs()) ||
>> +	     ftrace_graph_ignore_irqs()) || (trace->depth < 0) ||
>>  	    (max_depth && trace->depth >= max_depth))
>>  		return 0;
>>  
>> +	/* need to reserve a ret_stack entry to recover the depth */
>> +	if (ftrace_graph_notrace_addr(trace->func))
>> +		return 1;
>> +
>
> Also, I understand what you are doing here, with making curr_ret_stack
> negative to denote we are in a state to not do tracing. But it's more
> of a hack (not a bad one), and really needs to be documented somewhere.
> That is, the design should be in the comments where it's used, and 5
> years from now, nobody will understand how the notrace works without
> spending days trying to figure it out. Let's be nice to that poor soul,
> and write up what is going on so they don't need to pray to the holy
> tuna hoping to get a fish of enlightenment on how turning
> curr_ret_stack negative magically makes the function graph tracing not
> trace down functions, and magically starts tracing again when it comes
> back up.

Fully agreed.  How about this:

/*
 * The curr_ret_stack is an index to ftrace return stack of current
 * task.  Its value should be in [0, FTRACE_RETFUNC_DEPTH) when the
 * function graph tracer is used.  To support filtering out specific
 * functions, it makes the index negative by subtracting huge value
 * (FTRACE_NOTRACE_DEPTH) so when it sees a negative index the ftrace
 * will ignore the record.  And the index gets recovered when returning
 * from the filtered function by adding the FTRACE_NOTRACE_DEPTH and
 * then it will continue to record functions normally.
 */

Thanks,
Namhyung

next prev parent reply	other threads:[~2013-10-11  8:19 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-03  5:05 [RFC/PATCH] ftrace: add set_graph_notrace filter Namhyung Kim
2013-09-30  7:04 ` Namhyung Kim
2013-10-01  3:58   ` Steven Rostedt
2013-10-10  1:51     ` Namhyung Kim
2013-10-10  2:11       ` Steven Rostedt
2013-10-11  4:17 ` Steven Rostedt
2013-10-11  7:21   ` Heiko Carstens
2013-10-11  8:34     ` Namhyung Kim
2013-10-11  8:58       ` Heiko Carstens
2013-10-11 12:22     ` Steven Rostedt
2013-10-11  8:19   ` Namhyung Kim [this message]
2013-10-11 12:31     ` Steven Rostedt
2013-10-14  0:59       ` Namhyung Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a9igcj8d.fsf@sejong.aot.lge.com \
    --to=namhyung@kernel.org \
    --cc=fweisbec@gmail.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=namhyung.kim@lge.com \
    --cc=rostedt@goodmis.org \
    --cc=schwidefsky@de.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.