From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751394AbcFWOUk (ORCPT <rfc822;w@1wt.eu>);
	Thu, 23 Jun 2016 10:20:40 -0400
Received: from mx3-phx2.redhat.com ([209.132.183.24]:42555 "EHLO
	mx3-phx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751141AbcFWOUe convert rfc822-to-8bit (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 23 Jun 2016 10:20:34 -0400
Date: Thu, 23 Jun 2016 10:20:31 -0400 (EDT)
From: Chunyu Hu <chuhu@redhat.com>
Reply-To: Chunyu Hu <chuhu@redhat.com>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Message-ID: <746408767.4672536.1466691631491.JavaMail.zimbra@redhat.com>
In-Reply-To: <20160622090814.0bd039e3@gandalf.local.home>
References: <1466506666-28804-1-git-send-email-chuhu@redhat.com> <20160622090814.0bd039e3@gandalf.local.home>
Subject: Re: [PATCH] tracing: Fix oops caused by graph notrace filter
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8BIT
X-Originating-IP: [10.68.5.41]
X-Mailer: Zimbra 8.0.6_GA_5922 (ZimbraWebClient - FF38 (Linux)/8.0.6_GA_5922)
Thread-Topic: tracing: Fix oops caused by graph notrace filter
Thread-Index: 7/2VAQWBqHsi/gMrDs+VHuu4goRU0g==
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org



----- Original Message -----
> From: "Steven Rostedt" <rostedt@goodmis.org>
> To: "Chunyu Hu" <chuhu@redhat.com>
> Cc: linux-kernel@vger.kernel.org
> Sent: Wednesday, June 22, 2016 9:08:14 PM
> Subject: Re: [PATCH] tracing: Fix oops caused by graph notrace filter
> 
> On Tue, 21 Jun 2016 18:57:46 +0800
> Chunyu Hu <chuhu@redhat.com> wrote:
> 
> > wakeup tracer can use function_graph trace when display_graph trace
> > option is setup by user via tracefs, and bypass the set_graph_function
> > and set_graph_notrace. But the bypass of set_graph_notrace is not clean.
> > Although wakeup_graph_entry does most of the bypass, and both the enry
> > and exit event will be submitted to the trace ring buffer, the ret_stack
> > index, which will be assigned to depth field of graph enrty event is not
> > handled. The issue is that the depth is used as the array index of
> > fgraph_cpu_data and can cause an oops when it's negative. irqsoff tracer
> > has same issue. To see the oops:
> > 
> > echo 1 > options/display_graph
> > echo schedule > set_graph_notrace
> > echo wakeup > current_tracer
> > cat trace
> > cat trace
> 
> I'm not able to trigger an oops, but I do see them being traced.

Actually this needs to 'cat trace' for more than two times, after collecting
three graph entry events at least, as the print showed the depth was
-65534. One example call trace is like this, it's usually the same call trace:

[73710.161617] BUG: unable to handle kernel paging request at ffffe8fdf79836e0 
[73710.168616] IP: [<ffffffff81161254>] print_graph_entry+0x3c4/0x3e0 
[73710.174815] PGD 272759067 PUD 272758067 PMD 0  
[73710.179313] Oops: 0002 [#1] SMP 
[73710.182454] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ocrdma ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper be2iscsi ipmi_ssif cryptd ipmi_devintf cdc_ether iscsi_boot_sysfs libiscsi sg ipmi_si scsi_transport_iscsi iTCO_wdt usbnet pcspkr iTCO_vendor_support mii ipmi_msghandler ioatdma shpchp dca lpc_ich i2c_i801 i7core_edac mfd_core edac_core acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_generic pata_acpi drm mptsas ata_piix scsi_transport_sas mptscsih libata crc32c_intel serio_raw be2net mptbase i2c_core bnx2 dm_mirror dm_region_hash dm_log dm_mod 
[73710.269323] CPU: 8 PID: 31611 Comm: cat Not tainted 4.7.0-rc3 #1 
[73710.275325] Hardware name: IBM System x3550 M3 -[7944I21]-/69Y4438     , BIOS -[D6E157AUS-1.15]- 06/13/2012 
[73710.285054] task: ffff88007f213d80 ti: ffff88007f238000 task.ti: ffff88007f238000 
[73710.292528] RIP: 0010:[<ffffffff81161254>]  [<ffffffff81161254>] print_graph_entry+0x3c4/0x3e0 
[73710.301149] RSP: 0018:ffff88007f23bc68  EFLAGS: 00010286 
[73710.306458] RAX: ffffe8fdf7a036c0 RBX: ffff88007f2f0000 RCX: ffffffff810ef390 
[73710.313585] RDX: ffffffffffff0002 RSI: 0000000000000282 RDI: ffffffff81c81de0 
[73710.320713] RBP: ffff88007f23bca8 R08: 0000000000000003 R09: ffff880444a69ea0 
[73710.327839] R10: 0000000000000001 R11: 000000000000000b R12: ffff88007f23bce0 
[73710.334966] R13: ffff88007f2f1098 R14: ffff880444a69ea0 R15: 0000000000000038 
[73710.342094] FS:  00007fda5c740740(0000) GS:ffff880277b00000(0000) knlGS:0000000000000000 
[73710.350177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[73710.355918] CR2: ffffe8fdf79836e0 CR3: 0000000033c32000 CR4: 00000000000006e0 
[73710.363044] Stack: 
[73710.365059]  ffff8802730ce400 ffff880444a69ea0 000000007f23bcc8 ffff88007f2f0000 
[73710.372524]  ffff88007f2f1098 0000000000000000 0000000000000038 ffff88027557698c 
[73710.379991]  ffff88007f23bd28 ffffffff81161329 0000000000000aa8 0000000000000994 
[73710.387456] Call Trace: 
[73710.389906]  [<ffffffff81161329>] print_graph_function_flags+0xb9/0x500 
[73710.396514]  [<ffffffff810ef390>] ? rcu_sched_qs+0x60/0x60 
[73710.401998]  [<ffffffff8115ea10>] wakeup_print_line+0x20/0x30 
[73710.407740]  [<ffffffff81158524>] print_trace_line+0x54/0x510 
[73710.413481]  [<ffffffff81159389>] s_show+0x29/0x150 
[73710.418357]  [<ffffffff8123ddbe>] seq_read+0x22e/0x370 
[73710.423495]  [<ffffffff81218b67>] __vfs_read+0x37/0x150 
[73710.428716]  [<ffffffff81218b35>] ? __vfs_read+0x5/0x150 
[73710.434027]  [<ffffffff812c9413>] ? security_file_permission+0xa3/0xc0 
[73710.440550]  [<ffffffff81218b35>] ? __vfs_read+0x5/0x150 
[73710.445860]  [<ffffffff8121912e>] vfs_read+0x8e/0x140 
[73710.450908]  [<ffffffff8121a625>] SyS_read+0x55/0xc0 
[73710.455872]  [<ffffffff81003b12>] do_syscall_64+0x62/0x110 
[73710.461356]  [<ffffffff816c65a1>] entry_SYSCALL64_slow_path+0x25/0x25 
[73710.467791] Code: 42 18 49 89 41 38 48 8b 42 20 49 89 41 40 48 8b 42 28 49 89 41 48 4c 89 e0 e9 93 fd ff ff 89 50 08 e9 95 fc ff ff 49 8b 4c 24 08 <48> 89 4c d0 10 e9 2f ff ff ff 4c 89 e0 e9 74 fd ff ff 66 2e 0f  
[73710.487945] RIP  [<ffffffff81161254>] print_graph_entry+0x3c4/0x3e0 
[73710.494229]  RSP <ffff88007f23bc68> 
[73710.497714] CR2: ffffe8fdf79836e0 
 

> > 
> > Making ftrace_graph_notrace_addr always return false when tracers need
> > to bypass it is a proposed fix.
> 
> There's a lot of things wrong with this patch, but before we go there,

Thanks for the attention. Agree :/. , I found one wrong, I read the code 
lines again, the graph_ret event won't be added to ring buffer in this case.
And the trying to bypass is not so wise, I was thinking why filters were 
ignored in this tracer? overhead reason? Looks like it's not so.

> why should the latency tracers ignore set_graph_notrace?

There are two directions. I went to the wrong. If we resolve this through
enabling this graph notrace filter, it's also providing a small extra function 
of this tracer.

> What about this patch?
> 
> -- Steve
> 
> diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
> index 03cdff84d026..68fcee5097d2 100644
> --- a/kernel/trace/trace_irqsoff.c
> +++ b/kernel/trace/trace_irqsoff.c
> @@ -175,6 +175,9 @@ static int irqsoff_graph_entry(struct ftrace_graph_ent
> *trace)
>  	int ret;
>  	int pc;
>  
> +	if (trace->depth < 0)
> +		return 0;
> +

I tried the patch, if just for resolving the oops issue, it's indeed enough.
But if we care about just this filter, and compare this with the trace_graph_entry
filter handling, I found that, the depth 0 item will be submitted to the ring buffer, not
respecting the filter. I still use the same steps. It's possible to show schedule entry:

 419.456410 |     <idle>-0    |  .N..              |  schedule_preempt_disabled() {
  419.456410 |     <idle>-0    |  .N..              |    schedule() {
 ------------------------------------------
 3)    <idle>-0    =>   rngd-1979
 ------------------------------------------

  419.456425 |    rngd-1979    |  d...  0.405 us    |  finish_task_switch();
  419.456428 |    rngd-1979    |  ....              |  __fdget() {
  419.456428 |    rngd-1979    |  ....  0.913 us    |    __fget_light();
  419.456430 |    rngd-1979    |  ....  2.103 us    |  }
  419.456431 |    rngd-1979    |  ....  0.241 us    |  random_poll();
  419.456433 |    rngd-1979    |  ....              |  poll_schedule_timeout() {
  419.456433 |    rngd-1979    |  ....              |    schedule_hrtimeout_range() {
  419.456434 |    rngd-1979    |  ....              |      schedule_hrtimeout_range_clock() {
  419.456435 |    rngd-1979    |  ....              |        schedule() {
  419.456444 |    rngd-1979    |  d...  0.000 us    |  __schedule();
  419.456445 |    rngd-1979    |  d...              |  /*   1979:120:S ==> [003]  1494:100:R kworker/3:1H */

And compared with trace_graph_entry function, don't we need both the two check? 

+	if (trace->depth < 0)
+		return 0;
+       if (ftrace_graph_notrace_addr(trace->func))
+               return 1;



>  	if (!func_prolog_dec(tr, &data, &flags))
>  		return 0;
>  
> diff --git a/kernel/trace/trace_sched_wakeup.c
> b/kernel/trace/trace_sched_wakeup.c
> index 9d4399b553a3..ce9889bff8b4 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -239,6 +239,9 @@ static int wakeup_graph_entry(struct ftrace_graph_ent
> *trace)
>  	unsigned long flags;
>  	int pc, ret = 0;
>  
> +	if (trace->depth < 0)
> +		return 0;
> +

Do't we need this check?
+       if (ftrace_graph_notrace_addr(trace->func))
+               return 1;

>  	if (!func_prolog_preempt_disable(tr, &data, &pc))
>  		return 0;
>  
> 

-- 
Regards,
Chunyu Hu