perf: 3.17 another perf

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* perf: 3.17 another perf_fuzzer lockup
@ 2014-10-06 15:36 Vince Weaver
  2014-10-06 15:55 ` Vince Weaver
  0 siblings, 1 reply; 9+ messages in thread
From: Vince Weaver @ 2014-10-06 15:36 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo


I reported this on the other thread but it seems to be unrelated to the 
other bug, and this time instead of simply wedging the process I got a 
backtrace and the machine locked solid.

This is reproducible, on 3.17 on a core2.

Trying to get something useful out of ftrace but no luck so far.

Vince

[  843.700042] general protection fault: 0000 [#1] SMP 
[  843.704001] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative f71882fg mcs7830 evdev usbnet coretemp ohci_pci pcspkr ohci_hcd psmouse serio_raw video wmi i2c_nforce2 button acpi_cpufreq processor thermal_sys sg ehci_pci ehci_hcd sd_mod usbcore usb_common
[  843.704001] CPU: 1 PID: 7273 Comm: perf_fuzzer Not tainted 3.17.0+ #86
[  843.704001] Hardware name: AOpen   DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015  10/19/2012
[  843.704001] task: ffff88011a874000 ti: ffff8800bc0ec000 task.ti: ffff8800bc0ec000
[  843.704001] RIP: 0010:[<ffffffff810cd913>]  [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90
[  843.704001] RSP: 0018:ffff8800bc0efd50  EFLAGS: 00010087
[  843.704001] RAX: ffffea0002ba2d68 RBX: ffff1001e742c000 RCX: 000000000000038f
[  843.704001] RDX: ffff88011fc95b30 RSI: ffff880037d0eb00 RDI: ffff880037d0e700
[  843.704001] RBP: ffff8800bc0efd60 R08: ffff8800bc0ec000 R09: 000000000000baff
[  843.704001] R10: 0000000000000006 R11: 00000000000009bc R12: ffff880037d0e700
[  843.704001] R13: ffff8800c944f400 R14: 0000000000000001 R15: ffff88011b340800
[  843.704001] FS:  00007ffc76c17700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
[  843.704001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  843.704001] DR0: 0000000001c7b000 DR1: 0000000000000000 DR2: 0000000001c7b000
[  843.704001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[  843.704001] Stack:
[  843.704001]  ffff88011a874000 ffff88011b340800 ffff8800bc0efd90 ffffffff810cd9bb
[  843.704001]  ffff88011a8744e8 ffff88011b340800 ffff88011fc929c0 ffff8800c944f400
[  843.704001]  ffff8800bc0efdc0 ffffffff8105ae62 ffff88011fc929c0 ffff8800c944f400
[  843.704001] Call Trace:
[  843.704001]  [<ffffffff810cd9bb>] __perf_event_task_sched_in+0x37/0xf4
[  843.704001]  [<ffffffff8105ae62>] finish_task_switch+0x9b/0xa6
[  843.704001]  [<ffffffff815209c1>] __schedule+0x309/0x4a5
[  843.704001]  [<ffffffff81520df3>] _cond_resched+0x28/0x3b
[  843.704001]  [<ffffffff815217b5>] mutex_lock+0x12/0x2f
[  843.704001]  [<ffffffff810cb9bb>] find_get_context+0xfc/0x170
[  843.704001]  [<ffffffff810cff77>] SYSC_perf_event_open+0x47b/0x7f5
[  843.704001]  [<ffffffff810d0651>] SyS_perf_event_open+0xe/0x10
[  843.704001]  [<ffffffff81523ad1>] tracesys+0xd4/0xd9
[  843.704001] Code: 89 e7 e8 65 fe ff ff 5b 41 5c 5d c3 e8 c7 7e 45 00 55 48 89 e5 41 54 49 89 fc 53 48 8b 07 48 8b 58 40 65 48 03 1c 25 08 ce 00 00 <48> 39 bb d8 00 00 00 74 63 48 89 fe 48 89 df e8 f0 b4 ff ff 49 
[  843.704001] RIP  [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90
[  843.704001]  RSP <ffff8800bc0efd50>
[  843.704001] ---[ end trace 24a30c0f4ce68693 ]---
[  843.960002] ------------[ cut here ]------------
[  843.960002] WARNING: CPU: 0 PID: 2018 at kernel/watchdog.c:267 watchdog_overflow_callback+0x9b/0xa7()
[  843.960002] Watchdog detected hard LOCKUP on cpu 0
[  843.960002] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative f71882fg mcs7830 evdev usbnet coretemp ohci_pci pcspkr ohci_hcd psmouse serio_raw video wmi i2c_nforce2 button acpi_cpufreq processor thermal_sys sg ehci_pci ehci_hcd sd_mod usbcore usb_common
[  843.960002] CPU: 0 PID: 2018 Comm: rsyslogd Tainted: G      D        3.17.0+ #86
[  843.960002] Hardware name: AOpen   DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015  10/19/2012
[  843.960002]  0000000000000000 ffff88011fc06c20 ffffffff8151e1be ffff88011fc06c68
[  843.960002]  ffff88011fc06c58 ffffffff8103ee43 ffffffff810a6bd6 ffff88011b349800
[  843.960002]  0000000000000000 ffff88011fc06d70 ffff88011fc06ef8 ffff88011fc06cb8
[  843.960002] Call Trace:
[  843.960002]  <NMI>  [<ffffffff8151e1be>] dump_stack+0x45/0x56
[  843.960002]  [<ffffffff8103ee43>] warn_slowpath_common+0x7f/0x98
[  843.960002]  [<ffffffff810a6bd6>] ? watchdog_overflow_callback+0x9b/0xa7
[  843.960002]  [<ffffffff8103eea8>] warn_slowpath_fmt+0x4c/0x4e
[  843.960002]  [<ffffffff81525821>] ? ftrace_call+0x5/0x2f
[  843.960002]  [<ffffffff81525821>] ? ftrace_call+0x5/0x2f
[  843.960002]  [<ffffffff810a6bd6>] watchdog_overflow_callback+0x9b/0xa7
[  843.960002]  [<ffffffff810cec50>] __perf_event_overflow+0x139/0x1bd
[  843.960002]  [<ffffffff810cf14b>] perf_event_overflow+0x19/0x1b
[  843.960002]  [<ffffffff8101981a>] intel_pmu_handle_irq+0x29d/0x323
[  843.960002]  [<ffffffff81525821>] ? ftrace_call+0x5/0x2f
[  843.960002]  [<ffffffff81012da8>] perf_event_nmi_handler+0x25/0x3e
[  843.960002]  [<ffffffff81005e88>] nmi_handle+0x60/0xf3
[  843.960002]  [<ffffffff8102b34d>] ? cpumask_clear_cpu.constprop.3+0x11/0x11
[  843.960002]  [<ffffffff810060df>] default_do_nmi+0x50/0xdc
[  843.960002]  [<ffffffff810061cf>] do_nmi+0x64/0xa6
[  843.960002]  [<ffffffff8152572a>] end_repeat_nmi+0x1e/0x2e
[  843.960002]  [<ffffffff8106a531>] ? do_raw_spin_lock+0x1c/0x27
[  843.960002]  [<ffffffff8106a531>] ? do_raw_spin_lock+0x1c/0x27
[  843.960002]  [<ffffffff8106a531>] ? do_raw_spin_lock+0x1c/0x27
[  843.960002]  <<EOE>>  <IRQ>  [<ffffffff8152350f>] _raw_spin_lock+0xe/0x10
[  843.960002]  [<ffffffff8106326d>] load_balance+0x2bf/0x63d
[  843.960002]  [<ffffffff8105e0cf>] ? try_to_wake_up+0x1c4/0x1d4
[  843.960002]  [<ffffffff81063a5b>] rebalance_domains+0x144/0x205
[  843.960002]  [<ffffffff81063b59>] run_rebalance_domains+0x3d/0x15c
[  843.960002]  [<ffffffff81041bd4>] __do_softirq+0xde/0x238
[  843.960002]  [<ffffffff81041ea3>] irq_exit+0x3f/0x86
[  843.960002]  [<ffffffff8102a4e1>] smp_apic_timer_interrupt+0x35/0x41
[  843.960002]  [<ffffffff8152480a>] apic_timer_interrupt+0x6a/0x70
[  843.960002]  <EOI> 
[  843.960002] ---[ end trace 24a30c0f4ce68694 ]---
[  843.957990] ------------[ cut here ]------------
[  843.957990] WARNING: CPU: 1 PID: 7273 at kernel/watchdog.c:267 watchdog_overflow_callback+0x9b/0xa7()
[  843.957990] Watchdog detected hard LOCKUP on cpu 1
[  843.957990] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative f71882fg mcs7830 evdev usbnet coretemp ohci_pci pcspkr ohci_hcd psmouse serio_raw video wmi i2c_nforce2 button acpi_cpufreq processor thermal_sys sg ehci_pci ehci_hcd sd_mod usbcore usb_common
[  843.957990] CPU: 1 PID: 7273 Comm: perf_fuzzer Tainted: G      D W      3.17.0+ #86
[  843.957990] Hardware name: AOpen   DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015  10/19/2012
[  843.957990]  0000000000000000 ffff88011fc86c20 ffffffff8151e1be ffff88011fc86c68
[  843.957990]  ffff88011fc86c58 ffffffff8103ee43 ffffffff810a6bd6 ffff88011ac2ac00
[  843.957990]  0000000000000001 ffff88011fc86d70 ffff88011fc86ef8 ffff88011fc86cb8
[  843.957990] Call Trace:
[  843.957990]  <NMI>  [<ffffffff8151e1be>] dump_stack+0x45/0x56
[  843.957990]  [<ffffffff8103ee43>] warn_slowpath_common+0x7f/0x98
[  843.957990]  [<ffffffff810a6bd6>] ? watchdog_overflow_callback+0x9b/0xa7
[  843.957990]  [<ffffffff8103eea8>] warn_slowpath_fmt+0x4c/0x4e
[  843.957990]  [<ffffffff81525821>] ? ftrace_call+0x5/0x2f
[  843.957990]  [<ffffffff810a6bd6>] watchdog_overflow_callback+0x9b/0xa7
[  843.957990]  [<ffffffff810cec50>] __perf_event_overflow+0x139/0x1bd
[  843.957990]  [<ffffffff810cf14b>] perf_event_overflow+0x19/0x1b
[  843.957990]  [<ffffffff8101981a>] intel_pmu_handle_irq+0x29d/0x323
[  843.957990]  [<ffffffff81525821>] ? ftrace_call+0x5/0x2f
[  843.957990]  [<ffffffff81012da8>] perf_event_nmi_handler+0x25/0x3e
[  843.957990]  [<ffffffff81005e88>] nmi_handle+0x60/0xf3
[  843.957990]  [<ffffffff8102b34d>] ? cpumask_clear_cpu.constprop.3+0x11/0x11
[  843.957990]  [<ffffffff8108d3f0>] ? tick_sched_do_timer+0x2f/0x2f
[  843.957990]  [<ffffffff810060df>] default_do_nmi+0x50/0xdc
[  843.957990]  [<ffffffff810061cf>] do_nmi+0x64/0xa6
[  843.957990]  [<ffffffff8152572a>] end_repeat_nmi+0x1e/0x2e
[  843.957990]  [<ffffffff8108d3f0>] ? tick_sched_do_timer+0x2f/0x2f
[  843.957990]  [<ffffffff8106a538>] ? do_raw_spin_lock+0x23/0x27
[  843.957990]  [<ffffffff8106a538>] ? do_raw_spin_lock+0x23/0x27
[  843.957990]  [<ffffffff8106a538>] ? do_raw_spin_lock+0x23/0x27
[  843.957990]  <<EOE>>  <IRQ>  [<ffffffff8152350f>] _raw_spin_lock+0xe/0x10
[  843.957990]  [<ffffffff8105d52c>] scheduler_tick+0x3d/0x85
[  843.957990]  [<ffffffff8108145a>] update_process_times+0x56/0x65
[  843.957990]  [<ffffffff8108cf19>] tick_sched_handle+0x4a/0x59
[  843.957990]  [<ffffffff8108d42a>] tick_sched_timer+0x3a/0x58
[  843.957990]  [<ffffffff810818c7>] __run_hrtimer+0x92/0x14b
[  843.957990]  [<ffffffff810820a7>] hrtimer_interrupt+0xe8/0x1d3
[  843.957990]  [<ffffffff8102a189>] local_apic_timer_interrupt+0x54/0x57
[  843.957990]  [<ffffffff8102a4dc>] smp_apic_timer_interrupt+0x30/0x41
[  843.957990]  [<ffffffff8152480a>] apic_timer_interrupt+0x6a/0x70
[  843.957990]  <EOI>  [<ffffffff8107439d>] ? console_unlock+0x30d/0x32b
[  843.957990]  [<ffffffff810975d5>] ? acct_collect+0x164/0x16e
[  843.957990]  [<ffffffff8103fed7>] do_exit+0x21c/0x8fe
[  843.957990]  [<ffffffff8151c536>] ? printk+0x54/0x56
[  843.957990]  [<ffffffff81005685>] oops_end+0xa6/0xad
[  843.957990]  [<ffffffff810059cf>] die+0x5a/0x63
[  843.957990]  [<ffffffff81002e7f>] do_general_protection+0x96/0x142
[  843.957990]  [<ffffffff815253c2>] general_protection+0x22/0x30
[  843.957990]  [<ffffffff810cd913>] ? perf_event_context_sched_in.isra.75+0x1f/0x90
[  843.957990]  [<ffffffff810cd9bb>] __perf_event_task_sched_in+0x37/0xf4
[  843.957990]  [<ffffffff8105ae62>] finish_task_switch+0x9b/0xa6
[  843.957990]  [<ffffffff815209c1>] __schedule+0x309/0x4a5
[  843.957990]  [<ffffffff81520df3>] _cond_resched+0x28/0x3b
[  843.957990]  [<ffffffff815217b5>] mutex_lock+0x12/0x2f
[  843.957990]  [<ffffffff810cb9bb>] find_get_context+0xfc/0x170
[  843.957990]  [<ffffffff810cff77>] SYSC_perf_event_open+0x47b/0x7f5
[  843.957990]  [<ffffffff810d0651>] SyS_perf_event_open+0xe/0x10
[  843.957990]  [<ffffffff81523ad1>] tracesys+0xd4/0xd9
[  843.957990] ---[ end trace 24a30c0f4ce68695 ]---



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf: 3.17 another perf_fuzzer lockup
  2014-10-06 15:36 perf: 3.17 another perf_fuzzer lockup Vince Weaver
@ 2014-10-06 15:55 ` Vince Weaver
  2014-10-07  0:52   ` Chuck Ebbert
  2014-10-15 18:34   ` Vince Weaver
  0 siblings, 2 replies; 9+ messages in thread
From: Vince Weaver @ 2014-10-06 15:55 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, 6 Oct 2014, Vince Weaver wrote:

> [  843.700042] general protection fault: 0000 [#1] SMP 
> ...
> [  843.704001] task: ffff88011a874000 ti: ffff8800bc0ec000 task.ti: ffff8800bc0ec000
> [  843.704001] RIP: 0010:[<ffffffff810cd913>]  [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90

For what it's worth, this is 

kernel/events/core.c:2646

        if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
                perf_cgroup_sched_in(prev, task);


ffffffff810cd902:       53                      push   %rbx
ffffffff810cd903:       48 8b 07                mov    (%rdi),%rax
ffffffff810cd906:       48 8b 58 40             mov    0x40(%rax),%rbx
ffffffff810cd90a:       65 48 03 1c 25 08 ce    add    %gs:0xce08,%rbx
ffffffff810cd911:       00 00 
ffffffff810cd913:       48 39 bb d8 00 00 00    cmp    %rdi,0xd8(%rbx)
ffffffff810cd91a:       74 63                   je     ffffffff810cd97f <perf_event_context_sched_in.isra.75+0x8b>


> [  843.704001] RSP: 0018:ffff8800bc0efd50  EFLAGS: 00010087
> [  843.704001] RAX: ffffea0002ba2d68 RBX: ffff1001e742c000 RCX: 000000000000038f
> [  843.704001] RDX: ffff88011fc95b30 RSI: ffff880037d0eb00 RDI: ffff880037d0e700
> [  843.704001] RBP: ffff8800bc0efd60 R08: ffff8800bc0ec000 R09: 000000000000baff
> [  843.704001] R10: 0000000000000006 R11: 00000000000009bc R12: ffff880037d0e700
> [  843.704001] R13: ffff8800c944f400 R14: 0000000000000001 R15: ffff88011b340800
> [  843.704001] FS:  00007ffc76c17700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
> [  843.704001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  843.704001] DR0: 0000000001c7b000 DR1: 0000000000000000 DR2: 0000000001c7b000
> [  843.704001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [  843.704001] Stack:
> [  843.704001]  ffff88011a874000 ffff88011b340800 ffff8800bc0efd90 ffffffff810cd9bb
> [  843.704001]  ffff88011a8744e8 ffff88011b340800 ffff88011fc929c0 ffff8800c944f400
> [  843.704001]  ffff8800bc0efdc0 ffffffff8105ae62 ffff88011fc929c0 ffff8800c944f400
> [  843.704001] Call Trace:
> [  843.704001]  [<ffffffff810cd9bb>] __perf_event_task_sched_in+0x37/0xf4
> [  843.704001]  [<ffffffff8105ae62>] finish_task_switch+0x9b/0xa6
> [  843.704001]  [<ffffffff815209c1>] __schedule+0x309/0x4a5
> [  843.704001]  [<ffffffff81520df3>] _cond_resched+0x28/0x3b
> [  843.704001]  [<ffffffff815217b5>] mutex_lock+0x12/0x2f
> [  843.704001]  [<ffffffff810cb9bb>] find_get_context+0xfc/0x170
> [  843.704001]  [<ffffffff810cff77>] SYSC_perf_event_open+0x47b/0x7f5
> [  843.704001]  [<ffffffff810d0651>] SyS_perf_event_open+0xe/0x10
> [  843.704001]  [<ffffffff81523ad1>] tracesys+0xd4/0xd9
> [  843.704001] Code: 89 e7 e8 65 fe ff ff 5b 41 5c 5d c3 e8 c7 7e 45 00 55 48 89 e5 41 54 49 89 fc 53 48 8b 07 48 8b 58 40 65 48 03 1c 25 08 ce 00 00 <48> 39 bb d8 00 00 00 74 63 48 89 fe 48 89 df e8 f0 b4 ff ff 49 
> [  843.704001] RIP  [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90
> [  843.704001]  RSP <ffff8800bc0efd50>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf: 3.17 another perf_fuzzer lockup
  2014-10-06 15:55 ` Vince Weaver
@ 2014-10-07  0:52   ` Chuck Ebbert
  2014-10-15 18:34   ` Vince Weaver
  1 sibling, 0 replies; 9+ messages in thread
From: Chuck Ebbert @ 2014-10-07  0:52 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, 6 Oct 2014 11:55:11 -0400 (EDT)
Vince Weaver <vincent.weaver@maine.edu> wrote:

> On Mon, 6 Oct 2014, Vince Weaver wrote:
> 
> > [  843.700042] general protection fault: 0000 [#1] SMP 
> > ...
> > [  843.704001] task: ffff88011a874000 ti: ffff8800bc0ec000 task.ti: ffff8800bc0ec000
> > [  843.704001] RIP: 0010:[<ffffffff810cd913>]  [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90
> 
> For what it's worth, this is 
> 
> kernel/events/core.c:2646
> 
>         if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
>                 perf_cgroup_sched_in(prev, task);
> 
> 
> ffffffff810cd902:       53                      push   %rbx
> ffffffff810cd903:       48 8b 07                mov    (%rdi),%rax
> ffffffff810cd906:       48 8b 58 40             mov    0x40(%rax),%rbx
> ffffffff810cd90a:       65 48 03 1c 25 08 ce    add    %gs:0xce08,%rbx
> ffffffff810cd911:       00 00 
> ffffffff810cd913:       48 39 bb d8 00 00 00    cmp    %rdi,0xd8(%rbx)
> ffffffff810cd91a:       74 63                   je     ffffffff810cd97f <perf_event_context_sched_in.isra.75+0x8b>
> 

Actually it's:

static void perf_event_context_sched_in(struct perf_event_context *ctx,
					struct task_struct *task)
{
	struct perf_cpu_context *cpuctx;

	cpuctx = __get_cpu_context(ctx);
	if (cpuctx->task_ctx == ctx)      <======= oops
		return;

cpuctx is in %rbx (=ffff1001e742c000) and that's not even a legal
address, which is what caused the general protection fault

> 
> > [  843.704001] RSP: 0018:ffff8800bc0efd50  EFLAGS: 00010087
> > [  843.704001] RAX: ffffea0002ba2d68 RBX: ffff1001e742c000 RCX: 000000000000038f
> > [  843.704001] RDX: ffff88011fc95b30 RSI: ffff880037d0eb00 RDI: ffff880037d0e700
> > [  843.704001] RBP: ffff8800bc0efd60 R08: ffff8800bc0ec000 R09: 000000000000baff
> > [  843.704001] R10: 0000000000000006 R11: 00000000000009bc R12: ffff880037d0e700
> > [  843.704001] R13: ffff8800c944f400 R14: 0000000000000001 R15: ffff88011b340800
> > [  843.704001] FS:  00007ffc76c17700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
> > [  843.704001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [  843.704001] DR0: 0000000001c7b000 DR1: 0000000000000000 DR2: 0000000001c7b000
> > [  843.704001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> > [  843.704001] Stack:
> > [  843.704001]  ffff88011a874000 ffff88011b340800 ffff8800bc0efd90 ffffffff810cd9bb
> > [  843.704001]  ffff88011a8744e8 ffff88011b340800 ffff88011fc929c0 ffff8800c944f400
> > [  843.704001]  ffff8800bc0efdc0 ffffffff8105ae62 ffff88011fc929c0 ffff8800c944f400
> > [  843.704001] Call Trace:
> > [  843.704001]  [<ffffffff810cd9bb>] __perf_event_task_sched_in+0x37/0xf4
> > [  843.704001]  [<ffffffff8105ae62>] finish_task_switch+0x9b/0xa6
> > [  843.704001]  [<ffffffff815209c1>] __schedule+0x309/0x4a5
> > [  843.704001]  [<ffffffff81520df3>] _cond_resched+0x28/0x3b
> > [  843.704001]  [<ffffffff815217b5>] mutex_lock+0x12/0x2f
> > [  843.704001]  [<ffffffff810cb9bb>] find_get_context+0xfc/0x170
> > [  843.704001]  [<ffffffff810cff77>] SYSC_perf_event_open+0x47b/0x7f5
> > [  843.704001]  [<ffffffff810d0651>] SyS_perf_event_open+0xe/0x10
> > [  843.704001]  [<ffffffff81523ad1>] tracesys+0xd4/0xd9
> > [  843.704001] Code: 89 e7 e8 65 fe ff ff 5b 41 5c 5d c3 e8 c7 7e 45 00 55 48 89 e5 41 54 49 89 fc 53 48 8b 07 48 8b 58 40 65 48 03 1c 25 08 ce 00 00 <48> 39 bb d8 00 00 00 74 63 48 89 fe 48 89 df e8 f0 b4 ff ff 49 
> > [  843.704001] RIP  [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90
> > [  843.704001]  RSP <ffff8800bc0efd50>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf: 3.17 another perf_fuzzer lockup
  2014-10-06 15:55 ` Vince Weaver
  2014-10-07  0:52   ` Chuck Ebbert
@ 2014-10-15 18:34   ` Vince Weaver
  2014-10-16 21:25     ` Vince Weaver
  1 sibling, 1 reply; 9+ messages in thread
From: Vince Weaver @ 2014-10-15 18:34 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo

OK, so it turns out that the oops I saw with memory corruption wasn't the 
bug I was tracking, but something that comes up sometimes when trying to 
run ftrace at the same time as fuzzing.  So we'll leave that for another 
day.

The 3.17+ lockup I am tracking still reproduces as of git from yesterday 
(even after the 3.18-rc perf_event merges).

I can use sysrq to get the stack trace, the one CPU is stuck in a call
to find_get_context().

An example backtrace:

[88200.300003]  <EOI>
[88200.300003]  [<ffffffff81114869>] ? ____cache_alloc+0x130/0x25b
[88200.300003]  [<ffffffff8107fb05>] ? __call_rcu.constprop.63+0x1bf/0x1cb
[88200.300003]  [<ffffffff8107fb2b>] kfree_call_rcu+0x1a/0x1c
[88200.300003]  [<ffffffff810cf84f>] put_ctx+0x51/0x55
[88200.300003]  [<ffffffff810d1840>] find_get_context+0x166/0x195
[88200.300003]  [<ffffffff810d5856>] SYSC_perf_event_open+0x47b/0x7f5
[88200.300003]  [<ffffffff810d5f55>] SyS_perf_event_open+0xe/0x10
[88200.300003]  [<ffffffff815362d6>] system_call_fastpath+0x16/0x1b

It looks like the
			else if (task->perf_event_ctxp[ctxn])
	                        err = -EAGAIN;

case is triggering non-stop in the retry path of 
find_get_context() and so the kernel gets stuck forever retrying.

I can drop some printks in if it will help debug.  I've tried running 
ftrace, but for whatever reason if I enable ftrace the bug won't trigger.

Vince

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf: 3.17 another perf_fuzzer lockup
  2014-10-15 18:34   ` Vince Weaver
@ 2014-10-16 21:25     ` Vince Weaver
  2014-10-17 14:19       ` Vince Weaver
  0 siblings, 1 reply; 9+ messages in thread
From: Vince Weaver @ 2014-10-16 21:25 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Wed, 15 Oct 2014, Vince Weaver wrote:

> I can use sysrq to get the stack trace, the one CPU is stuck in a call
> to find_get_context().
> 
> An example backtrace:
> 
> [88200.300003]  <EOI>
> [88200.300003]  [<ffffffff81114869>] ? ____cache_alloc+0x130/0x25b
> [88200.300003]  [<ffffffff8107fb05>] ? __call_rcu.constprop.63+0x1bf/0x1cb
> [88200.300003]  [<ffffffff8107fb2b>] kfree_call_rcu+0x1a/0x1c
> [88200.300003]  [<ffffffff810cf84f>] put_ctx+0x51/0x55
> [88200.300003]  [<ffffffff810d1840>] find_get_context+0x166/0x195
> [88200.300003]  [<ffffffff810d5856>] SYSC_perf_event_open+0x47b/0x7f5
> [88200.300003]  [<ffffffff810d5f55>] SyS_perf_event_open+0xe/0x10
> [88200.300003]  [<ffffffff815362d6>] system_call_fastpath+0x16/0x1b
> 
> It looks like the
> 			else if (task->perf_event_ctxp[ctxn])
> 	                        err = -EAGAIN;

It is indeed stuck there, waiting for task->perf_event_ctxp[1] to get
set to zero, which never happens.

As far as I can tell it's when a Software event is being opened.

Still struggling through the code trying to figure out what's going on.

[ 7071.252607] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1
[ 7071.259439] VMW: type=1 config=8
[ 7071.262713] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1
[ 7071.269506] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1
[ 7071.276299] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1
[ 7071.283087] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1
[ 7071.289879] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1
[ 7071.296671] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1
[ 7071.303457] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1
[ 7071.310248] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1
[ 7071.317035] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1
[ 7076.256032] find_get_context: 7246310 callbacks suppressed

Vince

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf: 3.17 another perf_fuzzer lockup
  2014-10-16 21:25     ` Vince Weaver
@ 2014-10-17 14:19       ` Vince Weaver
  2014-10-17 15:21         ` Vince Weaver
  0 siblings, 1 reply; 9+ messages in thread
From: Vince Weaver @ 2014-10-17 14:19 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Thu, 16 Oct 2014, Vince Weaver wrote:

> > It looks like the
> > 			else if (task->perf_event_ctxp[ctxn])
> > 	                        err = -EAGAIN;
> 
> It is indeed stuck there, waiting for task->perf_event_ctxp[1] to get
> set to zero, which never happens.
> 

OK, so with some more printk()s, it looks like somehow the parent thread
is trying to open a software event on itself.

	task->perf_event_ctxp[1]

has a valid pointer, but the ctx it points to has a ctx->lock of 0.

So perf_lock_task_context() always returns NULL.

So in find_get_context()
	we get stuck in an infinite retry loop, waiting forever for
	either ctx->lock to go positive or for task->perf_event_ctxp[1]
	to go NULL, neither of which is going to happen.

Now to find out why this could happen.  Probably something to do with 
crazy RCU magic :(

Vince


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf: 3.17 another perf_fuzzer lockup
  2014-10-17 14:19       ` Vince Weaver
@ 2014-10-17 15:21         ` Vince Weaver
  2014-10-20  4:48           ` Vince Weaver
  0 siblings, 1 reply; 9+ messages in thread
From: Vince Weaver @ 2014-10-17 15:21 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Fri, 17 Oct 2014, Vince Weaver wrote:

> Now to find out why this could happen.  Probably something to do with 
> crazy RCU magic :(

it looks like there's an unbalanced get_ctx() / put_ctx() here, as the
software event context on the main process should not get decremented
to 0 unless that process is exiting, yet it happens.

Maybe this is bisectable.  Hmmm.

[  106.781177] VMW: using pid 2941
[  127.216558] ------------[ cut here ]------------

And here's where ctx->refcount gets decremented to 0.

[  127.221237] WARNING: CPU: 0 PID: 2941 at kernel/events/core.c:905 put_ctx+0x57/0x8e()
[  127.256799] CPU: 0 PID: 2941 Comm: perf_fuzzer Not tainted 3.17.0+ #97
[  127.263372] Hardware name: AOpen   DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015  10/19/2012
[  127.272289]  0000000000000009 ffff8800cb107d98 ffffffff81530f3c 000000000000249e
[  127.279954]  0000000000000000 ffff8800cb107dd8 ffffffff8104005d ffff8800cae4b750
[  127.287621]  ffffffff810cf819 ffff8800cbb26400 ffff8800cae4b000 ffff8800cbb26410
[  127.295285] Call Trace:
[  127.297789]  [<ffffffff81530f3c>] dump_stack+0x46/0x58
[  127.302980]  [<ffffffff8104005d>] warn_slowpath_common+0x81/0x9b
[  127.309036]  [<ffffffff810cf819>] ? put_ctx+0x57/0x8e
[  127.314134]  [<ffffffff8104011a>] warn_slowpath_null+0x1a/0x1c
[  127.320022]  [<ffffffff810cf819>] put_ctx+0x57/0x8e
[  127.324957]  [<ffffffff810cf898>] __free_event+0x48/0x71
[  127.330326]  [<ffffffff8112bb01>] ? __d_free_external+0x29/0x4f
[  127.336298]  [<ffffffff810d1311>] _free_event+0xd6/0xdb
[  127.341585]  [<ffffffff810d13ee>] put_event+0xd8/0xe1
[  127.346693]  [<ffffffff810d141e>] perf_release+0x15/0x19
[  127.352062]  [<ffffffff8111cd7d>] __fput+0xf1/0x1a6
[  127.356994]  [<ffffffff8111ce6a>] ____fput+0xe/0x10
[  127.361931]  [<ffffffff81055402>] task_work_run+0x83/0x9a
[  127.367389]  [<ffffffff810029ca>] do_notify_resume+0x5a/0x61
[  127.373106]  [<ffffffff81536720>] int_signal+0x12/0x17
[  127.378300] ---[ end trace 8508b4f6a48d2f87 ]---

and here a little later is when we try to add a new software event
but it gets infinitely stuck.

[  127.385717] VMW: task->perf_event_ctxp[1]=ffff8800cbb26400, EAGAIN, ref=1
[  127.392566] VMW: pmu->type=1 type=1 config=8 pid=2941


Vince


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf: 3.17 another perf_fuzzer lockup
  2014-10-17 15:21         ` Vince Weaver
@ 2014-10-20  4:48           ` Vince Weaver
  2014-10-20 14:48             ` Vince Weaver
  0 siblings, 1 reply; 9+ messages in thread
From: Vince Weaver @ 2014-10-20  4:48 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Fri, 17 Oct 2014, Vince Weaver wrote:

> On Fri, 17 Oct 2014, Vince Weaver wrote:
> 
> > Now to find out why this could happen.  Probably something to do with 
> > crazy RCU magic :(
> 
> it looks like there's an unbalanced get_ctx() / put_ctx() here, as the
> software event context on the main process should not get decremented
> to 0 unless that process is exiting, yet it happens.
> 
> Maybe this is bisectable.  Hmmm.

nope.  Wasted 3 days trying to bisect this.  Something during the 3.17 
merge window changed enough to make the perf_fuzzer results 
non-deterministic, and that makes it nearly impossibe to reproduce the
issue properly.  Ugh.  I might be reduced to sprinkling printks around 
unless anyone chimes in with better advice.

Vince

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf: 3.17 another perf_fuzzer lockup
  2014-10-20  4:48           ` Vince Weaver
@ 2014-10-20 14:48             ` Vince Weaver
  0 siblings, 0 replies; 9+ messages in thread
From: Vince Weaver @ 2014-10-20 14:48 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, 20 Oct 2014, Vince Weaver wrote:

> nope.  Wasted 3 days trying to bisect this.  Something during the 3.17 
> merge window changed enough to make the perf_fuzzer results 
> non-deterministic, and that makes it nearly impossibe to reproduce the
> issue properly.  Ugh.  I might be reduced to sprinkling printks around 
> unless anyone chimes in with better advice.

yes, so 3 different bisects from 3 different starting points and it 
bisects each time to:

	commit 29b88e23a9212136d39b0161a39afe587d0170a5
	Merge: 2521129 4e3a25b
	Author: Linus Torvalds <torvalds@linux-foundation.org>
	Date:   Mon Aug 4 18:34:04 2014 -0700

    	Merge tag 'driver-core-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/dri
    
	Pull driver core updates from Greg KH:
		"Here's the big driver-core pull request for 3.17-rc1.

which makes no sense at all :(

Vince

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-10-20 14:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-06 15:36 perf: 3.17 another perf_fuzzer lockup Vince Weaver
2014-10-06 15:55 ` Vince Weaver
2014-10-07  0:52   ` Chuck Ebbert
2014-10-15 18:34   ` Vince Weaver
2014-10-16 21:25     ` Vince Weaver
2014-10-17 14:19       ` Vince Weaver
2014-10-17 15:21         ` Vince Weaver
2014-10-20  4:48           ` Vince Weaver
2014-10-20 14:48             ` Vince Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox