* perf: 3.17 another perf_fuzzer lockup @ 2014-10-06 15:36 Vince Weaver 2014-10-06 15:55 ` Vince Weaver 0 siblings, 1 reply; 9+ messages in thread From: Vince Weaver @ 2014-10-06 15:36 UTC (permalink / raw) To: linux-kernel@vger.kernel.org Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo I reported this on the other thread but it seems to be unrelated to the other bug, and this time instead of simply wedging the process I got a backtrace and the machine locked solid. This is reproducible, on 3.17 on a core2. Trying to get something useful out of ftrace but no luck so far. Vince [ 843.700042] general protection fault: 0000 [#1] SMP [ 843.704001] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative f71882fg mcs7830 evdev usbnet coretemp ohci_pci pcspkr ohci_hcd psmouse serio_raw video wmi i2c_nforce2 button acpi_cpufreq processor thermal_sys sg ehci_pci ehci_hcd sd_mod usbcore usb_common [ 843.704001] CPU: 1 PID: 7273 Comm: perf_fuzzer Not tainted 3.17.0+ #86 [ 843.704001] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012 [ 843.704001] task: ffff88011a874000 ti: ffff8800bc0ec000 task.ti: ffff8800bc0ec000 [ 843.704001] RIP: 0010:[<ffffffff810cd913>] [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90 [ 843.704001] RSP: 0018:ffff8800bc0efd50 EFLAGS: 00010087 [ 843.704001] RAX: ffffea0002ba2d68 RBX: ffff1001e742c000 RCX: 000000000000038f [ 843.704001] RDX: ffff88011fc95b30 RSI: ffff880037d0eb00 RDI: ffff880037d0e700 [ 843.704001] RBP: ffff8800bc0efd60 R08: ffff8800bc0ec000 R09: 000000000000baff [ 843.704001] R10: 0000000000000006 R11: 00000000000009bc R12: ffff880037d0e700 [ 843.704001] R13: ffff8800c944f400 R14: 0000000000000001 R15: ffff88011b340800 [ 843.704001] FS: 00007ffc76c17700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000 [ 843.704001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 843.704001] DR0: 0000000001c7b000 DR1: 0000000000000000 DR2: 0000000001c7b000 [ 843.704001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 843.704001] Stack: [ 843.704001] ffff88011a874000 ffff88011b340800 ffff8800bc0efd90 ffffffff810cd9bb [ 843.704001] ffff88011a8744e8 ffff88011b340800 ffff88011fc929c0 ffff8800c944f400 [ 843.704001] ffff8800bc0efdc0 ffffffff8105ae62 ffff88011fc929c0 ffff8800c944f400 [ 843.704001] Call Trace: [ 843.704001] [<ffffffff810cd9bb>] __perf_event_task_sched_in+0x37/0xf4 [ 843.704001] [<ffffffff8105ae62>] finish_task_switch+0x9b/0xa6 [ 843.704001] [<ffffffff815209c1>] __schedule+0x309/0x4a5 [ 843.704001] [<ffffffff81520df3>] _cond_resched+0x28/0x3b [ 843.704001] [<ffffffff815217b5>] mutex_lock+0x12/0x2f [ 843.704001] [<ffffffff810cb9bb>] find_get_context+0xfc/0x170 [ 843.704001] [<ffffffff810cff77>] SYSC_perf_event_open+0x47b/0x7f5 [ 843.704001] [<ffffffff810d0651>] SyS_perf_event_open+0xe/0x10 [ 843.704001] [<ffffffff81523ad1>] tracesys+0xd4/0xd9 [ 843.704001] Code: 89 e7 e8 65 fe ff ff 5b 41 5c 5d c3 e8 c7 7e 45 00 55 48 89 e5 41 54 49 89 fc 53 48 8b 07 48 8b 58 40 65 48 03 1c 25 08 ce 00 00 <48> 39 bb d8 00 00 00 74 63 48 89 fe 48 89 df e8 f0 b4 ff ff 49 [ 843.704001] RIP [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90 [ 843.704001] RSP <ffff8800bc0efd50> [ 843.704001] ---[ end trace 24a30c0f4ce68693 ]--- [ 843.960002] ------------[ cut here ]------------ [ 843.960002] WARNING: CPU: 0 PID: 2018 at kernel/watchdog.c:267 watchdog_overflow_callback+0x9b/0xa7() [ 843.960002] Watchdog detected hard LOCKUP on cpu 0 [ 843.960002] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative f71882fg mcs7830 evdev usbnet coretemp ohci_pci pcspkr ohci_hcd psmouse serio_raw video wmi i2c_nforce2 button acpi_cpufreq processor thermal_sys sg ehci_pci ehci_hcd sd_mod usbcore usb_common [ 843.960002] CPU: 0 PID: 2018 Comm: rsyslogd Tainted: G D 3.17.0+ #86 [ 843.960002] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012 [ 843.960002] 0000000000000000 ffff88011fc06c20 ffffffff8151e1be ffff88011fc06c68 [ 843.960002] ffff88011fc06c58 ffffffff8103ee43 ffffffff810a6bd6 ffff88011b349800 [ 843.960002] 0000000000000000 ffff88011fc06d70 ffff88011fc06ef8 ffff88011fc06cb8 [ 843.960002] Call Trace: [ 843.960002] <NMI> [<ffffffff8151e1be>] dump_stack+0x45/0x56 [ 843.960002] [<ffffffff8103ee43>] warn_slowpath_common+0x7f/0x98 [ 843.960002] [<ffffffff810a6bd6>] ? watchdog_overflow_callback+0x9b/0xa7 [ 843.960002] [<ffffffff8103eea8>] warn_slowpath_fmt+0x4c/0x4e [ 843.960002] [<ffffffff81525821>] ? ftrace_call+0x5/0x2f [ 843.960002] [<ffffffff81525821>] ? ftrace_call+0x5/0x2f [ 843.960002] [<ffffffff810a6bd6>] watchdog_overflow_callback+0x9b/0xa7 [ 843.960002] [<ffffffff810cec50>] __perf_event_overflow+0x139/0x1bd [ 843.960002] [<ffffffff810cf14b>] perf_event_overflow+0x19/0x1b [ 843.960002] [<ffffffff8101981a>] intel_pmu_handle_irq+0x29d/0x323 [ 843.960002] [<ffffffff81525821>] ? ftrace_call+0x5/0x2f [ 843.960002] [<ffffffff81012da8>] perf_event_nmi_handler+0x25/0x3e [ 843.960002] [<ffffffff81005e88>] nmi_handle+0x60/0xf3 [ 843.960002] [<ffffffff8102b34d>] ? cpumask_clear_cpu.constprop.3+0x11/0x11 [ 843.960002] [<ffffffff810060df>] default_do_nmi+0x50/0xdc [ 843.960002] [<ffffffff810061cf>] do_nmi+0x64/0xa6 [ 843.960002] [<ffffffff8152572a>] end_repeat_nmi+0x1e/0x2e [ 843.960002] [<ffffffff8106a531>] ? do_raw_spin_lock+0x1c/0x27 [ 843.960002] [<ffffffff8106a531>] ? do_raw_spin_lock+0x1c/0x27 [ 843.960002] [<ffffffff8106a531>] ? do_raw_spin_lock+0x1c/0x27 [ 843.960002] <<EOE>> <IRQ> [<ffffffff8152350f>] _raw_spin_lock+0xe/0x10 [ 843.960002] [<ffffffff8106326d>] load_balance+0x2bf/0x63d [ 843.960002] [<ffffffff8105e0cf>] ? try_to_wake_up+0x1c4/0x1d4 [ 843.960002] [<ffffffff81063a5b>] rebalance_domains+0x144/0x205 [ 843.960002] [<ffffffff81063b59>] run_rebalance_domains+0x3d/0x15c [ 843.960002] [<ffffffff81041bd4>] __do_softirq+0xde/0x238 [ 843.960002] [<ffffffff81041ea3>] irq_exit+0x3f/0x86 [ 843.960002] [<ffffffff8102a4e1>] smp_apic_timer_interrupt+0x35/0x41 [ 843.960002] [<ffffffff8152480a>] apic_timer_interrupt+0x6a/0x70 [ 843.960002] <EOI> [ 843.960002] ---[ end trace 24a30c0f4ce68694 ]--- [ 843.957990] ------------[ cut here ]------------ [ 843.957990] WARNING: CPU: 1 PID: 7273 at kernel/watchdog.c:267 watchdog_overflow_callback+0x9b/0xa7() [ 843.957990] Watchdog detected hard LOCKUP on cpu 1 [ 843.957990] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative f71882fg mcs7830 evdev usbnet coretemp ohci_pci pcspkr ohci_hcd psmouse serio_raw video wmi i2c_nforce2 button acpi_cpufreq processor thermal_sys sg ehci_pci ehci_hcd sd_mod usbcore usb_common [ 843.957990] CPU: 1 PID: 7273 Comm: perf_fuzzer Tainted: G D W 3.17.0+ #86 [ 843.957990] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012 [ 843.957990] 0000000000000000 ffff88011fc86c20 ffffffff8151e1be ffff88011fc86c68 [ 843.957990] ffff88011fc86c58 ffffffff8103ee43 ffffffff810a6bd6 ffff88011ac2ac00 [ 843.957990] 0000000000000001 ffff88011fc86d70 ffff88011fc86ef8 ffff88011fc86cb8 [ 843.957990] Call Trace: [ 843.957990] <NMI> [<ffffffff8151e1be>] dump_stack+0x45/0x56 [ 843.957990] [<ffffffff8103ee43>] warn_slowpath_common+0x7f/0x98 [ 843.957990] [<ffffffff810a6bd6>] ? watchdog_overflow_callback+0x9b/0xa7 [ 843.957990] [<ffffffff8103eea8>] warn_slowpath_fmt+0x4c/0x4e [ 843.957990] [<ffffffff81525821>] ? ftrace_call+0x5/0x2f [ 843.957990] [<ffffffff810a6bd6>] watchdog_overflow_callback+0x9b/0xa7 [ 843.957990] [<ffffffff810cec50>] __perf_event_overflow+0x139/0x1bd [ 843.957990] [<ffffffff810cf14b>] perf_event_overflow+0x19/0x1b [ 843.957990] [<ffffffff8101981a>] intel_pmu_handle_irq+0x29d/0x323 [ 843.957990] [<ffffffff81525821>] ? ftrace_call+0x5/0x2f [ 843.957990] [<ffffffff81012da8>] perf_event_nmi_handler+0x25/0x3e [ 843.957990] [<ffffffff81005e88>] nmi_handle+0x60/0xf3 [ 843.957990] [<ffffffff8102b34d>] ? cpumask_clear_cpu.constprop.3+0x11/0x11 [ 843.957990] [<ffffffff8108d3f0>] ? tick_sched_do_timer+0x2f/0x2f [ 843.957990] [<ffffffff810060df>] default_do_nmi+0x50/0xdc [ 843.957990] [<ffffffff810061cf>] do_nmi+0x64/0xa6 [ 843.957990] [<ffffffff8152572a>] end_repeat_nmi+0x1e/0x2e [ 843.957990] [<ffffffff8108d3f0>] ? tick_sched_do_timer+0x2f/0x2f [ 843.957990] [<ffffffff8106a538>] ? do_raw_spin_lock+0x23/0x27 [ 843.957990] [<ffffffff8106a538>] ? do_raw_spin_lock+0x23/0x27 [ 843.957990] [<ffffffff8106a538>] ? do_raw_spin_lock+0x23/0x27 [ 843.957990] <<EOE>> <IRQ> [<ffffffff8152350f>] _raw_spin_lock+0xe/0x10 [ 843.957990] [<ffffffff8105d52c>] scheduler_tick+0x3d/0x85 [ 843.957990] [<ffffffff8108145a>] update_process_times+0x56/0x65 [ 843.957990] [<ffffffff8108cf19>] tick_sched_handle+0x4a/0x59 [ 843.957990] [<ffffffff8108d42a>] tick_sched_timer+0x3a/0x58 [ 843.957990] [<ffffffff810818c7>] __run_hrtimer+0x92/0x14b [ 843.957990] [<ffffffff810820a7>] hrtimer_interrupt+0xe8/0x1d3 [ 843.957990] [<ffffffff8102a189>] local_apic_timer_interrupt+0x54/0x57 [ 843.957990] [<ffffffff8102a4dc>] smp_apic_timer_interrupt+0x30/0x41 [ 843.957990] [<ffffffff8152480a>] apic_timer_interrupt+0x6a/0x70 [ 843.957990] <EOI> [<ffffffff8107439d>] ? console_unlock+0x30d/0x32b [ 843.957990] [<ffffffff810975d5>] ? acct_collect+0x164/0x16e [ 843.957990] [<ffffffff8103fed7>] do_exit+0x21c/0x8fe [ 843.957990] [<ffffffff8151c536>] ? printk+0x54/0x56 [ 843.957990] [<ffffffff81005685>] oops_end+0xa6/0xad [ 843.957990] [<ffffffff810059cf>] die+0x5a/0x63 [ 843.957990] [<ffffffff81002e7f>] do_general_protection+0x96/0x142 [ 843.957990] [<ffffffff815253c2>] general_protection+0x22/0x30 [ 843.957990] [<ffffffff810cd913>] ? perf_event_context_sched_in.isra.75+0x1f/0x90 [ 843.957990] [<ffffffff810cd9bb>] __perf_event_task_sched_in+0x37/0xf4 [ 843.957990] [<ffffffff8105ae62>] finish_task_switch+0x9b/0xa6 [ 843.957990] [<ffffffff815209c1>] __schedule+0x309/0x4a5 [ 843.957990] [<ffffffff81520df3>] _cond_resched+0x28/0x3b [ 843.957990] [<ffffffff815217b5>] mutex_lock+0x12/0x2f [ 843.957990] [<ffffffff810cb9bb>] find_get_context+0xfc/0x170 [ 843.957990] [<ffffffff810cff77>] SYSC_perf_event_open+0x47b/0x7f5 [ 843.957990] [<ffffffff810d0651>] SyS_perf_event_open+0xe/0x10 [ 843.957990] [<ffffffff81523ad1>] tracesys+0xd4/0xd9 [ 843.957990] ---[ end trace 24a30c0f4ce68695 ]--- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: perf: 3.17 another perf_fuzzer lockup 2014-10-06 15:36 perf: 3.17 another perf_fuzzer lockup Vince Weaver @ 2014-10-06 15:55 ` Vince Weaver 2014-10-07 0:52 ` Chuck Ebbert 2014-10-15 18:34 ` Vince Weaver 0 siblings, 2 replies; 9+ messages in thread From: Vince Weaver @ 2014-10-06 15:55 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo On Mon, 6 Oct 2014, Vince Weaver wrote: > [ 843.700042] general protection fault: 0000 [#1] SMP > ... > [ 843.704001] task: ffff88011a874000 ti: ffff8800bc0ec000 task.ti: ffff8800bc0ec000 > [ 843.704001] RIP: 0010:[<ffffffff810cd913>] [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90 For what it's worth, this is kernel/events/core.c:2646 if (atomic_read(&__get_cpu_var(perf_cgroup_events))) perf_cgroup_sched_in(prev, task); ffffffff810cd902: 53 push %rbx ffffffff810cd903: 48 8b 07 mov (%rdi),%rax ffffffff810cd906: 48 8b 58 40 mov 0x40(%rax),%rbx ffffffff810cd90a: 65 48 03 1c 25 08 ce add %gs:0xce08,%rbx ffffffff810cd911: 00 00 ffffffff810cd913: 48 39 bb d8 00 00 00 cmp %rdi,0xd8(%rbx) ffffffff810cd91a: 74 63 je ffffffff810cd97f <perf_event_context_sched_in.isra.75+0x8b> > [ 843.704001] RSP: 0018:ffff8800bc0efd50 EFLAGS: 00010087 > [ 843.704001] RAX: ffffea0002ba2d68 RBX: ffff1001e742c000 RCX: 000000000000038f > [ 843.704001] RDX: ffff88011fc95b30 RSI: ffff880037d0eb00 RDI: ffff880037d0e700 > [ 843.704001] RBP: ffff8800bc0efd60 R08: ffff8800bc0ec000 R09: 000000000000baff > [ 843.704001] R10: 0000000000000006 R11: 00000000000009bc R12: ffff880037d0e700 > [ 843.704001] R13: ffff8800c944f400 R14: 0000000000000001 R15: ffff88011b340800 > [ 843.704001] FS: 00007ffc76c17700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000 > [ 843.704001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 843.704001] DR0: 0000000001c7b000 DR1: 0000000000000000 DR2: 0000000001c7b000 > [ 843.704001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > [ 843.704001] Stack: > [ 843.704001] ffff88011a874000 ffff88011b340800 ffff8800bc0efd90 ffffffff810cd9bb > [ 843.704001] ffff88011a8744e8 ffff88011b340800 ffff88011fc929c0 ffff8800c944f400 > [ 843.704001] ffff8800bc0efdc0 ffffffff8105ae62 ffff88011fc929c0 ffff8800c944f400 > [ 843.704001] Call Trace: > [ 843.704001] [<ffffffff810cd9bb>] __perf_event_task_sched_in+0x37/0xf4 > [ 843.704001] [<ffffffff8105ae62>] finish_task_switch+0x9b/0xa6 > [ 843.704001] [<ffffffff815209c1>] __schedule+0x309/0x4a5 > [ 843.704001] [<ffffffff81520df3>] _cond_resched+0x28/0x3b > [ 843.704001] [<ffffffff815217b5>] mutex_lock+0x12/0x2f > [ 843.704001] [<ffffffff810cb9bb>] find_get_context+0xfc/0x170 > [ 843.704001] [<ffffffff810cff77>] SYSC_perf_event_open+0x47b/0x7f5 > [ 843.704001] [<ffffffff810d0651>] SyS_perf_event_open+0xe/0x10 > [ 843.704001] [<ffffffff81523ad1>] tracesys+0xd4/0xd9 > [ 843.704001] Code: 89 e7 e8 65 fe ff ff 5b 41 5c 5d c3 e8 c7 7e 45 00 55 48 89 e5 41 54 49 89 fc 53 48 8b 07 48 8b 58 40 65 48 03 1c 25 08 ce 00 00 <48> 39 bb d8 00 00 00 74 63 48 89 fe 48 89 df e8 f0 b4 ff ff 49 > [ 843.704001] RIP [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90 > [ 843.704001] RSP <ffff8800bc0efd50> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: perf: 3.17 another perf_fuzzer lockup 2014-10-06 15:55 ` Vince Weaver @ 2014-10-07 0:52 ` Chuck Ebbert 2014-10-15 18:34 ` Vince Weaver 1 sibling, 0 replies; 9+ messages in thread From: Chuck Ebbert @ 2014-10-07 0:52 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo On Mon, 6 Oct 2014 11:55:11 -0400 (EDT) Vince Weaver <vincent.weaver@maine.edu> wrote: > On Mon, 6 Oct 2014, Vince Weaver wrote: > > > [ 843.700042] general protection fault: 0000 [#1] SMP > > ... > > [ 843.704001] task: ffff88011a874000 ti: ffff8800bc0ec000 task.ti: ffff8800bc0ec000 > > [ 843.704001] RIP: 0010:[<ffffffff810cd913>] [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90 > > For what it's worth, this is > > kernel/events/core.c:2646 > > if (atomic_read(&__get_cpu_var(perf_cgroup_events))) > perf_cgroup_sched_in(prev, task); > > > ffffffff810cd902: 53 push %rbx > ffffffff810cd903: 48 8b 07 mov (%rdi),%rax > ffffffff810cd906: 48 8b 58 40 mov 0x40(%rax),%rbx > ffffffff810cd90a: 65 48 03 1c 25 08 ce add %gs:0xce08,%rbx > ffffffff810cd911: 00 00 > ffffffff810cd913: 48 39 bb d8 00 00 00 cmp %rdi,0xd8(%rbx) > ffffffff810cd91a: 74 63 je ffffffff810cd97f <perf_event_context_sched_in.isra.75+0x8b> > Actually it's: static void perf_event_context_sched_in(struct perf_event_context *ctx, struct task_struct *task) { struct perf_cpu_context *cpuctx; cpuctx = __get_cpu_context(ctx); if (cpuctx->task_ctx == ctx) <======= oops return; cpuctx is in %rbx (=ffff1001e742c000) and that's not even a legal address, which is what caused the general protection fault > > > [ 843.704001] RSP: 0018:ffff8800bc0efd50 EFLAGS: 00010087 > > [ 843.704001] RAX: ffffea0002ba2d68 RBX: ffff1001e742c000 RCX: 000000000000038f > > [ 843.704001] RDX: ffff88011fc95b30 RSI: ffff880037d0eb00 RDI: ffff880037d0e700 > > [ 843.704001] RBP: ffff8800bc0efd60 R08: ffff8800bc0ec000 R09: 000000000000baff > > [ 843.704001] R10: 0000000000000006 R11: 00000000000009bc R12: ffff880037d0e700 > > [ 843.704001] R13: ffff8800c944f400 R14: 0000000000000001 R15: ffff88011b340800 > > [ 843.704001] FS: 00007ffc76c17700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000 > > [ 843.704001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > [ 843.704001] DR0: 0000000001c7b000 DR1: 0000000000000000 DR2: 0000000001c7b000 > > [ 843.704001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > > [ 843.704001] Stack: > > [ 843.704001] ffff88011a874000 ffff88011b340800 ffff8800bc0efd90 ffffffff810cd9bb > > [ 843.704001] ffff88011a8744e8 ffff88011b340800 ffff88011fc929c0 ffff8800c944f400 > > [ 843.704001] ffff8800bc0efdc0 ffffffff8105ae62 ffff88011fc929c0 ffff8800c944f400 > > [ 843.704001] Call Trace: > > [ 843.704001] [<ffffffff810cd9bb>] __perf_event_task_sched_in+0x37/0xf4 > > [ 843.704001] [<ffffffff8105ae62>] finish_task_switch+0x9b/0xa6 > > [ 843.704001] [<ffffffff815209c1>] __schedule+0x309/0x4a5 > > [ 843.704001] [<ffffffff81520df3>] _cond_resched+0x28/0x3b > > [ 843.704001] [<ffffffff815217b5>] mutex_lock+0x12/0x2f > > [ 843.704001] [<ffffffff810cb9bb>] find_get_context+0xfc/0x170 > > [ 843.704001] [<ffffffff810cff77>] SYSC_perf_event_open+0x47b/0x7f5 > > [ 843.704001] [<ffffffff810d0651>] SyS_perf_event_open+0xe/0x10 > > [ 843.704001] [<ffffffff81523ad1>] tracesys+0xd4/0xd9 > > [ 843.704001] Code: 89 e7 e8 65 fe ff ff 5b 41 5c 5d c3 e8 c7 7e 45 00 55 48 89 e5 41 54 49 89 fc 53 48 8b 07 48 8b 58 40 65 48 03 1c 25 08 ce 00 00 <48> 39 bb d8 00 00 00 74 63 48 89 fe 48 89 df e8 f0 b4 ff ff 49 > > [ 843.704001] RIP [<ffffffff810cd913>] perf_event_context_sched_in.isra.75+0x1f/0x90 > > [ 843.704001] RSP <ffff8800bc0efd50> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: perf: 3.17 another perf_fuzzer lockup 2014-10-06 15:55 ` Vince Weaver 2014-10-07 0:52 ` Chuck Ebbert @ 2014-10-15 18:34 ` Vince Weaver 2014-10-16 21:25 ` Vince Weaver 1 sibling, 1 reply; 9+ messages in thread From: Vince Weaver @ 2014-10-15 18:34 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo OK, so it turns out that the oops I saw with memory corruption wasn't the bug I was tracking, but something that comes up sometimes when trying to run ftrace at the same time as fuzzing. So we'll leave that for another day. The 3.17+ lockup I am tracking still reproduces as of git from yesterday (even after the 3.18-rc perf_event merges). I can use sysrq to get the stack trace, the one CPU is stuck in a call to find_get_context(). An example backtrace: [88200.300003] <EOI> [88200.300003] [<ffffffff81114869>] ? ____cache_alloc+0x130/0x25b [88200.300003] [<ffffffff8107fb05>] ? __call_rcu.constprop.63+0x1bf/0x1cb [88200.300003] [<ffffffff8107fb2b>] kfree_call_rcu+0x1a/0x1c [88200.300003] [<ffffffff810cf84f>] put_ctx+0x51/0x55 [88200.300003] [<ffffffff810d1840>] find_get_context+0x166/0x195 [88200.300003] [<ffffffff810d5856>] SYSC_perf_event_open+0x47b/0x7f5 [88200.300003] [<ffffffff810d5f55>] SyS_perf_event_open+0xe/0x10 [88200.300003] [<ffffffff815362d6>] system_call_fastpath+0x16/0x1b It looks like the else if (task->perf_event_ctxp[ctxn]) err = -EAGAIN; case is triggering non-stop in the retry path of find_get_context() and so the kernel gets stuck forever retrying. I can drop some printks in if it will help debug. I've tried running ftrace, but for whatever reason if I enable ftrace the bug won't trigger. Vince ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: perf: 3.17 another perf_fuzzer lockup 2014-10-15 18:34 ` Vince Weaver @ 2014-10-16 21:25 ` Vince Weaver 2014-10-17 14:19 ` Vince Weaver 0 siblings, 1 reply; 9+ messages in thread From: Vince Weaver @ 2014-10-16 21:25 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo On Wed, 15 Oct 2014, Vince Weaver wrote: > I can use sysrq to get the stack trace, the one CPU is stuck in a call > to find_get_context(). > > An example backtrace: > > [88200.300003] <EOI> > [88200.300003] [<ffffffff81114869>] ? ____cache_alloc+0x130/0x25b > [88200.300003] [<ffffffff8107fb05>] ? __call_rcu.constprop.63+0x1bf/0x1cb > [88200.300003] [<ffffffff8107fb2b>] kfree_call_rcu+0x1a/0x1c > [88200.300003] [<ffffffff810cf84f>] put_ctx+0x51/0x55 > [88200.300003] [<ffffffff810d1840>] find_get_context+0x166/0x195 > [88200.300003] [<ffffffff810d5856>] SYSC_perf_event_open+0x47b/0x7f5 > [88200.300003] [<ffffffff810d5f55>] SyS_perf_event_open+0xe/0x10 > [88200.300003] [<ffffffff815362d6>] system_call_fastpath+0x16/0x1b > > It looks like the > else if (task->perf_event_ctxp[ctxn]) > err = -EAGAIN; It is indeed stuck there, waiting for task->perf_event_ctxp[1] to get set to zero, which never happens. As far as I can tell it's when a Software event is being opened. Still struggling through the code trying to figure out what's going on. [ 7071.252607] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1 [ 7071.259439] VMW: type=1 config=8 [ 7071.262713] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1 [ 7071.269506] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1 [ 7071.276299] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1 [ 7071.283087] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1 [ 7071.289879] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1 [ 7071.296671] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1 [ 7071.303457] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1 [ 7071.310248] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1 [ 7071.317035] VMW: task->perf_event_ctxp[1]=ffff8800cb12ec00, EAGAIN, ref=1 [ 7076.256032] find_get_context: 7246310 callbacks suppressed Vince ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: perf: 3.17 another perf_fuzzer lockup 2014-10-16 21:25 ` Vince Weaver @ 2014-10-17 14:19 ` Vince Weaver 2014-10-17 15:21 ` Vince Weaver 0 siblings, 1 reply; 9+ messages in thread From: Vince Weaver @ 2014-10-17 14:19 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo On Thu, 16 Oct 2014, Vince Weaver wrote: > > It looks like the > > else if (task->perf_event_ctxp[ctxn]) > > err = -EAGAIN; > > It is indeed stuck there, waiting for task->perf_event_ctxp[1] to get > set to zero, which never happens. > OK, so with some more printk()s, it looks like somehow the parent thread is trying to open a software event on itself. task->perf_event_ctxp[1] has a valid pointer, but the ctx it points to has a ctx->lock of 0. So perf_lock_task_context() always returns NULL. So in find_get_context() we get stuck in an infinite retry loop, waiting forever for either ctx->lock to go positive or for task->perf_event_ctxp[1] to go NULL, neither of which is going to happen. Now to find out why this could happen. Probably something to do with crazy RCU magic :( Vince ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: perf: 3.17 another perf_fuzzer lockup 2014-10-17 14:19 ` Vince Weaver @ 2014-10-17 15:21 ` Vince Weaver 2014-10-20 4:48 ` Vince Weaver 0 siblings, 1 reply; 9+ messages in thread From: Vince Weaver @ 2014-10-17 15:21 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo On Fri, 17 Oct 2014, Vince Weaver wrote: > Now to find out why this could happen. Probably something to do with > crazy RCU magic :( it looks like there's an unbalanced get_ctx() / put_ctx() here, as the software event context on the main process should not get decremented to 0 unless that process is exiting, yet it happens. Maybe this is bisectable. Hmmm. [ 106.781177] VMW: using pid 2941 [ 127.216558] ------------[ cut here ]------------ And here's where ctx->refcount gets decremented to 0. [ 127.221237] WARNING: CPU: 0 PID: 2941 at kernel/events/core.c:905 put_ctx+0x57/0x8e() [ 127.256799] CPU: 0 PID: 2941 Comm: perf_fuzzer Not tainted 3.17.0+ #97 [ 127.263372] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012 [ 127.272289] 0000000000000009 ffff8800cb107d98 ffffffff81530f3c 000000000000249e [ 127.279954] 0000000000000000 ffff8800cb107dd8 ffffffff8104005d ffff8800cae4b750 [ 127.287621] ffffffff810cf819 ffff8800cbb26400 ffff8800cae4b000 ffff8800cbb26410 [ 127.295285] Call Trace: [ 127.297789] [<ffffffff81530f3c>] dump_stack+0x46/0x58 [ 127.302980] [<ffffffff8104005d>] warn_slowpath_common+0x81/0x9b [ 127.309036] [<ffffffff810cf819>] ? put_ctx+0x57/0x8e [ 127.314134] [<ffffffff8104011a>] warn_slowpath_null+0x1a/0x1c [ 127.320022] [<ffffffff810cf819>] put_ctx+0x57/0x8e [ 127.324957] [<ffffffff810cf898>] __free_event+0x48/0x71 [ 127.330326] [<ffffffff8112bb01>] ? __d_free_external+0x29/0x4f [ 127.336298] [<ffffffff810d1311>] _free_event+0xd6/0xdb [ 127.341585] [<ffffffff810d13ee>] put_event+0xd8/0xe1 [ 127.346693] [<ffffffff810d141e>] perf_release+0x15/0x19 [ 127.352062] [<ffffffff8111cd7d>] __fput+0xf1/0x1a6 [ 127.356994] [<ffffffff8111ce6a>] ____fput+0xe/0x10 [ 127.361931] [<ffffffff81055402>] task_work_run+0x83/0x9a [ 127.367389] [<ffffffff810029ca>] do_notify_resume+0x5a/0x61 [ 127.373106] [<ffffffff81536720>] int_signal+0x12/0x17 [ 127.378300] ---[ end trace 8508b4f6a48d2f87 ]--- and here a little later is when we try to add a new software event but it gets infinitely stuck. [ 127.385717] VMW: task->perf_event_ctxp[1]=ffff8800cbb26400, EAGAIN, ref=1 [ 127.392566] VMW: pmu->type=1 type=1 config=8 pid=2941 Vince ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: perf: 3.17 another perf_fuzzer lockup 2014-10-17 15:21 ` Vince Weaver @ 2014-10-20 4:48 ` Vince Weaver 2014-10-20 14:48 ` Vince Weaver 0 siblings, 1 reply; 9+ messages in thread From: Vince Weaver @ 2014-10-20 4:48 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo On Fri, 17 Oct 2014, Vince Weaver wrote: > On Fri, 17 Oct 2014, Vince Weaver wrote: > > > Now to find out why this could happen. Probably something to do with > > crazy RCU magic :( > > it looks like there's an unbalanced get_ctx() / put_ctx() here, as the > software event context on the main process should not get decremented > to 0 unless that process is exiting, yet it happens. > > Maybe this is bisectable. Hmmm. nope. Wasted 3 days trying to bisect this. Something during the 3.17 merge window changed enough to make the perf_fuzzer results non-deterministic, and that makes it nearly impossibe to reproduce the issue properly. Ugh. I might be reduced to sprinkling printks around unless anyone chimes in with better advice. Vince ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: perf: 3.17 another perf_fuzzer lockup 2014-10-20 4:48 ` Vince Weaver @ 2014-10-20 14:48 ` Vince Weaver 0 siblings, 0 replies; 9+ messages in thread From: Vince Weaver @ 2014-10-20 14:48 UTC (permalink / raw) To: Vince Weaver Cc: linux-kernel@vger.kernel.org, Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo On Mon, 20 Oct 2014, Vince Weaver wrote: > nope. Wasted 3 days trying to bisect this. Something during the 3.17 > merge window changed enough to make the perf_fuzzer results > non-deterministic, and that makes it nearly impossibe to reproduce the > issue properly. Ugh. I might be reduced to sprinkling printks around > unless anyone chimes in with better advice. yes, so 3 different bisects from 3 different starting points and it bisects each time to: commit 29b88e23a9212136d39b0161a39afe587d0170a5 Merge: 2521129 4e3a25b Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon Aug 4 18:34:04 2014 -0700 Merge tag 'driver-core-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/dri Pull driver core updates from Greg KH: "Here's the big driver-core pull request for 3.17-rc1. which makes no sense at all :( Vince ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-10-20 14:41 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-10-06 15:36 perf: 3.17 another perf_fuzzer lockup Vince Weaver 2014-10-06 15:55 ` Vince Weaver 2014-10-07 0:52 ` Chuck Ebbert 2014-10-15 18:34 ` Vince Weaver 2014-10-16 21:25 ` Vince Weaver 2014-10-17 14:19 ` Vince Weaver 2014-10-17 15:21 ` Vince Weaver 2014-10-20 4:48 ` Vince Weaver 2014-10-20 14:48 ` Vince Weaver
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox