* perf on 2.6.38-rc4 wedges my box @ 2011-02-09 17:38 Jeff Moyer 2011-02-09 17:47 ` David Ahern 2011-02-10 21:38 ` Peter Zijlstra 0 siblings, 2 replies; 11+ messages in thread From: Jeff Moyer @ 2011-02-09 17:38 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo Hi, I'm trying out willy's ata_ram driver[1], and in so doing have managed to wedge my box while using perf record on an aio-stress run: [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1 [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds adding stage write starting with write file size 1024MB, record size 4KB, depth 32, ios per iteration 8 max io_submit 16, buffer alignment set to 4KB threads 1 files 1 contexts 1 context offset 2MB verification off adding file /dev/sds thread 0 write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds adding stage write starting with write file size 1024MB, record size 4KB, depth 32, ios per iteration 8 max io_submit 16, buffer alignment set to 4KB threads 1 files 1 contexts 1 context offset 2MB verification off adding file /dev/sds thread 0 and there it sits. On the console, I see: NOHZ: local_softirq_pending 100 NOHZ: local_softirq_pending 100 NOHZ: local_softirq_pending 100 NOHZ: local_softirq_pending 100 NOHZ: local_softirq_pending 100 The number of messages varies, but this is the most I've seen (it doesn't keep repeating). At this point, the machine does not respond to pings. As I don't have physical access at the moment, I can't try alt-sysrq, but might be able to do that tomorrow. It's probably worth noting that I've witnessed similar behavior with real devices, so it's not just the ata_ram driver. Any ideas on how to track this down? Thanks! Jeff [1] http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=refs/heads/ata-ram ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box 2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer @ 2011-02-09 17:47 ` David Ahern 2011-02-09 18:22 ` Jeff Moyer 2011-02-10 21:38 ` Peter Zijlstra 1 sibling, 1 reply; 11+ messages in thread From: David Ahern @ 2011-02-09 17:47 UTC (permalink / raw) To: Jeff Moyer Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo On 02/09/11 10:38, Jeff Moyer wrote: > Hi, > > I'm trying out willy's ata_ram driver[1], and in so doing have managed to > wedge my box while using perf record on an aio-stress run: > > [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1 > [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds > adding stage write > starting with write > file size 1024MB, record size 4KB, depth 32, ios per iteration 8 > max io_submit 16, buffer alignment set to 4KB > threads 1 files 1 contexts 1 context offset 2MB verification off > adding file /dev/sds thread 0 > write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s > thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s > [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 > /dev/sds Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W profiling? Add -v to see if the fallback to S/W is happening now. David > adding stage write > starting with write > file size 1024MB, record size 4KB, depth 32, ios per iteration 8 > max io_submit 16, buffer alignment set to 4KB > threads 1 files 1 contexts 1 context offset 2MB verification off > adding file /dev/sds thread 0 > > and there it sits. On the console, I see: > > NOHZ: local_softirq_pending 100 > NOHZ: local_softirq_pending 100 > NOHZ: local_softirq_pending 100 > NOHZ: local_softirq_pending 100 > NOHZ: local_softirq_pending 100 > > The number of messages varies, but this is the most I've seen (it > doesn't keep repeating). At this point, the machine does not respond to > pings. As I don't have physical access at the moment, I can't try > alt-sysrq, but might be able to do that tomorrow. It's probably worth > noting that I've witnessed similar behavior with real devices, so it's > not just the ata_ram driver. > > Any ideas on how to track this down? > > Thanks! > > Jeff > > [1] http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=refs/heads/ata-ram > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box 2011-02-09 17:47 ` David Ahern @ 2011-02-09 18:22 ` Jeff Moyer 2011-02-09 20:12 ` David Ahern 0 siblings, 1 reply; 11+ messages in thread From: Jeff Moyer @ 2011-02-09 18:22 UTC (permalink / raw) To: David Ahern Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo David Ahern <dsahern@gmail.com> writes: > Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W > profiling? Add -v to see if the fallback to S/W is happening now. Thanks for the suggestion, David. I tried: # perf record -v ls Warning: ... trying to fall back to cpu-clock-ticks couldn't open /proc/-1/status couldn't open /proc/-1/maps [ls output] [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ] If I explicitly set '-e cpu-clock', then the output is the same, except that the warning is gone. What's up with the /proc/-1/*? Now, when running perf record -e cpu-clock on the aio-stress run, unsurprisingly, I get the same result: # perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds couldn't open /proc/-1/status couldn't open /proc/-1/maps adding stage write starting with write file size 1024MB, record size 4KB, depth 32, ios per iteration 8 max io_submit 16, buffer alignment set to 4KB threads 1 files 1 contexts 1 context offset 2MB verification off adding file /dev/sds thread 0 and there it sits. In this case, however, I did not see the NOHZ warnings on the console, and this time the machine is still responding to ping, but nothing else. Cheers, Jeff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box 2011-02-09 18:22 ` Jeff Moyer @ 2011-02-09 20:12 ` David Ahern 2011-02-09 22:11 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 11+ messages in thread From: David Ahern @ 2011-02-09 20:12 UTC (permalink / raw) To: Jeff Moyer Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo On 02/09/11 11:22, Jeff Moyer wrote: > David Ahern <dsahern@gmail.com> writes: > >> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W >> profiling? Add -v to see if the fallback to S/W is happening now. > > Thanks for the suggestion, David. I tried: > > # perf record -v ls > Warning: ... trying to fall back to cpu-clock-ticks > > couldn't open /proc/-1/status > couldn't open /proc/-1/maps > [ls output] > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ] > > If I explicitly set '-e cpu-clock', then the output is the same, > except that the warning is gone. What's up with the /proc/-1/*? target_{pid,tid} are initialized to -1 in builtin-record.c I believe the tid version is making its way through the event__synthesize_xxx code (event__synthesize_thread -> __event__synthesize_thread -> event__synthesize_comm and event__synthesize_mmap_events). > > Now, when running perf record -e cpu-clock on the aio-stress run, > unsurprisingly, I get the same result: > > # perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds > couldn't open /proc/-1/status > couldn't open /proc/-1/maps > adding stage write > starting with write > file size 1024MB, record size 4KB, depth 32, ios per iteration 8 > max io_submit 16, buffer alignment set to 4KB > threads 1 files 1 contexts 1 context offset 2MB verification off > adding file /dev/sds thread 0 > > and there it sits. In this case, however, I did not see the NOHZ > warnings on the console, and this time the machine is still responding > to ping, but nothing else. cpu-clock is handled through hrtimers if that helps understand the lockup. David > > Cheers, > Jeff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box 2011-02-09 20:12 ` David Ahern @ 2011-02-09 22:11 ` Arnaldo Carvalho de Melo 0 siblings, 0 replies; 11+ messages in thread From: Arnaldo Carvalho de Melo @ 2011-02-09 22:11 UTC (permalink / raw) To: David Ahern Cc: Jeff Moyer, linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras Em Wed, Feb 09, 2011 at 01:12:22PM -0700, David Ahern escreveu: > > > On 02/09/11 11:22, Jeff Moyer wrote: > > David Ahern <dsahern@gmail.com> writes: > > > >> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W > >> profiling? Add -v to see if the fallback to S/W is happening now. > > > > Thanks for the suggestion, David. I tried: > > > > # perf record -v ls > > Warning: ... trying to fall back to cpu-clock-ticks > > > > couldn't open /proc/-1/status > > couldn't open /proc/-1/maps > > [ls output] > > [ perf record: Woken up 1 times to write data ] > > [ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ] > > > > If I explicitly set '-e cpu-clock', then the output is the same, > > except that the warning is gone. What's up with the /proc/-1/*? > > target_{pid,tid} are initialized to -1 in builtin-record.c I believe the > tid version is making its way through the event__synthesize_xxx code > (event__synthesize_thread -> __event__synthesize_thread -> > event__synthesize_comm and event__synthesize_mmap_events). Yes, I'm working on a patch, probably just this will fix it: diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 07f8d6d..dd27b9f 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -680,7 +680,7 @@ static int __cmd_record(int argc, const char **argv) perf_event__synthesize_guest_os); if (!system_wide) - perf_event__synthesize_thread(target_tid, + perf_event__synthesize_thread(evsel_list->threads->map[0], process_synthesized_event, session); else --- So that it gets the child_pid or the tid passed via --tid. But thinking more I think that the correct is to pass the thread_map evsel_list->threads, that will cover the case of just one thread (--tid), or all a group of threads (--pid, when the thread_map will have more than one tid). I'll get a patch done later. > > > > Now, when running perf record -e cpu-clock on the aio-stress run, > > unsurprisingly, I get the same result: > > > > # perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds > > couldn't open /proc/-1/status > > couldn't open /proc/-1/maps > > adding stage write > > starting with write > > file size 1024MB, record size 4KB, depth 32, ios per iteration 8 > > max io_submit 16, buffer alignment set to 4KB > > threads 1 files 1 contexts 1 context offset 2MB verification off > > adding file /dev/sds thread 0 > > > > and there it sits. In this case, however, I did not see the NOHZ > > warnings on the console, and this time the machine is still responding > > to ping, but nothing else. > > cpu-clock is handled through hrtimers if that helps understand the lockup. > > David > > > > > Cheers, > > Jeff ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box 2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer 2011-02-09 17:47 ` David Ahern @ 2011-02-10 21:38 ` Peter Zijlstra 2011-02-11 16:35 ` David Ahern 1 sibling, 1 reply; 11+ messages in thread From: Peter Zijlstra @ 2011-02-10 21:38 UTC (permalink / raw) To: Jeff Moyer Cc: linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo On Wed, 2011-02-09 at 12:38 -0500, Jeff Moyer wrote: > Hi, > > I'm trying out willy's ata_ram driver[1], and in so doing have managed to > wedge my box while using perf record on an aio-stress run: > > [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1 > [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds > adding stage write > starting with write > file size 1024MB, record size 4KB, depth 32, ios per iteration 8 > max io_submit 16, buffer alignment set to 4KB > threads 1 files 1 contexts 1 context offset 2MB verification off > adding file /dev/sds thread 0 > write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s > thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s > [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 > /dev/sds > adding stage write > starting with write > file size 1024MB, record size 4KB, depth 32, ios per iteration 8 > max io_submit 16, buffer alignment set to 4KB > threads 1 files 1 contexts 1 context offset 2MB verification off > adding file /dev/sds thread 0 > > and there it sits. On the console, I see: > > NOHZ: local_softirq_pending 100 > NOHZ: local_softirq_pending 100 > NOHZ: local_softirq_pending 100 > NOHZ: local_softirq_pending 100 > NOHZ: local_softirq_pending 100 > > The number of messages varies, but this is the most I've seen (it > doesn't keep repeating). At this point, the machine does not respond to > pings. As I don't have physical access at the moment, I can't try > alt-sysrq, but might be able to do that tomorrow. It's probably worth > noting that I've witnessed similar behavior with real devices, so it's > not just the ata_ram driver. > > Any ideas on how to track this down? So I took linus' tree of about half an hour ago, added git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git ata-ram (fixed up some Kconfig/Makefile rejects), google'd aio-stress (http://fsbench.filesystems.org/bench/aio-stress.c) and set out to reproduce the above.. Sadly it all seems to work here, its spending ~15% in _raw_spin_lock_irq, which when I use -g looks to break down like: - 14.13% aio-stress [kernel.kallsyms] [k] _raw_spin_lock_irq - _raw_spin_lock_irq + 44.14% __make_request + 20.91% __aio_get_req + 10.15% aio_run_iocb + 7.37% do_io_submit + 6.55% scsi_request_fn + 5.48% generic_unplug_device + 3.58% aio_put_req + 0.92% generic_make_request + 0.91% __generic_unplug_device ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box 2011-02-10 21:38 ` Peter Zijlstra @ 2011-02-11 16:35 ` David Ahern 2011-02-11 17:53 ` Peter Zijlstra 0 siblings, 1 reply; 11+ messages in thread From: David Ahern @ 2011-02-11 16:35 UTC (permalink / raw) To: Peter Zijlstra, Jeff Moyer Cc: linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo On 02/10/11 14:38, Peter Zijlstra wrote: > On Wed, 2011-02-09 at 12:38 -0500, Jeff Moyer wrote: >> Hi, >> >> I'm trying out willy's ata_ram driver[1], and in so doing have managed to >> wedge my box while using perf record on an aio-stress run: >> >> [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1 >> [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds >> adding stage write >> starting with write >> file size 1024MB, record size 4KB, depth 32, ios per iteration 8 >> max io_submit 16, buffer alignment set to 4KB >> threads 1 files 1 contexts 1 context offset 2MB verification off >> adding file /dev/sds thread 0 >> write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s >> thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s >> [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 >> /dev/sds >> adding stage write >> starting with write >> file size 1024MB, record size 4KB, depth 32, ios per iteration 8 >> max io_submit 16, buffer alignment set to 4KB >> threads 1 files 1 contexts 1 context offset 2MB verification off >> adding file /dev/sds thread 0 >> >> and there it sits. On the console, I see: >> >> NOHZ: local_softirq_pending 100 >> NOHZ: local_softirq_pending 100 >> NOHZ: local_softirq_pending 100 >> NOHZ: local_softirq_pending 100 >> NOHZ: local_softirq_pending 100 >> >> The number of messages varies, but this is the most I've seen (it >> doesn't keep repeating). At this point, the machine does not respond to >> pings. As I don't have physical access at the moment, I can't try >> alt-sysrq, but might be able to do that tomorrow. It's probably worth >> noting that I've witnessed similar behavior with real devices, so it's >> not just the ata_ram driver. >> >> Any ideas on how to track this down? > > So I took linus' tree of about half an hour ago, added > git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git ata-ram > (fixed up some Kconfig/Makefile rejects), google'd aio-stress > (http://fsbench.filesystems.org/bench/aio-stress.c) and set out to > reproduce the above.. > > Sadly it all seems to work here, its spending ~15% in > _raw_spin_lock_irq, which when I use -g looks to break down like: > > - 14.13% aio-stress [kernel.kallsyms] [k] _raw_spin_lock_irq > - _raw_spin_lock_irq > + 44.14% __make_request > + 20.91% __aio_get_req > + 10.15% aio_run_iocb > + 7.37% do_io_submit > + 6.55% scsi_request_fn > + 5.48% generic_unplug_device > + 3.58% aio_put_req > + 0.92% generic_make_request > + 0.91% __generic_unplug_device I'm guessing in your case perf is using hardware cycles for profiling. I was able to reproduce the lockup in a VM which uses cpu-clock for profiling - like Jeff's case. The VM is running Fedora 14 with 2.6.38-rc4. In the host one of qemu-kvm's threads is pegging the cpu. Backtrace of the vcpus is given below. stop-bt-cont at various intervals is showing similar traces, and xtime is not advancing. Final function changes (e.g., from __rb_rotate_left to __rb_rotate_right) David [Switching to Thread 1] (gdb) bt #0 native_safe_halt () at /opt/kernel/src/arch/x86/include/asm/irqflags.h:50 #1 0xffffffff81009d76 in arch_safe_halt () at /opt/kernel/src/arch/x86/include/asm/paravirt.h:110 #2 default_idle () at /opt/kernel/src/arch/x86/kernel/process.c:381 #3 0xffffffff8100132a in cpu_idle () at /opt/kernel/src/arch/x86/kernel/process_64.c:139 #4 0xffffffff81392c89 in rest_init () at /opt/kernel/src/init/main.c:463 #5 0xffffffff81697c23 in start_kernel () at /opt/kernel/src/init/main.c:713 #6 0xffffffff816972af in x86_64_start_reservations (real_mode_data=<value optimized out>) at /opt/kernel/src/arch/x86/kernel/head64.c:127 #7 0xffffffff816973b9 in x86_64_start_kernel (real_mode_data=0x93950 <Address 0x93950 out of bounds>) at /opt/kernel/src/arch/x86/kernel/head64.c:97 #8 0x0000000000000000 in ?? () (gdb) thread 2 [Switching to thread 2 (Thread 2)]#0 __rb_rotate_left (node=0xffff8800782dd8f0, root=0xffff88007fd0e358) at /opt/kernel/src/lib/rbtree.c:37 (gdb) bt #0 __rb_rotate_left (node=0xffff8800782dd8f0, root=0xffff88007fd0e358) at /opt/kernel/src/lib/rbtree.c:37 #1 0xffffffff81206bcc in rb_insert_color (node=0xffff8800782dd8f0, root=0xffff88007fd0e358) at /opt/kernel/src/lib/rbtree.c:130 #2 0xffffffff81207cb5 in timerqueue_add (head=0xffff88007fd0e358, node=0xffff8800782dd8f0) at /opt/kernel/src/lib/timerqueue.c:56 #3 0xffffffff8105f595 in enqueue_hrtimer (timer=0xffff8800782dd8f0, base=0xffff88007fd0e348) at /opt/kernel/src/kernel/hrtimer.c:848 #4 0xffffffff8105fc75 in __hrtimer_start_range_ns (timer=0xffff8800782dd8f0, tim=..., delta_ns=0, mode=<value optimized out>, wakeup=0) at /opt/kernel/src/kernel/hrtimer.c:961 #5 0xffffffff810b662e in perf_swevent_start_hrtimer (event=0xffff8800782dd800) at /opt/kernel/src/kernel/perf_event.c:5092 #6 0xffffffff810b669c in cpu_clock_event_start (event=<value optimized out>, flags=<value optimized out>) at /opt/kernel/src/kernel/perf_event.c:5126 #7 0xffffffff810ba515 in perf_ctx_adjust_freq (ctx=0xffff880079930600, period=999848) at /opt/kernel/src/kernel/perf_event.c:1726 #8 0xffffffff810ba690 in perf_rotate_context () at /opt/kernel/src/kernel/perf_event.c:1787 #9 perf_event_task_tick () at /opt/kernel/src/kernel/perf_event.c:1821 #10 0xffffffff8103d8d8 in scheduler_tick () at /opt/kernel/src/kernel/sched.c:3784 #11 0xffffffff8104effe in update_process_times (user_tick=0) at /opt/kernel/src/kernel/timer.c:1274 #12 0xffffffff81069587 in tick_sched_timer (timer=0xffff88007fd0e3f0) at /opt/kernel/src/kernel/time/tick-sched.c:760 #13 0xffffffff8105f75d in __run_hrtimer (timer=0xffff88007fd0e3f0, now=0xffff88007fd03f48) at /opt/kernel/src/kernel/hrtimer.c:1197 #14 0xffffffff8105ff5e in hrtimer_interrupt (dev=<value optimized out>) at /opt/kernel/src/kernel/hrtimer.c:1283 #15 0xffffffff813acb4e in local_apic_timer_interrupt (regs=<value optimized out>) at /opt/kernel/src/arch/x86/kernel/apic/apic.c:844 #16 smp_apic_timer_interrupt (regs=<value optimized out>) at /opt/kernel/src/arch/x86/kernel/apic/apic.c:871 #17 <signal handler called> > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box 2011-02-11 16:35 ` David Ahern @ 2011-02-11 17:53 ` Peter Zijlstra 2011-02-11 18:23 ` Peter Zijlstra 0 siblings, 1 reply; 11+ messages in thread From: Peter Zijlstra @ 2011-02-11 17:53 UTC (permalink / raw) To: David Ahern Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote: > I'm guessing in your case perf is using hardware cycles for profiling. > > I was able to reproduce the lockup in a VM which uses cpu-clock for > profiling - like Jeff's case. The VM is running Fedora 14 with > 2.6.38-rc4. > Ah, indeed, when I use: perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb things did come apart, something like the below cured that problem (but did show the pending softirq thing and triggered something iffy in the backtrace code -- will have to stare at those still) --- diff --git a/kernel/perf_event.c b/kernel/perf_event.c index a353a4d..36fb410 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -5123,6 +5123,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) u64 period; event = container_of(hrtimer, struct perf_event, hw.hrtimer); + + if (event->state < PERF_EVENT_STATE_ACTIVE) + return HRTIMER_NORESTART; + event->pmu->read(event); perf_sample_data_init(&data, 0); @@ -5174,7 +5178,7 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event) ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer); local64_set(&hwc->period_left, ktime_to_ns(remaining)); - hrtimer_cancel(&hwc->hrtimer); + hrtimer_try_to_cancel(&hwc->hrtimer); } } ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box 2011-02-11 17:53 ` Peter Zijlstra @ 2011-02-11 18:23 ` Peter Zijlstra 2011-02-11 18:47 ` David Ahern 2011-02-16 13:50 ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra 0 siblings, 2 replies; 11+ messages in thread From: Peter Zijlstra @ 2011-02-11 18:23 UTC (permalink / raw) To: David Ahern Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo On Fri, 2011-02-11 at 18:53 +0100, Peter Zijlstra wrote: > On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote: > > I'm guessing in your case perf is using hardware cycles for profiling. > > > > I was able to reproduce the lockup in a VM which uses cpu-clock for > > profiling - like Jeff's case. The VM is running Fedora 14 with > > 2.6.38-rc4. > > > Ah, indeed, when I use: > > perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb > > things did come apart, something like the below cured that problem (but > did show the pending softirq thing and triggered something iffy in the > backtrace code -- will have to stare at those still) So while this doesn't explain these weird things, it should have at least one race less -- hrtimer_init() on a possible still running timer didn't seem like a very good idea, also since hrtimers are nsec the whole freq thing seemed unnecessary. --- kernel/perf_event.c | 37 +++++++++++++++++++++++++++++++++---- 1 files changed, 33 insertions(+), 4 deletions(-) diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 999835b..08428b3 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -5051,6 +5051,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) u64 period; event = container_of(hrtimer, struct perf_event, hw.hrtimer); + + if (event->state < PERF_EVENT_STATE_ACTIVE) + return HRTIMER_NORESTART; + event->pmu->read(event); perf_sample_data_init(&data, 0); @@ -5077,9 +5081,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event) if (!is_sampling_event(event)) return; - hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); - hwc->hrtimer.function = perf_swevent_hrtimer; - period = local64_read(&hwc->period_left); if (period) { if (period < 0) @@ -5102,7 +5103,31 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event) ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer); local64_set(&hwc->period_left, ktime_to_ns(remaining)); - hrtimer_cancel(&hwc->hrtimer); + hrtimer_try_to_cancel(&hwc->hrtimer); + } +} + +static void perf_swevent_init_hrtimer(struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + + if (!is_sampling_event(event)) + return; + + hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + hwc->hrtimer.function = perf_swevent_hrtimer; + + /* + * Since hrtimers have a fixed rate, we can do a static freq->period + * mapping and avoid the whole period adjust feedback stuff. + */ + if (event->attr.freq) { + long freq = event->attr.sample_freq; + + event->attr.sample_period = NSEC_PER_SEC / freq; + hwc->sample_period = event->attr.sample_period; + local64_set(&hwc->period_left, hwc->sample_period); + event->attr.freq = 0; } } @@ -5158,6 +5183,8 @@ static int cpu_clock_event_init(struct perf_event *event) if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK) return -ENOENT; + perf_swevent_init_hrtimer(event); + return 0; } @@ -5235,6 +5262,8 @@ static int task_clock_event_init(struct perf_event *event) if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK) return -ENOENT; + perf_swevent_init_hrtimer(event); + return 0; } ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box 2011-02-11 18:23 ` Peter Zijlstra @ 2011-02-11 18:47 ` David Ahern 2011-02-16 13:50 ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra 1 sibling, 0 replies; 11+ messages in thread From: David Ahern @ 2011-02-11 18:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo On 02/11/11 11:23, Peter Zijlstra wrote: > On Fri, 2011-02-11 at 18:53 +0100, Peter Zijlstra wrote: >> On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote: >>> I'm guessing in your case perf is using hardware cycles for profiling. >>> >>> I was able to reproduce the lockup in a VM which uses cpu-clock for >>> profiling - like Jeff's case. The VM is running Fedora 14 with >>> 2.6.38-rc4. >>> >> Ah, indeed, when I use: >> >> perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb >> >> things did come apart, something like the below cured that problem (but >> did show the pending softirq thing and triggered something iffy in the >> backtrace code -- will have to stare at those still) > > So while this doesn't explain these weird things, it should have at > least one race less -- hrtimer_init() on a possible still running timer > didn't seem like a very good idea, also since hrtimers are nsec the > whole freq thing seemed unnecessary. Solved the lockup with my repro on 2.6.38-rc4. I did try 2.6.37 earlier today (without this patch), and it locked up pretty quick as well. David > > --- > kernel/perf_event.c | 37 +++++++++++++++++++++++++++++++++---- > 1 files changed, 33 insertions(+), 4 deletions(-) > > diff --git a/kernel/perf_event.c b/kernel/perf_event.c > index 999835b..08428b3 100644 > --- a/kernel/perf_event.c > +++ b/kernel/perf_event.c > @@ -5051,6 +5051,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) > u64 period; > > event = container_of(hrtimer, struct perf_event, hw.hrtimer); > + > + if (event->state < PERF_EVENT_STATE_ACTIVE) > + return HRTIMER_NORESTART; > + > event->pmu->read(event); > > perf_sample_data_init(&data, 0); > @@ -5077,9 +5081,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event) > if (!is_sampling_event(event)) > return; > > - hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); > - hwc->hrtimer.function = perf_swevent_hrtimer; > - > period = local64_read(&hwc->period_left); > if (period) { > if (period < 0) > @@ -5102,7 +5103,31 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event) > ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer); > local64_set(&hwc->period_left, ktime_to_ns(remaining)); > > - hrtimer_cancel(&hwc->hrtimer); > + hrtimer_try_to_cancel(&hwc->hrtimer); > + } > +} > + > +static void perf_swevent_init_hrtimer(struct perf_event *event) > +{ > + struct hw_perf_event *hwc = &event->hw; > + > + if (!is_sampling_event(event)) > + return; > + > + hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); > + hwc->hrtimer.function = perf_swevent_hrtimer; > + > + /* > + * Since hrtimers have a fixed rate, we can do a static freq->period > + * mapping and avoid the whole period adjust feedback stuff. > + */ > + if (event->attr.freq) { > + long freq = event->attr.sample_freq; > + > + event->attr.sample_period = NSEC_PER_SEC / freq; > + hwc->sample_period = event->attr.sample_period; > + local64_set(&hwc->period_left, hwc->sample_period); > + event->attr.freq = 0; > } > } > > @@ -5158,6 +5183,8 @@ static int cpu_clock_event_init(struct perf_event *event) > if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK) > return -ENOENT; > > + perf_swevent_init_hrtimer(event); > + > return 0; > } > > @@ -5235,6 +5262,8 @@ static int task_clock_event_init(struct perf_event *event) > if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK) > return -ENOENT; > > + perf_swevent_init_hrtimer(event); > + > return 0; > } > > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* [tip:perf/core] perf: Optimize hrtimer events 2011-02-11 18:23 ` Peter Zijlstra 2011-02-11 18:47 ` David Ahern @ 2011-02-16 13:50 ` tip-bot for Peter Zijlstra 1 sibling, 0 replies; 11+ messages in thread From: tip-bot for Peter Zijlstra @ 2011-02-16 13:50 UTC (permalink / raw) To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo Commit-ID: ba3dd36c6775264ee6e7354ba1aabcd6e86d7298 Gitweb: http://git.kernel.org/tip/ba3dd36c6775264ee6e7354ba1aabcd6e86d7298 Author: Peter Zijlstra <a.p.zijlstra@chello.nl> AuthorDate: Tue, 15 Feb 2011 12:41:46 +0100 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Wed, 16 Feb 2011 13:30:57 +0100 perf: Optimize hrtimer events There is no need to re-initialize the hrtimer every time we start it, so don't do that (shaves a few cycles). Also, since we know hrtimers run at a fixed rate (nanoseconds) we can pre-compute the desired frequency at which they tick. This avoids us having to go through the whole adaptive frequency feedback logic (shaves another few cycles). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1297448589.5226.47.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/perf_event.c | 35 ++++++++++++++++++++++++++++++++--- 1 files changed, 32 insertions(+), 3 deletions(-) diff --git a/kernel/perf_event.c b/kernel/perf_event.c index e03be08..a0a6987 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -5602,6 +5602,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) u64 period; event = container_of(hrtimer, struct perf_event, hw.hrtimer); + + if (event->state != PERF_EVENT_STATE_ACTIVE) + return HRTIMER_NORESTART; + event->pmu->read(event); perf_sample_data_init(&data, 0); @@ -5628,9 +5632,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event) if (!is_sampling_event(event)) return; - hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); - hwc->hrtimer.function = perf_swevent_hrtimer; - period = local64_read(&hwc->period_left); if (period) { if (period < 0) @@ -5657,6 +5658,30 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event) } } +static void perf_swevent_init_hrtimer(struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + + if (!is_sampling_event(event)) + return; + + hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + hwc->hrtimer.function = perf_swevent_hrtimer; + + /* + * Since hrtimers have a fixed rate, we can do a static freq->period + * mapping and avoid the whole period adjust feedback stuff. + */ + if (event->attr.freq) { + long freq = event->attr.sample_freq; + + event->attr.sample_period = NSEC_PER_SEC / freq; + hwc->sample_period = event->attr.sample_period; + local64_set(&hwc->period_left, hwc->sample_period); + event->attr.freq = 0; + } +} + /* * Software event: cpu wall time clock */ @@ -5709,6 +5734,8 @@ static int cpu_clock_event_init(struct perf_event *event) if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK) return -ENOENT; + perf_swevent_init_hrtimer(event); + return 0; } @@ -5787,6 +5814,8 @@ static int task_clock_event_init(struct perf_event *event) if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK) return -ENOENT; + perf_swevent_init_hrtimer(event); + return 0; } ^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2011-02-16 13:50 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer 2011-02-09 17:47 ` David Ahern 2011-02-09 18:22 ` Jeff Moyer 2011-02-09 20:12 ` David Ahern 2011-02-09 22:11 ` Arnaldo Carvalho de Melo 2011-02-10 21:38 ` Peter Zijlstra 2011-02-11 16:35 ` David Ahern 2011-02-11 17:53 ` Peter Zijlstra 2011-02-11 18:23 ` Peter Zijlstra 2011-02-11 18:47 ` David Ahern 2011-02-16 13:50 ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox