* perf on 2.6.38-rc4 wedges my box
@ 2011-02-09 17:38 Jeff Moyer
2011-02-09 17:47 ` David Ahern
2011-02-10 21:38 ` Peter Zijlstra
0 siblings, 2 replies; 11+ messages in thread
From: Jeff Moyer @ 2011-02-09 17:38 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo
Hi,
I'm trying out willy's ata_ram driver[1], and in so doing have managed to
wedge my box while using perf record on an aio-stress run:
[root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
[root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
adding stage write
starting with write
file size 1024MB, record size 4KB, depth 32, ios per iteration 8
max io_submit 16, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /dev/sds thread 0
write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
[root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
/dev/sds
adding stage write
starting with write
file size 1024MB, record size 4KB, depth 32, ios per iteration 8
max io_submit 16, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /dev/sds thread 0
and there it sits. On the console, I see:
NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100
The number of messages varies, but this is the most I've seen (it
doesn't keep repeating). At this point, the machine does not respond to
pings. As I don't have physical access at the moment, I can't try
alt-sysrq, but might be able to do that tomorrow. It's probably worth
noting that I've witnessed similar behavior with real devices, so it's
not just the ata_ram driver.
Any ideas on how to track this down?
Thanks!
Jeff
[1] http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=refs/heads/ata-ram
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box
2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer
@ 2011-02-09 17:47 ` David Ahern
2011-02-09 18:22 ` Jeff Moyer
2011-02-10 21:38 ` Peter Zijlstra
1 sibling, 1 reply; 11+ messages in thread
From: David Ahern @ 2011-02-09 17:47 UTC (permalink / raw)
To: Jeff Moyer
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo
On 02/09/11 10:38, Jeff Moyer wrote:
> Hi,
>
> I'm trying out willy's ata_ram driver[1], and in so doing have managed to
> wedge my box while using perf record on an aio-stress run:
>
> [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
> [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
> thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
> [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
> /dev/sds
Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
profiling? Add -v to see if the fallback to S/W is happening now.
David
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
>
> and there it sits. On the console, I see:
>
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
>
> The number of messages varies, but this is the most I've seen (it
> doesn't keep repeating). At this point, the machine does not respond to
> pings. As I don't have physical access at the moment, I can't try
> alt-sysrq, but might be able to do that tomorrow. It's probably worth
> noting that I've witnessed similar behavior with real devices, so it's
> not just the ata_ram driver.
>
> Any ideas on how to track this down?
>
> Thanks!
>
> Jeff
>
> [1] http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=refs/heads/ata-ram
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box
2011-02-09 17:47 ` David Ahern
@ 2011-02-09 18:22 ` Jeff Moyer
2011-02-09 20:12 ` David Ahern
0 siblings, 1 reply; 11+ messages in thread
From: Jeff Moyer @ 2011-02-09 18:22 UTC (permalink / raw)
To: David Ahern
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo
David Ahern <dsahern@gmail.com> writes:
> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
> profiling? Add -v to see if the fallback to S/W is happening now.
Thanks for the suggestion, David. I tried:
# perf record -v ls
Warning: ... trying to fall back to cpu-clock-ticks
couldn't open /proc/-1/status
couldn't open /proc/-1/maps
[ls output]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ]
If I explicitly set '-e cpu-clock', then the output is the same,
except that the warning is gone. What's up with the /proc/-1/*?
Now, when running perf record -e cpu-clock on the aio-stress run,
unsurprisingly, I get the same result:
# perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
couldn't open /proc/-1/status
couldn't open /proc/-1/maps
adding stage write
starting with write
file size 1024MB, record size 4KB, depth 32, ios per iteration 8
max io_submit 16, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /dev/sds thread 0
and there it sits. In this case, however, I did not see the NOHZ
warnings on the console, and this time the machine is still responding
to ping, but nothing else.
Cheers,
Jeff
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box
2011-02-09 18:22 ` Jeff Moyer
@ 2011-02-09 20:12 ` David Ahern
2011-02-09 22:11 ` Arnaldo Carvalho de Melo
0 siblings, 1 reply; 11+ messages in thread
From: David Ahern @ 2011-02-09 20:12 UTC (permalink / raw)
To: Jeff Moyer
Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo
On 02/09/11 11:22, Jeff Moyer wrote:
> David Ahern <dsahern@gmail.com> writes:
>
>> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
>> profiling? Add -v to see if the fallback to S/W is happening now.
>
> Thanks for the suggestion, David. I tried:
>
> # perf record -v ls
> Warning: ... trying to fall back to cpu-clock-ticks
>
> couldn't open /proc/-1/status
> couldn't open /proc/-1/maps
> [ls output]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ]
>
> If I explicitly set '-e cpu-clock', then the output is the same,
> except that the warning is gone. What's up with the /proc/-1/*?
target_{pid,tid} are initialized to -1 in builtin-record.c I believe the
tid version is making its way through the event__synthesize_xxx code
(event__synthesize_thread -> __event__synthesize_thread ->
event__synthesize_comm and event__synthesize_mmap_events).
>
> Now, when running perf record -e cpu-clock on the aio-stress run,
> unsurprisingly, I get the same result:
>
> # perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> couldn't open /proc/-1/status
> couldn't open /proc/-1/maps
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
>
> and there it sits. In this case, however, I did not see the NOHZ
> warnings on the console, and this time the machine is still responding
> to ping, but nothing else.
cpu-clock is handled through hrtimers if that helps understand the lockup.
David
>
> Cheers,
> Jeff
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box
2011-02-09 20:12 ` David Ahern
@ 2011-02-09 22:11 ` Arnaldo Carvalho de Melo
0 siblings, 0 replies; 11+ messages in thread
From: Arnaldo Carvalho de Melo @ 2011-02-09 22:11 UTC (permalink / raw)
To: David Ahern
Cc: Jeff Moyer, linux-kernel, Peter Zijlstra, Ingo Molnar,
Paul Mackerras
Em Wed, Feb 09, 2011 at 01:12:22PM -0700, David Ahern escreveu:
>
>
> On 02/09/11 11:22, Jeff Moyer wrote:
> > David Ahern <dsahern@gmail.com> writes:
> >
> >> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
> >> profiling? Add -v to see if the fallback to S/W is happening now.
> >
> > Thanks for the suggestion, David. I tried:
> >
> > # perf record -v ls
> > Warning: ... trying to fall back to cpu-clock-ticks
> >
> > couldn't open /proc/-1/status
> > couldn't open /proc/-1/maps
> > [ls output]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ]
> >
> > If I explicitly set '-e cpu-clock', then the output is the same,
> > except that the warning is gone. What's up with the /proc/-1/*?
>
> target_{pid,tid} are initialized to -1 in builtin-record.c I believe the
> tid version is making its way through the event__synthesize_xxx code
> (event__synthesize_thread -> __event__synthesize_thread ->
> event__synthesize_comm and event__synthesize_mmap_events).
Yes, I'm working on a patch, probably just this will fix it:
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 07f8d6d..dd27b9f 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -680,7 +680,7 @@ static int __cmd_record(int argc, const char **argv)
perf_event__synthesize_guest_os);
if (!system_wide)
- perf_event__synthesize_thread(target_tid,
+ perf_event__synthesize_thread(evsel_list->threads->map[0],
process_synthesized_event,
session);
else
---
So that it gets the child_pid or the tid passed via --tid.
But thinking more I think that the correct is to pass the thread_map
evsel_list->threads, that will cover the case of just one thread
(--tid), or all a group of threads (--pid, when the thread_map will have
more than one tid).
I'll get a patch done later.
> >
> > Now, when running perf record -e cpu-clock on the aio-stress run,
> > unsurprisingly, I get the same result:
> >
> > # perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> > couldn't open /proc/-1/status
> > couldn't open /proc/-1/maps
> > adding stage write
> > starting with write
> > file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> > max io_submit 16, buffer alignment set to 4KB
> > threads 1 files 1 contexts 1 context offset 2MB verification off
> > adding file /dev/sds thread 0
> >
> > and there it sits. In this case, however, I did not see the NOHZ
> > warnings on the console, and this time the machine is still responding
> > to ping, but nothing else.
>
> cpu-clock is handled through hrtimers if that helps understand the lockup.
>
> David
>
> >
> > Cheers,
> > Jeff
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box
2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer
2011-02-09 17:47 ` David Ahern
@ 2011-02-10 21:38 ` Peter Zijlstra
2011-02-11 16:35 ` David Ahern
1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2011-02-10 21:38 UTC (permalink / raw)
To: Jeff Moyer
Cc: linux-kernel, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo
On Wed, 2011-02-09 at 12:38 -0500, Jeff Moyer wrote:
> Hi,
>
> I'm trying out willy's ata_ram driver[1], and in so doing have managed to
> wedge my box while using perf record on an aio-stress run:
>
> [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
> [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
> thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
> [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
> /dev/sds
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
>
> and there it sits. On the console, I see:
>
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
>
> The number of messages varies, but this is the most I've seen (it
> doesn't keep repeating). At this point, the machine does not respond to
> pings. As I don't have physical access at the moment, I can't try
> alt-sysrq, but might be able to do that tomorrow. It's probably worth
> noting that I've witnessed similar behavior with real devices, so it's
> not just the ata_ram driver.
>
> Any ideas on how to track this down?
So I took linus' tree of about half an hour ago, added
git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git ata-ram
(fixed up some Kconfig/Makefile rejects), google'd aio-stress
(http://fsbench.filesystems.org/bench/aio-stress.c) and set out to
reproduce the above..
Sadly it all seems to work here, its spending ~15% in
_raw_spin_lock_irq, which when I use -g looks to break down like:
- 14.13% aio-stress [kernel.kallsyms] [k] _raw_spin_lock_irq
- _raw_spin_lock_irq
+ 44.14% __make_request
+ 20.91% __aio_get_req
+ 10.15% aio_run_iocb
+ 7.37% do_io_submit
+ 6.55% scsi_request_fn
+ 5.48% generic_unplug_device
+ 3.58% aio_put_req
+ 0.92% generic_make_request
+ 0.91% __generic_unplug_device
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box
2011-02-10 21:38 ` Peter Zijlstra
@ 2011-02-11 16:35 ` David Ahern
2011-02-11 17:53 ` Peter Zijlstra
0 siblings, 1 reply; 11+ messages in thread
From: David Ahern @ 2011-02-11 16:35 UTC (permalink / raw)
To: Peter Zijlstra, Jeff Moyer
Cc: linux-kernel, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo
On 02/10/11 14:38, Peter Zijlstra wrote:
> On Wed, 2011-02-09 at 12:38 -0500, Jeff Moyer wrote:
>> Hi,
>>
>> I'm trying out willy's ata_ram driver[1], and in so doing have managed to
>> wedge my box while using perf record on an aio-stress run:
>>
>> [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
>> [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
>> adding stage write
>> starting with write
>> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
>> max io_submit 16, buffer alignment set to 4KB
>> threads 1 files 1 contexts 1 context offset 2MB verification off
>> adding file /dev/sds thread 0
>> write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
>> thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
>> [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
>> /dev/sds
>> adding stage write
>> starting with write
>> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
>> max io_submit 16, buffer alignment set to 4KB
>> threads 1 files 1 contexts 1 context offset 2MB verification off
>> adding file /dev/sds thread 0
>>
>> and there it sits. On the console, I see:
>>
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>>
>> The number of messages varies, but this is the most I've seen (it
>> doesn't keep repeating). At this point, the machine does not respond to
>> pings. As I don't have physical access at the moment, I can't try
>> alt-sysrq, but might be able to do that tomorrow. It's probably worth
>> noting that I've witnessed similar behavior with real devices, so it's
>> not just the ata_ram driver.
>>
>> Any ideas on how to track this down?
>
> So I took linus' tree of about half an hour ago, added
> git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git ata-ram
> (fixed up some Kconfig/Makefile rejects), google'd aio-stress
> (http://fsbench.filesystems.org/bench/aio-stress.c) and set out to
> reproduce the above..
>
> Sadly it all seems to work here, its spending ~15% in
> _raw_spin_lock_irq, which when I use -g looks to break down like:
>
> - 14.13% aio-stress [kernel.kallsyms] [k] _raw_spin_lock_irq
> - _raw_spin_lock_irq
> + 44.14% __make_request
> + 20.91% __aio_get_req
> + 10.15% aio_run_iocb
> + 7.37% do_io_submit
> + 6.55% scsi_request_fn
> + 5.48% generic_unplug_device
> + 3.58% aio_put_req
> + 0.92% generic_make_request
> + 0.91% __generic_unplug_device
I'm guessing in your case perf is using hardware cycles for profiling.
I was able to reproduce the lockup in a VM which uses cpu-clock for
profiling - like Jeff's case. The VM is running Fedora 14 with 2.6.38-rc4.
In the host one of qemu-kvm's threads is pegging the cpu. Backtrace of
the vcpus is given below. stop-bt-cont at various intervals is showing
similar traces, and xtime is not advancing. Final function changes
(e.g., from __rb_rotate_left to __rb_rotate_right)
David
[Switching to Thread 1]
(gdb) bt
#0 native_safe_halt () at
/opt/kernel/src/arch/x86/include/asm/irqflags.h:50
#1 0xffffffff81009d76 in arch_safe_halt () at
/opt/kernel/src/arch/x86/include/asm/paravirt.h:110
#2 default_idle () at /opt/kernel/src/arch/x86/kernel/process.c:381
#3 0xffffffff8100132a in cpu_idle () at
/opt/kernel/src/arch/x86/kernel/process_64.c:139
#4 0xffffffff81392c89 in rest_init () at /opt/kernel/src/init/main.c:463
#5 0xffffffff81697c23 in start_kernel () at /opt/kernel/src/init/main.c:713
#6 0xffffffff816972af in x86_64_start_reservations
(real_mode_data=<value optimized out>) at
/opt/kernel/src/arch/x86/kernel/head64.c:127
#7 0xffffffff816973b9 in x86_64_start_kernel (real_mode_data=0x93950
<Address 0x93950 out of bounds>) at
/opt/kernel/src/arch/x86/kernel/head64.c:97
#8 0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (Thread 2)]#0 __rb_rotate_left
(node=0xffff8800782dd8f0, root=0xffff88007fd0e358) at
/opt/kernel/src/lib/rbtree.c:37
(gdb) bt
#0 __rb_rotate_left (node=0xffff8800782dd8f0, root=0xffff88007fd0e358)
at /opt/kernel/src/lib/rbtree.c:37
#1 0xffffffff81206bcc in rb_insert_color (node=0xffff8800782dd8f0,
root=0xffff88007fd0e358) at /opt/kernel/src/lib/rbtree.c:130
#2 0xffffffff81207cb5 in timerqueue_add (head=0xffff88007fd0e358,
node=0xffff8800782dd8f0) at /opt/kernel/src/lib/timerqueue.c:56
#3 0xffffffff8105f595 in enqueue_hrtimer (timer=0xffff8800782dd8f0,
base=0xffff88007fd0e348) at /opt/kernel/src/kernel/hrtimer.c:848
#4 0xffffffff8105fc75 in __hrtimer_start_range_ns
(timer=0xffff8800782dd8f0, tim=..., delta_ns=0, mode=<value optimized
out>, wakeup=0)
at /opt/kernel/src/kernel/hrtimer.c:961
#5 0xffffffff810b662e in perf_swevent_start_hrtimer
(event=0xffff8800782dd800) at /opt/kernel/src/kernel/perf_event.c:5092
#6 0xffffffff810b669c in cpu_clock_event_start (event=<value optimized
out>, flags=<value optimized out>) at
/opt/kernel/src/kernel/perf_event.c:5126
#7 0xffffffff810ba515 in perf_ctx_adjust_freq (ctx=0xffff880079930600,
period=999848) at /opt/kernel/src/kernel/perf_event.c:1726
#8 0xffffffff810ba690 in perf_rotate_context () at
/opt/kernel/src/kernel/perf_event.c:1787
#9 perf_event_task_tick () at /opt/kernel/src/kernel/perf_event.c:1821
#10 0xffffffff8103d8d8 in scheduler_tick () at
/opt/kernel/src/kernel/sched.c:3784
#11 0xffffffff8104effe in update_process_times (user_tick=0) at
/opt/kernel/src/kernel/timer.c:1274
#12 0xffffffff81069587 in tick_sched_timer (timer=0xffff88007fd0e3f0) at
/opt/kernel/src/kernel/time/tick-sched.c:760
#13 0xffffffff8105f75d in __run_hrtimer (timer=0xffff88007fd0e3f0,
now=0xffff88007fd03f48) at /opt/kernel/src/kernel/hrtimer.c:1197
#14 0xffffffff8105ff5e in hrtimer_interrupt (dev=<value optimized out>)
at /opt/kernel/src/kernel/hrtimer.c:1283
#15 0xffffffff813acb4e in local_apic_timer_interrupt (regs=<value
optimized out>) at /opt/kernel/src/arch/x86/kernel/apic/apic.c:844
#16 smp_apic_timer_interrupt (regs=<value optimized out>) at
/opt/kernel/src/arch/x86/kernel/apic/apic.c:871
#17 <signal handler called>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box
2011-02-11 16:35 ` David Ahern
@ 2011-02-11 17:53 ` Peter Zijlstra
2011-02-11 18:23 ` Peter Zijlstra
0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2011-02-11 17:53 UTC (permalink / raw)
To: David Ahern
Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo
On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote:
> I'm guessing in your case perf is using hardware cycles for profiling.
>
> I was able to reproduce the lockup in a VM which uses cpu-clock for
> profiling - like Jeff's case. The VM is running Fedora 14 with
> 2.6.38-rc4.
>
Ah, indeed, when I use:
perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb
things did come apart, something like the below cured that problem (but
did show the pending softirq thing and triggered something iffy in the
backtrace code -- will have to stare at those still)
---
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a353a4d..36fb410 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5123,6 +5123,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
u64 period;
event = container_of(hrtimer, struct perf_event, hw.hrtimer);
+
+ if (event->state < PERF_EVENT_STATE_ACTIVE)
+ return HRTIMER_NORESTART;
+
event->pmu->read(event);
perf_sample_data_init(&data, 0);
@@ -5174,7 +5178,7 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
local64_set(&hwc->period_left, ktime_to_ns(remaining));
- hrtimer_cancel(&hwc->hrtimer);
+ hrtimer_try_to_cancel(&hwc->hrtimer);
}
}
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box
2011-02-11 17:53 ` Peter Zijlstra
@ 2011-02-11 18:23 ` Peter Zijlstra
2011-02-11 18:47 ` David Ahern
2011-02-16 13:50 ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra
0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2011-02-11 18:23 UTC (permalink / raw)
To: David Ahern
Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo
On Fri, 2011-02-11 at 18:53 +0100, Peter Zijlstra wrote:
> On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote:
> > I'm guessing in your case perf is using hardware cycles for profiling.
> >
> > I was able to reproduce the lockup in a VM which uses cpu-clock for
> > profiling - like Jeff's case. The VM is running Fedora 14 with
> > 2.6.38-rc4.
> >
> Ah, indeed, when I use:
>
> perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb
>
> things did come apart, something like the below cured that problem (but
> did show the pending softirq thing and triggered something iffy in the
> backtrace code -- will have to stare at those still)
So while this doesn't explain these weird things, it should have at
least one race less -- hrtimer_init() on a possible still running timer
didn't seem like a very good idea, also since hrtimers are nsec the
whole freq thing seemed unnecessary.
---
kernel/perf_event.c | 37 +++++++++++++++++++++++++++++++++----
1 files changed, 33 insertions(+), 4 deletions(-)
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 999835b..08428b3 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5051,6 +5051,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
u64 period;
event = container_of(hrtimer, struct perf_event, hw.hrtimer);
+
+ if (event->state < PERF_EVENT_STATE_ACTIVE)
+ return HRTIMER_NORESTART;
+
event->pmu->read(event);
perf_sample_data_init(&data, 0);
@@ -5077,9 +5081,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
if (!is_sampling_event(event))
return;
- hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
- hwc->hrtimer.function = perf_swevent_hrtimer;
-
period = local64_read(&hwc->period_left);
if (period) {
if (period < 0)
@@ -5102,7 +5103,31 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
local64_set(&hwc->period_left, ktime_to_ns(remaining));
- hrtimer_cancel(&hwc->hrtimer);
+ hrtimer_try_to_cancel(&hwc->hrtimer);
+ }
+}
+
+static void perf_swevent_init_hrtimer(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (!is_sampling_event(event))
+ return;
+
+ hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hwc->hrtimer.function = perf_swevent_hrtimer;
+
+ /*
+ * Since hrtimers have a fixed rate, we can do a static freq->period
+ * mapping and avoid the whole period adjust feedback stuff.
+ */
+ if (event->attr.freq) {
+ long freq = event->attr.sample_freq;
+
+ event->attr.sample_period = NSEC_PER_SEC / freq;
+ hwc->sample_period = event->attr.sample_period;
+ local64_set(&hwc->period_left, hwc->sample_period);
+ event->attr.freq = 0;
}
}
@@ -5158,6 +5183,8 @@ static int cpu_clock_event_init(struct perf_event *event)
if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
return -ENOENT;
+ perf_swevent_init_hrtimer(event);
+
return 0;
}
@@ -5235,6 +5262,8 @@ static int task_clock_event_init(struct perf_event *event)
if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
return -ENOENT;
+ perf_swevent_init_hrtimer(event);
+
return 0;
}
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: perf on 2.6.38-rc4 wedges my box
2011-02-11 18:23 ` Peter Zijlstra
@ 2011-02-11 18:47 ` David Ahern
2011-02-16 13:50 ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra
1 sibling, 0 replies; 11+ messages in thread
From: David Ahern @ 2011-02-11 18:47 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras,
Arnaldo Carvalho de Melo
On 02/11/11 11:23, Peter Zijlstra wrote:
> On Fri, 2011-02-11 at 18:53 +0100, Peter Zijlstra wrote:
>> On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote:
>>> I'm guessing in your case perf is using hardware cycles for profiling.
>>>
>>> I was able to reproduce the lockup in a VM which uses cpu-clock for
>>> profiling - like Jeff's case. The VM is running Fedora 14 with
>>> 2.6.38-rc4.
>>>
>> Ah, indeed, when I use:
>>
>> perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb
>>
>> things did come apart, something like the below cured that problem (but
>> did show the pending softirq thing and triggered something iffy in the
>> backtrace code -- will have to stare at those still)
>
> So while this doesn't explain these weird things, it should have at
> least one race less -- hrtimer_init() on a possible still running timer
> didn't seem like a very good idea, also since hrtimers are nsec the
> whole freq thing seemed unnecessary.
Solved the lockup with my repro on 2.6.38-rc4.
I did try 2.6.37 earlier today (without this patch), and it locked up
pretty quick as well.
David
>
> ---
> kernel/perf_event.c | 37 +++++++++++++++++++++++++++++++++----
> 1 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 999835b..08428b3 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -5051,6 +5051,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
> u64 period;
>
> event = container_of(hrtimer, struct perf_event, hw.hrtimer);
> +
> + if (event->state < PERF_EVENT_STATE_ACTIVE)
> + return HRTIMER_NORESTART;
> +
> event->pmu->read(event);
>
> perf_sample_data_init(&data, 0);
> @@ -5077,9 +5081,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
> if (!is_sampling_event(event))
> return;
>
> - hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> - hwc->hrtimer.function = perf_swevent_hrtimer;
> -
> period = local64_read(&hwc->period_left);
> if (period) {
> if (period < 0)
> @@ -5102,7 +5103,31 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
> ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
> local64_set(&hwc->period_left, ktime_to_ns(remaining));
>
> - hrtimer_cancel(&hwc->hrtimer);
> + hrtimer_try_to_cancel(&hwc->hrtimer);
> + }
> +}
> +
> +static void perf_swevent_init_hrtimer(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> +
> + if (!is_sampling_event(event))
> + return;
> +
> + hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> + hwc->hrtimer.function = perf_swevent_hrtimer;
> +
> + /*
> + * Since hrtimers have a fixed rate, we can do a static freq->period
> + * mapping and avoid the whole period adjust feedback stuff.
> + */
> + if (event->attr.freq) {
> + long freq = event->attr.sample_freq;
> +
> + event->attr.sample_period = NSEC_PER_SEC / freq;
> + hwc->sample_period = event->attr.sample_period;
> + local64_set(&hwc->period_left, hwc->sample_period);
> + event->attr.freq = 0;
> }
> }
>
> @@ -5158,6 +5183,8 @@ static int cpu_clock_event_init(struct perf_event *event)
> if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
> return -ENOENT;
>
> + perf_swevent_init_hrtimer(event);
> +
> return 0;
> }
>
> @@ -5235,6 +5262,8 @@ static int task_clock_event_init(struct perf_event *event)
> if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
> return -ENOENT;
>
> + perf_swevent_init_hrtimer(event);
> +
> return 0;
> }
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [tip:perf/core] perf: Optimize hrtimer events
2011-02-11 18:23 ` Peter Zijlstra
2011-02-11 18:47 ` David Ahern
@ 2011-02-16 13:50 ` tip-bot for Peter Zijlstra
1 sibling, 0 replies; 11+ messages in thread
From: tip-bot for Peter Zijlstra @ 2011-02-16 13:50 UTC (permalink / raw)
To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo
Commit-ID: ba3dd36c6775264ee6e7354ba1aabcd6e86d7298
Gitweb: http://git.kernel.org/tip/ba3dd36c6775264ee6e7354ba1aabcd6e86d7298
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Tue, 15 Feb 2011 12:41:46 +0100
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 16 Feb 2011 13:30:57 +0100
perf: Optimize hrtimer events
There is no need to re-initialize the hrtimer every time we start it,
so don't do that (shaves a few cycles). Also, since we know hrtimers
run at a fixed rate (nanoseconds) we can pre-compute the desired
frequency at which they tick. This avoids us having to go through the
whole adaptive frequency feedback logic (shaves another few cycles).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1297448589.5226.47.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/perf_event.c | 35 ++++++++++++++++++++++++++++++++---
1 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index e03be08..a0a6987 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5602,6 +5602,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
u64 period;
event = container_of(hrtimer, struct perf_event, hw.hrtimer);
+
+ if (event->state != PERF_EVENT_STATE_ACTIVE)
+ return HRTIMER_NORESTART;
+
event->pmu->read(event);
perf_sample_data_init(&data, 0);
@@ -5628,9 +5632,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
if (!is_sampling_event(event))
return;
- hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
- hwc->hrtimer.function = perf_swevent_hrtimer;
-
period = local64_read(&hwc->period_left);
if (period) {
if (period < 0)
@@ -5657,6 +5658,30 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
}
}
+static void perf_swevent_init_hrtimer(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (!is_sampling_event(event))
+ return;
+
+ hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hwc->hrtimer.function = perf_swevent_hrtimer;
+
+ /*
+ * Since hrtimers have a fixed rate, we can do a static freq->period
+ * mapping and avoid the whole period adjust feedback stuff.
+ */
+ if (event->attr.freq) {
+ long freq = event->attr.sample_freq;
+
+ event->attr.sample_period = NSEC_PER_SEC / freq;
+ hwc->sample_period = event->attr.sample_period;
+ local64_set(&hwc->period_left, hwc->sample_period);
+ event->attr.freq = 0;
+ }
+}
+
/*
* Software event: cpu wall time clock
*/
@@ -5709,6 +5734,8 @@ static int cpu_clock_event_init(struct perf_event *event)
if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
return -ENOENT;
+ perf_swevent_init_hrtimer(event);
+
return 0;
}
@@ -5787,6 +5814,8 @@ static int task_clock_event_init(struct perf_event *event)
if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
return -ENOENT;
+ perf_swevent_init_hrtimer(event);
+
return 0;
}
^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2011-02-16 13:50 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer
2011-02-09 17:47 ` David Ahern
2011-02-09 18:22 ` Jeff Moyer
2011-02-09 20:12 ` David Ahern
2011-02-09 22:11 ` Arnaldo Carvalho de Melo
2011-02-10 21:38 ` Peter Zijlstra
2011-02-11 16:35 ` David Ahern
2011-02-11 17:53 ` Peter Zijlstra
2011-02-11 18:23 ` Peter Zijlstra
2011-02-11 18:47 ` David Ahern
2011-02-16 13:50 ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox