perf on 2.6.38-rc4 wedges my box

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* perf on 2.6.38-rc4 wedges my box
@ 2011-02-09 17:38 Jeff Moyer
  2011-02-09 17:47 ` David Ahern
  2011-02-10 21:38 ` Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Jeff Moyer @ 2011-02-09 17:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

Hi,

I'm trying out willy's ata_ram driver[1], and in so doing have managed to
wedge my box while using perf record on an aio-stress run:

[root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
[root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
adding stage write
starting with write
file size 1024MB, record size 4KB, depth 32, ios per iteration 8
max io_submit 16, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /dev/sds thread 0
write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
[root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
/dev/sds
adding stage write
starting with write
file size 1024MB, record size 4KB, depth 32, ios per iteration 8
max io_submit 16, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /dev/sds thread 0

and there it sits.  On the console, I see:

NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100

The number of messages varies, but this is the most I've seen (it
doesn't keep repeating).  At this point, the machine does not respond to
pings.  As I don't have physical access at the moment, I can't try
alt-sysrq, but might be able to do that tomorrow.  It's probably worth
noting that I've witnessed similar behavior with real devices, so it's
not just the ata_ram driver.

Any ideas on how to track this down?

Thanks!

Jeff

[1] http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=refs/heads/ata-ram

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer
@ 2011-02-09 17:47 ` David Ahern
  2011-02-09 18:22   ` Jeff Moyer
  2011-02-10 21:38 ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: David Ahern @ 2011-02-09 17:47 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo



On 02/09/11 10:38, Jeff Moyer wrote:
> Hi,
> 
> I'm trying out willy's ata_ram driver[1], and in so doing have managed to
> wedge my box while using perf record on an aio-stress run:
> 
> [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
> [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
> thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
> [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
> /dev/sds

Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
profiling? Add -v to see if the fallback to S/W is happening now.

David


> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> 
> and there it sits.  On the console, I see:
> 
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> 
> The number of messages varies, but this is the most I've seen (it
> doesn't keep repeating).  At this point, the machine does not respond to
> pings.  As I don't have physical access at the moment, I can't try
> alt-sysrq, but might be able to do that tomorrow.  It's probably worth
> noting that I've witnessed similar behavior with real devices, so it's
> not just the ata_ram driver.
> 
> Any ideas on how to track this down?
> 
> Thanks!
> 
> Jeff
> 
> [1] http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=refs/heads/ata-ram
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-09 17:47 ` David Ahern
@ 2011-02-09 18:22   ` Jeff Moyer
  2011-02-09 20:12     ` David Ahern
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff Moyer @ 2011-02-09 18:22 UTC (permalink / raw)
  To: David Ahern
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

David Ahern <dsahern@gmail.com> writes:

> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
> profiling? Add -v to see if the fallback to S/W is happening now.

Thanks for the suggestion, David.  I tried:

# perf record -v ls
  Warning:  ... trying to fall back to cpu-clock-ticks

couldn't open /proc/-1/status
couldn't open /proc/-1/maps
[ls output]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ]

If I explicitly set '-e cpu-clock', then the output is the same,
except that the warning is gone.  What's up with the /proc/-1/*?

Now, when running perf record -e cpu-clock on the aio-stress run,
unsurprisingly, I get the same result:

# perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
couldn't open /proc/-1/status
couldn't open /proc/-1/maps
adding stage write
starting with write
file size 1024MB, record size 4KB, depth 32, ios per iteration 8
max io_submit 16, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /dev/sds thread 0

and there it sits.  In this case, however, I did not see the NOHZ
warnings on the console, and this time the machine is still responding
to ping, but nothing else.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-09 18:22   ` Jeff Moyer
@ 2011-02-09 20:12     ` David Ahern
  2011-02-09 22:11       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 11+ messages in thread
From: David Ahern @ 2011-02-09 20:12 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo



On 02/09/11 11:22, Jeff Moyer wrote:
> David Ahern <dsahern@gmail.com> writes:
> 
>> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
>> profiling? Add -v to see if the fallback to S/W is happening now.
> 
> Thanks for the suggestion, David.  I tried:
> 
> # perf record -v ls
>   Warning:  ... trying to fall back to cpu-clock-ticks
> 
> couldn't open /proc/-1/status
> couldn't open /proc/-1/maps
> [ls output]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ]
> 
> If I explicitly set '-e cpu-clock', then the output is the same,
> except that the warning is gone.  What's up with the /proc/-1/*?

target_{pid,tid} are initialized to -1 in builtin-record.c I believe the
tid version is making its way through the event__synthesize_xxx code
(event__synthesize_thread -> __event__synthesize_thread ->
event__synthesize_comm and event__synthesize_mmap_events).

> 
> Now, when running perf record -e cpu-clock on the aio-stress run,
> unsurprisingly, I get the same result:
> 
> # perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> couldn't open /proc/-1/status
> couldn't open /proc/-1/maps
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> 
> and there it sits.  In this case, however, I did not see the NOHZ
> warnings on the console, and this time the machine is still responding
> to ping, but nothing else.

cpu-clock is handled through hrtimers if that helps understand the lockup.

David

> 
> Cheers,
> Jeff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-09 20:12     ` David Ahern
@ 2011-02-09 22:11       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 11+ messages in thread
From: Arnaldo Carvalho de Melo @ 2011-02-09 22:11 UTC (permalink / raw)
  To: David Ahern
  Cc: Jeff Moyer, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Paul Mackerras

Em Wed, Feb 09, 2011 at 01:12:22PM -0700, David Ahern escreveu:
> 
> 
> On 02/09/11 11:22, Jeff Moyer wrote:
> > David Ahern <dsahern@gmail.com> writes:
> > 
> >> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
> >> profiling? Add -v to see if the fallback to S/W is happening now.
> > 
> > Thanks for the suggestion, David.  I tried:
> > 
> > # perf record -v ls
> >   Warning:  ... trying to fall back to cpu-clock-ticks
> > 
> > couldn't open /proc/-1/status
> > couldn't open /proc/-1/maps
> > [ls output]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ]
> > 
> > If I explicitly set '-e cpu-clock', then the output is the same,
> > except that the warning is gone.  What's up with the /proc/-1/*?
> 
> target_{pid,tid} are initialized to -1 in builtin-record.c I believe the
> tid version is making its way through the event__synthesize_xxx code
> (event__synthesize_thread -> __event__synthesize_thread ->
> event__synthesize_comm and event__synthesize_mmap_events).

Yes, I'm working on a patch, probably just this will fix it:

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 07f8d6d..dd27b9f 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -680,7 +680,7 @@ static int __cmd_record(int argc, const char **argv)
 					       perf_event__synthesize_guest_os);
 
 	if (!system_wide)
-		perf_event__synthesize_thread(target_tid,
+		perf_event__synthesize_thread(evsel_list->threads->map[0],
 					      process_synthesized_event,
 					      session);
 	else
 
---

So that it gets the child_pid or the tid passed via --tid.

But thinking more I think that the correct is to pass the thread_map
evsel_list->threads, that will cover the case of just one thread
(--tid), or all a group of threads (--pid, when the thread_map will have
more than one tid).

I'll get a patch done later.

> > 
> > Now, when running perf record -e cpu-clock on the aio-stress run,
> > unsurprisingly, I get the same result:
> > 
> > # perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> > couldn't open /proc/-1/status
> > couldn't open /proc/-1/maps
> > adding stage write
> > starting with write
> > file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> > max io_submit 16, buffer alignment set to 4KB
> > threads 1 files 1 contexts 1 context offset 2MB verification off
> > adding file /dev/sds thread 0
> > 
> > and there it sits.  In this case, however, I did not see the NOHZ
> > warnings on the console, and this time the machine is still responding
> > to ping, but nothing else.
> 
> cpu-clock is handled through hrtimers if that helps understand the lockup.
> 
> David
> 
> > 
> > Cheers,
> > Jeff

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer
  2011-02-09 17:47 ` David Ahern
@ 2011-02-10 21:38 ` Peter Zijlstra
  2011-02-11 16:35   ` David Ahern
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2011-02-10 21:38 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

On Wed, 2011-02-09 at 12:38 -0500, Jeff Moyer wrote:
> Hi,
> 
> I'm trying out willy's ata_ram driver[1], and in so doing have managed to
> wedge my box while using perf record on an aio-stress run:
> 
> [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
> [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
> thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
> [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
> /dev/sds
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> 
> and there it sits.  On the console, I see:
> 
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> 
> The number of messages varies, but this is the most I've seen (it
> doesn't keep repeating).  At this point, the machine does not respond to
> pings.  As I don't have physical access at the moment, I can't try
> alt-sysrq, but might be able to do that tomorrow.  It's probably worth
> noting that I've witnessed similar behavior with real devices, so it's
> not just the ata_ram driver.
> 
> Any ideas on how to track this down?

So I took linus' tree of about half an hour ago, added
git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git ata-ram
(fixed up some Kconfig/Makefile rejects), google'd aio-stress
(http://fsbench.filesystems.org/bench/aio-stress.c) and set out to
reproduce the above..

Sadly it all seems to work here, its spending ~15% in
_raw_spin_lock_irq, which when I use -g looks to break down like:

-     14.13%  aio-stress  [kernel.kallsyms]  [k] _raw_spin_lock_irq 
   - _raw_spin_lock_irq                                             
      + 44.14% __make_request                                       
      + 20.91% __aio_get_req                                        
      + 10.15% aio_run_iocb                                         
      + 7.37% do_io_submit                                          
      + 6.55% scsi_request_fn                                       
      + 5.48% generic_unplug_device                                 
      + 3.58% aio_put_req                                           
      + 0.92% generic_make_request                                  
      + 0.91% __generic_unplug_device




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-10 21:38 ` Peter Zijlstra
@ 2011-02-11 16:35   ` David Ahern
  2011-02-11 17:53     ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: David Ahern @ 2011-02-11 16:35 UTC (permalink / raw)
  To: Peter Zijlstra, Jeff Moyer
  Cc: linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo



On 02/10/11 14:38, Peter Zijlstra wrote:
> On Wed, 2011-02-09 at 12:38 -0500, Jeff Moyer wrote:
>> Hi,
>>
>> I'm trying out willy's ata_ram driver[1], and in so doing have managed to
>> wedge my box while using perf record on an aio-stress run:
>>
>> [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
>> [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
>> adding stage write
>> starting with write
>> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
>> max io_submit 16, buffer alignment set to 4KB
>> threads 1 files 1 contexts 1 context offset 2MB verification off
>> adding file /dev/sds thread 0
>> write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
>> thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
>> [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
>> /dev/sds
>> adding stage write
>> starting with write
>> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
>> max io_submit 16, buffer alignment set to 4KB
>> threads 1 files 1 contexts 1 context offset 2MB verification off
>> adding file /dev/sds thread 0
>>
>> and there it sits.  On the console, I see:
>>
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>>
>> The number of messages varies, but this is the most I've seen (it
>> doesn't keep repeating).  At this point, the machine does not respond to
>> pings.  As I don't have physical access at the moment, I can't try
>> alt-sysrq, but might be able to do that tomorrow.  It's probably worth
>> noting that I've witnessed similar behavior with real devices, so it's
>> not just the ata_ram driver.
>>
>> Any ideas on how to track this down?
> 
> So I took linus' tree of about half an hour ago, added
> git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git ata-ram
> (fixed up some Kconfig/Makefile rejects), google'd aio-stress
> (http://fsbench.filesystems.org/bench/aio-stress.c) and set out to
> reproduce the above..
> 
> Sadly it all seems to work here, its spending ~15% in
> _raw_spin_lock_irq, which when I use -g looks to break down like:
> 
> -     14.13%  aio-stress  [kernel.kallsyms]  [k] _raw_spin_lock_irq 
>    - _raw_spin_lock_irq                                             
>       + 44.14% __make_request                                       
>       + 20.91% __aio_get_req                                        
>       + 10.15% aio_run_iocb                                         
>       + 7.37% do_io_submit                                          
>       + 6.55% scsi_request_fn                                       
>       + 5.48% generic_unplug_device                                 
>       + 3.58% aio_put_req                                           
>       + 0.92% generic_make_request                                  
>       + 0.91% __generic_unplug_device

I'm guessing in your case perf is using hardware cycles for profiling.

I was able to reproduce the lockup in a VM which uses cpu-clock for
profiling - like Jeff's case. The VM is running Fedora 14 with 2.6.38-rc4.

In the host one of qemu-kvm's threads is pegging the cpu. Backtrace of
the vcpus is given below. stop-bt-cont at various intervals is showing
similar traces, and xtime is not advancing. Final function changes
(e.g., from __rb_rotate_left to __rb_rotate_right)

David



[Switching to Thread 1]
(gdb) bt
#0  native_safe_halt () at
/opt/kernel/src/arch/x86/include/asm/irqflags.h:50
#1  0xffffffff81009d76 in arch_safe_halt () at
/opt/kernel/src/arch/x86/include/asm/paravirt.h:110
#2  default_idle () at /opt/kernel/src/arch/x86/kernel/process.c:381
#3  0xffffffff8100132a in cpu_idle () at
/opt/kernel/src/arch/x86/kernel/process_64.c:139
#4  0xffffffff81392c89 in rest_init () at /opt/kernel/src/init/main.c:463
#5  0xffffffff81697c23 in start_kernel () at /opt/kernel/src/init/main.c:713
#6  0xffffffff816972af in x86_64_start_reservations
(real_mode_data=<value optimized out>) at
/opt/kernel/src/arch/x86/kernel/head64.c:127
#7  0xffffffff816973b9 in x86_64_start_kernel (real_mode_data=0x93950
<Address 0x93950 out of bounds>) at
/opt/kernel/src/arch/x86/kernel/head64.c:97
#8  0x0000000000000000 in ?? ()

(gdb) thread 2
[Switching to thread 2 (Thread 2)]#0  __rb_rotate_left
(node=0xffff8800782dd8f0, root=0xffff88007fd0e358) at
/opt/kernel/src/lib/rbtree.c:37

(gdb) bt
#0  __rb_rotate_left (node=0xffff8800782dd8f0, root=0xffff88007fd0e358)
at /opt/kernel/src/lib/rbtree.c:37
#1  0xffffffff81206bcc in rb_insert_color (node=0xffff8800782dd8f0,
root=0xffff88007fd0e358) at /opt/kernel/src/lib/rbtree.c:130
#2  0xffffffff81207cb5 in timerqueue_add (head=0xffff88007fd0e358,
node=0xffff8800782dd8f0) at /opt/kernel/src/lib/timerqueue.c:56
#3  0xffffffff8105f595 in enqueue_hrtimer (timer=0xffff8800782dd8f0,
base=0xffff88007fd0e348) at /opt/kernel/src/kernel/hrtimer.c:848
#4  0xffffffff8105fc75 in __hrtimer_start_range_ns
(timer=0xffff8800782dd8f0, tim=..., delta_ns=0, mode=<value optimized
out>, wakeup=0)
    at /opt/kernel/src/kernel/hrtimer.c:961
#5  0xffffffff810b662e in perf_swevent_start_hrtimer
(event=0xffff8800782dd800) at /opt/kernel/src/kernel/perf_event.c:5092
#6  0xffffffff810b669c in cpu_clock_event_start (event=<value optimized
out>, flags=<value optimized out>) at
/opt/kernel/src/kernel/perf_event.c:5126
#7  0xffffffff810ba515 in perf_ctx_adjust_freq (ctx=0xffff880079930600,
period=999848) at /opt/kernel/src/kernel/perf_event.c:1726
#8  0xffffffff810ba690 in perf_rotate_context () at
/opt/kernel/src/kernel/perf_event.c:1787
#9  perf_event_task_tick () at /opt/kernel/src/kernel/perf_event.c:1821
#10 0xffffffff8103d8d8 in scheduler_tick () at
/opt/kernel/src/kernel/sched.c:3784
#11 0xffffffff8104effe in update_process_times (user_tick=0) at
/opt/kernel/src/kernel/timer.c:1274
#12 0xffffffff81069587 in tick_sched_timer (timer=0xffff88007fd0e3f0) at
/opt/kernel/src/kernel/time/tick-sched.c:760
#13 0xffffffff8105f75d in __run_hrtimer (timer=0xffff88007fd0e3f0,
now=0xffff88007fd03f48) at /opt/kernel/src/kernel/hrtimer.c:1197
#14 0xffffffff8105ff5e in hrtimer_interrupt (dev=<value optimized out>)
at /opt/kernel/src/kernel/hrtimer.c:1283
#15 0xffffffff813acb4e in local_apic_timer_interrupt (regs=<value
optimized out>) at /opt/kernel/src/arch/x86/kernel/apic/apic.c:844
#16 smp_apic_timer_interrupt (regs=<value optimized out>) at
/opt/kernel/src/arch/x86/kernel/apic/apic.c:871
#17 <signal handler called>


> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-11 16:35   ` David Ahern
@ 2011-02-11 17:53     ` Peter Zijlstra
  2011-02-11 18:23       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2011-02-11 17:53 UTC (permalink / raw)
  To: David Ahern
  Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote:
> I'm guessing in your case perf is using hardware cycles for profiling.
> 
> I was able to reproduce the lockup in a VM which uses cpu-clock for
> profiling - like Jeff's case. The VM is running Fedora 14 with
> 2.6.38-rc4.
> 
Ah, indeed, when I use:

  perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb

things did come apart, something like the below cured that problem (but
did show the pending softirq thing and triggered something iffy in the
backtrace code -- will have to stare at those still)


---
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a353a4d..36fb410 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5123,6 +5123,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 	u64 period;
 
 	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
+
+	if (event->state < PERF_EVENT_STATE_ACTIVE)
+		return HRTIMER_NORESTART;
+
 	event->pmu->read(event);
 
 	perf_sample_data_init(&data, 0);
@@ -5174,7 +5178,7 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
 		ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
 		local64_set(&hwc->period_left, ktime_to_ns(remaining));
 
-		hrtimer_cancel(&hwc->hrtimer);
+		hrtimer_try_to_cancel(&hwc->hrtimer);
 	}
 }
 



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-11 17:53     ` Peter Zijlstra
@ 2011-02-11 18:23       ` Peter Zijlstra
  2011-02-11 18:47         ` David Ahern
  2011-02-16 13:50         ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2011-02-11 18:23 UTC (permalink / raw)
  To: David Ahern
  Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

On Fri, 2011-02-11 at 18:53 +0100, Peter Zijlstra wrote:
> On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote:
> > I'm guessing in your case perf is using hardware cycles for profiling.
> > 
> > I was able to reproduce the lockup in a VM which uses cpu-clock for
> > profiling - like Jeff's case. The VM is running Fedora 14 with
> > 2.6.38-rc4.
> > 
> Ah, indeed, when I use:
> 
>   perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb
> 
> things did come apart, something like the below cured that problem (but
> did show the pending softirq thing and triggered something iffy in the
> backtrace code -- will have to stare at those still)

So while this doesn't explain these weird things, it should have at
least one race less -- hrtimer_init() on a possible still running timer
didn't seem like a very good idea, also since hrtimers are nsec the
whole freq thing seemed unnecessary.

---
 kernel/perf_event.c |   37 +++++++++++++++++++++++++++++++++----
 1 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 999835b..08428b3 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5051,6 +5051,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 	u64 period;
 
 	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
+
+	if (event->state < PERF_EVENT_STATE_ACTIVE)
+		return HRTIMER_NORESTART;
+
 	event->pmu->read(event);
 
 	perf_sample_data_init(&data, 0);
@@ -5077,9 +5081,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
 	if (!is_sampling_event(event))
 		return;
 
-	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	hwc->hrtimer.function = perf_swevent_hrtimer;
-
 	period = local64_read(&hwc->period_left);
 	if (period) {
 		if (period < 0)
@@ -5102,7 +5103,31 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
 		ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
 		local64_set(&hwc->period_left, ktime_to_ns(remaining));
 
-		hrtimer_cancel(&hwc->hrtimer);
+		hrtimer_try_to_cancel(&hwc->hrtimer);
+	}
+}
+
+static void perf_swevent_init_hrtimer(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (!is_sampling_event(event))
+		return;
+
+	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hwc->hrtimer.function = perf_swevent_hrtimer;
+
+	/*
+	 * Since hrtimers have a fixed rate, we can do a static freq->period
+	 * mapping and avoid the whole period adjust feedback stuff.
+	 */
+	if (event->attr.freq) {
+		long freq = event->attr.sample_freq;
+
+		event->attr.sample_period = NSEC_PER_SEC / freq;
+		hwc->sample_period = event->attr.sample_period;
+		local64_set(&hwc->period_left, hwc->sample_period);
+		event->attr.freq = 0;
 	}
 }
 
@@ -5158,6 +5183,8 @@ static int cpu_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
 		return -ENOENT;
 
+	perf_swevent_init_hrtimer(event);
+
 	return 0;
 }
 
@@ -5235,6 +5262,8 @@ static int task_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
 		return -ENOENT;
 
+	perf_swevent_init_hrtimer(event);
+
 	return 0;
 }
 



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-11 18:23       ` Peter Zijlstra
@ 2011-02-11 18:47         ` David Ahern
  2011-02-16 13:50         ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: David Ahern @ 2011-02-11 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo



On 02/11/11 11:23, Peter Zijlstra wrote:
> On Fri, 2011-02-11 at 18:53 +0100, Peter Zijlstra wrote:
>> On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote:
>>> I'm guessing in your case perf is using hardware cycles for profiling.
>>>
>>> I was able to reproduce the lockup in a VM which uses cpu-clock for
>>> profiling - like Jeff's case. The VM is running Fedora 14 with
>>> 2.6.38-rc4.
>>>
>> Ah, indeed, when I use:
>>
>>   perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb
>>
>> things did come apart, something like the below cured that problem (but
>> did show the pending softirq thing and triggered something iffy in the
>> backtrace code -- will have to stare at those still)
> 
> So while this doesn't explain these weird things, it should have at
> least one race less -- hrtimer_init() on a possible still running timer
> didn't seem like a very good idea, also since hrtimers are nsec the
> whole freq thing seemed unnecessary.

Solved the lockup with my repro on 2.6.38-rc4.

I did try 2.6.37 earlier today (without this patch), and it locked up
pretty quick as well.

David

> 
> ---
>  kernel/perf_event.c |   37 +++++++++++++++++++++++++++++++++----
>  1 files changed, 33 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 999835b..08428b3 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -5051,6 +5051,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
>  	u64 period;
>  
>  	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
> +
> +	if (event->state < PERF_EVENT_STATE_ACTIVE)
> +		return HRTIMER_NORESTART;
> +
>  	event->pmu->read(event);
>  
>  	perf_sample_data_init(&data, 0);
> @@ -5077,9 +5081,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
>  	if (!is_sampling_event(event))
>  		return;
>  
> -	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> -	hwc->hrtimer.function = perf_swevent_hrtimer;
> -
>  	period = local64_read(&hwc->period_left);
>  	if (period) {
>  		if (period < 0)
> @@ -5102,7 +5103,31 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
>  		ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
>  		local64_set(&hwc->period_left, ktime_to_ns(remaining));
>  
> -		hrtimer_cancel(&hwc->hrtimer);
> +		hrtimer_try_to_cancel(&hwc->hrtimer);
> +	}
> +}
> +
> +static void perf_swevent_init_hrtimer(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	if (!is_sampling_event(event))
> +		return;
> +
> +	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	hwc->hrtimer.function = perf_swevent_hrtimer;
> +
> +	/*
> +	 * Since hrtimers have a fixed rate, we can do a static freq->period
> +	 * mapping and avoid the whole period adjust feedback stuff.
> +	 */
> +	if (event->attr.freq) {
> +		long freq = event->attr.sample_freq;
> +
> +		event->attr.sample_period = NSEC_PER_SEC / freq;
> +		hwc->sample_period = event->attr.sample_period;
> +		local64_set(&hwc->period_left, hwc->sample_period);
> +		event->attr.freq = 0;
>  	}
>  }
>  
> @@ -5158,6 +5183,8 @@ static int cpu_clock_event_init(struct perf_event *event)
>  	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
>  		return -ENOENT;
>  
> +	perf_swevent_init_hrtimer(event);
> +
>  	return 0;
>  }
>  
> @@ -5235,6 +5262,8 @@ static int task_clock_event_init(struct perf_event *event)
>  	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
>  		return -ENOENT;
>  
> +	perf_swevent_init_hrtimer(event);
> +
>  	return 0;
>  }
>  
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [tip:perf/core] perf: Optimize hrtimer events
  2011-02-11 18:23       ` Peter Zijlstra
  2011-02-11 18:47         ` David Ahern
@ 2011-02-16 13:50         ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: tip-bot for Peter Zijlstra @ 2011-02-16 13:50 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  ba3dd36c6775264ee6e7354ba1aabcd6e86d7298
Gitweb:     http://git.kernel.org/tip/ba3dd36c6775264ee6e7354ba1aabcd6e86d7298
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Tue, 15 Feb 2011 12:41:46 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 16 Feb 2011 13:30:57 +0100

perf: Optimize hrtimer events

There is no need to re-initialize the hrtimer every time we start it,
so don't do that (shaves a few cycles). Also, since we know hrtimers
run at a fixed rate (nanoseconds) we can pre-compute the desired
frequency at which they tick. This avoids us having to go through the
whole adaptive frequency feedback logic (shaves another few cycles).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1297448589.5226.47.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/perf_event.c |   35 ++++++++++++++++++++++++++++++++---
 1 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index e03be08..a0a6987 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5602,6 +5602,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 	u64 period;
 
 	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
+
+	if (event->state != PERF_EVENT_STATE_ACTIVE)
+		return HRTIMER_NORESTART;
+
 	event->pmu->read(event);
 
 	perf_sample_data_init(&data, 0);
@@ -5628,9 +5632,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
 	if (!is_sampling_event(event))
 		return;
 
-	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	hwc->hrtimer.function = perf_swevent_hrtimer;
-
 	period = local64_read(&hwc->period_left);
 	if (period) {
 		if (period < 0)
@@ -5657,6 +5658,30 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
 	}
 }
 
+static void perf_swevent_init_hrtimer(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (!is_sampling_event(event))
+		return;
+
+	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hwc->hrtimer.function = perf_swevent_hrtimer;
+
+	/*
+	 * Since hrtimers have a fixed rate, we can do a static freq->period
+	 * mapping and avoid the whole period adjust feedback stuff.
+	 */
+	if (event->attr.freq) {
+		long freq = event->attr.sample_freq;
+
+		event->attr.sample_period = NSEC_PER_SEC / freq;
+		hwc->sample_period = event->attr.sample_period;
+		local64_set(&hwc->period_left, hwc->sample_period);
+		event->attr.freq = 0;
+	}
+}
+
 /*
  * Software event: cpu wall time clock
  */
@@ -5709,6 +5734,8 @@ static int cpu_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
 		return -ENOENT;
 
+	perf_swevent_init_hrtimer(event);
+
 	return 0;
 }
 
@@ -5787,6 +5814,8 @@ static int task_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
 		return -ENOENT;
 
+	perf_swevent_init_hrtimer(event);
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-02-16 13:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer
2011-02-09 17:47 ` David Ahern
2011-02-09 18:22   ` Jeff Moyer
2011-02-09 20:12     ` David Ahern
2011-02-09 22:11       ` Arnaldo Carvalho de Melo
2011-02-10 21:38 ` Peter Zijlstra
2011-02-11 16:35   ` David Ahern
2011-02-11 17:53     ` Peter Zijlstra
2011-02-11 18:23       ` Peter Zijlstra
2011-02-11 18:47         ` David Ahern
2011-02-16 13:50         ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox