From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
linux-arch@vger.kernel.org,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
Arnd Bergmann <arnd@arndb.de>,
Horst Hartmann <horsth@linux.vnet.ibm.com>,
Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Subject: Re: [patch 0/3] Allow inlined spinlocks again V3
Date: Thu, 13 Aug 2009 20:34:48 +0200 [thread overview]
Message-ID: <20090813183448.GA8125@elte.hu> (raw)
In-Reply-To: <alpine.LFD.2.01.0908131100000.28882@localhost.localdomain>
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Now, there is one advantage to inlining that I really like - and
> that I personally have missed from the days when we used to
> always do it: profiling. Inlining the spinlocks tends to do
> _wonders_ for showing exactly which spinlock ends up being the
> contention cases for certain loads. [...]
We've got a new tool to determine that kind of information though:
perf + call-graph profiling.
This is how it works and how it can be used. Say we want to profile
'bw_tcp' from lmbench. The traditional way is to record a flat
profile:
aldebaran:~/l> perf record -f ./bw_pipe
Pipe bandwidth: 4607.33 MB/sec
[ perf record: Captured and wrote 2.503 MB perf.data (~109368 samples) ]
and then to view the flat profile:
aldebaran:~/l> perf report
# Samples: 109331
#
# Overhead Command Shared Object Symbol
# ........ ....... ........................ ......
#
39.61% bw_pipe [kernel] [k] copy_user_generic_string
19.32% bw_pipe ./bw_pipe [.] enough_duration
6.21% bw_pipe [kernel] [k] mutex_spin_on_owner
3.54% bw_pipe ./bw_pipe [.] two_op
3.54% bw_pipe ./bw_pipe [.] one_op
2.79% bw_pipe [kernel] [k] get_page_from_freelist
2.16% bw_pipe [kernel] [k] ia32_sysenter_target
1.91% bw_pipe [kernel] [k] pipe_read
1.83% bw_pipe [kernel] [k] pipe_write
1.63% bw_pipe [kernel] [k] free_hot_cold_page
1.21% bw_pipe [kernel] [k] put_page
1.19% bw_pipe [kernel] [k] __alloc_pages_nodemask
0.99% bw_pipe [kernel] [k] anon_pipe_buf_release
0.74% bw_pipe [kernel] [k] get_pageblock_flags_group
Annoyingly enough, mutex_spin_on_owner() is in the profile but we
have no idea what the parent function(s) are. (Plus there's no way
we could reasonably inline mutex functions.) So we are stuck with
guessing which of the many other functions above (or some other
function which does not even get listed!) has that mutex spinning
overhead.
But if you do the recording via call-graph traces:
aldebaran:~/l> perf record -g -f ./bw_pipe
Pipe bandwidth: 4432.75 MB/sec
[ perf record: Captured and wrote 7.550 MB perf.data (~329886 samples) ]
(note the -g/--graph option) And then view the profile via perf
report, we get:
aldebaran:~/l> perf report -S _spin_lock
# symbol: _spin_lock
# Samples: 724
#
# Overhead Command Shared Object
# ........ ....... ........................
#
98.34% lat_tcp [kernel]
|
|--31.74%-- ipt_do_table
| |
| |--26.99%-- ipt_local_in_hook
| | nf_iterate
| | nf_hook_slow
| | ip_local_deliver
| | ip_rcv_finish
| | ip_rcv
| | netif_receive_skb
| | process_backlog
| | net_rx_action
| | __do_softirq
| | call_softirq
| | do_softirq
| | _local_bh_enable_ip
| | local_bh_enable
| | dev_queue_xmit
| | ip_finish_output2
| | ip_finish_output
| | ip_output
| | ip_local_out
| | ip_queue_xmit
| | tcp_transmit_skb
| | tcp_write_xmit
| | __tcp_push_pending_frames
| | tcp_push
| | tcp_sendmsg
| | __sock_sendmsg
| | sock_aio_write
| | do_sync_write
| | vfs_write
| | sys_write
| | sysenter_dispatch
| | 0xf7f50430
| | 0x5
| |
| |--24.34%-- ipt_local_in_hook
| | nf_iterate
| | nf_hook_slow
| | ip_local_deliver
| | ip_rcv_finish
| | ip_rcv
| | netif_receive_skb
| | process_backlog
| | net_rx_action
| | __do_softirq
| | call_softirq
| | do_softirq
| | _local_bh_enable_ip
| | local_bh_enable
| | dev_queue_xmit
| | ip_finish_output2
| | ip_finish_output
| | ip_output
| | ip_local_out
| | ip_queue_xmit
| | tcp_transmit_skb
| | tcp_write_xmit
| | __tcp_push_pending_frames
| | tcp_push
| | tcp_sendmsg
| | __sock_sendmsg
| | sock_aio_write
| | do_sync_write
| | vfs_write
| | sys_write
| | sysenter_dispatch
| | 0xf7f50430
| | 0x5
| |
| |--16.81%-- ipt_pre_routing_hook
| | nf_iterate
| | nf_hook_slow
| | ip_rcv
| | netif_receive_skb
| | process_backlog
| | net_rx_action
| | __do_softirq
| | call_softirq
| | do_softirq
| | _local_bh_enable_ip
| | local_bh_enable
| | dev_queue_xmit
| | ip_finish_output2
| | ip_finish_output
| | ip_output
| | ip_local_out
| | ip_queue_xmit
| | tcp_transmit_skb
| | tcp_write_xmit
| | __tcp_push_pending_frames
| | tcp_push
| | tcp_sendmsg
| | __sock_sendmsg
| | sock_aio_write
| | do_sync_write
| | vfs_write
| | sys_write
| | sysenter_dispatch
| | 0xf7f50430
| | 0x5
| |
| |--11.95%-- ipt_post_routing_hook
| | nf_iterate
| | nf_hook_slow
| | ip_output
| | ip_local_out
| | ip_queue_xmit
| | tcp_transmit_skb
| | tcp_write_xmit
| | __tcp_push_pending_frames
| | tcp_push
| | tcp_sendmsg
| | __sock_sendmsg
| | sock_aio_write
| | do_sync_write
| | vfs_write
| | sys_write
| | sysenter_dispatch
| | 0xf7f50430
| | 0x5
| |
| |--10.18%-- ipt_local_out_hook
| | nf_iterate
| | nf_hook_slow
| | __ip_local_out
| | ip_local_out
| | ip_queue_xmit
| | tcp_transmit_skb
| | tcp_write_xmit
| | __tcp_push_pending_frames
| | tcp_push
| | tcp_sendmsg
| | __sock_sendmsg
| | sock_aio_write
| | do_sync_write
| | vfs_write
| | sys_write
| | sysenter_dispatch
| | 0xf7f50430
| | 0x5
| |
| --9.73%-- ipt_local_hook
| nf_iterate
| nf_hook_slow
| __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_write_xmit
| __tcp_push_pending_frames
| tcp_push
| tcp_sendmsg
| __sock_sendmsg
| sock_aio_write
| do_sync_write
| vfs_write
| sys_write
| sysenter_dispatch
| 0xf7f50430
| 0x5
|
|--31.46%-- __mod_timer
| mod_timer
| sk_reset_timer
| |
| |--55.36%-- tcp_send_delayed_ack
| | __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_aio_read
| | do_sync_read
| | vfs_read
| | sys_read
| | sysenter_dispatch
| | 0xf7f50430
| | 0x5
| |
| --44.64%-- tcp_v4_rcv
| ip_local_deliver_finish
| ip_local_deliver
| ip_rcv_finish
| ip_rcv
| netif_receive_skb
| process_backlog
| net_rx_action
| __do_softirq
| call_softirq
| do_softirq
| _local_bh_enable_ip
| local_bh_enable
| dev_queue_xmit
| ip_finish_output2
| ip_finish_output
| ip_output
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_write_xmit
| __tcp_push_pending_frames
| tcp_push
| tcp_sendmsg
| __sock_sendmsg
| sock_aio_write
| do_sync_write
| vfs_write
| sys_write
| sysenter_dispatch
| 0xf7f50430
| 0x5
|
|--21.21%-- task_rq_lock
| try_to_wake_up
| default_wake_function
| |
| |--99.34%-- autoremove_wake_function
| | __wake_up_common
| | __wake_up
| | tcp_v4_rcv
| | ip_local_deliver_finish
| | ip_local_deliver
| | ip_rcv_finish
| | ip_rcv
| | netif_receive_skb
| | process_backlog
| | net_rx_action
| | __do_softirq
| | call_softirq
| | do_softirq
| | _local_bh_enable_ip
| | local_bh_enable
| | dev_queue_xmit
| | ip_finish_output2
| | ip_finish_output
| | ip_output
| | ip_local_out
| | ip_queue_xmit
| | tcp_transmit_skb
| | tcp_write_xmit
| | __tcp_push_pending_frames
| | tcp_push
| | tcp_sendmsg
| | __sock_sendmsg
| | sock_aio_write
| | do_sync_write
| | vfs_write
| | sys_write
| | sysenter_dispatch
| | 0xf7f50430
| | 0x5
| |
| --0.66%-- pollwake
| __wake_up_common
| __wake_up
| perf_counter_wakeup
| perf_pending_counter
| perf_counter_do_pending
| smp_perf_pending_interrupt
| perf_pending_interrupt
| 0xf7f50430
| 0x5
|
|--5.06%-- tcp_v4_rcv
| ip_local_deliver_finish
| ip_local_deliver
| ip_rcv_finish
| ip_rcv
| netif_receive_skb
| process_backlog
| net_rx_action
| __do_softirq
| call_softirq
| do_softirq
| _local_bh_enable_ip
| local_bh_enable
| dev_queue_xmit
| ip_finish_output2
| ip_finish_output
| ip_output
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_write_xmit
| __tcp_push_pending_frames
| tcp_push
| tcp_sendmsg
| __sock_sendmsg
| sock_aio_write
| do_sync_write
| vfs_write
| sys_write
| sysenter_dispatch
| 0xf7f50430
| 0x5
|
|--4.07%-- __perf_counter_sched_out
| perf_counter_task_sched_out
| schedule
| schedule_timeout
| sk_wait_data
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_aio_read
| do_sync_read
| vfs_read
| sys_read
| sysenter_dispatch
| 0xf7f50430
| 0x5
|
|--1.40%-- ipt_local_hook
| nf_iterate
| nf_hook_slow
| __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_write_xmit
| __tcp_push_pending_frames
| tcp_push
| tcp_sendmsg
| __sock_sendmsg
| sock_aio_write
| do_sync_write
| vfs_write
| sys_write
| sysenter_dispatch
| 0xf7f50430
| 0x5
|
|--1.26%-- mod_timer
| sk_reset_timer
| |
| |--66.67%-- tcp_send_delayed_ack
| | __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_aio_read
| | do_sync_read
| | vfs_read
| | sys_read
| | sysenter_dispatch
| | 0xf7f50430
| | 0x5
| |
| --33.33%-- tcp_v4_rcv
| ip_local_deliver_finish
| ip_local_deliver
| ip_rcv_finish
| ip_rcv
| netif_receive_skb
| process_backlog
| net_rx_action
| __do_softirq
| call_softirq
| do_softirq
| _local_bh_enable_ip
| local_bh_enable
| dev_queue_xmit
| ip_finish_output2
| ip_finish_output
| ip_output
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_write_xmit
| __tcp_push_pending_frames
| tcp_push
| tcp_sendmsg
| __sock_sendmsg
| sock_aio_write
| do_sync_write
| vfs_write
| sys_write
| sysenter_dispatch
| 0xf7f50430
| 0x5
|
|--0.70%-- ipt_local_in_hook
| nf_iterate
| nf_hook_slow
| ip_local_deliver
| ip_rcv_finish
| ip_rcv
| netif_receive_skb
| process_backlog
| net_rx_action
| __do_softirq
| call_softirq
| do_softirq
| _local_bh_enable_ip
| local_bh_enable
| dev_queue_xmit
| ip_finish_output2
| ip_finish_output
| ip_output
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_write_xmit
| __tcp_push_pending_frames
| tcp_push
| tcp_sendmsg
| __sock_sendmsg
| sock_aio_write
| do_sync_write
| vfs_write
| sys_write
| sysenter_dispatch
| 0xf7f50430
| 0x5
|
|--0.56%-- perf_counter_task_sched_out
| schedule
| schedule_timeout
| sk_wait_data
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_aio_read
| do_sync_read
| vfs_read
| sys_read
| sysenter_dispatch
| 0xf7f50430
| 0x5
--2.53%-- [...]
1.66% perf [kernel]
|
|--25.00%-- journal_stop
| __ext3_journal_stop
| ext3_writeback_write_end
| generic_perform_write
| generic_file_buffered_write
| __generic_file_aio_write_nolock
| generic_file_aio_write
| ext3_file_write
| do_sync_write
| vfs_write
| sys_write
| system_call_fastpath
| __write_nocancel
|
|--16.67%-- __perf_counter_sched_out
| perf_counter_task_sched_out
| schedule
| schedule_hrtimeout_range
| poll_schedule_timeout
| do_poll
| do_sys_poll
| sys_poll
| system_call_fastpath
| __GI___poll
|
|--8.33%-- dup_mmap
| dup_mm
| copy_process
| do_fork
| sys_clone
| stub_clone
| fork
| __cmd_record
|
|--8.33%-- hrtimer_interrupt
| smp_apic_timer_interrupt
| apic_timer_interrupt
| __write_nocancel
|
|--8.33%-- journal_start
| ext3_journal_start_sb
| ext3_write_begin
| generic_perform_write
| generic_file_buffered_write
| __generic_file_aio_write_nolock
| generic_file_aio_write
| ext3_file_write
| do_sync_write
| vfs_write
| sys_write
| system_call_fastpath
| __write_nocancel
|
|--8.33%-- create_empty_buffers
| __block_prepare_write
| block_write_begin
| ext3_write_begin
| generic_perform_write
| generic_file_buffered_write
| __generic_file_aio_write_nolock
| generic_file_aio_write
| ext3_file_write
| do_sync_write
| vfs_write
| sys_write
| system_call_fastpath
| __write_nocancel
|
|--8.33%-- perf_counter_task_sched_out
| schedule
| schedule_hrtimeout_range
| poll_schedule_timeout
| do_poll
| do_sys_poll
| sys_poll
| system_call_fastpath
| __GI___poll
|
|--8.33%-- ext3_alloc_branch
| ext3_get_blocks_handle
| ext3_get_block
| __block_prepare_write
| block_write_begin
| ext3_write_begin
| generic_perform_write
| generic_file_buffered_write
| __generic_file_aio_write_nolock
| generic_file_aio_write
| ext3_file_write
| do_sync_write
| vfs_write
| sys_write
| system_call_fastpath
| __write_nocancel
|
--8.33%-- start_this_handle
journal_start
ext3_journal_start_sb
ext3_write_begin
generic_perform_write
generic_file_buffered_write
__generic_file_aio_write_nolock
generic_file_aio_write
ext3_file_write
do_sync_write
vfs_write
sys_write
system_call_fastpath
__write_nocancel
the graph above (which is the sub-graph of the full graph, limited
via the -S/--symbols option to all callchains that intersect with
the _spin_lock symbol) details all the various code-paths that make
use of _spin_lock(), and the weights (and relative weights) of
those codepaths.
Note that this works for just about any locking primitive, and note
how it's recursive. If we inlined spin_lock we'd only know this
__mod_timer overhead:
|--31.46%-- __mod_timer
but we wouldnt know the second level (which is interesting too):
| |--55.36%-- tcp_send_delayed_ack
| --44.64%-- tcp_v4_rcv
All in one, i think we should detach the issue of instrumentation
and profiling from the issue of whether to inline or not. We should
inline where it makes things faster. Our profiling tools are able
to deal with it, independently of whether we inline or not.
Ingo
next prev parent reply other threads:[~2009-08-13 18:35 UTC|newest] Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top 2009-08-12 18:39 [patch 0/3] Allow inlined spinlocks again V3 Heiko Carstens 2009-08-12 18:39 ` [patch 1/3] spinlock: move spinlock function bodies to header file Heiko Carstens 2009-08-12 18:39 ` [patch 2/3] spinlock: allow inlined spinlocks Heiko Carstens 2009-08-12 18:39 ` [patch 3/3] spinlock: allow inlined spinlocks on s390 Heiko Carstens 2009-08-13 18:11 ` [patch 0/3] Allow inlined spinlocks again V3 Linus Torvalds 2009-08-13 18:34 ` Ingo Molnar [this message] 2009-08-13 18:43 ` Ingo Molnar 2009-08-14 12:34 ` Heiko Carstens 2009-08-14 16:04 ` Linus Torvalds 2009-08-14 17:13 ` Heiko Carstens 2009-08-14 18:08 ` Linus Torvalds 2009-08-14 20:19 ` David Miller 2009-08-14 20:45 ` Linus Torvalds 2009-08-14 21:10 ` Linus Torvalds 2009-08-14 22:23 ` David Miller 2009-08-16 18:27 ` Heiko Carstens 2009-08-16 18:45 ` Linus Torvalds 2009-08-16 20:36 ` Ingo Molnar 2009-08-17 10:26 ` Heiko Carstens 2009-08-17 21:26 ` [numbers] Re: [patch] more skb ops inlining Ingo Molnar 2009-08-18 11:34 ` Heiko Carstens
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090813183448.GA8125@elte.hu \
--to=mingo@elte.hu \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=ehrhardt@linux.vnet.ibm.com \
--cc=heiko.carstens@de.ibm.com \
--cc=horsth@linux.vnet.ibm.com \
--cc=linux-arch@vger.kernel.org \
--cc=schwidefsky@de.ibm.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).