From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
linux-arch@vger.kernel.org,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
Arnd Bergmann <arnd@arndb.de>,
Horst Hartmann <horsth@linux.vnet.ibm.com>,
Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Subject: Re: [patch 0/3] Allow inlined spinlocks again V3
Date: Thu, 13 Aug 2009 20:43:01 +0200 [thread overview]
Message-ID: <20090813184301.GA15187@elte.hu> (raw)
In-Reply-To: <20090813183448.GA8125@elte.hu>
* Ingo Molnar <mingo@elte.hu> wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Now, there is one advantage to inlining that I really like - and
> > that I personally have missed from the days when we used to
> > always do it: profiling. Inlining the spinlocks tends to do
> > _wonders_ for showing exactly which spinlock ends up being the
> > contention cases for certain loads. [...]
>
> We've got a new tool to determine that kind of information though:
> perf + call-graph profiling.
>
> This is how it works and how it can be used. Say we want to profile
> 'bw_tcp' from lmbench. The traditional way is to record a flat
> profile:
>
> aldebaran:~/l> perf record -f ./bw_pipe
> Pipe bandwidth: 4607.33 MB/sec
> [ perf record: Captured and wrote 2.503 MB perf.data (~109368 samples) ]
>
> and then to view the flat profile:
>
> aldebaran:~/l> perf report
> # Samples: 109331
> #
> # Overhead Command Shared Object Symbol
> # ........ ....... ........................ ......
> #
> 39.61% bw_pipe [kernel] [k] copy_user_generic_string
> 19.32% bw_pipe ./bw_pipe [.] enough_duration
> 6.21% bw_pipe [kernel] [k] mutex_spin_on_owner
> 3.54% bw_pipe ./bw_pipe [.] two_op
> 3.54% bw_pipe ./bw_pipe [.] one_op
> 2.79% bw_pipe [kernel] [k] get_page_from_freelist
> 2.16% bw_pipe [kernel] [k] ia32_sysenter_target
> 1.91% bw_pipe [kernel] [k] pipe_read
> 1.83% bw_pipe [kernel] [k] pipe_write
> 1.63% bw_pipe [kernel] [k] free_hot_cold_page
> 1.21% bw_pipe [kernel] [k] put_page
> 1.19% bw_pipe [kernel] [k] __alloc_pages_nodemask
> 0.99% bw_pipe [kernel] [k] anon_pipe_buf_release
> 0.74% bw_pipe [kernel] [k] get_pageblock_flags_group
>
> Annoyingly enough, mutex_spin_on_owner() is in the profile but we
> have no idea what the parent function(s) are. (Plus there's no way
> we could reasonably inline mutex functions.) So we are stuck with
> guessing which of the many other functions above (or some other
> function which does not even get listed!) has that mutex spinning
> overhead.
>
> But if you do the recording via call-graph traces:
>
> aldebaran:~/l> perf record -g -f ./bw_pipe
> Pipe bandwidth: 4432.75 MB/sec
> [ perf record: Captured and wrote 7.550 MB perf.data (~329886 samples) ]
>
> (note the -g/--graph option) And then view the profile via perf
> report, we get:
sorry - mail got trimmed due to a cut&paste mishap. Here's the
missing bw_pipe profile:
# Samples: 81007
#
# Overhead Command Shared Object Symbol
# ........ ....... ........................ ......
#
7.60% bw_pipe [kernel] [k] mutex_spin_on_owner
|
|--99.97%-- __mutex_lock_common
| __mutex_lock_slowpath
| mutex_lock
| |
| |--98.46%-- pipe_write
| | do_sync_write
| | vfs_write
| | sys_write
| | sysenter_dispatch
| | 0xf7f4f430
| | 0x600000007
| |
| --1.54%-- pipe_lock
| pipe_wait
| pipe_write
| do_sync_write
| vfs_write
| sys_write
| sysenter_dispatch
| 0xf7f4f430
| 0x600000007
--0.03%-- [...]
it shows that the mutex overhead is distributed between these two
functions:
| |--98.46%-- pipe_write
| --1.54%-- pipe_lock
I also wanted to show a more complex example, how spin-lock
overhead gets distributed in the networking stack, in a
"lat_tcp localhost" workload:
>
> aldebaran:~/l> perf report -S _spin_lock
>
> # symbol: _spin_lock
> # Samples: 724
> #
> # Overhead Command Shared Object
> # ........ ....... ........................
> #
> 98.34% lat_tcp [kernel]
> |
> |--31.74%-- ipt_do_table
> | |
> | |--26.99%-- ipt_local_in_hook
> | | nf_iterate
> | | nf_hook_slow
> | | ip_local_deliver
> | | ip_rcv_finish
> | | ip_rcv
> | | netif_receive_skb
> | | process_backlog
> | | net_rx_action
> | | __do_softirq
> | | call_softirq
> | | do_softirq
> | | _local_bh_enable_ip
> | | local_bh_enable
> | | dev_queue_xmit
> | | ip_finish_output2
> | | ip_finish_output
> | | ip_output
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | |--24.34%-- ipt_local_in_hook
> | | nf_iterate
> | | nf_hook_slow
> | | ip_local_deliver
> | | ip_rcv_finish
> | | ip_rcv
> | | netif_receive_skb
> | | process_backlog
> | | net_rx_action
> | | __do_softirq
> | | call_softirq
> | | do_softirq
> | | _local_bh_enable_ip
> | | local_bh_enable
> | | dev_queue_xmit
> | | ip_finish_output2
> | | ip_finish_output
> | | ip_output
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | |--16.81%-- ipt_pre_routing_hook
> | | nf_iterate
> | | nf_hook_slow
> | | ip_rcv
> | | netif_receive_skb
> | | process_backlog
> | | net_rx_action
> | | __do_softirq
> | | call_softirq
> | | do_softirq
> | | _local_bh_enable_ip
> | | local_bh_enable
> | | dev_queue_xmit
> | | ip_finish_output2
> | | ip_finish_output
> | | ip_output
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | |--11.95%-- ipt_post_routing_hook
> | | nf_iterate
> | | nf_hook_slow
> | | ip_output
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | |--10.18%-- ipt_local_out_hook
> | | nf_iterate
> | | nf_hook_slow
> | | __ip_local_out
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | --9.73%-- ipt_local_hook
> | nf_iterate
> | nf_hook_slow
> | __ip_local_out
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--31.46%-- __mod_timer
> | mod_timer
> | sk_reset_timer
> | |
> | |--55.36%-- tcp_send_delayed_ack
> | | __tcp_ack_snd_check
> | | tcp_rcv_established
> | | tcp_v4_do_rcv
> | | tcp_prequeue_process
> | | tcp_recvmsg
> | | sock_common_recvmsg
> | | __sock_recvmsg
> | | sock_aio_read
> | | do_sync_read
> | | vfs_read
> | | sys_read
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | --44.64%-- tcp_v4_rcv
> | ip_local_deliver_finish
> | ip_local_deliver
> | ip_rcv_finish
> | ip_rcv
> | netif_receive_skb
> | process_backlog
> | net_rx_action
> | __do_softirq
> | call_softirq
> | do_softirq
> | _local_bh_enable_ip
> | local_bh_enable
> | dev_queue_xmit
> | ip_finish_output2
> | ip_finish_output
> | ip_output
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--21.21%-- task_rq_lock
> | try_to_wake_up
> | default_wake_function
> | |
> | |--99.34%-- autoremove_wake_function
> | | __wake_up_common
> | | __wake_up
> | | tcp_v4_rcv
> | | ip_local_deliver_finish
> | | ip_local_deliver
> | | ip_rcv_finish
> | | ip_rcv
> | | netif_receive_skb
> | | process_backlog
> | | net_rx_action
> | | __do_softirq
> | | call_softirq
> | | do_softirq
> | | _local_bh_enable_ip
> | | local_bh_enable
> | | dev_queue_xmit
> | | ip_finish_output2
> | | ip_finish_output
> | | ip_output
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | --0.66%-- pollwake
> | __wake_up_common
> | __wake_up
> | perf_counter_wakeup
> | perf_pending_counter
> | perf_counter_do_pending
> | smp_perf_pending_interrupt
> | perf_pending_interrupt
> | 0xf7f50430
> | 0x5
> |
> |--5.06%-- tcp_v4_rcv
> | ip_local_deliver_finish
> | ip_local_deliver
> | ip_rcv_finish
> | ip_rcv
> | netif_receive_skb
> | process_backlog
> | net_rx_action
> | __do_softirq
> | call_softirq
> | do_softirq
> | _local_bh_enable_ip
> | local_bh_enable
> | dev_queue_xmit
> | ip_finish_output2
> | ip_finish_output
> | ip_output
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--4.07%-- __perf_counter_sched_out
> | perf_counter_task_sched_out
> | schedule
> | schedule_timeout
> | sk_wait_data
> | tcp_recvmsg
> | sock_common_recvmsg
> | __sock_recvmsg
> | sock_aio_read
> | do_sync_read
> | vfs_read
> | sys_read
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--1.40%-- ipt_local_hook
> | nf_iterate
> | nf_hook_slow
> | __ip_local_out
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--1.26%-- mod_timer
> | sk_reset_timer
> | |
> | |--66.67%-- tcp_send_delayed_ack
> | | __tcp_ack_snd_check
> | | tcp_rcv_established
> | | tcp_v4_do_rcv
> | | tcp_prequeue_process
> | | tcp_recvmsg
> | | sock_common_recvmsg
> | | __sock_recvmsg
> | | sock_aio_read
> | | do_sync_read
> | | vfs_read
> | | sys_read
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | --33.33%-- tcp_v4_rcv
> | ip_local_deliver_finish
> | ip_local_deliver
> | ip_rcv_finish
> | ip_rcv
> | netif_receive_skb
> | process_backlog
> | net_rx_action
> | __do_softirq
> | call_softirq
> | do_softirq
> | _local_bh_enable_ip
> | local_bh_enable
> | dev_queue_xmit
> | ip_finish_output2
> | ip_finish_output
> | ip_output
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--0.70%-- ipt_local_in_hook
> | nf_iterate
> | nf_hook_slow
> | ip_local_deliver
> | ip_rcv_finish
> | ip_rcv
> | netif_receive_skb
> | process_backlog
> | net_rx_action
> | __do_softirq
> | call_softirq
> | do_softirq
> | _local_bh_enable_ip
> | local_bh_enable
> | dev_queue_xmit
> | ip_finish_output2
> | ip_finish_output
> | ip_output
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--0.56%-- perf_counter_task_sched_out
> | schedule
> | schedule_timeout
> | sk_wait_data
> | tcp_recvmsg
> | sock_common_recvmsg
> | __sock_recvmsg
> | sock_aio_read
> | do_sync_read
> | vfs_read
> | sys_read
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> --2.53%-- [...]
>
> 1.66% perf [kernel]
> |
> |--25.00%-- journal_stop
> | __ext3_journal_stop
> | ext3_writeback_write_end
> | generic_perform_write
> | generic_file_buffered_write
> | __generic_file_aio_write_nolock
> | generic_file_aio_write
> | ext3_file_write
> | do_sync_write
> | vfs_write
> | sys_write
> | system_call_fastpath
> | __write_nocancel
> |
> |--16.67%-- __perf_counter_sched_out
> | perf_counter_task_sched_out
> | schedule
> | schedule_hrtimeout_range
> | poll_schedule_timeout
> | do_poll
> | do_sys_poll
> | sys_poll
> | system_call_fastpath
> | __GI___poll
> |
> |--8.33%-- dup_mmap
> | dup_mm
> | copy_process
> | do_fork
> | sys_clone
> | stub_clone
> | fork
> | __cmd_record
> |
> |--8.33%-- hrtimer_interrupt
> | smp_apic_timer_interrupt
> | apic_timer_interrupt
> | __write_nocancel
> |
> |--8.33%-- journal_start
> | ext3_journal_start_sb
> | ext3_write_begin
> | generic_perform_write
> | generic_file_buffered_write
> | __generic_file_aio_write_nolock
> | generic_file_aio_write
> | ext3_file_write
> | do_sync_write
> | vfs_write
> | sys_write
> | system_call_fastpath
> | __write_nocancel
> |
> |--8.33%-- create_empty_buffers
> | __block_prepare_write
> | block_write_begin
> | ext3_write_begin
> | generic_perform_write
> | generic_file_buffered_write
> | __generic_file_aio_write_nolock
> | generic_file_aio_write
> | ext3_file_write
> | do_sync_write
> | vfs_write
> | sys_write
> | system_call_fastpath
> | __write_nocancel
> |
> |--8.33%-- perf_counter_task_sched_out
> | schedule
> | schedule_hrtimeout_range
> | poll_schedule_timeout
> | do_poll
> | do_sys_poll
> | sys_poll
> | system_call_fastpath
> | __GI___poll
> |
> |--8.33%-- ext3_alloc_branch
> | ext3_get_blocks_handle
> | ext3_get_block
> | __block_prepare_write
> | block_write_begin
> | ext3_write_begin
> | generic_perform_write
> | generic_file_buffered_write
> | __generic_file_aio_write_nolock
> | generic_file_aio_write
> | ext3_file_write
> | do_sync_write
> | vfs_write
> | sys_write
> | system_call_fastpath
> | __write_nocancel
> |
> --8.33%-- start_this_handle
> journal_start
> ext3_journal_start_sb
> ext3_write_begin
> generic_perform_write
> generic_file_buffered_write
> __generic_file_aio_write_nolock
> generic_file_aio_write
> ext3_file_write
> do_sync_write
> vfs_write
> sys_write
> system_call_fastpath
> __write_nocancel
>
> the graph above (which is the sub-graph of the full graph, limited
> via the -S/--symbols option to all callchains that intersect with
> the _spin_lock symbol) details all the various code-paths that make
> use of _spin_lock(), and the weights (and relative weights) of
> those codepaths.
>
> Note that this works for just about any locking primitive, and note
> how it's recursive. If we inlined spin_lock we'd only know this
> __mod_timer overhead:
>
> |--31.46%-- __mod_timer
>
> but we wouldnt know the second level (which is interesting too):
>
> | |--55.36%-- tcp_send_delayed_ack
> | --44.64%-- tcp_v4_rcv
>
> All in one, i think we should detach the issue of instrumentation
> and profiling from the issue of whether to inline or not. We should
> inline where it makes things faster. Our profiling tools are able
> to deal with it, independently of whether we inline or not.
>
> Ingo
next prev parent reply other threads:[~2009-08-13 18:43 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-08-12 18:39 [patch 0/3] Allow inlined spinlocks again V3 Heiko Carstens
2009-08-12 18:39 ` [patch 1/3] spinlock: move spinlock function bodies to header file Heiko Carstens
2009-08-12 18:39 ` [patch 2/3] spinlock: allow inlined spinlocks Heiko Carstens
2009-08-12 18:39 ` [patch 3/3] spinlock: allow inlined spinlocks on s390 Heiko Carstens
2009-08-13 18:11 ` [patch 0/3] Allow inlined spinlocks again V3 Linus Torvalds
2009-08-13 18:34 ` Ingo Molnar
2009-08-13 18:43 ` Ingo Molnar [this message]
2009-08-14 12:34 ` Heiko Carstens
2009-08-14 16:04 ` Linus Torvalds
2009-08-14 17:13 ` Heiko Carstens
2009-08-14 18:08 ` Linus Torvalds
2009-08-14 20:19 ` David Miller
2009-08-14 20:45 ` Linus Torvalds
2009-08-14 21:10 ` Linus Torvalds
2009-08-14 22:23 ` David Miller
2009-08-16 18:27 ` Heiko Carstens
2009-08-16 18:45 ` Linus Torvalds
2009-08-16 20:36 ` Ingo Molnar
2009-08-17 10:26 ` Heiko Carstens
2009-08-17 21:26 ` [numbers] Re: [patch] more skb ops inlining Ingo Molnar
2009-08-18 11:34 ` Heiko Carstens
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090813184301.GA15187@elte.hu \
--to=mingo@elte.hu \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=ehrhardt@linux.vnet.ibm.com \
--cc=heiko.carstens@de.ibm.com \
--cc=horsth@linux.vnet.ibm.com \
--cc=linux-arch@vger.kernel.org \
--cc=schwidefsky@de.ibm.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).