From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
linux-arch@vger.kernel.org,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
Arnd Bergmann <arnd@arndb.de>,
Horst Hartmann <horsth@linux.vnet.ibm.com>,
Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Subject: Re: [patch 0/3] Allow inlined spinlocks again V3
Date: Thu, 13 Aug 2009 20:43:01 +0200 [thread overview]
Message-ID: <20090813184301.GA15187@elte.hu> (raw)
In-Reply-To: <20090813183448.GA8125@elte.hu>
* Ingo Molnar <mingo@elte.hu> wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Now, there is one advantage to inlining that I really like - and
> > that I personally have missed from the days when we used to
> > always do it: profiling. Inlining the spinlocks tends to do
> > _wonders_ for showing exactly which spinlock ends up being the
> > contention cases for certain loads. [...]
>
> We've got a new tool to determine that kind of information though:
> perf + call-graph profiling.
>
> This is how it works and how it can be used. Say we want to profile
> 'bw_tcp' from lmbench. The traditional way is to record a flat
> profile:
>
> aldebaran:~/l> perf record -f ./bw_pipe
> Pipe bandwidth: 4607.33 MB/sec
> [ perf record: Captured and wrote 2.503 MB perf.data (~109368 samples) ]
>
> and then to view the flat profile:
>
> aldebaran:~/l> perf report
> # Samples: 109331
> #
> # Overhead Command Shared Object Symbol
> # ........ ....... ........................ ......
> #
> 39.61% bw_pipe [kernel] [k] copy_user_generic_string
> 19.32% bw_pipe ./bw_pipe [.] enough_duration
> 6.21% bw_pipe [kernel] [k] mutex_spin_on_owner
> 3.54% bw_pipe ./bw_pipe [.] two_op
> 3.54% bw_pipe ./bw_pipe [.] one_op
> 2.79% bw_pipe [kernel] [k] get_page_from_freelist
> 2.16% bw_pipe [kernel] [k] ia32_sysenter_target
> 1.91% bw_pipe [kernel] [k] pipe_read
> 1.83% bw_pipe [kernel] [k] pipe_write
> 1.63% bw_pipe [kernel] [k] free_hot_cold_page
> 1.21% bw_pipe [kernel] [k] put_page
> 1.19% bw_pipe [kernel] [k] __alloc_pages_nodemask
> 0.99% bw_pipe [kernel] [k] anon_pipe_buf_release
> 0.74% bw_pipe [kernel] [k] get_pageblock_flags_group
>
> Annoyingly enough, mutex_spin_on_owner() is in the profile but we
> have no idea what the parent function(s) are. (Plus there's no way
> we could reasonably inline mutex functions.) So we are stuck with
> guessing which of the many other functions above (or some other
> function which does not even get listed!) has that mutex spinning
> overhead.
>
> But if you do the recording via call-graph traces:
>
> aldebaran:~/l> perf record -g -f ./bw_pipe
> Pipe bandwidth: 4432.75 MB/sec
> [ perf record: Captured and wrote 7.550 MB perf.data (~329886 samples) ]
>
> (note the -g/--graph option) And then view the profile via perf
> report, we get:
sorry - mail got trimmed due to a cut&paste mishap. Here's the
missing bw_pipe profile:
# Samples: 81007
#
# Overhead Command Shared Object Symbol
# ........ ....... ........................ ......
#
7.60% bw_pipe [kernel] [k] mutex_spin_on_owner
|
|--99.97%-- __mutex_lock_common
| __mutex_lock_slowpath
| mutex_lock
| |
| |--98.46%-- pipe_write
| | do_sync_write
| | vfs_write
| | sys_write
| | sysenter_dispatch
| | 0xf7f4f430
| | 0x600000007
| |
| --1.54%-- pipe_lock
| pipe_wait
| pipe_write
| do_sync_write
| vfs_write
| sys_write
| sysenter_dispatch
| 0xf7f4f430
| 0x600000007
--0.03%-- [...]
it shows that the mutex overhead is distributed between these two
functions:
| |--98.46%-- pipe_write
| --1.54%-- pipe_lock
I also wanted to show a more complex example, how spin-lock
overhead gets distributed in the networking stack, in a
"lat_tcp localhost" workload:
>
> aldebaran:~/l> perf report -S _spin_lock
>
> # symbol: _spin_lock
> # Samples: 724
> #
> # Overhead Command Shared Object
> # ........ ....... ........................
> #
> 98.34% lat_tcp [kernel]
> |
> |--31.74%-- ipt_do_table
> | |
> | |--26.99%-- ipt_local_in_hook
> | | nf_iterate
> | | nf_hook_slow
> | | ip_local_deliver
> | | ip_rcv_finish
> | | ip_rcv
> | | netif_receive_skb
> | | process_backlog
> | | net_rx_action
> | | __do_softirq
> | | call_softirq
> | | do_softirq
> | | _local_bh_enable_ip
> | | local_bh_enable
> | | dev_queue_xmit
> | | ip_finish_output2
> | | ip_finish_output
> | | ip_output
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | |--24.34%-- ipt_local_in_hook
> | | nf_iterate
> | | nf_hook_slow
> | | ip_local_deliver
> | | ip_rcv_finish
> | | ip_rcv
> | | netif_receive_skb
> | | process_backlog
> | | net_rx_action
> | | __do_softirq
> | | call_softirq
> | | do_softirq
> | | _local_bh_enable_ip
> | | local_bh_enable
> | | dev_queue_xmit
> | | ip_finish_output2
> | | ip_finish_output
> | | ip_output
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | |--16.81%-- ipt_pre_routing_hook
> | | nf_iterate
> | | nf_hook_slow
> | | ip_rcv
> | | netif_receive_skb
> | | process_backlog
> | | net_rx_action
> | | __do_softirq
> | | call_softirq
> | | do_softirq
> | | _local_bh_enable_ip
> | | local_bh_enable
> | | dev_queue_xmit
> | | ip_finish_output2
> | | ip_finish_output
> | | ip_output
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | |--11.95%-- ipt_post_routing_hook
> | | nf_iterate
> | | nf_hook_slow
> | | ip_output
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | |--10.18%-- ipt_local_out_hook
> | | nf_iterate
> | | nf_hook_slow
> | | __ip_local_out
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | --9.73%-- ipt_local_hook
> | nf_iterate
> | nf_hook_slow
> | __ip_local_out
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--31.46%-- __mod_timer
> | mod_timer
> | sk_reset_timer
> | |
> | |--55.36%-- tcp_send_delayed_ack
> | | __tcp_ack_snd_check
> | | tcp_rcv_established
> | | tcp_v4_do_rcv
> | | tcp_prequeue_process
> | | tcp_recvmsg
> | | sock_common_recvmsg
> | | __sock_recvmsg
> | | sock_aio_read
> | | do_sync_read
> | | vfs_read
> | | sys_read
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | --44.64%-- tcp_v4_rcv
> | ip_local_deliver_finish
> | ip_local_deliver
> | ip_rcv_finish
> | ip_rcv
> | netif_receive_skb
> | process_backlog
> | net_rx_action
> | __do_softirq
> | call_softirq
> | do_softirq
> | _local_bh_enable_ip
> | local_bh_enable
> | dev_queue_xmit
> | ip_finish_output2
> | ip_finish_output
> | ip_output
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--21.21%-- task_rq_lock
> | try_to_wake_up
> | default_wake_function
> | |
> | |--99.34%-- autoremove_wake_function
> | | __wake_up_common
> | | __wake_up
> | | tcp_v4_rcv
> | | ip_local_deliver_finish
> | | ip_local_deliver
> | | ip_rcv_finish
> | | ip_rcv
> | | netif_receive_skb
> | | process_backlog
> | | net_rx_action
> | | __do_softirq
> | | call_softirq
> | | do_softirq
> | | _local_bh_enable_ip
> | | local_bh_enable
> | | dev_queue_xmit
> | | ip_finish_output2
> | | ip_finish_output
> | | ip_output
> | | ip_local_out
> | | ip_queue_xmit
> | | tcp_transmit_skb
> | | tcp_write_xmit
> | | __tcp_push_pending_frames
> | | tcp_push
> | | tcp_sendmsg
> | | __sock_sendmsg
> | | sock_aio_write
> | | do_sync_write
> | | vfs_write
> | | sys_write
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | --0.66%-- pollwake
> | __wake_up_common
> | __wake_up
> | perf_counter_wakeup
> | perf_pending_counter
> | perf_counter_do_pending
> | smp_perf_pending_interrupt
> | perf_pending_interrupt
> | 0xf7f50430
> | 0x5
> |
> |--5.06%-- tcp_v4_rcv
> | ip_local_deliver_finish
> | ip_local_deliver
> | ip_rcv_finish
> | ip_rcv
> | netif_receive_skb
> | process_backlog
> | net_rx_action
> | __do_softirq
> | call_softirq
> | do_softirq
> | _local_bh_enable_ip
> | local_bh_enable
> | dev_queue_xmit
> | ip_finish_output2
> | ip_finish_output
> | ip_output
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--4.07%-- __perf_counter_sched_out
> | perf_counter_task_sched_out
> | schedule
> | schedule_timeout
> | sk_wait_data
> | tcp_recvmsg
> | sock_common_recvmsg
> | __sock_recvmsg
> | sock_aio_read
> | do_sync_read
> | vfs_read
> | sys_read
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--1.40%-- ipt_local_hook
> | nf_iterate
> | nf_hook_slow
> | __ip_local_out
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--1.26%-- mod_timer
> | sk_reset_timer
> | |
> | |--66.67%-- tcp_send_delayed_ack
> | | __tcp_ack_snd_check
> | | tcp_rcv_established
> | | tcp_v4_do_rcv
> | | tcp_prequeue_process
> | | tcp_recvmsg
> | | sock_common_recvmsg
> | | __sock_recvmsg
> | | sock_aio_read
> | | do_sync_read
> | | vfs_read
> | | sys_read
> | | sysenter_dispatch
> | | 0xf7f50430
> | | 0x5
> | |
> | --33.33%-- tcp_v4_rcv
> | ip_local_deliver_finish
> | ip_local_deliver
> | ip_rcv_finish
> | ip_rcv
> | netif_receive_skb
> | process_backlog
> | net_rx_action
> | __do_softirq
> | call_softirq
> | do_softirq
> | _local_bh_enable_ip
> | local_bh_enable
> | dev_queue_xmit
> | ip_finish_output2
> | ip_finish_output
> | ip_output
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--0.70%-- ipt_local_in_hook
> | nf_iterate
> | nf_hook_slow
> | ip_local_deliver
> | ip_rcv_finish
> | ip_rcv
> | netif_receive_skb
> | process_backlog
> | net_rx_action
> | __do_softirq
> | call_softirq
> | do_softirq
> | _local_bh_enable_ip
> | local_bh_enable
> | dev_queue_xmit
> | ip_finish_output2
> | ip_finish_output
> | ip_output
> | ip_local_out
> | ip_queue_xmit
> | tcp_transmit_skb
> | tcp_write_xmit
> | __tcp_push_pending_frames
> | tcp_push
> | tcp_sendmsg
> | __sock_sendmsg
> | sock_aio_write
> | do_sync_write
> | vfs_write
> | sys_write
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> |
> |--0.56%-- perf_counter_task_sched_out
> | schedule
> | schedule_timeout
> | sk_wait_data
> | tcp_recvmsg
> | sock_common_recvmsg
> | __sock_recvmsg
> | sock_aio_read
> | do_sync_read
> | vfs_read
> | sys_read
> | sysenter_dispatch
> | 0xf7f50430
> | 0x5
> --2.53%-- [...]
>
> 1.66% perf [kernel]
> |
> |--25.00%-- journal_stop
> | __ext3_journal_stop
> | ext3_writeback_write_end
> | generic_perform_write
> | generic_file_buffered_write
> | __generic_file_aio_write_nolock
> | generic_file_aio_write
> | ext3_file_write
> | do_sync_write
> | vfs_write
> | sys_write
> | system_call_fastpath
> | __write_nocancel
> |
> |--16.67%-- __perf_counter_sched_out
> | perf_counter_task_sched_out
> | schedule
> | schedule_hrtimeout_range
> | poll_schedule_timeout
> | do_poll
> | do_sys_poll
> | sys_poll
> | system_call_fastpath
> | __GI___poll
> |
> |--8.33%-- dup_mmap
> | dup_mm
> | copy_process
> | do_fork
> | sys_clone
> | stub_clone
> | fork
> | __cmd_record
> |
> |--8.33%-- hrtimer_interrupt
> | smp_apic_timer_interrupt
> | apic_timer_interrupt
> | __write_nocancel
> |
> |--8.33%-- journal_start
> | ext3_journal_start_sb
> | ext3_write_begin
> | generic_perform_write
> | generic_file_buffered_write
> | __generic_file_aio_write_nolock
> | generic_file_aio_write
> | ext3_file_write
> | do_sync_write
> | vfs_write
> | sys_write
> | system_call_fastpath
> | __write_nocancel
> |
> |--8.33%-- create_empty_buffers
> | __block_prepare_write
> | block_write_begin
> | ext3_write_begin
> | generic_perform_write
> | generic_file_buffered_write
> | __generic_file_aio_write_nolock
> | generic_file_aio_write
> | ext3_file_write
> | do_sync_write
> | vfs_write
> | sys_write
> | system_call_fastpath
> | __write_nocancel
> |
> |--8.33%-- perf_counter_task_sched_out
> | schedule
> | schedule_hrtimeout_range
> | poll_schedule_timeout
> | do_poll
> | do_sys_poll
> | sys_poll
> | system_call_fastpath
> | __GI___poll
> |
> |--8.33%-- ext3_alloc_branch
> | ext3_get_blocks_handle
> | ext3_get_block
> | __block_prepare_write
> | block_write_begin
> | ext3_write_begin
> | generic_perform_write
> | generic_file_buffered_write
> | __generic_file_aio_write_nolock
> | generic_file_aio_write
> | ext3_file_write
> | do_sync_write
> | vfs_write
> | sys_write
> | system_call_fastpath
> | __write_nocancel
> |
> --8.33%-- start_this_handle
> journal_start
> ext3_journal_start_sb
> ext3_write_begin
> generic_perform_write
> generic_file_buffered_write
> __generic_file_aio_write_nolock
> generic_file_aio_write
> ext3_file_write
> do_sync_write
> vfs_write
> sys_write
> system_call_fastpath
> __write_nocancel
>
> the graph above (which is the sub-graph of the full graph, limited
> via the -S/--symbols option to all callchains that intersect with
> the _spin_lock symbol) details all the various code-paths that make
> use of _spin_lock(), and the weights (and relative weights) of
> those codepaths.
>
> Note that this works for just about any locking primitive, and note
> how it's recursive. If we inlined spin_lock we'd only know this
> __mod_timer overhead:
>
> |--31.46%-- __mod_timer
>
> but we wouldnt know the second level (which is interesting too):
>
> | |--55.36%-- tcp_send_delayed_ack
> | --44.64%-- tcp_v4_rcv
>
> All in one, i think we should detach the issue of instrumentation
> and profiling from the issue of whether to inline or not. We should
> inline where it makes things faster. Our profiling tools are able
> to deal with it, independently of whether we inline or not.
>
> Ingo
next prev parent reply other threads:[~2009-08-13 18:43 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-08-12 18:39 [patch 0/3] Allow inlined spinlocks again V3 Heiko Carstens
2009-08-12 18:39 ` [patch 1/3] spinlock: move spinlock function bodies to header file Heiko Carstens
2009-08-12 18:39 ` [patch 2/3] spinlock: allow inlined spinlocks Heiko Carstens
2009-08-12 18:39 ` [patch 3/3] spinlock: allow inlined spinlocks on s390 Heiko Carstens
2009-08-13 18:11 ` [patch 0/3] Allow inlined spinlocks again V3 Linus Torvalds
2009-08-13 18:34 ` Ingo Molnar
2009-08-13 18:43 ` Ingo Molnar [this message]
2009-08-14 12:34 ` Heiko Carstens
2009-08-14 16:04 ` Linus Torvalds
2009-08-14 17:13 ` Heiko Carstens
2009-08-14 18:08 ` Linus Torvalds
2009-08-14 20:19 ` David Miller
2009-08-14 20:45 ` Linus Torvalds
2009-08-14 21:10 ` Linus Torvalds
2009-08-14 22:23 ` David Miller
2009-08-16 18:27 ` Heiko Carstens
2009-08-16 18:45 ` Linus Torvalds
2009-08-16 20:36 ` Ingo Molnar
2009-08-17 10:26 ` Heiko Carstens
2009-08-17 21:26 ` [numbers] Re: [patch] more skb ops inlining Ingo Molnar
2009-08-18 11:34 ` Heiko Carstens
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090813184301.GA15187@elte.hu \
--to=mingo@elte.hu \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=ehrhardt@linux.vnet.ibm.com \
--cc=heiko.carstens@de.ibm.com \
--cc=horsth@linux.vnet.ibm.com \
--cc=linux-arch@vger.kernel.org \
--cc=schwidefsky@de.ibm.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.