* Understanding lock contention in __udp4_lib_mcast_deliver
@ 2013-06-27 19:22 Shawn Bohrer
2013-06-27 19:58 ` Rick Jones
2013-06-27 21:39 ` Or Gerlitz
0 siblings, 2 replies; 10+ messages in thread
From: Shawn Bohrer @ 2013-06-27 19:22 UTC (permalink / raw)
To: netdev
I'm looking for opportunities to improve the multicast receive
performance for our application, and I thought I'd spend some time
trying to understand what I thought might be a small/simple
improvement. Profiling with perf I see that there is spin_lock
contention in __udp4_lib_mcast_deliver:
0.68% swapper [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--24.13%-- perf_adjust_freq_unthr_context.part.21
|
|--22.40%-- scheduler_tick
|
|--14.96%-- __udp4_lib_mcast_deliver
The lock appears to be the udp_hslot lock protecting sockets that have
been hashed based on the local port. My application contains multiple
processes and each process listens on multiple multicast sockets. We
open one socket per multicast addr:port per machine. All of the local
UDP ports are unique. The UDP hash table has 65536 entries and it
appears the hash function is simply the port number plus a
contribution from the network namespace (I believe the namespace part
should be constant since we're not not using network namespaces.
Perhaps I should disable CONFIG_NET_NS).
$ dmesg | grep "UDP hash"
[ 0.606472] UDP hash table entries: 65536 (order: 9, 2097152 bytes)
At this point I'm confused why there is any contention over this
spin_lock. We should have enough hash table entries to avoid
collisions and all of our ports are unique. Assuming that part is
true I could imagine contention if the kernel is processing two
packets for the same multicast addr:port in parallel but is that
possible? I'm using a Mellanox ConnectX-3 card that has multiple
receive queues but I've always thought that all packets for a given
srcip:srcport:destip:destport would all be delivered to a single queue
and thus not processed in parallel.
There is obviously something wrong with my thinking and I'd appreciate
if someone could tell me what I'm missing.
Thanks,
Shawn
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Understanding lock contention in __udp4_lib_mcast_deliver
2013-06-27 19:22 Understanding lock contention in __udp4_lib_mcast_deliver Shawn Bohrer
@ 2013-06-27 19:58 ` Rick Jones
2013-06-27 20:20 ` Shawn Bohrer
2013-06-27 21:39 ` Or Gerlitz
1 sibling, 1 reply; 10+ messages in thread
From: Rick Jones @ 2013-06-27 19:58 UTC (permalink / raw)
To: Shawn Bohrer; +Cc: netdev
On 06/27/2013 12:22 PM, Shawn Bohrer wrote:
> I'm looking for opportunities to improve the multicast receive
> performance for our application, and I thought I'd spend some time
> trying to understand what I thought might be a small/simple
> improvement. Profiling with perf I see that there is spin_lock
> contention in __udp4_lib_mcast_deliver:
>
> 0.68% swapper [kernel.kallsyms] [k] _raw_spin_lock
> |
> --- _raw_spin_lock
> |
> |--24.13%-- perf_adjust_freq_unthr_context.part.21
> |
> |--22.40%-- scheduler_tick
> |
> |--14.96%-- __udp4_lib_mcast_deliver
Are there other processes showing _raw_spin_lock time? It may be more
clear to add a --sort symbol,dso or some such to your perf report
command. Because what you show there suggests less than 1% of the
active cycles are in _raw_spin_lock.
rick jones
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Understanding lock contention in __udp4_lib_mcast_deliver
2013-06-27 19:58 ` Rick Jones
@ 2013-06-27 20:20 ` Shawn Bohrer
2013-06-27 20:46 ` Rick Jones
0 siblings, 1 reply; 10+ messages in thread
From: Shawn Bohrer @ 2013-06-27 20:20 UTC (permalink / raw)
To: Rick Jones; +Cc: netdev
On Thu, Jun 27, 2013 at 12:58:39PM -0700, Rick Jones wrote:
> On 06/27/2013 12:22 PM, Shawn Bohrer wrote:
> >I'm looking for opportunities to improve the multicast receive
> >performance for our application, and I thought I'd spend some time
> >trying to understand what I thought might be a small/simple
> >improvement. Profiling with perf I see that there is spin_lock
> >contention in __udp4_lib_mcast_deliver:
> >
> >0.68% swapper [kernel.kallsyms] [k] _raw_spin_lock
> > |
> > --- _raw_spin_lock
> > |
> > |--24.13%-- perf_adjust_freq_unthr_context.part.21
> > |
> > |--22.40%-- scheduler_tick
> > |
> > |--14.96%-- __udp4_lib_mcast_deliver
>
> Are there other processes showing _raw_spin_lock time? It may be
> more clear to add a --sort symbol,dso or some such to your perf
> report command. Because what you show there suggests less than 1%
> of the active cycles are in _raw_spin_lock.
You think I'm wasting time going after small potatoes huh? On a
normal system it looks like it is about .12% total which is indeed
small but my thinking was that I should be able to make that go to 0
easily by ensuring we use unique ports and only have one socket per
multicast addr:port. Now that I've failed at making it go to 0 I
would mostly like to understand what part of my thinking was flawed.
Or perhaps I can make it go to zero if I do ...
--
Shawn
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Understanding lock contention in __udp4_lib_mcast_deliver
2013-06-27 20:20 ` Shawn Bohrer
@ 2013-06-27 20:46 ` Rick Jones
2013-06-27 21:54 ` Shawn Bohrer
0 siblings, 1 reply; 10+ messages in thread
From: Rick Jones @ 2013-06-27 20:46 UTC (permalink / raw)
To: Shawn Bohrer; +Cc: netdev
On 06/27/2013 01:20 PM, Shawn Bohrer wrote:
> On Thu, Jun 27, 2013 at 12:58:39PM -0700, Rick Jones wrote:
>> Are there other processes showing _raw_spin_lock time? It may be
>> more clear to add a --sort symbol,dso or some such to your perf
>> report command. Because what you show there suggests less than 1%
>> of the active cycles are in _raw_spin_lock.
>
> You think I'm wasting time going after small potatoes huh?
Perhaps. I also find it difficult to see the potatoes' (symbols') big
picture in perf's default sorting :)
> On a normal system it looks like it is about .12% total which is
> indeed small but my thinking was that I should be able to make that
> go to 0 easily by ensuring we use unique ports and only have one
> socket per multicast addr:port. Now that I've failed at making it go
> to 0 I would mostly like to understand what part of my thinking was
> flawed. Or perhaps I can make it go to zero if I do ...
How do you know that time is actually contention and not simply acquire
and release overhead?
rick jones
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Understanding lock contention in __udp4_lib_mcast_deliver
2013-06-27 19:22 Understanding lock contention in __udp4_lib_mcast_deliver Shawn Bohrer
2013-06-27 19:58 ` Rick Jones
@ 2013-06-27 21:39 ` Or Gerlitz
2013-06-27 21:58 ` Shawn Bohrer
1 sibling, 1 reply; 10+ messages in thread
From: Or Gerlitz @ 2013-06-27 21:39 UTC (permalink / raw)
To: Shawn Bohrer; +Cc: netdev
On Thu, Jun 27, 2013 at 10:22 PM, Shawn Bohrer <sbohrer@rgmadvisors.com> wrote:
> I'm looking for opportunities to improve the multicast receive
> performance for our application, and I thought I'd spend some time
> trying to understand what I thought might be a small/simple
> improvement. Profiling with perf I see that there is spin_lock
> contention in __udp4_lib_mcast_deliver:
>
> 0.68% swapper [kernel.kallsyms] [k] _raw_spin_lock
> |
> --- _raw_spin_lock
> |
> |--24.13%-- perf_adjust_freq_unthr_context.part.21
> |
> |--22.40%-- scheduler_tick
> |
> |--14.96%-- __udp4_lib_mcast_deliver
>
> The lock appears to be the udp_hslot lock protecting sockets that have
> been hashed based on the local port. My application contains multiple
> processes and each process listens on multiple multicast sockets. We
> open one socket per multicast addr:port per machine. All of the local
> UDP ports are unique. The UDP hash table has 65536 entries and it
> appears the hash function is simply the port number plus a
> contribution from the network namespace (I believe the namespace part
> should be constant since we're not not using network namespaces.
> Perhaps I should disable CONFIG_NET_NS).
>
> $ dmesg | grep "UDP hash"
> [ 0.606472] UDP hash table entries: 65536 (order: 9, 2097152 bytes)
>
> At this point I'm confused why there is any contention over this
> spin_lock. We should have enough hash table entries to avoid
> collisions and all of our ports are unique. Assuming that part is
> true I could imagine contention if the kernel is processing two
> packets for the same multicast addr:port in parallel but is that
> possible? I'm using a Mellanox ConnectX-3 card that has multiple
> receive queues but I've always thought that all packets for a given
> srcip:srcport:destip:destport would all be delivered to a single queue
> and thus not processed in parallel.
But for a given dest mcast ip / udp port you can have multiple senders
where each has different source ucast ip / udp port, isn't that?
anyway if you disable the udp_rss module param you can be sure that
all traffic to dest mcast group goes to one ring, still this ring will
be the only one active, so I am not this is wise.
Or.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Understanding lock contention in __udp4_lib_mcast_deliver
2013-06-27 20:46 ` Rick Jones
@ 2013-06-27 21:54 ` Shawn Bohrer
2013-06-27 22:03 ` Rick Jones
0 siblings, 1 reply; 10+ messages in thread
From: Shawn Bohrer @ 2013-06-27 21:54 UTC (permalink / raw)
To: Rick Jones; +Cc: netdev
On Thu, Jun 27, 2013 at 01:46:58PM -0700, Rick Jones wrote:
> On 06/27/2013 01:20 PM, Shawn Bohrer wrote:
> >On Thu, Jun 27, 2013 at 12:58:39PM -0700, Rick Jones wrote:
> >>Are there other processes showing _raw_spin_lock time? It may be
> >>more clear to add a --sort symbol,dso or some such to your perf
> >>report command. Because what you show there suggests less than 1%
> >>of the active cycles are in _raw_spin_lock.
> >
> >You think I'm wasting time going after small potatoes huh?
>
> Perhaps. I also find it difficult to see the potatoes' (symbols')
> big picture in perf's default sorting :)
>
> >On a normal system it looks like it is about .12% total which is
> >indeed small but my thinking was that I should be able to make that
> >go to 0 easily by ensuring we use unique ports and only have one
> >socket per multicast addr:port. Now that I've failed at making it go
> >to 0 I would mostly like to understand what part of my thinking was
> >flawed. Or perhaps I can make it go to zero if I do ...
>
> How do you know that time is actually contention and not simply
> acquire and release overhead?
Excellent point, and that could be the problem with my thinking. I
just now tried (unsuccessfully) to use lockstat to see if there was
any contention reported. I read Documentation/lockstat.txt and
followed the instructions but the lock in question did not appear to
be in the output. I think I'm going to have to go with the assumption
that this is just acquire and release overhead.
Thanks,
Shawn
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Understanding lock contention in __udp4_lib_mcast_deliver
2013-06-27 21:39 ` Or Gerlitz
@ 2013-06-27 21:58 ` Shawn Bohrer
0 siblings, 0 replies; 10+ messages in thread
From: Shawn Bohrer @ 2013-06-27 21:58 UTC (permalink / raw)
To: Or Gerlitz; +Cc: netdev
On Fri, Jun 28, 2013 at 12:39:38AM +0300, Or Gerlitz wrote:
> On Thu, Jun 27, 2013 at 10:22 PM, Shawn Bohrer <sbohrer@rgmadvisors.com> wrote:
> > I'm looking for opportunities to improve the multicast receive
> > performance for our application, and I thought I'd spend some time
> > trying to understand what I thought might be a small/simple
> > improvement. Profiling with perf I see that there is spin_lock
> > contention in __udp4_lib_mcast_deliver:
> >
> > 0.68% swapper [kernel.kallsyms] [k] _raw_spin_lock
> > |
> > --- _raw_spin_lock
> > |
> > |--24.13%-- perf_adjust_freq_unthr_context.part.21
> > |
> > |--22.40%-- scheduler_tick
> > |
> > |--14.96%-- __udp4_lib_mcast_deliver
> >
> > The lock appears to be the udp_hslot lock protecting sockets that have
> > been hashed based on the local port. My application contains multiple
> > processes and each process listens on multiple multicast sockets. We
> > open one socket per multicast addr:port per machine. All of the local
> > UDP ports are unique. The UDP hash table has 65536 entries and it
> > appears the hash function is simply the port number plus a
> > contribution from the network namespace (I believe the namespace part
> > should be constant since we're not not using network namespaces.
> > Perhaps I should disable CONFIG_NET_NS).
> >
> > $ dmesg | grep "UDP hash"
> > [ 0.606472] UDP hash table entries: 65536 (order: 9, 2097152 bytes)
> >
> > At this point I'm confused why there is any contention over this
> > spin_lock. We should have enough hash table entries to avoid
> > collisions and all of our ports are unique. Assuming that part is
> > true I could imagine contention if the kernel is processing two
> > packets for the same multicast addr:port in parallel but is that
> > possible? I'm using a Mellanox ConnectX-3 card that has multiple
> > receive queues but I've always thought that all packets for a given
> > srcip:srcport:destip:destport would all be delivered to a single queue
> > and thus not processed in parallel.
>
> But for a given dest mcast ip / udp port you can have multiple senders
> where each has different source ucast ip / udp port, isn't that?
Sure, in my case it is one sender per multicast group / port.
> anyway if you disable the udp_rss module param you can be sure that
> all traffic to dest mcast group goes to one ring, still this ring will
> be the only one active, so I am not this is wise.
I can still try disabling udp_rss and see what happens.
Thanks,
Shawn
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Understanding lock contention in __udp4_lib_mcast_deliver
2013-06-27 21:54 ` Shawn Bohrer
@ 2013-06-27 22:03 ` Rick Jones
2013-06-27 22:44 ` Shawn Bohrer
0 siblings, 1 reply; 10+ messages in thread
From: Rick Jones @ 2013-06-27 22:03 UTC (permalink / raw)
To: Shawn Bohrer; +Cc: netdev
On 06/27/2013 02:54 PM, Shawn Bohrer wrote:
> On Thu, Jun 27, 2013 at 01:46:58PM -0700, Rick Jones wrote:
>> How do you know that time is actually contention and not simply
>> acquire and release overhead?
>
> Excellent point, and that could be the problem with my thinking. I
> just now tried (unsuccessfully) to use lockstat to see if there was
> any contention reported. I read Documentation/lockstat.txt and
> followed the instructions but the lock in question did not appear to
> be in the output. I think I'm going to have to go with the assumption
> that this is just acquire and release overhead.
I think there is a way to get perf to "annotate" (iirc that is the term
it uses) the report to show hits at the instruction level. Ostensibly
one could then look and see how many of the hits were for the
acquire/release part of the routine, and how much was for the actual
contention.
Another way to go, perhaps, would be to break-out the "contention"
portion into a separate routine. Then the question of contention versus
acquire/release would be clearly answered in a profile without needing
to take the annotation path. Long ago and far away in IA64 land that
was done. An example can be seen in:
ftp://ftp.netperf.org/iptable_scaling/full_iptables/32_core_net_next_full_iptables_cycles.txt
where you will see "ia64_spinlock_contention" rather high in the
profile. (That was taken with HP's "caliper" tool, when perfmon was the
mechanism in use for accessing performance counters)
rick jones
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Understanding lock contention in __udp4_lib_mcast_deliver
2013-06-27 22:03 ` Rick Jones
@ 2013-06-27 22:44 ` Shawn Bohrer
2013-07-02 20:16 ` Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: Shawn Bohrer @ 2013-06-27 22:44 UTC (permalink / raw)
To: Rick Jones; +Cc: netdev
On Thu, Jun 27, 2013 at 03:03:15PM -0700, Rick Jones wrote:
> On 06/27/2013 02:54 PM, Shawn Bohrer wrote:
> >On Thu, Jun 27, 2013 at 01:46:58PM -0700, Rick Jones wrote:
> >>How do you know that time is actually contention and not simply
> >>acquire and release overhead?
> >
> >Excellent point, and that could be the problem with my thinking. I
> >just now tried (unsuccessfully) to use lockstat to see if there was
> >any contention reported. I read Documentation/lockstat.txt and
> >followed the instructions but the lock in question did not appear to
> >be in the output. I think I'm going to have to go with the assumption
> >that this is just acquire and release overhead.
>
> I think there is a way to get perf to "annotate" (iirc that is the
> term it uses) the report to show hits at the instruction level.
> Ostensibly one could then look and see how many of the hits were for
> the acquire/release part of the routine, and how much was for the
> actual contention.
Yep, so ~1% of my total time is in _raw_spin_lock and using perf
annotate it appears that maybe only 5-6% percent of that is actually
contention and the rest is acquire/release. Looks like I need to look
elsewhere for my performance improvements. Thanks Rick for your help!
Below is the output of perf annotate if your curious.
Percent | Source code & Disassembly of vmlinux
------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: ffffffff814c72d0 <_raw_spin_lock>:
: EXPORT_SYMBOL(_raw_spin_trylock_bh);
: #endif
:
: #ifndef CONFIG_INLINE_SPIN_LOCK
: void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
: {
2.43 : ffffffff814c72d0: callq ffffffff814cf440 <__fentry__>
1.23 : ffffffff814c72d5: push %rbp
1.66 : ffffffff814c72d6: mov %rsp,%rbp
: */
: static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
: {
: register struct __raw_tickets inc = { .tail = 1 };
:
: inc = xadd(&lock->tickets, inc);
0.71 : ffffffff814c72d9: mov $0x10000,%eax
0.00 : ffffffff814c72de: lock xadd %eax,(%rdi)
86.07 : ffffffff814c72e2: mov %eax,%edx
0.05 : ffffffff814c72e4: shr $0x10,%edx
:
: for (;;) {
: if (inc.head == inc.tail)
0.00 : ffffffff814c72e7: cmp %ax,%dx
0.00 : ffffffff814c72ea: je ffffffff814c72fa <_raw_spin_lock+0x2a>
0.04 : ffffffff814c72ec: nopl 0x0(%rax)
: }
:
: /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
: static inline void rep_nop(void)
: {
: asm volatile("rep; nop" ::: "memory");
0.47 : ffffffff814c72f0: pause
: break;
: cpu_relax();
: inc.head = ACCESS_ONCE(lock->tickets.head);
2.85 : ffffffff814c72f2: movzwl (%rdi),%eax
: register struct __raw_tickets inc = { .tail = 1 };
:
: inc = xadd(&lock->tickets, inc);
:
: for (;;) {
: if (inc.head == inc.tail)
3.53 : ffffffff814c72f5: cmp %ax,%dx
0.00 : ffffffff814c72f8: jne ffffffff814c72f0 <_raw_spin_lock+0x20>
: __raw_spin_lock(lock);
: }
0.91 : ffffffff814c72fa: pop %rbp
0.00 : ffffffff814c72fb: retq
--
Shawn
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Understanding lock contention in __udp4_lib_mcast_deliver
2013-06-27 22:44 ` Shawn Bohrer
@ 2013-07-02 20:16 ` Eric Dumazet
0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2013-07-02 20:16 UTC (permalink / raw)
To: Shawn Bohrer; +Cc: Rick Jones, netdev
On Thu, 2013-06-27 at 17:44 -0500, Shawn Bohrer wrote:
> On Thu, Jun 27, 2013 at 03:03:15PM -0700, Rick Jones wrote:
> > On 06/27/2013 02:54 PM, Shawn Bohrer wrote:
> > >On Thu, Jun 27, 2013 at 01:46:58PM -0700, Rick Jones wrote:
> > >>How do you know that time is actually contention and not simply
> > >>acquire and release overhead?
> > >
> > >Excellent point, and that could be the problem with my thinking. I
> > >just now tried (unsuccessfully) to use lockstat to see if there was
> > >any contention reported. I read Documentation/lockstat.txt and
> > >followed the instructions but the lock in question did not appear to
> > >be in the output. I think I'm going to have to go with the assumption
> > >that this is just acquire and release overhead.
> >
> > I think there is a way to get perf to "annotate" (iirc that is the
> > term it uses) the report to show hits at the instruction level.
> > Ostensibly one could then look and see how many of the hits were for
> > the acquire/release part of the routine, and how much was for the
> > actual contention.
>
> Yep, so ~1% of my total time is in _raw_spin_lock and using perf
> annotate it appears that maybe only 5-6% percent of that is actually
> contention and the rest is acquire/release. Looks like I need to look
> elsewhere for my performance improvements. Thanks Rick for your help!
> Below is the output of perf annotate if your curious.
It seems multicast reception could benefit from the 'hcount' infra added
in commit fdcc8aa953a1123 ("udp: add a counter into udp_hslot")
That is : if udp hash slot has few sockets (apparently in your case, at
most one socket per hash slot), we can perform RCU lookup as in unicast
case.
__udp4_lib_mcast_deliver() had to use a spin_lock() protection because
we had no idea of the number of sockets we could find, as we use a
stack[256 / sizeof(struct sock *)] to hold socket pointers, and one
incoming message had to be delivered exactly once per socket.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2013-07-02 20:16 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-27 19:22 Understanding lock contention in __udp4_lib_mcast_deliver Shawn Bohrer
2013-06-27 19:58 ` Rick Jones
2013-06-27 20:20 ` Shawn Bohrer
2013-06-27 20:46 ` Rick Jones
2013-06-27 21:54 ` Shawn Bohrer
2013-06-27 22:03 ` Rick Jones
2013-06-27 22:44 ` Shawn Bohrer
2013-07-02 20:16 ` Eric Dumazet
2013-06-27 21:39 ` Or Gerlitz
2013-06-27 21:58 ` Shawn Bohrer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).