* Heavy spin_lock contention in __udp4_lib_mcast_deliver increase
@ 2012-04-26 15:15 Shawn Bohrer
2012-04-26 15:53 ` Eric Dumazet
0 siblings, 1 reply; 7+ messages in thread
From: Shawn Bohrer @ 2012-04-26 15:15 UTC (permalink / raw)
To: netdev; +Cc: eric.dumazet
I've been doing some UDP multicast benchmarking and noticed that as we
increase the number of sockets/multicast addresses the performance
degrades. The test I'm running has multiple machines sending packets
on multiple multicast addresses. A single receiving machine opens one
socket per multicast address to receive all the packets. The
receiving process is bound to a core that is not processing
interrupts.
Running this test with 300 multicast addresses and sockets and
profiling the receiving machine with 'perf -a -g' I can see the
following:
# Events: 45K cycles
#
# Overhead
# ........ .....................................
#
52.56% [k] _raw_spin_lock
|
|--99.09%-- __udp4_lib_mcast_deliver
20.10% [k] __udp4_lib_mcast_deliver
|
--- __udp4_lib_rcv
So if I understand this correctly 52.56% of the time is spent
contending for the spin_lock in __udp4_lib_mcast_deliver. If I
understand the code correctly it appears that for every packet
received we walk the list of all UDP sockets while holding the
spin_lock. Therefore I believe the thing that hurts so much in this
case is that we have a lot of UDP sockets.
Are there any ideas on how we can improve the performance in this
case? Honestly I have two ideas though my understanding of the
network stack is limited and it is unclear to me how to implement
either of them.
The first idea is to use RCU instead of acquiring the spin_lock. This
is what the Unicast path does however looking back to 271b72c7 "udp:
RCU handling for Unicast packets." Eric points out that the multicast
path is difficult. It appears from that commit description that the
problem is that since we have to find all sockets interested in
receiving the packet instead of just one that restarting the scan of
the hlist could lead us to deliver the packet twice to the same
socket. That commit is rather old though I believe things may have
changed. Looking at commit 1240d137 "ipv4: udp: Optimise multicast
reception" I can see that Eric also has already done some work to
reduce how long the spin_lock is held in __udp4_lib_mcast_deliver().
That commit also says "It's also a base for a future RCU conversion of
multicast recption". Is the idea that you could remove duplicate
sockets within flush_stack()? Actually I don't think that would work
since flush_stack() can be called multiple times if the stack gets
full.
The second idea would be to hash the sockets to reduce the number of
sockets to walk for each packet. Once again it looks like the Unicast
path already does this in commits 512615b6b "udp: secondary hash on
(local port, local address)" and 5051ebd27 "ipv4: udp: optimize
unicast RX path". Perhaps these hash lists could be used, however I
don't think they can since they currently use RCU and thus it might
depend on converting to RCU first.
Thanks,
Shawn
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Heavy spin_lock contention in __udp4_lib_mcast_deliver increase 2012-04-26 15:15 Heavy spin_lock contention in __udp4_lib_mcast_deliver increase Shawn Bohrer @ 2012-04-26 15:53 ` Eric Dumazet 2012-04-26 16:18 ` Eric Dumazet 2012-04-26 16:28 ` Shawn Bohrer 0 siblings, 2 replies; 7+ messages in thread From: Eric Dumazet @ 2012-04-26 15:53 UTC (permalink / raw) To: Shawn Bohrer; +Cc: netdev On Thu, 2012-04-26 at 10:15 -0500, Shawn Bohrer wrote: > I've been doing some UDP multicast benchmarking and noticed that as we > increase the number of sockets/multicast addresses the performance > degrades. The test I'm running has multiple machines sending packets > on multiple multicast addresses. A single receiving machine opens one > socket per multicast address to receive all the packets. The > receiving process is bound to a core that is not processing > interrupts. > > Running this test with 300 multicast addresses and sockets and > profiling the receiving machine with 'perf -a -g' I can see the > following: > > > # Events: 45K cycles > # > # Overhead > # ........ ..................................... > # > 52.56% [k] _raw_spin_lock > | > |--99.09%-- __udp4_lib_mcast_deliver > 20.10% [k] __udp4_lib_mcast_deliver > | > --- __udp4_lib_rcv > > So if I understand this correctly 52.56% of the time is spent > contending for the spin_lock in __udp4_lib_mcast_deliver. If I > understand the code correctly it appears that for every packet > received we walk the list of all UDP sockets while holding the > spin_lock. Therefore I believe the thing that hurts so much in this > case is that we have a lot of UDP sockets. > > Are there any ideas on how we can improve the performance in this > case? Honestly I have two ideas though my understanding of the > network stack is limited and it is unclear to me how to implement > either of them. > > The first idea is to use RCU instead of acquiring the spin_lock. This > is what the Unicast path does however looking back to 271b72c7 "udp: > RCU handling for Unicast packets." Eric points out that the multicast > path is difficult. It appears from that commit description that the > problem is that since we have to find all sockets interested in > receiving the packet instead of just one that restarting the scan of > the hlist could lead us to deliver the packet twice to the same > socket. That commit is rather old though I believe things may have > changed. Looking at commit 1240d137 "ipv4: udp: Optimise multicast > reception" I can see that Eric also has already done some work to > reduce how long the spin_lock is held in __udp4_lib_mcast_deliver(). > That commit also says "It's also a base for a future RCU conversion of > multicast recption". Is the idea that you could remove duplicate > sockets within flush_stack()? Actually I don't think that would work > since flush_stack() can be called multiple times if the stack gets > full. > > The second idea would be to hash the sockets to reduce the number of > sockets to walk for each packet. Once again it looks like the Unicast > path already does this in commits 512615b6b "udp: secondary hash on > (local port, local address)" and 5051ebd27 "ipv4: udp: optimize > unicast RX path". Perhaps these hash lists could be used, however I > don't think they can since they currently use RCU and thus it might > depend on converting to RCU first. Let me understand You have 300 sockets bound to the same port, so a single message must be copied 300 times and delivered to those sockets ? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Heavy spin_lock contention in __udp4_lib_mcast_deliver increase 2012-04-26 15:53 ` Eric Dumazet @ 2012-04-26 16:18 ` Eric Dumazet 2012-04-26 16:21 ` Eric Dumazet 2012-04-26 16:28 ` Shawn Bohrer 1 sibling, 1 reply; 7+ messages in thread From: Eric Dumazet @ 2012-04-26 16:18 UTC (permalink / raw) To: Shawn Bohrer; +Cc: netdev On Thu, 2012-04-26 at 17:53 +0200, Eric Dumazet wrote: > Let me understand > > You have 300 sockets bound to the same port, so a single message must be > copied 300 times and delivered to those sockets ? > > Please try the following patch. It should allow up to 512 sockets (on x86_64) to be stored in stack, and delivery performed out of the locked section. net/ipv4/udp.c | 16 ++++++++++++---- net/ipv6/udp.c | 15 +++++++++++---- 2 files changed, 23 insertions(+), 8 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 279fd08..beb9ea6 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1539,13 +1539,20 @@ static void flush_stack(struct sock **stack, unsigned int count, static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb, struct udphdr *uh, __be32 saddr, __be32 daddr, - struct udp_table *udptable) + struct udp_table *udptable, + int proto) { - struct sock *sk, *stack[256 / sizeof(struct sock *)]; + struct sock *sk, **stack; struct udp_hslot *hslot = udp_hashslot(udptable, net, ntohs(uh->dest)); int dif; unsigned int i, count = 0; + stack = kmalloc(PAGE_SIZE, GFP_ATOMIC); + if (unlikely(!stack)) { + UDP_INC_STATS_BH(net, UDP_MIB_RCVBUFERRORS, proto == IPPROTO_UDPLITE); + kfree_skb(skb); + return 0; + } spin_lock(&hslot->lock); sk = sk_nulls_head(&hslot->head); dif = skb->dev->ifindex; @@ -1554,7 +1561,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb, stack[count++] = sk; sk = udp_v4_mcast_next(net, sk_nulls_next(sk), uh->dest, daddr, uh->source, saddr, dif); - if (unlikely(count == ARRAY_SIZE(stack))) { + if (unlikely(count == PAGE_SIZE/sizeof(*sk))) { if (!sk) break; flush_stack(stack, count, skb, ~0); @@ -1580,6 +1587,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb, } else { kfree_skb(skb); } + kfree(stack); return 0; } @@ -1661,7 +1669,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST)) return __udp4_lib_mcast_deliver(net, skb, uh, - saddr, daddr, udptable); + saddr, daddr, udptable, proto); sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index d39bbc9..fc79b87 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -640,14 +640,20 @@ drop: */ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb, const struct in6_addr *saddr, const struct in6_addr *daddr, - struct udp_table *udptable) + struct udp_table *udptable, int proto) { - struct sock *sk, *stack[256 / sizeof(struct sock *)]; + struct sock *sk, **stack; const struct udphdr *uh = udp_hdr(skb); struct udp_hslot *hslot = udp_hashslot(udptable, net, ntohs(uh->dest)); int dif; unsigned int i, count = 0; + stack = kmalloc(PAGE_SIZE, GFP_ATOMIC); + if (unlikely(!stack)) { + UDP6_INC_STATS_BH(net, UDP_MIB_RCVBUFERRORS, proto == IPPROTO_UDPLITE); + kfree_skb(skb); + return 0; + } spin_lock(&hslot->lock); sk = sk_nulls_head(&hslot->head); dif = inet6_iif(skb); @@ -656,7 +662,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb, stack[count++] = sk; sk = udp_v6_mcast_next(net, sk_nulls_next(sk), uh->dest, daddr, uh->source, saddr, dif); - if (unlikely(count == ARRAY_SIZE(stack))) { + if (unlikely(count == PAGE_SIZE/sizeof(*sk))) { if (!sk) break; flush_stack(stack, count, skb, ~0); @@ -679,6 +685,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb, } else { kfree_skb(skb); } + kfree(stack); return 0; } @@ -763,7 +770,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, */ if (ipv6_addr_is_multicast(daddr)) return __udp6_lib_mcast_deliver(net, skb, - saddr, daddr, udptable); + saddr, daddr, udptable, proto); /* Unicast */ ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: Heavy spin_lock contention in __udp4_lib_mcast_deliver increase 2012-04-26 16:18 ` Eric Dumazet @ 2012-04-26 16:21 ` Eric Dumazet 0 siblings, 0 replies; 7+ messages in thread From: Eric Dumazet @ 2012-04-26 16:21 UTC (permalink / raw) To: Shawn Bohrer; +Cc: netdev On Thu, 2012-04-26 at 18:18 +0200, Eric Dumazet wrote: > On Thu, 2012-04-26 at 17:53 +0200, Eric Dumazet wrote: > > > Let me understand > > > > You have 300 sockets bound to the same port, so a single message must be > > copied 300 times and delivered to those sockets ? > > > > > > Please try the following patch. It should allow up to 512 sockets (on > x86_64) to be stored in stack, and delivery performed out of the locked > section. > > net/ipv4/udp.c | 16 ++++++++++++---- > net/ipv6/udp.c | 15 +++++++++++---- > 2 files changed, 23 insertions(+), 8 deletions(-) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index 279fd08..beb9ea6 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1539,13 +1539,20 @@ static void flush_stack(struct sock **stack, unsigned int count, > static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb, > struct udphdr *uh, > __be32 saddr, __be32 daddr, > - struct udp_table *udptable) > + struct udp_table *udptable, > + int proto) > { > - struct sock *sk, *stack[256 / sizeof(struct sock *)]; > + struct sock *sk, **stack; > struct udp_hslot *hslot = udp_hashslot(udptable, net, ntohs(uh->dest)); > int dif; > unsigned int i, count = 0; > > + stack = kmalloc(PAGE_SIZE, GFP_ATOMIC); > + if (unlikely(!stack)) { > + UDP_INC_STATS_BH(net, UDP_MIB_RCVBUFERRORS, proto == IPPROTO_UDPLITE); > + kfree_skb(skb); > + return 0; > + } > spin_lock(&hslot->lock); > sk = sk_nulls_head(&hslot->head); > dif = skb->dev->ifindex; > @@ -1554,7 +1561,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb, > stack[count++] = sk; > sk = udp_v4_mcast_next(net, sk_nulls_next(sk), uh->dest, > daddr, uh->source, saddr, dif); > - if (unlikely(count == ARRAY_SIZE(stack))) { > + if (unlikely(count == PAGE_SIZE/sizeof(*sk))) { Oops, should be PAGE_SIZE/sizeof(sk) of course (same problem in ipv6/udp.c) > if (!sk) > break; > flush_stack(stack, count, skb, ~0); > @@ -1580,6 +1587,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb, > } else { > kfree_skb(skb); > } > + kfree(stack); > return 0; > } > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Heavy spin_lock contention in __udp4_lib_mcast_deliver increase 2012-04-26 15:53 ` Eric Dumazet 2012-04-26 16:18 ` Eric Dumazet @ 2012-04-26 16:28 ` Shawn Bohrer 2012-04-26 16:31 ` Eric Dumazet 1 sibling, 1 reply; 7+ messages in thread From: Shawn Bohrer @ 2012-04-26 16:28 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, Apr 26, 2012 at 05:53:15PM +0200, Eric Dumazet wrote: > On Thu, 2012-04-26 at 10:15 -0500, Shawn Bohrer wrote: > > I've been doing some UDP multicast benchmarking and noticed that as we > > increase the number of sockets/multicast addresses the performance > > degrades. The test I'm running has multiple machines sending packets > > on multiple multicast addresses. A single receiving machine opens one > > socket per multicast address to receive all the packets. The > > receiving process is bound to a core that is not processing > > interrupts. > > > > Running this test with 300 multicast addresses and sockets and > > profiling the receiving machine with 'perf -a -g' I can see the > > following: > > > > > > # Events: 45K cycles > > # > > # Overhead > > # ........ ..................................... > > # > > 52.56% [k] _raw_spin_lock > > | > > |--99.09%-- __udp4_lib_mcast_deliver > > 20.10% [k] __udp4_lib_mcast_deliver > > | > > --- __udp4_lib_rcv > > > > So if I understand this correctly 52.56% of the time is spent > > contending for the spin_lock in __udp4_lib_mcast_deliver. If I > > understand the code correctly it appears that for every packet > > received we walk the list of all UDP sockets while holding the > > spin_lock. Therefore I believe the thing that hurts so much in this > > case is that we have a lot of UDP sockets. > > > > Are there any ideas on how we can improve the performance in this > > case? Honestly I have two ideas though my understanding of the > > network stack is limited and it is unclear to me how to implement > > either of them. > > > > The first idea is to use RCU instead of acquiring the spin_lock. This > > is what the Unicast path does however looking back to 271b72c7 "udp: > > RCU handling for Unicast packets." Eric points out that the multicast > > path is difficult. It appears from that commit description that the > > problem is that since we have to find all sockets interested in > > receiving the packet instead of just one that restarting the scan of > > the hlist could lead us to deliver the packet twice to the same > > socket. That commit is rather old though I believe things may have > > changed. Looking at commit 1240d137 "ipv4: udp: Optimise multicast > > reception" I can see that Eric also has already done some work to > > reduce how long the spin_lock is held in __udp4_lib_mcast_deliver(). > > That commit also says "It's also a base for a future RCU conversion of > > multicast recption". Is the idea that you could remove duplicate > > sockets within flush_stack()? Actually I don't think that would work > > since flush_stack() can be called multiple times if the stack gets > > full. > > > > The second idea would be to hash the sockets to reduce the number of > > sockets to walk for each packet. Once again it looks like the Unicast > > path already does this in commits 512615b6b "udp: secondary hash on > > (local port, local address)" and 5051ebd27 "ipv4: udp: optimize > > unicast RX path". Perhaps these hash lists could be used, however I > > don't think they can since they currently use RCU and thus it might > > depend on converting to RCU first. > > Let me understand > > You have 300 sockets bound to the same port, so a single message must be > copied 300 times and delivered to those sockets ? No in this case it is 300 unique multicast addresses, and there is one socket listening to each multicast address. So a single message is only copied once to a single socket. The bottle neck appears to be that even though a single message is only going to get copied to a single socket we still have to walk the list of all 300 sockets while holding the spin lock to figure that out. The incoming packet rate is also roughly evenly distributed across all 300 multicast addresses so even though we have multiple receive queues they are all contending for the same spin lock. -- Shawn -- --------------------------------------------------------------- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Heavy spin_lock contention in __udp4_lib_mcast_deliver increase 2012-04-26 16:28 ` Shawn Bohrer @ 2012-04-26 16:31 ` Eric Dumazet 2012-04-26 21:44 ` Shawn Bohrer 0 siblings, 1 reply; 7+ messages in thread From: Eric Dumazet @ 2012-04-26 16:31 UTC (permalink / raw) To: Shawn Bohrer; +Cc: netdev On Thu, 2012-04-26 at 11:28 -0500, Shawn Bohrer wrote: > No in this case it is 300 unique multicast addresses, and there is one > socket listening to each multicast address. So a single message is > only copied once to a single socket. The bottle neck appears to be > that even though a single message is only going to get copied to a > single socket we still have to walk the list of all 300 sockets while > holding the spin lock to figure that out. The incoming packet rate is > also roughly evenly distributed across all 300 multicast addresses so > even though we have multiple receive queues they are all contending > for the same spin lock. > I repeat my question : Are these 300 sockets bound to the same UDP port ? If not, they should be spreaded in hash table. You can make this hash table very big to reduce hash collisions Boot parameter : uhash_entries=65536 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Heavy spin_lock contention in __udp4_lib_mcast_deliver increase 2012-04-26 16:31 ` Eric Dumazet @ 2012-04-26 21:44 ` Shawn Bohrer 0 siblings, 0 replies; 7+ messages in thread From: Shawn Bohrer @ 2012-04-26 21:44 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, Apr 26, 2012 at 06:31:28PM +0200, Eric Dumazet wrote: > On Thu, 2012-04-26 at 11:28 -0500, Shawn Bohrer wrote: > > > No in this case it is 300 unique multicast addresses, and there is one > > socket listening to each multicast address. So a single message is > > only copied once to a single socket. The bottle neck appears to be > > that even though a single message is only going to get copied to a > > single socket we still have to walk the list of all 300 sockets while > > holding the spin lock to figure that out. The incoming packet rate is > > also roughly evenly distributed across all 300 multicast addresses so > > even though we have multiple receive queues they are all contending > > for the same spin lock. > > > > I repeat my question : Are these 300 sockets bound to the same UDP > port ? > > If not, they should be spreaded in hash table. > > You can make this hash table very big to reduce hash collisions > > Boot parameter : uhash_entries=65536 Thanks Eric I don't know how I missed this. In my test all 300 sockets were bound to the same UDP port so they were all falling into the same bucket. Switching the test to use unique ports solves the issue. I didn't try your other patch to increase the stack size up to 512 sockets because I don't think we need it. We rarely have more than a single socket per machine receiving packets on a multicast address so I think the current stack size is sufficient for us. Or perhaps once again I may be misunderstanding the purpose of that patch. -- Shawn -- --------------------------------------------------------------- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-04-26 21:44 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-04-26 15:15 Heavy spin_lock contention in __udp4_lib_mcast_deliver increase Shawn Bohrer 2012-04-26 15:53 ` Eric Dumazet 2012-04-26 16:18 ` Eric Dumazet 2012-04-26 16:21 ` Eric Dumazet 2012-04-26 16:28 ` Shawn Bohrer 2012-04-26 16:31 ` Eric Dumazet 2012-04-26 21:44 ` Shawn Bohrer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox