From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: High contention on the sk_buff_head.lock Date: Wed, 18 Mar 2009 20:07:19 +0100 Message-ID: <49C14667.2040806@cosmosbay.com> References: <49C12E64.1000301@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: netdev , LKML , rt-users To: Vernon Mauery Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:54162 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753361AbZCRTH3 convert rfc822-to-8bit (ORCPT ); Wed, 18 Mar 2009 15:07:29 -0400 In-Reply-To: <49C12E64.1000301@us.ibm.com> Sender: netdev-owner@vger.kernel.org List-ID: Vernon Mauery a =E9crit : > I have been beating on network throughput in the -rt kernel for some = time > now. After digging down through the send path of UDP packets, I foun= d > that the sk_buff_head.lock is under some very high contention. This = lock > is acquired each time a packet is enqueued on a qdisc and then acquir= ed > again to dequeue the packet. Under high networking loads, the enqueu= eing > processes are not only contending among each other for the lock, but = also > with the net-tx soft irq. This makes for some very high contention o= n this > one lock. My testcase is running varying numbers of concurrent netpe= rf > instances pushing UDP traffic to another machine. As the count goes = from > 1 to 2, the network performance increases. But from 2 to 4 and from = 4 > to 8, > we see a big decline, with 8 instances pushing about half of what a s= ingle > thread can do. >=20 > Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have = tried > both NetXen and Broadcom, with very similar results), I can only push= about > 1200 Mb/s. Whereas with the mainline 2.6.29-rc8 kernel, I can push n= early > 6000 Mb/s. But still not as much as I think is possible. I was curio= us and > decided to see if the mainline kernel was hitting the same lock, and = using > /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was = the > number one contended lock). >=20 > So while this issue really hits -rt kernels hard, it has a real effec= t on > mainline kernels as well. The contention of the spinlocks is amplifi= ed > when they get turned into rt-mutexes, which causes a double context s= witch. >=20 > Below is the top of the lock_stat for 2.6.29-rc8. This was captured = from > a 1 minute network stress test. The next high contender had 2 orders= of > magnitude fewer contentions. Think of the throughput increase if we = could > ease this contention a bit. We might even be able to saturate a 10Gb= E > link. >=20 > lock_stat version 0.3 > ---------------------------------------------------------------------= -----------------------------------------------------------------------= --------------------------- >=20 > class name con-bounces contentions waittime-min =20 > waittime-max waittime-total acq-bounces acquisitions =20 > holdtime-min holdtime-max holdtime-total > ---------------------------------------------------------------------= -----------------------------------------------------------------------= --------------------------- >=20 >=20 > &list->lock#3: 24517307 24643791 0.71 =20 > 1286.62 56516392.42 34834296 44904018 =20 > 0.60 164.79 31314786.02 > ------------- > &list->lock#3 15596927 [] > dev_queue_xmit+0x2ea/0x468 > &list->lock#3 9046864 [] > __qdisc_run+0x11b/0x1ef > ------------- > &list->lock#3 6525300 [] > __qdisc_run+0x11b/0x1ef > &list->lock#3 18118491 [] > dev_queue_xmit+0x2ea/0x468 >=20 >=20 > The story is the same for -rt kernels, only the waittime and holdtime > are both > orders of magnitude greater. >=20 > I am not exactly clear on the solution, but if I understand correctly= , > in the > past there has been some discussion of batched enqueueing and > dequeueing. Is > anyone else working on this problem right now who has just not yet po= sted > anything for review? Questions, comments, flames? >=20 Yes we have a known contention point here, but before adding more compl= ex code, could you try following patch please ? [PATCH] net: Reorder fields of struct Qdisc dev_queue_xmit() needs to dirty fields "state" and "q" On x86_64 arch, they currently span two cache lines, involving more cache line ping pongs than necessary. Before patch : offsetof(struct Qdisc, state)=3D0x38 offsetof(struct Qdisc, q)=3D0x48 offsetof(struct Qdisc, dev_queue)=3D0x60 After patch : offsetof(struct Qdisc, dev_queue)=3D0x38 offsetof(struct Qdisc, state)=3D0x48 offsetof(struct Qdisc, q)=3D0x50 Signed-off-by: Eric Dumazet diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index f8c4742..e24feeb 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -51,10 +51,11 @@ struct Qdisc u32 handle; u32 parent; atomic_t refcnt; - unsigned long state; + struct netdev_queue *dev_queue; + struct sk_buff *gso_skb; + unsigned long state; struct sk_buff_head q; - struct netdev_queue *dev_queue; struct Qdisc *next_sched; struct list_head list; =20