Re: High contention on the sk_buff_head.lock

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Eric Dumazet <dada1@cosmosbay.com>
To: Vernon Mauery <vernux@us.ibm.com>
Cc: netdev <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	rt-users <linux-rt-users@vger.kernel.org>
Subject: Re: High contention on the sk_buff_head.lock
Date: Wed, 18 Mar 2009 20:07:19 +0100	[thread overview]
Message-ID: <49C14667.2040806@cosmosbay.com> (raw)
In-Reply-To: <49C12E64.1000301@us.ibm.com>

Vernon Mauery a écrit :
> I have been beating on network throughput in the -rt kernel for some time
> now.  After digging down through the send path of UDP packets, I found
> that the sk_buff_head.lock is under some very high contention.  This lock
> is acquired each time a packet is enqueued on a qdisc and then acquired
> again to dequeue the packet.  Under high networking loads, the enqueueing
> processes are not only contending among each other for the lock, but also
> with the net-tx soft irq.  This makes for some very high contention on this
> one lock.  My testcase is running varying numbers of concurrent netperf
> instances pushing UDP traffic to another machine.  As the count goes from
> 1 to 2, the network performance increases.  But from 2 to 4 and from 4
> to 8,
> we see a big decline, with 8 instances pushing about half of what a single
> thread can do.
> 
> Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
> both NetXen and Broadcom, with very similar results), I can only push about
> 1200 Mb/s.  Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
> 6000 Mb/s. But still not as much as I think is possible.  I was curious and
> decided to see if the mainline kernel was hitting the same lock, and using
> /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
> number one contended lock).
> 
> So while this issue really hits -rt kernels hard, it has a real effect on
> mainline kernels as well.  The contention of the spinlocks is amplified
> when they get turned into rt-mutexes, which causes a double context switch.
> 
> Below is the top of the lock_stat for 2.6.29-rc8.  This was captured from
> a 1 minute network stress test.  The next high contender had 2 orders of
> magnitude fewer contentions.  Think of the throughput increase if we could
> ease this contention a bit.  We might even be able to saturate a 10GbE
> link.
> 
> lock_stat version 0.3
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>       class name    con-bounces    contentions   waittime-min  
> waittime-max   waittime-total    acq-bounces   acquisitions  
> holdtime-min  holdtime-max holdtime-total
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> 
>    &list->lock#3:      24517307       24643791           0.71       
> 1286.62      56516392.42       34834296       44904018          
> 0.60        164.79    31314786.02
>     -------------
>    &list->lock#3       15596927    [<ffffffff812474da>]
> dev_queue_xmit+0x2ea/0x468
>    &list->lock#3        9046864    [<ffffffff812546e9>]
> __qdisc_run+0x11b/0x1ef
>     -------------
>    &list->lock#3        6525300    [<ffffffff812546e9>]
> __qdisc_run+0x11b/0x1ef
>    &list->lock#3       18118491    [<ffffffff812474da>]
> dev_queue_xmit+0x2ea/0x468
> 
> 
> The story is the same for -rt kernels, only the waittime and holdtime
> are both
> orders of magnitude greater.
> 
> I am not exactly clear on the solution, but if I understand correctly,
> in the
> past there has been some discussion of batched enqueueing and
> dequeueing.  Is
> anyone else working on this problem right now who has just not yet posted
> anything for review?  Questions, comments, flames?
> 

Yes we have a known contention point here, but before adding more complex code,
could you try following patch please ?

[PATCH] net: Reorder fields of struct Qdisc

dev_queue_xmit() needs to dirty fields "state" and "q"

On x86_64 arch, they currently span two cache lines, involving more
cache line ping pongs than necessary.

Before patch :

offsetof(struct Qdisc, state)=0x38
offsetof(struct Qdisc, q)=0x48
offsetof(struct Qdisc, dev_queue)=0x60

After patch :

offsetof(struct Qdisc, dev_queue)=0x38
offsetof(struct Qdisc, state)=0x48
offsetof(struct Qdisc, q)=0x50


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f8c4742..e24feeb 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -51,10 +51,11 @@ struct Qdisc
 	u32			handle;
 	u32			parent;
 	atomic_t		refcnt;
-	unsigned long		state;
+	struct netdev_queue	*dev_queue;
+
 	struct sk_buff		*gso_skb;
+	unsigned long		state;
 	struct sk_buff_head	q;
-	struct netdev_queue	*dev_queue;
 	struct Qdisc		*next_sched;
 	struct list_head	list;

next prev parent reply	other threads:[~2009-03-18 19:07 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-18 17:24 High contention on the sk_buff_head.lock Vernon Mauery
2009-03-18 19:07 ` Eric Dumazet [this message]
2009-03-18 20:17   ` Vernon Mauery
2009-03-20 23:29     ` Jarek Poplawski
2009-03-23  8:32       ` Eric Dumazet
2009-03-23  8:37         ` David Miller
2009-03-23  8:50           ` Jarek Poplawski
2009-04-02 14:13           ` Herbert Xu
2009-04-02 14:15             ` Herbert Xu
2009-03-18 20:54 ` Andi Kleen
2009-03-18 21:03   ` David Miller
2009-03-18 21:10     ` Vernon Mauery
2009-03-18 21:38       ` David Miller
2009-03-18 21:49         ` Vernon Mauery
2009-03-19  1:02           ` David Miller
2009-03-18 21:54         ` Gregory Haskins
2009-03-19  1:03           ` David Miller
2009-03-19  1:13             ` Sven-Thorsten Dietrich
2009-03-19  1:17               ` David Miller
2009-03-19  1:43                 ` Sven-Thorsten Dietrich
2009-03-19  1:54                   ` David Miller
2009-03-19  5:49                     ` Eric Dumazet
2009-03-19  5:58                       ` David Miller
2009-03-19 14:04                         ` [PATCH] net: reorder struct Qdisc for better SMP performance Eric Dumazet
2009-03-20  8:33                           ` David Miller
2009-03-19 13:45                   ` High contention on the sk_buff_head.lock Andi Kleen
2009-03-19  3:48             ` Gregory Haskins
2009-03-19  5:38               ` David Miller
2009-03-19 12:42                 ` Gregory Haskins
2009-03-19 20:52                   ` David Miller
2009-03-19 12:50             ` Peter W. Morreale
2009-03-19  7:15           ` Evgeniy Polyakov
2009-03-18 21:07   ` Vernon Mauery
2009-03-18 21:45     ` Eilon Greenstein
2009-03-18 21:51       ` Vernon Mauery
2009-03-18 21:59         ` Andi Kleen
2009-03-18 22:19           ` Rick Jones
2009-03-19 12:59   ` Peter W. Morreale
2009-03-19 13:36     ` Peter W. Morreale
2009-03-19 13:46     ` Andi Kleen

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:f8c4742 dfblob:e24feeb )
 OR (
bs:"Re: High contention on the sk_buff_head.lock" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49C14667.2040806@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=vernux@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).