All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: David Miller <davem@davemloft.net>
Cc: sven@thebigcorporation.com, ghaskins@novell.com,
	vernux@us.ibm.com, andi@firstfloor.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org,
	pmullaney@novell.com
Subject: Re: High contention on the sk_buff_head.lock
Date: Thu, 19 Mar 2009 06:49:19 +0100	[thread overview]
Message-ID: <49C1DCDF.6050300@cosmosbay.com> (raw)
In-Reply-To: <20090318.185441.138157931.davem@davemloft.net>

David Miller a écrit :
> From: Sven-Thorsten Dietrich <sven@thebigcorporation.com>
> Date: Wed, 18 Mar 2009 18:43:27 -0700
> 
>> Do we have to rule-out per-CPU queues, that aggregate into a master
>> queue in a batch-wise manner? 
> 
> That would violate the properties and characteristics expected by
> the packet scheduler, wrt. to fair based fairness, rate limiting,
> etc.
> 
> The only legal situation where we can parallelize to single device
> is where only the most trivial packet scheduler is attached to
> the device and the device is multiqueue, and that is exactly what
> we do right now.

I agree with you David.

Still, there is room for improvements, since :

1) default qdisc is pfifo_fast. This beast uses three sk_buff_head (96 bytes)
  where it could use 3 smaller list_head (3 * 16 = 48 bytes on x86_64)

 (assuming sizeof(spinlock_t) is only 4 bytes, but it's more than that
 on various situations (LOCKDEP, ...)

2) struct Qdisc layout could be better, letting read mostly fields
   at beginning of structure. (ie move 'dev_queue', 'next_sched', reshape_fail,
   u32_node, __parent, ...)

  'struct gnet_stats_basic' has a 32 bits hole

   'gnet_stats_queue' could be split, at least in Qdisc, so that three
   seldom use fields (drops, requeues, overlimits) go in a different cache line.

   gnet_stats_rate_est might be also moved in a 'not very used' cache line, if
   I am not mistaken ?

3) In stress situation a CPU A queues a skb to a sk_buff_head, but a CPU B
   dequeues it to feed device, involving an expensive cache line miss
   on the skb.{next|prev} (to set them to NULL)

   We could:
      Use a special dequeue op that doesnt touch skb.{next|prev}
   Eventually set next/prev to NULL after q.lock is released



--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)
From: Eric Dumazet <dada1@cosmosbay.com>
To: David Miller <davem@davemloft.net>
Cc: sven@thebigcorporation.com, ghaskins@novell.com,
	vernux@us.ibm.com, andi@firstfloor.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org,
	pmullaney@novell.com
Subject: Re: High contention on the sk_buff_head.lock
Date: Thu, 19 Mar 2009 06:49:19 +0100	[thread overview]
Message-ID: <49C1DCDF.6050300@cosmosbay.com> (raw)
In-Reply-To: <20090318.185441.138157931.davem@davemloft.net>

David Miller a écrit :
> From: Sven-Thorsten Dietrich <sven@thebigcorporation.com>
> Date: Wed, 18 Mar 2009 18:43:27 -0700
> 
>> Do we have to rule-out per-CPU queues, that aggregate into a master
>> queue in a batch-wise manner? 
> 
> That would violate the properties and characteristics expected by
> the packet scheduler, wrt. to fair based fairness, rate limiting,
> etc.
> 
> The only legal situation where we can parallelize to single device
> is where only the most trivial packet scheduler is attached to
> the device and the device is multiqueue, and that is exactly what
> we do right now.

I agree with you David.

Still, there is room for improvements, since :

1) default qdisc is pfifo_fast. This beast uses three sk_buff_head (96 bytes)
  where it could use 3 smaller list_head (3 * 16 = 48 bytes on x86_64)

 (assuming sizeof(spinlock_t) is only 4 bytes, but it's more than that
 on various situations (LOCKDEP, ...)

2) struct Qdisc layout could be better, letting read mostly fields
   at beginning of structure. (ie move 'dev_queue', 'next_sched', reshape_fail,
   u32_node, __parent, ...)

  'struct gnet_stats_basic' has a 32 bits hole

   'gnet_stats_queue' could be split, at least in Qdisc, so that three
   seldom use fields (drops, requeues, overlimits) go in a different cache line.

   gnet_stats_rate_est might be also moved in a 'not very used' cache line, if
   I am not mistaken ?

3) In stress situation a CPU A queues a skb to a sk_buff_head, but a CPU B
   dequeues it to feed device, involving an expensive cache line miss
   on the skb.{next|prev} (to set them to NULL)

   We could:
      Use a special dequeue op that doesnt touch skb.{next|prev}
   Eventually set next/prev to NULL after q.lock is released




  reply	other threads:[~2009-03-19  5:49 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-18 17:24 High contention on the sk_buff_head.lock Vernon Mauery
2009-03-18 19:07 ` Eric Dumazet
2009-03-18 20:17   ` Vernon Mauery
2009-03-20 23:29     ` Jarek Poplawski
2009-03-23  8:32       ` Eric Dumazet
2009-03-23  8:32         ` Eric Dumazet
2009-03-23  8:37         ` David Miller
2009-03-23  8:50           ` Jarek Poplawski
2009-04-02 14:13           ` Herbert Xu
2009-04-02 14:15             ` Herbert Xu
2009-03-18 20:54 ` Andi Kleen
2009-03-18 21:03   ` David Miller
2009-03-18 21:10     ` Vernon Mauery
2009-03-18 21:38       ` David Miller
2009-03-18 21:49         ` Vernon Mauery
2009-03-19  1:02           ` David Miller
2009-03-18 21:54         ` Gregory Haskins
2009-03-19  1:03           ` David Miller
2009-03-19  1:13             ` Sven-Thorsten Dietrich
2009-03-19  1:17               ` David Miller
2009-03-19  1:43                 ` Sven-Thorsten Dietrich
2009-03-19  1:54                   ` David Miller
2009-03-19  5:49                     ` Eric Dumazet [this message]
2009-03-19  5:49                       ` Eric Dumazet
2009-03-19  5:58                       ` David Miller
2009-03-19 14:04                         ` [PATCH] net: reorder struct Qdisc for better SMP performance Eric Dumazet
2009-03-19 14:04                           ` Eric Dumazet
2009-03-20  8:33                           ` David Miller
2009-03-19 13:45                   ` High contention on the sk_buff_head.lock Andi Kleen
2009-03-19  3:48             ` Gregory Haskins
2009-03-19  5:38               ` David Miller
2009-03-19 12:42                 ` Gregory Haskins
2009-03-19 20:52                   ` David Miller
2009-03-19 12:50             ` Peter W. Morreale
2009-03-19  7:15           ` Evgeniy Polyakov
2009-03-18 21:07   ` Vernon Mauery
2009-03-18 21:45     ` Eilon Greenstein
2009-03-18 21:51       ` Vernon Mauery
2009-03-18 21:59         ` Andi Kleen
2009-03-18 22:19           ` Rick Jones
2009-03-19 12:59   ` Peter W. Morreale
2009-03-19 13:36     ` Peter W. Morreale
2009-03-19 13:46     ` Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49C1DCDF.6050300@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=andi@firstfloor.org \
    --cc=davem@davemloft.net \
    --cc=ghaskins@novell.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pmullaney@novell.com \
    --cc=sven@thebigcorporation.com \
    --cc=vernux@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.