Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Eric Dumazet <dada1@cosmosbay.com>
To: Andrew Dickinson <andrew@whydna.net>
Cc: David Miller <davem@davemloft.net>,
	jelaas@gmail.com, netdev@vger.kernel.org
Subject: Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
Date: Fri, 01 May 2009 08:14:03 +0200	[thread overview]
Message-ID: <49FA932B.4030405@cosmosbay.com> (raw)
In-Reply-To: <606676310904301653w28f3226fsc477dc92b6a7cdbc@mail.gmail.com>

Andrew Dickinson a écrit :
> OK... I've got some more data on it...
> 
> I passed a small number of packets through the system and added a ton
> of printks to it ;-P
> 
> Here's the distribution of values as seen by
> skb_rx_queue_recorded()... count on the left, value on the right:
>      37 0
>      31 1
>      31 2
>      39 3
>      37 4
>      31 5
>      42 6
>      39 7
> 
> That's nice and even....  Here's what's getting returned from the
> skb_tx_hash().  Again, count on the left, value on the right:
>      31 0
>      81 1
>      37 2
>      70 3
>      37 4
>      31 6
> 
> Note that we're entirely missing 5 and 7 and that those interrupts
> seem to have gotten munged onto 1 and 3.
> 
> I think the voodoo lies within:
>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
> 
> David,  I made the change that you suggested:
>         //hash = skb_get_rx_queue(skb);
>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
> 
> And now, I see a nice even mixing of interrupts on the TX side (yay!).
> 
> However, my problem's not solved entirely... here's what top is showing me:
> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
> ksoftirqd/1
>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
> ksoftirqd/3
>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
> ksoftirqd/5
>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
> ksoftirqd/7
>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
> <snip>
> 
> 
> It appears that only the odd CPUs are actually handling the
> interrupts, which doesn't jive with what /proc/interrupts shows me:
>             CPU0       CPU1	  CPU2       CPU3	CPU4	   CPU5       CPU6	 CPU7
>   66:    2970565          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-0
>   67:         28     821122          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-1
>   68:         28          0    2943299          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-2
>   69:         28          0          0     817776          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-3
>   70:         28          0          0          0    2963924
> 0          0          0   PCI-MSI-edge	  eth2-rx-4
>   71:         28          0          0          0          0
> 821032          0          0   PCI-MSI-edge	  eth2-rx-5
>   72:         28          0          0          0          0
> 0    2979987          0   PCI-MSI-edge	  eth2-rx-6
>   73:         28          0          0          0          0
> 0          0     845422   PCI-MSI-edge	  eth2-rx-7
>   74:    4664732          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-0
>   75:         34    4679312          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-1
>   76:         28          0    4665014          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-2
>   77:         28          0          0    4681531          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-3
>   78:         28          0          0          0    4665793
> 0          0          0   PCI-MSI-edge	  eth2-tx-4
>   79:         28          0          0          0          0
> 4671596          0          0   PCI-MSI-edge	  eth2-tx-5
>   80:         28          0          0          0          0
> 0    4665279          0   PCI-MSI-edge	  eth2-tx-6
>   81:         28          0          0          0          0
> 0          0    4664504   PCI-MSI-edge	  eth2-tx-7
>   82:          2          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2:lsc
> 
> 
> Why would ksoftirqd only run on half of the cores (and only the odd
> ones to boot)?  The one commonality that's striking me is that that
> all the odd CPU#'s are on the same physical processor:
> 
> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
> processor	: 0
> physical id	: 0
> processor	: 1
> physical id	: 1
> processor	: 2
> physical id	: 0
> processor	: 3
> physical id	: 1
> processor	: 4
> physical id	: 0
> processor	: 5
> physical id	: 1
> processor	: 6
> physical id	: 0
> processor	: 7
> physical id	: 1
> 
> I did compile the kernel with NUMA support... am I being bitten by
> something there?  Other thoughts on where I should look.
> 
> Also... is there an incantation to get NAPI to work in the torvalds
> kernel?  As you can see, I'm generating quite a few interrrupts.
> 
> -A
> 
> 
> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>> From: Andrew Dickinson <andrew@whydna.net>
>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>
>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>> before I start making claims. ;-P
>> That's one possibility.
>>
>> Another is that the hashing isn't working out.  One way to
>> play with that is to simply replace the:
>>
>>                hash = skb_get_rx_queue(skb);
>>
>> in skb_tx_hash() with something like:
>>
>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>
>> and see if that improves the situation.
>>

Hi Andrew

Please try following patch (I dont have multi-queue NIC, sorry)

I will do the followup patch if this ones corrects the distribution problem
you noticed.

Thanks very much for all your findings.

[PATCH] net: skb_tx_hash() improvements

When skb_rx_queue_recorded() is true, we dont want to use jash distribution
as the device driver exactly told us which queue was selected at RX time.
jhash makes a statistical shuffle, but this wont work with 8 static inputs.

Later improvements would be to compute reciprocal value of real_num_tx_queues
to avoid a divide here. But this computation should be done once,
when real_num_tx_queues is set. This needs a separate patch, and a new
field in struct net_device.

Reported-by: Andrew Dickinson <andrew@whydna.net>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/net/core/dev.c b/net/core/dev.c
index 308a7d0..e2e9e4a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 {
 	u32 hash;
 
-	if (skb_rx_queue_recorded(skb)) {
-		hash = skb_get_rx_queue(skb);
-	} else if (skb->sk && skb->sk->sk_hash) {
+	if (skb_rx_queue_recorded(skb))
+		return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
+
+	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
-	} else
+	else
 		hash = skb->protocol;
 
 	hash = jhash_1word(hash, skb_tx_hashrnd);

next prev parent reply	other threads:[~2009-05-01  6:14 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-29 23:00 tx queue hashing hot-spots and poor performance (multiq, ixgbe) Andrew Dickinson
2009-04-30  9:07 ` Jens Låås
2009-04-30  9:24   ` David Miller
2009-04-30 10:51     ` Jens Låås
2009-04-30 11:05       ` David Miller
2009-04-30 14:04     ` Andrew Dickinson
2009-04-30 14:08       ` David Miller
2009-04-30 23:53         ` Andrew Dickinson
2009-05-01  4:19           ` Andrew Dickinson
2009-05-01  7:32             ` Eric Dumazet
2009-05-01  7:47               ` Eric Dumazet
2009-05-01  6:14           ` Eric Dumazet [this message]
2009-05-01  6:19             ` Andrew Dickinson
2009-05-01  6:40               ` Eric Dumazet
2009-05-01  7:23                 ` Andrew Dickinson
2009-05-01  7:31                   ` Eric Dumazet
2009-05-01  7:34                     ` Andrew Dickinson
2009-05-01 21:37                   ` Brandeburg, Jesse
2009-05-01  8:29             ` [PATCH] net: skb_tx_hash() improvements Eric Dumazet
2009-05-01  8:52               ` Eric Dumazet
2009-05-01  9:29                 ` Eric Dumazet
2009-05-01 16:17                   ` David Miller
2009-05-03 21:44                     ` David Miller
2009-05-04  6:12                       ` Eric Dumazet
2009-05-01 16:08             ` tx queue hashing hot-spots and poor performance (multiq, ixgbe) David Miller
2009-05-01 16:48               ` Eric Dumazet
2009-05-01 17:22                 ` David Miller
2009-05-01 10:20 ` Jesper Dangaard Brouer

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:308a7d0 dfblob:e2e9e4a )
 OR (
bs:"Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49FA932B.4030405@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=andrew@whydna.net \
    --cc=davem@davemloft.net \
    --cc=jelaas@gmail.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).