netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: Andrew Dickinson <andrew@whydna.net>
Cc: David Miller <davem@davemloft.net>,
	jelaas@gmail.com, netdev@vger.kernel.org
Subject: Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
Date: Fri, 01 May 2009 08:40:00 +0200	[thread overview]
Message-ID: <49FA9940.1010203@cosmosbay.com> (raw)
In-Reply-To: <606676310904302319u1eacc634qde4b1f70e9936779@mail.gmail.com>

Andrew Dickinson a écrit :
> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@cosmosbay.com> wrote:
>> Andrew Dickinson a écrit :
>>> OK... I've got some more data on it...
>>>
>>> I passed a small number of packets through the system and added a ton
>>> of printks to it ;-P
>>>
>>> Here's the distribution of values as seen by
>>> skb_rx_queue_recorded()... count on the left, value on the right:
>>>      37 0
>>>      31 1
>>>      31 2
>>>      39 3
>>>      37 4
>>>      31 5
>>>      42 6
>>>      39 7
>>>
>>> That's nice and even....  Here's what's getting returned from the
>>> skb_tx_hash().  Again, count on the left, value on the right:
>>>      31 0
>>>      81 1
>>>      37 2
>>>      70 3
>>>      37 4
>>>      31 6
>>>
>>> Note that we're entirely missing 5 and 7 and that those interrupts
>>> seem to have gotten munged onto 1 and 3.
>>>
>>> I think the voodoo lies within:
>>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>
>>> David,  I made the change that you suggested:
>>>         //hash = skb_get_rx_queue(skb);
>>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>
>>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>>
>>> However, my problem's not solved entirely... here's what top is showing me:
>>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>>> ksoftirqd/1
>>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>>> ksoftirqd/3
>>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>>> ksoftirqd/5
>>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>>> ksoftirqd/7
>>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>>> <snip>
>>>
>>>
>>> It appears that only the odd CPUs are actually handling the
>>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>>   66:    2970565          0          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>>   67:         28     821122          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>>   68:         28          0    2943299          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>>   69:         28          0          0     817776          0
>>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>>   70:         28          0          0          0    2963924
>>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>>   71:         28          0          0          0          0
>>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>>   72:         28          0          0          0          0
>>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>>   73:         28          0          0          0          0
>>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>>   74:    4664732          0          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>>   75:         34    4679312          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>>   76:         28          0    4665014          0          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>>   77:         28          0          0    4681531          0
>>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>>   78:         28          0          0          0    4665793
>>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>>   79:         28          0          0          0          0
>>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>>   80:         28          0          0          0          0
>>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>>   81:         28          0          0          0          0
>>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>>   82:          2          0          0          0          0
>>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>>
>>>
>>> Why would ksoftirqd only run on half of the cores (and only the odd
>>> ones to boot)?  The one commonality that's striking me is that that
>>> all the odd CPU#'s are on the same physical processor:
>>>
>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>>> processor     : 0
>>> physical id   : 0
>>> processor     : 1
>>> physical id   : 1
>>> processor     : 2
>>> physical id   : 0
>>> processor     : 3
>>> physical id   : 1
>>> processor     : 4
>>> physical id   : 0
>>> processor     : 5
>>> physical id   : 1
>>> processor     : 6
>>> physical id   : 0
>>> processor     : 7
>>> physical id   : 1
>>>
>>> I did compile the kernel with NUMA support... am I being bitten by
>>> something there?  Other thoughts on where I should look.
>>>
>>> Also... is there an incantation to get NAPI to work in the torvalds
>>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>>
>>> -A
>>>
>>>
>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@davemloft.net> wrote:
>>>> From: Andrew Dickinson <andrew@whydna.net>
>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>>
>>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>>> before I start making claims. ;-P
>>>> That's one possibility.
>>>>
>>>> Another is that the hashing isn't working out.  One way to
>>>> play with that is to simply replace the:
>>>>
>>>>                hash = skb_get_rx_queue(skb);
>>>>
>>>> in skb_tx_hash() with something like:
>>>>
>>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>
>>>> and see if that improves the situation.
>>>>
>> Hi Andrew
>>
>> Please try following patch (I dont have multi-queue NIC, sorry)
>>
>> I will do the followup patch if this ones corrects the distribution problem
>> you noticed.
>>
>> Thanks very much for all your findings.
>>
>> [PATCH] net: skb_tx_hash() improvements
>>
>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>> as the device driver exactly told us which queue was selected at RX time.
>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>
>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>> to avoid a divide here. But this computation should be done once,
>> when real_num_tx_queues is set. This needs a separate patch, and a new
>> field in struct net_device.
>>
>> Reported-by: Andrew Dickinson <andrew@whydna.net>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 308a7d0..e2e9e4a 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>>  {
>>        u32 hash;
>>
>> -       if (skb_rx_queue_recorded(skb)) {
>> -               hash = skb_get_rx_queue(skb);
>> -       } else if (skb->sk && skb->sk->sk_hash) {
>> +       if (skb_rx_queue_recorded(skb))
>> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>> +
>> +       if (skb->sk && skb->sk->sk_hash)
>>                hash = skb->sk->sk_hash;
>> -       } else
>> +       else
>>                hash = skb->protocol;
>>
>>        hash = jhash_1word(hash, skb_tx_hashrnd);
>>
>>
> 
> Eric,
> 
> That's exactly what I did!  It solved the problem of hot-spots on some
> interrupts.  However, I now have a new problem (which is documented in
> my previous posts).  The short of it is that I'm only seeing 4 (out of
> 8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
> busy 4 are always on one physical package (but not always the same
> package (it'll change on reboot or when I change some parameters via
> ethtool), but never both.  This, despite /proc/interrupts showing me
> that all 8 interrupts are being hit evenly.  There's more details in
> my last mail. ;-D
> 

Well, I was reacting to your 'voodo' comment about 

return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);

Since this is not the problem. Problem is coming from jhash() which shuffles
the input, while in your case we want to select same output queue
because of cpu affinities. No shuffle required.

(assuming cpu0 is handling tx-queue-0 and rx-queue-0,
          cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)

Then /proc/interrupts show your rx interrupts are not evenly distributed.

Or that ksoftirqd is triggered only on one physical cpu, while on other
cpu, softirqds are not run from ksoftirqd. Its only a matter of load.



  reply	other threads:[~2009-05-01  6:40 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-29 23:00 tx queue hashing hot-spots and poor performance (multiq, ixgbe) Andrew Dickinson
2009-04-30  9:07 ` Jens Låås
2009-04-30  9:24   ` David Miller
2009-04-30 10:51     ` Jens Låås
2009-04-30 11:05       ` David Miller
2009-04-30 14:04     ` Andrew Dickinson
2009-04-30 14:08       ` David Miller
2009-04-30 23:53         ` Andrew Dickinson
2009-05-01  4:19           ` Andrew Dickinson
2009-05-01  7:32             ` Eric Dumazet
2009-05-01  7:47               ` Eric Dumazet
2009-05-01  6:14           ` Eric Dumazet
2009-05-01  6:19             ` Andrew Dickinson
2009-05-01  6:40               ` Eric Dumazet [this message]
2009-05-01  7:23                 ` Andrew Dickinson
2009-05-01  7:31                   ` Eric Dumazet
2009-05-01  7:34                     ` Andrew Dickinson
2009-05-01 21:37                   ` Brandeburg, Jesse
2009-05-01  8:29             ` [PATCH] net: skb_tx_hash() improvements Eric Dumazet
2009-05-01  8:52               ` Eric Dumazet
2009-05-01  9:29                 ` Eric Dumazet
2009-05-01 16:17                   ` David Miller
2009-05-03 21:44                     ` David Miller
2009-05-04  6:12                       ` Eric Dumazet
2009-05-01 16:08             ` tx queue hashing hot-spots and poor performance (multiq, ixgbe) David Miller
2009-05-01 16:48               ` Eric Dumazet
2009-05-01 17:22                 ` David Miller
2009-05-01 10:20 ` Jesper Dangaard Brouer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49FA9940.1010203@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=andrew@whydna.net \
    --cc=davem@davemloft.net \
    --cc=jelaas@gmail.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).