All of lore.kernel.org
 help / color / mirror / Atom feed
From: Neil Horman <nhorman@tuxdriver.com>
To: Bill Fink <billfink@mindspring.com>
Cc: Linux Network Developers <netdev@vger.kernel.org>,
	brice@myri.com, gallatin@myri.com
Subject: Re: Receive side performance issue with multi-10-GigE and NUMA
Date: Fri, 7 Aug 2009 18:12:11 -0400	[thread overview]
Message-ID: <20090807221211.GA16874@localhost.localdomain> (raw)
In-Reply-To: <20090807170600.9a2eff2e.billfink@mindspring.com>

On Fri, Aug 07, 2009 at 05:06:00PM -0400, Bill Fink wrote:
> I've run into a major receive side performance issue with multi-10-GigE
> on a NUMA system.  The system is using a SuperMicro X8DAH+-F motherboard
> with 2 3.2 GHz quad-core Intel Xeon 5580 processors and 12 GB of
> 1333 MHz DDR3 memory.  It is a Fedora 10 system but using the latest
> 2.6.29.6 kernel from Fedora 11 (originally tried the 2.6.27.29 kernel
> from Fedora 10).
> 
> The test setup is:
> 
> 	i7test1----(6)----xeontest1----(6)----i7test2
> 	         10-GigE             10-GigE
> 
> So xeontest1 has 6 dual-port Myricom 10-GigE NICs for a total
> of 12 10-GigE interfaces.  eth2 through eth7 (which are on the
> second Intel 5520 I/O Hub) are connected to i7test1 while
> eth8 through eth13 (which are on the first Intel 5520 I/O Hub)
> are connected to i7test2.
> 
> Previous direct testing between i7test1 and i7test2 (which use an
> Asus P6T6 WS Revolution motherboard) demonstrated that they could
> achieve ~70 Gbps performance for either transmit or receive using
> 8 10-GigE interfaces.
> 
> The transmit side performance of xeontest1 is fantastic:
> 
> [root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -xc6/3 -p5012 192.168.12.11 &
> n12:  9648.0522 MB /  10.00 sec = 8091.4066 Mbps 49 %TX 26 %RX 0 retrans 0.18 msRTT
> n9: 11130.5320 MB /  10.01 sec = 9328.3224 Mbps 47 %TX 37 %RX 0 retrans 0.19 msRTT
> n11:  9418.1250 MB /  10.00 sec = 7897.5848 Mbps 50 %TX 30 %RX 0 retrans 0.18 msRTT
> n10:  9279.4758 MB /  10.01 sec = 7778.7146 Mbps 49 %TX 28 %RX 0 retrans 0.12 msRTT
> n8: 11142.6574 MB /  10.01 sec = 9340.3789 Mbps 47 %TX 35 %RX 0 retrans 0.18 msRTT
> n13:  9422.1492 MB /  10.01 sec = 7897.4115 Mbps 49 %TX 25 %RX 0 retrans 0.17 msRTT
> n3: 11471.2500 MB /  10.01 sec = 9613.9477 Mbps 49 %TX 32 %RX 0 retrans 0.15 msRTT
> n6:  9339.6354 MB /  10.01 sec = 7828.5345 Mbps 50 %TX 25 %RX 0 retrans 0.19 msRTT
> n4:  9093.2500 MB /  10.01 sec = 7624.1589 Mbps 49 %TX 28 %RX 0 retrans 0.15 msRTT
> n5:  9121.8367 MB /  10.01 sec = 7646.8646 Mbps 50 %TX 29 %RX 0 retrans 0.17 msRTT
> n7:  9292.2500 MB /  10.01 sec = 7789.1574 Mbps 49 %TX 26 %RX 0 retrans 0.17 msRTT
> n2: 11487.1150 MB /  10.01 sec = 9627.2690 Mbps 49 %TX 46 %RX 0 retrans 0.19 msRTT
> 
> Aggregate performance:			100.4637 Gbps
> 
> The problem is with the receive side performance.
> 
> [root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
> n11:  6983.6359 MB /  10.09 sec = 5803.2293 Mbps 13 %TX 26 %RX 0 retrans 0.11 msRTT
> n10:  7000.1557 MB /  10.11 sec = 5807.5978 Mbps 13 %TX 26 %RX 0 retrans 0.12 msRTT
> n9:  2451.7206 MB /  10.21 sec = 2014.8397 Mbps 4 %TX 13 %RX 0 retrans 0.11 msRTT
> n13:  2453.0887 MB /  10.20 sec = 2016.8751 Mbps 3 %TX 11 %RX 0 retrans 0.10 msRTT
> n12:  2446.5303 MB /  10.24 sec = 2004.4638 Mbps 4 %TX 11 %RX 0 retrans 0.10 msRTT
> n8:  2462.5890 MB /  10.26 sec = 2014.0272 Mbps 3 %TX 11 %RX 0 retrans 0.12 msRTT
> n4:  2763.5091 MB /  10.26 sec = 2258.4871 Mbps 4 %TX 14 %RX 0 retrans 0.10 msRTT
> n5:  2770.0887 MB /  10.28 sec = 2261.2562 Mbps 4 %TX 15 %RX 0 retrans 0.10 msRTT
> n2:  1777.7277 MB /  10.32 sec = 1444.9054 Mbps 2 %TX 11 %RX 0 retrans 0.11 msRTT
> n6:  1772.7962 MB /  10.31 sec = 1442.0346 Mbps 3 %TX 10 %RX 0 retrans 0.11 msRTT
> n3:  1779.4535 MB /  10.32 sec = 1446.0090 Mbps 2 %TX 11 %RX 0 retrans 0.15 msRTT
> n7:  1770.8359 MB /  10.35 sec = 1435.4757 Mbps 2 %TX 11 %RX 0 retrans 0.12 msRTT
> 
> Aggregate performance:			29.9492 Gbps
> 
> I suspected that this was because the memory being allocated by the
> myri10ge driver was not being allocated on the optimum NUMA node.
> BTW the NUMA nodes on the system are 0 and 2 instead of 0 and 1 which
> is what I would have expected, but this is my first experience with
> a NUMA system.
> 
> Based upon a patch by Peter Zijlstra that I discovered through Google
> searching, I tried patching the myri10ge driver to change its memory
> allocation of memory pages from alloc_pages() to alloc_pages_node()
> and specifying the NUMA node of the parent device of the Myricom 10-GigE
> device, which IIUC should be the PCIe switch.  This didn't help.
> 
> This could be because I discovered that if I did:
> 
> 	find /sys -name numa_node -exec grep . {} /dev/null \;
> 
> that the numa_node associated with all the PCI devices was always 0,
> and if IIUC then I believe some of the PCI devices should have been
> associated with NUMA node 2.  Perhaps this is what is causing all
> the memory pages allocated by the myri10ge driver to be on NUMA
> node 0, and thus causing the major performance issue.
> 
> To kludge around this, I made a different patch to the myri10ge driver.
> This time I hardcoded the NUMA node in the call to alloc_pages_node()
> to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7)
> and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13).
> This is of course very specific to our specific system (NUMA node ids
> and Myricom 10-GigE device IRQs), and is not something that would be
> generically applicable.  But it was useful as a test, and it did
> improve the receive side performance substantially!
> 
> [root@xeontest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
> n5:  8221.2911 MB /  10.09 sec = 6836.0343 Mbps 17 %TX 31 %RX 0 retrans 0.12 msRTT
> n4:  8237.9524 MB /  10.10 sec = 6840.2379 Mbps 16 %TX 31 %RX 0 retrans 0.11 msRTT
> n11:  7935.3750 MB /  10.11 sec = 6586.2476 Mbps 15 %TX 29 %RX 0 retrans 0.16 msRTT
> n2:  4543.1621 MB /  10.13 sec = 3763.0669 Mbps 9 %TX 21 %RX 0 retrans 0.12 msRTT
> n10:  7916.3925 MB /  10.13 sec = 6555.5210 Mbps 15 %TX 28 %RX 0 retrans 0.13 msRTT
> n7:  4558.4817 MB /  10.14 sec = 3771.6557 Mbps 7 %TX 22 %RX 0 retrans 0.10 msRTT
> n13:  4390.1875 MB /  10.14 sec = 3633.6421 Mbps 6 %TX 21 %RX 0 retrans 0.12 msRTT
> n3:  4572.6478 MB /  10.15 sec = 3778.2596 Mbps 9 %TX 21 %RX 0 retrans 0.14 msRTT
> n6:  4564.4776 MB /  10.14 sec = 3774.4373 Mbps 9 %TX 21 %RX 0 retrans 0.11 msRTT
> n8:  4409.8551 MB /  10.16 sec = 3642.1920 Mbps 8 %TX 19 %RX 0 retrans 0.12 msRTT
> n9:  4412.7836 MB /  10.16 sec = 3643.7788 Mbps 8 %TX 20 %RX 0 retrans 0.14 msRTT
> n12:  4413.4061 MB /  10.16 sec = 3645.2544 Mbps 8 %TX 21 %RX 0 retrans 0.11 msRTT
> 
> Aggregate performance:			56.4703 Gbps
> 
> This was basically double the previous receive side performance
> without the patch.
> 
> I don't know if this is fundamentally a myri10ge driver issue or
> some underlying Linux kernel issue, so it's not clear to me what
> a proper fix would be.
> 
> Finally, while definitely a major improvement, I think it should be
> possible to do even better, since we achieved 70 Gbps in the i7 to i7
> tests, and probably could have done 80 Gbps except for an Asus
> motherboard restriction with the interconnect between the Intel X58
> and Nvidia NF200 chips.  It's definitely a big step in the right
> direction though if this issue can be resolved.
> 
> Any help greatly appreicated in advance.
> 
> 						-Thanks
> 
> 						-Bill
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

You're timing is impeccable!  I just posted a patch for an ftrace module to help
detect just these kind of conditions:
http://marc.info/?l=linux-netdev&m=124967650218846&w=2

Hope that helps you out
Neil


  parent reply	other threads:[~2009-08-07 22:12 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-07 21:06 Receive side performance issue with multi-10-GigE and NUMA Bill Fink
2009-08-07 21:18 ` Brice Goglin
2009-08-07 21:51   ` Bill Fink
2009-08-07 21:53     ` Brice Goglin
2009-08-07 22:08       ` Bill Fink
2009-08-07 22:17         ` Brice Goglin
2009-08-07 22:55           ` Bill Fink
2009-08-08  1:03     ` Andrew Gallatin
2009-08-08  1:35       ` Bill Fink
2009-08-08 11:08         ` Andrew Gallatin
2009-08-08 11:26           ` Neil Horman
2009-08-08 18:21             ` Andrew Gallatin
2009-08-08 18:32               ` Neil Horman
2009-08-11  7:32                 ` Bill Fink
2009-08-11 11:02                   ` Neil Horman
2009-08-11 19:15                     ` Christoph Lameter
2009-08-11 22:27                   ` Andi Kleen
2009-08-12  4:30                     ` Bill Fink
2009-08-12  7:21                       ` Andi Kleen
     [not found]                       ` <4A856781.2080301@myri.com>
2009-08-14 16:38                         ` Bill Fink
2009-08-14 16:55                           ` Andrew Gallatin
2009-08-14 21:13                             ` Aviv Greenberg
2009-08-20  7:26                               ` Bill Fink
2009-08-20 13:14                                 ` Ben Hutchings
2009-08-21  4:00                                   ` Bill Fink
2009-08-20 13:17                                 ` Aviv Greenberg
2009-08-12  0:02                   ` Brandeburg, Jesse
2009-08-12  4:38                     ` Bill Fink
2009-08-12 16:00                       ` Jesse Barnes
2009-08-14 20:31                       ` Bill Fink
2009-08-17 16:53                         ` Jesse Barnes
2009-08-18  7:07                           ` Bill Fink
2009-08-18 11:54                             ` Andrew Gallatin
2009-08-19 17:59                               ` Bill Fink
2009-08-07 22:12 ` Neil Horman [this message]
2009-08-08  0:54   ` Bill Fink
2009-08-08  1:56     ` Neil Horman
2009-08-14 20:44       ` Bill Fink
2009-08-14 23:25         ` Neil Horman
2009-08-20  7:50           ` Bill Fink
2009-08-20 20:19             ` Neil Horman
2009-08-21  4:14               ` Bill Fink
2009-08-21 15:23                 ` Neil Horman
2009-08-21 15:36                   ` Andrew Gallatin
2009-08-26  7:10                   ` Bill Fink
2009-08-26 11:00                     ` Neil Horman
2009-08-26 18:08                       ` Neil Horman
2009-08-26 18:15                         ` Ingo Molnar
2009-08-26 19:04                           ` Neil Horman
2009-08-26 19:08                             ` Ingo Molnar
2009-08-26 19:36                               ` David Miller
2009-08-26 19:48                                 ` Ingo Molnar
2009-08-26 20:23                                   ` Neil Horman
2009-08-26 20:40                                     ` Ingo Molnar
2009-08-26 22:39                                       ` Neil Horman
2009-08-26 22:44                                         ` David Miller
2009-08-26 23:05                                           ` Ingo Molnar
2009-08-26 23:08                                             ` David Miller
2009-08-26 23:58                                               ` Ingo Molnar
2009-08-27  0:05                                                 ` Steven Rostedt
2009-08-27  0:35                                                 ` Christoph Hellwig
2009-08-27  9:28                                                   ` Ingo Molnar
2009-08-26 23:05                                           ` Steven Rostedt
2009-08-26 23:09                                             ` David Miller
2009-08-26 23:30                                               ` Ingo Molnar
2009-08-26 23:23                                             ` Neil Horman
2009-08-26 23:29                                               ` David Miller
2009-08-26 23:19                                           ` Neil Horman
2009-08-26 23:14                                         ` Ingo Molnar
2009-08-26 23:33                                         ` Steven Rostedt
2009-08-27  0:14                                           ` Neil Horman
2009-08-27  0:29                                             ` Steven Rostedt
2009-08-27  1:17                                               ` Neil Horman
2009-08-27  9:06                                                 ` Ingo Molnar
2009-08-27  9:34                                               ` Ingo Molnar
2009-08-27  0:34                                         ` Christoph Hellwig
2009-08-27  0:30                                       ` blktrace ftrace plugin, was " Christoph Hellwig
2009-08-27  5:26                                         ` Jens Axboe
2009-08-27  9:12                                           ` Ingo Molnar
2009-08-27  9:14                                             ` Jens Axboe
2009-08-27 13:55                                               ` Arnaldo Carvalho de Melo
2009-08-28  2:03                                             ` Li Zefan
2009-08-26 23:46                                     ` Frederic Weisbecker
2009-08-26 20:28                                   ` Ingo Molnar
2009-08-26 20:01                               ` Neil Horman
2009-08-26 22:57                                 ` Ingo Molnar
2009-08-27 17:32                         ` Bill Fink
2009-09-02  5:28                           ` Bill Fink
2009-08-27 17:44                         ` Bill Fink
2009-08-27 17:51                           ` Neil Horman
2009-09-02  5:11                             ` Bill Fink
2009-09-02 10:49                               ` Neil Horman
2009-09-02 15:38                                 ` Bill Fink
2009-08-12 23:29 ` David Miller
2009-08-13  2:35   ` Bill Fink

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090807221211.GA16874@localhost.localdomain \
    --to=nhorman@tuxdriver.com \
    --cc=billfink@mindspring.com \
    --cc=brice@myri.com \
    --cc=gallatin@myri.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.