netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NUMA and multiQ interation
@ 2009-11-24 17:04 Tom Herbert
  2009-11-24 17:36 ` Eric Dumazet
  2009-11-24 18:06 ` David Miller
  0 siblings, 2 replies; 5+ messages in thread
From: Tom Herbert @ 2009-11-24 17:04 UTC (permalink / raw)
  To: Linux Netdev List

This is a question about the expected interaction between NUMA and
receive multi queue.  Our test setup is a 16 core AMD system with 4
sockets, one NUMA node per socket and a bnx2x.  The test is running
500 streams in netperf RR with response/request of one byte using
net-next-2.6.

Highest throughput we are seeing is with 4 queues (1 queue processed
per socket) giving 361862 tps at 67% of cpu.  16 queues (1 queue per
cpu) gives 226722 tps at 30.43% cpu.

However, with a modified kernel that does RX skb allocations from
local node rather than the devices numa node, I'm getting 923422 tps
at 100% cpu.  This is much higher tps and better cpu utilization than
the case where allocations are coming from the device numa node.  It
appears that cross node allocations are a causing a significant
performance hit.  For a 2.5 times performance improvement I'm kind of
motivated to revert netdev_alloc_skb to when it did not pay attention
to numa node :-)

What is the expected interaction here, and would these results be
typical?  If so, would this warrant the need to associate each RX
queue to a numa node, instead of just the device?

Thanks,
Tom

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NUMA and multiQ interation
  2009-11-24 17:04 NUMA and multiQ interation Tom Herbert
@ 2009-11-24 17:36 ` Eric Dumazet
  2009-11-24 18:06 ` David Miller
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2009-11-24 17:36 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Netdev List

Tom Herbert a écrit :
> This is a question about the expected interaction between NUMA and
> receive multi queue.  Our test setup is a 16 core AMD system with 4
> sockets, one NUMA node per socket and a bnx2x.  The test is running
> 500 streams in netperf RR with response/request of one byte using
> net-next-2.6.
> 
> Highest throughput we are seeing is with 4 queues (1 queue processed
> per socket) giving 361862 tps at 67% of cpu.  16 queues (1 queue per
> cpu) gives 226722 tps at 30.43% cpu.
> 
> However, with a modified kernel that does RX skb allocations from
> local node rather than the devices numa node, I'm getting 923422 tps
> at 100% cpu.  This is much higher tps and better cpu utilization than
> the case where allocations are coming from the device numa node.  It
> appears that cross node allocations are a causing a significant
> performance hit.  For a 2.5 times performance improvement I'm kind of
> motivated to revert netdev_alloc_skb to when it did not pay attention
> to numa node :-)
> 
> What is the expected interaction here, and would these results be
> typical?  If so, would this warrant the need to associate each RX
> queue to a numa node, instead of just the device?
> 

I believe you answer to your own question Tom.

I always had doubts about forcing RX buffers to be 'close to the device'

And RPS clearly shows that the hard work is done on a CPU close to application,
so it would make sense to allocate RX buffer on the local node (of cpu handling
the RX queue), eg not necessarly on the device numa node.

When packet is transfered from NIC to memory, the NUMA distance is only hit
one time per cache line, no cache needed.

Then, when processing packet by host cpus, we might need many transferts,
because of TCP coalescing and copying to user space.

SLUB/SLAB also pay an extra fee when cross node allocation/deallocation are performed.

Another point to look is the vmalloc() that various drivers use at NIC initialization.
module loading is performed with poor NUMA property (forcing all allocation to one single node)

For example, on this IXGBE adapter, all working space (tx queue rings) is allocated on node 0,
on my dual node machine.

# grep ixgbe_setup_rx_resources /proc/vmallocinfo 

0xffffc90006c67000-0xffffc90006c72000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006c73000-0xffffc90006c7e000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d02000-0xffffc90006d0d000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d0e000-0xffffc90006d19000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d1a000-0xffffc90006d25000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d26000-0xffffc90006d31000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d32000-0xffffc90006d3d000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d3e000-0xffffc90006d49000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d4a000-0xffffc90006d55000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d56000-0xffffc90006d61000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d62000-0xffffc90006d6d000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d6e000-0xffffc90006d79000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d7a000-0xffffc90006d85000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d86000-0xffffc90006d91000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d92000-0xffffc90006d9d000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d9e000-0xffffc90006da9000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e4a000-0xffffc90006e55000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e56000-0xffffc90006e61000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e62000-0xffffc90006e6d000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e6e000-0xffffc90006e79000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e7a000-0xffffc90006e85000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e86000-0xffffc90006e91000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e92000-0xffffc90006e9d000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e9e000-0xffffc90006ea9000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006eaa000-0xffffc90006eb5000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006eb6000-0xffffc90006ec1000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006ec2000-0xffffc90006ecd000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006ece000-0xffffc90006ed9000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006eda000-0xffffc90006ee5000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006ee6000-0xffffc90006ef1000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006ef2000-0xffffc90006efd000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006efe000-0xffffc90006f09000   45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10


alloc_large_system_hash() has better NUMA properties, spreading large hash tables to all nodes.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NUMA and multiQ interation
  2009-11-24 17:04 NUMA and multiQ interation Tom Herbert
  2009-11-24 17:36 ` Eric Dumazet
@ 2009-11-24 18:06 ` David Miller
  2009-11-24 19:51   ` Tom Herbert
  1 sibling, 1 reply; 5+ messages in thread
From: David Miller @ 2009-11-24 18:06 UTC (permalink / raw)
  To: therbert; +Cc: netdev

From: Tom Herbert <therbert@google.com>
Date: Tue, 24 Nov 2009 09:04:41 -0800

> What is the expected interaction here, and would these results be
> typical?  If so, would this warrant the need to associate each RX
> queue to a numa node, instead of just the device?

Yes we are fully aware of this and discussed it at netconf this year, see
in particular:

http://vger.kernel.org/netconf2009_slides/netconf2009_numa_discussion.odp

PJ is also currently trying to pass upstream some changes such that these
NUMA allocation bits can be exported to userspace and thus irqbalanced
can pick IRQ targetting more intelligently when a driver allocates per-queue
memory resources on different NUMA nodes.  The thread discussing this has
been going active for the past few days, maybe you missed it.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NUMA and multiQ interation
  2009-11-24 18:06 ` David Miller
@ 2009-11-24 19:51   ` Tom Herbert
  2009-11-24 20:10     ` David Miller
  0 siblings, 1 reply; 5+ messages in thread
From: Tom Herbert @ 2009-11-24 19:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Tue, Nov 24, 2009 at 10:06 AM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <therbert@google.com>
> Date: Tue, 24 Nov 2009 09:04:41 -0800
>
>> What is the expected interaction here, and would these results be
>> typical?  If so, would this warrant the need to associate each RX
>> queue to a numa node, instead of just the device?
>
> Yes we are fully aware of this and discussed it at netconf this year, see
> in particular:
>
> http://vger.kernel.org/netconf2009_slides/netconf2009_numa_discussion.odp
>
> PJ is also currently trying to pass upstream some changes such that these
> NUMA allocation bits can be exported to userspace and thus irqbalanced
> can pick IRQ targetting more intelligently when a driver allocates per-queue
> memory resources on different NUMA nodes.  The thread discussing this has
> been going active for the past few days, maybe you missed it.
>

I saw that thread, and it looks compelling.  But we have applications
that are network bound such that we want to use multiple queues across
multiple nodes for scaling-- and trying keep all queues on the same
node does not scale very well.  Maybe ignoring NUMA allocation could
be a fall-back mode in a dynamic allocation with heavy load?

Tom

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: NUMA and multiQ interation
  2009-11-24 19:51   ` Tom Herbert
@ 2009-11-24 20:10     ` David Miller
  0 siblings, 0 replies; 5+ messages in thread
From: David Miller @ 2009-11-24 20:10 UTC (permalink / raw)
  To: therbert; +Cc: netdev

From: Tom Herbert <therbert@google.com>
Date: Tue, 24 Nov 2009 11:51:39 -0800

> I saw that thread, and it looks compelling.  But we have applications
> that are network bound such that we want to use multiple queues across
> multiple nodes for scaling-- and trying keep all queues on the same
> node does not scale very well.  Maybe ignoring NUMA allocation could
> be a fall-back mode in a dynamic allocation with heavy load?

You use the word "But" as if what PJ and friends are doing is different
from what you're trying to achive.  I think they are the same.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-11-24 20:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-24 17:04 NUMA and multiQ interation Tom Herbert
2009-11-24 17:36 ` Eric Dumazet
2009-11-24 18:06 ` David Miller
2009-11-24 19:51   ` Tom Herbert
2009-11-24 20:10     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).