[RFC] Setting processor affinity for network queues

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Setting processor affinity for network queues
@ 2010-03-01 17:21 Ben Hutchings
  2010-03-01 17:43 ` Tadepalli, Hari K
  2010-03-01 20:46 ` Tom Herbert
  0 siblings, 2 replies; 3+ messages in thread
From: Ben Hutchings @ 2010-03-01 17:21 UTC (permalink / raw)
  To: netdev
  Cc: Peter P Waskiewicz Jr, Peter Zijlstra, Thomas Gleixner,
	Tom Herbert, Stephen Hemminger, sf-linux-drivers

With multiqueue network hardware or Receive/Transmit Packet Steering
(RPS/XPS) we can spread out network processing across multiple
processors.  The administrator should be able to control the number of
channels and the processor affinity of each.

By 'channel' I mean a bundle of:
- a wakeup (IRQ or IPI)
- a receive queue whose completions trigger the wakeup
- a transmit queue whose completions trigger the wakeup
- a NAPI instance scheduled by the wakeup, which handles the completions

Numbers of RX and TX queues used on a device do not have to match, but
ideally they should.  For generality, you can subsitute 'a receive
and/or a transmit queue' above.  At the hardware level the numbers of
queues could be different e.g. in the sfc driver a channel would be
associated with 1 hardware RX queue, 2 hardware TX queues (with and
without checksum offload) and 1 hardware event queue.

Currently we have a userspace interface for setting affinity of IRQs and
a convention for naming each channel's IRQ handler, but no such
interface for memory allocation.  For RX buffers this should not be a
problem since they are normally allocated as older buffers are
completed, in the NAPI context.  However, the DMA descriptor rings and
driver structures for a channel should also be allocated on the NUMA
node where NAPI processing is done.  Currently this allocation takes
place when a net device is created or when it is opened, before an
administrator has any opportunity to configure affinity.  Reallocation
will normally require a complete stop to network traffic (at least on
the affected queues) so it should not be done automatically when the
driver detects a change in IRQ affinity.  There needs to be an explicit
mechanism for changing it.

Devices using RPS will not generally be able to implement NUMA affinity
for RX buffer allocation, but there will be a similar issue of processor
selection for IPIs and NUMA node affinity for driver structures.  The
proposed interface for setting processor affinity should cover this, but
it is completely different from the IRQ affinity mechanism for hardware
multiqueue devices.  That seems undesirable.

Therefore I propose that:

1. Channels (or NAPI instances) should be exposed in sysfs.
2. Channels will have processor affinity, exposed read/write in sysfs.
Changing this triggers the networking core and driver to reallocate
associated structures if the processor affinity moved between NUMA
nodes, and triggers the driver to set IRQ affinity.
3. The networking core will set the initial affinity for each channel.
There may be global settings to control this.
4. Drivers should not set IRQ affinity.
5. irqbalanced should not set IRQ affinity for multiqueue network
devices.

(Most of this has been proposed already, but I'm trying to bring it all
together.)

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [RFC] Setting processor affinity for network queues
  2010-03-01 17:21 [RFC] Setting processor affinity for network queues Ben Hutchings
@ 2010-03-01 17:43 ` Tadepalli, Hari K
  2010-03-01 20:46 ` Tom Herbert
  1 sibling, 0 replies; 3+ messages in thread
From: Tadepalli, Hari K @ 2010-03-01 17:43 UTC (permalink / raw)
  To: Ben Hutchings, netdev

>> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of Ben Hutchings
>> Sent: Monday, March 01, 2010 10:22 AM
>> To: netdev
>> Cc: Waskiewicz Jr, Peter P; Peter Zijlstra; Thomas Gleixner; Tom Herbert; Stephen Hemminger; sf-linux-drivers
>> Subject: [RFC] Setting processor affinity for network queues
>>
>> Currently we have a userspace interface for setting affinity of IRQs and
>> a convention for naming each channel's IRQ handler, but no such
>> interface for memory allocation.  For RX buffers this should not be a
>> problem since they are normally allocated as older buffers are
>> completed, in the NAPI context.  However, the DMA descriptor rings and
>> driver structures for a channel should also be allocated on the NUMA
>> node where NAPI processing is done.  Currently this allocation takes
>> place when a net device is created or when it is opened, before an
>> administrator has any opportunity to configure affinity.  Reallocation
>> will normally require a complete stop to network traffic (at least on
>> the affected queues) so it should not be done automatically when the
>> driver detects a change in IRQ affinity
y.  There needs to be an explicit
>> mechanism for changing it.

I have sought clarifications on this issue 2 weeks ago in the context of IxGbE: 
http://marc.info/?l=linux-netdev&m=126638089307398&w=2

>> (i) tx_ring, rx_ring control structures of tx/rx rings:
>>	 tx_ring->tx_buffer_info = vmalloc_node(size, tx_ring->numa_node);
>> 
>> (ii) descriptor rings in the DMA region: tx_ring->dma, rx_ring->dma
>>	  tx_ring->desc = pci_alloc_consistent(pdev, tx_ring->size, &tx_ring->dma);
>> 
>> (iii) packet buffers:
>>	   struct sk_buff *skb = netdev_alloc_skb(adapter->netdev,bufsz);

The above memory allocation calls (sparing vmalloc_node()) conventionally used in a network driver allow for node awareness only at the physical device (pdev) level, where queues in a device do not have independent node-status. This is a serious setback to the benefits expected of recent developments for high throughput networking: RSS, 10/40GbE, NUMA & many cores on an SBC (single board computer or server). Cross node traffic, even in small proportions, is a retard on performance in the million packets/sec range. 

I have hacked into some of the above, by using alloc_pages_node() & setting node_id explicitly when allocating for specific queues. But, look forward to a clean, 'hack-free' solution. 

Thanks, 

- Hari
__________________________________________
Embeedded Communications Group
Intel/ IAG, Chandler, AZ

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] Setting processor affinity for network queues
  2010-03-01 17:21 [RFC] Setting processor affinity for network queues Ben Hutchings
  2010-03-01 17:43 ` Tadepalli, Hari K
@ 2010-03-01 20:46 ` Tom Herbert
  1 sibling, 0 replies; 3+ messages in thread
From: Tom Herbert @ 2010-03-01 20:46 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, Peter P Waskiewicz Jr, Peter Zijlstra, Thomas Gleixner,
	Stephen Hemminger, sf-linux-drivers

On Mon, Mar 1, 2010 at 9:21 AM, Ben Hutchings <bhutchings@solarflare.com> wrote:
> With multiqueue network hardware or Receive/Transmit Packet Steering
> (RPS/XPS) we can spread out network processing across multiple
> processors.  The administrator should be able to control the number of
> channels and the processor affinity of each.
>
> By 'channel' I mean a bundle of:
> - a wakeup (IRQ or IPI)
> - a receive queue whose completions trigger the wakeup
> - a transmit queue whose completions trigger the wakeup
> - a NAPI instance scheduled by the wakeup, which handles the completions
>

Yes.  Also in the receive side it is really cumbersome to do per NAPI
RPS settings without the receive napi instance not be exposed in
netif_rx.  Maybe a reference to NAPI structure can be added in skb?
This could clean up RPS a lot.

Tom

> Numbers of RX and TX queues used on a device do not have to match, but
> ideally they should.  For generality, you can subsitute 'a receive
> and/or a transmit queue' above.  At the hardware level the numbers of
> queues could be different e.g. in the sfc driver a channel would be
> associated with 1 hardware RX queue, 2 hardware TX queues (with and
> without checksum offload) and 1 hardware event queue.
>
> Currently we have a userspace interface for setting affinity of IRQs and
> a convention for naming each channel's IRQ handler, but no such
> interface for memory allocation.  For RX buffers this should not be a
> problem since they are normally allocated as older buffers are
> completed, in the NAPI context.  However, the DMA descriptor rings and
> driver structures for a channel should also be allocated on the NUMA
> node where NAPI processing is done.  Currently this allocation takes
> place when a net device is created or when it is opened, before an
> administrator has any opportunity to configure affinity.  Reallocation
> will normally require a complete stop to network traffic (at least on
> the affected queues) so it should not be done automatically when the
> driver detects a change in IRQ affinity.  There needs to be an explicit
> mechanism for changing it.
>
> Devices using RPS will not generally be able to implement NUMA affinity
> for RX buffer allocation, but there will be a similar issue of processor
> selection for IPIs and NUMA node affinity for driver structures.  The
> proposed interface for setting processor affinity should cover this, but
> it is completely different from the IRQ affinity mechanism for hardware
> multiqueue devices.  That seems undesirable.
>
> Therefore I propose that:
>
> 1. Channels (or NAPI instances) should be exposed in sysfs.
> 2. Channels will have processor affinity, exposed read/write in sysfs.
> Changing this triggers the networking core and driver to reallocate
> associated structures if the processor affinity moved between NUMA
> nodes, and triggers the driver to set IRQ affinity.
> 3. The networking core will set the initial affinity for each channel.
> There may be global settings to control this.
> 4. Drivers should not set IRQ affinity.
> 5. irqbalanced should not set IRQ affinity for multiqueue network
> devices.
>
> (Most of this has been proposed already, but I'm trying to bring it all
> together.)
>
> Ben.
>
> --
> Ben Hutchings, Senior Software Engineer, Solarflare Communications
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-03-01 20:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-01 17:21 [RFC] Setting processor affinity for network queues Ben Hutchings
2010-03-01 17:43 ` Tadepalli, Hari K
2010-03-01 20:46 ` Tom Herbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).