[RFC] Idea about increasing efficency of skb allocation in network devices

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Idea about increasing efficency of skb allocation in network devices
@ 2009-07-27  0:36 Neil Horman
  2009-07-27  1:02 ` David Miller
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Horman @ 2009-07-27  0:36 UTC (permalink / raw)
  To: netdev; +Cc: nhorman

Hey all-
	I've been thinking of an idea lately, and I'm starting to tinker with
implementation, so I thought before I went to far down any one path, I'd like to
solicit for comments on it, just to avoid early design errors and the like.
Please find my proposal below.  Feel free to openly ridicule it if you think its
completely off base or pointless.  Any and all criticizm welcome.  Thanks!

Problem Statement:
	Currently the networking stack receive path consists of a set of
producers (the network drivers which allocate skbs to receive on the wire data
into), and a set of consumers (user space applications and other networking
devices which free those skbs when their use is finished).  These consumers and
producders are dynamic (additional consumers and producers can be added alsmot
at will within the system).  Currently, there exists an potential inefficiency
in this receive path when using NUMA systems. Given that allocation of skb data
buffers is done with only minimal regard to the NUMA node on which a producer
exists (following standard vm policy in which we try to allocate on the local
node first), it is entirely possible that a consumer of this frame data will
exist on a different NUMA node than the node on which it was allocated. This
disparity leads to slower copying when an application attempts to copy this data
from the kernel, as it must cross a greater number of memory bridges.

Proposed solution:
	Since Network devices dma their memory into a provided DMA buffer (which
can usually be at an arbitrary location, as they must cross potentially several
pci busses to reach any memory location), I'm postulating that it would increase
our receive path efficiency to provide a hint to the driver layer as to which
node to allocate an skb data buffer on.  This hint would be determined by a
feedback mechanism.  I was thinking that we could provide a callback function
via the skb, that accepted the skb and the originating net_device.  This
callback can track statistics on which numa nodes consume (read: copy data from)
skbs that were produced by specific net devices.  Then, when in the future that
netdevice allocates a new skb (perhaps via netdev_alloc_skb), we can use that
statistical profile to determine if the data buffer should be allocated on the
local node, or on a remote node instead.  Ideally, this 'consumer based
allocation bias' would allow us to reduce the amount of time it takes to
transfer recieved buffers to user space and make the overall receive path more
efficient.  I see lots of opportunity here to develop tools to measure the
speedup this might provide (perhaps via ftrace plugins), as well as various
algorithms to better predict how to allocate skb's on various nodes.

	
	Obviously, the code is going to do the talking here, but I wanted to get
the idea out there so that I anyone who wanted to could point out anything
obvious that would lead to the conclusion that I was nuts.  Feel free to tear it
all apart, or, on the off chance that this has legs, suggestions for
improvements/features that you might like.

Thanks!
Neil


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Idea about increasing efficency of skb allocation in network devices
  2009-07-27  0:36 [RFC] Idea about increasing efficency of skb allocation in network devices Neil Horman
@ 2009-07-27  1:02 ` David Miller
  2009-07-27  7:10   ` Brice Goglin
  2009-07-27 10:52   ` Neil Horman
  0 siblings, 2 replies; 9+ messages in thread
From: David Miller @ 2009-07-27  1:02 UTC (permalink / raw)
  To: nhorman; +Cc: netdev

From: Neil Horman <nhorman@tuxdriver.com>
Date: Sun, 26 Jul 2009 20:36:09 -0400

> 	Since Network devices dma their memory into a provided DMA
> buffer (which can usually be at an arbitrary location, as they must
> cross potentially several pci busses to reach any memory location),
> I'm postulating that it would increase our receive path efficiency
> to provide a hint to the driver layer as to which node to allocate
> an skb data buffer on.  This hint would be determined by a feedback
> mechanism.  I was thinking that we could provide a callback function
> via the skb, that accepted the skb and the originating net_device.
> This callback can track statistics on which numa nodes consume
> (read: copy data from) skbs that were produced by specific net
> devices.  Then, when in the future that netdevice allocates a new
> skb (perhaps via netdev_alloc_skb), we can use that statistical
> profile to determine if the data buffer should be allocated on the
> local node, or on a remote node instead.

No matter what, you will do an inter-node memory operation.

Unless, the consumer NUMA node is the same as the one the
device is on.

Because since the device is on a NUMA node, if you DMA remotely
you've eaten the NUMA cost already.

If you always DMA to the device's NUMA node (what we try to do now) at
least the is the possibility of eliminating cross-NUMA traffic.

Better to move the application or stack processing towards the NUMA
node the network device is on, I think.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Idea about increasing efficency of skb allocation in network devices
  2009-07-27  1:02 ` David Miller
@ 2009-07-27  7:10   ` Brice Goglin
  2009-07-27  7:58     ` Eric Dumazet
  2009-07-27 10:52   ` Neil Horman
  1 sibling, 1 reply; 9+ messages in thread
From: Brice Goglin @ 2009-07-27  7:10 UTC (permalink / raw)
  To: David Miller; +Cc: nhorman, netdev

David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Sun, 26 Jul 2009 20:36:09 -0400
>
>   
>> 	Since Network devices dma their memory into a provided DMA
>> buffer (which can usually be at an arbitrary location, as they must
>> cross potentially several pci busses to reach any memory location),
>> I'm postulating that it would increase our receive path efficiency
>> to provide a hint to the driver layer as to which node to allocate
>> an skb data buffer on.  This hint would be determined by a feedback
>> mechanism.  I was thinking that we could provide a callback function
>> via the skb, that accepted the skb and the originating net_device.
>> This callback can track statistics on which numa nodes consume
>> (read: copy data from) skbs that were produced by specific net
>> devices.  Then, when in the future that netdevice allocates a new
>> skb (perhaps via netdev_alloc_skb), we can use that statistical
>> profile to determine if the data buffer should be allocated on the
>> local node, or on a remote node instead.
>>     
>
> No matter what, you will do an inter-node memory operation.
>
> Unless, the consumer NUMA node is the same as the one the
> device is on.
>
> Because since the device is on a NUMA node, if you DMA remotely
> you've eaten the NUMA cost already.
>
> If you always DMA to the device's NUMA node (what we try to do now) at
> least the is the possibility of eliminating cross-NUMA traffic.
>
> Better to move the application or stack processing towards the NUMA
> node the network device is on, I think.
>   

Is there an easy way to get this NUMA node from the application socket
descriptor?
Also, one question that was raised at the Linux Symposium is: how do you
know which processors run the receive queue for a specific connection ?
It would be nice to have a way to retrieve such information in the
application to avoid inter-node and inter-core/cache traffic.

Brice


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Idea about increasing efficency of skb allocation in network devices
  2009-07-27  7:10   ` Brice Goglin
@ 2009-07-27  7:58     ` Eric Dumazet
  2009-07-27  8:27       ` Brice Goglin
  2009-07-27 10:55       ` Neil Horman
  0 siblings, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2009-07-27  7:58 UTC (permalink / raw)
  To: Brice Goglin; +Cc: David Miller, nhorman, netdev

Brice Goglin a écrit :
> David Miller wrote:
>> From: Neil Horman <nhorman@tuxdriver.com>
>> Date: Sun, 26 Jul 2009 20:36:09 -0400
>>
>>   
>>> 	Since Network devices dma their memory into a provided DMA
>>> buffer (which can usually be at an arbitrary location, as they must
>>> cross potentially several pci busses to reach any memory location),
>>> I'm postulating that it would increase our receive path efficiency
>>> to provide a hint to the driver layer as to which node to allocate
>>> an skb data buffer on.  This hint would be determined by a feedback
>>> mechanism.  I was thinking that we could provide a callback function
>>> via the skb, that accepted the skb and the originating net_device.
>>> This callback can track statistics on which numa nodes consume
>>> (read: copy data from) skbs that were produced by specific net
>>> devices.  Then, when in the future that netdevice allocates a new
>>> skb (perhaps via netdev_alloc_skb), we can use that statistical
>>> profile to determine if the data buffer should be allocated on the
>>> local node, or on a remote node instead.
>>>     
>> No matter what, you will do an inter-node memory operation.
>>
>> Unless, the consumer NUMA node is the same as the one the
>> device is on.
>>
>> Because since the device is on a NUMA node, if you DMA remotely
>> you've eaten the NUMA cost already.
>>
>> If you always DMA to the device's NUMA node (what we try to do now) at
>> least the is the possibility of eliminating cross-NUMA traffic.
>>
>> Better to move the application or stack processing towards the NUMA
>> node the network device is on, I think.
>>   
> 
> Is there an easy way to get this NUMA node from the application socket
> descriptor?

Thats not easy, this information can change for every packet (think of
bonding setups, whith aggregation of devices on different NUMA nodes)

We could add a getsockopt() call to peek this information from the next
data to be read from socket (returns node id where skb data is sitting,
hoping that NIC driver hadnt copybreak it (ie : allocate a small skb and
copy the device provided data on it before feeding packet to network stack))


> Also, one question that was raised at the Linux Symposium is: how do you
> know which processors run the receive queue for a specific connection ?
> It would be nice to have a way to retrieve such information in the
> application to avoid inter-node and inter-core/cache traffic.

All this depends on the fact you have multiqueue devices or not, and
trafic spreads on all queues or not.

Assuming you have single queue device, only current way to handle
this is to do the reverse thinking.

Ie, bind NIC interrupts to the appropriate set of cpus, and
possibly bind user apps threads dealing with network trafic to same set.

Only background or cpu hungry threads should be allowed to run
on foreigns nodes.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Idea about increasing efficency of skb allocation in network devices
  2009-07-27  7:58     ` Eric Dumazet
@ 2009-07-27  8:27       ` Brice Goglin
  2009-07-27 10:55       ` Neil Horman
  1 sibling, 0 replies; 9+ messages in thread
From: Brice Goglin @ 2009-07-27  8:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, nhorman, netdev

Eric Dumazet wrote:
>> Is there an easy way to get this NUMA node from the application socket
>> descriptor?
>>     
>
> Thats not easy, this information can change for every packet (think of
> bonding setups, whith aggregation of devices on different NUMA nodes)
>   

If we return a mask of cpus near the NIC, we could return the mask
containing cpus that are close to any of the devices that were
aggregated in this bonding setup.
If no bonding, it's fine. If bonding, the behavior looks acceptable to me.

> We could add a getsockopt() call to peek this information from the next
> data to be read from socket (returns node id where skb data is sitting,
> hoping that NIC driver hadnt copybreak it (ie : allocate a small skb and
> copy the device provided data on it before feeding packet to network stack))
>
>
>   
>> Also, one question that was raised at the Linux Symposium is: how do you
>> know which processors run the receive queue for a specific connection ?
>> It would be nice to have a way to retrieve such information in the
>> application to avoid inter-node and inter-core/cache traffic.
>>     
>
> All this depends on the fact you have multiqueue devices or not, and
> trafic spreads on all queues or not.
>   

Again, on a per-connection basis, you should know whether your packets
are going through a single queue or to all of them? If going to a single
queue, return a mask of cpus near this exact queue. If going to multiple
queues (or if you don't know), just sumup the cpumask of all queues.

Brice


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Idea about increasing efficency of skb allocation in network devices
  2009-07-27  1:02 ` David Miller
  2009-07-27  7:10   ` Brice Goglin
@ 2009-07-27 10:52   ` Neil Horman
  1 sibling, 0 replies; 9+ messages in thread
From: Neil Horman @ 2009-07-27 10:52 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Sun, Jul 26, 2009 at 06:02:54PM -0700, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Sun, 26 Jul 2009 20:36:09 -0400
> 
> > 	Since Network devices dma their memory into a provided DMA
> > buffer (which can usually be at an arbitrary location, as they must
> > cross potentially several pci busses to reach any memory location),
> > I'm postulating that it would increase our receive path efficiency
> > to provide a hint to the driver layer as to which node to allocate
> > an skb data buffer on.  This hint would be determined by a feedback
> > mechanism.  I was thinking that we could provide a callback function
> > via the skb, that accepted the skb and the originating net_device.
> > This callback can track statistics on which numa nodes consume
> > (read: copy data from) skbs that were produced by specific net
> > devices.  Then, when in the future that netdevice allocates a new
> > skb (perhaps via netdev_alloc_skb), we can use that statistical
> > profile to determine if the data buffer should be allocated on the
> > local node, or on a remote node instead.
> 
> No matter what, you will do an inter-node memory operation.
> 
> Unless, the consumer NUMA node is the same as the one the
> device is on.
> 
> Because since the device is on a NUMA node, if you DMA remotely
> you've eaten the NUMA cost already.
> 
> If you always DMA to the device's NUMA node (what we try to do now) at
> least the is the possibility of eliminating cross-NUMA traffic.
> 
> Better to move the application or stack processing towards the NUMA
> node the network device is on, I think.
> 
I take your point, and I see where we attempt to allocate on the same node that
the device is in in __netdev_alloc_skb, I'm just wondering if (since we are
going to have cross node traffic if the app and device are on disparate nodes),
if it wouldn't be better to eat that cross node latency at the bottom of the
stack, rather than the top.  If we do it at the bottom we at least have a DMA
engine eating that time, rather than a CPU that could be doing some other work.
Not sure if thats worth the effort, but I think its worth asking the question.

Thoughts?

Regards
Neil


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Idea about increasing efficency of skb allocation in network devices
  2009-07-27  7:58     ` Eric Dumazet
  2009-07-27  8:27       ` Brice Goglin
@ 2009-07-27 10:55       ` Neil Horman
  2009-07-29  8:20         ` Brice Goglin
  1 sibling, 1 reply; 9+ messages in thread
From: Neil Horman @ 2009-07-27 10:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Brice Goglin, David Miller, netdev

On Mon, Jul 27, 2009 at 09:58:22AM +0200, Eric Dumazet wrote:
> Brice Goglin a écrit :
> > David Miller wrote:
> >> From: Neil Horman <nhorman@tuxdriver.com>
> >> Date: Sun, 26 Jul 2009 20:36:09 -0400
> >>
> >>   
> >>> 	Since Network devices dma their memory into a provided DMA
> >>> buffer (which can usually be at an arbitrary location, as they must
> >>> cross potentially several pci busses to reach any memory location),
> >>> I'm postulating that it would increase our receive path efficiency
> >>> to provide a hint to the driver layer as to which node to allocate
> >>> an skb data buffer on.  This hint would be determined by a feedback
> >>> mechanism.  I was thinking that we could provide a callback function
> >>> via the skb, that accepted the skb and the originating net_device.
> >>> This callback can track statistics on which numa nodes consume
> >>> (read: copy data from) skbs that were produced by specific net
> >>> devices.  Then, when in the future that netdevice allocates a new
> >>> skb (perhaps via netdev_alloc_skb), we can use that statistical
> >>> profile to determine if the data buffer should be allocated on the
> >>> local node, or on a remote node instead.
> >>>     
> >> No matter what, you will do an inter-node memory operation.
> >>
> >> Unless, the consumer NUMA node is the same as the one the
> >> device is on.
> >>
> >> Because since the device is on a NUMA node, if you DMA remotely
> >> you've eaten the NUMA cost already.
> >>
> >> If you always DMA to the device's NUMA node (what we try to do now) at
> >> least the is the possibility of eliminating cross-NUMA traffic.
> >>
> >> Better to move the application or stack processing towards the NUMA
> >> node the network device is on, I think.
> >>   
> > 
> > Is there an easy way to get this NUMA node from the application socket
> > descriptor?
> 
> Thats not easy, this information can change for every packet (think of
> bonding setups, whith aggregation of devices on different NUMA nodes)
> 
> We could add a getsockopt() call to peek this information from the next
> data to be read from socket (returns node id where skb data is sitting,
> hoping that NIC driver hadnt copybreak it (ie : allocate a small skb and
> copy the device provided data on it before feeding packet to network stack))
> 
Would a proc or debugfs interface perhaps be helpful here?  Something that
perhaps showed a statistical distribution of how many packets were received by
each process on each irq (operating under the assumption that each rx queue has
its own msi irq, giving us an easy identifier).

Neil

> 
> > Also, one question that was raised at the Linux Symposium is: how do you
> > know which processors run the receive queue for a specific connection ?
> > It would be nice to have a way to retrieve such information in the
> > application to avoid inter-node and inter-core/cache traffic.
> 
> All this depends on the fact you have multiqueue devices or not, and
> trafic spreads on all queues or not.
> 
> Assuming you have single queue device, only current way to handle
> this is to do the reverse thinking.
> 
> Ie, bind NIC interrupts to the appropriate set of cpus, and
> possibly bind user apps threads dealing with network trafic to same set.
> 
> Only background or cpu hungry threads should be allowed to run
> on foreigns nodes.
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Idea about increasing efficency of skb allocation in network devices
  2009-07-27 10:55       ` Neil Horman
@ 2009-07-29  8:20         ` Brice Goglin
  2009-07-29 10:47           ` Neil Horman
  0 siblings, 1 reply; 9+ messages in thread
From: Brice Goglin @ 2009-07-29  8:20 UTC (permalink / raw)
  To: Neil Horman; +Cc: Eric Dumazet, David Miller, netdev

Neil Horman wrote:
>>> Is there an easy way to get this NUMA node from the application socket
>>> descriptor?
>>>       
>> Thats not easy, this information can change for every packet (think of
>> bonding setups, whith aggregation of devices on different NUMA nodes)
>>
>> We could add a getsockopt() call to peek this information from the next
>> data to be read from socket (returns node id where skb data is sitting,
>> hoping that NIC driver hadnt copybreak it (ie : allocate a small skb and
>> copy the device provided data on it before feeding packet to network stack))
>>
>>     
> Would a proc or debugfs interface perhaps be helpful here?  Something that
> perhaps showed a statistical distribution of how many packets were received by
> each process on each irq (operating under the assumption that each rx queue has
> its own msi irq, giving us an easy identifier).
>   

It could be intereting. But unprivileged user processes cannot read
/proc/irq/*/smp_affinity, so they would not be able to translate your
procfs information into a binding hint.

Brice


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Idea about increasing efficency of skb allocation in network devices
  2009-07-29  8:20         ` Brice Goglin
@ 2009-07-29 10:47           ` Neil Horman
  0 siblings, 0 replies; 9+ messages in thread
From: Neil Horman @ 2009-07-29 10:47 UTC (permalink / raw)
  To: Brice Goglin; +Cc: Eric Dumazet, David Miller, netdev

On Wed, Jul 29, 2009 at 10:20:55AM +0200, Brice Goglin wrote:
> Neil Horman wrote:
> >>> Is there an easy way to get this NUMA node from the application socket
> >>> descriptor?
> >>>       
> >> Thats not easy, this information can change for every packet (think of
> >> bonding setups, whith aggregation of devices on different NUMA nodes)
> >>
> >> We could add a getsockopt() call to peek this information from the next
> >> data to be read from socket (returns node id where skb data is sitting,
> >> hoping that NIC driver hadnt copybreak it (ie : allocate a small skb and
> >> copy the device provided data on it before feeding packet to network stack))
> >>
> >>     
> > Would a proc or debugfs interface perhaps be helpful here?  Something that
> > perhaps showed a statistical distribution of how many packets were received by
> > each process on each irq (operating under the assumption that each rx queue has
> > its own msi irq, giving us an easy identifier).
> >   
> 
> It could be intereting. But unprivileged user processes cannot read
> /proc/irq/*/smp_affinity, so they would not be able to translate your
> procfs information into a binding hint.
> 
I don't think you'd need read access to the irq affinity files.  If the above
debugfs/proc information were exported to indicate which numa node or cpu the
allocated skb were local to, that could be used by the process to set its
scheduler affintiy via taskset.

Neil

> Brice
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-07-29 10:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-27  0:36 [RFC] Idea about increasing efficency of skb allocation in network devices Neil Horman
2009-07-27  1:02 ` David Miller
2009-07-27  7:10   ` Brice Goglin
2009-07-27  7:58     ` Eric Dumazet
2009-07-27  8:27       ` Brice Goglin
2009-07-27 10:55       ` Neil Horman
2009-07-29  8:20         ` Brice Goglin
2009-07-29 10:47           ` Neil Horman
2009-07-27 10:52   ` Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).