netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* dma_alloc_coherent() to use memory close to cpu
@ 2015-05-13 12:40 Amir Vadai
  2015-05-13 15:49 ` Alexander Duyck
  0 siblings, 1 reply; 3+ messages in thread
From: Amir Vadai @ 2015-05-13 12:40 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: achiad, Or Gerlitz, netdev@vger.kernel.org

Hi Alex,

dma_alloc_coherent() is allocating memory close to the device -
according to dev_to_node(dev). Sometimes it is better to use memory
close to the CPU. e.g. when it is a buffer that NIC writes and CPU reads.

It seems that you thought that too, and added a commit to ixgbe driver
that follows that logic [1].
You added calls to set_dev_node() before and after the allocation.
This seems to be prone to races in case multiple process want to alloc
in parallel. The proper fix seems to be to extend the
dma_alloc_coherent() to accept a NUMA node as an argument (if device's
node is not good enough).

I looked for, but couldn't find any discussion about that - is there a
special reason not to extend dma_alloc_coherent()?

[1] - de88eee ("ixgbe: Allocate rings as part of the q_vector")

Thanks,
Amir

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: dma_alloc_coherent() to use memory close to cpu
  2015-05-13 12:40 dma_alloc_coherent() to use memory close to cpu Amir Vadai
@ 2015-05-13 15:49 ` Alexander Duyck
  2015-05-14  7:15   ` Amir Vadai
  0 siblings, 1 reply; 3+ messages in thread
From: Alexander Duyck @ 2015-05-13 15:49 UTC (permalink / raw)
  To: Amir Vadai; +Cc: achiad, Or Gerlitz, netdev@vger.kernel.org

On 05/13/2015 05:40 AM, Amir Vadai wrote:
> Hi Alex,
>
> dma_alloc_coherent() is allocating memory close to the device -
> according to dev_to_node(dev). Sometimes it is better to use memory
> close to the CPU. e.g. when it is a buffer that NIC writes and CPU reads.

Yes, the easiest way to visualize this is do you want to have this 
operator under a push or pull model.  Either you can have the hardware 
push the data to where the interrupt will be processed, or the interrupt 
will have to pull the data to the CPU it is being processed on.  As long 
as there are enough PCIe credits to keep the PCIe link fully utilized 
you are usually better off pushing the data to the CPU the interrupt is 
on as the reads/writes are usually batched by the hardware.

> It seems that you thought that too, and added a commit to ixgbe driver
> that follows that logic [1].
> You added calls to set_dev_node() before and after the allocation.
> This seems to be prone to races in case multiple process want to alloc
> in parallel. The proper fix seems to be to extend the
> dma_alloc_coherent() to accept a NUMA node as an argument (if device's
> node is not good enough).

I'm not sure how racy it would be since you can really only have one 
driver per device and the function that does this is protected by the 
RTNL lock as I recall.

> I looked for, but couldn't find any discussion about that - is there a
> special reason not to extend dma_alloc_coherent()?

I think most of that is due to the fact that it is buried in multiple 
levels of abstraction and at the time I wrote that code I had only been 
working in the kernel drivers for a year or so.  I had to revert similar 
code from igb as it was buggy so I wasn't really in a place to be 
modifying that at that time.

If you are planning to give it a try I would say go for it.  The fact is 
there are models where you want to have the device memory spread around 
since the DMA writes usually are much less expensive to a remote node, 
than accessing a remote node from the interrupt handler.

- Alex

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: dma_alloc_coherent() to use memory close to cpu
  2015-05-13 15:49 ` Alexander Duyck
@ 2015-05-14  7:15   ` Amir Vadai
  0 siblings, 0 replies; 3+ messages in thread
From: Amir Vadai @ 2015-05-14  7:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Amir Vadai, Achiad Shochat, Or Gerlitz, netdev@vger.kernel.org

On Wed, May 13, 2015 at 6:49 PM, Alexander Duyck
<alexander.h.duyck@redhat.com> wrote:
> On 05/13/2015 05:40 AM, Amir Vadai wrote:
>>
>> Hi Alex,
>>
>> dma_alloc_coherent() is allocating memory close to the device -
>> according to dev_to_node(dev). Sometimes it is better to use memory
>> close to the CPU. e.g. when it is a buffer that NIC writes and CPU reads.
>
>
> Yes, the easiest way to visualize this is do you want to have this operator
> under a push or pull model.  Either you can have the hardware push the data
> to where the interrupt will be processed, or the interrupt will have to pull
> the data to the CPU it is being processed on.  As long as there are enough
> PCIe credits to keep the PCIe link fully utilized you are usually better off
> pushing the data to the CPU the interrupt is on as the reads/writes are
> usually batched by the hardware.
>
>> It seems that you thought that too, and added a commit to ixgbe driver
>> that follows that logic [1].
>> You added calls to set_dev_node() before and after the allocation.
>> This seems to be prone to races in case multiple process want to alloc
>> in parallel. The proper fix seems to be to extend the
>> dma_alloc_coherent() to accept a NUMA node as an argument (if device's
>> node is not good enough).
>
>
> I'm not sure how racy it would be since you can really only have one driver
> per device and the function that does this is protected by the RTNL lock as
> I recall.
>
>> I looked for, but couldn't find any discussion about that - is there a
>> special reason not to extend dma_alloc_coherent()?
>
>
> I think most of that is due to the fact that it is buried in multiple levels
> of abstraction and at the time I wrote that code I had only been working in
> the kernel drivers for a year or so.  I had to revert similar code from igb
> as it was buggy so I wasn't really in a place to be modifying that at that
> time.
>
> If you are planning to give it a try I would say go for it.  The fact is
> there are models where you want to have the device memory spread around
> since the DMA writes usually are much less expensive to a remote node, than
> accessing a remote node from the interrupt handler.
I will try to find some time to extend the dma_alloc_coherent() - I
see this set_dev_node() before and after in too many drivers
(including Mellanox's)...

Thanks for the quick reply,
Amir

>
> - Alex

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-05-14  7:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-13 12:40 dma_alloc_coherent() to use memory close to cpu Amir Vadai
2015-05-13 15:49 ` Alexander Duyck
2015-05-14  7:15   ` Amir Vadai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).