Netdev List

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: John Fastabend @ 2016-12-13 17:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Christoph Lameter
  Cc: Mike Rapoport, netdev@vger.kernel.org, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth, Vladislav Yasevich
In-Reply-To: <20161213171028.24dbf519@redhat.com>

On 16-12-13 08:10 AM, Jesper Dangaard Brouer wrote:
> 
> On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
>> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
>>
>>> Hmmm. If you can rely on hardware setup to give you steering and
>>> dedicated access to the RX rings.  In those cases, I guess, the "push"
>>> model could be a more direct API approach.  
>>
>> If the hardware does not support steering then one should be able to
>> provide those services in software.
> 
> This is the early demux problem.  With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.
> 
> My model pre-VMA map all the pages in the RX ring (if zero-copy gets
> enabled, by a single user).  The software step can filter and zero-copy
> send packet-pages to the application/socket that requested this. The

What does "zero-copy send packet-pages to the application/socket that
requested this" mean? At the moment on x86 page-flipping appears to be
more expensive than memcpy (I can post some data shortly) and shared
memory was proposed and rejected for security reasons when we were
working on bifurcated driver.

> disadvantage is all zero-copy application need to share this VMA
> mapping.  This is solved by configuring HW filters into a RX-queue, and
> then only attach your zero-copy application to that queue.
> 
> 
>>> I was shooting for a model that worked without hardware support.
>>> And then transparently benefit from HW support by configuring a HW
>>> filter into a specific RX queue and attaching/using to that queue.  
>>
>> The discussion here is a bit amusing since these issues have been
>> resolved a long time ago with the design of the RDMA subsystem. Zero
>> copy is already in wide use. Memory registration is used to pin down
>> memory areas. Work requests can be filed with the RDMA subsystem that
>> then send and receive packets from the registered memory regions.
>> This is not strictly remote memory access but this is a basic mode of
>> operations supported  by the RDMA subsystem. The mlx5 driver quoted
>> here supports all of that.
> 
> I hear what you are saying.  I will look into a push-model, as it might
> be a better solution.
>  I will read up on RDMA + verbs and learn more about their API model.  I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to Björn for already giving me some
> pointer here)
> 
> 
>> What is bad about RDMA is that it is a separate kernel subsystem.
>> What I would like to see is a deeper integration with the network
>> stack so that memory regions can be registred with a network socket
>> and work requests then can be submitted and processed that directly
>> read and write in these regions. The network stack should provide the
>> services that the hardware of the NIC does not suppport as usual.
> 
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])
> 

Note rte_flow is in the same family of APIs as the proposed Flow API
that was rejected as well.  The features in Flow API that are not
included in the rte_flow proposal have logical extensions to support
them. In kernel we have 'tc' and multiple vendors support cls_flower
and cls_tc which offer a subset of the functionality in the DPDK
implementation.

Are you suggesting 'tc' is not a proper NIC HW filter API?

>  
>> The RX/TX ring in user space should be an additional mode of
>> operation of the socket layer. Once that is in place the "Remote
>> memory acces" can be trivially implemented on top of that and the
>> ugly RDMA sidecar subsystem can go away.
>  
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).
> 
> 
> Appreciate your input, it challenged my thinking.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply