Re: [RFC PATCH v2 00/11] Device Memory TCP

public inbox for linux-media@vger.kernel.org
 help / color / mirror / Atom feed

From: "Christian König" <christian.koenig@amd.com>
To: Mina Almasry <almasrymina@google.com>,
	netdev@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org
Cc: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	Arnd Bergmann <arnd@arndb.de>, David Ahern <dsahern@kernel.org>,
	Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	Sumit Semwal <sumit.semwal@linaro.org>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Hari Ramakrishnan <rharix@google.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	stephen@networkplumber.org, sdf@google.com
Subject: Re: [RFC PATCH v2 00/11] Device Memory TCP
Date: Thu, 10 Aug 2023 12:29:08 +0200	[thread overview]
Message-ID: <1009bd5b-d577-ca7b-8eff-192ee89ad67d@amd.com> (raw)
In-Reply-To: <20230810015751.3297321-1-almasrymina@google.com>

Am 10.08.23 um 03:57 schrieb Mina Almasry:
> Changes in RFC v2:
> ------------------
>
> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> that attempts to resolve this by implementing scatterlist support in the
> networking stack, such that we can import the dma-buf scatterlist
> directly.

Impressive work, I didn't thought that this would be possible that "easily".

Please note that we have considered replacing scatterlists with simple 
arrays of DMA-addresses in the DMA-buf framework to avoid people trying 
to access the struct page inside the scatterlist.

It might be a good idea to push for that first before this here is 
finally implemented.

GPU drivers already convert the scatterlist used to arrays of 
DMA-addresses as soon as they get them. This leaves RDMA and V4L as the 
other two main users which would need to be converted.

>   This is the approach proposed at a high level here[2].
>
> Detailed changes:
> 1. Replaced dma-buf pages approach with importing scatterlist into the
>     page pool.
> 2. Replace the dma-buf pages centric API with a netlink API.
> 3. Removed the TX path implementation - there is no issue with
>     implementing the TX path with scatterlist approach, but leaving
>     out the TX path makes it easier to review.
> 4. Functionality is tested with this proposal, but I have not conducted
>     perf testing yet. I'm not sure there are regressions, but I removed
>     perf claims from the cover letter until they can be re-confirmed.
> 5. Added Signed-off-by: contributors to the implementation.
> 6. Fixed some bugs with the RX path since RFC v1.
>
> Any feedback welcome, but specifically the biggest pending questions
> needing feedback IMO are:
>
> 1. Feedback on the scatterlist-based approach in general.

As far as I can see this sounds like the right thing to do in general.

Question is rather if we should stick with scatterlist, use array of 
DMA-addresses or maybe even come up with a completely new structure.

> 2. Netlink API (Patch 1 & 2).

How does netlink manage the lifetime of objects?

> 3. Approach to handle all the drivers that expect to receive pages from
>     the page pool (Patch 6).

Can't say anything about that. I know TCP/IP inside out, but I'm a GPU 
and not a network driver author.

Regards,
Christian.

>
> [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.com/T/
> [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@mail.gmail.com/
>
> ----------------------
>
> * TL;DR:
>
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.
>
> * Problem:
>
> A large amount of data transfers have device memory as the source and/or
> destination. Accelerators drastically increased the volume of such transfers.
> Some examples include:
> - ML accelerators transferring large amounts of training data from storage into
>    GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
>    TPU compute time, improving data transfer throughput & efficiency can help
>    improving GPU/TPU utilization.
>
> - Distributed training, where ML accelerators, such as GPUs on different hosts,
>    exchange data among them.
>
> - Distributed raw block storage applications transfer large amounts of data with
>    remote SSDs, much of this data does not require host processing.
>
> Today, the majority of the Device-to-Device data transfers the network are
> implemented as the following low level operations: Device-to-Host copy,
> Host-to-Host network transfer, and Host-to-Device copy.
>
> The implementation is suboptimal, especially for bulk data transfers, and can
> put significant strains on system resources, such as host memory bandwidth,
> PCIe bandwidth, etc. One important reason behind the current state is the
> kernel’s lack of semantics to express device to network transfers.
>
> * Proposal:
>
> In this patch series we attempt to optimize this use case by implementing
> socket APIs that enable the user to:
>
> 1. send device memory across the network directly, and
> 2. receive incoming network packets directly into device memory.
>
> Packet _payloads_ go directly from the NIC to device memory for receive and from
> device memory to NIC for transmit.
> Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
> normally. The NIC _must_ support header split to achieve this.
>
> Advantages:
>
> - Alleviate host memory bandwidth pressure, compared to existing
>   network-transfer + device-copy semantics.
>
> - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
>    of the PCIe tree, compared to traditional path which sends data through the
>    root complex.
>
> * Patch overview:
>
> ** Part 1: netlink API
>
> Gives user ability to bind dma-buf to an RX queue.
>
> ** Part 2: scatterlist support
>
> Currently the standard for device memory sharing is DMABUF, which doesn't
> generate struct pages. On the other hand, networking stack (skbs, drivers, and
> page pool) operate on pages. We have 2 options:
>
> 1. Generate struct pages for dmabuf device memory, or,
> 2. Modify the networking stack to process scatterlist.
>
> Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
>
> ** part 3: page pool support
>
> We piggy back on page pool memory providers proposal:
> https://github.com/kuba-moo/linux/tree/pp-providers
>
> It allows the page pool to define a memory provider that provides the
> page allocation and freeing. It helps abstract most of the device memory
> TCP changes from the driver.
>
> ** part 4: support for unreadable skb frags
>
> Page pool iovs are not accessible by the host; we implement changes
> throughput the networking stack to correctly handle skbs with unreadable
> frags.
>
> ** Part 5: recvmsg() APIs
>
> We define user APIs for the user to send and receive device memory.
>
> Not included with this RFC is the GVE devmem TCP support, just to
> simplify the review. Code available here if desired:
> https://github.com/mina/linux/tree/tcpdevmem
>
> This RFC is built on top of net-next with Jakub's pp-providers changes
> cherry-picked.
>
> * NIC dependencies:
>
> 1. (strict) Devmem TCP require the NIC to support header split, i.e. the
>     capability to split incoming packets into a header + payload and to put
>     each into a separate buffer. Devmem TCP works by using device memory
>     for the packet payload, and host memory for the packet headers.
>
> 2. (optional) Devmem TCP works better with flow steering support & RSS support,
>     i.e. the NIC's ability to steer flows into certain rx queues. This allows the
>     sysadmin to enable devmem TCP on a subset of the rx queues, and steer
>     devmem TCP traffic onto these queues and non devmem TCP elsewhere.
>
> The NIC I have access to with these properties is the GVE with DQO support
> running in Google Cloud, but any NIC that supports these features would suffice.
> I may be able to help reviewers bring up devmem TCP on their NICs.
>
> * Testing:
>
> The series includes a udmabuf kselftest that show a simple use case of
> devmem TCP and validates the entire data path end to end without
> a dependency on a specific dmabuf provider.
>
> ** Test Setup
>
> Kernel: net-next with this RFC and memory provider API cherry-picked
> locally.
>
> Hardware: Google Cloud A3 VMs.
>
> NIC: GVE with header split & RSS & flow steering support.
>
> Mina Almasry (11):
>    net: add netdev netlink api to bind dma-buf to a net device
>    netdev: implement netlink api to bind dma-buf to netdevice
>    netdev: implement netdevice devmem allocator
>    memory-provider: updates to core provider API for devmem TCP
>    memory-provider: implement dmabuf devmem memory provider
>    page-pool: add device memory support
>    net: support non paged skb frags
>    net: add support for skbs with unreadable frags
>    tcp: implement recvmsg() RX path for devmem TCP
>    net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages
>    selftests: add ncdevmem, netcat for devmem TCP
>
>   Documentation/netlink/specs/netdev.yaml |  27 ++
>   include/linux/netdevice.h               |  61 +++
>   include/linux/skbuff.h                  |  54 ++-
>   include/linux/socket.h                  |   1 +
>   include/net/page_pool.h                 | 186 ++++++++-
>   include/net/sock.h                      |   2 +
>   include/net/tcp.h                       |   5 +-
>   include/uapi/asm-generic/socket.h       |   6 +
>   include/uapi/linux/netdev.h             |  10 +
>   include/uapi/linux/uio.h                |  10 +
>   net/core/datagram.c                     |   6 +
>   net/core/dev.c                          | 214 ++++++++++
>   net/core/gro.c                          |   2 +-
>   net/core/netdev-genl-gen.c              |  14 +
>   net/core/netdev-genl-gen.h              |   1 +
>   net/core/netdev-genl.c                  | 103 +++++
>   net/core/page_pool.c                    | 171 ++++++--
>   net/core/skbuff.c                       |  80 +++-
>   net/core/sock.c                         |  36 ++
>   net/ipv4/tcp.c                          | 196 ++++++++-
>   net/ipv4/tcp_input.c                    |  13 +-
>   net/ipv4/tcp_ipv4.c                     |   7 +
>   net/ipv4/tcp_output.c                   |   5 +-
>   net/packet/af_packet.c                  |   4 +-
>   tools/include/uapi/linux/netdev.h       |  10 +
>   tools/net/ynl/generated/netdev-user.c   |  41 ++
>   tools/net/ynl/generated/netdev-user.h   |  46 ++
>   tools/testing/selftests/net/.gitignore  |   1 +
>   tools/testing/selftests/net/Makefile    |   5 +
>   tools/testing/selftests/net/ncdevmem.c  | 534 ++++++++++++++++++++++++
>   30 files changed, 1787 insertions(+), 64 deletions(-)
>   create mode 100644 tools/testing/selftests/net/ncdevmem.c
>

next prev parent reply	other threads:[~2023-08-10 10:29 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-10  1:57 [RFC PATCH v2 00/11] Device Memory TCP Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 01/11] net: add netdev netlink api to bind dma-buf to a net device Mina Almasry
2023-08-10 16:04   ` Samudrala, Sridhar
2023-08-11  2:19     ` Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice Mina Almasry
2023-08-13 11:26   ` Leon Romanovsky
2023-08-14  1:10   ` David Ahern
2023-08-14  3:15     ` Mina Almasry
2023-08-30 12:38   ` Yunsheng Lin
2023-09-08  0:47   ` David Wei
2023-08-10  1:57 ` [RFC PATCH v2 03/11] netdev: implement netdevice devmem allocator Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 04/11] memory-provider: updates to core provider API for devmem TCP Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 05/11] memory-provider: implement dmabuf devmem memory provider Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 06/11] page-pool: add device memory support Mina Almasry
2023-08-19  9:51   ` Jesper Dangaard Brouer
2023-08-19 14:08     ` Willem de Bruijn
2023-08-19 15:22       ` Jesper Dangaard Brouer
2023-08-19 15:49         ` David Ahern
2023-08-19 16:12           ` Willem de Bruijn
2023-08-21 21:31             ` Jakub Kicinski
2023-08-22  0:58               ` Willem de Bruijn
2023-08-19 16:11         ` Willem de Bruijn
2023-08-19 20:24         ` Mina Almasry
2023-08-19 20:27           ` Mina Almasry
2023-09-08  2:32           ` David Wei
2023-08-22  6:05     ` Mina Almasry
2023-08-22 12:24       ` Jesper Dangaard Brouer
2023-08-22 23:33         ` Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 07/11] net: support non paged skb frags Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 08/11] net: add support for skbs with unreadable frags Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 09/11] tcp: implement recvmsg() RX path for devmem TCP Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 10/11] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages Mina Almasry
2023-08-10  1:57 ` [RFC PATCH v2 11/11] selftests: add ncdevmem, netcat for devmem TCP Mina Almasry
2023-08-10 10:29 ` Christian König [this message]
2023-08-10 16:06   ` [RFC PATCH v2 00/11] Device Memory TCP Jason Gunthorpe
2023-08-10 18:44   ` Mina Almasry
2023-08-10 18:58     ` Jason Gunthorpe
2023-08-11  1:56       ` Mina Almasry
2023-08-11 11:02     ` Christian König
2023-08-14  1:12 ` David Ahern
2023-08-14  2:11   ` Mina Almasry
2023-08-17 18:00   ` Pavel Begunkov
2023-08-17 22:18     ` Mina Almasry
2023-08-23 22:52       ` David Wei
2023-08-24  3:35         ` David Ahern
2023-08-15 13:38 ` David Laight
2023-08-15 14:41   ` Willem de Bruijn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1009bd5b-d577-ca7b-8eff-192ee89ad67d@amd.com \
    --to=christian.koenig@amd.com \
    --cc=almasrymina@google.com \
    --cc=arnd@arndb.de \
    --cc=dan.j.williams@intel.com \
    --cc=davem@davemloft.net \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=jgg@ziepe.ca \
    --cc=kuba@kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rharix@google.com \
    --cc=sdf@google.com \
    --cc=stephen@networkplumber.org \
    --cc=sumit.semwal@linaro.org \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox