From: Jonathan Lemon <jonathan.lemon@gmail.com>
To: <netdev@vger.kernel.org>
Cc: <kernel-team@fb.com>, <axboe@kernel.dk>
Subject: [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU.
Date: Thu, 18 Jun 2020 09:09:20 -0700 [thread overview]
Message-ID: <20200618160941.879717-1-jonathan.lemon@gmail.com> (raw)
This series is a working RFC proof-of-concept that implements DMA
zero-copy between the NIC and a GPU device for the data path, while
keeping the protocol processing on the host CPU.
This also works for zero-copy send/recv to host (CPU) memory.
Current limitations:
- mlx5 only, header splitting is at a fixed offset.
- currently only TCP protocol delivery is performed.
- not optimized (hey, it works!)
- TX completion notification is planned, but not in this patchset.
- one socket per device
- not compatible with xsk (re-uses same datastructures)
- not compatible with bpf payload inspection
- x86 !iommu only; liberties are taken with PA addresses.
The next section provides a brief overview of how things work, for this
phase 0 proof of concept.
A transport context is created on a device, which sets up the datapath,
and the device queues. Only specialized RX queues are needed, the
standard TX queues are used for packet transmission.
Memory areas which participate in zero-copy transmission are registered
with the context. These areas can be used as either RX packet buffers
or TX data areas (or both). The memory can come from either malloc/mmap
or cudaMalloc(). The latter call provides a handle to the userspace
application, but the memory region is only accessible to the GPU.
A socket is created and registered with the context, which sets
SOCK_ZEROCOPY, and is bound to the device with SO_BINDTODEVICE.
Asymmetrical data paths are possible (zc TX, normal RX), and vice versa,
but the curreent PoC sets things up for symmetrical transport. The
application needs to provide the RX buffers to the receive queue,
similar to AF_XDP.
Once things are set up, data is sent to the network with sendmsg(). The
iovecs provided contain an address in the region previously registered.
The normal protocol stack processing constructs the packet, but the data
is not touched by the stack. In this phase, the application is not
notified when the protocol processing is complete and the data area is
safe to modify again.
For RX, packets undergo the usual protocol processing and are delivered
up to the socket receive queue. At this point, the skb data fragments
are delivered to the application as iovecs through an AF_XDP style
queue. The application can poll for readability, but does not use
read() to receive the data.
The initial application used is iperf3, a modified version with the
userspace library is available at:
https://github.com/jlemon/iperf
https://github.com/jlemon/netgpu
Running "iperf3 -s -z --dport 8888" (host memory) on a 12Gbps link:
11.3 Gbit/sec receive
10.8 Gbit/sec tramsmit
Running "iperf3 -s -z --dport 8888 --gpu" on a 25Gbps link:
22.5 Gbit/sec receive
12.6 Gbit/sec transmit (!!!)
For the GPU runs, the Intel PCI monitoring tools were used to confirm
that the host PCI bus was mostly idle. The TX performance needs further
investigation.
Comments welcome. The next phase of the work will clean up the
interface, adding completion notifications, and a flexible queue
creation mechanism.
--
Jonathan
Jonathan Lemon (21):
mm: add {add|release}_memory_pages
mm: Allow DMA mapping of pages which are not online
tcp: Pad TCP options out to a fixed size
mlx5: add definitions for header split and netgpu
mlx5/xsk: check that xsk does not conflict with netgpu
mlx5: add header_split flag
mlx5: remove the umem parameter from mlx5e_open_channel
misc: add shqueue.h for prototyping
include: add definitions for netgpu
mlx5: add netgpu queue functions
skbuff: add a zc_netgpu bitflag
mlx5: hook up the netgpu channel functions
netdevice: add SETUP_NETGPU to the netdev_bpf structure
kernel: export free_uid
netgpu: add network/gpu dma module
lib: have __zerocopy_sg_from_iter get netgpu pages for a sk
net/core: add the SO_REGISTER_DMA socket option
tcp: add MSG_NETDMA flag for sendmsg()
core: add page recycling logic for netgpu pages
core/skbuff: use skb_zdata for testing whether skb is zerocopy
mlx5: add XDP_SETUP_NETGPU hook
drivers/misc/Kconfig | 1 +
drivers/misc/Makefile | 1 +
drivers/misc/netgpu/Kconfig | 10 +
drivers/misc/netgpu/Makefile | 11 +
drivers/misc/netgpu/nvidia.c | 1516 +++++++++++++++++
.../net/ethernet/mellanox/mlx5/core/Makefile | 3 +-
drivers/net/ethernet/mellanox/mlx5/core/en.h | 22 +-
.../mellanox/mlx5/core/en/netgpu/setup.c | 475 ++++++
.../mellanox/mlx5/core/en/netgpu/setup.h | 42 +
.../net/ethernet/mellanox/mlx5/core/en/txrx.h | 3 +
.../ethernet/mellanox/mlx5/core/en/xsk/umem.c | 3 +
.../ethernet/mellanox/mlx5/core/en/xsk/umem.h | 3 +
.../ethernet/mellanox/mlx5/core/en_ethtool.c | 15 +
.../net/ethernet/mellanox/mlx5/core/en_main.c | 118 +-
.../net/ethernet/mellanox/mlx5/core/en_rx.c | 52 +-
.../net/ethernet/mellanox/mlx5/core/en_txrx.c | 15 +-
include/linux/dma-mapping.h | 4 +-
include/linux/memory_hotplug.h | 4 +
include/linux/mmzone.h | 7 +
include/linux/netdevice.h | 6 +
include/linux/skbuff.h | 27 +-
include/linux/socket.h | 1 +
include/linux/uio.h | 4 +
include/net/netgpu.h | 65 +
include/uapi/asm-generic/socket.h | 2 +
include/uapi/misc/netgpu.h | 43 +
include/uapi/misc/shqueue.h | 205 +++
kernel/user.c | 1 +
lib/iov_iter.c | 45 +
mm/memory_hotplug.c | 65 +-
net/core/datagram.c | 6 +-
net/core/skbuff.c | 44 +-
net/core/sock.c | 26 +
net/ipv4/tcp.c | 8 +
net/ipv4/tcp_output.c | 16 +
35 files changed, 2828 insertions(+), 41 deletions(-)
create mode 100644 drivers/misc/netgpu/Kconfig
create mode 100644 drivers/misc/netgpu/Makefile
create mode 100644 drivers/misc/netgpu/nvidia.c
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
create mode 100644 include/net/netgpu.h
create mode 100644 include/uapi/misc/netgpu.h
create mode 100644 include/uapi/misc/shqueue.h
--
2.24.1
next reply other threads:[~2020-06-18 16:10 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-18 16:09 Jonathan Lemon [this message]
2020-06-18 16:09 ` [RFC PATCH 01/21] mm: add {add|release}_memory_pages Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 02/21] mm: Allow DMA mapping of pages which are not online Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 03/21] tcp: Pad TCP options out to a fixed size Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 04/21] mlx5: add definitions for header split and netgpu Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 05/21] mlx5/xsk: check that xsk does not conflict with netgpu Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 06/21] mlx5: add header_split flag Jonathan Lemon
2020-06-18 18:12 ` Eric Dumazet
2020-06-18 20:25 ` Michal Kubecek
2020-06-18 22:45 ` Eric Dumazet
2020-06-18 21:50 ` Jonathan Lemon
2020-06-18 22:34 ` Eric Dumazet
2020-06-18 22:36 ` Eric Dumazet
2020-06-18 16:09 ` [RFC PATCH 07/21] mlx5: remove the umem parameter from mlx5e_open_channel Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 08/21] misc: add shqueue.h for prototyping Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 09/21] include: add definitions for netgpu Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 10/21] mlx5: add netgpu queue functions Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 11/21] skbuff: add a zc_netgpu bitflag Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 12/21] mlx5: hook up the netgpu channel functions Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 13/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 14/21] kernel: export free_uid Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 15/21] netgpu: add network/gpu dma module Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 16/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 17/21] net/core: add the SO_REGISTER_DMA socket option Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 18/21] tcp: add MSG_NETDMA flag for sendmsg() Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 19/21] core: add page recycling logic for netgpu pages Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 20/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy Jonathan Lemon
2020-06-18 16:09 ` [RFC PATCH 21/21] mlx5: add XDP_SETUP_NETGPU hook Jonathan Lemon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200618160941.879717-1-jonathan.lemon@gmail.com \
--to=jonathan.lemon@gmail.com \
--cc=axboe@kernel.dk \
--cc=kernel-team@fb.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).