[PATCH net-next v3 0/6] Device memory TCP TX

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v3 0/6] Device memory TCP TX
@ 2025-02-03 22:39 Mina Almasry
  2025-02-03 22:39 ` [PATCH net-next v3 1/6] net: add devmem TCP TX documentation Mina Almasry
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Mina Almasry @ 2025-02-03 22:39 UTC (permalink / raw)
  To: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest
  Cc: Mina Almasry, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
===

Address minor comments from RFCv2 and fix a few build warnings and
ynl-regen issues. No major changes.

RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*
=======

RFC v2 addresses much of the feedback from RFC v1. I plan on sending
something close to this as net-next  reopens, sending it slightly early
to get feedback if any.

Major changes:
--------------

- much improved UAPI as suggested by Stan. We now interpret the iov_base
  of the passed in iov from userspace as the offset into the dmabuf to
  send from. This removes the need to set iov.iov_base = NULL which may
  be confusing to users, and enables us to send multiple iovs in the
  same sendmsg() call. ncdevmem and the docs show a sample use of that.

- Removed the duplicate dmabuf iov_iter in binding->iov_iter. I think
  this is good improvment as it was confusing to keep track of
  2 iterators for the same sendmsg, and mistracking both iterators
  caused a couple of bugs reported in the last iteration that are now
  resolved with this streamlining.

- Improved test coverage in ncdevmem. Now muliple sendmsg() are tested,
  and sending multiple iovs in the same sendmsg() is tested.

- Fixed issue where dmabuf unmapping was happening in invalid context
  (Stan).

====================================================================

The TX path had been dropped from the Device Memory TCP patch series
post RFCv1 [1], to make that series slightly easier to review. This
series rebases the implementation of the TX path on top of the
net_iov/netmem framework agreed upon and merged. The motivation for
the feature is thoroughly described in the docs & cover letter of the
original proposal, so I don't repeat the lengthy descriptions here, but
they are available in [1].

Sending this series as RFC as the winder closure is immenient. I plan on
reposting as non-RFC once the tree re-opens, addressing any feedback
I receive in the meantime.

Full outline on usage of the TX path is detailed in the documentation
added in the first patch.

Test example is available via the kselftest included in the series as well.

The series is relatively small, as the TX path for this feature largely
piggybacks on the existing MSG_ZEROCOPY implementation.

Patch Overview:
---------------

1. Documentation & tests to give high level overview of the feature
   being added.

2. Add netmem refcounting needed for the TX path.

3. Devmem TX netlink API.

4. Devmem TX net stack implementation.

Testing:
--------

Testing is very similar to devmem TCP RX path. The ncdevmem test used
for the RX path is now augemented with client functionality to test TX
path.

* Test Setup:

Kernel: net-next with this RFC and memory provider API cherry-picked
locally.

Hardware: Google Cloud A3 VMs.

NIC: GVE with header split & RSS & flow steering support.

Performance results are not included with this version, unfortunately.
I'm having issues running the dma-buf exporter driver against the
upstream kernel on my test setup. The issues are specific to that
dma-buf exporter and do not affect this patch series. I plan to follow
up this series with perf fixes if the tests point to issues once they're
up and running.

Special thanks to Stan who took a stab at rebasing the TX implementation
on top of the netmem/net_iov framework merged. Parts of his proposal [2]
that are reused as-is are forked off into their own patches to give full
credit.

[1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.com/
[2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#m066dd407fbed108828e2c40ae50e3f4376ef57fd

Cc: sdf@fomichev.me
Cc: asml.silence@gmail.com
Cc: dw@davidwei.uk
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Cc: Pedro Tammela <pctammela@mojatatu.com>
Cc: Samiullah Khawaja <skhawaja@google.com>

Mina Almasry (5):
  net: add devmem TCP TX documentation
  selftests: ncdevmem: Implement devmem TCP TX
  net: add get_netmem/put_netmem support
  net: devmem: Implement TX path
  net: devmem: make dmabuf unbinding scheduled work

Stanislav Fomichev (1):
  net: devmem: TCP tx netlink api

 Documentation/netlink/specs/netdev.yaml       |  12 +
 Documentation/networking/devmem.rst           | 144 ++++++++-
 include/linux/skbuff.h                        |  15 +-
 include/linux/skbuff_ref.h                    |   4 +-
 include/net/netmem.h                          |   3 +
 include/net/sock.h                            |   1 +
 include/uapi/linux/netdev.h                   |   1 +
 include/uapi/linux/uio.h                      |   6 +-
 net/core/datagram.c                           |  41 ++-
 net/core/devmem.c                             | 111 ++++++-
 net/core/devmem.h                             |  70 +++-
 net/core/netdev-genl-gen.c                    |  13 +
 net/core/netdev-genl-gen.h                    |   1 +
 net/core/netdev-genl.c                        |  66 +++-
 net/core/skbuff.c                             |  36 ++-
 net/core/sock.c                               |   8 +
 net/ipv4/tcp.c                                |  36 ++-
 net/vmw_vsock/virtio_transport_common.c       |   3 +-
 tools/include/uapi/linux/netdev.h             |   1 +
 .../selftests/drivers/net/hw/ncdevmem.c       | 300 +++++++++++++++++-
 20 files changed, 819 insertions(+), 53 deletions(-)

-- 
2.48.1.362.g079036d154-goog

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH net-next v3 1/6] net: add devmem TCP TX documentation
  2025-02-03 22:39 [PATCH net-next v3 0/6] Device memory TCP TX Mina Almasry
@ 2025-02-03 22:39 ` Mina Almasry
  2025-02-03 22:39 ` [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX Mina Almasry
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Mina Almasry @ 2025-02-03 22:39 UTC (permalink / raw)
  To: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest
  Cc: Mina Almasry, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

Add documentation outlining the usage and details of the devmem TCP TX
API.

Signed-off-by: Mina Almasry <almasrymina@google.com>

---

v2:
- Update documentation for iov_base is the dmabuf offset (Stan)
---
 Documentation/networking/devmem.rst | 144 +++++++++++++++++++++++++++-
 1 file changed, 140 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
index d95363645331..8166fe09da13 100644
--- a/Documentation/networking/devmem.rst
+++ b/Documentation/networking/devmem.rst
@@ -62,15 +62,15 @@ More Info
     https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
 
 
-Interface
-=========
+RX Interface
+============
 
 
 Example
 -------
 
-tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
-the RX path of this API.
+./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of
+setting up the RX path of this API.
 
 
 NIC Setup
@@ -235,6 +235,142 @@ can be less than the tokens provided by the user in case of:
 (a) an internal kernel leak bug.
 (b) the user passed more than 1024 frags.
 
+TX Interface
+============
+
+
+Example
+-------
+
+./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of
+setting up the TX path of this API.
+
+
+NIC Setup
+---------
+
+The user must bind a TX dmabuf to a given NIC using the netlink API::
+
+        struct netdev_bind_tx_req *req = NULL;
+        struct netdev_bind_tx_rsp *rsp = NULL;
+        struct ynl_error yerr;
+
+        *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+        req = netdev_bind_tx_req_alloc();
+        netdev_bind_tx_req_set_ifindex(req, ifindex);
+        netdev_bind_tx_req_set_fd(req, dmabuf_fd);
+
+        rsp = netdev_bind_tx(*ys, req);
+
+        tx_dmabuf_id = rsp->id;
+
+
+The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
+that has been bound.
+
+The user can unbind the dmabuf from the netdevice by closing the netlink socket
+that established the binding. We do this so that the binding is automatically
+unbound even if the userspace process crashes.
+
+Note that any reasonably well-behaved dmabuf from any exporter should work with
+devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
+this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
+
+Socket Setup
+------------
+
+The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem
+cannot be copied by the kernel, so the semantics of the devmem TX are similar
+to the semantics of MSG_ZEROCOPY.
+
+	ret = setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt));
+
+Sending data
+--------------
+
+Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg.
+
+The user should create a msghdr where,
+
+iov_base is set to the offset into the dmabuf to start sending from.
+iov_len is set to the number of bytes to be sent from the dmabuf.
+
+The user passes the dma-buf id to send from via the dmabuf_tx_cmsg.dmabuf_id.
+
+The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048
+from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id::
+
+       char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))];
+       struct dmabuf_tx_cmsg ddmabuf;
+       struct msghdr msg = {};
+       struct cmsghdr *cmsg;
+       struct iovec iov[2];
+
+       iov[0].iov_base = (void*)100;
+       iov[0].iov_len = 1024;
+       iov[1].iov_base = (void*)2000;
+       iov[1].iov_len = 2048;
+
+       msg.msg_iov = iov;
+       msg.msg_iovlen = 2;
+
+       msg.msg_control = ctrl_data;
+       msg.msg_controllen = sizeof(ctrl_data);
+
+       cmsg = CMSG_FIRSTHDR(&msg);
+       cmsg->cmsg_level = SOL_SOCKET;
+       cmsg->cmsg_type = SCM_DEVMEM_DMABUF;
+       cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg));
+
+       ddmabuf.dmabuf_id = tx_dmabuf_id;
+
+       *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf;
+
+       sendmsg(socket_fd, &msg, MSG_ZEROCOPY);
+
+
+Reusing TX dmabufs
+------------------
+
+Similar to MSG_ZEROCOPY with regular memory, the user should not modify the
+contents of the dma-buf while a send operation is in progress. This is because
+the kernel does not keep a copy of the dmabuf contents. Instead, the kernel
+will pin and send data from the buffer available to the userspace.
+
+Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions
+using MSG_ERRQUEUE::
+
+        int64_t tstop = gettimeofday_ms() + waittime_ms;
+        char control[CMSG_SPACE(100)] = {};
+        struct sock_extended_err *serr;
+        struct msghdr msg = {};
+        struct cmsghdr *cm;
+        int retries = 10;
+        __u32 hi, lo;
+
+        msg.msg_control = control;
+        msg.msg_controllen = sizeof(control);
+
+        while (gettimeofday_ms() < tstop) {
+                if (!do_poll(fd)) continue;
+
+                ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+
+                for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+                        serr = (void *)CMSG_DATA(cm);
+
+                        hi = serr->ee_data;
+                        lo = serr->ee_info;
+
+                        fprintf(stdout, "tx complete [%d,%d]\n", lo, hi);
+                }
+        }
+
+After the associated sendmsg has been completed, the dmabuf can be reused by
+the userspace.
+
+
 Implementation & Caveats
 ========================
 
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX
  2025-02-03 22:39 [PATCH net-next v3 0/6] Device memory TCP TX Mina Almasry
  2025-02-03 22:39 ` [PATCH net-next v3 1/6] net: add devmem TCP TX documentation Mina Almasry
@ 2025-02-03 22:39 ` Mina Almasry
  2025-02-04 12:29   ` Paolo Abeni
  2025-02-03 22:39 ` [PATCH net-next v3 3/6] net: add get_netmem/put_netmem support Mina Almasry
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-03 22:39 UTC (permalink / raw)
  To: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest
  Cc: Mina Almasry, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

Add support for devmem TX in ncdevmem.

This is a combination of the ncdevmem from the devmem TCP series RFCv1
which included the TX path, and work by Stan to include the netlink API
and refactored on top of his generic memory_provider support.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>

---

v3:
- Update ncdevmem docs to run validation with RX-only and RX-with-TX.
- Fix build warnings (Stan).
- Make the validation expect new lines in the pattern so we can have the
  TX path behave like netcat (Stan).
- Change ret to errno in error() calls (Stan).
- Handle the case where client_ip is not provided (Stan).
- Don't assume mid is <= 2000 (Stan).

v2:
- make errors a static variable so that we catch instances where there
  are less than 20 errors across different buffers.
- Fix the issue where the seed is reset to 0 instead of its starting
  value 1.
- Use 1000ULL instead of 1000 to guard against overflow (Willem).
- Do not set POLLERR (Willem).
- Update the test to use the new interface where iov_base is the
  dmabuf_offset.
- Update the test to send 2 iov instead of 1, so we get some test
  coverage over sending multiple iovs at once.
- Print the ifindex the test is using, useful for debugging issues where
  maybe the test may fail because the ifindex of the socket is different
  from the dmabuf binding.
---
 .../selftests/drivers/net/hw/ncdevmem.c       | 300 +++++++++++++++++-
 1 file changed, 289 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 19a6969643f4..a5ac78ed007e 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -9,22 +9,31 @@
  *     ncdevmem -s <server IP> [-c <client IP>] -f eth1 -l -p 5201
  *
  *     On client:
- *     echo -n "hello\nworld" | nc -s <server IP> 5201 -p 5201
+ *     echo -n "hello\nworld" | \
+ *		ncdevmem -s <server IP> [-c <client IP>] -p 5201 -f eth1
  *
- * Test data validation:
+ * Note this is compatible with regular netcat. i.e. the sender or receiver can
+ * be replaced with regular netcat to test the RX or TX path in isolation.
+ *
+ * Test data validation (devmem TCP on RX only):
  *
  *     On server:
  *     ncdevmem -s <server IP> [-c <client IP>] -f eth1 -l -p 5201 -v 7
  *
  *     On client:
  *     yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
- *             tr \\n \\0 | \
- *             head -c 5G | \
+ *             head -c 1G | \
  *             nc <server IP> 5201 -p 5201
  *
+ * Test data validation (devmem TCP on RX and TX, validation happens on RX):
  *
- * Note this is compatible with regular netcat. i.e. the sender or receiver can
- * be replaced with regular netcat to test the RX or TX path in isolation.
+ *	On server:
+ *	ncdevmem -s <server IP> [-c <client IP>] -l -p 5201 -v 8 -f eth1
+ *
+ *	On client:
+ *	yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06\\x07) | \
+ *		head -c 1M | \
+ *		ncdevmem -s <server IP> [-c <client IP>] -p 5201 -f eth1
  */
 #define _GNU_SOURCE
 #define __EXPORTED_HEADERS__
@@ -40,15 +49,18 @@
 #include <fcntl.h>
 #include <malloc.h>
 #include <error.h>
+#include <poll.h>
 
 #include <arpa/inet.h>
 #include <sys/socket.h>
 #include <sys/mman.h>
 #include <sys/ioctl.h>
 #include <sys/syscall.h>
+#include <sys/time.h>
 
 #include <linux/memfd.h>
 #include <linux/dma-buf.h>
+#include <linux/errqueue.h>
 #include <linux/udmabuf.h>
 #include <libmnl/libmnl.h>
 #include <linux/types.h>
@@ -80,6 +92,8 @@ static int num_queues = -1;
 static char *ifname;
 static unsigned int ifindex;
 static unsigned int dmabuf_id;
+static uint32_t tx_dmabuf_id;
+static int waittime_ms = 500;
 
 struct memory_buffer {
 	int fd;
@@ -93,6 +107,8 @@ struct memory_buffer {
 struct memory_provider {
 	struct memory_buffer *(*alloc)(size_t size);
 	void (*free)(struct memory_buffer *ctx);
+	void (*memcpy_to_device)(struct memory_buffer *dst, size_t off,
+				 void *src, int n);
 	void (*memcpy_from_device)(void *dst, struct memory_buffer *src,
 				   size_t off, int n);
 };
@@ -153,6 +169,20 @@ static void udmabuf_free(struct memory_buffer *ctx)
 	free(ctx);
 }
 
+static void udmabuf_memcpy_to_device(struct memory_buffer *dst, size_t off,
+				     void *src, int n)
+{
+	struct dma_buf_sync sync = {};
+
+	sync.flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_WRITE;
+	ioctl(dst->fd, DMA_BUF_IOCTL_SYNC, &sync);
+
+	memcpy(dst->buf_mem + off, src, n);
+
+	sync.flags = DMA_BUF_SYNC_END | DMA_BUF_SYNC_WRITE;
+	ioctl(dst->fd, DMA_BUF_IOCTL_SYNC, &sync);
+}
+
 static void udmabuf_memcpy_from_device(void *dst, struct memory_buffer *src,
 				       size_t off, int n)
 {
@@ -170,6 +200,7 @@ static void udmabuf_memcpy_from_device(void *dst, struct memory_buffer *src,
 static struct memory_provider udmabuf_memory_provider = {
 	.alloc = udmabuf_alloc,
 	.free = udmabuf_free,
+	.memcpy_to_device = udmabuf_memcpy_to_device,
 	.memcpy_from_device = udmabuf_memcpy_from_device,
 };
 
@@ -188,14 +219,16 @@ void validate_buffer(void *line, size_t size)
 {
 	static unsigned char seed = 1;
 	unsigned char *ptr = line;
-	int errors = 0;
+	unsigned char expected;
+	static int errors;
 	size_t i;
 
 	for (i = 0; i < size; i++) {
-		if (ptr[i] != seed) {
+		expected = seed ? seed : '\n';
+		if (ptr[i] != expected) {
 			fprintf(stderr,
 				"Failed validation: expected=%u, actual=%u, index=%lu\n",
-				seed, ptr[i], i);
+				expected, ptr[i], i);
 			errors++;
 			if (errors > 20)
 				error(1, 0, "validation failed.");
@@ -394,6 +427,49 @@ static int bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd,
 	return -1;
 }
 
+static int bind_tx_queue(unsigned int ifindex, unsigned int dmabuf_fd,
+			 struct ynl_sock **ys)
+{
+	struct netdev_bind_tx_req *req = NULL;
+	struct netdev_bind_tx_rsp *rsp = NULL;
+	struct ynl_error yerr;
+
+	*ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+	if (!*ys) {
+		fprintf(stderr, "YNL: %s\n", yerr.msg);
+		return -1;
+	}
+
+	req = netdev_bind_tx_req_alloc();
+	netdev_bind_tx_req_set_ifindex(req, ifindex);
+	netdev_bind_tx_req_set_fd(req, dmabuf_fd);
+
+	rsp = netdev_bind_tx(*ys, req);
+	if (!rsp) {
+		perror("netdev_bind_tx");
+		goto err_close;
+	}
+
+	if (!rsp->_present.id) {
+		perror("id not present");
+		goto err_close;
+	}
+
+	fprintf(stderr, "got tx dmabuf id=%d\n", rsp->id);
+	tx_dmabuf_id = rsp->id;
+
+	netdev_bind_tx_req_free(req);
+	netdev_bind_tx_rsp_free(rsp);
+
+	return 0;
+
+err_close:
+	fprintf(stderr, "YNL failed: %s\n", (*ys)->err.msg);
+	netdev_bind_tx_req_free(req);
+	ynl_sock_destroy(*ys);
+	return -1;
+}
+
 static void enable_reuseaddr(int fd)
 {
 	int opt = 1;
@@ -432,7 +508,7 @@ static int parse_address(const char *str, int port, struct sockaddr_in6 *sin6)
 	return 0;
 }
 
-int do_server(struct memory_buffer *mem)
+static int do_server(struct memory_buffer *mem)
 {
 	char ctrl_data[sizeof(int) * 20000];
 	struct netdev_queue_id *queues;
@@ -686,6 +762,206 @@ void run_devmem_tests(void)
 	provider->free(mem);
 }
 
+static uint64_t gettimeofday_ms(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000ULL) + (tv.tv_usec / 1000ULL);
+}
+
+static int do_poll(int fd)
+{
+	struct pollfd pfd;
+	int ret;
+
+	pfd.revents = 0;
+	pfd.fd = fd;
+
+	ret = poll(&pfd, 1, waittime_ms);
+	if (ret == -1)
+		error(1, errno, "poll");
+
+	return ret && (pfd.revents & POLLERR);
+}
+
+static void wait_compl(int fd)
+{
+	int64_t tstop = gettimeofday_ms() + waittime_ms;
+	char control[CMSG_SPACE(100)] = {};
+	struct sock_extended_err *serr;
+	struct msghdr msg = {};
+	struct cmsghdr *cm;
+	__u32 hi, lo;
+	int ret;
+
+	msg.msg_control = control;
+	msg.msg_controllen = sizeof(control);
+
+	while (gettimeofday_ms() < tstop) {
+		if (!do_poll(fd))
+			continue;
+
+		ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+		if (ret < 0) {
+			if (errno == EAGAIN)
+				continue;
+			error(1, errno, "recvmsg(MSG_ERRQUEUE)");
+			return;
+		}
+		if (msg.msg_flags & MSG_CTRUNC)
+			error(1, 0, "MSG_CTRUNC\n");
+
+		for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+			if (cm->cmsg_level != SOL_IP &&
+			    cm->cmsg_level != SOL_IPV6)
+				continue;
+			if (cm->cmsg_level == SOL_IP &&
+			    cm->cmsg_type != IP_RECVERR)
+				continue;
+			if (cm->cmsg_level == SOL_IPV6 &&
+			    cm->cmsg_type != IPV6_RECVERR)
+				continue;
+
+			serr = (void *)CMSG_DATA(cm);
+			if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
+				error(1, 0, "wrong origin %u", serr->ee_origin);
+			if (serr->ee_errno != 0)
+				error(1, 0, "wrong errno %d", serr->ee_errno);
+
+			hi = serr->ee_data;
+			lo = serr->ee_info;
+
+			fprintf(stderr, "tx complete [%d,%d]\n", lo, hi);
+			return;
+		}
+	}
+
+	error(1, 0, "did not receive tx completion");
+}
+
+static int do_client(struct memory_buffer *mem)
+{
+	char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))];
+	struct sockaddr_in6 server_sin;
+	struct sockaddr_in6 client_sin;
+	struct dmabuf_tx_cmsg ddmabuf;
+	struct ynl_sock *ys = NULL;
+	struct msghdr msg = {};
+	ssize_t line_size = 0;
+	struct cmsghdr *cmsg;
+	struct iovec iov[2];
+	char *line = NULL;
+	unsigned long mid;
+	size_t len = 0;
+	int socket_fd;
+	int ret;
+	int opt = 1;
+
+	ret = parse_address(server_ip, atoi(port), &server_sin);
+	if (ret < 0)
+		error(1, 0, "parse server address");
+
+	socket_fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (socket_fd < 0)
+		error(1, socket_fd, "create socket");
+
+	enable_reuseaddr(socket_fd);
+
+	ret = setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname,
+			 strlen(ifname) + 1);
+	if (ret)
+		error(1, errno, "bindtodevice");
+
+	if (bind_tx_queue(ifindex, mem->fd, &ys))
+		error(1, 0, "Failed to bind\n");
+
+	if (client_ip) {
+		ret = parse_address(client_ip, atoi(port), &client_sin);
+		if (ret < 0)
+			error(1, 0, "parse client address");
+
+		ret = bind(socket_fd, &client_sin, sizeof(client_sin));
+		if (ret)
+			error(1, errno, "bind");
+	}
+
+	ret = setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt));
+	if (ret)
+		error(1, errno, "set sock opt");
+
+	fprintf(stderr, "Connect to %s %d (via %s)\n", server_ip,
+		ntohs(server_sin.sin6_port), ifname);
+
+	ret = connect(socket_fd, &server_sin, sizeof(server_sin));
+	if (ret)
+		error(1, errno, "connect");
+
+	while (1) {
+		free(line);
+		line = NULL;
+		line_size = getline(&line, &len, stdin);
+
+		if (line_size < 0)
+			break;
+
+		mid = (line_size / 2) + 1;
+
+		iov[0].iov_base = (void *)1;
+		iov[0].iov_len = mid;
+		iov[1].iov_base = (void *)(mid + 2);
+		iov[1].iov_len = line_size - mid;
+
+		provider->memcpy_to_device(mem, (size_t)iov[0].iov_base, line,
+					   iov[0].iov_len);
+		provider->memcpy_to_device(mem, (size_t)iov[1].iov_base,
+					   line + iov[0].iov_len,
+					   iov[1].iov_len);
+
+		fprintf(stderr,
+			"read line_size=%ld iov[0].iov_base=%lu, iov[0].iov_len=%lu, iov[1].iov_base=%lu, iov[1].iov_len=%lu\n",
+			line_size, (unsigned long)iov[0].iov_base,
+			iov[0].iov_len, (unsigned long)iov[1].iov_base,
+			iov[1].iov_len);
+
+		msg.msg_iov = iov;
+		msg.msg_iovlen = 2;
+
+		msg.msg_control = ctrl_data;
+		msg.msg_controllen = sizeof(ctrl_data);
+
+		cmsg = CMSG_FIRSTHDR(&msg);
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_DEVMEM_DMABUF;
+		cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg));
+
+		ddmabuf.dmabuf_id = tx_dmabuf_id;
+
+		*((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf;
+
+		ret = sendmsg(socket_fd, &msg, MSG_ZEROCOPY);
+		if (ret < 0)
+			error(1, errno, "Failed sendmsg");
+
+		fprintf(stderr, "sendmsg_ret=%d\n", ret);
+
+		if (ret != line_size)
+			error(1, errno, "Did not send all bytes");
+
+		wait_compl(socket_fd);
+	}
+
+	fprintf(stderr, "%s: tx ok\n", TEST_PREFIX);
+
+	free(line);
+	close(socket_fd);
+
+	if (ys)
+		ynl_sock_destroy(ys);
+
+	return 0;
+}
+
 int main(int argc, char *argv[])
 {
 	struct memory_buffer *mem;
@@ -729,6 +1005,8 @@ int main(int argc, char *argv[])
 
 	ifindex = if_nametoindex(ifname);
 
+	fprintf(stderr, "using ifindex=%u\n", ifindex);
+
 	if (!server_ip && !client_ip) {
 		if (start_queue < 0 && num_queues < 0) {
 			num_queues = rxq_num(ifindex);
@@ -779,7 +1057,7 @@ int main(int argc, char *argv[])
 		error(1, 0, "Missing -p argument\n");
 
 	mem = provider->alloc(getpagesize() * NUM_PAGES);
-	ret = is_server ? do_server(mem) : 1;
+	ret = is_server ? do_server(mem) : do_client(mem);
 	provider->free(mem);
 
 	return ret;
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v3 3/6] net: add get_netmem/put_netmem support
  2025-02-03 22:39 [PATCH net-next v3 0/6] Device memory TCP TX Mina Almasry
  2025-02-03 22:39 ` [PATCH net-next v3 1/6] net: add devmem TCP TX documentation Mina Almasry
  2025-02-03 22:39 ` [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX Mina Almasry
@ 2025-02-03 22:39 ` Mina Almasry
  2025-02-03 22:39 ` [PATCH net-next v3 4/6] net: devmem: TCP tx netlink api Mina Almasry
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Mina Almasry @ 2025-02-03 22:39 UTC (permalink / raw)
  To: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest
  Cc: Mina Almasry, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

Currently net_iovs support only pp ref counts, and do not support a
page ref equivalent.

This is fine for the RX path as net_iovs are used exclusively with the
pp and only pp refcounting is needed there. The TX path however does not
use pp ref counts, thus, support for get_page/put_page equivalent is
needed for netmem.

Support get_netmem/put_netmem. Check the type of the netmem before
passing it to page or net_iov specific code to obtain a page ref
equivalent.

For dmabuf net_iovs, we obtain a ref on the underlying binding. This
ensures the entire binding doesn't disappear until all the net_iovs have
been put_netmem'ed. We do not need to track the refcount of individual
dmabuf net_iovs as we don't allocate/free them from a pool similar to
what the buddy allocator does for pages.

This code is written to be extensible by other net_iov implementers.
get_netmem/put_netmem will check the type of the netmem and route it to
the correct helper:

pages -> [get|put]_page()
dmabuf net_iovs -> net_devmem_[get|put]_net_iov()
new net_iovs ->	new helpers

Signed-off-by: Mina Almasry <almasrymina@google.com>


---

v2:
- Add comment on top of refcount_t ref explaining the usage in the XT
  path.
- Fix missing definition of net_devmem_dmabuf_binding_put in this patch.
---
 include/linux/skbuff_ref.h |  4 ++--
 include/net/netmem.h       |  3 +++
 net/core/devmem.c          | 10 ++++++++++
 net/core/devmem.h          | 20 ++++++++++++++++++++
 net/core/skbuff.c          | 30 ++++++++++++++++++++++++++++++
 5 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h
index 0f3c58007488..9e49372ef1a0 100644
--- a/include/linux/skbuff_ref.h
+++ b/include/linux/skbuff_ref.h
@@ -17,7 +17,7 @@
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
-	get_page(skb_frag_page(frag));
+	get_netmem(skb_frag_netmem(frag));
 }
 
 /**
@@ -40,7 +40,7 @@ static inline void skb_page_unref(netmem_ref netmem, bool recycle)
 	if (recycle && napi_pp_put_page(netmem))
 		return;
 #endif
-	put_page(netmem_to_page(netmem));
+	put_netmem(netmem);
 }
 
 /**
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 1b58faa4f20f..d30f31878a09 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -245,4 +245,7 @@ static inline unsigned long netmem_get_dma_addr(netmem_ref netmem)
 	return __netmem_clear_lsb(netmem)->dma_addr;
 }
 
+void get_netmem(netmem_ref netmem);
+void put_netmem(netmem_ref netmem);
+
 #endif /* _NET_NETMEM_H */
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 3bba3f018df0..20985a570662 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -333,6 +333,16 @@ void dev_dmabuf_uninstall(struct net_device *dev)
 	}
 }
 
+void net_devmem_get_net_iov(struct net_iov *niov)
+{
+	net_devmem_dmabuf_binding_get(niov->owner->binding);
+}
+
+void net_devmem_put_net_iov(struct net_iov *niov)
+{
+	net_devmem_dmabuf_binding_put(niov->owner->binding);
+}
+
 /*** "Dmabuf devmem memory provider" ***/
 
 int mp_dmabuf_devmem_init(struct page_pool *pool)
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 76099ef9c482..8b51caff5a0e 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -27,6 +27,10 @@ struct net_devmem_dmabuf_binding {
 	 * The binding undos itself and unmaps the underlying dmabuf once all
 	 * those refs are dropped and the binding is no longer desired or in
 	 * use.
+	 *
+	 * net_devmem_get_net_iov() on dmabuf net_iovs will increment this
+	 * reference, making sure that the binding remains alive until all the
+	 * net_iovs are no longer used.
 	 */
 	refcount_t ref;
 
@@ -119,6 +123,9 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
 	__net_devmem_dmabuf_binding_free(binding);
 }
 
+void net_devmem_get_net_iov(struct net_iov *niov);
+void net_devmem_put_net_iov(struct net_iov *niov);
+
 struct net_iov *
 net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding);
 void net_devmem_free_dmabuf(struct net_iov *ppiov);
@@ -126,6 +133,19 @@ void net_devmem_free_dmabuf(struct net_iov *ppiov);
 #else
 struct net_devmem_dmabuf_binding;
 
+static inline void
+net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
+{
+}
+
+static inline void net_devmem_get_net_iov(struct net_iov *niov)
+{
+}
+
+static inline void net_devmem_put_net_iov(struct net_iov *niov)
+{
+}
+
 static inline void
 __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a441613a1e6c..815245d5c36b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -88,6 +88,7 @@
 #include <linux/textsearch.h>
 
 #include "dev.h"
+#include "devmem.h"
 #include "netmem_priv.h"
 #include "sock_destructor.h"
 
@@ -7290,3 +7291,32 @@ bool csum_and_copy_from_iter_full(void *addr, size_t bytes,
 	return false;
 }
 EXPORT_SYMBOL(csum_and_copy_from_iter_full);
+
+void get_netmem(netmem_ref netmem)
+{
+	if (netmem_is_net_iov(netmem)) {
+		/* Assume any net_iov is devmem and route it to
+		 * net_devmem_get_net_iov. As new net_iov types are added they
+		 * need to be checked here.
+		 */
+		net_devmem_get_net_iov(netmem_to_net_iov(netmem));
+		return;
+	}
+	get_page(netmem_to_page(netmem));
+}
+EXPORT_SYMBOL(get_netmem);
+
+void put_netmem(netmem_ref netmem)
+{
+	if (netmem_is_net_iov(netmem)) {
+		/* Assume any net_iov is devmem and route it to
+		 * net_devmem_put_net_iov. As new net_iov types are added they
+		 * need to be checked here.
+		 */
+		net_devmem_put_net_iov(netmem_to_net_iov(netmem));
+		return;
+	}
+
+	put_page(netmem_to_page(netmem));
+}
+EXPORT_SYMBOL(put_netmem);
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v3 4/6] net: devmem: TCP tx netlink api
  2025-02-03 22:39 [PATCH net-next v3 0/6] Device memory TCP TX Mina Almasry
                   ` (2 preceding siblings ...)
  2025-02-03 22:39 ` [PATCH net-next v3 3/6] net: add get_netmem/put_netmem support Mina Almasry
@ 2025-02-03 22:39 ` Mina Almasry
  2025-02-03 22:39 ` [PATCH net-next v3 5/6] net: devmem: Implement TX path Mina Almasry
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Mina Almasry @ 2025-02-03 22:39 UTC (permalink / raw)
  To: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest
  Cc: Mina Almasry, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

From: Stanislav Fomichev <sdf@fomichev.me>

Add bind-tx netlink call to attach dmabuf for TX; queue is not
required, only ifindex and dmabuf fd for attachment.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Mina Almasry <almasrymina@google.com>


---

v3:
- Fix ynl-regen.sh error (Simon).
---
 Documentation/netlink/specs/netdev.yaml | 12 ++++++++++++
 include/uapi/linux/netdev.h             |  1 +
 net/core/netdev-genl-gen.c              | 13 +++++++++++++
 net/core/netdev-genl-gen.h              |  1 +
 net/core/netdev-genl.c                  |  6 ++++++
 tools/include/uapi/linux/netdev.h       |  1 +
 6 files changed, 34 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index cbb544bd6c84..93f4333e7bc6 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -711,6 +711,18 @@ operations:
             - defer-hard-irqs
             - gro-flush-timeout
             - irq-suspend-timeout
+    -
+      name: bind-tx
+      doc: Bind dmabuf to netdev for TX
+      attribute-set: dmabuf
+      do:
+        request:
+          attributes:
+            - ifindex
+            - fd
+        reply:
+          attributes:
+            - id
 
 kernel-family:
   headers: [ "linux/list.h"]
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index e4be227d3ad6..04364ef5edbe 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -203,6 +203,7 @@ enum {
 	NETDEV_CMD_QSTATS_GET,
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
+	NETDEV_CMD_BIND_TX,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 996ac6a449eb..f27608d6301c 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -99,6 +99,12 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPE
 	[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
 };
 
+/* NETDEV_CMD_BIND_TX - do */
+static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD + 1] = {
+	[NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
+	[NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -190,6 +196,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
+	{
+		.cmd		= NETDEV_CMD_BIND_TX,
+		.doit		= netdev_nl_bind_tx_doit,
+		.policy		= netdev_bind_tx_nl_policy,
+		.maxattr	= NETDEV_A_DMABUF_FD,
+		.flags		= GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index e09dd7539ff2..c1fed66e92b9 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -34,6 +34,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
 				struct netlink_callback *cb);
 int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
+int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info);
 
 enum {
 	NETDEV_NLGRP_MGMT,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 715f85c6b62e..0e41699df419 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -911,6 +911,12 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 	return err;
 }
 
+/* stub */
+int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	return 0;
+}
+
 void netdev_nl_sock_priv_init(struct list_head *priv)
 {
 	INIT_LIST_HEAD(priv);
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index e4be227d3ad6..04364ef5edbe 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -203,6 +203,7 @@ enum {
 	NETDEV_CMD_QSTATS_GET,
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
+	NETDEV_CMD_BIND_TX,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-03 22:39 [PATCH net-next v3 0/6] Device memory TCP TX Mina Almasry
                   ` (3 preceding siblings ...)
  2025-02-03 22:39 ` [PATCH net-next v3 4/6] net: devmem: TCP tx netlink api Mina Almasry
@ 2025-02-03 22:39 ` Mina Almasry
  2025-02-04 12:15   ` Paolo Abeni
                     ` (2 more replies)
  2025-02-03 22:39 ` [PATCH net-next v3 6/6] net: devmem: make dmabuf unbinding scheduled work Mina Almasry
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 39+ messages in thread
From: Mina Almasry @ 2025-02-03 22:39 UTC (permalink / raw)
  To: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest
  Cc: Mina Almasry, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja, Kaiyuan Zhang

Augment dmabuf binding to be able to handle TX. Additional to all the RX
binding, we also create tx_vec needed for the TX path.

Provide API for sendmsg to be able to send dmabufs bound to this device:

- Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
- MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.

Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
implementation, while disabling instances where MSG_ZEROCOPY falls back
to copying.

We additionally pipe the binding down to the new
zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
instead of the traditional page netmems.

We also special case skb_frag_dma_map to return the dma-address of these
dmabuf net_iovs instead of attempting to map pages.

Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat
of the implementation came from devmem TCP RFC v1[1], which included the
TX path, but Stan did all the rebasing on top of netmem/net_iov.

Cc: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>


---

v3:
- Use kvmalloc_array instead of kcalloc (Stan).
- Fix unreachable code warning (Simon).

v2:
- Remove dmabuf_offset from the dmabuf cmsg.
- Update zerocopy_fill_skb_from_devmem to interpret the
  iov_base/iter_iov_addr as the offset into the dmabuf to send from
  (Stan).
- Remove the confusing binding->tx_iter which is not needed if we
  interpret the iov_base/iter_iov_addr as offset into the dmabuf (Stan).
- Remove check for binding->sgt and binding->sgt->nents in dmabuf
  binding.
- Simplify the calculation of binding->tx_vec.
- Check in net_devmem_get_binding that the binding we're returning
  has ifindex matching the sending socket (Willem).
---
 include/linux/skbuff.h                  | 15 +++-
 include/net/sock.h                      |  1 +
 include/uapi/linux/uio.h                |  6 +-
 net/core/datagram.c                     | 41 ++++++++++-
 net/core/devmem.c                       | 97 +++++++++++++++++++++++--
 net/core/devmem.h                       | 42 ++++++++++-
 net/core/netdev-genl.c                  | 64 +++++++++++++++-
 net/core/skbuff.c                       |  6 +-
 net/core/sock.c                         |  8 ++
 net/ipv4/tcp.c                          | 36 ++++++---
 net/vmw_vsock/virtio_transport_common.c |  3 +-
 11 files changed, 285 insertions(+), 34 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb2b751d274a..3ff8f568c382 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1711,9 +1711,12 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
 
 void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
 
+struct net_devmem_dmabuf_binding;
+
 int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 			    struct sk_buff *skb, struct iov_iter *from,
-			    size_t length);
+			    size_t length,
+			    struct net_devmem_dmabuf_binding *binding);
 
 int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
 				struct iov_iter *from, size_t length);
@@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
 static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
 					  struct msghdr *msg, int len)
 {
-	return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
+	return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len,
+				       NULL);
 }
 
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 			     struct msghdr *msg, int len,
-			     struct ubuf_info *uarg);
+			     struct ubuf_info *uarg,
+			     struct net_devmem_dmabuf_binding *binding);
 
 /* Internal */
 #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
@@ -3697,6 +3702,10 @@ static inline dma_addr_t __skb_frag_dma_map(struct device *dev,
 					    size_t offset, size_t size,
 					    enum dma_data_direction dir)
 {
+	if (skb_frag_is_net_iov(frag)) {
+		return netmem_to_net_iov(frag->netmem)->dma_addr + offset +
+		       frag->offset;
+	}
 	return dma_map_page(dev, skb_frag_page(frag),
 			    skb_frag_off(frag) + offset, size, dir);
 }
diff --git a/include/net/sock.h b/include/net/sock.h
index 8036b3b79cd8..09eb918525b6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1822,6 +1822,7 @@ struct sockcm_cookie {
 	u32 tsflags;
 	u32 ts_opt_id;
 	u32 priority;
+	u32 dmabuf_id;
 };
 
 static inline void sockcm_init(struct sockcm_cookie *sockc,
diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h
index 649739e0c404..866bd5dfe39f 100644
--- a/include/uapi/linux/uio.h
+++ b/include/uapi/linux/uio.h
@@ -38,10 +38,14 @@ struct dmabuf_token {
 	__u32 token_count;
 };
 
+struct dmabuf_tx_cmsg {
+	__u32 dmabuf_id;
+};
+
 /*
  *	UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1)
  */
- 
+
 #define UIO_FASTIOV	8
 #define UIO_MAXIOV	1024
 
diff --git a/net/core/datagram.c b/net/core/datagram.c
index f0693707aece..c989606ff58d 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -63,6 +63,8 @@
 #include <net/busy_poll.h>
 #include <crypto/hash.h>
 
+#include "devmem.h"
+
 /*
  *	Is a socket 'connection oriented' ?
  */
@@ -692,9 +694,42 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
 	return 0;
 }
 
+static int
+zerocopy_fill_skb_from_devmem(struct sk_buff *skb, struct iov_iter *from,
+			      int length,
+			      struct net_devmem_dmabuf_binding *binding)
+{
+	int i = skb_shinfo(skb)->nr_frags;
+	size_t virt_addr, size, off;
+	struct net_iov *niov;
+
+	while (length && iov_iter_count(from)) {
+		if (i == MAX_SKB_FRAGS)
+			return -EMSGSIZE;
+
+		virt_addr = (size_t)iter_iov_addr(from);
+		niov = net_devmem_get_niov_at(binding, virt_addr, &off, &size);
+		if (!niov)
+			return -EFAULT;
+
+		size = min_t(size_t, size, length);
+		size = min_t(size_t, size, iter_iov_len(from));
+
+		get_netmem(net_iov_to_netmem(niov));
+		skb_add_rx_frag_netmem(skb, i, net_iov_to_netmem(niov), off,
+				       size, PAGE_SIZE);
+		iov_iter_advance(from, size);
+		length -= size;
+		i++;
+	}
+
+	return 0;
+}
+
 int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 			    struct sk_buff *skb, struct iov_iter *from,
-			    size_t length)
+			    size_t length,
+			    struct net_devmem_dmabuf_binding *binding)
 {
 	unsigned long orig_size = skb->truesize;
 	unsigned long truesize;
@@ -702,6 +737,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 
 	if (msg && msg->msg_ubuf && msg->sg_from_iter)
 		ret = msg->sg_from_iter(skb, from, length);
+	else if (unlikely(binding))
+		ret = zerocopy_fill_skb_from_devmem(skb, from, length, binding);
 	else
 		ret = zerocopy_fill_skb_from_iter(skb, from, length);
 
@@ -735,7 +772,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 	if (skb_copy_datagram_from_iter(skb, 0, from, copy))
 		return -EFAULT;
 
-	return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U);
+	return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U, NULL);
 }
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 20985a570662..5de887545f5e 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -16,6 +16,7 @@
 #include <net/netdev_queues.h>
 #include <net/netdev_rx_queue.h>
 #include <net/page_pool/helpers.h>
+#include <net/sock.h>
 #include <trace/events/page_pool.h>
 
 #include "devmem.h"
@@ -64,8 +65,10 @@ void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
 	dma_buf_detach(binding->dmabuf, binding->attachment);
 	dma_buf_put(binding->dmabuf);
 	xa_destroy(&binding->bound_rxqs);
+	kvfree(binding->tx_vec);
 	kfree(binding);
 }
+EXPORT_SYMBOL(__net_devmem_dmabuf_binding_free);
 
 struct net_iov *
 net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
@@ -110,6 +113,13 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 	unsigned long xa_idx;
 	unsigned int rxq_idx;
 
+	xa_erase(&net_devmem_dmabuf_bindings, binding->id);
+
+	/* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
+	 * erase.
+	 */
+	synchronize_net();
+
 	if (binding->list.next)
 		list_del(&binding->list);
 
@@ -123,8 +133,6 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 		WARN_ON(netdev_rx_queue_restart(binding->dev, rxq_idx));
 	}
 
-	xa_erase(&net_devmem_dmabuf_bindings, binding->id);
-
 	net_devmem_dmabuf_binding_put(binding);
 }
 
@@ -185,8 +193,9 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 }
 
 struct net_devmem_dmabuf_binding *
-net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
-		       struct netlink_ext_ack *extack)
+net_devmem_bind_dmabuf(struct net_device *dev,
+		       enum dma_data_direction direction,
+		       unsigned int dmabuf_fd, struct netlink_ext_ack *extack)
 {
 	struct net_devmem_dmabuf_binding *binding;
 	static u32 id_alloc_next;
@@ -229,7 +238,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
 	}
 
 	binding->sgt = dma_buf_map_attachment_unlocked(binding->attachment,
-						       DMA_FROM_DEVICE);
+						       direction);
 	if (IS_ERR(binding->sgt)) {
 		err = PTR_ERR(binding->sgt);
 		NL_SET_ERR_MSG(extack, "Failed to map dmabuf attachment");
@@ -240,13 +249,23 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
 	 * binding can be much more flexible than that. We may be able to
 	 * allocate MTU sized chunks here. Leave that for future work...
 	 */
-	binding->chunk_pool =
-		gen_pool_create(PAGE_SHIFT, dev_to_node(&dev->dev));
+	binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
+					      dev_to_node(&dev->dev));
 	if (!binding->chunk_pool) {
 		err = -ENOMEM;
 		goto err_unmap;
 	}
 
+	if (direction == DMA_TO_DEVICE) {
+		binding->tx_vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
+						 sizeof(struct net_iov *),
+						 GFP_KERNEL);
+		if (!binding->tx_vec) {
+			err = -ENOMEM;
+			goto err_free_chunks;
+		}
+	}
+
 	virtual = 0;
 	for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
 		dma_addr_t dma_addr = sg_dma_address(sg);
@@ -288,6 +307,8 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
 			niov->owner = owner;
 			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
 						      net_devmem_get_dma_addr(niov));
+			if (direction == DMA_TO_DEVICE)
+				binding->tx_vec[owner->base_virtual / PAGE_SIZE + i] = niov;
 		}
 
 		virtual += len;
@@ -313,6 +334,21 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
 	return ERR_PTR(err);
 }
 
+struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id)
+{
+	struct net_devmem_dmabuf_binding *binding;
+
+	rcu_read_lock();
+	binding = xa_load(&net_devmem_dmabuf_bindings, id);
+	if (binding) {
+		if (!net_devmem_dmabuf_binding_get(binding))
+			binding = NULL;
+	}
+	rcu_read_unlock();
+
+	return binding;
+}
+
 void dev_dmabuf_uninstall(struct net_device *dev)
 {
 	struct net_devmem_dmabuf_binding *binding;
@@ -343,6 +379,53 @@ void net_devmem_put_net_iov(struct net_iov *niov)
 	net_devmem_dmabuf_binding_put(niov->owner->binding);
 }
 
+struct net_devmem_dmabuf_binding *net_devmem_get_binding(struct sock *sk,
+							 unsigned int dmabuf_id)
+{
+	struct net_devmem_dmabuf_binding *binding;
+	struct dst_entry *dst = __sk_dst_get(sk);
+	int err = 0;
+
+	binding = net_devmem_lookup_dmabuf(dmabuf_id);
+	if (!binding || !binding->tx_vec) {
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	/* The dma-addrs in this binding are only reachable to the corresponding
+	 * net_device.
+	 */
+	if (!dst || !dst->dev || dst->dev->ifindex != binding->dev->ifindex) {
+		err = -ENODEV;
+		goto out_err;
+	}
+
+	return binding;
+
+out_err:
+	if (binding)
+		net_devmem_dmabuf_binding_put(binding);
+
+	return ERR_PTR(err);
+}
+
+struct net_iov *
+net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding,
+		       size_t virt_addr, size_t *off, size_t *size)
+{
+	size_t idx;
+
+	if (virt_addr >= binding->dmabuf->size)
+		return NULL;
+
+	idx = virt_addr / PAGE_SIZE;
+
+	*off = virt_addr % PAGE_SIZE;
+	*size = PAGE_SIZE - *off;
+
+	return binding->tx_vec[idx];
+}
+
 /*** "Dmabuf devmem memory provider" ***/
 
 int mp_dmabuf_devmem_init(struct page_pool *pool)
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 8b51caff5a0e..874e891e70e0 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -46,6 +46,12 @@ struct net_devmem_dmabuf_binding {
 	 * active.
 	 */
 	u32 id;
+
+	/* Array of net_iov pointers for this binding, sorted by virtual
+	 * address. This array is convenient to map the virtual addresses to
+	 * net_iovs in the TX path.
+	 */
+	struct net_iov **tx_vec;
 };
 
 #if defined(CONFIG_NET_DEVMEM)
@@ -70,12 +76,15 @@ struct dmabuf_genpool_chunk_owner {
 
 void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding);
 struct net_devmem_dmabuf_binding *
-net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
-		       struct netlink_ext_ack *extack);
+net_devmem_bind_dmabuf(struct net_device *dev,
+		       enum dma_data_direction direction,
+		       unsigned int dmabuf_fd, struct netlink_ext_ack *extack);
+struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id);
 void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
 int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 				    struct net_devmem_dmabuf_binding *binding,
 				    struct netlink_ext_ack *extack);
+void net_devmem_bind_tx_release(struct sock *sk);
 void dev_dmabuf_uninstall(struct net_device *dev);
 
 static inline struct dmabuf_genpool_chunk_owner *
@@ -108,10 +117,10 @@ static inline u32 net_iov_binding_id(const struct net_iov *niov)
 	return net_iov_owner(niov)->binding->id;
 }
 
-static inline void
+static inline bool
 net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
 {
-	refcount_inc(&binding->ref);
+	return refcount_inc_not_zero(&binding->ref);
 }
 
 static inline void
@@ -130,6 +139,12 @@ struct net_iov *
 net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding);
 void net_devmem_free_dmabuf(struct net_iov *ppiov);
 
+struct net_devmem_dmabuf_binding *
+net_devmem_get_binding(struct sock *sk, unsigned int dmabuf_id);
+struct net_iov *
+net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr,
+		       size_t *off, size_t *size);
+
 #else
 struct net_devmem_dmabuf_binding;
 
@@ -153,11 +168,17 @@ __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
 
 static inline struct net_devmem_dmabuf_binding *
 net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
+		       enum dma_data_direction direction,
 		       struct netlink_ext_ack *extack)
 {
 	return ERR_PTR(-EOPNOTSUPP);
 }
 
+static inline struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id)
+{
+	return NULL;
+}
+
 static inline void
 net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 {
@@ -195,6 +216,19 @@ static inline u32 net_iov_binding_id(const struct net_iov *niov)
 {
 	return 0;
 }
+
+static inline struct net_devmem_dmabuf_binding *
+net_devmem_get_binding(struct sock *sk, unsigned int dmabuf_id)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline struct net_iov *
+net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr,
+		       size_t *off, size_t *size)
+{
+	return NULL;
+}
 #endif
 
 #endif /* _NET_DEVMEM_H */
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 0e41699df419..3ecb3a6d3913 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -854,7 +854,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 		goto err_unlock;
 	}
 
-	binding = net_devmem_bind_dmabuf(netdev, dmabuf_fd, info->extack);
+	binding = net_devmem_bind_dmabuf(netdev, DMA_FROM_DEVICE, dmabuf_fd,
+					 info->extack);
 	if (IS_ERR(binding)) {
 		err = PTR_ERR(binding);
 		goto err_unlock;
@@ -911,10 +912,67 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 	return err;
 }
 
-/* stub */
 int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	return 0;
+	struct net_devmem_dmabuf_binding *binding;
+	struct list_head *sock_binding_list;
+	struct net_device *netdev;
+	u32 ifindex, dmabuf_fd;
+	struct sk_buff *rsp;
+	int err = 0;
+	void *hdr;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_DMABUF_FD))
+		return -EINVAL;
+
+	ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
+	dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]);
+
+	sock_binding_list = genl_sk_priv_get(&netdev_nl_family,
+					     NETLINK_CB(skb).sk);
+	if (IS_ERR(sock_binding_list))
+		return PTR_ERR(sock_binding_list);
+
+	rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!rsp)
+		return -ENOMEM;
+
+	hdr = genlmsg_iput(rsp, info);
+	if (!hdr) {
+		err = -EMSGSIZE;
+		goto err_genlmsg_free;
+	}
+
+	rtnl_lock();
+
+	netdev = __dev_get_by_index(genl_info_net(info), ifindex);
+	if (!netdev || !netif_device_present(netdev)) {
+		err = -ENODEV;
+		goto err_unlock;
+	}
+
+	binding = net_devmem_bind_dmabuf(netdev, DMA_TO_DEVICE, dmabuf_fd,
+					 info->extack);
+	if (IS_ERR(binding)) {
+		err = PTR_ERR(binding);
+		goto err_unlock;
+	}
+
+	list_add(&binding->list, sock_binding_list);
+
+	nla_put_u32(rsp, NETDEV_A_DMABUF_ID, binding->id);
+	genlmsg_end(rsp, hdr);
+
+	rtnl_unlock();
+
+	return genlmsg_reply(rsp, info);
+
+err_unlock:
+	rtnl_unlock();
+err_genlmsg_free:
+	nlmsg_free(rsp);
+	return err;
 }
 
 void netdev_nl_sock_priv_init(struct list_head *priv)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 815245d5c36b..6289ffcbb20b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1882,7 +1882,8 @@ EXPORT_SYMBOL_GPL(msg_zerocopy_ubuf_ops);
 
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 			     struct msghdr *msg, int len,
-			     struct ubuf_info *uarg)
+			     struct ubuf_info *uarg,
+			     struct net_devmem_dmabuf_binding *binding)
 {
 	int err, orig_len = skb->len;
 
@@ -1901,7 +1902,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 			return -EEXIST;
 	}
 
-	err = __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len);
+	err = __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len,
+				      binding);
 	if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) {
 		struct sock *save_sk = skb->sk;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index eae2ae70a2e0..353669f124ab 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2911,6 +2911,7 @@ EXPORT_SYMBOL(sock_alloc_send_pskb);
 int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 		     struct sockcm_cookie *sockc)
 {
+	struct dmabuf_tx_cmsg dmabuf_tx;
 	u32 tsflags;
 
 	BUILD_BUG_ON(SOF_TIMESTAMPING_LAST == (1 << 31));
@@ -2964,6 +2965,13 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
 		if (!sk_set_prio_allowed(sk, *(u32 *)CMSG_DATA(cmsg)))
 			return -EPERM;
 		sockc->priority = *(u32 *)CMSG_DATA(cmsg);
+		break;
+	case SCM_DEVMEM_DMABUF:
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(struct dmabuf_tx_cmsg)))
+			return -EINVAL;
+		dmabuf_tx = *(struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg);
+		sockc->dmabuf_id = dmabuf_tx.dmabuf_id;
+
 		break;
 	default:
 		return -EINVAL;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..44198ae7e44c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1051,6 +1051,7 @@ int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied,
 
 int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 {
+	struct net_devmem_dmabuf_binding *binding = NULL;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct ubuf_info *uarg = NULL;
 	struct sk_buff *skb;
@@ -1063,6 +1064,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 	flags = msg->msg_flags;
 
+	sockcm_init(&sockc, sk);
+	if (msg->msg_controllen) {
+		err = sock_cmsg_send(sk, msg, &sockc);
+		if (unlikely(err)) {
+			err = -EINVAL;
+			goto out_err;
+		}
+	}
+
 	if ((flags & MSG_ZEROCOPY) && size) {
 		if (msg->msg_ubuf) {
 			uarg = msg->msg_ubuf;
@@ -1080,6 +1090,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			else
 				uarg_to_msgzc(uarg)->zerocopy = 0;
 		}
+
+		if (sockc.dmabuf_id != 0) {
+			binding = net_devmem_get_binding(sk, sockc.dmabuf_id);
+			if (IS_ERR(binding)) {
+				err = PTR_ERR(binding);
+				binding = NULL;
+				goto out_err;
+			}
+		}
 	} else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) {
 		if (sk->sk_route_caps & NETIF_F_SG)
 			zc = MSG_SPLICE_PAGES;
@@ -1123,15 +1142,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		/* 'common' sending to sendq */
 	}
 
-	sockcm_init(&sockc, sk);
-	if (msg->msg_controllen) {
-		err = sock_cmsg_send(sk, msg, &sockc);
-		if (unlikely(err)) {
-			err = -EINVAL;
-			goto out_err;
-		}
-	}
-
 	/* This should be in poll */
 	sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
 
@@ -1248,7 +1258,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 					goto wait_for_space;
 			}
 
-			err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg);
+			err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg,
+						       binding);
 			if (err == -EMSGSIZE || err == -EEXIST) {
 				tcp_mark_push(tp, skb);
 				goto new_segment;
@@ -1329,6 +1340,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	/* msg->msg_ubuf is pinned by the caller so we don't take extra refs */
 	if (uarg && !msg->msg_ubuf)
 		net_zcopy_put(uarg);
+	if (binding)
+		net_devmem_dmabuf_binding_put(binding);
 	return copied + copied_syn;
 
 do_error:
@@ -1346,6 +1359,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		sk->sk_write_space(sk);
 		tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
 	}
+	if (binding)
+		net_devmem_dmabuf_binding_put(binding);
+
 	return err;
 }
 EXPORT_SYMBOL_GPL(tcp_sendmsg_locked);
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 7f7de6d88096..f6d4bb798517 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -107,8 +107,7 @@ static int virtio_transport_fill_skb(struct sk_buff *skb,
 {
 	if (zcopy)
 		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
-					       &info->msg->msg_iter,
-					       len);
+					       &info->msg->msg_iter, len, NULL);
 
 	return memcpy_from_msg(skb_put(skb, len), info->msg, len);
 }
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v3 6/6] net: devmem: make dmabuf unbinding scheduled work
  2025-02-03 22:39 [PATCH net-next v3 0/6] Device memory TCP TX Mina Almasry
                   ` (4 preceding siblings ...)
  2025-02-03 22:39 ` [PATCH net-next v3 5/6] net: devmem: Implement TX path Mina Almasry
@ 2025-02-03 22:39 ` Mina Almasry
  2025-02-04 12:32 ` [PATCH net-next v3 0/6] Device memory TCP TX Paolo Abeni
  2025-02-05  2:08 ` Jakub Kicinski
  7 siblings, 0 replies; 39+ messages in thread
From: Mina Almasry @ 2025-02-03 22:39 UTC (permalink / raw)
  To: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest
  Cc: Mina Almasry, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf,
resulting in a BUG like seen by Stan:

[    1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255
[    1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem
[    1.548926] preempt_count: 201, expected: 0
[    1.549026] RCU nest depth: 0, expected: 0
[    1.549197]
[    1.549237] =============================
[    1.549331] [ BUG: Invalid wait context ]
[    1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty #15 Tainted: G        W
[    1.549609] -----------------------------
[    1.549704] ncdevmem/149 is trying to lock:
[    1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.550051] other info that might help us debug this:
[    1.550167] context-{5:5}
[    1.550229] 3 locks held by ncdevmem/149:
[    1.550322]  #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0
[    1.550530]  #1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80
[    1.550731]  #2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0
[    1.550921] stack backtrace:
[    1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G        W          6.13.0-rc3-00770-gbc9ef9606dc9-dirty #15
[    1.551233] Tainted: [W]=WARN
[    1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[    1.551518] Call Trace:
[    1.551584]  <TASK>
[    1.551636]  dump_stack_lvl+0x86/0xc0
[    1.551723]  __lock_acquire+0xb0f/0xc30
[    1.551814]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.551941]  lock_acquire+0xf1/0x2a0
[    1.552026]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552152]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552281]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552408]  __ww_mutex_lock+0x121/0x1060
[    1.552503]  ? dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552648]  ww_mutex_lock+0x3d/0xa0
[    1.552733]  dma_buf_unmap_attachment_unlocked+0x4b/0x90
[    1.552857]  __net_devmem_dmabuf_binding_free+0x56/0xb0
[    1.552979]  skb_release_data+0x120/0x1f0
[    1.553074]  __kfree_skb+0x29/0xa0
[    1.553156]  tcp_write_queue_purge+0x41/0x310
[    1.553259]  tcp_v4_destroy_sock+0x127/0x320
[    1.553363]  ? __tcp_close+0x169/0x4b0
[    1.553452]  inet_csk_destroy_sock+0x53/0x130
[    1.553560]  __tcp_close+0x421/0x4b0
[    1.553646]  tcp_close+0x24/0x80
[    1.553724]  inet_release+0x5d/0x90
[    1.553806]  sock_close+0x4a/0xf0
[    1.553886]  __fput+0x9c/0x2b0
[    1.553960]  task_work_run+0x89/0xc0
[    1.554046]  do_exit+0x27f/0x980
[    1.554125]  do_group_exit+0xa4/0xb0
[    1.554211]  __x64_sys_exit_group+0x17/0x20
[    1.554309]  x64_sys_call+0x21a0/0x21a0
[    1.554400]  do_syscall_64+0xec/0x1d0
[    1.554487]  ? exc_page_fault+0x8a/0xf0
[    1.554585]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    1.554703] RIP: 0033:0x7f2f8a27abcd

Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Suggested-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Mina Almasry <almasrymina@google.com>

---
 net/core/devmem.c |  4 +++-
 net/core/devmem.h | 10 ++++++----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 5de887545f5e..23463de19f50 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -46,8 +46,10 @@ static dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)
 	       ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT);
 }
 
-void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
+void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
 {
+	struct net_devmem_dmabuf_binding *binding = container_of(wq, typeof(*binding), unbind_w);
+
 	size_t size, avail;
 
 	gen_pool_for_each_chunk(binding->chunk_pool,
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 874e891e70e0..63d16dbaca2d 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -52,6 +52,8 @@ struct net_devmem_dmabuf_binding {
 	 * net_iovs in the TX path.
 	 */
 	struct net_iov **tx_vec;
+
+	struct work_struct unbind_w;
 };
 
 #if defined(CONFIG_NET_DEVMEM)
@@ -74,7 +76,7 @@ struct dmabuf_genpool_chunk_owner {
 	struct net_devmem_dmabuf_binding *binding;
 };
 
-void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding);
+void __net_devmem_dmabuf_binding_free(struct work_struct *wq);
 struct net_devmem_dmabuf_binding *
 net_devmem_bind_dmabuf(struct net_device *dev,
 		       enum dma_data_direction direction,
@@ -129,7 +131,8 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding)
 	if (!refcount_dec_and_test(&binding->ref))
 		return;
 
-	__net_devmem_dmabuf_binding_free(binding);
+	INIT_WORK(&binding->unbind_w, __net_devmem_dmabuf_binding_free);
+	schedule_work(&binding->unbind_w);
 }
 
 void net_devmem_get_net_iov(struct net_iov *niov);
@@ -161,8 +164,7 @@ static inline void net_devmem_put_net_iov(struct net_iov *niov)
 {
 }
 
-static inline void
-__net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
+static inline void __net_devmem_dmabuf_binding_free(struct work_struct *wq)
 {
 }
 
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-03 22:39 ` [PATCH net-next v3 5/6] net: devmem: Implement TX path Mina Almasry
@ 2025-02-04 12:15   ` Paolo Abeni
  2025-02-05 12:20   ` Pavel Begunkov
  2025-02-05 21:56   ` Willem de Bruijn
  2 siblings, 0 replies; 39+ messages in thread
From: Paolo Abeni @ 2025-02-04 12:15 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-doc, kvm,
	virtualization, linux-kselftest
  Cc: Donald Hunter, Jakub Kicinski, David S. Miller, Eric Dumazet,
	Simon Horman, Jonathan Corbet, Andrew Lunn, Neal Cardwell,
	David Ahern, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Stefan Hajnoczi, Stefano Garzarella,
	Shuah Khan, sdf, asml.silence, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja, Kaiyuan Zhang

On 2/3/25 11:39 PM, Mina Almasry wrote:
> Augment dmabuf binding to be able to handle TX. Additional to all the RX
> binding, we also create tx_vec needed for the TX path.
> 
> Provide API for sendmsg to be able to send dmabufs bound to this device:
> 
> - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
> - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.
> 
> Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
> implementation, while disabling instances where MSG_ZEROCOPY falls back
> to copying.
> 
> We additionally pipe the binding down to the new
> zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
> instead of the traditional page netmems.
> 
> We also special case skb_frag_dma_map to return the dma-address of these
> dmabuf net_iovs instead of attempting to map pages.
> 
> Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat
> of the implementation came from devmem TCP RFC v1[1], which included the
> TX path, but Stan did all the rebasing on top of netmem/net_iov.
> 
> Cc: Stanislav Fomichev <sdf@fomichev.me>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>

Very minor nit: you unexpectedly leaved a lot of empty lines after the SoB.

[...]
@@ -240,13 +249,23 @@ net_devmem_bind_dmabuf(struct net_device *dev,
unsigned int dmabuf_fd,
>  	 * binding can be much more flexible than that. We may be able to
>  	 * allocate MTU sized chunks here. Leave that for future work...
>  	 */
> -	binding->chunk_pool =
> -		gen_pool_create(PAGE_SHIFT, dev_to_node(&dev->dev));
> +	binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> +					      dev_to_node(&dev->dev));
>  	if (!binding->chunk_pool) {
>  		err = -ENOMEM;
>  		goto err_unmap;
>  	}
>  
> +	if (direction == DMA_TO_DEVICE) {
> +		binding->tx_vec = kvmalloc_array(dmabuf->size / PAGE_SIZE,
> +						 sizeof(struct net_iov *),
> +						 GFP_KERNEL);
> +		if (!binding->tx_vec) {
> +			err = -ENOMEM;
> +			goto err_free_chunks;

It looks like the later error paths (in the for_each_sgtable_dma_sg()
loop) could happen even for 'direction == DMA_TO_DEVICE', so I guess an
additional error label is needed to clean tx_vec on such paths.

/P


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX
  2025-02-03 22:39 ` [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX Mina Almasry
@ 2025-02-04 12:29   ` Paolo Abeni
  2025-02-04 16:50     ` Jakub Kicinski
  2025-02-04 17:35     ` Mina Almasry
  0 siblings, 2 replies; 39+ messages in thread
From: Paolo Abeni @ 2025-02-04 12:29 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-doc, kvm,
	virtualization, linux-kselftest
  Cc: Donald Hunter, Jakub Kicinski, David S. Miller, Eric Dumazet,
	Simon Horman, Jonathan Corbet, Andrew Lunn, Neal Cardwell,
	David Ahern, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Stefan Hajnoczi, Stefano Garzarella,
	Shuah Khan, sdf, asml.silence, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja

On 2/3/25 11:39 PM, Mina Almasry wrote:
> Add support for devmem TX in ncdevmem.
> 
> This is a combination of the ncdevmem from the devmem TCP series RFCv1
> which included the TX path, and work by Stan to include the netlink API
> and refactored on top of his generic memory_provider support.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>

Usually the self-tests are included towards the end of the series, to
help reviewers building-up on previous patches knowledge.

>  .../selftests/drivers/net/hw/ncdevmem.c       | 300 +++++++++++++++++-
>  1 file changed, 289 insertions(+), 11 deletions(-)

Why devmem.py is not touched? AFAICS the test currently run ncdevmem
only in server (rx) mode, so the tx path is not actually exercised ?!?

/P


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-03 22:39 [PATCH net-next v3 0/6] Device memory TCP TX Mina Almasry
                   ` (5 preceding siblings ...)
  2025-02-03 22:39 ` [PATCH net-next v3 6/6] net: devmem: make dmabuf unbinding scheduled work Mina Almasry
@ 2025-02-04 12:32 ` Paolo Abeni
  2025-02-04 17:27   ` Mina Almasry
  2025-02-05  2:08 ` Jakub Kicinski
  7 siblings, 1 reply; 39+ messages in thread
From: Paolo Abeni @ 2025-02-04 12:32 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-doc, kvm,
	virtualization, linux-kselftest
  Cc: Donald Hunter, Jakub Kicinski, David S. Miller, Eric Dumazet,
	Simon Horman, Jonathan Corbet, Andrew Lunn, Neal Cardwell,
	David Ahern, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Stefan Hajnoczi, Stefano Garzarella,
	Shuah Khan, sdf, asml.silence, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja

On 2/3/25 11:39 PM, Mina Almasry wrote:
> The TX path had been dropped from the Device Memory TCP patch series
> post RFCv1 [1], to make that series slightly easier to review. This
> series rebases the implementation of the TX path on top of the
> net_iov/netmem framework agreed upon and merged. The motivation for
> the feature is thoroughly described in the docs & cover letter of the
> original proposal, so I don't repeat the lengthy descriptions here, but
> they are available in [1].
> 
> Sending this series as RFC as the winder closure is immenient. I plan on
> reposting as non-RFC once the tree re-opens, addressing any feedback
> I receive in the meantime.

I guess you should drop this paragraph.

> Full outline on usage of the TX path is detailed in the documentation
> added in the first patch.
> 
> Test example is available via the kselftest included in the series as well.
> 
> The series is relatively small, as the TX path for this feature largely
> piggybacks on the existing MSG_ZEROCOPY implementation.

It looks like no additional device level support is required. That is
IMHO so good up to suspicious level :)

> Patch Overview:
> ---------------
> 
> 1. Documentation & tests to give high level overview of the feature
>    being added.
> 
> 2. Add netmem refcounting needed for the TX path.
> 
> 3. Devmem TX netlink API.
> 
> 4. Devmem TX net stack implementation.

It looks like even the above section needs some update.

/P


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX
  2025-02-04 12:29   ` Paolo Abeni
@ 2025-02-04 16:50     ` Jakub Kicinski
  2025-02-04 17:35     ` Mina Almasry
  1 sibling, 0 replies; 39+ messages in thread
From: Jakub Kicinski @ 2025-02-04 16:50 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Mina Almasry, netdev, linux-kernel, linux-doc, kvm,
	virtualization, linux-kselftest, Donald Hunter, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Tue, 4 Feb 2025 13:29:18 +0100 Paolo Abeni wrote:
> On 2/3/25 11:39 PM, Mina Almasry wrote:
> > Add support for devmem TX in ncdevmem.
> > 
> > This is a combination of the ncdevmem from the devmem TCP series RFCv1
> > which included the TX path, and work by Stan to include the netlink API
> > and refactored on top of his generic memory_provider support.
> > 
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>  
> 
> Usually the self-tests are included towards the end of the series, to
> help reviewers building-up on previous patches knowledge.

I had the same reaction, but in cases where uAPI is simpler than 
the core code it may actually help the understanding to start with
the selftest. Dunno. Only concern would be that the test won't work
if someone bisects to this commit, but that's not very practical?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-04 12:32 ` [PATCH net-next v3 0/6] Device memory TCP TX Paolo Abeni
@ 2025-02-04 17:27   ` Mina Almasry
  2025-02-04 18:06     ` Stanislav Fomichev
  0 siblings, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-04 17:27 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 2/3/25 11:39 PM, Mina Almasry wrote:
> > The TX path had been dropped from the Device Memory TCP patch series
> > post RFCv1 [1], to make that series slightly easier to review. This
> > series rebases the implementation of the TX path on top of the
> > net_iov/netmem framework agreed upon and merged. The motivation for
> > the feature is thoroughly described in the docs & cover letter of the
> > original proposal, so I don't repeat the lengthy descriptions here, but
> > they are available in [1].
> >
> > Sending this series as RFC as the winder closure is immenient. I plan on
> > reposting as non-RFC once the tree re-opens, addressing any feedback
> > I receive in the meantime.
>
> I guess you should drop this paragraph.
>
> > Full outline on usage of the TX path is detailed in the documentation
> > added in the first patch.
> >
> > Test example is available via the kselftest included in the series as well.
> >
> > The series is relatively small, as the TX path for this feature largely
> > piggybacks on the existing MSG_ZEROCOPY implementation.
>
> It looks like no additional device level support is required. That is
> IMHO so good up to suspicious level :)
>

It is correct no additional device level support is required. I don't
have any local changes to my driver to make this work. I think Stan
on-list was able to run the TX path (he commented on fixes to the test
but didn't say it doesn't work :D) and one other person was able to
run it offlist.

> > Patch Overview:
> > ---------------
> >
> > 1. Documentation & tests to give high level overview of the feature
> >    being added.
> >
> > 2. Add netmem refcounting needed for the TX path.
> >
> > 3. Devmem TX netlink API.
> >
> > 4. Devmem TX net stack implementation.
>
> It looks like even the above section needs some update.
>

Ah, I usually keep the original cover letter untouched and put the
updates under the version labels. Looks like you expect the full cover
letter to be updated. Will do. Thanks for looking.


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX
  2025-02-04 12:29   ` Paolo Abeni
  2025-02-04 16:50     ` Jakub Kicinski
@ 2025-02-04 17:35     ` Mina Almasry
  2025-02-04 17:56       ` Paolo Abeni
  1 sibling, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-04 17:35 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Tue, Feb 4, 2025 at 4:29 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 2/3/25 11:39 PM, Mina Almasry wrote:
> > Add support for devmem TX in ncdevmem.
> >
> > This is a combination of the ncdevmem from the devmem TCP series RFCv1
> > which included the TX path, and work by Stan to include the netlink API
> > and refactored on top of his generic memory_provider support.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
>
> Usually the self-tests are included towards the end of the series, to
> help reviewers building-up on previous patches knowledge.
>

I noticed reviewers like to go over docs + selftests in my previous
series so I thought I'd put them in the beginning. Looks like the
gambit was not welcome. I'll move the selftests to the end. May also
move the docs to the end as is customary as well.

> >  .../selftests/drivers/net/hw/ncdevmem.c       | 300 +++++++++++++++++-
> >  1 file changed, 289 insertions(+), 11 deletions(-)
>
> Why devmem.py is not touched? AFAICS the test currently run ncdevmem
> only in server (rx) mode, so the tx path is not actually exercised ?!?
>

Yeah, to be honest I have a collection of local bash scripts that
invoke ncdevmem in different ways for my testing, and I have docs on
top of ncdevmem.c of how to test; I don't use devmem.py. I was going
to look at adding test cases to devmem.py as a follow up, if it's OK
with you, and Stan offered as well on an earlier revision. If not no
problem, I can address in this series. The only issue is that I have
some legwork to enable devmem.py on my test setup/distro, but the meat
of the tests is already included and passing in this series (when
invoked manually).

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX
  2025-02-04 17:35     ` Mina Almasry
@ 2025-02-04 17:56       ` Paolo Abeni
  2025-02-04 18:03         ` Mina Almasry
  0 siblings, 1 reply; 39+ messages in thread
From: Paolo Abeni @ 2025-02-04 17:56 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On 2/4/25 6:35 PM, Mina Almasry wrote:
> On Tue, Feb 4, 2025 at 4:29 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>  .../selftests/drivers/net/hw/ncdevmem.c       | 300 +++++++++++++++++-
>>>  1 file changed, 289 insertions(+), 11 deletions(-)
>>
>> Why devmem.py is not touched? AFAICS the test currently run ncdevmem
>> only in server (rx) mode, so the tx path is not actually exercised ?!?
>>
> 
> Yeah, to be honest I have a collection of local bash scripts that
> invoke ncdevmem in different ways for my testing, and I have docs on
> top of ncdevmem.c of how to test; I don't use devmem.py. I was going
> to look at adding test cases to devmem.py as a follow up, if it's OK
> with you, and Stan offered as well on an earlier revision. If not no
> problem, I can address in this series. The only issue is that I have
> some legwork to enable devmem.py on my test setup/distro, but the meat
> of the tests is already included and passing in this series (when
> invoked manually).

I think it would be better if you could include at least a very basic
test-case for the TX path. More accurate coverage could be a follow-up.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX
  2025-02-04 17:56       ` Paolo Abeni
@ 2025-02-04 18:03         ` Mina Almasry
  2025-02-04 18:07           ` Stanislav Fomichev
  0 siblings, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-04 18:03 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Tue, Feb 4, 2025 at 9:56 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 2/4/25 6:35 PM, Mina Almasry wrote:
> > On Tue, Feb 4, 2025 at 4:29 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>>  .../selftests/drivers/net/hw/ncdevmem.c       | 300 +++++++++++++++++-
> >>>  1 file changed, 289 insertions(+), 11 deletions(-)
> >>
> >> Why devmem.py is not touched? AFAICS the test currently run ncdevmem
> >> only in server (rx) mode, so the tx path is not actually exercised ?!?
> >>
> >
> > Yeah, to be honest I have a collection of local bash scripts that
> > invoke ncdevmem in different ways for my testing, and I have docs on
> > top of ncdevmem.c of how to test; I don't use devmem.py. I was going
> > to look at adding test cases to devmem.py as a follow up, if it's OK
> > with you, and Stan offered as well on an earlier revision. If not no
> > problem, I can address in this series. The only issue is that I have
> > some legwork to enable devmem.py on my test setup/distro, but the meat
> > of the tests is already included and passing in this series (when
> > invoked manually).
>
> I think it would be better if you could include at least a very basic
> test-case for the TX path. More accurate coverage could be a follow-up.
>

Thanks; will do.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-04 17:27   ` Mina Almasry
@ 2025-02-04 18:06     ` Stanislav Fomichev
  2025-02-04 18:32       ` Paolo Abeni
  2025-02-04 18:38       ` Mina Almasry
  0 siblings, 2 replies; 39+ messages in thread
From: Stanislav Fomichev @ 2025-02-04 18:06 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Paolo Abeni, netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On 02/04, Mina Almasry wrote:
> On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 2/3/25 11:39 PM, Mina Almasry wrote:
> > > The TX path had been dropped from the Device Memory TCP patch series
> > > post RFCv1 [1], to make that series slightly easier to review. This
> > > series rebases the implementation of the TX path on top of the
> > > net_iov/netmem framework agreed upon and merged. The motivation for
> > > the feature is thoroughly described in the docs & cover letter of the
> > > original proposal, so I don't repeat the lengthy descriptions here, but
> > > they are available in [1].
> > >
> > > Sending this series as RFC as the winder closure is immenient. I plan on
> > > reposting as non-RFC once the tree re-opens, addressing any feedback
> > > I receive in the meantime.
> >
> > I guess you should drop this paragraph.
> >
> > > Full outline on usage of the TX path is detailed in the documentation
> > > added in the first patch.
> > >
> > > Test example is available via the kselftest included in the series as well.
> > >
> > > The series is relatively small, as the TX path for this feature largely
> > > piggybacks on the existing MSG_ZEROCOPY implementation.
> >
> > It looks like no additional device level support is required. That is
> > IMHO so good up to suspicious level :)
> >
> 
> It is correct no additional device level support is required. I don't
> have any local changes to my driver to make this work. I think Stan
> on-list was able to run the TX path (he commented on fixes to the test
> but didn't say it doesn't work :D) and one other person was able to
> run it offlist.

For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
I have similar internal patch for mlx5 (will share after RX part gets
in). I agree that it seems like gve_unmap_packet needs some work to be more
careful to not unmap NIOVs (if you were testing against gve).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX
  2025-02-04 18:03         ` Mina Almasry
@ 2025-02-04 18:07           ` Stanislav Fomichev
  0 siblings, 0 replies; 39+ messages in thread
From: Stanislav Fomichev @ 2025-02-04 18:07 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Paolo Abeni, netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On 02/04, Mina Almasry wrote:
> On Tue, Feb 4, 2025 at 9:56 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 2/4/25 6:35 PM, Mina Almasry wrote:
> > > On Tue, Feb 4, 2025 at 4:29 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >>>  .../selftests/drivers/net/hw/ncdevmem.c       | 300 +++++++++++++++++-
> > >>>  1 file changed, 289 insertions(+), 11 deletions(-)
> > >>
> > >> Why devmem.py is not touched? AFAICS the test currently run ncdevmem
> > >> only in server (rx) mode, so the tx path is not actually exercised ?!?
> > >>
> > >
> > > Yeah, to be honest I have a collection of local bash scripts that
> > > invoke ncdevmem in different ways for my testing, and I have docs on
> > > top of ncdevmem.c of how to test; I don't use devmem.py. I was going
> > > to look at adding test cases to devmem.py as a follow up, if it's OK
> > > with you, and Stan offered as well on an earlier revision. If not no
> > > problem, I can address in this series. The only issue is that I have
> > > some legwork to enable devmem.py on my test setup/distro, but the meat
> > > of the tests is already included and passing in this series (when
> > > invoked manually).
> >
> > I think it would be better if you could include at least a very basic
> > test-case for the TX path. More accurate coverage could be a follow-up.
> >
> 
> Thanks; will do.

This is what I've been using to test tx-only and tx-rx modes (shared
previously on the list as well):
https://github.com/fomichev/linux/commit/df5ef094db57f6c49603e6be5730782e379dd237

Feel free to include in the v4.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-04 18:06     ` Stanislav Fomichev
@ 2025-02-04 18:32       ` Paolo Abeni
  2025-02-04 18:47         ` Mina Almasry
  2025-02-04 18:38       ` Mina Almasry
  1 sibling, 1 reply; 39+ messages in thread
From: Paolo Abeni @ 2025-02-04 18:32 UTC (permalink / raw)
  To: Stanislav Fomichev, Mina Almasry
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On 2/4/25 7:06 PM, Stanislav Fomichev wrote:
> On 02/04, Mina Almasry wrote:
>> On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>
>>> On 2/3/25 11:39 PM, Mina Almasry wrote:
>>>> The TX path had been dropped from the Device Memory TCP patch series
>>>> post RFCv1 [1], to make that series slightly easier to review. This
>>>> series rebases the implementation of the TX path on top of the
>>>> net_iov/netmem framework agreed upon and merged. The motivation for
>>>> the feature is thoroughly described in the docs & cover letter of the
>>>> original proposal, so I don't repeat the lengthy descriptions here, but
>>>> they are available in [1].
>>>>
>>>> Sending this series as RFC as the winder closure is immenient. I plan on
>>>> reposting as non-RFC once the tree re-opens, addressing any feedback
>>>> I receive in the meantime.
>>>
>>> I guess you should drop this paragraph.
>>>
>>>> Full outline on usage of the TX path is detailed in the documentation
>>>> added in the first patch.
>>>>
>>>> Test example is available via the kselftest included in the series as well.
>>>>
>>>> The series is relatively small, as the TX path for this feature largely
>>>> piggybacks on the existing MSG_ZEROCOPY implementation.
>>>
>>> It looks like no additional device level support is required. That is
>>> IMHO so good up to suspicious level :)
>>>
>>
>> It is correct no additional device level support is required. I don't
>> have any local changes to my driver to make this work. I think Stan
>> on-list was able to run the TX path (he commented on fixes to the test
>> but didn't say it doesn't work :D) and one other person was able to
>> run it offlist.
> 
> For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> I have similar internal patch for mlx5 (will share after RX part gets
> in). I agree that it seems like gve_unmap_packet needs some work to be more
> careful to not unmap NIOVs (if you were testing against gve).

What happen if an user try to use devmem TX on a device not really
supporting it? Silent data corruption?

Don't we need some way for the device to opt-in (or opt-out) and avoid
such issues?

/P


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-04 18:06     ` Stanislav Fomichev
  2025-02-04 18:32       ` Paolo Abeni
@ 2025-02-04 18:38       ` Mina Almasry
  2025-02-04 19:43         ` Stanislav Fomichev
  1 sibling, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-04 18:38 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Paolo Abeni, netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Tue, Feb 4, 2025 at 10:06 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 02/04, Mina Almasry wrote:
> > On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On 2/3/25 11:39 PM, Mina Almasry wrote:
> > > > The TX path had been dropped from the Device Memory TCP patch series
> > > > post RFCv1 [1], to make that series slightly easier to review. This
> > > > series rebases the implementation of the TX path on top of the
> > > > net_iov/netmem framework agreed upon and merged. The motivation for
> > > > the feature is thoroughly described in the docs & cover letter of the
> > > > original proposal, so I don't repeat the lengthy descriptions here, but
> > > > they are available in [1].
> > > >
> > > > Sending this series as RFC as the winder closure is immenient. I plan on
> > > > reposting as non-RFC once the tree re-opens, addressing any feedback
> > > > I receive in the meantime.
> > >
> > > I guess you should drop this paragraph.
> > >
> > > > Full outline on usage of the TX path is detailed in the documentation
> > > > added in the first patch.
> > > >
> > > > Test example is available via the kselftest included in the series as well.
> > > >
> > > > The series is relatively small, as the TX path for this feature largely
> > > > piggybacks on the existing MSG_ZEROCOPY implementation.
> > >
> > > It looks like no additional device level support is required. That is
> > > IMHO so good up to suspicious level :)
> > >
> >
> > It is correct no additional device level support is required. I don't
> > have any local changes to my driver to make this work. I think Stan
> > on-list was able to run the TX path (he commented on fixes to the test
> > but didn't say it doesn't work :D) and one other person was able to
> > run it offlist.
>
> For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> I have similar internal patch for mlx5 (will share after RX part gets
> in). I agree that it seems like gve_unmap_packet needs some work to be more
> careful to not unmap NIOVs (if you were testing against gve).

Hmm. I think you're right. We ran into a similar issue with the RX
path. The RX path worked 'fine' on initial merge, but it was passing
dmabuf dma-addrs to the dma-mapping API which Jason later called out
to be unsafe. The dma-mapping API calls with dmabuf dma-addrs will
boil down into no-ops for a lot of setups I think which is why I'm not
running into any issues in testing, but upon closer look, I think yes,
we need to make sure the driver doesn't end up passing these niov
dma-addrs to functions like dma_unmap_*() and dma_sync_*().

Stan, do you run into issues (crashes/warnings/bugs) in your setup
when the driver tries to unmap niovs? Or did you implement these
changes purely for safety?

Let me take a deeper look here and suggest something for the next
version. I think we may indeed need the driver to declare that it can
handle niovs in the TX path correctly (i.e. not accidentally pass niov
dma-addrs to the dma-mapping API).

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-04 18:32       ` Paolo Abeni
@ 2025-02-04 18:47         ` Mina Almasry
  2025-02-04 19:41           ` Stanislav Fomichev
  0 siblings, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-04 18:47 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Stanislav Fomichev, netdev, linux-kernel, linux-doc, kvm,
	virtualization, linux-kselftest, Donald Hunter, Jakub Kicinski,
	David S. Miller, Eric Dumazet, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Tue, Feb 4, 2025 at 10:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 2/4/25 7:06 PM, Stanislav Fomichev wrote:
> > On 02/04, Mina Almasry wrote:
> >> On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>>
> >>> On 2/3/25 11:39 PM, Mina Almasry wrote:
> >>>> The TX path had been dropped from the Device Memory TCP patch series
> >>>> post RFCv1 [1], to make that series slightly easier to review. This
> >>>> series rebases the implementation of the TX path on top of the
> >>>> net_iov/netmem framework agreed upon and merged. The motivation for
> >>>> the feature is thoroughly described in the docs & cover letter of the
> >>>> original proposal, so I don't repeat the lengthy descriptions here, but
> >>>> they are available in [1].
> >>>>
> >>>> Sending this series as RFC as the winder closure is immenient. I plan on
> >>>> reposting as non-RFC once the tree re-opens, addressing any feedback
> >>>> I receive in the meantime.
> >>>
> >>> I guess you should drop this paragraph.
> >>>
> >>>> Full outline on usage of the TX path is detailed in the documentation
> >>>> added in the first patch.
> >>>>
> >>>> Test example is available via the kselftest included in the series as well.
> >>>>
> >>>> The series is relatively small, as the TX path for this feature largely
> >>>> piggybacks on the existing MSG_ZEROCOPY implementation.
> >>>
> >>> It looks like no additional device level support is required. That is
> >>> IMHO so good up to suspicious level :)
> >>>
> >>
> >> It is correct no additional device level support is required. I don't
> >> have any local changes to my driver to make this work. I think Stan
> >> on-list was able to run the TX path (he commented on fixes to the test
> >> but didn't say it doesn't work :D) and one other person was able to
> >> run it offlist.
> >
> > For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> > I have similar internal patch for mlx5 (will share after RX part gets
> > in). I agree that it seems like gve_unmap_packet needs some work to be more
> > careful to not unmap NIOVs (if you were testing against gve).
>
> What happen if an user try to use devmem TX on a device not really
> supporting it? Silent data corruption?
>

So the tx dma-buf binding netlink API will bind the dma-buf to the
netdevice. If that fails, the uapi will return failure and devmem tx
will not be enabled.

If the dma-binding succeeds, then the device can indeed DMA into the
dma-addrs in the device. The TX path will dma from the dma-addrs in
the device just fine and it need not be aware that the dma-addrs are
coming from a device and not from host memory.

The only issue that Stan's patches is pointing to, is that the driver
will likely be passing these dma-buf addresses into dma-mapping APIs
like dma_unmap_*() and dma_sync_*() functions. Those, AFAIU, will be
no-ops with dma-buf addresses in most setups, but it's not 100% safe
to pass those dma-buf addresses to these dma-mapping APIs, so we
should avoid these calls entirely.

> Don't we need some way for the device to opt-in (or opt-out) and avoid
> such issues?
>

Yeah, I think likely the driver needs to declare support (i.e. it's
not using dma-mapping API with dma-buf addresses).

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-04 18:47         ` Mina Almasry
@ 2025-02-04 19:41           ` Stanislav Fomichev
  2025-02-05  2:06             ` Jakub Kicinski
  0 siblings, 1 reply; 39+ messages in thread
From: Stanislav Fomichev @ 2025-02-04 19:41 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Paolo Abeni, netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On 02/04, Mina Almasry wrote:
> On Tue, Feb 4, 2025 at 10:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 2/4/25 7:06 PM, Stanislav Fomichev wrote:
> > > On 02/04, Mina Almasry wrote:
> > >> On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >>>
> > >>> On 2/3/25 11:39 PM, Mina Almasry wrote:
> > >>>> The TX path had been dropped from the Device Memory TCP patch series
> > >>>> post RFCv1 [1], to make that series slightly easier to review. This
> > >>>> series rebases the implementation of the TX path on top of the
> > >>>> net_iov/netmem framework agreed upon and merged. The motivation for
> > >>>> the feature is thoroughly described in the docs & cover letter of the
> > >>>> original proposal, so I don't repeat the lengthy descriptions here, but
> > >>>> they are available in [1].
> > >>>>
> > >>>> Sending this series as RFC as the winder closure is immenient. I plan on
> > >>>> reposting as non-RFC once the tree re-opens, addressing any feedback
> > >>>> I receive in the meantime.
> > >>>
> > >>> I guess you should drop this paragraph.
> > >>>
> > >>>> Full outline on usage of the TX path is detailed in the documentation
> > >>>> added in the first patch.
> > >>>>
> > >>>> Test example is available via the kselftest included in the series as well.
> > >>>>
> > >>>> The series is relatively small, as the TX path for this feature largely
> > >>>> piggybacks on the existing MSG_ZEROCOPY implementation.
> > >>>
> > >>> It looks like no additional device level support is required. That is
> > >>> IMHO so good up to suspicious level :)
> > >>>
> > >>
> > >> It is correct no additional device level support is required. I don't
> > >> have any local changes to my driver to make this work. I think Stan
> > >> on-list was able to run the TX path (he commented on fixes to the test
> > >> but didn't say it doesn't work :D) and one other person was able to
> > >> run it offlist.
> > >
> > > For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> > > I have similar internal patch for mlx5 (will share after RX part gets
> > > in). I agree that it seems like gve_unmap_packet needs some work to be more
> > > careful to not unmap NIOVs (if you were testing against gve).
> >
> > What happen if an user try to use devmem TX on a device not really
> > supporting it? Silent data corruption?
> >
> 
> So the tx dma-buf binding netlink API will bind the dma-buf to the
> netdevice. If that fails, the uapi will return failure and devmem tx
> will not be enabled.
> 
> If the dma-binding succeeds, then the device can indeed DMA into the
> dma-addrs in the device. The TX path will dma from the dma-addrs in
> the device just fine and it need not be aware that the dma-addrs are
> coming from a device and not from host memory.
> 
> The only issue that Stan's patches is pointing to, is that the driver
> will likely be passing these dma-buf addresses into dma-mapping APIs
> like dma_unmap_*() and dma_sync_*() functions. Those, AFAIU, will be
> no-ops with dma-buf addresses in most setups, but it's not 100% safe
> to pass those dma-buf addresses to these dma-mapping APIs, so we
> should avoid these calls entirely.
> 
> > Don't we need some way for the device to opt-in (or opt-out) and avoid
> > such issues?
> >
> 
> Yeah, I think likely the driver needs to declare support (i.e. it's
> not using dma-mapping API with dma-buf addresses).

netif_skb_features/ndo_features_check seems like a good fit?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-04 18:38       ` Mina Almasry
@ 2025-02-04 19:43         ` Stanislav Fomichev
  2025-02-05  0:47           ` Samiullah Khawaja
  0 siblings, 1 reply; 39+ messages in thread
From: Stanislav Fomichev @ 2025-02-04 19:43 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Paolo Abeni, netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On 02/04, Mina Almasry wrote:
> On Tue, Feb 4, 2025 at 10:06 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 02/04, Mina Almasry wrote:
> > > On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > >
> > > > On 2/3/25 11:39 PM, Mina Almasry wrote:
> > > > > The TX path had been dropped from the Device Memory TCP patch series
> > > > > post RFCv1 [1], to make that series slightly easier to review. This
> > > > > series rebases the implementation of the TX path on top of the
> > > > > net_iov/netmem framework agreed upon and merged. The motivation for
> > > > > the feature is thoroughly described in the docs & cover letter of the
> > > > > original proposal, so I don't repeat the lengthy descriptions here, but
> > > > > they are available in [1].
> > > > >
> > > > > Sending this series as RFC as the winder closure is immenient. I plan on
> > > > > reposting as non-RFC once the tree re-opens, addressing any feedback
> > > > > I receive in the meantime.
> > > >
> > > > I guess you should drop this paragraph.
> > > >
> > > > > Full outline on usage of the TX path is detailed in the documentation
> > > > > added in the first patch.
> > > > >
> > > > > Test example is available via the kselftest included in the series as well.
> > > > >
> > > > > The series is relatively small, as the TX path for this feature largely
> > > > > piggybacks on the existing MSG_ZEROCOPY implementation.
> > > >
> > > > It looks like no additional device level support is required. That is
> > > > IMHO so good up to suspicious level :)
> > > >
> > >
> > > It is correct no additional device level support is required. I don't
> > > have any local changes to my driver to make this work. I think Stan
> > > on-list was able to run the TX path (he commented on fixes to the test
> > > but didn't say it doesn't work :D) and one other person was able to
> > > run it offlist.
> >
> > For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> > I have similar internal patch for mlx5 (will share after RX part gets
> > in). I agree that it seems like gve_unmap_packet needs some work to be more
> > careful to not unmap NIOVs (if you were testing against gve).
> 
> Hmm. I think you're right. We ran into a similar issue with the RX
> path. The RX path worked 'fine' on initial merge, but it was passing
> dmabuf dma-addrs to the dma-mapping API which Jason later called out
> to be unsafe. The dma-mapping API calls with dmabuf dma-addrs will
> boil down into no-ops for a lot of setups I think which is why I'm not
> running into any issues in testing, but upon closer look, I think yes,
> we need to make sure the driver doesn't end up passing these niov
> dma-addrs to functions like dma_unmap_*() and dma_sync_*().
> 
> Stan, do you run into issues (crashes/warnings/bugs) in your setup
> when the driver tries to unmap niovs? Or did you implement these
> changes purely for safety?

I don't run into any issues with those unmaps in place, but I'm running x86
with iommu bypass (and as you mention in the other thread, those
calls are no-ops in this case).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-04 19:43         ` Stanislav Fomichev
@ 2025-02-05  0:47           ` Samiullah Khawaja
  2025-02-05  1:05             ` Stanislav Fomichev
  0 siblings, 1 reply; 39+ messages in thread
From: Samiullah Khawaja @ 2025-02-05  0:47 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Mina Almasry, Paolo Abeni, netdev, linux-kernel, linux-doc, kvm,
	virtualization, linux-kselftest, Donald Hunter, Jakub Kicinski,
	David S. Miller, Eric Dumazet, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela

On Tue, Feb 4, 2025 at 11:43 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 02/04, Mina Almasry wrote:
> > On Tue, Feb 4, 2025 at 10:06 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 02/04, Mina Almasry wrote:
> > > > On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > On 2/3/25 11:39 PM, Mina Almasry wrote:
> > > > > > The TX path had been dropped from the Device Memory TCP patch series
> > > > > > post RFCv1 [1], to make that series slightly easier to review. This
> > > > > > series rebases the implementation of the TX path on top of the
> > > > > > net_iov/netmem framework agreed upon and merged. The motivation for
> > > > > > the feature is thoroughly described in the docs & cover letter of the
> > > > > > original proposal, so I don't repeat the lengthy descriptions here, but
> > > > > > they are available in [1].
> > > > > >
> > > > > > Sending this series as RFC as the winder closure is immenient. I plan on
> > > > > > reposting as non-RFC once the tree re-opens, addressing any feedback
> > > > > > I receive in the meantime.
> > > > >
> > > > > I guess you should drop this paragraph.
> > > > >
> > > > > > Full outline on usage of the TX path is detailed in the documentation
> > > > > > added in the first patch.
> > > > > >
> > > > > > Test example is available via the kselftest included in the series as well.
> > > > > >
> > > > > > The series is relatively small, as the TX path for this feature largely
> > > > > > piggybacks on the existing MSG_ZEROCOPY implementation.
> > > > >
> > > > > It looks like no additional device level support is required. That is
> > > > > IMHO so good up to suspicious level :)
> > > > >
> > > >
> > > > It is correct no additional device level support is required. I don't
> > > > have any local changes to my driver to make this work. I think Stan
> > > > on-list was able to run the TX path (he commented on fixes to the test
> > > > but didn't say it doesn't work :D) and one other person was able to
> > > > run it offlist.
> > >
> > > For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> > > I have similar internal patch for mlx5 (will share after RX part gets
> > > in). I agree that it seems like gve_unmap_packet needs some work to be more
> > > careful to not unmap NIOVs (if you were testing against gve).
> >
> > Hmm. I think you're right. We ran into a similar issue with the RX
> > path. The RX path worked 'fine' on initial merge, but it was passing
> > dmabuf dma-addrs to the dma-mapping API which Jason later called out
> > to be unsafe. The dma-mapping API calls with dmabuf dma-addrs will
> > boil down into no-ops for a lot of setups I think which is why I'm not
> > running into any issues in testing, but upon closer look, I think yes,
> > we need to make sure the driver doesn't end up passing these niov
> > dma-addrs to functions like dma_unmap_*() and dma_sync_*().
> >
> > Stan, do you run into issues (crashes/warnings/bugs) in your setup
> > when the driver tries to unmap niovs? Or did you implement these
> > changes purely for safety?
>
> I don't run into any issues with those unmaps in place, but I'm running x86
> with iommu bypass (and as you mention in the other thread, those
> calls are no-ops in this case).
The dma_addr from dma-buf should never enter dma_* APIs. dma-bufs
exporters have their own implementation of these ops and they could be
no-op for identity mappings or when iommu is disabled (in a VM? with
no IOMMU enabled GPA=IOVA). so if we really want to map/unmap/sync
these addresses the dma-buf APIs should be used to do that. Maybe some
glue with a memory provider is required for these net_iovs? I think
the safest option with these is that mappings are never unmapped
manually by driver until the dma_buf_unmap_attachment is called during
unbinding? But maybe that complicates things for io_uring?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-05  0:47           ` Samiullah Khawaja
@ 2025-02-05  1:05             ` Stanislav Fomichev
  0 siblings, 0 replies; 39+ messages in thread
From: Stanislav Fomichev @ 2025-02-05  1:05 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Mina Almasry, Paolo Abeni, netdev, linux-kernel, linux-doc, kvm,
	virtualization, linux-kselftest, Donald Hunter, Jakub Kicinski,
	David S. Miller, Eric Dumazet, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela

On 02/04, Samiullah Khawaja wrote:
> On Tue, Feb 4, 2025 at 11:43 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 02/04, Mina Almasry wrote:
> > > On Tue, Feb 4, 2025 at 10:06 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > > >
> > > > On 02/04, Mina Almasry wrote:
> > > > > On Tue, Feb 4, 2025 at 4:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > >
> > > > > > On 2/3/25 11:39 PM, Mina Almasry wrote:
> > > > > > > The TX path had been dropped from the Device Memory TCP patch series
> > > > > > > post RFCv1 [1], to make that series slightly easier to review. This
> > > > > > > series rebases the implementation of the TX path on top of the
> > > > > > > net_iov/netmem framework agreed upon and merged. The motivation for
> > > > > > > the feature is thoroughly described in the docs & cover letter of the
> > > > > > > original proposal, so I don't repeat the lengthy descriptions here, but
> > > > > > > they are available in [1].
> > > > > > >
> > > > > > > Sending this series as RFC as the winder closure is immenient. I plan on
> > > > > > > reposting as non-RFC once the tree re-opens, addressing any feedback
> > > > > > > I receive in the meantime.
> > > > > >
> > > > > > I guess you should drop this paragraph.
> > > > > >
> > > > > > > Full outline on usage of the TX path is detailed in the documentation
> > > > > > > added in the first patch.
> > > > > > >
> > > > > > > Test example is available via the kselftest included in the series as well.
> > > > > > >
> > > > > > > The series is relatively small, as the TX path for this feature largely
> > > > > > > piggybacks on the existing MSG_ZEROCOPY implementation.
> > > > > >
> > > > > > It looks like no additional device level support is required. That is
> > > > > > IMHO so good up to suspicious level :)
> > > > > >
> > > > >
> > > > > It is correct no additional device level support is required. I don't
> > > > > have any local changes to my driver to make this work. I think Stan
> > > > > on-list was able to run the TX path (he commented on fixes to the test
> > > > > but didn't say it doesn't work :D) and one other person was able to
> > > > > run it offlist.
> > > >
> > > > For BRCM I had shared this: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/
> > > > I have similar internal patch for mlx5 (will share after RX part gets
> > > > in). I agree that it seems like gve_unmap_packet needs some work to be more
> > > > careful to not unmap NIOVs (if you were testing against gve).
> > >
> > > Hmm. I think you're right. We ran into a similar issue with the RX
> > > path. The RX path worked 'fine' on initial merge, but it was passing
> > > dmabuf dma-addrs to the dma-mapping API which Jason later called out
> > > to be unsafe. The dma-mapping API calls with dmabuf dma-addrs will
> > > boil down into no-ops for a lot of setups I think which is why I'm not
> > > running into any issues in testing, but upon closer look, I think yes,
> > > we need to make sure the driver doesn't end up passing these niov
> > > dma-addrs to functions like dma_unmap_*() and dma_sync_*().
> > >
> > > Stan, do you run into issues (crashes/warnings/bugs) in your setup
> > > when the driver tries to unmap niovs? Or did you implement these
> > > changes purely for safety?
> >
> > I don't run into any issues with those unmaps in place, but I'm running x86
> > with iommu bypass (and as you mention in the other thread, those
> > calls are no-ops in this case).
> The dma_addr from dma-buf should never enter dma_* APIs. dma-bufs
> exporters have their own implementation of these ops and they could be
> no-op for identity mappings or when iommu is disabled (in a VM? with
> no IOMMU enabled GPA=IOVA). so if we really want to map/unmap/sync
> these addresses the dma-buf APIs should be used to do that. Maybe some
> glue with a memory provider is required for these net_iovs? I think
> the safest option with these is that mappings are never unmapped
> manually by driver until the dma_buf_unmap_attachment is called during
> unbinding? But maybe that complicates things for io_uring?

Correct, we don't want to call dma_* APIs on NIOVs, but currently we
do (unmap on tx completion). I mentioned [0] in another thread, we need
something similar for gve (and eventually mlx). skb_frag_dma_map hides
the mapping, but the unmapping unconditionally explicitly calls dma_ APIs
(in most drivers I've looked at).
 
0: https://lore.kernel.org/netdev/ZxAfWHk3aRWl-F31@mini-arch/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-04 19:41           ` Stanislav Fomichev
@ 2025-02-05  2:06             ` Jakub Kicinski
  2025-02-05 19:53               ` Mina Almasry
  0 siblings, 1 reply; 39+ messages in thread
From: Jakub Kicinski @ 2025-02-05  2:06 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Mina Almasry, Paolo Abeni, netdev, linux-kernel, linux-doc, kvm,
	virtualization, linux-kselftest, Donald Hunter, David S. Miller,
	Eric Dumazet, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Tue, 4 Feb 2025 11:41:09 -0800 Stanislav Fomichev wrote:
> > > Don't we need some way for the device to opt-in (or opt-out) and avoid
> > > such issues?
> > >  
> > 
> > Yeah, I think likely the driver needs to declare support (i.e. it's
> > not using dma-mapping API with dma-buf addresses).  
> 
> netif_skb_features/ndo_features_check seems like a good fit?

validate_xmit_skb()

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-03 22:39 [PATCH net-next v3 0/6] Device memory TCP TX Mina Almasry
                   ` (6 preceding siblings ...)
  2025-02-04 12:32 ` [PATCH net-next v3 0/6] Device memory TCP TX Paolo Abeni
@ 2025-02-05  2:08 ` Jakub Kicinski
  2025-02-05 19:52   ` Mina Almasry
  7 siblings, 1 reply; 39+ messages in thread
From: Jakub Kicinski @ 2025-02-05  2:08 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Mon,  3 Feb 2025 22:39:10 +0000 Mina Almasry wrote:
> v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
> ===
> 
> RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*

nit: lore links are better

please stick to RFC until a driver implementation is ready and
included

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-03 22:39 ` [PATCH net-next v3 5/6] net: devmem: Implement TX path Mina Almasry
  2025-02-04 12:15   ` Paolo Abeni
@ 2025-02-05 12:20   ` Pavel Begunkov
  2025-02-10 21:09     ` Mina Almasry
  2025-02-05 21:56   ` Willem de Bruijn
  2 siblings, 1 reply; 39+ messages in thread
From: Pavel Begunkov @ 2025-02-05 12:20 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-doc, kvm,
	virtualization, linux-kselftest
  Cc: Donald Hunter, Jakub Kicinski, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja, Kaiyuan Zhang

On 2/3/25 22:39, Mina Almasry wrote:
...
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index bb2b751d274a..3ff8f568c382 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1711,9 +1711,12 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
...
>   int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
>   				struct iov_iter *from, size_t length);
> @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
>   static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
>   					  struct msghdr *msg, int len)
>   {
> -	return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
> +	return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len,
> +				       NULL);

Instead of propagating it all the way down and carving a new path, why
not reuse the existing infra? You already hook into where ubuf is
allocated, you can stash the binding in there. And
zerocopy_fill_skb_from_devmem can implement ->sg_from_iter,
see __zerocopy_sg_from_iter().

...
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index f0693707aece..c989606ff58d 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -63,6 +63,8 @@
> +static int
> +zerocopy_fill_skb_from_devmem(struct sk_buff *skb, struct iov_iter *from,
> +			      int length,
> +			      struct net_devmem_dmabuf_binding *binding)
> +{
> +	int i = skb_shinfo(skb)->nr_frags;
> +	size_t virt_addr, size, off;
> +	struct net_iov *niov;
> +
> +	while (length && iov_iter_count(from)) {
> +		if (i == MAX_SKB_FRAGS)
> +			return -EMSGSIZE;
> +
> +		virt_addr = (size_t)iter_iov_addr(from);

Unless I missed it somewhere it needs to check that the iter
is iovec based.

> +		niov = net_devmem_get_niov_at(binding, virt_addr, &off, &size);
> +		if (!niov)
> +			return -EFAULT;
> +
> +		size = min_t(size_t, size, length);
> +		size = min_t(size_t, size, iter_iov_len(from));
> +
> +		get_netmem(net_iov_to_netmem(niov));
> +		skb_add_rx_frag_netmem(skb, i, net_iov_to_netmem(niov), off,
> +				       size, PAGE_SIZE);
> +		iov_iter_advance(from, size);
> +		length -= size;
> +		i++;
> +	}
> +
> +	return 0;
> +}
> +
>   int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>   			    struct sk_buff *skb, struct iov_iter *from,
> -			    size_t length)
> +			    size_t length,
> +			    struct net_devmem_dmabuf_binding *binding)
>   {
>   	unsigned long orig_size = skb->truesize;
>   	unsigned long truesize;
> @@ -702,6 +737,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>   
>   	if (msg && msg->msg_ubuf && msg->sg_from_iter)
>   		ret = msg->sg_from_iter(skb, from, length);

As mentioned above, you can implement this callback. The callback can
also be moved into ubuf_info ops if that's more convenient, I had
patches stashed for that.

> +	else if (unlikely(binding))
> +		ret = zerocopy_fill_skb_from_devmem(skb, from, length, binding);
>   	else
>   		ret = zerocopy_fill_skb_from_iter(skb, from, length);
>   
> @@ -735,7 +772,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
>   	if (skb_copy_datagram_from_iter(skb, 0, from, copy))
>   		return -EFAULT;

...

> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 0d704bda6c41..44198ae7e44c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1051,6 +1051,7 @@ int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied,
>   
>   int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>   {
> +	struct net_devmem_dmabuf_binding *binding = NULL;
>   	struct tcp_sock *tp = tcp_sk(sk);
>   	struct ubuf_info *uarg = NULL;
>   	struct sk_buff *skb;
> @@ -1063,6 +1064,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>   
>   	flags = msg->msg_flags;
>   
> +	sockcm_init(&sockc, sk);
> +	if (msg->msg_controllen) {
> +		err = sock_cmsg_send(sk, msg, &sockc);
> +		if (unlikely(err)) {
> +			err = -EINVAL;
> +			goto out_err;
> +		}
> +	}
> +
>   	if ((flags & MSG_ZEROCOPY) && size) {
>   		if (msg->msg_ubuf) {
>   			uarg = msg->msg_ubuf;
> @@ -1080,6 +1090,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>   			else
>   				uarg_to_msgzc(uarg)->zerocopy = 0;
>   		}
> +
> +		if (sockc.dmabuf_id != 0) {

It's better to be mutually exclusive with msg->msg_ubuf, the callers
have expectations about the buffers used. And you likely don't want
to mix it with normal MSG_ZEROCOPY in a single skb and/or ubuf_info,
you can force reallocation of ubuf_info here.

> +			binding = net_devmem_get_binding(sk, sockc.dmabuf_id);
> +			if (IS_ERR(binding)) {
> +				err = PTR_ERR(binding);
> +				binding = NULL;
> +				goto out_err;
> +			}
> +		}

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-05  2:08 ` Jakub Kicinski
@ 2025-02-05 19:52   ` Mina Almasry
  2025-02-06  1:45     ` Jakub Kicinski
  0 siblings, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-05 19:52 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Tue, Feb 4, 2025 at 6:08 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon,  3 Feb 2025 22:39:10 +0000 Mina Almasry wrote:
> > v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
> > ===
> >
> > RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*
>
> nit: lore links are better
>

Will do.

> please stick to RFC until a driver implementation is ready and
> included

For the RX path proposals I kept the driver implementation out of the
series and linked to it in the cover letter. Just to confirm, is that
OK for this series as well?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-05  2:06             ` Jakub Kicinski
@ 2025-02-05 19:53               ` Mina Almasry
  0 siblings, 0 replies; 39+ messages in thread
From: Mina Almasry @ 2025-02-05 19:53 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, Paolo Abeni, netdev, linux-kernel, linux-doc,
	kvm, virtualization, linux-kselftest, Donald Hunter,
	David S. Miller, Eric Dumazet, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Tue, Feb 4, 2025 at 6:06 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 4 Feb 2025 11:41:09 -0800 Stanislav Fomichev wrote:
> > > > Don't we need some way for the device to opt-in (or opt-out) and avoid
> > > > such issues?
> > > >
> > >
> > > Yeah, I think likely the driver needs to declare support (i.e. it's
> > > not using dma-mapping API with dma-buf addresses).
> >
> > netif_skb_features/ndo_features_check seems like a good fit?
>
> validate_xmit_skb()

I was thinking I'd check dev->features during the dmabuf tx binding
and check the binding completely if the feature is not supported. I'm
guessing another check in validate_xmit_skb() is needed anyway for
cases such as forwarding at what not?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-03 22:39 ` [PATCH net-next v3 5/6] net: devmem: Implement TX path Mina Almasry
  2025-02-04 12:15   ` Paolo Abeni
  2025-02-05 12:20   ` Pavel Begunkov
@ 2025-02-05 21:56   ` Willem de Bruijn
  2 siblings, 0 replies; 39+ messages in thread
From: Willem de Bruijn @ 2025-02-05 21:56 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-doc, kvm,
	virtualization, linux-kselftest
  Cc: Mina Almasry, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja, Kaiyuan Zhang

Mina Almasry wrote:
> Augment dmabuf binding to be able to handle TX. Additional to all the RX
> binding, we also create tx_vec needed for the TX path.
> 
> Provide API for sendmsg to be able to send dmabufs bound to this device:
> 
> - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
> - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.
> 
> Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
> implementation, while disabling instances where MSG_ZEROCOPY falls back
> to copying.
> 
> We additionally pipe the binding down to the new
> zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
> instead of the traditional page netmems.
> 
> We also special case skb_frag_dma_map to return the dma-address of these
> dmabuf net_iovs instead of attempting to map pages.
> 
> Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat
> of the implementation came from devmem TCP RFC v1[1], which included the
> TX path, but Stan did all the rebasing on top of netmem/net_iov.
> 
> Cc: Stanislav Fomichev <sdf@fomichev.me>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> 
> ---
> 
> v3:
> - Use kvmalloc_array instead of kcalloc (Stan).
> - Fix unreachable code warning (Simon).
> 
> v2:
> - Remove dmabuf_offset from the dmabuf cmsg.
> - Update zerocopy_fill_skb_from_devmem to interpret the
>   iov_base/iter_iov_addr as the offset into the dmabuf to send from
>   (Stan).
> - Remove the confusing binding->tx_iter which is not needed if we
>   interpret the iov_base/iter_iov_addr as offset into the dmabuf (Stan).
> - Remove check for binding->sgt and binding->sgt->nents in dmabuf
>   binding.
> - Simplify the calculation of binding->tx_vec.
> - Check in net_devmem_get_binding that the binding we're returning
>   has ifindex matching the sending socket (Willem).
> ---
>  include/linux/skbuff.h                  | 15 +++-
>  include/net/sock.h                      |  1 +
>  include/uapi/linux/uio.h                |  6 +-
>  net/core/datagram.c                     | 41 ++++++++++-
>  net/core/devmem.c                       | 97 +++++++++++++++++++++++--
>  net/core/devmem.h                       | 42 ++++++++++-
>  net/core/netdev-genl.c                  | 64 +++++++++++++++-
>  net/core/skbuff.c                       |  6 +-
>  net/core/sock.c                         |  8 ++
>  net/ipv4/tcp.c                          | 36 ++++++---
>  net/vmw_vsock/virtio_transport_common.c |  3 +-
>  11 files changed, 285 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index bb2b751d274a..3ff8f568c382 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1711,9 +1711,12 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>  
>  void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
>  
> +struct net_devmem_dmabuf_binding;
> +
>  int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>  			    struct sk_buff *skb, struct iov_iter *from,
> -			    size_t length);
> +			    size_t length,
> +			    struct net_devmem_dmabuf_binding *binding);
>  
>  int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
>  				struct iov_iter *from, size_t length);
> @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
>  static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
>  					  struct msghdr *msg, int len)
>  {
> -	return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
> +	return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len,
> +				       NULL);
>  }
>  
>  int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
>  			     struct msghdr *msg, int len,
> -			     struct ubuf_info *uarg);
> +			     struct ubuf_info *uarg,
> +			     struct net_devmem_dmabuf_binding *binding);
>  
>  /* Internal */
>  #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
> @@ -3697,6 +3702,10 @@ static inline dma_addr_t __skb_frag_dma_map(struct device *dev,
>  					    size_t offset, size_t size,
>  					    enum dma_data_direction dir)
>  {
> +	if (skb_frag_is_net_iov(frag)) {
> +		return netmem_to_net_iov(frag->netmem)->dma_addr + offset +
> +		       frag->offset;
> +	}
>  	return dma_map_page(dev, skb_frag_page(frag),
>  			    skb_frag_off(frag) + offset, size, dir);
>  }
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 8036b3b79cd8..09eb918525b6 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1822,6 +1822,7 @@ struct sockcm_cookie {
>  	u32 tsflags;
>  	u32 ts_opt_id;
>  	u32 priority;
> +	u32 dmabuf_id;
>  };
>  
>  static inline void sockcm_init(struct sockcm_cookie *sockc,
> diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h
> index 649739e0c404..866bd5dfe39f 100644
> --- a/include/uapi/linux/uio.h
> +++ b/include/uapi/linux/uio.h
> @@ -38,10 +38,14 @@ struct dmabuf_token {
>  	__u32 token_count;
>  };
>  
> +struct dmabuf_tx_cmsg {
> +	__u32 dmabuf_id;
> +};
> +

Why a wrapper struct instead of just __u32?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 0/6] Device memory TCP TX
  2025-02-05 19:52   ` Mina Almasry
@ 2025-02-06  1:45     ` Jakub Kicinski
  0 siblings, 0 replies; 39+ messages in thread
From: Jakub Kicinski @ 2025-02-06  1:45 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Andrew Lunn,
	Neal Cardwell, David Ahern, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, asml.silence, dw,
	Jamal Hadi Salim, Victor Nogueira, Pedro Tammela,
	Samiullah Khawaja

On Wed, 5 Feb 2025 11:52:20 -0800 Mina Almasry wrote:
> > please stick to RFC until a driver implementation is ready and
> > included  
> 
> For the RX path proposals I kept the driver implementation out of the
> series and linked to it in the cover letter. Just to confirm, is that
> OK for this series as well?

No, the other series was large IIRC. This is just 6 patches, normal
rules. 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-05 12:20   ` Pavel Begunkov
@ 2025-02-10 21:09     ` Mina Almasry
  2025-02-12 15:53       ` Pavel Begunkov
  0 siblings, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-10 21:09 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja, Kaiyuan Zhang

On Wed, Feb 5, 2025 at 4:20 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 2/3/25 22:39, Mina Almasry wrote:
> ...
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index bb2b751d274a..3ff8f568c382 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1711,9 +1711,12 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> ...
> >   int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >                               struct iov_iter *from, size_t length);
> > @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >   static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
> >                                         struct msghdr *msg, int len)
> >   {
> > -     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
> > +     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len,
> > +                                    NULL);
>
> Instead of propagating it all the way down and carving a new path, why
> not reuse the existing infra? You already hook into where ubuf is
> allocated, you can stash the binding in there. And

It looks like it's not possible to increase the side of ubuf_info at
all, otherwise the BUILD_BUG_ON in msg_zerocopy_alloc() fires.

It's asserting that sizeof(ubuf_info_msgzc) <= sizeof(skb->cb), and
I'm guessing increasing skb->cb size is not really the way to go.

What I may be able to do here is stash the binding somewhere in
ubuf_info_msgzc via union with fields we don't need for devmem, and/or
stashing the binding in ubuf_info_ops (very hacky). Neither approach
seems ideal, but the former may work and may be cleaner.

I'll take a deeper look here. I had looked before and concluded that
we're piggybacking devmem TX on MSG_ZEROCOPY path, because we need
almost all of the functionality there (no copying, send complete
notifications, etc), with one minor change in the skb filling. I had
concluded that if MSG_ZEROCOPY was never updated to use the existing
infra, then it's appropriate for devmem TX piggybacking on top of it
to follow that. I would not want to get into a refactor of
MSG_ZEROCOPY for no real reason.

But I'll take a deeper look here and see if I can make something
slightly cleaner work.

> zerocopy_fill_skb_from_devmem can implement ->sg_from_iter,
> see __zerocopy_sg_from_iter().
>
> ...
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index f0693707aece..c989606ff58d 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -63,6 +63,8 @@
> > +static int
> > +zerocopy_fill_skb_from_devmem(struct sk_buff *skb, struct iov_iter *from,
> > +                           int length,
> > +                           struct net_devmem_dmabuf_binding *binding)
> > +{
> > +     int i = skb_shinfo(skb)->nr_frags;
> > +     size_t virt_addr, size, off;
> > +     struct net_iov *niov;
> > +
> > +     while (length && iov_iter_count(from)) {
> > +             if (i == MAX_SKB_FRAGS)
> > +                     return -EMSGSIZE;
> > +
> > +             virt_addr = (size_t)iter_iov_addr(from);
>
> Unless I missed it somewhere it needs to check that the iter
> is iovec based.
>

How do we end up here with an iterator that is not iovec based? Is the
user able to trigger that somehow and I missed it?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-10 21:09     ` Mina Almasry
@ 2025-02-12 15:53       ` Pavel Begunkov
  2025-02-12 19:18         ` Mina Almasry
  0 siblings, 1 reply; 39+ messages in thread
From: Pavel Begunkov @ 2025-02-12 15:53 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja, Kaiyuan Zhang

On 2/10/25 21:09, Mina Almasry wrote:
> On Wed, Feb 5, 2025 at 4:20 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 2/3/25 22:39, Mina Almasry wrote:
>> ...
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index bb2b751d274a..3ff8f568c382 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -1711,9 +1711,12 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>> ...
>>>    int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
>>>                                struct iov_iter *from, size_t length);
>>> @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
>>>    static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
>>>                                          struct msghdr *msg, int len)
>>>    {
>>> -     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
>>> +     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len,
>>> +                                    NULL);
>>
>> Instead of propagating it all the way down and carving a new path, why
>> not reuse the existing infra? You already hook into where ubuf is
>> allocated, you can stash the binding in there. And
> 
> It looks like it's not possible to increase the side of ubuf_info at
> all, otherwise the BUILD_BUG_ON in msg_zerocopy_alloc() fires.
> 
> It's asserting that sizeof(ubuf_info_msgzc) <= sizeof(skb->cb), and
> I'm guessing increasing skb->cb size is not really the way to go.
> 
> What I may be able to do here is stash the binding somewhere in
> ubuf_info_msgzc via union with fields we don't need for devmem, and/or

It doesn't need to account the memory against the user, and you
actually don't want that because dmabuf should take care of that.
So, it should be fine to reuse ->mmp.

It's also not a real sk_buff, so maybe maintainers wouldn't mind
reusing some more space out of it, if that would even be needed.

> stashing the binding in ubuf_info_ops (very hacky). Neither approach
> seems ideal, but the former may work and may be cleaner.
> 
> I'll take a deeper look here. I had looked before and concluded that
> we're piggybacking devmem TX on MSG_ZEROCOPY path, because we need
> almost all of the functionality there (no copying, send complete
> notifications, etc), with one minor change in the skb filling. I had
> concluded that if MSG_ZEROCOPY was never updated to use the existing
> infra, then it's appropriate for devmem TX piggybacking on top of it

MSG_ZEROCOPY does use the common infra, i.e. passing ubuf_info,
but doesn't need ->sg_from_iter as zerocopy_fill_skb_from_iter()
and it's what was there first.

> to follow that. I would not want to get into a refactor of
> MSG_ZEROCOPY for no real reason.
> 
> But I'll take a deeper look here and see if I can make something
> slightly cleaner work.
> 
>> zerocopy_fill_skb_from_devmem can implement ->sg_from_iter,
>> see __zerocopy_sg_from_iter().
>>
>> ...
>>> diff --git a/net/core/datagram.c b/net/core/datagram.c
>>> index f0693707aece..c989606ff58d 100644
>>> --- a/net/core/datagram.c
>>> +++ b/net/core/datagram.c
>>> @@ -63,6 +63,8 @@
>>> +static int
>>> +zerocopy_fill_skb_from_devmem(struct sk_buff *skb, struct iov_iter *from,
>>> +                           int length,
>>> +                           struct net_devmem_dmabuf_binding *binding)
>>> +{
>>> +     int i = skb_shinfo(skb)->nr_frags;
>>> +     size_t virt_addr, size, off;
>>> +     struct net_iov *niov;
>>> +
>>> +     while (length && iov_iter_count(from)) {
>>> +             if (i == MAX_SKB_FRAGS)
>>> +                     return -EMSGSIZE;
>>> +
>>> +             virt_addr = (size_t)iter_iov_addr(from);
>>
>> Unless I missed it somewhere it needs to check that the iter
>> is iovec based.
>>
> 
> How do we end up here with an iterator that is not iovec based? Is the
> user able to trigger that somehow and I missed it?

Hopefully not, but for example io_uring passes bvecs for a number of
requests that can end up in tcp_sendmsg_locked(). Those probably
would work with the current patch, but check the order of some of the
checks it will break. And once io_uring starts passing bvecs for
normal send[msg] requests, it'd definitely be possible. And there
are other in kernel users apart from send(2) path, so who knows.

The api allows it and therefore should be checked, it's better to
avoid quite possible latent bugs.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-12 15:53       ` Pavel Begunkov
@ 2025-02-12 19:18         ` Mina Almasry
  2025-02-13 13:18           ` Pavel Begunkov
  0 siblings, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-12 19:18 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja, Kaiyuan Zhang

On Wed, Feb 12, 2025 at 7:52 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 2/10/25 21:09, Mina Almasry wrote:
> > On Wed, Feb 5, 2025 at 4:20 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> On 2/3/25 22:39, Mina Almasry wrote:
> >> ...
> >>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >>> index bb2b751d274a..3ff8f568c382 100644
> >>> --- a/include/linux/skbuff.h
> >>> +++ b/include/linux/skbuff.h
> >>> @@ -1711,9 +1711,12 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> >> ...
> >>>    int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >>>                                struct iov_iter *from, size_t length);
> >>> @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >>>    static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
> >>>                                          struct msghdr *msg, int len)
> >>>    {
> >>> -     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
> >>> +     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len,
> >>> +                                    NULL);
> >>
> >> Instead of propagating it all the way down and carving a new path, why
> >> not reuse the existing infra? You already hook into where ubuf is
> >> allocated, you can stash the binding in there. And
> >
> > It looks like it's not possible to increase the side of ubuf_info at
> > all, otherwise the BUILD_BUG_ON in msg_zerocopy_alloc() fires.
> >
> > It's asserting that sizeof(ubuf_info_msgzc) <= sizeof(skb->cb), and
> > I'm guessing increasing skb->cb size is not really the way to go.
> >
> > What I may be able to do here is stash the binding somewhere in
> > ubuf_info_msgzc via union with fields we don't need for devmem, and/or
>
> It doesn't need to account the memory against the user, and you
> actually don't want that because dmabuf should take care of that.
> So, it should be fine to reuse ->mmp.
>
> It's also not a real sk_buff, so maybe maintainers wouldn't mind
> reusing some more space out of it, if that would even be needed.
>

netmem skb are real sk_buff, with the modification that frags are not
readable, only in the case that the netmem is unreadable. I would not
approve of considering netmem/devmem skbs "not real skbs", and start
messing with the semantics of skb fields for devmem skbs, and having
to start adding skb_is_devmem() checks through all code in the skb
handlers that touch the fields being overwritten in the devmem case.
No, I don't think we can re-use random fields in the skb for devmem.

> > stashing the binding in ubuf_info_ops (very hacky). Neither approach
> > seems ideal, but the former may work and may be cleaner.
> >
> > I'll take a deeper look here. I had looked before and concluded that
> > we're piggybacking devmem TX on MSG_ZEROCOPY path, because we need
> > almost all of the functionality there (no copying, send complete
> > notifications, etc), with one minor change in the skb filling. I had
> > concluded that if MSG_ZEROCOPY was never updated to use the existing
> > infra, then it's appropriate for devmem TX piggybacking on top of it
>
> MSG_ZEROCOPY does use the common infra, i.e. passing ubuf_info,
> but doesn't need ->sg_from_iter as zerocopy_fill_skb_from_iter()
> and it's what was there first.
>

But MSG_ZEROCOPY doesn't set msg->msg_ubuf. And not setting
msg->msg_ubuf fails to trigger msg->sg_from_iter altogether.

And also currently sg_from_iter isn't set up to take in a ubuf_info.
We'd need that if we stash the binding in the ubuf_info.

All in all I think I wanna prototype an msg->sg_from_iter approach and
make a judgement call on whether it's cleaner than just passing the
binding through a couple of helpers just as I'm doing here. My feeling
is that the implementation in this patch may be cleaner than
refactoring the entire msg_ubuf/sg_from_iter flows so we can sort of
use it for MSG_ZEROCOPY with devmem when it currently doesn't use it.

> > to follow that. I would not want to get into a refactor of
> > MSG_ZEROCOPY for no real reason.
> >
> > But I'll take a deeper look here and see if I can make something
> > slightly cleaner work.
> >
> >> zerocopy_fill_skb_from_devmem can implement ->sg_from_iter,
> >> see __zerocopy_sg_from_iter().
> >>
> >> ...
> >>> diff --git a/net/core/datagram.c b/net/core/datagram.c
> >>> index f0693707aece..c989606ff58d 100644
> >>> --- a/net/core/datagram.c
> >>> +++ b/net/core/datagram.c
> >>> @@ -63,6 +63,8 @@
> >>> +static int
> >>> +zerocopy_fill_skb_from_devmem(struct sk_buff *skb, struct iov_iter *from,
> >>> +                           int length,
> >>> +                           struct net_devmem_dmabuf_binding *binding)
> >>> +{
> >>> +     int i = skb_shinfo(skb)->nr_frags;
> >>> +     size_t virt_addr, size, off;
> >>> +     struct net_iov *niov;
> >>> +
> >>> +     while (length && iov_iter_count(from)) {
> >>> +             if (i == MAX_SKB_FRAGS)
> >>> +                     return -EMSGSIZE;
> >>> +
> >>> +             virt_addr = (size_t)iter_iov_addr(from);
> >>
> >> Unless I missed it somewhere it needs to check that the iter
> >> is iovec based.
> >>
> >
> > How do we end up here with an iterator that is not iovec based? Is the
> > user able to trigger that somehow and I missed it?
>
> Hopefully not, but for example io_uring passes bvecs for a number of
> requests that can end up in tcp_sendmsg_locked(). Those probably
> would work with the current patch, but check the order of some of the
> checks it will break. And once io_uring starts passing bvecs for
> normal send[msg] requests, it'd definitely be possible. And there
> are other in kernel users apart from send(2) path, so who knows.
>
> The api allows it and therefore should be checked, it's better to
> avoid quite possible latent bugs.
>

Sounds good.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-12 19:18         ` Mina Almasry
@ 2025-02-13 13:18           ` Pavel Begunkov
  2025-02-17 23:26             ` Mina Almasry
  0 siblings, 1 reply; 39+ messages in thread
From: Pavel Begunkov @ 2025-02-13 13:18 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja, Kaiyuan Zhang

On 2/12/25 19:18, Mina Almasry wrote:
> On Wed, Feb 12, 2025 at 7:52 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 2/10/25 21:09, Mina Almasry wrote:
>>> On Wed, Feb 5, 2025 at 4:20 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>
>>>> On 2/3/25 22:39, Mina Almasry wrote:
>>>> ...
>>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>>> index bb2b751d274a..3ff8f568c382 100644
>>>>> --- a/include/linux/skbuff.h
>>>>> +++ b/include/linux/skbuff.h
>>>>> @@ -1711,9 +1711,12 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
>>>> ...
>>>>>     int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
>>>>>                                 struct iov_iter *from, size_t length);
>>>>> @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
>>>>>     static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
>>>>>                                           struct msghdr *msg, int len)
>>>>>     {
>>>>> -     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
>>>>> +     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len,
>>>>> +                                    NULL);
>>>>
>>>> Instead of propagating it all the way down and carving a new path, why
>>>> not reuse the existing infra? You already hook into where ubuf is
>>>> allocated, you can stash the binding in there. And
>>>
>>> It looks like it's not possible to increase the side of ubuf_info at
>>> all, otherwise the BUILD_BUG_ON in msg_zerocopy_alloc() fires.
>>>
>>> It's asserting that sizeof(ubuf_info_msgzc) <= sizeof(skb->cb), and
>>> I'm guessing increasing skb->cb size is not really the way to go.
>>>
>>> What I may be able to do here is stash the binding somewhere in
>>> ubuf_info_msgzc via union with fields we don't need for devmem, and/or
>>
>> It doesn't need to account the memory against the user, and you
>> actually don't want that because dmabuf should take care of that.
>> So, it should be fine to reuse ->mmp.
>>
>> It's also not a real sk_buff, so maybe maintainers wouldn't mind
>> reusing some more space out of it, if that would even be needed.
>>
> 
> netmem skb are real sk_buff, with the modification that frags are not

We were discussing ubuf_info allocation, take a look at
msg_zerocopy_alloc(), it has nothing to do with netmems and all that.

> readable, only in the case that the netmem is unreadable. I would not
> approve of considering netmem/devmem skbs "not real skbs", and start
> messing with the semantics of skb fields for devmem skbs, and having
> to start adding skb_is_devmem() checks through all code in the skb
> handlers that touch the fields being overwritten in the devmem case.
> No, I don't think we can re-use random fields in the skb for devmem.
> 
>>> stashing the binding in ubuf_info_ops (very hacky). Neither approach
>>> seems ideal, but the former may work and may be cleaner.
>>>
>>> I'll take a deeper look here. I had looked before and concluded that
>>> we're piggybacking devmem TX on MSG_ZEROCOPY path, because we need
>>> almost all of the functionality there (no copying, send complete
>>> notifications, etc), with one minor change in the skb filling. I had
>>> concluded that if MSG_ZEROCOPY was never updated to use the existing
>>> infra, then it's appropriate for devmem TX piggybacking on top of it
>>
>> MSG_ZEROCOPY does use the common infra, i.e. passing ubuf_info,
>> but doesn't need ->sg_from_iter as zerocopy_fill_skb_from_iter()
>> and it's what was there first.
>>
> 
> But MSG_ZEROCOPY doesn't set msg->msg_ubuf. And not setting
> msg->msg_ubuf fails to trigger msg->sg_from_iter altogether.
> 
> And also currently sg_from_iter isn't set up to take in a ubuf_info.
> We'd need that if we stash the binding in the ubuf_info.

https://github.com/isilence/linux.git sg-iter-ops

I have old patches for all of that, they even rebased cleanly. That
should do it for you, and I need to send then regardless of devmem.


> All in all I think I wanna prototype an msg->sg_from_iter approach and
> make a judgement call on whether it's cleaner than just passing the
> binding through a couple of helpers just as I'm doing here. My feeling
> is that the implementation in this patch may be cleaner than
> refactoring the entire msg_ubuf/sg_from_iter flows so we can sort of
> use it for MSG_ZEROCOPY with devmem when it currently doesn't use it.
> 
>>> to follow that. I would not want to get into a refactor of
>>> MSG_ZEROCOPY for no real reason.
>>>
>>> But I'll take a deeper look here and see if I can make something
>>> slightly cleaner work.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-13 13:18           ` Pavel Begunkov
@ 2025-02-17 23:26             ` Mina Almasry
  2025-02-19 22:41               ` Pavel Begunkov
  0 siblings, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-17 23:26 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja, Kaiyuan Zhang

On Thu, Feb 13, 2025 at 5:17 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 2/12/25 19:18, Mina Almasry wrote:
> > On Wed, Feb 12, 2025 at 7:52 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> On 2/10/25 21:09, Mina Almasry wrote:
> >>> On Wed, Feb 5, 2025 at 4:20 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>>>
> >>>> On 2/3/25 22:39, Mina Almasry wrote:
> >>>> ...
> >>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >>>>> index bb2b751d274a..3ff8f568c382 100644
> >>>>> --- a/include/linux/skbuff.h
> >>>>> +++ b/include/linux/skbuff.h
> >>>>> @@ -1711,9 +1711,12 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
> >>>> ...
> >>>>>     int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >>>>>                                 struct iov_iter *from, size_t length);
> >>>>> @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb,
> >>>>>     static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
> >>>>>                                           struct msghdr *msg, int len)
> >>>>>     {
> >>>>> -     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
> >>>>> +     return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len,
> >>>>> +                                    NULL);
> >>>>
> >>>> Instead of propagating it all the way down and carving a new path, why
> >>>> not reuse the existing infra? You already hook into where ubuf is
> >>>> allocated, you can stash the binding in there. And
> >>>
> >>> It looks like it's not possible to increase the side of ubuf_info at
> >>> all, otherwise the BUILD_BUG_ON in msg_zerocopy_alloc() fires.
> >>>
> >>> It's asserting that sizeof(ubuf_info_msgzc) <= sizeof(skb->cb), and
> >>> I'm guessing increasing skb->cb size is not really the way to go.
> >>>
> >>> What I may be able to do here is stash the binding somewhere in
> >>> ubuf_info_msgzc via union with fields we don't need for devmem, and/or
> >>
> >> It doesn't need to account the memory against the user, and you
> >> actually don't want that because dmabuf should take care of that.
> >> So, it should be fine to reuse ->mmp.
> >>
> >> It's also not a real sk_buff, so maybe maintainers wouldn't mind
> >> reusing some more space out of it, if that would even be needed.
> >>
> >
> > netmem skb are real sk_buff, with the modification that frags are not
>
> We were discussing ubuf_info allocation, take a look at
> msg_zerocopy_alloc(), it has nothing to do with netmems and all that.
>

Yes. My response was regarding the suggestion that we can use space in
devmem skbs however we want though.

> > readable, only in the case that the netmem is unreadable. I would not
> > approve of considering netmem/devmem skbs "not real skbs", and start
> > messing with the semantics of skb fields for devmem skbs, and having
> > to start adding skb_is_devmem() checks through all code in the skb
> > handlers that touch the fields being overwritten in the devmem case.
> > No, I don't think we can re-use random fields in the skb for devmem.
> >
> >>> stashing the binding in ubuf_info_ops (very hacky). Neither approach
> >>> seems ideal, but the former may work and may be cleaner.
> >>>
> >>> I'll take a deeper look here. I had looked before and concluded that
> >>> we're piggybacking devmem TX on MSG_ZEROCOPY path, because we need
> >>> almost all of the functionality there (no copying, send complete
> >>> notifications, etc), with one minor change in the skb filling. I had
> >>> concluded that if MSG_ZEROCOPY was never updated to use the existing
> >>> infra, then it's appropriate for devmem TX piggybacking on top of it
> >>
> >> MSG_ZEROCOPY does use the common infra, i.e. passing ubuf_info,
> >> but doesn't need ->sg_from_iter as zerocopy_fill_skb_from_iter()
> >> and it's what was there first.
> >>
> >
> > But MSG_ZEROCOPY doesn't set msg->msg_ubuf. And not setting
> > msg->msg_ubuf fails to trigger msg->sg_from_iter altogether.
> >
> > And also currently sg_from_iter isn't set up to take in a ubuf_info.
> > We'd need that if we stash the binding in the ubuf_info.
>
> https://github.com/isilence/linux.git sg-iter-ops
>
> I have old patches for all of that, they even rebased cleanly. That
> should do it for you, and I need to send then regardless of devmem.
>
>

These patches help a bit, but do not make any meaningful dent in
addressing the concern I have in the earlier emails.

The concern is that we're piggybacking devmem TX on MSG_ZEROCOPY, and
currently the MSG_ZEROCOPY code carefully avoids any code paths
setting msg->[sg_from_iter|msg_ubuf].

If we want devmem to reuse both the MSG_ZEROCOPY mechanisms and the
msg->[sg_from_iter|ubuf_info] mechanism, I have to dissect the
MSG_ZEROCOPY code carefully so that it works with and without
setting msg->[ubuf_info|msg->sg_from_iter]. Having gone through this
rabbit hole so far I see that it complicates the implementation and
adds more checks to the fast MSG_ZEROCOPY paths.

The complication could be worth it if there was some upside, but I
don't see one tbh. Passing the binding down to
zerocopy_fill_skb_from_devmem seems like a better approach to my eye
so far

I'm afraid I'm going to table this for now. If there is overwhelming
consensus that msg->sg_from_iter is the right approach here I will
revisit, but it seems to me to complicate code without a significant
upside.


--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-17 23:26             ` Mina Almasry
@ 2025-02-19 22:41               ` Pavel Begunkov
  2025-02-20  1:46                 ` Mina Almasry
  0 siblings, 1 reply; 39+ messages in thread
From: Pavel Begunkov @ 2025-02-19 22:41 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja, Kaiyuan Zhang

On 2/17/25 23:26, Mina Almasry wrote:
> On Thu, Feb 13, 2025 at 5:17 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
...
>>>>> It's asserting that sizeof(ubuf_info_msgzc) <= sizeof(skb->cb), and
>>>>> I'm guessing increasing skb->cb size is not really the way to go.
>>>>>
>>>>> What I may be able to do here is stash the binding somewhere in
>>>>> ubuf_info_msgzc via union with fields we don't need for devmem, and/or
>>>>
>>>> It doesn't need to account the memory against the user, and you
>>>> actually don't want that because dmabuf should take care of that.
>>>> So, it should be fine to reuse ->mmp.
>>>>
>>>> It's also not a real sk_buff, so maybe maintainers wouldn't mind
>>>> reusing some more space out of it, if that would even be needed.
>>>>
>>>
>>> netmem skb are real sk_buff, with the modification that frags are not
>>
>> We were discussing ubuf_info allocation, take a look at
>> msg_zerocopy_alloc(), it has nothing to do with netmems and all that.
>>
> 
> Yes. My response was regarding the suggestion that we can use space in
> devmem skbs however we want though.

Well, at least I didn't suggest that, assuming "devmem skbs" are skbs
filled with devmem frags. I think the confusion here is thinking
that skb->cb you mentioned above is about "devmem skbs", while it's
special skbs without data used only to piggy back ubuf allocation.
Functionally speaking, it'd be perfectly fine to get rid of the
warning and allocate it with kmalloc().

...
>>> But MSG_ZEROCOPY doesn't set msg->msg_ubuf. And not setting
>>> msg->msg_ubuf fails to trigger msg->sg_from_iter altogether.
>>>
>>> And also currently sg_from_iter isn't set up to take in a ubuf_info.
>>> We'd need that if we stash the binding in the ubuf_info.
>>
>> https://github.com/isilence/linux.git sg-iter-ops
>>
>> I have old patches for all of that, they even rebased cleanly. That
>> should do it for you, and I need to send then regardless of devmem.
>>
>>
> 
> These patches help a bit, but do not make any meaningful dent in
> addressing the concern I have in the earlier emails.
> 
> The concern is that we're piggybacking devmem TX on MSG_ZEROCOPY, and
> currently the MSG_ZEROCOPY code carefully avoids any code paths
> setting msg->[sg_from_iter|msg_ubuf].

Fwiw, with that branch you don't need ->msg_ubuf at all, just pass
it as an argument from tcp_sendmsg_locked() as usual, and
->sg_from_iter is gone from there as well.

> If we want devmem to reuse both the MSG_ZEROCOPY mechanisms and the
> msg->[sg_from_iter|ubuf_info] mechanism, I have to dissect the
> MSG_ZEROCOPY code carefully so that it works with and without
> setting msg->[ubuf_info|msg->sg_from_iter]. Having gone through this
> rabbit hole so far I see that it complicates the implementation and
> adds more checks to the fast MSG_ZEROCOPY paths.

If you've already done, maybe you can post it as a draft? At least
it'll be obvious why you say it's more complicated.

> The complication could be worth it if there was some upside, but I
> don't see one tbh. Passing the binding down to
> zerocopy_fill_skb_from_devmem seems like a better approach to my eye
> so far

The upside is that 1) you currently you add overhead to common
path (incl copy), 2) passing it down through all the function also
have overhead to the zerocopy and MSG_ZEROCOPY path, which I'd
assume is comparable to those extra checks you have. 3) tcp would
need to know about devmem tcp and its bindings, while it all could
be in one spot under the MSG_ZEROCOPY check. 4) When you'd want
another protocol to support that, instead of a simple

ubuf = get_devmem_ubuf();

You'd need to plumb binding passing through the stack there as
well.

5) And keeping it in one place makes it easier to keep around.

I just don't see why it'd be complicated, but maybe I miss
something, which is why a draft prototype would explain it
better than any words.

> I'm afraid I'm going to table this for now. If there is overwhelming
> consensus that msg->sg_from_iter is the right approach here I will
> revisit, but it seems to me to complicate code without a significant
> upside.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-19 22:41               ` Pavel Begunkov
@ 2025-02-20  1:46                 ` Mina Almasry
  2025-02-20 14:35                   ` Pavel Begunkov
  0 siblings, 1 reply; 39+ messages in thread
From: Mina Almasry @ 2025-02-20  1:46 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja, Kaiyuan Zhang

On Wed, Feb 19, 2025 at 2:40 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 2/17/25 23:26, Mina Almasry wrote:
> > On Thu, Feb 13, 2025 at 5:17 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> ...
> >>>>> It's asserting that sizeof(ubuf_info_msgzc) <= sizeof(skb->cb), and
> >>>>> I'm guessing increasing skb->cb size is not really the way to go.
> >>>>>
> >>>>> What I may be able to do here is stash the binding somewhere in
> >>>>> ubuf_info_msgzc via union with fields we don't need for devmem, and/or
> >>>>
> >>>> It doesn't need to account the memory against the user, and you
> >>>> actually don't want that because dmabuf should take care of that.
> >>>> So, it should be fine to reuse ->mmp.
> >>>>
> >>>> It's also not a real sk_buff, so maybe maintainers wouldn't mind
> >>>> reusing some more space out of it, if that would even be needed.
> >>>>
> >>>
> >>> netmem skb are real sk_buff, with the modification that frags are not
> >>
> >> We were discussing ubuf_info allocation, take a look at
> >> msg_zerocopy_alloc(), it has nothing to do with netmems and all that.
> >>
> >
> > Yes. My response was regarding the suggestion that we can use space in
> > devmem skbs however we want though.
>
> Well, at least I didn't suggest that, assuming "devmem skbs" are skbs
> filled with devmem frags. I think the confusion here is thinking
> that skb->cb you mentioned above is about "devmem skbs", while it's
> special skbs without data used only to piggy back ubuf allocation.

Ah, I see. I still don't see how we can just increase the size of
skb->cb when it's shared between these special skbs and regular skbs.

> Functionally speaking, it'd be perfectly fine to get rid of the
> warning and allocate it with kmalloc().
>

More suggestions to refactor unrelated things to force through a
msg->sg_from_iter approach.

> ...
> >>> But MSG_ZEROCOPY doesn't set msg->msg_ubuf. And not setting
> >>> msg->msg_ubuf fails to trigger msg->sg_from_iter altogether.
> >>>
> >>> And also currently sg_from_iter isn't set up to take in a ubuf_info.
> >>> We'd need that if we stash the binding in the ubuf_info.
> >>
> >> https://github.com/isilence/linux.git sg-iter-ops
> >>
> >> I have old patches for all of that, they even rebased cleanly. That
> >> should do it for you, and I need to send then regardless of devmem.
> >>
> >>
> >
> > These patches help a bit, but do not make any meaningful dent in
> > addressing the concern I have in the earlier emails.
> >
> > The concern is that we're piggybacking devmem TX on MSG_ZEROCOPY, and
> > currently the MSG_ZEROCOPY code carefully avoids any code paths
> > setting msg->[sg_from_iter|msg_ubuf].
>
> Fwiw, with that branch you don't need ->msg_ubuf at all, just pass
> it as an argument from tcp_sendmsg_locked() as usual, and
> ->sg_from_iter is gone from there as well.
>
> > If we want devmem to reuse both the MSG_ZEROCOPY mechanisms and the
> > msg->[sg_from_iter|ubuf_info] mechanism, I have to dissect the
> > MSG_ZEROCOPY code carefully so that it works with and without
> > setting msg->[ubuf_info|msg->sg_from_iter]. Having gone through this
> > rabbit hole so far I see that it complicates the implementation and
> > adds more checks to the fast MSG_ZEROCOPY paths.
>
> If you've already done, maybe you can post it as a draft? At least
> it'll be obvious why you say it's more complicated.
>

I don't have anything worth sharing. Just went down this rabbit hole
and saw a bunch of MSG_ZEROCOPY checks (!msg->msg_ubuf checks around
MSG_ZEROCOPY code) and restrictions (skb->cb size) need to be
addressed and checks to be added. From this thread you seem to be
suggesting more changes to force in a msg->sg_from_iter approach
adding to the complications.

> > The complication could be worth it if there was some upside, but I
> > don't see one tbh. Passing the binding down to
> > zerocopy_fill_skb_from_devmem seems like a better approach to my eye
> > so far
>
> The upside is that 1) you currently you add overhead to common
> path (incl copy),

You mean the unlikely() check for devmem before delegating to
skb_zerocopy_fill_from_devmem? Should be minimal.

> 2) passing it down through all the function also
> have overhead to the zerocopy and MSG_ZEROCOPY path, which I'd
> assume is comparable to those extra checks you have.

Complicating/refactoring existing code for devmem TCP to force in a
msg->sg_from_iter and save 1 arg passed down a couple of functions
doesn't seem like a good tradeoff IMO.

> 3) tcp would
> need to know about devmem tcp and its bindings, while it all could
> be in one spot under the MSG_ZEROCOPY check.

I don't see why this is binding to tcp somehow. If anything it makes
the devmem TX implementation follow closely MSG_ZEROCOPY, and existing
MSG_ZEROCOPY code would be easily extended for devmem TX without
having to also carry refactors to migrate to msg->sg_from_iter
approach (just grab the binding and pass it to
skb_zerocopy_iter_stream).

> 4) When you'd want
> another protocol to support that, instead of a simple
>
> ubuf = get_devmem_ubuf();
>
> You'd need to plumb binding passing through the stack there as
> well.
>

Similar to above, I think this approach will actually extend easier to
any protocol already using MSG_ZEROCOPY, because we follow that
closely instead of requiring refactors to force msg->sg_from_iter
approach.


> 5) And keeping it in one place makes it easier to keep around.
>
> I just don't see why it'd be complicated, but maybe I miss
> something, which is why a draft prototype would explain it
> better than any words.
>
> > I'm afraid I'm going to table this for now. If there is overwhelming
> > consensus that msg->sg_from_iter is the right approach here I will
> > revisit, but it seems to me to complicate code without a significant
> > upside.
>
> --
> Pavel Begunkov
>


--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v3 5/6] net: devmem: Implement TX path
  2025-02-20  1:46                 ` Mina Almasry
@ 2025-02-20 14:35                   ` Pavel Begunkov
  0 siblings, 0 replies; 39+ messages in thread
From: Pavel Begunkov @ 2025-02-20 14:35 UTC (permalink / raw)
  To: Mina Almasry
  Cc: netdev, linux-kernel, linux-doc, kvm, virtualization,
	linux-kselftest, Donald Hunter, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Andrew Lunn, Neal Cardwell, David Ahern, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Stefan Hajnoczi,
	Stefano Garzarella, Shuah Khan, sdf, dw, Jamal Hadi Salim,
	Victor Nogueira, Pedro Tammela, Samiullah Khawaja, Kaiyuan Zhang

On 2/20/25 01:46, Mina Almasry wrote:
> On Wed, Feb 19, 2025 at 2:40 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 2/17/25 23:26, Mina Almasry wrote:
>>> On Thu, Feb 13, 2025 at 5:17 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> ...
>>>>>>> It's asserting that sizeof(ubuf_info_msgzc) <= sizeof(skb->cb), and
>>>>>>> I'm guessing increasing skb->cb size is not really the way to go.
>>>>>>>
>>>>>>> What I may be able to do here is stash the binding somewhere in
>>>>>>> ubuf_info_msgzc via union with fields we don't need for devmem, and/or
>>>>>>
>>>>>> It doesn't need to account the memory against the user, and you
>>>>>> actually don't want that because dmabuf should take care of that.
>>>>>> So, it should be fine to reuse ->mmp.
>>>>>>
>>>>>> It's also not a real sk_buff, so maybe maintainers wouldn't mind
>>>>>> reusing some more space out of it, if that would even be needed.
>>>>>>
>>>>>
>>>>> netmem skb are real sk_buff, with the modification that frags are not
>>>>
>>>> We were discussing ubuf_info allocation, take a look at
>>>> msg_zerocopy_alloc(), it has nothing to do with netmems and all that.
>>>>
>>>
>>> Yes. My response was regarding the suggestion that we can use space in
>>> devmem skbs however we want though.
>>
>> Well, at least I didn't suggest that, assuming "devmem skbs" are skbs
>> filled with devmem frags. I think the confusion here is thinking
>> that skb->cb you mentioned above is about "devmem skbs", while it's
>> special skbs without data used only to piggy back ubuf allocation.
> 
> Ah, I see. I still don't see how we can just increase the size of
> skb->cb when it's shared between these special skbs and regular skbs.

The approach was not to increase ->cb but rather reuse some other unused
in the path sk_buff fields. Though, looking at __msg_zerocopy_callback(),
maybe it's better not to entertain that, as the skb is queued into the
error queue. But again, not like you need it.

>> Functionally speaking, it'd be perfectly fine to get rid of the
>> warning and allocate it with kmalloc().
>>
> 
> More suggestions to refactor unrelated things to force through a
> msg->sg_from_iter approach.

Mina, you're surprising me. Neither I suggested to do that, just
trying to help with your confusion using analogies, nor I said that
it'd be welcome, nor it's somehow "unrelated". And "forcing"
is a misstatement, so far I've been extending a recommendation
on how to make it better.

...
>> If you've already done, maybe you can post it as a draft? At least
>> it'll be obvious why you say it's more complicated.
>>
> 
> I don't have anything worth sharing. Just went down this rabbit hole
> and saw a bunch of MSG_ZEROCOPY checks (!msg->msg_ubuf checks around
> MSG_ZEROCOPY code) and restrictions (skb->cb size) need to be
> addressed and checks to be added. From this thread you seem to be
> suggesting more changes to force in a msg->sg_from_iter approach
> adding to the complications.

To sum up, you haven't tried it.

>>> The complication could be worth it if there was some upside, but I
>>> don't see one tbh. Passing the binding down to
>>> zerocopy_fill_skb_from_devmem seems like a better approach to my eye
>>> so far
>>
>> The upside is that 1) you currently you add overhead to common
>> path (incl copy),
> 
> You mean the unlikely() check for devmem before delegating to
> skb_zerocopy_fill_from_devmem? Should be minimal.

Like keeping the binding in tcp_sendmsg_locked(). The point is,
as you mentioned overhead ("adds more checks to the fast
MSG_ZEROCOPY paths"), all things included the current approach
will be adding more of it to MSG_ZEROCOPY and also other users.

>> 2) passing it down through all the function also
>> have overhead to the zerocopy and MSG_ZEROCOPY path, which I'd
>> assume is comparable to those extra checks you have.
> 
> Complicating/refactoring existing code for devmem TCP to force in a
> msg->sg_from_iter and save 1 arg passed down a couple of functions
> doesn't seem like a good tradeoff IMO.
> 
>> 3) tcp would
>> need to know about devmem tcp and its bindings, while it all could
>> be in one spot under the MSG_ZEROCOPY check.
> 
> I don't see why this is binding to tcp somehow. If anything it makes

I don't get what you're saying, but it refers to devmem binding,
which you add to TCP path, and so tcp now has to know how to work
with devmem instead of all of it being hidden behind the curtains
of ubuf_info. And it sticks out not only for tcp, but for all
zerocopy users by the virtue of dragging it down through all
helpers.

> the devmem TX implementation follow closely MSG_ZEROCOPY, and existing

Following closely would be passing ubuf just like MSG_ZEROCOPY does,
and not plumbing devmem all the way through all helpers.

> MSG_ZEROCOPY code would be easily extended for devmem TX without
> having to also carry refactors to migrate to msg->sg_from_iter

Don't be afraid of refactoring when it makes things better. We're
talking about minor changes touching only bits in the direct
vicinity of your set.

> approach (just grab the binding and pass it to
> skb_zerocopy_iter_stream).
> 
>> 4) When you'd want
>> another protocol to support that, instead of a simple
>>
>> ubuf = get_devmem_ubuf();
>>
>> You'd need to plumb binding passing through the stack there as
>> well.
>>
> 
> Similar to above, I think this approach will actually extend easier to
> any protocol already using MSG_ZEROCOPY, because we follow that
> closely instead of requiring refactors to force msg->sg_from_iter
> approach.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-02-20 14:34 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-03 22:39 [PATCH net-next v3 0/6] Device memory TCP TX Mina Almasry
2025-02-03 22:39 ` [PATCH net-next v3 1/6] net: add devmem TCP TX documentation Mina Almasry
2025-02-03 22:39 ` [PATCH net-next v3 2/6] selftests: ncdevmem: Implement devmem TCP TX Mina Almasry
2025-02-04 12:29   ` Paolo Abeni
2025-02-04 16:50     ` Jakub Kicinski
2025-02-04 17:35     ` Mina Almasry
2025-02-04 17:56       ` Paolo Abeni
2025-02-04 18:03         ` Mina Almasry
2025-02-04 18:07           ` Stanislav Fomichev
2025-02-03 22:39 ` [PATCH net-next v3 3/6] net: add get_netmem/put_netmem support Mina Almasry
2025-02-03 22:39 ` [PATCH net-next v3 4/6] net: devmem: TCP tx netlink api Mina Almasry
2025-02-03 22:39 ` [PATCH net-next v3 5/6] net: devmem: Implement TX path Mina Almasry
2025-02-04 12:15   ` Paolo Abeni
2025-02-05 12:20   ` Pavel Begunkov
2025-02-10 21:09     ` Mina Almasry
2025-02-12 15:53       ` Pavel Begunkov
2025-02-12 19:18         ` Mina Almasry
2025-02-13 13:18           ` Pavel Begunkov
2025-02-17 23:26             ` Mina Almasry
2025-02-19 22:41               ` Pavel Begunkov
2025-02-20  1:46                 ` Mina Almasry
2025-02-20 14:35                   ` Pavel Begunkov
2025-02-05 21:56   ` Willem de Bruijn
2025-02-03 22:39 ` [PATCH net-next v3 6/6] net: devmem: make dmabuf unbinding scheduled work Mina Almasry
2025-02-04 12:32 ` [PATCH net-next v3 0/6] Device memory TCP TX Paolo Abeni
2025-02-04 17:27   ` Mina Almasry
2025-02-04 18:06     ` Stanislav Fomichev
2025-02-04 18:32       ` Paolo Abeni
2025-02-04 18:47         ` Mina Almasry
2025-02-04 19:41           ` Stanislav Fomichev
2025-02-05  2:06             ` Jakub Kicinski
2025-02-05 19:53               ` Mina Almasry
2025-02-04 18:38       ` Mina Almasry
2025-02-04 19:43         ` Stanislav Fomichev
2025-02-05  0:47           ` Samiullah Khawaja
2025-02-05  1:05             ` Stanislav Fomichev
2025-02-05  2:08 ` Jakub Kicinski
2025-02-05 19:52   ` Mina Almasry
2025-02-06  1:45     ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).