All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-06 14:43 ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 14:43 UTC (permalink / raw)
  To: dccp

From: Breno Leitao <leit@fb.com>

This patchset creates the initial plumbing for a io_uring command for
sockets.

For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
heavily based on the ioctl operations.

In order to test this code, I created a liburing test, which is
currently located at [1], and I will create a pull request once we are
good with this patch.

I've also run test/io_uring_passthrough to make sure the first patch
didn't regressed the NVME passthrough path.

This patchset is a RFC for two different reasons:
  * It changes slighlty on how IO uring command operates. I.e, we are
    now passing the whole SQE to the io_uring_cmd callback (instead of
    an opaque buffer). This seems to be more palatable instead of
    creating some custom structure just to fit small parameters, as in
    SOCKET_URING_OP_SIOC{IN,OUT}Q. Is this OK?

  * Pavel has some ideas about the SQE->cmd_op field, so, we can start
    discussing it here.

This work is heavily inspired by Jens Axboe's initial implementation.

[1] https://github.com/leitao/liburing/blob/master/test/socket-io-cmd.c

Breno Leitao (4):
  net: wire up support for file_operations->uring_cmd()
  net: add uring_cmd callback to UDP
  net: add uring_cmd callback to TCP
  net: add uring_cmd callback to raw "protocol"

 include/linux/net.h      |  2 ++
 include/net/raw.h        |  3 +++
 include/net/sock.h       |  6 ++++++
 include/net/tcp.h        |  2 ++
 include/net/udp.h        |  2 ++
 include/uapi/linux/net.h |  5 +++++
 net/core/sock.c          | 17 +++++++++++++++--
 net/dccp/ipv4.c          |  1 +
 net/ipv4/af_inet.c       |  3 +++
 net/ipv4/raw.c           | 26 ++++++++++++++++++++++++++
 net/ipv4/tcp.c           | 34 ++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_ipv4.c      |  1 +
 net/ipv4/udp.c           | 18 ++++++++++++++++++
 net/l2tp/l2tp_ip.c       |  1 +
 net/mptcp/protocol.c     |  1 +
 net/sctp/protocol.c      |  1 +
 net/socket.c             | 13 +++++++++++++
 17 files changed, 134 insertions(+), 2 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-06 14:43 ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 14:43 UTC (permalink / raw)
  To: io-uring, netdev, kuba, asml.silence, axboe
  Cc: leit, edumazet, pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

From: Breno Leitao <leit@fb.com>

This patchset creates the initial plumbing for a io_uring command for
sockets.

For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
heavily based on the ioctl operations.

In order to test this code, I created a liburing test, which is
currently located at [1], and I will create a pull request once we are
good with this patch.

I've also run test/io_uring_passthrough to make sure the first patch
didn't regressed the NVME passthrough path.

This patchset is a RFC for two different reasons:
  * It changes slighlty on how IO uring command operates. I.e, we are
    now passing the whole SQE to the io_uring_cmd callback (instead of
    an opaque buffer). This seems to be more palatable instead of
    creating some custom structure just to fit small parameters, as in
    SOCKET_URING_OP_SIOC{IN,OUT}Q. Is this OK?

  * Pavel has some ideas about the SQE->cmd_op field, so, we can start
    discussing it here.

This work is heavily inspired by Jens Axboe's initial implementation.

[1] https://github.com/leitao/liburing/blob/master/test/socket-io-cmd.c

Breno Leitao (4):
  net: wire up support for file_operations->uring_cmd()
  net: add uring_cmd callback to UDP
  net: add uring_cmd callback to TCP
  net: add uring_cmd callback to raw "protocol"

 include/linux/net.h      |  2 ++
 include/net/raw.h        |  3 +++
 include/net/sock.h       |  6 ++++++
 include/net/tcp.h        |  2 ++
 include/net/udp.h        |  2 ++
 include/uapi/linux/net.h |  5 +++++
 net/core/sock.c          | 17 +++++++++++++++--
 net/dccp/ipv4.c          |  1 +
 net/ipv4/af_inet.c       |  3 +++
 net/ipv4/raw.c           | 26 ++++++++++++++++++++++++++
 net/ipv4/tcp.c           | 34 ++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_ipv4.c      |  1 +
 net/ipv4/udp.c           | 18 ++++++++++++++++++
 net/l2tp/l2tp_ip.c       |  1 +
 net/mptcp/protocol.c     |  1 +
 net/sctp/protocol.c      |  1 +
 net/socket.c             | 13 +++++++++++++
 17 files changed, 134 insertions(+), 2 deletions(-)

-- 
2.34.1




^ permalink raw reply	[flat|nested] 108+ messages in thread

* [RFC PATCH 1/4] net: wire up support for file_operations->uring_cmd()
  2023-04-06 14:43 ` Breno Leitao
@ 2023-04-06 14:43 ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 14:43 UTC (permalink / raw)
  To: dccp

Create the initial plumbing to call protocol specific uring_cmd
callbacks. These are io_uring specific callbacks that implement
ioctl-like operation types, such as SIOCINQ, SIOCOUTQ and others.

In order to achieve this, create uring_cmd callback placeholders in
file_ops, proto and proto_ops structures.

Create also the functions that does the plumbing from io_uring_cmd() up
to sk_proto->uring_cmd(). If the callback is not implemented,
-EOPNOTSUPP is returned.

That way, the io_uring issue path calls file_operations->uring_cmd
(sock_uring_cmd()).  This function calls proto_ops->uring_cmd
(sock_common_uring_cmd()). sock_common_uring_cmd() is responsible for
calling protocol specific (struct proto_ops) uring_cmd callback
(sock_common_uring_cmd()). sock_common_uring_cmd() then calls the proto
specific (struct proto) uring_cmd function, which are implemented in the
upcoming patch.

By the end, uring_cmd() function has access to  'struct io_uring_cmd'
which points to the whole SQE, and any field could be accessed from the
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/net.h  |  2 ++
 include/net/sock.h   |  6 ++++++
 net/core/sock.c      | 17 +++++++++++++++--
 net/dccp/ipv4.c      |  1 +
 net/ipv4/af_inet.c   |  3 +++
 net/l2tp/l2tp_ip.c   |  1 +
 net/mptcp/protocol.c |  1 +
 net/sctp/protocol.c  |  1 +
 net/socket.c         | 13 +++++++++++++
 9 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index b73ad8e3c212..efcc47a57069 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -182,6 +182,8 @@ struct proto_ops {
 	int	 	(*compat_ioctl) (struct socket *sock, unsigned int cmd,
 				      unsigned long arg);
 #endif
+	int		(*uring_cmd)(struct socket *sock, struct io_uring_cmd *cmd,
+				     unsigned int issue_flags);
 	int		(*gettstamp) (struct socket *sock, void __user *userstamp,
 				      bool timeval, bool time32);
 	int		(*listen)    (struct socket *sock, int len);
diff --git a/include/net/sock.h b/include/net/sock.h
index 573f2bf7e0de..57437a1e041c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -111,6 +111,7 @@ typedef struct {
 struct sock;
 struct proto;
 struct net;
+struct io_uring_cmd;
 
 typedef __u32 __bitwise __portpair;
 typedef __u64 __bitwise __addrpair;
@@ -1247,6 +1248,9 @@ struct proto {
 
 	int			(*ioctl)(struct sock *sk, int cmd,
 					 unsigned long arg);
+	int			(*uring_cmd)(struct sock *sk,
+					     struct io_uring_cmd *cmd,
+					     unsigned int issue_flags);
 	int			(*init)(struct sock *sk);
 	void			(*destroy)(struct sock *sk);
 	void			(*shutdown)(struct sock *sk, int how);
@@ -1921,6 +1925,8 @@ int sock_common_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 			int flags);
 int sock_common_setsockopt(struct socket *sock, int level, int optname,
 			   sockptr_t optval, unsigned int optlen);
+int sock_common_uring_cmd(struct socket *sock, struct io_uring_cmd *cmd,
+			  unsigned int issue_flags);
 
 void sk_common_release(struct sock *sk);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index c25888795390..1bf5e4d4ba29 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3669,6 +3669,18 @@ int sock_common_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(sock_common_setsockopt);
 
+int sock_common_uring_cmd(struct socket *sock, struct io_uring_cmd *cmd,
+			  unsigned int issue_flags)
+{
+	struct sock *sk = sock->sk;
+
+	if (!sk->sk_prot || !sk->sk_prot->uring_cmd)
+		return -EOPNOTSUPP;
+
+	return sk->sk_prot->uring_cmd(sk, cmd, issue_flags);
+}
+EXPORT_SYMBOL(sock_common_uring_cmd);
+
 void sk_common_release(struct sock *sk)
 {
 	if (sk->sk_prot->destroy)
@@ -4009,7 +4021,7 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
 
 	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
-			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
+			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   sock_prot_inuse_get(seq_file_net(seq), proto),
@@ -4023,6 +4035,7 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 		   proto_method_implemented(proto->disconnect),
 		   proto_method_implemented(proto->accept),
 		   proto_method_implemented(proto->ioctl),
+		   proto_method_implemented(proto->uring_cmd),
 		   proto_method_implemented(proto->init),
 		   proto_method_implemented(proto->destroy),
 		   proto_method_implemented(proto->shutdown),
@@ -4051,7 +4064,7 @@ static int proto_seq_show(struct seq_file *seq, void *v)
 			   "maxhdr",
 			   "slab",
 			   "module",
-			   "cl co di ac io in de sh ss gs se re sp bi br ha uh gp em\n");
+			   "cl co di ac io ur in de sh ss gs se re sp bi br ha uh gp em\n");
 	else
 		proto_seq_printf(seq, list_entry(v, struct proto, node));
 	return 0;
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index b780827f5e0a..47047ad05e65 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -999,6 +999,7 @@ static const struct proto_ops inet_dccp_ops = {
 	/* FIXME: work on tcp_poll to rename it to inet_csk_poll */
 	.poll		   = dccp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	/* FIXME: work on inet_listen to rename it to sock_common_listen */
 	.listen		   = inet_dccp_listen,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8db6747f892f..1c54c3b59f2e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1036,6 +1036,7 @@ const struct proto_ops inet_stream_ops = {
 	.getname	   = inet_getname,
 	.poll		   = tcp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = inet_listen,
 	.shutdown	   = inet_shutdown,
@@ -1071,6 +1072,7 @@ const struct proto_ops inet_dgram_ops = {
 	.getname	   = inet_getname,
 	.poll		   = udp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sock_no_listen,
 	.shutdown	   = inet_shutdown,
@@ -1103,6 +1105,7 @@ static const struct proto_ops inet_sockraw_ops = {
 	.getname	   = inet_getname,
 	.poll		   = datagram_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sock_no_listen,
 	.shutdown	   = inet_shutdown,
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 4db5a554bdbd..cdfaaada0695 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -617,6 +617,7 @@ static const struct proto_ops l2tp_ip_ops = {
 	.getname	   = l2tp_ip_getname,
 	.poll		   = datagram_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sock_no_listen,
 	.shutdown	   = inet_shutdown,
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 3ad9c46202fc..b8182eab5ebf 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3808,6 +3808,7 @@ static const struct proto_ops mptcp_stream_ops = {
 	.getname	   = inet_getname,
 	.poll		   = mptcp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = mptcp_listen,
 	.shutdown	   = inet_shutdown,
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index c365df24ad33..b1aaf644076f 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1127,6 +1127,7 @@ static const struct proto_ops inet_seqpacket_ops = {
 	.getname	   = inet_getname,	/* Semantics are different.  */
 	.poll		   = sctp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sctp_inet_listen,
 	.shutdown	   = inet_shutdown,	/* Looks harmless.  */
diff --git a/net/socket.c b/net/socket.c
index 9c92c0e6c4da..c683110c1523 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -87,6 +87,7 @@
 #include <linux/xattr.h>
 #include <linux/nospec.h>
 #include <linux/indirect_call_wrapper.h>
+#include <linux/io_uring.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -116,6 +117,7 @@ unsigned int sysctl_net_busy_poll __read_mostly;
 static ssize_t sock_read_iter(struct kiocb *iocb, struct iov_iter *to);
 static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from);
 static int sock_mmap(struct file *file, struct vm_area_struct *vma);
+static int sock_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
 
 static int sock_close(struct inode *inode, struct file *file);
 static __poll_t sock_poll(struct file *file,
@@ -159,6 +161,7 @@ static const struct file_operations socket_file_ops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = compat_sock_ioctl,
 #endif
+	.uring_cmd =	sock_uring_cmd,
 	.mmap =		sock_mmap,
 	.release =	sock_close,
 	.fasync =	sock_fasync,
@@ -1319,6 +1322,16 @@ static long sock_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	return err;
 }
 
+static int sock_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+	struct socket *sock = cmd->file->private_data;
+
+	if (!sock->ops || !sock->ops->uring_cmd)
+		return -EOPNOTSUPP;
+
+	return sock->ops->uring_cmd(sock, cmd, issue_flags);
+}
+
 /**
  *	sock_create_lite - creates a socket
  *	@family: protocol family (AF_INET, ...)
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH 1/4] net: wire up support for file_operations->uring_cmd()
@ 2023-04-06 14:43 ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 14:43 UTC (permalink / raw)
  To: io-uring, netdev, kuba, asml.silence, axboe
  Cc: leit, edumazet, pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

Create the initial plumbing to call protocol specific uring_cmd
callbacks. These are io_uring specific callbacks that implement
ioctl-like operation types, such as SIOCINQ, SIOCOUTQ and others.

In order to achieve this, create uring_cmd callback placeholders in
file_ops, proto and proto_ops structures.

Create also the functions that does the plumbing from io_uring_cmd() up
to sk_proto->uring_cmd(). If the callback is not implemented,
-EOPNOTSUPP is returned.

That way, the io_uring issue path calls file_operations->uring_cmd
(sock_uring_cmd()).  This function calls proto_ops->uring_cmd
(sock_common_uring_cmd()). sock_common_uring_cmd() is responsible for
calling protocol specific (struct proto_ops) uring_cmd callback
(sock_common_uring_cmd()). sock_common_uring_cmd() then calls the proto
specific (struct proto) uring_cmd function, which are implemented in the
upcoming patch.

By the end, uring_cmd() function has access to  'struct io_uring_cmd'
which points to the whole SQE, and any field could be accessed from the
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/net.h  |  2 ++
 include/net/sock.h   |  6 ++++++
 net/core/sock.c      | 17 +++++++++++++++--
 net/dccp/ipv4.c      |  1 +
 net/ipv4/af_inet.c   |  3 +++
 net/l2tp/l2tp_ip.c   |  1 +
 net/mptcp/protocol.c |  1 +
 net/sctp/protocol.c  |  1 +
 net/socket.c         | 13 +++++++++++++
 9 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index b73ad8e3c212..efcc47a57069 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -182,6 +182,8 @@ struct proto_ops {
 	int	 	(*compat_ioctl) (struct socket *sock, unsigned int cmd,
 				      unsigned long arg);
 #endif
+	int		(*uring_cmd)(struct socket *sock, struct io_uring_cmd *cmd,
+				     unsigned int issue_flags);
 	int		(*gettstamp) (struct socket *sock, void __user *userstamp,
 				      bool timeval, bool time32);
 	int		(*listen)    (struct socket *sock, int len);
diff --git a/include/net/sock.h b/include/net/sock.h
index 573f2bf7e0de..57437a1e041c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -111,6 +111,7 @@ typedef struct {
 struct sock;
 struct proto;
 struct net;
+struct io_uring_cmd;
 
 typedef __u32 __bitwise __portpair;
 typedef __u64 __bitwise __addrpair;
@@ -1247,6 +1248,9 @@ struct proto {
 
 	int			(*ioctl)(struct sock *sk, int cmd,
 					 unsigned long arg);
+	int			(*uring_cmd)(struct sock *sk,
+					     struct io_uring_cmd *cmd,
+					     unsigned int issue_flags);
 	int			(*init)(struct sock *sk);
 	void			(*destroy)(struct sock *sk);
 	void			(*shutdown)(struct sock *sk, int how);
@@ -1921,6 +1925,8 @@ int sock_common_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 			int flags);
 int sock_common_setsockopt(struct socket *sock, int level, int optname,
 			   sockptr_t optval, unsigned int optlen);
+int sock_common_uring_cmd(struct socket *sock, struct io_uring_cmd *cmd,
+			  unsigned int issue_flags);
 
 void sk_common_release(struct sock *sk);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index c25888795390..1bf5e4d4ba29 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3669,6 +3669,18 @@ int sock_common_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(sock_common_setsockopt);
 
+int sock_common_uring_cmd(struct socket *sock, struct io_uring_cmd *cmd,
+			  unsigned int issue_flags)
+{
+	struct sock *sk = sock->sk;
+
+	if (!sk->sk_prot || !sk->sk_prot->uring_cmd)
+		return -EOPNOTSUPP;
+
+	return sk->sk_prot->uring_cmd(sk, cmd, issue_flags);
+}
+EXPORT_SYMBOL(sock_common_uring_cmd);
+
 void sk_common_release(struct sock *sk)
 {
 	if (sk->sk_prot->destroy)
@@ -4009,7 +4021,7 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 {
 
 	seq_printf(seq, "%-9s %4u %6d  %6ld   %-3s %6u   %-3s  %-10s "
-			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
+			"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
 		   proto->name,
 		   proto->obj_size,
 		   sock_prot_inuse_get(seq_file_net(seq), proto),
@@ -4023,6 +4035,7 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
 		   proto_method_implemented(proto->disconnect),
 		   proto_method_implemented(proto->accept),
 		   proto_method_implemented(proto->ioctl),
+		   proto_method_implemented(proto->uring_cmd),
 		   proto_method_implemented(proto->init),
 		   proto_method_implemented(proto->destroy),
 		   proto_method_implemented(proto->shutdown),
@@ -4051,7 +4064,7 @@ static int proto_seq_show(struct seq_file *seq, void *v)
 			   "maxhdr",
 			   "slab",
 			   "module",
-			   "cl co di ac io in de sh ss gs se re sp bi br ha uh gp em\n");
+			   "cl co di ac io ur in de sh ss gs se re sp bi br ha uh gp em\n");
 	else
 		proto_seq_printf(seq, list_entry(v, struct proto, node));
 	return 0;
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index b780827f5e0a..47047ad05e65 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -999,6 +999,7 @@ static const struct proto_ops inet_dccp_ops = {
 	/* FIXME: work on tcp_poll to rename it to inet_csk_poll */
 	.poll		   = dccp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	/* FIXME: work on inet_listen to rename it to sock_common_listen */
 	.listen		   = inet_dccp_listen,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8db6747f892f..1c54c3b59f2e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1036,6 +1036,7 @@ const struct proto_ops inet_stream_ops = {
 	.getname	   = inet_getname,
 	.poll		   = tcp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = inet_listen,
 	.shutdown	   = inet_shutdown,
@@ -1071,6 +1072,7 @@ const struct proto_ops inet_dgram_ops = {
 	.getname	   = inet_getname,
 	.poll		   = udp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sock_no_listen,
 	.shutdown	   = inet_shutdown,
@@ -1103,6 +1105,7 @@ static const struct proto_ops inet_sockraw_ops = {
 	.getname	   = inet_getname,
 	.poll		   = datagram_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sock_no_listen,
 	.shutdown	   = inet_shutdown,
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 4db5a554bdbd..cdfaaada0695 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -617,6 +617,7 @@ static const struct proto_ops l2tp_ip_ops = {
 	.getname	   = l2tp_ip_getname,
 	.poll		   = datagram_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sock_no_listen,
 	.shutdown	   = inet_shutdown,
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 3ad9c46202fc..b8182eab5ebf 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3808,6 +3808,7 @@ static const struct proto_ops mptcp_stream_ops = {
 	.getname	   = inet_getname,
 	.poll		   = mptcp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = mptcp_listen,
 	.shutdown	   = inet_shutdown,
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index c365df24ad33..b1aaf644076f 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1127,6 +1127,7 @@ static const struct proto_ops inet_seqpacket_ops = {
 	.getname	   = inet_getname,	/* Semantics are different.  */
 	.poll		   = sctp_poll,
 	.ioctl		   = inet_ioctl,
+	.uring_cmd	   = sock_common_uring_cmd,
 	.gettstamp	   = sock_gettstamp,
 	.listen		   = sctp_inet_listen,
 	.shutdown	   = inet_shutdown,	/* Looks harmless.  */
diff --git a/net/socket.c b/net/socket.c
index 9c92c0e6c4da..c683110c1523 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -87,6 +87,7 @@
 #include <linux/xattr.h>
 #include <linux/nospec.h>
 #include <linux/indirect_call_wrapper.h>
+#include <linux/io_uring.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -116,6 +117,7 @@ unsigned int sysctl_net_busy_poll __read_mostly;
 static ssize_t sock_read_iter(struct kiocb *iocb, struct iov_iter *to);
 static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from);
 static int sock_mmap(struct file *file, struct vm_area_struct *vma);
+static int sock_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
 
 static int sock_close(struct inode *inode, struct file *file);
 static __poll_t sock_poll(struct file *file,
@@ -159,6 +161,7 @@ static const struct file_operations socket_file_ops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = compat_sock_ioctl,
 #endif
+	.uring_cmd =	sock_uring_cmd,
 	.mmap =		sock_mmap,
 	.release =	sock_close,
 	.fasync =	sock_fasync,
@@ -1319,6 +1322,16 @@ static long sock_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	return err;
 }
 
+static int sock_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+	struct socket *sock = cmd->file->private_data;
+
+	if (!sock->ops || !sock->ops->uring_cmd)
+		return -EOPNOTSUPP;
+
+	return sock->ops->uring_cmd(sock, cmd, issue_flags);
+}
+
 /**
  *	sock_create_lite - creates a socket
  *	@family: protocol family (AF_INET, ...)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH 2/4] net: add uring_cmd callback to UDP
  2023-04-06 14:43 ` Breno Leitao
@ 2023-04-06 14:43 ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 14:43 UTC (permalink / raw)
  To: dccp

This is the implementation of uring_cmd for the udp protocol. It
basically encompasses SOCKET_URING_OP_SIOCOUTQ and
SOCKET_URING_OP_SIOCINQ, which is similar to the SIOCOUTQ and SIOCINQ
ioctls.

The return value is exactly the same as the regular ioctl (udp_ioctl()).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/net/udp.h        |  2 ++
 include/uapi/linux/net.h |  5 +++++
 net/ipv4/udp.c           | 16 ++++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/include/net/udp.h b/include/net/udp.h
index de4b528522bb..c0e829dacc2f 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -283,6 +283,8 @@ void udp_flush_pending_frames(struct sock *sk);
 int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
 void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
 int udp_rcv(struct sk_buff *skb);
+int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags);
 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 int udp_init_sock(struct sock *sk);
 int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
diff --git a/include/uapi/linux/net.h b/include/uapi/linux/net.h
index 4dabec6bd957..dd8e7ced7d24 100644
--- a/include/uapi/linux/net.h
+++ b/include/uapi/linux/net.h
@@ -55,4 +55,9 @@ typedef enum {
 
 #define __SO_ACCEPTCON	(1 << 16)	/* performed a listen		*/
 
+enum {
+	SOCKET_URING_OP_SIOCINQ		= 0,
+	SOCKET_URING_OP_SIOCOUTQ,
+};
+
 #endif /* _UAPI_LINUX_NET_H */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..d6d60600831b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -113,6 +113,7 @@
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
 #include <net/udp_tunnel.h>
+#include <linux/io_uring.h>
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ipv6_stubs.h>
 #endif
@@ -1711,6 +1712,20 @@ static int first_packet_length(struct sock *sk)
 	return res;
 }
 
+int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags)
+{
+	switch (cmd->sqe->cmd_op) {
+	case SOCKET_URING_OP_SIOCOUTQ:
+		return sk_wmem_alloc_get(sk);
+	case SOCKET_URING_OP_SIOCINQ:
+		return max_t(int, 0, first_packet_length(sk));
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+EXPORT_SYMBOL_GPL(udp_uring_cmd);
+
 /*
  *	IOCTL requests applicable to the UDP protocol
  */
@@ -2952,6 +2967,7 @@ struct proto udp_prot = {
 	.connect		= ip4_datagram_connect,
 	.disconnect		= udp_disconnect,
 	.ioctl			= udp_ioctl,
+	.uring_cmd		= udp_uring_cmd,
 	.init			= udp_init_sock,
 	.destroy		= udp_destroy_sock,
 	.setsockopt		= udp_setsockopt,
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH 2/4] net: add uring_cmd callback to UDP
@ 2023-04-06 14:43 ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 14:43 UTC (permalink / raw)
  To: io-uring, netdev, kuba, asml.silence, axboe
  Cc: leit, edumazet, pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

This is the implementation of uring_cmd for the udp protocol. It
basically encompasses SOCKET_URING_OP_SIOCOUTQ and
SOCKET_URING_OP_SIOCINQ, which is similar to the SIOCOUTQ and SIOCINQ
ioctls.

The return value is exactly the same as the regular ioctl (udp_ioctl()).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/net/udp.h        |  2 ++
 include/uapi/linux/net.h |  5 +++++
 net/ipv4/udp.c           | 16 ++++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/include/net/udp.h b/include/net/udp.h
index de4b528522bb..c0e829dacc2f 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -283,6 +283,8 @@ void udp_flush_pending_frames(struct sock *sk);
 int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
 void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
 int udp_rcv(struct sk_buff *skb);
+int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags);
 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 int udp_init_sock(struct sock *sk);
 int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
diff --git a/include/uapi/linux/net.h b/include/uapi/linux/net.h
index 4dabec6bd957..dd8e7ced7d24 100644
--- a/include/uapi/linux/net.h
+++ b/include/uapi/linux/net.h
@@ -55,4 +55,9 @@ typedef enum {
 
 #define __SO_ACCEPTCON	(1 << 16)	/* performed a listen		*/
 
+enum {
+	SOCKET_URING_OP_SIOCINQ		= 0,
+	SOCKET_URING_OP_SIOCOUTQ,
+};
+
 #endif /* _UAPI_LINUX_NET_H */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..d6d60600831b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -113,6 +113,7 @@
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
 #include <net/udp_tunnel.h>
+#include <linux/io_uring.h>
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ipv6_stubs.h>
 #endif
@@ -1711,6 +1712,20 @@ static int first_packet_length(struct sock *sk)
 	return res;
 }
 
+int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags)
+{
+	switch (cmd->sqe->cmd_op) {
+	case SOCKET_URING_OP_SIOCOUTQ:
+		return sk_wmem_alloc_get(sk);
+	case SOCKET_URING_OP_SIOCINQ:
+		return max_t(int, 0, first_packet_length(sk));
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+EXPORT_SYMBOL_GPL(udp_uring_cmd);
+
 /*
  *	IOCTL requests applicable to the UDP protocol
  */
@@ -2952,6 +2967,7 @@ struct proto udp_prot = {
 	.connect		= ip4_datagram_connect,
 	.disconnect		= udp_disconnect,
 	.ioctl			= udp_ioctl,
+	.uring_cmd		= udp_uring_cmd,
 	.init			= udp_init_sock,
 	.destroy		= udp_destroy_sock,
 	.setsockopt		= udp_setsockopt,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH 3/4] net: add uring_cmd callback to TCP
  2023-04-06 14:43 ` Breno Leitao
@ 2023-04-06 14:43 ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 14:43 UTC (permalink / raw)
  To: dccp

This is the implementation of uring_cmd for the TCP protocol. It
basically encompasses SOCKET_URING_OP_SIOCOUTQ and
SOCKET_URING_OP_SIOCINQ, which is similar to the SIOCOUTQ and SIOCINQ
ioctls.

The return value is exactly the same as the regular ioctl (tcp_ioctl()).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/net/tcp.h   |  2 ++
 net/ipv4/tcp.c      | 32 ++++++++++++++++++++++++++++++++
 net/ipv4/tcp_ipv4.c |  1 +
 3 files changed, 35 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index db9f828e9d1e..4dfd6bd63261 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -342,6 +342,8 @@ void tcp_release_cb(struct sock *sk);
 void tcp_wfree(struct sk_buff *skb);
 void tcp_write_timer_handler(struct sock *sk);
 void tcp_delack_timer_handler(struct sock *sk);
+int tcp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags);
 int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb);
 void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 288693981b00..cf2822242e28 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -279,6 +279,7 @@
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
 #include <net/busy_poll.h>
+#include <linux/io_uring.h>
 
 /* Track pending CMSGs. */
 enum {
@@ -596,6 +597,37 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 }
 EXPORT_SYMBOL(tcp_poll);
 
+int tcp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	bool slow;
+	int ret;
+
+	switch (cmd->sqe->cmd_op) {
+	case SOCKET_URING_OP_SIOCINQ:
+		if (sk->sk_state = TCP_LISTEN)
+			return -EINVAL;
+
+		slow = lock_sock_fast(sk);
+		ret = tcp_inq(sk);
+		unlock_sock_fast(sk, slow);
+		return ret;
+	case SOCKET_URING_OP_SIOCOUTQ:
+		if (sk->sk_state = TCP_LISTEN)
+			return -EINVAL;
+
+		if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
+			ret = 0;
+		else
+			ret = READ_ONCE(tp->write_seq) - tp->snd_una;
+		return ret;
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+EXPORT_SYMBOL_GPL(tcp_uring_cmd);
+
 int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ea370afa70ed..900081fa2e1a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3103,6 +3103,7 @@ struct proto tcp_prot = {
 	.disconnect		= tcp_disconnect,
 	.accept			= inet_csk_accept,
 	.ioctl			= tcp_ioctl,
+	.uring_cmd		= tcp_uring_cmd,
 	.init			= tcp_v4_init_sock,
 	.destroy		= tcp_v4_destroy_sock,
 	.shutdown		= tcp_shutdown,
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH 3/4] net: add uring_cmd callback to TCP
@ 2023-04-06 14:43 ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 14:43 UTC (permalink / raw)
  To: io-uring, netdev, kuba, asml.silence, axboe
  Cc: leit, edumazet, pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

This is the implementation of uring_cmd for the TCP protocol. It
basically encompasses SOCKET_URING_OP_SIOCOUTQ and
SOCKET_URING_OP_SIOCINQ, which is similar to the SIOCOUTQ and SIOCINQ
ioctls.

The return value is exactly the same as the regular ioctl (tcp_ioctl()).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/net/tcp.h   |  2 ++
 net/ipv4/tcp.c      | 32 ++++++++++++++++++++++++++++++++
 net/ipv4/tcp_ipv4.c |  1 +
 3 files changed, 35 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index db9f828e9d1e..4dfd6bd63261 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -342,6 +342,8 @@ void tcp_release_cb(struct sock *sk);
 void tcp_wfree(struct sk_buff *skb);
 void tcp_write_timer_handler(struct sock *sk);
 void tcp_delack_timer_handler(struct sock *sk);
+int tcp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags);
 int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb);
 void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 288693981b00..cf2822242e28 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -279,6 +279,7 @@
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
 #include <net/busy_poll.h>
+#include <linux/io_uring.h>
 
 /* Track pending CMSGs. */
 enum {
@@ -596,6 +597,37 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 }
 EXPORT_SYMBOL(tcp_poll);
 
+int tcp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	bool slow;
+	int ret;
+
+	switch (cmd->sqe->cmd_op) {
+	case SOCKET_URING_OP_SIOCINQ:
+		if (sk->sk_state == TCP_LISTEN)
+			return -EINVAL;
+
+		slow = lock_sock_fast(sk);
+		ret = tcp_inq(sk);
+		unlock_sock_fast(sk, slow);
+		return ret;
+	case SOCKET_URING_OP_SIOCOUTQ:
+		if (sk->sk_state == TCP_LISTEN)
+			return -EINVAL;
+
+		if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
+			ret = 0;
+		else
+			ret = READ_ONCE(tp->write_seq) - tp->snd_una;
+		return ret;
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+EXPORT_SYMBOL_GPL(tcp_uring_cmd);
+
 int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ea370afa70ed..900081fa2e1a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3103,6 +3103,7 @@ struct proto tcp_prot = {
 	.disconnect		= tcp_disconnect,
 	.accept			= inet_csk_accept,
 	.ioctl			= tcp_ioctl,
+	.uring_cmd		= tcp_uring_cmd,
 	.init			= tcp_v4_init_sock,
 	.destroy		= tcp_v4_destroy_sock,
 	.shutdown		= tcp_shutdown,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH 4/4] net: add uring_cmd callback to raw "protocol"
  2023-04-06 14:43 ` Breno Leitao
@ 2023-04-06 14:43 ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 14:43 UTC (permalink / raw)
  To: dccp

This is the implementation of uring_cmd for the raw "protocol". It
basically encompasses SOCKET_URING_OP_SIOCOUTQ and
SOCKET_URING_OP_SIOCINQ, which is similar to the SIOCOUTQ and SIOCINQ
ioctls.

The return value is exactly the same as the regular ioctl (raw_ioctl()).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/net/raw.h |  3 +++
 net/ipv4/raw.c    | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/net/raw.h b/include/net/raw.h
index 2c004c20ed99..ba7a96dce16b 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -99,4 +99,7 @@ static inline bool raw_sk_bound_dev_eq(struct net *net, int bound_dev_if,
 #endif
 }
 
+int raw_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags);
+
 #endif	/* _RAW_H */
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 94df935ee0c5..3db828bc1224 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -75,6 +75,7 @@
 #include <linux/netfilter_ipv4.h>
 #include <linux/compat.h>
 #include <linux/uio.h>
+#include <linux/io_uring.h>
 
 struct raw_frag_vec {
 	struct msghdr *msg;
@@ -857,6 +858,29 @@ static int raw_getsockopt(struct sock *sk, int level, int optname,
 	return do_raw_getsockopt(sk, level, optname, optval, optlen);
 }
 
+int raw_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags)
+{
+	switch (cmd->sqe->cmd_op) {
+	case SOCKET_URING_OP_SIOCOUTQ:
+		return sk_wmem_alloc_get(sk);
+	case SOCKET_URING_OP_SIOCINQ: {
+		struct sk_buff *skb;
+		int amount = 0;
+
+		spin_lock_bh(&sk->sk_receive_queue.lock);
+		skb = skb_peek(&sk->sk_receive_queue);
+		if (skb)
+			amount = skb->len;
+		spin_unlock_bh(&sk->sk_receive_queue.lock);
+		return amount;
+	}
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+EXPORT_SYMBOL_GPL(raw_uring_cmd);
+
 static int raw_ioctl(struct sock *sk, int cmd, unsigned long arg)
 {
 	switch (cmd) {
@@ -925,6 +949,7 @@ struct proto raw_prot = {
 	.connect	   = ip4_datagram_connect,
 	.disconnect	   = __udp_disconnect,
 	.ioctl		   = raw_ioctl,
+	.uring_cmd	   = raw_uring_cmd,
 	.init		   = raw_sk_init,
 	.setsockopt	   = raw_setsockopt,
 	.getsockopt	   = raw_getsockopt,
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [RFC PATCH 4/4] net: add uring_cmd callback to raw "protocol"
@ 2023-04-06 14:43 ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 14:43 UTC (permalink / raw)
  To: io-uring, netdev, kuba, asml.silence, axboe
  Cc: leit, edumazet, pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

This is the implementation of uring_cmd for the raw "protocol". It
basically encompasses SOCKET_URING_OP_SIOCOUTQ and
SOCKET_URING_OP_SIOCINQ, which is similar to the SIOCOUTQ and SIOCINQ
ioctls.

The return value is exactly the same as the regular ioctl (raw_ioctl()).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/net/raw.h |  3 +++
 net/ipv4/raw.c    | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/net/raw.h b/include/net/raw.h
index 2c004c20ed99..ba7a96dce16b 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -99,4 +99,7 @@ static inline bool raw_sk_bound_dev_eq(struct net *net, int bound_dev_if,
 #endif
 }
 
+int raw_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags);
+
 #endif	/* _RAW_H */
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 94df935ee0c5..3db828bc1224 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -75,6 +75,7 @@
 #include <linux/netfilter_ipv4.h>
 #include <linux/compat.h>
 #include <linux/uio.h>
+#include <linux/io_uring.h>
 
 struct raw_frag_vec {
 	struct msghdr *msg;
@@ -857,6 +858,29 @@ static int raw_getsockopt(struct sock *sk, int level, int optname,
 	return do_raw_getsockopt(sk, level, optname, optval, optlen);
 }
 
+int raw_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
+		  unsigned int issue_flags)
+{
+	switch (cmd->sqe->cmd_op) {
+	case SOCKET_URING_OP_SIOCOUTQ:
+		return sk_wmem_alloc_get(sk);
+	case SOCKET_URING_OP_SIOCINQ: {
+		struct sk_buff *skb;
+		int amount = 0;
+
+		spin_lock_bh(&sk->sk_receive_queue.lock);
+		skb = skb_peek(&sk->sk_receive_queue);
+		if (skb)
+			amount = skb->len;
+		spin_unlock_bh(&sk->sk_receive_queue.lock);
+		return amount;
+	}
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+EXPORT_SYMBOL_GPL(raw_uring_cmd);
+
 static int raw_ioctl(struct sock *sk, int cmd, unsigned long arg)
 {
 	switch (cmd) {
@@ -925,6 +949,7 @@ struct proto raw_prot = {
 	.connect	   = ip4_datagram_connect,
 	.disconnect	   = __udp_disconnect,
 	.ioctl		   = raw_ioctl,
+	.uring_cmd	   = raw_uring_cmd,
 	.init		   = raw_sk_init,
 	.setsockopt	   = raw_setsockopt,
 	.getsockopt	   = raw_getsockopt,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-06 14:43 ` Breno Leitao
@ 2023-04-06 15:34   ` Willem de Bruijn
  -1 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-06 15:34 UTC (permalink / raw)
  To: dccp

On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote:
>
> From: Breno Leitao <leit@fb.com>
>
> This patchset creates the initial plumbing for a io_uring command for
> sockets.
>
> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> heavily based on the ioctl operations.

This duplicates all the existing ioctl logic of each protocol.

Can this just call the existing proto_ops.ioctl internally and translate from/to
io_uring format as needed?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-06 15:34   ` Willem de Bruijn
  0 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-06 15:34 UTC (permalink / raw)
  To: Breno Leitao
  Cc: io-uring, netdev, kuba, asml.silence, axboe, leit, edumazet,
	pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote:
>
> From: Breno Leitao <leit@fb.com>
>
> This patchset creates the initial plumbing for a io_uring command for
> sockets.
>
> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> heavily based on the ioctl operations.

This duplicates all the existing ioctl logic of each protocol.

Can this just call the existing proto_ops.ioctl internally and translate from/to
io_uring format as needed?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-06 15:34   ` Willem de Bruijn
@ 2023-04-06 15:59   ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 15:59 UTC (permalink / raw)
  To: dccp

On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote:
> On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote:
> >
> > From: Breno Leitao <leit@fb.com>
> >
> > This patchset creates the initial plumbing for a io_uring command for
> > sockets.
> >
> > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> > heavily based on the ioctl operations.
> 
> This duplicates all the existing ioctl logic of each protocol.
> 
> Can this just call the existing proto_ops.ioctl internally and translate from/to
> io_uring format as needed?

This is doable, and we have two options in this case:

1) Create a ioctl core function that does not call `put_user()`, and
call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper
translations. Something as:

	int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg)
	{
		int amount;
		switch (cmd) {
		case SIOCOUTQ: {
			amount = sk_wmem_alloc_get(sk);
			break;
		}
		case SIOCINQ: {
			amount = max_t(int, 0, first_packet_length(sk));
			break;
		}
		default:
			return -ENOIOCTLCMD;
		}
		return amount;
	}

	int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
	{
		int amount = udp_ioctl_core(sk, cmd, arg);

		return put_user(amount, (int __user *)arg);
	}
	EXPORT_SYMBOL(udp_ioctl);


2) Create a function for each "case entry". This seems a bit silly for
UDP, but it makes more sense for other protocols. The code will look
something like:

	 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
	 {
		switch (cmd) {
		case SIOCOUTQ:
		{
			int amount = udp_ioctl_siocoutq();
			return put_user(amount, (int __user *)arg);
		}
		...
	  }

What is the best approach?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-06 15:59   ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 15:59 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: io-uring, netdev, kuba, asml.silence, axboe, leit, edumazet,
	pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote:
> On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote:
> >
> > From: Breno Leitao <leit@fb.com>
> >
> > This patchset creates the initial plumbing for a io_uring command for
> > sockets.
> >
> > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> > heavily based on the ioctl operations.
> 
> This duplicates all the existing ioctl logic of each protocol.
> 
> Can this just call the existing proto_ops.ioctl internally and translate from/to
> io_uring format as needed?

This is doable, and we have two options in this case:

1) Create a ioctl core function that does not call `put_user()`, and
call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper
translations. Something as:

	int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg)
	{
		int amount;
		switch (cmd) {
		case SIOCOUTQ: {
			amount = sk_wmem_alloc_get(sk);
			break;
		}
		case SIOCINQ: {
			amount = max_t(int, 0, first_packet_length(sk));
			break;
		}
		default:
			return -ENOIOCTLCMD;
		}
		return amount;
	}

	int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
	{
		int amount = udp_ioctl_core(sk, cmd, arg);

		return put_user(amount, (int __user *)arg);
	}
	EXPORT_SYMBOL(udp_ioctl);


2) Create a function for each "case entry". This seems a bit silly for
UDP, but it makes more sense for other protocols. The code will look
something like:

	 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
	 {
		switch (cmd) {
		case SIOCOUTQ:
		{
			int amount = udp_ioctl_siocoutq();
			return put_user(amount, (int __user *)arg);
		}
		...
	  }

What is the best approach?


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-06 14:43 ` Breno Leitao
@ 2023-04-06 16:41   ` Keith Busch
  -1 siblings, 0 replies; 108+ messages in thread
From: Keith Busch @ 2023-04-06 16:41 UTC (permalink / raw)
  To: dccp

On Thu, Apr 06, 2023 at 07:43:26AM -0700, Breno Leitao wrote:
> This patchset creates the initial plumbing for a io_uring command for
> sockets.
> 
> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> heavily based on the ioctl operations.

Do you have asynchronous operations in mind for a future patch? The io_uring
command infrastructure makes more sense for operations that return EIOCBQUEUED,
otherwise it doesn't have much benefit over ioctl.
 
> In order to test this code, I created a liburing test, which is
> currently located at [1], and I will create a pull request once we are
> good with this patch.
> 
> I've also run test/io_uring_passthrough to make sure the first patch
> didn't regressed the NVME passthrough path.
> 
> This patchset is a RFC for two different reasons:
>   * It changes slighlty on how IO uring command operates. I.e, we are
>     now passing the whole SQE to the io_uring_cmd callback (instead of
>     an opaque buffer). This seems to be more palatable instead of
>     creating some custom structure just to fit small parameters, as in
>     SOCKET_URING_OP_SIOC{IN,OUT}Q. Is this OK?

I think I'm missing something from this series. Where is the io_uring_cmd
change to point to the sqe?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-06 16:41   ` Keith Busch
  0 siblings, 0 replies; 108+ messages in thread
From: Keith Busch @ 2023-04-06 16:41 UTC (permalink / raw)
  To: Breno Leitao
  Cc: io-uring, netdev, kuba, asml.silence, axboe, leit, edumazet,
	pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On Thu, Apr 06, 2023 at 07:43:26AM -0700, Breno Leitao wrote:
> This patchset creates the initial plumbing for a io_uring command for
> sockets.
> 
> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> heavily based on the ioctl operations.

Do you have asynchronous operations in mind for a future patch? The io_uring
command infrastructure makes more sense for operations that return EIOCBQUEUED,
otherwise it doesn't have much benefit over ioctl.
 
> In order to test this code, I created a liburing test, which is
> currently located at [1], and I will create a pull request once we are
> good with this patch.
> 
> I've also run test/io_uring_passthrough to make sure the first patch
> didn't regressed the NVME passthrough path.
> 
> This patchset is a RFC for two different reasons:
>   * It changes slighlty on how IO uring command operates. I.e, we are
>     now passing the whole SQE to the io_uring_cmd callback (instead of
>     an opaque buffer). This seems to be more palatable instead of
>     creating some custom structure just to fit small parameters, as in
>     SOCKET_URING_OP_SIOC{IN,OUT}Q. Is this OK?

I think I'm missing something from this series. Where is the io_uring_cmd
change to point to the sqe?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-06 16:41   ` Keith Busch
@ 2023-04-06 16:49   ` Jens Axboe
  -1 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-06 16:49 UTC (permalink / raw)
  To: dccp

On 4/6/23 10:41?AM, Keith Busch wrote:
> On Thu, Apr 06, 2023 at 07:43:26AM -0700, Breno Leitao wrote:
>> This patchset creates the initial plumbing for a io_uring command for
>> sockets.
>>
>> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
>> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
>> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
>> heavily based on the ioctl operations.
> 
> Do you have asynchronous operations in mind for a future patch? The
> io_uring command infrastructure makes more sense for operations that
> return EIOCBQUEUED, otherwise it doesn't have much benefit over ioctl.

Basically nothing returns EIOCBQUEUED, it's mostly sync/poll driven on
the networking side. The primary use case for this is with direct
descriptors, as you can't do get/setsockopt with those. And that means
you'd then need to instantiate a regular descriptor first and then
register it, rather than keep it all direct from the start.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-06 16:49   ` Jens Axboe
  0 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-06 16:49 UTC (permalink / raw)
  To: Keith Busch, Breno Leitao
  Cc: io-uring, netdev, kuba, asml.silence, leit, edumazet, pabeni,
	davem, dccp, mptcp, linux-kernel, dsahern, willemdebruijn.kernel,
	matthieu.baerts, marcelo.leitner

On 4/6/23 10:41?AM, Keith Busch wrote:
> On Thu, Apr 06, 2023 at 07:43:26AM -0700, Breno Leitao wrote:
>> This patchset creates the initial plumbing for a io_uring command for
>> sockets.
>>
>> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
>> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
>> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
>> heavily based on the ioctl operations.
> 
> Do you have asynchronous operations in mind for a future patch? The
> io_uring command infrastructure makes more sense for operations that
> return EIOCBQUEUED, otherwise it doesn't have much benefit over ioctl.

Basically nothing returns EIOCBQUEUED, it's mostly sync/poll driven on
the networking side. The primary use case for this is with direct
descriptors, as you can't do get/setsockopt with those. And that means
you'd then need to instantiate a regular descriptor first and then
register it, rather than keep it all direct from the start.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-06 14:43 ` Breno Leitao
@ 2023-04-06 16:57 ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 16:57 UTC (permalink / raw)
  To: dccp

Currently uring CMD operation relies on having large SQEs, but future
operations might want to use normal SQE.

The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
but, for commands that use normal SQE size, it might be necessary to
access the initial SQE fields outside of the payload/cmd block.  So,
saves the whole SQE other than just the pdu.

This changes slighlty how the io_uring_cmd works, since the cmd
structures and callbacks are not opaque to io_uring anymore. I.e, the
callbacks can look at the SQE entries, not only, in the cmd structure.

The main advantage is that we don't need to create custom structures for
simple commands.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/block/ublk_drv.c  | 24 ++++++++++++------------
 drivers/nvme/host/ioctl.c |  2 +-
 include/linux/io_uring.h  |  2 +-
 io_uring/opdef.c          |  2 +-
 io_uring/uring_cmd.c      | 11 ++++++-----
 5 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index d1d1c8d606c8..0e35d82eb070 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1258,7 +1258,7 @@ static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
 
 static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
-	struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->cmd;
+	struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->sqe->cmd;
 	struct ublk_device *ub = cmd->file->private_data;
 	struct ublk_queue *ubq;
 	struct ublk_io *io;
@@ -1562,7 +1562,7 @@ static struct ublk_device *ublk_get_device_from_id(int idx)
 
 static int ublk_ctrl_start_dev(struct ublk_device *ub, struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	int ublksrv_pid = (int)header->data[0];
 	struct gendisk *disk;
 	int ret = -EINVAL;
@@ -1624,7 +1624,7 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub, struct io_uring_cmd *cmd)
 static int ublk_ctrl_get_queue_affinity(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	cpumask_var_t cpumask;
 	unsigned long queue;
@@ -1675,7 +1675,7 @@ static inline void ublk_dump_dev_info(struct ublksrv_ctrl_dev_info *info)
 
 static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	struct ublksrv_ctrl_dev_info info;
 	struct ublk_device *ub;
@@ -1838,7 +1838,7 @@ static int ublk_ctrl_del_dev(struct ublk_device **p_ub)
 
 static inline void ublk_ctrl_cmd_dump(struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 
 	pr_devel("%s: cmd_op %x, dev id %d qid %d data %llx buf %llx len %u\n",
 			__func__, cmd->cmd_op, header->dev_id, header->queue_id,
@@ -1857,7 +1857,7 @@ static int ublk_ctrl_stop_dev(struct ublk_device *ub)
 static int ublk_ctrl_get_dev_info(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 
 	if (header->len < sizeof(struct ublksrv_ctrl_dev_info) || !header->addr)
@@ -1888,7 +1888,7 @@ static void ublk_ctrl_fill_params_devt(struct ublk_device *ub)
 static int ublk_ctrl_get_params(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	struct ublk_params_header ph;
 	int ret;
@@ -1919,7 +1919,7 @@ static int ublk_ctrl_get_params(struct ublk_device *ub,
 static int ublk_ctrl_set_params(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	struct ublk_params_header ph;
 	int ret = -EFAULT;
@@ -1977,7 +1977,7 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
 static int ublk_ctrl_start_recovery(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	int ret = -EINVAL;
 	int i;
 
@@ -2019,7 +2019,7 @@ static int ublk_ctrl_start_recovery(struct ublk_device *ub,
 static int ublk_ctrl_end_recovery(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	int ublksrv_pid = (int)header->data[0];
 	int ret = -EINVAL;
 
@@ -2086,7 +2086,7 @@ static int ublk_char_dev_permission(struct ublk_device *ub,
 static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	bool unprivileged = ub->dev_info.flags & UBLK_F_UNPRIVILEGED_DEV;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	char *dev_path = NULL;
@@ -2165,7 +2165,7 @@ static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
 static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
 		unsigned int issue_flags)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	struct ublk_device *ub = NULL;
 	int ret = -EINVAL;
 
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 723e7d5b778f..304da8f3200b 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -550,7 +550,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 		struct io_uring_cmd *ioucmd, unsigned int issue_flags, bool vec)
 {
 	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
-	const struct nvme_uring_cmd *cmd = ioucmd->cmd;
+	const struct nvme_uring_cmd *cmd = (struct nvme_uring_cmd *)ioucmd->sqe->cmd;
 	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
 	struct nvme_uring_data d;
 	struct nvme_command c;
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 934e5dd4ccc0..650e6f12cc18 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -24,7 +24,7 @@ enum io_uring_cmd_flags {
 
 struct io_uring_cmd {
 	struct file	*file;
-	const void	*cmd;
+	const struct io_uring_sqe *sqe;
 	union {
 		/* callback to defer completions to task context */
 		void (*task_work_cb)(struct io_uring_cmd *cmd);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..3b9c6489b8b6 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -627,7 +627,7 @@ const struct io_cold_def io_cold_defs[] = {
 	},
 	[IORING_OP_URING_CMD] = {
 		.name			= "URING_CMD",
-		.async_size		= uring_cmd_pdu_size(1),
+		.async_size		= 2 * sizeof(struct io_uring_sqe),
 		.prep_async		= io_uring_cmd_prep_async,
 	},
 	[IORING_OP_SEND_ZC] = {
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 2e4c483075d3..9648134ccae1 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
 int io_uring_cmd_prep_async(struct io_kiocb *req)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
-	size_t cmd_size;
+	size_t size = sizeof(struct io_uring_sqe);
 
 	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
 	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
 
-	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
+	if (req->ctx->flags & IORING_SETUP_SQE128)
+		size <<= 1;
 
-	memcpy(req->async_data, ioucmd->cmd, cmd_size);
+	memcpy(req->async_data, ioucmd->sqe, size);
 	return 0;
 }
 
@@ -96,7 +97,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		req->imu = ctx->user_bufs[index];
 		io_req_set_rsrc_node(req, ctx, 0);
 	}
-	ioucmd->cmd = sqe->cmd;
+	ioucmd->sqe = sqe;
 	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
 	return 0;
 }
@@ -128,7 +129,7 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	if (req_has_async_data(req))
-		ioucmd->cmd = req->async_data;
+		ioucmd->sqe = req->async_data;
 
 	ret = file->f_op->uring_cmd(ioucmd, issue_flags);
 	if (ret = -EAGAIN) {
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-06 16:57 ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 16:57 UTC (permalink / raw)
  To: leitao
  Cc: asml.silence, axboe, davem, dccp, dsahern, edumazet, io-uring,
	kuba, leit, linux-kernel, marcelo.leitner, matthieu.baerts, mptcp,
	netdev, pabeni, willemdebruijn.kernel

Currently uring CMD operation relies on having large SQEs, but future
operations might want to use normal SQE.

The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
but, for commands that use normal SQE size, it might be necessary to
access the initial SQE fields outside of the payload/cmd block.  So,
saves the whole SQE other than just the pdu.

This changes slighlty how the io_uring_cmd works, since the cmd
structures and callbacks are not opaque to io_uring anymore. I.e, the
callbacks can look at the SQE entries, not only, in the cmd structure.

The main advantage is that we don't need to create custom structures for
simple commands.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/block/ublk_drv.c  | 24 ++++++++++++------------
 drivers/nvme/host/ioctl.c |  2 +-
 include/linux/io_uring.h  |  2 +-
 io_uring/opdef.c          |  2 +-
 io_uring/uring_cmd.c      | 11 ++++++-----
 5 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index d1d1c8d606c8..0e35d82eb070 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1258,7 +1258,7 @@ static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
 
 static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
-	struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->cmd;
+	struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->sqe->cmd;
 	struct ublk_device *ub = cmd->file->private_data;
 	struct ublk_queue *ubq;
 	struct ublk_io *io;
@@ -1562,7 +1562,7 @@ static struct ublk_device *ublk_get_device_from_id(int idx)
 
 static int ublk_ctrl_start_dev(struct ublk_device *ub, struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	int ublksrv_pid = (int)header->data[0];
 	struct gendisk *disk;
 	int ret = -EINVAL;
@@ -1624,7 +1624,7 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub, struct io_uring_cmd *cmd)
 static int ublk_ctrl_get_queue_affinity(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	cpumask_var_t cpumask;
 	unsigned long queue;
@@ -1675,7 +1675,7 @@ static inline void ublk_dump_dev_info(struct ublksrv_ctrl_dev_info *info)
 
 static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	struct ublksrv_ctrl_dev_info info;
 	struct ublk_device *ub;
@@ -1838,7 +1838,7 @@ static int ublk_ctrl_del_dev(struct ublk_device **p_ub)
 
 static inline void ublk_ctrl_cmd_dump(struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 
 	pr_devel("%s: cmd_op %x, dev id %d qid %d data %llx buf %llx len %u\n",
 			__func__, cmd->cmd_op, header->dev_id, header->queue_id,
@@ -1857,7 +1857,7 @@ static int ublk_ctrl_stop_dev(struct ublk_device *ub)
 static int ublk_ctrl_get_dev_info(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 
 	if (header->len < sizeof(struct ublksrv_ctrl_dev_info) || !header->addr)
@@ -1888,7 +1888,7 @@ static void ublk_ctrl_fill_params_devt(struct ublk_device *ub)
 static int ublk_ctrl_get_params(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	struct ublk_params_header ph;
 	int ret;
@@ -1919,7 +1919,7 @@ static int ublk_ctrl_get_params(struct ublk_device *ub,
 static int ublk_ctrl_set_params(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	struct ublk_params_header ph;
 	int ret = -EFAULT;
@@ -1977,7 +1977,7 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
 static int ublk_ctrl_start_recovery(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	int ret = -EINVAL;
 	int i;
 
@@ -2019,7 +2019,7 @@ static int ublk_ctrl_start_recovery(struct ublk_device *ub,
 static int ublk_ctrl_end_recovery(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	int ublksrv_pid = (int)header->data[0];
 	int ret = -EINVAL;
 
@@ -2086,7 +2086,7 @@ static int ublk_char_dev_permission(struct ublk_device *ub,
 static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
 		struct io_uring_cmd *cmd)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	bool unprivileged = ub->dev_info.flags & UBLK_F_UNPRIVILEGED_DEV;
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	char *dev_path = NULL;
@@ -2165,7 +2165,7 @@ static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
 static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
 		unsigned int issue_flags)
 {
-	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->sqe->cmd;
 	struct ublk_device *ub = NULL;
 	int ret = -EINVAL;
 
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 723e7d5b778f..304da8f3200b 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -550,7 +550,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 		struct io_uring_cmd *ioucmd, unsigned int issue_flags, bool vec)
 {
 	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
-	const struct nvme_uring_cmd *cmd = ioucmd->cmd;
+	const struct nvme_uring_cmd *cmd = (struct nvme_uring_cmd *)ioucmd->sqe->cmd;
 	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
 	struct nvme_uring_data d;
 	struct nvme_command c;
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 934e5dd4ccc0..650e6f12cc18 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -24,7 +24,7 @@ enum io_uring_cmd_flags {
 
 struct io_uring_cmd {
 	struct file	*file;
-	const void	*cmd;
+	const struct io_uring_sqe *sqe;
 	union {
 		/* callback to defer completions to task context */
 		void (*task_work_cb)(struct io_uring_cmd *cmd);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..3b9c6489b8b6 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -627,7 +627,7 @@ const struct io_cold_def io_cold_defs[] = {
 	},
 	[IORING_OP_URING_CMD] = {
 		.name			= "URING_CMD",
-		.async_size		= uring_cmd_pdu_size(1),
+		.async_size		= 2 * sizeof(struct io_uring_sqe),
 		.prep_async		= io_uring_cmd_prep_async,
 	},
 	[IORING_OP_SEND_ZC] = {
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 2e4c483075d3..9648134ccae1 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
 int io_uring_cmd_prep_async(struct io_kiocb *req)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
-	size_t cmd_size;
+	size_t size = sizeof(struct io_uring_sqe);
 
 	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
 	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
 
-	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
+	if (req->ctx->flags & IORING_SETUP_SQE128)
+		size <<= 1;
 
-	memcpy(req->async_data, ioucmd->cmd, cmd_size);
+	memcpy(req->async_data, ioucmd->sqe, size);
 	return 0;
 }
 
@@ -96,7 +97,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		req->imu = ctx->user_bufs[index];
 		io_req_set_rsrc_node(req, ctx, 0);
 	}
-	ioucmd->cmd = sqe->cmd;
+	ioucmd->sqe = sqe;
 	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
 	return 0;
 }
@@ -128,7 +129,7 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	if (req_has_async_data(req))
-		ioucmd->cmd = req->async_data;
+		ioucmd->sqe = req->async_data;
 
 	ret = file->f_op->uring_cmd(ioucmd, issue_flags);
 	if (ret == -EAGAIN) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-06 16:41   ` Keith Busch
@ 2023-04-06 16:58   ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 16:58 UTC (permalink / raw)
  To: dccp

Hello Keith,

On Thu, Apr 06, 2023 at 10:41:52AM -0600, Keith Busch wrote:
> On Thu, Apr 06, 2023 at 07:43:26AM -0700, Breno Leitao wrote:
> > This patchset creates the initial plumbing for a io_uring command for
> > sockets.
> > 
> > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> > heavily based on the ioctl operations.
> 
> Do you have asynchronous operations in mind for a future patch? The io_uring
> command infrastructure makes more sense for operations that return EIOCBQUEUED,
> otherwise it doesn't have much benefit over ioctl.

I think this brings value even for synchronos operations, such as, you
can just keep using io_uring operations on network operations, other
than, using some io_uring operations and then doing a regular ioctl(2).
So, it improves the user experience.

The other benefit is calling several operations at a single io_uring
submit. It means you can save several syscalls and getting the same work
done.

>  
> > In order to test this code, I created a liburing test, which is
> > currently located at [1], and I will create a pull request once we are
> > good with this patch.
> > 
> > I've also run test/io_uring_passthrough to make sure the first patch
> > didn't regressed the NVME passthrough path.
> > 
> > This patchset is a RFC for two different reasons:
> >   * It changes slighlty on how IO uring command operates. I.e, we are
> >     now passing the whole SQE to the io_uring_cmd callback (instead of
> >     an opaque buffer). This seems to be more palatable instead of
> >     creating some custom structure just to fit small parameters, as in
> >     SOCKET_URING_OP_SIOC{IN,OUT}Q. Is this OK?
> 
> I think I'm missing something from this series. Where is the io_uring_cmd
> change to point to the sqe?

My bad, the patch was not part of the patchset. I've just submitted it
under the same RFC cover letter now.

Here is the link, if it helps:

https://lkml.org/lkml/2023/4/6/990


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-06 16:58   ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-06 16:58 UTC (permalink / raw)
  To: Keith Busch
  Cc: io-uring, netdev, kuba, asml.silence, axboe, leit, edumazet,
	pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

Hello Keith,

On Thu, Apr 06, 2023 at 10:41:52AM -0600, Keith Busch wrote:
> On Thu, Apr 06, 2023 at 07:43:26AM -0700, Breno Leitao wrote:
> > This patchset creates the initial plumbing for a io_uring command for
> > sockets.
> > 
> > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> > heavily based on the ioctl operations.
> 
> Do you have asynchronous operations in mind for a future patch? The io_uring
> command infrastructure makes more sense for operations that return EIOCBQUEUED,
> otherwise it doesn't have much benefit over ioctl.

I think this brings value even for synchronos operations, such as, you
can just keep using io_uring operations on network operations, other
than, using some io_uring operations and then doing a regular ioctl(2).
So, it improves the user experience.

The other benefit is calling several operations at a single io_uring
submit. It means you can save several syscalls and getting the same work
done.

>  
> > In order to test this code, I created a liburing test, which is
> > currently located at [1], and I will create a pull request once we are
> > good with this patch.
> > 
> > I've also run test/io_uring_passthrough to make sure the first patch
> > didn't regressed the NVME passthrough path.
> > 
> > This patchset is a RFC for two different reasons:
> >   * It changes slighlty on how IO uring command operates. I.e, we are
> >     now passing the whole SQE to the io_uring_cmd callback (instead of
> >     an opaque buffer). This seems to be more palatable instead of
> >     creating some custom structure just to fit small parameters, as in
> >     SOCKET_URING_OP_SIOC{IN,OUT}Q. Is this OK?
> 
> I think I'm missing something from this series. Where is the io_uring_cmd
> change to point to the sqe?

My bad, the patch was not part of the patchset. I've just submitted it
under the same RFC cover letter now.

Here is the link, if it helps:

https://lkml.org/lkml/2023/4/6/990



^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: io_uring: Pass whole sqe to commands: Tests Results
  2023-04-06 16:57 ` Breno Leitao
  (?)
@ 2023-04-06 17:50 ` MPTCP CI
  -1 siblings, 0 replies; 108+ messages in thread
From: MPTCP CI @ 2023-04-06 17:50 UTC (permalink / raw)
  To: Breno Leitao; +Cc: mptcp

Hi Breno,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join):
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/6625829522767872
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/6625829522767872/summary/summary.txt

- KVM Validation: debug (except selftest_mptcp_join):
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/5711035848458240
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/5711035848458240/summary/summary.txt

- KVM Validation: normal (only selftest_mptcp_join):
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/4585135941615616
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/4585135941615616/summary/summary.txt

- KVM Validation: debug (only selftest_mptcp_join):
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/5148085895036928
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/5148085895036928/summary/summary.txt

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/035a67e200f0


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-debug

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (Tessares)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-06 15:59   ` Breno Leitao
@ 2023-04-06 18:16   ` Willem de Bruijn
  -1 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-06 18:16 UTC (permalink / raw)
  To: dccp

On Thu, Apr 6, 2023 at 11:59 AM Breno Leitao <leitao@debian.org> wrote:
>
> On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote:
> > On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote:
> > >
> > > From: Breno Leitao <leit@fb.com>
> > >
> > > This patchset creates the initial plumbing for a io_uring command for
> > > sockets.
> > >
> > > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> > > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> > > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> > > heavily based on the ioctl operations.
> >
> > This duplicates all the existing ioctl logic of each protocol.
> >
> > Can this just call the existing proto_ops.ioctl internally and translate from/to
> > io_uring format as needed?
>
> This is doable, and we have two options in this case:
>
> 1) Create a ioctl core function that does not call `put_user()`, and
> call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper
> translations. Something as:
>
>         int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg)
>         {
>                 int amount;
>                 switch (cmd) {
>                 case SIOCOUTQ: {
>                         amount = sk_wmem_alloc_get(sk);
>                         break;
>                 }
>                 case SIOCINQ: {
>                         amount = max_t(int, 0, first_packet_length(sk));
>                         break;
>                 }
>                 default:
>                         return -ENOIOCTLCMD;
>                 }
>                 return amount;
>         }
>
>         int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
>         {
>                 int amount = udp_ioctl_core(sk, cmd, arg);
>
>                 return put_user(amount, (int __user *)arg);
>         }
>         EXPORT_SYMBOL(udp_ioctl);
>
>
> 2) Create a function for each "case entry". This seems a bit silly for
> UDP, but it makes more sense for other protocols. The code will look
> something like:
>
>          int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
>          {
>                 switch (cmd) {
>                 case SIOCOUTQ:
>                 {
>                         int amount = udp_ioctl_siocoutq();
>                         return put_user(amount, (int __user *)arg);
>                 }
>                 ...
>           }
>
> What is the best approach?

A, the issue is that sock->ops->ioctl directly call put_user.

I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
sock_do_ioctl.

But that would require those callbacks to return a negative error or
positive integer, rather than calling put_user. And then move the
put_user to sock_do_ioctl. Such a change is at least as much code
change as your series. Though without the ending up with code
duplication. It also works only if all ioctls only put_user of integer
size. That's true for TCP, UDP and RAW, but not sure if true more
broadly.

Another approach may be to pass another argument to the ioctl
callbacks, whether to call put_user or return the integer and let the
caller take care of the output to user. This could possibly be
embedded in the a high-order bit of the cmd, so that it fails on ioctl
callbacks that do not support this mode.

Of the two approaches you suggest, I find the first preferable.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-06 18:16   ` Willem de Bruijn
  0 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-06 18:16 UTC (permalink / raw)
  To: Breno Leitao
  Cc: io-uring, netdev, kuba, asml.silence, axboe, leit, edumazet,
	pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On Thu, Apr 6, 2023 at 11:59 AM Breno Leitao <leitao@debian.org> wrote:
>
> On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote:
> > On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote:
> > >
> > > From: Breno Leitao <leit@fb.com>
> > >
> > > This patchset creates the initial plumbing for a io_uring command for
> > > sockets.
> > >
> > > For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> > > and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> > > SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> > > heavily based on the ioctl operations.
> >
> > This duplicates all the existing ioctl logic of each protocol.
> >
> > Can this just call the existing proto_ops.ioctl internally and translate from/to
> > io_uring format as needed?
>
> This is doable, and we have two options in this case:
>
> 1) Create a ioctl core function that does not call `put_user()`, and
> call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper
> translations. Something as:
>
>         int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg)
>         {
>                 int amount;
>                 switch (cmd) {
>                 case SIOCOUTQ: {
>                         amount = sk_wmem_alloc_get(sk);
>                         break;
>                 }
>                 case SIOCINQ: {
>                         amount = max_t(int, 0, first_packet_length(sk));
>                         break;
>                 }
>                 default:
>                         return -ENOIOCTLCMD;
>                 }
>                 return amount;
>         }
>
>         int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
>         {
>                 int amount = udp_ioctl_core(sk, cmd, arg);
>
>                 return put_user(amount, (int __user *)arg);
>         }
>         EXPORT_SYMBOL(udp_ioctl);
>
>
> 2) Create a function for each "case entry". This seems a bit silly for
> UDP, but it makes more sense for other protocols. The code will look
> something like:
>
>          int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
>          {
>                 switch (cmd) {
>                 case SIOCOUTQ:
>                 {
>                         int amount = udp_ioctl_siocoutq();
>                         return put_user(amount, (int __user *)arg);
>                 }
>                 ...
>           }
>
> What is the best approach?

A, the issue is that sock->ops->ioctl directly call put_user.

I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
sock_do_ioctl.

But that would require those callbacks to return a negative error or
positive integer, rather than calling put_user. And then move the
put_user to sock_do_ioctl. Such a change is at least as much code
change as your series. Though without the ending up with code
duplication. It also works only if all ioctls only put_user of integer
size. That's true for TCP, UDP and RAW, but not sure if true more
broadly.

Another approach may be to pass another argument to the ioctl
callbacks, whether to call put_user or return the integer and let the
caller take care of the output to user. This could possibly be
embedded in the a high-order bit of the cmd, so that it fails on ioctl
callbacks that do not support this mode.

Of the two approaches you suggest, I find the first preferable.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH 2/4] net: add uring_cmd callback to UDP
  2023-04-06 14:43 ` Breno Leitao
  (?)
@ 2023-04-06 19:03 ` kernel test robot
  -1 siblings, 0 replies; 108+ messages in thread
From: kernel test robot @ 2023-04-06 19:03 UTC (permalink / raw)
  To: Breno Leitao; +Cc: oe-kbuild-all

Hi Breno,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on net/main]
[also build test ERROR on net-next/main horms-ipvs/master linus/master v6.3-rc5 next-20230406]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Breno-Leitao/net-add-uring_cmd-callback-to-UDP/20230406-234400
patch link:    https://lore.kernel.org/r/20230406144330.1932798-3-leitao%40debian.org
patch subject: [RFC PATCH 2/4] net: add uring_cmd callback to UDP
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20230407/202304070244.kmFlDfh5-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/481b4799b28c2e6bc49578ebbf64f7506df13804
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Breno-Leitao/net-add-uring_cmd-callback-to-UDP/20230406-234400
        git checkout 481b4799b28c2e6bc49578ebbf64f7506df13804
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc SHELL=/bin/bash net/ipv4/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304070244.kmFlDfh5-lkp@intel.com/

All errors (new ones prefixed by >>):

   net/ipv4/udp.c: In function 'udp_uring_cmd':
>> net/ipv4/udp.c:1718:20: error: 'struct io_uring_cmd' has no member named 'sqe'
    1718 |         switch (cmd->sqe->cmd_op) {
         |                    ^~
   net/ipv4/udp.c:1726:1: error: control reaches end of non-void function [-Werror=return-type]
    1726 | }
         | ^
   cc1: some warnings being treated as errors


vim +1718 net/ipv4/udp.c

  1714	
  1715	int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
  1716			  unsigned int issue_flags)
  1717	{
> 1718		switch (cmd->sqe->cmd_op) {
  1719		case SOCKET_URING_OP_SIOCOUTQ:
  1720			return sk_wmem_alloc_get(sk);
  1721		case SOCKET_URING_OP_SIOCINQ:
  1722			return max_t(int, 0, first_packet_length(sk));
  1723		default:
  1724			return -ENOIOCTLCMD;
  1725		}
  1726	}
  1727	EXPORT_SYMBOL_GPL(udp_uring_cmd);
  1728	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH 3/4] net: add uring_cmd callback to TCP
  2023-04-06 14:43 ` Breno Leitao
  (?)
@ 2023-04-06 20:35 ` kernel test robot
  -1 siblings, 0 replies; 108+ messages in thread
From: kernel test robot @ 2023-04-06 20:35 UTC (permalink / raw)
  To: Breno Leitao; +Cc: oe-kbuild-all

Hi Breno,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on net/main]
[also build test ERROR on net-next/main horms-ipvs/master linus/master v6.3-rc5 next-20230406]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Breno-Leitao/net-add-uring_cmd-callback-to-UDP/20230406-234400
patch link:    https://lore.kernel.org/r/20230406144330.1932798-4-leitao%40debian.org
patch subject: [RFC PATCH 3/4] net: add uring_cmd callback to TCP
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20230407/202304070428.kAgcjF1Z-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/01b93921bef2f71c4dd39e76f68f6227279c6c81
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Breno-Leitao/net-add-uring_cmd-callback-to-UDP/20230406-234400
        git checkout 01b93921bef2f71c4dd39e76f68f6227279c6c81
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc SHELL=/bin/bash net/ipv4/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304070428.kAgcjF1Z-lkp@intel.com/

All errors (new ones prefixed by >>):

   net/ipv4/tcp.c: In function 'tcp_uring_cmd':
>> net/ipv4/tcp.c:607:20: error: 'struct io_uring_cmd' has no member named 'sqe'
     607 |         switch (cmd->sqe->cmd_op) {
         |                    ^~
   net/ipv4/tcp.c:628:1: error: control reaches end of non-void function [-Werror=return-type]
     628 | }
         | ^
   cc1: some warnings being treated as errors


vim +607 net/ipv4/tcp.c

   599	
   600	int tcp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
   601			  unsigned int issue_flags)
   602	{
   603		struct tcp_sock *tp = tcp_sk(sk);
   604		bool slow;
   605		int ret;
   606	
 > 607		switch (cmd->sqe->cmd_op) {
   608		case SOCKET_URING_OP_SIOCINQ:
   609			if (sk->sk_state == TCP_LISTEN)
   610				return -EINVAL;
   611	
   612			slow = lock_sock_fast(sk);
   613			ret = tcp_inq(sk);
   614			unlock_sock_fast(sk, slow);
   615			return ret;
   616		case SOCKET_URING_OP_SIOCOUTQ:
   617			if (sk->sk_state == TCP_LISTEN)
   618				return -EINVAL;
   619	
   620			if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
   621				ret = 0;
   622			else
   623				ret = READ_ONCE(tp->write_seq) - tp->snd_una;
   624			return ret;
   625		default:
   626			return -ENOIOCTLCMD;
   627		}
   628	}
   629	EXPORT_SYMBOL_GPL(tcp_uring_cmd);
   630	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-06 18:16   ` Willem de Bruijn
@ 2023-04-07  2:46   ` David Ahern
  -1 siblings, 0 replies; 108+ messages in thread
From: David Ahern @ 2023-04-07  2:46 UTC (permalink / raw)
  To: dccp

On 4/6/23 12:16 PM, Willem de Bruijn wrote:
> On Thu, Apr 6, 2023 at 11:59 AM Breno Leitao <leitao@debian.org> wrote:
>>
>> On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote:
>>> On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote:
>>>>
>>>> From: Breno Leitao <leit@fb.com>
>>>>
>>>> This patchset creates the initial plumbing for a io_uring command for
>>>> sockets.
>>>>
>>>> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
>>>> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
>>>> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
>>>> heavily based on the ioctl operations.
>>>
>>> This duplicates all the existing ioctl logic of each protocol.
>>>
>>> Can this just call the existing proto_ops.ioctl internally and translate from/to
>>> io_uring format as needed?
>>
>> This is doable, and we have two options in this case:
>>
>> 1) Create a ioctl core function that does not call `put_user()`, and
>> call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper
>> translations. Something as:
>>
>>         int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg)
>>         {
>>                 int amount;
>>                 switch (cmd) {
>>                 case SIOCOUTQ: {
>>                         amount = sk_wmem_alloc_get(sk);
>>                         break;
>>                 }
>>                 case SIOCINQ: {
>>                         amount = max_t(int, 0, first_packet_length(sk));
>>                         break;
>>                 }
>>                 default:
>>                         return -ENOIOCTLCMD;
>>                 }
>>                 return amount;
>>         }
>>
>>         int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
>>         {
>>                 int amount = udp_ioctl_core(sk, cmd, arg);
>>
>>                 return put_user(amount, (int __user *)arg);
>>         }
>>         EXPORT_SYMBOL(udp_ioctl);
>>
>>
>> 2) Create a function for each "case entry". This seems a bit silly for
>> UDP, but it makes more sense for other protocols. The code will look
>> something like:
>>
>>          int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
>>          {
>>                 switch (cmd) {
>>                 case SIOCOUTQ:
>>                 {
>>                         int amount = udp_ioctl_siocoutq();
>>                         return put_user(amount, (int __user *)arg);
>>                 }
>>                 ...
>>           }
>>
>> What is the best approach?
> 
> A, the issue is that sock->ops->ioctl directly call put_user.
> 
> I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
> sock_do_ioctl.
> 
> But that would require those callbacks to return a negative error or
> positive integer, rather than calling put_user. And then move the
> put_user to sock_do_ioctl. Such a change is at least as much code
> change as your series. Though without the ending up with code
> duplication. It also works only if all ioctls only put_user of integer
> size. That's true for TCP, UDP and RAW, but not sure if true more
> broadly.
> 
> Another approach may be to pass another argument to the ioctl
> callbacks, whether to call put_user or return the integer and let the
> caller take care of the output to user. This could possibly be
> embedded in the a high-order bit of the cmd, so that it fails on ioctl
> callbacks that do not support this mode.
> 
> Of the two approaches you suggest, I find the first preferable.

The first approach sounds better to me and it would be good to avoid
io_uring details in the networking code (ie., cmd->sqe->cmd_op).

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-07  2:46   ` David Ahern
  0 siblings, 0 replies; 108+ messages in thread
From: David Ahern @ 2023-04-07  2:46 UTC (permalink / raw)
  To: Willem de Bruijn, Breno Leitao
  Cc: io-uring, netdev, kuba, asml.silence, axboe, leit, edumazet,
	pabeni, davem, dccp, mptcp, linux-kernel, willemdebruijn.kernel,
	matthieu.baerts, marcelo.leitner

On 4/6/23 12:16 PM, Willem de Bruijn wrote:
> On Thu, Apr 6, 2023 at 11:59 AM Breno Leitao <leitao@debian.org> wrote:
>>
>> On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote:
>>> On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote:
>>>>
>>>> From: Breno Leitao <leit@fb.com>
>>>>
>>>> This patchset creates the initial plumbing for a io_uring command for
>>>> sockets.
>>>>
>>>> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
>>>> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
>>>> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
>>>> heavily based on the ioctl operations.
>>>
>>> This duplicates all the existing ioctl logic of each protocol.
>>>
>>> Can this just call the existing proto_ops.ioctl internally and translate from/to
>>> io_uring format as needed?
>>
>> This is doable, and we have two options in this case:
>>
>> 1) Create a ioctl core function that does not call `put_user()`, and
>> call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper
>> translations. Something as:
>>
>>         int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg)
>>         {
>>                 int amount;
>>                 switch (cmd) {
>>                 case SIOCOUTQ: {
>>                         amount = sk_wmem_alloc_get(sk);
>>                         break;
>>                 }
>>                 case SIOCINQ: {
>>                         amount = max_t(int, 0, first_packet_length(sk));
>>                         break;
>>                 }
>>                 default:
>>                         return -ENOIOCTLCMD;
>>                 }
>>                 return amount;
>>         }
>>
>>         int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
>>         {
>>                 int amount = udp_ioctl_core(sk, cmd, arg);
>>
>>                 return put_user(amount, (int __user *)arg);
>>         }
>>         EXPORT_SYMBOL(udp_ioctl);
>>
>>
>> 2) Create a function for each "case entry". This seems a bit silly for
>> UDP, but it makes more sense for other protocols. The code will look
>> something like:
>>
>>          int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
>>          {
>>                 switch (cmd) {
>>                 case SIOCOUTQ:
>>                 {
>>                         int amount = udp_ioctl_siocoutq();
>>                         return put_user(amount, (int __user *)arg);
>>                 }
>>                 ...
>>           }
>>
>> What is the best approach?
> 
> A, the issue is that sock->ops->ioctl directly call put_user.
> 
> I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
> sock_do_ioctl.
> 
> But that would require those callbacks to return a negative error or
> positive integer, rather than calling put_user. And then move the
> put_user to sock_do_ioctl. Such a change is at least as much code
> change as your series. Though without the ending up with code
> duplication. It also works only if all ioctls only put_user of integer
> size. That's true for TCP, UDP and RAW, but not sure if true more
> broadly.
> 
> Another approach may be to pass another argument to the ioctl
> callbacks, whether to call put_user or return the integer and let the
> caller take care of the output to user. This could possibly be
> embedded in the a high-order bit of the cmd, so that it fails on ioctl
> callbacks that do not support this mode.
> 
> Of the two approaches you suggest, I find the first preferable.

The first approach sounds better to me and it would be good to avoid
io_uring details in the networking code (ie., cmd->sqe->cmd_op).

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-06 16:57 ` Breno Leitao
@ 2023-04-07 18:51   ` Keith Busch
  -1 siblings, 0 replies; 108+ messages in thread
From: Keith Busch @ 2023-04-07 18:51 UTC (permalink / raw)
  To: dccp

On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
> Currently uring CMD operation relies on having large SQEs, but future
> operations might want to use normal SQE.
> 
> The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
> but, for commands that use normal SQE size, it might be necessary to
> access the initial SQE fields outside of the payload/cmd block.  So,
> saves the whole SQE other than just the pdu.
> 
> This changes slighlty how the io_uring_cmd works, since the cmd
> structures and callbacks are not opaque to io_uring anymore. I.e, the
> callbacks can look at the SQE entries, not only, in the cmd structure.
> 
> The main advantage is that we don't need to create custom structures for
> simple commands.

This looks good to me. The only disadvantage I can see is that the async
fallback allocates just a tiny bit more data than before, but no biggie.

Reviewed-by: Keith Busch <kbusch@kernel.org>
 
> @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
>  int io_uring_cmd_prep_async(struct io_kiocb *req)
>  {
>  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> -	size_t cmd_size;
> +	size_t size = sizeof(struct io_uring_sqe);
>  
>  	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
>  	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);

One minor suggestion. The above is the only user of uring_cmd_pdu_size() now,
which is kind of a convoluted way to enfoce the offset of the 'cmd' field. It
may be more clear to replace these with:

	BUILD_BUG_ON(offsetof(struct io_uring_sqe, cmd) = 48);
  
> -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
> +	if (req->ctx->flags & IORING_SETUP_SQE128)
> +		size <<= 1;
>  
> -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
> +	memcpy(req->async_data, ioucmd->sqe, size);
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-07 18:51   ` Keith Busch
  0 siblings, 0 replies; 108+ messages in thread
From: Keith Busch @ 2023-04-07 18:51 UTC (permalink / raw)
  To: Breno Leitao
  Cc: asml.silence, axboe, davem, dccp, dsahern, edumazet, io-uring,
	kuba, leit, linux-kernel, marcelo.leitner, matthieu.baerts, mptcp,
	netdev, pabeni, willemdebruijn.kernel

On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
> Currently uring CMD operation relies on having large SQEs, but future
> operations might want to use normal SQE.
> 
> The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
> but, for commands that use normal SQE size, it might be necessary to
> access the initial SQE fields outside of the payload/cmd block.  So,
> saves the whole SQE other than just the pdu.
> 
> This changes slighlty how the io_uring_cmd works, since the cmd
> structures and callbacks are not opaque to io_uring anymore. I.e, the
> callbacks can look at the SQE entries, not only, in the cmd structure.
> 
> The main advantage is that we don't need to create custom structures for
> simple commands.

This looks good to me. The only disadvantage I can see is that the async
fallback allocates just a tiny bit more data than before, but no biggie.

Reviewed-by: Keith Busch <kbusch@kernel.org>
 
> @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
>  int io_uring_cmd_prep_async(struct io_kiocb *req)
>  {
>  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> -	size_t cmd_size;
> +	size_t size = sizeof(struct io_uring_sqe);
>  
>  	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
>  	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);

One minor suggestion. The above is the only user of uring_cmd_pdu_size() now,
which is kind of a convoluted way to enfoce the offset of the 'cmd' field. It
may be more clear to replace these with:

	BUILD_BUG_ON(offsetof(struct io_uring_sqe, cmd) == 48);
  
> -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
> +	if (req->ctx->flags & IORING_SETUP_SQE128)
> +		size <<= 1;
>  
> -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
> +	memcpy(req->async_data, ioucmd->sqe, size);
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: io_uring: Pass whole sqe to commands: Tests Results
  2023-04-06 16:57 ` Breno Leitao
                   ` (2 preceding siblings ...)
  (?)
@ 2023-04-07 19:43 ` MPTCP CI
  -1 siblings, 0 replies; 108+ messages in thread
From: MPTCP CI @ 2023-04-07 19:43 UTC (permalink / raw)
  To: Breno Leitao; +Cc: mptcp

Hi Breno,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join):
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/6539795959119872
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/6539795959119872/summary/summary.txt

- KVM Validation: debug (only selftest_mptcp_join):
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/5273158563921920
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/5273158563921920/summary/summary.txt

- KVM Validation: normal (only selftest_mptcp_join):
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/4710208610500608
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/4710208610500608/summary/summary.txt

- KVM Validation: debug (except selftest_mptcp_join):
  - Success! ✅:
  - Task: https://cirrus-ci.com/task/5836108517343232
  - Summary: https://api.cirrus-ci.com/v1/artifact/task/5836108517343232/summary/summary.txt

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/9999e6372112


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-debug

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (Tessares)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-07  2:46   ` David Ahern
@ 2023-04-11 12:00   ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-11 11:59 UTC (permalink / raw)
  To: dccp

On Thu, Apr 06, 2023 at 08:46:38PM -0600, David Ahern wrote:
> On 4/6/23 12:16 PM, Willem de Bruijn wrote:
> > On Thu, Apr 6, 2023 at 11:59 AM Breno Leitao <leitao@debian.org> wrote:
> >>
> >> On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote:
> >>> On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote:
> >>>>
> >>>> From: Breno Leitao <leit@fb.com>
> >>>>
> >>>> This patchset creates the initial plumbing for a io_uring command for
> >>>> sockets.
> >>>>
> >>>> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> >>>> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> >>>> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> >>>> heavily based on the ioctl operations.
> >>>
> >>> This duplicates all the existing ioctl logic of each protocol.
> >>>
> >>> Can this just call the existing proto_ops.ioctl internally and translate from/to
> >>> io_uring format as needed?
> >>
> >> This is doable, and we have two options in this case:
> >>
> >> 1) Create a ioctl core function that does not call `put_user()`, and
> >> call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper
> >> translations. Something as:
> >>
> >>         int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg)
> >>         {
> >>                 int amount;
> >>                 switch (cmd) {
> >>                 case SIOCOUTQ: {
> >>                         amount = sk_wmem_alloc_get(sk);
> >>                         break;
> >>                 }
> >>                 case SIOCINQ: {
> >>                         amount = max_t(int, 0, first_packet_length(sk));
> >>                         break;
> >>                 }
> >>                 default:
> >>                         return -ENOIOCTLCMD;
> >>                 }
> >>                 return amount;
> >>         }
> >>
> >>         int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
> >>         {
> >>                 int amount = udp_ioctl_core(sk, cmd, arg);
> >>
> >>                 return put_user(amount, (int __user *)arg);
> >>         }
> >>         EXPORT_SYMBOL(udp_ioctl);
> >>
> >>
> >> 2) Create a function for each "case entry". This seems a bit silly for
> >> UDP, but it makes more sense for other protocols. The code will look
> >> something like:
> >>
> >>          int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
> >>          {
> >>                 switch (cmd) {
> >>                 case SIOCOUTQ:
> >>                 {
> >>                         int amount = udp_ioctl_siocoutq();
> >>                         return put_user(amount, (int __user *)arg);
> >>                 }
> >>                 ...
> >>           }
> >>
> >> What is the best approach?
> > 
> > A, the issue is that sock->ops->ioctl directly call put_user.
> > 
> > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
> > sock_do_ioctl.
> > 
> > But that would require those callbacks to return a negative error or
> > positive integer, rather than calling put_user. And then move the
> > put_user to sock_do_ioctl. Such a change is at least as much code
> > change as your series. Though without the ending up with code
> > duplication. It also works only if all ioctls only put_user of integer
> > size. That's true for TCP, UDP and RAW, but not sure if true more
> > broadly.
> > 
> > Another approach may be to pass another argument to the ioctl
> > callbacks, whether to call put_user or return the integer and let the
> > caller take care of the output to user. This could possibly be
> > embedded in the a high-order bit of the cmd, so that it fails on ioctl
> > callbacks that do not support this mode.
> > 
> > Of the two approaches you suggest, I find the first preferable.
> 
> The first approach sounds better to me and it would be good to avoid
> io_uring details in the networking code (ie., cmd->sqe->cmd_op).

I am not sure if avoiding io_uring details in network code is possible.

The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
in the TCP case) could be somewhere else, such as in the io_uring/
directory, but, I think it might be cleaner if these implementations are
closer to function assignment (in the network subsystem).

And this function (tcp_uring_cmd() for instance) is the one that I am
planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
-> SIOCINQ.

Please let me know if you have any other idea in mind.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 12:00   ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-11 12:00 UTC (permalink / raw)
  To: David Ahern
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, axboe,
	leit, edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On Thu, Apr 06, 2023 at 08:46:38PM -0600, David Ahern wrote:
> On 4/6/23 12:16 PM, Willem de Bruijn wrote:
> > On Thu, Apr 6, 2023 at 11:59 AM Breno Leitao <leitao@debian.org> wrote:
> >>
> >> On Thu, Apr 06, 2023 at 11:34:28AM -0400, Willem de Bruijn wrote:
> >>> On Thu, Apr 6, 2023 at 10:45 AM Breno Leitao <leitao@debian.org> wrote:
> >>>>
> >>>> From: Breno Leitao <leit@fb.com>
> >>>>
> >>>> This patchset creates the initial plumbing for a io_uring command for
> >>>> sockets.
> >>>>
> >>>> For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ
> >>>> and SOCKET_URING_OP_SIOCINQ. They are similar to ioctl operations
> >>>> SIOCOUTQ and SIOCINQ. In fact, the code on the protocol side itself is
> >>>> heavily based on the ioctl operations.
> >>>
> >>> This duplicates all the existing ioctl logic of each protocol.
> >>>
> >>> Can this just call the existing proto_ops.ioctl internally and translate from/to
> >>> io_uring format as needed?
> >>
> >> This is doable, and we have two options in this case:
> >>
> >> 1) Create a ioctl core function that does not call `put_user()`, and
> >> call it from both the `udp_ioctl` and `udp_uring_cmd`, doing the proper
> >> translations. Something as:
> >>
> >>         int udp_ioctl_core(struct sock *sk, int cmd, unsigned long arg)
> >>         {
> >>                 int amount;
> >>                 switch (cmd) {
> >>                 case SIOCOUTQ: {
> >>                         amount = sk_wmem_alloc_get(sk);
> >>                         break;
> >>                 }
> >>                 case SIOCINQ: {
> >>                         amount = max_t(int, 0, first_packet_length(sk));
> >>                         break;
> >>                 }
> >>                 default:
> >>                         return -ENOIOCTLCMD;
> >>                 }
> >>                 return amount;
> >>         }
> >>
> >>         int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
> >>         {
> >>                 int amount = udp_ioctl_core(sk, cmd, arg);
> >>
> >>                 return put_user(amount, (int __user *)arg);
> >>         }
> >>         EXPORT_SYMBOL(udp_ioctl);
> >>
> >>
> >> 2) Create a function for each "case entry". This seems a bit silly for
> >> UDP, but it makes more sense for other protocols. The code will look
> >> something like:
> >>
> >>          int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
> >>          {
> >>                 switch (cmd) {
> >>                 case SIOCOUTQ:
> >>                 {
> >>                         int amount = udp_ioctl_siocoutq();
> >>                         return put_user(amount, (int __user *)arg);
> >>                 }
> >>                 ...
> >>           }
> >>
> >> What is the best approach?
> > 
> > A, the issue is that sock->ops->ioctl directly call put_user.
> > 
> > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
> > sock_do_ioctl.
> > 
> > But that would require those callbacks to return a negative error or
> > positive integer, rather than calling put_user. And then move the
> > put_user to sock_do_ioctl. Such a change is at least as much code
> > change as your series. Though without the ending up with code
> > duplication. It also works only if all ioctls only put_user of integer
> > size. That's true for TCP, UDP and RAW, but not sure if true more
> > broadly.
> > 
> > Another approach may be to pass another argument to the ioctl
> > callbacks, whether to call put_user or return the integer and let the
> > caller take care of the output to user. This could possibly be
> > embedded in the a high-order bit of the cmd, so that it fails on ioctl
> > callbacks that do not support this mode.
> > 
> > Of the two approaches you suggest, I find the first preferable.
> 
> The first approach sounds better to me and it would be good to avoid
> io_uring details in the networking code (ie., cmd->sqe->cmd_op).

I am not sure if avoiding io_uring details in network code is possible.

The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
in the TCP case) could be somewhere else, such as in the io_uring/
directory, but, I think it might be cleaner if these implementations are
closer to function assignment (in the network subsystem).

And this function (tcp_uring_cmd() for instance) is the one that I am
planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
-> SIOCINQ.

Please let me know if you have any other idea in mind.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-07 18:51   ` Keith Busch
@ 2023-04-11 12:22   ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-11 12:22 UTC (permalink / raw)
  To: dccp

On Fri, Apr 07, 2023 at 12:51:44PM -0600, Keith Busch wrote:
> > @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
> >  int io_uring_cmd_prep_async(struct io_kiocb *req)
> >  {
> >  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> > -	size_t cmd_size;
> > +	size_t size = sizeof(struct io_uring_sqe);
> >  
> >  	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
> >  	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
> 
> One minor suggestion. The above is the only user of uring_cmd_pdu_size() now,
> which is kind of a convoluted way to enfoce the offset of the 'cmd' field. It
> may be more clear to replace these with:

I agree with you here. Basically it is a bug if the payload (pdu) size is
is different than 16 for single SQE or != 80 for extended SQE.

So, basically it is checking for two things:
   * the cmd offset is 48
   * the io_uring_sqe struct is 64

Since this is a uapi, I am not confidence that they will change at all.
I can replace the code with your suggestion.

> 	BUILD_BUG_ON(offsetof(struct io_uring_sqe, cmd) = 48);

It should be "offset(struct io_uring_sqe, cmd) != 48)", right?

Thanks for the review!

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-11 12:22   ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-11 12:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: asml.silence, axboe, davem, dccp, dsahern, edumazet, io-uring,
	kuba, leit, linux-kernel, marcelo.leitner, matthieu.baerts, mptcp,
	netdev, pabeni, willemdebruijn.kernel

On Fri, Apr 07, 2023 at 12:51:44PM -0600, Keith Busch wrote:
> > @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
> >  int io_uring_cmd_prep_async(struct io_kiocb *req)
> >  {
> >  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> > -	size_t cmd_size;
> > +	size_t size = sizeof(struct io_uring_sqe);
> >  
> >  	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
> >  	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
> 
> One minor suggestion. The above is the only user of uring_cmd_pdu_size() now,
> which is kind of a convoluted way to enfoce the offset of the 'cmd' field. It
> may be more clear to replace these with:

I agree with you here. Basically it is a bug if the payload (pdu) size is
is different than 16 for single SQE or != 80 for extended SQE.

So, basically it is checking for two things:
   * the cmd offset is 48
   * the io_uring_sqe struct is 64

Since this is a uapi, I am not confidence that they will change at all.
I can replace the code with your suggestion.

> 	BUILD_BUG_ON(offsetof(struct io_uring_sqe, cmd) == 48);

It should be "offset(struct io_uring_sqe, cmd) != 48)", right?

Thanks for the review!

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-11 12:22   ` Breno Leitao
@ 2023-04-11 12:39   ` Pavel Begunkov
  -1 siblings, 0 replies; 108+ messages in thread
From: Pavel Begunkov @ 2023-04-11 12:39 UTC (permalink / raw)
  To: dccp

On 4/11/23 13:22, Breno Leitao wrote:
> On Fri, Apr 07, 2023 at 12:51:44PM -0600, Keith Busch wrote:
>>> @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
>>>   int io_uring_cmd_prep_async(struct io_kiocb *req)
>>>   {
>>>   	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
>>> -	size_t cmd_size;
>>> +	size_t size = sizeof(struct io_uring_sqe);
>>>   
>>>   	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
>>>   	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
>>
>> One minor suggestion. The above is the only user of uring_cmd_pdu_size() now,
>> which is kind of a convoluted way to enfoce the offset of the 'cmd' field. It
>> may be more clear to replace these with:
> 
> I agree with you here. Basically it is a bug if the payload (pdu) size is
> is different than 16 for single SQE or != 80 for extended SQE.
> 
> So, basically it is checking for two things:
>     * the cmd offset is 48
>     * the io_uring_sqe struct is 64
> 
> Since this is a uapi, I am not confidence that they will change at all.
> I can replace the code with your suggestion.
> 
>> 	BUILD_BUG_ON(offsetof(struct io_uring_sqe, cmd) = 48);
> 
> It should be "offset(struct io_uring_sqe, cmd) != 48)", right?

Which is already checked, see io_uring_init()

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-11 12:39   ` Pavel Begunkov
  0 siblings, 0 replies; 108+ messages in thread
From: Pavel Begunkov @ 2023-04-11 12:39 UTC (permalink / raw)
  To: Breno Leitao, Keith Busch
  Cc: axboe, davem, dccp, dsahern, edumazet, io-uring, kuba, leit,
	linux-kernel, marcelo.leitner, matthieu.baerts, mptcp, netdev,
	pabeni, willemdebruijn.kernel

On 4/11/23 13:22, Breno Leitao wrote:
> On Fri, Apr 07, 2023 at 12:51:44PM -0600, Keith Busch wrote:
>>> @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
>>>   int io_uring_cmd_prep_async(struct io_kiocb *req)
>>>   {
>>>   	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
>>> -	size_t cmd_size;
>>> +	size_t size = sizeof(struct io_uring_sqe);
>>>   
>>>   	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
>>>   	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
>>
>> One minor suggestion. The above is the only user of uring_cmd_pdu_size() now,
>> which is kind of a convoluted way to enfoce the offset of the 'cmd' field. It
>> may be more clear to replace these with:
> 
> I agree with you here. Basically it is a bug if the payload (pdu) size is
> is different than 16 for single SQE or != 80 for extended SQE.
> 
> So, basically it is checking for two things:
>     * the cmd offset is 48
>     * the io_uring_sqe struct is 64
> 
> Since this is a uapi, I am not confidence that they will change at all.
> I can replace the code with your suggestion.
> 
>> 	BUILD_BUG_ON(offsetof(struct io_uring_sqe, cmd) == 48);
> 
> It should be "offset(struct io_uring_sqe, cmd) != 48)", right?

Which is already checked, see io_uring_init()

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH 2/4] net: add uring_cmd callback to UDP
  2023-04-06 14:43 ` Breno Leitao
@ 2023-04-11 12:54   ` Pavel Begunkov
  -1 siblings, 0 replies; 108+ messages in thread
From: Pavel Begunkov @ 2023-04-11 12:54 UTC (permalink / raw)
  To: dccp

On 4/6/23 15:43, Breno Leitao wrote:
> This is the implementation of uring_cmd for the udp protocol. It
> basically encompasses SOCKET_URING_OP_SIOCOUTQ and
> SOCKET_URING_OP_SIOCINQ, which is similar to the SIOCOUTQ and SIOCINQ
> ioctls.
> 
> The return value is exactly the same as the regular ioctl (udp_ioctl()).
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>   include/net/udp.h        |  2 ++
>   include/uapi/linux/net.h |  5 +++++
>   net/ipv4/udp.c           | 16 ++++++++++++++++
>   3 files changed, 23 insertions(+)
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index de4b528522bb..c0e829dacc2f 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -283,6 +283,8 @@ void udp_flush_pending_frames(struct sock *sk);
>   int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
>   void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
>   int udp_rcv(struct sk_buff *skb);
> +int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
> +		  unsigned int issue_flags);
>   int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
>   int udp_init_sock(struct sock *sk);
>   int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
> diff --git a/include/uapi/linux/net.h b/include/uapi/linux/net.h
> index 4dabec6bd957..dd8e7ced7d24 100644
> --- a/include/uapi/linux/net.h
> +++ b/include/uapi/linux/net.h
> @@ -55,4 +55,9 @@ typedef enum {
>   
>   #define __SO_ACCEPTCON	(1 << 16)	/* performed a listen		*/
>   
> +enum {
> +	SOCKET_URING_OP_SIOCINQ		= 0,
> +	SOCKET_URING_OP_SIOCOUTQ,
> +};
> +
>   #endif /* _UAPI_LINUX_NET_H */
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index c605d171eb2d..d6d60600831b 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -113,6 +113,7 @@
>   #include <net/sock_reuseport.h>
>   #include <net/addrconf.h>
>   #include <net/udp_tunnel.h>
> +#include <linux/io_uring.h>
>   #if IS_ENABLED(CONFIG_IPV6)
>   #include <net/ipv6_stubs.h>
>   #endif
> @@ -1711,6 +1712,20 @@ static int first_packet_length(struct sock *sk)
>   	return res;
>   }
>   
> +int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
> +		  unsigned int issue_flags)
> +{
> +	switch (cmd->sqe->cmd_op) {

Not particularly a problem of this series, but what bothers
me is the quite unfortunate placement of cmd_op in SQE.

struct io_uring_sqe {
	...
	union {
		__u64	d1;
		struct {
			__u32	cmd_op;
			__u32	__pad1;
		};
	};
	__u64	d2;
	__u32	d3;
	...
};

I'd much prefer it like this:

struct io_uring_sqe {
	...
	__u64 d1[2];
	__u32 cmd_op;
	...
};


We can't change it for NVMe, but at least new commands can have
a better layout. It's read in the generic cmd path, i.e.
io_uring_cmd_prep(), so will need some refactoring to make
the placement cmd specific.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [RFC PATCH 2/4] net: add uring_cmd callback to UDP
@ 2023-04-11 12:54   ` Pavel Begunkov
  0 siblings, 0 replies; 108+ messages in thread
From: Pavel Begunkov @ 2023-04-11 12:54 UTC (permalink / raw)
  To: Breno Leitao, io-uring, netdev, kuba, axboe
  Cc: leit, edumazet, pabeni, davem, dccp, mptcp, linux-kernel, dsahern,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On 4/6/23 15:43, Breno Leitao wrote:
> This is the implementation of uring_cmd for the udp protocol. It
> basically encompasses SOCKET_URING_OP_SIOCOUTQ and
> SOCKET_URING_OP_SIOCINQ, which is similar to the SIOCOUTQ and SIOCINQ
> ioctls.
> 
> The return value is exactly the same as the regular ioctl (udp_ioctl()).
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>   include/net/udp.h        |  2 ++
>   include/uapi/linux/net.h |  5 +++++
>   net/ipv4/udp.c           | 16 ++++++++++++++++
>   3 files changed, 23 insertions(+)
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index de4b528522bb..c0e829dacc2f 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -283,6 +283,8 @@ void udp_flush_pending_frames(struct sock *sk);
>   int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
>   void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
>   int udp_rcv(struct sk_buff *skb);
> +int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
> +		  unsigned int issue_flags);
>   int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
>   int udp_init_sock(struct sock *sk);
>   int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
> diff --git a/include/uapi/linux/net.h b/include/uapi/linux/net.h
> index 4dabec6bd957..dd8e7ced7d24 100644
> --- a/include/uapi/linux/net.h
> +++ b/include/uapi/linux/net.h
> @@ -55,4 +55,9 @@ typedef enum {
>   
>   #define __SO_ACCEPTCON	(1 << 16)	/* performed a listen		*/
>   
> +enum {
> +	SOCKET_URING_OP_SIOCINQ		= 0,
> +	SOCKET_URING_OP_SIOCOUTQ,
> +};
> +
>   #endif /* _UAPI_LINUX_NET_H */
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index c605d171eb2d..d6d60600831b 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -113,6 +113,7 @@
>   #include <net/sock_reuseport.h>
>   #include <net/addrconf.h>
>   #include <net/udp_tunnel.h>
> +#include <linux/io_uring.h>
>   #if IS_ENABLED(CONFIG_IPV6)
>   #include <net/ipv6_stubs.h>
>   #endif
> @@ -1711,6 +1712,20 @@ static int first_packet_length(struct sock *sk)
>   	return res;
>   }
>   
> +int udp_uring_cmd(struct sock *sk, struct io_uring_cmd *cmd,
> +		  unsigned int issue_flags)
> +{
> +	switch (cmd->sqe->cmd_op) {

Not particularly a problem of this series, but what bothers
me is the quite unfortunate placement of cmd_op in SQE.

struct io_uring_sqe {
	...
	union {
		__u64	d1;
		struct {
			__u32	cmd_op;
			__u32	__pad1;
		};
	};
	__u64	d2;
	__u32	d3;
	...
};

I'd much prefer it like this:

struct io_uring_sqe {
	...
	__u64 d1[2];
	__u32 cmd_op;
	...
};


We can't change it for NVMe, but at least new commands can have
a better layout. It's read in the generic cmd path, i.e.
io_uring_cmd_prep(), so will need some refactoring to make
the placement cmd specific.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 12:00   ` Breno Leitao
@ 2023-04-11 14:36   ` David Ahern
  -1 siblings, 0 replies; 108+ messages in thread
From: David Ahern @ 2023-04-11 14:36 UTC (permalink / raw)
  To: dccp

On 4/11/23 6:00 AM, Breno Leitao wrote:
> I am not sure if avoiding io_uring details in network code is possible.
> 
> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
> in the TCP case) could be somewhere else, such as in the io_uring/
> directory, but, I think it might be cleaner if these implementations are
> closer to function assignment (in the network subsystem).
> 
> And this function (tcp_uring_cmd() for instance) is the one that I am
> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
> -> SIOCINQ.
> 
> Please let me know if you have any other idea in mind.

I am not convinced that this io_uring_cmd is needed. This is one
in-kernel subsystem calling into another, and there are APIs for that.
All of this set is ioctl based and as Willem noted a little refactoring
separates the get_user/put_user out so that in-kernel can call can be
made with existing ops.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 14:36   ` David Ahern
  0 siblings, 0 replies; 108+ messages in thread
From: David Ahern @ 2023-04-11 14:36 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, axboe,
	leit, edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On 4/11/23 6:00 AM, Breno Leitao wrote:
> I am not sure if avoiding io_uring details in network code is possible.
> 
> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
> in the TCP case) could be somewhere else, such as in the io_uring/
> directory, but, I think it might be cleaner if these implementations are
> closer to function assignment (in the network subsystem).
> 
> And this function (tcp_uring_cmd() for instance) is the one that I am
> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
> -> SIOCINQ.
> 
> Please let me know if you have any other idea in mind.

I am not convinced that this io_uring_cmd is needed. This is one
in-kernel subsystem calling into another, and there are APIs for that.
All of this set is ioctl based and as Willem noted a little refactoring
separates the get_user/put_user out so that in-kernel can call can be
made with existing ops.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 14:36   ` David Ahern
@ 2023-04-11 14:41   ` Jens Axboe
  -1 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 14:41 UTC (permalink / raw)
  To: dccp

On 4/11/23 8:36?AM, David Ahern wrote:
> On 4/11/23 6:00 AM, Breno Leitao wrote:
>> I am not sure if avoiding io_uring details in network code is possible.
>>
>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>> in the TCP case) could be somewhere else, such as in the io_uring/
>> directory, but, I think it might be cleaner if these implementations are
>> closer to function assignment (in the network subsystem).
>>
>> And this function (tcp_uring_cmd() for instance) is the one that I am
>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>> -> SIOCINQ.
>>
>> Please let me know if you have any other idea in mind.
> 
> I am not convinced that this io_uring_cmd is needed. This is one
> in-kernel subsystem calling into another, and there are APIs for that.
> All of this set is ioctl based and as Willem noted a little refactoring
> separates the get_user/put_user out so that in-kernel can call can be
> made with existing ops.

How do you want to wire it up then? We can't use fops->unlocked_ioctl()
obviously, and we already have ->uring_cmd() for this purpose.

I do think the right thing to do is have a common helper that returns
whatever value you want (or sets it), and split the ioctl parts into a
wrapper around that that simply copies in/out as needed. Then
->uring_cmd() could call that, or you could some exported function that
does supports that.

This works for the basic cases, though I do suspect we'll want to go
down the ->uring_cmd() at some point for more advanced cases or cases
that cannot sanely be done in an ioctl fashion.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 14:41   ` Jens Axboe
  0 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 14:41 UTC (permalink / raw)
  To: David Ahern, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On 4/11/23 8:36?AM, David Ahern wrote:
> On 4/11/23 6:00 AM, Breno Leitao wrote:
>> I am not sure if avoiding io_uring details in network code is possible.
>>
>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>> in the TCP case) could be somewhere else, such as in the io_uring/
>> directory, but, I think it might be cleaner if these implementations are
>> closer to function assignment (in the network subsystem).
>>
>> And this function (tcp_uring_cmd() for instance) is the one that I am
>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>> -> SIOCINQ.
>>
>> Please let me know if you have any other idea in mind.
> 
> I am not convinced that this io_uring_cmd is needed. This is one
> in-kernel subsystem calling into another, and there are APIs for that.
> All of this set is ioctl based and as Willem noted a little refactoring
> separates the get_user/put_user out so that in-kernel can call can be
> made with existing ops.

How do you want to wire it up then? We can't use fops->unlocked_ioctl()
obviously, and we already have ->uring_cmd() for this purpose.

I do think the right thing to do is have a common helper that returns
whatever value you want (or sets it), and split the ioctl parts into a
wrapper around that that simply copies in/out as needed. Then
->uring_cmd() could call that, or you could some exported function that
does supports that.

This works for the basic cases, though I do suspect we'll want to go
down the ->uring_cmd() at some point for more advanced cases or cases
that cannot sanely be done in an ioctl fashion.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 14:41   ` Jens Axboe
@ 2023-04-11 14:51   ` Willem de Bruijn
  -1 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-11 14:51 UTC (permalink / raw)
  To: dccp

Jens Axboe wrote:
> On 4/11/23 8:36?AM, David Ahern wrote:
> > On 4/11/23 6:00 AM, Breno Leitao wrote:
> >> I am not sure if avoiding io_uring details in network code is possible.
> >>
> >> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
> >> in the TCP case) could be somewhere else, such as in the io_uring/
> >> directory, but, I think it might be cleaner if these implementations are
> >> closer to function assignment (in the network subsystem).
> >>
> >> And this function (tcp_uring_cmd() for instance) is the one that I am
> >> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
> >> -> SIOCINQ.
> >>
> >> Please let me know if you have any other idea in mind.
> > 
> > I am not convinced that this io_uring_cmd is needed. This is one
> > in-kernel subsystem calling into another, and there are APIs for that.
> > All of this set is ioctl based and as Willem noted a little refactoring
> > separates the get_user/put_user out so that in-kernel can call can be
> > made with existing ops.
> 
> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
> obviously, and we already have ->uring_cmd() for this purpose.

Does this suggestion not work?

> > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
> > sock_do_ioctl.
 
> I do think the right thing to do is have a common helper that returns
> whatever value you want (or sets it), and split the ioctl parts into a
> wrapper around that that simply copies in/out as needed. Then
> ->uring_cmd() could call that, or you could some exported function that
> does supports that.
> 
> This works for the basic cases, though I do suspect we'll want to go
> down the ->uring_cmd() at some point for more advanced cases or cases
> that cannot sanely be done in an ioctl fashion.

Right now the two examples are ioctls that return an integer. Do you 
already have other calls in mind? That would help estimate whether
->uring_cmd() indeed will be needed and we might as well do it now.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 14:51   ` Willem de Bruijn
  0 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-11 14:51 UTC (permalink / raw)
  To: Jens Axboe, David Ahern, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

Jens Axboe wrote:
> On 4/11/23 8:36?AM, David Ahern wrote:
> > On 4/11/23 6:00 AM, Breno Leitao wrote:
> >> I am not sure if avoiding io_uring details in network code is possible.
> >>
> >> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
> >> in the TCP case) could be somewhere else, such as in the io_uring/
> >> directory, but, I think it might be cleaner if these implementations are
> >> closer to function assignment (in the network subsystem).
> >>
> >> And this function (tcp_uring_cmd() for instance) is the one that I am
> >> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
> >> -> SIOCINQ.
> >>
> >> Please let me know if you have any other idea in mind.
> > 
> > I am not convinced that this io_uring_cmd is needed. This is one
> > in-kernel subsystem calling into another, and there are APIs for that.
> > All of this set is ioctl based and as Willem noted a little refactoring
> > separates the get_user/put_user out so that in-kernel can call can be
> > made with existing ops.
> 
> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
> obviously, and we already have ->uring_cmd() for this purpose.

Does this suggestion not work?

> > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
> > sock_do_ioctl.
 
> I do think the right thing to do is have a common helper that returns
> whatever value you want (or sets it), and split the ioctl parts into a
> wrapper around that that simply copies in/out as needed. Then
> ->uring_cmd() could call that, or you could some exported function that
> does supports that.
> 
> This works for the basic cases, though I do suspect we'll want to go
> down the ->uring_cmd() at some point for more advanced cases or cases
> that cannot sanely be done in an ioctl fashion.

Right now the two examples are ioctls that return an integer. Do you 
already have other calls in mind? That would help estimate whether
->uring_cmd() indeed will be needed and we might as well do it now.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 14:51   ` Willem de Bruijn
@ 2023-04-11 14:54   ` Jens Axboe
  -1 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 14:54 UTC (permalink / raw)
  To: dccp

On 4/11/23 8:51?AM, Willem de Bruijn wrote:
> Jens Axboe wrote:
>> On 4/11/23 8:36?AM, David Ahern wrote:
>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>
>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>> directory, but, I think it might be cleaner if these implementations are
>>>> closer to function assignment (in the network subsystem).
>>>>
>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>> -> SIOCINQ.
>>>>
>>>> Please let me know if you have any other idea in mind.
>>>
>>> I am not convinced that this io_uring_cmd is needed. This is one
>>> in-kernel subsystem calling into another, and there are APIs for that.
>>> All of this set is ioctl based and as Willem noted a little refactoring
>>> separates the get_user/put_user out so that in-kernel can call can be
>>> made with existing ops.
>>
>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>> obviously, and we already have ->uring_cmd() for this purpose.
> 
> Does this suggestion not work?

Not sure I follow, what suggestion?

>>> I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
>>> sock_do_ioctl.
>  
>> I do think the right thing to do is have a common helper that returns
>> whatever value you want (or sets it), and split the ioctl parts into a
>> wrapper around that that simply copies in/out as needed. Then
>> ->uring_cmd() could call that, or you could some exported function that
>> does supports that.
>>
>> This works for the basic cases, though I do suspect we'll want to go
>> down the ->uring_cmd() at some point for more advanced cases or cases
>> that cannot sanely be done in an ioctl fashion.
> 
> Right now the two examples are ioctls that return an integer. Do you 
> already have other calls in mind? That would help estimate whether
> ->uring_cmd() indeed will be needed and we might as well do it now.

Right, it's a proof of concept. But we'd want to support anything that
setsockopt/getsockopt would do. This is necessary so that direct
descriptors (eg ones that describe a struct file that isn't in the
process file table or have a regular fd) can be used for anything that a
regular file can. Beyond that, perhaps various things necessary for
efficient zero copy rx.

I do think we can make the ->uring_cmd() hookup a bit more palatable in
terms of API. It really should be just a sub-opcode and then arguments
to support that. The grunt of the work is really refactoring the ioctl
and set/getsockopt bits so that they can be called in-kernel rather than
assuming copy in/out is needed. Once that is done, the actual uring_cmd
hookup should be simple and trivial.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 14:54   ` Jens Axboe
  0 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 14:54 UTC (permalink / raw)
  To: Willem de Bruijn, David Ahern, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	matthieu.baerts, marcelo.leitner

On 4/11/23 8:51?AM, Willem de Bruijn wrote:
> Jens Axboe wrote:
>> On 4/11/23 8:36?AM, David Ahern wrote:
>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>
>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>> directory, but, I think it might be cleaner if these implementations are
>>>> closer to function assignment (in the network subsystem).
>>>>
>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>> -> SIOCINQ.
>>>>
>>>> Please let me know if you have any other idea in mind.
>>>
>>> I am not convinced that this io_uring_cmd is needed. This is one
>>> in-kernel subsystem calling into another, and there are APIs for that.
>>> All of this set is ioctl based and as Willem noted a little refactoring
>>> separates the get_user/put_user out so that in-kernel can call can be
>>> made with existing ops.
>>
>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>> obviously, and we already have ->uring_cmd() for this purpose.
> 
> Does this suggestion not work?

Not sure I follow, what suggestion?

>>> I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
>>> sock_do_ioctl.
>  
>> I do think the right thing to do is have a common helper that returns
>> whatever value you want (or sets it), and split the ioctl parts into a
>> wrapper around that that simply copies in/out as needed. Then
>> ->uring_cmd() could call that, or you could some exported function that
>> does supports that.
>>
>> This works for the basic cases, though I do suspect we'll want to go
>> down the ->uring_cmd() at some point for more advanced cases or cases
>> that cannot sanely be done in an ioctl fashion.
> 
> Right now the two examples are ioctls that return an integer. Do you 
> already have other calls in mind? That would help estimate whether
> ->uring_cmd() indeed will be needed and we might as well do it now.

Right, it's a proof of concept. But we'd want to support anything that
setsockopt/getsockopt would do. This is necessary so that direct
descriptors (eg ones that describe a struct file that isn't in the
process file table or have a regular fd) can be used for anything that a
regular file can. Beyond that, perhaps various things necessary for
efficient zero copy rx.

I do think we can make the ->uring_cmd() hookup a bit more palatable in
terms of API. It really should be just a sub-opcode and then arguments
to support that. The grunt of the work is really refactoring the ioctl
and set/getsockopt bits so that they can be called in-kernel rather than
assuming copy in/out is needed. Once that is done, the actual uring_cmd
hookup should be simple and trivial.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 14:54   ` Jens Axboe
@ 2023-04-11 15:00   ` Willem de Bruijn
  -1 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-11 15:00 UTC (permalink / raw)
  To: dccp

Jens Axboe wrote:
> On 4/11/23 8:51?AM, Willem de Bruijn wrote:
> > Jens Axboe wrote:
> >> On 4/11/23 8:36?AM, David Ahern wrote:
> >>> On 4/11/23 6:00 AM, Breno Leitao wrote:
> >>>> I am not sure if avoiding io_uring details in network code is possible.
> >>>>
> >>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
> >>>> in the TCP case) could be somewhere else, such as in the io_uring/
> >>>> directory, but, I think it might be cleaner if these implementations are
> >>>> closer to function assignment (in the network subsystem).
> >>>>
> >>>> And this function (tcp_uring_cmd() for instance) is the one that I am
> >>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
> >>>> -> SIOCINQ.
> >>>>
> >>>> Please let me know if you have any other idea in mind.
> >>>
> >>> I am not convinced that this io_uring_cmd is needed. This is one
> >>> in-kernel subsystem calling into another, and there are APIs for that.
> >>> All of this set is ioctl based and as Willem noted a little refactoring
> >>> separates the get_user/put_user out so that in-kernel can call can be
> >>> made with existing ops.
> >>
> >> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
> >> obviously, and we already have ->uring_cmd() for this purpose.
> > 
> > Does this suggestion not work?
> 
> Not sure I follow, what suggestion?
>

This quote from earlier in the thread:

I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
sock_do_ioctl.
> >  
> >> I do think the right thing to do is have a common helper that returns
> >> whatever value you want (or sets it), and split the ioctl parts into a
> >> wrapper around that that simply copies in/out as needed. Then
> >> ->uring_cmd() could call that, or you could some exported function that
> >> does supports that.
> >>
> >> This works for the basic cases, though I do suspect we'll want to go
> >> down the ->uring_cmd() at some point for more advanced cases or cases
> >> that cannot sanely be done in an ioctl fashion.
> > 
> > Right now the two examples are ioctls that return an integer. Do you 
> > already have other calls in mind? That would help estimate whether
> > ->uring_cmd() indeed will be needed and we might as well do it now.
> 
> Right, it's a proof of concept. But we'd want to support anything that
> setsockopt/getsockopt would do. This is necessary so that direct
> descriptors (eg ones that describe a struct file that isn't in the
> process file table or have a regular fd) can be used for anything that a
> regular file can. Beyond that, perhaps various things necessary for
> efficient zero copy rx.
> 
> I do think we can make the ->uring_cmd() hookup a bit more palatable in
> terms of API. It really should be just a sub-opcode and then arguments
> to support that. The grunt of the work is really refactoring the ioctl
> and set/getsockopt bits so that they can be called in-kernel rather than
> assuming copy in/out is needed. Once that is done, the actual uring_cmd
> hookup should be simple and trivial.

That sounds like what I proposed above. That suggestion was only for
the narrow case where ioctls return an integer. The general approach
has to handle any put_user.

Though my initial skim of TCP, UDP and RAW did not bring up any other
forms.

getsockopt indeed has plenty of examples, such as receive zerocopy.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 15:00   ` Willem de Bruijn
  0 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-11 15:00 UTC (permalink / raw)
  To: Jens Axboe, Willem de Bruijn, David Ahern, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	matthieu.baerts, marcelo.leitner

Jens Axboe wrote:
> On 4/11/23 8:51?AM, Willem de Bruijn wrote:
> > Jens Axboe wrote:
> >> On 4/11/23 8:36?AM, David Ahern wrote:
> >>> On 4/11/23 6:00 AM, Breno Leitao wrote:
> >>>> I am not sure if avoiding io_uring details in network code is possible.
> >>>>
> >>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
> >>>> in the TCP case) could be somewhere else, such as in the io_uring/
> >>>> directory, but, I think it might be cleaner if these implementations are
> >>>> closer to function assignment (in the network subsystem).
> >>>>
> >>>> And this function (tcp_uring_cmd() for instance) is the one that I am
> >>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
> >>>> -> SIOCINQ.
> >>>>
> >>>> Please let me know if you have any other idea in mind.
> >>>
> >>> I am not convinced that this io_uring_cmd is needed. This is one
> >>> in-kernel subsystem calling into another, and there are APIs for that.
> >>> All of this set is ioctl based and as Willem noted a little refactoring
> >>> separates the get_user/put_user out so that in-kernel can call can be
> >>> made with existing ops.
> >>
> >> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
> >> obviously, and we already have ->uring_cmd() for this purpose.
> > 
> > Does this suggestion not work?
> 
> Not sure I follow, what suggestion?
>

This quote from earlier in the thread:

I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
sock_do_ioctl.
> >  
> >> I do think the right thing to do is have a common helper that returns
> >> whatever value you want (or sets it), and split the ioctl parts into a
> >> wrapper around that that simply copies in/out as needed. Then
> >> ->uring_cmd() could call that, or you could some exported function that
> >> does supports that.
> >>
> >> This works for the basic cases, though I do suspect we'll want to go
> >> down the ->uring_cmd() at some point for more advanced cases or cases
> >> that cannot sanely be done in an ioctl fashion.
> > 
> > Right now the two examples are ioctls that return an integer. Do you 
> > already have other calls in mind? That would help estimate whether
> > ->uring_cmd() indeed will be needed and we might as well do it now.
> 
> Right, it's a proof of concept. But we'd want to support anything that
> setsockopt/getsockopt would do. This is necessary so that direct
> descriptors (eg ones that describe a struct file that isn't in the
> process file table or have a regular fd) can be used for anything that a
> regular file can. Beyond that, perhaps various things necessary for
> efficient zero copy rx.
> 
> I do think we can make the ->uring_cmd() hookup a bit more palatable in
> terms of API. It really should be just a sub-opcode and then arguments
> to support that. The grunt of the work is really refactoring the ioctl
> and set/getsockopt bits so that they can be called in-kernel rather than
> assuming copy in/out is needed. Once that is done, the actual uring_cmd
> hookup should be simple and trivial.

That sounds like what I proposed above. That suggestion was only for
the narrow case where ioctls return an integer. The general approach
has to handle any put_user.

Though my initial skim of TCP, UDP and RAW did not bring up any other
forms.

getsockopt indeed has plenty of examples, such as receive zerocopy.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 15:00   ` Willem de Bruijn
@ 2023-04-11 15:06   ` Jens Axboe
  -1 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 15:06 UTC (permalink / raw)
  To: dccp

On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> Jens Axboe wrote:
>> On 4/11/23 8:51?AM, Willem de Bruijn wrote:
>>> Jens Axboe wrote:
>>>> On 4/11/23 8:36?AM, David Ahern wrote:
>>>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>>>
>>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>>>> directory, but, I think it might be cleaner if these implementations are
>>>>>> closer to function assignment (in the network subsystem).
>>>>>>
>>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>>>> -> SIOCINQ.
>>>>>>
>>>>>> Please let me know if you have any other idea in mind.
>>>>>
>>>>> I am not convinced that this io_uring_cmd is needed. This is one
>>>>> in-kernel subsystem calling into another, and there are APIs for that.
>>>>> All of this set is ioctl based and as Willem noted a little refactoring
>>>>> separates the get_user/put_user out so that in-kernel can call can be
>>>>> made with existing ops.
>>>>
>>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>>>> obviously, and we already have ->uring_cmd() for this purpose.
>>>
>>> Does this suggestion not work?
>>
>> Not sure I follow, what suggestion?
>>
> 
> This quote from earlier in the thread:
> 
> I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
> sock_do_ioctl.

But that doesn't work, because sock->ops->ioctl() assumes the arg is
memory in userspace. Or do you mean change all of the sock->ops->ioctl()
to pass in on-stack memory (or similar) and have it work with a kernel
address?

>>>> I do think the right thing to do is have a common helper that returns
>>>> whatever value you want (or sets it), and split the ioctl parts into a
>>>> wrapper around that that simply copies in/out as needed. Then
>>>> ->uring_cmd() could call that, or you could some exported function that
>>>> does supports that.
>>>>
>>>> This works for the basic cases, though I do suspect we'll want to go
>>>> down the ->uring_cmd() at some point for more advanced cases or cases
>>>> that cannot sanely be done in an ioctl fashion.
>>>
>>> Right now the two examples are ioctls that return an integer. Do you 
>>> already have other calls in mind? That would help estimate whether
>>> ->uring_cmd() indeed will be needed and we might as well do it now.
>>
>> Right, it's a proof of concept. But we'd want to support anything that
>> setsockopt/getsockopt would do. This is necessary so that direct
>> descriptors (eg ones that describe a struct file that isn't in the
>> process file table or have a regular fd) can be used for anything that a
>> regular file can. Beyond that, perhaps various things necessary for
>> efficient zero copy rx.
>>
>> I do think we can make the ->uring_cmd() hookup a bit more palatable in
>> terms of API. It really should be just a sub-opcode and then arguments
>> to support that. The grunt of the work is really refactoring the ioctl
>> and set/getsockopt bits so that they can be called in-kernel rather than
>> assuming copy in/out is needed. Once that is done, the actual uring_cmd
>> hookup should be simple and trivial.
> 
> That sounds like what I proposed above. That suggestion was only for
> the narrow case where ioctls return an integer. The general approach
> has to handle any put_user.

Right

> Though my initial skim of TCP, UDP and RAW did not bring up any other
> forms.
> 
> getsockopt indeed has plenty of examples, such as receive zerocopy.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 15:06   ` Jens Axboe
  0 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 15:06 UTC (permalink / raw)
  To: Willem de Bruijn, David Ahern, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	matthieu.baerts, marcelo.leitner

On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> Jens Axboe wrote:
>> On 4/11/23 8:51?AM, Willem de Bruijn wrote:
>>> Jens Axboe wrote:
>>>> On 4/11/23 8:36?AM, David Ahern wrote:
>>>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>>>
>>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>>>> directory, but, I think it might be cleaner if these implementations are
>>>>>> closer to function assignment (in the network subsystem).
>>>>>>
>>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>>>> -> SIOCINQ.
>>>>>>
>>>>>> Please let me know if you have any other idea in mind.
>>>>>
>>>>> I am not convinced that this io_uring_cmd is needed. This is one
>>>>> in-kernel subsystem calling into another, and there are APIs for that.
>>>>> All of this set is ioctl based and as Willem noted a little refactoring
>>>>> separates the get_user/put_user out so that in-kernel can call can be
>>>>> made with existing ops.
>>>>
>>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>>>> obviously, and we already have ->uring_cmd() for this purpose.
>>>
>>> Does this suggestion not work?
>>
>> Not sure I follow, what suggestion?
>>
> 
> This quote from earlier in the thread:
> 
> I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
> sock_do_ioctl.

But that doesn't work, because sock->ops->ioctl() assumes the arg is
memory in userspace. Or do you mean change all of the sock->ops->ioctl()
to pass in on-stack memory (or similar) and have it work with a kernel
address?

>>>> I do think the right thing to do is have a common helper that returns
>>>> whatever value you want (or sets it), and split the ioctl parts into a
>>>> wrapper around that that simply copies in/out as needed. Then
>>>> ->uring_cmd() could call that, or you could some exported function that
>>>> does supports that.
>>>>
>>>> This works for the basic cases, though I do suspect we'll want to go
>>>> down the ->uring_cmd() at some point for more advanced cases or cases
>>>> that cannot sanely be done in an ioctl fashion.
>>>
>>> Right now the two examples are ioctls that return an integer. Do you 
>>> already have other calls in mind? That would help estimate whether
>>> ->uring_cmd() indeed will be needed and we might as well do it now.
>>
>> Right, it's a proof of concept. But we'd want to support anything that
>> setsockopt/getsockopt would do. This is necessary so that direct
>> descriptors (eg ones that describe a struct file that isn't in the
>> process file table or have a regular fd) can be used for anything that a
>> regular file can. Beyond that, perhaps various things necessary for
>> efficient zero copy rx.
>>
>> I do think we can make the ->uring_cmd() hookup a bit more palatable in
>> terms of API. It really should be just a sub-opcode and then arguments
>> to support that. The grunt of the work is really refactoring the ioctl
>> and set/getsockopt bits so that they can be called in-kernel rather than
>> assuming copy in/out is needed. Once that is done, the actual uring_cmd
>> hookup should be simple and trivial.
> 
> That sounds like what I proposed above. That suggestion was only for
> the narrow case where ioctls return an integer. The general approach
> has to handle any put_user.

Right

> Though my initial skim of TCP, UDP and RAW did not bring up any other
> forms.
> 
> getsockopt indeed has plenty of examples, such as receive zerocopy.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 14:41   ` Jens Axboe
@ 2023-04-11 15:10   ` David Ahern
  -1 siblings, 0 replies; 108+ messages in thread
From: David Ahern @ 2023-04-11 15:10 UTC (permalink / raw)
  To: dccp

On 4/11/23 8:41 AM, Jens Axboe wrote:
> On 4/11/23 8:36?AM, David Ahern wrote:
>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>> I am not sure if avoiding io_uring details in network code is possible.
>>>
>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>> directory, but, I think it might be cleaner if these implementations are
>>> closer to function assignment (in the network subsystem).
>>>
>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>> -> SIOCINQ.
>>>
>>> Please let me know if you have any other idea in mind.
>>
>> I am not convinced that this io_uring_cmd is needed. This is one
>> in-kernel subsystem calling into another, and there are APIs for that.
>> All of this set is ioctl based and as Willem noted a little refactoring
>> separates the get_user/put_user out so that in-kernel can call can be
>> made with existing ops.
> 
> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
> obviously, and we already have ->uring_cmd() for this purpose.
> 
> I do think the right thing to do is have a common helper that returns
> whatever value you want (or sets it), and split the ioctl parts into a
> wrapper around that that simply copies in/out as needed. Then
> ->uring_cmd() could call that, or you could some exported function that
> does supports that.
> 
> This works for the basic cases, though I do suspect we'll want to go
> down the ->uring_cmd() at some point for more advanced cases or cases
> that cannot sanely be done in an ioctl fashion.
> 

My meta point is that there are uapis today to return this information
to applications (and I suspect this is just the start of more networking
changes - both data retrieval and adjusting settings). io_uring is
wanting to do this on behalf of the application without a syscall. That
makes io_uring yet another subsystem / component managing a socket. Any
change to the networking stack required by io_uring should be usable by
all other in-kernel socket owners or managers. ie., there is no reason
for io_uring specific code here.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 15:10   ` David Ahern
  0 siblings, 0 replies; 108+ messages in thread
From: David Ahern @ 2023-04-11 15:10 UTC (permalink / raw)
  To: Jens Axboe, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On 4/11/23 8:41 AM, Jens Axboe wrote:
> On 4/11/23 8:36?AM, David Ahern wrote:
>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>> I am not sure if avoiding io_uring details in network code is possible.
>>>
>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>> directory, but, I think it might be cleaner if these implementations are
>>> closer to function assignment (in the network subsystem).
>>>
>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>> -> SIOCINQ.
>>>
>>> Please let me know if you have any other idea in mind.
>>
>> I am not convinced that this io_uring_cmd is needed. This is one
>> in-kernel subsystem calling into another, and there are APIs for that.
>> All of this set is ioctl based and as Willem noted a little refactoring
>> separates the get_user/put_user out so that in-kernel can call can be
>> made with existing ops.
> 
> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
> obviously, and we already have ->uring_cmd() for this purpose.
> 
> I do think the right thing to do is have a common helper that returns
> whatever value you want (or sets it), and split the ioctl parts into a
> wrapper around that that simply copies in/out as needed. Then
> ->uring_cmd() could call that, or you could some exported function that
> does supports that.
> 
> This works for the basic cases, though I do suspect we'll want to go
> down the ->uring_cmd() at some point for more advanced cases or cases
> that cannot sanely be done in an ioctl fashion.
> 

My meta point is that there are uapis today to return this information
to applications (and I suspect this is just the start of more networking
changes - both data retrieval and adjusting settings). io_uring is
wanting to do this on behalf of the application without a syscall. That
makes io_uring yet another subsystem / component managing a socket. Any
change to the networking stack required by io_uring should be usable by
all other in-kernel socket owners or managers. ie., there is no reason
for io_uring specific code here.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 15:10   ` David Ahern
@ 2023-04-11 15:17   ` Jens Axboe
  -1 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 15:17 UTC (permalink / raw)
  To: dccp

On 4/11/23 9:10?AM, David Ahern wrote:
> On 4/11/23 8:41 AM, Jens Axboe wrote:
>> On 4/11/23 8:36?AM, David Ahern wrote:
>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>
>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>> directory, but, I think it might be cleaner if these implementations are
>>>> closer to function assignment (in the network subsystem).
>>>>
>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>> -> SIOCINQ.
>>>>
>>>> Please let me know if you have any other idea in mind.
>>>
>>> I am not convinced that this io_uring_cmd is needed. This is one
>>> in-kernel subsystem calling into another, and there are APIs for that.
>>> All of this set is ioctl based and as Willem noted a little refactoring
>>> separates the get_user/put_user out so that in-kernel can call can be
>>> made with existing ops.
>>
>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>> obviously, and we already have ->uring_cmd() for this purpose.
>>
>> I do think the right thing to do is have a common helper that returns
>> whatever value you want (or sets it), and split the ioctl parts into a
>> wrapper around that that simply copies in/out as needed. Then
>> ->uring_cmd() could call that, or you could some exported function that
>> does supports that.
>>
>> This works for the basic cases, though I do suspect we'll want to go
>> down the ->uring_cmd() at some point for more advanced cases or cases
>> that cannot sanely be done in an ioctl fashion.
>>
> 
> My meta point is that there are uapis today to return this information
> to applications (and I suspect this is just the start of more networking
> changes - both data retrieval and adjusting settings). io_uring is
> wanting to do this on behalf of the application without a syscall. That
> makes io_uring yet another subsystem / component managing a socket. Any
> change to the networking stack required by io_uring should be usable by
> all other in-kernel socket owners or managers. ie., there is no reason
> for io_uring specific code here.

I think we are in violent agreement here, what I'm describing is exactly
that - it'd make ioctl/{set,get}sockopt call into the same helpers that
->uring_cmd() would, with the only difference being that the former
would need copy in/out and the latter would not.

But let me just stress that for direct descriptors, we cannot currently
call ioctl or set/getsockopt. This means we have to instantiate a
regular descriptor first, do those things, then register it to never use
the regular file descriptor again. That's wasteful, and this is what we
want to enable (direct use of ioctl set/getsockopt WITHOUT a normal file
descriptor). It's not just for "oh it'd be handy to also do this from
io_uring" even if that would be a worthwhile goal in itself.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 15:17   ` Jens Axboe
  0 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 15:17 UTC (permalink / raw)
  To: David Ahern, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On 4/11/23 9:10?AM, David Ahern wrote:
> On 4/11/23 8:41 AM, Jens Axboe wrote:
>> On 4/11/23 8:36?AM, David Ahern wrote:
>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>
>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>> directory, but, I think it might be cleaner if these implementations are
>>>> closer to function assignment (in the network subsystem).
>>>>
>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>> -> SIOCINQ.
>>>>
>>>> Please let me know if you have any other idea in mind.
>>>
>>> I am not convinced that this io_uring_cmd is needed. This is one
>>> in-kernel subsystem calling into another, and there are APIs for that.
>>> All of this set is ioctl based and as Willem noted a little refactoring
>>> separates the get_user/put_user out so that in-kernel can call can be
>>> made with existing ops.
>>
>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>> obviously, and we already have ->uring_cmd() for this purpose.
>>
>> I do think the right thing to do is have a common helper that returns
>> whatever value you want (or sets it), and split the ioctl parts into a
>> wrapper around that that simply copies in/out as needed. Then
>> ->uring_cmd() could call that, or you could some exported function that
>> does supports that.
>>
>> This works for the basic cases, though I do suspect we'll want to go
>> down the ->uring_cmd() at some point for more advanced cases or cases
>> that cannot sanely be done in an ioctl fashion.
>>
> 
> My meta point is that there are uapis today to return this information
> to applications (and I suspect this is just the start of more networking
> changes - both data retrieval and adjusting settings). io_uring is
> wanting to do this on behalf of the application without a syscall. That
> makes io_uring yet another subsystem / component managing a socket. Any
> change to the networking stack required by io_uring should be usable by
> all other in-kernel socket owners or managers. ie., there is no reason
> for io_uring specific code here.

I think we are in violent agreement here, what I'm describing is exactly
that - it'd make ioctl/{set,get}sockopt call into the same helpers that
->uring_cmd() would, with the only difference being that the former
would need copy in/out and the latter would not.

But let me just stress that for direct descriptors, we cannot currently
call ioctl or set/getsockopt. This means we have to instantiate a
regular descriptor first, do those things, then register it to never use
the regular file descriptor again. That's wasteful, and this is what we
want to enable (direct use of ioctl set/getsockopt WITHOUT a normal file
descriptor). It's not just for "oh it'd be handy to also do this from
io_uring" even if that would be a worthwhile goal in itself.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 15:06   ` Jens Axboe
@ 2023-04-11 15:24   ` Willem de Bruijn
  -1 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-11 15:24 UTC (permalink / raw)
  To: dccp

Jens Axboe wrote:
> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> > Jens Axboe wrote:
> >> On 4/11/23 8:51?AM, Willem de Bruijn wrote:
> >>> Jens Axboe wrote:
> >>>> On 4/11/23 8:36?AM, David Ahern wrote:
> >>>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
> >>>>>> I am not sure if avoiding io_uring details in network code is possible.
> >>>>>>
> >>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
> >>>>>> in the TCP case) could be somewhere else, such as in the io_uring/
> >>>>>> directory, but, I think it might be cleaner if these implementations are
> >>>>>> closer to function assignment (in the network subsystem).
> >>>>>>
> >>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
> >>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
> >>>>>> -> SIOCINQ.
> >>>>>>
> >>>>>> Please let me know if you have any other idea in mind.
> >>>>>
> >>>>> I am not convinced that this io_uring_cmd is needed. This is one
> >>>>> in-kernel subsystem calling into another, and there are APIs for that.
> >>>>> All of this set is ioctl based and as Willem noted a little refactoring
> >>>>> separates the get_user/put_user out so that in-kernel can call can be
> >>>>> made with existing ops.
> >>>>
> >>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
> >>>> obviously, and we already have ->uring_cmd() for this purpose.
> >>>
> >>> Does this suggestion not work?
> >>
> >> Not sure I follow, what suggestion?
> >>
> > 
> > This quote from earlier in the thread:
> > 
> > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
> > sock_do_ioctl.
> 
> But that doesn't work, because sock->ops->ioctl() assumes the arg is
> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
> to pass in on-stack memory (or similar) and have it work with a kernel
> address?

That was what I suggested indeed.

It's about as much code change as this patch series. But it avoids
the code duplication.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 15:24   ` Willem de Bruijn
  0 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-11 15:24 UTC (permalink / raw)
  To: Jens Axboe, Willem de Bruijn, David Ahern, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	matthieu.baerts, marcelo.leitner

Jens Axboe wrote:
> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> > Jens Axboe wrote:
> >> On 4/11/23 8:51?AM, Willem de Bruijn wrote:
> >>> Jens Axboe wrote:
> >>>> On 4/11/23 8:36?AM, David Ahern wrote:
> >>>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
> >>>>>> I am not sure if avoiding io_uring details in network code is possible.
> >>>>>>
> >>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
> >>>>>> in the TCP case) could be somewhere else, such as in the io_uring/
> >>>>>> directory, but, I think it might be cleaner if these implementations are
> >>>>>> closer to function assignment (in the network subsystem).
> >>>>>>
> >>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
> >>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
> >>>>>> -> SIOCINQ.
> >>>>>>
> >>>>>> Please let me know if you have any other idea in mind.
> >>>>>
> >>>>> I am not convinced that this io_uring_cmd is needed. This is one
> >>>>> in-kernel subsystem calling into another, and there are APIs for that.
> >>>>> All of this set is ioctl based and as Willem noted a little refactoring
> >>>>> separates the get_user/put_user out so that in-kernel can call can be
> >>>>> made with existing ops.
> >>>>
> >>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
> >>>> obviously, and we already have ->uring_cmd() for this purpose.
> >>>
> >>> Does this suggestion not work?
> >>
> >> Not sure I follow, what suggestion?
> >>
> > 
> > This quote from earlier in the thread:
> > 
> > I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
> > sock_do_ioctl.
> 
> But that doesn't work, because sock->ops->ioctl() assumes the arg is
> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
> to pass in on-stack memory (or similar) and have it work with a kernel
> address?

That was what I suggested indeed.

It's about as much code change as this patch series. But it avoids
the code duplication.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 15:17   ` Jens Axboe
@ 2023-04-11 15:27   ` David Ahern
  -1 siblings, 0 replies; 108+ messages in thread
From: David Ahern @ 2023-04-11 15:27 UTC (permalink / raw)
  To: dccp

On 4/11/23 9:17 AM, Jens Axboe wrote:
> On 4/11/23 9:10?AM, David Ahern wrote:
>> On 4/11/23 8:41 AM, Jens Axboe wrote:
>>> On 4/11/23 8:36?AM, David Ahern wrote:
>>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>>
>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>>> directory, but, I think it might be cleaner if these implementations are
>>>>> closer to function assignment (in the network subsystem).
>>>>>
>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>>> -> SIOCINQ.
>>>>>
>>>>> Please let me know if you have any other idea in mind.
>>>>
>>>> I am not convinced that this io_uring_cmd is needed. This is one
>>>> in-kernel subsystem calling into another, and there are APIs for that.
>>>> All of this set is ioctl based and as Willem noted a little refactoring
>>>> separates the get_user/put_user out so that in-kernel can call can be
>>>> made with existing ops.
>>>
>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>>> obviously, and we already have ->uring_cmd() for this purpose.
>>>
>>> I do think the right thing to do is have a common helper that returns
>>> whatever value you want (or sets it), and split the ioctl parts into a
>>> wrapper around that that simply copies in/out as needed. Then
>>> ->uring_cmd() could call that, or you could some exported function that
>>> does supports that.
>>>
>>> This works for the basic cases, though I do suspect we'll want to go
>>> down the ->uring_cmd() at some point for more advanced cases or cases
>>> that cannot sanely be done in an ioctl fashion.
>>>
>>
>> My meta point is that there are uapis today to return this information
>> to applications (and I suspect this is just the start of more networking
>> changes - both data retrieval and adjusting settings). io_uring is
>> wanting to do this on behalf of the application without a syscall. That
>> makes io_uring yet another subsystem / component managing a socket. Any
>> change to the networking stack required by io_uring should be usable by
>> all other in-kernel socket owners or managers. ie., there is no reason
>> for io_uring specific code here.
> 
> I think we are in violent agreement here, what I'm describing is exactly
> that - it'd make ioctl/{set,get}sockopt call into the same helpers that
> ->uring_cmd() would, with the only difference being that the former
> would need copy in/out and the latter would not.
> 
> But let me just stress that for direct descriptors, we cannot currently
> call ioctl or set/getsockopt. This means we have to instantiate a
> regular descriptor first, do those things, then register it to never use
> the regular file descriptor again. That's wasteful, and this is what we
> want to enable (direct use of ioctl set/getsockopt WITHOUT a normal file
> descriptor). It's not just for "oh it'd be handy to also do this from
> io_uring" even if that would be a worthwhile goal in itself.
> 

Christoph's patch set a few years back that removed set_fs broke the
ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not
follow that change; was it a deliberate intent to not allow these
in-kernel calls vs wanting to remove the set_fs? e.g., can we add a
kioctl variant for in-kernel use of the APIs?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 15:27   ` David Ahern
  0 siblings, 0 replies; 108+ messages in thread
From: David Ahern @ 2023-04-11 15:27 UTC (permalink / raw)
  To: Jens Axboe, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On 4/11/23 9:17 AM, Jens Axboe wrote:
> On 4/11/23 9:10?AM, David Ahern wrote:
>> On 4/11/23 8:41 AM, Jens Axboe wrote:
>>> On 4/11/23 8:36?AM, David Ahern wrote:
>>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>>
>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>>> directory, but, I think it might be cleaner if these implementations are
>>>>> closer to function assignment (in the network subsystem).
>>>>>
>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>>> -> SIOCINQ.
>>>>>
>>>>> Please let me know if you have any other idea in mind.
>>>>
>>>> I am not convinced that this io_uring_cmd is needed. This is one
>>>> in-kernel subsystem calling into another, and there are APIs for that.
>>>> All of this set is ioctl based and as Willem noted a little refactoring
>>>> separates the get_user/put_user out so that in-kernel can call can be
>>>> made with existing ops.
>>>
>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>>> obviously, and we already have ->uring_cmd() for this purpose.
>>>
>>> I do think the right thing to do is have a common helper that returns
>>> whatever value you want (or sets it), and split the ioctl parts into a
>>> wrapper around that that simply copies in/out as needed. Then
>>> ->uring_cmd() could call that, or you could some exported function that
>>> does supports that.
>>>
>>> This works for the basic cases, though I do suspect we'll want to go
>>> down the ->uring_cmd() at some point for more advanced cases or cases
>>> that cannot sanely be done in an ioctl fashion.
>>>
>>
>> My meta point is that there are uapis today to return this information
>> to applications (and I suspect this is just the start of more networking
>> changes - both data retrieval and adjusting settings). io_uring is
>> wanting to do this on behalf of the application without a syscall. That
>> makes io_uring yet another subsystem / component managing a socket. Any
>> change to the networking stack required by io_uring should be usable by
>> all other in-kernel socket owners or managers. ie., there is no reason
>> for io_uring specific code here.
> 
> I think we are in violent agreement here, what I'm describing is exactly
> that - it'd make ioctl/{set,get}sockopt call into the same helpers that
> ->uring_cmd() would, with the only difference being that the former
> would need copy in/out and the latter would not.
> 
> But let me just stress that for direct descriptors, we cannot currently
> call ioctl or set/getsockopt. This means we have to instantiate a
> regular descriptor first, do those things, then register it to never use
> the regular file descriptor again. That's wasteful, and this is what we
> want to enable (direct use of ioctl set/getsockopt WITHOUT a normal file
> descriptor). It's not just for "oh it'd be handy to also do this from
> io_uring" even if that would be a worthwhile goal in itself.
> 

Christoph's patch set a few years back that removed set_fs broke the
ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not
follow that change; was it a deliberate intent to not allow these
in-kernel calls vs wanting to remove the set_fs? e.g., can we add a
kioctl variant for in-kernel use of the APIs?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 15:24   ` Willem de Bruijn
@ 2023-04-11 15:28   ` Jens Axboe
  -1 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 15:28 UTC (permalink / raw)
  To: dccp

On 4/11/23 9:24?AM, Willem de Bruijn wrote:
> Jens Axboe wrote:
>> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
>>> Jens Axboe wrote:
>>>> On 4/11/23 8:51?AM, Willem de Bruijn wrote:
>>>>> Jens Axboe wrote:
>>>>>> On 4/11/23 8:36?AM, David Ahern wrote:
>>>>>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>>>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>>>>>
>>>>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>>>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>>>>>> directory, but, I think it might be cleaner if these implementations are
>>>>>>>> closer to function assignment (in the network subsystem).
>>>>>>>>
>>>>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>>>>>> -> SIOCINQ.
>>>>>>>>
>>>>>>>> Please let me know if you have any other idea in mind.
>>>>>>>
>>>>>>> I am not convinced that this io_uring_cmd is needed. This is one
>>>>>>> in-kernel subsystem calling into another, and there are APIs for that.
>>>>>>> All of this set is ioctl based and as Willem noted a little refactoring
>>>>>>> separates the get_user/put_user out so that in-kernel can call can be
>>>>>>> made with existing ops.
>>>>>>
>>>>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>>>>>> obviously, and we already have ->uring_cmd() for this purpose.
>>>>>
>>>>> Does this suggestion not work?
>>>>
>>>> Not sure I follow, what suggestion?
>>>>
>>>
>>> This quote from earlier in the thread:
>>>
>>> I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
>>> sock_do_ioctl.
>>
>> But that doesn't work, because sock->ops->ioctl() assumes the arg is
>> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
>> to pass in on-stack memory (or similar) and have it work with a kernel
>> address?
> 
> That was what I suggested indeed.
> 
> It's about as much code change as this patch series. But it avoids
> the code duplication.

Breno, want to tackle that as a prep patch first? Should make the
functional changes afterwards much more straightforward, and will allow
support for anything really.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 15:28   ` Jens Axboe
  0 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 15:28 UTC (permalink / raw)
  To: Willem de Bruijn, David Ahern, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	matthieu.baerts, marcelo.leitner

On 4/11/23 9:24?AM, Willem de Bruijn wrote:
> Jens Axboe wrote:
>> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
>>> Jens Axboe wrote:
>>>> On 4/11/23 8:51?AM, Willem de Bruijn wrote:
>>>>> Jens Axboe wrote:
>>>>>> On 4/11/23 8:36?AM, David Ahern wrote:
>>>>>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>>>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>>>>>
>>>>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>>>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>>>>>> directory, but, I think it might be cleaner if these implementations are
>>>>>>>> closer to function assignment (in the network subsystem).
>>>>>>>>
>>>>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>>>>>> -> SIOCINQ.
>>>>>>>>
>>>>>>>> Please let me know if you have any other idea in mind.
>>>>>>>
>>>>>>> I am not convinced that this io_uring_cmd is needed. This is one
>>>>>>> in-kernel subsystem calling into another, and there are APIs for that.
>>>>>>> All of this set is ioctl based and as Willem noted a little refactoring
>>>>>>> separates the get_user/put_user out so that in-kernel can call can be
>>>>>>> made with existing ops.
>>>>>>
>>>>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>>>>>> obviously, and we already have ->uring_cmd() for this purpose.
>>>>>
>>>>> Does this suggestion not work?
>>>>
>>>> Not sure I follow, what suggestion?
>>>>
>>>
>>> This quote from earlier in the thread:
>>>
>>> I was thinking just having sock_uring_cmd call sock->ops->ioctl, like
>>> sock_do_ioctl.
>>
>> But that doesn't work, because sock->ops->ioctl() assumes the arg is
>> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
>> to pass in on-stack memory (or similar) and have it work with a kernel
>> address?
> 
> That was what I suggested indeed.
> 
> It's about as much code change as this patch series. But it avoids
> the code duplication.

Breno, want to tackle that as a prep patch first? Should make the
functional changes afterwards much more straightforward, and will allow
support for anything really.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 15:27   ` David Ahern
@ 2023-04-11 15:29   ` Jens Axboe
  -1 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 15:29 UTC (permalink / raw)
  To: dccp

On 4/11/23 9:27?AM, David Ahern wrote:
> On 4/11/23 9:17 AM, Jens Axboe wrote:
>> On 4/11/23 9:10?AM, David Ahern wrote:
>>> On 4/11/23 8:41 AM, Jens Axboe wrote:
>>>> On 4/11/23 8:36?AM, David Ahern wrote:
>>>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>>>
>>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>>>> directory, but, I think it might be cleaner if these implementations are
>>>>>> closer to function assignment (in the network subsystem).
>>>>>>
>>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>>>> -> SIOCINQ.
>>>>>>
>>>>>> Please let me know if you have any other idea in mind.
>>>>>
>>>>> I am not convinced that this io_uring_cmd is needed. This is one
>>>>> in-kernel subsystem calling into another, and there are APIs for that.
>>>>> All of this set is ioctl based and as Willem noted a little refactoring
>>>>> separates the get_user/put_user out so that in-kernel can call can be
>>>>> made with existing ops.
>>>>
>>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>>>> obviously, and we already have ->uring_cmd() for this purpose.
>>>>
>>>> I do think the right thing to do is have a common helper that returns
>>>> whatever value you want (or sets it), and split the ioctl parts into a
>>>> wrapper around that that simply copies in/out as needed. Then
>>>> ->uring_cmd() could call that, or you could some exported function that
>>>> does supports that.
>>>>
>>>> This works for the basic cases, though I do suspect we'll want to go
>>>> down the ->uring_cmd() at some point for more advanced cases or cases
>>>> that cannot sanely be done in an ioctl fashion.
>>>>
>>>
>>> My meta point is that there are uapis today to return this information
>>> to applications (and I suspect this is just the start of more networking
>>> changes - both data retrieval and adjusting settings). io_uring is
>>> wanting to do this on behalf of the application without a syscall. That
>>> makes io_uring yet another subsystem / component managing a socket. Any
>>> change to the networking stack required by io_uring should be usable by
>>> all other in-kernel socket owners or managers. ie., there is no reason
>>> for io_uring specific code here.
>>
>> I think we are in violent agreement here, what I'm describing is exactly
>> that - it'd make ioctl/{set,get}sockopt call into the same helpers that
>> ->uring_cmd() would, with the only difference being that the former
>> would need copy in/out and the latter would not.
>>
>> But let me just stress that for direct descriptors, we cannot currently
>> call ioctl or set/getsockopt. This means we have to instantiate a
>> regular descriptor first, do those things, then register it to never use
>> the regular file descriptor again. That's wasteful, and this is what we
>> want to enable (direct use of ioctl set/getsockopt WITHOUT a normal file
>> descriptor). It's not just for "oh it'd be handy to also do this from
>> io_uring" even if that would be a worthwhile goal in itself.
>>
> 
> Christoph's patch set a few years back that removed set_fs broke the
> ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not
> follow that change; was it a deliberate intent to not allow these
> in-kernel calls vs wanting to remove the set_fs? e.g., can we add a
> kioctl variant for in-kernel use of the APIs?

I think it'd be much better to cleanly split it out rather than try and
hack around it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-11 15:29   ` Jens Axboe
  0 siblings, 0 replies; 108+ messages in thread
From: Jens Axboe @ 2023-04-11 15:29 UTC (permalink / raw)
  To: David Ahern, Breno Leitao
  Cc: Willem de Bruijn, io-uring, netdev, kuba, asml.silence, leit,
	edumazet, pabeni, davem, dccp, mptcp, linux-kernel,
	willemdebruijn.kernel, matthieu.baerts, marcelo.leitner

On 4/11/23 9:27?AM, David Ahern wrote:
> On 4/11/23 9:17 AM, Jens Axboe wrote:
>> On 4/11/23 9:10?AM, David Ahern wrote:
>>> On 4/11/23 8:41 AM, Jens Axboe wrote:
>>>> On 4/11/23 8:36?AM, David Ahern wrote:
>>>>> On 4/11/23 6:00 AM, Breno Leitao wrote:
>>>>>> I am not sure if avoiding io_uring details in network code is possible.
>>>>>>
>>>>>> The "struct proto"->uring_cmd callback implementation (tcp_uring_cmd()
>>>>>> in the TCP case) could be somewhere else, such as in the io_uring/
>>>>>> directory, but, I think it might be cleaner if these implementations are
>>>>>> closer to function assignment (in the network subsystem).
>>>>>>
>>>>>> And this function (tcp_uring_cmd() for instance) is the one that I am
>>>>>> planning to map io_uring CMDs to ioctls. Such as SOCKET_URING_OP_SIOCINQ
>>>>>> -> SIOCINQ.
>>>>>>
>>>>>> Please let me know if you have any other idea in mind.
>>>>>
>>>>> I am not convinced that this io_uring_cmd is needed. This is one
>>>>> in-kernel subsystem calling into another, and there are APIs for that.
>>>>> All of this set is ioctl based and as Willem noted a little refactoring
>>>>> separates the get_user/put_user out so that in-kernel can call can be
>>>>> made with existing ops.
>>>>
>>>> How do you want to wire it up then? We can't use fops->unlocked_ioctl()
>>>> obviously, and we already have ->uring_cmd() for this purpose.
>>>>
>>>> I do think the right thing to do is have a common helper that returns
>>>> whatever value you want (or sets it), and split the ioctl parts into a
>>>> wrapper around that that simply copies in/out as needed. Then
>>>> ->uring_cmd() could call that, or you could some exported function that
>>>> does supports that.
>>>>
>>>> This works for the basic cases, though I do suspect we'll want to go
>>>> down the ->uring_cmd() at some point for more advanced cases or cases
>>>> that cannot sanely be done in an ioctl fashion.
>>>>
>>>
>>> My meta point is that there are uapis today to return this information
>>> to applications (and I suspect this is just the start of more networking
>>> changes - both data retrieval and adjusting settings). io_uring is
>>> wanting to do this on behalf of the application without a syscall. That
>>> makes io_uring yet another subsystem / component managing a socket. Any
>>> change to the networking stack required by io_uring should be usable by
>>> all other in-kernel socket owners or managers. ie., there is no reason
>>> for io_uring specific code here.
>>
>> I think we are in violent agreement here, what I'm describing is exactly
>> that - it'd make ioctl/{set,get}sockopt call into the same helpers that
>> ->uring_cmd() would, with the only difference being that the former
>> would need copy in/out and the latter would not.
>>
>> But let me just stress that for direct descriptors, we cannot currently
>> call ioctl or set/getsockopt. This means we have to instantiate a
>> regular descriptor first, do those things, then register it to never use
>> the regular file descriptor again. That's wasteful, and this is what we
>> want to enable (direct use of ioctl set/getsockopt WITHOUT a normal file
>> descriptor). It's not just for "oh it'd be handy to also do this from
>> io_uring" even if that would be a worthwhile goal in itself.
>>
> 
> Christoph's patch set a few years back that removed set_fs broke the
> ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not
> follow that change; was it a deliberate intent to not allow these
> in-kernel calls vs wanting to remove the set_fs? e.g., can we add a
> kioctl variant for in-kernel use of the APIs?

I think it'd be much better to cleanly split it out rather than try and
hack around it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 15:27   ` David Ahern
@ 2023-04-12  7:39   ` David Laight
  -1 siblings, 0 replies; 108+ messages in thread
From: David Laight @ 2023-04-12  7:39 UTC (permalink / raw)
  To: dccp

RnJvbTogRGF2aWQgQWhlcm4NCj4gU2VudDogMTEgQXByaWwgMjAyMyAxNjoyOA0KLi4uLg0KPiBD
aHJpc3RvcGgncyBwYXRjaCBzZXQgYSBmZXcgeWVhcnMgYmFjayB0aGF0IHJlbW92ZWQgc2V0X2Zz
IGJyb2tlIHRoZQ0KPiBhYmlsaXR5IHRvIGRvIGluLWtlcm5lbCBpb2N0bCBhbmQge3MsZ31zZXRz
b2Nrb3B0IGNhbGxzLiBJIGRpZCBub3QNCj4gZm9sbG93IHRoYXQgY2hhbmdlOyB3YXMgaXQgYSBk
ZWxpYmVyYXRlIGludGVudCB0byBub3QgYWxsb3cgdGhlc2UNCj4gaW4ta2VybmVsIGNhbGxzIHZz
IHdhbnRpbmcgdG8gcmVtb3ZlIHRoZSBzZXRfZnM/IGUuZy4sIGNhbiB3ZSBhZGQgYQ0KPiBraW9j
dGwgdmFyaWFudCBmb3IgaW4ta2VybmVsIHVzZSBvZiB0aGUgQVBJcz8NCg0KSSB0aGluayB0aGF0
IHdhcyBhIHNpZGUgZWZmZWN0LCBhbmQgd2l0aCBubyBpbi10cmVlIGluLWtlcm5lbA0KdXNlcnMg
KGFwYXJ0IGZyb20gbGltaXRlZCBjYWxscyBpbiBicGYpIGl0IHdhcyBkZWVtZWQgYWNjZXB0YWJs
ZS4NCihJdCBpcyBhIFBJVEEgZm9yIGFueSBjb2RlIHRyeWluZyB0byB1c2UgU0NUUCBpbiBrZXJu
ZWwuKQ0KDQpPbmUgcHJvYmxlbSBpcyB0aGF0IG5vdCBhbGwgc29ja29wdCBjYWxscyBwYXNzIHRo
ZSBjb3JyZWN0IGxlbmd0aC4NCkFuZCBzb21lIG9mIHRoZW0gY2FuIGhhdmUgdmVyeSBsb25nIGJ1
ZmZlcnMuDQpOb3QgdG8gbWVudGlvbiB0aGUgb25lcyB0aGF0IGFyZSByZWFkLW1vZGlmeS13cml0
ZS4NCg0KQSBwbGF1c2libGUgc29sdXRpb24gaXMgdG8gcGFzcyBhICdmYXQgcG9pbnRlcicgdGhh
dCBjb250YWlucw0Kc29tZSwgb3IgYWxsLCBvZjoNCgktIEEgdXNlcnNwYWNlIGJ1ZmZlciBwb2lu
dGVyLg0KCS0gQSBrZXJuZWwgYnVmZmVyIHBvaW50ZXIuDQoJLSBUaGUgbGVuZ3RoIHN1cHBsaWVk
IGJ5IHRoZSB1c2VyLg0KCS0gVGhlIGxlbmd0aCBvZiB0aGUga2VybmVsIGJ1ZmZlci4NCgk9IFRo
ZSBudW1iZXIgb2YgYnl0ZXMgdG8gY29weSBvbiBjb21wbGV0aW9uLg0KRm9yIHNpbXBsZSB1c2Vy
IHJlcXVlc3RzIHRoZSBzeXNjYWxsIGVudHJ5L2V4aXQgY29kZQ0Kd291bGQgY29weSB0aGUgZGF0
YSB0byBhIHNob3J0IG9uLXN0YWNrIGJ1ZmZlci4NCktlcm5lbCB1c2VycyBqdXN0IHBhc3MgdGhl
IGtlcm5lbCBhZGRyZXNzLg0KT2RkIHJlcXVlc3RzIGNhbiBqdXN0IHVzZSB0aGUgdXNlciBwb2lu
dGVyLg0KDQpQcm9iYWJseSBuZWVkcyBhY2Nlc3NvcnMgdGhhdCBhZGQgaW4gYW4gb2Zmc2V0Lg0K
DQpJdCBtaWdodCBhbHNvIGJlIHRoYXQgc29tZSBvZiB0aGUgcHJvYmxlbWF0aWMgc29ja29wdA0K
d2VyZSBpbiBkZWNuZXQgLSBub3cgcmVtb3ZlZC4NCg0KCURhdmlkDQoNCi0NClJlZ2lzdGVyZWQg
QWRkcmVzcyBMYWtlc2lkZSwgQnJhbWxleSBSb2FkLCBNb3VudCBGYXJtLCBNaWx0b24gS2V5bmVz
LCBNSzEgMVBULCBVSw0KUmVnaXN0cmF0aW9uIE5vOiAxMzk3Mzg2IChXYWxlcykNCg=

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-12  7:39   ` David Laight
  0 siblings, 0 replies; 108+ messages in thread
From: David Laight @ 2023-04-12  7:39 UTC (permalink / raw)
  To: 'David Ahern', Jens Axboe, Breno Leitao
  Cc: Willem de Bruijn, io-uring@vger.kernel.org,
	netdev@vger.kernel.org, kuba@kernel.org, asml.silence@gmail.com,
	leit@fb.com, edumazet@google.com, pabeni@redhat.com,
	davem@davemloft.net, dccp@vger.kernel.org, mptcp@lists.linux.dev,
	linux-kernel@vger.kernel.org, willemdebruijn.kernel@gmail.com,
	matthieu.baerts@tessares.net, marcelo.leitner@gmail.com

From: David Ahern
> Sent: 11 April 2023 16:28
....
> Christoph's patch set a few years back that removed set_fs broke the
> ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not
> follow that change; was it a deliberate intent to not allow these
> in-kernel calls vs wanting to remove the set_fs? e.g., can we add a
> kioctl variant for in-kernel use of the APIs?

I think that was a side effect, and with no in-tree in-kernel
users (apart from limited calls in bpf) it was deemed acceptable.
(It is a PITA for any code trying to use SCTP in kernel.)

One problem is that not all sockopt calls pass the correct length.
And some of them can have very long buffers.
Not to mention the ones that are read-modify-write.

A plausible solution is to pass a 'fat pointer' that contains
some, or all, of:
	- A userspace buffer pointer.
	- A kernel buffer pointer.
	- The length supplied by the user.
	- The length of the kernel buffer.
	= The number of bytes to copy on completion.
For simple user requests the syscall entry/exit code
would copy the data to a short on-stack buffer.
Kernel users just pass the kernel address.
Odd requests can just use the user pointer.

Probably needs accessors that add in an offset.

It might also be that some of the problematic sockopt
were in decnet - now removed.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-11 15:28   ` Jens Axboe
@ 2023-04-12 13:53   ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-12 13:53 UTC (permalink / raw)
  To: dccp

On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote:
> On 4/11/23 9:24?AM, Willem de Bruijn wrote:
> > Jens Axboe wrote:
> >> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> >> But that doesn't work, because sock->ops->ioctl() assumes the arg is
> >> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
> >> to pass in on-stack memory (or similar) and have it work with a kernel
> >> address?
> > 
> > That was what I suggested indeed.
> > 
> > It's about as much code change as this patch series. But it avoids
> > the code duplication.
> 
> Breno, want to tackle that as a prep patch first? Should make the
> functional changes afterwards much more straightforward, and will allow
> support for anything really.

Absolutely. I just want to make sure that I got the proper approach that
we agreed here.

Let me explain what I understood taking TCP as an example:

1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg'
argument is now just a kernel memory (located in the stack frame from the
callee).

2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the
stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of
this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg)

3) Repeat it for the 20 protocols that implement ioctl:

	ag  "struct proto .* = {" -A 20 net/ | grep \.ioctl
	net/dccp/ipv6.c 	.ioctl	= dccp_ioctl,
	net/dccp/ipv4.c		.ioctl	= dccp_ioctl,
	net/ieee802154/socket.c .ioctl	= dgram_ioctl,
	net/ipv4/udplite.c	.ioctl	= udp_ioctl,
	net/ipv4/raw.c 		.ioctl	= raw_ioctl,
	net/ipv4/udp.c		.ioctl	= udp_ioctl,
	net/ipv4/tcp_ipv4.c 	.ioctl	= tcp_ioctl,
	net/ipv6/raw.c		.ioctl	= rawv6_ioctl,
	net/ipv6/tcp_ipv6.c	.ioctl	= tcp_ioctl,
	net/ipv6/udp.c	 	.ioctl	= udp_ioctl,
	net/ipv6/udplite.c	.ioctl	= udp_ioctl,
	net/l2tp/l2tp_ip6.c	.ioctl	= l2tp_ioctl,
	net/l2tp/l2tp_ip.c	.ioctl	= l2tp_ioctl,
	net/phonet/datagram.:	.ioctl	= pn_ioctl,
	net/phonet/pep.c	.ioctl	= pep_ioctl,
	net/rds/af_rds.c	.ioctl	=	rds_ioctl,
	net/sctp/socket.c	.ioctl  =	sctp_ioctl,
	net/sctp/socket.c	.ioctl	= sctp_ioctl,
	net/xdp/xsk.c		.ioctl	= sock_no_ioctl,
	net/mptcp/protocol.c	.ioctl	= mptcp_ioctl,

Am I missing something?

Thanks!

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-12 13:53   ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-12 13:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Willem de Bruijn, David Ahern, Willem de Bruijn, io-uring, netdev,
	kuba, asml.silence, leit, edumazet, pabeni, davem, dccp, mptcp,
	linux-kernel, matthieu.baerts, marcelo.leitner

On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote:
> On 4/11/23 9:24?AM, Willem de Bruijn wrote:
> > Jens Axboe wrote:
> >> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> >> But that doesn't work, because sock->ops->ioctl() assumes the arg is
> >> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
> >> to pass in on-stack memory (or similar) and have it work with a kernel
> >> address?
> > 
> > That was what I suggested indeed.
> > 
> > It's about as much code change as this patch series. But it avoids
> > the code duplication.
> 
> Breno, want to tackle that as a prep patch first? Should make the
> functional changes afterwards much more straightforward, and will allow
> support for anything really.

Absolutely. I just want to make sure that I got the proper approach that
we agreed here.

Let me explain what I understood taking TCP as an example:

1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg'
argument is now just a kernel memory (located in the stack frame from the
callee).

2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the
stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of
this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg)

3) Repeat it for the 20 protocols that implement ioctl:

	ag  "struct proto .* = {" -A 20 net/ | grep \.ioctl
	net/dccp/ipv6.c 	.ioctl	= dccp_ioctl,
	net/dccp/ipv4.c		.ioctl	= dccp_ioctl,
	net/ieee802154/socket.c .ioctl	= dgram_ioctl,
	net/ipv4/udplite.c	.ioctl	= udp_ioctl,
	net/ipv4/raw.c 		.ioctl	= raw_ioctl,
	net/ipv4/udp.c		.ioctl	= udp_ioctl,
	net/ipv4/tcp_ipv4.c 	.ioctl	= tcp_ioctl,
	net/ipv6/raw.c		.ioctl	= rawv6_ioctl,
	net/ipv6/tcp_ipv6.c	.ioctl	= tcp_ioctl,
	net/ipv6/udp.c	 	.ioctl	= udp_ioctl,
	net/ipv6/udplite.c	.ioctl	= udp_ioctl,
	net/l2tp/l2tp_ip6.c	.ioctl	= l2tp_ioctl,
	net/l2tp/l2tp_ip.c	.ioctl	= l2tp_ioctl,
	net/phonet/datagram.:	.ioctl	= pn_ioctl,
	net/phonet/pep.c	.ioctl	= pep_ioctl,
	net/rds/af_rds.c	.ioctl	=	rds_ioctl,
	net/sctp/socket.c	.ioctl  =	sctp_ioctl,
	net/sctp/socket.c	.ioctl	= sctp_ioctl,
	net/xdp/xsk.c		.ioctl	= sock_no_ioctl,
	net/mptcp/protocol.c	.ioctl	= mptcp_ioctl,

Am I missing something?

Thanks!

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-12 13:53   ` Breno Leitao
@ 2023-04-12 14:28   ` Willem de Bruijn
  -1 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-12 14:28 UTC (permalink / raw)
  To: dccp

Breno Leitao wrote:
> On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote:
> > On 4/11/23 9:24?AM, Willem de Bruijn wrote:
> > > Jens Axboe wrote:
> > >> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> > >> But that doesn't work, because sock->ops->ioctl() assumes the arg is
> > >> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
> > >> to pass in on-stack memory (or similar) and have it work with a kernel
> > >> address?
> > > 
> > > That was what I suggested indeed.
> > > 
> > > It's about as much code change as this patch series. But it avoids
> > > the code duplication.
> > 
> > Breno, want to tackle that as a prep patch first? Should make the
> > functional changes afterwards much more straightforward, and will allow
> > support for anything really.
> 
> Absolutely. I just want to make sure that I got the proper approach that
> we agreed here.
> 
> Let me explain what I understood taking TCP as an example:
> 
> 1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg'
> argument is now just a kernel memory (located in the stack frame from the
> callee).
> 
> 2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the
> stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of
> this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg)
> 
> 3) Repeat it for the 20 protocols that implement ioctl:
> 
> 	ag  "struct proto .* = {" -A 20 net/ | grep \.ioctl
> 	net/dccp/ipv6.c 	.ioctl	= dccp_ioctl,
> 	net/dccp/ipv4.c		.ioctl	= dccp_ioctl,
> 	net/ieee802154/socket.c .ioctl	= dgram_ioctl,
> 	net/ipv4/udplite.c	.ioctl	= udp_ioctl,
> 	net/ipv4/raw.c 		.ioctl	= raw_ioctl,
> 	net/ipv4/udp.c		.ioctl	= udp_ioctl,
> 	net/ipv4/tcp_ipv4.c 	.ioctl	= tcp_ioctl,
> 	net/ipv6/raw.c		.ioctl	= rawv6_ioctl,
> 	net/ipv6/tcp_ipv6.c	.ioctl	= tcp_ioctl,
> 	net/ipv6/udp.c	 	.ioctl	= udp_ioctl,
> 	net/ipv6/udplite.c	.ioctl	= udp_ioctl,
> 	net/l2tp/l2tp_ip6.c	.ioctl	= l2tp_ioctl,
> 	net/l2tp/l2tp_ip.c	.ioctl	= l2tp_ioctl,
> 	net/phonet/datagram.:	.ioctl	= pn_ioctl,
> 	net/phonet/pep.c	.ioctl	= pep_ioctl,
> 	net/rds/af_rds.c	.ioctl	=	rds_ioctl,
> 	net/sctp/socket.c	.ioctl  =	sctp_ioctl,
> 	net/sctp/socket.c	.ioctl	= sctp_ioctl,
> 	net/xdp/xsk.c		.ioctl	= sock_no_ioctl,
> 	net/mptcp/protocol.c	.ioctl	= mptcp_ioctl,
> 
> Am I missing something?

The suggestion is to convert all to take kernel memory and do the
put_cmsg in the caller of .ioctl. Rather than create a wrapper for
each individual instance and add a separate .iouring_cmd for each.

"change all of the sock->ops->ioctl() to pass in on-stack memory
(or similar) and have it work with a kernel address"

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-12 14:28   ` Willem de Bruijn
  0 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-12 14:28 UTC (permalink / raw)
  To: Breno Leitao, Jens Axboe
  Cc: Willem de Bruijn, David Ahern, Willem de Bruijn, io-uring, netdev,
	kuba, asml.silence, leit, edumazet, pabeni, davem, dccp, mptcp,
	linux-kernel, matthieu.baerts, marcelo.leitner

Breno Leitao wrote:
> On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote:
> > On 4/11/23 9:24?AM, Willem de Bruijn wrote:
> > > Jens Axboe wrote:
> > >> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> > >> But that doesn't work, because sock->ops->ioctl() assumes the arg is
> > >> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
> > >> to pass in on-stack memory (or similar) and have it work with a kernel
> > >> address?
> > > 
> > > That was what I suggested indeed.
> > > 
> > > It's about as much code change as this patch series. But it avoids
> > > the code duplication.
> > 
> > Breno, want to tackle that as a prep patch first? Should make the
> > functional changes afterwards much more straightforward, and will allow
> > support for anything really.
> 
> Absolutely. I just want to make sure that I got the proper approach that
> we agreed here.
> 
> Let me explain what I understood taking TCP as an example:
> 
> 1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg'
> argument is now just a kernel memory (located in the stack frame from the
> callee).
> 
> 2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the
> stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of
> this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg)
> 
> 3) Repeat it for the 20 protocols that implement ioctl:
> 
> 	ag  "struct proto .* = {" -A 20 net/ | grep \.ioctl
> 	net/dccp/ipv6.c 	.ioctl	= dccp_ioctl,
> 	net/dccp/ipv4.c		.ioctl	= dccp_ioctl,
> 	net/ieee802154/socket.c .ioctl	= dgram_ioctl,
> 	net/ipv4/udplite.c	.ioctl	= udp_ioctl,
> 	net/ipv4/raw.c 		.ioctl	= raw_ioctl,
> 	net/ipv4/udp.c		.ioctl	= udp_ioctl,
> 	net/ipv4/tcp_ipv4.c 	.ioctl	= tcp_ioctl,
> 	net/ipv6/raw.c		.ioctl	= rawv6_ioctl,
> 	net/ipv6/tcp_ipv6.c	.ioctl	= tcp_ioctl,
> 	net/ipv6/udp.c	 	.ioctl	= udp_ioctl,
> 	net/ipv6/udplite.c	.ioctl	= udp_ioctl,
> 	net/l2tp/l2tp_ip6.c	.ioctl	= l2tp_ioctl,
> 	net/l2tp/l2tp_ip.c	.ioctl	= l2tp_ioctl,
> 	net/phonet/datagram.:	.ioctl	= pn_ioctl,
> 	net/phonet/pep.c	.ioctl	= pep_ioctl,
> 	net/rds/af_rds.c	.ioctl	=	rds_ioctl,
> 	net/sctp/socket.c	.ioctl  =	sctp_ioctl,
> 	net/sctp/socket.c	.ioctl	= sctp_ioctl,
> 	net/xdp/xsk.c		.ioctl	= sock_no_ioctl,
> 	net/mptcp/protocol.c	.ioctl	= mptcp_ioctl,
> 
> Am I missing something?

The suggestion is to convert all to take kernel memory and do the
put_cmsg in the caller of .ioctl. Rather than create a wrapper for
each individual instance and add a separate .iouring_cmd for each.

"change all of the sock->ops->ioctl() to pass in on-stack memory
(or similar) and have it work with a kernel address"

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-12 14:28   ` Willem de Bruijn
@ 2023-04-13  0:02   ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-13  0:02 UTC (permalink / raw)
  To: dccp

On Wed, Apr 12, 2023 at 10:28:41AM -0400, Willem de Bruijn wrote:
> Breno Leitao wrote:
> > On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote:
> > > On 4/11/23 9:24?AM, Willem de Bruijn wrote:
> > > > Jens Axboe wrote:
> > > >> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> > > >> But that doesn't work, because sock->ops->ioctl() assumes the arg is
> > > >> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
> > > >> to pass in on-stack memory (or similar) and have it work with a kernel
> > > >> address?
> > > > 
> > > > That was what I suggested indeed.
> > > > 
> > > > It's about as much code change as this patch series. But it avoids
> > > > the code duplication.
> > > 
> > > Breno, want to tackle that as a prep patch first? Should make the
> > > functional changes afterwards much more straightforward, and will allow
> > > support for anything really.
> > 
> > Absolutely. I just want to make sure that I got the proper approach that
> > we agreed here.
> > 
> > Let me explain what I understood taking TCP as an example:
> > 
> > 1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg'
> > argument is now just a kernel memory (located in the stack frame from the
> > callee).
> > 
> > 2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the
> > stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of
> > this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg)
> > 
> > 3) Repeat it for the 20 protocols that implement ioctl:
> > 
> > 	ag  "struct proto .* = {" -A 20 net/ | grep \.ioctl
> > 	net/dccp/ipv6.c 	.ioctl	= dccp_ioctl,
> > 	net/dccp/ipv4.c		.ioctl	= dccp_ioctl,
> > 	net/ieee802154/socket.c .ioctl	= dgram_ioctl,
> > 	net/ipv4/udplite.c	.ioctl	= udp_ioctl,
> > 	net/ipv4/raw.c 		.ioctl	= raw_ioctl,
> > 	net/ipv4/udp.c		.ioctl	= udp_ioctl,
> > 	net/ipv4/tcp_ipv4.c 	.ioctl	= tcp_ioctl,
> > 	net/ipv6/raw.c		.ioctl	= rawv6_ioctl,
> > 	net/ipv6/tcp_ipv6.c	.ioctl	= tcp_ioctl,
> > 	net/ipv6/udp.c	 	.ioctl	= udp_ioctl,
> > 	net/ipv6/udplite.c	.ioctl	= udp_ioctl,
> > 	net/l2tp/l2tp_ip6.c	.ioctl	= l2tp_ioctl,
> > 	net/l2tp/l2tp_ip.c	.ioctl	= l2tp_ioctl,
> > 	net/phonet/datagram.:	.ioctl	= pn_ioctl,
> > 	net/phonet/pep.c	.ioctl	= pep_ioctl,
> > 	net/rds/af_rds.c	.ioctl	=	rds_ioctl,
> > 	net/sctp/socket.c	.ioctl  =	sctp_ioctl,
> > 	net/sctp/socket.c	.ioctl	= sctp_ioctl,
> > 	net/xdp/xsk.c		.ioctl	= sock_no_ioctl,
> > 	net/mptcp/protocol.c	.ioctl	= mptcp_ioctl,
> > 
> > Am I missing something?
> 
> The suggestion is to convert all to take kernel memory and do the
> put_cmsg in the caller of .ioctl. Rather than create a wrapper for
> each individual instance and add a separate .iouring_cmd for each.
> 
> "change all of the sock->ops->ioctl() to pass in on-stack memory
> (or similar) and have it work with a kernel address"

is it possible to do it for cases where we don't know what is the size
of the buffer?

For instance the raw_ioctl()/rawv6_ioctl() case. The "arg" argument is
used in different ways (one for input and one for output):

  1) If cmd = SIOCOUTQ or SIOCINQ, then the return value will be
  returned to userspace:
  	put_user(amount, (int __user *)arg)

  2) For default cmd, ipmr_ioctl() is called, which reads from the `arg`
  parameter:
	copy_from_user(&vr, arg, sizeof(vr)

How to handle these contradictory behaviour ahead of time (at callee
time, where the buffers will be prepared)?

Thank you!

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-13  0:02   ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-13  0:02 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jens Axboe, David Ahern, Willem de Bruijn, io-uring, netdev, kuba,
	asml.silence, leit, edumazet, pabeni, davem, dccp, mptcp,
	linux-kernel, matthieu.baerts, marcelo.leitner

On Wed, Apr 12, 2023 at 10:28:41AM -0400, Willem de Bruijn wrote:
> Breno Leitao wrote:
> > On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote:
> > > On 4/11/23 9:24?AM, Willem de Bruijn wrote:
> > > > Jens Axboe wrote:
> > > >> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> > > >> But that doesn't work, because sock->ops->ioctl() assumes the arg is
> > > >> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
> > > >> to pass in on-stack memory (or similar) and have it work with a kernel
> > > >> address?
> > > > 
> > > > That was what I suggested indeed.
> > > > 
> > > > It's about as much code change as this patch series. But it avoids
> > > > the code duplication.
> > > 
> > > Breno, want to tackle that as a prep patch first? Should make the
> > > functional changes afterwards much more straightforward, and will allow
> > > support for anything really.
> > 
> > Absolutely. I just want to make sure that I got the proper approach that
> > we agreed here.
> > 
> > Let me explain what I understood taking TCP as an example:
> > 
> > 1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg'
> > argument is now just a kernel memory (located in the stack frame from the
> > callee).
> > 
> > 2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the
> > stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of
> > this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg)
> > 
> > 3) Repeat it for the 20 protocols that implement ioctl:
> > 
> > 	ag  "struct proto .* = {" -A 20 net/ | grep \.ioctl
> > 	net/dccp/ipv6.c 	.ioctl	= dccp_ioctl,
> > 	net/dccp/ipv4.c		.ioctl	= dccp_ioctl,
> > 	net/ieee802154/socket.c .ioctl	= dgram_ioctl,
> > 	net/ipv4/udplite.c	.ioctl	= udp_ioctl,
> > 	net/ipv4/raw.c 		.ioctl	= raw_ioctl,
> > 	net/ipv4/udp.c		.ioctl	= udp_ioctl,
> > 	net/ipv4/tcp_ipv4.c 	.ioctl	= tcp_ioctl,
> > 	net/ipv6/raw.c		.ioctl	= rawv6_ioctl,
> > 	net/ipv6/tcp_ipv6.c	.ioctl	= tcp_ioctl,
> > 	net/ipv6/udp.c	 	.ioctl	= udp_ioctl,
> > 	net/ipv6/udplite.c	.ioctl	= udp_ioctl,
> > 	net/l2tp/l2tp_ip6.c	.ioctl	= l2tp_ioctl,
> > 	net/l2tp/l2tp_ip.c	.ioctl	= l2tp_ioctl,
> > 	net/phonet/datagram.:	.ioctl	= pn_ioctl,
> > 	net/phonet/pep.c	.ioctl	= pep_ioctl,
> > 	net/rds/af_rds.c	.ioctl	=	rds_ioctl,
> > 	net/sctp/socket.c	.ioctl  =	sctp_ioctl,
> > 	net/sctp/socket.c	.ioctl	= sctp_ioctl,
> > 	net/xdp/xsk.c		.ioctl	= sock_no_ioctl,
> > 	net/mptcp/protocol.c	.ioctl	= mptcp_ioctl,
> > 
> > Am I missing something?
> 
> The suggestion is to convert all to take kernel memory and do the
> put_cmsg in the caller of .ioctl. Rather than create a wrapper for
> each individual instance and add a separate .iouring_cmd for each.
> 
> "change all of the sock->ops->ioctl() to pass in on-stack memory
> (or similar) and have it work with a kernel address"

is it possible to do it for cases where we don't know what is the size
of the buffer?

For instance the raw_ioctl()/rawv6_ioctl() case. The "arg" argument is
used in different ways (one for input and one for output):

  1) If cmd == SIOCOUTQ or SIOCINQ, then the return value will be
  returned to userspace:
  	put_user(amount, (int __user *)arg)

  2) For default cmd, ipmr_ioctl() is called, which reads from the `arg`
  parameter:
	copy_from_user(&vr, arg, sizeof(vr)

How to handle these contradictory behaviour ahead of time (at callee
time, where the buffers will be prepared)?

Thank you!

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-06 16:57 ` Breno Leitao
@ 2023-04-13  2:56   ` Ming Lei
  -1 siblings, 0 replies; 108+ messages in thread
From: Ming Lei @ 2023-04-13  2:56 UTC (permalink / raw)
  To: dccp

On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
> Currently uring CMD operation relies on having large SQEs, but future
> operations might want to use normal SQE.
> 
> The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
> but, for commands that use normal SQE size, it might be necessary to
> access the initial SQE fields outside of the payload/cmd block.  So,
> saves the whole SQE other than just the pdu.
> 
> This changes slighlty how the io_uring_cmd works, since the cmd
> structures and callbacks are not opaque to io_uring anymore. I.e, the
> callbacks can look at the SQE entries, not only, in the cmd structure.
> 
> The main advantage is that we don't need to create custom structures for
> simple commands.
> 
> Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---

...

> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 2e4c483075d3..9648134ccae1 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
>  int io_uring_cmd_prep_async(struct io_kiocb *req)
>  {
>  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> -	size_t cmd_size;
> +	size_t size = sizeof(struct io_uring_sqe);
>  
>  	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
>  	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
>  
> -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
> +	if (req->ctx->flags & IORING_SETUP_SQE128)
> +		size <<= 1;
>  
> -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
> +	memcpy(req->async_data, ioucmd->sqe, size);

The copy will make some fields of sqe become READ TWICE, and driver may see
different sqe field value compared with the one observed in io_init_req().

Can this kind of inconsistency cause trouble to driver?

If it isn't one problem, this patch looks fine.

But I guess any access on cmd->sqe in driver may have to be careful for dealing
with potential post-sqe-update.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-13  2:56   ` Ming Lei
  0 siblings, 0 replies; 108+ messages in thread
From: Ming Lei @ 2023-04-13  2:56 UTC (permalink / raw)
  To: Breno Leitao
  Cc: asml.silence, axboe, davem, dccp, dsahern, edumazet, io-uring,
	kuba, leit, linux-kernel, marcelo.leitner, matthieu.baerts, mptcp,
	netdev, pabeni, willemdebruijn.kernel, ming.lei

On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
> Currently uring CMD operation relies on having large SQEs, but future
> operations might want to use normal SQE.
> 
> The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
> but, for commands that use normal SQE size, it might be necessary to
> access the initial SQE fields outside of the payload/cmd block.  So,
> saves the whole SQE other than just the pdu.
> 
> This changes slighlty how the io_uring_cmd works, since the cmd
> structures and callbacks are not opaque to io_uring anymore. I.e, the
> callbacks can look at the SQE entries, not only, in the cmd structure.
> 
> The main advantage is that we don't need to create custom structures for
> simple commands.
> 
> Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---

...

> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 2e4c483075d3..9648134ccae1 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
>  int io_uring_cmd_prep_async(struct io_kiocb *req)
>  {
>  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> -	size_t cmd_size;
> +	size_t size = sizeof(struct io_uring_sqe);
>  
>  	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
>  	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
>  
> -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
> +	if (req->ctx->flags & IORING_SETUP_SQE128)
> +		size <<= 1;
>  
> -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
> +	memcpy(req->async_data, ioucmd->sqe, size);

The copy will make some fields of sqe become READ TWICE, and driver may see
different sqe field value compared with the one observed in io_init_req().

Can this kind of inconsistency cause trouble to driver?

If it isn't one problem, this patch looks fine.

But I guess any access on cmd->sqe in driver may have to be careful for dealing
with potential post-sqe-update.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-13  0:02   ` Breno Leitao
@ 2023-04-13 14:24   ` Willem de Bruijn
  -1 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-13 14:24 UTC (permalink / raw)
  To: dccp

Breno Leitao wrote:
> On Wed, Apr 12, 2023 at 10:28:41AM -0400, Willem de Bruijn wrote:
> > Breno Leitao wrote:
> > > On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote:
> > > > On 4/11/23 9:24?AM, Willem de Bruijn wrote:
> > > > > Jens Axboe wrote:
> > > > >> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> > > > >> But that doesn't work, because sock->ops->ioctl() assumes the arg is
> > > > >> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
> > > > >> to pass in on-stack memory (or similar) and have it work with a kernel
> > > > >> address?
> > > > > 
> > > > > That was what I suggested indeed.
> > > > > 
> > > > > It's about as much code change as this patch series. But it avoids
> > > > > the code duplication.
> > > > 
> > > > Breno, want to tackle that as a prep patch first? Should make the
> > > > functional changes afterwards much more straightforward, and will allow
> > > > support for anything really.
> > > 
> > > Absolutely. I just want to make sure that I got the proper approach that
> > > we agreed here.
> > > 
> > > Let me explain what I understood taking TCP as an example:
> > > 
> > > 1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg'
> > > argument is now just a kernel memory (located in the stack frame from the
> > > callee).
> > > 
> > > 2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the
> > > stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of
> > > this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg)
> > > 
> > > 3) Repeat it for the 20 protocols that implement ioctl:
> > > 
> > > 	ag  "struct proto .* = {" -A 20 net/ | grep \.ioctl
> > > 	net/dccp/ipv6.c 	.ioctl	= dccp_ioctl,
> > > 	net/dccp/ipv4.c		.ioctl	= dccp_ioctl,
> > > 	net/ieee802154/socket.c .ioctl	= dgram_ioctl,
> > > 	net/ipv4/udplite.c	.ioctl	= udp_ioctl,
> > > 	net/ipv4/raw.c 		.ioctl	= raw_ioctl,
> > > 	net/ipv4/udp.c		.ioctl	= udp_ioctl,
> > > 	net/ipv4/tcp_ipv4.c 	.ioctl	= tcp_ioctl,
> > > 	net/ipv6/raw.c		.ioctl	= rawv6_ioctl,
> > > 	net/ipv6/tcp_ipv6.c	.ioctl	= tcp_ioctl,
> > > 	net/ipv6/udp.c	 	.ioctl	= udp_ioctl,
> > > 	net/ipv6/udplite.c	.ioctl	= udp_ioctl,
> > > 	net/l2tp/l2tp_ip6.c	.ioctl	= l2tp_ioctl,
> > > 	net/l2tp/l2tp_ip.c	.ioctl	= l2tp_ioctl,
> > > 	net/phonet/datagram.:	.ioctl	= pn_ioctl,
> > > 	net/phonet/pep.c	.ioctl	= pep_ioctl,
> > > 	net/rds/af_rds.c	.ioctl	=	rds_ioctl,
> > > 	net/sctp/socket.c	.ioctl  =	sctp_ioctl,
> > > 	net/sctp/socket.c	.ioctl	= sctp_ioctl,
> > > 	net/xdp/xsk.c		.ioctl	= sock_no_ioctl,
> > > 	net/mptcp/protocol.c	.ioctl	= mptcp_ioctl,
> > > 
> > > Am I missing something?
> > 
> > The suggestion is to convert all to take kernel memory and do the
> > put_cmsg in the caller of .ioctl. Rather than create a wrapper for
> > each individual instance and add a separate .iouring_cmd for each.
> > 
> > "change all of the sock->ops->ioctl() to pass in on-stack memory
> > (or similar) and have it work with a kernel address"
> 
> is it possible to do it for cases where we don't know what is the size
> of the buffer?
> 
> For instance the raw_ioctl()/rawv6_ioctl() case. The "arg" argument is
> used in different ways (one for input and one for output):
> 
>   1) If cmd = SIOCOUTQ or SIOCINQ, then the return value will be
>   returned to userspace:
>   	put_user(amount, (int __user *)arg)
> 
>   2) For default cmd, ipmr_ioctl() is called, which reads from the `arg`
>   parameter:
> 	copy_from_user(&vr, arg, sizeof(vr)
> 
> How to handle these contradictory behaviour ahead of time (at callee
> time, where the buffers will be prepared)?
> 
> Thank you!

Ah you found a counter-example to the simple pattern of put_user.

The answer perhaps depends on how many such counter-examples you
encounter in the list you gave. If this is the only one, exceptions
in the wrapper are reasonable. Not if there are many.

Is the intent for io_uring to support all cases eventually? The
current patch series only targeted more common fast path operations.

Probably also relevant is whether/how the approach can be extended
to [gs]etsockopt, as that was another example given, with the same
challenge.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-13 14:24   ` Willem de Bruijn
  0 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-13 14:24 UTC (permalink / raw)
  To: Breno Leitao, Willem de Bruijn
  Cc: Jens Axboe, David Ahern, Willem de Bruijn, io-uring, netdev, kuba,
	asml.silence, leit, edumazet, pabeni, davem, dccp, mptcp,
	linux-kernel, matthieu.baerts, marcelo.leitner

Breno Leitao wrote:
> On Wed, Apr 12, 2023 at 10:28:41AM -0400, Willem de Bruijn wrote:
> > Breno Leitao wrote:
> > > On Tue, Apr 11, 2023 at 09:28:29AM -0600, Jens Axboe wrote:
> > > > On 4/11/23 9:24?AM, Willem de Bruijn wrote:
> > > > > Jens Axboe wrote:
> > > > >> On 4/11/23 9:00?AM, Willem de Bruijn wrote:
> > > > >> But that doesn't work, because sock->ops->ioctl() assumes the arg is
> > > > >> memory in userspace. Or do you mean change all of the sock->ops->ioctl()
> > > > >> to pass in on-stack memory (or similar) and have it work with a kernel
> > > > >> address?
> > > > > 
> > > > > That was what I suggested indeed.
> > > > > 
> > > > > It's about as much code change as this patch series. But it avoids
> > > > > the code duplication.
> > > > 
> > > > Breno, want to tackle that as a prep patch first? Should make the
> > > > functional changes afterwards much more straightforward, and will allow
> > > > support for anything really.
> > > 
> > > Absolutely. I just want to make sure that I got the proper approach that
> > > we agreed here.
> > > 
> > > Let me explain what I understood taking TCP as an example:
> > > 
> > > 1) Rename tcp_ioctl() to something as _tcp_ioctl() where the 'arg'
> > > argument is now just a kernel memory (located in the stack frame from the
> > > callee).
> > > 
> > > 2) Recreate "tcp_ioctl()" that will basically allocate a 'arg' in the
> > > stack and call _tcp_ioctl() passing that 'arg' argument. At the bottom of
> > > this (tcp_ioctl() function) function, call `put_user(in_kernel_arg, userspace_arg)
> > > 
> > > 3) Repeat it for the 20 protocols that implement ioctl:
> > > 
> > > 	ag  "struct proto .* = {" -A 20 net/ | grep \.ioctl
> > > 	net/dccp/ipv6.c 	.ioctl	= dccp_ioctl,
> > > 	net/dccp/ipv4.c		.ioctl	= dccp_ioctl,
> > > 	net/ieee802154/socket.c .ioctl	= dgram_ioctl,
> > > 	net/ipv4/udplite.c	.ioctl	= udp_ioctl,
> > > 	net/ipv4/raw.c 		.ioctl	= raw_ioctl,
> > > 	net/ipv4/udp.c		.ioctl	= udp_ioctl,
> > > 	net/ipv4/tcp_ipv4.c 	.ioctl	= tcp_ioctl,
> > > 	net/ipv6/raw.c		.ioctl	= rawv6_ioctl,
> > > 	net/ipv6/tcp_ipv6.c	.ioctl	= tcp_ioctl,
> > > 	net/ipv6/udp.c	 	.ioctl	= udp_ioctl,
> > > 	net/ipv6/udplite.c	.ioctl	= udp_ioctl,
> > > 	net/l2tp/l2tp_ip6.c	.ioctl	= l2tp_ioctl,
> > > 	net/l2tp/l2tp_ip.c	.ioctl	= l2tp_ioctl,
> > > 	net/phonet/datagram.:	.ioctl	= pn_ioctl,
> > > 	net/phonet/pep.c	.ioctl	= pep_ioctl,
> > > 	net/rds/af_rds.c	.ioctl	=	rds_ioctl,
> > > 	net/sctp/socket.c	.ioctl  =	sctp_ioctl,
> > > 	net/sctp/socket.c	.ioctl	= sctp_ioctl,
> > > 	net/xdp/xsk.c		.ioctl	= sock_no_ioctl,
> > > 	net/mptcp/protocol.c	.ioctl	= mptcp_ioctl,
> > > 
> > > Am I missing something?
> > 
> > The suggestion is to convert all to take kernel memory and do the
> > put_cmsg in the caller of .ioctl. Rather than create a wrapper for
> > each individual instance and add a separate .iouring_cmd for each.
> > 
> > "change all of the sock->ops->ioctl() to pass in on-stack memory
> > (or similar) and have it work with a kernel address"
> 
> is it possible to do it for cases where we don't know what is the size
> of the buffer?
> 
> For instance the raw_ioctl()/rawv6_ioctl() case. The "arg" argument is
> used in different ways (one for input and one for output):
> 
>   1) If cmd == SIOCOUTQ or SIOCINQ, then the return value will be
>   returned to userspace:
>   	put_user(amount, (int __user *)arg)
> 
>   2) For default cmd, ipmr_ioctl() is called, which reads from the `arg`
>   parameter:
> 	copy_from_user(&vr, arg, sizeof(vr)
> 
> How to handle these contradictory behaviour ahead of time (at callee
> time, where the buffers will be prepared)?
> 
> Thank you!

Ah you found a counter-example to the simple pattern of put_user.

The answer perhaps depends on how many such counter-examples you
encounter in the list you gave. If this is the only one, exceptions
in the wrapper are reasonable. Not if there are many.

Is the intent for io_uring to support all cases eventually? The
current patch series only targeted more common fast path operations.

Probably also relevant is whether/how the approach can be extended
to [gs]etsockopt, as that was another example given, with the same
challenge.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-13 14:24   ` Willem de Bruijn
@ 2023-04-13 14:45   ` Jakub Kicinski
  -1 siblings, 0 replies; 108+ messages in thread
From: Jakub Kicinski @ 2023-04-13 14:45 UTC (permalink / raw)
  To: dccp

On Thu, 13 Apr 2023 10:24:31 -0400 Willem de Bruijn wrote:
> Probably also relevant is whether/how the approach can be extended
> to [gs]etsockopt, as that was another example given, with the same
> challenge.

I had the same thought, given BPF filtering/integration with *etsockopt
is repeatedly giving us grief.
The only lesson from that I can think of is that we should perhaps
suffer thru the one-by-one conversions for a while. Pulling the cases
we inspected out into common code, rather than hope we can cover
everything in one fell swoop.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-13 14:45   ` Jakub Kicinski
  0 siblings, 0 replies; 108+ messages in thread
From: Jakub Kicinski @ 2023-04-13 14:45 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Breno Leitao, Jens Axboe, David Ahern, Willem de Bruijn, io-uring,
	netdev, asml.silence, leit, edumazet, pabeni, davem, dccp, mptcp,
	linux-kernel, matthieu.baerts, marcelo.leitner

On Thu, 13 Apr 2023 10:24:31 -0400 Willem de Bruijn wrote:
> Probably also relevant is whether/how the approach can be extended
> to [gs]etsockopt, as that was another example given, with the same
> challenge.

I had the same thought, given BPF filtering/integration with *etsockopt
is repeatedly giving us grief.
The only lesson from that I can think of is that we should perhaps
suffer thru the one-by-one conversions for a while. Pulling the cases
we inspected out into common code, rather than hope we can cover
everything in one fell swoop.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-13 14:24   ` Willem de Bruijn
@ 2023-04-13 14:57   ` David Laight
  -1 siblings, 0 replies; 108+ messages in thread
From: David Laight @ 2023-04-13 14:57 UTC (permalink / raw)
  To: dccp

RnJvbTogV2lsbGVtIGRlIEJydWlqbg0KPiBTZW50OiAxMyBBcHJpbCAyMDIzIDE1OjI1DQouLi4N
Cj4gPiBGb3IgaW5zdGFuY2UgdGhlIHJhd19pb2N0bCgpL3Jhd3Y2X2lvY3RsKCkgY2FzZS4gVGhl
ICJhcmciIGFyZ3VtZW50IGlzDQo+ID4gdXNlZCBpbiBkaWZmZXJlbnQgd2F5cyAob25lIGZvciBp
bnB1dCBhbmQgb25lIGZvciBvdXRwdXQpOg0KPiA+DQo+ID4gICAxKSBJZiBjbWQgPT0gU0lPQ09V
VFEgb3IgU0lPQ0lOUSwgdGhlbiB0aGUgcmV0dXJuIHZhbHVlIHdpbGwgYmUNCj4gPiAgIHJldHVy
bmVkIHRvIHVzZXJzcGFjZToNCj4gPiAgIAlwdXRfdXNlcihhbW91bnQsIChpbnQgX191c2VyICop
YXJnKQ0KDQpUaGVyZSBpcyBhbHdheXMgdGhlIG9wdGlvbiBvZiBkZWZpbmluZyBhbHRlcm5hdGUg
aW9jdGwNCidjbWQnIGNvZGVzIHRoYXQgdXNlciBJT1IoKSBhbmQgSU9XKCkgYW5kIHJlcXVpcmlu
ZyB0aGF0DQppb191cmluZyBhcHBsaWNhdGlvbnMgdXNlIHRoZSBhbHRlcm5hdGUgZm9ybXMuDQoN
ClRoZW4gaGF2ZSB0d28gJ2lvY3RsJyBmdW5jdGlvbnMgd2l0aCBhIG5ldyBvbmUgZm9yIElPUigp
DQp0eXBlIGNvbW1hbmRzIGFuZCB0aGUgZXhpc3Rpbmcgb25lIGZvciBjb21wYXRpYmlsaXR5DQp0
aGF0IG1pZ2h0IGp1c3QgZG8gYSB0cmFuc2xhdGlvbiAob3IgcmV0dXJuIGEgdHJhbnNsYXRlZA0K
Y29tbWFuZCB0byBhdm9pZCBleHRyYSBzdGFjayB1c2UpLg0KDQpZb3UgbWF5IHN0aWxsIHdhbnQg
dG8gcGFzcyB0aHJvdWdoIGJvdGggdGhlIGtlcm5lbCBhbmQNCnVzZXIgKGlmIGEgdXNlciByZXF1
ZXN0KSBidWZmZXIgYWRkcmVzc2VzIHRvIGFsbG93IGZvcg0KdGhvc2UgYnJva2VuIHJlcXVlc3Rz
IHdoZXJlIHRoZSBidWZmZXIgZGlyZWN0aW9uIGJpdHMNCmFyZSB3cm9uZy4NCg0KCURhdmlkDQoN
Ci0NClJlZ2lzdGVyZWQgQWRkcmVzcyBMYWtlc2lkZSwgQnJhbWxleSBSb2FkLCBNb3VudCBGYXJt
LCBNaWx0b24gS2V5bmVzLCBNSzEgMVBULCBVSw0KUmVnaXN0cmF0aW9uIE5vOiAxMzk3Mzg2IChX
YWxlcykNCg=

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-13 14:57   ` David Laight
  0 siblings, 0 replies; 108+ messages in thread
From: David Laight @ 2023-04-13 14:57 UTC (permalink / raw)
  To: 'Willem de Bruijn', Breno Leitao
  Cc: Jens Axboe, David Ahern, Willem de Bruijn,
	io-uring@vger.kernel.org, netdev@vger.kernel.org, kuba@kernel.org,
	asml.silence@gmail.com, leit@fb.com, edumazet@google.com,
	pabeni@redhat.com, davem@davemloft.net, dccp@vger.kernel.org,
	mptcp@lists.linux.dev, linux-kernel@vger.kernel.org,
	matthieu.baerts@tessares.net, marcelo.leitner@gmail.com

From: Willem de Bruijn
> Sent: 13 April 2023 15:25
...
> > For instance the raw_ioctl()/rawv6_ioctl() case. The "arg" argument is
> > used in different ways (one for input and one for output):
> >
> >   1) If cmd == SIOCOUTQ or SIOCINQ, then the return value will be
> >   returned to userspace:
> >   	put_user(amount, (int __user *)arg)

There is always the option of defining alternate ioctl
'cmd' codes that user IOR() and IOW() and requiring that
io_uring applications use the alternate forms.

Then have two 'ioctl' functions with a new one for IOR()
type commands and the existing one for compatibility
that might just do a translation (or return a translated
command to avoid extra stack use).

You may still want to pass through both the kernel and
user (if a user request) buffer addresses to allow for
those broken requests where the buffer direction bits
are wrong.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-13  2:56   ` Ming Lei
@ 2023-04-13 16:47   ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-13 16:47 UTC (permalink / raw)
  To: dccp

Hello Ming,

On Thu, Apr 13, 2023 at 10:56:49AM +0800, Ming Lei wrote:
> On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
> > Currently uring CMD operation relies on having large SQEs, but future
> > operations might want to use normal SQE.
> > 
> > The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
> > but, for commands that use normal SQE size, it might be necessary to
> > access the initial SQE fields outside of the payload/cmd block.  So,
> > saves the whole SQE other than just the pdu.
> > 
> > This changes slighlty how the io_uring_cmd works, since the cmd
> > structures and callbacks are not opaque to io_uring anymore. I.e, the
> > callbacks can look at the SQE entries, not only, in the cmd structure.
> > 
> > The main advantage is that we don't need to create custom structures for
> > simple commands.
> > 
> > Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> 
> ...
> 
> > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > index 2e4c483075d3..9648134ccae1 100644
> > --- a/io_uring/uring_cmd.c
> > +++ b/io_uring/uring_cmd.c
> > @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
> >  int io_uring_cmd_prep_async(struct io_kiocb *req)
> >  {
> >  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> > -	size_t cmd_size;
> > +	size_t size = sizeof(struct io_uring_sqe);
> >  
> >  	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
> >  	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
> >  
> > -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
> > +	if (req->ctx->flags & IORING_SETUP_SQE128)
> > +		size <<= 1;
> >  
> > -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
> > +	memcpy(req->async_data, ioucmd->sqe, size);
> 
> The copy will make some fields of sqe become READ TWICE, and driver may see
> different sqe field value compared with the one observed in io_init_req().

This copy only happens if the operation goes to the async path
(calling io_uring_cmd_prep_async()).  This only happens if
f_op->uring_cmd() returns -EAGAIN.

          ret = file->f_op->uring_cmd(ioucmd, issue_flags);
          if (ret = -EAGAIN) {
                  if (!req_has_async_data(req)) {
                          if (io_alloc_async_data(req))
                                  return -ENOMEM;
                          io_uring_cmd_prep_async(req);
                  }
                  return -EAGAIN;
          }

Are you saying that after this copy, the operation is still reading from
sqe instead of req->async_data?

If you have an example of the two copes flow, that would be great.

Thanks for the review,
Breno

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-13 16:47   ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-13 16:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: asml.silence, axboe, davem, dccp, dsahern, edumazet, io-uring,
	kuba, leit, linux-kernel, marcelo.leitner, matthieu.baerts, mptcp,
	netdev, pabeni, willemdebruijn.kernel

Hello Ming,

On Thu, Apr 13, 2023 at 10:56:49AM +0800, Ming Lei wrote:
> On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
> > Currently uring CMD operation relies on having large SQEs, but future
> > operations might want to use normal SQE.
> > 
> > The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
> > but, for commands that use normal SQE size, it might be necessary to
> > access the initial SQE fields outside of the payload/cmd block.  So,
> > saves the whole SQE other than just the pdu.
> > 
> > This changes slighlty how the io_uring_cmd works, since the cmd
> > structures and callbacks are not opaque to io_uring anymore. I.e, the
> > callbacks can look at the SQE entries, not only, in the cmd structure.
> > 
> > The main advantage is that we don't need to create custom structures for
> > simple commands.
> > 
> > Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> 
> ...
> 
> > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > index 2e4c483075d3..9648134ccae1 100644
> > --- a/io_uring/uring_cmd.c
> > +++ b/io_uring/uring_cmd.c
> > @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
> >  int io_uring_cmd_prep_async(struct io_kiocb *req)
> >  {
> >  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> > -	size_t cmd_size;
> > +	size_t size = sizeof(struct io_uring_sqe);
> >  
> >  	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
> >  	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
> >  
> > -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
> > +	if (req->ctx->flags & IORING_SETUP_SQE128)
> > +		size <<= 1;
> >  
> > -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
> > +	memcpy(req->async_data, ioucmd->sqe, size);
> 
> The copy will make some fields of sqe become READ TWICE, and driver may see
> different sqe field value compared with the one observed in io_init_req().

This copy only happens if the operation goes to the async path
(calling io_uring_cmd_prep_async()).  This only happens if
f_op->uring_cmd() returns -EAGAIN.

          ret = file->f_op->uring_cmd(ioucmd, issue_flags);
          if (ret == -EAGAIN) {
                  if (!req_has_async_data(req)) {
                          if (io_alloc_async_data(req))
                                  return -ENOMEM;
                          io_uring_cmd_prep_async(req);
                  }
                  return -EAGAIN;
          }

Are you saying that after this copy, the operation is still reading from
sqe instead of req->async_data?

If you have an example of the two copes flow, that would be great.

Thanks for the review,
Breno

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-13 16:47   ` Breno Leitao
@ 2023-04-14  2:12   ` Ming Lei
  -1 siblings, 0 replies; 108+ messages in thread
From: Ming Lei @ 2023-04-14  2:12 UTC (permalink / raw)
  To: dccp

On Thu, Apr 13, 2023 at 09:47:56AM -0700, Breno Leitao wrote:
> Hello Ming,
> 
> On Thu, Apr 13, 2023 at 10:56:49AM +0800, Ming Lei wrote:
> > On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
> > > Currently uring CMD operation relies on having large SQEs, but future
> > > operations might want to use normal SQE.
> > > 
> > > The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
> > > but, for commands that use normal SQE size, it might be necessary to
> > > access the initial SQE fields outside of the payload/cmd block.  So,
> > > saves the whole SQE other than just the pdu.
> > > 
> > > This changes slighlty how the io_uring_cmd works, since the cmd
> > > structures and callbacks are not opaque to io_uring anymore. I.e, the
> > > callbacks can look at the SQE entries, not only, in the cmd structure.
> > > 
> > > The main advantage is that we don't need to create custom structures for
> > > simple commands.
> > > 
> > > Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
> > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > ---
> > 
> > ...
> > 
> > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > index 2e4c483075d3..9648134ccae1 100644
> > > --- a/io_uring/uring_cmd.c
> > > +++ b/io_uring/uring_cmd.c
> > > @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
> > >  int io_uring_cmd_prep_async(struct io_kiocb *req)
> > >  {
> > >  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> > > -	size_t cmd_size;
> > > +	size_t size = sizeof(struct io_uring_sqe);
> > >  
> > >  	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
> > >  	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
> > >  
> > > -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
> > > +	if (req->ctx->flags & IORING_SETUP_SQE128)
> > > +		size <<= 1;
> > >  
> > > -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
> > > +	memcpy(req->async_data, ioucmd->sqe, size);
> > 
> > The copy will make some fields of sqe become READ TWICE, and driver may see
> > different sqe field value compared with the one observed in io_init_req().
> 
> This copy only happens if the operation goes to the async path
> (calling io_uring_cmd_prep_async()).  This only happens if
> f_op->uring_cmd() returns -EAGAIN.
> 
>           ret = file->f_op->uring_cmd(ioucmd, issue_flags);
>           if (ret = -EAGAIN) {
>                   if (!req_has_async_data(req)) {
>                           if (io_alloc_async_data(req))
>                                   return -ENOMEM;
>                           io_uring_cmd_prep_async(req);
>                   }
>                   return -EAGAIN;
>           }
> 
> Are you saying that after this copy, the operation is still reading from
> sqe instead of req->async_data?

I meant that the 2nd read is on the sqe copy(req->aync_data), but same
fields can become different between the two READs(first is done on original
SQE during io_init_req(), and second is done on sqe copy in driver).

Will this kind of inconsistency cause trouble for driver? Cause READ
TWICE becomes possible with this patch.

> 
> If you have an example of the two copes flow, that would be great.

Not any example yet, but also not see any access on cmd->sqe(except for cmd_op)
in your patches too.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-14  2:12   ` Ming Lei
  0 siblings, 0 replies; 108+ messages in thread
From: Ming Lei @ 2023-04-14  2:12 UTC (permalink / raw)
  To: Breno Leitao
  Cc: asml.silence, axboe, davem, dccp, dsahern, edumazet, io-uring,
	kuba, leit, linux-kernel, marcelo.leitner, matthieu.baerts, mptcp,
	netdev, pabeni, willemdebruijn.kernel, ming.lei

On Thu, Apr 13, 2023 at 09:47:56AM -0700, Breno Leitao wrote:
> Hello Ming,
> 
> On Thu, Apr 13, 2023 at 10:56:49AM +0800, Ming Lei wrote:
> > On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
> > > Currently uring CMD operation relies on having large SQEs, but future
> > > operations might want to use normal SQE.
> > > 
> > > The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
> > > but, for commands that use normal SQE size, it might be necessary to
> > > access the initial SQE fields outside of the payload/cmd block.  So,
> > > saves the whole SQE other than just the pdu.
> > > 
> > > This changes slighlty how the io_uring_cmd works, since the cmd
> > > structures and callbacks are not opaque to io_uring anymore. I.e, the
> > > callbacks can look at the SQE entries, not only, in the cmd structure.
> > > 
> > > The main advantage is that we don't need to create custom structures for
> > > simple commands.
> > > 
> > > Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
> > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > ---
> > 
> > ...
> > 
> > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > index 2e4c483075d3..9648134ccae1 100644
> > > --- a/io_uring/uring_cmd.c
> > > +++ b/io_uring/uring_cmd.c
> > > @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
> > >  int io_uring_cmd_prep_async(struct io_kiocb *req)
> > >  {
> > >  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> > > -	size_t cmd_size;
> > > +	size_t size = sizeof(struct io_uring_sqe);
> > >  
> > >  	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
> > >  	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
> > >  
> > > -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
> > > +	if (req->ctx->flags & IORING_SETUP_SQE128)
> > > +		size <<= 1;
> > >  
> > > -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
> > > +	memcpy(req->async_data, ioucmd->sqe, size);
> > 
> > The copy will make some fields of sqe become READ TWICE, and driver may see
> > different sqe field value compared with the one observed in io_init_req().
> 
> This copy only happens if the operation goes to the async path
> (calling io_uring_cmd_prep_async()).  This only happens if
> f_op->uring_cmd() returns -EAGAIN.
> 
>           ret = file->f_op->uring_cmd(ioucmd, issue_flags);
>           if (ret == -EAGAIN) {
>                   if (!req_has_async_data(req)) {
>                           if (io_alloc_async_data(req))
>                                   return -ENOMEM;
>                           io_uring_cmd_prep_async(req);
>                   }
>                   return -EAGAIN;
>           }
> 
> Are you saying that after this copy, the operation is still reading from
> sqe instead of req->async_data?

I meant that the 2nd read is on the sqe copy(req->aync_data), but same
fields can become different between the two READs(first is done on original
SQE during io_init_req(), and second is done on sqe copy in driver).

Will this kind of inconsistency cause trouble for driver? Cause READ
TWICE becomes possible with this patch.

> 
> If you have an example of the two copes flow, that would be great.

Not any example yet, but also not see any access on cmd->sqe(except for cmd_op)
in your patches too.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-14  2:12   ` Ming Lei
@ 2023-04-14 13:12   ` Pavel Begunkov
  -1 siblings, 0 replies; 108+ messages in thread
From: Pavel Begunkov @ 2023-04-14 13:12 UTC (permalink / raw)
  To: dccp

On 4/14/23 03:12, Ming Lei wrote:
> On Thu, Apr 13, 2023 at 09:47:56AM -0700, Breno Leitao wrote:
>> Hello Ming,
>>
>> On Thu, Apr 13, 2023 at 10:56:49AM +0800, Ming Lei wrote:
>>> On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
>>>> Currently uring CMD operation relies on having large SQEs, but future
>>>> operations might want to use normal SQE.
>>>>
>>>> The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
>>>> but, for commands that use normal SQE size, it might be necessary to
>>>> access the initial SQE fields outside of the payload/cmd block.  So,
>>>> saves the whole SQE other than just the pdu.
>>>>
>>>> This changes slighlty how the io_uring_cmd works, since the cmd
>>>> structures and callbacks are not opaque to io_uring anymore. I.e, the
>>>> callbacks can look at the SQE entries, not only, in the cmd structure.
>>>>
>>>> The main advantage is that we don't need to create custom structures for
>>>> simple commands.
>>>>
>>>> Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>>> ---
>>>
>>> ...
>>>
>>>> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
>>>> index 2e4c483075d3..9648134ccae1 100644
>>>> --- a/io_uring/uring_cmd.c
>>>> +++ b/io_uring/uring_cmd.c
>>>> @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
>>>>   int io_uring_cmd_prep_async(struct io_kiocb *req)
>>>>   {
>>>>   	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
>>>> -	size_t cmd_size;
>>>> +	size_t size = sizeof(struct io_uring_sqe);
>>>>   
>>>>   	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
>>>>   	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
>>>>   
>>>> -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
>>>> +	if (req->ctx->flags & IORING_SETUP_SQE128)
>>>> +		size <<= 1;
>>>>   
>>>> -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
>>>> +	memcpy(req->async_data, ioucmd->sqe, size);
>>>
>>> The copy will make some fields of sqe become READ TWICE, and driver may see
>>> different sqe field value compared with the one observed in io_init_req().
>>
>> This copy only happens if the operation goes to the async path
>> (calling io_uring_cmd_prep_async()).  This only happens if
>> f_op->uring_cmd() returns -EAGAIN.
>>
>>            ret = file->f_op->uring_cmd(ioucmd, issue_flags);
>>            if (ret = -EAGAIN) {
>>                    if (!req_has_async_data(req)) {
>>                            if (io_alloc_async_data(req))
>>                                    return -ENOMEM;
>>                            io_uring_cmd_prep_async(req);
>>                    }
>>                    return -EAGAIN;
>>            }
>>
>> Are you saying that after this copy, the operation is still reading from
>> sqe instead of req->async_data?
> 
> I meant that the 2nd read is on the sqe copy(req->aync_data), but same
> fields can become different between the two READs(first is done on original
> SQE during io_init_req(), and second is done on sqe copy in driver).
> 
> Will this kind of inconsistency cause trouble for driver? Cause READ
> TWICE becomes possible with this patch.

Right it might happen, and I was keeping that in mind, but it's not
specific to this patch. It won't reload core io_uring bits, and all
fields cmds use already have this problem.

Unless there is a better option, the direction we'll be moving in is
adding a preparation step that should read and stash parts of SQE
it cares about, which should also make full SQE copy not
needed / optional.

>> If you have an example of the two copes flow, that would be great.
> 
> Not any example yet, but also not see any access on cmd->sqe(except for cmd_op)
> in your patches too.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-14 13:12   ` Pavel Begunkov
  0 siblings, 0 replies; 108+ messages in thread
From: Pavel Begunkov @ 2023-04-14 13:12 UTC (permalink / raw)
  To: Ming Lei, Breno Leitao
  Cc: axboe, davem, dccp, dsahern, edumazet, io-uring, kuba, leit,
	linux-kernel, marcelo.leitner, matthieu.baerts, mptcp, netdev,
	pabeni, willemdebruijn.kernel

On 4/14/23 03:12, Ming Lei wrote:
> On Thu, Apr 13, 2023 at 09:47:56AM -0700, Breno Leitao wrote:
>> Hello Ming,
>>
>> On Thu, Apr 13, 2023 at 10:56:49AM +0800, Ming Lei wrote:
>>> On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
>>>> Currently uring CMD operation relies on having large SQEs, but future
>>>> operations might want to use normal SQE.
>>>>
>>>> The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
>>>> but, for commands that use normal SQE size, it might be necessary to
>>>> access the initial SQE fields outside of the payload/cmd block.  So,
>>>> saves the whole SQE other than just the pdu.
>>>>
>>>> This changes slighlty how the io_uring_cmd works, since the cmd
>>>> structures and callbacks are not opaque to io_uring anymore. I.e, the
>>>> callbacks can look at the SQE entries, not only, in the cmd structure.
>>>>
>>>> The main advantage is that we don't need to create custom structures for
>>>> simple commands.
>>>>
>>>> Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>>> ---
>>>
>>> ...
>>>
>>>> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
>>>> index 2e4c483075d3..9648134ccae1 100644
>>>> --- a/io_uring/uring_cmd.c
>>>> +++ b/io_uring/uring_cmd.c
>>>> @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
>>>>   int io_uring_cmd_prep_async(struct io_kiocb *req)
>>>>   {
>>>>   	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
>>>> -	size_t cmd_size;
>>>> +	size_t size = sizeof(struct io_uring_sqe);
>>>>   
>>>>   	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
>>>>   	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
>>>>   
>>>> -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
>>>> +	if (req->ctx->flags & IORING_SETUP_SQE128)
>>>> +		size <<= 1;
>>>>   
>>>> -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
>>>> +	memcpy(req->async_data, ioucmd->sqe, size);
>>>
>>> The copy will make some fields of sqe become READ TWICE, and driver may see
>>> different sqe field value compared with the one observed in io_init_req().
>>
>> This copy only happens if the operation goes to the async path
>> (calling io_uring_cmd_prep_async()).  This only happens if
>> f_op->uring_cmd() returns -EAGAIN.
>>
>>            ret = file->f_op->uring_cmd(ioucmd, issue_flags);
>>            if (ret == -EAGAIN) {
>>                    if (!req_has_async_data(req)) {
>>                            if (io_alloc_async_data(req))
>>                                    return -ENOMEM;
>>                            io_uring_cmd_prep_async(req);
>>                    }
>>                    return -EAGAIN;
>>            }
>>
>> Are you saying that after this copy, the operation is still reading from
>> sqe instead of req->async_data?
> 
> I meant that the 2nd read is on the sqe copy(req->aync_data), but same
> fields can become different between the two READs(first is done on original
> SQE during io_init_req(), and second is done on sqe copy in driver).
> 
> Will this kind of inconsistency cause trouble for driver? Cause READ
> TWICE becomes possible with this patch.

Right it might happen, and I was keeping that in mind, but it's not
specific to this patch. It won't reload core io_uring bits, and all
fields cmds use already have this problem.

Unless there is a better option, the direction we'll be moving in is
adding a preparation step that should read and stash parts of SQE
it cares about, which should also make full SQE copy not
needed / optional.

>> If you have an example of the two copes flow, that would be great.
> 
> Not any example yet, but also not see any access on cmd->sqe(except for cmd_op)
> in your patches too.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-14 13:12   ` Pavel Begunkov
@ 2023-04-14 13:59   ` Ming Lei
  -1 siblings, 0 replies; 108+ messages in thread
From: Ming Lei @ 2023-04-14 13:59 UTC (permalink / raw)
  To: dccp

On Fri, Apr 14, 2023 at 02:12:10PM +0100, Pavel Begunkov wrote:
> On 4/14/23 03:12, Ming Lei wrote:
> > On Thu, Apr 13, 2023 at 09:47:56AM -0700, Breno Leitao wrote:
> > > Hello Ming,
> > > 
> > > On Thu, Apr 13, 2023 at 10:56:49AM +0800, Ming Lei wrote:
> > > > On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
> > > > > Currently uring CMD operation relies on having large SQEs, but future
> > > > > operations might want to use normal SQE.
> > > > > 
> > > > > The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
> > > > > but, for commands that use normal SQE size, it might be necessary to
> > > > > access the initial SQE fields outside of the payload/cmd block.  So,
> > > > > saves the whole SQE other than just the pdu.
> > > > > 
> > > > > This changes slighlty how the io_uring_cmd works, since the cmd
> > > > > structures and callbacks are not opaque to io_uring anymore. I.e, the
> > > > > callbacks can look at the SQE entries, not only, in the cmd structure.
> > > > > 
> > > > > The main advantage is that we don't need to create custom structures for
> > > > > simple commands.
> > > > > 
> > > > > Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
> > > > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > > > ---
> > > > 
> > > > ...
> > > > 
> > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > index 2e4c483075d3..9648134ccae1 100644
> > > > > --- a/io_uring/uring_cmd.c
> > > > > +++ b/io_uring/uring_cmd.c
> > > > > @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
> > > > >   int io_uring_cmd_prep_async(struct io_kiocb *req)
> > > > >   {
> > > > >   	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> > > > > -	size_t cmd_size;
> > > > > +	size_t size = sizeof(struct io_uring_sqe);
> > > > >   	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
> > > > >   	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
> > > > > -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
> > > > > +	if (req->ctx->flags & IORING_SETUP_SQE128)
> > > > > +		size <<= 1;
> > > > > -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
> > > > > +	memcpy(req->async_data, ioucmd->sqe, size);
> > > > 
> > > > The copy will make some fields of sqe become READ TWICE, and driver may see
> > > > different sqe field value compared with the one observed in io_init_req().
> > > 
> > > This copy only happens if the operation goes to the async path
> > > (calling io_uring_cmd_prep_async()).  This only happens if
> > > f_op->uring_cmd() returns -EAGAIN.
> > > 
> > >            ret = file->f_op->uring_cmd(ioucmd, issue_flags);
> > >            if (ret = -EAGAIN) {
> > >                    if (!req_has_async_data(req)) {
> > >                            if (io_alloc_async_data(req))
> > >                                    return -ENOMEM;
> > >                            io_uring_cmd_prep_async(req);
> > >                    }
> > >                    return -EAGAIN;
> > >            }
> > > 
> > > Are you saying that after this copy, the operation is still reading from
> > > sqe instead of req->async_data?
> > 
> > I meant that the 2nd read is on the sqe copy(req->aync_data), but same
> > fields can become different between the two READs(first is done on original
> > SQE during io_init_req(), and second is done on sqe copy in driver).
> > 
> > Will this kind of inconsistency cause trouble for driver? Cause READ
> > TWICE becomes possible with this patch.
> 
> Right it might happen, and I was keeping that in mind, but it's not
> specific to this patch. It won't reload core io_uring bits, and all

It depends if driver reloads core bits or not, anyway the patch exports
all fields and opens the window.

> fields cmds use already have this problem.

driver is supposed to load cmds field just once too, right?

> 
> Unless there is a better option, the direction we'll be moving in is
> adding a preparation step that should read and stash parts of SQE
> it cares about, which should also make full SQE copy not
> needed / optional.

Sounds good.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-14 13:59   ` Ming Lei
  0 siblings, 0 replies; 108+ messages in thread
From: Ming Lei @ 2023-04-14 13:59 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Breno Leitao, axboe, davem, dccp, dsahern, edumazet, io-uring,
	kuba, leit, linux-kernel, marcelo.leitner, matthieu.baerts, mptcp,
	netdev, pabeni, willemdebruijn.kernel, ming.lei

On Fri, Apr 14, 2023 at 02:12:10PM +0100, Pavel Begunkov wrote:
> On 4/14/23 03:12, Ming Lei wrote:
> > On Thu, Apr 13, 2023 at 09:47:56AM -0700, Breno Leitao wrote:
> > > Hello Ming,
> > > 
> > > On Thu, Apr 13, 2023 at 10:56:49AM +0800, Ming Lei wrote:
> > > > On Thu, Apr 06, 2023 at 09:57:05AM -0700, Breno Leitao wrote:
> > > > > Currently uring CMD operation relies on having large SQEs, but future
> > > > > operations might want to use normal SQE.
> > > > > 
> > > > > The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
> > > > > but, for commands that use normal SQE size, it might be necessary to
> > > > > access the initial SQE fields outside of the payload/cmd block.  So,
> > > > > saves the whole SQE other than just the pdu.
> > > > > 
> > > > > This changes slighlty how the io_uring_cmd works, since the cmd
> > > > > structures and callbacks are not opaque to io_uring anymore. I.e, the
> > > > > callbacks can look at the SQE entries, not only, in the cmd structure.
> > > > > 
> > > > > The main advantage is that we don't need to create custom structures for
> > > > > simple commands.
> > > > > 
> > > > > Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
> > > > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > > > ---
> > > > 
> > > > ...
> > > > 
> > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > > > > index 2e4c483075d3..9648134ccae1 100644
> > > > > --- a/io_uring/uring_cmd.c
> > > > > +++ b/io_uring/uring_cmd.c
> > > > > @@ -63,14 +63,15 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
> > > > >   int io_uring_cmd_prep_async(struct io_kiocb *req)
> > > > >   {
> > > > >   	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> > > > > -	size_t cmd_size;
> > > > > +	size_t size = sizeof(struct io_uring_sqe);
> > > > >   	BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16);
> > > > >   	BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80);
> > > > > -	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
> > > > > +	if (req->ctx->flags & IORING_SETUP_SQE128)
> > > > > +		size <<= 1;
> > > > > -	memcpy(req->async_data, ioucmd->cmd, cmd_size);
> > > > > +	memcpy(req->async_data, ioucmd->sqe, size);
> > > > 
> > > > The copy will make some fields of sqe become READ TWICE, and driver may see
> > > > different sqe field value compared with the one observed in io_init_req().
> > > 
> > > This copy only happens if the operation goes to the async path
> > > (calling io_uring_cmd_prep_async()).  This only happens if
> > > f_op->uring_cmd() returns -EAGAIN.
> > > 
> > >            ret = file->f_op->uring_cmd(ioucmd, issue_flags);
> > >            if (ret == -EAGAIN) {
> > >                    if (!req_has_async_data(req)) {
> > >                            if (io_alloc_async_data(req))
> > >                                    return -ENOMEM;
> > >                            io_uring_cmd_prep_async(req);
> > >                    }
> > >                    return -EAGAIN;
> > >            }
> > > 
> > > Are you saying that after this copy, the operation is still reading from
> > > sqe instead of req->async_data?
> > 
> > I meant that the 2nd read is on the sqe copy(req->aync_data), but same
> > fields can become different between the two READs(first is done on original
> > SQE during io_init_req(), and second is done on sqe copy in driver).
> > 
> > Will this kind of inconsistency cause trouble for driver? Cause READ
> > TWICE becomes possible with this patch.
> 
> Right it might happen, and I was keeping that in mind, but it's not
> specific to this patch. It won't reload core io_uring bits, and all

It depends if driver reloads core bits or not, anyway the patch exports
all fields and opens the window.

> fields cmds use already have this problem.

driver is supposed to load cmds field just once too, right?

> 
> Unless there is a better option, the direction we'll be moving in is
> adding a preparation step that should read and stash parts of SQE
> it cares about, which should also make full SQE copy not
> needed / optional.

Sounds good.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-14 13:59   ` Ming Lei
@ 2023-04-14 14:56   ` Pavel Begunkov
  -1 siblings, 0 replies; 108+ messages in thread
From: Pavel Begunkov @ 2023-04-14 14:56 UTC (permalink / raw)
  To: dccp

On 4/14/23 14:59, Ming Lei wrote:
[...]
>>> Will this kind of inconsistency cause trouble for driver? Cause READ
>>> TWICE becomes possible with this patch.
>>
>> Right it might happen, and I was keeping that in mind, but it's not
>> specific to this patch. It won't reload core io_uring bits, and all
> 
> It depends if driver reloads core bits or not, anyway the patch exports
> all fields and opens the window.

If a driver tries to reload core bits and even worse modify io_uring
request without proper helpers, it should be rooted out and thrown
into a bin. In any case cmds are expected to exercise cautiousness
while working with SQEs as they may change. I'd even argue that
hiding it as void *cmd makes it much less obvious.

>> fields cmds use already have this problem.
> 
> driver is supposed to load cmds field just once too, right?

Ideally they shouldn't, but it's fine to reload as long as
the cmd can handle it. And it should always be READ_ONCE()
and so.

>> Unless there is a better option, the direction we'll be moving in is
>> adding a preparation step that should read and stash parts of SQE
>> it cares about, which should also make full SQE copy not
>> needed / optional.
> 
> Sounds good.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-14 14:56   ` Pavel Begunkov
  0 siblings, 0 replies; 108+ messages in thread
From: Pavel Begunkov @ 2023-04-14 14:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Breno Leitao, axboe, davem, dccp, dsahern, edumazet, io-uring,
	kuba, leit, linux-kernel, marcelo.leitner, matthieu.baerts, mptcp,
	netdev, pabeni, willemdebruijn.kernel

On 4/14/23 14:59, Ming Lei wrote:
[...]
>>> Will this kind of inconsistency cause trouble for driver? Cause READ
>>> TWICE becomes possible with this patch.
>>
>> Right it might happen, and I was keeping that in mind, but it's not
>> specific to this patch. It won't reload core io_uring bits, and all
> 
> It depends if driver reloads core bits or not, anyway the patch exports
> all fields and opens the window.

If a driver tries to reload core bits and even worse modify io_uring
request without proper helpers, it should be rooted out and thrown
into a bin. In any case cmds are expected to exercise cautiousness
while working with SQEs as they may change. I'd even argue that
hiding it as void *cmd makes it much less obvious.

>> fields cmds use already have this problem.
> 
> driver is supposed to load cmds field just once too, right?

Ideally they shouldn't, but it's fine to reload as long as
the cmd can handle it. And it should always be READ_ONCE()
and so.

>> Unless there is a better option, the direction we'll be moving in is
>> adding a preparation step that should read and stash parts of SQE
>> it cares about, which should also make full SQE copy not
>> needed / optional.
> 
> Sounds good.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
  2023-04-14 14:56   ` Pavel Begunkov
@ 2023-04-16  9:51   ` Ming Lei
  -1 siblings, 0 replies; 108+ messages in thread
From: Ming Lei @ 2023-04-16  9:51 UTC (permalink / raw)
  To: dccp

On Fri, Apr 14, 2023 at 03:56:47PM +0100, Pavel Begunkov wrote:
> On 4/14/23 14:59, Ming Lei wrote:
> [...]
> > > > Will this kind of inconsistency cause trouble for driver? Cause READ
> > > > TWICE becomes possible with this patch.
> > > 
> > > Right it might happen, and I was keeping that in mind, but it's not
> > > specific to this patch. It won't reload core io_uring bits, and all
> > 
> > It depends if driver reloads core bits or not, anyway the patch exports
> > all fields and opens the window.
> 
> If a driver tries to reload core bits and even worse modify io_uring
> request without proper helpers, it should be rooted out and thrown
> into a bin. In any case cmds are expected to exercise cautiousness
> while working with SQEs as they may change. I'd even argue that
> hiding it as void *cmd makes it much less obvious.

Fair enough, if it is well documented, then people will know these
problems and any change in this area can get careful review.


Thanks, 
Ming

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH RFC] io_uring: Pass whole sqe to commands
@ 2023-04-16  9:51   ` Ming Lei
  0 siblings, 0 replies; 108+ messages in thread
From: Ming Lei @ 2023-04-16  9:51 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Breno Leitao, axboe, davem, dccp, dsahern, edumazet, io-uring,
	kuba, leit, linux-kernel, marcelo.leitner, matthieu.baerts, mptcp,
	netdev, pabeni, willemdebruijn.kernel, ming.lei

On Fri, Apr 14, 2023 at 03:56:47PM +0100, Pavel Begunkov wrote:
> On 4/14/23 14:59, Ming Lei wrote:
> [...]
> > > > Will this kind of inconsistency cause trouble for driver? Cause READ
> > > > TWICE becomes possible with this patch.
> > > 
> > > Right it might happen, and I was keeping that in mind, but it's not
> > > specific to this patch. It won't reload core io_uring bits, and all
> > 
> > It depends if driver reloads core bits or not, anyway the patch exports
> > all fields and opens the window.
> 
> If a driver tries to reload core bits and even worse modify io_uring
> request without proper helpers, it should be rooted out and thrown
> into a bin. In any case cmds are expected to exercise cautiousness
> while working with SQEs as they may change. I'd even argue that
> hiding it as void *cmd makes it much less obvious.

Fair enough, if it is well documented, then people will know these
problems and any change in this area can get careful review.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-13 14:24   ` Willem de Bruijn
@ 2023-04-18 13:23   ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-18 13:23 UTC (permalink / raw)
  To: dccp

On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote:
> > How to handle these contradictory behaviour ahead of time (at callee
> > time, where the buffers will be prepared)?
> 
> Ah you found a counter-example to the simple pattern of put_user.
> 
> The answer perhaps depends on how many such counter-examples you
> encounter in the list you gave. If this is the only one, exceptions
> in the wrapper are reasonable. Not if there are many.


Hello Williem,

I spend sometime dealing with it, and the best way for me to figure out
how much work this is, was implementing a PoC. You can find a basic PoC
in the link below. It is not 100% complete (still need to convert 4
simple ioctls), but, it deals with the most complicated cases. The
missing parts are straighforward if we are OK with this approach.

	https://github.com/leitao/linux/commits/ioctl_refactor

Details
===
1)  Change the ioctl callback to use kernel memory arguments. This
changes a lot of files but most of them are trivial. This is the new
ioctl callback:

struct proto {

        int                     (*ioctl)(struct sock *sk, int cmd,
-                                        unsigned long arg);
+                                        int *karg);

	You can see the full changeset in the following commit (which is
	the last in the tree above)
	https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7

2) Create a wrapper (sock_skprot_ioctl()) that should be called instead
of sk->sk_prot->ioctl(). For every exception, calls a specific function
for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3)

	This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227

3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl().
ip{6}mr is the hardest one, and I implemented the exception flow for it.

	You could find ipmr changes here:
	https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9

Is this what you had in mind?

Thank you!

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-18 13:23   ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-18 13:23 UTC (permalink / raw)
  To: Willem de Bruijn, kuba
  Cc: Jens Axboe, David Ahern, Willem de Bruijn, io-uring, netdev, kuba,
	asml.silence, leit, edumazet, pabeni, davem, dccp, mptcp,
	linux-kernel, matthieu.baerts, marcelo.leitner

On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote:
> > How to handle these contradictory behaviour ahead of time (at callee
> > time, where the buffers will be prepared)?
> 
> Ah you found a counter-example to the simple pattern of put_user.
> 
> The answer perhaps depends on how many such counter-examples you
> encounter in the list you gave. If this is the only one, exceptions
> in the wrapper are reasonable. Not if there are many.


Hello Williem,

I spend sometime dealing with it, and the best way for me to figure out
how much work this is, was implementing a PoC. You can find a basic PoC
in the link below. It is not 100% complete (still need to convert 4
simple ioctls), but, it deals with the most complicated cases. The
missing parts are straighforward if we are OK with this approach.

	https://github.com/leitao/linux/commits/ioctl_refactor

Details
=======

1)  Change the ioctl callback to use kernel memory arguments. This
changes a lot of files but most of them are trivial. This is the new
ioctl callback:

struct proto {

        int                     (*ioctl)(struct sock *sk, int cmd,
-                                        unsigned long arg);
+                                        int *karg);

	You can see the full changeset in the following commit (which is
	the last in the tree above)
	https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7

2) Create a wrapper (sock_skprot_ioctl()) that should be called instead
of sk->sk_prot->ioctl(). For every exception, calls a specific function
for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3)

	This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227

3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl().
ip{6}mr is the hardest one, and I implemented the exception flow for it.

	You could find ipmr changes here:
	https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9

Is this what you had in mind?

Thank you!

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-18 13:23   ` Breno Leitao
@ 2023-04-18 19:41   ` Willem de Bruijn
  -1 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-18 19:41 UTC (permalink / raw)
  To: dccp

Breno Leitao wrote:
> On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote:
> > > How to handle these contradictory behaviour ahead of time (at callee
> > > time, where the buffers will be prepared)?
> > 
> > Ah you found a counter-example to the simple pattern of put_user.
> > 
> > The answer perhaps depends on how many such counter-examples you
> > encounter in the list you gave. If this is the only one, exceptions
> > in the wrapper are reasonable. Not if there are many.
> 
> 
> Hello Williem,
> 
> I spend sometime dealing with it, and the best way for me to figure out
> how much work this is, was implementing a PoC. You can find a basic PoC
> in the link below. It is not 100% complete (still need to convert 4
> simple ioctls), but, it deals with the most complicated cases. The
> missing parts are straighforward if we are OK with this approach.
> 
> 	https://github.com/leitao/linux/commits/ioctl_refactor
> 
> Details
> ===> 
> 1)  Change the ioctl callback to use kernel memory arguments. This
> changes a lot of files but most of them are trivial. This is the new
> ioctl callback:
> 
> struct proto {
> 
>         int                     (*ioctl)(struct sock *sk, int cmd,
> -                                        unsigned long arg);
> +                                        int *karg);
> 
> 	You can see the full changeset in the following commit (which is
> 	the last in the tree above)
> 	https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7
> 
> 2) Create a wrapper (sock_skprot_ioctl()) that should be called instead
> of sk->sk_prot->ioctl(). For every exception, calls a specific function
> for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3)
> 
> 	This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227
> 
> 3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl().
> ip{6}mr is the hardest one, and I implemented the exception flow for it.
> 
> 	You could find ipmr changes here:
> 	https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9
> 
> Is this what you had in mind?
> 
> Thank you!

Thanks for the series, Breno. Yes, this looks very much what I hoped for.

The series shows two cases of ioctls: getters that return an int, and
combined getter/setters that take a struct of a certain size and
return the exact same.

I would deduplicate the four ipmr/ip6mr cases that constitute the second
type, by having a single helper for this type. sock_skprot_ioctl_struct,
which takes an argument for the struct size to copy in/out.

Did this series cover all proto ioctls, or is this still a subset just
for demonstration purposes -- and might there still be other types
lurking elsewhere?

If this is all, this looks like a reasonable amount of code churn to me.

Three small points

* please keep the __user annotation. Use make C=2 when unsure to warn
  about mismatched annotation
* minor: special case the ipmr (type 2) ioctls in sock_skprot_ioctl
  and treat the "return int" (type 1) ioctls as the default case.
* introduce code in a patch together with its use-case, so no separate
  patches for sock_skprot_ioctl and sock_skprot_ioctl_ipmr. Either one
  patch, or two, for each type of conversion.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-18 19:41   ` Willem de Bruijn
  0 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-18 19:41 UTC (permalink / raw)
  To: Breno Leitao, Willem de Bruijn, kuba
  Cc: Jens Axboe, David Ahern, Willem de Bruijn, io-uring, netdev, kuba,
	asml.silence, leit, edumazet, pabeni, davem, dccp, mptcp,
	linux-kernel, matthieu.baerts, marcelo.leitner

Breno Leitao wrote:
> On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote:
> > > How to handle these contradictory behaviour ahead of time (at callee
> > > time, where the buffers will be prepared)?
> > 
> > Ah you found a counter-example to the simple pattern of put_user.
> > 
> > The answer perhaps depends on how many such counter-examples you
> > encounter in the list you gave. If this is the only one, exceptions
> > in the wrapper are reasonable. Not if there are many.
> 
> 
> Hello Williem,
> 
> I spend sometime dealing with it, and the best way for me to figure out
> how much work this is, was implementing a PoC. You can find a basic PoC
> in the link below. It is not 100% complete (still need to convert 4
> simple ioctls), but, it deals with the most complicated cases. The
> missing parts are straighforward if we are OK with this approach.
> 
> 	https://github.com/leitao/linux/commits/ioctl_refactor
> 
> Details
> =======
> 
> 1)  Change the ioctl callback to use kernel memory arguments. This
> changes a lot of files but most of them are trivial. This is the new
> ioctl callback:
> 
> struct proto {
> 
>         int                     (*ioctl)(struct sock *sk, int cmd,
> -                                        unsigned long arg);
> +                                        int *karg);
> 
> 	You can see the full changeset in the following commit (which is
> 	the last in the tree above)
> 	https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7
> 
> 2) Create a wrapper (sock_skprot_ioctl()) that should be called instead
> of sk->sk_prot->ioctl(). For every exception, calls a specific function
> for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3)
> 
> 	This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227
> 
> 3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl().
> ip{6}mr is the hardest one, and I implemented the exception flow for it.
> 
> 	You could find ipmr changes here:
> 	https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9
> 
> Is this what you had in mind?
> 
> Thank you!

Thanks for the series, Breno. Yes, this looks very much what I hoped for.

The series shows two cases of ioctls: getters that return an int, and
combined getter/setters that take a struct of a certain size and
return the exact same.

I would deduplicate the four ipmr/ip6mr cases that constitute the second
type, by having a single helper for this type. sock_skprot_ioctl_struct,
which takes an argument for the struct size to copy in/out.

Did this series cover all proto ioctls, or is this still a subset just
for demonstration purposes -- and might there still be other types
lurking elsewhere?

If this is all, this looks like a reasonable amount of code churn to me.

Three small points

* please keep the __user annotation. Use make C=2 when unsure to warn
  about mismatched annotation
* minor: special case the ipmr (type 2) ioctls in sock_skprot_ioctl
  and treat the "return int" (type 1) ioctls as the default case.
* introduce code in a patch together with its use-case, so no separate
  patches for sock_skprot_ioctl and sock_skprot_ioctl_ipmr. Either one
  patch, or two, for each type of conversion.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-18 19:41   ` Willem de Bruijn
@ 2023-04-20 14:43   ` Breno Leitao
  -1 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-20 14:43 UTC (permalink / raw)
  To: dccp

On Tue, Apr 18, 2023 at 03:41:24PM -0400, Willem de Bruijn wrote:
> Breno Leitao wrote:
> > On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote:
> > > > How to handle these contradictory behaviour ahead of time (at callee
> > > > time, where the buffers will be prepared)?
> > > 
> > > Ah you found a counter-example to the simple pattern of put_user.
> > > 
> > > The answer perhaps depends on how many such counter-examples you
> > > encounter in the list you gave. If this is the only one, exceptions
> > > in the wrapper are reasonable. Not if there are many.
> > 
> > 
> > Hello Williem,
> > 
> > I spend sometime dealing with it, and the best way for me to figure out
> > how much work this is, was implementing a PoC. You can find a basic PoC
> > in the link below. It is not 100% complete (still need to convert 4
> > simple ioctls), but, it deals with the most complicated cases. The
> > missing parts are straighforward if we are OK with this approach.
> > 
> > 	https://github.com/leitao/linux/commits/ioctl_refactor
> > 
> > Details
> > ===> > 
> > 1)  Change the ioctl callback to use kernel memory arguments. This
> > changes a lot of files but most of them are trivial. This is the new
> > ioctl callback:
> > 
> > struct proto {
> > 
> >         int                     (*ioctl)(struct sock *sk, int cmd,
> > -                                        unsigned long arg);
> > +                                        int *karg);
> > 
> > 	You can see the full changeset in the following commit (which is
> > 	the last in the tree above)
> > 	https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7
> > 
> > 2) Create a wrapper (sock_skprot_ioctl()) that should be called instead
> > of sk->sk_prot->ioctl(). For every exception, calls a specific function
> > for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3)
> > 
> > 	This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227
> > 
> > 3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl().
> > ip{6}mr is the hardest one, and I implemented the exception flow for it.
> > 
> > 	You could find ipmr changes here:
> > 	https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9
> > 
> > Is this what you had in mind?
> > 
> > Thank you!
> 
> Thanks for the series, Breno. Yes, this looks very much what I hoped for.

Awesome. Thanks.

> The series shows two cases of ioctls: getters that return an int, and
> combined getter/setters that take a struct of a certain size and
> return the exact same.
>
> I would deduplicate the four ipmr/ip6mr cases that constitute the second
> type, by having a single helper for this type. sock_skprot_ioctl_struct,
> which takes an argument for the struct size to copy in/out.

Ok, that is a good advice. Thanks!

> Did this series cover all proto ioctls, or is this still a subset just
> for demonstration purposes -- and might there still be other types
> lurking elsewhere?

It does not cover all the cases. I would say it cover 80% of the cases,
and the hardest cases.  These are the missing cases, and what they do:

* pn_ioctl     (getters/setter that reads/return an int)
* l2tp_ioctl   (getters that return an int)
* dgram_ioctl  (getters that return an int)
* sctp_ioctl   (getters that return an int)
* mptcp_ioctl  (getters that return an int)
* dccp_ioctl   (getters that return an int)
* dgram_ioctl  (getters that return an int)
* pep_ioctl    (getters that return an int)


Here is what I am using to get the full list:
 # ag  --no-filename -A 20 "struct proto \w* = {"  | grep .ioctl | cut -d "=" -f 2 | tr -d '\n'

 dccp_ioctl, dccp_ioctl, dgram_ioctl, tcp_ioctl, raw_ioctl, udp_ioctl,
 udp_ioctl, udp_ioctl, tcp_ioctl, l2tp_ioctl, rawv6_ioctl, l2tp_ioctl,
 mptcp_ioctl, pep_ioctl, pn_ioctl, rds_ioctl, sctp_ioctl, sctp_ioctl,
 sock_no_ioctl

> If this is all, this looks like a reasonable amount of code churn to me.

Should I proceed and create a final patch? I don't see a way to break up
the last patch, which changes the API , in smaller patches. I.e., the
last patch will be huge, right?

> Three small points
> 
> * please keep the __user annotation. Use make C=2 when unsure to warn
>   about mismatched annotation

ack!

> * minor: special case the ipmr (type 2) ioctls in sock_skprot_ioctl
>   and treat the "return int" (type 1) ioctls as the default case.

ack!

> * introduce code in a patch together with its use-case, so no separate
>   patches for sock_skprot_ioctl and sock_skprot_ioctl_ipmr. Either one
>   patch, or two, for each type of conversion.

I am not sure how to change the ABI (struct proto) without doing all the
protocol changes in the same patch. Otherwise compilation will be broken between
the patch that changes the "struct proto" and the patch that changes the
_ioctl for protocol X.  I mean, is it possible to break up changing
"struct proto" and the affected protocols?

Thank you for the review and suggestions!

PS: I will take some days off next week, and I am planning to send the
final patch when I come back.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-20 14:43   ` Breno Leitao
  0 siblings, 0 replies; 108+ messages in thread
From: Breno Leitao @ 2023-04-20 14:43 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: kuba, Jens Axboe, David Ahern, Willem de Bruijn, io-uring, netdev,
	asml.silence, leit, edumazet, pabeni, davem, dccp, mptcp,
	linux-kernel, matthieu.baerts, marcelo.leitner

On Tue, Apr 18, 2023 at 03:41:24PM -0400, Willem de Bruijn wrote:
> Breno Leitao wrote:
> > On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote:
> > > > How to handle these contradictory behaviour ahead of time (at callee
> > > > time, where the buffers will be prepared)?
> > > 
> > > Ah you found a counter-example to the simple pattern of put_user.
> > > 
> > > The answer perhaps depends on how many such counter-examples you
> > > encounter in the list you gave. If this is the only one, exceptions
> > > in the wrapper are reasonable. Not if there are many.
> > 
> > 
> > Hello Williem,
> > 
> > I spend sometime dealing with it, and the best way for me to figure out
> > how much work this is, was implementing a PoC. You can find a basic PoC
> > in the link below. It is not 100% complete (still need to convert 4
> > simple ioctls), but, it deals with the most complicated cases. The
> > missing parts are straighforward if we are OK with this approach.
> > 
> > 	https://github.com/leitao/linux/commits/ioctl_refactor
> > 
> > Details
> > =======
> > 
> > 1)  Change the ioctl callback to use kernel memory arguments. This
> > changes a lot of files but most of them are trivial. This is the new
> > ioctl callback:
> > 
> > struct proto {
> > 
> >         int                     (*ioctl)(struct sock *sk, int cmd,
> > -                                        unsigned long arg);
> > +                                        int *karg);
> > 
> > 	You can see the full changeset in the following commit (which is
> > 	the last in the tree above)
> > 	https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7
> > 
> > 2) Create a wrapper (sock_skprot_ioctl()) that should be called instead
> > of sk->sk_prot->ioctl(). For every exception, calls a specific function
> > for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3)
> > 
> > 	This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227
> > 
> > 3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl().
> > ip{6}mr is the hardest one, and I implemented the exception flow for it.
> > 
> > 	You could find ipmr changes here:
> > 	https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9
> > 
> > Is this what you had in mind?
> > 
> > Thank you!
> 
> Thanks for the series, Breno. Yes, this looks very much what I hoped for.

Awesome. Thanks.

> The series shows two cases of ioctls: getters that return an int, and
> combined getter/setters that take a struct of a certain size and
> return the exact same.
>
> I would deduplicate the four ipmr/ip6mr cases that constitute the second
> type, by having a single helper for this type. sock_skprot_ioctl_struct,
> which takes an argument for the struct size to copy in/out.

Ok, that is a good advice. Thanks!

> Did this series cover all proto ioctls, or is this still a subset just
> for demonstration purposes -- and might there still be other types
> lurking elsewhere?

It does not cover all the cases. I would say it cover 80% of the cases,
and the hardest cases.  These are the missing cases, and what they do:

* pn_ioctl     (getters/setter that reads/return an int)
* l2tp_ioctl   (getters that return an int)
* dgram_ioctl  (getters that return an int)
* sctp_ioctl   (getters that return an int)
* mptcp_ioctl  (getters that return an int)
* dccp_ioctl   (getters that return an int)
* dgram_ioctl  (getters that return an int)
* pep_ioctl    (getters that return an int)


Here is what I am using to get the full list:
 # ag  --no-filename -A 20 "struct proto \w* = {"  | grep .ioctl | cut -d "=" -f 2 | tr -d '\n'

 dccp_ioctl, dccp_ioctl, dgram_ioctl, tcp_ioctl, raw_ioctl, udp_ioctl,
 udp_ioctl, udp_ioctl, tcp_ioctl, l2tp_ioctl, rawv6_ioctl, l2tp_ioctl,
 mptcp_ioctl, pep_ioctl, pn_ioctl, rds_ioctl, sctp_ioctl, sctp_ioctl,
 sock_no_ioctl

> If this is all, this looks like a reasonable amount of code churn to me.

Should I proceed and create a final patch? I don't see a way to break up
the last patch, which changes the API , in smaller patches. I.e., the
last patch will be huge, right?

> Three small points
> 
> * please keep the __user annotation. Use make C=2 when unsure to warn
>   about mismatched annotation

ack!

> * minor: special case the ipmr (type 2) ioctls in sock_skprot_ioctl
>   and treat the "return int" (type 1) ioctls as the default case.

ack!

> * introduce code in a patch together with its use-case, so no separate
>   patches for sock_skprot_ioctl and sock_skprot_ioctl_ipmr. Either one
>   patch, or two, for each type of conversion.

I am not sure how to change the ABI (struct proto) without doing all the
protocol changes in the same patch. Otherwise compilation will be broken between
the patch that changes the "struct proto" and the patch that changes the
_ioctl for protocol X.  I mean, is it possible to break up changing
"struct proto" and the affected protocols?

Thank you for the review and suggestions!

PS: I will take some days off next week, and I am planning to send the
final patch when I come back.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-04-20 14:43   ` Breno Leitao
@ 2023-04-20 16:48   ` Willem de Bruijn
  -1 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-20 16:48 UTC (permalink / raw)
  To: dccp

Breno Leitao wrote:
> On Tue, Apr 18, 2023 at 03:41:24PM -0400, Willem de Bruijn wrote:
> > Breno Leitao wrote:
> > > On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote:
> > > > > How to handle these contradictory behaviour ahead of time (at callee
> > > > > time, where the buffers will be prepared)?
> > > > 
> > > > Ah you found a counter-example to the simple pattern of put_user.
> > > > 
> > > > The answer perhaps depends on how many such counter-examples you
> > > > encounter in the list you gave. If this is the only one, exceptions
> > > > in the wrapper are reasonable. Not if there are many.
> > > 
> > > 
> > > Hello Williem,
> > > 
> > > I spend sometime dealing with it, and the best way for me to figure out
> > > how much work this is, was implementing a PoC. You can find a basic PoC
> > > in the link below. It is not 100% complete (still need to convert 4
> > > simple ioctls), but, it deals with the most complicated cases. The
> > > missing parts are straighforward if we are OK with this approach.
> > > 
> > > 	https://github.com/leitao/linux/commits/ioctl_refactor
> > > 
> > > Details
> > > ===> > > 
> > > 1)  Change the ioctl callback to use kernel memory arguments. This
> > > changes a lot of files but most of them are trivial. This is the new
> > > ioctl callback:
> > > 
> > > struct proto {
> > > 
> > >         int                     (*ioctl)(struct sock *sk, int cmd,
> > > -                                        unsigned long arg);
> > > +                                        int *karg);
> > > 
> > > 	You can see the full changeset in the following commit (which is
> > > 	the last in the tree above)
> > > 	https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7
> > > 
> > > 2) Create a wrapper (sock_skprot_ioctl()) that should be called instead
> > > of sk->sk_prot->ioctl(). For every exception, calls a specific function
> > > for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3)
> > > 
> > > 	This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227
> > > 
> > > 3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl().
> > > ip{6}mr is the hardest one, and I implemented the exception flow for it.
> > > 
> > > 	You could find ipmr changes here:
> > > 	https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9
> > > 
> > > Is this what you had in mind?
> > > 
> > > Thank you!
> > 
> > Thanks for the series, Breno. Yes, this looks very much what I hoped for.
> 
> Awesome. Thanks.
> 
> > The series shows two cases of ioctls: getters that return an int, and
> > combined getter/setters that take a struct of a certain size and
> > return the exact same.
> >
> > I would deduplicate the four ipmr/ip6mr cases that constitute the second
> > type, by having a single helper for this type. sock_skprot_ioctl_struct,
> > which takes an argument for the struct size to copy in/out.
> 
> Ok, that is a good advice. Thanks!
> 
> > Did this series cover all proto ioctls, or is this still a subset just
> > for demonstration purposes -- and might there still be other types
> > lurking elsewhere?
> 
> It does not cover all the cases. I would say it cover 80% of the cases,
> and the hardest cases.  These are the missing cases, and what they do:
> 
> * pn_ioctl     (getters/setter that reads/return an int)
> * l2tp_ioctl   (getters that return an int)
> * dgram_ioctl  (getters that return an int)
> * sctp_ioctl   (getters that return an int)
> * mptcp_ioctl  (getters that return an int)
> * dccp_ioctl   (getters that return an int)
> * dgram_ioctl  (getters that return an int)
> * pep_ioctl    (getters that return an int)

Thanks for the thorough review.

So we have io_struct, io_int and o_int variants only. And the io_int
can use the proposed io_struct helper that takes an explicit length
to copy in and out.

 
> Here is what I am using to get the full list:
>  # ag  --no-filename -A 20 "struct proto \w* = {"  | grep .ioctl | cut -d "=" -f 2 | tr -d '\n'
> 
>  dccp_ioctl, dccp_ioctl, dgram_ioctl, tcp_ioctl, raw_ioctl, udp_ioctl,
>  udp_ioctl, udp_ioctl, tcp_ioctl, l2tp_ioctl, rawv6_ioctl, l2tp_ioctl,
>  mptcp_ioctl, pep_ioctl, pn_ioctl, rds_ioctl, sctp_ioctl, sctp_ioctl,
>  sock_no_ioctl
> 
> > If this is all, this looks like a reasonable amount of code churn to me.
> 
> Should I proceed and create a final patch? I don't see a way to break up
> the last patch, which changes the API , in smaller patches. I.e., the
> last patch will be huge, right?

Good point. So be it, then.
 
> > Three small points
> > 
> > * please keep the __user annotation. Use make C=2 when unsure to warn
> >   about mismatched annotation
> 
> ack!
> 
> > * minor: special case the ipmr (type 2) ioctls in sock_skprot_ioctl
> >   and treat the "return int" (type 1) ioctls as the default case.
> 
> ack!
> 
> > * introduce code in a patch together with its use-case, so no separate
> >   patches for sock_skprot_ioctl and sock_skprot_ioctl_ipmr. Either one
> >   patch, or two, for each type of conversion.
> 
> I am not sure how to change the ABI (struct proto) without doing all the
> protocol changes in the same patch. Otherwise compilation will be broken between
> the patch that changes the "struct proto" and the patch that changes the
> _ioctl for protocol X.  I mean, is it possible to break up changing
> "struct proto" and the affected protocols?
> 
> Thank you for the review and suggestions!
> 
> PS: I will take some days off next week, and I am planning to send the
> final patch when I come back.


^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-04-20 16:48   ` Willem de Bruijn
  0 siblings, 0 replies; 108+ messages in thread
From: Willem de Bruijn @ 2023-04-20 16:48 UTC (permalink / raw)
  To: Breno Leitao, Willem de Bruijn
  Cc: kuba, Jens Axboe, David Ahern, Willem de Bruijn, io-uring, netdev,
	asml.silence, leit, edumazet, pabeni, davem, dccp, mptcp,
	linux-kernel, matthieu.baerts, marcelo.leitner

Breno Leitao wrote:
> On Tue, Apr 18, 2023 at 03:41:24PM -0400, Willem de Bruijn wrote:
> > Breno Leitao wrote:
> > > On Thu, Apr 13, 2023 at 10:24:31AM -0400, Willem de Bruijn wrote:
> > > > > How to handle these contradictory behaviour ahead of time (at callee
> > > > > time, where the buffers will be prepared)?
> > > > 
> > > > Ah you found a counter-example to the simple pattern of put_user.
> > > > 
> > > > The answer perhaps depends on how many such counter-examples you
> > > > encounter in the list you gave. If this is the only one, exceptions
> > > > in the wrapper are reasonable. Not if there are many.
> > > 
> > > 
> > > Hello Williem,
> > > 
> > > I spend sometime dealing with it, and the best way for me to figure out
> > > how much work this is, was implementing a PoC. You can find a basic PoC
> > > in the link below. It is not 100% complete (still need to convert 4
> > > simple ioctls), but, it deals with the most complicated cases. The
> > > missing parts are straighforward if we are OK with this approach.
> > > 
> > > 	https://github.com/leitao/linux/commits/ioctl_refactor
> > > 
> > > Details
> > > =======
> > > 
> > > 1)  Change the ioctl callback to use kernel memory arguments. This
> > > changes a lot of files but most of them are trivial. This is the new
> > > ioctl callback:
> > > 
> > > struct proto {
> > > 
> > >         int                     (*ioctl)(struct sock *sk, int cmd,
> > > -                                        unsigned long arg);
> > > +                                        int *karg);
> > > 
> > > 	You can see the full changeset in the following commit (which is
> > > 	the last in the tree above)
> > > 	https://github.com/leitao/linux/commit/ad78da14601b078c4b6a9f63a86032467ab59bf7
> > > 
> > > 2) Create a wrapper (sock_skprot_ioctl()) that should be called instead
> > > of sk->sk_prot->ioctl(). For every exception, calls a specific function
> > > for the exception (basically ipmr_ioctl and ipmr_ioctl) (see more on 3)
> > > 
> > > 	This is the commit https://github.com/leitao/linux/commit/511592e549c39ef0de19efa2eb4382cac5786227
> > > 
> > > 3) There are two exceptions, they are ip{6}mr_ioctl() and pn_ioctl().
> > > ip{6}mr is the hardest one, and I implemented the exception flow for it.
> > > 
> > > 	You could find ipmr changes here:
> > > 	https://github.com/leitao/linux/commit/659a76dc0547ab2170023f31e20115520ebe33d9
> > > 
> > > Is this what you had in mind?
> > > 
> > > Thank you!
> > 
> > Thanks for the series, Breno. Yes, this looks very much what I hoped for.
> 
> Awesome. Thanks.
> 
> > The series shows two cases of ioctls: getters that return an int, and
> > combined getter/setters that take a struct of a certain size and
> > return the exact same.
> >
> > I would deduplicate the four ipmr/ip6mr cases that constitute the second
> > type, by having a single helper for this type. sock_skprot_ioctl_struct,
> > which takes an argument for the struct size to copy in/out.
> 
> Ok, that is a good advice. Thanks!
> 
> > Did this series cover all proto ioctls, or is this still a subset just
> > for demonstration purposes -- and might there still be other types
> > lurking elsewhere?
> 
> It does not cover all the cases. I would say it cover 80% of the cases,
> and the hardest cases.  These are the missing cases, and what they do:
> 
> * pn_ioctl     (getters/setter that reads/return an int)
> * l2tp_ioctl   (getters that return an int)
> * dgram_ioctl  (getters that return an int)
> * sctp_ioctl   (getters that return an int)
> * mptcp_ioctl  (getters that return an int)
> * dccp_ioctl   (getters that return an int)
> * dgram_ioctl  (getters that return an int)
> * pep_ioctl    (getters that return an int)

Thanks for the thorough review.

So we have io_struct, io_int and o_int variants only. And the io_int
can use the proposed io_struct helper that takes an explicit length
to copy in and out.

 
> Here is what I am using to get the full list:
>  # ag  --no-filename -A 20 "struct proto \w* = {"  | grep .ioctl | cut -d "=" -f 2 | tr -d '\n'
> 
>  dccp_ioctl, dccp_ioctl, dgram_ioctl, tcp_ioctl, raw_ioctl, udp_ioctl,
>  udp_ioctl, udp_ioctl, tcp_ioctl, l2tp_ioctl, rawv6_ioctl, l2tp_ioctl,
>  mptcp_ioctl, pep_ioctl, pn_ioctl, rds_ioctl, sctp_ioctl, sctp_ioctl,
>  sock_no_ioctl
> 
> > If this is all, this looks like a reasonable amount of code churn to me.
> 
> Should I proceed and create a final patch? I don't see a way to break up
> the last patch, which changes the API , in smaller patches. I.e., the
> last patch will be huge, right?

Good point. So be it, then.
 
> > Three small points
> > 
> > * please keep the __user annotation. Use make C=2 when unsure to warn
> >   about mismatched annotation
> 
> ack!
> 
> > * minor: special case the ipmr (type 2) ioctls in sock_skprot_ioctl
> >   and treat the "return int" (type 1) ioctls as the default case.
> 
> ack!
> 
> > * introduce code in a patch together with its use-case, so no separate
> >   patches for sock_skprot_ioctl and sock_skprot_ioctl_ipmr. Either one
> >   patch, or two, for each type of conversion.
> 
> I am not sure how to change the ABI (struct proto) without doing all the
> protocol changes in the same patch. Otherwise compilation will be broken between
> the patch that changes the "struct proto" and the patch that changes the
> _ioctl for protocol X.  I mean, is it possible to break up changing
> "struct proto" and the affected protocols?
> 
> Thank you for the review and suggestions!
> 
> PS: I will take some days off next week, and I am planning to send the
> final patch when I come back.



^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-05-02  9:21   ` Adrien Delorme
  0 siblings, 0 replies; 108+ messages in thread
From: Adrien Delorme @ 2023-05-02  9:21 UTC (permalink / raw)
  To: dccp

From Adrien Delorme

> From: David Ahern 
> Sent: 12 April 2023 7:39 
> > Sent: 11 April 2023 16:28
> ....
> > Christoph's patch set a few years back that removed set_fs broke the
> > ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not
> > follow that change; was it a deliberate intent to not allow these
> > in-kernel calls vs wanting to remove the set_fs? e.g., can we add a
> > kioctl variant for in-kernel use of the APIs?
>
> I think that was a side effect, and with no in-tree in-kernel
> users (apart from limited calls in bpf) it was deemed acceptable.
> (It is a PITA for any code trying to use SCTP in kernel.)
>
> One problem is that not all sockopt calls pass the correct length.
> And some of them can have very long buffers.
> Not to mention the ones that are read-modify-write.
>
> A plausible solution is to pass a 'fat pointer' that contains
> some, or all, of:
>       - A userspace buffer pointer.
>       - A kernel buffer pointer.
>       - The length supplied by the user.
>       - The length of the kernel buffer.
>       = The number of bytes to copy on completion.
> For simple user requests the syscall entry/exit code
> would copy the data to a short on-stack buffer.
> Kernel users just pass the kernel address.
> Odd requests can just use the user pointer.
>
> Probably needs accessors that add in an offset.
>
> It might also be that some of the problematic sockopt
> were in decnet - now removed.

Hello everyone,

I'm currently working on an implementation of {get,set} sockopt. 
Since this thread is already talking about it, I hope that I replying at the correct place. 

My implementation is rather simple using a struct that will be used to pass the necessary info throught sqe->cmd.

Here is my implementation based of kernel version 6.3 : 

Signed-off-by: Adrien Delorme <delorme.ade@outlook.com>

diff -uprN a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
--- a/include/uapi/linux/io_uring.h     2023-04-23 15:02:52.000000000 -0400
+++ b/include/uapi/linux/io_uring.h     2023-04-24 07:55:21.406981696 -0400
@@ -235,6 +235,25 @@ enum io_uring_op {
  */
#define IORING_URING_CMD_FIXED (1U << 0)

+/* struct io_uring_cmd->cmd_op flags for socket operations */
+#define IO_URING_CMD_OP_GETSOCKOPT 0x0
+#define IO_URING_CMD_OP_SETSOCKOPT 0x1
+
+/* Struct to pass args for IO_URING_CMD_OP_GETSOCKOPT and IO_URING_CMD_OP_SETSOCKOPT operations */
+struct args_setsockopt_uring{
+       int                             level;
+       int                     optname;
+       char __user *   user_optval;
+       int                     optlen;
+};
+
+struct args_getsockopt_uring{
+       int                             level;
+       int                     optname;
+       char __user *   user_optval;
+       int      __user *       optlen;
+};
+

/*
  * sqe->fsync_flags
diff -uprN a/net/socket.c b/net/socket.c
--- a/net/socket.c      2023-04-23 15:02:52.000000000 -0400
+++ b/net/socket.c      2023-04-24 08:06:44.800981696 -0400
@@ -108,6 +108,11 @@
#include <linux/ptp_clock_kernel.h>
#include <trace/events/sock.h>

+#ifdef CONFIG_IO_URING
+#include <uapi/linux/io_uring.h>
+#include <linux/io_uring.h>
+#endif
+
#ifdef CONFIG_NET_RX_BUSY_POLL
unsigned int sysctl_net_busy_read __read_mostly;
unsigned int sysctl_net_busy_poll __read_mostly;
@@ -132,6 +137,11 @@ static ssize_t sock_splice_read(struct f
                                struct pipe_inode_info *pipe, size_t len,
                                unsigned int flags);

+
+#ifdef CONFIG_IO_URING
+int socket_uring_cmd_handler(struct io_uring_cmd *cmd, unsigned int flags);
+#endif
+
#ifdef CONFIG_PROC_FS
static void sock_show_fdinfo(struct seq_file *m, struct file *f)
{
@@ -166,6 +176,9 @@ static const struct file_operations sock
        .splice_write = generic_splice_sendpage,
        .splice_read =  sock_splice_read,
        .show_fdinfo =  sock_show_fdinfo,
+#ifdef CONFIG_IO_URING
+       .uring_cmd = socket_uring_cmd_handler,
+#endif
};

static const char * const pf_family_names[] = {
@@ -2330,6 +2343,126 @@ SYSCALL_DEFINE5(getsockopt, int, fd, int
        return __sys_getsockopt(fd, level, optname, optval, optlen);
}

+#ifdef CONFIG_IO_URING
+
+/*
+ * Make getsockopt operation with io_uring.
+ * This fonction is based of the __sys_getsockopt without sockfd_lookup_light
+ * since io_uring retrieves it for us.
+ */
+int uring_cmd_getsockopt(struct socket *sock, int level, int optname, char __user *optval,
+               int __user *optlen)
+{
+       int err;
+       int max_optlen;
+
+       err = security_socket_getsockopt(sock, level, optname);
+       if (err)
+               goto out_put;
+
+       if (!in_compat_syscall())
+               max_optlen = BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen);
+
+       if (level == SOL_SOCKET)
+               err = sock_getsockopt(sock, level, optname, optval, optlen);
+       else if (unlikely(!sock->ops->getsockopt))
+               err = -EOPNOTSUPP;
+       else
+               err = sock->ops->getsockopt(sock, level, optname, optval,
+                                           optlen);
+
+       if (!in_compat_syscall())
+               err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
+                                                    optval, optlen, max_optlen,
+                                                    err);
+out_put:
+       return err;
+}
+
+/*
+ * Make setsockopt operation with io_uring.
+ * This fonction is based of the __sys_setsockopt without sockfd_lookup_light
+ * since io_uring retrieves it for us.
+ */
+int uring_cmd_setsockopt(struct socket *sock, int level, int optname, char *user_optval,
+               int optlen)
+{
+       sockptr_t optval = USER_SOCKPTR(user_optval);
+       char *kernel_optval = NULL;
+       int err;
+
+       if (optlen < 0)
+               return -EINVAL;
+
+       err = security_socket_setsockopt(sock, level, optname);
+       if (err)
+               goto out_put;
+
+       if (!in_compat_syscall())
+               err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, &level, &optname,
+                                                    user_optval, &optlen,
+                                                    &kernel_optval);
+       if (err < 0)
+               goto out_put;
+       if (err > 0) {
+               err = 0;
+               goto out_put;
+       }
+
+       if (kernel_optval)
+               optval = KERNEL_SOCKPTR(kernel_optval);
+       if (level == SOL_SOCKET && !sock_use_custom_sol_socket(sock))
+               err = sock_setsockopt(sock, level, optname, optval, optlen);
+       else if (unlikely(!sock->ops->setsockopt))
+               err = -EOPNOTSUPP;
+       else
+               err = sock->ops->setsockopt(sock, level, optname, optval,
+                                           optlen);
+       kfree(kernel_optval);
+out_put:
+       return err;
+}
+
+/*
+ * Handler uring_cmd socket file_operations.
+ *
+ * Operation code and struct are defined in /include/uapi/linux/io_uring.h
+ * The io_uring ring needs to be set with the flags : IORING_SETUP_SQE128 and IORING_SETUP_CQE32
+ *
+ */
+int socket_uring_cmd_handler(struct io_uring_cmd *cmd, unsigned int flags){
+
+       /* Retrieve socket */
+       struct socket *sock = sock_from_file(cmd->file);
+
+       if (!sock)
+               return -EINVAL;
+
+       /* Does the requested operation */
+       switch (cmd->cmd_op) {
+               case IO_URING_CMD_OP_GETSOCKOPT:
+                       struct args_getsockopt_uring *values_get = (struct args_getsockopt_uring *) cmd->cmd;
+                       return uring_cmd_getsockopt(sock,
+                                                                               values_get->level,
+                                                                               values_get->optname,
+                                                                               values_get->user_optval,
+                                                                               values_get->optlen);
+
+               case IO_URING_CMD_OP_SETSOCKOPT:
+                       struct args_setsockopt_uring *values_set = (struct args_setsockopt_uring *) cmd->cmd;
+                       return uring_cmd_setsockopt(sock,
+                                                                               values_set->level,
+                                                                               values_set->optname,
+                                                                               values_set->user_optval,
+                                                                               values_set->optlen);
+               default:
+                       break;
+
+       }
+       return -EINVAL;
+}
+#endif
+
/*
  *     Shutdown a socket.
  */

I would appreciate any feedback or advice you may have on this work. Hopefully it will be of some kind of help. Thank you for your time and consideration.

Adrien

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-05-02  9:21   ` Adrien Delorme
  0 siblings, 0 replies; 108+ messages in thread
From: Adrien Delorme @ 2023-05-02  9:21 UTC (permalink / raw)
  To: david.laight@aculab.com
  Cc: asml.silence@gmail.com, axboe@kernel.dk, davem@davemloft.net,
	dccp@vger.kernel.org, dsahern@kernel.org, edumazet@google.com,
	io-uring@vger.kernel.org, kuba@kernel.org, leit@fb.com,
	leitao@debian.org, linux-kernel@vger.kernel.org,
	marcelo.leitner@gmail.com, matthieu.baerts@tessares.net,
	mptcp@lists.linux.dev, netdev@vger.kernel.org, pabeni@redhat.com,
	willemb@google.com, willemdebruijn.kernel@gmail.com

From Adrien Delorme

> From: David Ahern 
> Sent: 12 April 2023 7:39 
> > Sent: 11 April 2023 16:28
> ....
> > Christoph's patch set a few years back that removed set_fs broke the
> > ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not
> > follow that change; was it a deliberate intent to not allow these
> > in-kernel calls vs wanting to remove the set_fs? e.g., can we add a
> > kioctl variant for in-kernel use of the APIs?
>
> I think that was a side effect, and with no in-tree in-kernel
> users (apart from limited calls in bpf) it was deemed acceptable.
> (It is a PITA for any code trying to use SCTP in kernel.)
>
> One problem is that not all sockopt calls pass the correct length.
> And some of them can have very long buffers.
> Not to mention the ones that are read-modify-write.
>
> A plausible solution is to pass a 'fat pointer' that contains
> some, or all, of:
>       - A userspace buffer pointer.
>       - A kernel buffer pointer.
>       - The length supplied by the user.
>       - The length of the kernel buffer.
>       = The number of bytes to copy on completion.
> For simple user requests the syscall entry/exit code
> would copy the data to a short on-stack buffer.
> Kernel users just pass the kernel address.
> Odd requests can just use the user pointer.
>
> Probably needs accessors that add in an offset.
>
> It might also be that some of the problematic sockopt
> were in decnet - now removed.

Hello everyone,

I'm currently working on an implementation of {get,set} sockopt. 
Since this thread is already talking about it, I hope that I replying at the correct place. 

My implementation is rather simple using a struct that will be used to pass the necessary info throught sqe->cmd.

Here is my implementation based of kernel version 6.3 : 

Signed-off-by: Adrien Delorme <delorme.ade@outlook.com>

diff -uprN a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
--- a/include/uapi/linux/io_uring.h     2023-04-23 15:02:52.000000000 -0400
+++ b/include/uapi/linux/io_uring.h     2023-04-24 07:55:21.406981696 -0400
@@ -235,6 +235,25 @@ enum io_uring_op {
  */
#define IORING_URING_CMD_FIXED (1U << 0)

+/* struct io_uring_cmd->cmd_op flags for socket operations */
+#define IO_URING_CMD_OP_GETSOCKOPT 0x0
+#define IO_URING_CMD_OP_SETSOCKOPT 0x1
+
+/* Struct to pass args for IO_URING_CMD_OP_GETSOCKOPT and IO_URING_CMD_OP_SETSOCKOPT operations */
+struct args_setsockopt_uring{
+       int                             level;
+       int                     optname;
+       char __user *   user_optval;
+       int                     optlen;
+};
+
+struct args_getsockopt_uring{
+       int                             level;
+       int                     optname;
+       char __user *   user_optval;
+       int      __user *       optlen;
+};
+

/*
  * sqe->fsync_flags
diff -uprN a/net/socket.c b/net/socket.c
--- a/net/socket.c      2023-04-23 15:02:52.000000000 -0400
+++ b/net/socket.c      2023-04-24 08:06:44.800981696 -0400
@@ -108,6 +108,11 @@
#include <linux/ptp_clock_kernel.h>
#include <trace/events/sock.h>

+#ifdef CONFIG_IO_URING
+#include <uapi/linux/io_uring.h>
+#include <linux/io_uring.h>
+#endif
+
#ifdef CONFIG_NET_RX_BUSY_POLL
unsigned int sysctl_net_busy_read __read_mostly;
unsigned int sysctl_net_busy_poll __read_mostly;
@@ -132,6 +137,11 @@ static ssize_t sock_splice_read(struct f
                                struct pipe_inode_info *pipe, size_t len,
                                unsigned int flags);

+
+#ifdef CONFIG_IO_URING
+int socket_uring_cmd_handler(struct io_uring_cmd *cmd, unsigned int flags);
+#endif
+
#ifdef CONFIG_PROC_FS
static void sock_show_fdinfo(struct seq_file *m, struct file *f)
{
@@ -166,6 +176,9 @@ static const struct file_operations sock
        .splice_write = generic_splice_sendpage,
        .splice_read =  sock_splice_read,
        .show_fdinfo =  sock_show_fdinfo,
+#ifdef CONFIG_IO_URING
+       .uring_cmd = socket_uring_cmd_handler,
+#endif
};

static const char * const pf_family_names[] = {
@@ -2330,6 +2343,126 @@ SYSCALL_DEFINE5(getsockopt, int, fd, int
        return __sys_getsockopt(fd, level, optname, optval, optlen);
}

+#ifdef CONFIG_IO_URING
+
+/*
+ * Make getsockopt operation with io_uring.
+ * This fonction is based of the __sys_getsockopt without sockfd_lookup_light
+ * since io_uring retrieves it for us.
+ */
+int uring_cmd_getsockopt(struct socket *sock, int level, int optname, char __user *optval,
+               int __user *optlen)
+{
+       int err;
+       int max_optlen;
+
+       err = security_socket_getsockopt(sock, level, optname);
+       if (err)
+               goto out_put;
+
+       if (!in_compat_syscall())
+               max_optlen = BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen);
+
+       if (level == SOL_SOCKET)
+               err = sock_getsockopt(sock, level, optname, optval, optlen);
+       else if (unlikely(!sock->ops->getsockopt))
+               err = -EOPNOTSUPP;
+       else
+               err = sock->ops->getsockopt(sock, level, optname, optval,
+                                           optlen);
+
+       if (!in_compat_syscall())
+               err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
+                                                    optval, optlen, max_optlen,
+                                                    err);
+out_put:
+       return err;
+}
+
+/*
+ * Make setsockopt operation with io_uring.
+ * This fonction is based of the __sys_setsockopt without sockfd_lookup_light
+ * since io_uring retrieves it for us.
+ */
+int uring_cmd_setsockopt(struct socket *sock, int level, int optname, char *user_optval,
+               int optlen)
+{
+       sockptr_t optval = USER_SOCKPTR(user_optval);
+       char *kernel_optval = NULL;
+       int err;
+
+       if (optlen < 0)
+               return -EINVAL;
+
+       err = security_socket_setsockopt(sock, level, optname);
+       if (err)
+               goto out_put;
+
+       if (!in_compat_syscall())
+               err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, &level, &optname,
+                                                    user_optval, &optlen,
+                                                    &kernel_optval);
+       if (err < 0)
+               goto out_put;
+       if (err > 0) {
+               err = 0;
+               goto out_put;
+       }
+
+       if (kernel_optval)
+               optval = KERNEL_SOCKPTR(kernel_optval);
+       if (level == SOL_SOCKET && !sock_use_custom_sol_socket(sock))
+               err = sock_setsockopt(sock, level, optname, optval, optlen);
+       else if (unlikely(!sock->ops->setsockopt))
+               err = -EOPNOTSUPP;
+       else
+               err = sock->ops->setsockopt(sock, level, optname, optval,
+                                           optlen);
+       kfree(kernel_optval);
+out_put:
+       return err;
+}
+
+/*
+ * Handler uring_cmd socket file_operations.
+ *
+ * Operation code and struct are defined in /include/uapi/linux/io_uring.h
+ * The io_uring ring needs to be set with the flags : IORING_SETUP_SQE128 and IORING_SETUP_CQE32
+ *
+ */
+int socket_uring_cmd_handler(struct io_uring_cmd *cmd, unsigned int flags){
+
+       /* Retrieve socket */
+       struct socket *sock = sock_from_file(cmd->file);
+
+       if (!sock)
+               return -EINVAL;
+
+       /* Does the requested operation */
+       switch (cmd->cmd_op) {
+               case IO_URING_CMD_OP_GETSOCKOPT:
+                       struct args_getsockopt_uring *values_get = (struct args_getsockopt_uring *) cmd->cmd;
+                       return uring_cmd_getsockopt(sock,
+                                                                               values_get->level,
+                                                                               values_get->optname,
+                                                                               values_get->user_optval,
+                                                                               values_get->optlen);
+
+               case IO_URING_CMD_OP_SETSOCKOPT:
+                       struct args_setsockopt_uring *values_set = (struct args_setsockopt_uring *) cmd->cmd;
+                       return uring_cmd_setsockopt(sock,
+                                                                               values_set->level,
+                                                                               values_set->optname,
+                                                                               values_set->user_optval,
+                                                                               values_set->optlen);
+               default:
+                       break;
+
+       }
+       return -EINVAL;
+}
+#endif
+
/*
  *     Shutdown a socket.
  */

I would appreciate any feedback or advice you may have on this work. Hopefully it will be of some kind of help. Thank you for your time and consideration.

Adrien

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-05-02  9:21   ` Adrien Delorme
@ 2023-05-02 13:03   ` Pavel Begunkov
  -1 siblings, 0 replies; 108+ messages in thread
From: Pavel Begunkov @ 2023-05-02 13:03 UTC (permalink / raw)
  To: dccp

On 5/2/23 10:21, Adrien Delorme wrote:
>  From Adrien Delorme
> 
>> From: David Ahern
>> Sent: 12 April 2023 7:39
>>> Sent: 11 April 2023 16:28
>> ....
>>> Christoph's patch set a few years back that removed set_fs broke the
>>> ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not
>>> follow that change; was it a deliberate intent to not allow these
>>> in-kernel calls vs wanting to remove the set_fs? e.g., can we add a
>>> kioctl variant for in-kernel use of the APIs?
>>
>> I think that was a side effect, and with no in-tree in-kernel
>> users (apart from limited calls in bpf) it was deemed acceptable.
>> (It is a PITA for any code trying to use SCTP in kernel.)
>>
>> One problem is that not all sockopt calls pass the correct length.
>> And some of them can have very long buffers.
>> Not to mention the ones that are read-modify-write.
>>
>> A plausible solution is to pass a 'fat pointer' that contains
>> some, or all, of:
>>        - A userspace buffer pointer.
>>        - A kernel buffer pointer.
>>        - The length supplied by the user.
>>        - The length of the kernel buffer.
>>        = The number of bytes to copy on completion.
>> For simple user requests the syscall entry/exit code
>> would copy the data to a short on-stack buffer.
>> Kernel users just pass the kernel address.
>> Odd requests can just use the user pointer.
>>
>> Probably needs accessors that add in an offset.
>>
>> It might also be that some of the problematic sockopt
>> were in decnet - now removed.
> 
> Hello everyone,
> 
> I'm currently working on an implementation of {get,set} sockopt.
> Since this thread is already talking about it, I hope that I replying at the correct place.

Hi Adrien, I believe Breno is working on set/getsockopt as well
and had similar patches for awhile, but that would need for some
problems to be solved first, e.g. try and decide whether it copies
to a ptr as the syscall versions or would get/return optval
directly in sqe/cqe. And also where to store bits that you pass
in struct args_setsockopt_uring, and whether to rely on SQE128
or not.


> My implementation is rather simple using a struct that will be used to pass the necessary info throught sqe->cmd.
> 
> Here is my implementation based of kernel version 6.3 :
> 
> Signed-off-by: Adrien Delorme <delorme.ade@outlook.com>
> 
> diff -uprN a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> --- a/include/uapi/linux/io_uring.h     2023-04-23 15:02:52.000000000 -0400
> +++ b/include/uapi/linux/io_uring.h     2023-04-24 07:55:21.406981696 -0400
> @@ -235,6 +235,25 @@ enum io_uring_op {
>    */
> #define IORING_URING_CMD_FIXED (1U << 0)
> 
> +/* struct io_uring_cmd->cmd_op flags for socket operations */
> +#define IO_URING_CMD_OP_GETSOCKOPT 0x0
> +#define IO_URING_CMD_OP_SETSOCKOPT 0x1
> +
> +/* Struct to pass args for IO_URING_CMD_OP_GETSOCKOPT and IO_URING_CMD_OP_SETSOCKOPT operations */
> +struct args_setsockopt_uring{

The name of the structure is quite inconsistent with the
rest. It's better to be io_[uring_]_sockopt_arg or some
variation.

> +       int                             level;
> +       int                     optname;
> +       char __user *   user_optval;
> +       int                     optlen;

That's uapi, there should not be __user, and field sizes
should be more portable, i.e. use __u32, __u64, etc, look
through the file.

Would need to look into the get/setsockopt implementation
before saying anything about uring_cmd_{set,get}sockopt.


> +};
> +
> +struct args_getsockopt_uring{
> +       int                             level;
> +       int                     optname;
> +       char __user *   user_optval;
> +       int      __user *       optlen;
> +};
> +
> 
> /*
>    * sqe->fsync_flags
> diff -uprN a/net/socket.c b/net/socket.c
> --- a/net/socket.c      2023-04-23 15:02:52.000000000 -0400
> +++ b/net/socket.c      2023-04-24 08:06:44.800981696 -0400
> @@ -108,6 +108,11 @@
> #include <linux/ptp_clock_kernel.h>
> #include <trace/events/sock.h>
> 
> +#ifdef CONFIG_IO_URING
> +#include <uapi/linux/io_uring.h>
> +#include <linux/io_uring.h>
> +#endif
> +
> #ifdef CONFIG_NET_RX_BUSY_POLL
> unsigned int sysctl_net_busy_read __read_mostly;
> unsigned int sysctl_net_busy_poll __read_mostly;
> @@ -132,6 +137,11 @@ static ssize_t sock_splice_read(struct f
>                                  struct pipe_inode_info *pipe, size_t len,
>                                  unsigned int flags);
> 
> +
> +#ifdef CONFIG_IO_URING
> +int socket_uring_cmd_handler(struct io_uring_cmd *cmd, unsigned int flags);
> +#endif
> +
> #ifdef CONFIG_PROC_FS
> static void sock_show_fdinfo(struct seq_file *m, struct file *f)
> {
> @@ -166,6 +176,9 @@ static const struct file_operations sock
>          .splice_write = generic_splice_sendpage,
>          .splice_read =  sock_splice_read,
>          .show_fdinfo =  sock_show_fdinfo,
> +#ifdef CONFIG_IO_URING
> +       .uring_cmd = socket_uring_cmd_handler,
> +#endif
> };
> 
> static const char * const pf_family_names[] = {
> @@ -2330,6 +2343,126 @@ SYSCALL_DEFINE5(getsockopt, int, fd, int
>          return __sys_getsockopt(fd, level, optname, optval, optlen);
> }
> 
> +#ifdef CONFIG_IO_URING
> +
> +/*
> + * Make getsockopt operation with io_uring.
> + * This fonction is based of the __sys_getsockopt without sockfd_lookup_light
> + * since io_uring retrieves it for us.
> + */
> +int uring_cmd_getsockopt(struct socket *sock, int level, int optname, char __user *optval,
> +               int __user *optlen)
> +{
> +       int err;
> +       int max_optlen;
> +
> +       err = security_socket_getsockopt(sock, level, optname);
> +       if (err)
> +               goto out_put;
> +
> +       if (!in_compat_syscall())
> +               max_optlen = BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen);
> +
> +       if (level = SOL_SOCKET)
> +               err = sock_getsockopt(sock, level, optname, optval, optlen);
> +       else if (unlikely(!sock->ops->getsockopt))
> +               err = -EOPNOTSUPP;
> +       else
> +               err = sock->ops->getsockopt(sock, level, optname, optval,
> +                                           optlen);
> +
> +       if (!in_compat_syscall())
> +               err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
> +                                                    optval, optlen, max_optlen,
> +                                                    err);
> +out_put:
> +       return err;
> +}
> +
> +/*
> + * Make setsockopt operation with io_uring.
> + * This fonction is based of the __sys_setsockopt without sockfd_lookup_light
> + * since io_uring retrieves it for us.
> + */
> +int uring_cmd_setsockopt(struct socket *sock, int level, int optname, char *user_optval,
> +               int optlen)
> +{
> +       sockptr_t optval = USER_SOCKPTR(user_optval);
> +       char *kernel_optval = NULL;
> +       int err;
> +
> +       if (optlen < 0)
> +               return -EINVAL;
> +
> +       err = security_socket_setsockopt(sock, level, optname);
> +       if (err)
> +               goto out_put;
> +
> +       if (!in_compat_syscall())
> +               err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, &level, &optname,
> +                                                    user_optval, &optlen,
> +                                                    &kernel_optval);
> +       if (err < 0)
> +               goto out_put;
> +       if (err > 0) {
> +               err = 0;
> +               goto out_put;
> +       }
> +
> +       if (kernel_optval)
> +               optval = KERNEL_SOCKPTR(kernel_optval);
> +       if (level = SOL_SOCKET && !sock_use_custom_sol_socket(sock))
> +               err = sock_setsockopt(sock, level, optname, optval, optlen);
> +       else if (unlikely(!sock->ops->setsockopt))
> +               err = -EOPNOTSUPP;
> +       else
> +               err = sock->ops->setsockopt(sock, level, optname, optval,
> +                                           optlen);
> +       kfree(kernel_optval);
> +out_put:
> +       return err;
> +}
> +
> +/*
> + * Handler uring_cmd socket file_operations.
> + *
> + * Operation code and struct are defined in /include/uapi/linux/io_uring.h
> + * The io_uring ring needs to be set with the flags : IORING_SETUP_SQE128 and IORING_SETUP_CQE32
> + *
> + */
> +int socket_uring_cmd_handler(struct io_uring_cmd *cmd, unsigned int flags){
> +
> +       /* Retrieve socket */
> +       struct socket *sock = sock_from_file(cmd->file);
> +
> +       if (!sock)
> +               return -EINVAL;
> +
> +       /* Does the requested operation */
> +       switch (cmd->cmd_op) {
> +               case IO_URING_CMD_OP_GETSOCKOPT:
> +                       struct args_getsockopt_uring *values_get = (struct args_getsockopt_uring *) cmd->cmd;
> +                       return uring_cmd_getsockopt(sock,
> +                                                                               values_get->level,
> +                                                                               values_get->optname,
> +                                                                               values_get->user_optval,
> +                                                                               values_get->optlen);
> +
> +               case IO_URING_CMD_OP_SETSOCKOPT:
> +                       struct args_setsockopt_uring *values_set = (struct args_setsockopt_uring *) cmd->cmd;
> +                       return uring_cmd_setsockopt(sock,
> +                                                                               values_set->level,
> +                                                                               values_set->optname,
> +                                                                               values_set->user_optval,
> +                                                                               values_set->optlen);
> +               default:
> +                       break;
> +
> +       }
> +       return -EINVAL;
> +}
> +#endif
> +
> /*
>    *     Shutdown a socket.
>    */
> 
> I would appreciate any feedback or advice you may have on this work. Hopefully it will be of some kind of help. Thank you for your time and consideration.
> 
> Adrien

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-05-02 13:03   ` Pavel Begunkov
  0 siblings, 0 replies; 108+ messages in thread
From: Pavel Begunkov @ 2023-05-02 13:03 UTC (permalink / raw)
  To: Adrien Delorme, david.laight@aculab.com
  Cc: axboe@kernel.dk, davem@davemloft.net, dccp@vger.kernel.org,
	dsahern@kernel.org, edumazet@google.com, io-uring@vger.kernel.org,
	kuba@kernel.org, leit@fb.com, leitao@debian.org,
	linux-kernel@vger.kernel.org, marcelo.leitner@gmail.com,
	matthieu.baerts@tessares.net, mptcp@lists.linux.dev,
	netdev@vger.kernel.org, pabeni@redhat.com, willemb@google.com,
	willemdebruijn.kernel@gmail.com

On 5/2/23 10:21, Adrien Delorme wrote:
>  From Adrien Delorme
> 
>> From: David Ahern
>> Sent: 12 April 2023 7:39
>>> Sent: 11 April 2023 16:28
>> ....
>>> Christoph's patch set a few years back that removed set_fs broke the
>>> ability to do in-kernel ioctl and {s,g}setsockopt calls. I did not
>>> follow that change; was it a deliberate intent to not allow these
>>> in-kernel calls vs wanting to remove the set_fs? e.g., can we add a
>>> kioctl variant for in-kernel use of the APIs?
>>
>> I think that was a side effect, and with no in-tree in-kernel
>> users (apart from limited calls in bpf) it was deemed acceptable.
>> (It is a PITA for any code trying to use SCTP in kernel.)
>>
>> One problem is that not all sockopt calls pass the correct length.
>> And some of them can have very long buffers.
>> Not to mention the ones that are read-modify-write.
>>
>> A plausible solution is to pass a 'fat pointer' that contains
>> some, or all, of:
>>        - A userspace buffer pointer.
>>        - A kernel buffer pointer.
>>        - The length supplied by the user.
>>        - The length of the kernel buffer.
>>        = The number of bytes to copy on completion.
>> For simple user requests the syscall entry/exit code
>> would copy the data to a short on-stack buffer.
>> Kernel users just pass the kernel address.
>> Odd requests can just use the user pointer.
>>
>> Probably needs accessors that add in an offset.
>>
>> It might also be that some of the problematic sockopt
>> were in decnet - now removed.
> 
> Hello everyone,
> 
> I'm currently working on an implementation of {get,set} sockopt.
> Since this thread is already talking about it, I hope that I replying at the correct place.

Hi Adrien, I believe Breno is working on set/getsockopt as well
and had similar patches for awhile, but that would need for some
problems to be solved first, e.g. try and decide whether it copies
to a ptr as the syscall versions or would get/return optval
directly in sqe/cqe. And also where to store bits that you pass
in struct args_setsockopt_uring, and whether to rely on SQE128
or not.


> My implementation is rather simple using a struct that will be used to pass the necessary info throught sqe->cmd.
> 
> Here is my implementation based of kernel version 6.3 :
> 
> Signed-off-by: Adrien Delorme <delorme.ade@outlook.com>
> 
> diff -uprN a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> --- a/include/uapi/linux/io_uring.h     2023-04-23 15:02:52.000000000 -0400
> +++ b/include/uapi/linux/io_uring.h     2023-04-24 07:55:21.406981696 -0400
> @@ -235,6 +235,25 @@ enum io_uring_op {
>    */
> #define IORING_URING_CMD_FIXED (1U << 0)
> 
> +/* struct io_uring_cmd->cmd_op flags for socket operations */
> +#define IO_URING_CMD_OP_GETSOCKOPT 0x0
> +#define IO_URING_CMD_OP_SETSOCKOPT 0x1
> +
> +/* Struct to pass args for IO_URING_CMD_OP_GETSOCKOPT and IO_URING_CMD_OP_SETSOCKOPT operations */
> +struct args_setsockopt_uring{

The name of the structure is quite inconsistent with the
rest. It's better to be io_[uring_]_sockopt_arg or some
variation.

> +       int                             level;
> +       int                     optname;
> +       char __user *   user_optval;
> +       int                     optlen;

That's uapi, there should not be __user, and field sizes
should be more portable, i.e. use __u32, __u64, etc, look
through the file.

Would need to look into the get/setsockopt implementation
before saying anything about uring_cmd_{set,get}sockopt.


> +};
> +
> +struct args_getsockopt_uring{
> +       int                             level;
> +       int                     optname;
> +       char __user *   user_optval;
> +       int      __user *       optlen;
> +};
> +
> 
> /*
>    * sqe->fsync_flags
> diff -uprN a/net/socket.c b/net/socket.c
> --- a/net/socket.c      2023-04-23 15:02:52.000000000 -0400
> +++ b/net/socket.c      2023-04-24 08:06:44.800981696 -0400
> @@ -108,6 +108,11 @@
> #include <linux/ptp_clock_kernel.h>
> #include <trace/events/sock.h>
> 
> +#ifdef CONFIG_IO_URING
> +#include <uapi/linux/io_uring.h>
> +#include <linux/io_uring.h>
> +#endif
> +
> #ifdef CONFIG_NET_RX_BUSY_POLL
> unsigned int sysctl_net_busy_read __read_mostly;
> unsigned int sysctl_net_busy_poll __read_mostly;
> @@ -132,6 +137,11 @@ static ssize_t sock_splice_read(struct f
>                                  struct pipe_inode_info *pipe, size_t len,
>                                  unsigned int flags);
> 
> +
> +#ifdef CONFIG_IO_URING
> +int socket_uring_cmd_handler(struct io_uring_cmd *cmd, unsigned int flags);
> +#endif
> +
> #ifdef CONFIG_PROC_FS
> static void sock_show_fdinfo(struct seq_file *m, struct file *f)
> {
> @@ -166,6 +176,9 @@ static const struct file_operations sock
>          .splice_write = generic_splice_sendpage,
>          .splice_read =  sock_splice_read,
>          .show_fdinfo =  sock_show_fdinfo,
> +#ifdef CONFIG_IO_URING
> +       .uring_cmd = socket_uring_cmd_handler,
> +#endif
> };
> 
> static const char * const pf_family_names[] = {
> @@ -2330,6 +2343,126 @@ SYSCALL_DEFINE5(getsockopt, int, fd, int
>          return __sys_getsockopt(fd, level, optname, optval, optlen);
> }
> 
> +#ifdef CONFIG_IO_URING
> +
> +/*
> + * Make getsockopt operation with io_uring.
> + * This fonction is based of the __sys_getsockopt without sockfd_lookup_light
> + * since io_uring retrieves it for us.
> + */
> +int uring_cmd_getsockopt(struct socket *sock, int level, int optname, char __user *optval,
> +               int __user *optlen)
> +{
> +       int err;
> +       int max_optlen;
> +
> +       err = security_socket_getsockopt(sock, level, optname);
> +       if (err)
> +               goto out_put;
> +
> +       if (!in_compat_syscall())
> +               max_optlen = BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen);
> +
> +       if (level == SOL_SOCKET)
> +               err = sock_getsockopt(sock, level, optname, optval, optlen);
> +       else if (unlikely(!sock->ops->getsockopt))
> +               err = -EOPNOTSUPP;
> +       else
> +               err = sock->ops->getsockopt(sock, level, optname, optval,
> +                                           optlen);
> +
> +       if (!in_compat_syscall())
> +               err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
> +                                                    optval, optlen, max_optlen,
> +                                                    err);
> +out_put:
> +       return err;
> +}
> +
> +/*
> + * Make setsockopt operation with io_uring.
> + * This fonction is based of the __sys_setsockopt without sockfd_lookup_light
> + * since io_uring retrieves it for us.
> + */
> +int uring_cmd_setsockopt(struct socket *sock, int level, int optname, char *user_optval,
> +               int optlen)
> +{
> +       sockptr_t optval = USER_SOCKPTR(user_optval);
> +       char *kernel_optval = NULL;
> +       int err;
> +
> +       if (optlen < 0)
> +               return -EINVAL;
> +
> +       err = security_socket_setsockopt(sock, level, optname);
> +       if (err)
> +               goto out_put;
> +
> +       if (!in_compat_syscall())
> +               err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, &level, &optname,
> +                                                    user_optval, &optlen,
> +                                                    &kernel_optval);
> +       if (err < 0)
> +               goto out_put;
> +       if (err > 0) {
> +               err = 0;
> +               goto out_put;
> +       }
> +
> +       if (kernel_optval)
> +               optval = KERNEL_SOCKPTR(kernel_optval);
> +       if (level == SOL_SOCKET && !sock_use_custom_sol_socket(sock))
> +               err = sock_setsockopt(sock, level, optname, optval, optlen);
> +       else if (unlikely(!sock->ops->setsockopt))
> +               err = -EOPNOTSUPP;
> +       else
> +               err = sock->ops->setsockopt(sock, level, optname, optval,
> +                                           optlen);
> +       kfree(kernel_optval);
> +out_put:
> +       return err;
> +}
> +
> +/*
> + * Handler uring_cmd socket file_operations.
> + *
> + * Operation code and struct are defined in /include/uapi/linux/io_uring.h
> + * The io_uring ring needs to be set with the flags : IORING_SETUP_SQE128 and IORING_SETUP_CQE32
> + *
> + */
> +int socket_uring_cmd_handler(struct io_uring_cmd *cmd, unsigned int flags){
> +
> +       /* Retrieve socket */
> +       struct socket *sock = sock_from_file(cmd->file);
> +
> +       if (!sock)
> +               return -EINVAL;
> +
> +       /* Does the requested operation */
> +       switch (cmd->cmd_op) {
> +               case IO_URING_CMD_OP_GETSOCKOPT:
> +                       struct args_getsockopt_uring *values_get = (struct args_getsockopt_uring *) cmd->cmd;
> +                       return uring_cmd_getsockopt(sock,
> +                                                                               values_get->level,
> +                                                                               values_get->optname,
> +                                                                               values_get->user_optval,
> +                                                                               values_get->optlen);
> +
> +               case IO_URING_CMD_OP_SETSOCKOPT:
> +                       struct args_setsockopt_uring *values_set = (struct args_setsockopt_uring *) cmd->cmd;
> +                       return uring_cmd_setsockopt(sock,
> +                                                                               values_set->level,
> +                                                                               values_set->optname,
> +                                                                               values_set->user_optval,
> +                                                                               values_set->optlen);
> +               default:
> +                       break;
> +
> +       }
> +       return -EINVAL;
> +}
> +#endif
> +
> /*
>    *     Shutdown a socket.
>    */
> 
> I would appreciate any feedback or advice you may have on this work. Hopefully it will be of some kind of help. Thank you for your time and consideration.
> 
> Adrien

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-05-02 13:03   ` Pavel Begunkov
@ 2023-05-03 13:11   ` Adrien Delorme
  -1 siblings, 0 replies; 108+ messages in thread
From: Adrien Delorme @ 2023-05-03 13:11 UTC (permalink / raw)
  To: dccp

DQpGcm9tIEFkcmllbiBEZWxvcm1lDQo+IEZyb23CoDogUGF2ZWwgQmVndW5rb3YgDQo+IFNlbnQg
OiAyIE1heSAyMDIzIDE1OjA0DQo+IE9uIDUvMi8yMyAxMDoyMSwgQWRyaWVuIERlbG9ybWUgd3Jv
dGU6DQo+ID4gIEZyb20gQWRyaWVuIERlbG9ybWUNCj4gPg0KPiA+PiBGcm9tOiBEYXZpZCBBaGVy
bg0KPiA+PiBTZW50OiAxMiBBcHJpbCAyMDIzIDc6MzkNCj4gPj4+IFNlbnQ6IDExIEFwcmlsIDIw
MjMgMTY6MjgNCj4gPj4gLi4uLg0KPiA+PiBPbmUgcHJvYmxlbSBpcyB0aGF0IG5vdCBhbGwgc29j
a29wdCBjYWxscyBwYXNzIHRoZSBjb3JyZWN0IGxlbmd0aC4NCj4gPj4gQW5kIHNvbWUgb2YgdGhl
bSBjYW4gaGF2ZSB2ZXJ5IGxvbmcgYnVmZmVycy4NCj4gPj4gTm90IHRvIG1lbnRpb24gdGhlIG9u
ZXMgdGhhdCBhcmUgcmVhZC1tb2RpZnktd3JpdGUuDQo+ID4+DQo+ID4+IEEgcGxhdXNpYmxlIHNv
bHV0aW9uIGlzIHRvIHBhc3MgYSAnZmF0IHBvaW50ZXInIHRoYXQgY29udGFpbnMgc29tZSwNCj4g
Pj4gb3IgYWxsLCBvZjoNCj4gPj4gICAgICAgIC0gQSB1c2Vyc3BhY2UgYnVmZmVyIHBvaW50ZXIu
DQo+ID4+ICAgICAgICAtIEEga2VybmVsIGJ1ZmZlciBwb2ludGVyLg0KPiA+PiAgICAgICAgLSBU
aGUgbGVuZ3RoIHN1cHBsaWVkIGJ5IHRoZSB1c2VyLg0KPiA+PiAgICAgICAgLSBUaGUgbGVuZ3Ro
IG9mIHRoZSBrZXJuZWwgYnVmZmVyLg0KPiA+PiAgICAgICAgPSBUaGUgbnVtYmVyIG9mIGJ5dGVz
IHRvIGNvcHkgb24gY29tcGxldGlvbi4NCj4gPj4gRm9yIHNpbXBsZSB1c2VyIHJlcXVlc3RzIHRo
ZSBzeXNjYWxsIGVudHJ5L2V4aXQgY29kZSB3b3VsZCBjb3B5IHRoZQ0KPiA+PiBkYXRhIHRvIGEg
c2hvcnQgb24tc3RhY2sgYnVmZmVyLg0KPiA+PiBLZXJuZWwgdXNlcnMganVzdCBwYXNzIHRoZSBr
ZXJuZWwgYWRkcmVzcy4NCj4gPj4gT2RkIHJlcXVlc3RzIGNhbiBqdXN0IHVzZSB0aGUgdXNlciBw
b2ludGVyLg0KPiA+Pg0KPiA+PiBQcm9iYWJseSBuZWVkcyBhY2Nlc3NvcnMgdGhhdCBhZGQgaW4g
YW4gb2Zmc2V0Lg0KPiA+Pg0KPiA+PiBJdCBtaWdodCBhbHNvIGJlIHRoYXQgc29tZSBvZiB0aGUg
cHJvYmxlbWF0aWMgc29ja29wdCB3ZXJlIGluIGRlY25ldA0KPiA+PiAtIG5vdyByZW1vdmVkLg0K
PiA+DQo+ID4gSGVsbG8gZXZlcnlvbmUsDQo+ID4NCj4gPiBJJ20gY3VycmVudGx5IHdvcmtpbmcg
b24gYW4gaW1wbGVtZW50YXRpb24gb2Yge2dldCxzZXR9IHNvY2tvcHQuDQo+ID4gU2luY2UgdGhp
cyB0aHJlYWQgaXMgYWxyZWFkeSB0YWxraW5nIGFib3V0IGl0LCBJIGhvcGUgdGhhdCBJIHJlcGx5
aW5nIGF0IHRoZQ0KPiBjb3JyZWN0IHBsYWNlLg0KPiANCj4gSGkgQWRyaWVuLCBJIGJlbGlldmUg
QnJlbm8gaXMgd29ya2luZyBvbiBzZXQvZ2V0c29ja29wdCBhcyB3ZWxsIGFuZCBoYWQNCj4gc2lt
aWxhciBwYXRjaGVzIGZvciBhd2hpbGUsIGJ1dCB0aGF0IHdvdWxkIG5lZWQgZm9yIHNvbWUgcHJv
YmxlbXMgdG8gYmUNCj4gc29sdmVkIGZpcnN0LCBlLmcuIHRyeSBhbmQgZGVjaWRlIHdoZXRoZXIg
aXQgY29waWVzIHRvIGEgcHRyIGFzIHRoZSBzeXNjYWxsDQo+IHZlcnNpb25zIG9yIHdvdWxkIGdl
dC9yZXR1cm4gb3B0dmFsIGRpcmVjdGx5IGluIHNxZS9jcWUuIEFuZCBhbHNvIHdoZXJlIHRvDQo+
IHN0b3JlIGJpdHMgdGhhdCB5b3UgcGFzcyBpbiBzdHJ1Y3QgYXJnc19zZXRzb2Nrb3B0X3VyaW5n
LCBhbmQgd2hldGhlciB0byByZWx5DQo+IG9uIFNRRTEyOCBvciBub3QuDQo+IA0KDQpIZWxsbyBQ
YXZlbCwNClRoYXQgaXMgZ29vZCB0byBoZWFyLiBJZiBwb3NzaWJsZSBJIHdvdWxkIGxpa2UgdG8g
cHJvdmlkZSBzb21lIGhlbHAuIA0KSSBsb29rZWQgYXQgdGhlIGdldHNvY2tvcHQgaW1wbGVtZW50
YXRpb24uIEZyb20gd2hhdCBJJ20gc2VlaW5nLCBJIGJlbGlldmUgdGhhdCBpdCB3b3VsZCBiZSBl
YXNpZXIgdG8gY29waWVzIHRvIGEgcHRyIGFzIHRoZSBzeXNjYWxsLg0KVGhlIGxlbmd0aCBvZiB0
aGUgb3V0cHV0IGlzIHVzdWFsbHkgNCBieXRlcyAoc29tZXRpbWVzIGxlc3MpIGJ1dCBpbiBhIGxv
dCBvZiBjYXNlcywgdGhpcyBsZW5ndGggaXMgdmFyaWFibGUuIFNvbWV0aW1lIGl0IGNhbiBldmVu
IGJlIGJpZ2dlciB0aGF0IHRoZSBTUUUxMjggcmluZy4NCg0KSGVyZSBpcyBhIG5vbi1leGhhdXN0
aXZlIGxpc3Qgb2YgdGhvc2UgY2FzZXMgOiANCi9uZXQvaXB2NC90Y3AuYyA6IGludCBkb190Y3Bf
Z2V0c29ja29wdCguLi4pDQogIC0gVENQX0lORk8gOiB1cCB0byAyNDAgYnl0ZXMNCiAgLSBUQ1Bf
Q0NfSU5GTyBhbmQgVENQX1JFUEFJUl9XSU5ET1cgOiB1cCB0byAyMCBieXRlcw0KICAtIFRDUF9D
T05HRVNUSU9OIGFuZCBUQ1BfVUxQIDogdXAgdG8gMTYgYnl0ZXMNCiAgLSBUQ1BfWkVST0NQT1lf
UkVDRUlWRSA6IHVwIHRvIDY0IGJ5dGVzICANCi9uZXQvYXRtL2NvbW11bi5jIDogaW50IHZjY19n
ZXRzb2Nrb3B0KC4uLikNCiAgLSBTT19BVE1RT1MgOiB1cCB0byA4OCBieXRlcw0KICAtIFNPX0FU
TVBWQyA6IHVwIHRvIDE2IGJ5dGVzDQovbmV0L2lwdjQvaW9fc29ja2dsdWUuYyA6IGludCBkb19p
cF9nZXRzb2Nrb3B0KC4uLikNCiAgLSBNQ0FTVF9NU0ZJTFRFUiA6IHVwIHRvIDE0NCBieXRlcw0K
ICAtIElQX01TRklMVEVSIDogMTYgYnl0ZXMgbWluaW11bQ0KDQpJIHdpbGwgbG9vayBpbnRvIHNl
dHNvY2tvcHQgYnV0IEkgYmVsaWV2ZSBpdCBtaWdodCBiZSB0aGUgc2FtZS4gDQpJZiBuZWVkZWQg
SSBjYW4gYWxzbyBjb21wbGV0ZSB0aGlzIGxpc3QuIA0KSG93ZXZlciB0aGVyZSBhcmUgc29tZSBj
YXNlcyB3aGVyZSBpdCBpcyBoYXJkIHRvIGRldGVybWluYXRlIGEgbWF4aW11bSBhbW91bnQgb2Yg
Ynl0ZXMgaW4gYWR2YW5jZS4gDQoNCkFzIHRvIHdoZXJlIHRoZSBieXRlcyBzaG91bGQgYmUgc3Rv
cmVkIEkgd2FzIHRoaW5raW5nIG9mIGVpdGhlciA6DQogIC0gQXMgYSBwb2ludGVyIGluIHNxZS0+
YWRkciBzbyB0aGUgU1FFMTI4IGlzIG5vdCBuZWVkZWQgDQogIC0gSW4gc3FlLT5jbWQgYXMgYSBz
dHJ1Y3QgYnV0IGZyb20gbXkgdW5kZXJzdGFuZGluZywgdGhlIFNRRTEyOCBpcyBuZWVkZWQNCj4g
DQo+ID4gTXkgaW1wbGVtZW50YXRpb24gaXMgcmF0aGVyIHNpbXBsZSB1c2luZyBhIHN0cnVjdCB0
aGF0IHdpbGwgYmUgdXNlZCB0byBwYXNzDQo+IHRoZSBuZWNlc3NhcnkgaW5mbyB0aHJvdWdodCBz
cWUtPmNtZC4NCj4gPg0KPiA+IEhlcmUgaXMgbXkgaW1wbGVtZW50YXRpb24gYmFzZWQgb2Yga2Vy
bmVsIHZlcnNpb24gNi4zIDoNCj4gPiAuLi4NCj4gPiArLyogU3RydWN0IHRvIHBhc3MgYXJncyBm
b3IgSU9fVVJJTkdfQ01EX09QX0dFVFNPQ0tPUFQgYW5kDQo+ID4gK0lPX1VSSU5HX0NNRF9PUF9T
RVRTT0NLT1BUIG9wZXJhdGlvbnMgKi8gc3RydWN0DQo+ID4gK2FyZ3Nfc2V0c29ja29wdF91cmlu
Z3sNCj4gDQo+IFRoZSBuYW1lIG9mIHRoZSBzdHJ1Y3R1cmUgaXMgcXVpdGUgaW5jb25zaXN0ZW50
IHdpdGggdGhlIHJlc3QuIEl0J3MgYmV0dGVyIHRvIGJlDQo+IGlvX1t1cmluZ19dX3NvY2tvcHRf
YXJnIG9yIHNvbWUgdmFyaWF0aW9uLg0KPiANCj4gPiArICAgICAgIGludCAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgbGV2ZWw7DQo+ID4gKyAgICAgICBpbnQgICAgICAgICAgICAgICAgICAg
ICBvcHRuYW1lOw0KPiA+ICsgICAgICAgY2hhciBfX3VzZXIgKiAgIHVzZXJfb3B0dmFsOw0KPiA+
ICsgICAgICAgaW50ICAgICAgICAgICAgICAgICAgICAgb3B0bGVuOw0KPiANCj4gVGhhdCdzIHVh
cGksIHRoZXJlIHNob3VsZCBub3QgYmUgX191c2VyLCBhbmQgZmllbGQgc2l6ZXMgc2hvdWxkIGJl
IG1vcmUNCj4gcG9ydGFibGUsIGkuZS4gdXNlIF9fdTMyLCBfX3U2NCwgZXRjLCBsb29rIHRocm91
Z2ggdGhlIGZpbGUuDQo+IA0KPiBXb3VsZCBuZWVkIHRvIGxvb2sgaW50byB0aGUgZ2V0L3NldHNv
Y2tvcHQgaW1wbGVtZW50YXRpb24gYmVmb3JlIHNheWluZw0KPiBhbnl0aGluZyBhYm91dCB1cmlu
Z19jbWRfe3NldCxnZXR9c29ja29wdC4NCj4gLi4uDQo+IFBhdmVsIEJlZ3Vua292DQoNClRoYW5r
IHlvdSBmb3IgdGhlIHJldmlldy4NCkFkcmllbiBEZWxvcm1lDQotLQ0KDQo

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-05-03 13:11   ` Adrien Delorme
  0 siblings, 0 replies; 108+ messages in thread
From: Adrien Delorme @ 2023-05-03 13:11 UTC (permalink / raw)
  To: Pavel Begunkov, david.laight@aculab.com
  Cc: axboe@kernel.dk, davem@davemloft.net, dccp@vger.kernel.org,
	dsahern@kernel.org, edumazet@google.com, io-uring@vger.kernel.org,
	kuba@kernel.org, leit@fb.com, leitao@debian.org,
	linux-kernel@vger.kernel.org, marcelo.leitner@gmail.com,
	matthieu.baerts@tessares.net, mptcp@lists.linux.dev,
	netdev@vger.kernel.org, pabeni@redhat.com, willemb@google.com,
	willemdebruijn.kernel@gmail.com


From Adrien Delorme
> From : Pavel Begunkov 
> Sent : 2 May 2023 15:04
> On 5/2/23 10:21, Adrien Delorme wrote:
> >  From Adrien Delorme
> >
> >> From: David Ahern
> >> Sent: 12 April 2023 7:39
> >>> Sent: 11 April 2023 16:28
> >> ....
> >> One problem is that not all sockopt calls pass the correct length.
> >> And some of them can have very long buffers.
> >> Not to mention the ones that are read-modify-write.
> >>
> >> A plausible solution is to pass a 'fat pointer' that contains some,
> >> or all, of:
> >>        - A userspace buffer pointer.
> >>        - A kernel buffer pointer.
> >>        - The length supplied by the user.
> >>        - The length of the kernel buffer.
> >>        = The number of bytes to copy on completion.
> >> For simple user requests the syscall entry/exit code would copy the
> >> data to a short on-stack buffer.
> >> Kernel users just pass the kernel address.
> >> Odd requests can just use the user pointer.
> >>
> >> Probably needs accessors that add in an offset.
> >>
> >> It might also be that some of the problematic sockopt were in decnet
> >> - now removed.
> >
> > Hello everyone,
> >
> > I'm currently working on an implementation of {get,set} sockopt.
> > Since this thread is already talking about it, I hope that I replying at the
> correct place.
> 
> Hi Adrien, I believe Breno is working on set/getsockopt as well and had
> similar patches for awhile, but that would need for some problems to be
> solved first, e.g. try and decide whether it copies to a ptr as the syscall
> versions or would get/return optval directly in sqe/cqe. And also where to
> store bits that you pass in struct args_setsockopt_uring, and whether to rely
> on SQE128 or not.
> 

Hello Pavel,
That is good to hear. If possible I would like to provide some help. 
I looked at the getsockopt implementation. From what I'm seeing, I believe that it would be easier to copies to a ptr as the syscall.
The length of the output is usually 4 bytes (sometimes less) but in a lot of cases, this length is variable. Sometime it can even be bigger that the SQE128 ring.

Here is a non-exhaustive list of those cases : 
/net/ipv4/tcp.c : int do_tcp_getsockopt(...)
  - TCP_INFO : up to 240 bytes
  - TCP_CC_INFO and TCP_REPAIR_WINDOW : up to 20 bytes
  - TCP_CONGESTION and TCP_ULP : up to 16 bytes
  - TCP_ZEROCPOY_RECEIVE : up to 64 bytes  
/net/atm/commun.c : int vcc_getsockopt(...)
  - SO_ATMQOS : up to 88 bytes
  - SO_ATMPVC : up to 16 bytes
/net/ipv4/io_sockglue.c : int do_ip_getsockopt(...)
  - MCAST_MSFILTER : up to 144 bytes
  - IP_MSFILTER : 16 bytes minimum

I will look into setsockopt but I believe it might be the same. 
If needed I can also complete this list. 
However there are some cases where it is hard to determinate a maximum amount of bytes in advance. 

As to where the bytes should be stored I was thinking of either :
  - As a pointer in sqe->addr so the SQE128 is not needed 
  - In sqe->cmd as a struct but from my understanding, the SQE128 is needed
> 
> > My implementation is rather simple using a struct that will be used to pass
> the necessary info throught sqe->cmd.
> >
> > Here is my implementation based of kernel version 6.3 :
> > ...
> > +/* Struct to pass args for IO_URING_CMD_OP_GETSOCKOPT and
> > +IO_URING_CMD_OP_SETSOCKOPT operations */ struct
> > +args_setsockopt_uring{
> 
> The name of the structure is quite inconsistent with the rest. It's better to be
> io_[uring_]_sockopt_arg or some variation.
> 
> > +       int                             level;
> > +       int                     optname;
> > +       char __user *   user_optval;
> > +       int                     optlen;
> 
> That's uapi, there should not be __user, and field sizes should be more
> portable, i.e. use __u32, __u64, etc, look through the file.
> 
> Would need to look into the get/setsockopt implementation before saying
> anything about uring_cmd_{set,get}sockopt.
> ...
> Pavel Begunkov

Thank you for the review.
Adrien Delorme
--


^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH 0/5] add initial io_uring_cmd support for sockets
  2023-05-03 13:11   ` Adrien Delorme
@ 2023-05-03 13:27   ` David Laight
  -1 siblings, 0 replies; 108+ messages in thread
From: David Laight @ 2023-05-03 13:27 UTC (permalink / raw)
  To: dccp

RnJvbTogQWRyaWVuIERlbG9ybWUNCj4gU2VudDogMDMgTWF5IDIwMjMgMTQ6MTENCj4gDQo+IEZy
b20gQWRyaWVuIERlbG9ybWUNCj4gPiBGcm9twqA6IFBhdmVsIEJlZ3Vua292DQo+ID4gU2VudCA6
IDIgTWF5IDIwMjMgMTU6MDQNCj4gPiBPbiA1LzIvMjMgMTA6MjEsIEFkcmllbiBEZWxvcm1lIHdy
b3RlOg0KPiA+ID4gIEZyb20gQWRyaWVuIERlbG9ybWUNCj4gPiA+DQo+ID4gPj4gRnJvbTogRGF2
aWQgQWhlcm4NCj4gPiA+PiBTZW50OiAxMiBBcHJpbCAyMDIzIDc6MzkNCj4gPiA+Pj4gU2VudDog
MTEgQXByaWwgMjAyMyAxNjoyOA0KPiA+ID4+IC4uLi4NCj4gPiA+PiBPbmUgcHJvYmxlbSBpcyB0
aGF0IG5vdCBhbGwgc29ja29wdCBjYWxscyBwYXNzIHRoZSBjb3JyZWN0IGxlbmd0aC4NCj4gPiA+
PiBBbmQgc29tZSBvZiB0aGVtIGNhbiBoYXZlIHZlcnkgbG9uZyBidWZmZXJzLg0KPiA+ID4+IE5v
dCB0byBtZW50aW9uIHRoZSBvbmVzIHRoYXQgYXJlIHJlYWQtbW9kaWZ5LXdyaXRlLg0KPiA+ID4+
DQo+ID4gPj4gQSBwbGF1c2libGUgc29sdXRpb24gaXMgdG8gcGFzcyBhICdmYXQgcG9pbnRlcicg
dGhhdCBjb250YWlucyBzb21lLA0KPiA+ID4+IG9yIGFsbCwgb2Y6DQo+ID4gPj4gICAgICAgIC0g
QSB1c2Vyc3BhY2UgYnVmZmVyIHBvaW50ZXIuDQo+ID4gPj4gICAgICAgIC0gQSBrZXJuZWwgYnVm
ZmVyIHBvaW50ZXIuDQo+ID4gPj4gICAgICAgIC0gVGhlIGxlbmd0aCBzdXBwbGllZCBieSB0aGUg
dXNlci4NCj4gPiA+PiAgICAgICAgLSBUaGUgbGVuZ3RoIG9mIHRoZSBrZXJuZWwgYnVmZmVyLg0K
PiA+ID4+ICAgICAgICA9IFRoZSBudW1iZXIgb2YgYnl0ZXMgdG8gY29weSBvbiBjb21wbGV0aW9u
Lg0KPiA+ID4+IEZvciBzaW1wbGUgdXNlciByZXF1ZXN0cyB0aGUgc3lzY2FsbCBlbnRyeS9leGl0
IGNvZGUgd291bGQgY29weSB0aGUNCj4gPiA+PiBkYXRhIHRvIGEgc2hvcnQgb24tc3RhY2sgYnVm
ZmVyLg0KPiA+ID4+IEtlcm5lbCB1c2VycyBqdXN0IHBhc3MgdGhlIGtlcm5lbCBhZGRyZXNzLg0K
PiA+ID4+IE9kZCByZXF1ZXN0cyBjYW4ganVzdCB1c2UgdGhlIHVzZXIgcG9pbnRlci4NCj4gPiA+
Pg0KPiA+ID4+IFByb2JhYmx5IG5lZWRzIGFjY2Vzc29ycyB0aGF0IGFkZCBpbiBhbiBvZmZzZXQu
DQo+ID4gPj4NCj4gPiA+PiBJdCBtaWdodCBhbHNvIGJlIHRoYXQgc29tZSBvZiB0aGUgcHJvYmxl
bWF0aWMgc29ja29wdCB3ZXJlIGluIGRlY25ldA0KPiA+ID4+IC0gbm93IHJlbW92ZWQuDQo+ID4g
Pg0KPiA+ID4gSGVsbG8gZXZlcnlvbmUsDQo+ID4gPg0KPiA+ID4gSSdtIGN1cnJlbnRseSB3b3Jr
aW5nIG9uIGFuIGltcGxlbWVudGF0aW9uIG9mIHtnZXQsc2V0fSBzb2Nrb3B0Lg0KPiA+ID4gU2lu
Y2UgdGhpcyB0aHJlYWQgaXMgYWxyZWFkeSB0YWxraW5nIGFib3V0IGl0LCBJIGhvcGUgdGhhdCBJ
IHJlcGx5aW5nIGF0IHRoZQ0KPiA+IGNvcnJlY3QgcGxhY2UuDQo+ID4NCj4gPiBIaSBBZHJpZW4s
IEkgYmVsaWV2ZSBCcmVubyBpcyB3b3JraW5nIG9uIHNldC9nZXRzb2Nrb3B0IGFzIHdlbGwgYW5k
IGhhZA0KPiA+IHNpbWlsYXIgcGF0Y2hlcyBmb3IgYXdoaWxlLCBidXQgdGhhdCB3b3VsZCBuZWVk
IGZvciBzb21lIHByb2JsZW1zIHRvIGJlDQo+ID4gc29sdmVkIGZpcnN0LCBlLmcuIHRyeSBhbmQg
ZGVjaWRlIHdoZXRoZXIgaXQgY29waWVzIHRvIGEgcHRyIGFzIHRoZSBzeXNjYWxsDQo+ID4gdmVy
c2lvbnMgb3Igd291bGQgZ2V0L3JldHVybiBvcHR2YWwgZGlyZWN0bHkgaW4gc3FlL2NxZS4gQW5k
IGFsc28gd2hlcmUgdG8NCj4gPiBzdG9yZSBiaXRzIHRoYXQgeW91IHBhc3MgaW4gc3RydWN0IGFy
Z3Nfc2V0c29ja29wdF91cmluZywgYW5kIHdoZXRoZXIgdG8gcmVseQ0KPiA+IG9uIFNRRTEyOCBv
ciBub3QuDQo+ID4NCj4gDQo+IEhlbGxvIFBhdmVsLA0KPiBUaGF0IGlzIGdvb2QgdG8gaGVhci4g
SWYgcG9zc2libGUgSSB3b3VsZCBsaWtlIHRvIHByb3ZpZGUgc29tZSBoZWxwLg0KPiBJIGxvb2tl
ZCBhdCB0aGUgZ2V0c29ja29wdCBpbXBsZW1lbnRhdGlvbi4gRnJvbSB3aGF0IEknbSBzZWVpbmcs
IEkgYmVsaWV2ZSB0aGF0IGl0IHdvdWxkIGJlIGVhc2llciB0bw0KPiBjb3BpZXMgdG8gYSBwdHIg
YXMgdGhlIHN5c2NhbGwuDQo+IFRoZSBsZW5ndGggb2YgdGhlIG91dHB1dCBpcyB1c3VhbGx5IDQg
Ynl0ZXMgKHNvbWV0aW1lcyBsZXNzKSBidXQgaW4gYSBsb3Qgb2YgY2FzZXMsIHRoaXMgbGVuZ3Ro
IGlzDQo+IHZhcmlhYmxlLiBTb21ldGltZSBpdCBjYW4gZXZlbiBiZSBiaWdnZXIgdGhhdCB0aGUg
U1FFMTI4IHJpbmcuDQo+IA0KPiBIZXJlIGlzIGEgbm9uLWV4aGF1c3RpdmUgbGlzdCBvZiB0aG9z
ZSBjYXNlcyA6DQo+IC9uZXQvaXB2NC90Y3AuYyA6IGludCBkb190Y3BfZ2V0c29ja29wdCguLi4p
DQo+ICAgLSBUQ1BfSU5GTyA6IHVwIHRvIDI0MCBieXRlcw0KPiAgIC0gVENQX0NDX0lORk8gYW5k
IFRDUF9SRVBBSVJfV0lORE9XIDogdXAgdG8gMjAgYnl0ZXMNCj4gICAtIFRDUF9DT05HRVNUSU9O
IGFuZCBUQ1BfVUxQIDogdXAgdG8gMTYgYnl0ZXMNCj4gICAtIFRDUF9aRVJPQ1BPWV9SRUNFSVZF
IDogdXAgdG8gNjQgYnl0ZXMNCj4gL25ldC9hdG0vY29tbXVuLmMgOiBpbnQgdmNjX2dldHNvY2tv
cHQoLi4uKQ0KPiAgIC0gU09fQVRNUU9TIDogdXAgdG8gODggYnl0ZXMNCj4gICAtIFNPX0FUTVBW
QyA6IHVwIHRvIDE2IGJ5dGVzDQo+IC9uZXQvaXB2NC9pb19zb2NrZ2x1ZS5jIDogaW50IGRvX2lw
X2dldHNvY2tvcHQoLi4uKQ0KPiAgIC0gTUNBU1RfTVNGSUxURVIgOiB1cCB0byAxNDQgYnl0ZXMN
Cj4gICAtIElQX01TRklMVEVSIDogMTYgYnl0ZXMgbWluaW11bQ0KPiANCj4gSSB3aWxsIGxvb2sg
aW50byBzZXRzb2Nrb3B0IGJ1dCBJIGJlbGlldmUgaXQgbWlnaHQgYmUgdGhlIHNhbWUuDQo+IElm
IG5lZWRlZCBJIGNhbiBhbHNvIGNvbXBsZXRlIHRoaXMgbGlzdC4NCj4gSG93ZXZlciB0aGVyZSBh
cmUgc29tZSBjYXNlcyB3aGVyZSBpdCBpcyBoYXJkIHRvIGRldGVybWluYXRlIGEgbWF4aW11bSBh
bW91bnQgb2YgYnl0ZXMgaW4gYWR2YW5jZS4NCg0KQWxzbyBsb29rIGF0IFNDVFAgLSBpdCBoYXMg
c29tZSB2ZXJ5IGxvbmcgYnVmZmVycy4NCkFsbW9zdCBhbnkgY29kZSB0aGF0IHVzZXMgU0NUUCBu
ZWVkcyB0byB1c2UgdGhlIFNDVFBfU1RBVFVTDQpyZXF1ZXN0IHRvIGdldCB0aGUgbmVnb3RpYXRl
ZCBudW1iZXIgb2YgZGF0YSBzdHJlYW1zDQoodGhhdCBvbmUgaXMgcmVsYXRpdmVseSBzaG9ydCku
DQpJSVJDIHRoZXJlIGFyZSBhbHNvIGdldHNvY2tvcHQoKSB0aGF0IGFyZSByZWFkL21vZGlmeS93
cml0ZSENCg0KVGhlcmUgd2lsbCBhbHNvIGJlIHVzZXIgY29kZSB0aGF0IHN1cHBsaWVzIGEgdmVy
eSBsb25nIGJ1ZmZlcg0KKHRvbyBsb25nIHRvIGFsbG9jYXRlIGluIGtlcm5lbCkgZm9yIHNvbWUg
dmFyaWFibGUgbGVuZ3RoIHJlcXVlc3RzLg0KDQpTbyB0aGUgZ2VuZXJpYyBzeXN0ZW0gY2FsbCBj
b2RlIGNhbiBhbGxvY2F0ZSBhIHNob3J0IChlZyBvbi1zdGFjaykNCmJ1ZmZlciBmb3Igc2hvcnQg
cmVxdWVzdHMgYW5kIHRoZW4gcGFzcyBib3RoIHRoZSB1c2VyIGFuZCBrZXJuZWwNCmFkZHJlc3Nl
cyAoYW5kIGxlbmd0aHMpIHRocm91Z2ggdG8gdGhlIHByb3RvY29sIGZ1bmN0aW9ucy4NCkFueXRo
aW5nIHRoYXQgbmVlZHMgYSBiaWcgYnVmZmVyIGNhbiBkaXJlY3RseSBjb3B5IHRvL2Zyb20NCmFu
ZCB1c2VyIGJ1ZmZlcnMsIGtlcm5lbCBjYWxsZXJzIHdvdWxkIG5lZWQgdG8gcGFzcyBhIGJpZyBl
bm91Z2gNCmJ1ZmZlci4NCg0KQnV0IHRoZSBjb2RlIGZvciBzbWFsbCBidWZmZXJzIHdvdWxkIGJl
IG11Y2ggc2ltcGxpZmllZCBmb3INCmJvdGgga2VybmVsIGFuZCB1c2VyIGFjY2Vzcy4NCg0KCURh
dmlkDQoNCi0NClJlZ2lzdGVyZWQgQWRkcmVzcyBMYWtlc2lkZSwgQnJhbWxleSBSb2FkLCBNb3Vu
dCBGYXJtLCBNaWx0b24gS2V5bmVzLCBNSzEgMVBULCBVSw0KUmVnaXN0cmF0aW9uIE5vOiAxMzk3
Mzg2IChXYWxlcykNCg=

^ permalink raw reply	[flat|nested] 108+ messages in thread

* RE: [PATCH 0/5] add initial io_uring_cmd support for sockets
@ 2023-05-03 13:27   ` David Laight
  0 siblings, 0 replies; 108+ messages in thread
From: David Laight @ 2023-05-03 13:27 UTC (permalink / raw)
  To: 'Adrien Delorme', Pavel Begunkov
  Cc: axboe@kernel.dk, davem@davemloft.net, dccp@vger.kernel.org,
	dsahern@kernel.org, edumazet@google.com, io-uring@vger.kernel.org,
	kuba@kernel.org, leit@fb.com, leitao@debian.org,
	linux-kernel@vger.kernel.org, marcelo.leitner@gmail.com,
	matthieu.baerts@tessares.net, mptcp@lists.linux.dev,
	netdev@vger.kernel.org, pabeni@redhat.com, willemb@google.com,
	willemdebruijn.kernel@gmail.com

From: Adrien Delorme
> Sent: 03 May 2023 14:11
> 
> From Adrien Delorme
> > From : Pavel Begunkov
> > Sent : 2 May 2023 15:04
> > On 5/2/23 10:21, Adrien Delorme wrote:
> > >  From Adrien Delorme
> > >
> > >> From: David Ahern
> > >> Sent: 12 April 2023 7:39
> > >>> Sent: 11 April 2023 16:28
> > >> ....
> > >> One problem is that not all sockopt calls pass the correct length.
> > >> And some of them can have very long buffers.
> > >> Not to mention the ones that are read-modify-write.
> > >>
> > >> A plausible solution is to pass a 'fat pointer' that contains some,
> > >> or all, of:
> > >>        - A userspace buffer pointer.
> > >>        - A kernel buffer pointer.
> > >>        - The length supplied by the user.
> > >>        - The length of the kernel buffer.
> > >>        = The number of bytes to copy on completion.
> > >> For simple user requests the syscall entry/exit code would copy the
> > >> data to a short on-stack buffer.
> > >> Kernel users just pass the kernel address.
> > >> Odd requests can just use the user pointer.
> > >>
> > >> Probably needs accessors that add in an offset.
> > >>
> > >> It might also be that some of the problematic sockopt were in decnet
> > >> - now removed.
> > >
> > > Hello everyone,
> > >
> > > I'm currently working on an implementation of {get,set} sockopt.
> > > Since this thread is already talking about it, I hope that I replying at the
> > correct place.
> >
> > Hi Adrien, I believe Breno is working on set/getsockopt as well and had
> > similar patches for awhile, but that would need for some problems to be
> > solved first, e.g. try and decide whether it copies to a ptr as the syscall
> > versions or would get/return optval directly in sqe/cqe. And also where to
> > store bits that you pass in struct args_setsockopt_uring, and whether to rely
> > on SQE128 or not.
> >
> 
> Hello Pavel,
> That is good to hear. If possible I would like to provide some help.
> I looked at the getsockopt implementation. From what I'm seeing, I believe that it would be easier to
> copies to a ptr as the syscall.
> The length of the output is usually 4 bytes (sometimes less) but in a lot of cases, this length is
> variable. Sometime it can even be bigger that the SQE128 ring.
> 
> Here is a non-exhaustive list of those cases :
> /net/ipv4/tcp.c : int do_tcp_getsockopt(...)
>   - TCP_INFO : up to 240 bytes
>   - TCP_CC_INFO and TCP_REPAIR_WINDOW : up to 20 bytes
>   - TCP_CONGESTION and TCP_ULP : up to 16 bytes
>   - TCP_ZEROCPOY_RECEIVE : up to 64 bytes
> /net/atm/commun.c : int vcc_getsockopt(...)
>   - SO_ATMQOS : up to 88 bytes
>   - SO_ATMPVC : up to 16 bytes
> /net/ipv4/io_sockglue.c : int do_ip_getsockopt(...)
>   - MCAST_MSFILTER : up to 144 bytes
>   - IP_MSFILTER : 16 bytes minimum
> 
> I will look into setsockopt but I believe it might be the same.
> If needed I can also complete this list.
> However there are some cases where it is hard to determinate a maximum amount of bytes in advance.

Also look at SCTP - it has some very long buffers.
Almost any code that uses SCTP needs to use the SCTP_STATUS
request to get the negotiated number of data streams
(that one is relatively short).
IIRC there are also getsockopt() that are read/modify/write!

There will also be user code that supplies a very long buffer
(too long to allocate in kernel) for some variable length requests.

So the generic system call code can allocate a short (eg on-stack)
buffer for short requests and then pass both the user and kernel
addresses (and lengths) through to the protocol functions.
Anything that needs a big buffer can directly copy to/from
and user buffers, kernel callers would need to pass a big enough
buffer.

But the code for small buffers would be much simplified for
both kernel and user access.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2023-05-03 13:27 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-06 14:43 [PATCH 0/5] add initial io_uring_cmd support for sockets Breno Leitao
2023-04-06 14:43 ` Breno Leitao
2023-04-06 15:34 ` Willem de Bruijn
2023-04-06 15:34   ` Willem de Bruijn
2023-04-06 15:59 ` Breno Leitao
2023-04-06 15:59   ` Breno Leitao
2023-04-06 16:41 ` Keith Busch
2023-04-06 16:41   ` Keith Busch
2023-04-06 16:49 ` Jens Axboe
2023-04-06 16:49   ` Jens Axboe
2023-04-06 16:58 ` Breno Leitao
2023-04-06 16:58   ` Breno Leitao
2023-04-06 18:16 ` Willem de Bruijn
2023-04-06 18:16   ` Willem de Bruijn
2023-04-07  2:46 ` David Ahern
2023-04-07  2:46   ` David Ahern
2023-04-11 11:59 ` Breno Leitao
2023-04-11 12:00   ` Breno Leitao
2023-04-11 14:36 ` David Ahern
2023-04-11 14:36   ` David Ahern
2023-04-11 14:41 ` Jens Axboe
2023-04-11 14:41   ` Jens Axboe
2023-04-11 14:51 ` Willem de Bruijn
2023-04-11 14:51   ` Willem de Bruijn
2023-04-11 14:54 ` Jens Axboe
2023-04-11 14:54   ` Jens Axboe
2023-04-11 15:00 ` Willem de Bruijn
2023-04-11 15:00   ` Willem de Bruijn
2023-04-11 15:06 ` Jens Axboe
2023-04-11 15:06   ` Jens Axboe
2023-04-11 15:10 ` David Ahern
2023-04-11 15:10   ` David Ahern
2023-04-11 15:17 ` Jens Axboe
2023-04-11 15:17   ` Jens Axboe
2023-04-11 15:24 ` Willem de Bruijn
2023-04-11 15:24   ` Willem de Bruijn
2023-04-11 15:27 ` David Ahern
2023-04-11 15:27   ` David Ahern
2023-04-11 15:28 ` Jens Axboe
2023-04-11 15:28   ` Jens Axboe
2023-04-11 15:29 ` Jens Axboe
2023-04-11 15:29   ` Jens Axboe
2023-04-12  7:39 ` David Laight
2023-04-12  7:39   ` David Laight
2023-04-12 13:53 ` Breno Leitao
2023-04-12 13:53   ` Breno Leitao
2023-04-12 14:28 ` Willem de Bruijn
2023-04-12 14:28   ` Willem de Bruijn
2023-04-13  0:02 ` Breno Leitao
2023-04-13  0:02   ` Breno Leitao
2023-04-13 14:24 ` Willem de Bruijn
2023-04-13 14:24   ` Willem de Bruijn
2023-04-13 14:45 ` Jakub Kicinski
2023-04-13 14:45   ` Jakub Kicinski
2023-04-13 14:57 ` David Laight
2023-04-13 14:57   ` David Laight
2023-04-18 13:23 ` Breno Leitao
2023-04-18 13:23   ` Breno Leitao
2023-04-18 19:41 ` Willem de Bruijn
2023-04-18 19:41   ` Willem de Bruijn
2023-04-20 14:43 ` Breno Leitao
2023-04-20 14:43   ` Breno Leitao
2023-04-20 16:48 ` Willem de Bruijn
2023-04-20 16:48   ` Willem de Bruijn
2023-05-02  9:21 ` Adrien Delorme
2023-05-02  9:21   ` Adrien Delorme
2023-05-02 13:03 ` Pavel Begunkov
2023-05-02 13:03   ` Pavel Begunkov
2023-05-03 13:11 ` Adrien Delorme
2023-05-03 13:11   ` Adrien Delorme
2023-05-03 13:27 ` David Laight
2023-05-03 13:27   ` David Laight
  -- strict thread matches above, loose matches on Subject: below --
2023-04-06 14:43 [RFC PATCH 1/4] net: wire up support for file_operations->uring_cmd() Breno Leitao
2023-04-06 14:43 ` Breno Leitao
2023-04-06 14:43 [RFC PATCH 2/4] net: add uring_cmd callback to UDP Breno Leitao
2023-04-06 14:43 ` Breno Leitao
2023-04-06 19:03 ` kernel test robot
2023-04-11 12:54 ` Pavel Begunkov
2023-04-11 12:54   ` Pavel Begunkov
2023-04-06 14:43 [RFC PATCH 3/4] net: add uring_cmd callback to TCP Breno Leitao
2023-04-06 14:43 ` Breno Leitao
2023-04-06 20:35 ` kernel test robot
2023-04-06 14:43 [RFC PATCH 4/4] net: add uring_cmd callback to raw "protocol" Breno Leitao
2023-04-06 14:43 ` Breno Leitao
2023-04-06 16:57 [PATCH RFC] io_uring: Pass whole sqe to commands Breno Leitao
2023-04-06 16:57 ` Breno Leitao
2023-04-06 17:50 ` io_uring: Pass whole sqe to commands: Tests Results MPTCP CI
2023-04-07 18:51 ` [PATCH RFC] io_uring: Pass whole sqe to commands Keith Busch
2023-04-07 18:51   ` Keith Busch
2023-04-07 19:43 ` io_uring: Pass whole sqe to commands: Tests Results MPTCP CI
2023-04-11 12:22 ` [PATCH RFC] io_uring: Pass whole sqe to commands Breno Leitao
2023-04-11 12:22   ` Breno Leitao
2023-04-11 12:39 ` Pavel Begunkov
2023-04-11 12:39   ` Pavel Begunkov
2023-04-13  2:56 ` Ming Lei
2023-04-13  2:56   ` Ming Lei
2023-04-13 16:47 ` Breno Leitao
2023-04-13 16:47   ` Breno Leitao
2023-04-14  2:12 ` Ming Lei
2023-04-14  2:12   ` Ming Lei
2023-04-14 13:12 ` Pavel Begunkov
2023-04-14 13:12   ` Pavel Begunkov
2023-04-14 13:59 ` Ming Lei
2023-04-14 13:59   ` Ming Lei
2023-04-14 14:56 ` Pavel Begunkov
2023-04-14 14:56   ` Pavel Begunkov
2023-04-16  9:51 ` Ming Lei
2023-04-16  9:51   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.