From: Mina Almasry <almasrymina@google.com>
To: zijianzhang@bytedance.com
Cc: netdev@vger.kernel.org, edumazet@google.com,
willemdebruijn.kernel@gmail.com, cong.wang@bytedance.com,
xiaochun.lu@bytedance.com
Subject: Re: [PATCH net-next v7 2/3] sock: add MSG_ZEROCOPY notification mechanism based on msg_control
Date: Thu, 25 Jul 2024 21:59:39 +0000 [thread overview]
Message-ID: <ZqLKy8OqpMi-kPQ3@google.com> (raw)
In-Reply-To: <20240708210405.870930-3-zijianzhang@bytedance.com>
On Mon, Jul 08, 2024 at 09:04:04PM +0000, zijianzhang@bytedance.com wrote:
> From: Zijian Zhang <zijianzhang@bytedance.com>
>
> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
> However, zerocopy is not a free lunch. Apart from the management of user
> pages, the combination of poll + recvmsg to receive notifications incurs
> unignorable overhead in the applications. We try to mitigate this overhead
> with a new notification mechanism based on msg_control. Leveraging the
> general framework to copy cmsgs to the user space, we copy zerocopy
> notifications to the user upon returning of sendmsgs.
>
> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com>
> ---
> arch/alpha/include/uapi/asm/socket.h | 2 ++
> arch/mips/include/uapi/asm/socket.h | 2 ++
> arch/parisc/include/uapi/asm/socket.h | 2 ++
> arch/sparc/include/uapi/asm/socket.h | 2 ++
> include/linux/socket.h | 2 +-
> include/uapi/asm-generic/socket.h | 2 ++
> include/uapi/linux/socket.h | 13 ++++++++
> net/core/sock.c | 46 +++++++++++++++++++++++++++
> 8 files changed, 70 insertions(+), 1 deletion(-)
>
> diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
> index e94f621903fe..7c32d9dbe47f 100644
> --- a/arch/alpha/include/uapi/asm/socket.h
> +++ b/arch/alpha/include/uapi/asm/socket.h
> @@ -140,6 +140,8 @@
> #define SO_PASSPIDFD 76
> #define SO_PEERPIDFD 77
>
> +#define SCM_ZC_NOTIFICATION 78
> +
> #if !defined(__KERNEL__)
>
> #if __BITS_PER_LONG == 64
> diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
> index 60ebaed28a4c..3f7fade998cb 100644
> --- a/arch/mips/include/uapi/asm/socket.h
> +++ b/arch/mips/include/uapi/asm/socket.h
> @@ -151,6 +151,8 @@
> #define SO_PASSPIDFD 76
> #define SO_PEERPIDFD 77
>
> +#define SCM_ZC_NOTIFICATION 78
> +
> #if !defined(__KERNEL__)
>
> #if __BITS_PER_LONG == 64
> diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
> index be264c2b1a11..77f5bee0fdc9 100644
> --- a/arch/parisc/include/uapi/asm/socket.h
> +++ b/arch/parisc/include/uapi/asm/socket.h
> @@ -132,6 +132,8 @@
> #define SO_PASSPIDFD 0x404A
> #define SO_PEERPIDFD 0x404B
>
> +#define SCM_ZC_NOTIFICATION 0x404C
> +
> #if !defined(__KERNEL__)
>
> #if __BITS_PER_LONG == 64
> diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
> index 682da3714686..eb44fc515b45 100644
> --- a/arch/sparc/include/uapi/asm/socket.h
> +++ b/arch/sparc/include/uapi/asm/socket.h
> @@ -133,6 +133,8 @@
> #define SO_PASSPIDFD 0x0055
> #define SO_PEERPIDFD 0x0056
>
> +#define SCM_ZC_NOTIFICATION 0x0057
> +
> #if !defined(__KERNEL__)
>
>
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 75461812a7a3..6f1b791e2de8 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -171,7 +171,7 @@ static inline struct cmsghdr * cmsg_nxthdr (struct msghdr *__msg, struct cmsghdr
>
> static inline bool cmsg_copy_to_user(struct cmsghdr *__cmsg)
> {
> - return 0;
> + return __cmsg->cmsg_type == SCM_ZC_NOTIFICATION;
> }
>
> static inline size_t msg_data_left(struct msghdr *msg)
> diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
> index 8ce8a39a1e5f..02e9159c7944 100644
> --- a/include/uapi/asm-generic/socket.h
> +++ b/include/uapi/asm-generic/socket.h
> @@ -135,6 +135,8 @@
> #define SO_PASSPIDFD 76
> #define SO_PEERPIDFD 77
>
> +#define SCM_ZC_NOTIFICATION 78
> +
> #if !defined(__KERNEL__)
>
> #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
> diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
> index d3fcd3b5ec53..ab361f30f3a6 100644
> --- a/include/uapi/linux/socket.h
> +++ b/include/uapi/linux/socket.h
> @@ -2,6 +2,8 @@
> #ifndef _UAPI_LINUX_SOCKET_H
> #define _UAPI_LINUX_SOCKET_H
>
> +#include <linux/types.h>
> +
> /*
> * Desired design of maximum size and alignment (see RFC2553)
> */
> @@ -35,4 +37,15 @@ struct __kernel_sockaddr_storage {
> #define SOCK_TXREHASH_DISABLED 0
> #define SOCK_TXREHASH_ENABLED 1
>
> +struct zc_info_elem {
> + __u32 lo;
> + __u32 hi;
> + __u8 zerocopy;
Some docs please on what each of these are, if possible. Sorry if the repeated
requests are annoying.
In particular I'm a bit confused why the zerocopy field is there. Looking at
the code, is this always set to 1?
> +};
> +
> +struct zc_info {
> + __u32 size;
> + struct zc_info_elem arr[];
> +};
> +
> #endif /* _UAPI_LINUX_SOCKET_H */
> diff --git a/net/core/sock.c b/net/core/sock.c
> index efb30668dac3..e0b5162233d3 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2863,6 +2863,52 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
> case SCM_RIGHTS:
> case SCM_CREDENTIALS:
> break;
> + case SCM_ZC_NOTIFICATION: {
> + struct zc_info *zc_info = CMSG_DATA(cmsg);
> + struct zc_info_elem *zc_info_arr;
> + struct sock_exterr_skb *serr;
> + int cmsg_data_len, i = 0;
> + struct sk_buff_head *q;
> + unsigned long flags;
> + struct sk_buff *skb;
> + u32 zc_info_size;
> +
> + if (!sock_flag(sk, SOCK_ZEROCOPY) || sk->sk_family == PF_RDS)
> + return -EINVAL;
> +
> + cmsg_data_len = cmsg->cmsg_len - sizeof(struct cmsghdr);
> + if (cmsg_data_len < sizeof(struct zc_info))
> + return -EINVAL;
> +
> + zc_info_size = zc_info->size;
> + zc_info_arr = zc_info->arr;
Annoying nit: To be honest zc_info->size isn't much longer to type than
zc_info_size, so I would have not added local variables.
> + if (cmsg_data_len != sizeof(struct zc_info) +
> + zc_info_size * sizeof(struct zc_info_elem))
> + return -EINVAL;
> +
> + q = &sk->sk_error_queue;
> + spin_lock_irqsave(&q->lock, flags);
> + skb = skb_peek(q);
> + while (skb && i < zc_info_size) {
> + struct sk_buff *skb_next = skb_peek_next(skb, q);
> +
> + serr = SKB_EXT_ERR(skb);
> + if (serr->ee.ee_errno == 0 &&
> + serr->ee.ee_origin == SO_EE_ORIGIN_ZEROCOPY) {
> + zc_info_arr[i].hi = serr->ee.ee_data;
> + zc_info_arr[i].lo = serr->ee.ee_info;
> + zc_info_arr[i].zerocopy = !(serr->ee.ee_code
> + & SO_EE_CODE_ZEROCOPY_COPIED);
> + __skb_unlink(skb, q);
> + consume_skb(skb);
> + i++;
> + }
> + skb = skb_next;
> + }
> + spin_unlock_irqrestore(&q->lock, flags);
I wonder if you should drop the spin lock in the middle of this loop somehow,
otherwise you may end up spinning for a very long time while the spinlock held
and irq disabled.
IIRC zc_info_size is user input, right? Maybe you should limit zc_info_size to
16 entries or something. So the user doesn't end up passing 100000 as
zc_info_size and making the kernel loop for a long time here.
> + zc_info->size = i;
> + break;
> + }
> default:
> return -EINVAL;
> }
> --
> 2.20.1
>
next prev parent reply other threads:[~2024-07-25 21:59 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-08 21:04 [PATCH net-next v7 0/3] net: A lightweight zero-copy notification zijianzhang
2024-07-08 21:04 ` [PATCH net-next v7 1/3] sock: support copying cmsgs to the user space in sendmsg zijianzhang
2024-07-09 9:14 ` Simon Horman
2024-07-09 17:17 ` Zijian Zhang
2024-07-09 16:40 ` Willem de Bruijn
2024-07-09 17:42 ` Zijian Zhang
2024-07-09 21:30 ` Willem de Bruijn
2024-07-25 1:11 ` Zijian Zhang
2024-07-25 3:08 ` Willem de Bruijn
2024-07-25 4:18 ` Zijian Zhang
2024-07-25 21:34 ` Mina Almasry
2024-07-25 23:50 ` Zijian Zhang
2024-07-26 0:05 ` Zijian Zhang
2024-07-26 17:00 ` Mina Almasry
2024-07-26 20:00 ` Zijian Zhang
2024-07-08 21:04 ` [PATCH net-next v7 2/3] sock: add MSG_ZEROCOPY notification mechanism based on msg_control zijianzhang
2024-07-25 21:59 ` Mina Almasry [this message]
2024-07-26 0:01 ` [External] " Zijian Zhang
2024-07-08 21:04 ` [PATCH net-next v7 3/3] selftests: add MSG_ZEROCOPY msg_control notification test zijianzhang
2024-07-09 9:19 ` Simon Horman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZqLKy8OqpMi-kPQ3@google.com \
--to=almasrymina@google.com \
--cc=cong.wang@bytedance.com \
--cc=edumazet@google.com \
--cc=netdev@vger.kernel.org \
--cc=willemdebruijn.kernel@gmail.com \
--cc=xiaochun.lu@bytedance.com \
--cc=zijianzhang@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.