Re: [PATCH bpf-next v5 2/3] bpf, sockmap: Fix FIONREAD for sockmap

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Jiayuan Chen" <jiayuan.chen@linux.dev>
To: "Jakub Sitnicki" <jakub@cloudflare.com>
Cc: bpf@vger.kernel.org, "John Fastabend" <john.fastabend@gmail.com>,
	"David  S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Simon Horman" <horms@kernel.org>,
	"Neal Cardwell" <ncardwell@google.com>,
	"Kuniyuki Iwashima" <kuniyu@google.com>,
	"David Ahern" <dsahern@kernel.org>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Andrii Nakryiko" <andrii@kernel.org>,
	"Martin  KaFai Lau" <martin.lau@linux.dev>,
	"Eduard Zingerman" <eddyz87@gmail.com>,
	"Song Liu" <song@kernel.org>,
	"Yonghong Song" <yonghong.song@linux.dev>,
	"KP  Singh" <kpsingh@kernel.org>,
	"Stanislav Fomichev" <sdf@fomichev.me>,
	"Hao  Luo" <haoluo@google.com>, "Jiri Olsa" <jolsa@kernel.org>,
	"Shuah Khan" <shuah@kernel.org>,
	"Stefano Garzarella" <sgarzare@redhat.com>,
	"Michal  Luczaj" <mhal@rbox.co>,
	"Cong Wang" <cong.wang@bytedance.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org
Subject: Re: [PATCH bpf-next v5 2/3] bpf, sockmap: Fix FIONREAD for sockmap
Date: Thu, 08 Jan 2026 11:46:46 +0000	[thread overview]
Message-ID: <66843c358b71a0dc3e57cc6ac6c0e8b1bb018e38@linux.dev> (raw)
In-Reply-To: <87ikddijrd.fsf@cloudflare.com>

January 7, 2026 at 22:23, "Jakub Sitnicki" <jakub@cloudflare.com mailto:jakub@cloudflare.com?to=%22Jakub%20Sitnicki%22%20%3Cjakub%40cloudflare.com%3E > wrote:


> 
> On Tue, Jan 06, 2026 at 01:14 PM +08, Jiayuan Chen wrote:
> 
> > 
> > A socket using sockmap has its own independent receive queue: ingress_msg.
> >  This queue may contain data from its own protocol stack or from other
> >  sockets.
> > 
> >  Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
> >  calculate FIONREAD is not enough.
> > 
> >  This patch adds a new ingress_size field in the psock structure to record
> >  the data length in ingress_msg. Additionally, we implement new ioctl
> >  interfaces for TCP and UDP to intercept FIONREAD operations. While Unix
> >  and VSOCK also support sockmap and have similar FIONREAD calculation
> >  issues, fixing them would require more extensive changes
> >  (please let me know if modifications are needed). I believe it's not
> >  appropriate to include those changes under this fix patch.
> > 
> Nit: These last two lines don't really belong in the commit message.
> Side notes for reviewers can be added after the "---" marker.
> 
> > 
> > Previous work by John Fastabend made some efforts towards FIONREAD support:
> >  commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
> >  Although the current patch is based on the previous work by John Fastabend,
> >  it is acceptable for our Fixes tag to point to the same commit.
> > 
> >  FD1:read()
> >  -- FD1->copied_seq++
> >  | [read data]
> >  |
> >  [enqueue data] v
> >  [sockmap] -> ingress to self -> ingress_msg queue
> >  FD1 native stack ------> ^
> >  -- FD1->rcv_nxt++ -> redirect to other | [enqueue data]
> >  | |
> >  | ingress to FD1
> >  v ^
> >  ... | [sockmap]
> >  FD2 native stack
> > 
> >  Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> >  Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> >  ---
> >  include/linux/skmsg.h | 67 +++++++++++++++++++++++++++++++++++++++++--
> >  net/core/skmsg.c | 3 ++
> >  net/ipv4/tcp_bpf.c | 21 ++++++++++++++
> >  net/ipv4/udp_bpf.c | 25 +++++++++++++---
> >  4 files changed, 110 insertions(+), 6 deletions(-)
> > 
> >  diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> >  index 0323a2b6cf5e..1fa03953043f 100644
> >  --- a/include/linux/skmsg.h
> >  +++ b/include/linux/skmsg.h
> >  @@ -97,6 +97,7 @@ struct sk_psock {
> >  struct sk_buff_head ingress_skb;
> >  struct list_head ingress_msg;
> >  spinlock_t ingress_lock;
> >  + ssize_t ingress_size;
> > 
> The name is not great because we also already have `ingress_bytes`.
> I suggest to rename and add a doc string. Also we don't expect the count
> to ever be negative. Why ssize_t when we store all other byte counts
> there as u32?
> 
>  /** @msg_tot_len: Total bytes queued in ingress_msg list. */
>  u32 msg_tot_len;
> 
> > 
> > unsigned long state;
> >  struct list_head link;
> >  spinlock_t link_lock;
> >  @@ -321,6 +322,27 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
> >  kfree_skb(skb);
> >  }
> >  
> >  +static inline ssize_t sk_psock_get_msg_size_nolock(struct sk_psock *psock)
> >  +{
> >  + /* Used by ioctl to read ingress_size only; lock-free for performance */
> >  + return READ_ONCE(psock->ingress_size);
> >  +}
> >  +
> >  +static inline void sk_psock_inc_msg_size_locked(struct sk_psock *psock, ssize_t diff)
> >  +{
> >  + /* Use WRITE_ONCE to ensure correct read in sk_psock_get_msg_size_nolock().
> >  + * ingress_lock should be held to prevent concurrent updates to ingress_size
> >  + */
> >  + WRITE_ONCE(psock->ingress_size, psock->ingress_size + diff);
> >  +}
> >  +
> >  +static inline void sk_psock_inc_msg_size(struct sk_psock *psock, ssize_t diff)
> > 
> Not sure about this function name. "inc" usually means increment by one.
> Was that modeled after some existing interface?
> 
> If not, I'd switch rename to sk_psock_msg_len_add(..., int delta)
> 
> Following the naming convention from sk_forward_alloc_add(),
> skb_frag_size_add(), skb_len_add(), etc.
> 
> > 
> > +{
> >  + spin_lock_bh(&psock->ingress_lock);
> >  + sk_psock_inc_msg_size_locked(psock, diff);
> >  + spin_unlock_bh(&psock->ingress_lock);
> >  +}
> >  +
> >  static inline bool sk_psock_queue_msg(struct sk_psock *psock,
> >  struct sk_msg *msg)
> >  {
> >  @@ -329,6 +351,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
> >  spin_lock_bh(&psock->ingress_lock);
> >  if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
> >  list_add_tail(&msg->list, &psock->ingress_msg);
> >  + sk_psock_inc_msg_size_locked(psock, msg->sg.size);
> >  ret = true;
> >  } else {
> >  sk_msg_free(psock->sk, msg);
> >  @@ -345,18 +368,25 @@ static inline struct sk_msg *sk_psock_dequeue_msg(struct sk_psock *psock)
> >  
> >  spin_lock_bh(&psock->ingress_lock);
> >  msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> >  - if (msg)
> >  + if (msg) {
> >  list_del(&msg->list);
> >  + sk_psock_inc_msg_size_locked(psock, -msg->sg.size);
> >  + }
> >  spin_unlock_bh(&psock->ingress_lock);
> >  return msg;
> >  }
> >  
> >  +static inline struct sk_msg *sk_psock_peek_msg_locked(struct sk_psock *psock)
> >  +{
> >  + return list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> >  +}
> >  +
> >  static inline struct sk_msg *sk_psock_peek_msg(struct sk_psock *psock)
> >  {
> >  struct sk_msg *msg;
> >  
> >  spin_lock_bh(&psock->ingress_lock);
> >  - msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> >  + msg = sk_psock_peek_msg_locked(psock);
> >  spin_unlock_bh(&psock->ingress_lock);
> >  return msg;
> >  }
> >  @@ -523,6 +553,39 @@ static inline bool sk_psock_strp_enabled(struct sk_psock *psock)
> >  return !!psock->saved_data_ready;
> >  }
> >  
> >  +/* for tcp only, sk is locked */
> >  +static inline ssize_t sk_psock_msg_inq(struct sock *sk)
> >  +{
> >  + struct sk_psock *psock;
> >  + ssize_t inq = 0;
> >  +
> >  + psock = sk_psock_get(sk);
> >  + if (likely(psock)) {
> >  + inq = sk_psock_get_msg_size_nolock(psock);
> >  + sk_psock_put(sk, psock);
> >  + }
> >  + return inq;
> >  +}
> >  +
> >  +/* for udp only, sk is not locked */
> >  +static inline ssize_t sk_msg_first_length(struct sock *sk)
> > 
> s/_length/_len/
> 
> > 
> > +{
> >  + struct sk_psock *psock;
> >  + struct sk_msg *msg;
> >  + ssize_t inq = 0;
> >  +
> >  + psock = sk_psock_get(sk);
> >  + if (likely(psock)) {
> >  + spin_lock_bh(&psock->ingress_lock);
> >  + msg = sk_psock_peek_msg_locked(psock);
> >  + if (msg)
> >  + inq = msg->sg.size;
> >  + spin_unlock_bh(&psock->ingress_lock);
> >  + sk_psock_put(sk, psock);
> >  + }
> >  + return inq;
> >  +}
> >  +
> >  #if IS_ENABLED(CONFIG_NET_SOCK_MSG)
> >  
> >  #define BPF_F_STRPARSER (1UL << 1)
> >  diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> >  index d73e03f7713a..c959d52a62b2 100644
> >  --- a/net/core/skmsg.c
> >  +++ b/net/core/skmsg.c
> >  @@ -455,6 +455,7 @@ int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg
> >  atomic_sub(copy, &sk->sk_rmem_alloc);
> >  }
> >  msg_rx->sg.size -= copy;
> >  + sk_psock_inc_msg_size(psock, -copy);
> >  
> >  if (!sge->length) {
> >  sk_msg_iter_var_next(i);
> >  @@ -819,9 +820,11 @@ static void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
> >  list_del(&msg->list);
> >  if (!msg->skb)
> >  atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
> >  + sk_psock_inc_msg_size(psock, -((ssize_t)msg->sg.size));
> > 
> Cast won't be needed after you switch param type to `int`.
> 
> > 
> > sk_msg_free(psock->sk, msg);
> >  kfree(msg);
> >  }
> >  + WARN_ON_ONCE(psock->ingress_size);
> >  }
> >  
> >  static void __sk_psock_zap_ingress(struct sk_psock *psock)
> >  diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> >  index 6332fc36ffe6..a9c758868f13 100644
> >  --- a/net/ipv4/tcp_bpf.c
> >  +++ b/net/ipv4/tcp_bpf.c
> >  @@ -10,6 +10,7 @@
> >  
> >  #include <net/inet_common.h>
> >  #include <net/tls.h>
> >  +#include <asm/ioctls.h>
> >  
> >  void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
> >  {
> >  @@ -332,6 +333,25 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
> >  return copied;
> >  }
> >  
> >  +static int tcp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
> >  +{
> >  + bool slow;
> >  +
> >  + /* we only care about FIONREAD */
> > 
> Nit: This comment seems redundant. The expression is obvious.
> 
> > 
> > + if (cmd != SIOCINQ)
> >  + return tcp_ioctl(sk, cmd, karg);
> >  +
> >  + /* works similar as tcp_ioctl */
> >  + if (sk->sk_state == TCP_LISTEN)
> >  + return -EINVAL;
> >  +
> >  + slow = lock_sock_fast(sk);
> >  + *karg = sk_psock_msg_inq(sk);
> >  + unlock_sock_fast(sk, slow);
> >  +
> >  + return 0;
> >  +}
> >  +
> >  static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> >  int flags, int *addr_len)
> >  {
> >  @@ -610,6 +630,7 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
> >  prot[TCP_BPF_BASE].close = sock_map_close;
> >  prot[TCP_BPF_BASE].recvmsg = tcp_bpf_recvmsg;
> >  prot[TCP_BPF_BASE].sock_is_readable = sk_msg_is_readable;
> >  + prot[TCP_BPF_BASE].ioctl = tcp_bpf_ioctl;
> >  
> >  prot[TCP_BPF_TX] = prot[TCP_BPF_BASE];
> >  prot[TCP_BPF_TX].sendmsg = tcp_bpf_sendmsg;
> >  diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> >  index 0735d820e413..cc1156aef14d 100644
> >  --- a/net/ipv4/udp_bpf.c
> >  +++ b/net/ipv4/udp_bpf.c
> >  @@ -5,6 +5,7 @@
> >  #include <net/sock.h>
> >  #include <net/udp.h>
> >  #include <net/inet_common.h>
> >  +#include <asm/ioctls.h>
> >  
> >  #include "udp_impl.h"
> >  
> >  @@ -111,12 +112,28 @@ enum {
> >  static DEFINE_SPINLOCK(udpv6_prot_lock);
> >  static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
> >  
> >  +static int udp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
> >  +{
> >  + /* we only care about FIONREAD */
> >  + if (cmd != SIOCINQ)
> >  + return udp_ioctl(sk, cmd, karg);
> >  +
> >  + /* works similar as udp_ioctl.
> >  + * man udp(7): "FIONREAD (SIOCINQ): Returns the size of the next
> >  + * pending datagram in the integer in bytes, or 0 when no datagram
> >  + * is pending."
> >  + */
> > 
> Not sure we need to quote man pages here.
> 
> > 
> > + *karg = sk_msg_first_length(sk);
> >  + return 0;
> >  +}
> >  +
> >  static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto *base)
> >  {
> >  - *prot = *base;
> >  - prot->close = sock_map_close;
> >  - prot->recvmsg = udp_bpf_recvmsg;
> >  - prot->sock_is_readable = sk_msg_is_readable;
> >  + *prot = *base;
> >  + prot->close = sock_map_close;
> >  + prot->recvmsg = udp_bpf_recvmsg;
> >  + prot->sock_is_readable = sk_msg_is_readable;
> >  + prot->ioctl = udp_bpf_ioctl;
> >  }
> >  
> >  static void udp_bpf_check_v6_needs_rebuild(struct proto *ops)
> >
>

Thanks Jakub. All good suggestions. I've applied them in v6.

next prev parent reply	other threads:[~2026-01-08 11:46 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-06  5:14 [PATCH bpf-next v5 0/3] bpf: Fix FIONREAD and copied_seq issues Jiayuan Chen
2026-01-06  5:14 ` [PATCH bpf-next v5 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
2026-01-07 13:01   ` Jakub Sitnicki
2026-01-06  5:14 ` [PATCH bpf-next v5 2/3] bpf, sockmap: Fix FIONREAD for sockmap Jiayuan Chen
2026-01-07 14:23   ` Jakub Sitnicki
2026-01-08 11:46     ` Jiayuan Chen [this message]
2026-01-06  5:14 ` [PATCH bpf-next v5 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq Jiayuan Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=66843c358b71a0dc3e57cc6ac6c0e8b1bb018e38@linux.dev \
    --to=jiayuan.chen@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=edumazet@google.com \
    --cc=haoluo@google.com \
    --cc=horms@kernel.org \
    --cc=jakub@cloudflare.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=mhal@rbox.co \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=sgarzare@redhat.com \
    --cc=shuah@kernel.org \
    --cc=song@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.