All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: Jiayuan Chen <mrpre@163.com>
Cc: bpf@vger.kernel.org,  john.fastabend@gmail.com,
	 netdev@vger.kernel.org, martin.lau@linux.dev,  ast@kernel.org,
	 edumazet@google.com, davem@davemloft.net,  dsahern@kernel.org,
	 kuba@kernel.org, pabeni@redhat.com,
	 linux-kernel@vger.kernel.org,  song@kernel.org,
	andrii@kernel.org,  mhal@rbox.co,  yonghong.song@linux.dev,
	daniel@iogearbox.net,  xiyou.wangcong@gmail.com,
	 horms@kernel.org, corbet@lwn.net,  eddyz87@gmail.com,
	 cong.wang@bytedance.com, shuah@kernel.org,  mykolal@fb.com,
	 jolsa@kernel.org,  haoluo@google.com, sdf@fomichev.me,
	 kpsingh@kernel.org,  linux-doc@vger.kernel.org,
	linux-kselftest@vger.kernel.org
Subject: Re: [PATCH bpf v8 2/5] bpf: fix wrong copied_seq calculation
Date: Tue, 21 Jan 2025 15:18:38 +0100	[thread overview]
Message-ID: <875xm8i729.fsf@cloudflare.com> (raw)
In-Reply-To: <20250121050707.55523-3-mrpre@163.com> (Jiayuan Chen's message of "Tue, 21 Jan 2025 13:07:04 +0800")

On Tue, Jan 21, 2025 at 01:07 PM +08, Jiayuan Chen wrote:
> 'sk->copied_seq' was updated in the tcp_eat_skb() function when the
> action of a BPF program was SK_REDIRECT. For other actions, like SK_PASS,
> the update logic for 'sk->copied_seq' was moved to
> tcp_bpf_recvmsg_parser() to ensure the accuracy of the 'fionread' feature.
>
> It works for a single stream_verdict scenario, as it also modified
> 'sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb'
> to remove updating 'sk->copied_seq'.
>
> However, for programs where both stream_parser and stream_verdict are
> active(strparser purpose), tcp_read_sock() was used instead of

Nit: missing space, "active (strparser purpose)"
            ^

> tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock)

Nit: missing period, "… (sk_data_ready->strp_data_ready->tcp_read_sock)."
                                                                       ^

> tcp_read_sock() now still update 'sk->copied_seq', leading to duplicated

Nit: grammar "still updates"
                          ^
Please run commit descriptions through a language tool or an LLM, if
possible. This makes reviewing easier.

> updates.
>
> In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
> in both tcp_read_sock() and tcp_bpf_recvmsg_parser().
>
> The issue causes incorrect copied_seq calculations, which prevent
> correct data reads from the recv() interface in user-land.
>
> We do not want to add new proto_ops to implement a new version of
> tcp_read_sock, as this would introduce code complexity [1].
>
> We add new callback for strparser for customized read operation.
>
> [1]: https://lore.kernel.org/bpf/20241218053408.437295-1-mrpre@163.com
> Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Jiayuan Chen <mrpre@163.com>
> ---
>  include/linux/skmsg.h |  2 ++
>  include/net/tcp.h     |  8 ++++++++
>  net/core/skmsg.c      |  7 +++++++
>  net/ipv4/tcp.c        | 29 ++++++++++++++++++++++++-----
>  net/ipv4/tcp_bpf.c    | 42 ++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 83 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 2cbe0c22a32f..0b9095a281b8 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -91,6 +91,8 @@ struct sk_psock {
>  	struct sk_psock_progs		progs;
>  #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
>  	struct strparser		strp;
> +	u32				copied_seq;
> +	u32				ingress_bytes;
>  #endif
>  	struct sk_buff_head		ingress_skb;
>  	struct list_head		ingress_msg;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index e9b37b76e894..06affc653247 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -729,6 +729,9 @@ void tcp_get_info(struct sock *, struct tcp_info *);
>  /* Read 'sendfile()'-style from a TCP socket */
>  int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>  		  sk_read_actor_t recv_actor);
> +int tcp_read_sock_noack(struct sock *sk, read_descriptor_t *desc,
> +			sk_read_actor_t recv_actor, bool noack,
> +			u32 *copied_seq);
>  int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor);
>  struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off);
>  void tcp_read_done(struct sock *sk, size_t len);
> @@ -2599,6 +2602,11 @@ struct sk_psock;
>  #ifdef CONFIG_BPF_SYSCALL
>  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
>  void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
> +#ifdef CONFIG_BPF_STREAM_PARSER
> +struct strparser;
> +int tcp_bpf_strp_read_sock(struct strparser *strp, read_descriptor_t *desc,
> +			   sk_read_actor_t recv_actor);
> +#endif /* CONFIG_BPF_STREAM_PARSER */
>  #endif /* CONFIG_BPF_SYSCALL */
>  
>  #ifdef CONFIG_INET
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 61f3f3d4e528..0ddc4c718833 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -549,6 +549,9 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb,
>  			return num_sge;
>  	}
>  
> +#if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
> +	psock->ingress_bytes += len;
> +#endif
>  	copied = len;
>  	msg->sg.start = 0;
>  	msg->sg.size = copied;
> @@ -1144,6 +1147,10 @@ int sk_psock_init_strp(struct sock *sk, struct sk_psock *psock)
>  	if (!ret)
>  		sk_psock_set_state(psock, SK_PSOCK_RX_STRP_ENABLED);
>  
> +	if (sk_is_tcp(sk)) {
> +		psock->strp.cb.read_sock = tcp_bpf_strp_read_sock;
> +		psock->copied_seq = tcp_sk(sk)->copied_seq;
> +	}
>  	return ret;
>  }
>  
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 0d704bda6c41..285678d8ce07 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1565,12 +1565,13 @@ EXPORT_SYMBOL(tcp_recv_skb);
>   *	  or for 'peeking' the socket using this routine
>   *	  (although both would be easy to implement).
>   */
> -int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> -		  sk_read_actor_t recv_actor)
> +static int __tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> +			   sk_read_actor_t recv_actor, bool noack,
> +			   u32 *copied_seq)
>  {
>  	struct sk_buff *skb;
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	u32 seq = tp->copied_seq;
> +	u32 seq = *copied_seq;
>  	u32 offset;
>  	int copied = 0;
>  
> @@ -1624,9 +1625,12 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>  		tcp_eat_recv_skb(sk, skb);
>  		if (!desc->count)
>  			break;
> -		WRITE_ONCE(tp->copied_seq, seq);
> +		WRITE_ONCE(*copied_seq, seq);
>  	}
> -	WRITE_ONCE(tp->copied_seq, seq);
> +	WRITE_ONCE(*copied_seq, seq);
> +
> +	if (noack)
> +		goto out;
>  
>  	tcp_rcv_space_adjust(sk);
>  
> @@ -1635,10 +1639,25 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>  		tcp_recv_skb(sk, seq, &offset);
>  		tcp_cleanup_rbuf(sk, copied);
>  	}
> +out:
>  	return copied;
>  }
> +
> +int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> +		  sk_read_actor_t recv_actor)
> +{
> +	return __tcp_read_sock(sk, desc, recv_actor, false,
> +			       &tcp_sk(sk)->copied_seq);
> +}
>  EXPORT_SYMBOL(tcp_read_sock);
>  
> +int tcp_read_sock_noack(struct sock *sk, read_descriptor_t *desc,
> +			sk_read_actor_t recv_actor, bool noack,
> +			u32 *copied_seq)
> +{
> +	return __tcp_read_sock(sk, desc, recv_actor, noack, copied_seq);
> +}
> +
>  int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
>  {
>  	struct sk_buff *skb;
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 47f65b1b70ca..4dcf88ad8275 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -646,6 +646,48 @@ static int tcp_bpf_assert_proto_ops(struct proto *ops)
>  	       ops->sendmsg  == tcp_sendmsg ? 0 : -ENOTSUPP;
>  }
>  
> +#if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
> +int tcp_bpf_strp_read_sock(struct strparser *strp, read_descriptor_t *desc,
> +			   sk_read_actor_t recv_actor)
> +{
> +	struct sock *sk = strp->sk;
> +	struct sk_psock *psock;
> +	struct tcp_sock *tp;
> +	int copied = 0;
> +
> +	tp = tcp_sk(sk);
> +	rcu_read_lock();
> +	psock = sk_psock(sk);
> +	if (WARN_ON(!psock)) {

WARN_ON_ONCE, please. We don't want to flood dmesg.

> +		desc->error = -EINVAL;
> +		goto out;
> +	}
> +
> +	psock->ingress_bytes = 0;
> +	/* We could easily add copied_seq and noack into desc then call
> +	 * ops->read_sock without calling symbol directly. But unfortunately
> +	 * most descriptors used by other modules are not inited with zero.
> +	 * Also it not work by replacing ops->read_sock without introducing
> +	 * new ops as ops itself is located in rodata segment.
> +	 */

Above commment explains an implementation decision and belongs in the
patch description, not in the code. It does not help with understanding
the code itself.

> +	copied = tcp_read_sock_noack(sk, desc, recv_actor, true,
> +				     &psock->copied_seq);
> +	if (copied < 0)
> +		goto out;
> +	/* recv_actor may redirect skb to another socket(SK_REDIRECT) or

Nit: missing space, "socket (SK_REDIRECT)"

> +	 * just put skb into ingress queue of current socket(SK_PASS).

Nit: missing space, "socket (SK_PASS)"

> +	 * For SK_REDIRECT, we need 'ack' the frame immediately but for

Nit: grammar, "we need to"
Nit: style, "to ack" is an accepted form of "to acknowlege", no need for
quotes around it.

> +	 * SK_PASS, the 'ack' was delay to tcp_bpf_recvmsg_parser()

Nit: grammar, "we want to delay the ack until tcp_bpf_recvmsg_parser()"

> +	 */
> +	tp->copied_seq = psock->copied_seq - psock->ingress_bytes;
> +	tcp_rcv_space_adjust(sk);
> +	__tcp_cleanup_rbuf(sk, copied - psock->ingress_bytes);
> +out:
> +	rcu_read_unlock();
> +	return copied;
> +}
> +#endif /* CONFIG_BPF_STREAM_PARSER */
> +
>  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  {
>  	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;

  reply	other threads:[~2025-01-21 14:18 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-21  5:07 [PATCH bpf v8 0/5] bpf: fix wrong copied_seq calculation and add tests Jiayuan Chen
2025-01-21  5:07 ` [PATCH bpf v8 1/5] strparser: add read_sock callback Jiayuan Chen
2025-01-21 14:19   ` Jakub Sitnicki
2025-01-21  5:07 ` [PATCH bpf v8 2/5] bpf: fix wrong copied_seq calculation Jiayuan Chen
2025-01-21 14:18   ` Jakub Sitnicki [this message]
2025-01-22  8:55     ` Jiayuan Chen
2025-01-21  5:07 ` [PATCH bpf v8 3/5] bpf: disable non stream socket for strparser Jiayuan Chen
2025-01-21 14:23   ` Jakub Sitnicki
2025-01-21  5:07 ` [PATCH bpf v8 4/5] selftests/bpf: fix invalid flag of recv() Jiayuan Chen
2025-01-21 14:26   ` Jakub Sitnicki
2025-01-21  5:07 ` [PATCH bpf v8 5/5] selftests/bpf: add strparser test for bpf Jiayuan Chen
2025-01-21 15:06   ` Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875xm8i729.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=corbet@lwn.net \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=eddyz87@gmail.com \
    --cc=edumazet@google.com \
    --cc=haoluo@google.com \
    --cc=horms@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=mhal@rbox.co \
    --cc=mrpre@163.com \
    --cc=mykolal@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=shuah@kernel.org \
    --cc=song@kernel.org \
    --cc=xiyou.wangcong@gmail.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.