From: Jakub Sitnicki <jakub@cloudflare.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: daniel@iogearbox.net, ast@kernel.org, andrii@kernel.org,
bpf@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCH bpf 2/3] bpf: sockmap, do not inc copied_seq when PEEK flag set
Date: Fri, 22 Sep 2023 12:23:53 +0200 [thread overview]
Message-ID: <87a5tesd8n.fsf@cloudflare.com> (raw)
In-Reply-To: <20230920232706.498747-3-john.fastabend@gmail.com>
On Wed, Sep 20, 2023 at 04:27 PM -07, John Fastabend wrote:
> When data is peek'd off the receive queue we shouldn't considered it
> copied from tcp_sock side. When we increment copied_seq this will confuse
> tcp_data_ready() because copied_seq can be arbitrarily increased. From]
> application side it results in poll() operations not waking up when
> expected.
>
> Notice tcp stack without BPF recvmsg programs also does not increment
> copied_seq.
>
> We broke this when we moved copied_seq into recvmsg to only update when
> actual copy was happening. But, it wasn't working correctly either before
> because the tcp_data_ready() tried to use the copied_seq value to see
> if data was read by user yet. See fixes tags.
>
> Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq")
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
> net/ipv4/tcp_bpf.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 81f0dff69e0b..327268203001 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -222,6 +222,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
> int *addr_len)
> {
> struct tcp_sock *tcp = tcp_sk(sk);
> + int peek = flags & MSG_PEEK;
> u32 seq = tcp->copied_seq;
> struct sk_psock *psock;
> int copied = 0;
> @@ -311,7 +312,8 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
> copied = -EAGAIN;
> }
> out:
> - WRITE_ONCE(tcp->copied_seq, seq);
> + if (!peek)
> + WRITE_ONCE(tcp->copied_seq, seq);
> tcp_rcv_space_adjust(sk);
> if (copied > 0)
> __tcp_cleanup_rbuf(sk, copied);
I was surprised to see that we recalculate TCP buffer space and ACK
frames when peeking at the receive queue. But tcp_recvmsg seems to do
the same.
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
next prev parent reply other threads:[~2023-09-22 10:39 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-20 23:27 [PATCH bpf 0/3] bpf, sockmap complete fixes for avail bytes John Fastabend
2023-09-20 23:27 ` [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq John Fastabend
2023-09-21 21:08 ` Simon Horman
2023-09-21 21:23 ` John Fastabend
2023-09-23 14:37 ` kernel test robot
2023-09-20 23:27 ` [PATCH bpf 2/3] bpf: sockmap, do not inc copied_seq when PEEK flag set John Fastabend
2023-09-22 10:23 ` Jakub Sitnicki [this message]
2023-09-20 23:27 ` [PATCH bpf 3/3] bpf: sockmap, add tests for MSG_F_PEEK John Fastabend
2023-09-22 11:06 ` Jakub Sitnicki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87a5tesd8n.fsf@cloudflare.com \
--to=jakub@cloudflare.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=john.fastabend@gmail.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.