From: "Pengcheng Yang" <yangpc@wangsu.com>
To: "'John Fastabend'" <john.fastabend@gmail.com>,
"'Jakub Sitnicki'" <jakub@cloudflare.com>,
"'Eric Dumazet'" <edumazet@google.com>,
"'Jakub Kicinski'" <kuba@kernel.org>, <bpf@vger.kernel.org>,
<netdev@vger.kernel.org>
Subject: Re: [PATCH bpf-next 2/3] tcp: Add the data length in skmsg to SIOCINQ ioctl
Date: Fri, 17 Nov 2023 18:59:50 +0800 [thread overview]
Message-ID: <009601da1945$2ff0d0c0$8fd27240$@wangsu.com> (raw)
In-Reply-To: <6556c2c238099_537dc208ab@john.notmuch>
John Fastabend <john.fastabend@gmail.com> wrote:
> Pengcheng Yang wrote:
> > John Fastabend <john.fastabend@gmail.com> wrote:
> > > Pengcheng Yang wrote:
> > > > SIOCINQ ioctl returns the number unread bytes of the receive
> > > > queue but does not include the ingress_msg queue. With the
> > > > sk_msg redirect, an application may get a value 0 if it calls
> > > > SIOCINQ ioctl before recv() to determine the readable size.
> > > >
> > > > Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
> > >
> > > This will break the SK_PASS case I believe. Here we do
> > > not update copied_seq until data is actually copied into user
> > > space. This also ensures tcp_epollin_ready works correctly and
> > > tcp_inq. The fix is relatively recent.
> > >
> > > commit e5c6de5fa025882babf89cecbed80acf49b987fa
> > > Author: John Fastabend <john.fastabend@gmail.com>
> > > Date: Mon May 22 19:56:12 2023 -0700
> > >
> > > bpf, sockmap: Incorrectly handling copied_seq
> > >
> > > The previous patch increments the msg_len for all cases even
> > > the SK_PASS case so you will get double counting.
> >
> > You are right, I missed the SK_PASS case of skb stream verdict.
> >
> > >
> > > I was starting to poke around at how to fix the other cases e.g.
> > > stream parser is in use and redirects but haven't got to it yet.
> > > By the way I think even with this patch epollin_ready is likely
> > > not correct still. We observe this as either failing to wake up
> > > or waking up an application to early when using stream parser.
> > >
> > > The other thing to consider is redirected skb into another socket
> > > and then read off the list increment the copied_seq even though
> > > they shouldn't if they came from another sock? The result would
> > > be tcp_inq would be incorrect even negative perhaps?
> > >
> > > What does your test setup look like? Simple redirect between
> > > two TCP sockets? With or without stream parser? My guess is we
> > > need to fix underlying copied_seq issues related to the redirect
> > > and stream parser case. I believe the fix is, only increment
> > > copied_seq for data that was put on the ingress_queue from SK_PASS.
> > > Then update previous patch to only incrmeent sk_msg_queue_len()
> > > for redirect paths. And this patch plus fix to tcp_epollin_ready
> > > would resolve most the issues. Its a bit unfortunate to leak the
> > > sk_sg_queue_len() into tcp_ioctl and tcp_epollin but I don't have
> > > a cleaner idea right now.
> > >
> >
> > What I tested was to use msg_verdict to redirect between two sockets
> > without stream parser, and the problem I encountered is that msg has
> > been queued in psock->ingress_msg, and the application has been woken up
> > by epoll (because of sk_psock_data_ready), but the ioctl(FIONREAD) returns 0.
>
> Yep makes sense.
>
> >
> > The key is that the rcv_nxt is not updated on ingress redirect, or we only need
> > to update rcv_nxt on ingress redirect, such as in bpf_tcp_ingress() and
> > sk_psock_skb_ingress_enqueue() ?
> >
>
> I think its likely best not to touch rcv_nxt. 'rcv_nxt' is used in
> the tcp stack to calculate lots of things. If you just bump it and
> then ever received an actual TCP pkt you would get some really
> odd behavior because seq numbers and rcv_nxt would be unrelated then.
>
> The approach you have is really the best bet IMO, but mask out
> the increment msg_len where its not needed. Then it should be OK.
>
I think we can add a flag to msg to identify whether msg comes from the same
sock's receive_queue. In this way, we can increase and decrease the msg_len
based on this flag when msg is queued to ingress_msg and when it is read by
the application.
And, this can also fix the case you mentioned above:
"The other thing to consider is redirected skb into another socket
and then read off the list increment the copied_seq even though
they shouldn't if they came from another sock? The result would
be tcp_inq would be incorrect even negative perhaps?"
During recv in tcp_bpf_recvmsg_parser(), we only need to increment copied_seq
when the msg comes from the same sock's receive_queue, otherwise copied_seq
may overflow rcv_nxt in this case.
> Mixing ingress redirect and TCP sending/recv pkts doesn't usually work
> very well anyway but I still think leaving rcv_nxt alone is best.
next prev parent reply other threads:[~2023-11-17 10:59 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-14 11:41 [PATCH bpf-next 0/3] skmsg: Add the data length in skmsg to SIOCINQ ioctl and rx_queue Pengcheng Yang
2023-11-14 11:41 ` [PATCH bpf-next 1/3] skmsg: Calculate the data length in ingress_msg Pengcheng Yang
2023-11-14 13:15 ` Eric Dumazet
2023-11-14 11:41 ` [PATCH bpf-next 2/3] tcp: Add the data length in skmsg to SIOCINQ ioctl Pengcheng Yang
2023-11-15 7:20 ` John Fastabend
2023-11-15 11:45 ` Pengcheng Yang
2023-11-17 1:32 ` John Fastabend
2023-11-17 10:59 ` Pengcheng Yang [this message]
2023-11-14 11:42 ` [PATCH bpf-next 3/3] tcp_diag: Add the data length in skmsg to rx_queue Pengcheng Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='009601da1945$2ff0d0c0$8fd27240$@wangsu.com' \
--to=yangpc@wangsu.com \
--cc=bpf@vger.kernel.org \
--cc=edumazet@google.com \
--cc=jakub@cloudflare.com \
--cc=john.fastabend@gmail.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).