From: David Gibson <david@gibson.dropbear.id.au>
To: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>,
kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com,
lvivier@redhat.com, dgibson@redhat.com, jmaloy@redhat.com,
netdev@vger.kernel.org, davem@davemloft.net
Subject: Re: [PATCH v3] tcp: add support for SO_PEEK_OFF
Date: Wed, 14 Feb 2024 10:34:50 +1100 [thread overview]
Message-ID: <Zcv8mjlWE7F9Of93@zatzit> (raw)
In-Reply-To: <CANn89iL2FvTVYv6ym58=4L-K-kSan6R4PEv488ztyX4HsNquug@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 3366 bytes --]
On Tue, Feb 13, 2024 at 04:49:01PM +0100, Eric Dumazet wrote:
> On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:
> > > On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > >
> > > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:
> > > > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> > > > > > > WRITE_ONCE(*seq, *seq + used);
> > > > > > > copied += used;
> > > > > > > len -= used;
> > > > > > > -
> > > > > > > + if (flags & MSG_PEEK)
> > > > > > > + sk_peek_offset_fwd(sk, used);
> > > > > > > + else
> > > > > > > + sk_peek_offset_bwd(sk, used);
> > > > >
> > > > > Yet another cache miss in TCP fast path...
> > > > >
> > > > > We need to move sk_peek_off in a better location before we accept this patch.
> > > > >
> > > > > I always thought MSK_PEEK was very inefficient, I am surprised we
> > > > > allow arbitrary loops in recvmsg().
> > > >
> > > > Let me double check I read the above correctly: are you concerned by
> > > > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could
> > > > touch a lot of skbs/cachelines before reaching the relevant skb?
> > > >
> > > > The end goal here is allowing an user-space application to read
> > > > incrementally/sequentially the received data while leaving them in
> > > > receive buffer.
> > > >
> > > > I don't see a better option than MSG_PEEK, am I missing something?
> > >
> > >
> > > This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non
> > > MSG_PEEK case is very strange IMO.
> > >
> > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK
> > > is used by the caller.
> > >
> > > That would only touch non fast paths.
> > >
> > > Since the API is mono-threaded anyway, the caller should not rely on
> > > the fact that normal recvmsg() call
> > > would 'consume' sk_peek_offset.
> >
> > Storing in sk_peek_seq the tcp next sequence number to be peeked should
> > avoid changes in the non MSG_PEEK cases.
> >
> > AFAICS that would need a new get_peek_off() sock_op and a bit somewhere
> > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would
> > that be acceptable?
>
> We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field.
>
> The new semantic would be : Supported by TCP (so far), and tcp
> recvmsg() only reads/writes this field when MSG_PEEK is used.
> Applications would have to clear the values themselves.
Those semantics would likely defeat the purpose of using SO_PEEK_OFF
for our use case, since we'd need an additional setsockopt() for every
non-PEEK recv() (which are all MSG_TRUNC in our case).
> BTW I see the man pages say SO_PEEK_OFF is "is currently supported
> only for unix(7) sockets"
Yes, this patch is explicitly aiming to change that.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-02-13 23:34 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-09 22:12 [PATCH v3] tcp: add support for SO_PEEK_OFF jmaloy
2024-02-11 23:17 ` Stefano Brivio
2024-02-13 10:49 ` Paolo Abeni
2024-02-13 12:24 ` Eric Dumazet
2024-02-13 13:02 ` Paolo Abeni
2024-02-13 13:34 ` Eric Dumazet
2024-02-13 15:28 ` Paolo Abeni
2024-02-13 15:49 ` Eric Dumazet
2024-02-13 18:39 ` Paolo Abeni
2024-02-13 19:31 ` Eric Dumazet
[not found] ` <20687849-ec5c-9ce5-0a18-cc80f5b64816@redhat.com>
2024-02-15 17:41 ` Paolo Abeni
2024-02-15 17:46 ` Eric Dumazet
2024-02-15 22:24 ` Jon Maloy
2024-02-16 9:14 ` Paolo Abeni
2024-02-16 9:21 ` Eric Dumazet
[not found] ` <6a9f5dec-eb0c-51ef-0911-7345f50e08f0@redhat.com>
2024-02-16 10:55 ` Eric Dumazet
2024-02-19 2:02 ` David Gibson
2024-02-13 23:34 ` David Gibson [this message]
2024-02-14 3:41 ` Eric Dumazet
2024-02-15 3:16 ` David Gibson
2024-02-15 3:21 ` David Gibson
2024-02-15 9:15 ` Eric Dumazet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zcv8mjlWE7F9Of93@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=davem@davemloft.net \
--cc=dgibson@redhat.com \
--cc=edumazet@google.com \
--cc=jmaloy@redhat.com \
--cc=kuba@kernel.org \
--cc=lvivier@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.