From: David Gibson <david@gibson.dropbear.id.au>
To: Jon Maloy <jmaloy@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>,
Paolo Abeni <pabeni@redhat.com>,
kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com,
lvivier@redhat.com, dgibson@redhat.com, netdev@vger.kernel.org,
davem@davemloft.net
Subject: Re: [PATCH v3] tcp: add support for SO_PEEK_OFF
Date: Mon, 19 Feb 2024 13:02:55 +1100 [thread overview]
Message-ID: <ZdK2z4U1naf_T6IM@zatzit> (raw)
In-Reply-To: <6a9f5dec-eb0c-51ef-0911-7345f50e08f0@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 6382 bytes --]
On Fri, Feb 16, 2024 at 05:13:34AM -0500, Jon Maloy wrote:
>
>
> On 2024-02-16 04:21, Eric Dumazet wrote:
> > On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni<pabeni@redhat.com> wrote:
> > > On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:
> > > > On 2024-02-15 12:46, Eric Dumazet wrote:
> > > > > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni<pabeni@redhat.com> wrote:
> > > > > > Note: please send text-only email to netdev.
> > > > > >
> > > > > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
> > > > > > > I wonder if the following could be acceptable:
> > > > > > >
> > > > > > > if (flags & MSG_PEEK)
> > > > > > > sk_peek_offset_fwd(sk, used);
> > > > > > > else if (peek_offset > 0)
> > > > > > > sk_peek_offset_bwd(sk, used);
> > > > > > >
> > > > > > > peek_offset is already present in the data cache, and if it has the value
> > > > > > > zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
> > > > > > > Either way, no rewind is needed in that case.
> > > > > > I agree the above should avoid touching cold cachelines in the
> > > > > > fastpath, and looks functionally correct to me.
> > > > > >
> > > > > > The last word is up to Eric :)
> > > > > >
> > > > > An actual patch seems needed.
> > > > >
> > > > > In the current form, local variable peek_offset is 0 when !MSG_PEEK.
> > > > >
> > > > > So the "else if (peek_offset > 0)" would always be false.
> > > > >
> > > > Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the
> > > > beginning of the function.
> > > > I will look at the other suggestions.
> > > I *think* that moving sk_peek_off this way:
> > >
> > > ---
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index a9d99a9c583f..576a6a6abb03 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -413,7 +413,7 @@ struct sock {
> > > unsigned int sk_napi_id;
> > > #endif
> > > int sk_rcvbuf;
> > > - int sk_disconnects;
> > > + int sk_peek_off;
> > >
> > > struct sk_filter __rcu *sk_filter;
> > > union {
> > > @@ -439,7 +439,7 @@ struct sock {
> > > struct rb_root tcp_rtx_queue;
> > > };
> > > struct sk_buff_head sk_write_queue;
> > > - __s32 sk_peek_off;
> > > + int sk_disconnects;
> > > int sk_write_pending;
> > > __u32 sk_dst_pending_confirm;
> > > u32 sk_pacing_status; /* see enum sk_pacing */
> > > ---
> > >
> > > should avoid problematic accesses,
> > >
> > > The relevant cachelines layout is as follow:
> > >
> > > /* --- cacheline 4 boundary (256 bytes) --- */
> > > struct sk_buff * tail; /* 256 8 */
> > > } sk_backlog; /* 240 24 */
> > > int sk_forward_alloc; /* 264 4 */
> > > u32 sk_reserved_mem; /* 268 4 */
> > > unsigned int sk_ll_usec; /* 272 4 */
> > > unsigned int sk_napi_id; /* 276 4 */
> > > int sk_rcvbuf; /* 280 4 */
> > > int sk_disconnects; /* 284 4 */
> > > // will become sk_peek_off
> > > struct sk_filter * sk_filter; /* 288 8 */
> > > union {
> > > struct socket_wq * sk_wq; /* 296 8 */
> > > struct socket_wq * sk_wq_raw; /* 296 8 */
> > > }; /* 296 8 */
> > > struct xfrm_policy * sk_policy[2]; /* 304 16 */
> > > /* --- cacheline 5 boundary (320 bytes) --- */
> > >
> > > // ...
> > >
> > > /* --- cacheline 6 boundary (384 bytes) --- */
> > > __s32 sk_peek_off; /* 384 4 */
> > > // will become sk_diconnects
> > > int sk_write_pending; /* 388 4 */
> > > __u32 sk_dst_pending_confirm; /* 392 4 */
> > > u32 sk_pacing_status; /* 396 4 */
> > > long int sk_sndtimeo; /* 400 8 */
> > > struct timer_list sk_timer; /* 408 40 */
> > >
> > > /* XXX last struct has 4 bytes of padding */
> > >
> > > /* --- cacheline 7 boundary (448 bytes) --- */
> > >
> > > sk_peek_off will be in the same cachline of sk_forward_alloc /
> > > sk_reserved_mem / backlog tail, that are already touched by the
> > > tcp_recvmsg_locked() main loop.
> > >
> > > WDYT?
> > I was about to send a similar change, also moving sk_rcvtimeo, and
> > adding __cacheline_group_begin()/__cacheline_group_end
> > annotations.
> >
> > I can finish this today.
> >
> There is also the following alternative:
>
> if (flags & MSG_PEEK)
> sk_peek_offset_fwd(sk, used);
> else if (flags & MSG_TRUNC)
> sk_peek_offset_bwd(sk, used);
>
> This is the way we use it, and probably the typical usage.
> It would force a user to drain the receive queue with MSG_TRUNC whenever he
> is using
> MSG_PEEK_OFF, but I don't really see that as a limitation.
I really don't like this, although it would certainly do what we need
for passt/pasta. SO_PEEK_OFF has established semantics for Unix
sockets, which includes regular recv() adjusting the offset. Having
it behave subtlety differently for TCP seems like a very bad idea.
> Anyway, if Paolo's suggestion solves the problem this shouldn't be
> necessary.
>
> ///jon
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-02-19 2:03 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-09 22:12 [PATCH v3] tcp: add support for SO_PEEK_OFF jmaloy
2024-02-11 23:17 ` Stefano Brivio
2024-02-13 10:49 ` Paolo Abeni
2024-02-13 12:24 ` Eric Dumazet
2024-02-13 13:02 ` Paolo Abeni
2024-02-13 13:34 ` Eric Dumazet
2024-02-13 15:28 ` Paolo Abeni
2024-02-13 15:49 ` Eric Dumazet
2024-02-13 18:39 ` Paolo Abeni
2024-02-13 19:31 ` Eric Dumazet
[not found] ` <20687849-ec5c-9ce5-0a18-cc80f5b64816@redhat.com>
2024-02-15 17:41 ` Paolo Abeni
2024-02-15 17:46 ` Eric Dumazet
2024-02-15 22:24 ` Jon Maloy
2024-02-16 9:14 ` Paolo Abeni
2024-02-16 9:21 ` Eric Dumazet
[not found] ` <6a9f5dec-eb0c-51ef-0911-7345f50e08f0@redhat.com>
2024-02-16 10:55 ` Eric Dumazet
2024-02-19 2:02 ` David Gibson [this message]
2024-02-13 23:34 ` David Gibson
2024-02-14 3:41 ` Eric Dumazet
2024-02-15 3:16 ` David Gibson
2024-02-15 3:21 ` David Gibson
2024-02-15 9:15 ` Eric Dumazet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZdK2z4U1naf_T6IM@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=davem@davemloft.net \
--cc=dgibson@redhat.com \
--cc=edumazet@google.com \
--cc=jmaloy@redhat.com \
--cc=kuba@kernel.org \
--cc=lvivier@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.