From: Jakub Sitnicki <jakub@cloudflare.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: Martin Lau <kafai@fb.com>,
"bpf\@vger.kernel.org" <bpf@vger.kernel.org>,
"netdev\@vger.kernel.org" <netdev@vger.kernel.org>,
"kernel-team\@cloudflare.com" <kernel-team@cloudflare.com>
Subject: Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
Date: Wed, 27 Nov 2019 23:18:44 +0100 [thread overview]
Message-ID: <87a78hnet7.fsf@cloudflare.com> (raw)
In-Reply-To: <5ddd7266c36aa_671a2b0b882605c04a@john-XPS-13-9370.notmuch>
On Tue, Nov 26, 2019 at 07:43 PM CET, John Fastabend wrote:
> Martin Lau wrote:
>> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
>> > On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
>> > > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
>> > > [ ... ]
>> > >
>> > >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>> > >> sk->sk_prot = psock->sk_proto;
>> > >> psock->sk_proto = NULL;
>> > >> }
>> > >> +
>> > >> + if (psock->icsk_af_ops) {
>> > >> + icsk->icsk_af_ops = psock->icsk_af_ops;
>> > >> + psock->icsk_af_ops = NULL;
>> > >> + }
>> > >> }
>> > >
>> > > [ ... ]
>> > >
>> > >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
>> > >> + struct sk_buff *skb,
>> > >> + struct request_sock *req,
>> > >> + struct dst_entry *dst,
>> > >> + struct request_sock *req_unhash,
>> > >> + bool *own_req)
>> > >> +{
>> > >> + const struct inet_connection_sock_af_ops *ops;
>> > >> + void (*write_space)(struct sock *sk);
>> > >> + struct sk_psock *psock;
>> > >> + struct proto *proto;
>> > >> + struct sock *child;
>> > >> +
>> > >> + rcu_read_lock();
>> > >> + psock = sk_psock(sk);
>> > >> + if (likely(psock)) {
>> > >> + proto = psock->sk_proto;
>> > >> + write_space = psock->saved_write_space;
>> > >> + ops = psock->icsk_af_ops;
>> > > It is not immediately clear to me what ensure
>> > > ops is not NULL here.
>> > >
>> > > It is likely I missed something. A short comment would
>> > > be very useful here.
>> >
>> > I can see the readability problem. Looking at it now, perhaps it should
>> > be rewritten, to the same effect, as:
>> >
>> > static struct sock *tcp_bpf_syn_recv_sock(...)
>> > {
>> > const struct inet_connection_sock_af_ops *ops = NULL;
>> > ...
>> >
>> > rcu_read_lock();
>> > psock = sk_psock(sk);
>> > if (likely(psock)) {
>> > proto = psock->sk_proto;
>> > write_space = psock->saved_write_space;
>> > ops = psock->icsk_af_ops;
>> > }
>> > rcu_read_unlock();
>> >
>> > if (!ops)
>> > ops = inet_csk(sk)->icsk_af_ops;
>> > child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
>> >
>> > If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
>> > properly. To double check what happens here:
>> I did not mean the init path. The init path is fine since it init
>> eveything on psock before publishing the sk to the sock_map.
>>
>> I was thinking the delete path (e.g. sock_map_delete_elem). It is not clear
>> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
>> psock->icsk_af_ops to NULL from running in parallel with
>> tcp_bpf_syn_recv_sock()? An explanation would be useful.
>>
>
> I'll answer. Updates are protected via sk_callback_lock so we don't have
> parrallel updates in-flight causing write_space and sk_proto to be out
> of sync. However access should be OK because its a pointer write we
> never update the pointer in place, e.g.
>
> static inline void sk_psock_restore_proto(struct sock *sk,
> struct sk_psock *psock)
> {
> + struct inet_connection_sock *icsk = inet_csk(sk);
> +
> sk->sk_write_space = psock->saved_write_space;
>
> if (psock->sk_proto) {
> struct inet_connection_sock *icsk = inet_csk(sk);
> bool has_ulp = !!icsk->icsk_ulp_data;
>
> if (has_ulp)
> tcp_update_ulp(sk, psock->sk_proto);
> else
> sk->sk_prot = psock->sk_proto;
> psock->sk_proto = NULL;
> }
>
> +
> + if (psock->icsk_af_ops) {
> + icsk->icsk_af_ops = psock->icsk_af_ops;
> + psock->icsk_af_ops = NULL;
> + }
> }
>
> In restore case either psock->icsk_af_ops is null or not. If its
> null below code catches it. If its not null (from init path) then
> we have a valid pointer.
>
> rcu_read_lock();
> psock = sk_psock(sk);
> if (likely(psock)) {
> proto = psock->sk_proto;
> write_space = psock->saved_write_space;
> ops = psock->icsk_af_ops;
> }
> rcu_read_unlock();
>
> if (!ops)
> ops = inet_csk(sk)->icsk_af_ops;
> child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
>
>
> We should do this with proper READ_ONCE/WRITE_ONCE to make it clear
> what is going on and to stop compiler from breaking these assumptions. I
> was going to generate that patch after this series but can do it before
> as well. I didn't mention it here because it seems a bit out of scope
> for this series because its mostly a fix to older code.
+1, looking forward to your patch. Also, as I've recently learned, that
should enable KTSAN to reason about the psock code [0].
> Also I started to think that write_space might be out of sync with ops but
> it seems we never actually remove psock_write_space until after
> rcu grace period so that should be OK as well and always point to the
> previous write_space.
>
> Finally I wondered if we could remove the ops and then add it back
> quickly which seems at least in theory possible, but that would get
> hit with a grace period because we can't have conflicting psock
> definitions on the same sock. So expanding the rcu block to include
> the ops = inet_csk(sk)->icsk_af_ops would fix that case.
I see, ops = inet_csk(sk)->icsk_af_ops might read out a re-overwritten
ops after sock_map_unlink, followed by sock_map_link. Ouch.
> So in summary I think we should expand the rcu lock here to include the
> ops = inet_csk(sk)->icsk_af_ops to ensure we dont race with tear
> down and create. I'll push the necessary update with WRITE_ONCE and
> READ_ONCE to fix that up. Seeing we have to wait until the merge
> window opens most likely anyways I'll send those out sooner rather
> then later and this series can add the proper annotations as well.
Or I could leave psock->icsk_af_ops set in restore_proto, like we do for
write_space as you noted. Restoring it twice doesn't seem harmful, it
has no side-effects. Less state changes to think about?
I'll still have to apply what you suggest for saving psock->sk_proto,
though.
Thanks,
Jakub
[0] https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE
next prev parent reply other threads:[~2019-11-27 22:18 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
2019-11-24 5:32 ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
2019-11-24 5:35 ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP Jakub Sitnicki
2019-11-24 5:38 ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy Jakub Sitnicki
2019-11-24 5:56 ` John Fastabend
2019-11-25 22:38 ` Martin Lau
2019-11-26 15:54 ` Jakub Sitnicki
2019-11-26 17:16 ` Martin Lau
2019-11-26 18:36 ` Jakub Sitnicki
[not found] ` <87sglsfdda.fsf@cloudflare.com>
2019-12-11 17:20 ` Martin Lau
2019-12-12 11:27 ` Jakub Sitnicki
2019-12-12 19:23 ` Martin Lau
2019-12-17 15:06 ` Jakub Sitnicki
2019-11-26 18:43 ` John Fastabend
2019-11-27 22:18 ` Jakub Sitnicki [this message]
2019-11-23 11:07 ` [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
2019-11-24 5:57 ` John Fastabend
2019-11-25 1:24 ` Alexei Starovoitov
2019-11-25 4:17 ` John Fastabend
2019-11-25 10:40 ` Jakub Sitnicki
2019-11-25 22:07 ` Martin Lau
2019-11-26 14:30 ` Jakub Sitnicki
2019-11-26 19:03 ` Martin Lau
2019-11-27 21:34 ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name Jakub Sitnicki
2019-11-24 5:57 ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
2019-11-24 6:00 ` John Fastabend
2019-11-25 22:30 ` Martin Lau
2019-11-26 14:32 ` Jakub Sitnicki
2019-12-12 10:30 ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
2019-11-24 6:04 ` John Fastabend
2019-11-24 6:10 ` [PATCH bpf-next 0/8] Extend SOCKMAP to store " John Fastabend
2019-11-25 9:22 ` Jakub Sitnicki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87a78hnet7.fsf@cloudflare.com \
--to=jakub@cloudflare.com \
--cc=bpf@vger.kernel.org \
--cc=john.fastabend@gmail.com \
--cc=kafai@fb.com \
--cc=kernel-team@cloudflare.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.