All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: Martin Lau <kafai@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>,
	"bpf\@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev\@vger.kernel.org" <netdev@vger.kernel.org>,
	"kernel-team\@cloudflare.com" <kernel-team@cloudflare.com>
Subject: Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy
Date: Thu, 12 Dec 2019 12:27:19 +0100	[thread overview]
Message-ID: <87pngtg4x4.fsf@cloudflare.com> (raw)
In-Reply-To: <20191211172051.clnwh5n5vdeovayy@kafai-mbp>

On Wed, Dec 11, 2019 at 06:20 PM CET, Martin Lau wrote:
> On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote:
>> John, Martin,
>>
>> On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote:
>> > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote:
>> >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
>> >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
>> >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
>> >>> > [ ... ]
>> >>> >
>> >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>> >>> >>  			sk->sk_prot = psock->sk_proto;
>> >>> >>  		psock->sk_proto = NULL;
>> >>> >>  	}
>> >>> >> +
>> >>> >> +	if (psock->icsk_af_ops) {
>> >>> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
>> >>> >> +		psock->icsk_af_ops = NULL;
>> >>> >> +	}
>> >>> >>  }
>> >>> >
>> >>> > [ ... ]
>> >>> >
>> >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
>> >>> >> +					  struct sk_buff *skb,
>> >>> >> +					  struct request_sock *req,
>> >>> >> +					  struct dst_entry *dst,
>> >>> >> +					  struct request_sock *req_unhash,
>> >>> >> +					  bool *own_req)
>> >>> >> +{
>> >>> >> +	const struct inet_connection_sock_af_ops *ops;
>> >>> >> +	void (*write_space)(struct sock *sk);
>> >>> >> +	struct sk_psock *psock;
>> >>> >> +	struct proto *proto;
>> >>> >> +	struct sock *child;
>> >>> >> +
>> >>> >> +	rcu_read_lock();
>> >>> >> +	psock = sk_psock(sk);
>> >>> >> +	if (likely(psock)) {
>> >>> >> +		proto = psock->sk_proto;
>> >>> >> +		write_space = psock->saved_write_space;
>> >>> >> +		ops = psock->icsk_af_ops;
>> >>> > It is not immediately clear to me what ensure
>> >>> > ops is not NULL here.
>> >>> >
>> >>> > It is likely I missed something.  A short comment would
>> >>> > be very useful here.
>> >>>
>> >>> I can see the readability problem. Looking at it now, perhaps it should
>> >>> be rewritten, to the same effect, as:
>> >>>
>> >>> static struct sock *tcp_bpf_syn_recv_sock(...)
>> >>> {
>> >>> 	const struct inet_connection_sock_af_ops *ops = NULL;
>> >>>         ...
>> >>>
>> >>>         rcu_read_lock();
>> >>> 	psock = sk_psock(sk);
>> >>> 	if (likely(psock)) {
>> >>> 		proto = psock->sk_proto;
>> >>> 		write_space = psock->saved_write_space;
>> >>> 		ops = psock->icsk_af_ops;
>> >>> 	}
>> >>> 	rcu_read_unlock();
>> >>>
>> >>>         if (!ops)
>> >>> 		ops = inet_csk(sk)->icsk_af_ops;
>> >>>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
>> >>>
>> >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
>> >>> properly. To double check what happens here:
>> >> I did not mean the init path.  The init path is fine since it init
>> >> eveything on psock before publishing the sk to the sock_map.
>> >>
>> >> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
>> >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
>> >> psock->icsk_af_ops to NULL from running in parallel with
>> >> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
>> >
>> > Ah, I misunderstood. Nothing prevents the race, AFAIK.
>> >
>> > Setting psock->icsk_af_ops to null on restore and not checking for it
>> > here was a bad move on my side.  Also I need to revisit what to do about
>> > psock->sk_proto so the child socket doesn't end up with null sk_proto.
>> >
>> > This race should be easy enough to trigger. Will give it a shot.
>>
>> I've convinced myself that this approach is racy beyond repair.
>>
>> Once syn_recv_sock() has returned it is too late to reset the child
>> sk_user_data and restore its callbacks. It has been already inserted
>> into ehash and ingress path can invoke its callbacks.
>>
>> The race can be triggered with with a reproducer where:
>>
>> thread-1:
>>
>>         p = accept(s, ...);
>>         close(p);
>>
>> thread-2:
>>
>> 	bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST);
>> 	bpf_map_delete_elem(mapfd, &key);
>>
>> This a dead-end because we can't have the parent and the child share the
>> psock state. Even though psock itself is refcounted, and potentially we
>> could grab a reference before cloning the parent, link into the map that
>> psock holds is not.
>>
>> Two ways out come to mind. Both involve touching TCP code, which I was
>> hoping to avoid:
>>
>> 1) reset sk_user_data when initializing the child
>>
>>    This is problematic because tcp_bpf callbacks are not designed to
>>    handle sockets with no psock _and_ with overridden sk_prot
>>    callbacks. (Although, I think they could if the fallback was directly
>>    on {tcp,tcpv6}_prot based on socket domain.)
>>
>>    Also, there are other sk_user_data users like DRBD which rely on
>>    sharing the sk_user_data pointer between parent and child, if I read
>>    the code correctly [0]. If anything, clearing the sk_user_data on
>>    clone would have to be guarded by a flag.
> Can the copy/not-to-copy sk_user_data decision be made in
> sk_clone_lock()?

Yes, this could be pushed down to sk_clone_lock(), where we do similar
work (reset sk_reuseport_cb and clone bpf_sk_storage):

	/* User data can hold reference. Child must not
	 * inherit the pointer without acquiring a reference.
	 */
	if (sock_flag(sk, SOCK_OWNS_USER_DATA)) {
		sock_reset_flag(newsk, SOCK_OWNS_USER_DATA);
		RCU_INIT_POINTER(newsk->sk_user_data, NULL);
	}

I belive this would still need to be guarded by a flag.  Do you see
value in clearing child sk_user_data on clone as opposed to dealying
that work until accept() time?

-Jakub

>
>>
>> 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot
>>
>>    The simpler way out. tcp_bpf callbacks never get invoked on the child
>>    socket so the copied psock reference is no longer a problem. We can
>>    clear the pointer on accept().
>>
>>    So far I wasn't able poke any holes in it and it comes down to
>>    patching tcp_create_openreq_child() with:
>>
>> 	/* sk_msg and ULP frameworks can override the callbacks into
>> 	 * protocol. We don't assume they are intended to be inherited
>> 	 * by the child. Frameworks can re-install the callbacks on
>> 	 * accept() if needed.
>> 	 */
>> 	WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator);
>>
>>    That's what I'm going with for v2.
>>
>> Open to suggestions.
>>
>> Thanks,
>> Jakub
>>
>> BTW. Reading into kTLS code, I noticed it has been limited down to just
>> established sockets due to the same problem I'm struggling with here:
>>
>> static int tls_init(struct sock *sk)
>> {
>> ...
>> 	/* The TLS ulp is currently supported only for TCP sockets
>> 	 * in ESTABLISHED state.
>> 	 * Supporting sockets in LISTEN state will require us
>> 	 * to modify the accept implementation to clone rather then
>> 	 * share the ulp context.
>> 	 */
>> 	if (sk->sk_state != TCP_ESTABLISHED)
>> 		return -ENOTCONN;
>>
>> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e=

  reply	other threads:[~2019-12-12 11:27 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-23 11:07 [PATCH bpf-next 0/8] Extend SOCKMAP to store listening sockets Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 1/8] bpf, sockmap: Return socket cookie on lookup from syscall Jakub Sitnicki
2019-11-24  5:32   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 2/8] bpf, sockmap: Let all kernel-land lookup values in SOCKMAP Jakub Sitnicki
2019-11-24  5:35   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 3/8] bpf, sockmap: Allow inserting listening TCP sockets into SOCKMAP Jakub Sitnicki
2019-11-24  5:38   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit psock or its ops on copy Jakub Sitnicki
2019-11-24  5:56   ` John Fastabend
2019-11-25 22:38   ` Martin Lau
2019-11-26 15:54     ` Jakub Sitnicki
2019-11-26 17:16       ` Martin Lau
2019-11-26 18:36         ` Jakub Sitnicki
     [not found]           ` <87sglsfdda.fsf@cloudflare.com>
2019-12-11 17:20             ` Martin Lau
2019-12-12 11:27               ` Jakub Sitnicki [this message]
2019-12-12 19:23                 ` Martin Lau
2019-12-17 15:06                   ` Jakub Sitnicki
2019-11-26 18:43         ` John Fastabend
2019-11-27 22:18           ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 5/8] bpf: Allow selecting reuseport socket from a SOCKMAP Jakub Sitnicki
2019-11-24  5:57   ` John Fastabend
2019-11-25  1:24   ` Alexei Starovoitov
2019-11-25  4:17     ` John Fastabend
2019-11-25 10:40       ` Jakub Sitnicki
2019-11-25 22:07         ` Martin Lau
2019-11-26 14:30           ` Jakub Sitnicki
2019-11-26 19:03             ` Martin Lau
2019-11-27 21:34               ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 6/8] libbpf: Recognize SK_REUSEPORT programs from section name Jakub Sitnicki
2019-11-24  5:57   ` John Fastabend
2019-11-23 11:07 ` [PATCH bpf-next 7/8] selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP Jakub Sitnicki
2019-11-24  6:00   ` John Fastabend
2019-11-25 22:30   ` Martin Lau
2019-11-26 14:32     ` Jakub Sitnicki
2019-12-12 10:30     ` Jakub Sitnicki
2019-11-23 11:07 ` [PATCH bpf-next 8/8] selftests/bpf: Tests for SOCKMAP holding listening sockets Jakub Sitnicki
2019-11-24  6:04   ` John Fastabend
2019-11-24  6:10 ` [PATCH bpf-next 0/8] Extend SOCKMAP to store " John Fastabend
2019-11-25  9:22   ` Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pngtg4x4.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=bpf@vger.kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kernel-team@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.