netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: Martin KaFai Lau <kafai@fb.com>
Cc: borkmann@iogearbox.net, ast@kernel.org, netdev@vger.kernel.org
Subject: Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
Date: Thu, 18 Jan 2018 09:27:09 -0800	[thread overview]
Message-ID: <8fbdeff9-23d6-7453-782c-f2ca6880af8a@gmail.com> (raw)
In-Reply-To: <20180117220434.uee6seg665loc535@kafai-mbp.dhcp.thefacebook.com>

On 01/17/2018 02:04 PM, Martin KaFai Lau wrote:
> On Fri, Jan 12, 2018 at 10:11:11AM -0800, John Fastabend wrote:
>> This implements a BPF ULP layer to allow policy enforcement and
>> monitoring at the socket layer. In order to support this a new
>> program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
>> the sendmsg/sendpage hook. To attach the policy to sockets a
>> sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.


[...]

> Some minor comments/nits below:

Thanks for reviewing!

>> +
>> +static unsigned int smap_do_tx_msg(struct sock *sk,
>> +				   struct smap_psock *psock,
>> +				   struct sk_msg_buff *md)
>> +{
>> +	struct bpf_prog *prog;
>> +	unsigned int rc, _rc;
>> +
>> +	preempt_disable();
> Why preempt_disable() is needed?

If we run a BPF program with preempt enabled my read of BPF
code path is we may race with multiple programs running. For
example program A gets preempted during a map update to an
array map, program B updates same map entry, program A then
updates some piece of the entry. The result is garbage in
the array element. Just one example.

With PREEMPT-RCU the above seems possible.

> 
>> +	rcu_read_lock();
>> +
>> +	/* If the policy was removed mid-send then default to 'accept' */
>> +	prog = READ_ONCE(psock->bpf_tx_msg);
>> +	if (unlikely(!prog)) {
>> +		_rc = SK_PASS;
>> +		goto verdict;
>> +	}
>> +
>> +	bpf_compute_data_pointers_sg(md);
>> +	_rc = (*prog->bpf_func)(md, prog->insnsi);
>> +
>> +verdict:
>> +	rcu_read_unlock();
>> +	preempt_enable();
>> +
>> +	/* Moving return codes from UAPI namespace into internal namespace */
>> +	rc = ((_rc == SK_PASS) ?
>> +	      (md->map ? __SK_REDIRECT : __SK_PASS) :
>> +	      __SK_DROP);
>> +
>> +	return rc;
>> +}
>> +

[...]

>> +out:
>> +	for (i = sg_curr; i < sg_num; ++i) {
>> +		free += sg[i].length;
> free is not init.
> 

Nice catch thanks.

>> +		put_page(sg_page(&sg[i]));
>> +	}
>> +	return free;
>> +}
>> +

>> +
>> +		/* sg_size indicates bytes already allocated and sg_num
>> +		 * is last sg element used. This is used when alloc_sg
> s/alloc_sg/sk_alloc_sg/
> 
>> +		 * partially allocates a scatterlist and then is sent
>> +		 * to wait for memory. In normal case (no memory pressure)
>> +		 * both sg_nun and sg_size are zero.
> s/sg_nun/sg_num/
> 

Will fix both spelling issues.

>> +		 */
>> +		copy = copy - sg_size;
>> +		err = sk_alloc_sg(sk, copy, sg, &sg_num, &sg_size, 0);
>> +		if (err) {
>> +			if (err != -ENOSPC)
>> +				goto wait_for_memory;
>> +			copy = sg_size;
>> +		}
>> +

[...]

>> +do_redir:
>> +	/* To avoid deadlock with multiple socks all doing redirects to
>> +	 * each other we must first drop the current sock lock and release
>> +	 * the psock. Then get the redirect socket (assuming it still
>> +	 * exists), take it's lock, and finally do the send here. If the
>> +	 * redirect fails there is nothing to do, we don't want to blame
>> +	 * the sender for remote socket failures. Instead we simply
>> +	 * continue making forward progress.
>> +	 */
>> +	return_mem_sg(sk, sg, sg_num);
>> +	release_sock(sk);
>> +	smap_release_sock(psock, sk);
>> +	copied -= bpf_tcp_sendmsg_do_redirect(sg, sg_num, &md, flags);
>> +	return copied;
> For __SK_REDIRECT case, before returning, should 'msg_data_left(msg)' be checked
> first?  Or msg_data_left(msg) will always be 0 here?
> 

Interesting, yes this would probably be best. Otherwise what happens is we
return with only some of the bytes copied. The application should (must?)
handle this case correctly, but agree might be nice to send as much as
possible here. I'll see if I can work this into a v2 or perhaps I'll do it
as a follow on performance improvement.

Nice observation though for sure.

>> +}
>> +

[...]

>> +	switch (rc) {
>> +	case __SK_PASS:
>> +		lock_sock(sk);
>> +		rc = tcp_sendpage_locked(sk, page, offset, size, flags);
>> +		release_sock(sk);
>> +		break;
>> +	case __SK_REDIRECT:
>> +		smap_release_sock(psock, sk);
> smap_release_sock() is only needed in __SK_REDIRECT case?

Now that I have a test case for this I also caught this with
testing. ;)

> 
>> +		rc = bpf_tcp_sendpage_do_redirect(page, offset, size, flags,
>> +						  &md);
>> +		break;
>> +	case __SK_DROP:
>> +	default:
>> +		rc = -EACCES;
>> +	}
>> +
>> +	return rc;
>> +}

Thanks a lot! Should have a v2 tomorrow after some additional
testing.

  reply	other threads:[~2018-01-18 17:27 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-12 18:09 [bpf-next PATCH 0/7] Add BPF_PROG_TYPE_SK_MSG and attach pt John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 1/7] net: add a UID to use for ULP socket assignment John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 2/7] sock: make static tls function alloc_sg generic sock helper John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 3/7] sockmap: convert refcnt to an atomic refcnt John Fastabend
2018-01-12 18:10 ` [bpf-next PATCH 4/7] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG John Fastabend
2018-01-12 20:10   ` Eric Dumazet
2018-01-12 20:26     ` John Fastabend
2018-01-12 20:46       ` Eric Dumazet
2018-01-12 21:11         ` John Fastabend
2018-01-12 18:11 ` [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data John Fastabend
2018-01-17  2:25   ` Alexei Starovoitov
2018-01-17  5:49     ` John Fastabend
2018-01-17  6:20       ` Alexei Starovoitov
2018-01-17 20:32         ` John Fastabend
2018-01-17 22:04   ` Martin KaFai Lau
2018-01-18 17:27     ` John Fastabend [this message]
2018-01-12 18:11 ` [bpf-next PATCH 6/7] bpf: add map tests for BPF_PROG_TYPE_SK_MSG John Fastabend
2018-01-12 18:11 ` [bpf-next PATCH 7/7] bpf: add verifier " John Fastabend

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8fbdeff9-23d6-7453-782c-f2ca6880af8a@gmail.com \
    --to=john.fastabend@gmail.com \
    --cc=ast@kernel.org \
    --cc=borkmann@iogearbox.net \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).