Re: [PATCH bpf-next 3/7] bpf: Add socket assign support

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jakub Sitnicki <jakub@cloudflare.com>
To: Joe Stringer <joe@wand.net.nz>
Cc: Martin KaFai Lau <kafai@fb.com>,
	bpf@vger.kernel.org, netdev <netdev@vger.kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Alexei Starovoitov <ast@kernel.org>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Lorenz Bauer <lmb@cloudflare.com>
Subject: Re: [PATCH bpf-next 3/7] bpf: Add socket assign support
Date: Wed, 18 Mar 2020 11:03:33 +0100	[thread overview]
Message-ID: <87h7ymx9my.fsf@cloudflare.com> (raw)
In-Reply-To: <CAOftzPjXexvng-+77b-4Yw0pEBHXchsNVwrx+h9vV+5XBQzy-g@mail.gmail.com>

On Wed, Mar 18, 2020 at 01:46 AM CET, Joe Stringer wrote:
> On Mon, Mar 16, 2020 at 11:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>
>> On Mon, Mar 16, 2020 at 08:06:38PM -0700, Joe Stringer wrote:
>> > On Mon, Mar 16, 2020 at 3:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> > >
>> > > On Thu, Mar 12, 2020 at 04:36:44PM -0700, Joe Stringer wrote:
>> > > > Add support for TPROXY via a new bpf helper, bpf_sk_assign().
>> > > >
>> > > > This helper requires the BPF program to discover the socket via a call
>> > > > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
>> > > > helper takes its own reference to the socket in addition to any existing
>> > > > reference that may or may not currently be obtained for the duration of
>> > > > BPF processing. For the destination socket to receive the traffic, the
>> > > > traffic must be routed towards that socket via local route, the socket
>> > > I also missed where is the local route check in the patch.
>> > > Is it implied by a sk can be found in bpf_sk*_lookup_*()?
>> >
>> > This is a requirement for traffic redirection, it's not enforced by
>> > the patch. If the operator does not configure routing for the relevant
>> > traffic to ensure that the traffic is delivered locally, then after
>> > the eBPF program terminates, it will pass up through ip_rcv() and
>> > friends and be subject to the whims of the routing table. (or
>> > alternatively if the BPF program redirects somewhere else then this
>> > reference will be dropped).
>> >
>> > Maybe there's a path to simplifying this configuration path in future
>> > to loosen this requirement, but for now I've kept the series as
>> > minimal as possible on that front.
>> >
>> > > [ ... ]
>> > >
>> > > > diff --git a/net/core/filter.c b/net/core/filter.c
>> > > > index cd0a532db4e7..bae0874289d8 100644
>> > > > --- a/net/core/filter.c
>> > > > +++ b/net/core/filter.c
>> > > > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
>> > > >       .arg5_type      = ARG_CONST_SIZE,
>> > > >  };
>> > > >
>> > > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags)
>> > > > +{
>> > > > +     if (flags != 0)
>> > > > +             return -EINVAL;
>> > > > +     if (!skb_at_tc_ingress(skb))
>> > > > +             return -EOPNOTSUPP;
>> > > > +     if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
>> > > > +             return -ENOENT;
>> > > > +
>> > > > +     skb_orphan(skb);
>> > > > +     skb->sk = sk;
>> > > sk is from the bpf_sk*_lookup_*() which does not consider
>> > > the bpf_prog installed in SO_ATTACH_REUSEPORT_EBPF.
>> > > However, the use-case is currently limited to sk inspection.
>> > >
>> > > It now supports selecting a particular sk to receive traffic.
>> > > Any plan in supporting that?
>> >
>> > I think this is a general bpf_sk*_lookup_*() question, previous
>> > discussion[0] settled on avoiding that complexity before a use case
>> > arises, for both TC and XDP versions of these helpers; I still don't
>> > have a specific use case in mind for such functionality. If we were to
>> > do it, I would presume that the socket lookup caller would need to
>> > pass a dedicated flag (supported at TC and likely not at XDP) to
>> > communicate that SO_ATTACH_REUSEPORT_EBPF progs should be respected
>> > and used to select the reuseport socket.
>> It is more about the expectation on the existing SO_ATTACH_REUSEPORT_EBPF
>> usecase.  It has been fine because SO_ATTACH_REUSEPORT_EBPF's bpf prog
>> will still be run later (e.g. from tcp_v4_rcv) to decide which sk to
>> recieve the skb.
>>
>> If the bpf@tc assigns a TCP_LISTEN sk in bpf_sk_assign(),
>> will the SO_ATTACH_REUSEPORT_EBPF's bpf still be run later
>> to make the final sk decision?
>
> I don't believe so, no:
>
> ip_local_deliver()
> -> ...
> -> ip_protocol_deliver_rcu()
> -> tcp_v4_rcv()
> -> __inet_lookup_skb()
> -> skb_steal_sock(skb)
>
> But this will only affect you if you are running both the bpf@tc
> program with sk_assign() and the reuseport BPF sock programs at the
> same time. This is why I link it back to the bpf_sk*_lookup_*()
> functions: If the socket lookup in the initial step respects reuseport
> BPF prog logic and returns the socket using the same logic, then the
> packet will be directed to the socket you expect. Just like how
> non-BPF reuseport would work with this series today.

I'm a bit lost in argumentation. The cover letter says that the goal is
to support TPROXY use cases from BPF TC. TPROXY, however, supports
reuseport load-balancing, which is essential to scaling out your
receiver [0].

I assume that in Cilium use case, single socket / single core is
sufficient to handle traffic steered with this new mechanism.

Also, socket lookup from XDP / BPF TC _without_ reuseport sounds
okay-ish because you're likely after information that a socket (group)
is attached to some local address / port.

However, when you go one step further and assign the socket to skb
without running reuseport logic, that is breaking socket load-balancing
for applications.

That is to say that I'm with Lorenz on this one. Sockets that belong to
reuseport group should not be a valid target for assignment until socket
lookup from BPF honors reuseport.

[0] https://www.slideshare.net/lfevents/boost-udp-transaction-performance

[...]

next prev parent reply	other threads:[~2020-03-18 10:03 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-12 23:36 [PATCH bpf-next 0/7] Add bpf_sk_assign eBPF helper Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 1/7] dst: Move skb_dst_drop to skbuff.c Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 2/7] dst: Add socket prefetch metadata destinations Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 3/7] bpf: Add socket assign support Joe Stringer
2020-03-16 10:08   ` Jakub Sitnicki
2020-03-16 21:23     ` Joe Stringer
2020-03-16 22:57   ` Martin KaFai Lau
2020-03-17  3:06     ` Joe Stringer
2020-03-17  6:26       ` Martin KaFai Lau
2020-03-18  0:46         ` Joe Stringer
2020-03-18 10:03           ` Jakub Sitnicki [this message]
2020-03-19  5:49             ` Joe Stringer
2020-03-18 18:48           ` Martin KaFai Lau
2020-03-19  6:24             ` Joe Stringer
2020-03-20  1:54               ` Martin KaFai Lau
2020-03-20  4:28                 ` Joe Stringer
2020-03-17 10:09       ` Lorenz Bauer
2020-03-18  1:10         ` Joe Stringer
2020-03-18  2:03           ` Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 4/7] dst: Prefetch established socket destinations Joe Stringer
2020-03-16 23:03   ` Martin KaFai Lau
2020-03-17  3:17     ` Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 5/7] selftests: bpf: add test for sk_assign Joe Stringer
2020-03-17  6:30   ` Martin KaFai Lau
2020-03-17 20:56     ` Joe Stringer
2020-03-18 17:27       ` Martin KaFai Lau
2020-03-19  5:45         ` Joe Stringer
2020-03-19 17:36           ` Andrii Nakryiko
2020-03-12 23:36 ` [PATCH bpf-next 6/7] selftests: bpf: Extend sk_assign for address proxy Joe Stringer
2020-03-12 23:36 ` [PATCH bpf-next 7/7] selftests: bpf: Improve debuggability of sk_assign Joe Stringer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h7ymx9my.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eric.dumazet@gmail.com \
    --cc=joe@wand.net.nz \
    --cc=kafai@fb.com \
    --cc=lmb@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.