Re: [RFC] Socket termination for policy enforcement and load-balancing

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Martin KaFai Lau <martin.lau@linux.dev>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Aditi Ghag <aditivghag@gmail.com>
Cc: Martin KaFai Lau <kafai@fb.com>,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	Daniel Borkmann <daniel@iogearbox.net>,
	Yonghong Song <yhs@fb.com>, Kuniyuki Iwashima <kuniyu@amazon.com>
Subject: Re: [RFC] Socket termination for policy enforcement and load-balancing
Date: Wed, 7 Sep 2022 19:26:34 -0700	[thread overview]
Message-ID: <077d56ef-30cb-2d19-6f57-a92fd886b5f2@linux.dev> (raw)
In-Reply-To: <CAP01T76ry6etJ2Zi02a2+ZtGJxrc=rky5gMqFE7on_fuOe8A8A@mail.gmail.com>

On 9/4/22 2:24 PM, Kumar Kartikeya Dwivedi wrote:
> On Sun, 4 Sept 2022 at 20:55, Aditi Ghag <aditivghag@gmail.com> wrote:
>>
>> On Wed, Aug 31, 2022 at 4:02 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>>
>>> On Wed, Aug 31, 2022 at 09:37:41AM -0700, Aditi Ghag wrote:
>>>> - Use BPF (sockets) iterator to identify sockets connected to a
>>>> deleted backend. The BPF (sockets) iterator is network namespace aware
>>>> so we'll either need to enter every possible container network
>>>> namespace to identify the affected connections, or adapt the iterator
>>>> to be without netns checks [3]. This was discussed with my colleague
>>>> Daniel Borkmann based on the feedback he shared from the LSFMMBPF
>>>> conference discussions.
>>> Being able to iterate all sockets across different netns will
>>> be useful.
>>>
>>> It should be doable to ignore the netns check.  For udp, a quick
>>> thought is to have another iter target. eg. "udp_all_netns".
>>>  From the sk, the bpf prog should be able to learn the netns and
>>> the bpf prog can filter the netns by itself.
>>>
>>> The TCP side is going to have an 'optional' per netns ehash table [0] soon,
>>> not lhash2 (listening hash) though.  Ideally, the same bpf
>>> all-netns iter interface should work similarly for both udp and
>>> tcp case.  Thus, both should be considered and work at the same time.
>>>
>>> For udp, something more useful than plain udp_abort() could potentially
>>> be done.  eg. directly connect to another backend (by bpf kfunc?).
>>> There may be some details in socket locking...etc but should
>>> be doable and the bpf-iter program could be sleepable also.
>>
>> This won't be effective for connected udp though, will it? Interesting thought
>> around using bpf kfunchmm... why the bpf-prog doing the udp re-connect() won't be effective? 
I suspect we are talking about different thing.

Regardless, for tcp, I think the user space needs to handle the tcp 
aborted-error by redoing the connect().  Thus, lets stay with 
{tcp,udp}_abort() for now.  Try to expose {tcp,udp}_abort() as a kfunc 
instead of a new bpf_helper.

>>
>>> fwiw, we are iterating the tcp socket to retire some older
>>> bpf-tcp-cc (congestion control) on the long-lived connections
>>> by bpf_setsockopt(TCP_CONGESTION).
>>>
>>> Also, potentially, instead of iterating all,
>>> a more selective case can be done by
>>> bpf_prog_test_run()+bpf_sk_lookup_*()+udp_abort().
>>
>> Can you elaborate more on the more selective iterator approach?
If the 4 tuples (src/dst ip/port) is known, bpf_sk_lookup_*() can lookup 
a sk from the tcp_hashinfo or udp_table.  bpf_sk_lookup_*() also takes a 
netns_id argument.  However, yeah, it will still go back to the need to 
get all netns, so may not work well in the RFC case here.

>>
>> On a similar note, are there better ways as alternatives to the
>> sockets iterator approach.
>> Since we have BPF programs executed on cgroup BPF hooks (e.g.,
>> connect), we already know what client
>> sockets are connected to a backend. Can we somehow store these socket
>> pointers in a regular BPF map, and
>> when a backend is deleted, use a regular map iterator to invoke
>> sock_destroy() for these sockets? Does anyone have
>> experience using the "typed pointer support in BPF maps" APIs [0]?
> 
> I am not very familiar with how socket lifetime is managed, it may not
> be possible in case lifetime is managed by RCU only,
> or due to other limitations.
> Martin will probably be able to comment more on that.
sk is the usual refcnt+rcu_reader pattern.  afaik, the use case here is 
the sk should be removed from the map when there is a tcp_close() or 
udp_lib_close().  There is sock_map and sock_hash to store sk as the 
map-value.  iirc the sk will be automatically removed from the map 
during tcp_close() and udp_lib_close().  The sock_map and sock_hash have 
bpf iterator also.  Meaning a bpf-iter-prog can iterate the sock_map and 
sock_hash and then do abort on each sk, so it looks like most of the 
pieces are in place.

next prev parent reply	other threads:[~2022-09-08  2:27 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-31 16:37 [RFC] Socket termination for policy enforcement and load-balancing Aditi Ghag
2022-08-31 22:02 ` sdf
2022-09-04 17:41   ` Aditi Ghag
2022-09-06 16:28     ` Stanislav Fomichev
2022-08-31 23:01 ` Martin KaFai Lau
2022-08-31 23:43   ` Kuniyuki Iwashima
2022-09-04 18:14   ` Aditi Ghag
2022-09-04 21:24     ` Kumar Kartikeya Dwivedi
2022-09-08  2:26       ` Martin KaFai Lau [this message]
2022-09-12 10:01         ` Aditi Ghag

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=077d56ef-30cb-2d19-6f57-a92fd886b5f2@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=aditivghag@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kafai@fb.com \
    --cc=kuniyu@amazon.com \
    --cc=memxor@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).