From: Martin KaFai Lau <martin.lau@linux.dev>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
Aditi Ghag <aditivghag@gmail.com>
Cc: Martin KaFai Lau <kafai@fb.com>,
netdev@vger.kernel.org, bpf@vger.kernel.org,
Daniel Borkmann <daniel@iogearbox.net>,
Yonghong Song <yhs@fb.com>, Kuniyuki Iwashima <kuniyu@amazon.com>
Subject: Re: [RFC] Socket termination for policy enforcement and load-balancing
Date: Wed, 7 Sep 2022 19:26:34 -0700 [thread overview]
Message-ID: <077d56ef-30cb-2d19-6f57-a92fd886b5f2@linux.dev> (raw)
In-Reply-To: <CAP01T76ry6etJ2Zi02a2+ZtGJxrc=rky5gMqFE7on_fuOe8A8A@mail.gmail.com>
On 9/4/22 2:24 PM, Kumar Kartikeya Dwivedi wrote:
> On Sun, 4 Sept 2022 at 20:55, Aditi Ghag <aditivghag@gmail.com> wrote:
>>
>> On Wed, Aug 31, 2022 at 4:02 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>>
>>> On Wed, Aug 31, 2022 at 09:37:41AM -0700, Aditi Ghag wrote:
>>>> - Use BPF (sockets) iterator to identify sockets connected to a
>>>> deleted backend. The BPF (sockets) iterator is network namespace aware
>>>> so we'll either need to enter every possible container network
>>>> namespace to identify the affected connections, or adapt the iterator
>>>> to be without netns checks [3]. This was discussed with my colleague
>>>> Daniel Borkmann based on the feedback he shared from the LSFMMBPF
>>>> conference discussions.
>>> Being able to iterate all sockets across different netns will
>>> be useful.
>>>
>>> It should be doable to ignore the netns check. For udp, a quick
>>> thought is to have another iter target. eg. "udp_all_netns".
>>> From the sk, the bpf prog should be able to learn the netns and
>>> the bpf prog can filter the netns by itself.
>>>
>>> The TCP side is going to have an 'optional' per netns ehash table [0] soon,
>>> not lhash2 (listening hash) though. Ideally, the same bpf
>>> all-netns iter interface should work similarly for both udp and
>>> tcp case. Thus, both should be considered and work at the same time.
>>>
>>> For udp, something more useful than plain udp_abort() could potentially
>>> be done. eg. directly connect to another backend (by bpf kfunc?).
>>> There may be some details in socket locking...etc but should
>>> be doable and the bpf-iter program could be sleepable also.
>>
>> This won't be effective for connected udp though, will it? Interesting thought
>> around using bpf kfunchmm... why the bpf-prog doing the udp re-connect() won't be effective?
I suspect we are talking about different thing.
Regardless, for tcp, I think the user space needs to handle the tcp
aborted-error by redoing the connect(). Thus, lets stay with
{tcp,udp}_abort() for now. Try to expose {tcp,udp}_abort() as a kfunc
instead of a new bpf_helper.
>>
>>> fwiw, we are iterating the tcp socket to retire some older
>>> bpf-tcp-cc (congestion control) on the long-lived connections
>>> by bpf_setsockopt(TCP_CONGESTION).
>>>
>>> Also, potentially, instead of iterating all,
>>> a more selective case can be done by
>>> bpf_prog_test_run()+bpf_sk_lookup_*()+udp_abort().
>>
>> Can you elaborate more on the more selective iterator approach?
If the 4 tuples (src/dst ip/port) is known, bpf_sk_lookup_*() can lookup
a sk from the tcp_hashinfo or udp_table. bpf_sk_lookup_*() also takes a
netns_id argument. However, yeah, it will still go back to the need to
get all netns, so may not work well in the RFC case here.
>>
>> On a similar note, are there better ways as alternatives to the
>> sockets iterator approach.
>> Since we have BPF programs executed on cgroup BPF hooks (e.g.,
>> connect), we already know what client
>> sockets are connected to a backend. Can we somehow store these socket
>> pointers in a regular BPF map, and
>> when a backend is deleted, use a regular map iterator to invoke
>> sock_destroy() for these sockets? Does anyone have
>> experience using the "typed pointer support in BPF maps" APIs [0]?
>
> I am not very familiar with how socket lifetime is managed, it may not
> be possible in case lifetime is managed by RCU only,
> or due to other limitations.
> Martin will probably be able to comment more on that.
sk is the usual refcnt+rcu_reader pattern. afaik, the use case here is
the sk should be removed from the map when there is a tcp_close() or
udp_lib_close(). There is sock_map and sock_hash to store sk as the
map-value. iirc the sk will be automatically removed from the map
during tcp_close() and udp_lib_close(). The sock_map and sock_hash have
bpf iterator also. Meaning a bpf-iter-prog can iterate the sock_map and
sock_hash and then do abort on each sk, so it looks like most of the
pieces are in place.
next prev parent reply other threads:[~2022-09-08 2:27 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-31 16:37 [RFC] Socket termination for policy enforcement and load-balancing Aditi Ghag
2022-08-31 22:02 ` sdf
2022-09-04 17:41 ` Aditi Ghag
2022-09-06 16:28 ` Stanislav Fomichev
2022-08-31 23:01 ` Martin KaFai Lau
2022-08-31 23:43 ` Kuniyuki Iwashima
2022-09-04 18:14 ` Aditi Ghag
2022-09-04 21:24 ` Kumar Kartikeya Dwivedi
2022-09-08 2:26 ` Martin KaFai Lau [this message]
2022-09-12 10:01 ` Aditi Ghag
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=077d56ef-30cb-2d19-6f57-a92fd886b5f2@linux.dev \
--to=martin.lau@linux.dev \
--cc=aditivghag@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kafai@fb.com \
--cc=kuniyu@amazon.com \
--cc=memxor@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).