From: sdf@google.com
To: Aditi Ghag <aditivghag@gmail.com>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org,
Daniel Borkmann <daniel@iogearbox.net>
Subject: Re: [RFC] Socket termination for policy enforcement and load-balancing
Date: Wed, 31 Aug 2022 15:02:08 -0700 [thread overview]
Message-ID: <Yw/aYIR3mBABN75G@google.com> (raw)
In-Reply-To: <CABG=zsBEh-P4NXk23eBJw7eajB5YJeRS7oPXnTAzs=yob4EMoQ@mail.gmail.com>
On 08/31, Aditi Ghag wrote:
> This is an RFC for terminating sockets with intent. We have two
> prominent use cases in Cilium [1] where we need a way to identify and
> forcefully terminate a set of sockets so that they can reconnect.
> Cilium uses eBPF cgroup hooks for load-balancing, where it translates
> a service vip to one of the service backend ip addresses at socket
> connect time for TCP and connected UDP. Client applications are likely
> to be unaware of the remote containers that they are connected to
> getting deleted, and are left hanging when the remotes go away
> (long-running UDP applications, particularly). For the policy
> enforcement use case, users may want to enforce policies on-the-fly
> where they want all client applications traffic including established
> connections to be redirected to a subset of destinations.
> We evaluated following ways to identify, and forcefully terminate sockets:
> - The sock_destroy API added for similar Android use cases is
> effective in tearing down sockets. The API is behind the
> CONFIG_INET_DIAG_DESTROY config that's disabled by default, and
> currently exposed via SOCK_DIAG netlink infrastructure in userspace.
> The sock destroy handlers for TCP and UDP protocols send ECONNABORTED
> error code to sockets related to the abort state as mentioned in RFC
> 793.
> - Add unreachable routes for deleted backends. I experimented with
> this approach with my colleague, Nikolay Aleksandrov. We found that
> TCP and connected UDP sockets in the established state simply ignore
> the ICMP error messages, and continue to send data in the presence of
> such routes. My read is that applications are ignoring the ICMP errors
> reported on sockets [2].
[..]
> - Use BPF (sockets) iterator to identify sockets connected to a
> deleted backend. The BPF (sockets) iterator is network namespace aware
> so we'll either need to enter every possible container network
> namespace to identify the affected connections, or adapt the iterator
> to be without netns checks [3]. This was discussed with my colleague
> Daniel Borkmann based on the feedback he shared from the LSFMMBPF
> conference discussions.
Maybe something worth fixing as well even if you end up using netlink?
Having to manually go over all networking namespaces (if I want
to iterate over all sockets on the host) doesn't seem feasible?
> - Use INET_DIAG infrastructure to filter and destroy sockets connected
> to stale backends. This approach involves first making a query to
> filter sockets connecting to a destination ip address/port using
> netlink messages with type SOCK_DIAG_BY_FAMILY, and then use the query
> results to make another message of type SOCK_DESTROY to actually
> destroy the sockets. The SOCK_DIAG infrastructure, similar to BPF
> iterators, is network namespace aware.
> We are currently leaning towards invoking the sock_destroy API
> directly from BPF programs. This allows us to have an effective
> mechanism without having to enter every possible container network
> namespace on a node, and rely on the CONFIG_INET_DIAG_DESTROY config
> with the right permissions. BPF programs attached to cgroup hooks can
> store client sockets connected to a backend, and invoke destroy APIs
> when backends are deleted.
> To that end, I'm in the process of adding a new BPF helper for the
> sock_destroy kernel function similar to the sock_diag_destroy function
> [4], and am soliciting early feedback on the evaluated and selected
> approaches. Happy to share more context.
> [1] https://github.com/cilium/cilium
> [2] https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_ipv4.c#L464
> [3] https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L3011
> [4]
> https://github.com/torvalds/linux/blob/master/net/core/sock_diag.c#L298
next prev parent reply other threads:[~2022-08-31 22:02 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-31 16:37 [RFC] Socket termination for policy enforcement and load-balancing Aditi Ghag
2022-08-31 22:02 ` sdf [this message]
2022-09-04 17:41 ` Aditi Ghag
2022-09-06 16:28 ` Stanislav Fomichev
2022-08-31 23:01 ` Martin KaFai Lau
2022-08-31 23:43 ` Kuniyuki Iwashima
2022-09-04 18:14 ` Aditi Ghag
2022-09-04 21:24 ` Kumar Kartikeya Dwivedi
2022-09-08 2:26 ` Martin KaFai Lau
2022-09-12 10:01 ` Aditi Ghag
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yw/aYIR3mBABN75G@google.com \
--to=sdf@google.com \
--cc=aditivghag@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.