From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 58BEBECAAD4 for ; Wed, 31 Aug 2022 22:02:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229992AbiHaWCO (ORCPT ); Wed, 31 Aug 2022 18:02:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53234 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229735AbiHaWCM (ORCPT ); Wed, 31 Aug 2022 18:02:12 -0400 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 398E0DC5FA for ; Wed, 31 Aug 2022 15:02:11 -0700 (PDT) Received: by mail-pl1-x649.google.com with SMTP id p18-20020a170902a41200b00172b0dc71e0so10554970plq.0 for ; Wed, 31 Aug 2022 15:02:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date; bh=r9j9UaaDE9X/fY022QlBZ5udCV7NoZgyGORt+9AxKm0=; b=Y/1MS+x26c6iPkEkuv0QMTJorex6WIyuyWIiBPEAMsuUvgYO8C1+JHOmQ8X/NlbOV0 1rsYQT5EKz8nj0Y9adeen0XCiMeU83iHpIGCtHf7EcuP6LR+LyGfMZzcuxRWmT8KO6Sg P3YirDsxLGB188mIeXn92fw45m4toftVW8Rw0MoRcQ37XFt9ExyPF/p6dy2ODdJxbJaw Up75/dGHEaAeKWUj2+onPosc8XjhcXU0rCOWHZ+W8ZJnLxIjo+Fea2xZjk3ypF6HCLxW SFyvG1pS5iXLOLXi4j+ez8jx2yA03P3pBNuCwpRfNculrm4+vsHLoESmPU0Ho8dhFO4o jICA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date; bh=r9j9UaaDE9X/fY022QlBZ5udCV7NoZgyGORt+9AxKm0=; b=CpnE38bicvFsXykIUojePjukP2fQmG4mAClj+PnaLxt3AjMT4+bsh+gLZKGtS/BDpw Ax32RKT3ZZIyPgm33nDe27SPuRT6bG7gHo7KjZk8W8pB/Rg0K9dczrffVrn/IGFB1xY3 Bx+XKqWM+63/7a/4LYn3PuNQVap7jb8YbKBa96fv7iIuRb4/ia7Xdl6juJuwdNCRpGBT OVhmK10mG25xfeTIyxLAOyVaoAZLvFdRGNzy8PunVY1oms6W90h2wc46JBC/aDhPZqKT WU5pBX05MOu+IG/e0YJg6GhfZIleFYdcRnQtyeiPmGdK09r+uP1aMO8ZKYN6V8r0Y5eY 3wCg== X-Gm-Message-State: ACgBeo0M1caX1Atyp9YQVMONQ300uXuQfi5oO4BlcBgOlvOezCDP9U1p mnzV/iayrAkNfC+SgeOtsg5r8cU= X-Google-Smtp-Source: AA6agR49drocjZ0i2QppRHAYciWa4wRqWZV+P2gTJMtFvBQ+0bmrqsnZM1oN5+OWxpO6NHX/hEUf40E= X-Received: from sdf.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5935]) (user=sdf job=sendgmr) by 2002:a17:90a:249:b0:1e0:a8a3:3c6c with SMTP id t9-20020a17090a024900b001e0a8a33c6cmr309305pje.0.1661983330118; Wed, 31 Aug 2022 15:02:10 -0700 (PDT) Date: Wed, 31 Aug 2022 15:02:08 -0700 In-Reply-To: Mime-Version: 1.0 References: Message-ID: Subject: Re: [RFC] Socket termination for policy enforcement and load-balancing From: sdf@google.com To: Aditi Ghag Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, Daniel Borkmann Content-Type: text/plain; charset="UTF-8"; format=flowed; delsp=yes Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On 08/31, Aditi Ghag wrote: > This is an RFC for terminating sockets with intent. We have two > prominent use cases in Cilium [1] where we need a way to identify and > forcefully terminate a set of sockets so that they can reconnect. > Cilium uses eBPF cgroup hooks for load-balancing, where it translates > a service vip to one of the service backend ip addresses at socket > connect time for TCP and connected UDP. Client applications are likely > to be unaware of the remote containers that they are connected to > getting deleted, and are left hanging when the remotes go away > (long-running UDP applications, particularly). For the policy > enforcement use case, users may want to enforce policies on-the-fly > where they want all client applications traffic including established > connections to be redirected to a subset of destinations. > We evaluated following ways to identify, and forcefully terminate sockets: > - The sock_destroy API added for similar Android use cases is > effective in tearing down sockets. The API is behind the > CONFIG_INET_DIAG_DESTROY config that's disabled by default, and > currently exposed via SOCK_DIAG netlink infrastructure in userspace. > The sock destroy handlers for TCP and UDP protocols send ECONNABORTED > error code to sockets related to the abort state as mentioned in RFC > 793. > - Add unreachable routes for deleted backends. I experimented with > this approach with my colleague, Nikolay Aleksandrov. We found that > TCP and connected UDP sockets in the established state simply ignore > the ICMP error messages, and continue to send data in the presence of > such routes. My read is that applications are ignoring the ICMP errors > reported on sockets [2]. [..] > - Use BPF (sockets) iterator to identify sockets connected to a > deleted backend. The BPF (sockets) iterator is network namespace aware > so we'll either need to enter every possible container network > namespace to identify the affected connections, or adapt the iterator > to be without netns checks [3]. This was discussed with my colleague > Daniel Borkmann based on the feedback he shared from the LSFMMBPF > conference discussions. Maybe something worth fixing as well even if you end up using netlink? Having to manually go over all networking namespaces (if I want to iterate over all sockets on the host) doesn't seem feasible? > - Use INET_DIAG infrastructure to filter and destroy sockets connected > to stale backends. This approach involves first making a query to > filter sockets connecting to a destination ip address/port using > netlink messages with type SOCK_DIAG_BY_FAMILY, and then use the query > results to make another message of type SOCK_DESTROY to actually > destroy the sockets. The SOCK_DIAG infrastructure, similar to BPF > iterators, is network namespace aware. > We are currently leaning towards invoking the sock_destroy API > directly from BPF programs. This allows us to have an effective > mechanism without having to enter every possible container network > namespace on a node, and rely on the CONFIG_INET_DIAG_DESTROY config > with the right permissions. BPF programs attached to cgroup hooks can > store client sockets connected to a backend, and invoke destroy APIs > when backends are deleted. > To that end, I'm in the process of adding a new BPF helper for the > sock_destroy kernel function similar to the sock_diag_destroy function > [4], and am soliciting early feedback on the evaluated and selected > approaches. Happy to share more context. > [1] https://github.com/cilium/cilium > [2] https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_ipv4.c#L464 > [3] https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L3011 > [4] > https://github.com/torvalds/linux/blob/master/net/core/sock_diag.c#L298