From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1BE9C77B78 for ; Tue, 25 Apr 2023 18:43:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234553AbjDYSnN (ORCPT ); Tue, 25 Apr 2023 14:43:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41934 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234230AbjDYSnM (ORCPT ); Tue, 25 Apr 2023 14:43:12 -0400 Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9925016188 for ; Tue, 25 Apr 2023 11:43:10 -0700 (PDT) Received: by mail-pj1-x1031.google.com with SMTP id 98e67ed59e1d1-2472a3bfd23so4264669a91.3 for ; Tue, 25 Apr 2023 11:43:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682448190; x=1685040190; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=9r/LuIGjx8rrrWo+CZqGbK9n/EXyber61jogDduPh4o=; b=vtV3wBJHtnKPGsYJrJ8ojlk7oCjHxAwyeJoQso+XJ1HTkxl9SS8nngqu1iHr1vfdP2 R3hRjAdY2OQfQiKy374/JzF7Y0XSJ37LKrieQ9Y6Q7SBCTdIakL2XCmgkZm7qmx1OcGG fY4LGcT/oF+mD8dF+Hi1CLdE5RunYO4WaESN791EbrObQovyaFkdTFYCQ5rDJLX3Vpxb 3Xmlhbc+hAyhEkRu9tTUv+tvqMQjYUJQRqVMTCvJ6qcSzozx46V5+mTJpG3EeNgGp74x mj6FjWzyDNfbb+39wnYn/4Tj+FQIwqm8cPDSqVGWDqkmnfJU6mxjZn4aBhOity0+Y0BE CZ0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682448190; x=1685040190; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=9r/LuIGjx8rrrWo+CZqGbK9n/EXyber61jogDduPh4o=; b=jF2V+7lA9pB4kI4QRvJZ1Pcx+JrUVsAs7ul0IjDzj7S9pRM2KZ4OgS6ecyJxQZfG7v PwPwETBF4GuCaIbRsDznhcu3EfXp4ir8fWnonRElXODRbeLv9V5BpA60UdJWA0vTEQ8P 5rt8CPAWx/uOifqHfZ0TgC/i24f5CPlWLFdmggCaXNPM68io21qHn4NzM2kQ/dJt/vLG YtHbBJf+5K3B79JoYgfIHUxXDo7DSwJ3+BqCs8Ulqx2XgM3ptbVHEHbTeP9LEuk8Ch8J aOLDhk47gWFrudXnQYxhmBsGhrbZq/3XTj5OC3Os8YWEWZXO2m7dm+khSIFfG8aU4H7X GL8g== X-Gm-Message-State: AAQBX9cv88eZAj3lCvNBfZdulntUfeZlzUvakvR8wbGN7a9243ZB/pB3 nM3OVBgK6xUUc5OBhdQMymRpPQyzzf4He+eK0xv6mQ== X-Google-Smtp-Source: AKy350ZZPv49tW4jIrpNQbuGE03auCjlVZ5DpZtvMTqVnf3KdZs/jN0kTN73n6FbKawjFqKNlE6RRmAV2mthsoE4Jcs= X-Received: by 2002:a17:90b:1642:b0:247:6c78:6c3f with SMTP id il2-20020a17090b164200b002476c786c3fmr19434427pjb.29.1682448189925; Tue, 25 Apr 2023 11:43:09 -0700 (PDT) MIME-Version: 1.0 References: <20230413133355.350571-1-aleksandr.mikhalitsyn@canonical.com> <20230413133355.350571-3-aleksandr.mikhalitsyn@canonical.com> In-Reply-To: From: Stanislav Fomichev Date: Tue, 25 Apr 2023 11:42:58 -0700 Message-ID: Subject: Re: handling unsupported optlen in cgroup bpf getsockopt: (was [PATCH net-next v4 2/4] net: socket: add sockopts blacklist for BPF cgroup hook) To: Kui-Feng Lee Cc: Martin KaFai Lau , Eric Dumazet , davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, daniel@iogearbox.net, Jakub Kicinski , Paolo Abeni , Leon Romanovsky , David Ahern , Arnd Bergmann , Kees Cook , Christian Brauner , Kuniyuki Iwashima , Lennart Poettering , linux-arch@vger.kernel.org, Aleksandr Mikhalitsyn , bpf Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-arch@vger.kernel.org On Tue, Apr 25, 2023 at 10:59=E2=80=AFAM Kui-Feng Lee = wrote: > > > > On 4/18/23 09:47, Stanislav Fomichev wrote: > > On 04/17, Martin KaFai Lau wrote: > >> On 4/14/23 6:55 PM, Stanislav Fomichev wrote: > >>> On 04/13, Stanislav Fomichev wrote: > >>>> On Thu, Apr 13, 2023 at 7:38=E2=80=AFAM Aleksandr Mikhalitsyn > >>>> wrote: > >>>>> > >>>>> On Thu, Apr 13, 2023 at 4:22=E2=80=AFPM Eric Dumazet wrote: > >>>>>> > >>>>>> On Thu, Apr 13, 2023 at 3:35=E2=80=AFPM Alexander Mikhalitsyn > >>>>>> wrote: > >>>>>>> > >>>>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Christi= an), > >>>>>>> that bpf cgroup hook can cause FD leaks when used with sockopts w= hich > >>>>>>> install FDs into the process fdtable. > >>>>>>> > >>>>>>> After some offlist discussion it was proposed to add a blacklist = of > >>>>>> > >>>>>> We try to replace this word by either denylist or blocklist, even = in changelogs. > >>>>> > >>>>> Hi Eric, > >>>>> > >>>>> Oh, I'm sorry about that. :( Sure. > >>>>> > >>>>>> > >>>>>>> socket options those can cause troubles when BPF cgroup hook is e= nabled. > >>>>>>> > >>>>>> > >>>>>> Can we find the appropriate Fixes: tag to help stable teams ? > >>>>> > >>>>> Sure, I will add next time. > >>>>> > >>>>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hook= s") > >>>>> > >>>>> I think it's better to add Stanislav Fomichev to CC. > >>>> > >>>> Can we use 'struct proto' bpf_bypass_getsockopt instead? We already > >>>> use it for tcp zerocopy, I'm assuming it should work in this case as > >>>> well? > >>> > >>> Jakub reminded me of the other things I wanted to ask here bug forgot= : > >>> > >>> - setsockopt is probably not needed, right? setsockopt hook triggers > >>> before the kernel and shouldn't leak anything > >>> - for getsockopt, instead of bypassing bpf completely, should we inst= ead > >>> ignore the error from the bpf program? that would still preserve > >>> the observability aspect > >> > >> stealing this thread to discuss the optlen issue which may make sense = to > >> bypass also. > >> > >> There has been issue with optlen. Other than this older post related t= o > >> optlen > PAGE_SIZE: > >> https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@linux= .dev/, > >> the recent one related to optlen that we have seen is > >> NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen =3D=3D 0 and = the kernel > >> put the expected optlen (> 0) and 'return 0;' to userspace. The usersp= ace > >> intention is to learn the expected optlen. This makes 'ctx.optlen > > >> max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning > >> -EFAULT to the userspace even the bpf prog has not changed anything. > > > > (ignoring -EFAULT issue) this seems like it needs to be > > > > if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { > > /* error */ > > } > > > > ? > > > >> Does it make sense to also bypass the bpf prog when 'ctx.optlen > > >> max_optlen' for now (and this can use a separate patch which as usual > >> requires a bpf selftests)? > > > > Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or somethin= g > > seems like the way to go. It caused too much trouble already :-( > > > > Should I prepare a patch or do you want to take a stab at it? > > > >> In the future, does it make sense to have a specific cgroup-bpf-prog (= a > >> specific attach type?) that only uses bpf_dynptr kfunc to access the o= ptval > >> such that it can enforce read-only for some optname and potentially al= so > >> track if bpf-prog has written a new optval? The bpf-prog can only retu= rn 1 > >> (OK) and only allows using bpf_set_retval() instead. Likely there is s= till > >> holes but could be a seed of thought to continue polishing the idea. > > > > Ack, let's think about it. > > > > Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea > > as well? If we can have a sleepable hook that can copy_from_user/copy_t= o_user, > > and we have a mostly working bpf_getsockopt (after your refactoring), > > I don't see why we need to continue the current scheme of triggering > > after the kernel? > > Since a sleepable hook would cause some restrictions, perhaps, we could > introduce something like the promise pattern. In our case here, BPF > program call an async version of copy_from_user()/copy_to_user() to > return a promise. Having a promise might work. This is essentially what we already do with sockets/etc with acquire/release pattern. What are the sleepable restrictions you're hinting about? I feel like with the sleepable bpf, we can also remove all the temporary buffer management / extra copies which sounds like a win to me. (we have this ugly heuristics with BPF_SOCKOPT_KERN_BUF_SIZE) The program can allocate temporary buffers if needed.. > >>> - or maybe we can even have a per-proto bpf_getsockopt_cleanup call t= hat > >>> gets called whenever bpf returns an error to make sure protocols = have > >>> a chance to handle that condition (and free the fd) > >>> > >> > >>