From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-28.2 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61A32C433EF for ; Thu, 9 Sep 2021 16:48:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4A8C06113A for ; Thu, 9 Sep 2021 16:48:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232456AbhIIQtR (ORCPT ); Thu, 9 Sep 2021 12:49:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48334 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239704AbhIIQtK (ORCPT ); Thu, 9 Sep 2021 12:49:10 -0400 Received: from mail-qv1-xf49.google.com (mail-qv1-xf49.google.com [IPv6:2607:f8b0:4864:20::f49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 41C58C061575 for ; Thu, 9 Sep 2021 09:48:01 -0700 (PDT) Received: by mail-qv1-xf49.google.com with SMTP id u6-20020ad449a6000000b003798010ad14so7738060qvx.10 for ; Thu, 09 Sep 2021 09:48:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=IMj/rIfNdJeeUF+FlwIsH4hs1mbpSXmqFJm48NclVww=; b=UFn0HFmXoW6gwgT17y/0iUAcakdzLusQlRpqyDyweWHtAo0WTVbcyLQ+W8LAynefHC 9zAuum1lNIkuOVkFuFKAvAkHhS2mWtSiUuY0mOo008XFAZjUSZElEZJzc+hYFvIyPSBt 8zHNlLO4jHZrIVV4680qt/DAs9us6dlT0KjuTEp6mbg5Phl5lRLjOjsL5JgdqsXE+gSW AahcvmjLP2wUNirlwMC7weUI5CGxFwhG2etGtElgAWDiSDyHfK/ODsp6NnnfIsGS+/lU fc7CiPUNaPJNum5ROFLp2cg9PKLWJ/YSagBuTNUXBDfoaEsW7XmkADJayFGmPwP4Zsvj 0bKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=IMj/rIfNdJeeUF+FlwIsH4hs1mbpSXmqFJm48NclVww=; b=2r5Dt2GA04MawXOnIWsWzS/qLonJw6VJWHN6lV8kedNGbnpUw58EhpyHs/qjouimMt MTnUb03xsHDbgWMHvb0S1zE+pDfWTRexC9BA2Xr6RrMwCgCcf1zcrdqpTIkwZodMRwm/ OxvlE6RZEz2Y8otvS7RqgWr8Bv4kQVcvmmXRJ9gjBhKyEmyjpfcJE6rwS22i3BYQYadl 0TFXAjo34T5rES1JSOwwMbGHkK49H4MG4TAcczbgoZsxrc+9Gzk/d4W0F2CzoR3+8tWH SEWB8zoI/AZzPBWm/F1mVf++tGDZiohoBa3j23rDiaW+07/DMjWvP80+E+S1iLqNbtWa zM2Q== X-Gm-Message-State: AOAM533nYUd8f+m93zDLwp3ePsYm8QXME1sNfnWyCU6jm/kqX3TjycuW sLjLBVP4fBArYhIWTRMdUF4Zig8= X-Google-Smtp-Source: ABdhPJyuPqofISYhPI3+Hna52ReTYpusT0R3wqECL0uHS7mhMxKTEprmvZqj9wnaSSSGN5cUEVRycC8= X-Received: from sdf2.svl.corp.google.com ([2620:15c:2c4:201:5cc5:b62d:9471:6101]) (user=sdf job=sendgmr) by 2002:a05:6214:6aa:: with SMTP id s10mr937017qvz.56.1631206080415; Thu, 09 Sep 2021 09:48:00 -0700 (PDT) Date: Thu, 9 Sep 2021 09:47:58 -0700 In-Reply-To: <1e9ee1059ddb0ad7cd2c5f9eeaa26606f9d5fbbf.1631189197.git.daniel@iogearbox.net> Message-Id: Mime-Version: 1.0 References: <1e9ee1059ddb0ad7cd2c5f9eeaa26606f9d5fbbf.1631189197.git.daniel@iogearbox.net> Subject: Re: [PATCH bpf 1/3] bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode From: sdf@google.com To: Daniel Borkmann Cc: bpf@vger.kernel.org, netdev@vger.kernel.org, tj@kernel.org, davem@davemloft.net, m@lambda.lt, alexei.starovoitov@gmail.com, andrii@kernel.org Content-Type: text/plain; charset="UTF-8"; format=flowed; delsp=yes Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 09/09, Daniel Borkmann wrote: > Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used. > Back in the days, commit bd1060a1d671 ("sock, cgroup: add > sock->sk_cgroup") > embedded per-socket cgroup information into sock->sk_cgrp_data and in > order > to save 8 bytes in struct sock made both mutually exclusive, that is, when > cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2 > falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp). > The assumption made was "there is no reason to mix the two and this is in > line > with how legacy and v2 compatibility is handled" as stated in > bd1060a1d671. > However, with Kubernetes more widely supporting cgroups v2 as well > nowadays, > this assumption no longer holds, and the possibility of the v1/v2 mixed > mode > with the v2 root fallback being hit becomes a real security issue. > Many of the cgroup v2 BPF programs are also used for policy enforcement, > just > to pick _one_ example, that is, to programmatically deny socket related > system > calls like connect(2) or bind(2). A v2 root fallback would implicitly > cause > a policy bypass for the affected Pods. > In production environments, we have recently seen this case due to various > circumstances: i) a different 3rd party agent and/or ii) a container > runtime > such as [0] in the user's environment configuring legacy cgroup v1 net_cls > tags, which triggered implicitly mentioned root fallback. Another case is > Kubernetes projects like kind [1] which create Kubernetes nodes in a > container > and also add cgroup namespaces to the mix, meaning programs which are > attached > to the cgroup v2 root of the cgroup namespace get attached to a non-root > cgroup v2 path from init namespace point of view. And the latter's root is > out of reach for agents on a kind Kubernetes node to configure. Meaning, > any > entity on the node setting cgroup v1 net_cls tag will trigger the bypass > despite cgroup v2 BPF programs attached to the namespace root. > Generally, this mutual exclusiveness does not hold anymore in today's user > environments and makes cgroup v2 usage from BPF side fragile and > unreliable. > This fix adds proper struct cgroup pointer for the cgroup v2 case to > struct > sock_cgroup_data in order to address these issues; this implicitly also > fixes > the tradeoffs being made back then with regards to races and refcount > leaks > as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF > programs always operate as expected. > [0] https://github.com/nestybox/sysbox/ > [1] https://kind.sigs.k8s.io/ > Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") > Signed-off-by: Daniel Borkmann > Cc: David S. Miller > Cc: Tejun Heo > Cc: Martynas Pumputis > --- > include/linux/cgroup-defs.h | 107 +++++++++-------------------------- > include/linux/cgroup.h | 22 +------ > kernel/cgroup/cgroup.c | 50 ++++------------ > net/core/netclassid_cgroup.c | 7 +-- > net/core/netprio_cgroup.c | 10 +--- > 5 files changed, 41 insertions(+), 155 deletions(-) > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index e1c705fdfa7c..44446025741f 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -752,107 +752,54 @@ static inline void > cgroup_threadgroup_change_end(struct task_struct *tsk) {} > * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains > * per-socket cgroup information except for memcg association. > * > - * On legacy hierarchies, net_prio and net_cls controllers directly set > - * attributes on each sock which can then be tested by the network layer. > - * On the default hierarchy, each sock is associated with the cgroup it > was > - * created in and the networking layer can match the cgroup directly. > - * > - * To avoid carrying all three cgroup related fields separately in sock, > - * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. > - * On boot, sock_cgroup_data records the cgroup that the sock was created > - * in so that cgroup2 matches can be made; however, once either net_prio > or > - * net_cls starts being used, the area is overridden to carry prioidx > and/or > - * classid. The two modes are distinguished by whether the lowest bit is > - * set. Clear bit indicates cgroup pointer while set bit prioidx and > - * classid. > - * > - * While userland may start using net_prio or net_cls at any time, once > - * either is used, cgroup2 matching no longer works. There is no reason > to > - * mix the two and this is in line with how legacy and v2 compatibility > is > - * handled. On mode switch, cgroup references which are already being > - * pointed to by socks may be leaked. While this can be remedied by > adding > - * synchronization around sock_cgroup_data, given that the number of > leaked > - * cgroups is bound and highly unlikely to be high, this seems to be the > - * better trade-off. > + * On legacy hierarchies, net_prio and net_cls controllers directly > + * set attributes on each sock which can then be tested by the network > + * layer. On the default hierarchy, each sock is associated with the > + * cgroup it was created in and the networking layer can match the > + * cgroup directly. > */ > struct sock_cgroup_data { > - union { > -#ifdef __LITTLE_ENDIAN > - struct { > - u8 is_data : 1; > - u8 no_refcnt : 1; > - u8 unused : 6; > - u8 padding; > - u16 prioidx; > - u32 classid; > - } __packed; > -#else > - struct { > - u32 classid; > - u16 prioidx; > - u8 padding; > - u8 unused : 6; > - u8 no_refcnt : 1; > - u8 is_data : 1; > - } __packed; > + struct cgroup *cgroup; /* v2 */ > +#if defined(CONFIG_CGROUP_NET_CLASSID) > + u32 classid; /* v1 */ > +#endif > +#if defined(CONFIG_CGROUP_NET_PRIO) > + u16 prioidx; /* v1 */ > #endif > - u64 val; > - }; > }; > -/* > - * There's a theoretical window where the following accessors race with > - * updaters and return part of the previous pointer as the prioidx or > - * classid. Such races are short-lived and the result isn't critical. > - */ > static inline u16 sock_cgroup_prioidx(const struct sock_cgroup_data > *skcd) > { > - /* fallback to 1 which is always the ID of the root cgroup */ > - return (skcd->is_data & 1) ? skcd->prioidx : 1; > +#if defined(CONFIG_CGROUP_NET_PRIO) > + return READ_ONCE(skcd->prioidx); > +#else > + return 1; > +#endif > } > static inline u32 sock_cgroup_classid(const struct sock_cgroup_data > *skcd) > { > - /* fallback to 0 which is the unconfigured default classid */ > - return (skcd->is_data & 1) ? skcd->classid : 0; > +#if defined(CONFIG_CGROUP_NET_CLASSID) > + return READ_ONCE(skcd->classid); > +#else > + return 0; > +#endif > } > -/* > - * If invoked concurrently, the updaters may clobber each other. The > - * caller is responsible for synchronization. > - */ > static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd, > u16 prioidx) > { > - struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }}; > - > - if (sock_cgroup_prioidx(&skcd_buf) == prioidx) > - return; > - > - if (!(skcd_buf.is_data & 1)) { > - skcd_buf.val = 0; > - skcd_buf.is_data = 1; > - } > - > - skcd_buf.prioidx = prioidx; > - WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */ > +#if defined(CONFIG_CGROUP_NET_PRIO) > + WRITE_ONCE(skcd->prioidx, prioidx); > +#endif > } > static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd, > u32 classid) > { > - struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }}; > - > - if (sock_cgroup_classid(&skcd_buf) == classid) > - return; > - > - if (!(skcd_buf.is_data & 1)) { > - skcd_buf.val = 0; > - skcd_buf.is_data = 1; > - } > - > - skcd_buf.classid = classid; > - WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */ > +#if defined(CONFIG_CGROUP_NET_CLASSID) > + WRITE_ONCE(skcd->classid, classid); > +#endif > } > #else /* CONFIG_SOCK_CGROUP_DATA */ > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index 7bf60454a313..a7e79ad7c9b0 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -829,33 +829,13 @@ static inline void > cgroup_account_cputime_field(struct task_struct *task, > */ > #ifdef CONFIG_SOCK_CGROUP_DATA > -#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID) > -extern spinlock_t cgroup_sk_update_lock; > -#endif > - > -void cgroup_sk_alloc_disable(void); > void cgroup_sk_alloc(struct sock_cgroup_data *skcd); > void cgroup_sk_clone(struct sock_cgroup_data *skcd); > void cgroup_sk_free(struct sock_cgroup_data *skcd); > static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data > *skcd) > { > -#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID) > - unsigned long v; > - > - /* > - * @skcd->val is 64bit but the following is safe on 32bit too as we > - * just need the lower ulong to be written and read atomically. > - */ > - v = READ_ONCE(skcd->val); > - > - if (v & 3) > - return &cgrp_dfl_root.cgrp; > - > - return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp; > -#else > - return (struct cgroup *)(unsigned long)skcd->val; > -#endif > + return READ_ONCE(skcd->cgroup); Do we really need READ_ONCE here? I was always assuming it was there because we were flipping that lower bit. Now that it's a simple pointer, why not 'return skcd->cgroup' instead? > } > #else /* CONFIG_CGROUP_DATA */ > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 881ce1470beb..15ad5c8b24a8 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -6572,74 +6572,44 @@ int cgroup_parse_float(const char *input, > unsigned dec_shift, s64 *v) > */ > #ifdef CONFIG_SOCK_CGROUP_DATA > -#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID) > - > -DEFINE_SPINLOCK(cgroup_sk_update_lock); > -static bool cgroup_sk_alloc_disabled __read_mostly; > - > -void cgroup_sk_alloc_disable(void) > -{ > - if (cgroup_sk_alloc_disabled) > - return; > - pr_info("cgroup: disabling cgroup2 socket matching due to net_prio or > net_cls activation\n"); > - cgroup_sk_alloc_disabled = true; > -} > - > -#else > - > -#define cgroup_sk_alloc_disabled false > - > -#endif > - > void cgroup_sk_alloc(struct sock_cgroup_data *skcd) > { > - if (cgroup_sk_alloc_disabled) { > - skcd->no_refcnt = 1; > - return; > - } > - > /* Don't associate the sock with unrelated interrupted task's cgroup. */ > if (in_interrupt()) > return; > rcu_read_lock(); > - > while (true) { > struct css_set *cset; > cset = task_css_set(current); > if (likely(cgroup_tryget(cset->dfl_cgrp))) { > - skcd->val = (unsigned long)cset->dfl_cgrp; > + WRITE_ONCE(skcd->cgroup, cset->dfl_cgrp); > cgroup_bpf_get(cset->dfl_cgrp); > break; > } > cpu_relax(); > } > - > rcu_read_unlock(); > } > void cgroup_sk_clone(struct sock_cgroup_data *skcd) > { > - if (skcd->val) { > - if (skcd->no_refcnt) > - return; > - /* > - * We might be cloning a socket which is left in an empty > - * cgroup and the cgroup might have already been rmdir'd. > - * Don't use cgroup_get_live(). > - */ > - cgroup_get(sock_cgroup_ptr(skcd)); > - cgroup_bpf_get(sock_cgroup_ptr(skcd)); > - } > + struct cgroup *cgrp = sock_cgroup_ptr(skcd); > + > + /* > + * We might be cloning a socket which is left in an empty > + * cgroup and the cgroup might have already been rmdir'd. > + * Don't use cgroup_get_live(). > + */ > + cgroup_get(cgrp); > + cgroup_bpf_get(cgrp); > } > void cgroup_sk_free(struct sock_cgroup_data *skcd) > { > struct cgroup *cgrp = sock_cgroup_ptr(skcd); > - if (skcd->no_refcnt) > - return; > cgroup_bpf_put(cgrp); > cgroup_put(cgrp); > } > diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c > index b49c57d35a88..1a6a86693b74 100644 > --- a/net/core/netclassid_cgroup.c > +++ b/net/core/netclassid_cgroup.c > @@ -71,11 +71,8 @@ static int update_classid_sock(const void *v, struct > file *file, unsigned n) > struct update_classid_context *ctx = (void *)v; > struct socket *sock = sock_from_file(file); > - if (sock) { > - spin_lock(&cgroup_sk_update_lock); > + if (sock) > sock_cgroup_set_classid(&sock->sk->sk_cgrp_data, ctx->classid); > - spin_unlock(&cgroup_sk_update_lock); > - } > if (--ctx->batch == 0) { > ctx->batch = UPDATE_CLASSID_BATCH; > return n + 1; > @@ -121,8 +118,6 @@ static int write_classid(struct cgroup_subsys_state > *css, struct cftype *cft, > struct css_task_iter it; > struct task_struct *p; > - cgroup_sk_alloc_disable(); > - > cs->classid = (u32)value; > css_task_iter_start(css, 0, &it); > diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c > index 99a431c56f23..8456dfbe2eb4 100644 > --- a/net/core/netprio_cgroup.c > +++ b/net/core/netprio_cgroup.c > @@ -207,8 +207,6 @@ static ssize_t write_priomap(struct kernfs_open_file > *of, > if (!dev) > return -ENODEV; > - cgroup_sk_alloc_disable(); > - > rtnl_lock(); > ret = netprio_set_prio(of_css(of), dev, prio); > @@ -221,12 +219,10 @@ static ssize_t write_priomap(struct > kernfs_open_file *of, > static int update_netprio(const void *v, struct file *file, unsigned n) > { > struct socket *sock = sock_from_file(file); > - if (sock) { > - spin_lock(&cgroup_sk_update_lock); > + > + if (sock) > sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data, > (unsigned long)v); > - spin_unlock(&cgroup_sk_update_lock); > - } > return 0; > } > @@ -235,8 +231,6 @@ static void net_prio_attach(struct cgroup_taskset > *tset) > struct task_struct *p; > struct cgroup_subsys_state *css; > - cgroup_sk_alloc_disable(); > - > cgroup_taskset_for_each(p, css, tset) { > void *v = (void *)(unsigned long)css->id; > -- > 2.21.0