From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 20FD810E2 for ; Wed, 2 Aug 2023 03:37:42 +0000 (UTC) Received: from mail-qk1-f182.google.com (mail-qk1-f182.google.com [209.85.222.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 095D21705 for ; Tue, 1 Aug 2023 20:30:02 -0700 (PDT) Received: by mail-qk1-f182.google.com with SMTP id af79cd13be357-7658430eb5dso605894285a.2 for ; Tue, 01 Aug 2023 20:30:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690947001; x=1691551801; h=user-agent:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=eY4+GTWowNqtaIfveHX8wlxdlRQKQBYs1my47rXI9g8=; b=ARRm7AXm8ixt94WoIjIm4a+29klsoeOOADp+xn7fs9OP04BecRFmsBrNsWBMS2nLlo 0FYs3bov/TKfxjKptaj4Flckg8msrO5rr6uIfsGB9fH2/DDFYkuS5KRLKK4uACAm8LTZ TyKAGiCbX2Md/H9UT0hpFRrowFAcrhOGjcyRzOPNCpdB3BWc2Yl6Up7nJ4eShIelVRrs 4g4ItzzM3WnUbnDM2LlKZYEHJ395Nt3wz/U6dWIxyvfQMGUUpBmfZWDXhYrO/V/PFrDe GLjeIhh6oNPICoJ6AFntQRxdGNfo6AXPOasnU9gfxCllsDx0H4VtCtAab6GQy1QIksNg Gssw== X-Gm-Message-State: ABy/qLYCyzCdsqkT5mBIuqABbT5D0uLyx9MlBvBb1OZKMEqsw3rVD7Lq IdgUcXRxf1BOVOYoMqyDtQA= X-Google-Smtp-Source: APBJJlE6+mEPtwe/AyEaQz9TNU1tfKts0d1SrISPh4uj0knvBLF185jv77BuVCxQQup+lSWbIjCfXw== X-Received: by 2002:a05:622a:190c:b0:406:910f:527 with SMTP id w12-20020a05622a190c00b00406910f0527mr21679926qtc.65.1690947000935; Tue, 01 Aug 2023 20:30:00 -0700 (PDT) Received: from maniforge ([2620:10d:c091:400::5:e145]) by smtp.gmail.com with ESMTPSA id h10-20020ac85e0a000000b00406bf860430sm4914793qtx.11.2023.08.01.20.29.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Aug 2023 20:30:00 -0700 (PDT) Date: Tue, 1 Aug 2023 22:29:58 -0500 From: David Vernet To: Alexei Starovoitov Cc: Yafang Shao , Yonghong Song , Alexei Starovoitov , Daniel Borkmann , John Fastabend , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , bpf Subject: Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Message-ID: <20230802032958.GB472124@maniforge> References: <20230801142912.55078-1-laoar.shao@gmail.com> <3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/2.2.10 (2023-03-25) X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote: > On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao wrote: > > > > > > > > In kernel, we have a global variable > > > nr_cpu_ids (also in kernel/bpf/helpers.c) > > > which is used in numerous places for per cpu data struct access. > > > > > > I am wondering whether we could have bpf code like > > > int nr_cpu_ids __ksym; I think this would be useful in general, though any __ksym variable like this would have to be const and mapped in .rodata, right? But yeah, being able to R/O map global variables like this which have static lifetimes would be nice. It's not quite the same thing as nr_cpu_ids, but FWIW, you could accomplish something close to this by doing something like this in your BPF prog: /* Set in user space to libbpf_num_possible_cpus() */ const volatile __u32 nr_cpus; ... __u32 i; bpf_for(i, 0, nr_cpus) bpf_printk("Iterating over cpu %u", i); ... > > > struct bpf_iter_num it; > > > int i = 0; > > > > > > // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS]. > > > bpf_iter_num_new(&it, 1, nr_cpu_ids); > > > while ((v = bpf_iter_num_next(&it))) { > > > /* access cpu i data */ > > > i++; > > > } > > > bpf_iter_num_destroy(&it); > > > > > > From all existing open coded iterator loops, looks like > > > upper bound has to be a constant. We might need to extend support > > > to bounded scalar upper bound if not there. > > > > Currently the upper bound is required by both the open-coded for-loop > > and the bpf_loop. I think we can extend it. > > > > It can't handle the cpumask case either. > > > > for_each_cpu(cpu, mask) > > > > In the 'mask', the CPU IDs might not be continuous. In our container > > environment, we always use the cpuset cgroup for some critical tasks, > > but it is not so convenient to traverse the percpu data of this cpuset > > cgroup. We have to do it as follows for this case : > > > > That's why we prefer to introduce a bpf_for_each_cpu helper. It is > > fine if it can be implemented as a kfunc. > > I think open-coded-iterators is the only acceptable path forward here. > Since existing bpf_iter_num doesn't fit due to sparse cpumask, > let's introduce bpf_iter_cpumask and few additional kfuncs > that return cpu_possible_mask and others. I agree that this is the correct way to generalize this. The only thing that we'll have to figure out is how to generalize treating const struct cpumask * objects as kptrs. In sched_ext [0] we export scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to return trusted global cpumask kptrs that can then be "released" in scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and exists only to appease the verifier that the trusted cpumask kptrs aren't being leaked and are having their references "dropped". [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/ I'd imagine that we have 2 ways forward if we want to enable progs to fetch other global cpumasks with static lifetimes (e.g. __cpu_possible_mask or nohz.idle_cpus_mask): 1. The most straightforward thing to do would be to add a new kfunc in kernel/bpf/cpumask.c that's a drop-in replacment for scx_bpf_put_idle_cpumask(): void bpf_global_cpumask_drop(const struct cpumask *cpumask) {} 2. Another would be to implement something resembling what Yonghong suggested in [1], where progs can link against global allocated kptrs like: const struct cpumask *__cpu_possible_mask __ksym; [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t In my opinion (1) is more straightforward, (2) is a better UX. Note again that both approaches only works for cpumasks with static lifetimes. I can't think of a way to treat dynamically allocated struct cpumask *objects as kptrs as there's nowhere to put a reference. If someone wants to track a dynamically allocated cpumask, they'd have to create a kptr out of its container object, and then pass that object's cpumask as a const struct cpumask * to other BPF cpumask kfuncs (including e.g. the proposed iterator). > We already have some cpumask support in kernel/bpf/cpumask.c > bpf_iter_cpumask will be a natural follow up. Yes, this should be easy to add. - David