From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9D5CC433EF for ; Tue, 28 Jun 2022 17:04:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231295AbiF1REA (ORCPT ); Tue, 28 Jun 2022 13:04:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47680 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232474AbiF1RDs (ORCPT ); Tue, 28 Jun 2022 13:03:48 -0400 Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9A51511824 for ; Tue, 28 Jun 2022 10:03:47 -0700 (PDT) Received: by mail-pl1-x631.google.com with SMTP id c4so11615587plc.8 for ; Tue, 28 Jun 2022 10:03:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=OU8WUB6QZv3fzXCyA8VX1te4xsGWLQRtBW2oFDLfWyA=; b=eVenewKW8iLFOd58gS2pf8kiu9IT31PyyWl0pIqOay7A39me1hIT/dn429X1+DGUvG a4XO7Uu7NawQiEuEy+Fu4WDtDcGBkEJsrqx6e+rN1RJWdjfQWsOOJ87Gw5on0mUvd9X2 o1CUXjq7cS/9DbN9Vr2KTCQHKVGih4hN+2S6Nx/hjhb34cwcYTj/qQwAYvhe3seoA9Qr B+HN/qy1RyTnpSl6cQMVF6uAkuOSWLrp1k5Esb3pS6bl2Lq6Xy1De45xGNok8bDaiYXA RO+Dn8y90jzlMvb5XcW+Z3GJnVEcnUKrUdI4aR54B+ETduHp6t0nYtvILDAt7pWnSV/U DtwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=OU8WUB6QZv3fzXCyA8VX1te4xsGWLQRtBW2oFDLfWyA=; b=BRCxb94uFNPIGU/2gEm30jkO7OhLR+/f0Jzw2kyqwfpE0Rv20GedWU9fSvuiSvhz88 19V4eKN/hcEzNjMtz84DWE7GkTV4Fp+sbp1JJDRVS9OP2xQ/WRmZ9G2pF1jjQnaS/oUa XTy+ZIUnTbBJYME5AZs+efxAVWij+pAVrdqZtkWso2KExlKD9NMudUL6blKwg/NaBNfv zHh+VjmXlfem9Un8AP5Cvcz+5fBKC4Q/w6mGVgD0SeDHxwtboMOiNNkYPhE5E8MUC2xH uqT3aDUcUtRlLae7QqRTb996eFZo42mvRZsAov0Ely//QTyZKrEXAxxrdfF8K63mrIZO uJBg== X-Gm-Message-State: AJIora+XkIFP+sggyBsxMJNUOxZj2czqd2qsXUdII3TFHW+/YV6GeE2F AFIkd02WKKwMpgiMLFv2NLU= X-Google-Smtp-Source: AGRyM1ui/q1saVihJEUNQmGeDO5B6I9PqLvtGxAYvreR9CUUZffg3O2W//t5c0Vd2JfQidUWwVszJg== X-Received: by 2002:a17:902:6bc6:b0:16a:569d:33da with SMTP id m6-20020a1709026bc600b0016a569d33damr5737146plt.59.1656435827001; Tue, 28 Jun 2022 10:03:47 -0700 (PDT) Received: from MacBook-Pro-3.local ([2620:10d:c090:500::1:50]) by smtp.gmail.com with ESMTPSA id 9-20020a170902c20900b0015e8d4eb1dfsm9607528pll.41.2022.06.28.10.03.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Jun 2022 10:03:46 -0700 (PDT) Date: Tue, 28 Jun 2022 10:03:43 -0700 From: Alexei Starovoitov To: Christoph Lameter Cc: Christoph Hellwig , David Miller , Daniel Borkmann , Andrii Nakryiko , Tejun Heo , Martin KaFai Lau , bpf , Kernel Team , linux-mm , Pekka Enberg , David Rientjes , Joonsoo Kim , Andrew Morton , Vlastimil Babka Subject: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator. Message-ID: <20220628170343.ng46xfwi32vefiyp@MacBook-Pro-3.local> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Tue, Jun 28, 2022 at 03:57:54PM +0200, Christoph Lameter wrote: > On Mon, 27 Jun 2022, Alexei Starovoitov wrote: > > > On Mon, Jun 27, 2022 at 5:17 PM Christoph Lameter wrote: > > > > > > > From: Alexei Starovoitov > > > > > > > > Introduce any context BPF specific memory allocator. > > > > > > > > Tracing BPF programs can attach to kprobe and fentry. Hence they > > > > run in unknown context where calling plain kmalloc() might not be safe. > > > > Front-end kmalloc() with per-cpu per-bucket cache of free elements. > > > > Refill this cache asynchronously from irq_work. > > > > > > GFP_ATOMIC etc is not going to work for you? > > > > slab_alloc_node->slab_alloc->local_lock_irqsave > > kprobe -> bpf prog -> slab_alloc_node -> deadlock. > > In other words, the slow path of slab allocator takes locks. > > That is a relatively new feature due to RT logic support. without RT this > would be a simple irq disable. Not just RT. It's a slow path: if (IS_ENABLED(CONFIG_PREEMPT_RT) || unlikely(!object || !slab || !node_match(slab, node))) { local_unlock_irqrestore(&s->cpu_slab->lock,...); and that's not the only lock in there. new_slab->allocate_slab... alloc_pages grabbing more locks. > Generally doing slab allocation while debugging slab allocation is not > something that can work. Can we exempt RT locks/irqsave or slab alloc from > BPF tracing? People started doing lock profiling with bpf back in 2017. People do rcu profiling now and attaching bpf progs to all kinds of low level kernel internals: page alloc, etc. > I would assume that other key items of kernel logic will have similar > issues. We're _not_ asking for any changes from mm/slab side. Things were working all these years. We're making them more efficient now by getting rid of 'lets prealloc everything' approach. > > Which makes it unsafe to use from tracing bpf progs. > > That's why we preallocated all elements in bpf maps, > > so there are no calls to mm or rcu logic. > > bpf specific allocator cannot use locks at all. > > try_lock approach could have been used in alloc path, > > but free path cannot fail with try_lock. > > Hence the algorithm in this patch is purely lockless. > > bpf prog can attach to spin_unlock_irqrestore and > > safely do bpf_mem_alloc. > > That is generally safe unless you get into reetrance issues with memory > allocation. Right. Generic slab/mm/page_alloc/rcu are not ready for reentrance and are not safe from NMI either. That's why we're added all kinds of safey mechanisms in bpf layers. > Which begs the question: > > What happens if I try to use BPF to trace *your* shiny new memory 'shiny and new' is overstatement. It's a trivial lock less freelist layer on top of kmalloc. Please read the patch. > allocation functions in the BPF logic like bpf_mem_alloc? How do you stop > that from happening? here is the comment in the patch: /* notrace is necessary here and in other functions to make sure * bpf programs cannot attach to them and cause llist corruptions. */