From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 835693D3D1C for ; Wed, 20 May 2026 10:01:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.169 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779271310; cv=none; b=AuwAXRJ7rlZ6EF8driwRjRnumis6OJkkgM1ruR+tiJDQc2RUxatmdiFsKNwDuQd6YIT0FSXJYc77+2iSUt3Y4ZP9ifCkI9dhDiyT3N83zCAgUWlk1W9+4q4rW/cS7Bs4JITAv8AcR8X0AnovRY5Tz6Ec7p9f2WNvXgmKKWSEy4k= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779271310; c=relaxed/simple; bh=IaHSadIy3bIJ6MlGxzJTk1EmaP/jdnaOO0MMUSYvTuM=; h=From:Date:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=rhgqsXVFR6Ln/VfLLT1F0ULG9ivXE/0ZDPSyfCicnjIh0LmidVwE4kpiGYTuuZL2BsBTMDplJt+IZrLb55cPGYJtQ40vbZS1TQfizmKBJ27V1KNcEAA2htgwRYN9EuIUyJ6AXiY5PyaZLMxPZwYCzcE5vxYLI0oCgowM4dTAwWw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=W6rKixEU; arc=none smtp.client-ip=209.85.208.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="W6rKixEU" Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-38e800deae4so46670001fa.0 for ; Wed, 20 May 2026 03:01:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779271307; x=1779876107; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:from:to:cc:subject:date:message-id:reply-to; bh=j+vsRneuQukq1VCZsxvt5EboIhJTKy6gqyaaIC0kDxc=; b=W6rKixEUYedl5cHO4t8Ed4I9qpID8TVWAnvHYF7DGPcrdWkMBd56M2dO8Nt90cTwqr bFT1ILXtRFJAguLb/WIPl2uFR+yfMLs3Di+KGRdG96ytzZQ+PULbbeRgenoYoEKclL9N kXu48tP1rlkojIXUNNN458qjt2ewJAbqEkl0M2R3uwmygQZz7/HckV/XU9uTuEjsW1r5 F99PkFge4sgPgKKwDpV40pUJX8s7OUPUuj4WwhBHRIvNvgQWuSyiivcGjJ78TMrSlqDi SQET/uz1NTlUL/hz3xmR+Fh7+cN+3Yvyzwqj7EduWzsyWp3Gt8I3D0Ugmfb3vnOeNmHw 3OxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779271307; x=1779876107; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=j+vsRneuQukq1VCZsxvt5EboIhJTKy6gqyaaIC0kDxc=; b=e6FSOcs8IyQpwXKLCuZcrGsg83Uw6uTjf/2DkObNtdLzStCIUQNb8QTRdGcQ+veKWS rIJ8/EWHCcNEYS4gXJIHOL8n3vkUz9/WWwNWRhCLTqOnScYZXV5Pd0/inoB3/koWTcM1 EpaOhqPV/hjC/zeXqmovPULPkfg1cbutm1X1KpDCTOIllnln8+agp7aRg0YELh4n9P4Z Onw7i3z2Geuh4Y+ObqCbghxYUdbqtAEuXCB6MyyUU867gsfZeQSfM1HlCFZ9z78yqqqo Da8T7sODNOTh9kZ1SBlqzQWYxDFCxhuHkpGLQ70XdzEtl1H985jFa7PhXfPQaSt38rxg jIwA== X-Forwarded-Encrypted: i=1; AFNElJ/ey4+qq4Y/3nK49C6dwZwZ3f/i7SmxOh6Vh6v0Gx36NcnQJgYRQy/lW6QzGsT2h/XelX4=@vger.kernel.org X-Gm-Message-State: AOJu0YzLirN1xd0mIDyv5pXIVUHxEek3IUFXD0pSXnH0H2AwCp+0gqZc PUbylQncsuOG978Jebjr6Ea8K9xVoFZv5YAX6P4xb3KwlCQKGnaGdFm2 X-Gm-Gg: Acq92OFoJzzRjprvKP+7HCPrQMTZLKI5n/jBv5Kl/Ifb/0cCxjACTCAdEefUJwjokrX z0//X40SobI36uV/DOLT2bJa7sSDnBPLG5Tozb4yt0CHu+9qYsshisSS3qX5V3lOa0FC78+UVeI bumkkgi7Dr7N4j5nyj0n3uYBw19WKri3H+6NBtfvmbMfff2F2j3Fvu9/iHWFV5PiYGaIXsuaiUB lqDlkDydt/eVOSmVNe6uOdKABmMSUjmdDJTsTrmsFbqd4SYOLp7He25LZ/WwBBJzSl3st3f3WG+ dk1A8w6NoteLqAPMkcrA07mredXzPCCyc8P+1m9u9lk3eh5wd11IVcMRteGv3ozYX5or+oaJuqW 2VVbDbpMaoZUuZdQn5dut+DmVAzrLtI0Mo4xsq+3uDdwzQFV5ow0OlkczpKeDDHOo X-Received: by 2002:a05:6512:3b96:b0:5a8:6391:4fe5 with SMTP id 2adb3069b0e04-5aa0e7647c2mr8015525e87.26.1779271306104; Wed, 20 May 2026 03:01:46 -0700 (PDT) Received: from milan ([2001:9b1:d5a0:a500::24b]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-5a9164cb606sm4795698e87.63.2026.05.20.03.01.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 May 2026 03:01:45 -0700 (PDT) From: Uladzislau Rezki X-Google-Original-From: Uladzislau Rezki Date: Wed, 20 May 2026 12:01:44 +0200 To: Harry Yoo Cc: Uladzislau Rezki , Andrew Morton , Vlastimil Babka , Christoph Lameter , David Rientjes , Roman Gushchin , Hao Li , Alexei Starovoitov , "Paul E . McKenney" , Frederic Weisbecker , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Zqiang , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , rcu@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Subject: Re: [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Message-ID: References: <20260416091022.36823-1-harry@kernel.org> <20260416091022.36823-5-harry@kernel.org> <3s4jafam3la72a6y3dkfvhtzxk3fsngb2cka3bpfqrirl5m633@pz3vzizefoxb> <82d2145a-9b41-4ee4-b980-e7bd5d12f035@kernel.org> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <82d2145a-9b41-4ee4-b980-e7bd5d12f035@kernel.org> On Tue, May 19, 2026 at 04:44:30PM +0900, Harry Yoo wrote: > [Resending as it's rejected by mailing lists due to my broken email > setup. Apologies for the noise.] > > *shows up late again after LSFMM and processing some backlog* > > On 4/30/26 9:10 PM, Uladzislau Rezki wrote: > > Hello, Harry! > > > > > > > > Hi Ulad. Apologies for the delayed response. > > > I meant to reply sooner but sidetracked by other issues. > > > > > No problem, sometimes i also can lag because of other tasks :) > > > > > Your questions are fair, but let me try to clarify > > > the current situation. > > > > > > And before diving into details, I would like to reiterate that > > > there are potentially two points to discuss here: > > > > > > Point 1. Can we justify complicating subsystems by passing > > > `allow_spin` parameter all over the place? > > > > > Yes, we can. But as i noted i see some drawbacks :) > > > > - all new incoming patches have to respect that new third argument; > > That is true :) > This is i would like to avoid :) > > - the fallback mechanism which uses irq-work is not optimal in my > > opinion: > > In most cases it would not fall back because most likely trylock would > succeed. If most of the calls do not fall back, a bit of suboptimality > on the fallback path is acceptable. > > > a) We introduce an extra window between queuing a pointer, mark > > irq-work to be executed and then reenter the kfree_rcu() with > > no-sync flag and now we need to wait a GP for them. But the GP > > might be already passed for such pointers. So we potentially > > need more time to offload. This is rather minus. > > > > b) Since it is for BPF, allow_spin is always false, thus only > > fallback path is used. Decoupling comes to mind. > > No, allow_spin == true means spinning on a lock is safe. If allow_spin > is false, it would do a trylock instead of spinning, and it is expected > to succeed most of the time. As long as trylock succeeds, it uses the > same data structures as the existing kvfree_rcu batching without fallback. > > > > > c) > > > > Why should we mix those? What it is worth to do, is to prevent mixing > > "unknown path which is for BPF/others" with generic kfree_rcu(). > > Because we want to reuse the existing kvfree_rcu batching infrastructure > without reinventing a new feature to do the same thing. > > The intent is to avoid the fallback in most cases when allow_spin is > false, with fallback being there for correctness. > The problem is that, it is a random behaviour with trylocking, i.e. it is not deterministic. If you apply some noise you end up in kicking two paths anyway. If the idea is to reuse "existing kvfree_rcu batching" you need to access the array in lock-free way. If you can do that, i would agree with it. > > > > Point 2. Can we avoid adding this complexity to kvfree_rcu() and > > > let slab handle it instead? (as mentioned in [4]) > > > > > it depends if BPF people want to free a pointer using RCU machinery? > > Do you know if that an intention? > > They want to free slab objects after RCU grace period. Freeing slab > objects without involving RCU is already supported by kfree_nolock(). > (There are other use cases as well, as recently posted in [1]) > > I meant RCU sheaves can handle freeing slab objects after RCU grace > period, and kfree_rcu_nolock() users don't need to handle vmalloc pages. > So technically we don't have to add this complexity to kvfree_rcu > batching and handle it in slab. > > But to do that, we shouldn't disable kfree_rcu_sheaf() completely on RT. > Apparently Vlastimil has a suggestion to address this, and I'm going to > digest his suggestion and explore that aspect. > > [1] https://lore.kernel.org/linux-mm/esepccfhqg7m6jo76ns2znj2cnuaepx2xvw5zaygtwohq4psma@563ypprp6rr3 > > [2] https://lore.kernel.org/linux-mm/6811cc17-8ee4-48c8-8cbf-6bf4d9f98162@kernel.org > > > > On Point 1: IMHO it could be justified, but at the same time I hope we > > > end up avoiding more complexity in the long term by working on Point 2. > > > > > > This reply focuses only on Point 1 and explains why it could be > > > justified. > > > > > > On Thu, Apr 23, 2026 at 01:35:25PM +0200, Uladzislau Rezki wrote: > > > > On Thu, Apr 23, 2026 at 01:23:25PM +0900, Harry Yoo (Oracle) wrote: > > > > > On Wed, Apr 22, 2026 at 04:42:28PM +0200, Uladzislau Rezki wrote: > > > > > How much performance do we sacrifice compared to > > > > > letting them go through the kvfree_rcu() fastpath? > > > > > > > > Freeing an object over RCU from > > > > NMI context is a corner case. It is __not_ generic. > > > > > > First, I want to clarify that kfree_rcu_nolock() is not just for NMI > > > context. It is intended to be used when the context is unknown (because > > > it can be called in an arbitrary code locations). > > > > > When we say "unknown" to me it sounds like a worst case, which is NMI :) > > If we say "allow_spin = false assumes the most restrictive context, such > as NMI context", that is misleading. It sounds like we always fall back, > but we don't. Even when the context is unknown, fallback isn't required > most of the time. > > So I would like to say "the context is unknown", meaning that > technically kvfree_rcu could be re-entered in the middle of kvfree_rcu > and we need to be able to handle that for correctness (although in most > cases there's no re-entry and no fallback). > > > > There are two kinds of problematic situations where BPF programs > > > are attached to: > > > > > > - 1) a tracepoint or a function that can be invoked in a critical > > > section (w/ a lock held), or > > > > > > - 2) a function that can be called in an NMI context, which might > > > preempt an arbitrary context holding a lock. > > > > > > While 1) and 2) are not (I think) dominant use cases, and although > > > most of users can legally call kvfree_rcu(), BPF can't use kvfree_rcu() > > > and must consider the most restrictive contexts. > > > > > > > We even do not have(now > > > > in mainline) users because we never support it from NMI, > > > > just like call_rcu(). > > > > > > Unfortunately, we've had this use case (of allocating memory for BPF > > > programs) for a long time in the mainline. There are two current > > > approaches to mitigate the limitation: > > > > > > - 1) Pre-allocate all memory. e.g.) allocate all hash table elements > > > when creating a BPF map, rather than allocating them on demand. > > > This ensures correctness but sacrifices memory. > > > > > > - 2) Use the BPF-specific memory allocator [1] [2] to allocate memory > > > on demand and avoid preallocation. While this wastes less memory > > > than 1) and also maintains performance, it is re-inventing yet > > > another memory allocator. > > > > > > Also, the allocator reinvented kfree_rcu batching as well. > > > > > > Now, we're trying to avoid 1) and 2) as much as possible and use > > > kmalloc_nolock() instead [3]. > > > > > > > If BPF needs > > > > it, then the first question which comes to mind is not about performance. > > > > It is how to support this case in kfree_rcu() without adding noticeable > > > > complexity or overhead or hacks to the generic path without making it harder > > > > to maintain. > > > > > > Since there will be only few subsystems that needs it, and because > > > they already use it on production systems, I don't see much value in > > > maintaining a simple implementation if that compromises performance > > > (and thus make the transition harder). > > > > > > > Performance wise you noted, you mean: > > > > > > > > a) call latency(this is probably the most important for NMI)? > > > > b) memory footprint? > > > > c) pointer-chasing overhead? > > > > > > I think it's either > > > > > > - The performance of kfree_rcu_nolock() itself (a), or > > > - Not distrubing workloads running on the machine (b and c) > > > > > > depending on what people use BPF for. > > > > > Are you aware of any specific workloads which we can run? To test > > and see what we have when it comes to performance metrics? I mean > > exact uses cases with exact steps who to trigger them? > > > > That would be useful to see on behaviour. > > I'll share once I find what BPF folks are using for performance > benchmarks. (Which means I'm not aware at the moment :D) > Please share. -- Uladzislau Rezki