From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6AC41365A10 for ; Fri, 20 Feb 2026 17:36:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771608978; cv=none; b=JvWfCfg4DKVImK2QMJ5PfzrhM7ubgbUe8/qd8pXsiKm4lCz++Fg9sHiw8C9/lqj4hKnfJZeuClSUVT73376YX1VVBXRN1fXxqF2rZmTMMwIa6QWTgyy58rJvSNMGTBwTmWO1w6FDpIzl/l2cD5yfeuFtQaPURgqwIWyc6f4tlms= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771608978; c=relaxed/simple; bh=NugPw72MOwfm33ElpPfIIe5OdoblLhaj8Ii5u02un4U=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=CylhHve4gOuRwOf1BgSHSb3MCsUbvjdYDet16h/t1NtL5PRFBPW1CtYpL2bf8bhtGgUNsA50dhnedAvEG3TsfHbiM5Z9ri9rkQ7nQcZkVHeMXoYNjJ2gMNvbholvb4TElCpma5c75gVnOPLA0iG7cWRCC0929ryuxCSGj7GklUM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Wk4/g+sk; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Wk4/g+sk" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1771608975; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=3IqjOdiofWjjCWCfPy8j7WH4t5XK2/Efigokxu3dvlE=; b=Wk4/g+skn6TAc44DSwAhpbGG5L9WR9ycM/lS5m+NhCm9Kh8ntJDNzXnM9jNYYWnwdem0ri w7x+ycKiBhR4sdlx4+AM1mGCnEpsL/TWv9bfurDwxBKSiKCGEUovJFTW6/t+B+xNDA28UU qyafBOkEgWPNtVCjFf78se4vGNUFkxg= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-553-GCy-qzgoOVO6a-GZTQQOwQ-1; Fri, 20 Feb 2026 12:36:11 -0500 X-MC-Unique: GCy-qzgoOVO6a-GZTQQOwQ-1 X-Mimecast-MFC-AGG-ID: GCy-qzgoOVO6a-GZTQQOwQ_1771608969 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 8F65719560A3; Fri, 20 Feb 2026 17:36:08 +0000 (UTC) Received: from tpad.localdomain (unknown [10.96.133.4]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id BF4231800348; Fri, 20 Feb 2026 17:36:06 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id BD690400DC5A8; Fri, 20 Feb 2026 14:35:41 -0300 (-03) Date: Fri, 20 Feb 2026 14:35:41 -0300 From: Marcelo Tosatti To: Vlastimil Babka Cc: Michal Hocko , Leonardo Bras , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Vlastimil Babka , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Leonardo Bras , Thomas Gleixner , Waiman Long , Boqun Feng , Frederic Weisbecker Subject: Re: [PATCH 0/4] Introduce QPW for per-cpu operations Message-ID: References: <20260206143430.021026873@redhat.com> <3f2b985a-2fb0-4d63-9dce-8a9cad8ce464@suse.com> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3f2b985a-2fb0-4d63-9dce-8a9cad8ce464@suse.com> X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Hi Vlastimil, On Fri, Feb 20, 2026 at 11:48:00AM +0100, Vlastimil Babka wrote: > On 2/19/26 16:27, Marcelo Tosatti wrote: > > On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote: > > > > Michal, > > > > Again, i don't see how moving operations to happen at return to > > kernel would help (assuming you are talking about > > "context_tracking,x86: Defer some IPIs until a user->kernel transition"). > > > > The IPIs in the patchset above can be deferred until user->kernel > > transition because they are TLB flushes, for addresses which do not > > exist on the address space mapping in userspace. > > > > What are the per-CPU objects in SLUB ? > > > > struct slab_sheaf { > > union { > > struct rcu_head rcu_head; > > struct list_head barn_list; > > /* only used for prefilled sheafs */ > > struct { > > unsigned int capacity; > > bool pfmemalloc; > > }; > > }; > > struct kmem_cache *cache; > > unsigned int size; > > int node; /* only used for rcu_sheaf */ > > void *objects[]; > > }; > > > > struct slub_percpu_sheaves { > > local_trylock_t lock; > > struct slab_sheaf *main; /* never NULL when unlocked */ > > struct slab_sheaf *spare; /* empty or full, may be NULL */ > > struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */ > > }; > > > > Examples of local CPU operation that manipulates the data structures: > > 1) kmalloc, allocates an object from local per CPU list. > > 2) kfree, returns an object to local per CPU list. > > > > Examples of an operation that would perform changes on the per-CPU lists > > remotely: > > kmem_cache_shrink (cache shutdown), kmem_cache_shrink. > > > > You can't delay either kmalloc (removal of object from per-CPU freelist), > > or kfree (return of object from per-CPU freelist), or kmem_cache_shrink > > or kmem_cache_shrink to return to userspace. > > > > What i missing something here? (or do you have something on your mind > > which i can't see). > > Let's try and analyze when we need to do the flushing in SLUB > > - memory offline - would anyone do that with isolcpus? if yes, they probably > deserve the disruption I think its OK to avoid memory offline on such systems. > - cache shrinking (mainly from sysfs handler) - not necessary for > correctness, can probably skip cpu if needed, also kinda shooting your own > foot on isolcpu systems > > - kmem_cache is being destroyed (__kmem_cache_shutdown()) - this is > important for correctness. destroying caches should be rare, but can't rule > it out > > - kvfree_rcu_barrier() - a very tricky one; currently has only a debugging > caller, but that can change > > (BTW, see the note in flush_rcu_sheaves_on_cache() and how it relies on the > flush actually happening on the cpu. Won't QPW violate that?) (struct kmem_cache *s)->cpu_sheaves (percpu)->rcu_free with the s->cpu_sheaves->lock lock held: do_free: rcu_sheaf = pcs->rcu_free; /* * Since we flush immediately when size reaches capacity, we never reach * this with size already at capacity, so no OOB write is possible. */ rcu_sheaf->objects[rcu_sheaf->size++] = obj; if (likely(rcu_sheaf->size < s->sheaf_capacity)) { rcu_sheaf = NULL; } else { pcs->rcu_free = NULL; rcu_sheaf->node = numa_mem_id(); } /* * we flush before local_unlock to make sure a racing * flush_all_rcu_sheaves() doesn't miss this sheaf */ if (rcu_sheaf) call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf); qpw_unlock(&s->cpu_sheaves->lock, cpu); So if it invokes call_rcu, it sets pcs->rcu_free = NULL. In that case, for flush_rcu_sheaf executing remotely from flush_rcu_sheaves_on_cache will: static void flush_rcu_sheaf(struct work_struct *w) { struct slub_percpu_sheaves *pcs; struct slab_sheaf *rcu_free; struct slub_flush_work *sfw; struct kmem_cache *s; int cpu = qpw_get_cpu(w); sfw = &per_cpu(slub_flush, cpu); s = sfw->s; qpw_lock(&s->cpu_sheaves->lock, cpu); pcs = per_cpu_ptr(s->cpu_sheaves, cpu); rcu_free = pcs->rcu_free; pcs->rcu_free = NULL; qpw_unlock(&s->cpu_sheaves->lock, cpu); if (rcu_free) call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn); } Only call rcu_free_sheaf_nobarn if pcs->rcu_free is not NULL. So it seems safe? > How would this work with houskeeping on return to userspace approach? > > - Would we just walk the list of all caches to flush them? could be > expensive. Would we somehow note only those that need it? That would make > the fast paths do something extra? > > - If some other CPU executed kmem_cache_destroy(), it would have to wait for > the isolated cpu returning to userspace. Do we have the means for > synchronizing on that? Would that risk a deadlock? We used to have a > deferred finishing of the destroy for other reasons but were glad to get rid > of it when it was possible, now it might be necessary to revive it? I don't think you can expect system calls to return to userspace in a given amount of time. Could be in kernel mode for long periods of time. > How would this work with QPW? > > - probably fast paths more expensive due to spin lock vs local_trylock_t > > - flush_rcu_sheaves_on_cache() needs to be solved safely (see above) > > What if we avoid percpu sheaves completely on isolated cpus and instead > allocate/free using the slowpaths? > > - It could probably be achieved without affecting fastpaths, as we already > handle bootstrap without sheaves, so it's implemented in a way to not affect > fastpaths. > > - Would it slow the isolcpu workloads down too much when they do a syscall? > - compared to "houskeeping on return to userspace" flushing, maybe not? > Because in that case the syscall starts with sheaves flushed from previous > return, it has to do something expensive to get the initial sheaf, then > maybe will use only on or few objects, then on return has to flush > everything. Likely the slowpath might be faster, unless it allocates/frees > many objects from the same cache. > - compared to QPW - it would be slower as QPW would mostly retain sheaves > populated, the need for flushes should be very rare > > So if we can assume that workloads on isolated cpus make syscalls only > rarely, and when they do they can tolerate them being slower, I think the > "avoid sheaves on isolated cpus" would be the best way here. I am not sure its safe to assume that. Ask Gemini about isolcpus use cases and: 1. High-Frequency Trading (HFT) In the world of HFT, microseconds are the difference between profit and loss. Traders use isolcpus to pin their execution engines to specific cores. The Goal: Eliminate "jitter" caused by the OS moving other processes onto the same core. The Benefit: Guaranteed execution time and ultra-low latency. 2. Real-Time Audio & Video Processing If you are running a Digital Audio Workstation (DAW) or a live video encoding rig, a tiny "hiccup" in CPU availability results in an audible pop or a dropped frame. The Goal: Reserve cores specifically for the Digital Signal Processor (DSP) or the encoder. The Benefit: Smooth, glitch-free media streams even when the rest of the system is busy. 3. Network Function Virtualization (NFV) & DPDK For high-speed networking (like 10Gbps+ traffic), the Data Plane Development Kit (DPDK) uses "poll mode" drivers. These drivers constantly loop to check for new packets rather than waiting for interrupts. The Goal: Isolate cores so they can run at 100% utilization just checking for network packets. The Benefit: Maximum throughput and zero packet loss in high-traffic environments. 4. Gaming & Simulation Competitive gamers or flight simulator enthusiasts sometimes isolate a few cores to handle the game's main thread, while leaving the rest of the OS (Discord, Chrome, etc.) to the remaining cores. The Goal: Prevent background Windows/Linux tasks from stealing cycles from the game engine. The Benefit: More consistent 1% low FPS and reduced input lag. 5. Deterministic Scientific Computing If you're running a simulation that needs to take exactly the same amount of time every time it runs (for benchmarking or safety-critical testing), you can't have the OS interference messing with your metrics. The Goal: Remove the variability of the Linux scheduler. The Benefit: Highly repeatable, deterministic results. === For example, AF_XDP bypass uses system calls (and wants isolcpus): https://www.quantvps.com/blog/kernel-bypass-in-hft?srsltid=AfmBOoryeSxuuZjzTJIC9O-Ag8x4gSwjs-V4Xukm2wQpGmwDJ6t4szuE