Re: [PATCH 0/4] Introduce QPW for per-cpu operations

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Vlastimil Babka <vbabka@suse.com>
Cc: Michal Hocko <mhocko@suse.com>,
	Leonardo Bras <leobras.c@gmail.com>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	Leonardo Bras <leobras@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Waiman Long <longman@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	Frederic Weisbecker <fweisbecker@suse.de>
Subject: Re: [PATCH 0/4] Introduce QPW for per-cpu operations
Date: Fri, 20 Feb 2026 14:35:41 -0300	[thread overview]
Message-ID: <aZibbYH7yrDZlnJh@tpad> (raw)
In-Reply-To: <3f2b985a-2fb0-4d63-9dce-8a9cad8ce464@suse.com>

Hi Vlastimil,

On Fri, Feb 20, 2026 at 11:48:00AM +0100, Vlastimil Babka wrote:
> On 2/19/26 16:27, Marcelo Tosatti wrote:
> > On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> > 
> > Michal,
> > 
> > Again, i don't see how moving operations to happen at return to 
> > kernel would help (assuming you are talking about 
> > "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> > 
> > The IPIs in the patchset above can be deferred until user->kernel
> > transition because they are TLB flushes, for addresses which do not
> > exist on the address space mapping in userspace.
> > 
> > What are the per-CPU objects in SLUB ?
> > 
> > struct slab_sheaf {
> >         union {
> >                 struct rcu_head rcu_head;
> >                 struct list_head barn_list;
> >                 /* only used for prefilled sheafs */
> >                 struct {
> >                         unsigned int capacity;
> >                         bool pfmemalloc;
> >                 };
> >         };
> >         struct kmem_cache *cache;
> >         unsigned int size;
> >         int node; /* only used for rcu_sheaf */
> >         void *objects[];
> > };
> > 
> > struct slub_percpu_sheaves {
> >         local_trylock_t lock;
> >         struct slab_sheaf *main; /* never NULL when unlocked */
> >         struct slab_sheaf *spare; /* empty or full, may be NULL */
> >         struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
> > };
> > 
> > Examples of local CPU operation that manipulates the data structures:
> > 1) kmalloc, allocates an object from local per CPU list.
> > 2) kfree, returns an object to local per CPU list.
> > 
> > Examples of an operation that would perform changes on the per-CPU lists 
> > remotely:
> > kmem_cache_shrink (cache shutdown), kmem_cache_shrink.
> > 
> > You can't delay either kmalloc (removal of object from per-CPU freelist), 
> > or kfree (return of object from per-CPU freelist), or kmem_cache_shrink 
> > or kmem_cache_shrink to return to userspace.
> > 
> > What i missing something here? (or do you have something on your mind
> > which i can't see).
> 
> Let's try and analyze when we need to do the flushing in SLUB
> 
> - memory offline - would anyone do that with isolcpus? if yes, they probably
> deserve the disruption

I think its OK to avoid memory offline on such systems.

> - cache shrinking (mainly from sysfs handler) - not necessary for
> correctness, can probably skip cpu if needed, also kinda shooting your own
> foot on isolcpu systems
> 
> - kmem_cache is being destroyed (__kmem_cache_shutdown()) - this is
> important for correctness. destroying caches should be rare, but can't rule
> it out
> 
> - kvfree_rcu_barrier() - a very tricky one; currently has only a debugging
> caller, but that can change
> 
> (BTW, see the note in flush_rcu_sheaves_on_cache() and how it relies on the
> flush actually happening on the cpu. Won't QPW violate that?)

(struct kmem_cache *s)->cpu_sheaves (percpu)->rcu_free with the
s->cpu_sheaves->lock lock held:

do_free:

        rcu_sheaf = pcs->rcu_free;

        /*
         * Since we flush immediately when size reaches capacity, we never reach
         * this with size already at capacity, so no OOB write is possible.
         */
        rcu_sheaf->objects[rcu_sheaf->size++] = obj;

        if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
                rcu_sheaf = NULL;
        } else {
                pcs->rcu_free = NULL;
                rcu_sheaf->node = numa_mem_id();
        }

        /*
         * we flush before local_unlock to make sure a racing
         * flush_all_rcu_sheaves() doesn't miss this sheaf
         */
        if (rcu_sheaf)
                call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);

        qpw_unlock(&s->cpu_sheaves->lock, cpu);

So if it invokes call_rcu, it sets pcs->rcu_free = NULL. In that case,
for flush_rcu_sheaf executing remotely from flush_rcu_sheaves_on_cache
will:

static void flush_rcu_sheaf(struct work_struct *w)
{
        struct slub_percpu_sheaves *pcs;
        struct slab_sheaf *rcu_free;
        struct slub_flush_work *sfw;
        struct kmem_cache *s;
        int cpu = qpw_get_cpu(w);

        sfw = &per_cpu(slub_flush, cpu);
        s = sfw->s;

        qpw_lock(&s->cpu_sheaves->lock, cpu);
        pcs = per_cpu_ptr(s->cpu_sheaves, cpu);

        rcu_free = pcs->rcu_free;
        pcs->rcu_free = NULL;

        qpw_unlock(&s->cpu_sheaves->lock, cpu);

        if (rcu_free)
                call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
}

Only call rcu_free_sheaf_nobarn if pcs->rcu_free is not NULL.

So it seems safe?

> How would this work with houskeeping on return to userspace approach?
> 
> - Would we just walk the list of all caches to flush them? could be
> expensive. Would we somehow note only those that need it? That would make
> the fast paths do something extra?
> 
> - If some other CPU executed kmem_cache_destroy(), it would have to wait for
> the isolated cpu returning to userspace. Do we have the means for
> synchronizing on that? Would that risk a deadlock? We used to have a
> deferred finishing of the destroy for other reasons but were glad to get rid
> of it when it was possible, now it might be necessary to revive it?

I don't think you can expect system calls to return to userspace in 
a given amount of time. Could be in kernel mode for long periods of
time.

> How would this work with QPW?
> 
> - probably fast paths more expensive due to spin lock vs local_trylock_t
> 
> - flush_rcu_sheaves_on_cache() needs to be solved safely (see above)
> 
> What if we avoid percpu sheaves completely on isolated cpus and instead
> allocate/free using the slowpaths?
> 
> - It could probably be achieved without affecting fastpaths, as we already
> handle bootstrap without sheaves, so it's implemented in a way to not affect
> fastpaths.
> 
> - Would it slow the isolcpu workloads down too much when they do a syscall?
>   - compared to "houskeeping on return to userspace" flushing, maybe not?
> Because in that case the syscall starts with sheaves flushed from previous
> return, it has to do something expensive to get the initial sheaf, then
> maybe will use only on or few objects, then on return has to flush
> everything. Likely the slowpath might be faster, unless it allocates/frees
> many objects from the same cache.
>   - compared to QPW - it would be slower as QPW would mostly retain sheaves
> populated, the need for flushes should be very rare
> 
> So if we can assume that workloads on isolated cpus make syscalls only
> rarely, and when they do they can tolerate them being slower, I think the
> "avoid sheaves on isolated cpus" would be the best way here.

I am not sure its safe to assume that. Ask Gemini about isolcpus use
cases and:

1. High-Frequency Trading (HFT)
In the world of HFT, microseconds are the difference between profit and loss. 
Traders use isolcpus to pin their execution engines to specific cores.

The Goal: Eliminate "jitter" caused by the OS moving other processes onto the same core.

The Benefit: Guaranteed execution time and ultra-low latency.

2. Real-Time Audio & Video Processing
If you are running a Digital Audio Workstation (DAW) or a live video encoding rig, a tiny "hiccup" in CPU availability results in an audible pop or a dropped frame.

The Goal: Reserve cores specifically for the Digital Signal Processor (DSP) or the encoder.

The Benefit: Smooth, glitch-free media streams even when the rest of the system is busy.

3. Network Function Virtualization (NFV) & DPDK
For high-speed networking (like 10Gbps+ traffic), the Data Plane Development Kit (DPDK) uses "poll mode" drivers. These drivers constantly loop to check for new packets rather than waiting for interrupts.

The Goal: Isolate cores so they can run at 100% utilization just checking for network packets.

The Benefit: Maximum throughput and zero packet loss in high-traffic environments.

4. Gaming & Simulation
Competitive gamers or flight simulator enthusiasts sometimes isolate a few cores to handle the game's main thread, while leaving the rest of the OS (Discord, Chrome, etc.) to the remaining cores.

The Goal: Prevent background Windows/Linux tasks from stealing cycles from the game engine.

The Benefit: More consistent 1% low FPS and reduced input lag.

5. Deterministic Scientific Computing
If you're running a simulation that needs to take exactly the same amount of time every time it runs (for benchmarking or safety-critical testing), you can't have the OS interference messing with your metrics.

The Goal: Remove the variability of the Linux scheduler.

The Benefit: Highly repeatable, deterministic results.

===

For example, AF_XDP bypass uses system calls (and wants isolcpus):

https://www.quantvps.com/blog/kernel-bypass-in-hft?srsltid=AfmBOoryeSxuuZjzTJIC9O-Ag8x4gSwjs-V4Xukm2wQpGmwDJ6t4szuE

next prev parent reply	other threads:[~2026-02-20 17:36 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-06 14:34 [PATCH 0/4] Introduce QPW for per-cpu operations Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
2026-02-06 15:20   ` Marcelo Tosatti
2026-02-07  0:16   ` Leonardo Bras
2026-02-11 12:09     ` Marcelo Tosatti
2026-02-14 21:32       ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 2/4] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 3/4] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
2026-02-07  1:06   ` Leonardo Bras
2026-02-26 15:49     ` Marcelo Tosatti
2026-03-08 17:35       ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 4/4] slub: " Marcelo Tosatti
2026-02-07  1:27   ` Leonardo Bras
2026-02-06 23:56 ` [PATCH 0/4] Introduce QPW for per-cpu operations Leonardo Bras
2026-02-10 14:01 ` Michal Hocko
2026-02-11 12:01   ` Marcelo Tosatti
2026-02-11 12:11     ` Marcelo Tosatti
2026-02-14 21:35       ` Leonardo Bras
2026-02-11 16:38     ` Michal Hocko
2026-02-11 16:50       ` Marcelo Tosatti
2026-02-11 16:59         ` Vlastimil Babka
2026-02-11 17:07         ` Michal Hocko
2026-02-14 22:02       ` Leonardo Bras
2026-02-16 11:00         ` Michal Hocko
2026-02-19 15:27           ` Marcelo Tosatti
2026-02-19 19:30             ` Michal Hocko
2026-02-20 14:30               ` Marcelo Tosatti
2026-02-23  9:18                 ` Michal Hocko
2026-03-03 10:55                   ` Frederic Weisbecker
2026-02-23 21:56               ` Frederic Weisbecker
2026-02-24 17:23                 ` Marcelo Tosatti
2026-02-25 21:49                   ` Frederic Weisbecker
2026-02-26  7:06                     ` Michal Hocko
2026-02-26 11:41                     ` Marcelo Tosatti
2026-03-03 11:08                       ` Frederic Weisbecker
2026-02-20 10:48             ` Vlastimil Babka
2026-02-20 12:31               ` Michal Hocko
2026-02-20 17:35               ` Marcelo Tosatti [this message]
2026-02-20 17:58                 ` Vlastimil Babka
2026-02-20 19:01                   ` Marcelo Tosatti
2026-02-23  9:11                     ` Michal Hocko
2026-02-23 11:20                       ` Marcelo Tosatti
2026-02-24 14:40                 ` Frederic Weisbecker
2026-02-24 18:12                   ` Marcelo Tosatti
2026-02-20 16:51           ` Marcelo Tosatti
2026-02-20 16:55             ` Marcelo Tosatti
2026-02-20 22:38               ` Leonardo Bras
2026-02-23 18:09               ` Vlastimil Babka
2026-02-26 18:24                 ` Marcelo Tosatti
2026-02-20 21:58           ` Leonardo Bras
2026-02-23  9:06             ` Michal Hocko
2026-02-28  1:23               ` Leonardo Bras
2026-03-03  0:19                 ` Marcelo Tosatti
2026-03-08 17:41                   ` Leonardo Bras
2026-03-09  9:52                     ` Vlastimil Babka (SUSE)
2026-03-11  0:01                       ` Leonardo Bras
2026-03-10 21:24                     ` Marcelo Tosatti
2026-03-11  0:03                       ` Leonardo Bras
2026-03-11 10:23                         ` Marcelo Tosatti
2026-02-19 13:15       ` Marcelo Tosatti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZibbYH7yrDZlnJh@tpad \
    --to=mtosatti@redhat.com \
    --cc=42.hyeyoo@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=boqun.feng@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=fweisbecker@suse.de \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=leobras.c@gmail.com \
    --cc=leobras@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tglx@linutronix.de \
    --cc=vbabka@suse.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.