From: Marcelo Tosatti <mtosatti@redhat.com>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>,
Leonardo Bras <leobras.c@gmail.com>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, Johannes Weiner <hannes@cmpxchg.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
Christoph Lameter <cl@linux.com>,
Pekka Enberg <penberg@kernel.org>,
David Rientjes <rientjes@google.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Vlastimil Babka <vbabka@suse.cz>,
Hyeonggon Yoo <42.hyeyoo@gmail.com>,
Leonardo Bras <leobras@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Waiman Long <longman@redhat.com>,
Boqun Feng <boqun.feng@gmail.com>,
Frederic Weisbecker <fweisbecker@suse.de>
Subject: Re: [PATCH 0/4] Introduce QPW for per-cpu operations
Date: Tue, 24 Feb 2026 14:23:25 -0300 [thread overview]
Message-ID: <aZ3ejedS7nE5mnva@tpad> (raw)
In-Reply-To: <aZzM_44L1vKzcOCy@pavilion.home>
On Mon, Feb 23, 2026 at 10:56:15PM +0100, Frederic Weisbecker wrote:
> Le Thu, Feb 19, 2026 at 08:30:31PM +0100, Michal Hocko a écrit :
> > On Thu 19-02-26 12:27:23, Marcelo Tosatti wrote:
> > > Michal,
> > >
> > > Again, i don't see how moving operations to happen at return to
> > > kernel would help (assuming you are talking about
> > > "context_tracking,x86: Defer some IPIs until a user->kernel transition").
> >
> > Nope, I am not talking about IPIs, although those are an example of pcp
> > state as well. I am sorry I do not have a link handy, I am pretty sure
> > Frederic will have that. Another example, though, was vmstat flushes
> > that need to be pcp. There are many other examples.
>
> Here it is:
>
> https://lore.kernel.org/all/20250410152327.24504-1-frederic@kernel.org/
>
> Thanks.
Frederic,
I think this is a valid solution, however on systems with many CPUs, in
nohz_full, performing system calls, can't there be significant increase
of lru_lock contention ? Consider 100+ CPUs performing many system calls
which add 1 or 2 folios to per-CPU LRU lists.
Note: if you are confident about the above not being a problem,
this approach looks good to me.
commit eb709b0d062efd653a61183af8e27b2711c3cf5c
Author: Shaohua Li <shaohua.li@intel.com>
Date: Tue May 24 17:12:55 2011 -0700
mm: batch activate_page() to reduce lock contention
The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.
For example, in a 4 socket 64 CPU system, create a sparse file and 64
processes, processes shared map to the file. Each process read access the
whole file and then exit. The process exit will do unmap_vmas() and cause
a lot of activate_page() call. In such workload, we saw about 58% total
time reduction with below patch. Other workloads with a lot of
activate_page also benefits a lot too.
...
The most significent are:
case-lru-file-readtwice -11.69%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq -69.72%
Some Gemini answers (question was "list of nohz_full usecases"):
2. Scientific Simulation & Research
Research institutions (like CERN, NASA, or national labs) use nohz_full
for "tightly coupled" parallel workloads.
Workloads: Molecular dynamics, fluid dynamics (CFD), and weather forecasting (e.g., WRF models).
The "Barrier" Problem: In massive clusters using MPI (Message Passing
Interface), all CPUs often have to reach a synchronization barrier
before the next step of a simulation. If one CPU is delayed by a few
milliseconds due to a timer tick, all other thousands of CPUs sit idle
waiting for it. nohz_full prevents this "tail latency" from stalling the
entire supercomputer.
...
4. Competitive Benchmarking & Kernel Development
Performance engineers use this mode to get "clean" numbers when testing
new hardware or compilers.
Workloads: Core-to-core latency tests, cache-bandwidth benchmarks, and
standard suites like SPEC CPU.
Goal: Eliminating the "noise" of the operating system so that the
results reflect pure hardware performance.
...
Summary Table: Who uses nohz_full?
User Group Primary Workload Why they use it
Quant Firms High-Frequency Trading To prevent micro-stutter during trade execution.
Research Labs MPI-based Simulations To avoid the "slowest node" stalling the whole cluster.
Telcos/ISP 5G/Packet Processing To ensure wire-speed processing without interrupts.
Hardware Vendors Chip Validation To benchmark CPU performance without OS interference.
Here is how scientific simulations handle system calls:
1. The "Compute-Loop" (Low Syscall)
The core of a simulation (like a GROMACS molecular dynamics step) is
just raw math: fetching data from RAM, doing floating-point arithmetic
(AVX/SSE), and writing it back.
During the loop: The CPU stays in "Userspace" for millions of cycles
without ever asking the kernel for help.
Why it works: Since there are no system calls, nohz_full can
successfully turn off the timer tick, allowing the CPU to focus 100% on
the math.
2. The "Communication-Phase" (High Syscall)
System calls usually happen only at the end of a computation block, when
the simulation needs to talk to other nodes.
The Tools: MPI (Message Passing Interface) uses system calls like write,
sendmsg, or specialized RDMA calls to move data across the network.
The Pattern: These simulations follow a "Burst" pattern—long periods
of zero system calls (computation) followed by a quick burst of system
calls (synchronization).
3. When are they "Syscall Intensive"?
There are specific parts of a simulation that are intensive, but
researchers try to minimize them:
I/O Operations: Writing "checkpoints" or large trajectory files to disk
(using write()). This is why high-end HPC systems use Asynchronous I/O
or dedicated I/O nodes—to keep the compute cores from getting bogged
down in system calls.
Memory Allocation: Constantly calling malloc/free involves the brk or
mmap system calls. Optimized simulation tools pre-allocate all the
memory they need at startup to avoid this.
next prev parent reply other threads:[~2026-02-24 18:26 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-06 14:34 [PATCH 0/4] Introduce QPW for per-cpu operations Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
2026-02-06 15:20 ` Marcelo Tosatti
2026-02-07 0:16 ` Leonardo Bras
2026-02-11 12:09 ` Marcelo Tosatti
2026-02-14 21:32 ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 2/4] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 3/4] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
2026-02-07 1:06 ` Leonardo Bras
2026-02-26 15:49 ` Marcelo Tosatti
2026-03-08 17:35 ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 4/4] slub: " Marcelo Tosatti
2026-02-07 1:27 ` Leonardo Bras
2026-02-06 23:56 ` [PATCH 0/4] Introduce QPW for per-cpu operations Leonardo Bras
2026-02-10 14:01 ` Michal Hocko
2026-02-11 12:01 ` Marcelo Tosatti
2026-02-11 12:11 ` Marcelo Tosatti
2026-02-14 21:35 ` Leonardo Bras
2026-02-11 16:38 ` Michal Hocko
2026-02-11 16:50 ` Marcelo Tosatti
2026-02-11 16:59 ` Vlastimil Babka
2026-02-11 17:07 ` Michal Hocko
2026-02-14 22:02 ` Leonardo Bras
2026-02-16 11:00 ` Michal Hocko
2026-02-19 15:27 ` Marcelo Tosatti
2026-02-19 19:30 ` Michal Hocko
2026-02-20 14:30 ` Marcelo Tosatti
2026-02-23 9:18 ` Michal Hocko
2026-03-03 10:55 ` Frederic Weisbecker
2026-02-23 21:56 ` Frederic Weisbecker
2026-02-24 17:23 ` Marcelo Tosatti [this message]
2026-02-25 21:49 ` Frederic Weisbecker
2026-02-26 7:06 ` Michal Hocko
2026-02-26 11:41 ` Marcelo Tosatti
2026-03-03 11:08 ` Frederic Weisbecker
2026-02-20 10:48 ` Vlastimil Babka
2026-02-20 12:31 ` Michal Hocko
2026-02-20 17:35 ` Marcelo Tosatti
2026-02-20 17:58 ` Vlastimil Babka
2026-02-20 19:01 ` Marcelo Tosatti
2026-02-23 9:11 ` Michal Hocko
2026-02-23 11:20 ` Marcelo Tosatti
2026-02-24 14:40 ` Frederic Weisbecker
2026-02-24 18:12 ` Marcelo Tosatti
2026-02-20 16:51 ` Marcelo Tosatti
2026-02-20 16:55 ` Marcelo Tosatti
2026-02-20 22:38 ` Leonardo Bras
2026-02-23 18:09 ` Vlastimil Babka
2026-02-26 18:24 ` Marcelo Tosatti
2026-02-20 21:58 ` Leonardo Bras
2026-02-23 9:06 ` Michal Hocko
2026-02-28 1:23 ` Leonardo Bras
2026-03-03 0:19 ` Marcelo Tosatti
2026-03-08 17:41 ` Leonardo Bras
2026-03-09 9:52 ` Vlastimil Babka (SUSE)
2026-03-11 0:01 ` Leonardo Bras
2026-03-10 21:24 ` Marcelo Tosatti
2026-03-11 0:03 ` Leonardo Bras
2026-03-11 10:23 ` Marcelo Tosatti
2026-02-19 13:15 ` Marcelo Tosatti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aZ3ejedS7nE5mnva@tpad \
--to=mtosatti@redhat.com \
--cc=42.hyeyoo@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=boqun.feng@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=cl@linux.com \
--cc=frederic@kernel.org \
--cc=fweisbecker@suse.de \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=leobras.c@gmail.com \
--cc=leobras@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=penberg@kernel.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tglx@linutronix.de \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.