Re: [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@kernel.org>
To: Jing Wu <realwujing@gmail.com>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Boqun Feng <boqun@kernel.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Zqiang <qiang.zhang@linux.dev>,
	Anna-Maria Behnsen <anna-maria@linutronix.de>,
	Tejun Heo <tj@kernel.org>, Jonathan Corbet <corbet@lwn.net>,
	Shuah Khan <skhan@linuxfoundation.org>,
	Shuah Khan <shuah@kernel.org>
Cc: linux-kernel@vger.kernel.org, rcu@vger.kernel.org,
	cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Jing Wu <realwujing@gmail.com>,
	Qiliang Yuan <yuanql9@chinatelecom.cn>
Subject: Re: [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates
Date: Thu, 18 Jun 2026 19:27:48 +0200	[thread overview]
Message-ID: <87ik7fep2j.ffs@fw13> (raw)
In-Reply-To: <20260618-wujing-dhm-v3-6-28f1a4d83b68@gmail.com>

On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
> Remove __init from ct_cpu_track_user() and __initdata from the
> initialized flag so context tracking can be activated on CPUs that
> join nohz_full at runtime.  Drop the __ro_after_init attribute from
> the context_tracking_key static key, allowing static_branch_dec()
> when a CPU leaves nohz_full.
>
> Add ct_cpu_untrack_user() to reverse ct_cpu_track_user(), decrementing
> the static key and clearing the per-CPU tracking state.

Please do not enumerate WHAT the patch is doing. Explain the context and
the WHY

  https://docs.kernel.org/process/maintainer-tip.html#changelog


>  
>  #include <asm/irq_regs.h>
> @@ -653,11 +654,6 @@ void __init tick_nohz_init(void)
>  	if (!tick_nohz_full_running)
>  		return;
>  
> -	/*
> -	 * Full dynticks uses IRQ work to drive the tick rescheduling on safe
> -	 * locking contexts. But then we need IRQ work to raise its own
> -	 * interrupts to avoid circular dependency on the tick.
> -	 */

This comment is removed because it's not longer correct? How is this
related to $Subject?

>  	if (!arch_irq_work_has_interrupt()) {
>  		pr_warn("NO_HZ: Can't run full dynticks because arch doesn't support IRQ work self-IPIs\n");
>  		cpumask_clear(tick_nohz_full_mask);
> @@ -676,6 +672,16 @@ void __init tick_nohz_init(void)
>  		}
>  	}
>  
> +	/*
> +	 * Pre-initialize context tracking for all possible CPUs so
> +	 * ctx tracking is already active when a CPU is later added to
> +	 * nohz_full at runtime.  The tracking overhead is negligible
> +	 * because the static key is not incremented yet — only per-CPU
> +	 * tracking state is set up.
> +	 */
> +	if (IS_ENABLED(CONFIG_CONTEXT_TRACKING_USER_FORCE))
> +		context_tracking_init();

Seriously? Care to look where and when context_tracking_init() is invoked?

>  	for_each_cpu(cpu, tick_nohz_full_mask)
>  		ct_cpu_track_user(cpu);
>  
> @@ -686,6 +692,147 @@ void __init tick_nohz_init(void)
>  	pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
>  		cpumask_pr_args(tick_nohz_full_mask));
>  }
> +
> +static int tick_nohz_hk_validate(enum hk_type type,
> +				 const struct cpumask *cur_mask,
> +				 const struct cpumask *new_mask)
> +{
> +	if (!IS_ENABLED(CONFIG_NO_HZ_FULL))
> +		return -EOPNOTSUPP;
> +	return 0;
> +}

Why is this code even compiled when CONFIG_NO_HZ_FULL is not enabled?

> +
> +static void tick_nohz_hk_apply(enum hk_type type)
> +{
> +	static DEFINE_SPINLOCK(tick_nohz_lock);
> +	cpumask_var_t nohz_full, added, removed;
> +	bool was_running;
> +	int cpu;
> +
> +	if (!alloc_cpumask_var(&nohz_full, GFP_KERNEL))
> +		return;

This looks more than wrong. If this fails then the core code will
happily proceed with the completely wrong state.

> +	if (!alloc_cpumask_var(&added, GFP_KERNEL)) {
> +		free_cpumask_var(nohz_full);
> +		return;
> +	}
> +	if (!alloc_cpumask_var(&removed, GFP_KERNEL)) {
> +		free_cpumask_var(added);
> +		free_cpumask_var(nohz_full);
> +		return;
> +	}

        cpumask_var_t __free(free_cpumask_var) a = CPUMASK_VAR_NULL;
        cpumask_var_t __free(free_cpumask_var) b = CPUMASK_VAR_NULL;
        cpumask_var_t __free(free_cpumask_var) c = CPUMASK_VAR_NULL;

        if (!alloc_cpumask_var(&a, GFP_KERNEL))
        	return -ENOMEM;
        ....

> +
> +	/*
> +	 * Snapshot the new HK_TYPE_KERNEL_NOISE mask under an RCU read lock.
> +	 * housekeeping_update_types() completes synchronize_rcu() before
> +	 * invoking apply(), so the new pointer is stable; however the lockdep
> +	 * annotation in housekeeping_cpumask() still requires an RCU read-side
> +	 * critical section for runtime-mutable types.

This comment is explaining the obvious: housekeeping_cpumask_rcu()

> +	 */
> +	rcu_read_lock();

        scoped_guard(rcu)


> +	cpumask_andnot(nohz_full, cpu_possible_mask,
> +		       housekeeping_cpumask_rcu(HK_TYPE_KERNEL_NOISE));
> +	rcu_read_unlock();
> +
> +	/*
> +	 * When "nohz_full=" was not passed at boot, tick_nohz_full_running is
> +	 * false and the full dynticks infrastructure (sched_tick_offload_init,
> +	 * RCU nohz quiescent-state reporting, context-tracking bootstrap) was
> +	 * never initialised.  In that case restrict the update to
> +	 * tick_nohz_full_mask so the /sys/devices/system/cpu/nohz_full sysfs
> +	 * attribute reflects DHM-isolated CPUs without enabling tick
> +	 * suppression, context tracking, or timer migration – all of which
> +	 * require boot-time setup and would deadlock on the first
> +	 * synchronize_rcu() call after CPUs are offlined.

What? You tell user space that the CPUs are nohz_full by updating the
mask, which is exposed in sysfs, which is blatantly wrong.

> +	 */
> +	was_running = READ_ONCE(tick_nohz_full_running);

Q: This READ_ONCE() pairs with which WRITE_ONCE()? 
A: With none, so it's just voodoo programming.

> +	spin_lock(&tick_nohz_lock);

This lock protects against the housekeeping core code invoking the apply
callback multiple times in parallel, right?

If that happens then there are bigger problems than corrupted masks.

> +	/*
> +	 * When nohz_full= was active at boot, compute the delta and update
> +	 * context tracking for CPUs joining or leaving the nohz_full set.
> +	 * Skip when !was_running: ct_cpu_track_user() calls
> +	 * static_branch_inc() which may sleep (jump_label_update on the
> +	 * 0→1 transition) – illegal inside a spinlock.

If you remove the pointless voodoo lock then this nonsense goes away too.

> +	 */
> +	if (IS_ENABLED(CONFIG_CONTEXT_TRACKING_USER) &&
> +	    was_running &&
> +	    cpumask_available(tick_nohz_full_mask)) {

Why is this stuff even invoked when the mask is not available? If it's
not there then NOHZ full is not functional, period.

> +		cpumask_andnot(added, nohz_full, tick_nohz_full_mask);
> +		cpumask_andnot(removed, tick_nohz_full_mask, nohz_full);
> +		for_each_cpu(cpu, added)
> +			ct_cpu_track_user(cpu);
> +		for_each_cpu(cpu, removed)
> +			ct_cpu_untrack_user(cpu);
> +	}
> +
> +	/*
> +	 * Update tick_nohz_full_mask unconditionally: this is the snapshot
> +	 * read by the /sys/devices/system/cpu/nohz_full sysfs attribute and
> +	 * must reflect the current isolation set even in the DHM runtime case.
> +	 */
> +	if (cpumask_available(tick_nohz_full_mask))
> +		cpumask_copy(tick_nohz_full_mask, nohz_full);

Seriously?

> +	/*
> +	 * Only modify tick_nohz_full_running and migrate the global tick when
> +	 * nohz_full= was set at boot; without boot-time setup, setting
> +	 * tick_nohz_full_running would suppress ticks on isolated CPUs and
> +	 * prevent RCU quiescent-state reporting, causing synchronize_rcu()
> +	 * to stall permanently when a CPU is subsequently offlined.
> +	 */
> +	if (was_running) {

Again, why is any of this invoked when NOHZ full was never enabled and
initialized?

> +		tick_nohz_full_running = !cpumask_empty(nohz_full);

Brilliant. When NOHZ full was enabled on the command line, then changing
the mask can disable "running" and that makes it disabled forever. There
is no way to reenable it.

This 'was_running' check is just wrong. What you need is a
'tick_nohz_full_initialized' boolean, which is only true when nohz_full
was setup early on including the mask.

If that's not the case, then none of this code is supposed to run
ever. I.e. the callback is not installed in the first place.

> +	/*
> +	 * Ensure tick_nohz_full_mask is allocated so that tick_nohz_hk_apply()
> +	 * can update it (and the /sys/devices/system/cpu/nohz_full sysfs
> +	 * attribute) when CPUs are isolated at runtime via DHM.  If "nohz_full="
> +	 * was passed at boot the mask is already allocated; allocate an empty
> +	 * one here for the runtime-only case.

What's the runtime only case? The fake exposure in sysfs which is just
misleading the user? Not going to happen. If it's not enabled on the
command line then it's disabled, end of story.

> +	 */
> +	if (!cpumask_available(tick_nohz_full_mask) &&
> +	    !zalloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL))
> +		pr_warn("tick/nohz: failed to allocate nohz_full_mask for DHM\n");

ROTFL. If the allocation fails, then the apply callback becomes a
complete noop doing magic cpumask operations for nothing and pretending
to be successful.

Thanks,

        tglx

next prev parent reply	other threads:[~2026-06-18 17:27 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-18  3:11 [PATCH v3 00/13] Dynamic Housekeeping Management (DHM) via CPUSets Jing Wu
2026-06-18  3:11 ` [PATCH v3 01/13] sched/isolation: Replace notifier chain with explicit callback interface Jing Wu
2026-06-18  3:11 ` [PATCH v3 02/13] sched/isolation: Add housekeeping_update_types() for kernel-noise masks Jing Wu
2026-06-18  3:11 ` [PATCH v3 03/13] sched/isolation: RCU-protect all housekeeping cpumask readers Jing Wu
2026-06-18  3:11 ` [PATCH v3 04/13] sched/isolation: Fix RCU protection for runtime-mutable cpumask callers Jing Wu
2026-06-18  3:11 ` [PATCH v3 05/13] cpu/hotplug: Reserve CPUHP states for nohz_full and managed IRQ down-paths Jing Wu
2026-06-18 16:06   ` Thomas Gleixner
2026-06-18 21:01     ` Thomas Gleixner
2026-06-18  3:11 ` [PATCH v3 06/13] tick/nohz, context_tracking: Prepare for runtime nohz_full updates Jing Wu
2026-06-18 17:27   ` Thomas Gleixner [this message]
2026-06-18 19:49     ` Thomas Gleixner
2026-06-18  3:11 ` [PATCH v3 07/13] rcu/nocb: Add explicit housekeeping callback for runtime NOCB toggling Jing Wu
2026-06-18  3:11 ` [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration Jing Wu
2026-06-18 20:27   ` Thomas Gleixner
2026-06-18 21:11     ` Thomas Gleixner
2026-06-18  3:11 ` [PATCH v3 09/13] watchdog/lockup_detector: Register housekeeping callback for kernel-noise Jing Wu
2026-06-18  3:11 ` [PATCH v3 10/13] sched: Guard sched_tick_start/stop against uninitialized tick_work_cpu Jing Wu
2026-06-18 20:50   ` Thomas Gleixner
2026-06-18  3:11 ` [PATCH v3 11/13] cgroup/cpuset: Extend isolated partition to trigger kernel-noise isolation Jing Wu
2026-06-18 20:55   ` Thomas Gleixner
2026-06-18  3:11 ` [PATCH v3 12/13] docs: cgroup-v2: Document kernel-noise isolation via isolated partitions Jing Wu
2026-06-18  3:11 ` [PATCH v3 13/13] selftests/cgroup: Add kernel-noise isolation test to cpuset selftest Jing Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ik7fep2j.ffs@fw13 \
    --to=tglx@kernel.org \
    --cc=anna-maria@linutronix.de \
    --cc=boqun@kernel.org \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dietmar.eggemann@arm.com \
    --cc=frederic@kernel.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qiang.zhang@linux.dev \
    --cc=rcu@vger.kernel.org \
    --cc=realwujing@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=skhan@linuxfoundation.org \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yuanql9@chinatelecom.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox