public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: "Thomas Gleixner" <tglx@kernel.org>, "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Shuah Khan" <skhan@linuxfoundation.org>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Will Deacon" <will@kernel.org>,
	"K. Y. Srinivasan" <kys@microsoft.com>,
	"Haiyang Zhang" <haiyangz@microsoft.com>,
	"Wei Liu" <wei.liu@kernel.org>,
	"Dexuan Cui" <decui@microsoft.com>,
	"Long Li" <longli@microsoft.com>,
	"Guenter Roeck" <linux@roeck-us.net>,
	"Frederic Weisbecker" <frederic@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	"Neeraj Upadhyay" <neeraj.upadhyay@kernel.org>,
	"Joel Fernandes" <joelagnelf@nvidia.com>,
	"Josh Triplett" <josh@joshtriplett.org>,
	"Boqun Feng" <boqun@kernel.org>,
	"Uladzislau Rezki" <urezki@gmail.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>,
	"Lai Jiangshan" <jiangshanlai@gmail.com>,
	Zqiang <qiang.zhang@linux.dev>,
	"Anna-Maria Behnsen" <anna-maria@linutronix.de>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Chen Ridong" <chenridong@huaweicloud.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	"Valentin Schneider" <vschneid@redhat.com>,
	"K Prateek Nayak" <kprateek.nayak@amd.com>,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Simon Horman" <horms@kernel.org>
Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-hyperv@vger.kernel.org, linux-hwmon@vger.kernel.org,
	rcu@vger.kernel.org, netdev@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	Costa Shulyupin <cshulyup@redhat.com>,
	Qiliang Yuan <realwujing@gmail.com>
Subject: Re: [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API
Date: Tue, 21 Apr 2026 13:29:13 -0400	[thread overview]
Message-ID: <4a0ede3e-6e87-414f-a3a3-dd15c32f25ef@redhat.com> (raw)
In-Reply-To: <87o6jcb84w.ffs@tglx>

On 4/21/26 12:17 PM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> Add a new cpuhp_offline_cb() API that allows us to offline a set of
>> CPUs one-by-one, run the given callback function and then bring those
>> CPUs back online again while inhibiting any concurrent CPU hotplug
>> operations from happening.
> Please provide a properly structured change log which explains the
> context, the problem and the solution in separate paragraphs and this
> order. This is not new. It's documented...
>
>> This new API can be used to enable runtime adjustment of nohz_full and
>> isolcpus boot command line options. A new cpuhp_offline_cb_mode flag
>> is also added to signal that the system is in this offline callback
>> transient state so that some hotplug operations can be optimized out
>> if we choose to.
> We chose nothing.
>
>> +#include <linux/cpumask_types.h>
> What for? This header only needs a 'struct cpumask' forward declaration
> so that the compiler can handle the pointer argument, no?
>
>> +typedef int (*cpuhp_cb_t)(void *arg);
> You couldn't come up with a more generic name for this, right?
>
>>   struct device;
>>   
>>   extern int lockdep_is_cpus_held(void);
>> @@ -29,6 +31,8 @@ void clear_tasks_mm_cpumask(int cpu);
>>   int remove_cpu(unsigned int cpu);
>>   int cpu_device_down(struct device *dev);
>>   void smp_shutdown_nonboot_cpus(unsigned int primary_cpu);
>> +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg);
> Ditto.
>
>> +extern bool cpuhp_offline_cb_mode;
> Groan. The only users are in the cpusets code which invokes this muck
> and should therefore know what's going on, no?
>
>>   #else /* CONFIG_HOTPLUG_CPU */
>>   
>> @@ -43,6 +47,11 @@ static inline void cpu_hotplug_disable(void) { }
>>   static inline void cpu_hotplug_enable(void) { }
>>   static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
>>   static inline void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) { }
>> +static inline int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
>> +{
>> +	return -EPERM;
> -EPERM?
>
>> +/**
>> + * cpuhp_offline_cb - offline CPUs, invoke callback function & online CPUs afterward
>> + * @mask: A mask of CPUs to be taken offline and then online
>> + * @func: A callback function to be invoked while the given CPUs are offline
>> + * @arg:  Argument to be passed back to the callback function
>> + *
>> + * Return: 0 if successful, an error code otherwise
>> + */
>> +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
>> +{
>> +	int off_cpu, on_cpu, ret, ret2 = 0;
>> +
>> +	if (WARN_ON_ONCE(cpumask_empty(mask) ||
>> +	   !cpumask_subset(mask, cpu_online_mask)))
>> +		return -EINVAL;
> No line break required. You have 100 characters.
>
> But what's worse is that the access to cpu_online_mask is not protected
> against a concurrent CPU hotplug operation.
>
>> +
>> +	pr_debug("%s: begin (CPU list = %*pbl)\n", __func__, cpumask_pr_args(mask));
> Tracing?
>
>> +	lock_device_hotplug();
>> +	cpuhp_offline_cb_mode = true;
>> +	/*
>> +	 * If all offline operations succeed, off_cpu should become nr_cpu_ids.
>> +	 */
>> +	for_each_cpu(off_cpu, mask) {
>> +		ret = device_offline(get_cpu_device(off_cpu));
>> +		if (unlikely(ret))
>> +			break;
>> +	}
>> +	if (!ret)
>> +		ret = func(arg);
>> +
>> +	/* Bring previously offline CPUs back online */
>> +	for_each_cpu(on_cpu, mask) {
>> +		int retries = 0;
>> +
>> +		if (on_cpu == off_cpu)
>> +			break;
>> +
>> +retry:
>> +		ret2 = device_online(get_cpu_device(on_cpu));
>> +
>> +		/*
>> +		 * With the unlikely event that CPU hotplug is disabled while
>> +		 * this operation is in progress, we will need to wait a bit
>> +		 * for hotplug to hopefully be re-enabled again. If not, print
>> +		 * a warning and return the error.
>> +		 *
>> +		 * cpu_hotplug_disabled is supposed to be accessed while
>> +		 * holding the cpu_add_remove_lock mutex. So we need to
>> +		 * use the data_race() macro to access it here.
>> +		 */
>> +		while ((ret2 == -EBUSY) && data_race(cpu_hotplug_disabled) &&
>> +		       (++retries <= 5)) {
>> +			msleep(20);
>> +			if (!data_race(cpu_hotplug_disabled))
>> +				goto retry;
>> +		}
>> +		if (ret2) {
>> +			pr_warn("%s: Failed to bring CPU %d back online!\n",
>> +				__func__, on_cpu);
> Provide a proper text and not this silly __func__ thing.
>
>> +			break;
>> +		}
>> +	}
> TBH. This is unreviewable gunk and the whole 'unlikely event that CPU
> hotplug is disabled' is just a lazy hack.
>
> All of this can be avoided including this made up callback function.
>
> It's not rocket science to provide:
>
>       1) A function which serializes against any other CPU hotplug
>          related action.
>
>       2) A function which brings the CPUs in a given CPU mask down
>
>       3) A function which brings the CPUs in a given CPU mask up
>
>       4) A function which undoes #1
>
> Yeah I know, it's more work and not convoluted enough. But see below.
>
> That brings me to that other hack namely cpuhp_offline_cb_mode, which
> you self described as such in patch 21/23:
>
>> +	/*
>> +	 * Hack: In cpuhp_offline_cb_mode, pretend all partitions are empty
>> +	 * to prevent unnecessary partition invalidation.
>> +	 */
>> +	if (cpuhp_offline_cb_mode)
>> +		return false;
>> +
> We are not merging hacks. End of story. But you knew that already, no?
>
> Let's take a step back and see what you really need to achieve:
>
>    1) Update tick_nohz_full_mask
>    2) Update the managed interrupt mask
>    3) Update CPU sets
>
> Independent of the direction of this update you need to ensure that the
> affected functionality keeps working correctly.
>
> You achieve that by bulk offlining the affected CPUs, invoking a magic
> callback and then bulk onlining the affected CPUs again, which requires
> that ill defined cpuhp_offline_cb_mode hackery and probably some more
> hacks all over the place.
>
> You can achieve the same by doing CPU by CPU operations in the right
> order without this mode hack, when you establish proper limitations for
> this:
>
>    At no point in time it's allowed to empty a CPU set or a affected CPU
>    mask, except when you completely undo the isolation of CPUs.
>
>    That can be computed upfront w/o changing anything at all. Once the
>    validity is established, the update can proceed. Or you can leave it
>    to user space which can keep the pieces if it gets it wrong.
>
> That's a reasonable limitation as there is absolutely zero justification
> to support something like:
>
>         housekeeping_cpus = [CPU 0], isolated_cpus = [CPU 1]
>    ---> housekeeping_cpus = [CPU 1], isolated_cpus = [CPU 0]
>
> just because we can with enough horrible hacks.
>
> If you get that out of the way, then a CPU by CPU update becomes the
> obvious and simplest solution. The ordering constraints can be computed
> in user space upfront and there is no reason to do any of this in the
> kernel itself except for an eventual validation step. It might be a tad
> slower, but this is all but a hotpath operation.
>
> Just for the record. I suggested exactly this more than a year ago and
> it's still the right thing to do.
>
> And of course neither your cover letter nor any of the patches give a
> proper rationale why you think that your bulk hackery is better. For the
> very simple reason that there is no rationale at all.
>
> This bulk muck is doomed when your ultimate goal is to avoid the stop
> machine dance. With a per CPU update it is actually doable without more
> ill defined hacks all over the place.
>
>     1) Bring down the CPU to CPUHP_AP_SCHED_WAIT_EMPTY, which is the last
>        state before stop machine is invoked.
>
>        At that point:
>
>           - no user space thread is running on the CPU anymore
>
>           - everything related to this CPU has been shut down or moved
>             elsewhere
>
>           - interrupt managed device queues are quiesced if the CPU was
>             the last online one in the queue affinity mask. If not the
>             interrupt might still be affine to the CPU, but there is at
>             least one other CPU available in the mask.
>
>     2) Update the tick NOHZ handover
>
>        This can be done without going into stop machine by providing a
>        hotplug callback right between CPUHP_AP_SMPBOOT_THREADS and
>        CPUHP_AP_IRQ_AFFINITY_ONLINE.
>
>        That's trivial enough to achieve and can work independently of
>        NOHZ full.
>
>     3) Rework the affinity management, so that interrupt affinities can
>        be reassigned in the CPUHP_AP_IRQ_AFFINITY_ONLINE state.
>
>        That needs a lot of thoughts, but there is no real reason why it
>        can't work.
>
>     4) Flip the housekeeping CPU masks in sched_cpu_wait_empty() after
>        balance_hotplug_wait().
>
>     5) Bring the CPU online again.
>
> For #2 and #3 to work you need a separate CPU mask which avoids touching
> CPU online mask. For #3 this needs some more work to avoid reassigning the
> interrupts once sparse_irq_lock is dropped, but the bulk is achieved
> with the separate CPU mask.
>
> No?

Thanks for the great suggestions. I will certainly look into that.

We actually have a cpu_active_mask that will be cleared early in 
sched_cpu_deactivate(). In the CPUHP_AP_SCHED_WAIT_EMPTY state, the CPU 
will still have online bit set but the active bit will be cleared. Or we 
could add another cpumask that can be used to indicate CPUs that have 
reached CPUHP_AP_SCHED_WAIT_EMPTY or below if necessary.

Cheers,
Longman


  reply	other threads:[~2026-04-21 17:29 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-21  3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
2026-04-21  3:03 ` [PATCH 01/23] sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT & HK_TYPE_MANAGED_IRQ_BOOT Waiman Long
2026-04-21  3:03 ` [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask Waiman Long
2026-04-21  3:03 ` [PATCH 03/23] tick/nohz: Make nohz_full parameter optional Waiman Long
2026-04-21  8:32   ` Thomas Gleixner
2026-04-21 14:14     ` Waiman Long
2026-04-21  3:03 ` [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs Waiman Long
2026-04-21  8:50   ` Thomas Gleixner
2026-04-21 14:24     ` Waiman Long
2026-04-21  3:03 ` [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() Waiman Long
2026-04-21  8:55   ` Thomas Gleixner
2026-04-21 14:22     ` Waiman Long
2026-04-21  3:03 ` [PATCH 06/23] rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask Waiman Long
2026-04-21  3:03 ` [PATCH 07/23] watchdog: Sync up with runtime change of isolated CPUs Waiman Long
2026-04-21  3:03 ` [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask Waiman Long
2026-04-21  3:03 ` [PATCH 09/23] workqueue: Use RCU to protect access of HK_TYPE_TIMER cpumask Waiman Long
2026-04-21  3:03 ` [PATCH 10/23] cpu: " Waiman Long
2026-04-21  8:57   ` Thomas Gleixner
2026-04-21 14:25     ` Waiman Long
2026-04-21  3:03 ` [PATCH 11/23] hrtimer: " Waiman Long
2026-04-21  8:59   ` Thomas Gleixner
2026-04-21  3:03 ` [PATCH 12/23] net: Use boot time housekeeping cpumask settings for now Waiman Long
2026-04-21  3:03 ` [PATCH 13/23] sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask Waiman Long
2026-04-21  3:03 ` [PATCH 14/23] hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask Waiman Long
2026-04-21  3:03 ` [PATCH 15/23] Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask Waiman Long
2026-04-21  3:03 ` [PATCH 16/23] genirq/cpuhotplug: " Waiman Long
2026-04-21  9:02   ` Thomas Gleixner
2026-04-21 14:29     ` Waiman Long
2026-04-21  3:03 ` [PATCH 17/23] sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks Waiman Long
2026-04-21  3:03 ` [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API Waiman Long
2026-04-21 16:17   ` Thomas Gleixner
2026-04-21 17:29     ` Waiman Long [this message]
2026-04-21 18:43       ` Thomas Gleixner
2026-04-21  3:03 ` [PATCH 19/23] cgroup/cpuset: Improve check for calling housekeeping_update() Waiman Long
2026-04-21  3:03 ` [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks Waiman Long
2026-04-21  3:03 ` [PATCH 21/23] cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition Waiman Long
2026-04-21  3:03 ` [PATCH 22/23] cgroup/cpuset: Prevent offline_disabled CPUs from being used in " Waiman Long
2026-04-21  3:03 ` [PATCH 23/23] cgroup/cpuset: Documentation and kselftest updates Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4a0ede3e-6e87-414f-a3a3-dd15c32f25ef@redhat.com \
    --to=longman@redhat.com \
    --cc=anna-maria@linutronix.de \
    --cc=boqun@kernel.org \
    --cc=bsegall@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=corbet@lwn.net \
    --cc=cshulyup@redhat.com \
    --cc=davem@davemloft.net \
    --cc=decui@microsoft.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=edumazet@google.com \
    --cc=frederic@kernel.org \
    --cc=haiyangz@microsoft.com \
    --cc=hannes@cmpxchg.org \
    --cc=horms@kernel.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=kuba@kernel.org \
    --cc=kys@microsoft.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-hwmon@vger.kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux@roeck-us.net \
    --cc=longli@microsoft.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qiang.zhang@linux.dev \
    --cc=rcu@vger.kernel.org \
    --cc=realwujing@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=skhan@linuxfoundation.org \
    --cc=tglx@kernel.org \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wei.liu@kernel.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox