From: sashiko-bot@kernel.org
To: "Waiman Long" <longman@redhat.com>
Cc: linux-hwmon@vger.kernel.org, hannes@cmpxchg.org, tj@kernel.org,
mkoutny@suse.com
Subject: Re: [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API
Date: Wed, 22 Apr 2026 03:09:05 +0000 [thread overview]
Message-ID: <20260422030905.8FB79C2BCB0@smtp.kernel.org> (raw)
In-Reply-To: <20260421030351.281436-19-longman@redhat.com>
Sashiko AI review found 4 potential issue(s):
- [High] Early loop break in the rollback path leaves successfully offlined CPUs permanently offline.
- [High] Race condition in the `device_online()` retry loop bypasses recovery if hotplug is re-enabled immediately.
- [High] Hardcoded 100ms timeout for `cpu_hotplug_disabled` is brittle due to lock dropping, leading to permanently stranded CPUs.
- [Medium] Time-of-check to time-of-use (TOCTOU) race on `cpu_online_mask` validation triggers spurious kernel warnings and API failure.
--
commit ebe2a08265b3e2e79b0d74906f6ed951ac37520b
Author: Waiman Long <longman@redhat.com>
cpu/hotplug: Add a new cpuhp_offline_cb() API
This patch introduces the cpuhp_offline_cb() API to offline a set of CPUs,
execute a callback function, and then bring them back online. It allows
runtime adjustment of command line options like nohz_full and isolcpus.
A new flag, cpuhp_offline_cb_mode, is added to optimize hotplug operations
while the system is in this transient state.
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 0d02b5d7a7baf..9b32f742cd1da 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1520,6 +1520,76 @@ int remove_cpu(unsigned int cpu)
> }
> EXPORT_SYMBOL_GPL(remove_cpu);
>
> +bool cpuhp_offline_cb_mode;
> +
> +/**
> + * cpuhp_offline_cb - offline CPUs, invoke callback function & online CPUs afterward
> + * @mask: A mask of CPUs to be taken offline and then online
> + * @func: A callback function to be invoked while the given CPUs are offline
> + * @arg: Argument to be passed back to the callback function
> + *
> + * Return: 0 if successful, an error code otherwise
> + */
> +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
> +{
> + int off_cpu, on_cpu, ret, ret2 = 0;
> +
> + if (WARN_ON_ONCE(cpumask_empty(mask) ||
> + !cpumask_subset(mask, cpu_online_mask)))
> + return -EINVAL;
Since the mask validation checks cpu_online_mask before acquiring
lock_device_hotplug(), could a concurrent user-initiated sysfs CPU offline
operation remove a CPU from cpu_online_mask after the caller generates the
mask? This would trigger a spurious WARN_ON_ONCE and fail with -EINVAL.
Should this validation be moved under the hotplug lock?
> +
> + pr_debug("%s: begin (CPU list = %*pbl)\n", __func__, cpumask_pr_args(mask));
> + lock_device_hotplug();
> + cpuhp_offline_cb_mode = true;
[ ... ]
> + /* Bring previously offline CPUs back online */
> + for_each_cpu(on_cpu, mask) {
> + int retries = 0;
> +
> + if (on_cpu == off_cpu)
> + break;
> +
> +retry:
> + ret2 = device_online(get_cpu_device(on_cpu));
> +
> + /*
> + * With the unlikely event that CPU hotplug is disabled while
> + * this operation is in progress, we will need to wait a bit
> + * for hotplug to hopefully be re-enabled again. If not, print
> + * a warning and return the error.
> + *
> + * cpu_hotplug_disabled is supposed to be accessed while
> + * holding the cpu_add_remove_lock mutex. So we need to
> + * use the data_race() macro to access it here.
> + */
> + while ((ret2 == -EBUSY) && data_race(cpu_hotplug_disabled) &&
> + (++retries <= 5)) {
If device_online() returns -EBUSY, but another thread re-enables CPU hotplug
right before the while loop condition is evaluated,
data_race(cpu_hotplug_disabled) will evaluate to false. Doesn't this
short-circuit the loop entirely, leaving ret2 == -EBUSY and bypassing the
retry mechanism exactly when recovery would have succeeded?
Additionally, is the hardcoded 100ms (5 * 20ms) timeout sufficient?
Concurrent paths like PCI probing or idle injection might hold the
cpu_add_remove_lock for several seconds via cpu_hotplug_disable(). If the
timeout expires during the online phase, won't this abort the API and
permanently abandon the CPUs in an offline state?
> + msleep(20);
> + if (!data_race(cpu_hotplug_disabled))
> + goto retry;
> + }
> + if (ret2) {
> + pr_warn("%s: Failed to bring CPU %d back online!\n",
> + __func__, on_cpu);
> + break;
If device_online() fails to bring a CPU back online and breaks out of the
loop here, doesn't it prematurely exit the restoration loop? This would skip
any remaining CPUs in the mask that were successfully taken offline, leaving
them permanently offline and reducing system capacity. Should this use
continue instead of break to ensure all offlined CPUs have a chance to be
restored?
> + }
> + }
> + cpuhp_offline_cb_mode = false;
> + unlock_device_hotplug();
> + pr_debug("%s: end\n", __func__);
> + return ret ? ret : ret2;
> +}
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260421030351.281436-1-longman@redhat.com?part=18
next prev parent reply other threads:[~2026-04-22 3:09 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-21 3:03 [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs Waiman Long
2026-04-21 3:03 ` [PATCH 01/23] sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT & HK_TYPE_MANAGED_IRQ_BOOT Waiman Long
2026-04-21 3:03 ` [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask Waiman Long
2026-04-22 3:08 ` sashiko-bot
2026-04-22 6:39 ` Chen Ridong
2026-04-21 3:03 ` [PATCH 03/23] tick/nohz: Make nohz_full parameter optional Waiman Long
2026-04-21 8:32 ` Thomas Gleixner
2026-04-21 14:14 ` Waiman Long
2026-04-22 3:08 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs Waiman Long
2026-04-21 8:50 ` Thomas Gleixner
2026-04-21 14:24 ` Waiman Long
2026-04-22 3:08 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() Waiman Long
2026-04-21 8:55 ` Thomas Gleixner
2026-04-21 14:22 ` Waiman Long
2026-04-21 3:03 ` [PATCH 06/23] rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask Waiman Long
2026-04-22 3:08 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 07/23] watchdog: Sync up with runtime change of isolated CPUs Waiman Long
2026-04-22 3:08 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask Waiman Long
2026-04-22 3:08 ` sashiko-bot
2026-04-22 9:34 ` Chen Ridong
2026-04-21 3:03 ` [PATCH 09/23] workqueue: Use RCU to protect access of HK_TYPE_TIMER cpumask Waiman Long
2026-04-21 3:03 ` [PATCH 10/23] cpu: " Waiman Long
2026-04-21 8:57 ` Thomas Gleixner
2026-04-21 14:25 ` Waiman Long
2026-04-21 3:03 ` [PATCH 11/23] hrtimer: " Waiman Long
2026-04-21 8:59 ` Thomas Gleixner
2026-04-22 3:09 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 12/23] net: Use boot time housekeeping cpumask settings for now Waiman Long
2026-04-21 3:03 ` [PATCH 13/23] sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask Waiman Long
2026-04-22 3:09 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 14/23] hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask Waiman Long
2026-04-22 3:09 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 15/23] Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask Waiman Long
2026-04-22 3:09 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 16/23] genirq/cpuhotplug: " Waiman Long
2026-04-21 9:02 ` Thomas Gleixner
2026-04-21 14:29 ` Waiman Long
2026-04-21 3:03 ` [PATCH 17/23] sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks Waiman Long
2026-04-22 3:09 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API Waiman Long
2026-04-21 16:17 ` Thomas Gleixner
2026-04-21 17:29 ` Waiman Long
2026-04-21 18:43 ` Thomas Gleixner
2026-04-22 3:09 ` sashiko-bot [this message]
2026-04-21 3:03 ` [PATCH 19/23] cgroup/cpuset: Improve check for calling housekeeping_update() Waiman Long
2026-04-21 3:03 ` [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks Waiman Long
2026-04-22 3:09 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 21/23] cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition Waiman Long
2026-04-22 3:09 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 22/23] cgroup/cpuset: Prevent offline_disabled CPUs from being used in " Waiman Long
2026-04-22 3:09 ` sashiko-bot
2026-04-21 3:03 ` [PATCH 23/23] cgroup/cpuset: Documentation and kselftest updates Waiman Long
2026-04-22 3:09 ` sashiko-bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260422030905.8FB79C2BCB0@smtp.kernel.org \
--to=sashiko-bot@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-hwmon@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mkoutny@suse.com \
--cc=sashiko@lists.linux.dev \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox