From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0C4F23A1E9B for ; Wed, 22 Apr 2026 03:09:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776827347; cv=none; b=aZaBuKFIw0TqQOVBqYRgGuOozjSapcUa0cgGUeNydzLgtVdcQeaB7O2niOpAtmuUfcUU4QcjA2iGX/FcenmujtMyaDpn5FCfj2Nv83kP/S4NHB3PTVnJuTk6W3WAcgzzeSHWD0/7FhVjYBRgLzb5J9sBpYdCR7MPt2s/AH4RMZY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776827347; c=relaxed/simple; bh=CXcOC8EpU2mlFVkaMvmVIcdfDN5KvtTohNiRWi0YLrE=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=Suayuyi4CiPz/gi1ubVag1Vm01nUDN0ZX1s8vWbkD7r88ze0qf+lCXSdsyiEyHXMPyVegvh1H2MAS9EQLZuYMfQkerlIY5IXstLKbzVBC8KDZh4nwLjByIFlteCkt5VvBnFFcHkIYtZiABiP3ozJKHNTRp+o9wHZgVBLAZx1tlA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=OK3yvufX; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="OK3yvufX" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8FB79C2BCB0; Wed, 22 Apr 2026 03:09:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776827345; bh=CXcOC8EpU2mlFVkaMvmVIcdfDN5KvtTohNiRWi0YLrE=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date:From; b=OK3yvufXbKGroOLOjtzpUUQP2inTEooIsv8EUT3KKgPUC8QfQuSQ5yX/x2VGJYHhF Fx6+edN4aAEjA2TR4++cqRAr/Wc9qFQNrdcIe9+TvGMSndH85TC9Uh02IoMTHvfwWq qC2J2C9wfOJtRy33PsNIEeMJeZA0xOFkNioPBBnkyRyVlhjGW2ZPdInHkcdrJg1HRU +lS2WjjgCNHIrE7uUQiU1JWoAzevJMsauBlCvBdrYaCTBI0f68rtBoYiQUleu7LPnN Rbq1kBfmzTKD5rzCTZVjFczlQNYH/0KJbR/KvuMh3dzceoy4lwX1+b1KAbyVGrs4Zl GxCF65YX1SMoQ== From: sashiko-bot@kernel.org Subject: Re: [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API Reply-To: sashiko@lists.linux.dev To: "Waiman Long" Cc: linux-hwmon@vger.kernel.org, hannes@cmpxchg.org, tj@kernel.org, mkoutny@suse.com In-Reply-To: <20260421030351.281436-19-longman@redhat.com> References: <20260421030351.281436-19-longman@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Wed, 22 Apr 2026 03:09:05 +0000 Message-Id: <20260422030905.8FB79C2BCB0@smtp.kernel.org> Precedence: bulk X-Mailing-List: linux-hwmon@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Sashiko AI review found 4 potential issue(s): - [High] Early loop break in the rollback path leaves successfully offlined= CPUs permanently offline. - [High] Race condition in the `device_online()` retry loop bypasses recove= ry if hotplug is re-enabled immediately. - [High] Hardcoded 100ms timeout for `cpu_hotplug_disabled` is brittle due = to lock dropping, leading to permanently stranded CPUs. - [Medium] Time-of-check to time-of-use (TOCTOU) race on `cpu_online_mask` = validation triggers spurious kernel warnings and API failure. -- commit ebe2a08265b3e2e79b0d74906f6ed951ac37520b Author: Waiman Long cpu/hotplug: Add a new cpuhp_offline_cb() API This patch introduces the cpuhp_offline_cb() API to offline a set of CPUs, execute a callback function, and then bring them back online. It allows runtime adjustment of command line options like nohz_full and isolcpus. A new flag, cpuhp_offline_cb_mode, is added to optimize hotplug operations while the system is in this transient state. > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 0d02b5d7a7baf..9b32f742cd1da 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -1520,6 +1520,76 @@ int remove_cpu(unsigned int cpu) > } > EXPORT_SYMBOL_GPL(remove_cpu); > =20 > +bool cpuhp_offline_cb_mode; > + > +/** > + * cpuhp_offline_cb - offline CPUs, invoke callback function & online CP= Us afterward > + * @mask: A mask of CPUs to be taken offline and then online > + * @func: A callback function to be invoked while the given CPUs are off= line > + * @arg: Argument to be passed back to the callback function > + * > + * Return: 0 if successful, an error code otherwise > + */ > +int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg) > +{ > + int off_cpu, on_cpu, ret, ret2 =3D 0; > + > + if (WARN_ON_ONCE(cpumask_empty(mask) || > + !cpumask_subset(mask, cpu_online_mask))) > + return -EINVAL; Since the mask validation checks cpu_online_mask before acquiring lock_device_hotplug(), could a concurrent user-initiated sysfs CPU offline operation remove a CPU from cpu_online_mask after the caller generates the mask? This would trigger a spurious WARN_ON_ONCE and fail with -EINVAL. Should this validation be moved under the hotplug lock? > + > + pr_debug("%s: begin (CPU list =3D %*pbl)\n", __func__, cpumask_pr_args(= mask)); > + lock_device_hotplug(); > + cpuhp_offline_cb_mode =3D true; [ ... ] > + /* Bring previously offline CPUs back online */ > + for_each_cpu(on_cpu, mask) { > + int retries =3D 0; > + > + if (on_cpu =3D=3D off_cpu) > + break; > + > +retry: > + ret2 =3D device_online(get_cpu_device(on_cpu)); > + > + /* > + * With the unlikely event that CPU hotplug is disabled while > + * this operation is in progress, we will need to wait a bit > + * for hotplug to hopefully be re-enabled again. If not, print > + * a warning and return the error. > + * > + * cpu_hotplug_disabled is supposed to be accessed while > + * holding the cpu_add_remove_lock mutex. So we need to > + * use the data_race() macro to access it here. > + */ > + while ((ret2 =3D=3D -EBUSY) && data_race(cpu_hotplug_disabled) && > + (++retries <=3D 5)) { If device_online() returns -EBUSY, but another thread re-enables CPU hotplug right before the while loop condition is evaluated, data_race(cpu_hotplug_disabled) will evaluate to false. Doesn't this short-circuit the loop entirely, leaving ret2 =3D=3D -EBUSY and bypassing t= he retry mechanism exactly when recovery would have succeeded? Additionally, is the hardcoded 100ms (5 * 20ms) timeout sufficient? Concurrent paths like PCI probing or idle injection might hold the cpu_add_remove_lock for several seconds via cpu_hotplug_disable(). If the timeout expires during the online phase, won't this abort the API and permanently abandon the CPUs in an offline state? > + msleep(20); > + if (!data_race(cpu_hotplug_disabled)) > + goto retry; > + } > + if (ret2) { > + pr_warn("%s: Failed to bring CPU %d back online!\n", > + __func__, on_cpu); > + break; If device_online() fails to bring a CPU back online and breaks out of the loop here, doesn't it prematurely exit the restoration loop? This would skip any remaining CPUs in the mask that were successfully taken offline, leaving them permanently offline and reducing system capacity. Should this use continue instead of break to ensure all offlined CPUs have a chance to be restored? > + } > + } > + cpuhp_offline_cb_mode =3D false; > + unlock_device_hotplug(); > + pr_debug("%s: end\n", __func__); > + return ret ? ret : ret2; > +} --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260421030351.2814= 36-1-longman@redhat.com?part=3D18