* [PATCH/for-next 0/2] cgroup/cpuset: Fix partition related locking issues
@ 2026-01-28 4:42 Waiman Long
2026-01-28 4:42 ` [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work Waiman Long
2026-01-28 4:42 ` [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex Waiman Long
0 siblings, 2 replies; 19+ messages in thread
From: Waiman Long @ 2026-01-28 4:42 UTC (permalink / raw)
To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner,
Shuah Khan
Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long
After booting the latest cgroup for-next debug kernel with the latest
cgroup changes as well as Federic's "cpuset/isolation: Honour kthreads
preferred affinity" patch series [1] merged on top and running the
test-cpuset-prs.sh test, a circular locking dependency lockdep splat
was reported. See patch 2 for details.
The following changes are made to resolve this locking problem.
1) Deferring calling housekeeping_update() from CPU hotplug to task work
2) Release cpus_read_lock before calling housekeeping_update()
With these changes, the cpuset test ran to completion with no failure
and no lockdep splat.
[1] https://lore.kernel.org/lkml/20260125224541.50226-1-frederic@kernel.org/
Waiman Long (2):
cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to
task_work
cgroup/cpuset: Introduce a new top level isolcpus_update_mutex
kernel/cgroup/cpuset.c | 124 ++++++++++++++----
kernel/sched/isolation.c | 4 +-
kernel/time/timer_migration.c | 3 +-
.../selftests/cgroup/test_cpuset_prs.sh | 9 ++
4 files changed, 111 insertions(+), 29 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work 2026-01-28 4:42 [PATCH/for-next 0/2] cgroup/cpuset: Fix partition related locking issues Waiman Long @ 2026-01-28 4:42 ` Waiman Long 2026-01-28 17:44 ` Tejun Heo 2026-01-29 4:03 ` Chen Ridong 2026-01-28 4:42 ` [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex Waiman Long 1 sibling, 2 replies; 19+ messages in thread From: Waiman Long @ 2026-01-28 4:42 UTC (permalink / raw) To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long The update_isolation_cpumasks() function can be called either directly from regular cpuset control file write with cpuset_full_lock() called or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held. As we are going to enable dynamic update to the nozh_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug, allowing the CPU hotplug path to call into housekeeping_update() directly from update_isolation_cpumasks() will cause deadlock. So we have to defer any call to housekeeping_update() after the CPU hotplug operation has finished. This can be done via the task_work_add(..., TWA_RESUME) API where the actual housekeeping_update() call, if needed, will happen right before existing back to userspace. Since the HK_TYPE_DOMAIN housekeeping cpumask should now track the changes in "cpuset.cpus.isolated", add a check in test_cpuset_prs.sh to confirm that the CPU hotplug deferral, if needed, is working as expected. Signed-off-by: Waiman Long <longman@redhat.com> --- kernel/cgroup/cpuset.c | 49 ++++++++++++++++++- .../selftests/cgroup/test_cpuset_prs.sh | 9 ++++ 2 files changed, 56 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 7b7d12ab1006..98c7cb732206 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -84,6 +84,10 @@ static cpumask_var_t isolated_cpus; */ static bool isolated_cpus_updating; +/* Both cpuset_mutex and cpus_read_locked acquired */ +static bool cpuset_full_locked; +static bool isolation_task_work_queued; + /* * A flag to force sched domain rebuild at the end of an operation. * It can be set in @@ -285,10 +289,12 @@ void cpuset_full_lock(void) { cpus_read_lock(); mutex_lock(&cpuset_mutex); + cpuset_full_locked = true; } void cpuset_full_unlock(void) { + cpuset_full_locked = false; mutex_unlock(&cpuset_mutex); cpus_read_unlock(); } @@ -1285,25 +1291,64 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) return false; } +static void __update_isolation_cpumasks(bool twork); +static void isolation_task_work_fn(struct callback_head *cb) +{ + cpuset_full_lock(); + __update_isolation_cpumasks(true); + cpuset_full_lock(); +} + /* - * update_isolation_cpumasks - Update external isolation related CPU masks + * __update_isolation_cpumasks - Update external isolation related CPU masks + * @twork - set if call from isolation_task_work_fn() * * The following external CPU masks will be updated if necessary: * - workqueue unbound cpumask */ -static void update_isolation_cpumasks(void) +static void __update_isolation_cpumasks(bool twork) { int ret; + if (twork) + isolation_task_work_queued = false; + if (!isolated_cpus_updating) return; + /* + * This function can be reached either directly from regular cpuset + * control file write (cpuset_full_locked) or via hotplug + * (cpus_write_lock && cpuset_mutex held). In the later case, we + * defer the housekeeping_update() call to a task_work to avoid + * the possibility of deadlock. The task_work will be run right + * before exiting back to userspace. + */ + if (!cpuset_full_locked) { + static struct callback_head twork_cb; + + if (!isolation_task_work_queued) { + init_task_work(&twork_cb, isolation_task_work_fn); + if (!task_work_add(current, &twork_cb, TWA_RESUME)) + isolation_task_work_queued = true; + else + /* Current task shouldn't be exiting */ + WARN_ON_ONCE(1); + } + return; + } + ret = housekeeping_update(isolated_cpus); WARN_ON_ONCE(ret < 0); isolated_cpus_updating = false; } +static inline void update_isolation_cpumasks(void) +{ + __update_isolation_cpumasks(false); +} + /** * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets * @parent: Parent cpuset containing all siblings diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh index 5dff3ad53867..af4a2532cb3e 100755 --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -773,6 +773,7 @@ check_isolcpus() EXPECTED_ISOLCPUS=$1 ISCPUS=${CGROUP2}/cpuset.cpus.isolated ISOLCPUS=$(cat $ISCPUS) + HKICPUS=$(cat /sys/devices/system/cpu/isolated) LASTISOLCPU= SCHED_DOMAINS=/sys/kernel/debug/sched/domains if [[ $EXPECTED_ISOLCPUS = . ]] @@ -810,6 +811,14 @@ check_isolcpus() ISOLCPUS= EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN + # + # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS + # + [[ "$ISOLCPUS" != "$HKICPUS" ]] && { + echo "Housekeeping isolated CPUs mismatch - $HKICPUS" + return 1 + } + # # Use the sched domain in debugfs to check isolated CPUs, if available # -- 2.52.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work 2026-01-28 4:42 ` [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work Waiman Long @ 2026-01-28 17:44 ` Tejun Heo 2026-01-28 18:08 ` Waiman Long 2026-01-29 4:03 ` Chen Ridong 1 sibling, 1 reply; 19+ messages in thread From: Tejun Heo @ 2026-01-28 17:44 UTC (permalink / raw) To: Waiman Long Cc: Chen Ridong, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan, cgroups, linux-kernel, linux-kselftest On Tue, Jan 27, 2026 at 11:42:50PM -0500, Waiman Long wrote: > +static void isolation_task_work_fn(struct callback_head *cb) > +{ > + cpuset_full_lock(); > + __update_isolation_cpumasks(true); > + cpuset_full_lock(); ^ unlock? Thanks. -- tejun ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work 2026-01-28 17:44 ` Tejun Heo @ 2026-01-28 18:08 ` Waiman Long 0 siblings, 0 replies; 19+ messages in thread From: Waiman Long @ 2026-01-28 18:08 UTC (permalink / raw) To: Tejun Heo Cc: Chen Ridong, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan, cgroups, linux-kernel, linux-kselftest On 1/28/26 12:44 PM, Tejun Heo wrote: > On Tue, Jan 27, 2026 at 11:42:50PM -0500, Waiman Long wrote: >> +static void isolation_task_work_fn(struct callback_head *cb) >> +{ >> + cpuset_full_lock(); >> + __update_isolation_cpumasks(true); >> + cpuset_full_lock(); > ^ > unlock? > > Thanks. > Sorry about that, will fix it. That change was removed in the 2nd patch. That is why the series still worked as expected. Thanks, Longman ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work 2026-01-28 4:42 ` [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work Waiman Long 2026-01-28 17:44 ` Tejun Heo @ 2026-01-29 4:03 ` Chen Ridong 2026-01-29 7:15 ` Chen Ridong 1 sibling, 1 reply; 19+ messages in thread From: Chen Ridong @ 2026-01-29 4:03 UTC (permalink / raw) To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 2026/1/28 12:42, Waiman Long wrote: > The update_isolation_cpumasks() function can be called either directly > from regular cpuset control file write with cpuset_full_lock() called > or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held. > > As we are going to enable dynamic update to the nozh_full housekeeping > cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug, > allowing the CPU hotplug path to call into housekeeping_update() > directly from update_isolation_cpumasks() will cause deadlock. So we > have to defer any call to housekeeping_update() after the CPU hotplug > operation has finished. This can be done via the task_work_add(..., > TWA_RESUME) API where the actual housekeeping_update() call, if needed, > will happen right before existing back to userspace. > > Since the HK_TYPE_DOMAIN housekeeping cpumask should now track the > changes in "cpuset.cpus.isolated", add a check in test_cpuset_prs.sh to > confirm that the CPU hotplug deferral, if needed, is working as expected. > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > kernel/cgroup/cpuset.c | 49 ++++++++++++++++++- > .../selftests/cgroup/test_cpuset_prs.sh | 9 ++++ > 2 files changed, 56 insertions(+), 2 deletions(-) > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index 7b7d12ab1006..98c7cb732206 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -84,6 +84,10 @@ static cpumask_var_t isolated_cpus; > */ > static bool isolated_cpus_updating; > > +/* Both cpuset_mutex and cpus_read_locked acquired */ > +static bool cpuset_full_locked; > +static bool isolation_task_work_queued; > + > /* > * A flag to force sched domain rebuild at the end of an operation. > * It can be set in > @@ -285,10 +289,12 @@ void cpuset_full_lock(void) > { > cpus_read_lock(); > mutex_lock(&cpuset_mutex); > + cpuset_full_locked = true; > } > > void cpuset_full_unlock(void) > { > + cpuset_full_locked = false; > mutex_unlock(&cpuset_mutex); > cpus_read_unlock(); > } > @@ -1285,25 +1291,64 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) > return false; > } > > +static void __update_isolation_cpumasks(bool twork); > +static void isolation_task_work_fn(struct callback_head *cb) > +{ > + cpuset_full_lock(); > + __update_isolation_cpumasks(true); > + cpuset_full_lock(); > +} > + > /* > - * update_isolation_cpumasks - Update external isolation related CPU masks > + * __update_isolation_cpumasks - Update external isolation related CPU masks > + * @twork - set if call from isolation_task_work_fn() > * > * The following external CPU masks will be updated if necessary: > * - workqueue unbound cpumask > */ > -static void update_isolation_cpumasks(void) > +static void __update_isolation_cpumasks(bool twork) > { > int ret; > > + if (twork) > + isolation_task_work_queued = false; > + > if (!isolated_cpus_updating) > return; > > + /* > + * This function can be reached either directly from regular cpuset > + * control file write (cpuset_full_locked) or via hotplug > + * (cpus_write_lock && cpuset_mutex held). In the later case, we > + * defer the housekeeping_update() call to a task_work to avoid > + * the possibility of deadlock. The task_work will be run right > + * before exiting back to userspace. > + */ > + if (!cpuset_full_locked) { > + static struct callback_head twork_cb; > + > + if (!isolation_task_work_queued) { > + init_task_work(&twork_cb, isolation_task_work_fn); > + if (!task_work_add(current, &twork_cb, TWA_RESUME)) > + isolation_task_work_queued = true; > + else > + /* Current task shouldn't be exiting */ > + WARN_ON_ONCE(1); > + } > + return; > + } > + > ret = housekeeping_update(isolated_cpus); > WARN_ON_ONCE(ret < 0); > > isolated_cpus_updating = false; > } > The logic is not straightforward; perhaps we can simplify it as follows, maybe I missed something, just correct me. static void isolation_task_work_fn(struct callback_head *cb) { guard(mutex)(&isolcpus_update_mutex); WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); } /* * __update_isolation_cpumasks - Update external isolation related CPU masks * @twork - set if call from isolation_task_work_fn() * * The following external CPU masks will be updated if necessary: * - workqueue unbound cpumask */ static void __update_isolation_cpumasks(bool twork) { if (!isolated_cpus_updating) return; /* * This function can be reached either directly from regular cpuset * control file write (cpuset_full_locked) or via hotplug * (cpus_write_lock && cpuset_mutex held). In the later case, we * defer the housekeeping_update() call to a task_work to avoid * the possibility of deadlock. The task_work will be run right * before exiting back to userspace. */ if (twork) { static struct callback_head twork_cb; init_task_work(&twork_cb, isolation_task_work_fn); if (task_work_add(current, &twork_cb, TWA_RESUME)) /* Current task shouldn't be exiting */ WARN_ON_ONCE(1); return; } lockdep_assert_held(&isolcpus_update_mutex); /* * Release cpus_read_lock & cpuset_mutex before calling * housekeeping_update() and re-acquiring them afterward if not * calling from task_work. */ cpuset_full_unlock(); WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); cpuset_full_lock(); isolated_cpus_updating = false; } static inline void update_isolation_cpumasks(void) { __update_isolation_cpumasks(false); } If twork is true, add it directly to the work queue — note that this requires holding isolcpus_update_mutex and then calling housekeeping_update. Otherwise, release the 'full' lock before invoking housekeeping_update. > +static inline void update_isolation_cpumasks(void) > +{ > + __update_isolation_cpumasks(false); > +} > + > /** > * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets > * @parent: Parent cpuset containing all siblings > diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh > index 5dff3ad53867..af4a2532cb3e 100755 > --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh > +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh > @@ -773,6 +773,7 @@ check_isolcpus() > EXPECTED_ISOLCPUS=$1 > ISCPUS=${CGROUP2}/cpuset.cpus.isolated > ISOLCPUS=$(cat $ISCPUS) > + HKICPUS=$(cat /sys/devices/system/cpu/isolated) > LASTISOLCPU= > SCHED_DOMAINS=/sys/kernel/debug/sched/domains > if [[ $EXPECTED_ISOLCPUS = . ]] > @@ -810,6 +811,14 @@ check_isolcpus() > ISOLCPUS= > EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN > > + # > + # The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS > + # > + [[ "$ISOLCPUS" != "$HKICPUS" ]] && { > + echo "Housekeeping isolated CPUs mismatch - $HKICPUS" > + return 1 > + } > + > # > # Use the sched domain in debugfs to check isolated CPUs, if available > # -- Best regards, Ridong ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work 2026-01-29 4:03 ` Chen Ridong @ 2026-01-29 7:15 ` Chen Ridong 2026-01-30 1:37 ` Waiman Long 0 siblings, 1 reply; 19+ messages in thread From: Chen Ridong @ 2026-01-29 7:15 UTC (permalink / raw) To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 2026/1/29 12:03, Chen Ridong wrote: > > > On 2026/1/28 12:42, Waiman Long wrote: >> The update_isolation_cpumasks() function can be called either directly >> from regular cpuset control file write with cpuset_full_lock() called >> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held. >> >> As we are going to enable dynamic update to the nozh_full housekeeping >> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug, >> allowing the CPU hotplug path to call into housekeeping_update() >> directly from update_isolation_cpumasks() will cause deadlock. So we >> have to defer any call to housekeeping_update() after the CPU hotplug >> operation has finished. This can be done via the task_work_add(..., >> TWA_RESUME) API where the actual housekeeping_update() call, if needed, >> will happen right before existing back to userspace. >> >> Since the HK_TYPE_DOMAIN housekeeping cpumask should now track the >> changes in "cpuset.cpus.isolated", add a check in test_cpuset_prs.sh to >> confirm that the CPU hotplug deferral, if needed, is working as expected. >> >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- >> kernel/cgroup/cpuset.c | 49 ++++++++++++++++++- >> .../selftests/cgroup/test_cpuset_prs.sh | 9 ++++ >> 2 files changed, 56 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >> index 7b7d12ab1006..98c7cb732206 100644 >> --- a/kernel/cgroup/cpuset.c >> +++ b/kernel/cgroup/cpuset.c >> @@ -84,6 +84,10 @@ static cpumask_var_t isolated_cpus; >> */ >> static bool isolated_cpus_updating; >> >> +/* Both cpuset_mutex and cpus_read_locked acquired */ >> +static bool cpuset_full_locked; >> +static bool isolation_task_work_queued; >> + >> /* >> * A flag to force sched domain rebuild at the end of an operation. >> * It can be set in >> @@ -285,10 +289,12 @@ void cpuset_full_lock(void) >> { >> cpus_read_lock(); >> mutex_lock(&cpuset_mutex); >> + cpuset_full_locked = true; >> } >> >> void cpuset_full_unlock(void) >> { >> + cpuset_full_locked = false; >> mutex_unlock(&cpuset_mutex); >> cpus_read_unlock(); >> } >> @@ -1285,25 +1291,64 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) >> return false; >> } >> >> +static void __update_isolation_cpumasks(bool twork); >> +static void isolation_task_work_fn(struct callback_head *cb) >> +{ >> + cpuset_full_lock(); >> + __update_isolation_cpumasks(true); >> + cpuset_full_lock(); >> +} >> + >> /* >> - * update_isolation_cpumasks - Update external isolation related CPU masks >> + * __update_isolation_cpumasks - Update external isolation related CPU masks >> + * @twork - set if call from isolation_task_work_fn() >> * >> * The following external CPU masks will be updated if necessary: >> * - workqueue unbound cpumask >> */ >> -static void update_isolation_cpumasks(void) >> +static void __update_isolation_cpumasks(bool twork) >> { >> int ret; >> >> + if (twork) >> + isolation_task_work_queued = false; >> + >> if (!isolated_cpus_updating) >> return; >> >> + /* >> + * This function can be reached either directly from regular cpuset >> + * control file write (cpuset_full_locked) or via hotplug >> + * (cpus_write_lock && cpuset_mutex held). In the later case, we >> + * defer the housekeeping_update() call to a task_work to avoid >> + * the possibility of deadlock. The task_work will be run right >> + * before exiting back to userspace. >> + */ >> + if (!cpuset_full_locked) { >> + static struct callback_head twork_cb; >> + >> + if (!isolation_task_work_queued) { >> + init_task_work(&twork_cb, isolation_task_work_fn); >> + if (!task_work_add(current, &twork_cb, TWA_RESUME)) >> + isolation_task_work_queued = true; >> + else >> + /* Current task shouldn't be exiting */ >> + WARN_ON_ONCE(1); >> + } >> + return; >> + } >> + >> ret = housekeeping_update(isolated_cpus); >> WARN_ON_ONCE(ret < 0); >> >> isolated_cpus_updating = false; >> } >> > > The logic is not straightforward; perhaps we can simplify it as follows, > maybe I missed something, just correct me. > > static void isolation_task_work_fn(struct callback_head *cb) > { > guard(mutex)(&isolcpus_update_mutex); > WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); > } > > /* > * __update_isolation_cpumasks - Update external isolation related CPU masks > * @twork - set if call from isolation_task_work_fn() > * > * The following external CPU masks will be updated if necessary: > * - workqueue unbound cpumask > */ > static void __update_isolation_cpumasks(bool twork) > { > if (!isolated_cpus_updating) > return; > > /* > * This function can be reached either directly from regular cpuset > * control file write (cpuset_full_locked) or via hotplug > * (cpus_write_lock && cpuset_mutex held). In the later case, we > * defer the housekeeping_update() call to a task_work to avoid > * the possibility of deadlock. The task_work will be run right > * before exiting back to userspace. > */ > if (twork) { > static struct callback_head twork_cb; > > init_task_work(&twork_cb, isolation_task_work_fn); > if (task_work_add(current, &twork_cb, TWA_RESUME)) > /* Current task shouldn't be exiting */ > WARN_ON_ONCE(1); > > return; > } > > lockdep_assert_held(&isolcpus_update_mutex); > /* > * Release cpus_read_lock & cpuset_mutex before calling > * housekeeping_update() and re-acquiring them afterward if not > * calling from task_work. > */ > > cpuset_full_unlock(); > WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); > cpuset_full_lock(); > > isolated_cpus_updating = false; > } > > static inline void update_isolation_cpumasks(void) > { > __update_isolation_cpumasks(false); > } > It can be much clearer: static void isolation_task_work_fn(struct callback_head *cb) { guard(mutex)(&isolcpus_update_mutex); WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); } /* * __update_isolation_cpumasks - Update external isolation related CPU masks * @defer * * The following external CPU masks will be updated if necessary: * - workqueue unbound cpumask */ static void __update_isolation_cpumasks(bool defer) { if (!isolated_cpus_updating) return; /* * This function can be reached either directly from regular cpuset * control file write (cpuset_full_locked) or via hotplug * (cpus_write_lock && cpuset_mutex held). In the later case, we * defer the housekeeping_update() call to a task_work to avoid * the possibility of deadlock. The task_work will be run right * before exiting back to userspace. */ if (defer) { static struct callback_head twork_cb; init_task_work(&twork_cb, isolation_task_work_fn); if (task_work_add(current, &twork_cb, TWA_RESUME)) /* Current task shouldn't be exiting */ WARN_ON_ONCE(1); return; } lockdep_assert_held(&isolcpus_update_mutex); lockdep_assert_cpus_held(); lockdep_assert_cpuset_lock_held(); /* * Release cpus_read_lock & cpuset_mutex before calling * housekeeping_update() and re-acquiring them afterward if not * calling from task_work. */ cpuset_full_unlock(); WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); cpuset_full_lock(); isolated_cpus_updating = false; } static inline void update_isolation_cpumasks(void) { __update_isolation_cpumasks(false); } static inline void asyn_update_isolation_cpumasks(void) { __update_isolation_cpumasks(true); } The hotplug path just calls asyn_update_isolation_cpumasks(), cpuset_full_locked and isolation_task_work_queued can be removed. -- Best regards, Ridong ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work 2026-01-29 7:15 ` Chen Ridong @ 2026-01-30 1:37 ` Waiman Long 2026-01-30 1:39 ` Waiman Long 0 siblings, 1 reply; 19+ messages in thread From: Waiman Long @ 2026-01-30 1:37 UTC (permalink / raw) To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 1/29/26 2:15 AM, Chen Ridong wrote: > > On 2026/1/29 12:03, Chen Ridong wrote: >> >> On 2026/1/28 12:42, Waiman Long wrote: >>> The update_isolation_cpumasks() function can be called either directly >>> from regular cpuset control file write with cpuset_full_lock() called >>> or via the CPU hotplug path with cpus_write_lock and cpuset_mutex held. >>> >>> As we are going to enable dynamic update to the nozh_full housekeeping >>> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug, >>> allowing the CPU hotplug path to call into housekeeping_update() >>> directly from update_isolation_cpumasks() will cause deadlock. So we >>> have to defer any call to housekeeping_update() after the CPU hotplug >>> operation has finished. This can be done via the task_work_add(..., >>> TWA_RESUME) API where the actual housekeeping_update() call, if needed, >>> will happen right before existing back to userspace. >>> >>> Since the HK_TYPE_DOMAIN housekeeping cpumask should now track the >>> changes in "cpuset.cpus.isolated", add a check in test_cpuset_prs.sh to >>> confirm that the CPU hotplug deferral, if needed, is working as expected. >>> >>> Signed-off-by: Waiman Long <longman@redhat.com> >>> --- >>> kernel/cgroup/cpuset.c | 49 ++++++++++++++++++- >>> .../selftests/cgroup/test_cpuset_prs.sh | 9 ++++ >>> 2 files changed, 56 insertions(+), 2 deletions(-) >>> >>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >>> index 7b7d12ab1006..98c7cb732206 100644 >>> --- a/kernel/cgroup/cpuset.c >>> +++ b/kernel/cgroup/cpuset.c >>> @@ -84,6 +84,10 @@ static cpumask_var_t isolated_cpus; >>> */ >>> static bool isolated_cpus_updating; >>> >>> +/* Both cpuset_mutex and cpus_read_locked acquired */ >>> +static bool cpuset_full_locked; >>> +static bool isolation_task_work_queued; >>> + >>> /* >>> * A flag to force sched domain rebuild at the end of an operation. >>> * It can be set in >>> @@ -285,10 +289,12 @@ void cpuset_full_lock(void) >>> { >>> cpus_read_lock(); >>> mutex_lock(&cpuset_mutex); >>> + cpuset_full_locked = true; >>> } >>> >>> void cpuset_full_unlock(void) >>> { >>> + cpuset_full_locked = false; >>> mutex_unlock(&cpuset_mutex); >>> cpus_read_unlock(); >>> } >>> @@ -1285,25 +1291,64 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) >>> return false; >>> } >>> >>> +static void __update_isolation_cpumasks(bool twork); >>> +static void isolation_task_work_fn(struct callback_head *cb) >>> +{ >>> + cpuset_full_lock(); >>> + __update_isolation_cpumasks(true); >>> + cpuset_full_lock(); >>> +} >>> + >>> /* >>> - * update_isolation_cpumasks - Update external isolation related CPU masks >>> + * __update_isolation_cpumasks - Update external isolation related CPU masks >>> + * @twork - set if call from isolation_task_work_fn() >>> * >>> * The following external CPU masks will be updated if necessary: >>> * - workqueue unbound cpumask >>> */ >>> -static void update_isolation_cpumasks(void) >>> +static void __update_isolation_cpumasks(bool twork) >>> { >>> int ret; >>> >>> + if (twork) >>> + isolation_task_work_queued = false; >>> + >>> if (!isolated_cpus_updating) >>> return; >>> >>> + /* >>> + * This function can be reached either directly from regular cpuset >>> + * control file write (cpuset_full_locked) or via hotplug >>> + * (cpus_write_lock && cpuset_mutex held). In the later case, we >>> + * defer the housekeeping_update() call to a task_work to avoid >>> + * the possibility of deadlock. The task_work will be run right >>> + * before exiting back to userspace. >>> + */ >>> + if (!cpuset_full_locked) { >>> + static struct callback_head twork_cb; >>> + >>> + if (!isolation_task_work_queued) { >>> + init_task_work(&twork_cb, isolation_task_work_fn); >>> + if (!task_work_add(current, &twork_cb, TWA_RESUME)) >>> + isolation_task_work_queued = true; >>> + else >>> + /* Current task shouldn't be exiting */ >>> + WARN_ON_ONCE(1); >>> + } >>> + return; >>> + } >>> + >>> ret = housekeeping_update(isolated_cpus); >>> WARN_ON_ONCE(ret < 0); >>> >>> isolated_cpus_updating = false; >>> } >>> >> The logic is not straightforward; perhaps we can simplify it as follows, >> maybe I missed something, just correct me. >> >> static void isolation_task_work_fn(struct callback_head *cb) >> { >> guard(mutex)(&isolcpus_update_mutex); >> WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); >> } >> >> /* >> * __update_isolation_cpumasks - Update external isolation related CPU masks >> * @twork - set if call from isolation_task_work_fn() >> * >> * The following external CPU masks will be updated if necessary: >> * - workqueue unbound cpumask >> */ >> static void __update_isolation_cpumasks(bool twork) >> { >> if (!isolated_cpus_updating) >> return; >> >> /* >> * This function can be reached either directly from regular cpuset >> * control file write (cpuset_full_locked) or via hotplug >> * (cpus_write_lock && cpuset_mutex held). In the later case, we >> * defer the housekeeping_update() call to a task_work to avoid >> * the possibility of deadlock. The task_work will be run right >> * before exiting back to userspace. >> */ >> if (twork) { >> static struct callback_head twork_cb; >> >> init_task_work(&twork_cb, isolation_task_work_fn); >> if (task_work_add(current, &twork_cb, TWA_RESUME)) >> /* Current task shouldn't be exiting */ >> WARN_ON_ONCE(1); >> >> return; >> } >> >> lockdep_assert_held(&isolcpus_update_mutex); >> /* >> * Release cpus_read_lock & cpuset_mutex before calling >> * housekeeping_update() and re-acquiring them afterward if not >> * calling from task_work. >> */ >> >> cpuset_full_unlock(); >> WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); >> cpuset_full_lock(); >> >> isolated_cpus_updating = false; >> } >> >> static inline void update_isolation_cpumasks(void) >> { >> __update_isolation_cpumasks(false); >> } >> > It can be much clearer: > > static void isolation_task_work_fn(struct callback_head *cb) > { > guard(mutex)(&isolcpus_update_mutex); > WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); > } > > /* > * __update_isolation_cpumasks - Update external isolation related CPU masks > * @defer > * > * The following external CPU masks will be updated if necessary: > * - workqueue unbound cpumask > */ > static void __update_isolation_cpumasks(bool defer) > { > if (!isolated_cpus_updating) > return; > > /* > * This function can be reached either directly from regular cpuset > * control file write (cpuset_full_locked) or via hotplug > * (cpus_write_lock && cpuset_mutex held). In the later case, we > * defer the housekeeping_update() call to a task_work to avoid > * the possibility of deadlock. The task_work will be run right > * before exiting back to userspace. > */ > if (defer) { > static struct callback_head twork_cb; > > init_task_work(&twork_cb, isolation_task_work_fn); > if (task_work_add(current, &twork_cb, TWA_RESUME)) > /* Current task shouldn't be exiting */ > WARN_ON_ONCE(1); > > return; > } > > lockdep_assert_held(&isolcpus_update_mutex); > lockdep_assert_cpus_held(); > lockdep_assert_cpuset_lock_held(); > > /* > * Release cpus_read_lock & cpuset_mutex before calling > * housekeeping_update() and re-acquiring them afterward if not > * calling from task_work. > */ > > cpuset_full_unlock(); > WARN_ON_ONCE(housekeeping_update(isolated_cpus) < 0); > cpuset_full_lock(); > > isolated_cpus_updating = false; > } > > static inline void update_isolation_cpumasks(void) > { > __update_isolation_cpumasks(false); > } > > static inline void asyn_update_isolation_cpumasks(void) > { > __update_isolation_cpumasks(true); > } > > The hotplug path just calls asyn_update_isolation_cpumasks(), cpuset_full_locked > and isolation_task_work_queued can be removed. > Thanks for the suggestions. I will adopt some of them. BTW, I am switching to use workqueue instead of the task_work as the latter is suitable in this use case. Cheers, Longman ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work 2026-01-30 1:37 ` Waiman Long @ 2026-01-30 1:39 ` Waiman Long 0 siblings, 0 replies; 19+ messages in thread From: Waiman Long @ 2026-01-30 1:39 UTC (permalink / raw) To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 1/29/26 8:37 PM, Waiman Long wrote: >> The hotplug path just calls asyn_update_isolation_cpumasks(), >> cpuset_full_locked >> and isolation_task_work_queued can be removed. >> > Thanks for the suggestions. I will adopt some of them. BTW, I am > switching to use workqueue instead of the task_work as the latter is > suitable in this use case. Sorry, I mean "the latter is NOT suitable in this use case." Cheers, Longman ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-28 4:42 [PATCH/for-next 0/2] cgroup/cpuset: Fix partition related locking issues Waiman Long 2026-01-28 4:42 ` [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work Waiman Long @ 2026-01-28 4:42 ` Waiman Long 2026-01-29 8:01 ` Chen Ridong 1 sibling, 1 reply; 19+ messages in thread From: Waiman Long @ 2026-01-28 4:42 UTC (permalink / raw) To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest, Waiman Long The current cpuset partition code is able to dynamically update the sched domains of a running system and the corresponding HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the "isolcpus=domain,..." boot command line feature at run time. The housekeeping cpumask update requires flushing a number of different workqueues which may not be safe with cpus_read_lock() held as the workqueue flushing code may acquire cpus_read_lock() or acquiring locks which have locking dependency with cpus_read_lock() down the chain. Below is an example of such circular locking problem. ====================================================== WARNING: possible circular locking dependency detected 6.18.0-test+ #2 Tainted: G S ------------------------------------------------------ test_cpuset_prs/10971 is trying to acquire lock: ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180 but task is already holding lock: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (cpuset_mutex){+.+.}-{4:4}: -> #3 (cpu_hotplug_lock){++++}-{0:0}: -> #2 (rtnl_mutex){+.+.}-{4:4}: -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: Chain exists of: (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex 5 locks held by test_cpuset_prs/10971: #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0 #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0 #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130 #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 Call Trace: <TASK> : touch_wq_lockdep_map+0x93/0x180 __flush_workqueue+0x111/0x10b0 housekeeping_update+0x12d/0x2d0 update_parent_effective_cpumask+0x595/0x2440 update_prstate+0x89d/0xce0 cpuset_partition_write+0xc5/0x130 cgroup_file_write+0x1a5/0x680 kernfs_fop_write_iter+0x3df/0x5f0 vfs_write+0x525/0xfd0 ksys_write+0xf9/0x1d0 do_syscall_64+0x95/0x520 entry_SYSCALL_64_after_hwframe+0x76/0x7e To avoid such a circular locking dependency problem, we have to call housekeeping_update() without holding the cpus_read_lock() and cpuset_mutex. One way to do that is to introduce a new top level isolcpus_update_mutex which will be acquired first if the set of isolated CPUs may have to be updated. This new isolcpus_update_mutex will provide the need mutual exclusion without the need to hold cpus_read_lock(). As cpus_read_lock() is now no longer held when tmigr_isolated_exclude_cpumask() is called, it needs to acquire it directly. The lockdep_is_cpuset_held() is also updated to check the new isolcpus_update_mutex. Signed-off-by: Waiman Long <longman@redhat.com> --- kernel/cgroup/cpuset.c | 79 ++++++++++++++++++++++++----------- kernel/sched/isolation.c | 4 +- kernel/time/timer_migration.c | 3 +- 3 files changed, 57 insertions(+), 29 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 98c7cb732206..96390ceb5122 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -78,7 +78,7 @@ static cpumask_var_t subpartitions_cpus; static cpumask_var_t isolated_cpus; /* - * isolated_cpus updating flag (protected by cpuset_mutex) + * isolated_cpus updating flag (protected by isolcpus_update_mutex) * Set if isolated_cpus is going to be updated in the current * cpuset_mutex crtical section. */ @@ -223,29 +223,46 @@ struct cpuset top_cpuset = { }; /* - * There are two global locks guarding cpuset structures - cpuset_mutex and - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset - * structures. Note that cpuset_mutex needs to be a mutex as it is used in - * paths that rely on priority inheritance (e.g. scheduler - on RT) for - * correctness. + * CPUSET Locking Convention + * ------------------------- * - * A task must hold both locks to modify cpusets. If a task holds - * cpuset_mutex, it blocks others, ensuring that it is the only task able to - * also acquire callback_lock and be able to modify cpusets. It can perform - * various checks on the cpuset structure first, knowing nothing will change. - * It can also allocate memory while just holding cpuset_mutex. While it is - * performing these checks, various callback routines can briefly acquire - * callback_lock to query cpusets. Once it is ready to make the changes, it - * takes callback_lock, blocking everyone else. + * Below are the three global locks guarding cpuset structures in lock + * acquisition order: + * - isolcpus_update_mutex (optional) + * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock) + * - cpuset_mutex + * - callback_lock (raw spinlock) * - * Calls to the kernel memory allocator can not be made while holding - * callback_lock, as that would risk double tripping on callback_lock - * from one of the callbacks into the cpuset code from within - * __alloc_pages(). + * The first isolcpus_update_mutex should only be held if the existing set of + * isolated CPUs (in isolated partition) or any of the partition states may be + * changed when some cpuset control files are being written into. Otherwise, + * it can be skipped. Holding isolcpus_update_mutex/cpus_read_lock or + * cpus_write_lock will ensure mutual exclusion of isolated_cpus update. * - * If a task is only holding callback_lock, then it has read-only - * access to cpusets. + * As cpuset will now indirectly flush a number of different workqueues in + * housekeeping_update() when the set of isolated CPUs is going to be changed, + * it may not be safe from the circular locking perspective to hold the + * cpus_read_lock. So cpuset_full_lock() will be released before calling + * housekeeping_update() and re-acquired afterward. + * + * A task must hold all the remaining three locks to modify externally visible + * or used fields of cpusets, though some of the internally used cpuset fields + * can be modified by holding cpu_hotplug_lock and cpuset_mutex only. If only + * reliable read access of the externally used fields are needed, a task can + * hold either cpuset_mutex or callback_lock. + * + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others, + * ensuring that it is the only task able to also acquire callback_lock and + * be able to modify cpusets. It can perform various checks on the cpuset + * structure first, knowing nothing will change. It can also allocate memory + * without holding callback_lock. While it is performing these checks, various + * callback routines can briefly acquire callback_lock to query cpusets. Once + * it is ready to make the changes, it takes callback_lock, blocking everyone + * else. + * + * Calls to the kernel memory allocator cannot be made while holding + * callback_lock which is a spinlock, as the memory allocator may sleep or + * call back into cpuset code and acquire callback_lock. * * Now, the task_struct fields mems_allowed and mempolicy may be changed * by other task, we use alloc_lock in the task_struct fields to protect @@ -256,6 +273,7 @@ struct cpuset top_cpuset = { * cpumasks and nodemasks. */ +static DEFINE_MUTEX(isolcpus_update_mutex); static DEFINE_MUTEX(cpuset_mutex); /** @@ -302,7 +320,7 @@ void cpuset_full_unlock(void) #ifdef CONFIG_LOCKDEP bool lockdep_is_cpuset_held(void) { - return lockdep_is_held(&cpuset_mutex); + return lockdep_is_held(&isolcpus_update_mutex); } #endif @@ -1294,9 +1312,8 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) static void __update_isolation_cpumasks(bool twork); static void isolation_task_work_fn(struct callback_head *cb) { - cpuset_full_lock(); + guard(mutex)(&isolcpus_update_mutex); __update_isolation_cpumasks(true); - cpuset_full_lock(); } /* @@ -1338,8 +1355,18 @@ static void __update_isolation_cpumasks(bool twork) return; } + lockdep_assert_held(&isolcpus_update_mutex); + /* + * Release cpus_read_lock & cpuset_mutex before calling + * housekeeping_update() and re-acquiring them afterward if not + * calling from task_work. + */ + if (!twork) + cpuset_full_unlock(); ret = housekeeping_update(isolated_cpus); WARN_ON_ONCE(ret < 0); + if (!twork) + cpuset_full_lock(); isolated_cpus_updating = false; } @@ -3196,6 +3223,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, return -EACCES; buf = strstrip(buf); + mutex_lock(&isolcpus_update_mutex); cpuset_full_lock(); if (!is_cpuset_online(cs)) goto out_unlock; @@ -3226,6 +3254,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, rebuild_sched_domains_locked(); out_unlock: cpuset_full_unlock(); + mutex_unlock(&isolcpus_update_mutex); if (of_cft(of)->private == FILE_MEMLIST) schedule_flush_migrate_mm(); return retval ?: nbytes; @@ -3329,6 +3358,7 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf, else return -EINVAL; + guard(mutex)(&isolcpus_update_mutex); cpuset_full_lock(); if (is_cpuset_online(cs)) retval = update_prstate(cs, val); @@ -3502,6 +3532,7 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css) { struct cpuset *cs = css_cs(css); + guard(mutex)(&isolcpus_update_mutex); cpuset_full_lock(); /* Reset valid partition back to member */ if (is_partition_valid(cs)) diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index 3b725d39c06e..ef152d401fe2 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask) struct cpumask *trial, *old = NULL; int err; - lockdep_assert_cpus_held(); - trial = kmalloc(cpumask_size(), GFP_KERNEL); if (!trial) return -ENOMEM; @@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask) } if (!housekeeping.flags) - static_branch_enable_cpuslocked(&housekeeping_overridden); + static_branch_enable(&housekeeping_overridden); if (housekeeping.flags & HK_FLAG_DOMAIN) old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN); diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 6da9cd562b20..244a8d025e78 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask) cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL; int cpu; - lockdep_assert_cpus_held(); - if (!works) return -ENOMEM; if (!alloc_cpumask_var(&cpumask, GFP_KERNEL)) @@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask) * First set previously isolated CPUs as available (unisolate). * This cpumask contains only CPUs that switched to available now. */ + guard(cpus_read_lock)(); cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask); cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask); -- 2.52.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-28 4:42 ` [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex Waiman Long @ 2026-01-29 8:01 ` Chen Ridong 2026-01-29 8:20 ` Chen Ridong 2026-01-29 21:16 ` Waiman Long 0 siblings, 2 replies; 19+ messages in thread From: Chen Ridong @ 2026-01-29 8:01 UTC (permalink / raw) To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 2026/1/28 12:42, Waiman Long wrote: > The current cpuset partition code is able to dynamically update > the sched domains of a running system and the corresponding > HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the > "isolcpus=domain,..." boot command line feature at run time. > > The housekeeping cpumask update requires flushing a number of different > workqueues which may not be safe with cpus_read_lock() held as the > workqueue flushing code may acquire cpus_read_lock() or acquiring locks > which have locking dependency with cpus_read_lock() down the chain. Below > is an example of such circular locking problem. > > ====================================================== > WARNING: possible circular locking dependency detected > 6.18.0-test+ #2 Tainted: G S > ------------------------------------------------------ > test_cpuset_prs/10971 is trying to acquire lock: > ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180 > > but task is already holding lock: > ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 > > which lock already depends on the new lock. > > the existing dependency chain (in reverse order) is: > -> #4 (cpuset_mutex){+.+.}-{4:4}: > -> #3 (cpu_hotplug_lock){++++}-{0:0}: > -> #2 (rtnl_mutex){+.+.}-{4:4}: > -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: > -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: > > Chain exists of: > (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex > > 5 locks held by test_cpuset_prs/10971: > #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 > #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0 > #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0 > #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130 > #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 > > Call Trace: > <TASK> > : > touch_wq_lockdep_map+0x93/0x180 > __flush_workqueue+0x111/0x10b0 > housekeeping_update+0x12d/0x2d0 > update_parent_effective_cpumask+0x595/0x2440 > update_prstate+0x89d/0xce0 > cpuset_partition_write+0xc5/0x130 > cgroup_file_write+0x1a5/0x680 > kernfs_fop_write_iter+0x3df/0x5f0 > vfs_write+0x525/0xfd0 > ksys_write+0xf9/0x1d0 > do_syscall_64+0x95/0x520 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > To avoid such a circular locking dependency problem, we have to > call housekeeping_update() without holding the cpus_read_lock() > and cpuset_mutex. One way to do that is to introduce a new top level > isolcpus_update_mutex which will be acquired first if the set of isolated > CPUs may have to be updated. This new isolcpus_update_mutex will provide > the need mutual exclusion without the need to hold cpus_read_lock(). > > As cpus_read_lock() is now no longer held when > tmigr_isolated_exclude_cpumask() is called, it needs to acquire it > directly. > > The lockdep_is_cpuset_held() is also updated to check the new > isolcpus_update_mutex. > I worry about the issue: CPU1 CPU2 rmdir css->ss->css_killed(css); cpuset_css_killed __update_isolation_cpumasks cpuset_full_unlock css->flags |= CSS_DYING; css_clear_dir(css); ... // offline and free do not // get isolcpus_update_mutex cpuset_css_offline cpuset_css_free cpuset_full_lock ... // UAF? > Signed-off-by: Waiman Long <longman@redhat.com> > --- > kernel/cgroup/cpuset.c | 79 ++++++++++++++++++++++++----------- > kernel/sched/isolation.c | 4 +- > kernel/time/timer_migration.c | 3 +- > 3 files changed, 57 insertions(+), 29 deletions(-) > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index 98c7cb732206..96390ceb5122 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -78,7 +78,7 @@ static cpumask_var_t subpartitions_cpus; > static cpumask_var_t isolated_cpus; > > /* > - * isolated_cpus updating flag (protected by cpuset_mutex) > + * isolated_cpus updating flag (protected by isolcpus_update_mutex) > * Set if isolated_cpus is going to be updated in the current > * cpuset_mutex crtical section. > */ > @@ -223,29 +223,46 @@ struct cpuset top_cpuset = { > }; > > /* > - * There are two global locks guarding cpuset structures - cpuset_mutex and > - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel > - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset > - * structures. Note that cpuset_mutex needs to be a mutex as it is used in > - * paths that rely on priority inheritance (e.g. scheduler - on RT) for > - * correctness. > + * CPUSET Locking Convention > + * ------------------------- > * > - * A task must hold both locks to modify cpusets. If a task holds > - * cpuset_mutex, it blocks others, ensuring that it is the only task able to > - * also acquire callback_lock and be able to modify cpusets. It can perform > - * various checks on the cpuset structure first, knowing nothing will change. > - * It can also allocate memory while just holding cpuset_mutex. While it is > - * performing these checks, various callback routines can briefly acquire > - * callback_lock to query cpusets. Once it is ready to make the changes, it > - * takes callback_lock, blocking everyone else. > + * Below are the three global locks guarding cpuset structures in lock > + * acquisition order: > + * - isolcpus_update_mutex (optional) > + * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock) > + * - cpuset_mutex > + * - callback_lock (raw spinlock) > * > - * Calls to the kernel memory allocator can not be made while holding > - * callback_lock, as that would risk double tripping on callback_lock > - * from one of the callbacks into the cpuset code from within > - * __alloc_pages(). > + * The first isolcpus_update_mutex should only be held if the existing set of > + * isolated CPUs (in isolated partition) or any of the partition states may be > + * changed when some cpuset control files are being written into. Otherwise, > + * it can be skipped. Holding isolcpus_update_mutex/cpus_read_lock or > + * cpus_write_lock will ensure mutual exclusion of isolated_cpus update. > * > - * If a task is only holding callback_lock, then it has read-only > - * access to cpusets. > + * As cpuset will now indirectly flush a number of different workqueues in > + * housekeeping_update() when the set of isolated CPUs is going to be changed, > + * it may not be safe from the circular locking perspective to hold the > + * cpus_read_lock. So cpuset_full_lock() will be released before calling > + * housekeeping_update() and re-acquired afterward. > + * > + * A task must hold all the remaining three locks to modify externally visible > + * or used fields of cpusets, though some of the internally used cpuset fields > + * can be modified by holding cpu_hotplug_lock and cpuset_mutex only. If only > + * reliable read access of the externally used fields are needed, a task can > + * hold either cpuset_mutex or callback_lock. > + * > + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others, > + * ensuring that it is the only task able to also acquire callback_lock and > + * be able to modify cpusets. It can perform various checks on the cpuset > + * structure first, knowing nothing will change. It can also allocate memory > + * without holding callback_lock. While it is performing these checks, various > + * callback routines can briefly acquire callback_lock to query cpusets. Once > + * it is ready to make the changes, it takes callback_lock, blocking everyone > + * else. > + * > + * Calls to the kernel memory allocator cannot be made while holding > + * callback_lock which is a spinlock, as the memory allocator may sleep or > + * call back into cpuset code and acquire callback_lock. > * > * Now, the task_struct fields mems_allowed and mempolicy may be changed > * by other task, we use alloc_lock in the task_struct fields to protect > @@ -256,6 +273,7 @@ struct cpuset top_cpuset = { > * cpumasks and nodemasks. > */ > > +static DEFINE_MUTEX(isolcpus_update_mutex); > static DEFINE_MUTEX(cpuset_mutex); > > /** > @@ -302,7 +320,7 @@ void cpuset_full_unlock(void) > #ifdef CONFIG_LOCKDEP > bool lockdep_is_cpuset_held(void) > { > - return lockdep_is_held(&cpuset_mutex); > + return lockdep_is_held(&isolcpus_update_mutex); > } > #endif > > @@ -1294,9 +1312,8 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) > static void __update_isolation_cpumasks(bool twork); > static void isolation_task_work_fn(struct callback_head *cb) > { > - cpuset_full_lock(); > + guard(mutex)(&isolcpus_update_mutex); > __update_isolation_cpumasks(true); > - cpuset_full_lock(); > } > > /* > @@ -1338,8 +1355,18 @@ static void __update_isolation_cpumasks(bool twork) > return; > } > > + lockdep_assert_held(&isolcpus_update_mutex); > + /* > + * Release cpus_read_lock & cpuset_mutex before calling > + * housekeeping_update() and re-acquiring them afterward if not > + * calling from task_work. > + */ > + if (!twork) > + cpuset_full_unlock(); > ret = housekeeping_update(isolated_cpus); > WARN_ON_ONCE(ret < 0); > + if (!twork) > + cpuset_full_lock(); > > isolated_cpus_updating = false; > } > @@ -3196,6 +3223,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, > return -EACCES; > > buf = strstrip(buf); > + mutex_lock(&isolcpus_update_mutex); > cpuset_full_lock(); > if (!is_cpuset_online(cs)) > goto out_unlock; > @@ -3226,6 +3254,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, > rebuild_sched_domains_locked(); > out_unlock: > cpuset_full_unlock(); > + mutex_unlock(&isolcpus_update_mutex); > if (of_cft(of)->private == FILE_MEMLIST) > schedule_flush_migrate_mm(); > return retval ?: nbytes; > @@ -3329,6 +3358,7 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf, > else > return -EINVAL; > > + guard(mutex)(&isolcpus_update_mutex); > cpuset_full_lock(); > if (is_cpuset_online(cs)) > retval = update_prstate(cs, val); > @@ -3502,6 +3532,7 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css) > { > struct cpuset *cs = css_cs(css); > > + guard(mutex)(&isolcpus_update_mutex); > cpuset_full_lock(); > /* Reset valid partition back to member */ > if (is_partition_valid(cs)) > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c > index 3b725d39c06e..ef152d401fe2 100644 > --- a/kernel/sched/isolation.c > +++ b/kernel/sched/isolation.c > @@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask) > struct cpumask *trial, *old = NULL; > int err; > > - lockdep_assert_cpus_held(); > - > trial = kmalloc(cpumask_size(), GFP_KERNEL); > if (!trial) > return -ENOMEM; > @@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask) > } > > if (!housekeeping.flags) > - static_branch_enable_cpuslocked(&housekeeping_overridden); > + static_branch_enable(&housekeeping_overridden); > > if (housekeeping.flags & HK_FLAG_DOMAIN) > old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN); > diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c > index 6da9cd562b20..244a8d025e78 100644 > --- a/kernel/time/timer_migration.c > +++ b/kernel/time/timer_migration.c > @@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask) > cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL; > int cpu; > > - lockdep_assert_cpus_held(); > - > if (!works) > return -ENOMEM; > if (!alloc_cpumask_var(&cpumask, GFP_KERNEL)) > @@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask) > * First set previously isolated CPUs as available (unisolate). > * This cpumask contains only CPUs that switched to available now. > */ > + guard(cpus_read_lock)(); > cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask); > cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask); > -- Best regards, Ridong ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-29 8:01 ` Chen Ridong @ 2026-01-29 8:20 ` Chen Ridong 2026-01-29 20:57 ` Waiman Long 2026-01-29 21:16 ` Waiman Long 1 sibling, 1 reply; 19+ messages in thread From: Chen Ridong @ 2026-01-29 8:20 UTC (permalink / raw) To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 2026/1/29 16:01, Chen Ridong wrote: > > > On 2026/1/28 12:42, Waiman Long wrote: >> The current cpuset partition code is able to dynamically update >> the sched domains of a running system and the corresponding >> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the >> "isolcpus=domain,..." boot command line feature at run time. >> >> The housekeeping cpumask update requires flushing a number of different >> workqueues which may not be safe with cpus_read_lock() held as the >> workqueue flushing code may acquire cpus_read_lock() or acquiring locks >> which have locking dependency with cpus_read_lock() down the chain. Below >> is an example of such circular locking problem. >> >> ====================================================== >> WARNING: possible circular locking dependency detected >> 6.18.0-test+ #2 Tainted: G S >> ------------------------------------------------------ >> test_cpuset_prs/10971 is trying to acquire lock: >> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180 >> >> but task is already holding lock: >> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 >> >> which lock already depends on the new lock. >> >> the existing dependency chain (in reverse order) is: >> -> #4 (cpuset_mutex){+.+.}-{4:4}: >> -> #3 (cpu_hotplug_lock){++++}-{0:0}: >> -> #2 (rtnl_mutex){+.+.}-{4:4}: >> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: >> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: >> >> Chain exists of: >> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex >> >> 5 locks held by test_cpuset_prs/10971: >> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 >> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0 >> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0 >> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130 >> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 >> >> Call Trace: >> <TASK> >> : >> touch_wq_lockdep_map+0x93/0x180 >> __flush_workqueue+0x111/0x10b0 >> housekeeping_update+0x12d/0x2d0 >> update_parent_effective_cpumask+0x595/0x2440 >> update_prstate+0x89d/0xce0 >> cpuset_partition_write+0xc5/0x130 >> cgroup_file_write+0x1a5/0x680 >> kernfs_fop_write_iter+0x3df/0x5f0 >> vfs_write+0x525/0xfd0 >> ksys_write+0xf9/0x1d0 >> do_syscall_64+0x95/0x520 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> >> To avoid such a circular locking dependency problem, we have to >> call housekeeping_update() without holding the cpus_read_lock() >> and cpuset_mutex. One way to do that is to introduce a new top level >> isolcpus_update_mutex which will be acquired first if the set of isolated >> CPUs may have to be updated. This new isolcpus_update_mutex will provide >> the need mutual exclusion without the need to hold cpus_read_lock(). >> When I reviewed Frederic's patches, I concerned about this issue. However, I was not certain whether any flush worker would need to acquire cpu_hotplug_lock or cpuset_mutex. Despite this warning, I do not understand how wq_completion would need to acquire cpu_hotplug_lock and cpuset_mutex. The reason I want to understand how wq_completion acquires cpu_hotplug_lock or cpuset_mutex is to determine whether isolcpus_update_mutex is truly necessary. As I mentioned in my previous email, I am concerned about a potential use-after-free (UAF) issue, which might imply that isolcpus_update_mutex is required in most places that currently acquire cpuset_mutex, with the possible exception of the hotplug path? >> As cpus_read_lock() is now no longer held when >> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it >> directly. >> >> The lockdep_is_cpuset_held() is also updated to check the new >> isolcpus_update_mutex. >> > > I worry about the issue: > > CPU1 CPU2 > rmdir > css->ss->css_killed(css); > cpuset_css_killed > __update_isolation_cpumasks > cpuset_full_unlock > css->flags |= CSS_DYING; > css_clear_dir(css); > ... > // offline and free do not > // get isolcpus_update_mutex > cpuset_css_offline > cpuset_css_free > cpuset_full_lock > ... > // UAF? > >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- >> kernel/cgroup/cpuset.c | 79 ++++++++++++++++++++++++----------- >> kernel/sched/isolation.c | 4 +- >> kernel/time/timer_migration.c | 3 +- >> 3 files changed, 57 insertions(+), 29 deletions(-) >> >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >> index 98c7cb732206..96390ceb5122 100644 >> --- a/kernel/cgroup/cpuset.c >> +++ b/kernel/cgroup/cpuset.c >> @@ -78,7 +78,7 @@ static cpumask_var_t subpartitions_cpus; >> static cpumask_var_t isolated_cpus; >> >> /* >> - * isolated_cpus updating flag (protected by cpuset_mutex) >> + * isolated_cpus updating flag (protected by isolcpus_update_mutex) >> * Set if isolated_cpus is going to be updated in the current >> * cpuset_mutex crtical section. >> */ >> @@ -223,29 +223,46 @@ struct cpuset top_cpuset = { >> }; >> >> /* >> - * There are two global locks guarding cpuset structures - cpuset_mutex and >> - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel >> - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset >> - * structures. Note that cpuset_mutex needs to be a mutex as it is used in >> - * paths that rely on priority inheritance (e.g. scheduler - on RT) for >> - * correctness. >> + * CPUSET Locking Convention >> + * ------------------------- >> * >> - * A task must hold both locks to modify cpusets. If a task holds >> - * cpuset_mutex, it blocks others, ensuring that it is the only task able to >> - * also acquire callback_lock and be able to modify cpusets. It can perform >> - * various checks on the cpuset structure first, knowing nothing will change. >> - * It can also allocate memory while just holding cpuset_mutex. While it is >> - * performing these checks, various callback routines can briefly acquire >> - * callback_lock to query cpusets. Once it is ready to make the changes, it >> - * takes callback_lock, blocking everyone else. >> + * Below are the three global locks guarding cpuset structures in lock >> + * acquisition order: >> + * - isolcpus_update_mutex (optional) >> + * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock) >> + * - cpuset_mutex >> + * - callback_lock (raw spinlock) >> * >> - * Calls to the kernel memory allocator can not be made while holding >> - * callback_lock, as that would risk double tripping on callback_lock >> - * from one of the callbacks into the cpuset code from within >> - * __alloc_pages(). >> + * The first isolcpus_update_mutex should only be held if the existing set of >> + * isolated CPUs (in isolated partition) or any of the partition states may be >> + * changed when some cpuset control files are being written into. Otherwise, >> + * it can be skipped. Holding isolcpus_update_mutex/cpus_read_lock or >> + * cpus_write_lock will ensure mutual exclusion of isolated_cpus update. >> * >> - * If a task is only holding callback_lock, then it has read-only >> - * access to cpusets. >> + * As cpuset will now indirectly flush a number of different workqueues in >> + * housekeeping_update() when the set of isolated CPUs is going to be changed, >> + * it may not be safe from the circular locking perspective to hold the >> + * cpus_read_lock. So cpuset_full_lock() will be released before calling >> + * housekeeping_update() and re-acquired afterward. >> + * >> + * A task must hold all the remaining three locks to modify externally visible >> + * or used fields of cpusets, though some of the internally used cpuset fields >> + * can be modified by holding cpu_hotplug_lock and cpuset_mutex only. If only >> + * reliable read access of the externally used fields are needed, a task can >> + * hold either cpuset_mutex or callback_lock. >> + * >> + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others, >> + * ensuring that it is the only task able to also acquire callback_lock and >> + * be able to modify cpusets. It can perform various checks on the cpuset >> + * structure first, knowing nothing will change. It can also allocate memory >> + * without holding callback_lock. While it is performing these checks, various >> + * callback routines can briefly acquire callback_lock to query cpusets. Once >> + * it is ready to make the changes, it takes callback_lock, blocking everyone >> + * else. >> + * >> + * Calls to the kernel memory allocator cannot be made while holding >> + * callback_lock which is a spinlock, as the memory allocator may sleep or >> + * call back into cpuset code and acquire callback_lock. >> * >> * Now, the task_struct fields mems_allowed and mempolicy may be changed >> * by other task, we use alloc_lock in the task_struct fields to protect >> @@ -256,6 +273,7 @@ struct cpuset top_cpuset = { >> * cpumasks and nodemasks. >> */ >> >> +static DEFINE_MUTEX(isolcpus_update_mutex); >> static DEFINE_MUTEX(cpuset_mutex); >> >> /** >> @@ -302,7 +320,7 @@ void cpuset_full_unlock(void) >> #ifdef CONFIG_LOCKDEP >> bool lockdep_is_cpuset_held(void) >> { >> - return lockdep_is_held(&cpuset_mutex); >> + return lockdep_is_held(&isolcpus_update_mutex); >> } >> #endif >> >> @@ -1294,9 +1312,8 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) >> static void __update_isolation_cpumasks(bool twork); >> static void isolation_task_work_fn(struct callback_head *cb) >> { >> - cpuset_full_lock(); >> + guard(mutex)(&isolcpus_update_mutex); >> __update_isolation_cpumasks(true); >> - cpuset_full_lock(); >> } >> >> /* >> @@ -1338,8 +1355,18 @@ static void __update_isolation_cpumasks(bool twork) >> return; >> } >> >> + lockdep_assert_held(&isolcpus_update_mutex); >> + /* >> + * Release cpus_read_lock & cpuset_mutex before calling >> + * housekeeping_update() and re-acquiring them afterward if not >> + * calling from task_work. >> + */ >> + if (!twork) >> + cpuset_full_unlock(); >> ret = housekeeping_update(isolated_cpus); >> WARN_ON_ONCE(ret < 0); >> + if (!twork) >> + cpuset_full_lock(); >> >> isolated_cpus_updating = false; >> } >> @@ -3196,6 +3223,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, >> return -EACCES; >> >> buf = strstrip(buf); >> + mutex_lock(&isolcpus_update_mutex); >> cpuset_full_lock(); >> if (!is_cpuset_online(cs)) >> goto out_unlock; >> @@ -3226,6 +3254,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, >> rebuild_sched_domains_locked(); >> out_unlock: >> cpuset_full_unlock(); >> + mutex_unlock(&isolcpus_update_mutex); >> if (of_cft(of)->private == FILE_MEMLIST) >> schedule_flush_migrate_mm(); >> return retval ?: nbytes; >> @@ -3329,6 +3358,7 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf, >> else >> return -EINVAL; >> >> + guard(mutex)(&isolcpus_update_mutex); >> cpuset_full_lock(); >> if (is_cpuset_online(cs)) >> retval = update_prstate(cs, val); >> @@ -3502,6 +3532,7 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css) >> { >> struct cpuset *cs = css_cs(css); >> >> + guard(mutex)(&isolcpus_update_mutex); >> cpuset_full_lock(); >> /* Reset valid partition back to member */ >> if (is_partition_valid(cs)) >> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c >> index 3b725d39c06e..ef152d401fe2 100644 >> --- a/kernel/sched/isolation.c >> +++ b/kernel/sched/isolation.c >> @@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask) >> struct cpumask *trial, *old = NULL; >> int err; >> >> - lockdep_assert_cpus_held(); >> - >> trial = kmalloc(cpumask_size(), GFP_KERNEL); >> if (!trial) >> return -ENOMEM; >> @@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask) >> } >> >> if (!housekeeping.flags) >> - static_branch_enable_cpuslocked(&housekeeping_overridden); >> + static_branch_enable(&housekeeping_overridden); >> >> if (housekeeping.flags & HK_FLAG_DOMAIN) >> old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN); >> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c >> index 6da9cd562b20..244a8d025e78 100644 >> --- a/kernel/time/timer_migration.c >> +++ b/kernel/time/timer_migration.c >> @@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask) >> cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL; >> int cpu; >> >> - lockdep_assert_cpus_held(); >> - >> if (!works) >> return -ENOMEM; >> if (!alloc_cpumask_var(&cpumask, GFP_KERNEL)) >> @@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask) >> * First set previously isolated CPUs as available (unisolate). >> * This cpumask contains only CPUs that switched to available now. >> */ >> + guard(cpus_read_lock)(); >> cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask); >> cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask); >> > -- Best regards, Ridong ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-29 8:20 ` Chen Ridong @ 2026-01-29 20:57 ` Waiman Long 2026-01-30 1:16 ` Chen Ridong 0 siblings, 1 reply; 19+ messages in thread From: Waiman Long @ 2026-01-29 20:57 UTC (permalink / raw) To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 1/29/26 3:20 AM, Chen Ridong wrote: > > On 2026/1/29 16:01, Chen Ridong wrote: >> >> On 2026/1/28 12:42, Waiman Long wrote: >>> The current cpuset partition code is able to dynamically update >>> the sched domains of a running system and the corresponding >>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the >>> "isolcpus=domain,..." boot command line feature at run time. >>> >>> The housekeeping cpumask update requires flushing a number of different >>> workqueues which may not be safe with cpus_read_lock() held as the >>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks >>> which have locking dependency with cpus_read_lock() down the chain. Below >>> is an example of such circular locking problem. >>> >>> ====================================================== >>> WARNING: possible circular locking dependency detected >>> 6.18.0-test+ #2 Tainted: G S >>> ------------------------------------------------------ >>> test_cpuset_prs/10971 is trying to acquire lock: >>> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180 >>> >>> but task is already holding lock: >>> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 >>> >>> which lock already depends on the new lock. >>> >>> the existing dependency chain (in reverse order) is: >>> -> #4 (cpuset_mutex){+.+.}-{4:4}: >>> -> #3 (cpu_hotplug_lock){++++}-{0:0}: >>> -> #2 (rtnl_mutex){+.+.}-{4:4}: >>> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: >>> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: >>> >>> Chain exists of: >>> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex >>> >>> 5 locks held by test_cpuset_prs/10971: >>> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 >>> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0 >>> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0 >>> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130 >>> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 >>> >>> Call Trace: >>> <TASK> >>> : >>> touch_wq_lockdep_map+0x93/0x180 >>> __flush_workqueue+0x111/0x10b0 >>> housekeeping_update+0x12d/0x2d0 >>> update_parent_effective_cpumask+0x595/0x2440 >>> update_prstate+0x89d/0xce0 >>> cpuset_partition_write+0xc5/0x130 >>> cgroup_file_write+0x1a5/0x680 >>> kernfs_fop_write_iter+0x3df/0x5f0 >>> vfs_write+0x525/0xfd0 >>> ksys_write+0xf9/0x1d0 >>> do_syscall_64+0x95/0x520 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> >>> To avoid such a circular locking dependency problem, we have to >>> call housekeeping_update() without holding the cpus_read_lock() >>> and cpuset_mutex. One way to do that is to introduce a new top level >>> isolcpus_update_mutex which will be acquired first if the set of isolated >>> CPUs may have to be updated. This new isolcpus_update_mutex will provide >>> the need mutual exclusion without the need to hold cpus_read_lock(). >>> > When I reviewed Frederic's patches, I concerned about this issue. However, I was > not certain whether any flush worker would need to acquire cpu_hotplug_lock or > cpuset_mutex. > > Despite this warning, I do not understand how wq_completion would need to > acquire cpu_hotplug_lock and cpuset_mutex. > > The reason I want to understand how wq_completion acquires cpu_hotplug_lock or > cpuset_mutex is to determine whether isolcpus_update_mutex is truly necessary. > As I mentioned in my previous email, I am concerned about a potential > use-after-free (UAF) issue, which might imply that isolcpus_update_mutex is > required in most places that currently acquire cpuset_mutex, with the possible > exception of the hotplug path? A circular lock dependency can invoke more than 2 tasks/parties. In this case, the task that hold wq_completion does not need to acquire cpu_hotplug_lock. If a worker that flushes a work function required for the completion to finish and it happens to acquire cpu_hotplug_lock with another task trying to acquire cpus_write_lock in the interim, the worker will wait there for the write lock to be released which will not happen until the original task that calls flush_workqueue() release its read lock. In essence, it is a deadlock. Cheers, Longman ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-29 20:57 ` Waiman Long @ 2026-01-30 1:16 ` Chen Ridong 0 siblings, 0 replies; 19+ messages in thread From: Chen Ridong @ 2026-01-30 1:16 UTC (permalink / raw) To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 2026/1/30 4:57, Waiman Long wrote: > On 1/29/26 3:20 AM, Chen Ridong wrote: >> >> On 2026/1/29 16:01, Chen Ridong wrote: >>> >>> On 2026/1/28 12:42, Waiman Long wrote: >>>> The current cpuset partition code is able to dynamically update >>>> the sched domains of a running system and the corresponding >>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the >>>> "isolcpus=domain,..." boot command line feature at run time. >>>> >>>> The housekeeping cpumask update requires flushing a number of different >>>> workqueues which may not be safe with cpus_read_lock() held as the >>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks >>>> which have locking dependency with cpus_read_lock() down the chain. Below >>>> is an example of such circular locking problem. >>>> >>>> ====================================================== >>>> WARNING: possible circular locking dependency detected >>>> 6.18.0-test+ #2 Tainted: G S >>>> ------------------------------------------------------ >>>> test_cpuset_prs/10971 is trying to acquire lock: >>>> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: >>>> touch_wq_lockdep_map+0x7a/0x180 >>>> >>>> but task is already holding lock: >>>> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>>> cpuset_partition_write+0x85/0x130 >>>> >>>> which lock already depends on the new lock. >>>> >>>> the existing dependency chain (in reverse order) is: >>>> -> #4 (cpuset_mutex){+.+.}-{4:4}: >>>> -> #3 (cpu_hotplug_lock){++++}-{0:0}: >>>> -> #2 (rtnl_mutex){+.+.}-{4:4}: >>>> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: >>>> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: >>>> >>>> Chain exists of: >>>> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex >>>> >>>> 5 locks held by test_cpuset_prs/10971: >>>> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 >>>> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: >>>> kernfs_fop_write_iter+0x260/0x5f0 >>>> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: >>>> kernfs_fop_write_iter+0x2b6/0x5f0 >>>> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: >>>> cpuset_partition_write+0x77/0x130 >>>> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>>> cpuset_partition_write+0x85/0x130 >>>> >>>> Call Trace: >>>> <TASK> >>>> : >>>> touch_wq_lockdep_map+0x93/0x180 >>>> __flush_workqueue+0x111/0x10b0 >>>> housekeeping_update+0x12d/0x2d0 >>>> update_parent_effective_cpumask+0x595/0x2440 >>>> update_prstate+0x89d/0xce0 >>>> cpuset_partition_write+0xc5/0x130 >>>> cgroup_file_write+0x1a5/0x680 >>>> kernfs_fop_write_iter+0x3df/0x5f0 >>>> vfs_write+0x525/0xfd0 >>>> ksys_write+0xf9/0x1d0 >>>> do_syscall_64+0x95/0x520 >>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>> >>>> To avoid such a circular locking dependency problem, we have to >>>> call housekeeping_update() without holding the cpus_read_lock() >>>> and cpuset_mutex. One way to do that is to introduce a new top level >>>> isolcpus_update_mutex which will be acquired first if the set of isolated >>>> CPUs may have to be updated. This new isolcpus_update_mutex will provide >>>> the need mutual exclusion without the need to hold cpus_read_lock(). >>>> >> When I reviewed Frederic's patches, I concerned about this issue. However, I was >> not certain whether any flush worker would need to acquire cpu_hotplug_lock or >> cpuset_mutex. >> >> Despite this warning, I do not understand how wq_completion would need to >> acquire cpu_hotplug_lock and cpuset_mutex. >> >> The reason I want to understand how wq_completion acquires cpu_hotplug_lock or >> cpuset_mutex is to determine whether isolcpus_update_mutex is truly necessary. >> As I mentioned in my previous email, I am concerned about a potential >> use-after-free (UAF) issue, which might imply that isolcpus_update_mutex is >> required in most places that currently acquire cpuset_mutex, with the possible >> exception of the hotplug path? > > A circular lock dependency can invoke more than 2 tasks/parties. In this case, > the task that hold wq_completion does not need to acquire cpu_hotplug_lock. If a > worker that flushes a work function required for the completion to finish and it > happens to acquire cpu_hotplug_lock with another task trying to acquire > cpus_write_lock in the interim, the worker will wait there for the write lock to > be released which will not happen until the original task that calls > flush_workqueue() release its read lock. In essence, it is a deadlock. > Thanks, Longman, I looked through the relevant workers: pci_probe_flush_workqueue() mem_cgroup_flush_workqueue() vmstat_flush_workqueue() However, I still haven’t found any worker function that actually acquires cpu_hotplug_lock or cpuset_mutex. Perhaps I missed something. -- Best regards, Ridong ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-29 8:01 ` Chen Ridong 2026-01-29 8:20 ` Chen Ridong @ 2026-01-29 21:16 ` Waiman Long 2026-01-30 0:56 ` Chen Ridong 1 sibling, 1 reply; 19+ messages in thread From: Waiman Long @ 2026-01-29 21:16 UTC (permalink / raw) To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 1/29/26 3:01 AM, Chen Ridong wrote: > > On 2026/1/28 12:42, Waiman Long wrote: >> The current cpuset partition code is able to dynamically update >> the sched domains of a running system and the corresponding >> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the >> "isolcpus=domain,..." boot command line feature at run time. >> >> The housekeeping cpumask update requires flushing a number of different >> workqueues which may not be safe with cpus_read_lock() held as the >> workqueue flushing code may acquire cpus_read_lock() or acquiring locks >> which have locking dependency with cpus_read_lock() down the chain. Below >> is an example of such circular locking problem. >> >> ====================================================== >> WARNING: possible circular locking dependency detected >> 6.18.0-test+ #2 Tainted: G S >> ------------------------------------------------------ >> test_cpuset_prs/10971 is trying to acquire lock: >> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180 >> >> but task is already holding lock: >> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 >> >> which lock already depends on the new lock. >> >> the existing dependency chain (in reverse order) is: >> -> #4 (cpuset_mutex){+.+.}-{4:4}: >> -> #3 (cpu_hotplug_lock){++++}-{0:0}: >> -> #2 (rtnl_mutex){+.+.}-{4:4}: >> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: >> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: >> >> Chain exists of: >> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex >> >> 5 locks held by test_cpuset_prs/10971: >> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 >> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0 >> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0 >> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130 >> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 >> >> Call Trace: >> <TASK> >> : >> touch_wq_lockdep_map+0x93/0x180 >> __flush_workqueue+0x111/0x10b0 >> housekeeping_update+0x12d/0x2d0 >> update_parent_effective_cpumask+0x595/0x2440 >> update_prstate+0x89d/0xce0 >> cpuset_partition_write+0xc5/0x130 >> cgroup_file_write+0x1a5/0x680 >> kernfs_fop_write_iter+0x3df/0x5f0 >> vfs_write+0x525/0xfd0 >> ksys_write+0xf9/0x1d0 >> do_syscall_64+0x95/0x520 >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> >> To avoid such a circular locking dependency problem, we have to >> call housekeeping_update() without holding the cpus_read_lock() >> and cpuset_mutex. One way to do that is to introduce a new top level >> isolcpus_update_mutex which will be acquired first if the set of isolated >> CPUs may have to be updated. This new isolcpus_update_mutex will provide >> the need mutual exclusion without the need to hold cpus_read_lock(). >> >> As cpus_read_lock() is now no longer held when >> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it >> directly. >> >> The lockdep_is_cpuset_held() is also updated to check the new >> isolcpus_update_mutex. >> > I worry about the issue: > > CPU1 CPU2 > rmdir > css->ss->css_killed(css); > cpuset_css_killed > __update_isolation_cpumasks > cpuset_full_unlock > css->flags |= CSS_DYING; > css_clear_dir(css); > ... > // offline and free do not > // get isolcpus_update_mutex > cpuset_css_offline > cpuset_css_free > cpuset_full_lock > ... > // UAF? > That is the reason why I add a new top-level isolcpus_update_mutex. cpuset_css_killed() and the update_isolation_cpumasks()'s unlock/lock sequence will have to acquire this isolcpus_update_mutex first. As long as all the possible paths (except CPU hotplug) that can call into update_isolation_cpumasks() has acquired isolcpus_update_mutex, it will block cpuset_css_killed() from completing. Note that I add a "lockdep_assert_held(&isolcpus_update_mutex);" in update_isolation_cpumasks(). Cheers, Longman >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- >> kernel/cgroup/cpuset.c | 79 ++++++++++++++++++++++++----------- >> kernel/sched/isolation.c | 4 +- >> kernel/time/timer_migration.c | 3 +- >> 3 files changed, 57 insertions(+), 29 deletions(-) >> >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >> index 98c7cb732206..96390ceb5122 100644 >> --- a/kernel/cgroup/cpuset.c >> +++ b/kernel/cgroup/cpuset.c >> @@ -78,7 +78,7 @@ static cpumask_var_t subpartitions_cpus; >> static cpumask_var_t isolated_cpus; >> >> /* >> - * isolated_cpus updating flag (protected by cpuset_mutex) >> + * isolated_cpus updating flag (protected by isolcpus_update_mutex) >> * Set if isolated_cpus is going to be updated in the current >> * cpuset_mutex crtical section. >> */ >> @@ -223,29 +223,46 @@ struct cpuset top_cpuset = { >> }; >> >> /* >> - * There are two global locks guarding cpuset structures - cpuset_mutex and >> - * callback_lock. The cpuset code uses only cpuset_mutex. Other kernel >> - * subsystems can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset >> - * structures. Note that cpuset_mutex needs to be a mutex as it is used in >> - * paths that rely on priority inheritance (e.g. scheduler - on RT) for >> - * correctness. >> + * CPUSET Locking Convention >> + * ------------------------- >> * >> - * A task must hold both locks to modify cpusets. If a task holds >> - * cpuset_mutex, it blocks others, ensuring that it is the only task able to >> - * also acquire callback_lock and be able to modify cpusets. It can perform >> - * various checks on the cpuset structure first, knowing nothing will change. >> - * It can also allocate memory while just holding cpuset_mutex. While it is >> - * performing these checks, various callback routines can briefly acquire >> - * callback_lock to query cpusets. Once it is ready to make the changes, it >> - * takes callback_lock, blocking everyone else. >> + * Below are the three global locks guarding cpuset structures in lock >> + * acquisition order: >> + * - isolcpus_update_mutex (optional) >> + * - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock) >> + * - cpuset_mutex >> + * - callback_lock (raw spinlock) >> * >> - * Calls to the kernel memory allocator can not be made while holding >> - * callback_lock, as that would risk double tripping on callback_lock >> - * from one of the callbacks into the cpuset code from within >> - * __alloc_pages(). >> + * The first isolcpus_update_mutex should only be held if the existing set of >> + * isolated CPUs (in isolated partition) or any of the partition states may be >> + * changed when some cpuset control files are being written into. Otherwise, >> + * it can be skipped. Holding isolcpus_update_mutex/cpus_read_lock or >> + * cpus_write_lock will ensure mutual exclusion of isolated_cpus update. >> * >> - * If a task is only holding callback_lock, then it has read-only >> - * access to cpusets. >> + * As cpuset will now indirectly flush a number of different workqueues in >> + * housekeeping_update() when the set of isolated CPUs is going to be changed, >> + * it may not be safe from the circular locking perspective to hold the >> + * cpus_read_lock. So cpuset_full_lock() will be released before calling >> + * housekeeping_update() and re-acquired afterward. >> + * >> + * A task must hold all the remaining three locks to modify externally visible >> + * or used fields of cpusets, though some of the internally used cpuset fields >> + * can be modified by holding cpu_hotplug_lock and cpuset_mutex only. If only >> + * reliable read access of the externally used fields are needed, a task can >> + * hold either cpuset_mutex or callback_lock. >> + * >> + * If a task holds cpu_hotplug_lock and cpuset_mutex, it blocks others, >> + * ensuring that it is the only task able to also acquire callback_lock and >> + * be able to modify cpusets. It can perform various checks on the cpuset >> + * structure first, knowing nothing will change. It can also allocate memory >> + * without holding callback_lock. While it is performing these checks, various >> + * callback routines can briefly acquire callback_lock to query cpusets. Once >> + * it is ready to make the changes, it takes callback_lock, blocking everyone >> + * else. >> + * >> + * Calls to the kernel memory allocator cannot be made while holding >> + * callback_lock which is a spinlock, as the memory allocator may sleep or >> + * call back into cpuset code and acquire callback_lock. >> * >> * Now, the task_struct fields mems_allowed and mempolicy may be changed >> * by other task, we use alloc_lock in the task_struct fields to protect >> @@ -256,6 +273,7 @@ struct cpuset top_cpuset = { >> * cpumasks and nodemasks. >> */ >> >> +static DEFINE_MUTEX(isolcpus_update_mutex); >> static DEFINE_MUTEX(cpuset_mutex); >> >> /** >> @@ -302,7 +320,7 @@ void cpuset_full_unlock(void) >> #ifdef CONFIG_LOCKDEP >> bool lockdep_is_cpuset_held(void) >> { >> - return lockdep_is_held(&cpuset_mutex); >> + return lockdep_is_held(&isolcpus_update_mutex); >> } >> #endif >> >> @@ -1294,9 +1312,8 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) >> static void __update_isolation_cpumasks(bool twork); >> static void isolation_task_work_fn(struct callback_head *cb) >> { >> - cpuset_full_lock(); >> + guard(mutex)(&isolcpus_update_mutex); >> __update_isolation_cpumasks(true); >> - cpuset_full_lock(); >> } >> >> /* >> @@ -1338,8 +1355,18 @@ static void __update_isolation_cpumasks(bool twork) >> return; >> } >> >> + lockdep_assert_held(&isolcpus_update_mutex); >> + /* >> + * Release cpus_read_lock & cpuset_mutex before calling >> + * housekeeping_update() and re-acquiring them afterward if not >> + * calling from task_work. >> + */ >> + if (!twork) >> + cpuset_full_unlock(); >> ret = housekeeping_update(isolated_cpus); >> WARN_ON_ONCE(ret < 0); >> + if (!twork) >> + cpuset_full_lock(); >> >> isolated_cpus_updating = false; >> } >> @@ -3196,6 +3223,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, >> return -EACCES; >> >> buf = strstrip(buf); >> + mutex_lock(&isolcpus_update_mutex); >> cpuset_full_lock(); >> if (!is_cpuset_online(cs)) >> goto out_unlock; >> @@ -3226,6 +3254,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of, >> rebuild_sched_domains_locked(); >> out_unlock: >> cpuset_full_unlock(); >> + mutex_unlock(&isolcpus_update_mutex); >> if (of_cft(of)->private == FILE_MEMLIST) >> schedule_flush_migrate_mm(); >> return retval ?: nbytes; >> @@ -3329,6 +3358,7 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf, >> else >> return -EINVAL; >> >> + guard(mutex)(&isolcpus_update_mutex); >> cpuset_full_lock(); >> if (is_cpuset_online(cs)) >> retval = update_prstate(cs, val); >> @@ -3502,6 +3532,7 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css) >> { >> struct cpuset *cs = css_cs(css); >> >> + guard(mutex)(&isolcpus_update_mutex); >> cpuset_full_lock(); >> /* Reset valid partition back to member */ >> if (is_partition_valid(cs)) >> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c >> index 3b725d39c06e..ef152d401fe2 100644 >> --- a/kernel/sched/isolation.c >> +++ b/kernel/sched/isolation.c >> @@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask) >> struct cpumask *trial, *old = NULL; >> int err; >> >> - lockdep_assert_cpus_held(); >> - >> trial = kmalloc(cpumask_size(), GFP_KERNEL); >> if (!trial) >> return -ENOMEM; >> @@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask) >> } >> >> if (!housekeeping.flags) >> - static_branch_enable_cpuslocked(&housekeeping_overridden); >> + static_branch_enable(&housekeeping_overridden); >> >> if (housekeeping.flags & HK_FLAG_DOMAIN) >> old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN); >> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c >> index 6da9cd562b20..244a8d025e78 100644 >> --- a/kernel/time/timer_migration.c >> +++ b/kernel/time/timer_migration.c >> @@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask) >> cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL; >> int cpu; >> >> - lockdep_assert_cpus_held(); >> - >> if (!works) >> return -ENOMEM; >> if (!alloc_cpumask_var(&cpumask, GFP_KERNEL)) >> @@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask) >> * First set previously isolated CPUs as available (unisolate). >> * This cpumask contains only CPUs that switched to available now. >> */ >> + guard(cpus_read_lock)(); >> cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask); >> cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask); >> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-29 21:16 ` Waiman Long @ 2026-01-30 0:56 ` Chen Ridong 2026-01-30 1:35 ` Waiman Long 0 siblings, 1 reply; 19+ messages in thread From: Chen Ridong @ 2026-01-30 0:56 UTC (permalink / raw) To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 2026/1/30 5:16, Waiman Long wrote: > On 1/29/26 3:01 AM, Chen Ridong wrote: >> >> On 2026/1/28 12:42, Waiman Long wrote: >>> The current cpuset partition code is able to dynamically update >>> the sched domains of a running system and the corresponding >>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the >>> "isolcpus=domain,..." boot command line feature at run time. >>> >>> The housekeeping cpumask update requires flushing a number of different >>> workqueues which may not be safe with cpus_read_lock() held as the >>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks >>> which have locking dependency with cpus_read_lock() down the chain. Below >>> is an example of such circular locking problem. >>> >>> ====================================================== >>> WARNING: possible circular locking dependency detected >>> 6.18.0-test+ #2 Tainted: G S >>> ------------------------------------------------------ >>> test_cpuset_prs/10971 is trying to acquire lock: >>> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: >>> touch_wq_lockdep_map+0x7a/0x180 >>> >>> but task is already holding lock: >>> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>> cpuset_partition_write+0x85/0x130 >>> >>> which lock already depends on the new lock. >>> >>> the existing dependency chain (in reverse order) is: >>> -> #4 (cpuset_mutex){+.+.}-{4:4}: >>> -> #3 (cpu_hotplug_lock){++++}-{0:0}: >>> -> #2 (rtnl_mutex){+.+.}-{4:4}: >>> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: >>> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: >>> >>> Chain exists of: >>> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex >>> >>> 5 locks held by test_cpuset_prs/10971: >>> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 >>> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: >>> kernfs_fop_write_iter+0x260/0x5f0 >>> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: >>> kernfs_fop_write_iter+0x2b6/0x5f0 >>> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: >>> cpuset_partition_write+0x77/0x130 >>> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>> cpuset_partition_write+0x85/0x130 >>> >>> Call Trace: >>> <TASK> >>> : >>> touch_wq_lockdep_map+0x93/0x180 >>> __flush_workqueue+0x111/0x10b0 >>> housekeeping_update+0x12d/0x2d0 >>> update_parent_effective_cpumask+0x595/0x2440 >>> update_prstate+0x89d/0xce0 >>> cpuset_partition_write+0xc5/0x130 >>> cgroup_file_write+0x1a5/0x680 >>> kernfs_fop_write_iter+0x3df/0x5f0 >>> vfs_write+0x525/0xfd0 >>> ksys_write+0xf9/0x1d0 >>> do_syscall_64+0x95/0x520 >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> >>> To avoid such a circular locking dependency problem, we have to >>> call housekeeping_update() without holding the cpus_read_lock() >>> and cpuset_mutex. One way to do that is to introduce a new top level >>> isolcpus_update_mutex which will be acquired first if the set of isolated >>> CPUs may have to be updated. This new isolcpus_update_mutex will provide >>> the need mutual exclusion without the need to hold cpus_read_lock(). >>> >>> As cpus_read_lock() is now no longer held when >>> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it >>> directly. >>> >>> The lockdep_is_cpuset_held() is also updated to check the new >>> isolcpus_update_mutex. >>> >> I worry about the issue: >> >> CPU1 CPU2 >> rmdir >> css->ss->css_killed(css); >> cpuset_css_killed >> __update_isolation_cpumasks >> cpuset_full_unlock >> css->flags |= CSS_DYING; >> css_clear_dir(css); >> ... >> // offline and free do not >> // get isolcpus_update_mutex >> cpuset_css_offline >> cpuset_css_free >> cpuset_full_lock >> ... >> // UAF? >> Hi, Longman, In this patch, I noticed that cpuset_css_offline and cpuset_css_free do not acquire the isolcpus_update_mutex. This could potentially lead to a UAF issue. > That is the reason why I add a new top-level isolcpus_update_mutex. > cpuset_css_killed() and the update_isolation_cpumasks()'s unlock/lock sequence > will have to acquire this isolcpus_update_mutex first. > However, simply adding isolcpus_update_mutex to cpuset_css_killed and update_isolation_cpumasks may not be sufficient. As I mentioned, the path that calls __update_isolation_cpumasks may first acquire isolcpus_update_mutex and cpuset_full_lock, but once cpuset_css_killed is completed, it will release the “full” lock and then attempt to reacquire it later. During this intermediate period, the cpuset may have already been freed, because cpuset_css_offline and cpuset_css_free do not currently acquire the isolcpus_update_mutex. > As long as all the possible paths (except CPU hotplug) that can call into > update_isolation_cpumasks() has acquired isolcpus_update_mutex, it will block > cpuset_css_killed() from completing. Note that I add a > "lockdep_assert_held(&isolcpus_update_mutex);" in update_isolation_cpumasks(). > -- Best regards, Ridong ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-30 0:56 ` Chen Ridong @ 2026-01-30 1:35 ` Waiman Long 2026-01-30 1:42 ` Chen Ridong 0 siblings, 1 reply; 19+ messages in thread From: Waiman Long @ 2026-01-30 1:35 UTC (permalink / raw) To: Chen Ridong, Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 1/29/26 7:56 PM, Chen Ridong wrote: > > On 2026/1/30 5:16, Waiman Long wrote: >> On 1/29/26 3:01 AM, Chen Ridong wrote: >>> On 2026/1/28 12:42, Waiman Long wrote: >>>> The current cpuset partition code is able to dynamically update >>>> the sched domains of a running system and the corresponding >>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the >>>> "isolcpus=domain,..." boot command line feature at run time. >>>> >>>> The housekeeping cpumask update requires flushing a number of different >>>> workqueues which may not be safe with cpus_read_lock() held as the >>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks >>>> which have locking dependency with cpus_read_lock() down the chain. Below >>>> is an example of such circular locking problem. >>>> >>>> ====================================================== >>>> WARNING: possible circular locking dependency detected >>>> 6.18.0-test+ #2 Tainted: G S >>>> ------------------------------------------------------ >>>> test_cpuset_prs/10971 is trying to acquire lock: >>>> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: >>>> touch_wq_lockdep_map+0x7a/0x180 >>>> >>>> but task is already holding lock: >>>> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>>> cpuset_partition_write+0x85/0x130 >>>> >>>> which lock already depends on the new lock. >>>> >>>> the existing dependency chain (in reverse order) is: >>>> -> #4 (cpuset_mutex){+.+.}-{4:4}: >>>> -> #3 (cpu_hotplug_lock){++++}-{0:0}: >>>> -> #2 (rtnl_mutex){+.+.}-{4:4}: >>>> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: >>>> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: >>>> >>>> Chain exists of: >>>> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex >>>> >>>> 5 locks held by test_cpuset_prs/10971: >>>> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 >>>> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: >>>> kernfs_fop_write_iter+0x260/0x5f0 >>>> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: >>>> kernfs_fop_write_iter+0x2b6/0x5f0 >>>> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: >>>> cpuset_partition_write+0x77/0x130 >>>> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>>> cpuset_partition_write+0x85/0x130 >>>> >>>> Call Trace: >>>> <TASK> >>>> : >>>> touch_wq_lockdep_map+0x93/0x180 >>>> __flush_workqueue+0x111/0x10b0 >>>> housekeeping_update+0x12d/0x2d0 >>>> update_parent_effective_cpumask+0x595/0x2440 >>>> update_prstate+0x89d/0xce0 >>>> cpuset_partition_write+0xc5/0x130 >>>> cgroup_file_write+0x1a5/0x680 >>>> kernfs_fop_write_iter+0x3df/0x5f0 >>>> vfs_write+0x525/0xfd0 >>>> ksys_write+0xf9/0x1d0 >>>> do_syscall_64+0x95/0x520 >>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>> >>>> To avoid such a circular locking dependency problem, we have to >>>> call housekeeping_update() without holding the cpus_read_lock() >>>> and cpuset_mutex. One way to do that is to introduce a new top level >>>> isolcpus_update_mutex which will be acquired first if the set of isolated >>>> CPUs may have to be updated. This new isolcpus_update_mutex will provide >>>> the need mutual exclusion without the need to hold cpus_read_lock(). >>>> >>>> As cpus_read_lock() is now no longer held when >>>> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it >>>> directly. >>>> >>>> The lockdep_is_cpuset_held() is also updated to check the new >>>> isolcpus_update_mutex. >>>> >>> I worry about the issue: >>> >>> CPU1 CPU2 >>> rmdir >>> css->ss->css_killed(css); >>> cpuset_css_killed >>> __update_isolation_cpumasks >>> cpuset_full_unlock >>> css->flags |= CSS_DYING; >>> css_clear_dir(css); >>> ... >>> // offline and free do not >>> // get isolcpus_update_mutex >>> cpuset_css_offline >>> cpuset_css_free >>> cpuset_full_lock >>> ... >>> // UAF? >>> > Hi, Longman, > > In this patch, I noticed that cpuset_css_offline and cpuset_css_free do not > acquire the isolcpus_update_mutex. This could potentially lead to a UAF issue. > >> That is the reason why I add a new top-level isolcpus_update_mutex. >> cpuset_css_killed() and the update_isolation_cpumasks()'s unlock/lock sequence >> will have to acquire this isolcpus_update_mutex first. >> > However, simply adding isolcpus_update_mutex to cpuset_css_killed and > update_isolation_cpumasks may not be sufficient. > > As I mentioned, the path that calls __update_isolation_cpumasks may first > acquire isolcpus_update_mutex and cpuset_full_lock, but once cpuset_css_killed > is completed, it will release the “full” lock and then attempt to reacquire it > later. During this intermediate period, the cpuset may have already been freed, > because cpuset_css_offline and cpuset_css_free do not currently acquire the > isolcpus_update_mutex. You are right that acquisition of the new isolcpus_update_mutex should be in all the places where cpuset_full_lock() is acquired. Will update the patch to do that. That should eliminate the risk. Cheers, Longman ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-30 1:35 ` Waiman Long @ 2026-01-30 1:42 ` Chen Ridong 2026-01-30 3:53 ` Waiman Long 0 siblings, 1 reply; 19+ messages in thread From: Chen Ridong @ 2026-01-30 1:42 UTC (permalink / raw) To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 2026/1/30 9:35, Waiman Long wrote: > On 1/29/26 7:56 PM, Chen Ridong wrote: >> >> On 2026/1/30 5:16, Waiman Long wrote: >>> On 1/29/26 3:01 AM, Chen Ridong wrote: >>>> On 2026/1/28 12:42, Waiman Long wrote: >>>>> The current cpuset partition code is able to dynamically update >>>>> the sched domains of a running system and the corresponding >>>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the >>>>> "isolcpus=domain,..." boot command line feature at run time. >>>>> >>>>> The housekeeping cpumask update requires flushing a number of different >>>>> workqueues which may not be safe with cpus_read_lock() held as the >>>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks >>>>> which have locking dependency with cpus_read_lock() down the chain. Below >>>>> is an example of such circular locking problem. >>>>> >>>>> ====================================================== >>>>> WARNING: possible circular locking dependency detected >>>>> 6.18.0-test+ #2 Tainted: G S >>>>> ------------------------------------------------------ >>>>> test_cpuset_prs/10971 is trying to acquire lock: >>>>> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: >>>>> touch_wq_lockdep_map+0x7a/0x180 >>>>> >>>>> but task is already holding lock: >>>>> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>>>> cpuset_partition_write+0x85/0x130 >>>>> >>>>> which lock already depends on the new lock. >>>>> >>>>> the existing dependency chain (in reverse order) is: >>>>> -> #4 (cpuset_mutex){+.+.}-{4:4}: >>>>> -> #3 (cpu_hotplug_lock){++++}-{0:0}: >>>>> -> #2 (rtnl_mutex){+.+.}-{4:4}: >>>>> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: >>>>> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: >>>>> >>>>> Chain exists of: >>>>> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex >>>>> >>>>> 5 locks held by test_cpuset_prs/10971: >>>>> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: >>>>> ksys_write+0xf9/0x1d0 >>>>> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: >>>>> kernfs_fop_write_iter+0x260/0x5f0 >>>>> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: >>>>> kernfs_fop_write_iter+0x2b6/0x5f0 >>>>> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: >>>>> cpuset_partition_write+0x77/0x130 >>>>> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>>>> cpuset_partition_write+0x85/0x130 >>>>> >>>>> Call Trace: >>>>> <TASK> >>>>> : >>>>> touch_wq_lockdep_map+0x93/0x180 >>>>> __flush_workqueue+0x111/0x10b0 >>>>> housekeeping_update+0x12d/0x2d0 >>>>> update_parent_effective_cpumask+0x595/0x2440 >>>>> update_prstate+0x89d/0xce0 >>>>> cpuset_partition_write+0xc5/0x130 >>>>> cgroup_file_write+0x1a5/0x680 >>>>> kernfs_fop_write_iter+0x3df/0x5f0 >>>>> vfs_write+0x525/0xfd0 >>>>> ksys_write+0xf9/0x1d0 >>>>> do_syscall_64+0x95/0x520 >>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>>> >>>>> To avoid such a circular locking dependency problem, we have to >>>>> call housekeeping_update() without holding the cpus_read_lock() >>>>> and cpuset_mutex. One way to do that is to introduce a new top level >>>>> isolcpus_update_mutex which will be acquired first if the set of isolated >>>>> CPUs may have to be updated. This new isolcpus_update_mutex will provide >>>>> the need mutual exclusion without the need to hold cpus_read_lock(). >>>>> >>>>> As cpus_read_lock() is now no longer held when >>>>> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it >>>>> directly. >>>>> >>>>> The lockdep_is_cpuset_held() is also updated to check the new >>>>> isolcpus_update_mutex. >>>>> >>>> I worry about the issue: >>>> >>>> CPU1 CPU2 >>>> rmdir >>>> css->ss->css_killed(css); >>>> cpuset_css_killed >>>> __update_isolation_cpumasks >>>> cpuset_full_unlock >>>> css->flags |= CSS_DYING; >>>> css_clear_dir(css); >>>> ... >>>> // offline and free do not >>>> // get isolcpus_update_mutex >>>> cpuset_css_offline >>>> cpuset_css_free >>>> cpuset_full_lock >>>> ... >>>> // UAF? >>>> >> Hi, Longman, >> >> In this patch, I noticed that cpuset_css_offline and cpuset_css_free do not >> acquire the isolcpus_update_mutex. This could potentially lead to a UAF issue. >> >>> That is the reason why I add a new top-level isolcpus_update_mutex. >>> cpuset_css_killed() and the update_isolation_cpumasks()'s unlock/lock sequence >>> will have to acquire this isolcpus_update_mutex first. >>> >> However, simply adding isolcpus_update_mutex to cpuset_css_killed and >> update_isolation_cpumasks may not be sufficient. >> >> As I mentioned, the path that calls __update_isolation_cpumasks may first >> acquire isolcpus_update_mutex and cpuset_full_lock, but once cpuset_css_killed >> is completed, it will release the “full” lock and then attempt to reacquire it >> later. During this intermediate period, the cpuset may have already been freed, >> because cpuset_css_offline and cpuset_css_free do not currently acquire the >> isolcpus_update_mutex. > > You are right that acquisition of the new isolcpus_update_mutex should be in all > the places where cpuset_full_lock() is acquired. Will update the patch to do > that. That should eliminate the risk. > I suggest that putting isolcpus_update_mutex into cpuset_full_lock, since this function means that all the locks needed have been acquired. void cpuset_full_lock(void) { mutex_lock(&isolcpus_update_mutex); cpus_read_lock(); mutex_lock(&cpuset_mutex); } void cpuset_full_unlock(void) { mutex_unlock(&cpuset_mutex); cpus_read_unlock(); mutex_unlock(&isolcpus_update_mutex); } In the __update_isolation_cpumasks function, we can pair: ``` ... mutex_unlock(&cpuset_mutex); cpus_read_unlock(); ... Actions cpus_read_lock(); mutex_lock(&cpuset_mutex); ... ``` -- Best regards, Ridong ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-30 1:42 ` Chen Ridong @ 2026-01-30 3:53 ` Waiman Long 2026-01-30 6:07 ` Chen Ridong 0 siblings, 1 reply; 19+ messages in thread From: Waiman Long @ 2026-01-30 3:53 UTC (permalink / raw) To: Chen Ridong, Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 1/29/26 8:42 PM, Chen Ridong wrote: > > On 2026/1/30 9:35, Waiman Long wrote: >> On 1/29/26 7:56 PM, Chen Ridong wrote: >>> On 2026/1/30 5:16, Waiman Long wrote: >>>> On 1/29/26 3:01 AM, Chen Ridong wrote: >>>>> On 2026/1/28 12:42, Waiman Long wrote: >>>>>> The current cpuset partition code is able to dynamically update >>>>>> the sched domains of a running system and the corresponding >>>>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the >>>>>> "isolcpus=domain,..." boot command line feature at run time. >>>>>> >>>>>> The housekeeping cpumask update requires flushing a number of different >>>>>> workqueues which may not be safe with cpus_read_lock() held as the >>>>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks >>>>>> which have locking dependency with cpus_read_lock() down the chain. Below >>>>>> is an example of such circular locking problem. >>>>>> >>>>>> ====================================================== >>>>>> WARNING: possible circular locking dependency detected >>>>>> 6.18.0-test+ #2 Tainted: G S >>>>>> ------------------------------------------------------ >>>>>> test_cpuset_prs/10971 is trying to acquire lock: >>>>>> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: >>>>>> touch_wq_lockdep_map+0x7a/0x180 >>>>>> >>>>>> but task is already holding lock: >>>>>> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>>>>> cpuset_partition_write+0x85/0x130 >>>>>> >>>>>> which lock already depends on the new lock. >>>>>> >>>>>> the existing dependency chain (in reverse order) is: >>>>>> -> #4 (cpuset_mutex){+.+.}-{4:4}: >>>>>> -> #3 (cpu_hotplug_lock){++++}-{0:0}: >>>>>> -> #2 (rtnl_mutex){+.+.}-{4:4}: >>>>>> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: >>>>>> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: >>>>>> >>>>>> Chain exists of: >>>>>> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex >>>>>> >>>>>> 5 locks held by test_cpuset_prs/10971: >>>>>> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: >>>>>> ksys_write+0xf9/0x1d0 >>>>>> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: >>>>>> kernfs_fop_write_iter+0x260/0x5f0 >>>>>> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: >>>>>> kernfs_fop_write_iter+0x2b6/0x5f0 >>>>>> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: >>>>>> cpuset_partition_write+0x77/0x130 >>>>>> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>>>>> cpuset_partition_write+0x85/0x130 >>>>>> >>>>>> Call Trace: >>>>>> <TASK> >>>>>> : >>>>>> touch_wq_lockdep_map+0x93/0x180 >>>>>> __flush_workqueue+0x111/0x10b0 >>>>>> housekeeping_update+0x12d/0x2d0 >>>>>> update_parent_effective_cpumask+0x595/0x2440 >>>>>> update_prstate+0x89d/0xce0 >>>>>> cpuset_partition_write+0xc5/0x130 >>>>>> cgroup_file_write+0x1a5/0x680 >>>>>> kernfs_fop_write_iter+0x3df/0x5f0 >>>>>> vfs_write+0x525/0xfd0 >>>>>> ksys_write+0xf9/0x1d0 >>>>>> do_syscall_64+0x95/0x520 >>>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>>>> >>>>>> To avoid such a circular locking dependency problem, we have to >>>>>> call housekeeping_update() without holding the cpus_read_lock() >>>>>> and cpuset_mutex. One way to do that is to introduce a new top level >>>>>> isolcpus_update_mutex which will be acquired first if the set of isolated >>>>>> CPUs may have to be updated. This new isolcpus_update_mutex will provide >>>>>> the need mutual exclusion without the need to hold cpus_read_lock(). >>>>>> >>>>>> As cpus_read_lock() is now no longer held when >>>>>> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it >>>>>> directly. >>>>>> >>>>>> The lockdep_is_cpuset_held() is also updated to check the new >>>>>> isolcpus_update_mutex. >>>>>> >>>>> I worry about the issue: >>>>> >>>>> CPU1 CPU2 >>>>> rmdir >>>>> css->ss->css_killed(css); >>>>> cpuset_css_killed >>>>> __update_isolation_cpumasks >>>>> cpuset_full_unlock >>>>> css->flags |= CSS_DYING; >>>>> css_clear_dir(css); >>>>> ... >>>>> // offline and free do not >>>>> // get isolcpus_update_mutex >>>>> cpuset_css_offline >>>>> cpuset_css_free >>>>> cpuset_full_lock >>>>> ... >>>>> // UAF? >>>>> >>> Hi, Longman, >>> >>> In this patch, I noticed that cpuset_css_offline and cpuset_css_free do not >>> acquire the isolcpus_update_mutex. This could potentially lead to a UAF issue. >>> >>>> That is the reason why I add a new top-level isolcpus_update_mutex. >>>> cpuset_css_killed() and the update_isolation_cpumasks()'s unlock/lock sequence >>>> will have to acquire this isolcpus_update_mutex first. >>>> >>> However, simply adding isolcpus_update_mutex to cpuset_css_killed and >>> update_isolation_cpumasks may not be sufficient. >>> >>> As I mentioned, the path that calls __update_isolation_cpumasks may first >>> acquire isolcpus_update_mutex and cpuset_full_lock, but once cpuset_css_killed >>> is completed, it will release the “full” lock and then attempt to reacquire it >>> later. During this intermediate period, the cpuset may have already been freed, >>> because cpuset_css_offline and cpuset_css_free do not currently acquire the >>> isolcpus_update_mutex. >> You are right that acquisition of the new isolcpus_update_mutex should be in all >> the places where cpuset_full_lock() is acquired. Will update the patch to do >> that. That should eliminate the risk. >> > I suggest that putting isolcpus_update_mutex into cpuset_full_lock, since this > function means that all the locks needed have been acquired. > > void cpuset_full_lock(void) > { > mutex_lock(&isolcpus_update_mutex); > cpus_read_lock(); > mutex_lock(&cpuset_mutex); > } > > void cpuset_full_unlock(void) > { > mutex_unlock(&cpuset_mutex); > cpus_read_unlock(); > mutex_unlock(&isolcpus_update_mutex); > } That is what I had done. Cheers, Longman > > In the __update_isolation_cpumasks function, we can pair: > > ``` > ... > mutex_unlock(&cpuset_mutex); > cpus_read_unlock(); > ... Actions > cpus_read_lock(); > mutex_lock(&cpuset_mutex); > ... > ``` > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex 2026-01-30 3:53 ` Waiman Long @ 2026-01-30 6:07 ` Chen Ridong 0 siblings, 0 replies; 19+ messages in thread From: Chen Ridong @ 2026-01-30 6:07 UTC (permalink / raw) To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Anna-Maria Behnsen, Frederic Weisbecker, Thomas Gleixner, Shuah Khan Cc: cgroups, linux-kernel, linux-kselftest On 2026/1/30 11:53, Waiman Long wrote: > On 1/29/26 8:42 PM, Chen Ridong wrote: >> >> On 2026/1/30 9:35, Waiman Long wrote: >>> On 1/29/26 7:56 PM, Chen Ridong wrote: >>>> On 2026/1/30 5:16, Waiman Long wrote: >>>>> On 1/29/26 3:01 AM, Chen Ridong wrote: >>>>>> On 2026/1/28 12:42, Waiman Long wrote: >>>>>>> The current cpuset partition code is able to dynamically update >>>>>>> the sched domains of a running system and the corresponding >>>>>>> HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the >>>>>>> "isolcpus=domain,..." boot command line feature at run time. >>>>>>> >>>>>>> The housekeeping cpumask update requires flushing a number of different >>>>>>> workqueues which may not be safe with cpus_read_lock() held as the >>>>>>> workqueue flushing code may acquire cpus_read_lock() or acquiring locks >>>>>>> which have locking dependency with cpus_read_lock() down the chain. Below >>>>>>> is an example of such circular locking problem. >>>>>>> >>>>>>> ====================================================== >>>>>>> WARNING: possible circular locking dependency detected >>>>>>> 6.18.0-test+ #2 Tainted: G S >>>>>>> ------------------------------------------------------ >>>>>>> test_cpuset_prs/10971 is trying to acquire lock: >>>>>>> ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: >>>>>>> touch_wq_lockdep_map+0x7a/0x180 >>>>>>> >>>>>>> but task is already holding lock: >>>>>>> ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>>>>>> cpuset_partition_write+0x85/0x130 >>>>>>> >>>>>>> which lock already depends on the new lock. >>>>>>> >>>>>>> the existing dependency chain (in reverse order) is: >>>>>>> -> #4 (cpuset_mutex){+.+.}-{4:4}: >>>>>>> -> #3 (cpu_hotplug_lock){++++}-{0:0}: >>>>>>> -> #2 (rtnl_mutex){+.+.}-{4:4}: >>>>>>> -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: >>>>>>> -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: >>>>>>> >>>>>>> Chain exists of: >>>>>>> (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex >>>>>>> >>>>>>> 5 locks held by test_cpuset_prs/10971: >>>>>>> #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: >>>>>>> ksys_write+0xf9/0x1d0 >>>>>>> #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: >>>>>>> kernfs_fop_write_iter+0x260/0x5f0 >>>>>>> #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: >>>>>>> kernfs_fop_write_iter+0x2b6/0x5f0 >>>>>>> #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: >>>>>>> cpuset_partition_write+0x77/0x130 >>>>>>> #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: >>>>>>> cpuset_partition_write+0x85/0x130 >>>>>>> >>>>>>> Call Trace: >>>>>>> <TASK> >>>>>>> : >>>>>>> touch_wq_lockdep_map+0x93/0x180 >>>>>>> __flush_workqueue+0x111/0x10b0 >>>>>>> housekeeping_update+0x12d/0x2d0 >>>>>>> update_parent_effective_cpumask+0x595/0x2440 >>>>>>> update_prstate+0x89d/0xce0 >>>>>>> cpuset_partition_write+0xc5/0x130 >>>>>>> cgroup_file_write+0x1a5/0x680 >>>>>>> kernfs_fop_write_iter+0x3df/0x5f0 >>>>>>> vfs_write+0x525/0xfd0 >>>>>>> ksys_write+0xf9/0x1d0 >>>>>>> do_syscall_64+0x95/0x520 >>>>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e >>>>>>> >>>>>>> To avoid such a circular locking dependency problem, we have to >>>>>>> call housekeeping_update() without holding the cpus_read_lock() >>>>>>> and cpuset_mutex. One way to do that is to introduce a new top level >>>>>>> isolcpus_update_mutex which will be acquired first if the set of isolated >>>>>>> CPUs may have to be updated. This new isolcpus_update_mutex will provide >>>>>>> the need mutual exclusion without the need to hold cpus_read_lock(). >>>>>>> >>>>>>> As cpus_read_lock() is now no longer held when >>>>>>> tmigr_isolated_exclude_cpumask() is called, it needs to acquire it >>>>>>> directly. >>>>>>> >>>>>>> The lockdep_is_cpuset_held() is also updated to check the new >>>>>>> isolcpus_update_mutex. >>>>>>> >>>>>> I worry about the issue: >>>>>> >>>>>> CPU1 CPU2 >>>>>> rmdir >>>>>> css->ss->css_killed(css); >>>>>> cpuset_css_killed >>>>>> __update_isolation_cpumasks >>>>>> cpuset_full_unlock >>>>>> css->flags |= CSS_DYING; >>>>>> css_clear_dir(css); >>>>>> ... >>>>>> // offline and free do not >>>>>> // get isolcpus_update_mutex >>>>>> cpuset_css_offline >>>>>> cpuset_css_free >>>>>> cpuset_full_lock >>>>>> ... >>>>>> // UAF? >>>>>> >>>> Hi, Longman, >>>> >>>> In this patch, I noticed that cpuset_css_offline and cpuset_css_free do not >>>> acquire the isolcpus_update_mutex. This could potentially lead to a UAF issue. >>>> >>>>> That is the reason why I add a new top-level isolcpus_update_mutex. >>>>> cpuset_css_killed() and the update_isolation_cpumasks()'s unlock/lock sequence >>>>> will have to acquire this isolcpus_update_mutex first. >>>>> >>>> However, simply adding isolcpus_update_mutex to cpuset_css_killed and >>>> update_isolation_cpumasks may not be sufficient. >>>> >>>> As I mentioned, the path that calls __update_isolation_cpumasks may first >>>> acquire isolcpus_update_mutex and cpuset_full_lock, but once cpuset_css_killed >>>> is completed, it will release the “full” lock and then attempt to reacquire it >>>> later. During this intermediate period, the cpuset may have already been freed, >>>> because cpuset_css_offline and cpuset_css_free do not currently acquire the >>>> isolcpus_update_mutex. >>> You are right that acquisition of the new isolcpus_update_mutex should be in all >>> the places where cpuset_full_lock() is acquired. Will update the patch to do >>> that. That should eliminate the risk. >>> >> I suggest that putting isolcpus_update_mutex into cpuset_full_lock, since this >> function means that all the locks needed have been acquired. >> >> void cpuset_full_lock(void) >> { >> mutex_lock(&isolcpus_update_mutex); >> cpus_read_lock(); >> mutex_lock(&cpuset_mutex); >> } >> >> void cpuset_full_unlock(void) >> { >> mutex_unlock(&cpuset_mutex); >> cpus_read_unlock(); >> mutex_unlock(&isolcpus_update_mutex); >> } > > That is what I had done. > Great. -- Best regards, Ridong ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2026-01-30 6:07 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-28 4:42 [PATCH/for-next 0/2] cgroup/cpuset: Fix partition related locking issues Waiman Long 2026-01-28 4:42 ` [PATCH/for-next 1/2] cgroup/cpuset: Defer housekeeping_update() call from CPU hotplug to task_work Waiman Long 2026-01-28 17:44 ` Tejun Heo 2026-01-28 18:08 ` Waiman Long 2026-01-29 4:03 ` Chen Ridong 2026-01-29 7:15 ` Chen Ridong 2026-01-30 1:37 ` Waiman Long 2026-01-30 1:39 ` Waiman Long 2026-01-28 4:42 ` [PATCH/for-next 2/2] cgroup/cpuset: Introduce a new top level isolcpus_update_mutex Waiman Long 2026-01-29 8:01 ` Chen Ridong 2026-01-29 8:20 ` Chen Ridong 2026-01-29 20:57 ` Waiman Long 2026-01-30 1:16 ` Chen Ridong 2026-01-29 21:16 ` Waiman Long 2026-01-30 0:56 ` Chen Ridong 2026-01-30 1:35 ` Waiman Long 2026-01-30 1:42 ` Chen Ridong 2026-01-30 3:53 ` Waiman Long 2026-01-30 6:07 ` Chen Ridong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox