Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jon Hunter <jonathanh@nvidia.com>
To: "Waiman Long" <longman@redhat.com>,
	"Chen Ridong" <chenridong@huaweicloud.com>,
	"Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	"Valentin Schneider" <vschneid@redhat.com>,
	"Frederic Weisbecker" <frederic@kernel.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Shuah Khan" <shuah@kernel.org>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	"linux-tegra@vger.kernel.org" <linux-tegra@vger.kernel.org>
Subject: Re: [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue
Date: Tue, 3 Mar 2026 15:18:16 +0000	[thread overview]
Message-ID: <1a89aceb-48db-4edd-a730-b445e41221fe@nvidia.com> (raw)
In-Reply-To: <20260221185418.29319-8-longman@redhat.com>

Hi Waiman,

On 21/02/2026 18:54, Waiman Long wrote:
> The cpuset_handle_hotplug() may need to invoke housekeeping_update(),
> for instance, when an isolated partition is invalidated because its
> last active CPU has been put offline.
> 
> As we are going to enable dynamic update to the nozh_full housekeeping
> cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug,
> allowing the CPU hotplug path to call into housekeeping_update() directly
> from update_isolation_cpumasks() will likely cause deadlock. So we
> have to defer any call to housekeeping_update() after the CPU hotplug
> operation has finished. This is now done via the workqueue where
> the update_hk_sched_domains() function will be invoked via the
> hk_sd_workfn().
> 
> An concurrent cpuset control file write may have executed the required
> update_hk_sched_domains() function before the work function is called. So
> the work function call may become a no-op when it is invoked.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/cgroup/cpuset.c                        | 31 ++++++++++++++++---
>   .../selftests/cgroup/test_cpuset_prs.sh       | 11 ++++++-
>   2 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3d0d18bf182f..2c80bfc30bbc 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1323,6 +1323,16 @@ static void update_hk_sched_domains(void)
>   		rebuild_sched_domains_locked();
>   }
>   
> +/*
> + * Work function to invoke update_hk_sched_domains()
> + */
> +static void hk_sd_workfn(struct work_struct *work)
> +{
> +	cpuset_full_lock();
> +	update_hk_sched_domains();
> +	cpuset_full_unlock();
> +}
> +
>   /**
>    * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
>    * @parent: Parent cpuset containing all siblings
> @@ -3795,6 +3805,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>    */
>   static void cpuset_handle_hotplug(void)
>   {
> +	static DECLARE_WORK(hk_sd_work, hk_sd_workfn);
>   	static cpumask_t new_cpus;
>   	static nodemask_t new_mems;
>   	bool cpus_updated, mems_updated;
> @@ -3877,11 +3888,21 @@ static void cpuset_handle_hotplug(void)
>   	}
>   
>   
> -	if (update_housekeeping || force_sd_rebuild) {
> -		mutex_lock(&cpuset_mutex);
> -		update_hk_sched_domains();
> -		mutex_unlock(&cpuset_mutex);
> -	}
> +	/*
> +	 * Queue a work to call housekeeping_update() & rebuild_sched_domains()
> +	 * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
> +	 * cpumask can correctly reflect what is in isolated_cpus.
> +	 *
> +	 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
> +	 * is still pending. Before the pending bit is cleared, the work data
> +	 * is copied out and work item dequeued. So it is possible to queue
> +	 * the work again before the hk_sd_workfn() is invoked to process the
> +	 * previously queued work. Since hk_sd_workfn() doesn't use the work
> +	 * item at all, this is not a problem.
> +	 */
> +	if (update_housekeeping || force_sd_rebuild)
> +		queue_work(system_unbound_wq, &hk_sd_work);
> +
>   	free_tmpmasks(ptmp);
>   }
>   
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index 0c5db118f2d1..dc2dff361ec6 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -246,6 +246,9 @@ TEST_MATRIX=(
>   	"  C2-3:P1  C3:P1   .      .     O3=0    .      .      .     0 A1:2|A2: A1:P1|A2:P1"
>   	"  C2-3:P1  C3:P1   .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-1"
>   	"  C2-3:P1  C3:P1   .      .      .    T:O3=0   .      .     0 A1:2|A2:2 A1:P1|A2:P-1"
> +	"  C2-3:P1  C3:P2   .      .    T:O2=0   .      .      .     0 A1:3|A2:3 A1:P1|A2:P-2"
> +	"  C1-3:P1  C3:P2   .      .      .    T:O3=0   .      .     0 A1:1-2|A2:1-2 A1:P1|A2:P-2 3|"
> +	"  C1-3:P1  C3:P2   .      .      .    T:O3=0  O3=1    .     0 A1:1-2|A2:3 A1:P1|A2:P2  3"
>   	"$SETUP_A123_PARTITIONS    .     O1=0    .      .      .     0 A1:|A2:2|A3:3 A1:P1|A2:P1|A3:P1"
>   	"$SETUP_A123_PARTITIONS    .     O2=0    .      .      .     0 A1:1|A2:|A3:3 A1:P1|A2:P1|A3:P1"
>   	"$SETUP_A123_PARTITIONS    .     O3=0    .      .      .     0 A1:1|A2:2|A3: A1:P1|A2:P1|A3:P1"
> @@ -762,7 +765,7 @@ check_cgroup_states()
>   # only CPUs in isolated partitions as well as those that are isolated at
>   # boot time.
>   #
> -# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
> +# $1 - expected isolated cpu list(s) <isolcpus1>{|<isolcpus2>}
>   # <isolcpus1> - expected sched/domains value
>   # <isolcpus2> - cpuset.cpus.isolated value = <isolcpus1> if not defined
>   #
> @@ -771,6 +774,7 @@ check_isolcpus()
>   	EXPECTED_ISOLCPUS=$1
>   	ISCPUS=${CGROUP2}/cpuset.cpus.isolated
>   	ISOLCPUS=$(cat $ISCPUS)
> +	HKICPUS=$(cat /sys/devices/system/cpu/isolated)
>   	LASTISOLCPU=
>   	SCHED_DOMAINS=/sys/kernel/debug/sched/domains
>   	if [[ $EXPECTED_ISOLCPUS = . ]]
> @@ -808,6 +812,11 @@ check_isolcpus()
>   	ISOLCPUS=
>   	EXPECTED_ISOLCPUS=$EXPECTED_SDOMAIN
>   
> +	#
> +	# The inverse of HK_TYPE_DOMAIN cpumask in $HKICPUS should match $ISOLCPUS
> +	#
> +	[[ "$ISOLCPUS" != "$HKICPUS" ]] && return 1
> +
>   	#
>   	# Use the sched domain in debugfs to check isolated CPUs, if available
>   	#

We have a CPU hotplug test that cycles through all CPUs off-lining them 
and on-lining them in different combinations. Since this change was 
added to -next, this test is failing on our Tegra210 boards. Bisecting 
the issue, it pointed to this commit and reverting this on top of -next 
fixes the issue.

The test is quite simple and part of Thierry's tegra-test suite [0].

$ ./tegra-tests/tests/cpu.py --verbose hotplug
cpu: hotplug: CPU#0: mask: 1
cpu: hotplug: CPU#1: mask: 2
cpu: hotplug: CPU#2: mask: 4
cpu: hotplug: CPU#3: mask: 8
cpu: hotplug: applying mask 0xf
cpu: hotplug: applying mask 0xe
cpu: hotplug: applying mask 0xd
cpu: hotplug: applying mask 0xc
cpu: hotplug: applying mask 0xb
cpu: hotplug: applying mask 0xa
...
cpu: hotplug: applying mask 0x1
Traceback (most recent call last):
   File "./tegra-tests/tests/cpu.py", line 159, in <module>
     runner.standalone(module)
   File "./tegra-tests/tests/runner.py", line 147, in standalone
     log.test(log = log, args = args)
   File "./tegra-tests/tests/cpu.py", line 29, in __call__
     cpus.apply_mask(mask)
   File "./tegra-tests/linux/system.py", line 149, in apply_mask
     cpu.set_online(False)
   File "./tegra-tests/linux/system.py", line 45, in set_online
     self.online = online
OSError: [Errno 16] Device or resource busy

 From looking at different runs it appears to fail at different places.

Let me know if you have any thoughts.

Thanks
Jon

[0] https://github.com/thierryreding/tegra-tests/blob/master/tests/cpu.py

-- 
nvpublic

next prev parent reply	other threads:[~2026-03-03 15:18 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-21 18:54 [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Waiman Long
2026-02-21 18:54 ` [PATCH v6 1/8] cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() Waiman Long
2026-02-21 18:54 ` [PATCH v6 2/8] cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier() Waiman Long
2026-02-21 18:54 ` [PATCH v6 3/8] cgroup/cpuset: Clarify exclusion rules for cpuset internal variables Waiman Long
2026-02-26 15:00   ` Frederic Weisbecker
2026-02-21 18:54 ` [PATCH v6 4/8] cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed Waiman Long
2026-02-26 15:07   ` Frederic Weisbecker
2026-02-21 18:54 ` [PATCH v6 5/8] kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command Waiman Long
2026-02-21 18:54 ` [PATCH v6 6/8] cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together Waiman Long
2026-02-26 15:51   ` Frederic Weisbecker
2026-02-21 18:54 ` [PATCH v6 7/8] cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue Waiman Long
2026-02-26 16:06   ` Frederic Weisbecker
2026-03-03 16:00     ` Waiman Long
2026-03-03 22:48       ` Frederic Weisbecker
2026-03-04  4:05         ` Waiman Long
2026-03-02 11:49   ` Frederic Weisbecker
2026-03-03 15:18   ` Jon Hunter [this message]
2026-03-03 16:09     ` Waiman Long
2026-03-04  3:58     ` Waiman Long
2026-03-04 11:07       ` Jon Hunter
2026-03-04 18:11         ` Waiman Long
2026-02-21 18:54 ` [PATCH v6 8/8] cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock Waiman Long
2026-03-02 12:14   ` Frederic Weisbecker
2026-03-02 14:15     ` Waiman Long
2026-03-02 15:40       ` Waiman Long
2026-02-23 20:57 ` [PATCH v6 0/8] cgroup/cpuset: Fix partition related locking issues Tejun Heo
2026-02-23 21:11   ` Waiman Long
2026-02-24  7:51     ` Chen Ridong
2026-03-02 12:21   ` Frederic Weisbecker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1a89aceb-48db-4edd-a730-b445e41221fe@nvidia.com \
    --to=jonathanh@nvidia.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=frederic@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.