public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
@ 2026-03-04 18:41 Waiman Long
  2026-03-05  6:45 ` Chen Ridong
  2026-03-05 13:54 ` Frederic Weisbecker
  0 siblings, 2 replies; 5+ messages in thread
From: Waiman Long @ 2026-03-04 18:41 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Frederic Weisbecker
  Cc: cgroups, linux-kernel, Jon Hunter, Waiman Long

Besides deferring the call to housekeeping_update(), commit 6df415aa46ec
("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug
to workqueue") also defers the rebuild_sched_domains() call to
the workqueue. So a new offline CPU may still be in a sched domain
or new online CPU not showing up in the sched domains for a short
transition period. That could be a problem in some corner cases and
can be the cause of a reported test failure[1]. Fix it by calling
rebuild_sched_domains_cpuslocked() directly in hotplug as before. If
isolated partition invalidation or recreation is being done, the
housekeeping_update() call to update the housekeeping cpumasks will
still be deferred to a workqueue.

In commit 3bfe47967191 ("cgroup/cpuset: Move
housekeeping_update()/rebuild_sched_domains() together"),
housekeeping_update() is called before rebuild_sched_domains() because
it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now
changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN
cpumask is now changeable at run time.  As a result, we can move the
rebuild_sched_domains() call before housekeeping_update() with
the slight advantage that it will be done in the same cpus_read_lock
critical section without the possibility of interference by a concurrent
cpu hot add/remove operation.

As it doesn't make sense to acquire cpuset_mutex/cpuset_top_mutex after
calling housekeeping_update() and immediately release them again, move
the cpuset_full_unlock() operation inside update_hk_sched_domains()
and rename it to cpuset_update_sd_hk_unlock() to signify that it will
release the full set of locks.

Fixes: 6df415aa46ec ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue")
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 59 ++++++++++++++++++++++--------------------
 1 file changed, 31 insertions(+), 28 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 271bb99b1b9d..f7657b325490 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -881,7 +881,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	/*
 	 * Cgroup v2 doesn't support domain attributes, just set all of them
 	 * to SD_ATTR_INIT. Also non-isolating partition root CPUs are a
-	 * subset of HK_TYPE_DOMAIN housekeeping CPUs.
+	 * subset of HK_TYPE_DOMAIN_BOOT housekeeping CPUs.
 	 */
 	for (i = 0; i < ndoms; i++) {
 		/*
@@ -890,7 +890,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		 */
 		if (!csa || csa[i] == &top_cpuset)
 			cpumask_and(doms[i], top_cpuset.effective_cpus,
-				    housekeeping_cpumask(HK_TYPE_DOMAIN));
+				    housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
 		else
 			cpumask_copy(doms[i], csa[i]->effective_cpus);
 		if (dattr)
@@ -1331,17 +1331,22 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 }
 
 /*
- * update_hk_sched_domains - Update HK cpumasks & rebuild sched domains
+ * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock
  *
- * Update housekeeping cpumasks and rebuild sched domains if necessary.
- * This should be called at the end of cpuset or hotplug actions.
+ * Update housekeeping cpumasks and rebuild sched domains if necessary and
+ * then do a cpuset_full_unlock().
+ * This should be called at the end of cpuset operation.
  */
-static void update_hk_sched_domains(void)
+static void cpuset_update_sd_hk_unlock(void)
+	__releases(&cpuset_mutex)
+	__releases(&cpuset_top_mutex)
 {
+	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
+	if (force_sd_rebuild)
+		rebuild_sched_domains_locked();
+
 	if (update_housekeeping) {
-		/* Updating HK cpumasks implies rebuild sched domains */
 		update_housekeeping = false;
-		force_sd_rebuild = true;
 		cpumask_copy(isolated_hk_cpus, isolated_cpus);
 
 		/*
@@ -1352,22 +1357,19 @@ static void update_hk_sched_domains(void)
 		mutex_unlock(&cpuset_mutex);
 		cpus_read_unlock();
 		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus));
-		cpus_read_lock();
-		mutex_lock(&cpuset_mutex);
+		mutex_unlock(&cpuset_top_mutex);
+	} else {
+		cpuset_full_unlock();
 	}
-	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
-	if (force_sd_rebuild)
-		rebuild_sched_domains_locked();
 }
 
 /*
- * Work function to invoke update_hk_sched_domains()
+ * Work function to invoke cpuset_update_sd_hk_unlock()
  */
 static void hk_sd_workfn(struct work_struct *work)
 {
 	cpuset_full_lock();
-	update_hk_sched_domains();
-	cpuset_full_unlock();
+	cpuset_update_sd_hk_unlock();
 }
 
 /**
@@ -3232,8 +3234,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 
 	free_cpuset(trialcs);
 out_unlock:
-	update_hk_sched_domains();
-	cpuset_full_unlock();
+	cpuset_update_sd_hk_unlock();
 	if (of_cft(of)->private == FILE_MEMLIST)
 		schedule_flush_migrate_mm();
 	return retval ?: nbytes;
@@ -3340,8 +3341,7 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf,
 	cpuset_full_lock();
 	if (is_cpuset_online(cs))
 		retval = update_prstate(cs, val);
-	update_hk_sched_domains();
-	cpuset_full_unlock();
+	cpuset_update_sd_hk_unlock();
 	return retval ?: nbytes;
 }
 
@@ -3515,8 +3515,7 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css)
 	/* Reset valid partition back to member */
 	if (is_partition_valid(cs))
 		update_prstate(cs, PRS_MEMBER);
-	update_hk_sched_domains();
-	cpuset_full_unlock();
+	cpuset_update_sd_hk_unlock();
 }
 
 static void cpuset_css_free(struct cgroup_subsys_state *css)
@@ -3925,11 +3924,13 @@ static void cpuset_handle_hotplug(void)
 		rcu_read_unlock();
 	}
 
-
 	/*
-	 * Queue a work to call housekeeping_update() & rebuild_sched_domains()
-	 * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
-	 * cpumask can correctly reflect what is in isolated_cpus.
+	 * rebuild_sched_domains() will always be called directly if needed
+	 * to make sure that newly added or removed CPU will be reflected in
+	 * the sched domains. However, if isolated partition invalidation
+	 * or recreation is being done (update_housekeeping set), a work item
+	 * will be queued to call housekeeping_update() to update the
+	 * corresponding housekeeping cpumasks after some slight delay.
 	 *
 	 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
 	 * is still pending. Before the pending bit is cleared, the work data
@@ -3938,8 +3939,10 @@ static void cpuset_handle_hotplug(void)
 	 * previously queued work. Since hk_sd_workfn() doesn't use the work
 	 * item at all, this is not a problem.
 	 */
-	if (update_housekeeping || force_sd_rebuild)
-		queue_work(system_unbound_wq, &hk_sd_work);
+	if (force_sd_rebuild)
+		rebuild_sched_domains_cpuslocked();
+	if (update_housekeeping)
+		queue_work(system_dfl_wq, &hk_sd_work);
 
 	free_tmpmasks(ptmp);
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
  2026-03-04 18:41 [PATCH] cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug Waiman Long
@ 2026-03-05  6:45 ` Chen Ridong
  2026-03-05 19:16   ` Waiman Long
  2026-03-05 13:54 ` Frederic Weisbecker
  1 sibling, 1 reply; 5+ messages in thread
From: Chen Ridong @ 2026-03-05  6:45 UTC (permalink / raw)
  To: Waiman Long, Chen Ridong, Tejun Heo, Johannes Weiner,
	Michal Koutný, Frederic Weisbecker
  Cc: cgroups, linux-kernel, Jon Hunter



On 2026/3/5 2:41, Waiman Long wrote:
> Besides deferring the call to housekeeping_update(), commit 6df415aa46ec
> ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug
> to workqueue") also defers the rebuild_sched_domains() call to
> the workqueue. So a new offline CPU may still be in a sched domain
> or new online CPU not showing up in the sched domains for a short
> transition period. That could be a problem in some corner cases and
> can be the cause of a reported test failure[1]. Fix it by calling

Miss Link [1]?

> rebuild_sched_domains_cpuslocked() directly in hotplug as before. If
> isolated partition invalidation or recreation is being done, the
> housekeeping_update() call to update the housekeeping cpumasks will
> still be deferred to a workqueue.
> 
> In commit 3bfe47967191 ("cgroup/cpuset: Move
> housekeeping_update()/rebuild_sched_domains() together"),
> housekeeping_update() is called before rebuild_sched_domains() because
> it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now
> changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN
> cpumask is now changeable at run time.  As a result, we can move the
> rebuild_sched_domains() call before housekeeping_update() with
> the slight advantage that it will be done in the same cpus_read_lock
> critical section without the possibility of interference by a concurrent
> cpu hot add/remove operation.
> 

Nice.

> As it doesn't make sense to acquire cpuset_mutex/cpuset_top_mutex after
> calling housekeeping_update() and immediately release them again, move
> the cpuset_full_unlock() operation inside update_hk_sched_domains()
> and rename it to cpuset_update_sd_hk_unlock() to signify that it will
> release the full set of locks.
> 
> Fixes: 6df415aa46ec ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue")
> Tested-by: Jon Hunter <jonathanh@nvidia.com>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c | 59 ++++++++++++++++++++++--------------------
>  1 file changed, 31 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 271bb99b1b9d..f7657b325490 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -881,7 +881,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>  	/*
>  	 * Cgroup v2 doesn't support domain attributes, just set all of them
>  	 * to SD_ATTR_INIT. Also non-isolating partition root CPUs are a
> -	 * subset of HK_TYPE_DOMAIN housekeeping CPUs.
> +	 * subset of HK_TYPE_DOMAIN_BOOT housekeeping CPUs.
>  	 */
>  	for (i = 0; i < ndoms; i++) {
>  		/*
> @@ -890,7 +890,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>  		 */
>  		if (!csa || csa[i] == &top_cpuset)
>  			cpumask_and(doms[i], top_cpuset.effective_cpus,
> -				    housekeeping_cpumask(HK_TYPE_DOMAIN));
> +				    housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
>  		else
>  			cpumask_copy(doms[i], csa[i]->effective_cpus);
>  		if (dattr)
> @@ -1331,17 +1331,22 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>  }
>  
>  /*
> - * update_hk_sched_domains - Update HK cpumasks & rebuild sched domains
> + * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock
>   *
> - * Update housekeeping cpumasks and rebuild sched domains if necessary.
> - * This should be called at the end of cpuset or hotplug actions.
> + * Update housekeeping cpumasks and rebuild sched domains if necessary and
> + * then do a cpuset_full_unlock().
> + * This should be called at the end of cpuset operation.
>   */
> -static void update_hk_sched_domains(void)
> +static void cpuset_update_sd_hk_unlock(void)
> +	__releases(&cpuset_mutex)
> +	__releases(&cpuset_top_mutex)
>  {
> +	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
> +	if (force_sd_rebuild)
> +		rebuild_sched_domains_locked();
> +
>  	if (update_housekeeping) {
> -		/* Updating HK cpumasks implies rebuild sched domains */
>  		update_housekeeping = false;
> -		force_sd_rebuild = true;
>  		cpumask_copy(isolated_hk_cpus, isolated_cpus);
>  
>  		/*
> @@ -1352,22 +1357,19 @@ static void update_hk_sched_domains(void)
>  		mutex_unlock(&cpuset_mutex);
>  		cpus_read_unlock();
>  		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus));
> -		cpus_read_lock();
> -		mutex_lock(&cpuset_mutex);
> +		mutex_unlock(&cpuset_top_mutex);
> +	} else {
> +		cpuset_full_unlock();
>  	}
> -	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
> -	if (force_sd_rebuild)
> -		rebuild_sched_domains_locked();
>  }
>  
>  /*
> - * Work function to invoke update_hk_sched_domains()
> + * Work function to invoke cpuset_update_sd_hk_unlock()
>   */
>  static void hk_sd_workfn(struct work_struct *work)
>  {
>  	cpuset_full_lock();
> -	update_hk_sched_domains();
> -	cpuset_full_unlock();
> +	cpuset_update_sd_hk_unlock();
>  }
>  
>  /**
> @@ -3232,8 +3234,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>  
>  	free_cpuset(trialcs);
>  out_unlock:
> -	update_hk_sched_domains();
> -	cpuset_full_unlock();
> +	cpuset_update_sd_hk_unlock();
>  	if (of_cft(of)->private == FILE_MEMLIST)
>  		schedule_flush_migrate_mm();
>  	return retval ?: nbytes;
> @@ -3340,8 +3341,7 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf,
>  	cpuset_full_lock();
>  	if (is_cpuset_online(cs))
>  		retval = update_prstate(cs, val);
> -	update_hk_sched_domains();
> -	cpuset_full_unlock();
> +	cpuset_update_sd_hk_unlock();
>  	return retval ?: nbytes;
>  }
>  
> @@ -3515,8 +3515,7 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css)
>  	/* Reset valid partition back to member */
>  	if (is_partition_valid(cs))
>  		update_prstate(cs, PRS_MEMBER);
> -	update_hk_sched_domains();
> -	cpuset_full_unlock();
> +	cpuset_update_sd_hk_unlock();
>  }
>  
>  static void cpuset_css_free(struct cgroup_subsys_state *css)
> @@ -3925,11 +3924,13 @@ static void cpuset_handle_hotplug(void)
>  		rcu_read_unlock();
>  	}
>  
> -
>  	/*
> -	 * Queue a work to call housekeeping_update() & rebuild_sched_domains()
> -	 * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
> -	 * cpumask can correctly reflect what is in isolated_cpus.
> +	 * rebuild_sched_domains() will always be called directly if needed
> +	 * to make sure that newly added or removed CPU will be reflected in
> +	 * the sched domains. However, if isolated partition invalidation
> +	 * or recreation is being done (update_housekeeping set), a work item
> +	 * will be queued to call housekeeping_update() to update the
> +	 * corresponding housekeeping cpumasks after some slight delay.
>  	 *
>  	 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
>  	 * is still pending. Before the pending bit is cleared, the work data
> @@ -3938,8 +3939,10 @@ static void cpuset_handle_hotplug(void)
>  	 * previously queued work. Since hk_sd_workfn() doesn't use the work
>  	 * item at all, this is not a problem.
>  	 */
> -	if (update_housekeeping || force_sd_rebuild)
> -		queue_work(system_unbound_wq, &hk_sd_work);
> +	if (force_sd_rebuild)
> +		rebuild_sched_domains_cpuslocked();
> +	if (update_housekeeping)
> +		queue_work(system_dfl_wq, &hk_sd_work);
>  
>  	free_tmpmasks(ptmp);
>  }

This means that rebuild schedule domains are decoupled from HK updates, right?

I'm wondering again whether we can do the same for changes to
cpus/partition/cpus.exclusive. If we can defer the HK update, then the
cpuset_top_mutex might no longer be necessary.

This patch looks good to me.

Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
  2026-03-04 18:41 [PATCH] cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug Waiman Long
  2026-03-05  6:45 ` Chen Ridong
@ 2026-03-05 13:54 ` Frederic Weisbecker
  2026-03-05 19:27   ` Waiman Long
  1 sibling, 1 reply; 5+ messages in thread
From: Frederic Weisbecker @ 2026-03-05 13:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	cgroups, linux-kernel, Jon Hunter

Le Wed, Mar 04, 2026 at 01:41:00PM -0500, Waiman Long a écrit :
> Besides deferring the call to housekeeping_update(), commit 6df415aa46ec
> ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug
> to workqueue") also defers the rebuild_sched_domains() call to
> the workqueue. So a new offline CPU may still be in a sched domain
> or new online CPU not showing up in the sched domains for a short
> transition period. That could be a problem in some corner cases and
> can be the cause of a reported test failure[1]. Fix it by calling
> rebuild_sched_domains_cpuslocked() directly in hotplug as before. If
> isolated partition invalidation or recreation is being done, the
> housekeeping_update() call to update the housekeeping cpumasks will
> still be deferred to a workqueue.
> 
> In commit 3bfe47967191 ("cgroup/cpuset: Move
> housekeeping_update()/rebuild_sched_domains() together"),
> housekeeping_update() is called before rebuild_sched_domains() because
> it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now
> changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN
> cpumask is now changeable at run time.  As a result, we can move the

But rebuild_sched_domains() will still handle the cpuset isolated partitions
somehow right? Sorry for the question, I'm a bit lost in the
partition_sched_domains() maze...

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
  2026-03-05  6:45 ` Chen Ridong
@ 2026-03-05 19:16   ` Waiman Long
  0 siblings, 0 replies; 5+ messages in thread
From: Waiman Long @ 2026-03-05 19:16 UTC (permalink / raw)
  To: Chen Ridong, Chen Ridong, Tejun Heo, Johannes Weiner,
	Michal Koutný, Frederic Weisbecker
  Cc: cgroups, linux-kernel, Jon Hunter


On 3/5/26 1:45 AM, Chen Ridong wrote:
>
> On 2026/3/5 2:41, Waiman Long wrote:
>> Besides deferring the call to housekeeping_update(), commit 6df415aa46ec
>> ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug
>> to workqueue") also defers the rebuild_sched_domains() call to
>> the workqueue. So a new offline CPU may still be in a sched domain
>> or new online CPU not showing up in the sched domains for a short
>> transition period. That could be a problem in some corner cases and
>> can be the cause of a reported test failure[1]. Fix it by calling
> Miss Link [1]?
I thought I did. Will update it with the link.
>
>> rebuild_sched_domains_cpuslocked() directly in hotplug as before. If
>> isolated partition invalidation or recreation is being done, the
>> housekeeping_update() call to update the housekeeping cpumasks will
>> still be deferred to a workqueue.
>>
>> In commit 3bfe47967191 ("cgroup/cpuset: Move
>> housekeeping_update()/rebuild_sched_domains() together"),
>> housekeeping_update() is called before rebuild_sched_domains() because
>> it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now
>> changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN
>> cpumask is now changeable at run time.  As a result, we can move the
>> rebuild_sched_domains() call before housekeeping_update() with
>> the slight advantage that it will be done in the same cpus_read_lock
>> critical section without the possibility of interference by a concurrent
>> cpu hot add/remove operation.
>>
> Nice.
>
>> As it doesn't make sense to acquire cpuset_mutex/cpuset_top_mutex after
>> calling housekeeping_update() and immediately release them again, move
>> the cpuset_full_unlock() operation inside update_hk_sched_domains()
>> and rename it to cpuset_update_sd_hk_unlock() to signify that it will
>> release the full set of locks.
>>
>> Fixes: 6df415aa46ec ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue")
>> Tested-by: Jon Hunter <jonathanh@nvidia.com>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 59 ++++++++++++++++++++++--------------------
>>   1 file changed, 31 insertions(+), 28 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 271bb99b1b9d..f7657b325490 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -881,7 +881,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>>   	/*
>>   	 * Cgroup v2 doesn't support domain attributes, just set all of them
>>   	 * to SD_ATTR_INIT. Also non-isolating partition root CPUs are a
>> -	 * subset of HK_TYPE_DOMAIN housekeeping CPUs.
>> +	 * subset of HK_TYPE_DOMAIN_BOOT housekeeping CPUs.
>>   	 */
>>   	for (i = 0; i < ndoms; i++) {
>>   		/*
>> @@ -890,7 +890,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
>>   		 */
>>   		if (!csa || csa[i] == &top_cpuset)
>>   			cpumask_and(doms[i], top_cpuset.effective_cpus,
>> -				    housekeeping_cpumask(HK_TYPE_DOMAIN));
>> +				    housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
>>   		else
>>   			cpumask_copy(doms[i], csa[i]->effective_cpus);
>>   		if (dattr)
>> @@ -1331,17 +1331,22 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
>>   }
>>   
>>   /*
>> - * update_hk_sched_domains - Update HK cpumasks & rebuild sched domains
>> + * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock
>>    *
>> - * Update housekeeping cpumasks and rebuild sched domains if necessary.
>> - * This should be called at the end of cpuset or hotplug actions.
>> + * Update housekeeping cpumasks and rebuild sched domains if necessary and
>> + * then do a cpuset_full_unlock().
>> + * This should be called at the end of cpuset operation.
>>    */
>> -static void update_hk_sched_domains(void)
>> +static void cpuset_update_sd_hk_unlock(void)
>> +	__releases(&cpuset_mutex)
>> +	__releases(&cpuset_top_mutex)
>>   {
>> +	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
>> +	if (force_sd_rebuild)
>> +		rebuild_sched_domains_locked();
>> +
>>   	if (update_housekeeping) {
>> -		/* Updating HK cpumasks implies rebuild sched domains */
>>   		update_housekeeping = false;
>> -		force_sd_rebuild = true;
>>   		cpumask_copy(isolated_hk_cpus, isolated_cpus);
>>   
>>   		/*
>> @@ -1352,22 +1357,19 @@ static void update_hk_sched_domains(void)
>>   		mutex_unlock(&cpuset_mutex);
>>   		cpus_read_unlock();
>>   		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus));
>> -		cpus_read_lock();
>> -		mutex_lock(&cpuset_mutex);
>> +		mutex_unlock(&cpuset_top_mutex);
>> +	} else {
>> +		cpuset_full_unlock();
>>   	}
>> -	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
>> -	if (force_sd_rebuild)
>> -		rebuild_sched_domains_locked();
>>   }
>>   
>>   /*
>> - * Work function to invoke update_hk_sched_domains()
>> + * Work function to invoke cpuset_update_sd_hk_unlock()
>>    */
>>   static void hk_sd_workfn(struct work_struct *work)
>>   {
>>   	cpuset_full_lock();
>> -	update_hk_sched_domains();
>> -	cpuset_full_unlock();
>> +	cpuset_update_sd_hk_unlock();
>>   }
>>   
>>   /**
>> @@ -3232,8 +3234,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>>   
>>   	free_cpuset(trialcs);
>>   out_unlock:
>> -	update_hk_sched_domains();
>> -	cpuset_full_unlock();
>> +	cpuset_update_sd_hk_unlock();
>>   	if (of_cft(of)->private == FILE_MEMLIST)
>>   		schedule_flush_migrate_mm();
>>   	return retval ?: nbytes;
>> @@ -3340,8 +3341,7 @@ static ssize_t cpuset_partition_write(struct kernfs_open_file *of, char *buf,
>>   	cpuset_full_lock();
>>   	if (is_cpuset_online(cs))
>>   		retval = update_prstate(cs, val);
>> -	update_hk_sched_domains();
>> -	cpuset_full_unlock();
>> +	cpuset_update_sd_hk_unlock();
>>   	return retval ?: nbytes;
>>   }
>>   
>> @@ -3515,8 +3515,7 @@ static void cpuset_css_killed(struct cgroup_subsys_state *css)
>>   	/* Reset valid partition back to member */
>>   	if (is_partition_valid(cs))
>>   		update_prstate(cs, PRS_MEMBER);
>> -	update_hk_sched_domains();
>> -	cpuset_full_unlock();
>> +	cpuset_update_sd_hk_unlock();
>>   }
>>   
>>   static void cpuset_css_free(struct cgroup_subsys_state *css)
>> @@ -3925,11 +3924,13 @@ static void cpuset_handle_hotplug(void)
>>   		rcu_read_unlock();
>>   	}
>>   
>> -
>>   	/*
>> -	 * Queue a work to call housekeeping_update() & rebuild_sched_domains()
>> -	 * There will be a slight delay before the HK_TYPE_DOMAIN housekeeping
>> -	 * cpumask can correctly reflect what is in isolated_cpus.
>> +	 * rebuild_sched_domains() will always be called directly if needed
>> +	 * to make sure that newly added or removed CPU will be reflected in
>> +	 * the sched domains. However, if isolated partition invalidation
>> +	 * or recreation is being done (update_housekeeping set), a work item
>> +	 * will be queued to call housekeeping_update() to update the
>> +	 * corresponding housekeeping cpumasks after some slight delay.
>>   	 *
>>   	 * We rely on WORK_STRUCT_PENDING_BIT to not requeue a work item that
>>   	 * is still pending. Before the pending bit is cleared, the work data
>> @@ -3938,8 +3939,10 @@ static void cpuset_handle_hotplug(void)
>>   	 * previously queued work. Since hk_sd_workfn() doesn't use the work
>>   	 * item at all, this is not a problem.
>>   	 */
>> -	if (update_housekeeping || force_sd_rebuild)
>> -		queue_work(system_unbound_wq, &hk_sd_work);
>> +	if (force_sd_rebuild)
>> +		rebuild_sched_domains_cpuslocked();
>> +	if (update_housekeeping)
>> +		queue_work(system_dfl_wq, &hk_sd_work);
>>   
>>   	free_tmpmasks(ptmp);
>>   }
> This means that rebuild schedule domains are decoupled from HK updates, right?
Yes.
> I'm wondering again whether we can do the same for changes to
> cpus/partition/cpus.exclusive. If we can defer the HK update, then the
> cpuset_top_mutex might no longer be necessary.
>
> This patch looks good to me.
>
> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
>
Thanks,
Longman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
  2026-03-05 13:54 ` Frederic Weisbecker
@ 2026-03-05 19:27   ` Waiman Long
  0 siblings, 0 replies; 5+ messages in thread
From: Waiman Long @ 2026-03-05 19:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	cgroups, linux-kernel, Jon Hunter


On 3/5/26 8:54 AM, Frederic Weisbecker wrote:
> Le Wed, Mar 04, 2026 at 01:41:00PM -0500, Waiman Long a écrit :
>> Besides deferring the call to housekeeping_update(), commit 6df415aa46ec
>> ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug
>> to workqueue") also defers the rebuild_sched_domains() call to
>> the workqueue. So a new offline CPU may still be in a sched domain
>> or new online CPU not showing up in the sched domains for a short
>> transition period. That could be a problem in some corner cases and
>> can be the cause of a reported test failure[1]. Fix it by calling
>> rebuild_sched_domains_cpuslocked() directly in hotplug as before. If
>> isolated partition invalidation or recreation is being done, the
>> housekeeping_update() call to update the housekeeping cpumasks will
>> still be deferred to a workqueue.
>>
>> In commit 3bfe47967191 ("cgroup/cpuset: Move
>> housekeeping_update()/rebuild_sched_domains() together"),
>> housekeeping_update() is called before rebuild_sched_domains() because
>> it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now
>> changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN
>> cpumask is now changeable at run time.  As a result, we can move the
> But rebuild_sched_domains() will still handle the cpuset isolated partitions
> somehow right? Sorry for the question, I'm a bit lost in the
> partition_sched_domains() maze...

For v2, generate_sched_domains() has no dependency on housekeeping 
cpumasks other that the HK_TYPE_DOMAIN_BOOT to strip out boot-time 
isolated CPUs from the effective_cpus of top_cpuset. So 
rebuild_sched_domains() will do the right thing even if HK cpumasks 
haven't been fully updated yet. Please let me know if you can think of 
any corner cases that will still be problematic.

Cheers, Longman



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-05 19:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-04 18:41 [PATCH] cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug Waiman Long
2026-03-05  6:45 ` Chen Ridong
2026-03-05 19:16   ` Waiman Long
2026-03-05 13:54 ` Frederic Weisbecker
2026-03-05 19:27   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox