cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
       [not found] <20251028034357.11055-1-piliu@redhat.com>
@ 2025-10-28  3:43 ` Pingfan Liu
  2025-10-29  2:37   ` Chen Ridong
                     ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Pingfan Liu @ 2025-10-28  3:43 UTC (permalink / raw)
  To: linux-kernel, cgroups
  Cc: Pingfan Liu, Waiman Long, Peter Zijlstra, Juri Lelli,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

*** Bug description ***
When testing kexec-reboot on a 144 cpus machine with
isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
encounter the following bug:

[   97.114759] psci: CPU142 killed (polled 0 ms)
[   97.333236] Failed to offline CPU143 - error=-16
[   97.333246] ------------[ cut here ]------------
[   97.342682] kernel BUG at kernel/cpu.c:1569!
[   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[...]

In essence, the issue originates from the CPU hot-removal process, not
limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
program that waits indefinitely on a semaphore, spawning multiple
instances to ensure some run on CPU 72, and then offlining CPUs 1–143
one by one. When attempting this, CPU 143 failed to go offline.
  bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'

`
*** Issue ***
Tracking down this issue, I found that dl_bw_deactivate() returned
-EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
But that is not the fact, and contributed by the following factors:
When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
blocked-state deadline task (in this case, "cppc_fie"), it was not
migrated to CPU0, and its task_rq() information is stale. So its rq->rd
points to def_root_domain instead of the one shared with CPU0.  As a
result, its bandwidth is wrongly accounted into a wrong root domain
during domain rebuild.

The key point is that root_domain is only tracked through active rq->rd.
To avoid using a global data structure to track all root_domains in the
system, there should be a method to locate an active CPU within the
corresponding root_domain.

*** Solution ***
To locate the active cpu, the following rules for deadline
sub-system is useful
  -1.any cpu belongs to a unique root domain at a given time
  -2.DL bandwidth checker ensures that the root domain has active cpus.

Now, let's examine the blocked-state task P.
If P is attached to a cpuset that is a partition root, it is
straightforward to find an active CPU.
If P is attached to a cpuset that has changed from 'root' to 'member',
the active CPUs are grouped into the parent root domain. Naturally, the
CPUs' capacity and reserved DL bandwidth are taken into account in the
ancestor root domain. (In practice, it may be unsafe to attach P to an
arbitrary root domain, since that domain may lack sufficient DL
bandwidth for P.) Again, it is straightforward to find an active CPU in
the ancestor root domain.

This patch groups CPUs into isolated and housekeeping sets. For the
housekeeping group, it walks up the cpuset hierarchy to find active CPUs
in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Pierre Gondois <pierre.gondois@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
To: cgroups@vger.kernel.org
To: linux-kernel@vger.kernel.org
---
v3 -> v4:
rename function with cpuset_ prefix
improve commit log

 include/linux/cpuset.h  | 18 ++++++++++++++++++
 kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
 kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
 3 files changed, 68 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2ddb256187b51..d4da93e51b37b 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -12,6 +12,7 @@
 #include <linux/sched.h>
 #include <linux/sched/topology.h>
 #include <linux/sched/task.h>
+#include <linux/sched/housekeeping.h>
 #include <linux/cpumask.h>
 #include <linux/nodemask.h>
 #include <linux/mm.h>
@@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
 
 extern void cpuset_print_current_mems_allowed(void);
 extern void cpuset_reset_sched_domains(void);
+extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
 
 /*
  * read_mems_allowed_begin is required when making decisions involving
@@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
 	partition_sched_domains(1, NULL, NULL);
 }
 
+static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
+		struct cpumask *cpus)
+{
+	const struct cpumask *hk_msk;
+
+	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
+	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
+		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
+			/* isolated cpus belong to a root domain */
+			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
+			return;
+		}
+	}
+	cpumask_and(cpus, cpu_active_mask, hk_msk);
+}
+
 static inline void cpuset_print_current_mems_allowed(void)
 {
 }
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 27adb04df675d..6ad88018f1a4e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
 	mutex_unlock(&cpuset_mutex);
 }
 
+/* caller hold RCU read lock */
+void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
+{
+	const struct cpumask *hk_msk;
+	struct cpuset *cs;
+
+	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
+	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
+		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
+			/* isolated cpus belong to a root domain */
+			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
+			return;
+		}
+	}
+	/* In HK_TYPE_DOMAIN, cpuset can be applied */
+	cs = task_cs(p);
+	while (cs != &top_cpuset) {
+		if (is_sched_load_balance(cs))
+			break;
+		cs = parent_cs(cs);
+	}
+
+	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
+	cpumask_and(cpus, cs->effective_cpus, hk_msk);
+}
+
 /**
  * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
  * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 72c1f72463c75..a3a43baf4314e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2884,6 +2884,8 @@ void dl_add_task_root_domain(struct task_struct *p)
 	struct rq_flags rf;
 	struct rq *rq;
 	struct dl_bw *dl_b;
+	unsigned int cpu;
+	struct cpumask msk;
 
 	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
 	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
@@ -2891,16 +2893,32 @@ void dl_add_task_root_domain(struct task_struct *p)
 		return;
 	}
 
-	rq = __task_rq_lock(p, &rf);
-
+	/* prevent race among cpu hotplug, changing of partition_root_state */
+	lockdep_assert_cpus_held();
+	/*
+	 * If @p is in blocked state, task_cpu() may be not active. In that
+	 * case, rq->rd does not trace a correct root_domain. On the other hand,
+	 * @p must belong to an root_domain at any given time, which must have
+	 * active rq, whose rq->rd traces the valid root domain.
+	 */
+	cpuset_get_task_effective_cpus(p, &msk);
+	cpu = cpumask_first_and(cpu_active_mask, &msk);
+	/*
+	 * If a root domain reserves bandwidth for a DL task, the DL bandwidth
+	 * check prevents CPU hot removal from deactivating all CPUs in that
+	 * domain.
+	 */
+	BUG_ON(cpu >= nr_cpu_ids);
+	rq = cpu_rq(cpu);
+	/*
+	 * This point is under the protection of cpu_hotplug_lock. Hence
+	 * rq->rd is stable.
+	 */
 	dl_b = &rq->rd->dl_bw;
 	raw_spin_lock(&dl_b->lock);
-
 	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span));
-
 	raw_spin_unlock(&dl_b->lock);
-
-	task_rq_unlock(rq, p, &rf);
+	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 }
 
 void dl_clear_root_domain(struct root_domain *rd)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-28  3:43 ` [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug Pingfan Liu
@ 2025-10-29  2:37   ` Chen Ridong
  2025-10-29 11:18     ` Pingfan Liu
  2025-10-29 15:31   ` Waiman Long
  2025-11-05  2:23   ` Chen Ridong
  2 siblings, 1 reply; 15+ messages in thread
From: Chen Ridong @ 2025-10-29  2:37 UTC (permalink / raw)
  To: Pingfan Liu, linux-kernel, cgroups
  Cc: Waiman Long, Peter Zijlstra, Juri Lelli, Pierre Gondois,
	Frederic Weisbecker, Ingo Molnar, Tejun Heo, Johannes Weiner,
	Michal Koutný, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider



On 2025/10/28 11:43, Pingfan Liu wrote:
> *** Bug description ***
> When testing kexec-reboot on a 144 cpus machine with
> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> encounter the following bug:
> 
> [   97.114759] psci: CPU142 killed (polled 0 ms)
> [   97.333236] Failed to offline CPU143 - error=-16
> [   97.333246] ------------[ cut here ]------------
> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [...]
> 
> In essence, the issue originates from the CPU hot-removal process, not
> limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
> program that waits indefinitely on a semaphore, spawning multiple
> instances to ensure some run on CPU 72, and then offlining CPUs 1–143
> one by one. When attempting this, CPU 143 failed to go offline.
>   bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> 
> `
> *** Issue ***
> Tracking down this issue, I found that dl_bw_deactivate() returned
> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> But that is not the fact, and contributed by the following factors:
> When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
> blocked-state deadline task (in this case, "cppc_fie"), it was not
> migrated to CPU0, and its task_rq() information is stale. So its rq->rd
> points to def_root_domain instead of the one shared with CPU0.  As a
> result, its bandwidth is wrongly accounted into a wrong root domain
> during domain rebuild.
> 
> The key point is that root_domain is only tracked through active rq->rd.
> To avoid using a global data structure to track all root_domains in the
> system, there should be a method to locate an active CPU within the
> corresponding root_domain.
> 
> *** Solution ***
> To locate the active cpu, the following rules for deadline
> sub-system is useful
>   -1.any cpu belongs to a unique root domain at a given time
>   -2.DL bandwidth checker ensures that the root domain has active cpus.
> 
> Now, let's examine the blocked-state task P.
> If P is attached to a cpuset that is a partition root, it is
> straightforward to find an active CPU.
> If P is attached to a cpuset that has changed from 'root' to 'member',
> the active CPUs are grouped into the parent root domain. Naturally, the
> CPUs' capacity and reserved DL bandwidth are taken into account in the
> ancestor root domain. (In practice, it may be unsafe to attach P to an
> arbitrary root domain, since that domain may lack sufficient DL
> bandwidth for P.) Again, it is straightforward to find an active CPU in
> the ancestor root domain.
> 
> This patch groups CPUs into isolated and housekeeping sets. For the
> housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
> 
> Signed-off-by: Pingfan Liu <piliu@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: "Michal Koutný" <mkoutny@suse.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Pierre Gondois <pierre.gondois@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> To: cgroups@vger.kernel.org
> To: linux-kernel@vger.kernel.org
> ---
> v3 -> v4:
> rename function with cpuset_ prefix
> improve commit log
> 
>  include/linux/cpuset.h  | 18 ++++++++++++++++++
>  kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
>  kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
>  3 files changed, 68 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b51..d4da93e51b37b 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -12,6 +12,7 @@
>  #include <linux/sched.h>
>  #include <linux/sched/topology.h>
>  #include <linux/sched/task.h>
> +#include <linux/sched/housekeeping.h>
>  #include <linux/cpumask.h>
>  #include <linux/nodemask.h>
>  #include <linux/mm.h>
> @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
>  
>  extern void cpuset_print_current_mems_allowed(void);
>  extern void cpuset_reset_sched_domains(void);
> +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
>  
>  /*
>   * read_mems_allowed_begin is required when making decisions involving
> @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
>  	partition_sched_domains(1, NULL, NULL);
>  }
>  
> +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
> +		struct cpumask *cpus)
> +{
> +	const struct cpumask *hk_msk;
> +
> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> +			/* isolated cpus belong to a root domain */
> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> +			return;
> +		}
> +	}
> +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> +}
> +
>  static inline void cpuset_print_current_mems_allowed(void)
>  {
>  }
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 27adb04df675d..6ad88018f1a4e 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
>  	mutex_unlock(&cpuset_mutex);
>  }
>  
> +/* caller hold RCU read lock */
> +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> +{
> +	const struct cpumask *hk_msk;
> +	struct cpuset *cs;
> +
> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> +			/* isolated cpus belong to a root domain */
> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> +			return;
> +		}
> +	}
> +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> +	cs = task_cs(p);
> +	while (cs != &top_cpuset) {
> +		if (is_sched_load_balance(cs))
> +			break;
> +		cs = parent_cs(cs);
> +	}
> +
> +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> +}
> +

It seems you may have misunderstood what Longman intended to convey.

First, you should add comments to this function because its purpose is not clear. When I first saw
this function, I thought it was supposed to retrieve p->cpus_ptr excluding the offline CPU mask.
However, I'm genuinely confused about the function's actual purpose.

Regarding the isolated partition concept: isolated CPUs (isolcpus) can be included in cpusets. For
example, if the system boots with isolcpus=9, and when process p is in a isolated partition that
only contains CPU 9 (which is listed in isolcpus), will this function return all CPUs except CPU 9?
Is that the behavior you intended?

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-29  2:37   ` Chen Ridong
@ 2025-10-29 11:18     ` Pingfan Liu
  2025-10-30  6:44       ` Chen Ridong
  0 siblings, 1 reply; 15+ messages in thread
From: Pingfan Liu @ 2025-10-29 11:18 UTC (permalink / raw)
  To: Chen Ridong
  Cc: linux-kernel, cgroups, Waiman Long, Peter Zijlstra, Juri Lelli,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

Hi Ridong,

Thank you for your review, please see the comment below.

On Wed, Oct 29, 2025 at 10:37:47AM +0800, Chen Ridong wrote:
> 
> 
> On 2025/10/28 11:43, Pingfan Liu wrote:
> > *** Bug description ***
> > When testing kexec-reboot on a 144 cpus machine with
> > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > encounter the following bug:
> > 
> > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > [   97.333236] Failed to offline CPU143 - error=-16
> > [   97.333246] ------------[ cut here ]------------
> > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > [...]
> > 
> > In essence, the issue originates from the CPU hot-removal process, not
> > limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
> > program that waits indefinitely on a semaphore, spawning multiple
> > instances to ensure some run on CPU 72, and then offlining CPUs 1–143
> > one by one. When attempting this, CPU 143 failed to go offline.
> >   bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> > 
> > `
> > *** Issue ***
> > Tracking down this issue, I found that dl_bw_deactivate() returned
> > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > But that is not the fact, and contributed by the following factors:
> > When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
> > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > migrated to CPU0, and its task_rq() information is stale. So its rq->rd
> > points to def_root_domain instead of the one shared with CPU0.  As a
> > result, its bandwidth is wrongly accounted into a wrong root domain
> > during domain rebuild.
> > 
> > The key point is that root_domain is only tracked through active rq->rd.
> > To avoid using a global data structure to track all root_domains in the
> > system, there should be a method to locate an active CPU within the
> > corresponding root_domain.
> > 
> > *** Solution ***
> > To locate the active cpu, the following rules for deadline
> > sub-system is useful
> >   -1.any cpu belongs to a unique root domain at a given time
> >   -2.DL bandwidth checker ensures that the root domain has active cpus.
> > 
> > Now, let's examine the blocked-state task P.
> > If P is attached to a cpuset that is a partition root, it is
> > straightforward to find an active CPU.
> > If P is attached to a cpuset that has changed from 'root' to 'member',
> > the active CPUs are grouped into the parent root domain. Naturally, the
> > CPUs' capacity and reserved DL bandwidth are taken into account in the
> > ancestor root domain. (In practice, it may be unsafe to attach P to an
> > arbitrary root domain, since that domain may lack sufficient DL
> > bandwidth for P.) Again, it is straightforward to find an active CPU in
> > the ancestor root domain.
> > 
> > This patch groups CPUs into isolated and housekeeping sets. For the
> > housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> > in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
> > 
> > Signed-off-by: Pingfan Liu <piliu@redhat.com>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: "Michal Koutný" <mkoutny@suse.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Pierre Gondois <pierre.gondois@arm.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > To: cgroups@vger.kernel.org
> > To: linux-kernel@vger.kernel.org
> > ---
> > v3 -> v4:
> > rename function with cpuset_ prefix
> > improve commit log
> > 
> >  include/linux/cpuset.h  | 18 ++++++++++++++++++
> >  kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
> >  kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
> >  3 files changed, 68 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index 2ddb256187b51..d4da93e51b37b 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -12,6 +12,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/sched/topology.h>
> >  #include <linux/sched/task.h>
> > +#include <linux/sched/housekeeping.h>
> >  #include <linux/cpumask.h>
> >  #include <linux/nodemask.h>
> >  #include <linux/mm.h>
> > @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
> >  
> >  extern void cpuset_print_current_mems_allowed(void);
> >  extern void cpuset_reset_sched_domains(void);
> > +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
> >  
> >  /*
> >   * read_mems_allowed_begin is required when making decisions involving
> > @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
> >  	partition_sched_domains(1, NULL, NULL);
> >  }
> >  
> > +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
> > +		struct cpumask *cpus)
> > +{
> > +	const struct cpumask *hk_msk;
> > +
> > +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> > +			/* isolated cpus belong to a root domain */
> > +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > +			return;
> > +		}
> > +	}
> > +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> > +}
> > +
> >  static inline void cpuset_print_current_mems_allowed(void)
> >  {
> >  }
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 27adb04df675d..6ad88018f1a4e 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
> >  	mutex_unlock(&cpuset_mutex);
> >  }
> >  
> > +/* caller hold RCU read lock */
> > +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> > +{
> > +	const struct cpumask *hk_msk;
> > +	struct cpuset *cs;
> > +
> > +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> > +			/* isolated cpus belong to a root domain */
> > +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > +			return;
> > +		}
> > +	}
> > +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> > +	cs = task_cs(p);
> > +	while (cs != &top_cpuset) {
> > +		if (is_sched_load_balance(cs))
> > +			break;
> > +		cs = parent_cs(cs);
> > +	}
> > +
> > +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> > +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> > +}
> > +
> 
> It seems you may have misunderstood what Longman intended to convey.
> 

Thanks for pointing that out. That is possible and please let me address
your concern.

> First, you should add comments to this function because its purpose is not clear. When I first saw

OK, I will.

> this function, I thought it was supposed to retrieve p->cpus_ptr excluding the offline CPU mask.
> However, I'm genuinely confused about the function's actual purpose.
> 

This function retrieves the active CPUs within the root domain where a specified task resides.

> Regarding the isolated partition concept: isolated CPUs (isolcpus) can be included in cpusets. For
> example, if the system boots with isolcpus=9, and when process p is in a isolated partition that
> only contains CPU 9 (which is listed in isolcpus), will this function return all CPUs except CPU 9?
> Is that the behavior you intended?
> 

First, to clarify the scope of this discussion, it should be limited to
the isolcpus=domain case, excluding other isolcpus options. If a CPU is
in an isolated domain, that domain can only be the def_root_domain,
regardless of whether it is added to a user cpuset or not. In your
example (isolcpus="domain,9"), all the other CPUs form a new root domain,
so the function is expected to return only CPU 9.

Thanks,

Pingfan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-28  3:43 ` [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug Pingfan Liu
  2025-10-29  2:37   ` Chen Ridong
@ 2025-10-29 15:31   ` Waiman Long
  2025-10-30 10:41     ` Pingfan Liu
  2025-11-03 13:50     ` Juri Lelli
  2025-11-05  2:23   ` Chen Ridong
  2 siblings, 2 replies; 15+ messages in thread
From: Waiman Long @ 2025-10-29 15:31 UTC (permalink / raw)
  To: Pingfan Liu, linux-kernel, cgroups
  Cc: Peter Zijlstra, Juri Lelli, Pierre Gondois, Frederic Weisbecker,
	Ingo Molnar, Tejun Heo, Johannes Weiner, Michal Koutný,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider

On 10/27/25 11:43 PM, Pingfan Liu wrote:
> *** Bug description ***
> When testing kexec-reboot on a 144 cpus machine with
> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> encounter the following bug:
>
> [   97.114759] psci: CPU142 killed (polled 0 ms)
> [   97.333236] Failed to offline CPU143 - error=-16
> [   97.333246] ------------[ cut here ]------------
> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [...]
>
> In essence, the issue originates from the CPU hot-removal process, not
> limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
> program that waits indefinitely on a semaphore, spawning multiple
> instances to ensure some run on CPU 72, and then offlining CPUs 1–143
> one by one. When attempting this, CPU 143 failed to go offline.
>    bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
>
> `
> *** Issue ***
> Tracking down this issue, I found that dl_bw_deactivate() returned
> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> But that is not the fact, and contributed by the following factors:
> When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
> blocked-state deadline task (in this case, "cppc_fie"), it was not
> migrated to CPU0, and its task_rq() information is stale. So its rq->rd
> points to def_root_domain instead of the one shared with CPU0.  As a
> result, its bandwidth is wrongly accounted into a wrong root domain
> during domain rebuild.
>
> The key point is that root_domain is only tracked through active rq->rd.
> To avoid using a global data structure to track all root_domains in the
> system, there should be a method to locate an active CPU within the
> corresponding root_domain.
>
> *** Solution ***
> To locate the active cpu, the following rules for deadline
> sub-system is useful
>    -1.any cpu belongs to a unique root domain at a given time
>    -2.DL bandwidth checker ensures that the root domain has active cpus.
>
> Now, let's examine the blocked-state task P.
> If P is attached to a cpuset that is a partition root, it is
> straightforward to find an active CPU.
> If P is attached to a cpuset that has changed from 'root' to 'member',
> the active CPUs are grouped into the parent root domain. Naturally, the
> CPUs' capacity and reserved DL bandwidth are taken into account in the
> ancestor root domain. (In practice, it may be unsafe to attach P to an
> arbitrary root domain, since that domain may lack sufficient DL
> bandwidth for P.) Again, it is straightforward to find an active CPU in
> the ancestor root domain.
>
> This patch groups CPUs into isolated and housekeeping sets. For the
> housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
>
> Signed-off-by: Pingfan Liu <piliu@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: "Michal Koutný" <mkoutny@suse.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Pierre Gondois <pierre.gondois@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> To: cgroups@vger.kernel.org
> To: linux-kernel@vger.kernel.org
> ---
> v3 -> v4:
> rename function with cpuset_ prefix
> improve commit log
>
>   include/linux/cpuset.h  | 18 ++++++++++++++++++
>   kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
>   kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
>   3 files changed, 68 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b51..d4da93e51b37b 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -12,6 +12,7 @@
>   #include <linux/sched.h>
>   #include <linux/sched/topology.h>
>   #include <linux/sched/task.h>
> +#include <linux/sched/housekeeping.h>
>   #include <linux/cpumask.h>
>   #include <linux/nodemask.h>
>   #include <linux/mm.h>
> @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
>   
>   extern void cpuset_print_current_mems_allowed(void);
>   extern void cpuset_reset_sched_domains(void);
> +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
>   
>   /*
>    * read_mems_allowed_begin is required when making decisions involving
> @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
>   	partition_sched_domains(1, NULL, NULL);
>   }
>   
> +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
> +		struct cpumask *cpus)
> +{
> +	const struct cpumask *hk_msk;
> +
> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> +			/* isolated cpus belong to a root domain */
> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> +			return;
> +		}
> +	}
> +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> +}
> +
>   static inline void cpuset_print_current_mems_allowed(void)
>   {
>   }
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 27adb04df675d..6ad88018f1a4e 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
>   	mutex_unlock(&cpuset_mutex);
>   }
>   
> +/* caller hold RCU read lock */
> +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> +{
> +	const struct cpumask *hk_msk;
> +	struct cpuset *cs;
> +
> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> +			/* isolated cpus belong to a root domain */
> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> +			return;
> +		}
> +	}
> +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> +	cs = task_cs(p);
> +	while (cs != &top_cpuset) {
> +		if (is_sched_load_balance(cs))
> +			break;
> +		cs = parent_cs(cs);
> +	}
> +
> +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> +}
> +

It looks like you are trying to find a set of CPUs that are definitely 
in a active sched domain. The difference between this version and the 
!CONFIG_CPUSETS version in cpuset.h is the going up the cpuset hierarchy 
to find one with load balancing enabled. I would suggest you extract 
just this part out as a cpuset helper function and put the rests into 
deadline.c as a separate helper function without the cpuset prefix. In 
that way, you don't create a new housekeeping.h header file.

>   /**
>    * cpuset_update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
>    * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 72c1f72463c75..a3a43baf4314e 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2884,6 +2884,8 @@ void dl_add_task_root_domain(struct task_struct *p)
>   	struct rq_flags rf;
>   	struct rq *rq;
>   	struct dl_bw *dl_b;
> +	unsigned int cpu;
> +	struct cpumask msk;
>   
>   	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
>   	if (!dl_task(p) || dl_entity_is_special(&p->dl)) {
> @@ -2891,16 +2893,32 @@ void dl_add_task_root_domain(struct task_struct *p)
>   		return;
>   	}
>   
> -	rq = __task_rq_lock(p, &rf);
> -
> +	/* prevent race among cpu hotplug, changing of partition_root_state */
> +	lockdep_assert_cpus_held();
> +	/*
> +	 * If @p is in blocked state, task_cpu() may be not active. In that
> +	 * case, rq->rd does not trace a correct root_domain. On the other hand,
> +	 * @p must belong to an root_domain at any given time, which must have
> +	 * active rq, whose rq->rd traces the valid root domain.
> +	 */
> +	cpuset_get_task_effective_cpus(p, &msk);
> +	cpu = cpumask_first_and(cpu_active_mask, &msk);
> +	/*
> +	 * If a root domain reserves bandwidth for a DL task, the DL bandwidth
> +	 * check prevents CPU hot removal from deactivating all CPUs in that
> +	 * domain.
> +	 */
> +	BUG_ON(cpu >= nr_cpu_ids);
> +	rq = cpu_rq(cpu);
> +	/*
> +	 * This point is under the protection of cpu_hotplug_lock. Hence
> +	 * rq->rd is stable.
> +	 */

So you trying to find a active sched domain with some dl bw to use for 
checking. I don't know enough about this dl bw checking code to know if 
it is valid or not. I will let Juri comment on that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-29 11:18     ` Pingfan Liu
@ 2025-10-30  6:44       ` Chen Ridong
  2025-10-30 10:45         ` Pingfan Liu
  0 siblings, 1 reply; 15+ messages in thread
From: Chen Ridong @ 2025-10-30  6:44 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: linux-kernel, cgroups, Waiman Long, Peter Zijlstra, Juri Lelli,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider



On 2025/10/29 19:18, Pingfan Liu wrote:
> Hi Ridong,
> 
> Thank you for your review, please see the comment below.
> 
> On Wed, Oct 29, 2025 at 10:37:47AM +0800, Chen Ridong wrote:
>>
>>
>> On 2025/10/28 11:43, Pingfan Liu wrote:
>>> *** Bug description ***
>>> When testing kexec-reboot on a 144 cpus machine with
>>> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
>>> encounter the following bug:
>>>
>>> [   97.114759] psci: CPU142 killed (polled 0 ms)
>>> [   97.333236] Failed to offline CPU143 - error=-16
>>> [   97.333246] ------------[ cut here ]------------
>>> [   97.342682] kernel BUG at kernel/cpu.c:1569!
>>> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>>> [...]
>>>
>>> In essence, the issue originates from the CPU hot-removal process, not
>>> limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
>>> program that waits indefinitely on a semaphore, spawning multiple
>>> instances to ensure some run on CPU 72, and then offlining CPUs 1–143
>>> one by one. When attempting this, CPU 143 failed to go offline.
>>>   bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
>>>
>>> `
>>> *** Issue ***
>>> Tracking down this issue, I found that dl_bw_deactivate() returned
>>> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
>>> But that is not the fact, and contributed by the following factors:
>>> When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
>>> blocked-state deadline task (in this case, "cppc_fie"), it was not
>>> migrated to CPU0, and its task_rq() information is stale. So its rq->rd
>>> points to def_root_domain instead of the one shared with CPU0.  As a
>>> result, its bandwidth is wrongly accounted into a wrong root domain
>>> during domain rebuild.
>>>
>>> The key point is that root_domain is only tracked through active rq->rd.
>>> To avoid using a global data structure to track all root_domains in the
>>> system, there should be a method to locate an active CPU within the
>>> corresponding root_domain.
>>>
>>> *** Solution ***
>>> To locate the active cpu, the following rules for deadline
>>> sub-system is useful
>>>   -1.any cpu belongs to a unique root domain at a given time
>>>   -2.DL bandwidth checker ensures that the root domain has active cpus.
>>>
>>> Now, let's examine the blocked-state task P.
>>> If P is attached to a cpuset that is a partition root, it is
>>> straightforward to find an active CPU.
>>> If P is attached to a cpuset that has changed from 'root' to 'member',
>>> the active CPUs are grouped into the parent root domain. Naturally, the
>>> CPUs' capacity and reserved DL bandwidth are taken into account in the
>>> ancestor root domain. (In practice, it may be unsafe to attach P to an
>>> arbitrary root domain, since that domain may lack sufficient DL
>>> bandwidth for P.) Again, it is straightforward to find an active CPU in
>>> the ancestor root domain.
>>>
>>> This patch groups CPUs into isolated and housekeeping sets. For the
>>> housekeeping group, it walks up the cpuset hierarchy to find active CPUs
>>> in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
>>>
>>> Signed-off-by: Pingfan Liu <piliu@redhat.com>
>>> Cc: Waiman Long <longman@redhat.com>
>>> Cc: Tejun Heo <tj@kernel.org>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Cc: "Michal Koutný" <mkoutny@suse.com>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Juri Lelli <juri.lelli@redhat.com>
>>> Cc: Pierre Gondois <pierre.gondois@arm.com>
>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>> Cc: Steven Rostedt <rostedt@goodmis.org>
>>> Cc: Ben Segall <bsegall@google.com>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Cc: Valentin Schneider <vschneid@redhat.com>
>>> To: cgroups@vger.kernel.org
>>> To: linux-kernel@vger.kernel.org
>>> ---
>>> v3 -> v4:
>>> rename function with cpuset_ prefix
>>> improve commit log
>>>
>>>  include/linux/cpuset.h  | 18 ++++++++++++++++++
>>>  kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
>>>  kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
>>>  3 files changed, 68 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>>> index 2ddb256187b51..d4da93e51b37b 100644
>>> --- a/include/linux/cpuset.h
>>> +++ b/include/linux/cpuset.h
>>> @@ -12,6 +12,7 @@
>>>  #include <linux/sched.h>
>>>  #include <linux/sched/topology.h>
>>>  #include <linux/sched/task.h>
>>> +#include <linux/sched/housekeeping.h>
>>>  #include <linux/cpumask.h>
>>>  #include <linux/nodemask.h>
>>>  #include <linux/mm.h>
>>> @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
>>>  
>>>  extern void cpuset_print_current_mems_allowed(void);
>>>  extern void cpuset_reset_sched_domains(void);
>>> +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
>>>  
>>>  /*
>>>   * read_mems_allowed_begin is required when making decisions involving
>>> @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
>>>  	partition_sched_domains(1, NULL, NULL);
>>>  }
>>>  
>>> +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
>>> +		struct cpumask *cpus)
>>> +{
>>> +	const struct cpumask *hk_msk;
>>> +
>>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
>>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
>>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
>>> +			/* isolated cpus belong to a root domain */
>>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
>>> +			return;
>>> +		}
>>> +	}
>>> +	cpumask_and(cpus, cpu_active_mask, hk_msk);
>>> +}
>>> +
>>>  static inline void cpuset_print_current_mems_allowed(void)
>>>  {
>>>  }
>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>> index 27adb04df675d..6ad88018f1a4e 100644
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
>>>  	mutex_unlock(&cpuset_mutex);
>>>  }
>>>  
>>> +/* caller hold RCU read lock */
>>> +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
>>> +{
>>> +	const struct cpumask *hk_msk;
>>> +	struct cpuset *cs;
>>> +
>>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
>>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
>>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
>>> +			/* isolated cpus belong to a root domain */
>>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
>>> +			return;
>>> +		}
>>> +	}
>>> +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
>>> +	cs = task_cs(p);
>>> +	while (cs != &top_cpuset) {
>>> +		if (is_sched_load_balance(cs))
>>> +			break;
>>> +		cs = parent_cs(cs);
>>> +	}
>>> +
>>> +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
>>> +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
>>> +}
>>> +
>>
>> It seems you may have misunderstood what Longman intended to convey.
>>
> 
> Thanks for pointing that out. That is possible and please let me address
> your concern.
> 
>> First, you should add comments to this function because its purpose is not clear. When I first saw
> 
> OK, I will.
> 
>> this function, I thought it was supposed to retrieve p->cpus_ptr excluding the offline CPU mask.
>> However, I'm genuinely confused about the function's actual purpose.
>>
> 
> This function retrieves the active CPUs within the root domain where a specified task resides.
> 

Thank you for the further clarification.

	+	/*
	+	 * If @p is in blocked state, task_cpu() may be not active. In that
	+	 * case, rq->rd does not trace a correct root_domain. On the other hand,
	+	 * @p must belong to an root_domain at any given time, which must have
	+	 * active rq, whose rq->rd traces the valid root domain.
	+	 */

Is it necessary to walk up to the root partition (is_sched_load_balance(cs))?

The effective_cpus of the cpuset where @p resides should contain active CPUs.
If all CPUs in cpuset.cpus are offline, it would inherit the parent's effective_cpus for v2, and it
would move the task to the parent for v1.

Could the effective_cpus of @p's current cpuset be sufficient?
What we really need is to find active CPUs that task P can be affine to, correct?

Hope I didn't miss something.

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-29 15:31   ` Waiman Long
@ 2025-10-30 10:41     ` Pingfan Liu
  2025-11-03 13:50     ` Juri Lelli
  1 sibling, 0 replies; 15+ messages in thread
From: Pingfan Liu @ 2025-10-30 10:41 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-kernel, cgroups, Peter Zijlstra, Juri Lelli, Pierre Gondois,
	Frederic Weisbecker, Ingo Molnar, Tejun Heo, Johannes Weiner,
	Michal Koutný, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

On Wed, Oct 29, 2025 at 11:31:23AM -0400, Waiman Long wrote:
> On 10/27/25 11:43 PM, Pingfan Liu wrote:
> > *** Bug description ***
> > When testing kexec-reboot on a 144 cpus machine with
> > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > encounter the following bug:
> > 
> > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > [   97.333236] Failed to offline CPU143 - error=-16
> > [   97.333246] ------------[ cut here ]------------
> > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > [...]
> > 
> > In essence, the issue originates from the CPU hot-removal process, not
> > limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
> > program that waits indefinitely on a semaphore, spawning multiple
> > instances to ensure some run on CPU 72, and then offlining CPUs 1–143
> > one by one. When attempting this, CPU 143 failed to go offline.
> >    bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> > 
> > `
> > *** Issue ***
> > Tracking down this issue, I found that dl_bw_deactivate() returned
> > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > But that is not the fact, and contributed by the following factors:
> > When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
> > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > migrated to CPU0, and its task_rq() information is stale. So its rq->rd
> > points to def_root_domain instead of the one shared with CPU0.  As a
> > result, its bandwidth is wrongly accounted into a wrong root domain
> > during domain rebuild.
> > 
> > The key point is that root_domain is only tracked through active rq->rd.
> > To avoid using a global data structure to track all root_domains in the
> > system, there should be a method to locate an active CPU within the
> > corresponding root_domain.
> > 
> > *** Solution ***
> > To locate the active cpu, the following rules for deadline
> > sub-system is useful
> >    -1.any cpu belongs to a unique root domain at a given time
> >    -2.DL bandwidth checker ensures that the root domain has active cpus.
> > 
> > Now, let's examine the blocked-state task P.
> > If P is attached to a cpuset that is a partition root, it is
> > straightforward to find an active CPU.
> > If P is attached to a cpuset that has changed from 'root' to 'member',
> > the active CPUs are grouped into the parent root domain. Naturally, the
> > CPUs' capacity and reserved DL bandwidth are taken into account in the
> > ancestor root domain. (In practice, it may be unsafe to attach P to an
> > arbitrary root domain, since that domain may lack sufficient DL
> > bandwidth for P.) Again, it is straightforward to find an active CPU in
> > the ancestor root domain.
> > 
> > This patch groups CPUs into isolated and housekeeping sets. For the
> > housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> > in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
> > 
> > Signed-off-by: Pingfan Liu <piliu@redhat.com>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: "Michal Koutný" <mkoutny@suse.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Pierre Gondois <pierre.gondois@arm.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > To: cgroups@vger.kernel.org
> > To: linux-kernel@vger.kernel.org
> > ---
> > v3 -> v4:
> > rename function with cpuset_ prefix
> > improve commit log
> > 
> >   include/linux/cpuset.h  | 18 ++++++++++++++++++
> >   kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
> >   kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
> >   3 files changed, 68 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index 2ddb256187b51..d4da93e51b37b 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -12,6 +12,7 @@
> >   #include <linux/sched.h>
> >   #include <linux/sched/topology.h>
> >   #include <linux/sched/task.h>
> > +#include <linux/sched/housekeeping.h>
> >   #include <linux/cpumask.h>
> >   #include <linux/nodemask.h>
> >   #include <linux/mm.h>
> > @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
> >   extern void cpuset_print_current_mems_allowed(void);
> >   extern void cpuset_reset_sched_domains(void);
> > +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
> >   /*
> >    * read_mems_allowed_begin is required when making decisions involving
> > @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
> >   	partition_sched_domains(1, NULL, NULL);
> >   }
> > +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
> > +		struct cpumask *cpus)
> > +{
> > +	const struct cpumask *hk_msk;
> > +
> > +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> > +			/* isolated cpus belong to a root domain */
> > +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > +			return;
> > +		}
> > +	}
> > +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> > +}
> > +
> >   static inline void cpuset_print_current_mems_allowed(void)
> >   {
> >   }
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 27adb04df675d..6ad88018f1a4e 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
> >   	mutex_unlock(&cpuset_mutex);
> >   }
> > +/* caller hold RCU read lock */
> > +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> > +{
> > +	const struct cpumask *hk_msk;
> > +	struct cpuset *cs;
> > +
> > +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> > +			/* isolated cpus belong to a root domain */
> > +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > +			return;
> > +		}
> > +	}
> > +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> > +	cs = task_cs(p);
> > +	while (cs != &top_cpuset) {
> > +		if (is_sched_load_balance(cs))
> > +			break;
> > +		cs = parent_cs(cs);
> > +	}
> > +
> > +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> > +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> > +}
> > +
> 
> It looks like you are trying to find a set of CPUs that are definitely in a
> active sched domain. The difference between this version and the
> !CONFIG_CPUSETS version in cpuset.h is the going up the cpuset hierarchy to
> find one with load balancing enabled. I would suggest you extract just this
> part out as a cpuset helper function and put the rests into deadline.c as a
> separate helper function without the cpuset prefix. In that way, you don't
> create a new housekeeping.h header file.
> 

A good suggestion, thanks!

Best Regards,

Pingfan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-30  6:44       ` Chen Ridong
@ 2025-10-30 10:45         ` Pingfan Liu
  2025-10-31  0:47           ` Chen Ridong
  0 siblings, 1 reply; 15+ messages in thread
From: Pingfan Liu @ 2025-10-30 10:45 UTC (permalink / raw)
  To: Chen Ridong
  Cc: linux-kernel, cgroups, Waiman Long, Peter Zijlstra, Juri Lelli,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

On Thu, Oct 30, 2025 at 02:44:43PM +0800, Chen Ridong wrote:
> 
> 
> On 2025/10/29 19:18, Pingfan Liu wrote:
> > Hi Ridong,
> > 
> > Thank you for your review, please see the comment below.
> > 
> > On Wed, Oct 29, 2025 at 10:37:47AM +0800, Chen Ridong wrote:
> >>
> >>
> >> On 2025/10/28 11:43, Pingfan Liu wrote:
> >>> *** Bug description ***
> >>> When testing kexec-reboot on a 144 cpus machine with
> >>> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> >>> encounter the following bug:
> >>>
> >>> [   97.114759] psci: CPU142 killed (polled 0 ms)
> >>> [   97.333236] Failed to offline CPU143 - error=-16
> >>> [   97.333246] ------------[ cut here ]------------
> >>> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> >>> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> >>> [...]
> >>>
> >>> In essence, the issue originates from the CPU hot-removal process, not
> >>> limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
> >>> program that waits indefinitely on a semaphore, spawning multiple
> >>> instances to ensure some run on CPU 72, and then offlining CPUs 1–143
> >>> one by one. When attempting this, CPU 143 failed to go offline.
> >>>   bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> >>>
> >>> `
> >>> *** Issue ***
> >>> Tracking down this issue, I found that dl_bw_deactivate() returned
> >>> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> >>> But that is not the fact, and contributed by the following factors:
> >>> When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
> >>> blocked-state deadline task (in this case, "cppc_fie"), it was not
> >>> migrated to CPU0, and its task_rq() information is stale. So its rq->rd
> >>> points to def_root_domain instead of the one shared with CPU0.  As a
> >>> result, its bandwidth is wrongly accounted into a wrong root domain
> >>> during domain rebuild.
> >>>
> >>> The key point is that root_domain is only tracked through active rq->rd.
> >>> To avoid using a global data structure to track all root_domains in the
> >>> system, there should be a method to locate an active CPU within the
> >>> corresponding root_domain.
> >>>
> >>> *** Solution ***
> >>> To locate the active cpu, the following rules for deadline
> >>> sub-system is useful
> >>>   -1.any cpu belongs to a unique root domain at a given time
> >>>   -2.DL bandwidth checker ensures that the root domain has active cpus.
> >>>
> >>> Now, let's examine the blocked-state task P.
> >>> If P is attached to a cpuset that is a partition root, it is
> >>> straightforward to find an active CPU.
> >>> If P is attached to a cpuset that has changed from 'root' to 'member',
> >>> the active CPUs are grouped into the parent root domain. Naturally, the
> >>> CPUs' capacity and reserved DL bandwidth are taken into account in the
> >>> ancestor root domain. (In practice, it may be unsafe to attach P to an
> >>> arbitrary root domain, since that domain may lack sufficient DL
> >>> bandwidth for P.) Again, it is straightforward to find an active CPU in
> >>> the ancestor root domain.
> >>>
> >>> This patch groups CPUs into isolated and housekeeping sets. For the
> >>> housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> >>> in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
> >>>
> >>> Signed-off-by: Pingfan Liu <piliu@redhat.com>
> >>> Cc: Waiman Long <longman@redhat.com>
> >>> Cc: Tejun Heo <tj@kernel.org>
> >>> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >>> Cc: "Michal Koutný" <mkoutny@suse.com>
> >>> Cc: Ingo Molnar <mingo@redhat.com>
> >>> Cc: Peter Zijlstra <peterz@infradead.org>
> >>> Cc: Juri Lelli <juri.lelli@redhat.com>
> >>> Cc: Pierre Gondois <pierre.gondois@arm.com>
> >>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >>> Cc: Steven Rostedt <rostedt@goodmis.org>
> >>> Cc: Ben Segall <bsegall@google.com>
> >>> Cc: Mel Gorman <mgorman@suse.de>
> >>> Cc: Valentin Schneider <vschneid@redhat.com>
> >>> To: cgroups@vger.kernel.org
> >>> To: linux-kernel@vger.kernel.org
> >>> ---
> >>> v3 -> v4:
> >>> rename function with cpuset_ prefix
> >>> improve commit log
> >>>
> >>>  include/linux/cpuset.h  | 18 ++++++++++++++++++
> >>>  kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
> >>>  kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
> >>>  3 files changed, 68 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> >>> index 2ddb256187b51..d4da93e51b37b 100644
> >>> --- a/include/linux/cpuset.h
> >>> +++ b/include/linux/cpuset.h
> >>> @@ -12,6 +12,7 @@
> >>>  #include <linux/sched.h>
> >>>  #include <linux/sched/topology.h>
> >>>  #include <linux/sched/task.h>
> >>> +#include <linux/sched/housekeeping.h>
> >>>  #include <linux/cpumask.h>
> >>>  #include <linux/nodemask.h>
> >>>  #include <linux/mm.h>
> >>> @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
> >>>  
> >>>  extern void cpuset_print_current_mems_allowed(void);
> >>>  extern void cpuset_reset_sched_domains(void);
> >>> +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
> >>>  
> >>>  /*
> >>>   * read_mems_allowed_begin is required when making decisions involving
> >>> @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
> >>>  	partition_sched_domains(1, NULL, NULL);
> >>>  }
> >>>  
> >>> +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
> >>> +		struct cpumask *cpus)
> >>> +{
> >>> +	const struct cpumask *hk_msk;
> >>> +
> >>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> >>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> >>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> >>> +			/* isolated cpus belong to a root domain */
> >>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> >>> +			return;
> >>> +		}
> >>> +	}
> >>> +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> >>> +}
> >>> +
> >>>  static inline void cpuset_print_current_mems_allowed(void)
> >>>  {
> >>>  }
> >>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> >>> index 27adb04df675d..6ad88018f1a4e 100644
> >>> --- a/kernel/cgroup/cpuset.c
> >>> +++ b/kernel/cgroup/cpuset.c
> >>> @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
> >>>  	mutex_unlock(&cpuset_mutex);
> >>>  }
> >>>  
> >>> +/* caller hold RCU read lock */
> >>> +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> >>> +{
> >>> +	const struct cpumask *hk_msk;
> >>> +	struct cpuset *cs;
> >>> +
> >>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> >>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> >>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> >>> +			/* isolated cpus belong to a root domain */
> >>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> >>> +			return;
> >>> +		}
> >>> +	}
> >>> +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> >>> +	cs = task_cs(p);
> >>> +	while (cs != &top_cpuset) {
> >>> +		if (is_sched_load_balance(cs))
> >>> +			break;
> >>> +		cs = parent_cs(cs);
> >>> +	}
> >>> +
> >>> +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> >>> +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> >>> +}
> >>> +
> >>
> >> It seems you may have misunderstood what Longman intended to convey.
> >>
> > 
> > Thanks for pointing that out. That is possible and please let me address
> > your concern.
> > 
> >> First, you should add comments to this function because its purpose is not clear. When I first saw
> > 
> > OK, I will.
> > 
> >> this function, I thought it was supposed to retrieve p->cpus_ptr excluding the offline CPU mask.
> >> However, I'm genuinely confused about the function's actual purpose.
> >>
> > 
> > This function retrieves the active CPUs within the root domain where a specified task resides.
> > 
> 
> Thank you for the further clarification.
> 
> 	+	/*
> 	+	 * If @p is in blocked state, task_cpu() may be not active. In that
> 	+	 * case, rq->rd does not trace a correct root_domain. On the other hand,
> 	+	 * @p must belong to an root_domain at any given time, which must have
> 	+	 * active rq, whose rq->rd traces the valid root domain.
> 	+	 */
> 
> Is it necessary to walk up to the root partition (is_sched_load_balance(cs))?
> 
> The effective_cpus of the cpuset where @p resides should contain active CPUs.
> If all CPUs in cpuset.cpus are offline, it would inherit the parent's effective_cpus for v2, and it
> would move the task to the parent for v1.
> 

Suppose that the parent cpuset has no active CPUs too.
But for a root_domain, deadline bandwidth validation can guard there are
active CPUs remaining.

> Could the effective_cpus of @p's current cpuset be sufficient?
> What we really need is to find active CPUs that task P can be affine to, correct?
> 

Yes, that is the purpose.

Best Regards,

Pingfan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-30 10:45         ` Pingfan Liu
@ 2025-10-31  0:47           ` Chen Ridong
  2025-10-31 14:21             ` Pingfan Liu
  0 siblings, 1 reply; 15+ messages in thread
From: Chen Ridong @ 2025-10-31  0:47 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: linux-kernel, cgroups, Waiman Long, Peter Zijlstra, Juri Lelli,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider



On 2025/10/30 18:45, Pingfan Liu wrote:
> On Thu, Oct 30, 2025 at 02:44:43PM +0800, Chen Ridong wrote:
>>
>>
>> On 2025/10/29 19:18, Pingfan Liu wrote:
>>> Hi Ridong,
>>>
>>> Thank you for your review, please see the comment below.
>>>
>>> On Wed, Oct 29, 2025 at 10:37:47AM +0800, Chen Ridong wrote:
>>>>
>>>>
>>>> On 2025/10/28 11:43, Pingfan Liu wrote:
>>>>> *** Bug description ***
>>>>> When testing kexec-reboot on a 144 cpus machine with
>>>>> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
>>>>> encounter the following bug:
>>>>>
>>>>> [   97.114759] psci: CPU142 killed (polled 0 ms)
>>>>> [   97.333236] Failed to offline CPU143 - error=-16
>>>>> [   97.333246] ------------[ cut here ]------------
>>>>> [   97.342682] kernel BUG at kernel/cpu.c:1569!
>>>>> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>>>>> [...]
>>>>>
>>>>> In essence, the issue originates from the CPU hot-removal process, not
>>>>> limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
>>>>> program that waits indefinitely on a semaphore, spawning multiple
>>>>> instances to ensure some run on CPU 72, and then offlining CPUs 1–143
>>>>> one by one. When attempting this, CPU 143 failed to go offline.
>>>>>   bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
>>>>>
>>>>> `
>>>>> *** Issue ***
>>>>> Tracking down this issue, I found that dl_bw_deactivate() returned
>>>>> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
>>>>> But that is not the fact, and contributed by the following factors:
>>>>> When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
>>>>> blocked-state deadline task (in this case, "cppc_fie"), it was not
>>>>> migrated to CPU0, and its task_rq() information is stale. So its rq->rd
>>>>> points to def_root_domain instead of the one shared with CPU0.  As a
>>>>> result, its bandwidth is wrongly accounted into a wrong root domain
>>>>> during domain rebuild.
>>>>>
>>>>> The key point is that root_domain is only tracked through active rq->rd.
>>>>> To avoid using a global data structure to track all root_domains in the
>>>>> system, there should be a method to locate an active CPU within the
>>>>> corresponding root_domain.
>>>>>
>>>>> *** Solution ***
>>>>> To locate the active cpu, the following rules for deadline
>>>>> sub-system is useful
>>>>>   -1.any cpu belongs to a unique root domain at a given time
>>>>>   -2.DL bandwidth checker ensures that the root domain has active cpus.
>>>>>
>>>>> Now, let's examine the blocked-state task P.
>>>>> If P is attached to a cpuset that is a partition root, it is
>>>>> straightforward to find an active CPU.
>>>>> If P is attached to a cpuset that has changed from 'root' to 'member',
>>>>> the active CPUs are grouped into the parent root domain. Naturally, the
>>>>> CPUs' capacity and reserved DL bandwidth are taken into account in the
>>>>> ancestor root domain. (In practice, it may be unsafe to attach P to an
>>>>> arbitrary root domain, since that domain may lack sufficient DL
>>>>> bandwidth for P.) Again, it is straightforward to find an active CPU in
>>>>> the ancestor root domain.
>>>>>
>>>>> This patch groups CPUs into isolated and housekeeping sets. For the
>>>>> housekeeping group, it walks up the cpuset hierarchy to find active CPUs
>>>>> in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
>>>>>
>>>>> Signed-off-by: Pingfan Liu <piliu@redhat.com>
>>>>> Cc: Waiman Long <longman@redhat.com>
>>>>> Cc: Tejun Heo <tj@kernel.org>
>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>>> Cc: "Michal Koutný" <mkoutny@suse.com>
>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>>> Cc: Juri Lelli <juri.lelli@redhat.com>
>>>>> Cc: Pierre Gondois <pierre.gondois@arm.com>
>>>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>>>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>>>> Cc: Steven Rostedt <rostedt@goodmis.org>
>>>>> Cc: Ben Segall <bsegall@google.com>
>>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>>> Cc: Valentin Schneider <vschneid@redhat.com>
>>>>> To: cgroups@vger.kernel.org
>>>>> To: linux-kernel@vger.kernel.org
>>>>> ---
>>>>> v3 -> v4:
>>>>> rename function with cpuset_ prefix
>>>>> improve commit log
>>>>>
>>>>>  include/linux/cpuset.h  | 18 ++++++++++++++++++
>>>>>  kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
>>>>>  kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
>>>>>  3 files changed, 68 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>>>>> index 2ddb256187b51..d4da93e51b37b 100644
>>>>> --- a/include/linux/cpuset.h
>>>>> +++ b/include/linux/cpuset.h
>>>>> @@ -12,6 +12,7 @@
>>>>>  #include <linux/sched.h>
>>>>>  #include <linux/sched/topology.h>
>>>>>  #include <linux/sched/task.h>
>>>>> +#include <linux/sched/housekeeping.h>
>>>>>  #include <linux/cpumask.h>
>>>>>  #include <linux/nodemask.h>
>>>>>  #include <linux/mm.h>
>>>>> @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
>>>>>  
>>>>>  extern void cpuset_print_current_mems_allowed(void);
>>>>>  extern void cpuset_reset_sched_domains(void);
>>>>> +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
>>>>>  
>>>>>  /*
>>>>>   * read_mems_allowed_begin is required when making decisions involving
>>>>> @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
>>>>>  	partition_sched_domains(1, NULL, NULL);
>>>>>  }
>>>>>  
>>>>> +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
>>>>> +		struct cpumask *cpus)
>>>>> +{
>>>>> +	const struct cpumask *hk_msk;
>>>>> +
>>>>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
>>>>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
>>>>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
>>>>> +			/* isolated cpus belong to a root domain */
>>>>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
>>>>> +			return;
>>>>> +		}
>>>>> +	}
>>>>> +	cpumask_and(cpus, cpu_active_mask, hk_msk);
>>>>> +}
>>>>> +
>>>>>  static inline void cpuset_print_current_mems_allowed(void)
>>>>>  {
>>>>>  }
>>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>>>> index 27adb04df675d..6ad88018f1a4e 100644
>>>>> --- a/kernel/cgroup/cpuset.c
>>>>> +++ b/kernel/cgroup/cpuset.c
>>>>> @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
>>>>>  	mutex_unlock(&cpuset_mutex);
>>>>>  }
>>>>>  
>>>>> +/* caller hold RCU read lock */
>>>>> +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
>>>>> +{
>>>>> +	const struct cpumask *hk_msk;
>>>>> +	struct cpuset *cs;
>>>>> +
>>>>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
>>>>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
>>>>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
>>>>> +			/* isolated cpus belong to a root domain */
>>>>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
>>>>> +			return;
>>>>> +		}
>>>>> +	}
>>>>> +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
>>>>> +	cs = task_cs(p);
>>>>> +	while (cs != &top_cpuset) {
>>>>> +		if (is_sched_load_balance(cs))
>>>>> +			break;
>>>>> +		cs = parent_cs(cs);
>>>>> +	}
>>>>> +
>>>>> +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
>>>>> +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
>>>>> +}
>>>>> +
>>>>
>>>> It seems you may have misunderstood what Longman intended to convey.
>>>>
>>>
>>> Thanks for pointing that out. That is possible and please let me address
>>> your concern.
>>>
>>>> First, you should add comments to this function because its purpose is not clear. When I first saw
>>>
>>> OK, I will.
>>>
>>>> this function, I thought it was supposed to retrieve p->cpus_ptr excluding the offline CPU mask.
>>>> However, I'm genuinely confused about the function's actual purpose.
>>>>
>>>
>>> This function retrieves the active CPUs within the root domain where a specified task resides.
>>>
>>
>> Thank you for the further clarification.
>>
>> 	+	/*
>> 	+	 * If @p is in blocked state, task_cpu() may be not active. In that
>> 	+	 * case, rq->rd does not trace a correct root_domain. On the other hand,
>> 	+	 * @p must belong to an root_domain at any given time, which must have
>> 	+	 * active rq, whose rq->rd traces the valid root domain.
>> 	+	 */
>>
>> Is it necessary to walk up to the root partition (is_sched_load_balance(cs))?
>>
>> The effective_cpus of the cpuset where @p resides should contain active CPUs.
>> If all CPUs in cpuset.cpus are offline, it would inherit the parent's effective_cpus for v2, and it
>> would move the task to the parent for v1.
>>
> 
> Suppose that the parent cpuset has no active CPUs too.
> But for a root_domain, deadline bandwidth validation can guard there are
> active CPUs remaining.
> 

I don't think this should happen. When a parent's effective_cpus is empty, it should inherit its own
parent's effective_cpus as well, meaning that in v2, the effective_cpus should not ultimately remain
empty.

For v1, the task would be moved to an ancestor with active cpus.

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-31  0:47           ` Chen Ridong
@ 2025-10-31 14:21             ` Pingfan Liu
  2025-11-03  3:17               ` Pingfan Liu
  0 siblings, 1 reply; 15+ messages in thread
From: Pingfan Liu @ 2025-10-31 14:21 UTC (permalink / raw)
  To: Chen Ridong
  Cc: linux-kernel, cgroups, Waiman Long, Peter Zijlstra, Juri Lelli,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

On Fri, Oct 31, 2025 at 08:47:14AM +0800, Chen Ridong wrote:
> 
> 
> On 2025/10/30 18:45, Pingfan Liu wrote:
> > On Thu, Oct 30, 2025 at 02:44:43PM +0800, Chen Ridong wrote:
> >>
> >>
> >> On 2025/10/29 19:18, Pingfan Liu wrote:
> >>> Hi Ridong,
> >>>
> >>> Thank you for your review, please see the comment below.
> >>>
> >>> On Wed, Oct 29, 2025 at 10:37:47AM +0800, Chen Ridong wrote:
> >>>>
> >>>>
> >>>> On 2025/10/28 11:43, Pingfan Liu wrote:
> >>>>> *** Bug description ***
> >>>>> When testing kexec-reboot on a 144 cpus machine with
> >>>>> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> >>>>> encounter the following bug:
> >>>>>
> >>>>> [   97.114759] psci: CPU142 killed (polled 0 ms)
> >>>>> [   97.333236] Failed to offline CPU143 - error=-16
> >>>>> [   97.333246] ------------[ cut here ]------------
> >>>>> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> >>>>> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> >>>>> [...]
> >>>>>
> >>>>> In essence, the issue originates from the CPU hot-removal process, not
> >>>>> limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
> >>>>> program that waits indefinitely on a semaphore, spawning multiple
> >>>>> instances to ensure some run on CPU 72, and then offlining CPUs 1–143
> >>>>> one by one. When attempting this, CPU 143 failed to go offline.
> >>>>>   bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> >>>>>
> >>>>> `
> >>>>> *** Issue ***
> >>>>> Tracking down this issue, I found that dl_bw_deactivate() returned
> >>>>> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> >>>>> But that is not the fact, and contributed by the following factors:
> >>>>> When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
> >>>>> blocked-state deadline task (in this case, "cppc_fie"), it was not
> >>>>> migrated to CPU0, and its task_rq() information is stale. So its rq->rd
> >>>>> points to def_root_domain instead of the one shared with CPU0.  As a
> >>>>> result, its bandwidth is wrongly accounted into a wrong root domain
> >>>>> during domain rebuild.
> >>>>>
> >>>>> The key point is that root_domain is only tracked through active rq->rd.
> >>>>> To avoid using a global data structure to track all root_domains in the
> >>>>> system, there should be a method to locate an active CPU within the
> >>>>> corresponding root_domain.
> >>>>>
> >>>>> *** Solution ***
> >>>>> To locate the active cpu, the following rules for deadline
> >>>>> sub-system is useful
> >>>>>   -1.any cpu belongs to a unique root domain at a given time
> >>>>>   -2.DL bandwidth checker ensures that the root domain has active cpus.
> >>>>>
> >>>>> Now, let's examine the blocked-state task P.
> >>>>> If P is attached to a cpuset that is a partition root, it is
> >>>>> straightforward to find an active CPU.
> >>>>> If P is attached to a cpuset that has changed from 'root' to 'member',
> >>>>> the active CPUs are grouped into the parent root domain. Naturally, the
> >>>>> CPUs' capacity and reserved DL bandwidth are taken into account in the
> >>>>> ancestor root domain. (In practice, it may be unsafe to attach P to an
> >>>>> arbitrary root domain, since that domain may lack sufficient DL
> >>>>> bandwidth for P.) Again, it is straightforward to find an active CPU in
> >>>>> the ancestor root domain.
> >>>>>
> >>>>> This patch groups CPUs into isolated and housekeeping sets. For the
> >>>>> housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> >>>>> in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
> >>>>>
> >>>>> Signed-off-by: Pingfan Liu <piliu@redhat.com>
> >>>>> Cc: Waiman Long <longman@redhat.com>
> >>>>> Cc: Tejun Heo <tj@kernel.org>
> >>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >>>>> Cc: "Michal Koutný" <mkoutny@suse.com>
> >>>>> Cc: Ingo Molnar <mingo@redhat.com>
> >>>>> Cc: Peter Zijlstra <peterz@infradead.org>
> >>>>> Cc: Juri Lelli <juri.lelli@redhat.com>
> >>>>> Cc: Pierre Gondois <pierre.gondois@arm.com>
> >>>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >>>>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >>>>> Cc: Steven Rostedt <rostedt@goodmis.org>
> >>>>> Cc: Ben Segall <bsegall@google.com>
> >>>>> Cc: Mel Gorman <mgorman@suse.de>
> >>>>> Cc: Valentin Schneider <vschneid@redhat.com>
> >>>>> To: cgroups@vger.kernel.org
> >>>>> To: linux-kernel@vger.kernel.org
> >>>>> ---
> >>>>> v3 -> v4:
> >>>>> rename function with cpuset_ prefix
> >>>>> improve commit log
> >>>>>
> >>>>>  include/linux/cpuset.h  | 18 ++++++++++++++++++
> >>>>>  kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
> >>>>>  kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
> >>>>>  3 files changed, 68 insertions(+), 6 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> >>>>> index 2ddb256187b51..d4da93e51b37b 100644
> >>>>> --- a/include/linux/cpuset.h
> >>>>> +++ b/include/linux/cpuset.h
> >>>>> @@ -12,6 +12,7 @@
> >>>>>  #include <linux/sched.h>
> >>>>>  #include <linux/sched/topology.h>
> >>>>>  #include <linux/sched/task.h>
> >>>>> +#include <linux/sched/housekeeping.h>
> >>>>>  #include <linux/cpumask.h>
> >>>>>  #include <linux/nodemask.h>
> >>>>>  #include <linux/mm.h>
> >>>>> @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
> >>>>>  
> >>>>>  extern void cpuset_print_current_mems_allowed(void);
> >>>>>  extern void cpuset_reset_sched_domains(void);
> >>>>> +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
> >>>>>  
> >>>>>  /*
> >>>>>   * read_mems_allowed_begin is required when making decisions involving
> >>>>> @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
> >>>>>  	partition_sched_domains(1, NULL, NULL);
> >>>>>  }
> >>>>>  
> >>>>> +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
> >>>>> +		struct cpumask *cpus)
> >>>>> +{
> >>>>> +	const struct cpumask *hk_msk;
> >>>>> +
> >>>>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> >>>>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> >>>>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> >>>>> +			/* isolated cpus belong to a root domain */
> >>>>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> >>>>> +			return;
> >>>>> +		}
> >>>>> +	}
> >>>>> +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> >>>>> +}
> >>>>> +
> >>>>>  static inline void cpuset_print_current_mems_allowed(void)
> >>>>>  {
> >>>>>  }
> >>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> >>>>> index 27adb04df675d..6ad88018f1a4e 100644
> >>>>> --- a/kernel/cgroup/cpuset.c
> >>>>> +++ b/kernel/cgroup/cpuset.c
> >>>>> @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
> >>>>>  	mutex_unlock(&cpuset_mutex);
> >>>>>  }
> >>>>>  
> >>>>> +/* caller hold RCU read lock */
> >>>>> +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> >>>>> +{
> >>>>> +	const struct cpumask *hk_msk;
> >>>>> +	struct cpuset *cs;
> >>>>> +
> >>>>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> >>>>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> >>>>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> >>>>> +			/* isolated cpus belong to a root domain */
> >>>>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> >>>>> +			return;
> >>>>> +		}
> >>>>> +	}
> >>>>> +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> >>>>> +	cs = task_cs(p);
> >>>>> +	while (cs != &top_cpuset) {
> >>>>> +		if (is_sched_load_balance(cs))
> >>>>> +			break;
> >>>>> +		cs = parent_cs(cs);
> >>>>> +	}
> >>>>> +
> >>>>> +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> >>>>> +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> >>>>> +}
> >>>>> +
> >>>>
> >>>> It seems you may have misunderstood what Longman intended to convey.
> >>>>
> >>>
> >>> Thanks for pointing that out. That is possible and please let me address
> >>> your concern.
> >>>
> >>>> First, you should add comments to this function because its purpose is not clear. When I first saw
> >>>
> >>> OK, I will.
> >>>
> >>>> this function, I thought it was supposed to retrieve p->cpus_ptr excluding the offline CPU mask.
> >>>> However, I'm genuinely confused about the function's actual purpose.
> >>>>
> >>>
> >>> This function retrieves the active CPUs within the root domain where a specified task resides.
> >>>
> >>
> >> Thank you for the further clarification.
> >>
> >> 	+	/*
> >> 	+	 * If @p is in blocked state, task_cpu() may be not active. In that
> >> 	+	 * case, rq->rd does not trace a correct root_domain. On the other hand,
> >> 	+	 * @p must belong to an root_domain at any given time, which must have
> >> 	+	 * active rq, whose rq->rd traces the valid root domain.
> >> 	+	 */
> >>
> >> Is it necessary to walk up to the root partition (is_sched_load_balance(cs))?
> >>
> >> The effective_cpus of the cpuset where @p resides should contain active CPUs.
> >> If all CPUs in cpuset.cpus are offline, it would inherit the parent's effective_cpus for v2, and it
> >> would move the task to the parent for v1.
> >>

I located the code which implemented your comment. And I think for v2,
you are right. But for v1, there is an async nuance about
remove_tasks_in_empty_cpuset(). It is scheduled with a work_struct, so
there is no gurantee that task has been moved to ancestor cpuset before
rebuild_sched_domains_cpuslocked() is called in cpuset_handle_hotplug(),
which means that in dl_update_tasks_root_domain(), maybe task's cpuset
has not been updated yet.


> > 
> > Suppose that the parent cpuset has no active CPUs too.
> > But for a root_domain, deadline bandwidth validation can guard there are
> > active CPUs remaining.
> > 
> 
> I don't think this should happen. When a parent's effective_cpus is empty, it should inherit its own
> parent's effective_cpus as well, meaning that in v2, the effective_cpus should not ultimately remain
> empty.
> 

You are right. I found the propagating in cpuset_for_each_descendant_pre().

Thanks for your insight. It makes thing more clear!

Best Regards,

Pingfan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-31 14:21             ` Pingfan Liu
@ 2025-11-03  3:17               ` Pingfan Liu
  0 siblings, 0 replies; 15+ messages in thread
From: Pingfan Liu @ 2025-11-03  3:17 UTC (permalink / raw)
  To: Chen Ridong
  Cc: linux-kernel, cgroups, Waiman Long, Peter Zijlstra, Juri Lelli,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

Hi Ridong,

I have some further findings. Your thoughts would be really
helpful!

On Fri, Oct 31, 2025 at 10:21:10PM +0800, Pingfan Liu wrote:
> On Fri, Oct 31, 2025 at 08:47:14AM +0800, Chen Ridong wrote:
> > 
> > 
> > On 2025/10/30 18:45, Pingfan Liu wrote:
> > > On Thu, Oct 30, 2025 at 02:44:43PM +0800, Chen Ridong wrote:
> > >>
> > >>
> > >> On 2025/10/29 19:18, Pingfan Liu wrote:
> > >>> Hi Ridong,
> > >>>
> > >>> Thank you for your review, please see the comment below.
> > >>>
> > >>> On Wed, Oct 29, 2025 at 10:37:47AM +0800, Chen Ridong wrote:
> > >>>>
> > >>>>
> > >>>> On 2025/10/28 11:43, Pingfan Liu wrote:
> > >>>>> *** Bug description ***
> > >>>>> When testing kexec-reboot on a 144 cpus machine with
> > >>>>> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > >>>>> encounter the following bug:
> > >>>>>
> > >>>>> [   97.114759] psci: CPU142 killed (polled 0 ms)
> > >>>>> [   97.333236] Failed to offline CPU143 - error=-16
> > >>>>> [   97.333246] ------------[ cut here ]------------
> > >>>>> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > >>>>> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > >>>>> [...]
> > >>>>>
> > >>>>> In essence, the issue originates from the CPU hot-removal process, not
> > >>>>> limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
> > >>>>> program that waits indefinitely on a semaphore, spawning multiple
> > >>>>> instances to ensure some run on CPU 72, and then offlining CPUs 1–143
> > >>>>> one by one. When attempting this, CPU 143 failed to go offline.
> > >>>>>   bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> > >>>>>
> > >>>>> `
> > >>>>> *** Issue ***
> > >>>>> Tracking down this issue, I found that dl_bw_deactivate() returned
> > >>>>> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > >>>>> But that is not the fact, and contributed by the following factors:
> > >>>>> When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
> > >>>>> blocked-state deadline task (in this case, "cppc_fie"), it was not
> > >>>>> migrated to CPU0, and its task_rq() information is stale. So its rq->rd
> > >>>>> points to def_root_domain instead of the one shared with CPU0.  As a
> > >>>>> result, its bandwidth is wrongly accounted into a wrong root domain
> > >>>>> during domain rebuild.
> > >>>>>
> > >>>>> The key point is that root_domain is only tracked through active rq->rd.
> > >>>>> To avoid using a global data structure to track all root_domains in the
> > >>>>> system, there should be a method to locate an active CPU within the
> > >>>>> corresponding root_domain.
> > >>>>>
> > >>>>> *** Solution ***
> > >>>>> To locate the active cpu, the following rules for deadline
> > >>>>> sub-system is useful
> > >>>>>   -1.any cpu belongs to a unique root domain at a given time
> > >>>>>   -2.DL bandwidth checker ensures that the root domain has active cpus.
> > >>>>>
> > >>>>> Now, let's examine the blocked-state task P.
> > >>>>> If P is attached to a cpuset that is a partition root, it is
> > >>>>> straightforward to find an active CPU.
> > >>>>> If P is attached to a cpuset that has changed from 'root' to 'member',
> > >>>>> the active CPUs are grouped into the parent root domain. Naturally, the
> > >>>>> CPUs' capacity and reserved DL bandwidth are taken into account in the
> > >>>>> ancestor root domain. (In practice, it may be unsafe to attach P to an
> > >>>>> arbitrary root domain, since that domain may lack sufficient DL
> > >>>>> bandwidth for P.) Again, it is straightforward to find an active CPU in
> > >>>>> the ancestor root domain.
> > >>>>>
> > >>>>> This patch groups CPUs into isolated and housekeeping sets. For the
> > >>>>> housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> > >>>>> in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
> > >>>>>
> > >>>>> Signed-off-by: Pingfan Liu <piliu@redhat.com>
> > >>>>> Cc: Waiman Long <longman@redhat.com>
> > >>>>> Cc: Tejun Heo <tj@kernel.org>
> > >>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
> > >>>>> Cc: "Michal Koutný" <mkoutny@suse.com>
> > >>>>> Cc: Ingo Molnar <mingo@redhat.com>
> > >>>>> Cc: Peter Zijlstra <peterz@infradead.org>
> > >>>>> Cc: Juri Lelli <juri.lelli@redhat.com>
> > >>>>> Cc: Pierre Gondois <pierre.gondois@arm.com>
> > >>>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > >>>>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > >>>>> Cc: Steven Rostedt <rostedt@goodmis.org>
> > >>>>> Cc: Ben Segall <bsegall@google.com>
> > >>>>> Cc: Mel Gorman <mgorman@suse.de>
> > >>>>> Cc: Valentin Schneider <vschneid@redhat.com>
> > >>>>> To: cgroups@vger.kernel.org
> > >>>>> To: linux-kernel@vger.kernel.org
> > >>>>> ---
> > >>>>> v3 -> v4:
> > >>>>> rename function with cpuset_ prefix
> > >>>>> improve commit log
> > >>>>>
> > >>>>>  include/linux/cpuset.h  | 18 ++++++++++++++++++
> > >>>>>  kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
> > >>>>>  kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
> > >>>>>  3 files changed, 68 insertions(+), 6 deletions(-)
> > >>>>>
> > >>>>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > >>>>> index 2ddb256187b51..d4da93e51b37b 100644
> > >>>>> --- a/include/linux/cpuset.h
> > >>>>> +++ b/include/linux/cpuset.h
> > >>>>> @@ -12,6 +12,7 @@
> > >>>>>  #include <linux/sched.h>
> > >>>>>  #include <linux/sched/topology.h>
> > >>>>>  #include <linux/sched/task.h>
> > >>>>> +#include <linux/sched/housekeeping.h>
> > >>>>>  #include <linux/cpumask.h>
> > >>>>>  #include <linux/nodemask.h>
> > >>>>>  #include <linux/mm.h>
> > >>>>> @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
> > >>>>>  
> > >>>>>  extern void cpuset_print_current_mems_allowed(void);
> > >>>>>  extern void cpuset_reset_sched_domains(void);
> > >>>>> +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
> > >>>>>  
> > >>>>>  /*
> > >>>>>   * read_mems_allowed_begin is required when making decisions involving
> > >>>>> @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
> > >>>>>  	partition_sched_domains(1, NULL, NULL);
> > >>>>>  }
> > >>>>>  
> > >>>>> +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
> > >>>>> +		struct cpumask *cpus)
> > >>>>> +{
> > >>>>> +	const struct cpumask *hk_msk;
> > >>>>> +
> > >>>>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > >>>>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > >>>>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> > >>>>> +			/* isolated cpus belong to a root domain */
> > >>>>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > >>>>> +			return;
> > >>>>> +		}
> > >>>>> +	}
> > >>>>> +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> > >>>>> +}
> > >>>>> +
> > >>>>>  static inline void cpuset_print_current_mems_allowed(void)
> > >>>>>  {
> > >>>>>  }
> > >>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > >>>>> index 27adb04df675d..6ad88018f1a4e 100644
> > >>>>> --- a/kernel/cgroup/cpuset.c
> > >>>>> +++ b/kernel/cgroup/cpuset.c
> > >>>>> @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
> > >>>>>  	mutex_unlock(&cpuset_mutex);
> > >>>>>  }
> > >>>>>  
> > >>>>> +/* caller hold RCU read lock */
> > >>>>> +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> > >>>>> +{
> > >>>>> +	const struct cpumask *hk_msk;
> > >>>>> +	struct cpuset *cs;
> > >>>>> +
> > >>>>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > >>>>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > >>>>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> > >>>>> +			/* isolated cpus belong to a root domain */
> > >>>>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > >>>>> +			return;
> > >>>>> +		}
> > >>>>> +	}
> > >>>>> +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> > >>>>> +	cs = task_cs(p);
> > >>>>> +	while (cs != &top_cpuset) {
> > >>>>> +		if (is_sched_load_balance(cs))
> > >>>>> +			break;
> > >>>>> +		cs = parent_cs(cs);
> > >>>>> +	}
> > >>>>> +
> > >>>>> +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> > >>>>> +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> > >>>>> +}
> > >>>>> +
> > >>>>
> > >>>> It seems you may have misunderstood what Longman intended to convey.
> > >>>>
> > >>>
> > >>> Thanks for pointing that out. That is possible and please let me address
> > >>> your concern.
> > >>>
> > >>>> First, you should add comments to this function because its purpose is not clear. When I first saw
> > >>>
> > >>> OK, I will.
> > >>>
> > >>>> this function, I thought it was supposed to retrieve p->cpus_ptr excluding the offline CPU mask.
> > >>>> However, I'm genuinely confused about the function's actual purpose.
> > >>>>
> > >>>
> > >>> This function retrieves the active CPUs within the root domain where a specified task resides.
> > >>>
> > >>
> > >> Thank you for the further clarification.
> > >>
> > >> 	+	/*
> > >> 	+	 * If @p is in blocked state, task_cpu() may be not active. In that
> > >> 	+	 * case, rq->rd does not trace a correct root_domain. On the other hand,
> > >> 	+	 * @p must belong to an root_domain at any given time, which must have
> > >> 	+	 * active rq, whose rq->rd traces the valid root domain.
> > >> 	+	 */
> > >>
> > >> Is it necessary to walk up to the root partition (is_sched_load_balance(cs))?
> > >>
> > >> The effective_cpus of the cpuset where @p resides should contain active CPUs.
> > >> If all CPUs in cpuset.cpus are offline, it would inherit the parent's effective_cpus for v2, and it
> > >> would move the task to the parent for v1.
> > >>
> 
> I located the code which implemented your comment. And I think for v2,
> you are right. But for v1, there is an async nuance about
> remove_tasks_in_empty_cpuset(). It is scheduled with a work_struct, so
> there is no gurantee that task has been moved to ancestor cpuset before
> rebuild_sched_domains_cpuslocked() is called in cpuset_handle_hotplug(),
> which means that in dl_update_tasks_root_domain(), maybe task's cpuset
> has not been updated yet.
> 

This behavior requires two set of implements for the new introduced
function, one for cpuset V1 due to async, one for v2 which can fetch the
active cpus from p->cpus_ptr directly.

Apart from this drawback, the call sequence of functions matters when
the logic enters the scheduler code for the hot-removal path. The newly
introduced function should be called after cpuset propagation.


Best Regards,

Pingfan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-29 15:31   ` Waiman Long
  2025-10-30 10:41     ` Pingfan Liu
@ 2025-11-03 13:50     ` Juri Lelli
  2025-11-04  3:34       ` Pingfan Liu
  1 sibling, 1 reply; 15+ messages in thread
From: Juri Lelli @ 2025-11-03 13:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Pingfan Liu, linux-kernel, cgroups, Peter Zijlstra,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

On 29/10/25 11:31, Waiman Long wrote:
> On 10/27/25 11:43 PM, Pingfan Liu wrote:

...

> > @@ -2891,16 +2893,32 @@ void dl_add_task_root_domain(struct task_struct *p)
> >   		return;
> >   	}
> > -	rq = __task_rq_lock(p, &rf);
> > -
> > +	/* prevent race among cpu hotplug, changing of partition_root_state */
> > +	lockdep_assert_cpus_held();
> > +	/*
> > +	 * If @p is in blocked state, task_cpu() may be not active. In that
> > +	 * case, rq->rd does not trace a correct root_domain. On the other hand,
> > +	 * @p must belong to an root_domain at any given time, which must have
> > +	 * active rq, whose rq->rd traces the valid root domain.
> > +	 */
> > +	cpuset_get_task_effective_cpus(p, &msk);
> > +	cpu = cpumask_first_and(cpu_active_mask, &msk);
> > +	/*
> > +	 * If a root domain reserves bandwidth for a DL task, the DL bandwidth
> > +	 * check prevents CPU hot removal from deactivating all CPUs in that
> > +	 * domain.
> > +	 */
> > +	BUG_ON(cpu >= nr_cpu_ids);
> > +	rq = cpu_rq(cpu);
> > +	/*
> > +	 * This point is under the protection of cpu_hotplug_lock. Hence
> > +	 * rq->rd is stable.
> > +	 */
> 
> So you trying to find a active sched domain with some dl bw to use for
> checking. I don't know enough about this dl bw checking code to know if it
> is valid or not. I will let Juri comment on that.

So, just to refresh my understanding of this issue, the task was
sleeping/blocked while the cpu it was running on before blocking has
been turned off. dl_add_task_root_domain() wrongly adds its bw
contribution to def_root_domain as it's where offline cpus are attached
to while off. We instead want to attach the sleeping task contribution
to the root domain that once comprised also the cpu it was running on
before blocking. Correct?

If that is the case, and assuming nobody touched the sleeping task
affinity (p->cpus_ptr), can't we just use another online cpu from
current task affinity to get to the right root domain? Somewhat similar
to what dl_task_offline_migration() is doing in the (!later_rq) case,
I'm thinking.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-11-03 13:50     ` Juri Lelli
@ 2025-11-04  3:34       ` Pingfan Liu
  2025-11-04  3:42         ` Waiman Long
  0 siblings, 1 reply; 15+ messages in thread
From: Pingfan Liu @ 2025-11-04  3:34 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Waiman Long, linux-kernel, cgroups, Peter Zijlstra,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

On Mon, Nov 03, 2025 at 02:50:15PM +0100, Juri Lelli wrote:
> On 29/10/25 11:31, Waiman Long wrote:
> > On 10/27/25 11:43 PM, Pingfan Liu wrote:
> 
> ...
> 
> > > @@ -2891,16 +2893,32 @@ void dl_add_task_root_domain(struct task_struct *p)
> > >   		return;
> > >   	}
> > > -	rq = __task_rq_lock(p, &rf);
> > > -
> > > +	/* prevent race among cpu hotplug, changing of partition_root_state */
> > > +	lockdep_assert_cpus_held();
> > > +	/*
> > > +	 * If @p is in blocked state, task_cpu() may be not active. In that
> > > +	 * case, rq->rd does not trace a correct root_domain. On the other hand,
> > > +	 * @p must belong to an root_domain at any given time, which must have
> > > +	 * active rq, whose rq->rd traces the valid root domain.
> > > +	 */
> > > +	cpuset_get_task_effective_cpus(p, &msk);
> > > +	cpu = cpumask_first_and(cpu_active_mask, &msk);
> > > +	/*
> > > +	 * If a root domain reserves bandwidth for a DL task, the DL bandwidth
> > > +	 * check prevents CPU hot removal from deactivating all CPUs in that
> > > +	 * domain.
> > > +	 */
> > > +	BUG_ON(cpu >= nr_cpu_ids);
> > > +	rq = cpu_rq(cpu);
> > > +	/*
> > > +	 * This point is under the protection of cpu_hotplug_lock. Hence
> > > +	 * rq->rd is stable.
> > > +	 */
> > 
> > So you trying to find a active sched domain with some dl bw to use for
> > checking. I don't know enough about this dl bw checking code to know if it
> > is valid or not. I will let Juri comment on that.
> 
> So, just to refresh my understanding of this issue, the task was
> sleeping/blocked while the cpu it was running on before blocking has
> been turned off. dl_add_task_root_domain() wrongly adds its bw
> contribution to def_root_domain as it's where offline cpus are attached
> to while off. We instead want to attach the sleeping task contribution
> to the root domain that once comprised also the cpu it was running on
> before blocking. Correct?
> 

Yes, that's correct.

> If that is the case, and assuming nobody touched the sleeping task
> affinity (p->cpus_ptr), can't we just use another online cpu from

In fact, IIUC, the change will be always propagated through the cpuset
hier into cpus_ptr by cpuset_update_tasks_cpumask() in cpuset v2.
(Ridong, please correct me if my understanding is wrong)

But for cpuset v1, due to async, it is not reliable at this point [1].

> current task affinity to get to the right root domain? Somewhat similar
> to what dl_task_offline_migration() is doing in the (!later_rq) case,
> I'm thinking.
> 

Sorry, I don't quite understand what you mean. Do you mean something
like cpumask_any_and(cpu_active_mask, p->cpus_ptr) in
dl_task_offline_migration()?

If so, that will run into the async challenge discussed in [1], where
p->cpus_ptr becomes stale with no active CPUs. However, in fact, there
are still active CPUs in the root domain.


So my plan is to follow Waiman's suggestion. Any further comments or
suggestion?

[1]: https://lore.kernel.org/all/aQge00u94JKGF9Tb@fedora/


Best Regards,

Pingfan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-11-04  3:34       ` Pingfan Liu
@ 2025-11-04  3:42         ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2025-11-04  3:42 UTC (permalink / raw)
  To: Pingfan Liu, Juri Lelli
  Cc: Waiman Long, linux-kernel, cgroups, Peter Zijlstra,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

On 11/3/25 10:34 PM, Pingfan Liu wrote:
>> If that is the case, and assuming nobody touched the sleeping task
>> affinity (p->cpus_ptr), can't we just use another online cpu from
> In fact, IIUC, the change will be always propagated through the cpuset
> hier into cpus_ptr by cpuset_update_tasks_cpumask() in cpuset v2.
> (Ridong, please correct me if my understanding is wrong)

For kthreads, changes to p->cpus_ptr via cpuset is going to be updated 
with a new set of kthread patch [1] being reviewed right now.

[1] https://lore.kernel.org/lkml/20251013203146.10162-1-frederic@kernel.org/

Cheers,
Longman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-10-28  3:43 ` [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug Pingfan Liu
  2025-10-29  2:37   ` Chen Ridong
  2025-10-29 15:31   ` Waiman Long
@ 2025-11-05  2:23   ` Chen Ridong
  2025-11-05  7:11     ` Pingfan Liu
  2 siblings, 1 reply; 15+ messages in thread
From: Chen Ridong @ 2025-11-05  2:23 UTC (permalink / raw)
  To: Pingfan Liu, linux-kernel, cgroups
  Cc: Waiman Long, Peter Zijlstra, Juri Lelli, Pierre Gondois,
	Frederic Weisbecker, Ingo Molnar, Tejun Heo, Johannes Weiner,
	Michal Koutný, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider



On 2025/10/28 11:43, Pingfan Liu wrote:
> *** Bug description ***
> When testing kexec-reboot on a 144 cpus machine with
> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> encounter the following bug:
> 
> [   97.114759] psci: CPU142 killed (polled 0 ms)
> [   97.333236] Failed to offline CPU143 - error=-16
> [   97.333246] ------------[ cut here ]------------
> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [...]
> 
> In essence, the issue originates from the CPU hot-removal process, not
> limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
> program that waits indefinitely on a semaphore, spawning multiple
> instances to ensure some run on CPU 72, and then offlining CPUs 1–143
> one by one. When attempting this, CPU 143 failed to go offline.
>   bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> 
> `
> *** Issue ***
> Tracking down this issue, I found that dl_bw_deactivate() returned
> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> But that is not the fact, and contributed by the following factors:
> When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
> blocked-state deadline task (in this case, "cppc_fie"), it was not
> migrated to CPU0, and its task_rq() information is stale. So its rq->rd
> points to def_root_domain instead of the one shared with CPU0.  As a
> result, its bandwidth is wrongly accounted into a wrong root domain
> during domain rebuild.
> 
> The key point is that root_domain is only tracked through active rq->rd.
> To avoid using a global data structure to track all root_domains in the
> system, there should be a method to locate an active CPU within the
> corresponding root_domain.
> 
> *** Solution ***
> To locate the active cpu, the following rules for deadline
> sub-system is useful
>   -1.any cpu belongs to a unique root domain at a given time
>   -2.DL bandwidth checker ensures that the root domain has active cpus.
> 
> Now, let's examine the blocked-state task P.
> If P is attached to a cpuset that is a partition root, it is
> straightforward to find an active CPU.
> If P is attached to a cpuset that has changed from 'root' to 'member',
> the active CPUs are grouped into the parent root domain. Naturally, the
> CPUs' capacity and reserved DL bandwidth are taken into account in the
> ancestor root domain. (In practice, it may be unsafe to attach P to an
> arbitrary root domain, since that domain may lack sufficient DL
> bandwidth for P.) Again, it is straightforward to find an active CPU in
> the ancestor root domain.
> 
> This patch groups CPUs into isolated and housekeeping sets. For the
> housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
> 
> Signed-off-by: Pingfan Liu <piliu@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: "Michal Koutný" <mkoutny@suse.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Pierre Gondois <pierre.gondois@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> To: cgroups@vger.kernel.org
> To: linux-kernel@vger.kernel.org
> ---
> v3 -> v4:
> rename function with cpuset_ prefix
> improve commit log
> 
>  include/linux/cpuset.h  | 18 ++++++++++++++++++
>  kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
>  kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
>  3 files changed, 68 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b51..d4da93e51b37b 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -12,6 +12,7 @@
>  #include <linux/sched.h>
>  #include <linux/sched/topology.h>
>  #include <linux/sched/task.h>
> +#include <linux/sched/housekeeping.h>
>  #include <linux/cpumask.h>
>  #include <linux/nodemask.h>
>  #include <linux/mm.h>
> @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
>  
>  extern void cpuset_print_current_mems_allowed(void);
>  extern void cpuset_reset_sched_domains(void);
> +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
>  
>  /*
>   * read_mems_allowed_begin is required when making decisions involving
> @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
>  	partition_sched_domains(1, NULL, NULL);
>  }
>  
> +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
> +		struct cpumask *cpus)
> +{
> +	const struct cpumask *hk_msk;
> +
> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> +			/* isolated cpus belong to a root domain */
> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> +			return;
> +		}
> +	}
> +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> +}
> +
>  static inline void cpuset_print_current_mems_allowed(void)
>  {
>  }
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 27adb04df675d..6ad88018f1a4e 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
>  	mutex_unlock(&cpuset_mutex);
>  }
>  
> +/* caller hold RCU read lock */
> +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> +{
> +	const struct cpumask *hk_msk;
> +	struct cpuset *cs;
> +
> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> +			/* isolated cpus belong to a root domain */
> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> +			return;
> +		}
> +	}
> +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> +	cs = task_cs(p);
> +	while (cs != &top_cpuset) {
> +		if (is_sched_load_balance(cs))
> +			break;
> +		cs = parent_cs(cs);
> +	}
> +
> +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> +}
> +

It seems cpuset_cpus_allowed functions is enough for you, no new functions need to be introduced.

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
  2025-11-05  2:23   ` Chen Ridong
@ 2025-11-05  7:11     ` Pingfan Liu
  0 siblings, 0 replies; 15+ messages in thread
From: Pingfan Liu @ 2025-11-05  7:11 UTC (permalink / raw)
  To: Chen Ridong
  Cc: linux-kernel, cgroups, Waiman Long, Peter Zijlstra, Juri Lelli,
	Pierre Gondois, Frederic Weisbecker, Ingo Molnar, Tejun Heo,
	Johannes Weiner, Michal Koutný, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

On Wed, Nov 5, 2025 at 10:24 AM Chen Ridong <chenridong@huaweicloud.com> wrote:
>
>
>
> On 2025/10/28 11:43, Pingfan Liu wrote:
> > *** Bug description ***
> > When testing kexec-reboot on a 144 cpus machine with
> > isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> > encounter the following bug:
> >
> > [   97.114759] psci: CPU142 killed (polled 0 ms)
> > [   97.333236] Failed to offline CPU143 - error=-16
> > [   97.333246] ------------[ cut here ]------------
> > [   97.342682] kernel BUG at kernel/cpu.c:1569!
> > [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> > [...]
> >
> > In essence, the issue originates from the CPU hot-removal process, not
> > limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
> > program that waits indefinitely on a semaphore, spawning multiple
> > instances to ensure some run on CPU 72, and then offlining CPUs 1–143
> > one by one. When attempting this, CPU 143 failed to go offline.
> >   bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> >
> > `
> > *** Issue ***
> > Tracking down this issue, I found that dl_bw_deactivate() returned
> > -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> > But that is not the fact, and contributed by the following factors:
> > When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
> > blocked-state deadline task (in this case, "cppc_fie"), it was not
> > migrated to CPU0, and its task_rq() information is stale. So its rq->rd
> > points to def_root_domain instead of the one shared with CPU0.  As a
> > result, its bandwidth is wrongly accounted into a wrong root domain
> > during domain rebuild.
> >
> > The key point is that root_domain is only tracked through active rq->rd.
> > To avoid using a global data structure to track all root_domains in the
> > system, there should be a method to locate an active CPU within the
> > corresponding root_domain.
> >
> > *** Solution ***
> > To locate the active cpu, the following rules for deadline
> > sub-system is useful
> >   -1.any cpu belongs to a unique root domain at a given time
> >   -2.DL bandwidth checker ensures that the root domain has active cpus.
> >
> > Now, let's examine the blocked-state task P.
> > If P is attached to a cpuset that is a partition root, it is
> > straightforward to find an active CPU.
> > If P is attached to a cpuset that has changed from 'root' to 'member',
> > the active CPUs are grouped into the parent root domain. Naturally, the
> > CPUs' capacity and reserved DL bandwidth are taken into account in the
> > ancestor root domain. (In practice, it may be unsafe to attach P to an
> > arbitrary root domain, since that domain may lack sufficient DL
> > bandwidth for P.) Again, it is straightforward to find an active CPU in
> > the ancestor root domain.
> >
> > This patch groups CPUs into isolated and housekeeping sets. For the
> > housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> > in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
> >
> > Signed-off-by: Pingfan Liu <piliu@redhat.com>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: "Michal Koutný" <mkoutny@suse.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Pierre Gondois <pierre.gondois@arm.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > To: cgroups@vger.kernel.org
> > To: linux-kernel@vger.kernel.org
> > ---
> > v3 -> v4:
> > rename function with cpuset_ prefix
> > improve commit log
> >
> >  include/linux/cpuset.h  | 18 ++++++++++++++++++
> >  kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
> >  kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
> >  3 files changed, 68 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index 2ddb256187b51..d4da93e51b37b 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -12,6 +12,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/sched/topology.h>
> >  #include <linux/sched/task.h>
> > +#include <linux/sched/housekeeping.h>
> >  #include <linux/cpumask.h>
> >  #include <linux/nodemask.h>
> >  #include <linux/mm.h>
> > @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
> >
> >  extern void cpuset_print_current_mems_allowed(void);
> >  extern void cpuset_reset_sched_domains(void);
> > +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
> >
> >  /*
> >   * read_mems_allowed_begin is required when making decisions involving
> > @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
> >       partition_sched_domains(1, NULL, NULL);
> >  }
> >
> > +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
> > +             struct cpumask *cpus)
> > +{
> > +     const struct cpumask *hk_msk;
> > +
> > +     hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > +     if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > +             if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> > +                     /* isolated cpus belong to a root domain */
> > +                     cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > +                     return;
> > +             }
> > +     }
> > +     cpumask_and(cpus, cpu_active_mask, hk_msk);
> > +}
> > +
> >  static inline void cpuset_print_current_mems_allowed(void)
> >  {
> >  }
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 27adb04df675d..6ad88018f1a4e 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
> >       mutex_unlock(&cpuset_mutex);
> >  }
> >
> > +/* caller hold RCU read lock */
> > +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> > +{
> > +     const struct cpumask *hk_msk;
> > +     struct cpuset *cs;
> > +
> > +     hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> > +     if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> > +             if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> > +                     /* isolated cpus belong to a root domain */
> > +                     cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> > +                     return;
> > +             }
> > +     }
> > +     /* In HK_TYPE_DOMAIN, cpuset can be applied */
> > +     cs = task_cs(p);
> > +     while (cs != &top_cpuset) {
> > +             if (is_sched_load_balance(cs))
> > +                     break;
> > +             cs = parent_cs(cs);
> > +     }
> > +
> > +     /* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> > +     cpumask_and(cpus, cs->effective_cpus, hk_msk);
> > +}
> > +
>
> It seems cpuset_cpus_allowed functions is enough for you, no new functions need to be introduced.
>

Ah! Yes, it is good enough.

Best Regards,

Pingfan


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-11-05  7:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251028034357.11055-1-piliu@redhat.com>
2025-10-28  3:43 ` [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug Pingfan Liu
2025-10-29  2:37   ` Chen Ridong
2025-10-29 11:18     ` Pingfan Liu
2025-10-30  6:44       ` Chen Ridong
2025-10-30 10:45         ` Pingfan Liu
2025-10-31  0:47           ` Chen Ridong
2025-10-31 14:21             ` Pingfan Liu
2025-11-03  3:17               ` Pingfan Liu
2025-10-29 15:31   ` Waiman Long
2025-10-30 10:41     ` Pingfan Liu
2025-11-03 13:50     ` Juri Lelli
2025-11-04  3:34       ` Pingfan Liu
2025-11-04  3:42         ` Waiman Long
2025-11-05  2:23   ` Chen Ridong
2025-11-05  7:11     ` Pingfan Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).