All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pingfan Liu <piliu@redhat.com>
To: Chen Ridong <chenridong@huaweicloud.com>
Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	"Waiman Long" <longman@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Pierre Gondois" <pierre.gondois@arm.com>,
	"Frederic Weisbecker" <frederic@kernel.org>,
	"Ingo Molnar" <mingo@redhat.com>, "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	"Valentin Schneider" <vschneid@redhat.com>
Subject: Re: [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
Date: Thu, 30 Oct 2025 18:45:29 +0800	[thread overview]
Message-ID: <aQNBydr4geMWXebC@fedora> (raw)
In-Reply-To: <44130515-725a-4f44-b064-3b396ed26159@huaweicloud.com>

On Thu, Oct 30, 2025 at 02:44:43PM +0800, Chen Ridong wrote:
> 
> 
> On 2025/10/29 19:18, Pingfan Liu wrote:
> > Hi Ridong,
> > 
> > Thank you for your review, please see the comment below.
> > 
> > On Wed, Oct 29, 2025 at 10:37:47AM +0800, Chen Ridong wrote:
> >>
> >>
> >> On 2025/10/28 11:43, Pingfan Liu wrote:
> >>> *** Bug description ***
> >>> When testing kexec-reboot on a 144 cpus machine with
> >>> isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
> >>> encounter the following bug:
> >>>
> >>> [   97.114759] psci: CPU142 killed (polled 0 ms)
> >>> [   97.333236] Failed to offline CPU143 - error=-16
> >>> [   97.333246] ------------[ cut here ]------------
> >>> [   97.342682] kernel BUG at kernel/cpu.c:1569!
> >>> [   97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> >>> [...]
> >>>
> >>> In essence, the issue originates from the CPU hot-removal process, not
> >>> limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
> >>> program that waits indefinitely on a semaphore, spawning multiple
> >>> instances to ensure some run on CPU 72, and then offlining CPUs 1–143
> >>> one by one. When attempting this, CPU 143 failed to go offline.
> >>>   bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
> >>>
> >>> `
> >>> *** Issue ***
> >>> Tracking down this issue, I found that dl_bw_deactivate() returned
> >>> -EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
> >>> But that is not the fact, and contributed by the following factors:
> >>> When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
> >>> blocked-state deadline task (in this case, "cppc_fie"), it was not
> >>> migrated to CPU0, and its task_rq() information is stale. So its rq->rd
> >>> points to def_root_domain instead of the one shared with CPU0.  As a
> >>> result, its bandwidth is wrongly accounted into a wrong root domain
> >>> during domain rebuild.
> >>>
> >>> The key point is that root_domain is only tracked through active rq->rd.
> >>> To avoid using a global data structure to track all root_domains in the
> >>> system, there should be a method to locate an active CPU within the
> >>> corresponding root_domain.
> >>>
> >>> *** Solution ***
> >>> To locate the active cpu, the following rules for deadline
> >>> sub-system is useful
> >>>   -1.any cpu belongs to a unique root domain at a given time
> >>>   -2.DL bandwidth checker ensures that the root domain has active cpus.
> >>>
> >>> Now, let's examine the blocked-state task P.
> >>> If P is attached to a cpuset that is a partition root, it is
> >>> straightforward to find an active CPU.
> >>> If P is attached to a cpuset that has changed from 'root' to 'member',
> >>> the active CPUs are grouped into the parent root domain. Naturally, the
> >>> CPUs' capacity and reserved DL bandwidth are taken into account in the
> >>> ancestor root domain. (In practice, it may be unsafe to attach P to an
> >>> arbitrary root domain, since that domain may lack sufficient DL
> >>> bandwidth for P.) Again, it is straightforward to find an active CPU in
> >>> the ancestor root domain.
> >>>
> >>> This patch groups CPUs into isolated and housekeeping sets. For the
> >>> housekeeping group, it walks up the cpuset hierarchy to find active CPUs
> >>> in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
> >>>
> >>> Signed-off-by: Pingfan Liu <piliu@redhat.com>
> >>> Cc: Waiman Long <longman@redhat.com>
> >>> Cc: Tejun Heo <tj@kernel.org>
> >>> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >>> Cc: "Michal Koutný" <mkoutny@suse.com>
> >>> Cc: Ingo Molnar <mingo@redhat.com>
> >>> Cc: Peter Zijlstra <peterz@infradead.org>
> >>> Cc: Juri Lelli <juri.lelli@redhat.com>
> >>> Cc: Pierre Gondois <pierre.gondois@arm.com>
> >>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> >>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >>> Cc: Steven Rostedt <rostedt@goodmis.org>
> >>> Cc: Ben Segall <bsegall@google.com>
> >>> Cc: Mel Gorman <mgorman@suse.de>
> >>> Cc: Valentin Schneider <vschneid@redhat.com>
> >>> To: cgroups@vger.kernel.org
> >>> To: linux-kernel@vger.kernel.org
> >>> ---
> >>> v3 -> v4:
> >>> rename function with cpuset_ prefix
> >>> improve commit log
> >>>
> >>>  include/linux/cpuset.h  | 18 ++++++++++++++++++
> >>>  kernel/cgroup/cpuset.c  | 26 ++++++++++++++++++++++++++
> >>>  kernel/sched/deadline.c | 30 ++++++++++++++++++++++++------
> >>>  3 files changed, 68 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> >>> index 2ddb256187b51..d4da93e51b37b 100644
> >>> --- a/include/linux/cpuset.h
> >>> +++ b/include/linux/cpuset.h
> >>> @@ -12,6 +12,7 @@
> >>>  #include <linux/sched.h>
> >>>  #include <linux/sched/topology.h>
> >>>  #include <linux/sched/task.h>
> >>> +#include <linux/sched/housekeeping.h>
> >>>  #include <linux/cpumask.h>
> >>>  #include <linux/nodemask.h>
> >>>  #include <linux/mm.h>
> >>> @@ -130,6 +131,7 @@ extern void rebuild_sched_domains(void);
> >>>  
> >>>  extern void cpuset_print_current_mems_allowed(void);
> >>>  extern void cpuset_reset_sched_domains(void);
> >>> +extern void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus);
> >>>  
> >>>  /*
> >>>   * read_mems_allowed_begin is required when making decisions involving
> >>> @@ -276,6 +278,22 @@ static inline void cpuset_reset_sched_domains(void)
> >>>  	partition_sched_domains(1, NULL, NULL);
> >>>  }
> >>>  
> >>> +static inline void cpuset_get_task_effective_cpus(struct task_struct *p,
> >>> +		struct cpumask *cpus)
> >>> +{
> >>> +	const struct cpumask *hk_msk;
> >>> +
> >>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> >>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> >>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> >>> +			/* isolated cpus belong to a root domain */
> >>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> >>> +			return;
> >>> +		}
> >>> +	}
> >>> +	cpumask_and(cpus, cpu_active_mask, hk_msk);
> >>> +}
> >>> +
> >>>  static inline void cpuset_print_current_mems_allowed(void)
> >>>  {
> >>>  }
> >>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> >>> index 27adb04df675d..6ad88018f1a4e 100644
> >>> --- a/kernel/cgroup/cpuset.c
> >>> +++ b/kernel/cgroup/cpuset.c
> >>> @@ -1102,6 +1102,32 @@ void cpuset_reset_sched_domains(void)
> >>>  	mutex_unlock(&cpuset_mutex);
> >>>  }
> >>>  
> >>> +/* caller hold RCU read lock */
> >>> +void cpuset_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus)
> >>> +{
> >>> +	const struct cpumask *hk_msk;
> >>> +	struct cpuset *cs;
> >>> +
> >>> +	hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN);
> >>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN)) {
> >>> +		if (!cpumask_intersects(p->cpus_ptr, hk_msk)) {
> >>> +			/* isolated cpus belong to a root domain */
> >>> +			cpumask_andnot(cpus, cpu_active_mask, hk_msk);
> >>> +			return;
> >>> +		}
> >>> +	}
> >>> +	/* In HK_TYPE_DOMAIN, cpuset can be applied */
> >>> +	cs = task_cs(p);
> >>> +	while (cs != &top_cpuset) {
> >>> +		if (is_sched_load_balance(cs))
> >>> +			break;
> >>> +		cs = parent_cs(cs);
> >>> +	}
> >>> +
> >>> +	/* For top_cpuset, its effective_cpus does not exclude isolated cpu */
> >>> +	cpumask_and(cpus, cs->effective_cpus, hk_msk);
> >>> +}
> >>> +
> >>
> >> It seems you may have misunderstood what Longman intended to convey.
> >>
> > 
> > Thanks for pointing that out. That is possible and please let me address
> > your concern.
> > 
> >> First, you should add comments to this function because its purpose is not clear. When I first saw
> > 
> > OK, I will.
> > 
> >> this function, I thought it was supposed to retrieve p->cpus_ptr excluding the offline CPU mask.
> >> However, I'm genuinely confused about the function's actual purpose.
> >>
> > 
> > This function retrieves the active CPUs within the root domain where a specified task resides.
> > 
> 
> Thank you for the further clarification.
> 
> 	+	/*
> 	+	 * If @p is in blocked state, task_cpu() may be not active. In that
> 	+	 * case, rq->rd does not trace a correct root_domain. On the other hand,
> 	+	 * @p must belong to an root_domain at any given time, which must have
> 	+	 * active rq, whose rq->rd traces the valid root domain.
> 	+	 */
> 
> Is it necessary to walk up to the root partition (is_sched_load_balance(cs))?
> 
> The effective_cpus of the cpuset where @p resides should contain active CPUs.
> If all CPUs in cpuset.cpus are offline, it would inherit the parent's effective_cpus for v2, and it
> would move the task to the parent for v1.
> 

Suppose that the parent cpuset has no active CPUs too.
But for a root_domain, deadline bandwidth validation can guard there are
active CPUs remaining.

> Could the effective_cpus of @p's current cpuset be sufficient?
> What we really need is to find active CPUs that task P can be affine to, correct?
> 

Yes, that is the purpose.

Best Regards,

Pingfan


  reply	other threads:[~2025-10-30 10:45 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-28  3:43 [PATCHv4 1/2] sched/isolation: Split out the housekeeping part Pingfan Liu
2025-10-28  3:43 ` [PATCHv4 2/2] sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug Pingfan Liu
2025-10-29  2:37   ` Chen Ridong
2025-10-29 11:18     ` Pingfan Liu
2025-10-30  6:44       ` Chen Ridong
2025-10-30 10:45         ` Pingfan Liu [this message]
2025-10-31  0:47           ` Chen Ridong
2025-10-31 14:21             ` Pingfan Liu
2025-11-03  3:17               ` Pingfan Liu
2025-10-29 15:31   ` Waiman Long
2025-10-30 10:41     ` Pingfan Liu
2025-11-03 13:50     ` Juri Lelli
2025-11-04  3:34       ` Pingfan Liu
2025-11-04  3:42         ` Waiman Long
2025-11-05  2:23   ` Chen Ridong
2025-11-05  7:11     ` Pingfan Liu
2025-10-29 14:56 ` [PATCHv4 1/2] sched/isolation: Split out the housekeeping part Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQNBydr4geMWXebC@fedora \
    --to=piliu@redhat.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=frederic@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=peterz@infradead.org \
    --cc=pierre.gondois@arm.com \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.