From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E028037F8DB for ; Thu, 7 May 2026 15:53:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778169187; cv=none; b=RjJ5HTMHEaGia4hLLukg/iBjB2mn9DjugOJjBBH/sqxFv8YvIyjrQcRD/+Sl7UvXN3Tr5tDGFGowjPFQ+WA0sCxg2kBpHKcm9/BbNACwAbBX/ZWSg3+lFrDONA49rXzZ5GLt/Vt+q0P7pBFa3m6ahY7XsTm6LTyOi2aZUZFi3/c= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778169187; c=relaxed/simple; bh=5Cg3bylYbwZ7wPM/TPOqpFEgOdsJEMvm8YA8uWNMom0=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=kL9xCs/Wl/ExxxuoUa6zh/0WccK++G1lW6YHjOEcEwbhzufAAEq5zpjgSyKdWOU8ErnN1WjViBRWYjj+zhfrbWgxOqqlLpdrZOiQFit8rhGOqm8UlOwKfqAzRXfHJRFvHj9/2hoEo0SHVRoYpC8erxetCNzZqWFmZhrx0IwR1dg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=MWhueQBb; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="MWhueQBb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778169184; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UTQRu3iifG+kUtcXVt+UQ1Aeko3vBlnOxCllwAXA5UA=; b=MWhueQBbnM29xt3klvQnTNQ6QYNOnsYNVVyZgiwveS0mznQzC1Im0qi9bUoZP9Z1CRYxTb vdGD4fiobjziQ0s/akG91jQY74gfNwrLtrJrBJ/fcbEnJurG/+bfQmPjknuYmgD+GbkAXv +ZtIWvpST6MY2QhmJWd1dOEbX4u5wWU= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-481-U_qFem0vPc-Q3j6z47tZcQ-1; Thu, 07 May 2026 11:53:02 -0400 X-MC-Unique: U_qFem0vPc-Q3j6z47tZcQ-1 X-Mimecast-MFC-AGG-ID: U_qFem0vPc-Q3j6z47tZcQ_1778169180 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9531D18005B5; Thu, 7 May 2026 15:52:59 +0000 (UTC) Received: from [10.2.16.64] (unknown [10.2.16.64]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 1E5CB180034E; Thu, 7 May 2026 15:52:51 +0000 (UTC) Message-ID: <4e718145-5af6-40c3-83da-a004904e29d1@redhat.com> Date: Thu, 7 May 2026 11:52:50 -0400 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] cgroup/cpuset: align DL bandwidth reservation with attach target mask To: Guopeng Zhang , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli Cc: Chen Ridong , Johannes Weiner , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , K Prateek Nayak , Gabriele Monaco , Will Deacon , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org References: <20260507103310.35849-1-zhangguopeng@kylinos.cn> <20260507103310.35849-3-zhangguopeng@kylinos.cn> Content-Language: en-US From: Waiman Long In-Reply-To: <20260507103310.35849-3-zhangguopeng@kylinos.cn> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 On 5/7/26 6:33 AM, Guopeng Zhang wrote: > cpuset_can_attach() preallocates destination SCHED_DEADLINE bandwidth > before the attach commit point, while set_cpus_allowed_dl() later > subtracts bandwidth from the source root domain when the task affinity is > actually updated. > > Those two decisions must be made with the same CPU mask. > cpuset_can_attach() used the destination cpuset effective mask directly, > but cpuset_attach_task() first builds a per-task target mask which is > constrained by task_cpu_possible_mask() and, if needed, by walking up the > cpuset hierarchy. On asymmetric systems, the actual target mask can > therefore be a strict subset of cs->effective_cpus. The task_cpu_possible_mask() is there for a special class of arm64 CPUs where only some of the cores are able to run legacy 32-bit applications on 64-bit arm CPUs. We can argue how likely that a DL task can be a legacy 32 bit application that is inherently slower than the same application compiled into native 64-bit code. Perhaps we can just disallow such a legacy 32-bit application from moving to a DL scheduling class in the first place. I am not in favor of the idea of making the cpuset code more complex to support such a corner case which may never be utilized. Could you strip out the task_possible_cpu_mask() part from this patch? We can revisit this with another patch if such a special use case can be useful to support in the future. Cheers, Longman > > If the source root domain intersects cs->effective_cpus only on CPUs > outside the task's possible mask, can_attach() can skip the destination > reservation even though set_cpus_allowed_dl() later sees a real > root-domain move and subtracts from the source domain. > > Extract the root-domain bandwidth-move test used by > set_cpus_allowed_dl() into dl_task_needs_bw_move(), and make > cpuset_can_attach() compute the same per-task target mask that > cpuset_attach_task() applies. > > Keep nr_migrate_dl_tasks counting all migrating deadline tasks for > cpuset DL task accounting. Restrict sum_migrate_dl_bw to the subset of > tasks that need destination root-domain bandwidth reservation, because a > deadline task can move between cpusets without moving bandwidth between > root domains. > > This keeps the existing per-attach aggregate reservation model; it only > changes the per-task mask used to decide which tasks contribute to that > aggregate. The broader can_attach()/attach() transaction window is left > unchanged. > > Fixes: 431c69fac05b ("cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()") > Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") > Signed-off-by: Guopeng Zhang > --- > include/linux/sched/deadline.h | 9 +++ > kernel/cgroup/cpuset-internal.h | 1 + > kernel/cgroup/cpuset.c | 97 ++++++++++++++++++++++----------- > kernel/sched/deadline.c | 13 ++++- > 4 files changed, 86 insertions(+), 34 deletions(-) > > diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h > index 1198138cb839..ddfd5216f3fc 100644 > --- a/include/linux/sched/deadline.h > +++ b/include/linux/sched/deadline.h > @@ -33,6 +33,15 @@ struct root_domain; > extern void dl_add_task_root_domain(struct task_struct *p); > extern void dl_clear_root_domain(struct root_domain *rd); > extern void dl_clear_root_domain_cpu(int cpu); > +/* > + * Return whether moving DL task @p to @new_mask requires moving DL > + * bandwidth accounting between root domains. This helper is specific to > + * DL bandwidth move accounting semantics and is shared by > + * cpuset_can_attach() and set_cpus_allowed_dl() so both paths use the > + * same source root-domain test. > + */ > +bool dl_task_needs_bw_move(struct task_struct *p, > + const struct cpumask *new_mask); > > extern u64 dl_cookie; > extern bool dl_bw_visited(int cpu, u64 cookie); > diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h > index bb4e692bea30..f7aaf01f7cd5 100644 > --- a/kernel/cgroup/cpuset-internal.h > +++ b/kernel/cgroup/cpuset-internal.h > @@ -167,6 +167,7 @@ struct cpuset { > */ > int nr_deadline_tasks; > int nr_migrate_dl_tasks; > + /* DL bandwidth that needs destination reservation for this attach. */ > u64 sum_migrate_dl_bw; > /* > * CPU used for temporary DL bandwidth allocation during attach; > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index ae41736399a1..78c1a4071cc3 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -485,6 +485,30 @@ static void guarantee_active_cpus(struct task_struct *tsk, > rcu_read_unlock(); > } > > +/* Compute the effective CPU mask cpuset_attach_task() will apply to @tsk. */ > +static void cpuset_attach_task_cpus(struct cpuset *cs, struct task_struct *tsk, > + struct cpumask *pmask) > +{ > + const struct cpumask *possible_mask = task_cpu_possible_mask(tsk); > + > + lockdep_assert_cpuset_lock_held(); > + > + if (cs == &top_cpuset) { > + cpumask_andnot(pmask, possible_mask, subpartitions_cpus); > + return; > + } > + > + if (WARN_ON(!cpumask_and(pmask, possible_mask, cpu_active_mask))) > + cpumask_copy(pmask, cpu_active_mask); > + > + rcu_read_lock(); > + while (!cpumask_intersects(cs->effective_cpus, pmask)) > + cs = parent_cs(cs); > + > + cpumask_and(pmask, pmask, cs->effective_cpus); > + rcu_read_unlock(); > +} > + > /* > * Return in *pmask the portion of a cpusets's mems_allowed that > * are online, with memory. If none are online with memory, walk > @@ -2986,6 +3010,14 @@ static void reset_migrate_dl_data(struct cpuset *cs) > cs->dl_bw_cpu = -1; > } > > +/* > + * Protected by cpuset_mutex. cpus_attach is used by the can_attach/attach > + * paths but we can't allocate it dynamically there. Define it global and > + * allocate from cpuset_init(). > + */ > +static cpumask_var_t cpus_attach; > +static nodemask_t cpuset_attach_nodemask_to; > + > /* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */ > static int cpuset_can_attach(struct cgroup_taskset *tset) > { > @@ -2993,7 +3025,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) > struct cpuset *cs, *oldcs; > struct task_struct *task; > bool setsched_check; > - int ret; > + int cpu = nr_cpu_ids, ret; > > /* used later by cpuset_attach() */ > cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css)); > @@ -3038,32 +3070,47 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) > } > > if (dl_task(task)) { > + /* > + * Count all migrating DL tasks for cpuset task accounting. > + * Only tasks that need a root-domain bandwidth move > + * contribute to sum_migrate_dl_bw. > + */ > cs->nr_migrate_dl_tasks++; > - cs->sum_migrate_dl_bw += task->dl.dl_bw; > + cpuset_attach_task_cpus(cs, task, cpus_attach); > + > + if (dl_task_needs_bw_move(task, cpus_attach)) { > + /* > + * Keep the existing aggregate reservation model. > + * Tasks in one attach enter the same destination > + * cpuset, so the first CPU found for a task needing > + * DL bandwidth reservation identifies the destination > + * root domain. > + */ > + if (cpu >= nr_cpu_ids) > + cpu = cpumask_any_and(cpu_active_mask, > + cpus_attach); > + cs->sum_migrate_dl_bw += task->dl.dl_bw; > + } > } > } > > - if (!cs->nr_migrate_dl_tasks) > + if (!cs->sum_migrate_dl_bw) > goto out_success; > > - if (!cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)) { > - int cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus); > - > - if (unlikely(cpu >= nr_cpu_ids)) { > - reset_migrate_dl_data(cs); > - ret = -EINVAL; > - goto out_unlock; > - } > - > - ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw); > - if (ret) { > - reset_migrate_dl_data(cs); > - goto out_unlock; > - } > + if (unlikely(cpu >= nr_cpu_ids)) { > + reset_migrate_dl_data(cs); > + ret = -EINVAL; > + goto out_unlock; > + } > > - cs->dl_bw_cpu = cpu; > + ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw); > + if (ret) { > + reset_migrate_dl_data(cs); > + goto out_unlock; > } > > + cs->dl_bw_cpu = cpu; > + > out_success: > /* > * Mark attach is in progress. This makes validate_change() fail > @@ -3099,23 +3146,11 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset) > mutex_unlock(&cpuset_mutex); > } > > -/* > - * Protected by cpuset_mutex. cpus_attach is used only by cpuset_attach_task() > - * but we can't allocate it dynamically there. Define it global and > - * allocate from cpuset_init(). > - */ > -static cpumask_var_t cpus_attach; > -static nodemask_t cpuset_attach_nodemask_to; > - > static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task) > { > lockdep_assert_cpuset_lock_held(); > > - if (cs != &top_cpuset) > - guarantee_active_cpus(task, cpus_attach); > - else > - cpumask_andnot(cpus_attach, task_cpu_possible_mask(task), > - subpartitions_cpus); > + cpuset_attach_task_cpus(cs, task, cpus_attach); > /* > * can_attach beforehand should guarantee that this doesn't > * fail. TODO: have a better way to handle failure here > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index edca7849b165..7db4c87df83b 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -3107,20 +3107,18 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p) > static void set_cpus_allowed_dl(struct task_struct *p, > struct affinity_context *ctx) > { > - struct root_domain *src_rd; > struct rq *rq; > > WARN_ON_ONCE(!dl_task(p)); > > rq = task_rq(p); > - src_rd = rq->rd; > /* > * Migrating a SCHED_DEADLINE task between exclusive > * cpusets (different root_domains) entails a bandwidth > * update. We already made space for us in the destination > * domain (see cpuset_can_attach()). > */ > - if (!cpumask_intersects(src_rd->span, ctx->new_mask)) { > + if (dl_task_needs_bw_move(p, ctx->new_mask)) { > struct dl_bw *src_dl_b; > > src_dl_b = dl_bw_of(cpu_of(rq)); > @@ -3137,6 +3135,15 @@ static void set_cpus_allowed_dl(struct task_struct *p, > set_cpus_allowed_common(p, ctx); > } > > +bool dl_task_needs_bw_move(struct task_struct *p, > + const struct cpumask *new_mask) > +{ > + if (!dl_task(p)) > + return false; > + > + return !cpumask_intersects(task_rq(p)->rd->span, new_mask); > +} > + > /* Assumes rq->lock is held */ > static void rq_online_dl(struct rq *rq) > {