From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8F1B931618B for ; Thu, 4 Jun 2026 15:03:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780585386; cv=none; b=bsyBaLijhAoCYlTDMoi06yqZeNSH/RmbeuWmTa9BjQt/XY5T+GP0Rhne3aF2LpJBspJU6SeqNI4z+Xy6hGJ4Zk5qx1MMyBmD6TEaEVsY2rc5ri6rmFjnV+EMgEjPIlUU3oO7c8QnIKFbDZnck0BpT2MHzuWZ+W1dU1quK8UwunU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780585386; c=relaxed/simple; bh=WnNIN/RzbpUh0IyL4vvpzNYRJlHk9UDe83Q/ncPXD4Y=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=iCuejMooZujvNb22mSwIS0kKAsodLj/5ikHf7Vs7yAb36Jk2+HzWaqFXngOCQHIlv6izqPJGXkpW7bEoYpjjLFRRghZvCaCBPMQHYqdY0+vAYozxgrAwQSTuPhrFWZc+bb35Mw9z+W1bXwvITldBcYAbFuhNLfC3WvW6AppH6cQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=B3/jlzKj; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="B3/jlzKj" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780585384; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0FrkYdUDwBoR5o4p88v8K3yiyF02buGsjkf+7iTywpA=; b=B3/jlzKjaefe1BuMIlAqYJZ1+4sInu72fJeFnWJHfRV/yL/yJrSzalAaaVucMAaZs/C5pb 84BXG0Y6wtloT86FLCgXzdhHV74daHJhJJg57ZT1+D9DfD+y9gYWbiRdkmqPbhfcZo2N+d sPi/GfCX6G6myUkH+ASI/JGfXGc6xgc= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-573-nSIbrvXjOPyVVQ6Bzwg5JA-1; Thu, 04 Jun 2026 11:02:59 -0400 X-MC-Unique: nSIbrvXjOPyVVQ6Bzwg5JA-1 X-Mimecast-MFC-AGG-ID: nSIbrvXjOPyVVQ6Bzwg5JA_1780585378 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id E003F19560AD; Thu, 4 Jun 2026 15:02:57 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.88.175]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 09B08195608E; Thu, 4 Jun 2026 15:02:55 +0000 (UTC) From: Waiman Long To: Ridong Chen , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Peter Zijlstra Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Aaron Tomlin , Guopeng Zhang , Waiman Long Subject: [PATCH-next v6 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach() Date: Thu, 4 Jun 2026 11:02:24 -0400 Message-ID: <20260604150229.414135-2-longman@redhat.com> In-Reply-To: <20260604150229.414135-1-longman@redhat.com> References: <20260604150229.414135-1-longman@redhat.com> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Whenever memory node mask is changed, there are 4 places where the node mask has to be updated or used. 1) task's node mask via cpuset_change_task_nodemask() 2) memory policy binding via mpol_rebind_mm() 3) if memory migration is enabled, migrate from old_mems_allowed to the new node mask via cpuset_migrate_mm(). 4) setting old_mems_allowed These memory actions are done in cpuset_update_tasks_nodemask() and cpuset_attach(). However there are inconsistencies in what node masks are being used in these 2 functions. In cpuset_update_tasks_nodemask(), - cpuset_change_task_nodemask(): guarantee_online_mems() - mpol_rebind_mm(): mems_allowed - cpuset_migrate_mm(): guarantee_online_mems() - old_mems_allowed: guarantee_online_mems() In cpuset_attach(), - cpuset_change_task_nodemask(): guarantee_online_mems() - mpol_rebind_mm(): effective_mems - cpuset_migrate_mm(): effective_mems - old_mems_allowed: effective_mems These inconsistencies dates back to quite a long time ago and it is hard to say what should be the correct values. The guarantee_online_mems() function returns a node mask from current or an ancestor cpuset that is a subset of node_states[N_MEMORY]. Nodes in node_states[N_MEMORY] are all online, i.e. in node_states[N_ONLINE]. However, node in node_states[N_ONLINE] may not have memory. So node_states[N_MEMORY] should be a subset of node_states[N_ONLINE]. The guarantee_online_mems() function should only be useful for v1 where mems_allowed is the same as effective_mems. With v2, the memory nodes in effective_mems should always be a subset of node_states[N_MEMORY]. The only time that may not be true is when a memory hot-unplug operation is in progress and a memory node is removed from node_states[N_MEMORY] but not yet reflected in effective_mems as cpuset_handle_hotplug() has not yet been called from cpuset_track_online_nodes(). When cpuset_handle_hotplug() is called later, the memory node setting of the relevant cpusets and tasks will be updated. So replacing the guarantee_online_mems() call by just using cs->effective_mems should be fine. Let use the following setup for both of them and make them consistent. - cpuset_change_task_nodemask(): guarantee_online_mems() - mpol_rebind_mm(): effective_mems - cpuset_migrate_mm(): guarantee_online_mems() - old_mems_allowed: guarantee_online_mems() So for v2, it is effectively all effective_mems. For v1, mpol_rebind_mm() uses mems_allowed which may differ from what guarantee_online_mems() returns. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 37 +++++++++++++++++++++++++------------ 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6bdb68689c24..8305b5830c3c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -489,7 +489,10 @@ static void guarantee_active_cpus(struct task_struct *tsk, * Return in *pmask the portion of a cpusets's mems_allowed that * are online, with memory. If none are online with memory, walk * up the cpuset hierarchy until we find one that does have some - * online mems. The top cpuset always has some mems online. + * online mems. The top cpuset always has some mems online. With v2, + * effective_mems should always contain online memory nodes except + * during the transition period where a memory node hotunplug operation + * is in progress. * * One way or another, we guarantee to return some non-empty subset * of node_states[N_MEMORY]. @@ -498,6 +501,10 @@ static void guarantee_active_cpus(struct task_struct *tsk, */ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask) { + if (cpuset_v2()) { + *pmask = cs->effective_mems; + return; + } while (!nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY])) cs = parent_cs(cs); } @@ -2616,6 +2623,13 @@ static void *cpuset_being_rebound; * Iterate through each task of @cs updating its mems_allowed to the * effective cpuset's. As this function is called with cpuset_mutex held, * cpuset membership stays stable. + * + * - cpuset_change_task_nodemask(): guarantee_online_mems() + * - mpol_rebind_mm(): effective_mems + * - cpuset_migrate_mm(): guarantee_online_mems() + * - old_mems_allowed: guarantee_online_mems() + * + * For v2, guarantee_online_mems() should just return effective_mems. */ void cpuset_update_tasks_nodemask(struct cpuset *cs) { @@ -2624,7 +2638,6 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) struct task_struct *task; cpuset_being_rebound = cs; /* causes mpol_dup() rebind */ - guarantee_online_mems(cs, &newmems); /* @@ -2650,7 +2663,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs) migrate = is_memory_migrate(cs); - mpol_rebind_mm(mm, &cs->mems_allowed); + mpol_rebind_mm(mm, &cs->effective_mems); if (migrate) cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems); else @@ -3148,17 +3161,18 @@ static void cpuset_attach(struct cgroup_taskset *tset) /* * In the default hierarchy, enabling cpuset in the child cgroups - * will trigger a number of cpuset_attach() calls with no change - * in effective cpus and mems. In that case, we can optimize out - * by skipping the task iteration and update. + * will trigger a cpuset_attach() call with no change in effective cpus + * and mems. In that case, we can optimize out by skipping the task + * iteration and update. */ - if (cpuset_v2() && !cpus_updated && !mems_updated) { + if (cpuset_v2()) { cpuset_attach_nodemask_to = cs->effective_mems; - goto out; + if (!cpus_updated && !mems_updated) + goto out; + } else { + guarantee_online_mems(cs, &cpuset_attach_nodemask_to); } - guarantee_online_mems(cs, &cpuset_attach_nodemask_to); - cgroup_taskset_for_each(task, css, tset) cpuset_attach_task(cs, task); @@ -3168,7 +3182,6 @@ static void cpuset_attach(struct cgroup_taskset *tset) * if there is no change in effective_mems and CS_MEMORY_MIGRATE is * not set. */ - cpuset_attach_nodemask_to = cs->effective_mems; if (!is_memory_migrate(cs) && !mems_updated) goto out; @@ -3176,7 +3189,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) struct mm_struct *mm = get_task_mm(leader); if (mm) { - mpol_rebind_mm(mm, &cpuset_attach_nodemask_to); + mpol_rebind_mm(mm, &cs->effective_mems); /* * old_mems_allowed is the same with mems_allowed -- 2.54.0