From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 56615248867 for ; Tue, 20 May 2025 14:28:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747751304; cv=none; b=ph7UJzMxBGyDlOHC/65jmJ7QLToFetLJcftttKqTh1ybIq85GQMWTFweo6OO/bVhiYEzB6pRdtSgi7Fkh7jX1mIckJBTfrRKAMfqnJMIZZvdG+6vZp7XvbvXvMNqBvIbY3KotcqwmcarO461+1/mGWhrcs4BO7LyGCOVYJczZxs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747751304; c=relaxed/simple; bh=8Qsh50lvRvt+cgISPa8hOMXFU79G9nm4ylBRlgAP3lY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ub9j8mX7yIvNNZTx/5H0mTEpttpHtnbeUI4pEBZ8PvfUgAatGKYatMFnP5Hs3da1C/zbMsT9Vb7vzgO2U5UYwyFj45xxHcYZdlHjEeqCu8WpzY86UHd9rZkUVEdTQAKFOao2UHr/ygGQKrSypMLr/65ZLwYu0tB3VoazsvswvIE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=IRIj/pfZ; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="IRIj/pfZ" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 71E3EC4CEEA; Tue, 20 May 2025 14:28:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1747751303; bh=8Qsh50lvRvt+cgISPa8hOMXFU79G9nm4ylBRlgAP3lY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=IRIj/pfZG1xJlPQBtlq3SNYUlnHmHaGjyH3rWZUEwUNWdP8dMCrLuDBUb46ZbaYlA TIXX6LTVjyyFJQGEgwELyMhqIw3iu4nwrx0pYrOs6/H+ywW2b8+BcZjFpCKK5s4KIk FA0pr9nrmCQXUX2QUlbeTMDrgkUm1Yl3sa/YPnvOu/KdZkwBh34AybZ02sA5bsKntv gC/3A57MyXyjp4h09xRiOTQkEUXlEMJmWQH+zjttjRkesaf7Ch+JwiHfE1dkTNr0k8 sTTaZJCrAl+fjGywsIPKGLIaDJep4WFPg+PSJIb0Ef9rsdHPeKtNAfJHJkquMcoH31 xH7GoRF5qz+mA== Date: Tue, 20 May 2025 16:28:20 +0200 From: Frederic Weisbecker To: Gabriele Monaco Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , Waiman Long , Anna-Maria Behnsen Subject: Re: [PATCH v5 5/6] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping Message-ID: References: <20250508145319.97794-8-gmonaco@redhat.com> <20250508145319.97794-13-gmonaco@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20250508145319.97794-13-gmonaco@redhat.com> (Please keep Anna-Maria Cc'ed) Le Thu, May 08, 2025 at 04:53:25PM +0200, Gabriele Monaco a écrit : > Currently the user can set up isolated cpus via cpuset and nohz_full in > such a way that leaves no housekeeping CPU (i.e. no CPU that is neither > domain isolated nor nohz full). This can be a problem for other > subsystems (e.g. the timer wheel imgration). > > Prevent this configuration by blocking any assignation that would cause > the union of domain isolated cpus and nohz_full to covers all CPUs. > > Signed-off-by: Gabriele Monaco > --- > kernel/cgroup/cpuset.c | 67 ++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 65 insertions(+), 2 deletions(-) > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index 95316d39c282..2f1df6f5b988 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -80,6 +80,12 @@ static cpumask_var_t subpartitions_cpus; > */ > static cpumask_var_t isolated_cpus; > > +/* > + * Housekeeping CPUs for both HK_TYPE_DOMAIN and HK_TYPE_KERNEL_NOISE > + */ > +static cpumask_var_t full_hk_cpus; > +static bool have_boot_nohz_full; Do you really need to maintain those copies? > + > /* > * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot > */ > @@ -1253,10 +1259,26 @@ static void reset_partition_data(struct cpuset *cs) > static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus) > { > WARN_ON_ONCE(old_prs == new_prs); > - if (new_prs == PRS_ISOLATED) > + if (new_prs == PRS_ISOLATED) { > cpumask_or(isolated_cpus, isolated_cpus, xcpus); > - else > + cpumask_andnot(full_hk_cpus, full_hk_cpus, xcpus); > + } else { > cpumask_andnot(isolated_cpus, isolated_cpus, xcpus); > + cpumask_or(full_hk_cpus, full_hk_cpus, xcpus); > + } > +} > + > +/* > + * isolated_cpus_should_update - Returns if the isolated_cpus mask needs update > + * @prs: new or old partition_root_state > + * @parent: parent cpuset > + * Return: true if isolated_cpus needs modification, false otherwise > + */ > +static bool isolated_cpus_should_update(int prs, struct cpuset *parent) > +{ > + if (!parent) > + parent = &top_cpuset; > + return prs != parent->partition_root_state; > } > > /* > @@ -1323,6 +1345,25 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent, > return isolcpus_updated; > } > > +/* > + * isolcpus_nohz_conflict - check for isolated & nohz_full conflicts > + * @new_cpus: cpu mask The description lacks explanation about the role of this cpu mask. > + * Return: true if there is conflict, false otherwise > + * > + * If nohz_full is enabled and we have isolated CPUs, their combination must > + * still leave housekeeping CPUs. > + */ > +static bool isolcpus_nohz_conflict(struct cpumask *new_cpus) > +{ > + if (!have_boot_nohz_full) > + return false; > + > + if (!cpumask_weight_andnot(full_hk_cpus, new_cpus)) > + return true; Do we also need to make sure that in this weight there is an online CPU? Can you allocate a temporary mask here and do: cpumask_var_t full_hk_cpus; int ret; if (!zalloc_cpumask_var(&full_hk_cpus, GFP_KERNEL)) return true; cpumask_copy(full_hk_cpus, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)); cpumask_and(full_hk_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN)); cpumask_and(full_hk_cpus, cpu_online_mask)); if (!cpumask_weight_andnot(full_hk_cpus, new_cpus)) ret = true; else ret = false; free_cpumask_var(full_hk_cpus); I also realize something, what makes sure that we don't offline the last non isolated? I just did a small test: # cd /sys/fs/cgroup/ # echo +cpuset > cgroup.subtree_control # cat cpuset.cpus.effective 0-7 # mkdir test # cd test # echo +cpuset > cgroup.subtree_control # echo 0-6 > cpuset.cpus # echo isolated > cpuset.cpus.partition # cat ../cpuset.cpus.effective 7 # echo 0 > /sys/devices/system/cpu/cpu7/online [ 4590.864066] ------------[ cut here ]------------ [ 4590.866469] WARNING: CPU: 7 PID: 50 at kernel/cgroup/cpuset.c:1906 update_parent_effective_cpumask+0x770/0x8c0 [ 4590.870023] Modules linked in: [ 4590.871058] CPU: 7 UID: 0 PID: 50 Comm: cpuhp/7 Not tainted 6.15.0-rc2-g996d9d202383 #10 PREEMPT(voluntary) [ 4590.873588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014 [ 4590.875689] RIP: 0010:update_parent_effective_cpumask+0x770/0x8c0 [ 4590.876858] Code: 06 48 8b 0c 24 ba 05 00 00 00 48 23 85 f8 00 00 00 41 0f 95 c6 48 89 01 41 8b 84 24 34 01 00 00 45 0f b6 f6 e9 90 fe ff ff 90 <0f> 0b 90e [ 4590.880010] RSP: 0018:ffffa4ce001ebd40 EFLAGS: 00010086 [ 4590.880963] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000001 [ 4590.882342] RDX: 000000000000007f RSI: 0000000000000000 RDI: 0000000000000002 [ 4590.883683] RBP: ffffffffbdf52f00 R08: 0000000000000000 R09: 0000000000000000 [ 4590.885071] R10: ffffa223062d2388 R11: 0000000000000000 R12: ffffa223062d2200 [ 4590.886604] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000004 [ 4590.888309] FS: 0000000000000000(0000) GS:ffffa223bc4d6000(0000) knlGS:0000000000000000 [ 4590.890183] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4590.891385] CR2: 000055ab80ada170 CR3: 00000001084ac000 CR4: 00000000000006f0 [ 4590.892901] DR0: ffffffffbc8c8420 DR1: 0000000000000000 DR2: 0000000000000000 [ 4590.894341] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 4590.895765] Call Trace: [ 4590.896400] [ 4590.896938] cpuset_update_active_cpus+0x680/0x730 [ 4590.897979] ? kvm_sched_clock_read+0x11/0x20 [ 4590.898916] ? sched_clock+0x10/0x30 [ 4590.899785] sched_cpu_deactivate+0x148/0x170 [ 4590.900812] ? __pfx_sched_cpu_deactivate+0x10/0x10 [ 4590.901925] cpuhp_invoke_callback+0x10e/0x480 [ 4590.902920] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 4590.903928] cpuhp_thread_fun+0xd7/0x160 [ 4590.904818] smpboot_thread_fn+0xee/0x220 [ 4590.905716] kthread+0xf6/0x1f0 [ 4590.906471] ? __pfx_kthread+0x10/0x10 [ 4590.907297] ret_from_fork+0x2f/0x50 [ 4590.908110] ? __pfx_kthread+0x10/0x10 [ 4590.908917] ret_from_fork_asm+0x1a/0x30 [ 4590.909833] [ 4590.910465] ---[ end trace 0000000000000000 ]--- [ 4590.916786] smpboot: CPU 7 is now offline Apparently you can't trigger the same with isolcpus=0-6, for some reason. One last thing, nohz_full makes sure that we never offline the timekeeper (see tick_nohz_cpu_down()). The timekeeper also never shuts down its tick and therefore never go idle, from tmigr perspective, this way when a nohz_full CPU shuts down its tick, it makes sure that its global timers are handled by the timekeeper in last resort, because it's the last global migrator, always alive. But if the timekeeper is HK_TYPE_DOMAIN, or isolated by cpuset, it will go out of the tmigr hierarchy, breaking the guarantee to have a live global migrator for nohz_full. That one is a bit more tricky to solve. The easiest is to forbid the timekeeper from ever being made unavailable. It is also possible to migrate the timekeeping duty to another common housekeeper. We probably need to do the latter... Thanks. -- Frederic Weisbecker SUSE Labs