From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ADE7C375AA8 for ; Tue, 21 Apr 2026 03:07:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776740852; cv=none; b=Ra5lc8IpcBM6RAvERu0mYhW0yMLAgkq64iTbdDwh6wZP7ZHAjt2JNLhIuX+yXuuODwKdCi3hnv3z98y6o9Ckm188ZU3+illTL4qiT7jlAtQo7xt2ocoAAEi8mTO3Hsj4Kz2ho3qHRwx4nl81SqftyqeMKtrsT4RZ5GFUoYYveXY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776740852; c=relaxed/simple; bh=y4yP5tRZ7eQlB15SkfEhgLy4KCCMqsQOclN9c++lXS8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rf6nUR3JZgFuYSyTMnXz5IV8qlN/Qf2jIYnauT74C1NB6yHYOGadprG/0/ElEbyWrsblkW+tqKnNBeTr0GMVlzw42Mn4Qdd0a+xc2Y1Vtfm7X+FZTRyx/c69ifVgjpyLasYXLTrGKUuVYd2gwqvMOEwqj2E9ZQMc08a9yLfs8UA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=V0gGjoCe; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="V0gGjoCe" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776740849; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rj/Vtbhr3JXvU2mTPHbZebZmch25NsVuRvhs8pJ62G8=; b=V0gGjoCeAxf1f7lR3RR0Hjtp/8kWeVjKE/s1ViBZx5D+WaT8exGD1jDzxrEVsTYcCyWgmx 4lktmZy44lrcXQ/TQrWJw0SsUUH0v9RTms+ol1Pn9/xtECt/9S/36QiX8apDuema3BNYcJ KYLN1V+qqkGaIIlw/2iESepJ90pAu+Y= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-436-SDq97JYuPx29ycCUXIoJEQ-1; Mon, 20 Apr 2026 23:07:26 -0400 X-MC-Unique: SDq97JYuPx29ycCUXIoJEQ-1 X-Mimecast-MFC-AGG-ID: SDq97JYuPx29ycCUXIoJEQ_1776740841 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 5189019560B9; Tue, 21 Apr 2026 03:07:21 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.22.65.81]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 6E7F419560AB; Tue, 21 Apr 2026 03:07:14 +0000 (UTC) From: Waiman Long To: Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Jonathan Corbet , Shuah Khan , Catalin Marinas , Will Deacon , "K. Y. Srinivasan" , Haiyang Zhang , Wei Liu , Dexuan Cui , Long Li , Guenter Roeck , Frederic Weisbecker , "Paul E. McKenney" , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Uladzislau Rezki , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Zqiang , Anna-Maria Behnsen , Ingo Molnar , Thomas Gleixner , Chen Ridong , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Ben Segall , Mel Gorman , Valentin Schneider , K Prateek Nayak , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-hyperv@vger.kernel.org, linux-hwmon@vger.kernel.org, rcu@vger.kernel.org, netdev@vger.kernel.org, linux-kselftest@vger.kernel.org, Costa Shulyupin , Qiliang Yuan , Waiman Long Subject: [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks Date: Mon, 20 Apr 2026 23:03:48 -0400 Message-ID: <20260421030351.281436-21-longman@redhat.com> In-Reply-To: <20260421030351.281436-1-longman@redhat.com> References: <20260421030351.281436-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-hwmon@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 One simple way to enable runtime update of HK_TYPE_KERNEL_NOISE (nohz_full) and HK_TYPE_MANAGED_IRQ cpumasks is to make use of the CPU hotplug to facilitate the transition of those CPUs that are changing states as long as CONFIG_HOTPLUG_CPU is enabled and a nohz_full boot parameter is provided. Otherwise, only HK_TYPE_DOMAIN cpumask will be updated at run time. For changes in HK_TYPE_DOMAIN cpumask, it can be done without using CPU hotplug. For changes in HK_TYPE_MANAGED_IRQ cpumask, we have to update the cpumask first and then tear down and bring up the newly isolated CPUs to migrate the managed irqs in those CPUs to other housekeeping CPUs. For changes in HK_TYPE_KERNEL_NOISE, we have to tear down all the newly isolated and de-isolated CPUs, change the cpumask and then bring all the offline CPUs back online. As it is possible that the various boot versions of the housekeeping cpumasks are different resulting in the use of different set of isolated cpumasks for calling housekeeping_update(), we may need to pre-allocate these cpumasks if necessary. Note that the use of CPU hotplug to facilitate the changing of HK_TYPE_KERNEL_NOISE and HK_TYPE_MANAGED_IRQ housekeeping cpumasks has the drawback that during the tear down of a CPU from CPUHP_TEARDOWN_CPU state to CPUHP_AP_OFFLINE, the stop_machine code will be invoked to stop all the other CPUs including all the pre-existing isolated CPUs. That will cause latency spikes on those isolated CPUs. That latency spike should only happen when the cpuset isolated partition setting is changed resulting in changes in those housekeeping cpumasks. One possible workaround that is being used right now is to pre-allocate a set of nohz_full and managed_irq CPUs at boot time. The semi-isolated CPUs are then used to create cpuset isolated partitions when needed to enable full isolation. This will likely continue even if we made the nohz_full and managed_irq CPUs runtime changeable if they can't tolerate these latency spikes. This is a problem we need to address in a future patch series. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 199 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 182 insertions(+), 17 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 1b0c50b46a49..a927b9cd4f71 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -152,9 +152,17 @@ static cpumask_var_t isolated_cpus; /* CSCB */ static bool update_housekeeping; /* RWCS */ /* - * Copy of isolated_cpus to be passed to housekeeping_update() + * Cpumasks to be passed to housekeeping_update() + * isolated_hk_cpus - copy of isolated_cpus for HK_TYPE_DOMAIN + * isolated_nohz_cpus - for HK_TYPE_KERNEL_NOISE + * isolated_mirq_cpus - for HK_TYPE_MANAGED_IRQ */ static cpumask_var_t isolated_hk_cpus; /* T */ +static cpumask_var_t isolated_nohz_cpus; /* T */ +static cpumask_var_t isolated_mirq_cpus; /* T */ + +static bool boot_nohz_le_domain __ro_after_init; +static bool boot_mirq_le_domain __ro_after_init; /* * A flag to force sched domain rebuild at the end of an operation. @@ -1328,29 +1336,67 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) return false; } +static int cpuset_nohz_update_cbfunc(void *arg) +{ + struct cpumask *isol_cpus = (struct cpumask *)arg; + + if (isol_cpus) + housekeeping_update(isol_cpus, BIT(HK_TYPE_KERNEL_NOISE)); + return 0; +} + /* - * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock - * - * Update housekeeping cpumasks and rebuild sched domains if necessary and - * then do a cpuset_full_unlock(). - * This should be called at the end of cpuset operation. */ -static void cpuset_update_sd_hk_unlock(void) - __releases(&cpuset_mutex) - __releases(&cpuset_top_mutex) +static void cpuset_update_housekeeping_unlock(void) { - update_housekeeping = false; + bool update_nohz, update_mirq; + cpumask_var_t cpus; + int ret; - /* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */ - if (force_sd_rebuild) - rebuild_sched_domains_locked(); + if (!tick_nohz_full_enabled()) + return; - if (cpumask_equal(isolated_hk_cpus, isolated_cpus)) { - /* No housekeeping cpumask update needed */ + update_nohz = boot_nohz_le_domain; + update_mirq = boot_mirq_le_domain; + + if (WARN_ON_ONCE(!alloc_cpumask_var(&cpus, GFP_KERNEL))) { cpuset_full_unlock(); return; } + /* + * Update isolated_nohz_cpus/isolated_mirq_cpus if necessary + */ + if (!boot_nohz_le_domain) { + cpumask_andnot(cpus, cpu_possible_mask, + housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)); + cpumask_or(cpus, cpus, isolated_cpus); + update_nohz = !cpumask_equal(isolated_nohz_cpus, cpus); + if (update_nohz) + cpumask_copy(isolated_nohz_cpus, cpus); + } + if (!boot_mirq_le_domain) { + cpumask_andnot(cpus, cpu_possible_mask, + housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)); + cpumask_or(cpus, cpus, isolated_cpus); + update_mirq = !cpumask_equal(isolated_mirq_cpus, cpus); + if (update_mirq) + cpumask_copy(isolated_mirq_cpus, cpus); + } + + /* + * Compute list of CPUs to be brought offline into "cpus" + * isolated_hk_cpus - old cpumask + * isolated_cpus - new cpumask + * + * With update_nohz, we need to offline both the newly isolated + * and de-isolated CPUs. With only update_mirq, we only need to + * offline the new isolated CPUs. + */ + if (update_nohz) + cpumask_xor(cpus, isolated_hk_cpus, isolated_cpus); + else if (update_mirq) + cpumask_andnot(cpus, isolated_cpus, isolated_hk_cpus); cpumask_copy(isolated_hk_cpus, isolated_cpus); /* @@ -1360,10 +1406,103 @@ static void cpuset_update_sd_hk_unlock(void) */ mutex_unlock(&cpuset_mutex); cpus_read_unlock(); - WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN))); + + if (!update_mirq) { + ret = housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)); + } else if (boot_mirq_le_domain) { + ret = housekeeping_update(isolated_hk_cpus, + BIT(HK_TYPE_DOMAIN)|BIT(HK_TYPE_MANAGED_IRQ)); + } else { + ret = housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)); + if (!ret) + ret = housekeeping_update(isolated_mirq_cpus, + BIT(HK_TYPE_MANAGED_IRQ)); + } + + if (WARN_ON_ONCE(ret)) + goto out_free; + + /* + * Calling cpuhp_offline_cb() is only needed if either + * HK_TYPE_KERNEL_NOISE and/or HK_TYPE_MANAGED_IRQ cpumasks + * needed to be updated. + * + * TODO: When tearing down a CPU from CPUHP_TEARDOWN_CPU state + * downward to CPUHP_AP_OFFLINE, the stop_machine code will be + * invoked to stop all the other CPUs in the system. This will + * cause latency spikes on existing isolated CPUs. We will need + * some new mechanism to enable us to not stop the existing + * isolated CPUs whenever possible. A possible workaround is + * to pre-allocate a set of nohz_full and manged_irq CPUs at + * boot time and use them to create isolated cpuset partitions + * so that CPU hotplug won't need to be used. + */ + if (update_mirq || update_nohz) { + struct cpumask *nohz_cpus; + + /* + * Calling housekeeping_update() is only needed if + * update_nohz is set. + */ + nohz_cpus = !update_nohz ? NULL : boot_nohz_le_domain + ? isolated_hk_cpus + : isolated_nohz_cpus; + /* + * Mask out offline CPUs in cpus + * If there is no online CPUs, we can call + * housekeeping_update() directly if needed. + */ + cpumask_and(cpus, cpus, cpu_online_mask); + if (!cpumask_empty(cpus)) + ret = cpuhp_offline_cb(cpus, cpuset_nohz_update_cbfunc, + nohz_cpus); + else if (nohz_cpus) + ret = housekeeping_update(nohz_cpus, BIT(HK_TYPE_KERNEL_NOISE)); + } + WARN_ON_ONCE(ret); +out_free: + free_cpumask_var(cpus); mutex_unlock(&cpuset_top_mutex); } +/* + * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock + * + * Update housekeeping cpumasks and rebuild sched domains if necessary and + * then do a cpuset_full_unlock(). + * This should be called at the end of cpuset operation. + */ +static void cpuset_update_sd_hk_unlock(void) + __releases(&cpuset_mutex) + __releases(&cpuset_top_mutex) +{ + /* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */ + if (force_sd_rebuild) + rebuild_sched_domains_locked(); + + update_housekeeping = false; + + if (cpumask_equal(isolated_cpus, isolated_hk_cpus)) { + cpuset_full_unlock(); + return; + } + + if (!tick_nohz_full_enabled()) { + /* + * housekeeping_update() is now called without holding + * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex + * is still being held for mutual exclusion. + */ + mutex_unlock(&cpuset_mutex); + cpus_read_unlock(); + WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, + BIT(HK_TYPE_DOMAIN))); + mutex_unlock(&cpuset_top_mutex); + } else { + cpuset_update_housekeeping_unlock(); + } +} + /* * Work function to invoke cpuset_update_sd_hk_unlock() */ @@ -3700,6 +3839,29 @@ int __init cpuset_init(void) housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT)); cpumask_copy(isolated_hk_cpus, isolated_cpus); } + + /* + * If nohz_full and/or managed_irq cpu list, if present, is a subset + * of the domain cpu list, i.e. HK_TYPE_DOMAIN_BOOT cpumask is a subset + * of HK_TYPE_KERNEL_NOISE_BOOT/HK_TYPE_MANAGED_IRQ_BOOT cpumask, the + * corresponding non-boot housekeeping cpumasks will follow changes + * in the HK_TYPE_DOMAIN cpumask. So we don't need to use separate + * cpumasks to isolate them. + */ + boot_nohz_le_domain = cpumask_subset(housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), + housekeeping_cpumask(HK_TYPE_KERNEL_NOISE_BOOT)); + boot_mirq_le_domain = cpumask_subset(housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), + housekeeping_cpumask(HK_TYPE_MANAGED_IRQ_BOOT)); + if (!boot_nohz_le_domain) { + BUG_ON(!alloc_cpumask_var(&isolated_nohz_cpus, GFP_KERNEL)); + cpumask_andnot(isolated_nohz_cpus, cpu_possible_mask, + housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)); + } + if (!boot_mirq_le_domain) { + BUG_ON(!alloc_cpumask_var(&isolated_mirq_cpus, GFP_KERNEL)); + cpumask_andnot(isolated_mirq_cpus, cpu_possible_mask, + housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)); + } return 0; } @@ -3954,7 +4116,10 @@ static void cpuset_handle_hotplug(void) */ if (force_sd_rebuild) rebuild_sched_domains_cpuslocked(); - if (update_housekeeping) + /* + * Don't need to update housekeeping cpumasks in cpuhp_offline_cb mode. + */ + if (update_housekeeping && !cpuhp_offline_cb_mode) queue_work(system_dfl_wq, &hk_sd_work); free_tmpmasks(ptmp); -- 2.53.0