Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH 16/23] genirq/cpuhotplug: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
From: Waiman Long @ 2026-04-21 14:29 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <87qzo8bs9m.ffs@tglx>

On 4/21/26 5:02 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>
>> As HK_TYPE_MANAGED_IRQ cpumask is going to be changeable at run time,
>> use RCU to protect access to the cpumask.
>>
>> To enable the new HK_TYPE_MANAGED_IRQ cpumask to take effect, the
>> following steps can be done.
> Can be done?
>
>>   1) Update the HK_TYPE_MANAGED_IRQ cpumask to take out the newly isolated
>>      CPUs and add back the de-isolated CPUs.
>>   2) Tear down the affected CPUs to cause irq_migrate_all_off_this_cpu()
>>      to be called on the affected CPUs to migrate the irqs to other
>>      HK_TYPE_MANAGED_IRQ housekeeping CPUs.
>>   3) Bring up the previously offline CPUs to invoke
>>      irq_affinity_online_cpu() to allow the newly de-isolated CPUs to
>>      be used for managed irqs.
> Which previously offline CPUs?
This part should go into another patch.
>
>> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
>> index 2e8072437826..8270c4de260b 100644
>> --- a/kernel/irq/manage.c
>> +++ b/kernel/irq/manage.c
>> @@ -263,6 +263,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, bool
>>   	    housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
>>   		const struct cpumask *hk_mask;
>>   
>> +		guard(rcu)();
>>   		hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
>>   
>>   		cpumask_and(tmp_mask, mask, hk_mask);
> How is this hunk related to $Subject?

The subject is actually about using RCU to protect access to 
housekeeping cpumask. There are extra info in the commit  log that 
should go to another patch.

Cheers,
Longman

>


^ permalink raw reply

* Re: [PATCH 10/23] cpu: Use RCU to protect access of HK_TYPE_TIMER cpumask
From: Waiman Long @ 2026-04-21 14:25 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <87wly0bsh5.ffs@tglx>

On 4/21/26 4:57 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> As HK_TYPE_TIMER cpumask is going to be changeable at run time, use
>> RCU to protect access to the cpumask.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cpu.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index bc4f7a9ba64e..0d02b5d7a7ba 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -1890,6 +1890,8 @@ int freeze_secondary_cpus(int primary)
>>   	cpu_maps_update_begin();
>>   	if (primary == -1) {
>>   		primary = cpumask_first(cpu_online_mask);
>> +
>> +		guard(rcu)();
>>   		if (!housekeeping_cpu(primary, HK_TYPE_TIMER))
>>   			primary = housekeeping_any_cpu(HK_TYPE_TIMER);
> housekeeping_cpu() and housekeeping_any_cpu() can operate on two
> different CPU masks once the runtime update is enabled.
>
> Seriously?

Good point, will fix that in the next version.

Cheers,
Longman


^ permalink raw reply

* Re: [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs
From: Waiman Long @ 2026-04-21 14:24 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <87340od7ev.ffs@tglx>

On 4/21/26 4:50 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> Full dynticks can only be enabled if "nohz_full" boot option has been
>> been specified with or without parameter. Any change in the list of
>> nohz_full CPUs have to be reflected in tick_nohz_full_mask. Introduce
>> a new tick_nohz_full_update_cpus() helper that can be called to update
>> the tick_nohz_full_mask at run time. The housekeeping_update() function
>> is modified to call the new helper when the HK_TYPE_KERNEL_NOSIE cpumask
>> is going to be changed.
>>
>> We also need to enable CPU context tracking for those CPUs that
> We need nothing. Use passive voice for change logs as requested in
> documentation.
>
>> are in tick_nohz_full_mask. So remove __init from tick_nohz_init()
>> and ct_cpu_track_user() so that they be called later when an isolated
>> cpuset partition is being created. The __ro_after_init attribute is
>> taken away from context_tracking_key as well.
>>
>> Also add a new ct_cpu_untrack_user() function to reverse the action of
>> ct_cpu_track_user() in case we need to disable the nohz_full mode of
>> a CPU.
>>
>> With nohz_full enabled, the boot CPU (typically CPU 0) will be the
>> tick CPU which cannot be shut down easily. So the boot CPU should not
>> be used in an isolated cpuset partition.
>>
>> With runtime modification of nohz_full CPUs, tick_do_timer_cpu can become
>> TICK_DO_TIMER_NONE. So remove the two TICK_DO_TIMER_NONE WARN_ON_ONCE()
>> checks in tick-sched.c to avoid unnecessary warnings.
> in tick-sched.c? Describe the functions which contain that.
>
>>   static inline void tick_nohz_task_switch(void)
>> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
>> index 925999de1a28..394e432630a3 100644
>> --- a/kernel/context_tracking.c
>> +++ b/kernel/context_tracking.c
>> @@ -411,7 +411,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { }
>>   #define CREATE_TRACE_POINTS
>>   #include <trace/events/context_tracking.h>
>>   
>> -DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key);
>> +DEFINE_STATIC_KEY_FALSE(context_tracking_key);
>>   EXPORT_SYMBOL_GPL(context_tracking_key);
>>   
>>   static noinstr bool context_tracking_recursion_enter(void)
>> @@ -674,9 +674,9 @@ void user_exit_callable(void)
>>   }
>>   NOKPROBE_SYMBOL(user_exit_callable);
>>   
>> -void __init ct_cpu_track_user(int cpu)
>> +void ct_cpu_track_user(int cpu)
>>   {
>> -	static __initdata bool initialized = false;
>> +	static bool initialized;
>>   
>>   	if (cpu == CONTEXT_TRACKING_FORCE_ENABLE) {
>>   		static_branch_inc(&context_tracking_key);
>> @@ -700,6 +700,15 @@ void __init ct_cpu_track_user(int cpu)
>>   	initialized = true;
>>   }
>>   
>> +void ct_cpu_untrack_user(int cpu)
>> +{
>> +	if (!per_cpu(context_tracking.active, cpu))
>> +		return;
>> +
>> +	per_cpu(context_tracking.active, cpu) = false;
>> +	static_branch_dec(&context_tracking_key);
>> +}
>> +
> Why is this in a patch which makes tick/nohz related changes? This is a
> preparatory change, so make it that way and do not bury it inside
> something else.
Sure, I will break out the context tracking change into its own patch.
>> +/* Get the new set of run-time nohz CPU list & update accordingly */
>> +void tick_nohz_full_update_cpus(struct cpumask *cpumask)
>> +{
>> +	int cpu;
>> +
>> +	if (!tick_nohz_full_running) {
>> +		pr_warn_once("Full dynticks cannot be enabled without the nohz_full kernel boot parameter!\n");
> That's the result of this enforced enable hackery. Make this work
> properly.
>
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * To properly enable/disable nohz_full dynticks for the affected CPUs,
>> +	 * the new nohz_full CPUs have to be copied to tick_nohz_full_mask and
>> +	 * ct_cpu_track_user/ct_cpu_untrack_user() will have to be called
>> +	 * for those CPUs that have their states changed. Those CPUs should be
>> +	 * in an offline state.
>> +	 */
>> +	for_each_cpu_andnot(cpu, cpumask, tick_nohz_full_mask) {
>> +		WARN_ON_ONCE(cpu_online(cpu));
>> +		ct_cpu_track_user(cpu);
>> +		cpumask_set_cpu(cpu, tick_nohz_full_mask);
>> +	}
>> +
>> +	for_each_cpu_andnot(cpu, tick_nohz_full_mask, cpumask) {
>> +		WARN_ON_ONCE(cpu_online(cpu));
>> +		ct_cpu_untrack_user(cpu);
>> +		cpumask_clear_cpu(cpu, tick_nohz_full_mask);
>> +	}
>> +}
> So this writes to tick_nohz_full_mask while other CPUs can access
> it. That's just wrong and I'm not at all interested in the resulting
> KCSAN warnings.
>
> tick_nohz_full_mask needs to become a RCU protected pointer, which is
> updated once the new mask is established in a separately allocated one.

I will work on that.

Cheers,
Longman

>
> Thanks,
>
>          tglx
>
>


^ permalink raw reply

* Re: [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
From: Waiman Long @ 2026-04-21 14:22 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <87zf2wbsli.ffs@tglx>

On 4/21/26 4:55 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> In tick_cpu_dying(), if the dying CPU is the current timekeeper,
>> it has to pass the job over to another CPU. The current code passes
>> it to another online CPU. However, that CPU may not be a timer tick
>> housekeeping CPU.  If that happens, another CPU will have to manually
>> take it over again later. Avoid this unnecessary work by directly
>> assigning an online housekeeping CPU.
>>
>> Use READ_ONCE/WRITE_ONCE() to access tick_do_timer_cpu in case the
>> non-HK CPUs may not be in stop machine in the future.
> 'may not be in the future' is yet more handwaving without
> content. Please write your change logs in a way so that people who have
> not spent months on this can follow.
>
>> @@ -394,12 +395,19 @@ int tick_cpu_dying(unsigned int dying_cpu)
>>   {
>>   	/*
>>   	 * If the current CPU is the timekeeper, it's the only one that can
>> -	 * safely hand over its duty. Also all online CPUs are in stop
>> -	 * machine, guaranteed not to be idle, therefore there is no
>> +	 * safely hand over its duty. Also all online housekeeping CPUs are
>> +	 * in stop machine, guaranteed not to be idle, therefore there is no
>>   	 * concurrency and it's safe to pick any online successor.
>>   	 */
>> -	if (tick_do_timer_cpu == dying_cpu)
>> -		tick_do_timer_cpu = cpumask_first(cpu_online_mask);
>> +	if (READ_ONCE(tick_do_timer_cpu) == dying_cpu) {
>> +		unsigned int new_cpu;
>> +
>> +		guard(rcu)();
> What's this guard for?
>
>> +		new_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TICK));
> Why has this to use housekeeping_cpumask() and does not use
> tick_nohz_full_mask?

The RCU guard is for accessing the HK_TYPE_TICK(HK_TYPE_KERNEL_NOISE) 
cpumask. tick_nohz_full_mask cpumask is actually the inverse of 
HK_TYPE_TICK cpumask. Yes, I could use cpumask_first_andnot() with 
tick_nohz_full_mask. If we make tick_nohz_full_mask an RCU protected 
pointer, we still need the guard.

Cheers,
Longman


^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH net v2] hv_sock: Report EOF instead of -EIO for FIN
From: Jakub Kicinski @ 2026-04-21 14:18 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Dexuan Cui, patchwork-bot+netdevbpf@kernel.org, KY Srinivasan,
	Haiyang Zhang, wei.liu@kernel.org, Long Li, davem@davemloft.net,
	edumazet@google.com, pabeni@redhat.com, horms@kernel.org,
	niuxuewei.nxw@antgroup.com, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org, Ben Hillis,
	levymitchell0@gmail.com
In-Reply-To: <CAGxU2F6DVcLDLg3dT5DsDmsaOuhOcD+4VSG5dqXcFRwsN1NZ+A@mail.gmail.com>

On Tue, 21 Apr 2026 09:28:19 +0200 Stefano Garzarella wrote:
> > I'm sorry -- I just posted v3
> >     https://lore.kernel.org/linux-hyperv/20260421025950.1099495-1-decui@microsoft.com/T/#u
> > and then I realized that the v2 had been merged into the main branch :-(
> >
> > Should I post a new delta patch(with a Fixes tag against the v2) based on the main branch?  
> 
> Ehm, I'm not sure about the process but if it's merged in net tree,
> maybe we need a follow up patch.
> 
> Anyway, let's wait for Jakub's or other net maintainers' suggestions.

Yes, you have to post an incremental fix

^ permalink raw reply

* Re: [PATCH 03/23] tick/nohz: Make nohz_full parameter optional
From: Waiman Long @ 2026-04-21 14:14 UTC (permalink / raw)
  To: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <875x5kd88d.ffs@tglx>

On 4/21/26 4:32 AM, Thomas Gleixner wrote:
> On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
>> To provide nohz_full tick support, there is a set of tick dependency
>> masks that need to be evaluated on every IRQ and context switch.
> s/IRQ/interrupt/
>
> This is a changelog and not a SMS service.
>> Switching on nohz_full tick support at runtime will be problematic
>> as some of the tick dependency masks may not be properly set causing
>> problem down the road.
> That's useless blurb with zero content.
>
>> Allow nohz_full boot option to be specified without any
>> parameter to force enable nohz_full tick support without any
>> CPU in the tick_nohz_full_mask yet. The context_tracking_key and
>> tick_nohz_full_running flag will be enabled in this case to make
>> tick_nohz_full_enabled() return true.
> I kinda can crystal-ball what you are trying to say here, but that does
> not make it qualified as a proper change log.
>
>> There is still a small performance overhead by force enable nohz_full
>> this way. So it should only be used if there is a chance that some
>> CPUs may become isolated later via the cpuset isolated partition
>> functionality and better CPU isolation closed to nohz_full is desired.
> Why has this key to be enabled on boot if there are no CPUs in the
> isolated mask?
>
> If you want to manage this dynamically at runtime then enable the key
> once CPUs are isolated. Yes, it's more work, but that avoids the "should
> only be used" nonsense and makes this more robust down the road.

OK, I will try to make it fully dynamic. Of course, it will be more work.

Cheers,
Longman

> Thanks,
>
>          tglx
>
>


^ permalink raw reply

* [PATCH] RDMA/mana_ib: validate rx_hash_key_len in mana_ib_create_qp_rss
From: Junrui Luo @ 2026-04-21 10:50 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov, Jason Gunthorpe, Leon Romanovsky,
	Dexuan Cui, Ajay Sharma
  Cc: linux-rdma, linux-hyperv, linux-kernel, Yuhao Jiang, stable,
	Junrui Luo

mana_ib_create_qp_rss() passes the user-supplied ucmd.rx_hash_key_len
directly to mana_ib_cfg_vport_steering(), which uses it as the length
argument to memcpy(req->hashkey, rx_hash_key, rx_hash_key_len).

A value greater than MANA_HASH_KEY_SIZE leads to an out-of-bounds read
from the kernel stack and an out-of-bounds write past req->hashkey
within the kzalloc'd struct mana_cfg_rx_steer_req_v2.

Reject any rx_hash_key_len greater than MANA_HASH_KEY_SIZE.

Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
---
 drivers/infiniband/hw/mana/qp.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 82f84f7ad37a..f5ab545cfd74 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -151,6 +151,13 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		return -EINVAL;
 	}
 
+	if (ucmd.rx_hash_key_len > MANA_HASH_KEY_SIZE) {
+		ibdev_dbg(&mdev->ib_dev,
+			  "RX Hash key length %u exceeds maximum %u\n",
+			  ucmd.rx_hash_key_len, MANA_HASH_KEY_SIZE);
+		return -EINVAL;
+	}
+
 	/* IB ports start with 1, MANA start with 0 */
 	port = ucmd.port;
 	ndev = mana_ib_get_netdev(pd->device, port);

---
base-commit: 7aaa8047eafd0bd628065b15757d9b48c5f9c07d
change-id: 20260421-fixes-9402b9f92e0f

Best regards,
-- 
Junrui Luo <moonafterrain@outlook.com>


^ permalink raw reply related

* Re: [PATCH 16/23] genirq/cpuhotplug: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
From: Thomas Gleixner @ 2026-04-21  9:02 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-17-longman@redhat.com>

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:

> As HK_TYPE_MANAGED_IRQ cpumask is going to be changeable at run time,
> use RCU to protect access to the cpumask.
>
> To enable the new HK_TYPE_MANAGED_IRQ cpumask to take effect, the
> following steps can be done.

Can be done?

>  1) Update the HK_TYPE_MANAGED_IRQ cpumask to take out the newly isolated
>     CPUs and add back the de-isolated CPUs.
>  2) Tear down the affected CPUs to cause irq_migrate_all_off_this_cpu()
>     to be called on the affected CPUs to migrate the irqs to other
>     HK_TYPE_MANAGED_IRQ housekeeping CPUs.
>  3) Bring up the previously offline CPUs to invoke
>     irq_affinity_online_cpu() to allow the newly de-isolated CPUs to
>     be used for managed irqs.

Which previously offline CPUs?

> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 2e8072437826..8270c4de260b 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -263,6 +263,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, bool
>  	    housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
>  		const struct cpumask *hk_mask;
>  
> +		guard(rcu)();
>  		hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
>  
>  		cpumask_and(tmp_mask, mask, hk_mask);

How is this hunk related to $Subject?

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH 11/23] hrtimer: Use RCU to protect access of HK_TYPE_TIMER cpumask
From: Thomas Gleixner @ 2026-04-21  8:59 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-12-longman@redhat.com>

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> As HK_TYPE_TIMER cpumask is going to be changeable at run time, use

As the ...

> RCU to protect access to the cpumask.
>
> The access of HK_TYPE_TIMER cpumask within hrtimers_cpu_dying() is
> protected as interrupt is disabled and all the other CPUs are stopped

interrupts are disabled

> when this function is invoked as part of the CPU tear down process.

^ permalink raw reply

* Re: [PATCH 10/23] cpu: Use RCU to protect access of HK_TYPE_TIMER cpumask
From: Thomas Gleixner @ 2026-04-21  8:57 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-11-longman@redhat.com>

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> As HK_TYPE_TIMER cpumask is going to be changeable at run time, use
> RCU to protect access to the cpumask.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cpu.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bc4f7a9ba64e..0d02b5d7a7ba 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1890,6 +1890,8 @@ int freeze_secondary_cpus(int primary)
>  	cpu_maps_update_begin();
>  	if (primary == -1) {
>  		primary = cpumask_first(cpu_online_mask);
> +
> +		guard(rcu)();
>  		if (!housekeeping_cpu(primary, HK_TYPE_TIMER))
>  			primary = housekeeping_any_cpu(HK_TYPE_TIMER);

housekeeping_cpu() and housekeeping_any_cpu() can operate on two
different CPU masks once the runtime update is enabled.

Seriously?

^ permalink raw reply

* Re: [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
From: Thomas Gleixner @ 2026-04-21  8:55 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-6-longman@redhat.com>

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> In tick_cpu_dying(), if the dying CPU is the current timekeeper,
> it has to pass the job over to another CPU. The current code passes
> it to another online CPU. However, that CPU may not be a timer tick
> housekeeping CPU.  If that happens, another CPU will have to manually
> take it over again later. Avoid this unnecessary work by directly
> assigning an online housekeeping CPU.
>
> Use READ_ONCE/WRITE_ONCE() to access tick_do_timer_cpu in case the
> non-HK CPUs may not be in stop machine in the future.

'may not be in the future' is yet more handwaving without
content. Please write your change logs in a way so that people who have
not spent months on this can follow.

> @@ -394,12 +395,19 @@ int tick_cpu_dying(unsigned int dying_cpu)
>  {
>  	/*
>  	 * If the current CPU is the timekeeper, it's the only one that can
> -	 * safely hand over its duty. Also all online CPUs are in stop
> -	 * machine, guaranteed not to be idle, therefore there is no
> +	 * safely hand over its duty. Also all online housekeeping CPUs are
> +	 * in stop machine, guaranteed not to be idle, therefore there is no
>  	 * concurrency and it's safe to pick any online successor.
>  	 */
> -	if (tick_do_timer_cpu == dying_cpu)
> -		tick_do_timer_cpu = cpumask_first(cpu_online_mask);
> +	if (READ_ONCE(tick_do_timer_cpu) == dying_cpu) {
> +		unsigned int new_cpu;
> +
> +		guard(rcu)();

What's this guard for?

> +		new_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TICK));

Why has this to use housekeeping_cpumask() and does not use
tick_nohz_full_mask?

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs
From: Thomas Gleixner @ 2026-04-21  8:50 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-5-longman@redhat.com>

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> Full dynticks can only be enabled if "nohz_full" boot option has been
> been specified with or without parameter. Any change in the list of
> nohz_full CPUs have to be reflected in tick_nohz_full_mask. Introduce
> a new tick_nohz_full_update_cpus() helper that can be called to update
> the tick_nohz_full_mask at run time. The housekeeping_update() function
> is modified to call the new helper when the HK_TYPE_KERNEL_NOSIE cpumask
> is going to be changed.
>
> We also need to enable CPU context tracking for those CPUs that

We need nothing. Use passive voice for change logs as requested in
documentation.

> are in tick_nohz_full_mask. So remove __init from tick_nohz_init()
> and ct_cpu_track_user() so that they be called later when an isolated
> cpuset partition is being created. The __ro_after_init attribute is
> taken away from context_tracking_key as well.
>
> Also add a new ct_cpu_untrack_user() function to reverse the action of
> ct_cpu_track_user() in case we need to disable the nohz_full mode of
> a CPU.
>
> With nohz_full enabled, the boot CPU (typically CPU 0) will be the
> tick CPU which cannot be shut down easily. So the boot CPU should not
> be used in an isolated cpuset partition.
>
> With runtime modification of nohz_full CPUs, tick_do_timer_cpu can become
> TICK_DO_TIMER_NONE. So remove the two TICK_DO_TIMER_NONE WARN_ON_ONCE()
> checks in tick-sched.c to avoid unnecessary warnings.

in tick-sched.c? Describe the functions which contain that.

>  static inline void tick_nohz_task_switch(void)
> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index 925999de1a28..394e432630a3 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -411,7 +411,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { }
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/context_tracking.h>
>  
> -DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key);
> +DEFINE_STATIC_KEY_FALSE(context_tracking_key);
>  EXPORT_SYMBOL_GPL(context_tracking_key);
>  
>  static noinstr bool context_tracking_recursion_enter(void)
> @@ -674,9 +674,9 @@ void user_exit_callable(void)
>  }
>  NOKPROBE_SYMBOL(user_exit_callable);
>  
> -void __init ct_cpu_track_user(int cpu)
> +void ct_cpu_track_user(int cpu)
>  {
> -	static __initdata bool initialized = false;
> +	static bool initialized;
>  
>  	if (cpu == CONTEXT_TRACKING_FORCE_ENABLE) {
>  		static_branch_inc(&context_tracking_key);
> @@ -700,6 +700,15 @@ void __init ct_cpu_track_user(int cpu)
>  	initialized = true;
>  }
>  
> +void ct_cpu_untrack_user(int cpu)
> +{
> +	if (!per_cpu(context_tracking.active, cpu))
> +		return;
> +
> +	per_cpu(context_tracking.active, cpu) = false;
> +	static_branch_dec(&context_tracking_key);
> +}
> +

Why is this in a patch which makes tick/nohz related changes? This is a
preparatory change, so make it that way and do not bury it inside
something else.

> +/* Get the new set of run-time nohz CPU list & update accordingly */
> +void tick_nohz_full_update_cpus(struct cpumask *cpumask)
> +{
> +	int cpu;
> +
> +	if (!tick_nohz_full_running) {
> +		pr_warn_once("Full dynticks cannot be enabled without the nohz_full kernel boot parameter!\n");

That's the result of this enforced enable hackery. Make this work
properly.

> +		return;
> +	}
> +
> +	/*
> +	 * To properly enable/disable nohz_full dynticks for the affected CPUs,
> +	 * the new nohz_full CPUs have to be copied to tick_nohz_full_mask and
> +	 * ct_cpu_track_user/ct_cpu_untrack_user() will have to be called
> +	 * for those CPUs that have their states changed. Those CPUs should be
> +	 * in an offline state.
> +	 */
> +	for_each_cpu_andnot(cpu, cpumask, tick_nohz_full_mask) {
> +		WARN_ON_ONCE(cpu_online(cpu));
> +		ct_cpu_track_user(cpu);
> +		cpumask_set_cpu(cpu, tick_nohz_full_mask);
> +	}
> +
> +	for_each_cpu_andnot(cpu, tick_nohz_full_mask, cpumask) {
> +		WARN_ON_ONCE(cpu_online(cpu));
> +		ct_cpu_untrack_user(cpu);
> +		cpumask_clear_cpu(cpu, tick_nohz_full_mask);
> +	}
> +}

So this writes to tick_nohz_full_mask while other CPUs can access
it. That's just wrong and I'm not at all interested in the resulting
KCSAN warnings.

tick_nohz_full_mask needs to become a RCU protected pointer, which is
updated once the new mask is established in a separately allocated one.

Thanks,

        tglx



^ permalink raw reply

* Re: [PATCH 03/23] tick/nohz: Make nohz_full parameter optional
From: Thomas Gleixner @ 2026-04-21  8:32 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Chen Ridong, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-4-longman@redhat.com>

On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> To provide nohz_full tick support, there is a set of tick dependency
> masks that need to be evaluated on every IRQ and context switch.

s/IRQ/interrupt/

This is a changelog and not a SMS service.

> Switching on nohz_full tick support at runtime will be problematic
> as some of the tick dependency masks may not be properly set causing
> problem down the road.

That's useless blurb with zero content.

> Allow nohz_full boot option to be specified without any
> parameter to force enable nohz_full tick support without any
> CPU in the tick_nohz_full_mask yet. The context_tracking_key and
> tick_nohz_full_running flag will be enabled in this case to make
> tick_nohz_full_enabled() return true.

I kinda can crystal-ball what you are trying to say here, but that does
not make it qualified as a proper change log.

> There is still a small performance overhead by force enable nohz_full
> this way. So it should only be used if there is a chance that some
> CPUs may become isolated later via the cpuset isolated partition
> functionality and better CPU isolation closed to nohz_full is desired.

Why has this key to be enabled on boot if there are no CPUs in the
isolated mask?

If you want to manage this dynamically at runtime then enable the key
once CPUs are isolated. Yes, it's more work, but that avoids the "should
only be used" nonsense and makes this more robust down the road.

Thanks,

        tglx



^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH net v2] hv_sock: Report EOF instead of -EIO for FIN
From: Stefano Garzarella @ 2026-04-21  7:28 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: patchwork-bot+netdevbpf@kernel.org, kuba@kernel.org,
	KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Long Li,
	davem@davemloft.net, edumazet@google.com, pabeni@redhat.com,
	horms@kernel.org, niuxuewei.nxw@antgroup.com,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, Ben Hillis, levymitchell0@gmail.com
In-Reply-To: <SA1PR21MB69214CABCA0DCD597040F849BF2C2@SA1PR21MB6921.namprd21.prod.outlook.com>

On Tue, 21 Apr 2026 at 05:13, Dexuan Cui <DECUI@microsoft.com> wrote:
>
> > From: patchwork-bot+netdevbpf@kernel.org <patchwork-
> > bot+netdevbpf@kernel.org>
> > Sent: Monday, April 20, 2026 3:00 PM
> > > [...]
> >
> > Here is the summary with links:
> >   - [net,v2] hv_sock: Report EOF instead of -EIO for FIN
> >  https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=f63152958994
>
> Hi Jakub, Stefano,
> I'm sorry -- I just posted v3
>     https://lore.kernel.org/linux-hyperv/20260421025950.1099495-1-decui@microsoft.com/T/#u
> and then I realized that the v2 had been merged into the main branch :-(
>
> Should I post a new delta patch(with a Fixes tag against the v2) based on the main branch?

Ehm, I'm not sure about the process but if it's merged in net tree,
maybe we need a follow up patch.

Anyway, let's wait for Jakub's or other net maintainers' suggestions.

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH] Drivers: hv: vmbus: Improve the logc of reserving fb_mmio on Gen2 VMs
From: Matthew Ruffell @ 2026-04-21  6:51 UTC (permalink / raw)
  To: Krister Johansen
  Cc: Dexuan Cui, kys, haiyangz, wei.liu, longli, linux-hyperv,
	linux-kernel, mhklinux, stable
In-Reply-To: <aeKW4ESwsoK5La-t@templeofstupid.com>

Thanks Dexuan for all your hard work and analysis on this patch.

I have tested this patch on Azure with:
- Standard_D4ads_v5
- Standard_D4ads_v6

with the following images:
"Ubuntu Server 22.04 LTS - x64 Gen2"
"Ubuntu Server 24.04 LTS - x64 Gen2"

with the following kernels:
- 7.1 merge window at c1f49dea2b8f335813d3b348fd39117fb8efb428
- 7.1 merge window at c1f49dea2b8f335813d3b348fd39117fb8efb428 + this patch

Without this patch, I could reproduce the issue on 22.04 + v6 based instance
types.

I can confirm that with this patch, v6 instance types can correctly kdump and
create a vmcore correctly and restart correctly without running into
MMIO issues.

I can confirm that with this patch, v5 instance types continue to operate the
same as they did previously.

Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com>

On Sat, 18 Apr 2026 at 08:24, Krister Johansen <kjlx@templeofstupid.com> wrote:
>
> On Thu, Apr 16, 2026 at 11:35:29AM -0700, Dexuan Cui wrote:
> > If vmbus_reserve_fb() in the kdump kernel fails to properly reserve the
> > framebuffer MMIO range due to a Gen2 VM's screen.lfb_base being zero [1],
> > there is an MMIO conflict between the drivers hyperv_drm and pci-hyperv.
> > This is especially an issue if pci-hyperv is built-in and hyperv_drm is
> > built as a module. Consequently, the kdump kernel fails to detect PCI
> > devices via pci-hyperv, and may fail to mount the root file system,
> > which may reside in a NVMe disk.
> >
> > On Gen2 VMs, if the screen.lfb_base is 0 in the kdump kernel, fall
> > back to the low MMIO base, which should be equal to the framebuffer
> > MMIO base (Tested on x64 Windows Server 2016, and on x64 and ARM64 Windows
> > Server 2025 and on Azure) [2]. In the first kernel, screen.lfb_base
> > is not 0; if the user specifies a high resolution, it's not enough to
> > only reserve 8MB: in this case, reserve half of the space below 4GB, but
> > cap the reservation to 128MB, which is the required framebuffer size of
> > the highest resolution 7680*4320 supported by Hyper-V.
> >
> > Add the cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) check, because a CoCo
> > VM (i.e. Confidential VM) on Hyper-V doesn't have any framebuffer
> > device, so there is no need to reserve any MMIO for it.
> >
> > While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
> > the > to >=. Here the 'end' is an inclusive end (typically, it's
> > 0xFFFF_FFFF).
> >
> > [1] https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
> > [2] https://lore.kernel.org/all/SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR21MB6921.namprd21.prod.outlook.com/
> >
> > Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> > CC: stable@vger.kernel.org
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > ---
> >  drivers/hv/vmbus_drv.c | 30 ++++++++++++++++++++++++++++--
> >  1 file changed, 28 insertions(+), 2 deletions(-)
>
> Thanks for the updated patch.  I tested this on the arm64 instances that
> had been failing and was able to confirm that without it present the
> failure still occurred, but with the new patch networking was able to
> attach correctly in the dump environment and kdumps were successful.
>
> Tested-by: Krister Johansen <kjlx@templeofstupid.com>
>
> -K

^ permalink raw reply

* Re: [PATCH v2 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Anirudh Rayabharam @ 2026-04-21  4:46 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260403190613.47026-5-jloeser@linux.microsoft.com>

On Fri, Apr 03, 2026 at 12:06:10PM -0700, Jork Loeser wrote:
> The SynIC is shared between VMBus and MSHV. VMBus owns the message
> page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
> and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
> 
> Currently mshv_synic_init() redundantly enables SIMP, SIEFP, and
> SCONTROL that VMBus already configured, and mshv_synic_cleanup()
> disables all of them. This is wrong because MSHV can be torn down
> while VMBus is still active. In particular, a kexec reboot notifier
> tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
> from under VMBus causes its later cleanup to write SynIC MSRs while
> SynIC is disabled, which the hypervisor does not tolerate.
> 
> Restrict MSHV to managing only the resources it owns:
> - SINT0, SINT5: mask on cleanup, unmask on init
> - SIRBP: enable/disable as before
> - SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
>   and nested root partition); on a non-nested root partition VMBus
>   doesn't run, so MSHV must enable/disable them
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c | 142 ++++++++++++++++++++++++++--------------
>  1 file changed, 94 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index f8b0337cdc82..7d273766bdb5 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -454,46 +454,72 @@ int mshv_synic_init(unsigned int cpu)
>  #ifdef HYPERVISOR_CALLBACK_VECTOR
>  	union hv_synic_sint sint;
>  #endif
> -	union hv_synic_scontrol sctrl;
>  	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_synic_event_flags_page **event_flags_page =
>  			&spages->synic_event_flags_page;
>  	struct hv_synic_event_ring_page **event_ring_page =
>  			&spages->synic_event_ring_page;
> +	/* VMBus runs on L1VH and nested root; it owns SIMP/SIEFP/SCONTROL */
> +	bool vmbus_active = !hv_root_partition() || hv_nested;
>  
> -	/* Setup the Synic's message page */
> +	/*
> +	 * Map the SYNIC message page. When VMBus is not active the
> +	 * hypervisor pre-provisions the SIMP GPA but may not set
> +	 * simp_enabled — enable it here.
> +	 */
>  	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> -	simp.simp_enabled = true;
> +	if (!vmbus_active) {
> +		simp.simp_enabled = true;
> +		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +	}
>  	*msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
>  			     HV_HYP_PAGE_SIZE,
>  			     MEMREMAP_WB);
>  
>  	if (!(*msg_page))
> -		return -EFAULT;
> +		goto cleanup_simp;
>  
> -	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> -
> -	/* Setup the Synic's event flags page */
> +	/*
> +	 * Map the event flags page. Same as SIMP: enable when
> +	 * VMBus is not active, already enabled by VMBus otherwise.
> +	 */
>  	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> -	siefp.siefp_enabled = true;
> +	if (!vmbus_active) {
> +		siefp.siefp_enabled = true;
> +		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +	}
>  	*event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
>  				     PAGE_SIZE, MEMREMAP_WB);
>  
>  	if (!(*event_flags_page))
> -		goto cleanup;
> -
> -	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +		goto cleanup_siefp;
>  
>  	/* Setup the Synic's event ring page */
>  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> -	sirbp.sirbp_enabled = true;
> -	*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
> -				    PAGE_SIZE, MEMREMAP_WB);
>  
> -	if (!(*event_ring_page))
> -		goto cleanup;
> +	if (hv_root_partition()) {
> +		*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
> +					    PAGE_SIZE, MEMREMAP_WB);
>  
> +		if (!(*event_ring_page))
> +			goto cleanup_siefp;
> +	} else {
> +		/*
> +		 * On L1VH the hypervisor does not provide a SIRBP page.
> +		 * Allocate one and program its GPA into the MSR.
> +		 */

How were things working before this?

Thanks,
Anirudh.

> +		*event_ring_page = (struct hv_synic_event_ring_page *)
> +			get_zeroed_page(GFP_KERNEL);
> +
> +		if (!(*event_ring_page))
> +			goto cleanup_siefp;
> +
> +		sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
> +				>> PAGE_SHIFT;
> +	}
> +
> +	sirbp.sirbp_enabled = true;
>  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
>  
>  #ifdef HYPERVISOR_CALLBACK_VECTOR
> @@ -515,28 +541,30 @@ int mshv_synic_init(unsigned int cpu)
>  			      sint.as_uint64);
>  #endif
>  
> -	/* Enable global synic bit */
> -	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> -	sctrl.enable = 1;
> -	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	/* When VMBus is active it already enabled SCONTROL. */
> +	if (!vmbus_active) {
> +		union hv_synic_scontrol sctrl;
> +
> +		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> +		sctrl.enable = 1;
> +		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	}
>  
>  	return 0;
>  
> -cleanup:
> -	if (*event_ring_page) {
> -		sirbp.sirbp_enabled = false;
> -		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> -		memunmap(*event_ring_page);
> -	}
> -	if (*event_flags_page) {
> +cleanup_siefp:
> +	if (*event_flags_page)
> +		memunmap(*event_flags_page);
> +	if (!vmbus_active) {
>  		siefp.siefp_enabled = false;
>  		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> -		memunmap(*event_flags_page);
>  	}
> -	if (*msg_page) {
> +cleanup_simp:
> +	if (*msg_page)
> +		memunmap(*msg_page);
> +	if (!vmbus_active) {
>  		simp.simp_enabled = false;
>  		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> -		memunmap(*msg_page);
>  	}
>  
>  	return -EFAULT;
> @@ -545,16 +573,15 @@ int mshv_synic_init(unsigned int cpu)
>  int mshv_synic_cleanup(unsigned int cpu)
>  {
>  	union hv_synic_sint sint;
> -	union hv_synic_simp simp;
> -	union hv_synic_siefp siefp;
>  	union hv_synic_sirbp sirbp;
> -	union hv_synic_scontrol sctrl;
>  	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_synic_event_flags_page **event_flags_page =
>  		&spages->synic_event_flags_page;
>  	struct hv_synic_event_ring_page **event_ring_page =
>  		&spages->synic_event_ring_page;
> +	/* VMBus runs on L1VH and nested root; it owns SIMP/SIEFP/SCONTROL */
> +	bool vmbus_active = !hv_root_partition() || hv_nested;
>  
>  	/* Disable the interrupt */
>  	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX);
> @@ -568,28 +595,47 @@ int mshv_synic_cleanup(unsigned int cpu)
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
>  			      sint.as_uint64);
>  
> -	/* Disable Synic's event ring page */
> +	/* Disable SYNIC event ring page owned by MSHV */
>  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
>  	sirbp.sirbp_enabled = false;
> -	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> -	memunmap(*event_ring_page);
>  
> -	/* Disable Synic's event flags page */
> -	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> -	siefp.siefp_enabled = false;
> -	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +	if (hv_root_partition()) {
> +		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> +		memunmap(*event_ring_page);
> +	} else {
> +		sirbp.base_sirbp_gpa = 0;
> +		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> +		free_page((unsigned long)*event_ring_page);
> +	}
> +
> +	/*
> +	 * Release our mappings of the message and event flags pages.
> +	 * When VMBus is not active, we enabled SIMP/SIEFP — disable
> +	 * them. Otherwise VMBus owns the MSRs — leave them.
> +	 */
>  	memunmap(*event_flags_page);
> +	if (!vmbus_active) {
> +		union hv_synic_simp simp;
> +		union hv_synic_siefp siefp;
>  
> -	/* Disable Synic's message page */
> -	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> -	simp.simp_enabled = false;
> -	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +		siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> +		siefp.siefp_enabled = false;
> +		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +
> +		simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> +		simp.simp_enabled = false;
> +		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +	}
>  	memunmap(*msg_page);
>  
> -	/* Disable global synic bit */
> -	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> -	sctrl.enable = 0;
> -	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	/* When VMBus is active it owns SCONTROL — leave it. */
> +	if (!vmbus_active) {
> +		union hv_synic_scontrol sctrl;
> +
> +		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> +		sctrl.enable = 0;
> +		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	}
>  
>  	return 0;
>  }
> -- 
> 2.43.0
> 

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH net v2] hv_sock: Report EOF instead of -EIO for FIN
From: Dexuan Cui @ 2026-04-21  3:13 UTC (permalink / raw)
  To: patchwork-bot+netdevbpf@kernel.org, kuba@kernel.org,
	sgarzare@redhat.com
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Long Li,
	davem@davemloft.net, edumazet@google.com, pabeni@redhat.com,
	horms@kernel.org, niuxuewei.nxw@antgroup.com,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, Ben Hillis, levymitchell0@gmail.com
In-Reply-To: <177672238581.1802062.15838493180057695674.git-patchwork-notify@kernel.org>

> From: patchwork-bot+netdevbpf@kernel.org <patchwork-
> bot+netdevbpf@kernel.org>
> Sent: Monday, April 20, 2026 3:00 PM
> > [...]
> 
> Here is the summary with links:
>   - [net,v2] hv_sock: Report EOF instead of -EIO for FIN
>  https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=f63152958994

Hi Jakub, Stefano,
I'm sorry -- I just posted v3 
    https://lore.kernel.org/linux-hyperv/20260421025950.1099495-1-decui@microsoft.com/T/#u
and then I realized that the v2 had been merged into the main branch :-(

Should I post a new delta patch(with a Fixes tag against the v2) based on the main branch?

Thanks,
Dexuan

^ permalink raw reply

* [PATCH 23/23] cgroup/cpuset: Documentation and kselftest updates
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>

As CPU hotplug is now being used to enable runtime update to the list
of nohz_full and managed_irq CPUs, we should avoid using CPU 0 in the
formation of isolated partition as CPU 0 may not be able to be brought
offline like in the case of x86-64 architecture. So a number of the
test cases in test_cpuset_prs.sh will have to be updated accordingly.

A new test will also be run in offline isn't allowed in CPU 0 to verify
that using CPU 0 as part of an isolated partition will fail.

The cgroup-v2.rst is also updated to reflect the new capability of using
CPU hotplug to enable run time change to the nohz_full and managed_irq
CPU lists.

Since there is a slight performance overhead to enable runtime changes
to nohz_full CPU list, users have to explicitly opt in by adding a
"nohz_ful" kernel command line parameter with or without a CPU list.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst       | 35 +++++++---
 .../selftests/cgroup/test_cpuset_prs.sh       | 70 +++++++++++++++++--
 2 files changed, 92 insertions(+), 13 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8ad0b2781317..e97fc031eb86 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2604,11 +2604,12 @@ Cpuset Interface Files
 
 	It accepts only the following input values when written to.
 
-	  ==========	=====================================
+	  ==========	===============================================
 	  "member"	Non-root member of a partition
 	  "root"	Partition root
-	  "isolated"	Partition root without load balancing
-	  ==========	=====================================
+	  "isolated"	Partition root without load balancing and other
+		        OS noises
+	  ==========	===============================================
 
 	A cpuset partition is a collection of cpuset-enabled cgroups with
 	a partition root at the top of the hierarchy and its descendants
@@ -2652,11 +2653,29 @@ Cpuset Interface Files
 	partition or scheduling domain.  The set of exclusive CPUs is
 	determined by the value of its "cpuset.cpus.exclusive.effective".
 
-	When set to "isolated", the CPUs in that partition will be in
-	an isolated state without any load balancing from the scheduler
-	and excluded from the unbound workqueues.  Tasks placed in such
-	a partition with multiple CPUs should be carefully distributed
-	and bound to each of the individual CPUs for optimal performance.
+	When set to "isolated", the CPUs in that partition will be in an
+	isolated state without any load balancing from the scheduler and
+	excluded from the unbound workqueues as well as other OS noises.
+	Tasks placed in such a partition with multiple CPUs should be
+	carefully distributed and bound to each of the individual CPUs
+	for optimal performance.
+
+	As CPU hotplug, if supported, is used to improve the degree of
+	CPU isolation close to the "nohz_full" kernel boot parameter.
+	In some architectures, like x86-64, the boot CPU (typically CPU
+	0) cannot be brought offline, so the boot CPU should not be used
+	for forming isolated partitions.  The "nohz_full" kernel boot
+	parameter needs to be present to enable full dynticks support
+	and RCU no-callback CPU mode for CPUs in isolated partitions
+	even if the optional cpu list isn't provided.
+
+	Using CPU hotplug for creating or destroying an isolated
+	partition can cause latency spike in applications running
+	in other isolated partitions.  A reserved list of CPUs can
+	optionally be put in the "nohz_full" kernel boot parameter to
+	alleviate this problem.  When these reserved CPUs are used for
+	isolated partitions, CPU hotplug won't need to be invoked and
+	so there won't be latency spike in other isolated partitions.
 
 	A partition root ("root" or "isolated") can be in one of the
 	two possible states - valid or invalid.  An invalid partition
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index a56f4153c64d..eebb4122b581 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -67,6 +67,12 @@ then
 	echo Y > /sys/kernel/debug/sched/verbose
 fi
 
+# Enable dynamic debug message if available
+DYN_DEBUG=/proc/dynamic_debug/control
+[[ -f $DYN_DEBUG ]] && {
+	echo "file kernel/cpu.c +p" > $DYN_DEBUG
+}
+
 cd $CGROUP2
 echo +cpuset > cgroup.subtree_control
 
@@ -84,6 +90,15 @@ echo member > test/cpuset.cpus.partition
 echo "" > test/cpuset.cpus
 [[ $RESULT -eq 0 ]] && skip_test "Child cgroups are using cpuset!"
 
+#
+# If nohz_full parameter is specified and nohz_full file exists, CPU hotplug
+# will be used to modify nohz_full cpumask to include all the isolated CPUs
+# in cpuset isolated partitions.
+#
+NOHZ_FULL=/sys/devices/system/cpu/nohz_full
+BOOT_NOHZ_FULL=$(fmt -1 /proc/cmdline | grep "^nohz_full")
+[[ "$BOOT_NOHZ_FULL" = nohz_full ]] && CHK_NOHZ_FULL=1
+
 #
 # If isolated CPUs have been reserved at boot time (as shown in
 # cpuset.cpus.isolated), these isolated CPUs should be outside of CPUs 0-8
@@ -318,8 +333,8 @@ TEST_MATRIX=(
 	# Invalid to valid local partition direct transition tests
 	" C1-3:P2  X4:P2    .      .      .      .      .      .     0 A1:1-3|XA1:1-3|A2:1-3:XA2: A1:P2|A2:P-2 1-3"
 	" C1-3:P2  X4:P2    .      .      .    X3:P2    .      .     0 A1:1-2|XA1:1-3|A2:3:XA2:3 A1:P2|A2:P2 1-3"
-	" C0-3:P2    .      .    C4-6   C0-4     .      .      .     0 A1:0-4|B1:5-6 A1:P2|B1:P0"
-	" C0-3:P2    .      .    C4-6 C0-4:C0-3  .      .      .     0 A1:0-3|B1:4-6 A1:P2|B1:P0 0-3"
+	" C1-3:P2    .      .    C4-6   C1-4     .      .      .     0 A1:1-4|B1:5-6 A1:P2|B1:P0"
+	" C1-3:P2    .      .    C4-6 C1-4:C1-3  .      .      .     0 A1:1-3|B1:4-6 A1:P2|B1:P0 1-3"
 
 	# Local partition invalidation tests
 	" C0-3:X1-3:P2 C1-3:X2-3:P2 C2-3:X3:P2 \
@@ -329,8 +344,8 @@ TEST_MATRIX=(
 	" C0-3:X1-3:P2 C1-3:X2-3:P2 C2-3:X3:P2 \
 				   .      .    C4:X     .      .     0 A1:1-3|A2:1-3|A3:2-3|XA2:|XA3: A1:P2|A2:P-2|A3:P-2 1-3"
 	# Local partition CPU change tests
-	" C0-5:P2  C4-5:P1  .      .      .    C3-5     .      .     0 A1:0-2|A2:3-5 A1:P2|A2:P1 0-2"
-	" C0-5:P2  C4-5:P1  .      .    C1-5     .      .      .     0 A1:1-3|A2:4-5 A1:P2|A2:P1 1-3"
+	" C1-5:P2  C4-5:P1  .      .      .    C3-5     .      .     0 A1:1-2|A2:3-5 A1:P2|A2:P1 1-2"
+	" C1-5:P2  C4-5:P1  .      .    C2-5     .      .      .     0 A1:2-3|A2:4-5 A1:P2|A2:P1 2-3"
 
 	# cpus_allowed/exclusive_cpus update tests
 	" C0-3:X2-3 C1-3:X2-3 C2-3:X2-3 \
@@ -442,6 +457,21 @@ TEST_MATRIX=(
 	"   C0-3     .      .    C4-5   X3-5     .      .      .     1 A1:0-3|B1:4-5"
 )
 
+#
+# Test matrix to verify that using CPU 0 in isolated (local or remote) partition
+# will fail when offline isn't allowed for CPU 0.
+#
+CPU0_ISOLCPUS_MATRIX=(
+	#  old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
+	#  ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
+	"   C0-3     .      .    C4-5     P2     .      .      .     0 A1:0-3|B1:4-5 A1:P-2"
+	"   C1-3     .      .      .      P2     .      .      .     0 A1:1-3 A1:P2"
+	"   C1-3     .      .      .    P2:C0-3  .      .      .     0 A1:0-3 A1:P-2"
+	"  CX0-3   C0-3     .      .       .     P2     .      .     0 A1:0-3|A2:0-3 A2:P-2"
+	"  CX0-3 C0-3:X1-3  .      .       .     P2     .      .     0 A1:0|A2:1-3 A2:P2"
+	"  CX0-3 C0-3:X1-3  .      .       .   P2:X0-3  .      .     0 A1:0-3|A2:0-3 A2:P-2"
+)
+
 #
 # Cpuset controller remote partition test matrix.
 #
@@ -513,7 +543,7 @@ write_cpu_online()
 		}
 	fi
 	echo $VAL > $CPUFILE
-	pause 0.05
+	pause 0.10
 }
 
 #
@@ -654,6 +684,8 @@ dump_states()
 		[[ -e $PCPUS  ]] && echo "$PCPUS: $(cat $PCPUS)"
 		[[ -e $ISCPUS ]] && echo "$ISCPUS: $(cat $ISCPUS)"
 	done
+	# Dump nohz_full
+	[[ -f $NOHZ_FULL ]] && echo "nohz_full: $(cat $NOHZ_FULL)"
 }
 
 #
@@ -789,6 +821,18 @@ check_isolcpus()
 		EXPECTED_SDOMAIN=$EXPECTED_ISOLCPUS
 	fi
 
+	#
+	# Check if nohz_full match cpuset.cpus.isolated if nohz_boot parameter
+	# specified with no parameter.
+	#
+	[[ -f $NOHZ_FULL && "$BOOT_NOHZ_FULL" = nohz_full ]] && {
+		NOHZ_FULL_CPUS=$(cat $NOHZ_FULL)
+		[[ "$ISOLCPUS" != "$NOHZ_FULL_CPUS" ]] && {
+			echo "nohz_full ($NOHZ_FULL_CPUS) does not match cpuset.cpus.isolated ($ISOLCPUS)"
+			return 1
+		}
+	}
+
 	#
 	# Appending pre-isolated CPUs
 	# Even though CPU #8 isn't used for testing, it can't be pre-isolated
@@ -1070,6 +1114,21 @@ run_remote_state_test()
 	echo "All $I tests of $TEST PASSED."
 }
 
+#
+# Testing CPU 0 isolated partition test when offline is disabled
+#
+run_cpu0_isol_test()
+{
+	# Skip the test if CPU0 offline is allowed or if nohz_full kernel
+	# boot parameter is missing.
+	CPU0_ONLINE=/sys/devices/system/cpu/cpu0/online
+	[[ -f $CPU0_ONLINE ]] && return
+	grep -q -w nohz_full /proc/cmdline
+	[[ $? -ne 0 ]] && return
+
+	run_state_test CPU0_ISOLCPUS_MATRIX
+}
+
 #
 # Testing the new "isolated" partition root type
 #
@@ -1207,6 +1266,7 @@ test_inotify()
 trap cleanup 0 2 3 6
 run_state_test TEST_MATRIX
 run_remote_state_test REMOTE_TEST_MATRIX
+run_cpu0_isol_test
 test_isolated
 test_inotify
 echo "All tests PASSED."
-- 
2.53.0


^ permalink raw reply related

* [PATCH 22/23] cgroup/cpuset: Prevent offline_disabled CPUs from being used in isolated partition
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>

If tick_nohz_full_enabled() is true, we are going to use CPU
hotplug to enable runtime changes to the HK_TYPE_KERNEL_NOISE and
HK_TYPE_MANAGED_IRQ cpumasks. However, for some architectures, one
or maybe more CPUs will have the offline_disabled flag set in their
cpu devices. For instance, x64-64 will set this flag for its boot CPU
(typically CPU 0) to disable it from going offline. Those CPUs can't
be used in cpuset isolated partition, or we are going to have problem
in the CPU offline process.

Find out the set of CPUs with offline_disabled set in a
new cpuset_init_late() helper, set the corresponding bits in
offline_disabled_cpus cpumask and check it when isolated partitions are
being created or modified to ensure that we will not use any of those
offline disabled CPUs in an isolated partition.

An error message mentioning those offline disabled CPUs will be
constructed in cpuset_init_late() and shown in "cpuset.cpus.partition"
when isolated creation or modification fails.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset-internal.h |  1 +
 kernel/cgroup/cpuset.c          | 89 ++++++++++++++++++++++++++++++++-
 2 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index fd7d19842ded..87b7411540ff 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -35,6 +35,7 @@ enum prs_errcode {
 	PERR_HKEEPING,
 	PERR_ACCESS,
 	PERR_REMOTE,
+	PERR_OL_DISABLED,
 };
 
 /* bits in struct cpuset flags field */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5f6b4e67748f..f3af8ef6c64e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -20,6 +20,8 @@
  */
 #include "cpuset-internal.h"
 
+#include <linux/cpu.h>
+#include <linux/device.h>
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/kernel.h>
@@ -48,7 +50,7 @@ DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
  */
 DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
 
-static const char * const perr_strings[] = {
+static const char *perr_strings[] __ro_after_init = {
 	[PERR_INVCPUS]   = "Invalid cpu list in cpuset.cpus.exclusive",
 	[PERR_INVPARENT] = "Parent is an invalid partition root",
 	[PERR_NOTPART]   = "Parent is not a partition root",
@@ -59,6 +61,7 @@ static const char * const perr_strings[] = {
 	[PERR_HKEEPING]  = "partition config conflicts with housekeeping setup",
 	[PERR_ACCESS]    = "Enable partition not permitted",
 	[PERR_REMOTE]    = "Have remote partition underneath",
+	[PERR_OL_DISABLED] = "", /* To be set up later */
 };
 
 /*
@@ -164,6 +167,12 @@ static cpumask_var_t	isolated_mirq_cpus;	/* T */
 static bool		boot_nohz_le_domain __ro_after_init;
 static bool		boot_mirq_le_domain __ro_after_init;
 
+/*
+ * Cpumask of CPUs with offline_disabled set
+ * The cpumask is effectively __ro_after_init.
+ */
+static cpumask_var_t	offline_disabled_cpus;
+
 /*
  * A flag to force sched domain rebuild at the end of an operation.
  * It can be set in
@@ -1649,6 +1658,8 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
 	 * The effective_xcpus mask can contain offline CPUs, but there must
 	 * be at least one or more online CPUs present before it can be enabled.
 	 *
+	 * An isolated partition cannot contain CPUs with offline disabled.
+	 *
 	 * Note that creating a remote partition with any local partition root
 	 * above it or remote partition root underneath it is not allowed.
 	 */
@@ -1661,6 +1672,9 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
 	     !isolated_cpus_can_update(tmp->new_cpus, NULL)) ||
 	    prstate_housekeeping_conflict(new_prs, tmp->new_cpus))
 		return PERR_HKEEPING;
+	if (tick_nohz_full_enabled() && (new_prs == PRS_ISOLATED) &&
+	    cpumask_intersects(tmp->new_cpus, offline_disabled_cpus))
+		return PERR_OL_DISABLED;
 
 	spin_lock_irq(&callback_lock);
 	partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
@@ -1746,6 +1760,16 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
 		goto invalidate;
 	}
 
+	/*
+	 * Isolated partition cannot contains CPUs with offline_disabled
+	 * bit set.
+	 */
+	if (tick_nohz_full_enabled() && (prs == PRS_ISOLATED) &&
+	    cpumask_intersects(excpus, offline_disabled_cpus)) {
+		cs->prs_err = PERR_OL_DISABLED;
+		goto invalidate;
+	}
+
 	adding   = cpumask_andnot(tmp->addmask, excpus, cs->effective_xcpus);
 	deleting = cpumask_andnot(tmp->delmask, cs->effective_xcpus, excpus);
 
@@ -1913,6 +1937,11 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 		if (tasks_nocpu_error(parent, cs, xcpus))
 			return PERR_NOCPUS;
 
+		if (tick_nohz_full_enabled() &&
+		    (new_prs == PRS_ISOLATED) &&
+		    cpumask_intersects(xcpus, offline_disabled_cpus))
+			return PERR_OL_DISABLED;
+
 		/*
 		 * This function will only be called when all the preliminary
 		 * checks have passed. At this point, the following condition
@@ -1979,12 +2008,21 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 					       parent->effective_xcpus);
 		}
 
+		/*
+		 * Isolated partition cannot contain CPUs with offline_disabled
+		 * bit set.
+		 */
+		if (tick_nohz_full_enabled() &&
+		   ((old_prs == PRS_ISOLATED) ||
+		    (old_prs == PRS_INVALID_ISOLATED)) &&
+		    cpumask_intersects(newmask, offline_disabled_cpus)) {
+			part_error = PERR_OL_DISABLED;
 		/*
 		 * TBD: Invalidate a currently valid child root partition may
 		 * still break isolated_cpus_can_update() rule if parent is an
 		 * isolated partition.
 		 */
-		if (is_partition_valid(cs) && (old_prs != parent_prs)) {
+		} else if (is_partition_valid(cs) && (old_prs != parent_prs)) {
 			if ((parent_prs == PRS_ROOT) &&
 			    /* Adding to parent means removing isolated CPUs */
 			    !isolated_cpus_can_update(tmp->delmask, tmp->addmask))
@@ -1995,6 +2033,19 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 				part_error = PERR_HKEEPING;
 		}
 
+		if (part_error) {
+			deleting = false;
+			/*
+			 * For a previously valid partition, we need to move
+			 * the exclusive CPUs back to its parent.
+			 */
+			if (is_partition_valid(cs) &&
+			    !cpumask_empty(cs->effective_xcpus)) {
+				cpumask_copy(tmp->addmask, cs->effective_xcpus);
+				adding = true;
+			}
+		}
+
 		/*
 		 * The new CPUs to be removed from parent's effective CPUs
 		 * must be present.
@@ -3829,6 +3880,7 @@ int __init cpuset_init(void)
 	BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
 	BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&offline_disabled_cpus, GFP_KERNEL));
 
 	cpumask_setall(top_cpuset.cpus_allowed);
 	nodes_setall(top_cpuset.mems_allowed);
@@ -4188,6 +4240,39 @@ void __init cpuset_init_smp(void)
 	BUG_ON(!cpuset_migrate_mm_wq);
 }
 
+/**
+ * cpuset_init_late - initialize the list of CPUs with offline_disabled set
+ *
+ * Description: Initialize a cpumask with CPUs that have the offline_disabled
+ *		bit set. It is done in a separate initcall as cpuset_init_smp()
+ *		is called before driver_init() where the CPU devices will be
+ *		set up.
+ */
+static int __init cpuset_init_late(void)
+{
+	int cpu;
+
+	if (!tick_nohz_full_enabled())
+		return 0;
+	/*
+	 * Iterate all the possible CPUs to see which one has offline disabled.
+	 */
+	for_each_possible_cpu(cpu) {
+		if (get_cpu_device(cpu)->offline_disabled)
+			__cpumask_set_cpu(cpu, offline_disabled_cpus);
+	}
+	if (!cpumask_empty(offline_disabled_cpus)) {
+		char buf[128];
+
+		snprintf(buf, sizeof(buf),
+			 "CPU %*pbl with offline disabled not allowed in isolated partition",
+			 cpumask_pr_args(offline_disabled_cpus));
+		perr_strings[PERR_OL_DISABLED] = kstrdup(buf, GFP_KERNEL);
+	}
+	return 0;
+}
+pure_initcall(cpuset_init_late);
+
 /*
  * Return cpus_allowed mask from a task's cpuset.
  */
-- 
2.53.0


^ permalink raw reply related

* [PATCH 21/23] cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated partition
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>

CPU hotplug is used to facilitate the modification of the
HK_TYPE_KERNEL_NOISE and HK_TYPE_MANAGED_IRQ cpumasks. However, tearing
down and bringing up CPUs can impact the cpuset partition states
as well. For instance, tearing down the last CPU of a partition can
invalidate the partition with active tasks which will not happen if
CPU hotplug isn't used.

A workaround of this issue is disable the invalidation by pretending that
the partition has no task, and making the tasks within the partition
to the effective CPUs of its parent for a short while during the short
transition process where the CPUs will be teared down and the brought
up again.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a927b9cd4f71..5f6b4e67748f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -434,6 +434,13 @@ static inline bool partition_is_populated(struct cpuset *cs,
 	struct cpuset *cp;
 	struct cgroup_subsys_state *pos_css;
 
+	/*
+	 * Hack: In cpuhp_offline_cb_mode, pretend all partitions are empty
+	 * to prevent unnecessary partition invalidation.
+	 */
+	if (cpuhp_offline_cb_mode)
+		return false;
+
 	/*
 	 * We cannot call cs_is_populated(cs) directly, as
 	 * nr_populated_domain_children may include populated
@@ -3881,6 +3888,17 @@ hotplug_update_tasks(struct cpuset *cs,
 	cs->effective_mems = *new_mems;
 	spin_unlock_irq(&callback_lock);
 
+	/*
+	 * When cpuhp_offline_cb_mode is active, valid isolated partition
+	 * with tasks may have no online CPUs available for a short while.
+	 * In that case, we fall back to parent's effective CPUs temporarily
+	 * which will be reset back to their rightful value once the affected
+	 * CPUs are online again.
+	 */
+	if (cpuhp_offline_cb_mode && cpumask_empty(new_cpus) &&
+	   (cs->partition_root_state == PRS_ISOLATED))
+		cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
+
 	if (cpus_updated)
 		cpuset_update_tasks_cpumask(cs, new_cpus);
 	if (mems_updated)
-- 
2.53.0


^ permalink raw reply related

* [PATCH 20/23] cgroup/cpuset: Enable runtime update of HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>

One simple way to enable runtime update of HK_TYPE_KERNEL_NOISE
(nohz_full) and HK_TYPE_MANAGED_IRQ cpumasks is to make use of the CPU
hotplug to facilitate the transition of those CPUs that are changing
states as long as CONFIG_HOTPLUG_CPU is enabled and a nohz_full boot
parameter is provided. Otherwise, only HK_TYPE_DOMAIN cpumask will be
updated at run time.

For changes in HK_TYPE_DOMAIN cpumask, it can be done without using CPU
hotplug. For changes in HK_TYPE_MANAGED_IRQ cpumask, we have to update
the cpumask first and then tear down and bring up the newly isolated
CPUs to migrate the managed irqs in those CPUs to other housekeeping
CPUs.

For changes in HK_TYPE_KERNEL_NOISE, we have to tear down all the newly
isolated and de-isolated CPUs, change the cpumask and then bring all the
offline CPUs back online.

As it is possible that the various boot versions of the housekeeping
cpumasks are different resulting in the use of different set of isolated
cpumasks for calling housekeeping_update(), we may need to pre-allocate
these cpumasks if necessary.

Note that the use of CPU hotplug to facilitate the changing of
HK_TYPE_KERNEL_NOISE and HK_TYPE_MANAGED_IRQ housekeeping cpumasks has
the drawback that during the tear down of a CPU from CPUHP_TEARDOWN_CPU
state to CPUHP_AP_OFFLINE, the stop_machine code will be invoked to stop
all the other CPUs including all the pre-existing isolated CPUs. That
will cause latency spikes on those isolated CPUs. That latency spike
should only happen when the cpuset isolated partition setting is changed
resulting in changes in those housekeeping cpumasks.

One possible workaround that is being used right now is to pre-allocate
a set of nohz_full and managed_irq CPUs at boot time. The semi-isolated
CPUs are then used to create cpuset isolated partitions when needed
to enable full isolation. This will likely continue even if we made
the nohz_full and managed_irq CPUs runtime changeable if they can't
tolerate these latency spikes.

This is a problem we need to address in a future patch series.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 199 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 182 insertions(+), 17 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1b0c50b46a49..a927b9cd4f71 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -152,9 +152,17 @@ static cpumask_var_t	isolated_cpus;		/* CSCB */
 static bool		update_housekeeping;	/* RWCS */
 
 /*
- * Copy of isolated_cpus to be passed to housekeeping_update()
+ * Cpumasks to be passed to housekeeping_update()
+ * isolated_hk_cpus - copy of isolated_cpus for HK_TYPE_DOMAIN
+ * isolated_nohz_cpus - for HK_TYPE_KERNEL_NOISE
+ * isolated_mirq_cpus - for HK_TYPE_MANAGED_IRQ
  */
 static cpumask_var_t	isolated_hk_cpus;	/* T */
+static cpumask_var_t	isolated_nohz_cpus;	/* T */
+static cpumask_var_t	isolated_mirq_cpus;	/* T */
+
+static bool		boot_nohz_le_domain __ro_after_init;
+static bool		boot_mirq_le_domain __ro_after_init;
 
 /*
  * A flag to force sched domain rebuild at the end of an operation.
@@ -1328,29 +1336,67 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
 	return false;
 }
 
+static int cpuset_nohz_update_cbfunc(void *arg)
+{
+	struct cpumask *isol_cpus = (struct cpumask *)arg;
+
+	if (isol_cpus)
+		housekeeping_update(isol_cpus, BIT(HK_TYPE_KERNEL_NOISE));
+	return 0;
+}
+
 /*
- * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock
- *
- * Update housekeeping cpumasks and rebuild sched domains if necessary and
- * then do a cpuset_full_unlock().
- * This should be called at the end of cpuset operation.
  */
-static void cpuset_update_sd_hk_unlock(void)
-	__releases(&cpuset_mutex)
-	__releases(&cpuset_top_mutex)
+static void cpuset_update_housekeeping_unlock(void)
 {
-	update_housekeeping = false;
+	bool update_nohz, update_mirq;
+	cpumask_var_t cpus;
+	int ret;
 
-	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
-	if (force_sd_rebuild)
-		rebuild_sched_domains_locked();
+	if (!tick_nohz_full_enabled())
+		return;
 
-	if (cpumask_equal(isolated_hk_cpus, isolated_cpus)) {
-		/* No housekeeping cpumask update needed */
+	update_nohz = boot_nohz_le_domain;
+	update_mirq = boot_mirq_le_domain;
+
+	if (WARN_ON_ONCE(!alloc_cpumask_var(&cpus, GFP_KERNEL))) {
 		cpuset_full_unlock();
 		return;
 	}
 
+	/*
+	 * Update isolated_nohz_cpus/isolated_mirq_cpus if necessary
+	 */
+	if (!boot_nohz_le_domain) {
+		cpumask_andnot(cpus, cpu_possible_mask,
+			       housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
+		cpumask_or(cpus, cpus, isolated_cpus);
+		update_nohz = !cpumask_equal(isolated_nohz_cpus, cpus);
+		if (update_nohz)
+			cpumask_copy(isolated_nohz_cpus, cpus);
+	}
+	if (!boot_mirq_le_domain) {
+		cpumask_andnot(cpus, cpu_possible_mask,
+			       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+		cpumask_or(cpus, cpus, isolated_cpus);
+		update_mirq = !cpumask_equal(isolated_mirq_cpus, cpus);
+		if (update_mirq)
+			cpumask_copy(isolated_mirq_cpus, cpus);
+	}
+
+	/*
+	 * Compute list of CPUs to be brought offline into "cpus"
+	 * isolated_hk_cpus - old cpumask
+	 * isolated_cpus    - new cpumask
+	 *
+	 * With update_nohz, we need to offline both the newly isolated
+	 * and de-isolated CPUs. With only update_mirq, we only need to
+	 * offline the new isolated CPUs.
+	 */
+	if (update_nohz)
+		cpumask_xor(cpus, isolated_hk_cpus, isolated_cpus);
+	else if (update_mirq)
+		cpumask_andnot(cpus, isolated_cpus, isolated_hk_cpus);
 	cpumask_copy(isolated_hk_cpus, isolated_cpus);
 
 	/*
@@ -1360,10 +1406,103 @@ static void cpuset_update_sd_hk_unlock(void)
 	 */
 	mutex_unlock(&cpuset_mutex);
 	cpus_read_unlock();
-	WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
+
+	if (!update_mirq) {
+		ret = housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN));
+	} else if (boot_mirq_le_domain) {
+		ret = housekeeping_update(isolated_hk_cpus,
+					  BIT(HK_TYPE_DOMAIN)|BIT(HK_TYPE_MANAGED_IRQ));
+	} else {
+		ret = housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN));
+		if (!ret)
+			ret = housekeeping_update(isolated_mirq_cpus,
+						  BIT(HK_TYPE_MANAGED_IRQ));
+	}
+
+	if (WARN_ON_ONCE(ret))
+		goto out_free;
+
+	/*
+	 * Calling cpuhp_offline_cb() is only needed if either
+	 * HK_TYPE_KERNEL_NOISE and/or HK_TYPE_MANAGED_IRQ cpumasks
+	 * needed to be updated.
+	 *
+	 * TODO: When tearing down a CPU from CPUHP_TEARDOWN_CPU state
+	 * downward to CPUHP_AP_OFFLINE, the stop_machine code will be
+	 * invoked to stop all the other CPUs in the system. This will
+	 * cause latency spikes on existing isolated CPUs. We will need
+	 * some new mechanism to enable us to not stop the existing
+	 * isolated CPUs whenever possible. A possible workaround is
+	 * to pre-allocate a set of nohz_full and manged_irq CPUs at
+	 * boot time and use them to create isolated cpuset partitions
+	 * so that CPU hotplug won't need to be used.
+	 */
+	if (update_mirq || update_nohz) {
+		struct cpumask *nohz_cpus;
+
+		/*
+		 * Calling housekeeping_update() is only needed if
+		 * update_nohz is set.
+		 */
+		nohz_cpus = !update_nohz ? NULL : boot_nohz_le_domain
+					 ? isolated_hk_cpus
+					 : isolated_nohz_cpus;
+		/*
+		 * Mask out offline CPUs in cpus
+		 * If there is no online CPUs, we can call
+		 * housekeeping_update() directly if needed.
+		 */
+		cpumask_and(cpus, cpus, cpu_online_mask);
+		if (!cpumask_empty(cpus))
+			ret = cpuhp_offline_cb(cpus, cpuset_nohz_update_cbfunc,
+					       nohz_cpus);
+		else if (nohz_cpus)
+			ret = housekeeping_update(nohz_cpus, BIT(HK_TYPE_KERNEL_NOISE));
+	}
+	WARN_ON_ONCE(ret);
+out_free:
+	free_cpumask_var(cpus);
 	mutex_unlock(&cpuset_top_mutex);
 }
 
+/*
+ * cpuset_update_sd_hk_unlock - Rebuild sched domains, update HK & unlock
+ *
+ * Update housekeeping cpumasks and rebuild sched domains if necessary and
+ * then do a cpuset_full_unlock().
+ * This should be called at the end of cpuset operation.
+ */
+static void cpuset_update_sd_hk_unlock(void)
+	__releases(&cpuset_mutex)
+	__releases(&cpuset_top_mutex)
+{
+	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
+	if (force_sd_rebuild)
+		rebuild_sched_domains_locked();
+
+	update_housekeeping = false;
+
+	if (cpumask_equal(isolated_cpus, isolated_hk_cpus)) {
+		cpuset_full_unlock();
+		return;
+	}
+
+	if (!tick_nohz_full_enabled()) {
+		/*
+		 * housekeeping_update() is now called without holding
+		 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
+		 * is still being held for mutual exclusion.
+		 */
+		mutex_unlock(&cpuset_mutex);
+		cpus_read_unlock();
+		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus,
+						 BIT(HK_TYPE_DOMAIN)));
+		mutex_unlock(&cpuset_top_mutex);
+	} else {
+		cpuset_update_housekeeping_unlock();
+	}
+}
+
 /*
  * Work function to invoke cpuset_update_sd_hk_unlock()
  */
@@ -3700,6 +3839,29 @@ int __init cpuset_init(void)
 			       housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
 		cpumask_copy(isolated_hk_cpus, isolated_cpus);
 	}
+
+	/*
+	 * If nohz_full and/or managed_irq cpu list, if present, is a subset
+	 * of the domain cpu list, i.e. HK_TYPE_DOMAIN_BOOT cpumask is a subset
+	 * of HK_TYPE_KERNEL_NOISE_BOOT/HK_TYPE_MANAGED_IRQ_BOOT cpumask, the
+	 * corresponding non-boot housekeeping cpumasks will follow changes
+	 * in the HK_TYPE_DOMAIN cpumask. So we don't need to use separate
+	 * cpumasks to isolate them.
+	 */
+	boot_nohz_le_domain = cpumask_subset(housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT),
+					     housekeeping_cpumask(HK_TYPE_KERNEL_NOISE_BOOT));
+	boot_mirq_le_domain = cpumask_subset(housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT),
+					     housekeeping_cpumask(HK_TYPE_MANAGED_IRQ_BOOT));
+	if (!boot_nohz_le_domain) {
+		BUG_ON(!alloc_cpumask_var(&isolated_nohz_cpus, GFP_KERNEL));
+		cpumask_andnot(isolated_nohz_cpus, cpu_possible_mask,
+			       housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
+	}
+	if (!boot_mirq_le_domain) {
+		BUG_ON(!alloc_cpumask_var(&isolated_mirq_cpus, GFP_KERNEL));
+		cpumask_andnot(isolated_mirq_cpus, cpu_possible_mask,
+			       housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+	}
 	return 0;
 }
 
@@ -3954,7 +4116,10 @@ static void cpuset_handle_hotplug(void)
 	 */
 	if (force_sd_rebuild)
 		rebuild_sched_domains_cpuslocked();
-	if (update_housekeeping)
+	/*
+	 * Don't need to update housekeeping cpumasks in cpuhp_offline_cb mode.
+	 */
+	if (update_housekeeping && !cpuhp_offline_cb_mode)
 		queue_work(system_dfl_wq, &hk_sd_work);
 
 	free_tmpmasks(ptmp);
-- 
2.53.0


^ permalink raw reply related

* [PATCH 19/23] cgroup/cpuset: Improve check for calling housekeeping_update()
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>

By making sure that isolated_hk_cpus matches isolated_cpus at boot time,
we can more accurately determine if calling housekeeping_update()
is needed by comparing if the two cpumasks are equal. The
update_housekeeping flag still have a use in cpuset_handle_hotplug()
to determine if a work function should be queued to invoke
cpuset_update_sd_hk_unlock() as it is not supposed to look at
isolated_hk_cpus without holding cpuset_top_mutex.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a4eccb0ec0d1..1b0c50b46a49 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1339,26 +1339,29 @@ static void cpuset_update_sd_hk_unlock(void)
 	__releases(&cpuset_mutex)
 	__releases(&cpuset_top_mutex)
 {
+	update_housekeeping = false;
+
 	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
 	if (force_sd_rebuild)
 		rebuild_sched_domains_locked();
 
-	if (update_housekeeping) {
-		update_housekeeping = false;
-		cpumask_copy(isolated_hk_cpus, isolated_cpus);
-
-		/*
-		 * housekeeping_update() is now called without holding
-		 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
-		 * is still being held for mutual exclusion.
-		 */
-		mutex_unlock(&cpuset_mutex);
-		cpus_read_unlock();
-		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
-		mutex_unlock(&cpuset_top_mutex);
-	} else {
+	if (cpumask_equal(isolated_hk_cpus, isolated_cpus)) {
+		/* No housekeeping cpumask update needed */
 		cpuset_full_unlock();
+		return;
 	}
+
+	cpumask_copy(isolated_hk_cpus, isolated_cpus);
+
+	/*
+	 * housekeeping_update() is now called without holding
+	 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
+	 * is still being held for mutual exclusion.
+	 */
+	mutex_unlock(&cpuset_mutex);
+	cpus_read_unlock();
+	WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
+	mutex_unlock(&cpuset_top_mutex);
 }
 
 /*
@@ -3692,10 +3695,11 @@ int __init cpuset_init(void)
 
 	BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
 
-	if (housekeeping_enabled(HK_TYPE_DOMAIN_BOOT))
+	if (housekeeping_enabled(HK_TYPE_DOMAIN_BOOT)) {
 		cpumask_andnot(isolated_cpus, cpu_possible_mask,
 			       housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
-
+		cpumask_copy(isolated_hk_cpus, isolated_cpus);
+	}
 	return 0;
 }
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH 18/23] cpu/hotplug: Add a new cpuhp_offline_cb() API
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>

Add a new cpuhp_offline_cb() API that allows us to offline a set of
CPUs one-by-one, run the given callback function and then bring those
CPUs back online again while inhibiting any concurrent CPU hotplug
operations from happening.

This new API can be used to enable runtime adjustment of nohz_full and
isolcpus boot command line options. A new cpuhp_offline_cb_mode flag
is also added to signal that the system is in this offline callback
transient state so that some hotplug operations can be optimized out
if we choose to.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cpuhplock.h |  9 +++++
 kernel/cpu.c              | 70 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/include/linux/cpuhplock.h b/include/linux/cpuhplock.h
index 286b3ab92e15..37637baa32eb 100644
--- a/include/linux/cpuhplock.h
+++ b/include/linux/cpuhplock.h
@@ -9,7 +9,9 @@
 
 #include <linux/cleanup.h>
 #include <linux/errno.h>
+#include <linux/cpumask_types.h>
 
+typedef int (*cpuhp_cb_t)(void *arg);
 struct device;
 
 extern int lockdep_is_cpus_held(void);
@@ -29,6 +31,8 @@ void clear_tasks_mm_cpumask(int cpu);
 int remove_cpu(unsigned int cpu);
 int cpu_device_down(struct device *dev);
 void smp_shutdown_nonboot_cpus(unsigned int primary_cpu);
+int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg);
+extern bool cpuhp_offline_cb_mode;
 
 #else /* CONFIG_HOTPLUG_CPU */
 
@@ -43,6 +47,11 @@ static inline void cpu_hotplug_disable(void) { }
 static inline void cpu_hotplug_enable(void) { }
 static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
 static inline void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) { }
+static inline int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
+{
+	return -EPERM;
+}
+#define cpuhp_offline_cb_mode	false
 #endif	/* !CONFIG_HOTPLUG_CPU */
 
 DEFINE_LOCK_GUARD_0(cpus_read_lock, cpus_read_lock(), cpus_read_unlock())
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 0d02b5d7a7ba..9b32f742cd1d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1520,6 +1520,76 @@ int remove_cpu(unsigned int cpu)
 }
 EXPORT_SYMBOL_GPL(remove_cpu);
 
+bool cpuhp_offline_cb_mode;
+
+/**
+ * cpuhp_offline_cb - offline CPUs, invoke callback function & online CPUs afterward
+ * @mask: A mask of CPUs to be taken offline and then online
+ * @func: A callback function to be invoked while the given CPUs are offline
+ * @arg:  Argument to be passed back to the callback function
+ *
+ * Return: 0 if successful, an error code otherwise
+ */
+int cpuhp_offline_cb(struct cpumask *mask, cpuhp_cb_t func, void *arg)
+{
+	int off_cpu, on_cpu, ret, ret2 = 0;
+
+	if (WARN_ON_ONCE(cpumask_empty(mask) ||
+	   !cpumask_subset(mask, cpu_online_mask)))
+		return -EINVAL;
+
+	pr_debug("%s: begin (CPU list = %*pbl)\n", __func__, cpumask_pr_args(mask));
+	lock_device_hotplug();
+	cpuhp_offline_cb_mode = true;
+	/*
+	 * If all offline operations succeed, off_cpu should become nr_cpu_ids.
+	 */
+	for_each_cpu(off_cpu, mask) {
+		ret = device_offline(get_cpu_device(off_cpu));
+		if (unlikely(ret))
+			break;
+	}
+	if (!ret)
+		ret = func(arg);
+
+	/* Bring previously offline CPUs back online */
+	for_each_cpu(on_cpu, mask) {
+		int retries = 0;
+
+		if (on_cpu == off_cpu)
+			break;
+
+retry:
+		ret2 = device_online(get_cpu_device(on_cpu));
+
+		/*
+		 * With the unlikely event that CPU hotplug is disabled while
+		 * this operation is in progress, we will need to wait a bit
+		 * for hotplug to hopefully be re-enabled again. If not, print
+		 * a warning and return the error.
+		 *
+		 * cpu_hotplug_disabled is supposed to be accessed while
+		 * holding the cpu_add_remove_lock mutex. So we need to
+		 * use the data_race() macro to access it here.
+		 */
+		while ((ret2 == -EBUSY) && data_race(cpu_hotplug_disabled) &&
+		       (++retries <= 5)) {
+			msleep(20);
+			if (!data_race(cpu_hotplug_disabled))
+				goto retry;
+		}
+		if (ret2) {
+			pr_warn("%s: Failed to bring CPU %d back online!\n",
+				__func__, on_cpu);
+			break;
+		}
+	}
+	cpuhp_offline_cb_mode = false;
+	unlock_device_hotplug();
+	pr_debug("%s: end\n", __func__);
+	return ret ? ret : ret2;
+}
+
 void smp_shutdown_nonboot_cpus(unsigned int primary_cpu)
 {
 	unsigned int cpu;
-- 
2.53.0


^ permalink raw reply related

* [PATCH 17/23] sched/isolation: Extend housekeeping_dereference_check() to cover changes in nohz_full or manged_irqs cpumasks
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>

As we are going to make HK_TYPE_KERNEL_NOISE and
HK_TYPE_MANAGED_IRQ housekeeping cpumasks run time changeable, extend
housekeeping_dereference_check() to cover changes to those cpumasks
as well.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/isolation.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 1f3f1c83dd12..1647e7b08bac 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -38,7 +38,8 @@ EXPORT_SYMBOL_GPL(housekeeping_enabled);
 
 static bool housekeeping_dereference_check(enum hk_type type)
 {
-	if (IS_ENABLED(CONFIG_LOCKDEP) && type == HK_TYPE_DOMAIN) {
+	if (IS_ENABLED(CONFIG_LOCKDEP) &&
+	   (BIT(type) & (HK_FLAG_DOMAIN | HK_FLAG_KERNEL_NOISE | HK_FLAG_MANAGED_IRQ))) {
 		/* Cpuset isn't even writable yet? */
 		if (system_state <= SYSTEM_SCHEDULING)
 			return true;
-- 
2.53.0


^ permalink raw reply related

* [PATCH 16/23] genirq/cpuhotplug: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
From: Waiman Long @ 2026-04-21  3:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
	Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
	Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>

As HK_TYPE_MANAGED_IRQ cpumask is going to be changeable at run time,
use RCU to protect access to the cpumask.

To enable the new HK_TYPE_MANAGED_IRQ cpumask to take effect, the
following steps can be done.

 1) Update the HK_TYPE_MANAGED_IRQ cpumask to take out the newly isolated
    CPUs and add back the de-isolated CPUs.
 2) Tear down the affected CPUs to cause irq_migrate_all_off_this_cpu()
    to be called on the affected CPUs to migrate the irqs to other
    HK_TYPE_MANAGED_IRQ housekeeping CPUs.
 3) Bring up the previously offline CPUs to invoke
    irq_affinity_online_cpu() to allow the newly de-isolated CPUs to
    be used for managed irqs.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/irq/cpuhotplug.c | 1 +
 kernel/irq/manage.c     | 1 +
 2 files changed, 2 insertions(+)

diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index cd5689e383b0..86437c78f1f2 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -196,6 +196,7 @@ static bool hk_should_isolate(struct irq_data *data, unsigned int cpu)
 	if (!housekeeping_enabled(HK_TYPE_MANAGED_IRQ))
 		return false;
 
+	guard(rcu)();
 	hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
 	if (cpumask_subset(irq_data_get_effective_affinity_mask(data), hk_mask))
 		return false;
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 2e8072437826..8270c4de260b 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -263,6 +263,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, bool
 	    housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
 		const struct cpumask *hk_mask;
 
+		guard(rcu)();
 		hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
 
 		cpumask_and(tmp_mask, mask, hk_mask);
-- 
2.53.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox