* [PATCH 09/23] workqueue: Use RCU to protect access of HK_TYPE_TIMER cpumask
From: Waiman Long @ 2026-04-21 3:03 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
As HK_TYPE_TIMER cpumask is going to be changeable at run time, use
RCU to protect access to the cpumask.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/workqueue.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 08b1c786b463..2dab3872281a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2557,8 +2557,10 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
if (housekeeping_enabled(HK_TYPE_TIMER)) {
/* If the current cpu is a housekeeping cpu, use it. */
cpu = smp_processor_id();
- if (!housekeeping_test_cpu(cpu, HK_TYPE_TIMER))
- cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
+ scoped_guard(rcu) {
+ if (!housekeeping_test_cpu(cpu, HK_TYPE_TIMER))
+ cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
+ }
add_timer_on(timer, cpu);
} else {
if (likely(cpu == WORK_CPU_UNBOUND))
--
2.53.0
^ permalink raw reply related
* [PATCH 08/23] arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
From: Waiman Long @ 2026-04-21 3:03 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
As the HK_TYPE_TICK cpumask is going to be changeable at run time, we
need to use RCU to protect access to the cpumask to prevent it from
going away in the middle of the operation.
Signed-off-by: Waiman Long <longman@redhat.com>
---
arch/arm64/kernel/topology.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index b32f13358fbb..48f150801689 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -173,6 +173,7 @@ void arch_cpu_idle_enter(void)
if (!amu_fie_cpu_supported(cpu))
return;
+ guard(rcu)();
/* Kick in AMU update but only if one has not happened already */
if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
time_is_before_jiffies(per_cpu(cpu_amu_samples.last_scale_update, cpu)))
@@ -187,11 +188,16 @@ int arch_freq_get_on_cpu(int cpu)
unsigned int start_cpu = cpu;
unsigned long last_update;
unsigned int freq = 0;
+ bool hk_cpu;
u64 scale;
if (!amu_fie_cpu_supported(cpu) || !arch_scale_freq_ref(cpu))
return -EOPNOTSUPP;
+ scoped_guard(rcu) {
+ hk_cpu = housekeeping_cpu(cpu, HK_TYPE_TICK);
+ }
+
while (1) {
amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
@@ -204,16 +210,21 @@ int arch_freq_get_on_cpu(int cpu)
* (and thus freq scale), if available, for given policy: this boils
* down to identifying an active cpu within the same freq domain, if any.
*/
- if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
+ if (!hk_cpu ||
time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+ bool hk_intersects;
int ref_cpu;
if (!policy)
return -EINVAL;
- if (!cpumask_intersects(policy->related_cpus,
- housekeeping_cpumask(HK_TYPE_TICK))) {
+ scoped_guard(rcu) {
+ hk_intersects = cpumask_intersects(policy->related_cpus,
+ housekeeping_cpumask(HK_TYPE_TICK));
+ }
+
+ if (!hk_intersects) {
cpufreq_cpu_put(policy);
return -EOPNOTSUPP;
}
--
2.53.0
^ permalink raw reply related
* [PATCH 07/23] watchdog: Sync up with runtime change of isolated CPUs
From: Waiman Long @ 2026-04-21 3:03 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
At bootup, watchdog will exclude nohz_full CPUs specified at boot time.
As we are now enabling runtime changes to nohz_full CPUs, the list of
CPUs with watchdog timer running should be updated to exclude the
current set of isolated CPUs.
Add a new watchdog_cpumask_update() helper to be invoked
by housekeeping_update() when the HK_TYPE_KERNEL_NOISE
(HK_TYPE_TIMER) cpumask is being updated to update watchdog_cpumask and
watchdog_allowed_mask for soft lockup detector. The cpumask updates will
be done when the affected CPUs are in the offline state. When those
CPUs are brought up later, the new cpumask will be used to determine
if any hard/soft watchdog should be enabled again.
Signed-off-by: Waiman Long <longman@redhat.com>
---
include/linux/nmi.h | 2 ++
kernel/sched/isolation.c | 1 +
kernel/watchdog.c | 24 ++++++++++++++++++++++++
3 files changed, 27 insertions(+)
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index bc1162895f35..5bf941d2b168 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -17,6 +17,7 @@
void lockup_detector_init(void);
void lockup_detector_retry_init(void);
void lockup_detector_soft_poweroff(void);
+void watchdog_cpumask_update(struct cpumask *mask);
extern int watchdog_user_enabled;
extern int watchdog_thresh;
@@ -37,6 +38,7 @@ extern int sysctl_hardlockup_all_cpu_backtrace;
static inline void lockup_detector_init(void) { }
static inline void lockup_detector_retry_init(void) { }
static inline void lockup_detector_soft_poweroff(void) { }
+static inline void watchdog_cpumask_update(struct cpumask *mask) { }
#endif /* !CONFIG_LOCKUP_DETECTOR */
#ifdef CONFIG_SOFTLOCKUP_DETECTOR
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index b5635484ec69..1f3f1c83dd12 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -184,6 +184,7 @@ int housekeeping_update(struct cpumask *isol_mask, unsigned long flags)
if (flags & HK_FLAG_KERNEL_NOISE) {
tick_nohz_full_update_cpus(isol_mask);
rcu_nocb_update_cpus(isol_mask);
+ watchdog_cpumask_update(isol_mask);
}
synchronize_rcu();
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 87dd5e0f6968..498c1463b843 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -1071,6 +1071,30 @@ static inline void lockup_detector_setup(void)
}
#endif /* !CONFIG_SOFTLOCKUP_DETECTOR */
+/**
+ * watchdog_cpumask_update - update watchdog_cpumask & watchdog_allowed_mask
+ * @isol_mask: cpumask of isolated CPUs
+ *
+ * Update watchdog_cpumask and watchdog_allowed_mask to be inverse of the
+ * given isolated cpumask to disable watchdog activities on isolated CPUs.
+ * It should be called with the affected CPUs in offline state which will be
+ * brought up online later.
+ *
+ * Any changes made in watchdog_cpumask by users via the sysctl parameter will
+ * be overridden. However, proc_watchdog_update() isn't called. So change will
+ * only happens on CPUs that will brought up later on to minimize changes to
+ * the existing watchdog configuration.
+ */
+void watchdog_cpumask_update(struct cpumask *isol_mask)
+{
+ mutex_lock(&watchdog_mutex);
+ cpumask_andnot(&watchdog_cpumask, cpu_possible_mask, isol_mask);
+#ifdef CONFIG_SOFTLOCKUP_DETECTOR
+ cpumask_copy(&watchdog_allowed_mask, &watchdog_cpumask);
+#endif
+ mutex_unlock(&watchdog_mutex);
+}
+
/**
* lockup_detector_soft_poweroff - Interface to stop lockup detector(s)
*
--
2.53.0
^ permalink raw reply related
* [PATCH 06/23] rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask
From: Waiman Long @ 2026-04-21 3:03 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
We can make use of the rcu_nocb_cpu_offload()/rcu_nocb_cpu_deoffload()
APIs to enable RCU NO-CB CPU offloading of newly isolated CPUs and
deoffloading of de-isolated CPUs.
Add a new rcu_nocb_update_cpus() helper to do that and call it directly
from housekeeping_update() when the HK_TYPE_KERNEL_NOISE cpumask is
being changed.
This dynamic RCU NO-CB CPU offloading feature can only used if either
"rcs_nocbs" or the "nohz_full" boot command parameters are used with or
without parameter so that the proper RCU NO-CB resources are properly
initialized at boot time.
Signed-off-by: Waiman Long <longman@redhat.com>
---
include/linux/rcupdate.h | 2 ++
kernel/rcu/tree_nocb.h | 22 ++++++++++++++++++++++
kernel/sched/isolation.c | 4 +++-
3 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 04f3f86a4145..987e3d1d413e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -150,6 +150,7 @@ void rcu_init_nohz(void);
int rcu_nocb_cpu_offload(int cpu);
int rcu_nocb_cpu_deoffload(int cpu);
void rcu_nocb_flush_deferred_wakeup(void);
+void rcu_nocb_update_cpus(struct cpumask *cpumask);
#define RCU_NOCB_LOCKDEP_WARN(c, s) RCU_LOCKDEP_WARN(c, s)
@@ -159,6 +160,7 @@ static inline void rcu_init_nohz(void) { }
static inline int rcu_nocb_cpu_offload(int cpu) { return -EINVAL; }
static inline int rcu_nocb_cpu_deoffload(int cpu) { return 0; }
static inline void rcu_nocb_flush_deferred_wakeup(void) { }
+static inline void rcu_nocb_update_cpus(struct cpumask *cpumask) { }
#define RCU_NOCB_LOCKDEP_WARN(c, s)
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 2d06dcb61f37..b2daba1e5cb9 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1173,6 +1173,28 @@ int rcu_nocb_cpu_offload(int cpu)
}
EXPORT_SYMBOL_GPL(rcu_nocb_cpu_offload);
+void rcu_nocb_update_cpus(struct cpumask *cpumask)
+{
+ int cpu, ret;
+
+ if (!rcu_state.nocb_is_setup) {
+ pr_warn_once("Dynamic RCU NOCB cannot be enabled without nohz_full/rcu_nocbs kernel boot parameter!\n");
+ return;
+ }
+
+ for_each_cpu_andnot(cpu, cpumask, rcu_nocb_mask) {
+ ret = rcu_nocb_cpu_offload(cpu);
+ if (WARN_ON_ONCE(ret))
+ return;
+ }
+
+ for_each_cpu_andnot(cpu, rcu_nocb_mask, cpumask) {
+ ret = rcu_nocb_cpu_deoffload(cpu);
+ if (WARN_ON_ONCE(ret))
+ return;
+ }
+}
+
#ifdef CONFIG_RCU_LAZY
static unsigned long
lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 48b155e0b290..b5635484ec69 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -181,8 +181,10 @@ int housekeeping_update(struct cpumask *isol_mask, unsigned long flags)
if ((housekeeping.flags & flags) != flags)
WRITE_ONCE(housekeeping.flags, housekeeping.flags | flags);
- if (flags & HK_FLAG_KERNEL_NOISE)
+ if (flags & HK_FLAG_KERNEL_NOISE) {
tick_nohz_full_update_cpus(isol_mask);
+ rcu_nocb_update_cpus(isol_mask);
+ }
synchronize_rcu();
--
2.53.0
^ permalink raw reply related
* [PATCH 05/23] tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
From: Waiman Long @ 2026-04-21 3:03 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
In tick_cpu_dying(), if the dying CPU is the current timekeeper,
it has to pass the job over to another CPU. The current code passes
it to another online CPU. However, that CPU may not be a timer tick
housekeeping CPU. If that happens, another CPU will have to manually
take it over again later. Avoid this unnecessary work by directly
assigning an online housekeeping CPU.
Use READ_ONCE/WRITE_ONCE() to access tick_do_timer_cpu in case the
non-HK CPUs may not be in stop machine in the future.
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/time/tick-common.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index d305d8521896..4834a1b9044b 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -17,6 +17,7 @@
#include <linux/profile.h>
#include <linux/sched.h>
#include <linux/module.h>
+#include <linux/sched/isolation.h>
#include <trace/events/power.h>
#include <asm/irq_regs.h>
@@ -394,12 +395,19 @@ int tick_cpu_dying(unsigned int dying_cpu)
{
/*
* If the current CPU is the timekeeper, it's the only one that can
- * safely hand over its duty. Also all online CPUs are in stop
- * machine, guaranteed not to be idle, therefore there is no
+ * safely hand over its duty. Also all online housekeeping CPUs are
+ * in stop machine, guaranteed not to be idle, therefore there is no
* concurrency and it's safe to pick any online successor.
*/
- if (tick_do_timer_cpu == dying_cpu)
- tick_do_timer_cpu = cpumask_first(cpu_online_mask);
+ if (READ_ONCE(tick_do_timer_cpu) == dying_cpu) {
+ unsigned int new_cpu;
+
+ guard(rcu)();
+ new_cpu = cpumask_first_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TICK));
+ if (WARN_ON_ONCE(new_cpu >= nr_cpu_ids))
+ new_cpu = cpumask_first(cpu_online_mask);
+ WRITE_ONCE(tick_do_timer_cpu, new_cpu);
+ }
/* Make sure the CPU won't try to retake the timekeeping duty */
tick_sched_timer_dying(dying_cpu);
--
2.53.0
^ permalink raw reply related
* [PATCH 04/23] tick/nohz: Allow runtime changes in full dynticks CPUs
From: Waiman Long @ 2026-04-21 3:03 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
Full dynticks can only be enabled if "nohz_full" boot option has been
been specified with or without parameter. Any change in the list of
nohz_full CPUs have to be reflected in tick_nohz_full_mask. Introduce
a new tick_nohz_full_update_cpus() helper that can be called to update
the tick_nohz_full_mask at run time. The housekeeping_update() function
is modified to call the new helper when the HK_TYPE_KERNEL_NOSIE cpumask
is going to be changed.
We also need to enable CPU context tracking for those CPUs that
are in tick_nohz_full_mask. So remove __init from tick_nohz_init()
and ct_cpu_track_user() so that they be called later when an isolated
cpuset partition is being created. The __ro_after_init attribute is
taken away from context_tracking_key as well.
Also add a new ct_cpu_untrack_user() function to reverse the action of
ct_cpu_track_user() in case we need to disable the nohz_full mode of
a CPU.
With nohz_full enabled, the boot CPU (typically CPU 0) will be the
tick CPU which cannot be shut down easily. So the boot CPU should not
be used in an isolated cpuset partition.
With runtime modification of nohz_full CPUs, tick_do_timer_cpu can become
TICK_DO_TIMER_NONE. So remove the two TICK_DO_TIMER_NONE WARN_ON_ONCE()
checks in tick-sched.c to avoid unnecessary warnings.
Signed-off-by: Waiman Long <longman@redhat.com>
---
include/linux/context_tracking.h | 1 +
include/linux/tick.h | 2 ++
kernel/context_tracking.c | 15 ++++++++++---
kernel/sched/isolation.c | 3 +++
kernel/time/tick-sched.c | 37 ++++++++++++++++++++++++++------
5 files changed, 48 insertions(+), 10 deletions(-)
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index a3fea7f9fef6..1a6b816f1ad6 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -17,6 +17,7 @@
#define CONTEXT_TRACKING_FORCE_ENABLE (-1)
extern void ct_cpu_track_user(int cpu);
+extern void ct_cpu_untrack_user(int cpu);
/* Called with interrupts disabled. */
extern void __ct_user_enter(enum ctx_state state);
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 738007d6f577..05586f14461c 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -274,6 +274,7 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
extern void tick_nohz_full_kick_cpu(int cpu);
extern void __tick_nohz_task_switch(void);
extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
+extern void tick_nohz_full_update_cpus(struct cpumask *cpumask);
#else
static inline bool tick_nohz_full_enabled(void) { return false; }
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
@@ -299,6 +300,7 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
static inline void tick_nohz_full_kick_cpu(int cpu) { }
static inline void __tick_nohz_task_switch(void) { }
static inline void tick_nohz_full_setup(cpumask_var_t cpumask) { }
+static inline void tick_nohz_full_update_cpus(struct cpumask *cpumask) { }
#endif
static inline void tick_nohz_task_switch(void)
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 925999de1a28..394e432630a3 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -411,7 +411,7 @@ static __always_inline void ct_kernel_enter(bool user, int offset) { }
#define CREATE_TRACE_POINTS
#include <trace/events/context_tracking.h>
-DEFINE_STATIC_KEY_FALSE_RO(context_tracking_key);
+DEFINE_STATIC_KEY_FALSE(context_tracking_key);
EXPORT_SYMBOL_GPL(context_tracking_key);
static noinstr bool context_tracking_recursion_enter(void)
@@ -674,9 +674,9 @@ void user_exit_callable(void)
}
NOKPROBE_SYMBOL(user_exit_callable);
-void __init ct_cpu_track_user(int cpu)
+void ct_cpu_track_user(int cpu)
{
- static __initdata bool initialized = false;
+ static bool initialized;
if (cpu == CONTEXT_TRACKING_FORCE_ENABLE) {
static_branch_inc(&context_tracking_key);
@@ -700,6 +700,15 @@ void __init ct_cpu_track_user(int cpu)
initialized = true;
}
+void ct_cpu_untrack_user(int cpu)
+{
+ if (!per_cpu(context_tracking.active, cpu))
+ return;
+
+ per_cpu(context_tracking.active, cpu) = false;
+ static_branch_dec(&context_tracking_key);
+}
+
#ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
void __init context_tracking_init(void)
{
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index c233d55a1e95..48b155e0b290 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -181,6 +181,9 @@ int housekeeping_update(struct cpumask *isol_mask, unsigned long flags)
if ((housekeeping.flags & flags) != flags)
WRITE_ONCE(housekeeping.flags, housekeeping.flags | flags);
+ if (flags & HK_FLAG_KERNEL_NOISE)
+ tick_nohz_full_update_cpus(isol_mask);
+
synchronize_rcu();
if (flags & HK_FLAG_DOMAIN) {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ed877b2c9040..7baa757ca45f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -241,9 +241,6 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
tick_cpu = READ_ONCE(tick_do_timer_cpu);
if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && unlikely(tick_cpu == TICK_DO_TIMER_NONE)) {
-#ifdef CONFIG_NO_HZ_FULL
- WARN_ON_ONCE(tick_nohz_full_running);
-#endif
WRITE_ONCE(tick_do_timer_cpu, cpu);
tick_cpu = cpu;
}
@@ -629,6 +626,36 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask)
tick_nohz_full_running = true;
}
+/* Get the new set of run-time nohz CPU list & update accordingly */
+void tick_nohz_full_update_cpus(struct cpumask *cpumask)
+{
+ int cpu;
+
+ if (!tick_nohz_full_running) {
+ pr_warn_once("Full dynticks cannot be enabled without the nohz_full kernel boot parameter!\n");
+ return;
+ }
+
+ /*
+ * To properly enable/disable nohz_full dynticks for the affected CPUs,
+ * the new nohz_full CPUs have to be copied to tick_nohz_full_mask and
+ * ct_cpu_track_user/ct_cpu_untrack_user() will have to be called
+ * for those CPUs that have their states changed. Those CPUs should be
+ * in an offline state.
+ */
+ for_each_cpu_andnot(cpu, cpumask, tick_nohz_full_mask) {
+ WARN_ON_ONCE(cpu_online(cpu));
+ ct_cpu_track_user(cpu);
+ cpumask_set_cpu(cpu, tick_nohz_full_mask);
+ }
+
+ for_each_cpu_andnot(cpu, tick_nohz_full_mask, cpumask) {
+ WARN_ON_ONCE(cpu_online(cpu));
+ ct_cpu_untrack_user(cpu);
+ cpumask_clear_cpu(cpu, tick_nohz_full_mask);
+ }
+}
+
bool tick_nohz_cpu_hotpluggable(unsigned int cpu)
{
/*
@@ -1238,10 +1265,6 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
*/
if (tick_cpu == cpu)
return false;
-
- /* Should not happen for nohz-full */
- if (WARN_ON_ONCE(tick_cpu == TICK_DO_TIMER_NONE))
- return false;
}
return true;
--
2.53.0
^ permalink raw reply related
* [PATCH 03/23] tick/nohz: Make nohz_full parameter optional
From: Waiman Long @ 2026-04-21 3:03 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
To provide nohz_full tick support, there is a set of tick dependency
masks that need to be evaluated on every IRQ and context switch.
Switching on nohz_full tick support at runtime will be problematic
as some of the tick dependency masks may not be properly set causing
problem down the road.
Allow nohz_full boot option to be specified without any
parameter to force enable nohz_full tick support without any
CPU in the tick_nohz_full_mask yet. The context_tracking_key and
tick_nohz_full_running flag will be enabled in this case to make
tick_nohz_full_enabled() return true.
There is still a small performance overhead by force enable nohz_full
this way. So it should only be used if there is a chance that some
CPUs may become isolated later via the cpuset isolated partition
functionality and better CPU isolation closed to nohz_full is desired.
Signed-off-by: Waiman Long <longman@redhat.com>
---
Documentation/admin-guide/kernel-parameters.txt | 15 +++++++++------
include/linux/context_tracking.h | 7 ++++++-
kernel/context_tracking.c | 4 +++-
kernel/rcu/tree_nocb.h | 2 +-
kernel/sched/isolation.c | 13 ++++++++++++-
kernel/time/tick-sched.c | 11 +++++++++--
6 files changed, 40 insertions(+), 12 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 95f97ce487a4..f0eedaebe9d6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4550,13 +4550,16 @@ Kernel parameters
Valid arguments: on, off
Default: on
- nohz_full= [KNL,BOOT,SMP,ISOL]
- The argument is a cpu list, as described above.
+ nohz_full[=cpu-list]
+ [KNL,BOOT,SMP,ISOL]
In kernels built with CONFIG_NO_HZ_FULL=y, set
- the specified list of CPUs whose tick will be stopped
- whenever possible. The boot CPU will be forced outside
- the range to maintain the timekeeping. Any CPUs
- in this list will have their RCU callbacks offloaded,
+ the specified list of CPUs whose tick will be
+ stopped whenever possible. If the argument is
+ not specified, nohz_full will be forced enabled
+ without any CPU in the nohz_full list yet.
+ The boot CPU will be forced outside the range
+ to maintain the timekeeping. Any CPUs in this
+ list will have their RCU callbacks offloaded,
just as if they had also been called out in the
rcu_nocbs= boot parameter.
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index af9fe87a0922..a3fea7f9fef6 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -9,8 +9,13 @@
#include <asm/ptrace.h>
-
#ifdef CONFIG_CONTEXT_TRACKING_USER
+/*
+ * Pass CONTEXT_TRACKING_FORCE_ENABLE to ct_cpu_track_user() to force enable
+ * user context tracking.
+ */
+#define CONTEXT_TRACKING_FORCE_ENABLE (-1)
+
extern void ct_cpu_track_user(int cpu);
/* Called with interrupts disabled. */
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index a743e7ffa6c0..925999de1a28 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -678,7 +678,9 @@ void __init ct_cpu_track_user(int cpu)
{
static __initdata bool initialized = false;
- if (!per_cpu(context_tracking.active, cpu)) {
+ if (cpu == CONTEXT_TRACKING_FORCE_ENABLE) {
+ static_branch_inc(&context_tracking_key);
+ } else if (!per_cpu(context_tracking.active, cpu)) {
per_cpu(context_tracking.active, cpu) = true;
static_branch_inc(&context_tracking_key);
}
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index b3337c7231cc..2d06dcb61f37 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1267,7 +1267,7 @@ void __init rcu_init_nohz(void)
struct shrinker * __maybe_unused lazy_rcu_shrinker;
#if defined(CONFIG_NO_HZ_FULL)
- if (tick_nohz_full_running && !cpumask_empty(tick_nohz_full_mask))
+ if (tick_nohz_full_running)
cpumask = tick_nohz_full_mask;
#endif
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 965d6f8fe344..c233d55a1e95 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -268,6 +268,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
}
alloc_bootmem_cpumask_var(&non_housekeeping_mask);
+
if (cpulist_parse(str, non_housekeeping_mask) < 0) {
pr_warn("Housekeeping: nohz_full= or isolcpus= incorrect CPU range\n");
goto free_non_housekeeping_mask;
@@ -277,6 +278,13 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
cpumask_andnot(housekeeping_staging,
cpu_possible_mask, non_housekeeping_mask);
+ /*
+ * Allow "nohz_full" without parameter to force enable nohz_full
+ * at boot time without any CPUs in the nohz_full list yet.
+ */
+ if ((flags & HK_FLAG_KERNEL_NOISE) && !*str)
+ goto setup_housekeeping_staging;
+
first_cpu = cpumask_first_and(cpu_present_mask, housekeeping_staging);
if (first_cpu >= nr_cpu_ids || first_cpu >= setup_max_cpus) {
__cpumask_set_cpu(smp_processor_id(), housekeeping_staging);
@@ -290,6 +298,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
if (cpumask_empty(non_housekeeping_mask))
goto free_housekeeping_staging;
+setup_housekeeping_staging:
if (!housekeeping.flags) {
/* First setup call ("nohz_full=" or "isolcpus=") */
enum hk_type type;
@@ -357,10 +366,12 @@ static int __init housekeeping_nohz_full_setup(char *str)
unsigned long flags;
flags = HK_FLAG_KERNEL_NOISE | HK_FLAG_KERNEL_NOISE_BOOT;
+ if (*str == '=')
+ str++;
return housekeeping_setup(str, flags);
}
-__setup("nohz_full=", housekeeping_nohz_full_setup);
+__setup("nohz_full", housekeeping_nohz_full_setup);
static int __init housekeeping_isolcpus_setup(char *str)
{
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9e5264458414..ed877b2c9040 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -676,8 +676,15 @@ void __init tick_nohz_init(void)
}
}
- for_each_cpu(cpu, tick_nohz_full_mask)
- ct_cpu_track_user(cpu);
+ /*
+ * Force enable context_tracking_key if tick_nohz_full_mask empty
+ */
+ if (cpumask_empty(tick_nohz_full_mask)) {
+ ct_cpu_track_user(CONTEXT_TRACKING_FORCE_ENABLE);
+ } else {
+ for_each_cpu(cpu, tick_nohz_full_mask)
+ ct_cpu_track_user(cpu);
+ }
ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
"kernel/nohz:predown", NULL,
--
2.53.0
^ permalink raw reply related
* [PATCH 02/23] sched/isolation: Enhance housekeeping_update() to support updating more than one HK cpumask
From: Waiman Long @ 2026-04-21 3:03 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
The housekeeping_update() function currently allows update to the
HK_TYPE_DOMAIN cpumask only. As we are going to enable dynamic
modification of the other housekeeping cpumasks, we need to extend
it to support passing in the information about the HK cpumask(s) to
be updated. In cases where some HK cpumasks happen to be the same,
it will be more efficient to update multiple HK cpumasks in one single
call instead of calling it multiple times. Extend housekeeping_update()
to support that as well.
Also add the restriction that passed in isolated cpumask parameter
of housekeeping_update() must include all the CPUs isolated at boot
time. This is currently the case for cpuset anyway.
Signed-off-by: Waiman Long <longman@redhat.com>
---
include/linux/sched/isolation.h | 2 +-
kernel/cgroup/cpuset.c | 2 +-
kernel/sched/isolation.c | 99 +++++++++++++++++++++++----------
3 files changed, 71 insertions(+), 32 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d1707f121e20..a17f16e0156e 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -51,7 +51,7 @@ extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
extern bool housekeeping_enabled(enum hk_type type);
extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
-extern int housekeeping_update(struct cpumask *isol_mask);
+extern int housekeeping_update(struct cpumask *isol_mask, unsigned long flags);
extern void __init housekeeping_init(void);
#else
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1335e437098e..a4eccb0ec0d1 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1354,7 +1354,7 @@ static void cpuset_update_sd_hk_unlock(void)
*/
mutex_unlock(&cpuset_mutex);
cpus_read_unlock();
- WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus));
+ WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
mutex_unlock(&cpuset_top_mutex);
} else {
cpuset_full_unlock();
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 9ec9ae510dc7..965d6f8fe344 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -120,48 +120,87 @@ bool housekeeping_test_cpu(int cpu, enum hk_type type)
}
EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
-int housekeeping_update(struct cpumask *isol_mask)
-{
- struct cpumask *trial, *old = NULL;
- int err;
+/* HK type processing table */
+static struct {
+ int type;
+ int boot_type;
+} hk_types[] = {
+ { HK_TYPE_DOMAIN, HK_TYPE_DOMAIN_BOOT },
+ { HK_TYPE_MANAGED_IRQ, HK_TYPE_MANAGED_IRQ_BOOT },
+ { HK_TYPE_KERNEL_NOISE, HK_TYPE_KERNEL_NOISE_BOOT }
+};
- trial = kmalloc(cpumask_size(), GFP_KERNEL);
- if (!trial)
- return -ENOMEM;
+#define HK_TYPE_CNT ARRAY_SIZE(hk_types)
- cpumask_andnot(trial, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), isol_mask);
- if (!cpumask_intersects(trial, cpu_online_mask)) {
- kfree(trial);
- return -EINVAL;
+int housekeeping_update(struct cpumask *isol_mask, unsigned long flags)
+{
+ struct cpumask *trial[HK_TYPE_CNT];
+ int i, err = 0;
+
+ for (i = 0; i < HK_TYPE_CNT; i++) {
+ int type = hk_types[i].type;
+ int boot = hk_types[i].boot_type;
+
+ trial[i] = NULL;
+ if (flags & BIT(type)) {
+ trial[i] = kmalloc(cpumask_size(), GFP_KERNEL);
+ if (!trial[i]) {
+ err = -ENOMEM;
+ goto out;
+ }
+ /*
+ * The new HK cpumask must be a subset of its boot
+ * cpumask.
+ */
+ cpumask_andnot(trial[i], cpu_possible_mask, isol_mask);
+ if (!cpumask_intersects(trial[i], cpu_online_mask) ||
+ !cpumask_subset(trial[i], housekeeping_cpumask(boot))) {
+ i++;
+ err = -EINVAL;
+ goto out;
+ }
+ }
}
if (!housekeeping.flags)
static_branch_enable(&housekeeping_overridden);
- if (housekeeping.flags & HK_FLAG_DOMAIN)
- old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);
- else
- WRITE_ONCE(housekeeping.flags, housekeeping.flags | HK_FLAG_DOMAIN);
- rcu_assign_pointer(housekeeping.cpumasks[HK_TYPE_DOMAIN], trial);
-
- synchronize_rcu();
-
- pci_probe_flush_workqueue();
- mem_cgroup_flush_workqueue();
- vmstat_flush_workqueue();
+ for (i = 0; i < HK_TYPE_CNT; i++) {
+ int type = hk_types[i].type;
+ struct cpumask *old;
- err = workqueue_unbound_housekeeping_update(housekeeping_cpumask(HK_TYPE_DOMAIN));
- WARN_ON_ONCE(err < 0);
+ if (!trial[i])
+ continue;
+ old = NULL;
+ if (housekeeping.flags & BIT(type))
+ old = housekeeping_cpumask_dereference(type);
+ rcu_assign_pointer(housekeeping.cpumasks[type], trial[i]);
+ trial[i] = old;
+ }
- err = tmigr_isolated_exclude_cpumask(isol_mask);
- WARN_ON_ONCE(err < 0);
+ if ((housekeeping.flags & flags) != flags)
+ WRITE_ONCE(housekeeping.flags, housekeeping.flags | flags);
- err = kthreads_update_housekeeping();
- WARN_ON_ONCE(err < 0);
+ synchronize_rcu();
- kfree(old);
+ if (flags & HK_FLAG_DOMAIN) {
+ /*
+ * HK_TYPE_DOMAIN specific callbacks
+ */
+ pci_probe_flush_workqueue();
+ mem_cgroup_flush_workqueue();
+ vmstat_flush_workqueue();
+
+ WARN_ON_ONCE(workqueue_unbound_housekeeping_update(
+ housekeeping_cpumask(HK_TYPE_DOMAIN)) < 0);
+ WARN_ON_ONCE(tmigr_isolated_exclude_cpumask(isol_mask) < 0);
+ WARN_ON_ONCE(kthreads_update_housekeeping() < 0);
+ }
- return 0;
+out:
+ while (--i >= 0)
+ kfree(trial[i]);
+ return err;
}
void __init housekeeping_init(void)
--
2.53.0
^ permalink raw reply related
* [PATCH 01/23] sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT & HK_TYPE_MANAGED_IRQ_BOOT
From: Waiman Long @ 2026-04-21 3:03 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan, Waiman Long
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
Since commit 4fca0e550d50 ("sched/isolation: Save boot defined
domain flags"), HK_TYPE_DOMAIN_BOOT was added to record the boot
time "isolcpus{=domain}" setting. As we are going to make the
HK_TYPE_MANAGED_IRQ and HK_TYPE_KERNEL_NOISE housekeeping cpumasks
runtime modifiable, we need some additional cpumasks to record the boot
time settings to make sure that those housekeeping cpumasks will always
be a subset of their boot time equivalents.
Introduce the new HK_TYPE_KERNEL_NOISE_BOOT and HK_TYPE_MANAGED_IRQ_BOOT
housekeeping types to do that.
Signed-off-by: Waiman Long <longman@redhat.com>
---
include/linux/sched/isolation.h | 16 ++++++++++++++--
kernel/sched/isolation.c | 16 +++++++++-------
2 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index dc3975ff1b2e..d1707f121e20 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -14,10 +14,22 @@ enum hk_type {
* is always a subset of HK_TYPE_DOMAIN_BOOT.
*/
HK_TYPE_DOMAIN,
- /* Inverse of boot-time isolcpus=managed_irq argument */
- HK_TYPE_MANAGED_IRQ,
+
/* Inverse of boot-time nohz_full= or isolcpus=nohz arguments */
+ HK_TYPE_KERNEL_NOISE_BOOT,
+ /*
+ * A subset of HK_TYPE_KERNEL_NOISE_BOOT as it may excludes some
+ * additional isolated CPUs at run time.
+ */
HK_TYPE_KERNEL_NOISE,
+
+ /* Inverse of boot-time isolcpus=managed_irq argument */
+ HK_TYPE_MANAGED_IRQ_BOOT,
+ /*
+ * A subset of HK_TYPE_MANAGED_IRQ_BOOT as it may excludes some
+ * additional isolated CPUs at run time.
+ */
+ HK_TYPE_MANAGED_IRQ,
HK_TYPE_MAX,
/*
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index a947d75b43f1..9ec9ae510dc7 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -12,10 +12,12 @@
#include "sched.h"
enum hk_flags {
- HK_FLAG_DOMAIN_BOOT = BIT(HK_TYPE_DOMAIN_BOOT),
- HK_FLAG_DOMAIN = BIT(HK_TYPE_DOMAIN),
- HK_FLAG_MANAGED_IRQ = BIT(HK_TYPE_MANAGED_IRQ),
- HK_FLAG_KERNEL_NOISE = BIT(HK_TYPE_KERNEL_NOISE),
+ HK_FLAG_DOMAIN_BOOT = BIT(HK_TYPE_DOMAIN_BOOT),
+ HK_FLAG_DOMAIN = BIT(HK_TYPE_DOMAIN),
+ HK_FLAG_KERNEL_NOISE_BOOT = BIT(HK_TYPE_KERNEL_NOISE_BOOT),
+ HK_FLAG_KERNEL_NOISE = BIT(HK_TYPE_KERNEL_NOISE),
+ HK_FLAG_MANAGED_IRQ_BOOT = BIT(HK_TYPE_MANAGED_IRQ_BOOT),
+ HK_FLAG_MANAGED_IRQ = BIT(HK_TYPE_MANAGED_IRQ),
};
DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
@@ -315,7 +317,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
{
unsigned long flags;
- flags = HK_FLAG_KERNEL_NOISE;
+ flags = HK_FLAG_KERNEL_NOISE | HK_FLAG_KERNEL_NOISE_BOOT;
return housekeeping_setup(str, flags);
}
@@ -334,7 +336,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
*/
if (!strncmp(str, "nohz,", 5)) {
str += 5;
- flags |= HK_FLAG_KERNEL_NOISE;
+ flags |= HK_FLAG_KERNEL_NOISE | HK_FLAG_KERNEL_NOISE_BOOT;
continue;
}
@@ -346,7 +348,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
if (!strncmp(str, "managed_irq,", 12)) {
str += 12;
- flags |= HK_FLAG_MANAGED_IRQ;
+ flags |= HK_FLAG_MANAGED_IRQ | HK_FLAG_MANAGED_IRQ_BOOT;
continue;
}
--
2.53.0
^ permalink raw reply related
* [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs
From: Waiman Long @ 2026-04-21 3:03 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Catalin Marinas, Will Deacon, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Guenter Roeck,
Frederic Weisbecker, Paul E. McKenney, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang,
Anna-Maria Behnsen, Ingo Molnar, Thomas Gleixner, Chen Ridong,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
Qiliang Yuan, Waiman Long
The "isolcpus=domain" CPU list implied by the HK_TYPE_DOMAIN housekeeping
cpumask is being dynamically updated whenever the set of cpuset isolated
partitions change. This patch series extends the isolated partition code
to make dynamic changes to the "nohz_full" and "isolcpus=managed_irq"
CPU lists. These CPU lists correspoond to equivalent changes in the
HK_TYPE_KERNEL_NOISE and HK_TYPE_MANAGED_IRQ housekeeping cpumasks.
To facilitate the changing of these CPU lists, the CPU hotplug code
which is doing a lot of heavy lifting is now being used. For changing
the "nohz_full" and HK_TYPE_KERNEL_NOISE cpumasks, the affected CPUs
are torn down into offline mode first. Then the housekeeping cpumask
and the corresponding cpumasks in the RCU, tick and watchdog subsystems
are modified. After that, the affected CPUs are brought online again.
It is slightly different with the "managed_irq" HK_TYPE_MANAGED_IRQ
cpumask, the cpumask is updated first and the affected CPUs are torn
down and brought up to move the managed interrupts away from the newly
isolated CPUs.
Using CPU hotplug does have its drawback when multiple isolated
partitions are being managed in a system. The CPU offline process uses
the stop_machine mechanism to stop all the CPUs except the one being torn
down. This will cause latency spike for isolated CPUs in other isolated
partitions. It is not a problem if only one isolated partition is
needed. This is an issue we need to address in the near future.
Patches 1-7 enables runtime updates to the nohz_full/managed_irq
cpumasks in sched/isolation, tick, RCU and watchdog subsystems.
Patches 8-17 modifies various subsystems that need access to the
HK_TYPE_MANAGED_IRQ cpumask and HK_TYPE_KERNEL_NOISE cpumask and its
aliases. Like the runtime modifiable HK_TYPE_DOMAIN cpumask, RCU is used
to protect access to cpumasks to avoid potential UAF problem. Patch 17
updates the housekeeping_dereference_check() to print a WARNING when
lockdep is enabled if those housekeeping cpumasks are used without the
proper lock protection.
Patch 18 introduces a new cpuhp_offline_cb() API that enables the
shutting down of a given list of CPUs, run a callback function and then
brought those CPUs up again while disallowing any concurrent CPU hotplug
activity.
Patches 19-23 updates the cpuset code, selftest and documentation to
allow change in isolated partition configuration to be reflected in
the housekeeping and other cpumasks dynamically.
As there is a slight overhead in enabling dynamic update to the nohz_full
cpumask, this new nohz_full and managed_irq runtime update feature
has to be explicitly opted in by adding a nohz_full kernel command
line parameter with or without a CPU list to indicate a desire to use
this feature. It is also because a number of subsystems have explicit
check of nohz_full at boot time to adjust their behavior which may not
be easy to modify after boot.
Waiman Long (23):
sched/isolation: Add HK_TYPE_KERNEL_NOISE_BOOT &
HK_TYPE_MANAGED_IRQ_BOOT
sched/isolation: Enhance housekeeping_update() to support updating
more than one HK cpumask
tick/nohz: Make nohz_full parameter optional
tick/nohz: Allow runtime changes in full dynticks CPUs
tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
rcu/nocbs: Allow runtime changes in RCU NOCBS cpumask
watchdog: Sync up with runtime change of isolated CPUs
arm64: topology: Use RCU to protect access to HK_TYPE_TICK cpumask
workqueue: Use RCU to protect access of HK_TYPE_TIMER cpumask
cpu: Use RCU to protect access of HK_TYPE_TIMER cpumask
hrtimer: Use RCU to protect access of HK_TYPE_TIMER cpumask
net: Use boot time housekeeping cpumask settings for now
sched/core: Use RCU to protect access of HK_TYPE_KERNEL_NOISE cpumask
hwmon/coretemp: Use RCU to protect access of HK_TYPE_MISC cpumask
Drivers: hv: Use RCU to protect access of HK_TYPE_MANAGED_IRQ cpumask
genirq/cpuhotplug: Use RCU to protect access of HK_TYPE_MANAGED_IRQ
cpumask
sched/isolation: Extend housekeeping_dereference_check() to cover
changes in nohz_full or manged_irqs cpumasks
cpu/hotplug: Add a new cpuhp_offline_cb() API
cgroup/cpuset: Improve check for calling housekeeping_update()
cgroup/cpuset: Enable runtime update of
HK_TYPE_{KERNEL_NOISE,MANAGED_IRQ} cpumasks
cgroup/cpuset: Limit the side effect of using CPU hotplug on isolated
partition
cgroup/cpuset: Prevent offline_disabled CPUs from being used in
isolated partition
cgroup/cpuset: Documentation and kselftest updates
Documentation/admin-guide/cgroup-v2.rst | 35 ++-
.../admin-guide/kernel-parameters.txt | 15 +-
arch/arm64/kernel/topology.c | 17 +-
drivers/hv/channel_mgmt.c | 15 +-
drivers/hv/vmbus_drv.c | 7 +-
drivers/hwmon/coretemp.c | 6 +-
include/linux/context_tracking.h | 8 +-
include/linux/cpuhplock.h | 9 +
include/linux/nmi.h | 2 +
include/linux/rcupdate.h | 2 +
include/linux/sched/isolation.h | 18 +-
include/linux/tick.h | 2 +
kernel/cgroup/cpuset-internal.h | 1 +
kernel/cgroup/cpuset.c | 292 +++++++++++++++++-
kernel/context_tracking.c | 19 +-
kernel/cpu.c | 72 +++++
kernel/irq/cpuhotplug.c | 1 +
kernel/irq/manage.c | 1 +
kernel/rcu/tree_nocb.h | 24 +-
kernel/sched/core.c | 4 +-
kernel/sched/isolation.c | 135 +++++---
kernel/time/hrtimer.c | 4 +-
kernel/time/tick-common.c | 16 +-
kernel/time/tick-sched.c | 48 ++-
kernel/watchdog.c | 24 ++
kernel/workqueue.c | 6 +-
net/core/net-sysfs.c | 2 +-
.../selftests/cgroup/test_cpuset_prs.sh | 70 ++++-
28 files changed, 747 insertions(+), 108 deletions(-)
--
2.53.0
^ permalink raw reply
* [PATCH net v3] hv_sock: Report EOF instead of -EIO for FIN
From: Dexuan Cui @ 2026-04-21 2:59 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, sgarzare, davem, edumazet,
kuba, pabeni, horms, niuxuewei.nxw, linux-hyperv, virtualization,
netdev, linux-kernel
Cc: stable, Ben Hillis, Mitchell Levy
Commit f0c5827d07cb unluckily causes a regression for the FIN packet,
and the final read syscall gets an error rather than 0.
Ideally, we would want to fix hvs_channel_readable_payload() so that it
could return 0 in the FIN scenario, but it's not good for the hv_sock
driver to use the VMBus ringbuffer's cached priv_read_index, which is
internal data in the VMBus driver.
Fix the regression in hv_sock by returning 0 rather than -EIO.
In case we see a malformed/short packet, we still return -EIO.
Fixes: f0c5827d07cb ("hv_sock: Return the readable bytes in hvs_stream_has_data()")
Cc: stable@vger.kernel.org
Reported-by: Ben Hillis <Ben.Hillis@microsoft.com>
Reported-by: Mitchell Levy <levymitchell0@gmail.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
Changes since v1:
Removed the local variable 'need_refill' to make the code more
readable. Stefano, thanks!
No other change.
Changes since v2:
Added code to test the flag SEND_SHUTDOWN. Copilot, thanks!
Updated the comment and the commit messages accordingly.
net/vmw_vsock/hyperv_transport.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 069386a74557..da150de10f0d 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -694,7 +694,6 @@ static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, struct msghdr *msg,
static s64 hvs_stream_has_data(struct vsock_sock *vsk)
{
struct hvsock *hvs = vsk->trans;
- bool need_refill;
s64 ret;
if (hvs->recv_data_len > 0)
@@ -702,9 +701,31 @@ static s64 hvs_stream_has_data(struct vsock_sock *vsk)
switch (hvs_channel_readable_payload(hvs->chan)) {
case 1:
- need_refill = !hvs->recv_desc;
- if (!need_refill)
- return -EIO;
+ if (hvs->recv_desc) {
+ /* Here hvs->recv_data_len is 0, so hvs->recv_desc must
+ * be NULL unless it points to the 0-byte-payload FIN
+ * packet or a malformed/short packet: see
+ * hvs_update_recv_data().
+ *
+ * If hvs->recv_desc points to the FIN packet, here all
+ * the payload has been dequeued and the peer_shutdown
+ * flag is set, but hvs_channel_readable_payload() still
+ * returns 1, because the VMBus ringbuffer's read_index
+ * is not updated for the FIN packet:
+ * hvs_stream_dequeue() -> hv_pkt_iter_next() updates
+ * the cached priv_read_index but has no opportunity to
+ * update the read_index in hv_pkt_iter_close() as
+ * hvs_stream_has_data() returns 0 for the FIN packet,
+ * so it won't get dequeued.
+ *
+ * In case hvs->recv_desc points to a malformed/short
+ * packet, return -EIO.
+ */
+ if (hvs->vsk->peer_shutdown & SEND_SHUTDOWN)
+ return 0;
+ else
+ return -EIO;
+ }
hvs->recv_desc = hv_pkt_iter_first(hvs->chan);
if (!hvs->recv_desc)
--
2.49.0
^ permalink raw reply related
* Re: [PATCH 1/7] mshv: Convert from page pointers to PFNs
From: Stanislav Kinsburskii @ 2026-04-20 23:45 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB41571DD9045771941F371750D42F2@SN6PR02MB4157.namprd02.prod.outlook.com>
On Mon, Apr 20, 2026 at 05:18:10PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, April 20, 2026 9:22 AM
> >
> > On Mon, Apr 13, 2026 at 09:08:16PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> > > >
>
> [snip]
>
> > > > @@ -57,60 +58,61 @@ static int mshv_chunk_stride(struct page *page,
> > > > /**
> > > > * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
> > > > * in a region.
> > > > - * @region : Pointer to the memory region structure.
> > > > - * @flags : Flags to pass to the handler.
> > > > - * @page_offset: Offset into the region's pages array to start processing.
> > > > - * @page_count : Number of pages to process.
> > > > - * @handler : Callback function to handle the chunk.
> > > > + * @region : Pointer to the memory region structure.
> > > > + * @flags : Flags to pass to the handler.
> > > > + * @pfn_offset: Offset into the region's PFNs array to start processing.
> > > > + * @pfn_count : Number of PFNs to process.
> > > > + * @handler : Callback function to handle the chunk.
> > > > *
> > > > - * This function scans the region's pages starting from @page_offset,
> > > > - * checking for contiguous present pages of the same size (normal or huge).
> > > > - * It invokes @handler for the chunk of contiguous pages found. Returns the
> > > > - * number of pages handled, or a negative error code if the first page is
> > > > - * not present or the handler fails.
> > > > + * This function scans the region's PFNs starting from @pfn_offset,
> > > > + * checking for contiguous valid PFNs backed by pages of the same size
> > > > + * (normal or huge). It invokes @handler for the chunk of contiguous valid
> > > > + * PFNs found. Returns the number of PFNs handled, or a negative error code
> > > > + * if the first PFN is invalid or the handler fails.
> > > > *
> > > > - * Note: The @handler callback must be able to handle both normal and huge
> > > > - * pages.
> > > > + * Note: The @handler callback must be able to handle valid PFNs backed by
> > > > + * both normal and huge pages.
> > > > *
> > > > * Return: Number of pages handled, or negative error code.
> > > > */
> > > > -static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > > > - u32 flags,
> > > > - u64 page_offset, u64 page_count,
> > > > - int (*handler)(struct mshv_mem_region *region,
> > > > - u32 flags,
> > > > - u64 page_offset,
> > > > - u64 page_count,
> > > > - bool huge_page))
> > > > +static long mshv_region_process_pfns(struct mshv_mem_region *region,
> > > > + u32 flags,
> > > > + u64 pfn_offset, u64 pfn_count,
> > > > + int (*handler)(struct mshv_mem_region *region,
> > > > + u32 flags,
> > > > + u64 pfn_offset,
> > > > + u64 pfn_count,
> > > > + bool huge_page))
> > > > {
> > > > - u64 gfn = region->start_gfn + page_offset;
> > > > + u64 gfn = region->start_gfn + pfn_offset;
> > > > u64 count;
> > > > - struct page *page;
> > > > + unsigned long pfn;
> > > > int stride, ret;
> > > >
> > > > - page = region->mreg_pages[page_offset];
> > > > - if (!page)
> > > > + pfn = region->mreg_pfns[pfn_offset];
> > > > + if (!pfn_valid(pfn))
> > > > return -EINVAL;
> > > >
> > > > - stride = mshv_chunk_stride(page, gfn, page_count);
> > > > + stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
> > > > if (stride < 0)
> > > > return stride;
> > > >
> > > > /* Start at stride since the first stride is validated */
> > > > - for (count = stride; count < page_count; count += stride) {
> > > > - page = region->mreg_pages[page_offset + count];
> > > > + for (count = stride; count < pfn_count ; count += stride) {
> > > > + pfn = region->mreg_pfns[pfn_offset + count];
> > > >
> > > > - /* Break if current page is not present */
> > > > - if (!page)
> > > > + /* Break if current pfn is invalid */
> > > > + if (!pfn_valid(pfn))
> > >
> > > pfn_valid() is a relatively expensive test to be doing in a loop
> > > on what may be every single page. It does an RCU lock/unlock
> > > and make other checks that aren't necessary here. Since
> > > mreg_pfns[] is populated from mm calls, the only invalid PFNs
> > > would be MSHV_INVALID_PFN that code in this module has
> > > explicitly put there. Just testing against MSHV_INVALID_PFN
> > > would be a lot faster here and elsewhere in this module. It's
> > > really a "pfn set/not set" test. Defining a pfn_set() macro
> > > here in this module that tests against MSHV_INVALID_PFN
> > > would accomplish the same thing more efficiently.
> > >
> >
> > Yes, we could do it the way you suggest. For completeness, I should add
> > that pfn_valid() is expensive only on 32-bit ARM and ARC, which we
> > don’t care about.
> >
>
> Could you elaborate? On x86, I'm seeing that pfn_valid() is about
> 220 bytes of code. It's the version in include/linux/mmzone.h, not
> the simple version in include/asm-generic/memory_model.h. The
> latter is used only for CONFIG_FLATMEM=y. Or is the root partition
> kernel build setting CONFIG_FLATMEM_MANUAL and hence getting
> the simple version?
>
I was wrong: this long function is indeed compiled for x86.
Still, it's not big of a runtime impact as taking the rcu lock is cheap,
but I'll simplify as proposed.
Thanks,
Stanislav
> Michael
^ permalink raw reply
* Re: [PATCH net v2] hv_sock: Report EOF instead of -EIO for FIN
From: patchwork-bot+netdevbpf @ 2026-04-20 21:59 UTC (permalink / raw)
To: Dexuan Cui
Cc: kys, haiyangz, wei.liu, longli, sgarzare, davem, edumazet, kuba,
pabeni, horms, niuxuewei.nxw, linux-hyperv, virtualization,
netdev, linux-kernel, stable, Ben.Hillis, levymitchell0
In-Reply-To: <20260416191433.840637-1-decui@microsoft.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 16 Apr 2026 12:14:33 -0700 you wrote:
> Commit f0c5827d07cb unluckily causes a regression for the FIN packet,
> and the final read syscall gets an error rather than 0.
>
> Ideally, we would want to fix hvs_channel_readable_payload() so that it
> could return 0 in the FIN scenario, but it's not good for the hv_sock
> driver to use the VMBus ringbuffer's cached priv_read_index, which is
> internal data in the VMBus driver.
>
> [...]
Here is the summary with links:
- [net,v2] hv_sock: Report EOF instead of -EIO for FIN
https://git.kernel.org/netdev/net/c/f63152958994
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* RE: [PATCH 1/7] mshv: Convert from page pointers to PFNs
From: Michael Kelley @ 2026-04-20 17:18 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aeZSnjgSkm7vJxhW@skinsburskii.localdomain>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, April 20, 2026 9:22 AM
>
> On Mon, Apr 13, 2026 at 09:08:16PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> > >
[snip]
> > > @@ -57,60 +58,61 @@ static int mshv_chunk_stride(struct page *page,
> > > /**
> > > * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
> > > * in a region.
> > > - * @region : Pointer to the memory region structure.
> > > - * @flags : Flags to pass to the handler.
> > > - * @page_offset: Offset into the region's pages array to start processing.
> > > - * @page_count : Number of pages to process.
> > > - * @handler : Callback function to handle the chunk.
> > > + * @region : Pointer to the memory region structure.
> > > + * @flags : Flags to pass to the handler.
> > > + * @pfn_offset: Offset into the region's PFNs array to start processing.
> > > + * @pfn_count : Number of PFNs to process.
> > > + * @handler : Callback function to handle the chunk.
> > > *
> > > - * This function scans the region's pages starting from @page_offset,
> > > - * checking for contiguous present pages of the same size (normal or huge).
> > > - * It invokes @handler for the chunk of contiguous pages found. Returns the
> > > - * number of pages handled, or a negative error code if the first page is
> > > - * not present or the handler fails.
> > > + * This function scans the region's PFNs starting from @pfn_offset,
> > > + * checking for contiguous valid PFNs backed by pages of the same size
> > > + * (normal or huge). It invokes @handler for the chunk of contiguous valid
> > > + * PFNs found. Returns the number of PFNs handled, or a negative error code
> > > + * if the first PFN is invalid or the handler fails.
> > > *
> > > - * Note: The @handler callback must be able to handle both normal and huge
> > > - * pages.
> > > + * Note: The @handler callback must be able to handle valid PFNs backed by
> > > + * both normal and huge pages.
> > > *
> > > * Return: Number of pages handled, or negative error code.
> > > */
> > > -static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > > - u32 flags,
> > > - u64 page_offset, u64 page_count,
> > > - int (*handler)(struct mshv_mem_region *region,
> > > - u32 flags,
> > > - u64 page_offset,
> > > - u64 page_count,
> > > - bool huge_page))
> > > +static long mshv_region_process_pfns(struct mshv_mem_region *region,
> > > + u32 flags,
> > > + u64 pfn_offset, u64 pfn_count,
> > > + int (*handler)(struct mshv_mem_region *region,
> > > + u32 flags,
> > > + u64 pfn_offset,
> > > + u64 pfn_count,
> > > + bool huge_page))
> > > {
> > > - u64 gfn = region->start_gfn + page_offset;
> > > + u64 gfn = region->start_gfn + pfn_offset;
> > > u64 count;
> > > - struct page *page;
> > > + unsigned long pfn;
> > > int stride, ret;
> > >
> > > - page = region->mreg_pages[page_offset];
> > > - if (!page)
> > > + pfn = region->mreg_pfns[pfn_offset];
> > > + if (!pfn_valid(pfn))
> > > return -EINVAL;
> > >
> > > - stride = mshv_chunk_stride(page, gfn, page_count);
> > > + stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
> > > if (stride < 0)
> > > return stride;
> > >
> > > /* Start at stride since the first stride is validated */
> > > - for (count = stride; count < page_count; count += stride) {
> > > - page = region->mreg_pages[page_offset + count];
> > > + for (count = stride; count < pfn_count ; count += stride) {
> > > + pfn = region->mreg_pfns[pfn_offset + count];
> > >
> > > - /* Break if current page is not present */
> > > - if (!page)
> > > + /* Break if current pfn is invalid */
> > > + if (!pfn_valid(pfn))
> >
> > pfn_valid() is a relatively expensive test to be doing in a loop
> > on what may be every single page. It does an RCU lock/unlock
> > and make other checks that aren't necessary here. Since
> > mreg_pfns[] is populated from mm calls, the only invalid PFNs
> > would be MSHV_INVALID_PFN that code in this module has
> > explicitly put there. Just testing against MSHV_INVALID_PFN
> > would be a lot faster here and elsewhere in this module. It's
> > really a "pfn set/not set" test. Defining a pfn_set() macro
> > here in this module that tests against MSHV_INVALID_PFN
> > would accomplish the same thing more efficiently.
> >
>
> Yes, we could do it the way you suggest. For completeness, I should add
> that pfn_valid() is expensive only on 32-bit ARM and ARC, which we
> don’t care about.
>
Could you elaborate? On x86, I'm seeing that pfn_valid() is about
220 bytes of code. It's the version in include/linux/mmzone.h, not
the simple version in include/asm-generic/memory_model.h. The
latter is used only for CONFIG_FLATMEM=y. Or is the root partition
kernel build setting CONFIG_FLATMEM_MANUAL and hence getting
the simple version?
Michael
^ permalink raw reply
* Re: [PATCH] mshv: remove page order restriction to enable 1G hugepage support
From: Stanislav Kinsburskii @ 2026-04-20 16:56 UTC (permalink / raw)
To: Anirudh Rayabharam (Microsoft)
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
linux-hyperv, linux-kernel
In-Reply-To: <20260416-huge_1g-v1-1-e066738cddfb@anirudhrb.com>
On Thu, Apr 16, 2026 at 01:37:15PM +0000, Anirudh Rayabharam (Microsoft) wrote:
> The hypervisor's map GPA hypercall handles large pages intelligently,
> combining 2M pages into 1G mappings when alignment allows.
>
> Remove the PMD_ORDER check in mshv_chunk_stride() so that 1G hugepages
> and other large page orders are passed through as 2M-aligned chunks,
> letting the hypervisor promote them to 1G mappings automatically.
>
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> ---
> drivers/hv/mshv_regions.c | 5 +----
> 1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index fdffd4f002f6..5f617a96d97a 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -29,7 +29,7 @@
> * Uses huge page stride if the backing page is huge and the guest mapping
> * is properly aligned; otherwise falls back to single page stride.
> *
> - * Return: Stride in pages, or -EINVAL if page order is unsupported.
> + * Return: Stride in pages.
> */
> static int mshv_chunk_stride(struct page *page,
> u64 gfn, u64 page_count)
Nit: the return type of the function should now become unsigned.
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> @@ -47,9 +47,6 @@ static int mshv_chunk_stride(struct page *page,
> return 1;
>
> page_order = folio_order(page_folio(page));
> - /* The hypervisor only supports 2M huge page */
> - if (page_order != PMD_ORDER)
> - return -EINVAL;
>
> return 1 << page_order;
> }
>
> ---
> base-commit: cd9f2e7d6e5b1837ef40b96e300fa28b73ab5a77
> change-id: 20260416-huge_1g-e44461393c8f
>
> Best regards,
> --
> Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
>
^ permalink raw reply
* Re: [PATCH v3 0/7] mshv: Reduce memory consumption for unpinned regions
From: Stanislav Kinsburskii @ 2026-04-20 16:54 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260414-broad-abstract-anaconda-1c4ab1@anirudhrb>
On Tue, Apr 14, 2026 at 04:11:22PM +0000, Anirudh Rayabharam wrote:
> On Thu, Apr 09, 2026 at 03:23:59PM +0000, Stanislav Kinsburskii wrote:
> > This series reduces memory consumption for unpinned regions by avoiding
> > PFN array allocation. A 1GB unpinned region currently wastes 2MB for an
> > unused PFN array that HMM-managed regions don't need.
>
> This series has a dependency on "mshv: Refactor memory region management
> and map pages at creation" right?
>
Yes, it does.
Thanks,
Stanislav
> Anirudh.
>
^ permalink raw reply
* Re: [PATCH 0/7] mshv: Refactor memory region management and map pages at creation
From: Stanislav Kinsburskii @ 2026-04-20 16:40 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157920DE623A9B2C613D282D4242@SN6PR02MB4157.namprd02.prod.outlook.com>
On Mon, Apr 13, 2026 at 09:07:59PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> >
> > This series refactors the mshv memory region subsystem in preparation
> > for mapping populated pages into the hypervisor at movable region
> > creation time, rather than relying solely on demand faulting.
> >
> > The primary motivation is to ensure that when userspace passes a
> > pre-populated mapping for a movable memory region, those pages are
> > immediately visible to the hypervisor. Previously, all movable regions
> > were created with HV_MAP_GPA_NO_ACCESS on every page regardless of
> > whether the backing pages were already present, deferring all mapping
> > to the fault handler. This added unnecessary fault overhead and
> > complicated the initial setup of child partitions with pre-populated
> > memory.
> >
>
> This is a nice set of changes. Independent of the new functionality
> for pre-populating, it improves the code organization and makes
> it more regular.
>
> See a few comments on individual patches. I noticed that Sashiko
> wasn't able to review the series because it wouldn't apply. Hopefully
> your v2 will apply. From what I've seen so far of Sashiko, it finds some
> good issues. I did run the patch set through Co-Pilot, but that didn't
> have the benefit of the AI prompts that Sashiko provides.
>
Thank you for your time.
Indeed, hopefully sashiko will be able to review the v2.
Thanks,
Stanislav
> Michael
^ permalink raw reply
* Re: [PATCH 5/7] mshv: Map populated pages on movable region creation
From: Stanislav Kinsburskii @ 2026-04-20 16:35 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415725768DFBF9502CE942DDD4242@SN6PR02MB4157.namprd02.prod.outlook.com>
On Mon, Apr 13, 2026 at 09:09:08PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:05 PM
> >
> > Map any populated pages into the hypervisor upfront when creating a
> > movable region, rather than waiting for faults. Previously, movable
> > regions were created with all pages marked as HV_MAP_GPA_NO_ACCESS
> > regardless of whether the userspace mapping contained populated pages.
> >
> > This guarantees that if the caller passes a populated mapping, those
> > present pages will be mapped into the hypervisor immediately during
> > region creation instead of being faulted in later.
> >
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> > drivers/hv/mshv_regions.c | 65 ++++++++++++++++++++++++++++++++-----------
> > drivers/hv/mshv_root.h | 1 +
> > drivers/hv/mshv_root_main.c | 10 +------
> > 3 files changed, 50 insertions(+), 26 deletions(-)
> >
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index 133ec7771812..28d3f488d89f 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -519,7 +519,8 @@ int mshv_region_get(struct mshv_mem_region *region)
> > static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> > unsigned long start,
> > unsigned long end,
> > - unsigned long *pfns)
> > + unsigned long *pfns,
> > + bool do_fault)
> > {
> > struct hmm_range range = {
> > .notifier = ®ion->mreg_mni,
> > @@ -540,9 +541,12 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> > range.hmm_pfns = pfns;
> > range.start = start;
> > range.end = min(vma->vm_end, end);
> > - range.default_flags = HMM_PFN_REQ_FAULT;
> > - if (vma->vm_flags & VM_WRITE)
> > - range.default_flags |= HMM_PFN_REQ_WRITE;
> > + range.default_flags = 0;
> > + if (do_fault) {
> > + range.default_flags = HMM_PFN_REQ_FAULT;
> > + if (vma->vm_flags & VM_WRITE)
> > + range.default_flags |= HMM_PFN_REQ_WRITE;
> > + }
> >
> > ret = hmm_range_fault(&range);
> > if (ret)
> > @@ -567,26 +571,40 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> > }
> >
> > /**
> > - * mshv_region_range_fault - Handle memory range faults for a given region.
> > - * @region: Pointer to the memory region structure.
> > - * @pfn_offset: Offset of the page within the region.
> > - * @pfn_count: Number of pages to handle.
> > + * mshv_region_collect_and_map - Collect PFNs for a user range and map them
> > + * @region : memory region being processed
> > + * @pfn_offset: PFNs offset within the region
> > + * @pfn_count : number of PFNs to process
> > + * @do_fault : if true, fault in missing pages;
> > + * if false, collect only present pages
> > *
> > - * This function resolves memory faults for a specified range of pages
> > - * within a memory region. It uses HMM (Heterogeneous Memory Management)
> > - * to fault in the required pages and updates the region's page array.
> > + * Collects PFNs for the specified portion of @region from the
> > + * corresponding userspace VMA and maps them into the hypervisor. The
>
> Actually, this should be "userspace VMAs" (i.e., plural)
>
Will change.
> > + * behavior depends on @do_fault:
> > *
> > - * Return: 0 on success, negative error code on failure.
> > + * - true: Fault in missing pages from userspace, ensuring all pages in the
> > + * range are present. Used for on-demand page population.
> > + * - false: Collect PFNs only for pages already present in userspace,
> > + * leaving missing pages as invalid PFN markers.
> > + * Used for initial region setup.
> > + *
> > + * Collected PFNs are stored in region->mreg_pfns[] with HMM bookkeeping
> > + * flags cleared, then the range is mapped into the hypervisor. Present
> > + * PFNs get mapped with region access permissions; missing PFNs (zero
> > + * entries) get mapped with no-access permissions.
>
> Hmmm. The missing PFNs are just skipped and the mreg_pfns[] array
> is not updated. Is the corresponding entry in mreg_pfns[] known to
> already be set to MSHV_INVALID_PFN? When mapping a new movable
> region, that appears to be so. I'm less sure about the
> mshv_region_range_fault() case, though mshv_region_invalidate_pfns()
> does such initialization of any entries that are invalidated. At that point
> in the code, I'd add a comment about that assumption, as it took me a
> bit to figure it out.
>
This logic is called for movable regions only.
Should this be mentioned in the comment from your POV?
> So does the comment about "zero entries" refer to what is returned
> by hmm_range_fault() via mshv_region_hmm_fault_and_lock()?
> The mention of "zero entries" here is a bit confusing.
>
"Zero entries" should be changed to invalid PFN markers, which are
defined as MSHV_INVALID_PFN. I'll update the comment to clarify that.
Thanks,
Stanislav
> > + *
> > + * Return: 0 on success, negative errno on failure.
> > */
> > -static int mshv_region_range_fault(struct mshv_mem_region *region,
> > - u64 pfn_offset, u64 pfn_count)
> > +static int mshv_region_collect_and_map(struct mshv_mem_region *region,
> > + u64 pfn_offset, u64 pfn_count,
> > + bool do_fault)
> > {
> > unsigned long start, end;
> > unsigned long *pfns;
> > int ret;
> > u64 i;
> >
> > - pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
> > + pfns = vmalloc_array(pfn_count, sizeof(unsigned long));
> > if (!pfns)
> > return -ENOMEM;
> >
> > @@ -595,7 +613,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> >
> > do {
> > ret = mshv_region_hmm_fault_and_lock(region, start, end,
> > - pfns);
> > + pfns, do_fault);
> > } while (ret == -EBUSY);
> >
> > if (ret)
> > @@ -613,10 +631,17 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> >
> > mutex_unlock(®ion->mreg_mutex);
> > out:
> > - kfree(pfns);
> > + vfree(pfns);
> > return ret;
> > }
> >
> > +static int mshv_region_range_fault(struct mshv_mem_region *region,
> > + u64 pfn_offset, u64 pfn_count)
> > +{
> > + return mshv_region_collect_and_map(region, pfn_offset, pfn_count,
> > + true);
> > +}
> > +
> > bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> > {
> > u64 pfn_offset, pfn_count;
> > @@ -800,3 +825,9 @@ int mshv_map_pinned_region(struct mshv_mem_region
> > *region)
> > err_out:
> > return ret;
> > }
> > +
> > +int mshv_map_movable_region(struct mshv_mem_region *region)
> > +{
> > + return mshv_region_collect_and_map(region, 0, region->nr_pfns,
> > + false);
> > +}
> > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > index d2e65a137bf4..02c1c11f701c 100644
> > --- a/drivers/hv/mshv_root.h
> > +++ b/drivers/hv/mshv_root.h
> > @@ -374,5 +374,6 @@ bool mshv_region_handle_gfn_fault(struct mshv_mem_region
> > *region, u64 gfn);
> > void mshv_region_movable_fini(struct mshv_mem_region *region);
> > bool mshv_region_movable_init(struct mshv_mem_region *region);
> > int mshv_map_pinned_region(struct mshv_mem_region *region);
> > +int mshv_map_movable_region(struct mshv_mem_region *region);
> >
> > #endif /* _MSHV_ROOT_H_ */
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index c393b5144e0b..91dab2a3bc92 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -1299,15 +1299,7 @@ mshv_map_user_memory(struct mshv_partition
> > *partition,
> > ret = mshv_map_pinned_region(region);
> > break;
> > case MSHV_REGION_TYPE_MEM_MOVABLE:
> > - /*
> > - * For movable memory regions, remap with no access to let
> > - * the hypervisor track dirty pages, enabling pre-copy live
> > - * migration.
> > - */
> > - ret = hv_call_map_ram_pfns(partition->pt_id,
> > - region->start_gfn,
> > - region->nr_pfns,
> > - HV_MAP_GPA_NO_ACCESS, NULL);
> > + ret = mshv_map_movable_region(region);
> > break;
> > case MSHV_REGION_TYPE_MMIO:
> > ret = hv_call_map_mmio_pfns(partition->pt_id,
> >
> >
^ permalink raw reply
* Re: [PATCH 3/7] mshv: Support regions with different VMAs
From: Stanislav Kinsburskii @ 2026-04-20 16:29 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157043516DEB3DC5E987AA6D4242@SN6PR02MB4157.namprd02.prod.outlook.com>
On Mon, Apr 13, 2026 at 09:08:52PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> >
> > Allow HMM fault handling across memory regions that span multiple VMAs
> > with different protection flags. The previous implementation assumed a
> > single VMA per region, which would fail when guest memory crosses VMA
> > boundaries.
> >
> > Iterate through VMAs within the range and handle each separately with
> > appropriate protection flags, enabling more flexible memory region
> > configurations for partitions.
> >
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> > drivers/hv/mshv_regions.c | 72 +++++++++++++++++++++++++++++++++------------
> > 1 file changed, 52 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index ed9c55841140..1bb1bfe177e2 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -492,37 +492,72 @@ int mshv_region_get(struct mshv_mem_region *region)
> > }
> >
> > /**
> > - * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
> > + * mshv_region_hmm_fault_and_lock - Handle HMM faults across VMAs and lock
> > + * the memory region
> > * @region: Pointer to the memory region structure
> > - * @range: Pointer to the HMM range structure
> > + * @start : Starting virtual address of the range to fault
> > + * @end : Ending virtual address of the range to fault (exclusive)
> > + * @pfns : Output array for page frame numbers with HMM flags
> > *
> > * This function performs the following steps:
> > * 1. Reads the notifier sequence for the HMM range.
> > * 2. Acquires a read lock on the memory map.
> > - * 3. Handles HMM faults for the specified range.
> > - * 4. Releases the read lock on the memory map.
> > - * 5. If successful, locks the memory region mutex.
> > - * 6. Verifies if the notifier sequence has changed during the operation.
> > - * If it has, releases the mutex and returns -EBUSY to match with
> > - * hmm_range_fault() return code for repeating.
> > + * 3. Iterates through VMAs in the specified range, handling each
> > + * separately with appropriate protection flags (HMM_PFN_REQ_WRITE set
> > + * based on VMA flags).
> > + * 4. Handles HMM faults for each VMA segment.
> > + * 5. Releases the read lock on the memory map.
> > + * 6. If successful, locks the memory region mutex.
> > + * 7. Verifies if the notifier sequence has changed during the operation.
> > + * If it has, releases the mutex and returns -EBUSY to signal retry.
> > + *
> > + * The function expects the range [start, end] is backed by valid VMAs.
>
> Use "[start, end)" to describe the range since end is exclusive.
>
Will do
> > + * Returns -EFAULT if any address in the range is not covered by a VMA.
> > *
> > * Return: 0 on success, a negative error code otherwise.
> > */
> > static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> > - struct hmm_range *range)
> > + unsigned long start,
> > + unsigned long end,
> > + unsigned long *pfns)
> > {
> > + struct hmm_range range = {
> > + .notifier = ®ion->mreg_mni,
> > + };
> > int ret;
> >
> > - range->notifier_seq = mmu_interval_read_begin(range->notifier);
> > + range.notifier_seq = mmu_interval_read_begin(range.notifier);
> > mmap_read_lock(region->mreg_mni.mm);
> > - ret = hmm_range_fault(range);
> > + while (start < end) {
> > + struct vm_area_struct *vma;
> > +
> > + vma = vma_lookup(current->mm, start);
>
> The mmap_read_lock() was obtained on region->mreg_mni.mm, but the
> lookup is done against current->mm. Maybe these are the same, but
> it looks wrong. (Pointed out by a Co-Pilot AI review.)
>
Yes, they arethe same, but I'll update to use the same mm for clarity.
> > + if (!vma) {
> > + ret = -EFAULT;
> > + break;
> > + }
> > +
> > + range.hmm_pfns = pfns;
> > + range.start = start;
> > + range.end = min(vma->vm_end, end);
> > + range.default_flags = HMM_PFN_REQ_FAULT;
> > + if (vma->vm_flags & VM_WRITE)
> > + range.default_flags |= HMM_PFN_REQ_WRITE;
> > +
> > + ret = hmm_range_fault(&range);
> > + if (ret)
> > + break;
> > +
> > + start = range.end + 1;
>
> Since range.end is exclusive, the +1 should not be done.
>
Is it always? I'll need to check to make sure the end passed to this
function is page aligned. If it is, then I'll remove the +1.
> > + pfns += DIV_ROUND_UP(range.end - range.start, PAGE_SIZE);
>
> Just to confirm, range.end and range.start should always be page aligned,
> right? So the ROUND_UP should never kick in.
>
Same as above: if the end passed to this function is page aligned, then
I'll remove the DIV_ROUND_UP and just do a simple division.
Thanks,
Stanislav
> > + }
> > mmap_read_unlock(region->mreg_mni.mm);
> > if (ret)
> > return ret;
> >
> > mutex_lock(®ion->mreg_mutex);
> >
> > - if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
> > + if (mmu_interval_read_retry(range.notifier, range.notifier_seq)) {
> > mutex_unlock(®ion->mreg_mutex);
> > cond_resched();
> > return -EBUSY;
> > @@ -546,10 +581,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> > static int mshv_region_range_fault(struct mshv_mem_region *region,
> > u64 pfn_offset, u64 pfn_count)
> > {
> > - struct hmm_range range = {
> > - .notifier = ®ion->mreg_mni,
> > - .default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> > - };
> > + unsigned long start, end;
> > unsigned long *pfns;
> > int ret;
> > u64 i;
> > @@ -558,12 +590,12 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> > if (!pfns)
> > return -ENOMEM;
> >
> > - range.hmm_pfns = pfns;
> > - range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
> > - range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
> > + start = region->start_uaddr + pfn_offset * PAGE_SIZE;
> > + end = start + pfn_count * PAGE_SIZE;
> >
> > do {
> > - ret = mshv_region_hmm_fault_and_lock(region, &range);
> > + ret = mshv_region_hmm_fault_and_lock(region, start, end,
> > + pfns);
> > } while (ret == -EBUSY);
> >
> > if (ret)
> >
> >
>
^ permalink raw reply
* Re: [PATCH 2/7] mshv: Add support to address range holes remapping
From: Stanislav Kinsburskii @ 2026-04-20 16:24 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D44B15BAA0F3CA8B078BD4242@SN6PR02MB4157.namprd02.prod.outlook.com>
On Mon, Apr 13, 2026 at 09:08:31PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> >
> > Consolidate memory region processing to handle both valid and invalid PFNs
> > uniformly. This eliminates code duplication across remap, unmap, share, and
> > unshare operations by using a common range processing interface.
> >
> > Holes are now remapped with no-access permissions to enable
> > hypervisor dirty page tracking for precopy live migration.
> >
> > This refactoring is a precursor to an upcoming change that will map
> > present pages in movable regions upon region creation, requiring
> > consistent handling of both mapped and unmapped ranges.
> >
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> > drivers/hv/mshv_regions.c | 108
> > ++++++++++++++++++++++++++++++++++++++++-----
> > 1 file changed, 95 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index b1a707d16c07..ed9c55841140 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -119,6 +119,57 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
> > return count;
> > }
> >
> > +/**
> > + * mshv_region_process_hole - Handle a hole (invalid PFNs) in a memory
> > + * region
> > + * @region : Memory region containing the hole
> > + * @flags : Flags to pass to the handler function
> > + * @pfn_offset: Starting PFN offset within the region
> > + * @pfn_count : Number of PFNs in the hole
> > + * @handler : Callback function to invoke for the hole
> > + *
> > + * Invokes the handler function for a contiguous hole with the specified
> > + * parameters.
> > + *
> > + * Return: Number of PFNs handled, or negative error code.
> > + */
> > +static long mshv_region_process_hole(struct mshv_mem_region *region,
> > + u32 flags,
> > + u64 pfn_offset, u64 pfn_count,
> > + int (*handler)(struct mshv_mem_region *region,
> > + u32 flags,
> > + u64 pfn_offset,
> > + u64 pfn_count,
> > + bool huge_page))
> > +{
> > + long ret;
> > +
> > + ret = handler(region, flags, pfn_offset, pfn_count, 0);
> > + if (ret)
> > + return ret;
> > +
> > + return pfn_count;
> > +}
> > +
> > +static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > + u32 flags,
> > + u64 pfn_offset, u64 pfn_count,
> > + int (*handler)(struct mshv_mem_region *region,
> > + u32 flags,
> > + u64 pfn_offset,
> > + u64 pfn_count,
> > + bool huge_page))
> > +{
> > + if (pfn_valid(region->mreg_pfns[pfn_offset]))
> > + return mshv_region_process_pfns(region, flags,
> > + pfn_offset, pfn_count,
> > + handler);
> > + else
> > + return mshv_region_process_hole(region, flags,
> > + pfn_offset, pfn_count,
> > + handler);
> > +}
> > +
> > /**
> > * mshv_region_process_range - Processes a range of PFNs in a region.
> > * @region : Pointer to the memory region structure.
> > @@ -146,33 +197,47 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
> > u64 pfn_count,
> > bool huge_page))
> > {
> > - u64 pfn_end;
> > + u64 start, end;
> > long ret;
> >
> > - if (check_add_overflow(pfn_offset, pfn_count, &pfn_end))
> > + if (!pfn_count)
> > + return 0;
> > +
> > + if (check_add_overflow(pfn_offset, pfn_count, &end))
> > return -EOVERFLOW;
> >
> > - if (pfn_end > region->nr_pfns)
> > + if (end > region->nr_pfns)
> > return -EINVAL;
> >
> > - while (pfn_count) {
> > - /* Skip non-present pages */
> > - if (!pfn_valid(region->mreg_pfns[pfn_offset])) {
> > - pfn_offset++;
> > - pfn_count--;
> > + start = pfn_offset;
> > + end = pfn_offset + 1;
> > +
> > + while (end < pfn_offset + pfn_count) {
> > + /*
> > + * Accumulate contiguous pfns with the same validity
> > + * (valid or not).
> > + */
> > + if (pfn_valid(region->mreg_pfns[start]) ==
> > + pfn_valid(region->mreg_pfns[end])) {
> > + end++;
> > continue;
> > }
> >
> > - ret = mshv_region_process_pfns(region, flags,
> > - pfn_offset, pfn_count,
> > - handler);
> > + ret = mshv_region_process_chunk(region, flags,
> > + start, end - start,
> > + handler);
> > if (ret < 0)
> > return ret;
> >
> > - pfn_offset += ret;
> > - pfn_count -= ret;
> > + start += ret;
> > }
> >
> > + ret = mshv_region_process_chunk(region, flags,
> > + start, end - start,
> > + handler);
> > + if (ret < 0)
> > + return ret;
> > +
> > return 0;
> > }
> >
> > @@ -208,6 +273,9 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
> > u64 pfn_offset, u64 pfn_count,
> > bool huge_page)
> > {
> > + if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> > + return -EINVAL;
> > +
> > if (huge_page)
> > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> >
> > @@ -233,6 +301,9 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> > u64 pfn_offset, u64 pfn_count,
> > bool huge_page)
> > {
> > + if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> > + return -EINVAL;
> > +
> > if (huge_page)
> > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> >
> > @@ -256,6 +327,14 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> > u64 pfn_offset, u64 pfn_count,
> > bool huge_page)
> > {
> > + /*
> > + * Remap missing pages with no access to let the
> > + * hypervisor track dirty pages, enabling precopy live
> > + * migration.
> > + */
> > + if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> > + flags = HV_MAP_GPA_NO_ACCESS;
>
> Is it OK to wipe out any other flags that might be set? Certainly, any previous
> flags in PERMISSIONS_MASK should be removed, but what about ADJUSTABLE
> and NOT_CACHED?
>
Yes, this is the right approach. The HV_MAP_GPA_NO_ACCESS flag will
immediately cause a hypervisor fault on any access to the page. So
caching and adjustability no longer matter.
Thanks,
Stanislav
> > +
> > if (huge_page)
> > flags |= HV_MAP_GPA_LARGE_PAGE;
> >
> > @@ -357,6 +436,9 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > u64 pfn_offset, u64 pfn_count,
> > bool huge_page)
> > {
> > + if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> > + return 0;
> > +
> > if (huge_page)
> > flags |= HV_UNMAP_GPA_LARGE_PAGE;
> >
> >
> >
>
^ permalink raw reply
* Re: [PATCH 1/7] mshv: Convert from page pointers to PFNs
From: Stanislav Kinsburskii @ 2026-04-20 16:21 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157CD26728B2D4BFD171DB7D4242@SN6PR02MB4157.namprd02.prod.outlook.com>
On Mon, Apr 13, 2026 at 09:08:16PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
> >
> > The HMM interface returns PFNs from hmm_range_fault(), and the
> > hypervisor hypercalls operate on PFNs. Storing page pointers in
> > between these interfaces requires unnecessary conversions and
> > temporary allocations.
> >
> > Store PFNs directly in memory regions to match the natural data flow.
> > This eliminates the temporary PFN array allocation in the HMM fault
> > path and reduces page_to_pfn() conversions throughout the driver.
> > Convert to page structs via pfn_to_page() only when operations like
> > unpin_user_page() require them.
>
> General comment for this series: PFN fields are typed as "unsigned long".
> But pfn_offset and pfn_count are "u64". GFNs are also "u64". Any
> reason not to make PFNs also "u64"? I know that pfn_valid() takes
> an "unsigned long" input, but see comment below about pfn_valid().
>
The only reason is to keep the type consistent with the standard Linux
kernel definition of PFN as unsigned long.
> >
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> > drivers/hv/mshv_regions.c | 297 ++++++++++++++++++++++------------------
> > drivers/hv/mshv_root.h | 20 +--
> > drivers/hv/mshv_root_hv_call.c | 50 +++----
> > drivers/hv/mshv_root_main.c | 30 ++--
> > 4 files changed, 212 insertions(+), 185 deletions(-)
> >
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index fdffd4f002f6..b1a707d16c07 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -18,12 +18,13 @@
> > #include "mshv_root.h"
> >
> > #define MSHV_MAP_FAULT_IN_PAGES PTRS_PER_PMD
> > +#define MSHV_INVALID_PFN ULONG_MAX
> >
> > /**
> > * mshv_chunk_stride - Compute stride for mapping guest memory
> > * @page : The page to check for huge page backing
> > * @gfn : Guest frame number for the mapping
> > - * @page_count: Total number of pages in the mapping
> > + * @pfn_count: Total number of pages in the mapping
>
> Nit: The colons are misaligned after this change.
>
> > *
> > * Determines the appropriate stride (in pages) for mapping guest memory.
> > * Uses huge page stride if the backing page is huge and the guest mapping
> > @@ -32,18 +33,18 @@
> > * Return: Stride in pages, or -EINVAL if page order is unsupported.
> > */
> > static int mshv_chunk_stride(struct page *page,
> > - u64 gfn, u64 page_count)
> > + u64 gfn, u64 pfn_count)
> > {
> > unsigned int page_order;
> >
> > /*
> > * Use single page stride by default. For huge page stride, the
> > * page must be compound and point to the head of the compound
> > - * page, and both gfn and page_count must be huge-page aligned.
> > + * page, and both gfn and pfn_count must be huge-page aligned.
> > */
> > if (!PageCompound(page) || !PageHead(page) ||
> > !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
> > - !IS_ALIGNED(page_count, PTRS_PER_PMD))
> > + !IS_ALIGNED(pfn_count, PTRS_PER_PMD))
> > return 1;
> >
> > page_order = folio_order(page_folio(page));
> > @@ -57,60 +58,61 @@ static int mshv_chunk_stride(struct page *page,
> > /**
> > * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
> > * in a region.
> > - * @region : Pointer to the memory region structure.
> > - * @flags : Flags to pass to the handler.
> > - * @page_offset: Offset into the region's pages array to start processing.
> > - * @page_count : Number of pages to process.
> > - * @handler : Callback function to handle the chunk.
> > + * @region : Pointer to the memory region structure.
> > + * @flags : Flags to pass to the handler.
> > + * @pfn_offset: Offset into the region's PFNs array to start processing.
> > + * @pfn_count : Number of PFNs to process.
> > + * @handler : Callback function to handle the chunk.
> > *
> > - * This function scans the region's pages starting from @page_offset,
> > - * checking for contiguous present pages of the same size (normal or huge).
> > - * It invokes @handler for the chunk of contiguous pages found. Returns the
> > - * number of pages handled, or a negative error code if the first page is
> > - * not present or the handler fails.
> > + * This function scans the region's PFNs starting from @pfn_offset,
> > + * checking for contiguous valid PFNs backed by pages of the same size
> > + * (normal or huge). It invokes @handler for the chunk of contiguous valid
> > + * PFNs found. Returns the number of PFNs handled, or a negative error code
> > + * if the first PFN is invalid or the handler fails.
> > *
> > - * Note: The @handler callback must be able to handle both normal and huge
> > - * pages.
> > + * Note: The @handler callback must be able to handle valid PFNs backed by
> > + * both normal and huge pages.
> > *
> > * Return: Number of pages handled, or negative error code.
> > */
> > -static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > - u32 flags,
> > - u64 page_offset, u64 page_count,
> > - int (*handler)(struct mshv_mem_region *region,
> > - u32 flags,
> > - u64 page_offset,
> > - u64 page_count,
> > - bool huge_page))
> > +static long mshv_region_process_pfns(struct mshv_mem_region *region,
> > + u32 flags,
> > + u64 pfn_offset, u64 pfn_count,
> > + int (*handler)(struct mshv_mem_region *region,
> > + u32 flags,
> > + u64 pfn_offset,
> > + u64 pfn_count,
> > + bool huge_page))
> > {
> > - u64 gfn = region->start_gfn + page_offset;
> > + u64 gfn = region->start_gfn + pfn_offset;
> > u64 count;
> > - struct page *page;
> > + unsigned long pfn;
> > int stride, ret;
> >
> > - page = region->mreg_pages[page_offset];
> > - if (!page)
> > + pfn = region->mreg_pfns[pfn_offset];
> > + if (!pfn_valid(pfn))
> > return -EINVAL;
> >
> > - stride = mshv_chunk_stride(page, gfn, page_count);
> > + stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
> > if (stride < 0)
> > return stride;
> >
> > /* Start at stride since the first stride is validated */
> > - for (count = stride; count < page_count; count += stride) {
> > - page = region->mreg_pages[page_offset + count];
> > + for (count = stride; count < pfn_count ; count += stride) {
> > + pfn = region->mreg_pfns[pfn_offset + count];
> >
> > - /* Break if current page is not present */
> > - if (!page)
> > + /* Break if current pfn is invalid */
> > + if (!pfn_valid(pfn))
>
> pfn_valid() is a relatively expensive test to be doing in a loop
> on what may be every single page. It does an RCU lock/unlock
> and make other checks that aren't necessary here. Since
> mreg_pfns[] is populated from mm calls, the only invalid PFNs
> would be MSHV_INVALID_PFN that code in this module has
> explicitly put there. Just testing against MSHV_INVALID_PFN
> would be a lot faster here and elsewhere in this module. It's
> really a "pfn set/not set" test. Defining a pfn_set() macro
> here in this module that tests against MSHV_INVALID_PFN
> would accomplish the same thing more efficiently.
>
Yes, we could do it the way you suggest. For completeness, I should add
that pfn_valid() is expensive only on 32-bit ARM and ARC, which we
don’t care about.
> > break;
> >
> > /* Break if stride size changes */
> > - if (stride != mshv_chunk_stride(page, gfn + count,
> > - page_count - count))
> > + if (stride != mshv_chunk_stride(pfn_to_page(pfn),
> > + gfn + count,
> > + pfn_count - count))
> > break;
> > }
> >
> > - ret = handler(region, flags, page_offset, count, stride > 1);
> > + ret = handler(region, flags, pfn_offset, count, stride > 1);
> > if (ret)
> > return ret;
> >
> > @@ -118,70 +120,73 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > }
> >
> > /**
> > - * mshv_region_process_range - Processes a range of memory pages in a
> > - * region.
> > - * @region : Pointer to the memory region structure.
> > - * @flags : Flags to pass to the handler.
> > - * @page_offset: Offset into the region's pages array to start processing.
> > - * @page_count : Number of pages to process.
> > - * @handler : Callback function to handle each chunk of contiguous
> > - * pages.
> > + * mshv_region_process_range - Processes a range of PFNs in a region.
> > + * @region : Pointer to the memory region structure.
> > + * @flags : Flags to pass to the handler.
> > + * @pfn_offset: Offset into the region's PFNs array to start processing.
> > + * @pfn_count : Number of PFNs to process.
> > + * @handler : Callback function to handle each chunk of contiguous
> > + * valid PFNs.
> > *
> > - * Iterates over the specified range of pages in @region, skipping
> > - * non-present pages. For each contiguous chunk of present pages, invokes
> > - * @handler via mshv_region_process_chunk.
> > + * Iterates over the specified range of PFNs in @region, skipping
> > + * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
> > + * @handler via mshv_region_process_pfns.
> > *
> > - * Note: The @handler callback must be able to handle both normal and huge
> > - * pages.
> > + * Note: The @handler callback must be able to handle PFNs backed by both
> > + * normal and huge pages.
> > *
> > * Returns 0 on success, or a negative error code on failure.
> > */
> > static int mshv_region_process_range(struct mshv_mem_region *region,
> > u32 flags,
> > - u64 page_offset, u64 page_count,
> > + u64 pfn_offset, u64 pfn_count,
> > int (*handler)(struct mshv_mem_region *region,
> > u32 flags,
> > - u64 page_offset,
> > - u64 page_count,
> > + u64 pfn_offset,
> > + u64 pfn_count,
> > bool huge_page))
> > {
> > + u64 pfn_end;
>
> In Patch 2 of this series, "pfn_end" is changed to just "end", and
> the references are adjusted. Patch 2 could be a few lines smaller if it
> was named "end" here and Patch 2 didn't have to change it.
>
Sure, can do.
> > long ret;
> >
> > - if (page_offset + page_count > region->nr_pages)
> > + if (check_add_overflow(pfn_offset, pfn_count, &pfn_end))
> > + return -EOVERFLOW;
> > +
> > + if (pfn_end > region->nr_pfns)
> > return -EINVAL;
> >
> > - while (page_count) {
> > + while (pfn_count) {
> > /* Skip non-present pages */
> > - if (!region->mreg_pages[page_offset]) {
> > - page_offset++;
> > - page_count--;
> > + if (!pfn_valid(region->mreg_pfns[pfn_offset])) {
> > + pfn_offset++;
> > + pfn_count--;
> > continue;
> > }
> >
> > - ret = mshv_region_process_chunk(region, flags,
> > - page_offset,
> > - page_count,
> > - handler);
> > + ret = mshv_region_process_pfns(region, flags,
> > + pfn_offset, pfn_count,
> > + handler);
> > if (ret < 0)
> > return ret;
> >
> > - page_offset += ret;
> > - page_count -= ret;
> > + pfn_offset += ret;
> > + pfn_count -= ret;
> > }
> >
> > return 0;
> > }
> >
> > -struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> > +struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
> > u64 uaddr, u32 flags)
> > {
> > struct mshv_mem_region *region;
> > + u64 i;
> >
> > - region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
> > + region = vzalloc(sizeof(*region) + sizeof(unsigned long) * nr_pfns);
>
> Use struct_size(region, mreg_pfns, nr_pfns) instead of open coding the arithmetic?
>
This is new to me. Sure, will do.
Thanks,
Stanislav
> > if (!region)
> > return ERR_PTR(-ENOMEM);
> >
> > - region->nr_pages = nr_pages;
> > + region->nr_pfns = nr_pfns;
> > region->start_gfn = guest_pfn;
> > region->start_uaddr = uaddr;
> > region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
> > @@ -190,6 +195,9 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> > if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
> > region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
> >
> > + for (i = 0; i < nr_pfns; i++)
> > + region->mreg_pfns[i] = MSHV_INVALID_PFN;
> > +
> > kref_init(®ion->mreg_refcount);
> >
> > return region;
> > @@ -197,15 +205,15 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> >
> > static int mshv_region_chunk_share(struct mshv_mem_region *region,
> > u32 flags,
> > - u64 page_offset, u64 page_count,
> > + u64 pfn_offset, u64 pfn_count,
> > bool huge_page)
> > {
> > if (huge_page)
> > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> >
> > return hv_call_modify_spa_host_access(region->partition->pt_id,
> > - region->mreg_pages + page_offset,
> > - page_count,
> > + region->mreg_pfns + pfn_offset,
> > + pfn_count,
> > HV_MAP_GPA_READABLE |
> > HV_MAP_GPA_WRITABLE,
> > flags, true);
> > @@ -216,21 +224,21 @@ int mshv_region_share(struct mshv_mem_region *region)
> > u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
> >
> > return mshv_region_process_range(region, flags,
> > - 0, region->nr_pages,
> > + 0, region->nr_pfns,
> > mshv_region_chunk_share);
> > }
> >
> > static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> > u32 flags,
> > - u64 page_offset, u64 page_count,
> > + u64 pfn_offset, u64 pfn_count,
> > bool huge_page)
> > {
> > if (huge_page)
> > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> >
> > return hv_call_modify_spa_host_access(region->partition->pt_id,
> > - region->mreg_pages + page_offset,
> > - page_count, 0,
> > + region->mreg_pfns + pfn_offset,
> > + pfn_count, 0,
> > flags, false);
> > }
> >
> > @@ -239,30 +247,30 @@ int mshv_region_unshare(struct mshv_mem_region *region)
> > u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
> >
> > return mshv_region_process_range(region, flags,
> > - 0, region->nr_pages,
> > + 0, region->nr_pfns,
> > mshv_region_chunk_unshare);
> > }
> >
> > static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> > u32 flags,
> > - u64 page_offset, u64 page_count,
> > + u64 pfn_offset, u64 pfn_count,
> > bool huge_page)
> > {
> > if (huge_page)
> > flags |= HV_MAP_GPA_LARGE_PAGE;
> >
> > - return hv_call_map_gpa_pages(region->partition->pt_id,
> > - region->start_gfn + page_offset,
> > - page_count, flags,
> > - region->mreg_pages + page_offset);
> > + return hv_call_map_ram_pfns(region->partition->pt_id,
> > + region->start_gfn + pfn_offset,
> > + pfn_count, flags,
> > + region->mreg_pfns + pfn_offset);
> > }
> >
> > -static int mshv_region_remap_pages(struct mshv_mem_region *region,
> > - u32 map_flags,
> > - u64 page_offset, u64 page_count)
> > +static int mshv_region_remap_pfns(struct mshv_mem_region *region,
> > + u32 map_flags,
> > + u64 pfn_offset, u64 pfn_count)
> > {
> > return mshv_region_process_range(region, map_flags,
> > - page_offset, page_count,
> > + pfn_offset, pfn_count,
> > mshv_region_chunk_remap);
> > }
> >
> > @@ -270,38 +278,50 @@ int mshv_region_map(struct mshv_mem_region *region)
> > {
> > u32 map_flags = region->hv_map_flags;
> >
> > - return mshv_region_remap_pages(region, map_flags,
> > - 0, region->nr_pages);
> > + return mshv_region_remap_pfns(region, map_flags,
> > + 0, region->nr_pfns);
> > }
> >
> > -static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > - u64 page_offset, u64 page_count)
> > +static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
> > + u64 pfn_offset, u64 pfn_count)
> > {
> > - if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > - unpin_user_pages(region->mreg_pages + page_offset, page_count);
> > + u64 i;
> > +
> > + for (i = pfn_offset; i < pfn_offset + pfn_count; i++) {
> > + if (!pfn_valid(region->mreg_pfns[i]))
> > + continue;
> > +
> > + if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > + unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
> >
> > - memset(region->mreg_pages + page_offset, 0,
> > - page_count * sizeof(struct page *));
> > + region->mreg_pfns[i] = MSHV_INVALID_PFN;
> > + }
> > }
> >
> > void mshv_region_invalidate(struct mshv_mem_region *region)
> > {
> > - mshv_region_invalidate_pages(region, 0, region->nr_pages);
> > + mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
> > }
> >
> > int mshv_region_pin(struct mshv_mem_region *region)
> > {
> > - u64 done_count, nr_pages;
> > + u64 done_count, nr_pfns, i;
> > + unsigned long *pfns;
> > struct page **pages;
> > __u64 userspace_addr;
> > int ret;
> >
> > - for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
> > - pages = region->mreg_pages + done_count;
> > + pages = kmalloc_array(MSHV_PIN_PAGES_BATCH_SIZE,
> > + sizeof(struct page *), GFP_KERNEL);
> > + if (!pages)
> > + return -ENOMEM;
> > +
> > + for (done_count = 0; done_count < region->nr_pfns; done_count += ret) {
> > + pfns = region->mreg_pfns + done_count;
> > userspace_addr = region->start_uaddr +
> > done_count * HV_HYP_PAGE_SIZE;
> > - nr_pages = min(region->nr_pages - done_count,
> > - MSHV_PIN_PAGES_BATCH_SIZE);
> > + nr_pfns = min(region->nr_pfns - done_count,
> > + MSHV_PIN_PAGES_BATCH_SIZE);
> >
> > /*
> > * Pinning assuming 4k pages works for large pages too.
> > @@ -311,39 +331,44 @@ int mshv_region_pin(struct mshv_mem_region *region)
> > * with the FOLL_LONGTERM flag does a large temporary
> > * allocation of contiguous memory.
> > */
> > - ret = pin_user_pages_fast(userspace_addr, nr_pages,
> > + ret = pin_user_pages_fast(userspace_addr, nr_pfns,
> > FOLL_WRITE | FOLL_LONGTERM,
> > pages);
> > - if (ret != nr_pages)
> > + if (ret != nr_pfns)
> > goto release_pages;
> > +
> > + for (i = 0; i < ret; i++)
> > + pfns[i] = page_to_pfn(pages[i]);
> > }
> >
> > + kfree(pages);
> > return 0;
> >
> > release_pages:
> > if (ret > 0)
> > done_count += ret;
> > - mshv_region_invalidate_pages(region, 0, done_count);
> > + mshv_region_invalidate_pfns(region, 0, done_count);
> > + kfree(pages);
> > return ret < 0 ? ret : -ENOMEM;
> > }
> >
> > static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > u32 flags,
> > - u64 page_offset, u64 page_count,
> > + u64 pfn_offset, u64 pfn_count,
> > bool huge_page)
> > {
> > if (huge_page)
> > flags |= HV_UNMAP_GPA_LARGE_PAGE;
> >
> > - return hv_call_unmap_gpa_pages(region->partition->pt_id,
> > - region->start_gfn + page_offset,
> > - page_count, flags);
> > + return hv_call_unmap_pfns(region->partition->pt_id,
> > + region->start_gfn + pfn_offset,
> > + pfn_count, flags);
> > }
> >
> > static int mshv_region_unmap(struct mshv_mem_region *region)
> > {
> > return mshv_region_process_range(region, 0,
> > - 0, region->nr_pages,
> > + 0, region->nr_pfns,
> > mshv_region_chunk_unmap);
> > }
> >
> > @@ -427,8 +452,8 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> > /**
> > * mshv_region_range_fault - Handle memory range faults for a given region.
> > * @region: Pointer to the memory region structure.
> > - * @page_offset: Offset of the page within the region.
> > - * @page_count: Number of pages to handle.
> > + * @pfn_offset: Offset of the page within the region.
> > + * @pfn_count: Number of pages to handle.
> > *
> > * This function resolves memory faults for a specified range of pages
> > * within a memory region. It uses HMM (Heterogeneous Memory Management)
> > @@ -437,7 +462,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> > * Return: 0 on success, negative error code on failure.
> > */
> > static int mshv_region_range_fault(struct mshv_mem_region *region,
> > - u64 page_offset, u64 page_count)
> > + u64 pfn_offset, u64 pfn_count)
> > {
> > struct hmm_range range = {
> > .notifier = ®ion->mreg_mni,
> > @@ -447,13 +472,13 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> > int ret;
> > u64 i;
> >
> > - pfns = kmalloc_array(page_count, sizeof(*pfns), GFP_KERNEL);
> > + pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
> > if (!pfns)
> > return -ENOMEM;
> >
> > range.hmm_pfns = pfns;
> > - range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
> > - range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
> > + range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
> > + range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
> >
> > do {
> > ret = mshv_region_hmm_fault_and_lock(region, &range);
> > @@ -462,11 +487,15 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> > if (ret)
> > goto out;
> >
> > - for (i = 0; i < page_count; i++)
> > - region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> > + for (i = 0; i < pfn_count; i++) {
> > + if (!(pfns[i] & HMM_PFN_VALID))
> > + continue;
> > + /* Drop HMM_PFN_* flags to ensure PFNs are valid. */
> > + region->mreg_pfns[pfn_offset + i] = pfns[i] & ~HMM_PFN_FLAGS;
> > + }
> >
> > - ret = mshv_region_remap_pages(region, region->hv_map_flags,
> > - page_offset, page_count);
> > + ret = mshv_region_remap_pfns(region, region->hv_map_flags,
> > + pfn_offset, pfn_count);
> >
> > mutex_unlock(®ion->mreg_mutex);
> > out:
> > @@ -476,24 +505,24 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> >
> > bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> > {
> > - u64 page_offset, page_count;
> > + u64 pfn_offset, pfn_count;
> > int ret;
> >
> > /* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
> > - page_offset = ALIGN_DOWN(gfn - region->start_gfn,
> > - MSHV_MAP_FAULT_IN_PAGES);
> > + pfn_offset = ALIGN_DOWN(gfn - region->start_gfn,
> > + MSHV_MAP_FAULT_IN_PAGES);
> >
> > /* Map more pages than requested to reduce the number of faults. */
> > - page_count = min(region->nr_pages - page_offset,
> > - MSHV_MAP_FAULT_IN_PAGES);
> > + pfn_count = min(region->nr_pfns - pfn_offset,
> > + MSHV_MAP_FAULT_IN_PAGES);
> >
> > - ret = mshv_region_range_fault(region, page_offset, page_count);
> > + ret = mshv_region_range_fault(region, pfn_offset, pfn_count);
> >
> > WARN_ONCE(ret,
> > - "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
> > + "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, pfn_offset %llu, pfn_count %llu\n",
> > region->partition->pt_id, region->start_uaddr,
> > - region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> > - gfn, page_offset, page_count);
> > + region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
> > + gfn, pfn_offset, pfn_count);
> >
> > return !ret;
> > }
> > @@ -523,16 +552,16 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> > struct mshv_mem_region *region = container_of(mni,
> > struct mshv_mem_region,
> > mreg_mni);
> > - u64 page_offset, page_count;
> > + u64 pfn_offset, pfn_count;
> > unsigned long mstart, mend;
> > int ret = -EPERM;
> >
> > mstart = max(range->start, region->start_uaddr);
> > mend = min(range->end, region->start_uaddr +
> > - (region->nr_pages << HV_HYP_PAGE_SHIFT));
> > + (region->nr_pfns << HV_HYP_PAGE_SHIFT));
> >
> > - page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> > - page_count = HVPFN_DOWN(mend - mstart);
> > + pfn_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> > + pfn_count = HVPFN_DOWN(mend - mstart);
> >
> > if (mmu_notifier_range_blockable(range))
> > mutex_lock(®ion->mreg_mutex);
> > @@ -541,12 +570,12 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> >
> > mmu_interval_set_seq(mni, cur_seq);
> >
> > - ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
> > - page_offset, page_count);
> > + ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
> > + pfn_offset, pfn_count);
> > if (ret)
> > goto out_unlock;
> >
> > - mshv_region_invalidate_pages(region, page_offset, page_count);
> > + mshv_region_invalidate_pfns(region, pfn_offset, pfn_count);
> >
> > mutex_unlock(®ion->mreg_mutex);
> >
> > @@ -558,9 +587,9 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> > WARN_ONCE(ret,
> > "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
> > region->start_uaddr,
> > - region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> > + region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
> > range->start, range->end, range->event,
> > - page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
> > + pfn_offset, pfn_offset + pfn_count - 1, (u64)range->mm, ret);
> > return false;
> > }
> >
> > @@ -579,7 +608,7 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
> >
> > ret = mmu_interval_notifier_insert(®ion->mreg_mni, current->mm,
> > region->start_uaddr,
> > - region->nr_pages << HV_HYP_PAGE_SHIFT,
> > + region->nr_pfns << HV_HYP_PAGE_SHIFT,
> > &mshv_region_mni_ops);
> > if (ret)
> > return false;
> > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > index 947dfb76bb19..f1d4bee97a3f 100644
> > --- a/drivers/hv/mshv_root.h
> > +++ b/drivers/hv/mshv_root.h
> > @@ -84,15 +84,15 @@ enum mshv_region_type {
> > struct mshv_mem_region {
> > struct hlist_node hnode;
> > struct kref mreg_refcount;
> > - u64 nr_pages;
> > + u64 nr_pfns;
> > u64 start_gfn;
> > u64 start_uaddr;
> > u32 hv_map_flags;
> > struct mshv_partition *partition;
> > enum mshv_region_type mreg_type;
> > struct mmu_interval_notifier mreg_mni;
> > - struct mutex mreg_mutex; /* protects region pages remapping */
> > - struct page *mreg_pages[];
> > + struct mutex mreg_mutex; /* protects region PFNs remapping */
> > + unsigned long mreg_pfns[];
> > };
> >
> > struct mshv_irq_ack_notifier {
> > @@ -282,11 +282,11 @@ int hv_call_create_partition(u64 flags,
> > int hv_call_initialize_partition(u64 partition_id);
> > int hv_call_finalize_partition(u64 partition_id);
> > int hv_call_delete_partition(u64 partition_id);
> > -int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64
> > numpgs);
> > -int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> > - u32 flags, struct page **pages);
> > -int hv_call_unmap_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> > - u32 flags);
> > +int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
> > +int hv_call_map_ram_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
> > + u32 flags, unsigned long *pfns);
> > +int hv_call_unmap_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
> > + u32 flags);
> > int hv_call_delete_vp(u64 partition_id, u32 vp_index);
> > int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
> > u64 dest_addr,
> > @@ -329,8 +329,8 @@ int hv_map_stats_page(enum hv_stats_object_type type,
> > int hv_unmap_stats_page(enum hv_stats_object_type type,
> > struct hv_stats_page *page_addr,
> > const union hv_stats_object_identity *identity);
> > -int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> > - u64 page_struct_count, u32 host_access,
> > +int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
> > + u64 pfns_count, u32 host_access,
> > u32 flags, u8 acquire);
> > int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg,
> > void *property_value, size_t property_value_sz);
> > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > index cb55d4d4be2e..a95f2cfc5da5 100644
> > --- a/drivers/hv/mshv_root_hv_call.c
> > +++ b/drivers/hv/mshv_root_hv_call.c
> > @@ -188,17 +188,16 @@ int hv_call_delete_partition(u64 partition_id)
> > return hv_result_to_errno(status);
> > }
> >
> > -/* Ask the hypervisor to map guest ram pages or the guest mmio space */
> > -static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> > - u32 flags, struct page **pages, u64 mmio_spa)
> > +static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,
> > + u32 flags, unsigned long *pfns, u64 mmio_spa)
> > {
> > struct hv_input_map_gpa_pages *input_page;
> > u64 status, *pfnlist;
> > unsigned long irq_flags, large_shift = 0;
> > int ret = 0, done = 0;
> > - u64 page_count = page_struct_count;
> > + u64 page_count = pfns_count;
> >
> > - if (page_count == 0 || (pages && mmio_spa))
> > + if (page_count == 0 || (pfns && mmio_spa))
> > return -EINVAL;
> >
> > if (flags & HV_MAP_GPA_LARGE_PAGE) {
> > @@ -227,14 +226,14 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> > for (i = 0; i < rep_count; i++)
> > if (flags & HV_MAP_GPA_NO_ACCESS) {
> > pfnlist[i] = 0;
> > - } else if (pages) {
> > + } else if (pfns) {
> > u64 index = (done + i) << large_shift;
> >
> > - if (index >= page_struct_count) {
> > + if (index >= pfns_count) {
> > ret = -EINVAL;
> > break;
> > }
> > - pfnlist[i] = page_to_pfn(pages[index]);
> > + pfnlist[i] = pfns[index];
> > } else {
> > pfnlist[i] = mmio_spa + done + i;
> > }
> > @@ -266,37 +265,37 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >
> > if (flags & HV_MAP_GPA_LARGE_PAGE)
> > unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > - hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> > + hv_call_unmap_pfns(partition_id, gfn, done, unmap_flags);
> > }
> >
> > return ret;
> > }
> >
> > /* Ask the hypervisor to map guest ram pages */
> > -int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> > - u32 flags, struct page **pages)
> > +int hv_call_map_ram_pfns(u64 partition_id, u64 gfn, u64 pfn_count,
> > + u32 flags, unsigned long *pfns)
> > {
> > - return hv_do_map_gpa_hcall(partition_id, gpa_target, page_count,
> > - flags, pages, 0);
> > + return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
> > + pfns, 0);
> > }
> >
> > -/* Ask the hypervisor to map guest mmio space */
> > -int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
> > +int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa,
> > + u64 pfn_count)
> > {
> > int i;
> > u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
> > HV_MAP_GPA_NOT_CACHED;
> >
> > - for (i = 0; i < numpgs; i++)
> > + for (i = 0; i < pfn_count; i++)
> > if (page_is_ram(mmio_spa + i))
> > return -EINVAL;
> >
> > - return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
> > - mmio_spa);
> > + return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
> > + NULL, mmio_spa);
> > }
> >
> > -int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> > - u32 flags)
> > +int hv_call_unmap_pfns(u64 partition_id, u64 gfn, u64 page_count_4k,
> > + u32 flags)
> > {
> > struct hv_input_unmap_gpa_pages *input_page;
> > u64 status, page_count = page_count_4k;
> > @@ -1009,15 +1008,15 @@ int hv_unmap_stats_page(enum hv_stats_object_type type,
> > return ret;
> > }
> >
> > -int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> > - u64 page_struct_count, u32 host_access,
> > +int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
> > + u64 pfns_count, u32 host_access,
> > u32 flags, u8 acquire)
> > {
> > struct hv_input_modify_sparse_spa_page_host_access *input_page;
> > u64 status;
> > int done = 0;
> > unsigned long irq_flags, large_shift = 0;
> > - u64 page_count = page_struct_count;
> > + u64 page_count = pfns_count;
> > u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
> > HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS;
> >
> > @@ -1051,11 +1050,10 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> > for (i = 0; i < rep_count; i++) {
> > u64 index = (done + i) << large_shift;
> >
> > - if (index >= page_struct_count)
> > + if (index >= pfns_count)
> > return -EINVAL;
> >
> > - input_page->spa_page_list[i] =
> > - page_to_pfn(pages[index]);
> > + input_page->spa_page_list[i] = pfns[index];
> > }
> >
> > status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index f2d83d6c8c4f..685e4b562186 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -619,7 +619,7 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> >
> > hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
> > if (gfn >= region->start_gfn &&
> > - gfn < region->start_gfn + region->nr_pages)
> > + gfn < region->start_gfn + region->nr_pfns)
> > return region;
> > }
> >
> > @@ -1221,20 +1221,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
> > bool is_mmio)
> > {
> > struct mshv_mem_region *rg;
> > - u64 nr_pages = HVPFN_DOWN(mem->size);
> > + u64 nr_pfns = HVPFN_DOWN(mem->size);
> >
> > /* Reject overlapping regions */
> > spin_lock(&partition->pt_mem_regions_lock);
> > hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
> > - if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
> > - rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
> > + if (mem->guest_pfn + nr_pfns <= rg->start_gfn ||
> > + rg->start_gfn + rg->nr_pfns <= mem->guest_pfn)
> > continue;
> > spin_unlock(&partition->pt_mem_regions_lock);
> > return -EEXIST;
> > }
> > spin_unlock(&partition->pt_mem_regions_lock);
> >
> > - rg = mshv_region_create(mem->guest_pfn, nr_pages,
> > + rg = mshv_region_create(mem->guest_pfn, nr_pfns,
> > mem->userspace_addr, mem->flags);
> > if (IS_ERR(rg))
> > return PTR_ERR(rg);
> > @@ -1372,21 +1372,21 @@ mshv_map_user_memory(struct mshv_partition *partition,
> > * the hypervisor track dirty pages, enabling pre-copy live
> > * migration.
> > */
> > - ret = hv_call_map_gpa_pages(partition->pt_id,
> > - region->start_gfn,
> > - region->nr_pages,
> > - HV_MAP_GPA_NO_ACCESS, NULL);
> > + ret = hv_call_map_ram_pfns(partition->pt_id,
> > + region->start_gfn,
> > + region->nr_pfns,
> > + HV_MAP_GPA_NO_ACCESS, NULL);
> > break;
> > case MSHV_REGION_TYPE_MMIO:
> > - ret = hv_call_map_mmio_pages(partition->pt_id,
> > - region->start_gfn,
> > - mmio_pfn,
> > - region->nr_pages);
> > + ret = hv_call_map_mmio_pfns(partition->pt_id,
> > + region->start_gfn,
> > + mmio_pfn,
> > + region->nr_pfns);
> > break;
> > }
> >
> > trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
> > - region->start_gfn, region->nr_pages,
> > + region->start_gfn, region->nr_pfns,
> > region->hv_map_flags, ret);
> >
> > if (ret)
> > @@ -1424,7 +1424,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
> > /* Paranoia check */
> > if (region->start_uaddr != mem.userspace_addr ||
> > region->start_gfn != mem.guest_pfn ||
> > - region->nr_pages != HVPFN_DOWN(mem.size)) {
> > + region->nr_pfns != HVPFN_DOWN(mem.size)) {
> > spin_unlock(&partition->pt_mem_regions_lock);
> > return -EINVAL;
> > }
> >
> >
>
^ permalink raw reply
* [BUG/RFE] hv_balloon: hot-add not triggered under burst memory demand
From: Boiler Plate @ 2026-04-20 15:30 UTC (permalink / raw)
To: linux-hyperv; +Cc: wei.liu, mhklinux
In-Reply-To: <CAPLehQ3cT+jEUkLnfBYFh3a=eCZcxzBeDvTX-B0MWdSJ_oEFuQ@mail.gmail.com>
Hi !
I'm experiencing a consistent failure of the hv_balloon driver to
respond to burst memory demand on Alpine Linux, with VS Code Remote
devcontainers as a
representative workload.
The issue has been thoroughly analyzed using PSI monitoring and kernel
configuration verification. The root cause is the absence of burst
demand support in the driver architecture, compounded by the 1-second
polling loop latency.
A detailed analysis with measurement data and proposed improvements down below.
--
Bug Report / Request for Enhancement: hv_balloon Dynamic Memory
Hot-Add Fails Under Burst Demand Workloads
SUMMARY
The Linux hv_balloon driver for Hyper-V Dynamic Memory is documented
and architecturally designed to manage guest memory in both
directions: increasing memory via hot-add when the guest needs more,
and decreasing it via balloon inflation when the guest needs less. The
original patch comment introducing the driver explicitly states this
dual purpose, and the driver's state machine contains distinct states
for DM_BALLOON_UP, DM_BALLOON_DOWN and DM_HOT_ADD.
In practice, only the downward direction works reliably. This was
demonstrated in a controlled test running Alpine Linux v3.23 (kernel
6.18.20-lts) as a Hyper-V guest with Dynamic Memory configured
(Startup RAM 1024 MB, Maximum 16384 MB). Under a representative burst
demand workload using VS Code Remote SSH with devcontainer startup,
the guest experienced 97% PSI memory stall, 176,000+ swap pages
written, and near-OOM conditions over a sustained period exceeding 150
seconds. During this entire period, MemTotal never increased by a
single kilobyte and dmesg showed zero hot-add activity. The upward
direction failed completely.
The root cause is the complete absence of burst demand support in the
driver architecture, compounded by a fixed-interval 1-second polling
loop between guest and host, sequential hot-add protocol semantics,
and a kernel default configuration (MHP_DEFAULT_ONLINE_TYPE_OFFLINE)
that leaves hot-added memory sections offline even when they do
arrive. Collectively these mean the driver cannot respond to burst
memory demand fast enough to be useful.
Proposed resolution: PSI-triggered hot-add requests independent of the
1-second polling loop, documentation of required
auto_online_blocks=online configuration, and improved diagnostics when
hot-add is not initiated despite high memory pressure.
1. VERIFIED PURPOSE AND DESIGN INTENT
The original patch introducing hv_balloon into the Linux kernel states
explicitly:
"Windows hosts dynamically manage the guest memory allocation via a
combination memory hot add and ballooning. Memory hot add is used to
grow the guest memory up to the maximum memory that can be allocated
to the guest. Ballooning is used to both shrink as well as expand up
to the max memory."
Source: K.Y. Srinivasan, [PATCH 2/2] Drivers: hv: Add Hyper-V balloon
driver, lkml.iu.edu, 2012.
The driver's state machine in the current kernel source at
drivers/hv/hv_balloon.c confirms this with explicit states
DM_BALLOON_UP, DM_BALLOON_DOWN and DM_HOT_ADD. Source:
github.com/torvalds/linux/blob/master/drivers/hv/hv_balloon.c
(verified April 2026).
The upward direction is therefore not an optional or aspirational
feature. It is the primary stated purpose of the hot-add component of
the driver.
2. SYSTEM CONFIGURATION
Host: Windows Server 2022 with Hyper-V, Dynamic Memory enabled
Guest OS: Alpine Linux v3.23, kernel 6.18.20-lts (x86_64)
Kernel config relevant to this report:
- CONFIG_MEMORY_HOTPLUG=y
- CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE=y (Alpine default)
- CONFIG_PSI=y
- CONFIG_PSI_DEFAULT_DISABLED=y
Hyper-V Dynamic Memory settings:
- Startup RAM: 1024 MB
- Minimum RAM: 512 MB
- Maximum RAM: 16384 MB
auto_online_blocks: Set to online manually after discovering the
default was offline
PSI: Enabled via psi=1 kernel parameter after discovering
CONFIG_PSI_DEFAULT_DISABLED=y
Driver module parameters verified on test system:
- /sys/module/hv_balloon/parameters/hot_add = Y (hot-add enabled)
- /sys/module/hv_balloon/parameters/pressure_report_delay = 0 (no startup delay)
3. USE CASE: VS CODE REMOTE SSH WITH DEVCONTAINER
This use case is representative of a class of developer workloads with
containerized development environments, that are increasingly common
on Linux VMs hosted on Hyper-V.
Workload profile:
VS Code Remote SSH connects to the Alpine guest and starts a Home
Assistant Add-on development container (ha-dev). The VS Code server
process (node) expands from zero to approximately 240 MB RSS within 10
seconds of connection. Multiple node processes spawn in rapid
succession as extensions and language servers load.
Expected behavior:
Hyper-V Dynamic Memory detects guest memory pressure, initiates
hot-add to expand MemTotal beyond the startup value, guest makes the
new memory available via auto_online_blocks, workload proceeds
normally.
Observed behavior:
MemTotal remained at 921764 kB (startup RAM) throughout the entire
session. dmesg showed no hot-add activity whatsoever. The system
responded by swapping aggressively and reaching PSI avg10 values of
97% before becoming unresponsive.
4. MEASUREMENT DATA
All measurements were collected using a custom shell script sampling
/proc/pressure/memory, /proc/vmstat and /proc/meminfo at 1-2 second
intervals, with per-process RSS from /proc/PID/status.
Timeline of VS Code startup of remote containers (elapsed time from connection)
Time Event PSI delta/s Swap
pages out MemTotal
+0s Baseline, docker running 0 us 0
921764 kB
+48s VS Code server appears 0 us 0
921764 kB
+50s node 107 MB RSS 152,837 us 3,748
921764 kB
+55s 4 node processes, 348 MB RSS total 941,966 us 22,945
921764 kB
+92s PSI avg10 48% 6,357,506 us 159,934
921764 kB
+106s PSI avg10 83% 13,446,517 us 160,262
921764 kB
+179s PSI avg10 97% 20,093,449 us 165,208
921764 kB
Key observations:
- MemTotal never changed from startup value
- No hot-add lines appeared in dmesg at any point during or after the session
- PSI cumulative stall since boot at session end: 804,383,983
microseconds (804 seconds of accumulated memory stall)
5. ROOT CAUSE ANALYSIS
The fundamental design gap in hv_balloon is the complete absence of
burst demand support. The driver was designed around a polling-based,
fixed-interval model and has no mechanism to detect or respond to
rapid memory transitions. All other issues described below are
consequences or amplifications of this core architectural limitation.
Issue 1: No burst demand support in the driver architecture
The driver has no concept of burst demand, a rapid transition from low
memory pressure to near-OOM within seconds. There is no fast path, no
threshold trigger, and no priority escalation mechanism. The entire
communication model between guest and host is based on periodic status
reporting, which by design introduces latency that is structurally
incompatible with burst workloads. A guest can transition from 0% to
97% PSI memory stall and write 160,000 swap pages before the driver
has sent more than a handful of status messages to the host.
Modern workloads such as containerized development environments,
Kubernetes pod scheduling, JVM heap initialization, Node.js extension
loading, routinely demand hundreds of megabytes of memory within a
5-10 second window. The driver architecture predates this workload
class entirely.
Issue 2: 1-second fixed-interval polling loop is too slow for burst workloads
The hv_balloon thread reports memory pressure to the host once per
second via post_status(). Source:
elixir.bootlin.com/linux/v6.14.6/source/drivers/hv/hv_balloon.c#L1381
(verified via Medium article by Shlomi Boutnaru, May 2025).
VS Code expanded from 0 to 240 MB RSS in under 10 seconds. By the time
the host received sufficient pressure signals to consider a hot-add
response, the guest had already exhausted available memory and entered
heavy swap. The polling cadence has no mechanism to accelerate or
escalate regardless of how severe or rapid the memory pressure
becomes.
Issue 3: pressure_report_delay not a factor in this case
The hv_balloon module parameter pressure_report_delay defaults to 30
seconds per the original 2013 patch. Source: K.Y. Srinivasan, [PATCH
1/2] Drivers: hv: balloon: Add a parameter to delay pressure
reporting, lkml.indiana.edu, 2013. On the test system this parameter
was verified to be 0
(/sys/module/hv_balloon/parameters/pressure_report_delay = 0), meaning
pressure reporting was not delayed. This eliminates
pressure_report_delay as a contributing factor in this specific case
and strengthens the conclusion that Issues 1 and 2 are solely
responsible for the observed failure.
Note: on systems where pressure_report_delay retains its default value
of 30, the failure window would be significantly wider, as the host
would receive no pressure data at all during the first 30 seconds
after driver load.
Issue 4: Sequential hot-add protocol prevents parallel responses
Per the Hyper-V Dynamic Memory protocol specification: the host must
not send a new hot-add request until the guest has responded to the
previous one. Source: quoted in QEMU developer discussion,
mail-archive.com, September 2020. Combined with the 128 MB minimum
DIMM size for Linux hot-add (source: patchew.org, verified search
result), each expansion step is large, slow and serialized.
Issue 5: MHP_DEFAULT_ONLINE_TYPE_OFFLINE leaves hot-added memory unusable
Alpine Linux ships with CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE=y.
Hot-added memory sections are registered in sysfs but remain in
offline state until explicitly brought online. Without udev (Alpine
uses mdev) there is no automatic mechanism to online new sections. The
auto_online_blocks sysfs interface defaults to offline and must be
manually set to online.
This issue was identified and resolved in this specific environment by
setting echo online > /sys/devices/system/memory/auto_online_blocks
and making it persistent via /etc/local.d/memory-hotplug.start.
However even with this fix applied, hot-add was never triggered by the
host during the VS Code session. This confirms that Issues 1-4 are the
primary blockers and Issue 5 is a prerequisite that was already
satisfied.
6. COMPARISON WITH KNOWN SIMILAR REPORTS AND RECENT PATCHES
This failure mode is not new. A Kubernetes/minikube issue from 2017
describes an identical pattern: memory demand increases, Hyper-V
Manager shows warning status, assigned memory never increases, OOM
killer activates. Source: github.com/kubernetes/minikube/issues/1403.
The issue was closed as stale without resolution. The present report
provides significantly more detailed measurement data and kernel
configuration context than the prior report.
The driver is actively maintained. Two recent patches are relevant as context:
- A March 2024 patch by Michael Kelley fixes hot-add failures on
systems with memblock sizes larger than 128 MB, where add_memory()
would fail with error -22. Source: lore.kernel.org/lkml, March 2024.
This is a separate correctness fix and does not address burst demand.
- A January 2025 patch accepted into hyperv-next fixes an issue where
the balloon driver's global page-onlining callback blocked hot-add of
memory from GPU and vPCI device drivers. Source:
mail-archive.com/linux-hyperv, January 2025. Again a separate
correctness fix, but both patches confirm the driver is under active
development and that the maintainers are responsive to bug reports.
No open patches or RFC discussions on linux-hyperv@vger.kernel.org
addressing burst demand or PSI integration in hv_balloon were
identified as of April 2026.
7. PROPOSED IMPROVEMENTS
RFE 1: PSI-triggered hot-add requests to handle burst demand
The driver's current architecture is built around fixed-interval
polling: it reports memory pressure to the host once per second via
post_status() and waits for the host to initiate a hot-add sequence.
This design has no mechanism to accelerate or escalate outside the
polling cadence regardless of how severe or rapid the memory pressure
becomes.
Modern workloads have fundamentally different memory characteristics.
Containers, container runtimes (Docker, containerd, podman), JVM-based
systems, Node.js applications, Kubernetes pods, and development
environments such as VS Code devcontainers routinely exhibit burst
demand patterns: a guest transitions from low memory pressure to
near-OOM within seconds as processes spawn, images are pulled, or
runtimes initialize their heaps. Kubernetes is a particularly
illustrative case - the entire value proposition of dynamic memory
allocation in a Kubernetes node depends on the hypervisor being able
to supply memory fast enough to honor pod scheduling decisions. When a
scheduler assigns a new pod to a node, it expects memory to be
available within seconds, not after a multi-second feedback loop that
may itself be preceded by a 30-second pressure_report_delay. This
pattern is not an edge case - it is the normal startup behavior of a
significant proportion of workloads running on Linux VMs today.
The VS Code devcontainer use case presented in this report is
representative but not exceptional. Any workload that combines a
container runtime with a language server, a build system, a database
startup sequence, or a Kubernetes pod scheduling event will exhibit
similar burst demand characteristics. The 1-second fixed-interval
polling loop is structurally incapable of protecting against this
class of memory event regardless of host configuration.
The infrastructure to solve this already exists in the Linux kernel.
PSI threshold triggers via poll() on /proc/pressure/memory have been
available since kernel 4.20 and are already used by systemd-oomd and
Facebook's oomd to react to memory pressure faster than periodic
polling allows. The driver should leverage this same mechanism to send
an immediate out-of-band hot-add request to the host when burst demand
is detected, specifically when memory.full exceeds a configurable
threshold such as 10% over a 500ms window. This would allow the host
to begin the hot-add sequence at the onset of burst demand rather than
after the guest has already entered heavy swap.
This improvement requires no protocol-level changes if the existing
hot-add request message is used as the signal. A protocol extension
adding an explicit "burst demand" flag to the status message would be
preferable, allowing the host to prioritize the response and bypass
any queuing of normal pressure-based adjustments.
RFE 2: Document auto_online_blocks requirement
The kernel documentation and Hyper-V guest integration documentation
should explicitly state that auto_online_blocks must be set to online
(or the kernel compiled with MHP_DEFAULT_ONLINE_TYPE_ONLINE_AUTO) for
Dynamic Memory hot-add to function on distributions that do not use
udev with a memory hotplug rule. Currently this is undocumented and
discoverable only by reading kernel source or community bug reports.
RFE 3: Diagnostics when hot-add is not triggered
When PSI memory.full avg10 exceeds a significant threshold (e.g. 10%)
without a hot-add request being sent or received, the driver should
emit a pr_warn to dmesg. Currently the guest has no visibility into
whether the host is aware of pressure, whether a hot-add request was
sent, or whether it failed. This makes the failure mode completely
silent from the guest's perspective.
RFE 4: Consider proactive pressure signaling
The driver could be extended to send an out-of-band high-priority
pressure signal to the host when PSI crosses a critical threshold,
rather than waiting for the next 1-second polling cycle. This would
require a protocol-level change and coordination with the Hyper-V host
implementation but would address the fundamental latency issue.
8. WHAT IS UNKNOWN
Why the host never sent a hot-add request despite the guest reaching
near-OOM conditions is unknown. The host-side decision algorithm for
when to initiate hot-add is proprietary and not publicly documented.
It is possible the host's memory pressure thresholds were not met
because guest-side pressure reporting was too slow to accumulate
sufficient signal. It is also possible the host's algorithm is simply
not designed for burst workloads of this type.
The Hyper-V Dynamic Memory Buffer setting (configurable between 5% and
200%, default 20%) controls how much headroom the host maintains above
current demand. Whether increasing this value would provide sufficient
buffer to absorb burst demand without requiring hot-add at all is
unknown without testing. It would not address the architectural
limitation but could serve as a partial operational mitigation.
--
^ permalink raw reply
* Re: [PATCH 11/11] Drivers: hv: Kconfig: Add ARM64 support for MSHV_VTL
From: Naman Jain @ 2026-04-20 15:24 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157FEE5578344625418BFDBD450A@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/1/2026 10:28 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>
> Nit: In keeping with past practice, the "Subject" prefix for this patch could
> just be "Drivers: hv:"
Acked.
I am also planning to change other subject line prefixes, based on your
earlier suggestion:
mshv_vtl_main changes - "mshv_vtl: "
arch/arm64 Hyper-V changes - "arm64: hyperv: "
arch/x86 Hyper-V changes - "x86/hyperv: "
Thank you so much for doing such a thorough review. I really appreciate
all the help and guidance.
Regards,
Naman
^ permalink raw reply
* Re: [PATCH 10/11] Drivers: hv: Add support for arm64 in MSHV_VTL
From: Naman Jain @ 2026-04-20 15:24 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB41576766C5FB291952CC58E8D450A@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/1/2026 10:28 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> Add necessary support to make MSHV_VTL work for arm64 architecture.
>> * Add stub implementation for mshv_vtl_return_call_init(): not required
>> for arm64
>> * Remove fpu/legacy.h header inclusion, as this is not required
>> * handle HV_REGISTER_VSM_CODE_PAGE_OFFSETS register: not supported
>> in arm64
>> * Configure custom percpu_vmbus_handler by using
>> hv_setup_percpu_vmbus_handler()
>> * Handle hugepage functions by config checks
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> arch/arm64/include/asm/mshyperv.h | 2 ++
>> drivers/hv/mshv_vtl_main.c | 21 ++++++++++++++-------
>> 2 files changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/mshyperv.h
>> b/arch/arm64/include/asm/mshyperv.h
>> index 36803f0386cc..027a7f062d70 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -83,6 +83,8 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u
>> return 1;
>> }
>>
>> +static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>> +
>> void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
>> #endif
>> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
>> index 4c9ae65ad3e8..5702fe258500 100644
>> --- a/drivers/hv/mshv_vtl_main.c
>> +++ b/drivers/hv/mshv_vtl_main.c
>> @@ -23,8 +23,6 @@
>> #include <trace/events/ipi.h>
>> #include <uapi/linux/mshv.h>
>> #include <hyperv/hvhdk.h>
>> -
>> -#include "../../kernel/fpu/legacy.h"
>
> Was there a particular code change that made this unnecessary? Or was it
> unnecessary from the start of this source code file? Just curious ....
This was present in initial driver changes when the assembly code was
part of this driver. Then it moved to arch files and this was left here.
Just cleaning it up.
>
>> #include "mshv.h"
>> #include "mshv_vtl.h"
>> #include "hyperv_vmbus.h"
>> @@ -206,18 +204,21 @@ static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>> static int mshv_vtl_get_vsm_regs(void)
>> {
>> struct hv_register_assoc registers[2];
>> - int ret, count = 2;
>> + int ret, count = 0;
>>
>> - registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>> - registers[1].name = HV_REGISTER_VSM_CAPABILITIES;
>> + registers[count++].name = HV_REGISTER_VSM_CAPABILITIES;
>> + /* Code page offset register is not supported on ARM */
>> + if (IS_ENABLED(CONFIG_X86_64))
>> + registers[count++].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
>>
>> ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> count, input_vtl_zero, registers);
>> if (ret)
>> return ret;
>>
>> - mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64;
>> - mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64;
>> + mshv_vsm_capabilities.as_uint64 = registers[0].value.reg64;
>> + if (IS_ENABLED(CONFIG_X86_64))
>> + mshv_vsm_page_offsets.as_uint64 = registers[1].value.reg64;
>>
>> return ret;
>> }
>
> This function has gotten somewhat messy to handle the x86 and arm64
> differences. Let me suggest a different approach. Have this function only
> get the VSM capabilities register, as that is generic across x86 and
> arm64. Then, update x86 mshv_vtl_return_call_init() to get the
> PAGE_OFFSETS register and then immediately use the value to update
> the static call. The global variable mshv_vms_page_offsets is no longer
> necessary.
>
> My suggestion might be little more code because hv_call_get_vp_registers()
> is invoked in two different places. But it cleanly separates the two use
> cases, and keeps the x86 hackery under arch/x86.
>
I implemented this in my dev branch, and it works fine. Thanks for the
suggestion.
>> @@ -280,10 +281,13 @@ static int hv_vtl_setup_synic(void)
>>
>> /* Use our isr to first filter out packets destined for userspace */
>> hv_setup_vmbus_handler(mshv_vtl_vmbus_isr);
>> + /* hv_setup_vmbus_handler() is stubbed for ARM64, add per-cpu VMBus handlers instead */
>> + hv_setup_percpu_vmbus_handler(mshv_vtl_vmbus_isr);
>>
>> ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "hyperv/vtl:online",
>> mshv_vtl_alloc_context, NULL);
>> if (ret < 0) {
>> + hv_setup_percpu_vmbus_handler(vmbus_isr);
>> hv_setup_vmbus_handler(vmbus_isr);
>> return ret;
>> }
>> @@ -296,6 +300,7 @@ static int hv_vtl_setup_synic(void)
>> static void hv_vtl_remove_synic(void)
>> {
>> cpuhp_remove_state(mshv_vtl_cpuhp_online);
>> + hv_setup_percpu_vmbus_handler(vmbus_isr);
hv_setup_percpu_vmbus_handler() calls will also be removed with the
redesign.
Regards,
Naman
^ permalink raw reply
* Re: [PATCH 08/11] Drivers: hv: mshv_vtl: Move register page config to arch-specific files
From: Naman Jain @ 2026-04-20 15:23 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157CF364DA2C0CC657A6DCBD450A@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/1/2026 10:28 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> Move mshv_vtl_configure_reg_page() implementation from
>> drivers/hv/mshv_vtl_main.c to arch-specific files:
>> - arch/x86/hyperv/hv_vtl.c: full implementation with register page setup
>> - arch/arm64/hyperv/hv_vtl.c: stub implementation (unsupported)
>>
>> Move common type definitions to include/asm-generic/mshyperv.h:
>> - struct mshv_vtl_per_cpu
>> - union hv_synic_overlay_page_msr
>>
>> Move hv_call_get_vp_registers() and hv_call_set_vp_registers()
>> declarations to include/asm-generic/mshyperv.h since these functions
>> are used by multiple modules.
>>
>> While at it, remove the unnecessary stub implementations in #else
>> case for mshv_vtl_return* functions in arch/x86/include/asm/mshyperv.h.
>
> Seems like this patch is doing multiple things. The reg page configuration
> changes are more substantial and should probably be in a patch by
> themselves. The other changes are more trivial and maybe are OK
> grouped into a single patch, but you could also consider breaking them
> out.
I will split this patch into 3 patches.
>
>>
>> This is essential for adding support for ARM64 in MSHV_VTL.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> arch/arm64/hyperv/hv_vtl.c | 8 +++++
>> arch/arm64/include/asm/mshyperv.h | 3 ++
>> arch/x86/hyperv/hv_vtl.c | 32 ++++++++++++++++++++
>> arch/x86/include/asm/mshyperv.h | 7 ++---
>> drivers/hv/mshv.h | 8 -----
>> drivers/hv/mshv_vtl_main.c | 49 +++----------------------------
>> include/asm-generic/mshyperv.h | 42 ++++++++++++++++++++++++++
>> 7 files changed, 92 insertions(+), 57 deletions(-)
>>
>> diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
>> index 66318672c242..d699138427c1 100644
>> --- a/arch/arm64/hyperv/hv_vtl.c
>> +++ b/arch/arm64/hyperv/hv_vtl.c
>> @@ -10,6 +10,7 @@
>> #include <asm/boot.h>
>> #include <asm/mshyperv.h>
>> #include <asm/cpu_ops.h>
>> +#include <linux/export.h>
>>
>> void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
>> {
>> @@ -142,3 +143,10 @@ void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
>> "v24", "v25", "v26", "v27", "v28", "v29", "v30", "v31");
>> }
>> EXPORT_SYMBOL(mshv_vtl_return_call);
>> +
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
>> +{
>> + pr_debug("Register page not supported on ARM64\n");
>> + return false;
>> +}
>> +EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
>> diff --git a/arch/arm64/include/asm/mshyperv.h
>> b/arch/arm64/include/asm/mshyperv.h
>> index de7f3a41a8ea..36803f0386cc 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -61,6 +61,8 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>> ARM_SMCCC_OWNER_VENDOR_HYP, \
>> HV_SMCCC_FUNC_NUMBER)
>>
>> +struct mshv_vtl_per_cpu;
>> +
>> struct mshv_vtl_cpu_context {
>> /*
>> * NOTE: x18 is managed by the hypervisor. It won't be reloaded from this array.
>> @@ -82,6 +84,7 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs,
>> bool set, u
>> }
>>
>> void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
>
> I think this declaration could be added in asm-generic/mshyperv.h so that it
> is shared by x86 and arm64. That also obviates the need for the forward
> ref to struct mshv_vtl_per_cpu that you've added here.
Acked.
>
>> #endif
>>
>> #include <asm-generic/mshyperv.h>
>> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
>> index 72a0bb4ae0c7..ede290985d41 100644
>> --- a/arch/x86/hyperv/hv_vtl.c
>> +++ b/arch/x86/hyperv/hv_vtl.c
>> @@ -20,6 +20,7 @@
>> #include <uapi/asm/mtrr.h>
>> #include <asm/debugreg.h>
>> #include <linux/export.h>
>> +#include <linux/hyperv.h>
>> #include <../kernel/smpboot.h>
>> #include "../../kernel/fpu/legacy.h"
>>
>> @@ -259,6 +260,37 @@ int __init hv_vtl_early_init(void)
>> return 0;
>> }
>>
>> +static const union hv_input_vtl input_vtl_zero;
>> +
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
>> +{
>> + struct hv_register_assoc reg_assoc = {};
>> + union hv_synic_overlay_page_msr overlay = {};
>> + struct page *reg_page;
>> +
>> + reg_page = alloc_page(GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL);
>> + if (!reg_page) {
>> + WARN(1, "failed to allocate register page\n");
>> + return false;
>> + }
>> +
>> + overlay.enabled = 1;
>> + overlay.pfn = page_to_hvpfn(reg_page);
>> + reg_assoc.name = HV_X64_REGISTER_REG_PAGE;
>> + reg_assoc.value.reg64 = overlay.as_uint64;
>> +
>> + if (hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> + 1, input_vtl_zero, ®_assoc)) {
>> + WARN(1, "failed to setup register page\n");
>> + __free_page(reg_page);
>> + return false;
>> + }
>> +
>> + per_cpu->reg_page = reg_page;
>> + return true;
>
> As Sashiko AI noted, the memory allocated for the reg_page never gets freed.
These are present in existing code, I'll address them in a separate series.
>
>> +}
>> +EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
>> +
>> DEFINE_STATIC_CALL_NULL(__mshv_vtl_return_hypercall, void (*)(void));
>>
>> void mshv_vtl_return_call_init(u64 vtl_return_offset)
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index d5355a5b7517..d592fea49cdb 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -271,6 +271,8 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg) {
>> return 0; }
>> static inline int hv_apicid_to_vp_index(u32 apic_id) { return -EINVAL; }
>> #endif /* CONFIG_HYPERV */
>>
>> +struct mshv_vtl_per_cpu;
>> +
>> struct mshv_vtl_cpu_context {
>> union {
>> struct {
>> @@ -305,13 +307,10 @@ void mshv_vtl_return_call_init(u64 vtl_return_offset);
>> void mshv_vtl_return_hypercall(void);
>> void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared);
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
>
> Same as for arm64. Add a shared declaration in asm-generic/mshyperv.h.
Ditto.
>
>> #else
>> static inline void __init hv_vtl_init_platform(void) {}
>> static inline int __init hv_vtl_early_init(void) { return 0; }
>> -static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> -static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>> -static inline void mshv_vtl_return_hypercall(void) {}
>> -static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> #endif
>>
>> #include <asm-generic/mshyperv.h>
>> diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
>> index d4813df92b9c..0fcb7f9ba6a9 100644
>> --- a/drivers/hv/mshv.h
>> +++ b/drivers/hv/mshv.h
>> @@ -14,14 +14,6 @@
>> memchr_inv(&((STRUCT).MEMBER), \
>> 0, sizeof_field(typeof(STRUCT), MEMBER))
>>
>> -int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
>> - union hv_input_vtl input_vtl,
>> - struct hv_register_assoc *registers);
>> -
>> -int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
>> - union hv_input_vtl input_vtl,
>> - struct hv_register_assoc *registers);
>> -
>> int hv_call_get_partition_property(u64 partition_id, u64 property_code,
>> u64 *property_value);
>>
>> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
>> index 91517b45d526..c79d24317b8e 100644
>> --- a/drivers/hv/mshv_vtl_main.c
>> +++ b/drivers/hv/mshv_vtl_main.c
>> @@ -78,21 +78,6 @@ struct mshv_vtl {
>> u64 id;
>> };
>>
>> -struct mshv_vtl_per_cpu {
>> - struct mshv_vtl_run *run;
>> - struct page *reg_page;
>> -};
>> -
>> -/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
>> -union hv_synic_overlay_page_msr {
>> - u64 as_uint64;
>> - struct {
>> - u64 enabled: 1;
>> - u64 reserved: 11;
>> - u64 pfn: 52;
>> - } __packed;
>> -};
>> -
>> static struct mutex mshv_vtl_poll_file_lock;
>> static union hv_register_vsm_page_offsets mshv_vsm_page_offsets;
>> static union hv_register_vsm_capabilities mshv_vsm_capabilities;
>> @@ -201,34 +186,6 @@ static struct page *mshv_vtl_cpu_reg_page(int cpu)
>> return *per_cpu_ptr(&mshv_vtl_per_cpu.reg_page, cpu);
>> }
>>
>> -static void mshv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
>> -{
>> - struct hv_register_assoc reg_assoc = {};
>> - union hv_synic_overlay_page_msr overlay = {};
>> - struct page *reg_page;
>> -
>> - reg_page = alloc_page(GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL);
>> - if (!reg_page) {
>> - WARN(1, "failed to allocate register page\n");
>> - return;
>> - }
>> -
>> - overlay.enabled = 1;
>> - overlay.pfn = page_to_hvpfn(reg_page);
>> - reg_assoc.name = HV_X64_REGISTER_REG_PAGE;
>> - reg_assoc.value.reg64 = overlay.as_uint64;
>> -
>> - if (hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
>> - 1, input_vtl_zero, ®_assoc)) {
>> - WARN(1, "failed to setup register page\n");
>> - __free_page(reg_page);
>> - return;
>> - }
>> -
>> - per_cpu->reg_page = reg_page;
>> - mshv_has_reg_page = true;
>> -}
>> -
>> static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>> {
>> union hv_synic_sint sint;
>> @@ -329,8 +286,10 @@ static int mshv_vtl_alloc_context(unsigned int cpu)
>> if (!per_cpu->run)
>> return -ENOMEM;
>>
>> - if (mshv_vsm_capabilities.intercept_page_available)
>> - mshv_vtl_configure_reg_page(per_cpu);
>> + if (mshv_vsm_capabilities.intercept_page_available) {
>> + if (hv_vtl_configure_reg_page(per_cpu))
>> + mshv_has_reg_page = true;
>
> As Sashiko AI noted, it doesn't work to use the global mshv_has_reg_page
> to indicate the success of configuring the reg page, which is a per-cpu
> operation. But this bug existed before this patch set, so maybe it should
> be fixed as a preliminary patch.
Acked. Will address them in a separate series.
>
>> + }
>>
>> mshv_vtl_synic_enable_regs(cpu);
>>
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index b147a12085e4..b53fcc071596 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -383,8 +383,50 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
>> return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
>> }
>>
>> +#if IS_ENABLED(CONFIG_MSHV_ROOT) || IS_ENABLED(CONFIG_MSHV_VTL)
>> +int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
>> + union hv_input_vtl input_vtl,
>> + struct hv_register_assoc *registers);
>> +
>> +int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
>> + union hv_input_vtl input_vtl,
>> + struct hv_register_assoc *registers);
>> +#else
>> +static inline int hv_call_get_vp_registers(u32 vp_index, u64 partition_id,
>> + u16 count,
>> + union hv_input_vtl input_vtl,
>> + struct hv_register_assoc *registers)
>> +{
>> + return -EOPNOTSUPP;
>> +}
>> +
>> +static inline int hv_call_set_vp_registers(u32 vp_index, u64 partition_id,
>> + u16 count,
>> + union hv_input_vtl input_vtl,
>> + struct hv_register_assoc *registers)
>> +{
>> + return -EOPNOTSUPP;
>> +}
>> +#endif /* CONFIG_MSHV_ROOT || CONFIG_MSHV_VTL */
>> +
>> #define HV_VP_ASSIST_PAGE_ADDRESS_SHIFT 12
>> +
>> #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>> +struct mshv_vtl_per_cpu {
>> + struct mshv_vtl_run *run;
>> + struct page *reg_page;
>> +};
>> +
>> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
>> +union hv_synic_overlay_page_msr {
>> + u64 as_uint64;
>> + struct {
>> + u64 enabled: 1;
>> + u64 reserved: 11;
>> + u64 pfn: 52;
>> + } __packed;
>> +};
>> +
>> u8 __init get_vtl(void);
>> #else
>> static inline u8 get_vtl(void) { return 0; }
>> --
>> 2.43.0
>>
>
> Sashiko AI noted another existing bug in mshv_vtl_init(), which is that
> the error path does kfree(mem_dev) when it should do
> put_device(mem_dev). See the comment in the header of
> device_initialize().
To avoid this series bloating up, I am thinking of taking up these fixes
in a separate series.
Regards,
Naman
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox