Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [RFC PATCH 4/8] KVM: x86: Use ktime_get_snapshot_id() for master clock
From: David Woodhouse @ 2026-05-26 23:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, John Stultz,
	Michael Kelley
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S . Hall,
	Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
	linux-kernel
In-Reply-To: <20260526230635.136914-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Replace the KVM-private vgettsc()/do_kvmclock_base()/do_monotonic()/
do_realtime() timekeeping reimplementation with calls to the generic
ktime_get_snapshot_id() interface.

The snapshot provides both the system time and the raw_cycles (TSC)
atomically paired. When raw_cycles is zero, the clocksource could not
provide a raw hardware counter value, which is equivalent to the
previous vgettsc() returning VDSO_CLOCKMODE_NONE.

For kvm_get_time_and_clockread(), the kvmclock base time is
CLOCK_MONOTONIC_RAW + offs_boot. The snapshot provides the raw time
atomically paired with the TSC; offs_boot is added separately as it
only changes at suspend/resume boundaries.

This is a step towards eliminating the pvclock_gtod_data private copy
of timekeeping state and the associated notifier callback.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.6-1m
---
 arch/x86/kvm/x86.c | 50 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c9e17e01f82d..e6f740f95ff9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -35,6 +35,7 @@
 #include "smm.h"
 
 #include <linux/clocksource.h>
+#include <linux/timekeeping.h>
 #include <linux/interrupt.h>
 #include <linux/kvm.h>
 #include <linux/fs.h>
@@ -3137,14 +3138,34 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
  * reports the TSC value from which it do so. Returns true if host is
  * using TSC based clocksource.
  */
+static bool kvm_snapshot_has_tsc(struct system_time_snapshot *snap,
+				u64 *tsc_timestamp)
+{
+	if (snap->cs_id == CSID_X86_TSC) {
+		*tsc_timestamp = snap->cycles;
+		return true;
+	}
+
+	if (snap->raw_csid == CSID_X86_TSC && snap->raw_cycles) {
+		*tsc_timestamp = snap->raw_cycles;
+		return true;
+	}
+
+	return false;
+}
+
 static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
 {
-	/* checked again under seqlock below */
-	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
+	struct system_time_snapshot snap;
+
+	if (!ktime_get_snapshot_id(&snap, CLOCK_MONOTONIC_RAW))
+		return false;
+	if (!kvm_snapshot_has_tsc(&snap, tsc_timestamp))
 		return false;
 
-	return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns,
-						     tsc_timestamp));
+	*kernel_ns = ktime_to_ns(snap.sys) +
+		     ktime_to_ns(ktime_mono_to_any(0, TK_OFFS_BOOT));
+	return true;
 }
 
 /*
@@ -3153,12 +3174,15 @@ static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
  */
 bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
 {
-	/* checked again under seqlock below */
-	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
+	struct system_time_snapshot snap;
+
+	if (!ktime_get_snapshot_id(&snap, CLOCK_MONOTONIC))
+		return false;
+	if (!kvm_snapshot_has_tsc(&snap, tsc_timestamp))
 		return false;
 
-	return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
-						 tsc_timestamp));
+	*kernel_ns = ktime_to_ns(snap.sys);
+	return true;
 }
 
 /*
@@ -3171,11 +3195,15 @@ bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
 static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
 					   u64 *tsc_timestamp)
 {
-	/* checked again under seqlock below */
-	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
+	struct system_time_snapshot snap;
+
+	if (!ktime_get_snapshot_id(&snap, CLOCK_REALTIME))
+		return false;
+	if (!kvm_snapshot_has_tsc(&snap, tsc_timestamp))
 		return false;
 
-	return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
+	*ts = ktime_to_timespec64(snap.sys);
+	return true;
 }
 #endif
 
-- 
2.54.0


^ permalink raw reply related

* [RFC PATCH 1/8] timekeeping: Add clocksource read_raw() method and raw_cycles to snapshot
From: David Woodhouse @ 2026-05-26 23:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, John Stultz,
	Michael Kelley
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S . Hall,
	Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
	linux-kernel
In-Reply-To: <to=b6d2173312b8d0469774846eb18b9799832d9cfc.camel@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Add a read_raw() callback to struct clocksource which returns the
derived clocksource value while also providing the underlying hardware
counter reading. This allows ktime_get_snapshot_id() to populate a new
raw_cycles field in struct system_time_snapshot.

For clocksources that are derived from an underlying counter (e.g.,
Hyper-V TSC page scales TSC to 10MHz, kvmclock scales TSC to 1GHz), this
provides atomic access to both the derived value needed for timekeeping
calculations, and the raw hardware counter needed by consumers like
KVM's master clock and the vmclock PTP driver.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.6-1m
---
 include/linux/clocksource.h |  8 ++++++++
 include/linux/timekeeping.h |  6 ++++++
 kernel/time/timekeeping.c   | 30 +++++++++++++++++++++++++++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 7c38190b10bf..674299e32f0c 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -37,6 +37,10 @@ struct module;
  *	This is the structure used for system time.
  *
  * @read:		Returns a cycle value, passes clocksource as argument
+ * @read_raw:		Where a clocksource such as kvmclock or the Hyper-V
+ *			scaled TSC is calculated from an underlying hardware
+ *			counter, return both a cycle value and the raw value
+ *			of the underlying counter from which it was calculated
  * @mask:		Bitmask for two's complement
  *			subtraction of non 64 bit counters
  * @mult:		Cycle to nanosecond multiplier
@@ -69,6 +73,8 @@ struct module;
  *			in certain snapshot functions to allow callers to
  *			validate the clocksource from which the snapshot was
  *			taken.
+ * @raw_csid:		If a @read_raw method exists, the clocksource_id of the
+ *			raw underlying counter
  * @flags:		Flags describing special properties
  * @base:		Hardware abstraction for clock on which a clocksource
  *			is based
@@ -97,6 +103,7 @@ struct module;
  */
 struct clocksource {
 	u64			(*read)(struct clocksource *cs);
+	u64			(*read_raw)(struct clocksource *cs, u64 *raw);
 	u64			mask;
 	u32			mult;
 	u32			shift;
@@ -109,6 +116,7 @@ struct clocksource {
 	u32			freq_khz;
 	int			rating;
 	enum clocksource_ids	id;
+	enum clocksource_ids	raw_csid;
 	enum vdso_clock_mode	vdso_clock_mode;
 	unsigned long		flags;
 	struct clocksource_base *base;
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index f7945f1048fc..54799a9ebeb0 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -279,18 +279,24 @@ static inline bool ktime_get_aux_ts64(clockid_t id, struct timespec64 *kt) { ret
  * struct system_time_snapshot - Simultaneous time capture of CLOCK_MONOTONIC_RAW,
  *				 a selected CLOCK_* and the clocksource counter value
  * @cycles:		Clocksource counter value to produce the system times
+ * @raw_cycles:		For derived clocksources, the raw hardware counter value from
+ *			which @cycles was derived
  * @sys:		The system time of the selected CLOCK ID
  * @raw:		Monotonic raw system time
  * @cs_id:		Clocksource ID
+ * @raw_csid:		Clocksource ID of underlying raw hardware counter, set if
+ *			@raw_cycles is non-zero
  * @clock_was_set_seq:	The sequence number of clock-was-set events
  * @cs_was_changed_seq:	The sequence number of clocksource change events
  * @valid:		True if the snapshot is valid
  */
 struct system_time_snapshot {
 	u64			cycles;
+	u64			raw_cycles;
 	ktime_t			sys;
 	ktime_t			raw;
 	enum clocksource_ids	cs_id;
+	enum clocksource_ids	raw_csid;
 	unsigned int		clock_was_set_seq;
 	u8			cs_was_changed_seq;
 	u8			valid;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index c4fd7229b7da..6c75a677fd2a 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -304,6 +304,21 @@ static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
 	return clock->read(clock);
 }
 
+static __always_inline u64 tk_clock_read_raw(const struct tk_read_base *tkr, u64 *raw)
+{
+	struct clocksource *clock = READ_ONCE(tkr->clock);
+
+	*raw = 0;
+
+	if (static_branch_likely(&clocksource_read_inlined))
+		return arch_inlined_clocksource_read(clock);
+
+	if (clock->read_raw)
+		return clock->read_raw(clock, raw);
+	else
+		return clock->read(clock);
+}
+
 static inline void clocksource_disable_inline_read(void)
 {
 	static_branch_disable(&clocksource_read_inlined);
@@ -320,6 +335,18 @@ static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
 
 	return clock->read(clock);
 }
+
+static __always_inline u64 tk_clock_read_raw(const struct tk_read_base *tkr, u64 *raw)
+{
+	struct clocksource *clock = READ_ONCE(tkr->clock);
+
+	*raw = 0;
+
+	if (clock->read_raw)
+		return clock->read_raw(clock, raw);
+	else
+		return clock->read(clock);
+}
 static inline void clocksource_disable_inline_read(void) { }
 static inline void clocksource_enable_inline_read(void) { }
 #endif
@@ -1243,8 +1270,9 @@ bool ktime_get_snapshot_id(struct system_time_snapshot *systime_snapshot, clocki
 		if (!tk->clock_valid)
 			return false;
 
-		now = tk_clock_read(&tk->tkr_mono);
+		now = tk_clock_read_raw(&tk->tkr_mono, &systime_snapshot->raw_cycles);
 		systime_snapshot->cs_id = tk->tkr_mono.clock->id;
+		systime_snapshot->raw_csid = tk->tkr_mono.clock->raw_csid;
 		systime_snapshot->cs_was_changed_seq = tk->cs_was_changed_seq;
 		systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
 
-- 
2.54.0


^ permalink raw reply related

* [RFC PATCH 5/8] KVM: x86: Compute kvmclock base without pvclock_gtod_data
From: David Woodhouse @ 2026-05-26 23:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, John Stultz,
	Michael Kelley
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S . Hall,
	Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
	linux-kernel
In-Reply-To: <20260526230635.136914-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

get_kvmclock_base_ns() needs CLOCK_MONOTONIC_RAW + offs_boot. Compute
this directly rather than reading offs_boot from the pvclock_gtod_data
private copy. offs_boot only changes at suspend/resume so does not
need to be atomically paired with the raw clock read.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.6-1m
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e6f740f95ff9..d057f42603e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2402,7 +2402,7 @@ static void update_pvclock_gtod(struct timekeeper *tk)
 static s64 get_kvmclock_base_ns(void)
 {
 	/* Count up from boot time, but with the frequency of the raw clock.  */
-	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
+	return ktime_get_raw_ns() + ktime_to_ns(ktime_mono_to_any(0, TK_OFFS_BOOT));
 }
 
 static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_ofs)
-- 
2.54.0


^ permalink raw reply related

* [RFC PATCH 3/8] x86/kvmclock: Implement read_raw() for kvmclock clocksource
From: David Woodhouse @ 2026-05-26 23:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, John Stultz,
	Michael Kelley
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S . Hall,
	Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
	linux-kernel
In-Reply-To: <20260526230635.136914-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Implement the read_raw() callback for the kvmclock clocksource.
This returns the kvmclock nanosecond value (for timekeeping) while
also providing the raw TSC value that was used to compute it.

The TSC is read inside the pvclock seqlock-protected region,
ensuring the raw TSC and derived kvmclock value are atomically
paired.

This enables ktime_get_snapshot_id() to provide the raw TSC to consumers
like the vmclock PTP driver, which currently has to do a separate call
to get_cycles() to obtain the value to feed through the vmclock
calculation.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.6-1m
---
 arch/x86/kernel/kvmclock.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 74aca22dc726..ef86635433f9 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -87,6 +87,25 @@ static u64 kvm_clock_get_cycles(struct clocksource *cs)
 	return kvm_clock_read();
 }
 
+static u64 kvm_clock_get_cycles_raw(struct clocksource *cs, u64 *raw)
+{
+	struct pvclock_vcpu_time_info *src;
+	unsigned version;
+	u64 ret, tsc;
+
+	preempt_disable_notrace();
+	src = this_cpu_pvti();
+	do {
+		version = pvclock_read_begin(src);
+		tsc = rdtsc_ordered();
+		ret = __pvclock_read_cycles(src, tsc);
+	} while (pvclock_read_retry(src, version));
+	preempt_enable_notrace();
+
+	*raw = tsc;
+	return ret;
+}
+
 static noinstr u64 kvm_sched_clock_read(void)
 {
 	return pvclock_clocksource_read_nowd(this_cpu_pvti()) - kvm_sched_clock_offset;
@@ -163,6 +182,8 @@ static int kvm_cs_enable(struct clocksource *cs)
 static struct clocksource kvm_clock = {
 	.name	= "kvm-clock",
 	.read	= kvm_clock_get_cycles,
+	.read_raw = kvm_clock_get_cycles_raw,
+	.raw_csid = CSID_X86_TSC,
 	.rating	= 400,
 	.mask	= CLOCKSOURCE_MASK(64),
 	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
-- 
2.54.0


^ permalink raw reply related

* [RFC PATCH 6/8] KVM: x86: Replace pvclock_gtod_data vclock_mode with boolean
From: David Woodhouse @ 2026-05-26 23:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, John Stultz,
	Michael Kelley
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S . Hall,
	Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
	linux-kernel
In-Reply-To: <20260526230635.136914-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

The remaining users of pvclock_gtod_data only need to know whether
the host clocksource is TSC-based. Replace all vclock_mode checks
with a simple kvm_host_has_tsc_clocksource boolean, updated by the
pvclock_gtod_notify callback.

This is inherently racy (as it always was — kvm_track_tsc_matching
never held the gtod seqcount), relying on eventual consistency: the
notifier fires on every timekeeping update and will correct any
transient inconsistency within one tick.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.6-1m
---
 arch/x86/kvm/x86.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d057f42603e4..c31b19860c13 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2649,6 +2649,8 @@ static s64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 }
 
 #ifdef CONFIG_X86_64
+static bool kvm_host_has_tsc_clocksource;
+
 static inline bool gtod_is_based_on_tsc(int mode)
 {
 	return mode == VDSO_CLOCKMODE_TSC || mode == VDSO_CLOCKMODE_HVCLOCK;
@@ -2682,7 +2684,6 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 {
 #ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &vcpu->kvm->arch;
-	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 
 	/*
 	 * Track whether all vCPUs have matching TSC offsets (for
@@ -2701,7 +2702,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 	 * accounts for its offset.
 	 */
 	bool use_master_clock = kvm_use_master_clock(vcpu->kvm) &&
-				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
+				kvm_host_has_tsc_clocksource;
 
 	/*
 	 * Request a masterclock update if the masterclock needs to be toggled
@@ -2715,7 +2716,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 
 	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
 			    atomic_read(&vcpu->kvm->online_vcpus),
-		            ka->use_master_clock, gtod->clock.vclock_mode);
+		            ka->use_master_clock, kvm_host_has_tsc_clocksource);
 #endif
 }
 
@@ -2836,7 +2837,7 @@ static inline bool kvm_check_tsc_unstable(void)
 	 * TSC is marked unstable when we're running on Hyper-V,
 	 * 'TSC page' clocksource is good.
 	 */
-	if (pvclock_gtod_data.clock.vclock_mode == VDSO_CLOCKMODE_HVCLOCK)
+	if (kvm_host_has_tsc_clocksource)
 		return false;
 #endif
 	return check_tsc_unstable();
@@ -3292,7 +3293,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 					   &ka->master_tsc_mul);
 	}
 
-	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
+	vclock_mode = kvm_host_has_tsc_clocksource;
 	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
 					ka->all_vcpus_matched_freq);
 #endif
@@ -10364,12 +10365,15 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 	update_pvclock_gtod(tk);
 
 #ifdef CONFIG_X86_64
+	kvm_host_has_tsc_clocksource =
+		gtod_is_based_on_tsc(tk->tkr_mono.clock->vdso_clock_mode);
+
 	/*
 	 * Disable master clock if host does not trust, or does not use,
 	 * TSC based clocksource. Delegate queue_work() to irq_work as
 	 * this is invoked with tk_core.seq write held.
 	 */
-	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode) &&
+	if (!kvm_host_has_tsc_clocksource &&
 	    atomic_read(&kvm_guest_has_master_clock) != 0)
 		irq_work_queue(&pvclock_irq_work);
 #endif
-- 
2.54.0


^ permalink raw reply related

* [RFC PATCH 7/8] KVM: x86: Remove pvclock_gtod_data and private timekeeping code
From: David Woodhouse @ 2026-05-26 23:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, John Stultz,
	Michael Kelley
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S . Hall,
	Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
	linux-kernel
In-Reply-To: <20260526230635.136914-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Remove the now-unused KVM-private timekeeping infrastructure:

 - struct pvclock_clock and struct pvclock_gtod_data
 - update_pvclock_gtod() and its seqcount-protected state copy
 - read_tsc() (KVM's private TSC reader with cycle_last clamping)
 - vgettsc() (KVM's private clocksource interpolation)
 - do_kvmclock_base(), do_monotonic(), do_realtime()

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.6-1m
---
 arch/x86/kvm/x86.c | 175 ---------------------------------------------
 1 file changed, 175 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c31b19860c13..2c34b973fce0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2347,58 +2347,6 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
 	return kvm_set_msr_ignored_check(vcpu, index, *data, true);
 }
 
-struct pvclock_clock {
-	int vclock_mode;
-	u64 cycle_last;
-	u64 mask;
-	u32 mult;
-	u32 shift;
-	u64 base_cycles;
-	u64 offset;
-};
-
-struct pvclock_gtod_data {
-	seqcount_t	seq;
-
-	struct pvclock_clock clock; /* extract of a clocksource struct */
-	struct pvclock_clock raw_clock; /* extract of a clocksource struct */
-
-	ktime_t		offs_boot;
-	u64		wall_time_sec;
-};
-
-static struct pvclock_gtod_data pvclock_gtod_data;
-
-static void update_pvclock_gtod(struct timekeeper *tk)
-{
-	struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
-
-	write_seqcount_begin(&vdata->seq);
-
-	/* copy pvclock gtod data */
-	vdata->clock.vclock_mode	= tk->tkr_mono.clock->vdso_clock_mode;
-	vdata->clock.cycle_last		= tk->tkr_mono.cycle_last;
-	vdata->clock.mask		= tk->tkr_mono.mask;
-	vdata->clock.mult		= tk->tkr_mono.mult;
-	vdata->clock.shift		= tk->tkr_mono.shift;
-	vdata->clock.base_cycles	= tk->tkr_mono.xtime_nsec;
-	vdata->clock.offset		= tk->tkr_mono.base;
-
-	vdata->raw_clock.vclock_mode	= tk->tkr_raw.clock->vdso_clock_mode;
-	vdata->raw_clock.cycle_last	= tk->tkr_raw.cycle_last;
-	vdata->raw_clock.mask		= tk->tkr_raw.mask;
-	vdata->raw_clock.mult		= tk->tkr_raw.mult;
-	vdata->raw_clock.shift		= tk->tkr_raw.shift;
-	vdata->raw_clock.base_cycles	= tk->tkr_raw.xtime_nsec;
-	vdata->raw_clock.offset		= tk->tkr_raw.base;
-
-	vdata->wall_time_sec            = tk->xtime_sec;
-
-	vdata->offs_boot		= tk->offs_boot;
-
-	write_seqcount_end(&vdata->seq);
-}
-
 static s64 get_kvmclock_base_ns(void)
 {
 	/* Count up from boot time, but with the frequency of the raw clock.  */
@@ -3012,128 +2960,6 @@ static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
 
 #ifdef CONFIG_X86_64
 
-static u64 read_tsc(void)
-{
-	u64 ret = (u64)rdtsc_ordered();
-	u64 last = pvclock_gtod_data.clock.cycle_last;
-
-	if (likely(ret >= last))
-		return ret;
-
-	/*
-	 * GCC likes to generate cmov here, but this branch is extremely
-	 * predictable (it's just a function of time and the likely is
-	 * very likely) and there's a data dependence, so force GCC
-	 * to generate a branch instead.  I don't barrier() because
-	 * we don't actually need a barrier, and if this function
-	 * ever gets inlined it will generate worse code.
-	 */
-	asm volatile ("");
-	return last;
-}
-
-static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
-			  int *mode)
-{
-	u64 tsc_pg_val;
-	long v;
-
-	switch (clock->vclock_mode) {
-	case VDSO_CLOCKMODE_HVCLOCK:
-		if (hv_read_tsc_page_tsc(hv_get_tsc_page(),
-					 tsc_timestamp, &tsc_pg_val)) {
-			/* TSC page valid */
-			*mode = VDSO_CLOCKMODE_HVCLOCK;
-			v = (tsc_pg_val - clock->cycle_last) &
-				clock->mask;
-		} else {
-			/* TSC page invalid */
-			*mode = VDSO_CLOCKMODE_NONE;
-		}
-		break;
-	case VDSO_CLOCKMODE_TSC:
-		*mode = VDSO_CLOCKMODE_TSC;
-		*tsc_timestamp = read_tsc();
-		v = (*tsc_timestamp - clock->cycle_last) &
-			clock->mask;
-		break;
-	default:
-		*mode = VDSO_CLOCKMODE_NONE;
-	}
-
-	if (*mode == VDSO_CLOCKMODE_NONE)
-		*tsc_timestamp = v = 0;
-
-	return v * clock->mult;
-}
-
-/*
- * As with get_kvmclock_base_ns(), this counts from boot time, at the
- * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
- */
-static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
-{
-	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
-	unsigned long seq;
-	int mode;
-	u64 ns;
-
-	do {
-		seq = read_seqcount_begin(&gtod->seq);
-		ns = gtod->raw_clock.base_cycles;
-		ns += vgettsc(&gtod->raw_clock, tsc_timestamp, &mode);
-		ns >>= gtod->raw_clock.shift;
-		ns += ktime_to_ns(ktime_add(gtod->raw_clock.offset, gtod->offs_boot));
-	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
-	*t = ns;
-
-	return mode;
-}
-
-/*
- * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with
- * no boot time offset.
- */
-static int do_monotonic(s64 *t, u64 *tsc_timestamp)
-{
-	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
-	unsigned long seq;
-	int mode;
-	u64 ns;
-
-	do {
-		seq = read_seqcount_begin(&gtod->seq);
-		ns = gtod->clock.base_cycles;
-		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
-		ns >>= gtod->clock.shift;
-		ns += ktime_to_ns(gtod->clock.offset);
-	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
-	*t = ns;
-
-	return mode;
-}
-
-static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
-{
-	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
-	unsigned long seq;
-	int mode;
-	u64 ns;
-
-	do {
-		seq = read_seqcount_begin(&gtod->seq);
-		ts->tv_sec = gtod->wall_time_sec;
-		ns = gtod->clock.base_cycles;
-		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
-		ns >>= gtod->clock.shift;
-	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
-
-	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
-	ts->tv_nsec = ns;
-
-	return mode;
-}
-
 /*
  * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and
  * reports the TSC value from which it do so. Returns true if host is
@@ -10362,7 +10188,6 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 {
 	struct timekeeper *tk = priv;
 
-	update_pvclock_gtod(tk);
 
 #ifdef CONFIG_X86_64
 	kvm_host_has_tsc_clocksource =
-- 
2.54.0


^ permalink raw reply related

* [RFC PATCH 2/8] clocksource/hyperv: Implement read_raw() for TSC page clocksource
From: David Woodhouse @ 2026-05-26 23:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, John Stultz,
	Michael Kelley
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S . Hall,
	Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
	linux-kernel
In-Reply-To: <20260526230635.136914-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Implement the read_raw() callback for the Hyper-V TSC page
clocksource. This returns the derived 10MHz reference time (for
timekeeping) while also providing the raw TSC value that was used
to compute it.

When the TSC page is valid, hv_read_tsc_page_tsc() atomically
captures both values from a single RDTSC inside the sequence-counter
protected read. When the TSC page is invalid (sequence == 0), raw is
set to zero indicating no value is available.

This enables ktime_get_snapshot_id() to provide the raw TSC to
consumers like KVM's master clock when running nested on Hyper-V.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.6-1m
---
 drivers/clocksource/hyperv_timer.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index e9f5034a1bc8..c5ae01fdbd8e 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -444,6 +444,18 @@ static u64 notrace read_hv_clock_tsc_cs(struct clocksource *arg)
 	return read_hv_clock_tsc();
 }
 
+static u64 notrace read_hv_clock_tsc_cs_raw(struct clocksource *arg, u64 *raw)
+{
+	u64 time;
+
+	if (!hv_read_tsc_page_tsc(tsc_page, raw, &time)) {
+		time = read_hv_clock_msr();
+		*raw = 0;
+	}
+
+	return time;
+}
+
 static u64 noinstr read_hv_sched_clock_tsc(void)
 {
 	return (read_hv_clock_tsc() - hv_sched_clock_offset) *
@@ -495,6 +507,8 @@ static struct clocksource hyperv_cs_tsc = {
 	.name	= "hyperv_clocksource_tsc_page",
 	.rating	= 500,
 	.read	= read_hv_clock_tsc_cs,
+	.read_raw = read_hv_clock_tsc_cs_raw,
+	.raw_csid = CSID_X86_TSC,
 	.mask	= CLOCKSOURCE_MASK(64),
 	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
 	.suspend= suspend_hv_clock_tsc,
-- 
2.54.0


^ permalink raw reply related

* Re: [RFC] KVM/x86: Killing kvm_get_time_and_clockread() in favour of ktime_get_snapshot()
From: David Woodhouse @ 2026-05-26 23:04 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, John Stultz,
	Michael Kelley
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S. Hall,
	Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
	linux-kernel
In-Reply-To: <b4895a532344ba6a879d922be8536f9000cd398c.camel@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 2117 bytes --]

On Tue, 2026-05-26 at 14:57 +0100, David Woodhouse wrote:
> 
> One simple option that occurs to me would be to add a 'cycles_raw'
> value to the system_time_snapshot, for PV clocksources like hyperv and
> kvmclock to populate with the original TSC reading.
> 
> That might actually let us clean up some of the PTP code that currently
> has to deal with TSC vs. kvmclock in counter snapshots too. I think I
> could kill the use of get_cycles() in vmclock for the kvmclock case,
> which might make Thomas happy...

I hacked that up to see what it looks like, and it kind of seems to work...

Based on merging my kvmclock branch and Thomas's ktime_get_snapshot_id():
 • https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/kvmclock5
 • https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/log/?h=timers/ptp/timekeeping

I'll probably not post this for real until the above two are merged;
there's no rush but I think it's a worthwhile cleanup. For now it's at
 • https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/kvm-ktime-snapshot

David Woodhouse (8):
      timekeeping: Add clocksource read_raw() method and raw_cycles to snapshot
      clocksource/hyperv: Implement read_raw() for TSC page clocksource
      x86/kvmclock: Implement read_raw() for kvmclock clocksource
      KVM: x86: Use ktime_get_snapshot_id() for master clock
      KVM: x86: Compute kvmclock base without pvclock_gtod_data
      KVM: x86: Replace pvclock_gtod_data vclock_mode with boolean
      KVM: x86: Remove pvclock_gtod_data and private timekeeping code
      ptp: vmclock: Use raw_cycles from snapshot for precise TSC pairing

 arch/x86/kernel/kvmclock.c         |  21 ++++
 arch/x86/kvm/x86.c                 | 239 ++++++++-----------------------------
 drivers/clocksource/hyperv_timer.c |  14 +++
 drivers/ptp/ptp_vmclock.c          |   4 +
 include/linux/clocksource.h        |   8 ++
 include/linux/timekeeping.h        |   6 +
 kernel/time/timekeeping.c          |  30 ++++-
 7 files changed, 130 insertions(+), 192 deletions(-)


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] drm/hyperv: Replace "hyperv_" with "hvdrm_" as symbol name prefix
From: sashiko-bot @ 2026-05-26 20:59 UTC (permalink / raw)
  To: Michael Kelley; +Cc: linux-hyperv
In-Reply-To: <20260526205239.1509-1-mhklkml@zohomail.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Potential NULL pointer dereference in `hvdrm_crtc_helper_atomic_enable` when accessing `plane_state->fb->pitches`.
--

commit 9e40bbf4c669c839b0568d52b603d410d08dde6c
Author: Michael Kelley <mhklinux@outlook.com>
drm/hyperv: Replace "hyperv_" with "hvdrm_" as symbol name prefix

This commit mechanically replaces the "hyperv_" prefix with "hvdrm_" across
functions and structures in the Hyper-V DRM driver. This renaming prevents
namespace collisions with core Hyper-V and VMBus code, and addresses an
initcall conflict that prevents using initcall_blacklist.

> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
[ ... ]
> @@ -107,14 +107,14 @@ static const uint64_t hyperv_modifiers[] = {
> -static void hyperv_crtc_helper_atomic_enable(struct drm_crtc *crtc,
> +static void hvdrm_crtc_helper_atomic_enable(struct drm_crtc *crtc,
>  					     struct drm_atomic_commit *state)
>  {
> -	struct hyperv_drm_device *hv = to_hv(crtc->dev);
> +	struct hvdrm_drm_device *hv = to_hv(crtc->dev);
>  	struct drm_plane *plane = &hv->plane;
>  	struct drm_plane_state *plane_state = plane->state;
>  	struct drm_crtc_state *crtc_state = crtc->state;
>  
> -	hyperv_hide_hw_ptr(hv->hdev);
> -	hyperv_update_situation(hv->hdev, 1,  hv->screen_depth,
> +	hvdrm_hide_hw_ptr(hv->hdev);
> +	hvdrm_update_situation(hv->hdev, 1,  hv->screen_depth,
>  				crtc_state->mode.hdisplay,
>  				crtc_state->mode.vdisplay,
>  				plane_state->fb->pitches[0]);

[Severity: High]
This is a pre-existing issue, but could this cause a NULL pointer
dereference?

In the DRM atomic API, userspace can configure a CRTC to be active without an
attached framebuffer on its primary plane, which means plane_state->fb can be
NULL.

If the CRTC is enabled while the primary plane lacks a framebuffer, it looks
like dereferencing plane_state->fb->pitches[0] here will crash the kernel.

>  
>  	drm_crtc_vblank_on(crtc);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260526205239.1509-1-mhklkml@zohomail.com?part=1

^ permalink raw reply

* [PATCH 1/1] drm/hyperv: Replace "hyperv_" with "hvdrm_" as symbol name prefix
From: Michael Kelley @ 2026-05-26 20:52 UTC (permalink / raw)
  To: maarten.lankhorst, mripard, tzimmermann, airlied, simona, decui,
	longli, ssengar
  Cc: dri-devel, linux-kernel, linux-hyperv

From: Michael Kelley <mhklinux@outlook.com>

Function and structure names in the Hyper-V DRM driver currently
use "hyperv_" as the prefix. This conflicts with usage in core Hyper-V
and VMBus code, and incorrectly implies that functions and structures
in this driver apply generically to Hyper-V. A specific conflict arises
for "hyperv_init", which is an initcall for generic Hyper-V
initialization on arm64. The conflict prevents the use of
initcall_blacklist on the kernel boot line to skip loading this driver.

Fix this by substituting "hvdrm_" as the prefix for all functions and
structures in this driver. This prefix marries the existing "hv" prefix
for Hyper-V related code with "drm" to indicate this driver.

The changes are all mechanical text substitution in symbol names.
There are no other code or functional changes.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
This patch is built against linux-next20260526.

 drivers/gpu/drm/hyperv/hyperv_drm.h         |  20 ++--
 drivers/gpu/drm/hyperv/hyperv_drm_drv.c     |  88 ++++++++--------
 drivers/gpu/drm/hyperv/hyperv_drm_modeset.c | 110 ++++++++++----------
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c   |  70 ++++++-------
 4 files changed, 144 insertions(+), 144 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm.h b/drivers/gpu/drm/hyperv/hyperv_drm.h
index 9e776112c03e..66bd8730aad2 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm.h
+++ b/drivers/gpu/drm/hyperv/hyperv_drm.h
@@ -8,7 +8,7 @@
 
 #define VMBUS_MAX_PACKET_SIZE 0x4000
 
-struct hyperv_drm_device {
+struct hvdrm_drm_device {
 	/* drm */
 	struct drm_device dev;
 	struct drm_plane plane;
@@ -39,17 +39,17 @@ struct hyperv_drm_device {
 	struct hv_device *hdev;
 };
 
-#define to_hv(_dev) container_of(_dev, struct hyperv_drm_device, dev)
+#define to_hv(_dev) container_of(_dev, struct hvdrm_drm_device, dev)
 
-/* hyperv_drm_modeset */
-int hyperv_mode_config_init(struct hyperv_drm_device *hv);
+/* hvdrm_drm_modeset */
+int hvdrm_mode_config_init(struct hvdrm_drm_device *hv);
 
-/* hyperv_drm_proto */
-int hyperv_update_vram_location(struct hv_device *hdev, phys_addr_t vram_pp);
-int hyperv_update_situation(struct hv_device *hdev, u8 active, u32 bpp,
+/* hvdrm_drm_proto */
+int hvdrm_update_vram_location(struct hv_device *hdev, phys_addr_t vram_pp);
+int hvdrm_update_situation(struct hv_device *hdev, u8 active, u32 bpp,
 			    u32 w, u32 h, u32 pitch);
-int hyperv_hide_hw_ptr(struct hv_device *hdev);
-int hyperv_update_dirt(struct hv_device *hdev, struct drm_rect *rect);
-int hyperv_connect_vsp(struct hv_device *hdev);
+int hvdrm_hide_hw_ptr(struct hv_device *hdev);
+int hvdrm_update_dirt(struct hv_device *hdev, struct drm_rect *rect);
+int hvdrm_connect_vsp(struct hv_device *hdev);
 
 #endif
diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
index b6bf6412ae34..a4456ccf340e 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_drv.c
@@ -26,7 +26,7 @@
 
 DEFINE_DRM_GEM_FOPS(hv_fops);
 
-static struct drm_driver hyperv_driver = {
+static struct drm_driver hvdrm_driver = {
 	.driver_features = DRIVER_MODESET | DRIVER_GEM | DRIVER_ATOMIC,
 
 	.name		 = DRIVER_NAME,
@@ -39,17 +39,17 @@ static struct drm_driver hyperv_driver = {
 	DRM_FBDEV_SHMEM_DRIVER_OPS,
 };
 
-static int hyperv_pci_probe(struct pci_dev *pdev,
+static int hvdrm_pci_probe(struct pci_dev *pdev,
 			    const struct pci_device_id *ent)
 {
 	return 0;
 }
 
-static void hyperv_pci_remove(struct pci_dev *pdev)
+static void hvdrm_pci_remove(struct pci_dev *pdev)
 {
 }
 
-static const struct pci_device_id hyperv_pci_tbl[] = {
+static const struct pci_device_id hvdrm_pci_tbl[] = {
 	{
 		.vendor = PCI_VENDOR_ID_MICROSOFT,
 		.device = PCI_DEVICE_ID_HYPERV_VIDEO,
@@ -60,14 +60,14 @@ static const struct pci_device_id hyperv_pci_tbl[] = {
 /*
  * PCI stub to support gen1 VM.
  */
-static struct pci_driver hyperv_pci_driver = {
+static struct pci_driver hvdrm_pci_driver = {
 	.name =		KBUILD_MODNAME,
-	.id_table =	hyperv_pci_tbl,
-	.probe =	hyperv_pci_probe,
-	.remove =	hyperv_pci_remove,
+	.id_table =	hvdrm_pci_tbl,
+	.probe =	hvdrm_pci_probe,
+	.remove =	hvdrm_pci_remove,
 };
 
-static int hyperv_setup_vram(struct hyperv_drm_device *hv,
+static int hvdrm_setup_vram(struct hvdrm_drm_device *hv,
 			     struct hv_device *hdev)
 {
 	struct drm_device *dev = &hv->dev;
@@ -102,15 +102,15 @@ static int hyperv_setup_vram(struct hyperv_drm_device *hv,
 	return ret;
 }
 
-static int hyperv_vmbus_probe(struct hv_device *hdev,
+static int hvdrm_vmbus_probe(struct hv_device *hdev,
 			      const struct hv_vmbus_device_id *dev_id)
 {
-	struct hyperv_drm_device *hv;
+	struct hvdrm_drm_device *hv;
 	struct drm_device *dev;
 	int ret;
 
-	hv = devm_drm_dev_alloc(&hdev->device, &hyperv_driver,
-				struct hyperv_drm_device, dev);
+	hv = devm_drm_dev_alloc(&hdev->device, &hvdrm_driver,
+				struct hvdrm_drm_device, dev);
 	if (IS_ERR(hv))
 		return PTR_ERR(hv);
 
@@ -119,15 +119,15 @@ static int hyperv_vmbus_probe(struct hv_device *hdev,
 	hv_set_drvdata(hdev, hv);
 	hv->hdev = hdev;
 
-	ret = hyperv_connect_vsp(hdev);
+	ret = hvdrm_connect_vsp(hdev);
 	if (ret) {
 		drm_err(dev, "Failed to connect to vmbus.\n");
 		goto err_hv_set_drv_data;
 	}
 
-	aperture_remove_all_conflicting_devices(hyperv_driver.name);
+	aperture_remove_all_conflicting_devices(hvdrm_driver.name);
 
-	ret = hyperv_setup_vram(hv, hdev);
+	ret = hvdrm_setup_vram(hv, hdev);
 	if (ret)
 		goto err_vmbus_close;
 
@@ -136,11 +136,11 @@ static int hyperv_vmbus_probe(struct hv_device *hdev,
 	 * vram location is not fatal. Device will update dirty area till
 	 * preferred resolution only.
 	 */
-	ret = hyperv_update_vram_location(hdev, hv->fb_base);
+	ret = hvdrm_update_vram_location(hdev, hv->fb_base);
 	if (ret)
 		drm_warn(dev, "Failed to update vram location.\n");
 
-	ret = hyperv_mode_config_init(hv);
+	ret = hvdrm_mode_config_init(hv);
 	if (ret)
 		goto err_free_mmio;
 
@@ -168,10 +168,10 @@ static int hyperv_vmbus_probe(struct hv_device *hdev,
 	return ret;
 }
 
-static void hyperv_vmbus_remove(struct hv_device *hdev)
+static void hvdrm_vmbus_remove(struct hv_device *hdev)
 {
 	struct drm_device *dev = hv_get_drvdata(hdev);
-	struct hyperv_drm_device *hv = to_hv(dev);
+	struct hvdrm_drm_device *hv = to_hv(dev);
 
 	vmbus_set_skip_unload(false);
 	drm_dev_unplug(dev);
@@ -183,12 +183,12 @@ static void hyperv_vmbus_remove(struct hv_device *hdev)
 	vmbus_free_mmio(hv->mem->start, hv->fb_size);
 }
 
-static void hyperv_vmbus_shutdown(struct hv_device *hdev)
+static void hvdrm_vmbus_shutdown(struct hv_device *hdev)
 {
 	drm_atomic_helper_shutdown(hv_get_drvdata(hdev));
 }
 
-static int hyperv_vmbus_suspend(struct hv_device *hdev)
+static int hvdrm_vmbus_suspend(struct hv_device *hdev)
 {
 	struct drm_device *dev = hv_get_drvdata(hdev);
 	int ret;
@@ -202,67 +202,67 @@ static int hyperv_vmbus_suspend(struct hv_device *hdev)
 	return 0;
 }
 
-static int hyperv_vmbus_resume(struct hv_device *hdev)
+static int hvdrm_vmbus_resume(struct hv_device *hdev)
 {
 	struct drm_device *dev = hv_get_drvdata(hdev);
-	struct hyperv_drm_device *hv = to_hv(dev);
+	struct hvdrm_drm_device *hv = to_hv(dev);
 	int ret;
 
-	ret = hyperv_connect_vsp(hdev);
+	ret = hvdrm_connect_vsp(hdev);
 	if (ret)
 		return ret;
 
-	ret = hyperv_update_vram_location(hdev, hv->fb_base);
+	ret = hvdrm_update_vram_location(hdev, hv->fb_base);
 	if (ret)
 		return ret;
 
 	return drm_mode_config_helper_resume(dev);
 }
 
-static const struct hv_vmbus_device_id hyperv_vmbus_tbl[] = {
+static const struct hv_vmbus_device_id hvdrm_vmbus_tbl[] = {
 	/* Synthetic Video Device GUID */
 	{HV_SYNTHVID_GUID},
 	{}
 };
 
-static struct hv_driver hyperv_hv_driver = {
+static struct hv_driver hvdrm_hv_driver = {
 	.name = KBUILD_MODNAME,
-	.id_table = hyperv_vmbus_tbl,
-	.probe = hyperv_vmbus_probe,
-	.remove = hyperv_vmbus_remove,
-	.shutdown = hyperv_vmbus_shutdown,
-	.suspend = hyperv_vmbus_suspend,
-	.resume = hyperv_vmbus_resume,
+	.id_table = hvdrm_vmbus_tbl,
+	.probe = hvdrm_vmbus_probe,
+	.remove = hvdrm_vmbus_remove,
+	.shutdown = hvdrm_vmbus_shutdown,
+	.suspend = hvdrm_vmbus_suspend,
+	.resume = hvdrm_vmbus_resume,
 	.driver = {
 		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
 	},
 };
 
-static int __init hyperv_init(void)
+static int __init hvdrm_init(void)
 {
 	int ret;
 
 	if (drm_firmware_drivers_only())
 		return -ENODEV;
 
-	ret = pci_register_driver(&hyperv_pci_driver);
+	ret = pci_register_driver(&hvdrm_pci_driver);
 	if (ret != 0)
 		return ret;
 
-	return vmbus_driver_register(&hyperv_hv_driver);
+	return vmbus_driver_register(&hvdrm_hv_driver);
 }
 
-static void __exit hyperv_exit(void)
+static void __exit hvdrm_exit(void)
 {
-	vmbus_driver_unregister(&hyperv_hv_driver);
-	pci_unregister_driver(&hyperv_pci_driver);
+	vmbus_driver_unregister(&hvdrm_hv_driver);
+	pci_unregister_driver(&hvdrm_pci_driver);
 }
 
-module_init(hyperv_init);
-module_exit(hyperv_exit);
+module_init(hvdrm_init);
+module_exit(hvdrm_exit);
 
-MODULE_DEVICE_TABLE(pci, hyperv_pci_tbl);
-MODULE_DEVICE_TABLE(vmbus, hyperv_vmbus_tbl);
+MODULE_DEVICE_TABLE(pci, hvdrm_pci_tbl);
+MODULE_DEVICE_TABLE(vmbus, hvdrm_vmbus_tbl);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Deepak Rawat <drawat.floss@gmail.com>");
 MODULE_DESCRIPTION("DRM driver for Hyper-V synthetic video device");
diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
index 793dbbf61893..6844d085e709 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_modeset.c
@@ -25,11 +25,11 @@
 
 #include "hyperv_drm.h"
 
-static int hyperv_blit_to_vram_rect(struct drm_framebuffer *fb,
+static int hvdrm_blit_to_vram_rect(struct drm_framebuffer *fb,
 				    const struct iosys_map *vmap,
 				    struct drm_rect *rect)
 {
-	struct hyperv_drm_device *hv = to_hv(fb->dev);
+	struct hvdrm_drm_device *hv = to_hv(fb->dev);
 	struct iosys_map dst = IOSYS_MAP_INIT_VADDR_IOMEM(hv->vram);
 	int idx;
 
@@ -44,9 +44,9 @@ static int hyperv_blit_to_vram_rect(struct drm_framebuffer *fb,
 	return 0;
 }
 
-static int hyperv_connector_get_modes(struct drm_connector *connector)
+static int hvdrm_connector_get_modes(struct drm_connector *connector)
 {
-	struct hyperv_drm_device *hv = to_hv(connector->dev);
+	struct hvdrm_drm_device *hv = to_hv(connector->dev);
 	int count;
 
 	count = drm_add_modes_noedid(connector,
@@ -58,11 +58,11 @@ static int hyperv_connector_get_modes(struct drm_connector *connector)
 	return count;
 }
 
-static const struct drm_connector_helper_funcs hyperv_connector_helper_funcs = {
-	.get_modes = hyperv_connector_get_modes,
+static const struct drm_connector_helper_funcs hvdrm_connector_helper_funcs = {
+	.get_modes = hvdrm_connector_get_modes,
 };
 
-static const struct drm_connector_funcs hyperv_connector_funcs = {
+static const struct drm_connector_funcs hvdrm_connector_funcs = {
 	.fill_modes = drm_helper_probe_single_connector_modes,
 	.destroy = drm_connector_cleanup,
 	.reset = drm_atomic_helper_connector_reset,
@@ -70,15 +70,15 @@ static const struct drm_connector_funcs hyperv_connector_funcs = {
 	.atomic_destroy_state = drm_atomic_helper_connector_destroy_state,
 };
 
-static inline int hyperv_conn_init(struct hyperv_drm_device *hv)
+static inline int hvdrm_conn_init(struct hvdrm_drm_device *hv)
 {
-	drm_connector_helper_add(&hv->connector, &hyperv_connector_helper_funcs);
+	drm_connector_helper_add(&hv->connector, &hvdrm_connector_helper_funcs);
 	return drm_connector_init(&hv->dev, &hv->connector,
-				  &hyperv_connector_funcs,
+				  &hvdrm_connector_funcs,
 				  DRM_MODE_CONNECTOR_VIRTUAL);
 }
 
-static int hyperv_check_size(struct hyperv_drm_device *hv, int w, int h,
+static int hvdrm_check_size(struct hvdrm_drm_device *hv, int w, int h,
 			     struct drm_framebuffer *fb)
 {
 	u32 pitch = w * (hv->screen_depth / 8);
@@ -92,25 +92,25 @@ static int hyperv_check_size(struct hyperv_drm_device *hv, int w, int h,
 	return 0;
 }
 
-static const uint32_t hyperv_formats[] = {
+static const uint32_t hvdrm_formats[] = {
 	DRM_FORMAT_XRGB8888,
 };
 
-static const uint64_t hyperv_modifiers[] = {
+static const uint64_t hvdrm_modifiers[] = {
 	DRM_FORMAT_MOD_LINEAR,
 	DRM_FORMAT_MOD_INVALID
 };
 
-static void hyperv_crtc_helper_atomic_enable(struct drm_crtc *crtc,
+static void hvdrm_crtc_helper_atomic_enable(struct drm_crtc *crtc,
 					     struct drm_atomic_commit *state)
 {
-	struct hyperv_drm_device *hv = to_hv(crtc->dev);
+	struct hvdrm_drm_device *hv = to_hv(crtc->dev);
 	struct drm_plane *plane = &hv->plane;
 	struct drm_plane_state *plane_state = plane->state;
 	struct drm_crtc_state *crtc_state = crtc->state;
 
-	hyperv_hide_hw_ptr(hv->hdev);
-	hyperv_update_situation(hv->hdev, 1,  hv->screen_depth,
+	hvdrm_hide_hw_ptr(hv->hdev);
+	hvdrm_update_situation(hv->hdev, 1,  hv->screen_depth,
 				crtc_state->mode.hdisplay,
 				crtc_state->mode.vdisplay,
 				plane_state->fb->pitches[0]);
@@ -118,14 +118,14 @@ static void hyperv_crtc_helper_atomic_enable(struct drm_crtc *crtc,
 	drm_crtc_vblank_on(crtc);
 }
 
-static const struct drm_crtc_helper_funcs hyperv_crtc_helper_funcs = {
+static const struct drm_crtc_helper_funcs hvdrm_crtc_helper_funcs = {
 	.atomic_check = drm_crtc_helper_atomic_check,
 	.atomic_flush = drm_crtc_vblank_atomic_flush,
-	.atomic_enable = hyperv_crtc_helper_atomic_enable,
+	.atomic_enable = hvdrm_crtc_helper_atomic_enable,
 	.atomic_disable = drm_crtc_vblank_atomic_disable,
 };
 
-static const struct drm_crtc_funcs hyperv_crtc_funcs = {
+static const struct drm_crtc_funcs hvdrm_crtc_funcs = {
 	.reset = drm_atomic_helper_crtc_reset,
 	.destroy = drm_crtc_cleanup,
 	.set_config = drm_atomic_helper_set_config,
@@ -135,11 +135,11 @@ static const struct drm_crtc_funcs hyperv_crtc_funcs = {
 	DRM_CRTC_VBLANK_TIMER_FUNCS,
 };
 
-static int hyperv_plane_atomic_check(struct drm_plane *plane,
+static int hvdrm_plane_atomic_check(struct drm_plane *plane,
 				     struct drm_atomic_commit *state)
 {
 	struct drm_plane_state *plane_state = drm_atomic_get_new_plane_state(state, plane);
-	struct hyperv_drm_device *hv = to_hv(plane->dev);
+	struct hvdrm_drm_device *hv = to_hv(plane->dev);
 	struct drm_framebuffer *fb = plane_state->fb;
 	struct drm_crtc *crtc = plane_state->crtc;
 	struct drm_crtc_state *crtc_state = NULL;
@@ -167,10 +167,10 @@ static int hyperv_plane_atomic_check(struct drm_plane *plane,
 	return 0;
 }
 
-static void hyperv_plane_atomic_update(struct drm_plane *plane,
+static void hvdrm_plane_atomic_update(struct drm_plane *plane,
 				       struct drm_atomic_commit *state)
 {
-	struct hyperv_drm_device *hv = to_hv(plane->dev);
+	struct hvdrm_drm_device *hv = to_hv(plane->dev);
 	struct drm_plane_state *old_state = drm_atomic_get_old_plane_state(state, plane);
 	struct drm_plane_state *new_state = drm_atomic_get_new_plane_state(state, plane);
 	struct drm_shadow_plane_state *shadow_plane_state = to_drm_shadow_plane_state(new_state);
@@ -185,15 +185,15 @@ static void hyperv_plane_atomic_update(struct drm_plane *plane,
 		if (!drm_rect_intersect(&dst_clip, &damage))
 			continue;
 
-		hyperv_blit_to_vram_rect(new_state->fb, &shadow_plane_state->data[0], &damage);
-		hyperv_update_dirt(hv->hdev, &damage);
+		hvdrm_blit_to_vram_rect(new_state->fb, &shadow_plane_state->data[0], &damage);
+		hvdrm_update_dirt(hv->hdev, &damage);
 	}
 }
 
-static int hyperv_plane_get_scanout_buffer(struct drm_plane *plane,
+static int hvdrm_plane_get_scanout_buffer(struct drm_plane *plane,
 					   struct drm_scanout_buffer *sb)
 {
-	struct hyperv_drm_device *hv = to_hv(plane->dev);
+	struct hvdrm_drm_device *hv = to_hv(plane->dev);
 	struct iosys_map map = IOSYS_MAP_INIT_VADDR_IOMEM(hv->vram);
 
 	if (plane->state && plane->state->fb) {
@@ -207,9 +207,9 @@ static int hyperv_plane_get_scanout_buffer(struct drm_plane *plane,
 	return -ENODEV;
 }
 
-static void hyperv_plane_panic_flush(struct drm_plane *plane)
+static void hvdrm_plane_panic_flush(struct drm_plane *plane)
 {
-	struct hyperv_drm_device *hv = to_hv(plane->dev);
+	struct hvdrm_drm_device *hv = to_hv(plane->dev);
 	struct drm_rect rect;
 
 	if (plane->state && plane->state->fb) {
@@ -218,32 +218,32 @@ static void hyperv_plane_panic_flush(struct drm_plane *plane)
 		rect.x2 = plane->state->fb->width;
 		rect.y2 = plane->state->fb->height;
 
-		hyperv_update_dirt(hv->hdev, &rect);
+		hvdrm_update_dirt(hv->hdev, &rect);
 	}
 
 	vmbus_initiate_unload(true);
 }
 
-static const struct drm_plane_helper_funcs hyperv_plane_helper_funcs = {
+static const struct drm_plane_helper_funcs hvdrm_plane_helper_funcs = {
 	DRM_GEM_SHADOW_PLANE_HELPER_FUNCS,
-	.atomic_check = hyperv_plane_atomic_check,
-	.atomic_update = hyperv_plane_atomic_update,
-	.get_scanout_buffer = hyperv_plane_get_scanout_buffer,
-	.panic_flush = hyperv_plane_panic_flush,
+	.atomic_check = hvdrm_plane_atomic_check,
+	.atomic_update = hvdrm_plane_atomic_update,
+	.get_scanout_buffer = hvdrm_plane_get_scanout_buffer,
+	.panic_flush = hvdrm_plane_panic_flush,
 };
 
-static const struct drm_plane_funcs hyperv_plane_funcs = {
+static const struct drm_plane_funcs hvdrm_plane_funcs = {
 	.update_plane		= drm_atomic_helper_update_plane,
 	.disable_plane		= drm_atomic_helper_disable_plane,
 	.destroy		= drm_plane_cleanup,
 	DRM_GEM_SHADOW_PLANE_FUNCS,
 };
 
-static const struct drm_encoder_funcs hyperv_drm_simple_encoder_funcs_cleanup = {
+static const struct drm_encoder_funcs hvdrm_drm_simple_encoder_funcs_cleanup = {
 	.destroy = drm_encoder_cleanup,
 };
 
-static inline int hyperv_pipe_init(struct hyperv_drm_device *hv)
+static inline int hvdrm_pipe_init(struct hvdrm_drm_device *hv)
 {
 	struct drm_device *dev = &hv->dev;
 	struct drm_encoder *encoder = &hv->encoder;
@@ -253,29 +253,29 @@ static inline int hyperv_pipe_init(struct hyperv_drm_device *hv)
 	int ret;
 
 	ret = drm_universal_plane_init(dev, plane, 0,
-				       &hyperv_plane_funcs,
-				       hyperv_formats, ARRAY_SIZE(hyperv_formats),
-				       hyperv_modifiers,
+				       &hvdrm_plane_funcs,
+				       hvdrm_formats, ARRAY_SIZE(hvdrm_formats),
+				       hvdrm_modifiers,
 				       DRM_PLANE_TYPE_PRIMARY, NULL);
 	if (ret)
 		return ret;
-	drm_plane_helper_add(plane, &hyperv_plane_helper_funcs);
+	drm_plane_helper_add(plane, &hvdrm_plane_helper_funcs);
 	drm_plane_enable_fb_damage_clips(plane);
 
 	ret = drm_crtc_init_with_planes(dev, crtc, plane, NULL,
-					&hyperv_crtc_funcs, NULL);
+					&hvdrm_crtc_funcs, NULL);
 	if (ret)
 		return ret;
-	drm_crtc_helper_add(crtc, &hyperv_crtc_helper_funcs);
+	drm_crtc_helper_add(crtc, &hvdrm_crtc_helper_funcs);
 
 	encoder->possible_crtcs = drm_crtc_mask(crtc);
 	ret = drm_encoder_init(dev, encoder,
-			       &hyperv_drm_simple_encoder_funcs_cleanup,
+			       &hvdrm_drm_simple_encoder_funcs_cleanup,
 			       DRM_MODE_ENCODER_NONE, NULL);
 	if (ret)
 		return ret;
 
-	ret = hyperv_conn_init(hv);
+	ret = hvdrm_conn_init(hv);
 	if (ret) {
 		drm_err(dev, "Failed to initialized connector.\n");
 		return ret;
@@ -285,25 +285,25 @@ static inline int hyperv_pipe_init(struct hyperv_drm_device *hv)
 }
 
 static enum drm_mode_status
-hyperv_mode_valid(struct drm_device *dev,
+hvdrm_mode_valid(struct drm_device *dev,
 		  const struct drm_display_mode *mode)
 {
-	struct hyperv_drm_device *hv = to_hv(dev);
+	struct hvdrm_drm_device *hv = to_hv(dev);
 
-	if (hyperv_check_size(hv, mode->hdisplay, mode->vdisplay, NULL))
+	if (hvdrm_check_size(hv, mode->hdisplay, mode->vdisplay, NULL))
 		return MODE_BAD;
 
 	return MODE_OK;
 }
 
-static const struct drm_mode_config_funcs hyperv_mode_config_funcs = {
+static const struct drm_mode_config_funcs hvdrm_mode_config_funcs = {
 	.fb_create = drm_gem_fb_create_with_dirty,
-	.mode_valid = hyperv_mode_valid,
+	.mode_valid = hvdrm_mode_valid,
 	.atomic_check = drm_atomic_helper_check,
 	.atomic_commit = drm_atomic_helper_commit,
 };
 
-int hyperv_mode_config_init(struct hyperv_drm_device *hv)
+int hvdrm_mode_config_init(struct hvdrm_drm_device *hv)
 {
 	struct drm_device *dev = &hv->dev;
 	int ret;
@@ -322,9 +322,9 @@ int hyperv_mode_config_init(struct hyperv_drm_device *hv)
 	dev->mode_config.preferred_depth = hv->screen_depth;
 	dev->mode_config.prefer_shadow = 0;
 
-	dev->mode_config.funcs = &hyperv_mode_config_funcs;
+	dev->mode_config.funcs = &hvdrm_mode_config_funcs;
 
-	ret = hyperv_pipe_init(hv);
+	ret = hvdrm_pipe_init(hv);
 	if (ret) {
 		drm_err(dev, "Failed to initialized pipe.\n");
 		return ret;
diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index 6e09b0218df4..7c11b20a9124 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -181,7 +181,7 @@ struct synthvid_msg {
 	};
 } __packed;
 
-static inline bool hyperv_version_ge(u32 ver1, u32 ver2)
+static inline bool hvdrm_version_ge(u32 ver1, u32 ver2)
 {
 	if (SYNTHVID_VER_GET_MAJOR(ver1) > SYNTHVID_VER_GET_MAJOR(ver2) ||
 	    (SYNTHVID_VER_GET_MAJOR(ver1) == SYNTHVID_VER_GET_MAJOR(ver2) &&
@@ -191,10 +191,10 @@ static inline bool hyperv_version_ge(u32 ver1, u32 ver2)
 	return false;
 }
 
-static inline int hyperv_sendpacket(struct hv_device *hdev, struct synthvid_msg *msg)
+static inline int hvdrm_sendpacket(struct hv_device *hdev, struct synthvid_msg *msg)
 {
 	static atomic64_t request_id = ATOMIC64_INIT(0);
-	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
+	struct hvdrm_drm_device *hv = hv_get_drvdata(hdev);
 	int ret;
 
 	msg->pipe_hdr.type = PIPE_MSG_DATA;
@@ -211,9 +211,9 @@ static inline int hyperv_sendpacket(struct hv_device *hdev, struct synthvid_msg
 	return ret;
 }
 
-static int hyperv_negotiate_version(struct hv_device *hdev, u32 ver)
+static int hvdrm_negotiate_version(struct hv_device *hdev, u32 ver)
 {
-	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
+	struct hvdrm_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg *msg = (struct synthvid_msg *)hv->init_buf;
 	struct drm_device *dev = &hv->dev;
 	unsigned long t;
@@ -223,7 +223,7 @@ static int hyperv_negotiate_version(struct hv_device *hdev, u32 ver)
 	msg->vid_hdr.size = sizeof(struct synthvid_msg_hdr) +
 		sizeof(struct synthvid_version_req);
 	msg->ver_req.version = ver;
-	hyperv_sendpacket(hdev, msg);
+	hvdrm_sendpacket(hdev, msg);
 
 	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
 	if (!t) {
@@ -243,9 +243,9 @@ static int hyperv_negotiate_version(struct hv_device *hdev, u32 ver)
 	return 0;
 }
 
-int hyperv_update_vram_location(struct hv_device *hdev, phys_addr_t vram_pp)
+int hvdrm_update_vram_location(struct hv_device *hdev, phys_addr_t vram_pp)
 {
-	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
+	struct hvdrm_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg *msg = (struct synthvid_msg *)hv->init_buf;
 	struct drm_device *dev = &hv->dev;
 	unsigned long t;
@@ -257,7 +257,7 @@ int hyperv_update_vram_location(struct hv_device *hdev, phys_addr_t vram_pp)
 	msg->vram.user_ctx = vram_pp;
 	msg->vram.vram_gpa = vram_pp;
 	msg->vram.is_vram_gpa_specified = 1;
-	hyperv_sendpacket(hdev, msg);
+	hvdrm_sendpacket(hdev, msg);
 
 	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
 	if (!t) {
@@ -272,7 +272,7 @@ int hyperv_update_vram_location(struct hv_device *hdev, phys_addr_t vram_pp)
 	return 0;
 }
 
-int hyperv_update_situation(struct hv_device *hdev, u8 active, u32 bpp,
+int hvdrm_update_situation(struct hv_device *hdev, u8 active, u32 bpp,
 			    u32 w, u32 h, u32 pitch)
 {
 	struct synthvid_msg msg;
@@ -292,7 +292,7 @@ int hyperv_update_situation(struct hv_device *hdev, u8 active, u32 bpp,
 	msg.situ.video_output[0].height_pixels = h;
 	msg.situ.video_output[0].pitch_bytes = pitch;
 
-	hyperv_sendpacket(hdev, &msg);
+	hvdrm_sendpacket(hdev, &msg);
 
 	return 0;
 }
@@ -306,11 +306,11 @@ int hyperv_update_situation(struct hv_device *hdev, u8 active, u32 bpp,
  * the msg.ptr_shape.data. Note: setting msg.ptr_pos.is_visible to 0 doesn't
  * work in tests.
  *
- * The hyperv_hide_hw_ptr() is also called in the handler of the
+ * The hvdrm_hide_hw_ptr() is also called in the handler of the
  * SYNTHVID_FEATURE_CHANGE event, otherwise the host still draws an extra
  * unwanted mouse pointer after the VM Connection window is closed and reopened.
  */
-int hyperv_hide_hw_ptr(struct hv_device *hdev)
+int hvdrm_hide_hw_ptr(struct hv_device *hdev)
 {
 	struct synthvid_msg msg;
 
@@ -322,7 +322,7 @@ int hyperv_hide_hw_ptr(struct hv_device *hdev)
 	msg.ptr_pos.video_output = 0;
 	msg.ptr_pos.image_x = 0;
 	msg.ptr_pos.image_y = 0;
-	hyperv_sendpacket(hdev, &msg);
+	hvdrm_sendpacket(hdev, &msg);
 
 	memset(&msg, 0, sizeof(struct synthvid_msg));
 	msg.vid_hdr.type = SYNTHVID_POINTER_SHAPE;
@@ -338,14 +338,14 @@ int hyperv_hide_hw_ptr(struct hv_device *hdev)
 	msg.ptr_shape.data[1] = 1;
 	msg.ptr_shape.data[2] = 1;
 	msg.ptr_shape.data[3] = 1;
-	hyperv_sendpacket(hdev, &msg);
+	hvdrm_sendpacket(hdev, &msg);
 
 	return 0;
 }
 
-int hyperv_update_dirt(struct hv_device *hdev, struct drm_rect *rect)
+int hvdrm_update_dirt(struct hv_device *hdev, struct drm_rect *rect)
 {
-	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
+	struct hvdrm_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg msg;
 
 	if (!hv->dirt_needed)
@@ -363,14 +363,14 @@ int hyperv_update_dirt(struct hv_device *hdev, struct drm_rect *rect)
 	msg.dirt.rect[0].x2 = rect->x2;
 	msg.dirt.rect[0].y2 = rect->y2;
 
-	hyperv_sendpacket(hdev, &msg);
+	hvdrm_sendpacket(hdev, &msg);
 
 	return 0;
 }
 
-static int hyperv_get_supported_resolution(struct hv_device *hdev)
+static int hvdrm_get_supported_resolution(struct hv_device *hdev)
 {
-	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
+	struct hvdrm_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg *msg = (struct synthvid_msg *)hv->init_buf;
 	struct drm_device *dev = &hv->dev;
 	unsigned long t;
@@ -383,7 +383,7 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 		sizeof(struct synthvid_supported_resolution_req);
 	msg->resolution_req.maximum_resolution_count =
 		SYNTHVID_MAX_RESOLUTION_COUNT;
-	hyperv_sendpacket(hdev, msg);
+	hvdrm_sendpacket(hdev, msg);
 
 	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
 	if (!t) {
@@ -420,9 +420,9 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 	return 0;
 }
 
-static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
+static void hvdrm_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
 {
-	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
+	struct hvdrm_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg *msg;
 	size_t hdr_size;
 	size_t need;
@@ -486,7 +486,7 @@ static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
 		}
 		hv->dirt_needed = msg->feature_chg.is_dirt_needed;
 		if (hv->dirt_needed)
-			hyperv_hide_hw_ptr(hv->hdev);
+			hvdrm_hide_hw_ptr(hv->hdev);
 		return;
 	default:
 		return;
@@ -508,10 +508,10 @@ static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
 	complete(&hv->wait);
 }
 
-static void hyperv_receive(void *ctx)
+static void hvdrm_receive(void *ctx)
 {
 	struct hv_device *hdev = ctx;
-	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
+	struct hvdrm_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg *recv_buf;
 	u32 bytes_recvd;
 	u64 req_id;
@@ -539,19 +539,19 @@ static void hyperv_receive(void *ctx)
 					    ret, bytes_recvd);
 		} else if (bytes_recvd > 0 &&
 			   recv_buf->pipe_hdr.type == PIPE_MSG_DATA) {
-			hyperv_receive_sub(hdev, bytes_recvd);
+			hvdrm_receive_sub(hdev, bytes_recvd);
 		}
 	} while (bytes_recvd > 0 && ret == 0);
 }
 
-int hyperv_connect_vsp(struct hv_device *hdev)
+int hvdrm_connect_vsp(struct hv_device *hdev)
 {
-	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
+	struct hvdrm_drm_device *hv = hv_get_drvdata(hdev);
 	struct drm_device *dev = &hv->dev;
 	int ret;
 
 	ret = vmbus_open(hdev->channel, VMBUS_RING_BUFSIZE, VMBUS_RING_BUFSIZE,
-			 NULL, 0, hyperv_receive, hdev);
+			 NULL, 0, hvdrm_receive, hdev);
 	if (ret) {
 		drm_err(dev, "Unable to open vmbus channel\n");
 		return ret;
@@ -561,16 +561,16 @@ int hyperv_connect_vsp(struct hv_device *hdev)
 	switch (vmbus_proto_version) {
 	case VERSION_WIN10:
 	case VERSION_WIN10_V5:
-		ret = hyperv_negotiate_version(hdev, SYNTHVID_VERSION_WIN10);
+		ret = hvdrm_negotiate_version(hdev, SYNTHVID_VERSION_WIN10);
 		if (!ret)
 			break;
 		fallthrough;
 	case VERSION_WIN8:
 	case VERSION_WIN8_1:
-		ret = hyperv_negotiate_version(hdev, SYNTHVID_VERSION_WIN8);
+		ret = hvdrm_negotiate_version(hdev, SYNTHVID_VERSION_WIN8);
 		break;
 	default:
-		ret = hyperv_negotiate_version(hdev, SYNTHVID_VERSION_WIN10);
+		ret = hvdrm_negotiate_version(hdev, SYNTHVID_VERSION_WIN10);
 		break;
 	}
 
@@ -581,8 +581,8 @@ int hyperv_connect_vsp(struct hv_device *hdev)
 
 	hv->screen_depth = SYNTHVID_DEPTH_WIN8;
 
-	if (hyperv_version_ge(hv->synthvid_version, SYNTHVID_VERSION_WIN10)) {
-		ret = hyperv_get_supported_resolution(hdev);
+	if (hvdrm_version_ge(hv->synthvid_version, SYNTHVID_VERSION_WIN10)) {
+		ret = hvdrm_get_supported_resolution(hdev);
 		if (ret)
 			drm_err(dev, "Failed to get supported resolution from host, use default\n");
 	}
-- 
2.25.1


^ permalink raw reply related

* [PATCH 2/2] RDMA: Update the query_device() op
From: Jason Gunthorpe @ 2026-05-26 16:15 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Junxian Huang, Kai Shen, Kalesh AP, Konstantin Taranov,
	Krzysztof Czurylo, Leon Romanovsky, linux-hyperv, linux-rdma,
	Long Li, Michal Kalderon, Nelson Escobar, Satish Kharat,
	Selvin Xavier, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas, Zhu Yanjun
  Cc: Leon Romanovsky, patches
In-Reply-To: <0-v1-922fa8e828ba+f7-ib_udata_stack_jgg@nvidia.com>

This op hasn't followed the normal pattern of passing NULL for udata when
invoked by the kernel. Instead the kernel caller creates a dummy ib_udata
on the stack and passes that in. It does not seem to currently be a bug,
but this flow should be modernized to use the new API flow and in the
process accept NULL as well.

Only mlx4 uses an input request structure, have every other driver call
ib_is_udata_in_empty() to enforce the lack of request structs.

Use ib_respond_empty_udata() in every driver that does not use a response
struct.

Ensure a check for NULL udata before calling ib_respond_udata() in
bnxt_re, efa, and mlx5.

Make mlx4 safe to be called with NULL.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/device.c                |  3 +--
 drivers/infiniband/hw/bnxt_re/ib_verbs.c        |  5 ++++-
 drivers/infiniband/hw/cxgb4/provider.c          |  8 +++++---
 drivers/infiniband/hw/erdma/erdma_verbs.c       |  9 +++++++--
 drivers/infiniband/hw/hns/hns_roce_main.c       |  7 ++++++-
 drivers/infiniband/hw/ionic/ionic_ibdev.c       |  7 ++++++-
 drivers/infiniband/hw/irdma/verbs.c             |  8 +++++---
 drivers/infiniband/hw/mana/main.c               |  7 ++++++-
 drivers/infiniband/hw/mlx4/main.c               | 13 +++++++------
 drivers/infiniband/hw/mthca/mthca_provider.c    | 13 ++++++++-----
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c     |  8 +++++---
 drivers/infiniband/hw/qedr/verbs.c              |  7 ++++++-
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c    |  8 +++++---
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c |  8 +++++---
 drivers/infiniband/sw/rdmavt/vt.c               |  9 ++++++---
 drivers/infiniband/sw/rxe/rxe_verbs.c           | 14 ++++----------
 drivers/infiniband/sw/siw/siw_verbs.c           |  8 +++++---
 17 files changed, 91 insertions(+), 51 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index b89efaaa81ec58..9f9662e9228186 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -1245,7 +1245,6 @@ static int assign_name(struct ib_device *device, const char *name)
  */
 static int setup_device(struct ib_device *device)
 {
-	struct ib_udata uhw = {.outlen = 0, .inlen = 0};
 	int ret;
 
 	ib_device_check_mandatory(device);
@@ -1257,7 +1256,7 @@ static int setup_device(struct ib_device *device)
 	}
 
 	memset(&device->attrs, 0, sizeof(device->attrs));
-	ret = device->ops.query_device(device, &device->attrs, &uhw);
+	ret = device->ops.query_device(device, &device->attrs, NULL);
 	if (ret) {
 		dev_warn(&device->dev,
 			 "Couldn't query the device attributes\n");
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 94aa06e3b828ca..98d65c1b102200 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -265,7 +265,10 @@ int bnxt_re_query_device(struct ib_device *ibdev,
 		resp.packet_pacing_caps.supported_qpts =
 			1 << IB_QPT_RC;
 	}
-	return ib_respond_udata(udata, resp);
+
+	if (udata)
+		return ib_respond_udata(udata, resp);
+	return 0;
 }
 
 int bnxt_re_modify_device(struct ib_device *ibdev,
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index 0e3827022c63da..e1eec37ee8222a 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -259,11 +259,13 @@ static int c4iw_query_device(struct ib_device *ibdev, struct ib_device_attr *pro
 {
 
 	struct c4iw_dev *dev;
+	int err;
 
 	pr_debug("ibdev %p\n", ibdev);
 
-	if (uhw->inlen || uhw->outlen)
-		return -EINVAL;
+	err = ib_is_udata_in_empty(uhw);
+	if (err)
+		return err;
 
 	dev = to_c4iw_dev(ibdev);
 	addrconf_addr_eui48((u8 *)&props->sys_image_guid,
@@ -298,7 +300,7 @@ static int c4iw_query_device(struct ib_device *ibdev, struct ib_device_attr *pro
 	props->max_fast_reg_page_list_len =
 		t4_max_fr_depth(dev->rdev.lldi.ulptx_memwrite_dsgl && use_dsgl);
 
-	return 0;
+	return ib_respond_empty_udata(uhw);
 }
 
 static int c4iw_query_port(struct ib_device *ibdev, u32 port,
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index b59c2e3a5306d1..d9eb8ae2c56fba 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -315,9 +315,14 @@ erdma_user_mmap_entry_insert(struct erdma_ucontext *uctx, void *address,
 }
 
 int erdma_query_device(struct ib_device *ibdev, struct ib_device_attr *attr,
-		       struct ib_udata *unused)
+		       struct ib_udata *udata)
 {
 	struct erdma_dev *dev = to_edev(ibdev);
+	int err;
+
+	err = ib_is_udata_in_empty(udata);
+	if (err)
+		return err;
 
 	memset(attr, 0, sizeof(*attr));
 
@@ -358,7 +363,7 @@ int erdma_query_device(struct ib_device *ibdev, struct ib_device_attr *attr,
 		addrconf_addr_eui48((u8 *)&attr->sys_image_guid,
 				    dev->netdev->dev_addr);
 
-	return 0;
+	return ib_respond_empty_udata(udata);
 }
 
 int erdma_query_gid(struct ib_device *ibdev, u32 port, int idx,
diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 77bad9f5d482bb..c6f633bd5a3402 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -221,6 +221,11 @@ static int hns_roce_query_device(struct ib_device *ib_dev,
 				 struct ib_udata *uhw)
 {
 	struct hns_roce_dev *hr_dev = to_hr_dev(ib_dev);
+	int ret;
+
+	ret = ib_is_udata_in_empty(uhw);
+	if (ret)
+		return ret;
 
 	memset(props, 0, sizeof(*props));
 
@@ -274,7 +279,7 @@ static int hns_roce_query_device(struct ib_device *ib_dev,
 	if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_XRC)
 		props->device_cap_flags |= IB_DEVICE_XRC;
 
-	return 0;
+	return ib_respond_empty_udata(uhw);
 }
 
 static int hns_roce_query_port(struct ib_device *ib_dev, u32 port_num,
diff --git a/drivers/infiniband/hw/ionic/ionic_ibdev.c b/drivers/infiniband/hw/ionic/ionic_ibdev.c
index 73a616ae350236..b0449c75f8938f 100644
--- a/drivers/infiniband/hw/ionic/ionic_ibdev.c
+++ b/drivers/infiniband/hw/ionic/ionic_ibdev.c
@@ -25,6 +25,11 @@ static int ionic_query_device(struct ib_device *ibdev,
 {
 	struct ionic_ibdev *dev = to_ionic_ibdev(ibdev);
 	struct net_device *ndev;
+	int err;
+
+	err = ib_is_udata_in_empty(udata);
+	if (err)
+		return err;
 
 	ndev = ib_device_get_netdev(ibdev, 1);
 	addrconf_ifid_eui48((u8 *)&attr->sys_image_guid, ndev);
@@ -69,7 +74,7 @@ static int ionic_query_device(struct ib_device *ibdev,
 	attr->max_fast_reg_page_list_len = dev->lif_cfg.npts_per_lif / 2;
 	attr->max_pkeys = IONIC_PKEY_TBL_LEN;
 
-	return 0;
+	return ib_respond_empty_udata(udata);
 }
 
 static int ionic_query_port(struct ib_device *ibdev, u32 port,
diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index 3f4811bb5514c6..5ba2e63b51036e 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -16,9 +16,11 @@ static int irdma_query_device(struct ib_device *ibdev,
 	struct irdma_pci_f *rf = iwdev->rf;
 	struct pci_dev *pcidev = iwdev->rf->pcidev;
 	struct irdma_hw_attrs *hw_attrs = &rf->sc_dev.hw_attrs;
+	int err;
 
-	if (udata->inlen || udata->outlen)
-		return -EINVAL;
+	err = ib_is_udata_in_empty(udata);
+	if (err)
+		return err;
 
 	memset(props, 0, sizeof(*props));
 	addrconf_addr_eui48((u8 *)&props->sys_image_guid,
@@ -74,7 +76,7 @@ static int irdma_query_device(struct ib_device *ibdev,
 	if (hw_attrs->uk_attrs.hw_rev >= IRDMA_GEN_3)
 		props->device_cap_flags |= IB_DEVICE_MEM_WINDOW_TYPE_2B;
 
-	return 0;
+	return ib_respond_empty_udata(udata);
 }
 
 /**
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 307ae01bf26f34..4dcd048d44b69a 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -549,6 +549,11 @@ int mana_ib_query_device(struct ib_device *ibdev, struct ib_device_attr *props,
 {
 	struct mana_ib_dev *dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
 	struct pci_dev *pdev = to_pci_dev(mdev_to_gc(dev)->dev);
+	int err;
+
+	err = ib_is_udata_in_empty(uhw);
+	if (err)
+		return err;
 
 	memset(props, 0, sizeof(*props));
 	props->vendor_id = pdev->vendor;
@@ -576,7 +581,7 @@ int mana_ib_query_device(struct ib_device *ibdev, struct ib_device_attr *props,
 	if (!mana_ib_is_rnic(dev))
 		props->raw_packet_caps = IB_RAW_PACKET_CAP_IP_CSUM;
 
-	return 0;
+	return ib_respond_empty_udata(uhw);
 }
 
 int mana_ib_query_port(struct ib_device *ibdev, u32 port,
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index d50743f090bf21..17073e8f105aab 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -444,8 +444,9 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 	struct mlx4_uverbs_ex_query_device cmd;
 	struct mlx4_uverbs_ex_query_device_resp resp = {};
 	struct mlx4_clock_params clock_params;
+	size_t uhw_outlen = uhw ? uhw->outlen : 0;
 
-	if (uhw->inlen) {
+	if (uhw && uhw->inlen) {
 		err = ib_copy_validate_udata_in_cm(uhw, cmd, reserved, 0);
 		if (err)
 			return err;
@@ -572,7 +573,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 	props->cq_caps.max_cq_moderation_count = MLX4_MAX_CQ_COUNT;
 	props->cq_caps.max_cq_moderation_period = MLX4_MAX_CQ_PERIOD;
 
-	if (uhw->outlen >= resp.response_length + sizeof(resp.hca_core_clock_offset)) {
+	if (uhw_outlen >= resp.response_length + sizeof(resp.hca_core_clock_offset)) {
 		resp.response_length += sizeof(resp.hca_core_clock_offset);
 		if (!mlx4_get_internal_clock_params(dev->dev, &clock_params)) {
 			resp.comp_mask |= MLX4_IB_QUERY_DEV_RESP_MASK_CORE_CLOCK_OFFSET;
@@ -580,14 +581,14 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 		}
 	}
 
-	if (uhw->outlen >= resp.response_length +
+	if (uhw_outlen >= resp.response_length +
 	    sizeof(resp.max_inl_recv_sz)) {
 		resp.response_length += sizeof(resp.max_inl_recv_sz);
 		resp.max_inl_recv_sz  = dev->dev->caps.max_rq_sg *
 			sizeof(struct mlx4_wqe_data_seg);
 	}
 
-	if (offsetofend(typeof(resp), rss_caps) <= uhw->outlen) {
+	if (offsetofend(typeof(resp), rss_caps) <= uhw_outlen) {
 		if (props->rss_caps.supported_qpts) {
 			resp.rss_caps.rx_hash_function =
 				MLX4_IB_RX_HASH_FUNC_TOEPLITZ;
@@ -611,7 +612,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 				       sizeof(resp.rss_caps);
 	}
 
-	if (offsetofend(typeof(resp), tso_caps) <= uhw->outlen) {
+	if (offsetofend(typeof(resp), tso_caps) <= uhw_outlen) {
 		if (dev->dev->caps.max_gso_sz &&
 		    ((mlx4_ib_port_link_layer(ibdev, 1) ==
 		    IB_LINK_LAYER_ETHERNET) ||
@@ -625,7 +626,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 				       sizeof(resp.tso_caps);
 	}
 
-	if (uhw->outlen) {
+	if (uhw_outlen) {
 		err = ib_respond_udata(uhw, resp);
 		if (err)
 			goto out;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index afa97d3801f783..079c51003b24a4 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -55,16 +55,19 @@ static int mthca_query_device(struct ib_device *ibdev, struct ib_device_attr *pr
 {
 	struct ib_smp *in_mad;
 	struct ib_smp *out_mad;
-	int err = -ENOMEM;
+	int err;
 	struct mthca_dev *mdev = to_mdev(ibdev);
 
-	if (uhw->inlen || uhw->outlen)
-		return -EINVAL;
+	err = ib_is_udata_in_empty(uhw);
+	if (err)
+		return err;
 
 	in_mad = kzalloc_obj(*in_mad);
 	out_mad = kmalloc_obj(*out_mad);
-	if (!in_mad || !out_mad)
+	if (!in_mad || !out_mad) {
+		err = -ENOMEM;
 		goto out;
+	}
 
 	memset(props, 0, sizeof *props);
 
@@ -111,7 +114,7 @@ static int mthca_query_device(struct ib_device *ibdev, struct ib_device_attr *pr
 	props->max_total_mcast_qp_attach = props->max_mcast_qp_attach *
 					   props->max_mcast_grp;
 
-	err = 0;
+	err = ib_respond_empty_udata(uhw);
  out:
 	kfree(in_mad);
 	kfree(out_mad);
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 383f1d9c15d151..17def9d9ce99ca 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -68,9 +68,11 @@ int ocrdma_query_device(struct ib_device *ibdev, struct ib_device_attr *attr,
 			struct ib_udata *uhw)
 {
 	struct ocrdma_dev *dev = get_ocrdma_dev(ibdev);
+	int err;
 
-	if (uhw->inlen || uhw->outlen)
-		return -EINVAL;
+	err = ib_is_udata_in_empty(uhw);
+	if (err)
+		return err;
 
 	memset(attr, 0, sizeof *attr);
 	memcpy(&attr->fw_ver, &dev->attr.fw_ver[0],
@@ -110,7 +112,7 @@ int ocrdma_query_device(struct ib_device *ibdev, struct ib_device_attr *attr,
 	attr->local_ca_ack_delay = dev->attr.local_ca_ack_delay;
 	attr->max_fast_reg_page_list_len = dev->attr.max_pages_per_frmr;
 	attr->max_pkeys = 1;
-	return 0;
+	return ib_respond_empty_udata(uhw);
 }
 
 static inline void get_link_speed_and_width(struct ocrdma_dev *dev,
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 1af908275ca729..cf01078820d8cb 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -105,6 +105,7 @@ int qedr_query_device(struct ib_device *ibdev,
 {
 	struct qedr_dev *dev = get_qedr_dev(ibdev);
 	struct qedr_device_attr *qattr = &dev->attr;
+	int rc;
 
 	if (!dev->rdma_ctx) {
 		DP_ERR(dev,
@@ -113,6 +114,10 @@ int qedr_query_device(struct ib_device *ibdev,
 		return -EINVAL;
 	}
 
+	rc = ib_is_udata_in_empty(udata);
+	if (rc)
+		return rc;
+
 	memset(attr, 0, sizeof(*attr));
 
 	attr->fw_ver = qattr->fw_ver;
@@ -155,7 +160,7 @@ int qedr_query_device(struct ib_device *ibdev,
 	attr->max_pkeys = qattr->max_pkey;
 	attr->max_ah = qattr->max_ah;
 
-	return 0;
+	return ib_respond_empty_udata(udata);
 }
 
 static inline void get_link_speed_and_width(int speed, u16 *ib_speed,
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index 261f18a8368543..dc355b00f61cec 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -275,10 +275,12 @@ int usnic_ib_query_device(struct ib_device *ibdev,
 	union ib_gid gid;
 	struct ethtool_drvinfo info;
 	int qp_per_vf;
+	int err;
 
 	usnic_dbg("\n");
-	if (uhw->inlen || uhw->outlen)
-		return -EINVAL;
+	err = ib_is_udata_in_empty(uhw);
+	if (err)
+		return err;
 
 	mutex_lock(&us_ibdev->usdev_lock);
 	us_ibdev->netdev->ethtool_ops->get_drvinfo(us_ibdev->netdev, &info);
@@ -322,7 +324,7 @@ int usnic_ib_query_device(struct ib_device *ibdev,
 	 * max_qp_wr, max_sge, max_sge_rd, max_cqe */
 	mutex_unlock(&us_ibdev->usdev_lock);
 
-	return 0;
+	return ib_respond_empty_udata(uhw);
 }
 
 int usnic_ib_query_port(struct ib_device *ibdev, u32 port,
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c
index b9c3202b9545e3..1d29a535f76a8c 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c
@@ -67,9 +67,11 @@ int pvrdma_query_device(struct ib_device *ibdev,
 			struct ib_udata *uhw)
 {
 	struct pvrdma_dev *dev = to_vdev(ibdev);
+	int err;
 
-	if (uhw->inlen || uhw->outlen)
-		return -EINVAL;
+	err = ib_is_udata_in_empty(uhw);
+	if (err)
+		return err;
 
 	props->fw_ver = dev->dsr->caps.fw_ver;
 	props->sys_image_guid = dev->dsr->caps.sys_image_guid;
@@ -114,7 +116,7 @@ int pvrdma_query_device(struct ib_device *ibdev,
 	props->device_cap_flags |= IB_DEVICE_PORT_ACTIVE_EVENT |
 				   IB_DEVICE_RC_RNR_NAK_GEN;
 
-	return 0;
+	return ib_respond_empty_udata(uhw);
 }
 
 /**
diff --git a/drivers/infiniband/sw/rdmavt/vt.c b/drivers/infiniband/sw/rdmavt/vt.c
index 40aa6420836470..5fa3a1f3332689 100644
--- a/drivers/infiniband/sw/rdmavt/vt.c
+++ b/drivers/infiniband/sw/rdmavt/vt.c
@@ -6,6 +6,7 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/dma-mapping.h>
+#include <rdma/uverbs_ioctl.h>
 #include "vt.h"
 #include "cq.h"
 #include "trace.h"
@@ -79,14 +80,16 @@ static int rvt_query_device(struct ib_device *ibdev,
 			    struct ib_udata *uhw)
 {
 	struct rvt_dev_info *rdi = ib_to_rvt(ibdev);
+	int err;
 
-	if (uhw->inlen || uhw->outlen)
-		return -EINVAL;
+	err = ib_is_udata_in_empty(uhw);
+	if (err)
+		return err;
 	/*
 	 * Return rvt_dev_info.dparms.props contents
 	 */
 	*props = rdi->dparms.props;
-	return 0;
+	return ib_respond_empty_udata(uhw);
 }
 
 static int rvt_get_numa_node(struct ib_device *ibdev)
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 8edd4dd1f031f4..5815ce34d9704c 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -22,19 +22,13 @@ static int rxe_query_device(struct ib_device *ibdev,
 	struct rxe_dev *rxe = to_rdev(ibdev);
 	int err;
 
-	if (udata->inlen || udata->outlen) {
-		rxe_dbg_dev(rxe, "malformed udata\n");
-		err = -EINVAL;
-		goto err_out;
-	}
+	err = ib_is_udata_in_empty(udata);
+	if (err)
+		return err;
 
 	memcpy(attr, &rxe->attr, sizeof(*attr));
 
-	return 0;
-
-err_out:
-	rxe_err_dev(rxe, "returned err = %d\n", err);
-	return err;
+	return ib_respond_empty_udata(udata);
 }
 
 static int rxe_query_port(struct ib_device *ibdev,
diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
index b34f3d6547ffc7..b74ac85c1b8b8b 100644
--- a/drivers/infiniband/sw/siw/siw_verbs.c
+++ b/drivers/infiniband/sw/siw/siw_verbs.c
@@ -130,9 +130,11 @@ int siw_query_device(struct ib_device *base_dev, struct ib_device_attr *attr,
 		     struct ib_udata *udata)
 {
 	struct siw_device *sdev = to_siw_dev(base_dev);
+	int rv;
 
-	if (udata->inlen || udata->outlen)
-		return -EINVAL;
+	rv = ib_is_udata_in_empty(udata);
+	if (rv)
+		return rv;
 
 	memset(attr, 0, sizeof(*attr));
 
@@ -165,7 +167,7 @@ int siw_query_device(struct ib_device *base_dev, struct ib_device_attr *attr,
 	addrconf_addr_eui48((u8 *)&attr->sys_image_guid,
 			    sdev->raw_gid);
 
-	return 0;
+	return ib_respond_empty_udata(udata);
 }
 
 int siw_query_port(struct ib_device *base_dev, u32 port,
-- 
2.43.0


^ permalink raw reply related

* [PATCH 1/2] RDMA/core: Don't make a dummy ib_udata on the stack in create_qp
From: Jason Gunthorpe @ 2026-05-26 16:15 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Junxian Huang, Kai Shen, Kalesh AP, Konstantin Taranov,
	Krzysztof Czurylo, Leon Romanovsky, linux-hyperv, linux-rdma,
	Long Li, Michal Kalderon, Nelson Escobar, Satish Kharat,
	Selvin Xavier, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas, Zhu Yanjun
  Cc: Leon Romanovsky, patches
In-Reply-To: <0-v1-922fa8e828ba+f7-ib_udata_stack_jgg@nvidia.com>

Sashiko points out the udata for destruction has to be created using
uverbs_get_cleared_udata(). Move it to ib_core_uverbs.c so that the core
qp code can call it. Rework the call chain to pass the struct
uverbs_attr_bundle right up to the driver op callback.

Fixes a possible wild stack reference in drivers during error unwinding,
mlx5 can call rdma_udata_to_drv_context() from destroy_qp() when
destroying a QP.

Fixes: 00a79d6b996d ("RDMA/core: Configure selinux QP during creation")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/core_priv.h           |  2 +-
 drivers/infiniband/core/ib_core_uverbs.c      | 12 +++++++++++
 drivers/infiniband/core/rdma_core.h           |  7 +++++++
 drivers/infiniband/core/uverbs_cmd.c          | 14 +------------
 drivers/infiniband/core/uverbs_std_types_qp.c |  3 +--
 drivers/infiniband/core/verbs.c               | 20 ++++++++++---------
 6 files changed, 33 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index a2c36666e6fcb9..19104c542b270d 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -321,7 +321,7 @@ void nldev_exit(void);
 
 struct ib_qp *ib_create_qp_user(struct ib_device *dev, struct ib_pd *pd,
 				struct ib_qp_init_attr *attr,
-				struct ib_udata *udata,
+				struct uverbs_attr_bundle *uattrs,
 				struct ib_uqp_object *uobj, const char *caller);
 
 void ib_qp_usecnt_inc(struct ib_qp *qp);
diff --git a/drivers/infiniband/core/ib_core_uverbs.c b/drivers/infiniband/core/ib_core_uverbs.c
index b4fc693a3bd8b7..6c3bc9ca1d58ef 100644
--- a/drivers/infiniband/core/ib_core_uverbs.c
+++ b/drivers/infiniband/core/ib_core_uverbs.c
@@ -532,6 +532,18 @@ int uverbs_destroy_def_handler(struct uverbs_attr_bundle *attrs)
 }
 EXPORT_SYMBOL(uverbs_destroy_def_handler);
 
+/*
+ * When calling a destroy function during an error unwind we need to pass in
+ * the udata that is sanitized of all user arguments. Ie from the driver
+ * perspective it looks like no udata was passed.
+ */
+struct ib_udata *uverbs_get_cleared_udata(struct uverbs_attr_bundle *attrs)
+{
+	attrs->driver_udata = (struct ib_udata){};
+	return &attrs->driver_udata;
+}
+EXPORT_SYMBOL_NS_GPL(uverbs_get_cleared_udata, "rdma_core");
+
 /**
  * _uverbs_alloc() - Quickly allocate memory for use with a bundle
  * @bundle: The bundle
diff --git a/drivers/infiniband/core/rdma_core.h b/drivers/infiniband/core/rdma_core.h
index b626d3d24d087d..56121103e9f4f5 100644
--- a/drivers/infiniband/core/rdma_core.h
+++ b/drivers/infiniband/core/rdma_core.h
@@ -71,7 +71,14 @@ int uverbs_output_written(const struct uverbs_attr_bundle *bundle, size_t idx);
 
 void setup_ufile_idr_uobject(struct ib_uverbs_file *ufile);
 
+#if IS_ENABLED(CONFIG_INFINIBAND_USER_ACCESS)
 struct ib_udata *uverbs_get_cleared_udata(struct uverbs_attr_bundle *attrs);
+#else
+static inline struct ib_udata *uverbs_get_cleared_udata(struct uverbs_attr_bundle *attrs)
+{
+	return NULL;
+}
+#endif
 
 /*
  * This is the runtime description of the uverbs API, used by the syscall
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 32914007bae66f..41ad11ae1123b7 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -163,17 +163,6 @@ static int uverbs_request_finish(struct uverbs_req_iter *iter)
 	return 0;
 }
 
-/*
- * When calling a destroy function during an error unwind we need to pass in
- * the udata that is sanitized of all user arguments. Ie from the driver
- * perspective it looks like no udata was passed.
- */
-struct ib_udata *uverbs_get_cleared_udata(struct uverbs_attr_bundle *attrs)
-{
-	attrs->driver_udata = (struct ib_udata){};
-	return &attrs->driver_udata;
-}
-
 static struct ib_uverbs_completion_event_file *
 _ib_uverbs_lookup_comp_file(s32 fd, struct uverbs_attr_bundle *attrs)
 {
@@ -1462,8 +1451,7 @@ static int create_qp(struct uverbs_attr_bundle *attrs,
 		attr.source_qpn = cmd->source_qpn;
 	}
 
-	qp = ib_create_qp_user(device, pd, &attr, &attrs->driver_udata, obj,
-			       KBUILD_MODNAME);
+	qp = ib_create_qp_user(device, pd, &attr, attrs, obj, KBUILD_MODNAME);
 	if (IS_ERR(qp)) {
 		ret = PTR_ERR(qp);
 		goto err_put;
diff --git a/drivers/infiniband/core/uverbs_std_types_qp.c b/drivers/infiniband/core/uverbs_std_types_qp.c
index be0730e8509ed9..fd617903ffcf49 100644
--- a/drivers/infiniband/core/uverbs_std_types_qp.c
+++ b/drivers/infiniband/core/uverbs_std_types_qp.c
@@ -248,8 +248,7 @@ static int UVERBS_HANDLER(UVERBS_METHOD_QP_CREATE)(
 	set_caps(&attr, &cap, true);
 	mutex_init(&obj->mcast_lock);
 
-	qp = ib_create_qp_user(device, pd, &attr, &attrs->driver_udata, obj,
-			       KBUILD_MODNAME);
+	qp = ib_create_qp_user(device, pd, &attr, attrs, obj, KBUILD_MODNAME);
 	if (IS_ERR(qp)) {
 		ret = PTR_ERR(qp);
 		goto err_put;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index bac87de9cc6735..1500bc09bdc915 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -53,6 +53,7 @@
 #include <rdma/rw.h>
 #include <rdma/lag.h>
 
+#include "rdma_core.h"
 #include "core_priv.h"
 #include <trace/events/rdma_core.h>
 
@@ -1265,10 +1266,9 @@ static struct ib_qp *create_xrc_qp_user(struct ib_qp *qp,
 
 static struct ib_qp *create_qp(struct ib_device *dev, struct ib_pd *pd,
 			       struct ib_qp_init_attr *attr,
-			       struct ib_udata *udata,
+			       struct uverbs_attr_bundle *uattrs,
 			       struct ib_uqp_object *uobj, const char *caller)
 {
-	struct ib_udata dummy = {};
 	struct ib_qp *qp;
 	int ret;
 
@@ -1301,9 +1301,10 @@ static struct ib_qp *create_qp(struct ib_device *dev, struct ib_pd *pd,
 	qp->recv_cq = attr->recv_cq;
 
 	rdma_restrack_new(&qp->res, RDMA_RESTRACK_QP);
-	WARN_ONCE(!udata && !caller, "Missing kernel QP owner");
-	rdma_restrack_set_name(&qp->res, udata ? NULL : caller);
-	ret = dev->ops.create_qp(qp, attr, udata);
+	WARN_ONCE(!uattrs && !caller, "Missing kernel QP owner");
+	rdma_restrack_set_name(&qp->res, uattrs ? NULL : caller);
+	ret = dev->ops.create_qp(qp, attr,
+				 uattrs ? &uattrs->driver_udata : NULL);
 	if (ret)
 		goto err_create;
 
@@ -1322,7 +1323,8 @@ static struct ib_qp *create_qp(struct ib_device *dev, struct ib_pd *pd,
 	return qp;
 
 err_security:
-	qp->device->ops.destroy_qp(qp, udata ? &dummy : NULL);
+	qp->device->ops.destroy_qp(
+		qp, uattrs ? uverbs_get_cleared_udata(uattrs) : NULL);
 err_create:
 	rdma_restrack_put(&qp->res);
 	kfree(qp);
@@ -1338,13 +1340,13 @@ static struct ib_qp *create_qp(struct ib_device *dev, struct ib_pd *pd,
  * @attr: A list of initial attributes required to create the
  *   QP.  If QP creation succeeds, then the attributes are updated to
  *   the actual capabilities of the created QP.
- * @udata: User data
+ * @uattrs: User ioctl attributes and udata
  * @uobj: uverbs obect
  * @caller: caller's build-time module name
  */
 struct ib_qp *ib_create_qp_user(struct ib_device *dev, struct ib_pd *pd,
 				struct ib_qp_init_attr *attr,
-				struct ib_udata *udata,
+				struct uverbs_attr_bundle *uattrs,
 				struct ib_uqp_object *uobj, const char *caller)
 {
 	struct ib_qp *qp, *xrc_qp;
@@ -1352,7 +1354,7 @@ struct ib_qp *ib_create_qp_user(struct ib_device *dev, struct ib_pd *pd,
 	if (attr->qp_type == IB_QPT_XRC_TGT)
 		qp = create_qp(dev, pd, attr, NULL, NULL, caller);
 	else
-		qp = create_qp(dev, pd, attr, udata, uobj, NULL);
+		qp = create_qp(dev, pd, attr, uattrs, uobj, NULL);
 	if (attr->qp_type != IB_QPT_XRC_TGT || IS_ERR(qp))
 		return qp;
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH 0/2] Remove stack ib_udata's
From: Jason Gunthorpe @ 2026-05-26 16:15 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Junxian Huang, Kai Shen, Kalesh AP, Konstantin Taranov,
	Krzysztof Czurylo, Leon Romanovsky, linux-hyperv, linux-rdma,
	Long Li, Michal Kalderon, Nelson Escobar, Satish Kharat,
	Selvin Xavier, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas, Zhu Yanjun
  Cc: Leon Romanovsky, patches

Sashiko pointed out these are dangerous, and the create_qp() one is in
fact a bug. The query_device is just ugly old code.

Remove the stack ib_udata's from both places.

Jason Gunthorpe (2):
  RDMA/core: Don't make a dummy ib_udata on the stack in create_qp
  RDMA: Update the query_device() op

 drivers/infiniband/core/core_priv.h           |  2 +-
 drivers/infiniband/core/device.c              |  3 +--
 drivers/infiniband/core/ib_core_uverbs.c      | 12 +++++++++++
 drivers/infiniband/core/rdma_core.h           |  7 +++++++
 drivers/infiniband/core/uverbs_cmd.c          | 14 +------------
 drivers/infiniband/core/uverbs_std_types_qp.c |  3 +--
 drivers/infiniband/core/verbs.c               | 20 ++++++++++---------
 drivers/infiniband/hw/bnxt_re/ib_verbs.c      |  5 ++++-
 drivers/infiniband/hw/cxgb4/provider.c        |  8 +++++---
 drivers/infiniband/hw/erdma/erdma_verbs.c     |  9 +++++++--
 drivers/infiniband/hw/hns/hns_roce_main.c     |  7 ++++++-
 drivers/infiniband/hw/ionic/ionic_ibdev.c     |  7 ++++++-
 drivers/infiniband/hw/irdma/verbs.c           |  8 +++++---
 drivers/infiniband/hw/mana/main.c             |  7 ++++++-
 drivers/infiniband/hw/mlx4/main.c             | 13 ++++++------
 drivers/infiniband/hw/mthca/mthca_provider.c  | 13 +++++++-----
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   |  8 +++++---
 drivers/infiniband/hw/qedr/verbs.c            |  7 ++++++-
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c  |  8 +++++---
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c   |  8 +++++---
 drivers/infiniband/sw/rdmavt/vt.c             |  9 ++++++---
 drivers/infiniband/sw/rxe/rxe_verbs.c         | 14 ++++---------
 drivers/infiniband/sw/siw/siw_verbs.c         |  8 +++++---
 23 files changed, 124 insertions(+), 76 deletions(-)


base-commit: fd9482545e37fb6b7e04b588ad2bd80a2779776c
-- 
2.43.0


^ permalink raw reply

* Re: [PATCH] uio_hv_generic: Bind to FCopy device by default
From: Naman Jain @ 2026-05-26 15:49 UTC (permalink / raw)
  To: Michael Kelley, Ben Hutchings
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Greg Kroah-Hartman, linux-hyperv@vger.kernel.org
In-Reply-To: <SN6PR02MB41574FDA377FF59597181B7BD40B2@SN6PR02MB4157.namprd02.prod.outlook.com>



On 5/26/2026 8:45 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Tuesday, May 26, 2026 3:10 AM
>>
>> On 5/26/2026 1:59 PM, Ben Hutchings wrote:
>>> On Tue, 2026-05-26 at 12:15 +0530, Naman Jain wrote:
>>>>
>>>> On 5/25/2026 5:34 PM, Ben Hutchings wrote:
>>>>> The Hyper-V kernel-mode fcopy driver was removed in 6.10 and the new
>>>>> fcopy daemon requires this uio driver to function.  However, by
>>>>> default the driver does not bind to any devices, and must be
>>>>> configured through the sysfs "new_id" file.
>>>>>
>>>>> Since the FCopy device is now only usable through this driver, add its
>>>>> ID to the driver's ID table so that the daemon will work "out of the
>>>>> box".
>>>>>
>>>>> Signed-off-by: Ben Hutchings <benh@debian.org>
>>>>> Fixes: ec314f61e4fc ("Drivers: hv: Remove fcopy driver")
>>>>> ---
>>>>> --- a/drivers/uio/uio_hv_generic.c
>>>>> +++ b/drivers/uio/uio_hv_generic.c
>>>>> @@ -395,9 +395,15 @@ hv_uio_remove(struct hv_device *dev)
>>>>>     	vmbus_free_ring(dev->channel);
>>>>>     }
>>>>>
>>>>> +static const struct hv_vmbus_device_id hv_uio_id_table[] = {
>>>>> +	{ HV_FCOPY_GUID },
>>>>> +	{}
>>>>> +};
>>>>> +MODULE_DEVICE_TABLE(vmbus, hv_uio_id_table);
>>>>> +
>>>>>     static struct hv_driver hv_uio_drv = {
>>>>>     	.name = "uio_hv_generic",
>>>>> -	.id_table = NULL, /* only dynamic id's */
>>>>> +	.id_table = hv_uio_id_table,
>>>>>     	.probe = hv_uio_probe,
>>>>>     	.remove = hv_uio_remove,
>>>>>     };
>>>>
>>
>> ++ recipients, assuming you mistakenly clicked reply instead of reply all.
> 
> Ben --
> 
> Regarding recipients, please include the full LKML
> (linux-kernel@vger.kernel.org) on the original patch posting, even
> though it is about a narrow Hyper-V issue. I dabble in areas beyond
> just Hyper-V so subscribe to the full LKML instead of the
> linux-hyperv mailing list. I miss patches like this one unless I happen
> to be looking through the lore.kernel.org archives for linux-hyperv.
> 
> Thx,
> 
> Michael
> 
Sashiko also misses out on such patches if linux-kernel@vger.kernel.org 
is not included. I could not find this in https://sashiko.dev/#/.

Regards,
Naman

^ permalink raw reply

* RE: [PATCH] uio_hv_generic: Bind to FCopy device by default
From: Michael Kelley @ 2026-05-26 15:15 UTC (permalink / raw)
  To: Naman Jain, Ben Hutchings
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Greg Kroah-Hartman, linux-hyperv@vger.kernel.org
In-Reply-To: <aa420dc1-029c-408b-aef0-f02d6bfa002c@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Tuesday, May 26, 2026 3:10 AM
> 
> On 5/26/2026 1:59 PM, Ben Hutchings wrote:
> > On Tue, 2026-05-26 at 12:15 +0530, Naman Jain wrote:
> >>
> >> On 5/25/2026 5:34 PM, Ben Hutchings wrote:
> >>> The Hyper-V kernel-mode fcopy driver was removed in 6.10 and the new
> >>> fcopy daemon requires this uio driver to function.  However, by
> >>> default the driver does not bind to any devices, and must be
> >>> configured through the sysfs "new_id" file.
> >>>
> >>> Since the FCopy device is now only usable through this driver, add its
> >>> ID to the driver's ID table so that the daemon will work "out of the
> >>> box".
> >>>
> >>> Signed-off-by: Ben Hutchings <benh@debian.org>
> >>> Fixes: ec314f61e4fc ("Drivers: hv: Remove fcopy driver")
> >>> ---
> >>> --- a/drivers/uio/uio_hv_generic.c
> >>> +++ b/drivers/uio/uio_hv_generic.c
> >>> @@ -395,9 +395,15 @@ hv_uio_remove(struct hv_device *dev)
> >>>    	vmbus_free_ring(dev->channel);
> >>>    }
> >>>
> >>> +static const struct hv_vmbus_device_id hv_uio_id_table[] = {
> >>> +	{ HV_FCOPY_GUID },
> >>> +	{}
> >>> +};
> >>> +MODULE_DEVICE_TABLE(vmbus, hv_uio_id_table);
> >>> +
> >>>    static struct hv_driver hv_uio_drv = {
> >>>    	.name = "uio_hv_generic",
> >>> -	.id_table = NULL, /* only dynamic id's */
> >>> +	.id_table = hv_uio_id_table,
> >>>    	.probe = hv_uio_probe,
> >>>    	.remove = hv_uio_remove,
> >>>    };
> >>
> 
> ++ recipients, assuming you mistakenly clicked reply instead of reply all.

Ben --

Regarding recipients, please include the full LKML
(linux-kernel@vger.kernel.org) on the original patch posting, even
though it is about a narrow Hyper-V issue. I dabble in areas beyond
just Hyper-V so subscribe to the full LKML instead of the
linux-hyperv mailing list. I miss patches like this one unless I happen
to be looking through the lore.kernel.org archives for linux-hyperv.

Thx,

Michael

> 
> 
> >> Two things worth considering before applying:
> >>
> >> 1. Please add Cc: stable@vger.kernel.org or is it that we do not want
> >> this to be ported to older kernels?
> >>
> >> 2. Every Hyper-V guest (with UIO_HV_GENERIC enabled) will now have an
> >> additional auto-bound /dev/uio0 node for FCopy.
> >
> > I don't think that's quite true.  I tested with a Windows 11 host and
> > needed to enable "Guest services" for the VM, which was disabled by
> > default.  But if that includes other features besides FCopy it might be
> > enabled for other reasons.
> >
> 
> Yes, meaning if these two conditions are satisfied (enabling guest
> services is also one time step for a Hyper-V VM), we would see uio0 by
> default for fcopy.
> 
> >> Anything that hardcodes
> >> /dev/uio0 (e.g. ad-hoc DPDK scripts that bind a NetVSC NIC via
> >> uio_hv_generic + new_id) may see its index shift, since FCopy now wins
> >> uio0 at boot.
> >
> > OK, so maybe I should implement the new_id dance in the fcopy service
> > startup, to avoid that?  I did already looked at doing it in a systemd
> > unit, but it's hard to do right because adding the same ID twice is an
> > error.  Maybe the daemon itself ould do it?
> 
> Implementing it in uio daemon can introduce race conditions with sysfs
> creation. I guess it's OK then to implement it here, in kernel.
> 
> >
> >> The fix for such consumers is the same thing DPDK and the
> >> in-tree daemon already do: resolve uio via
> >> /sys/bus/vmbus/devices/<guid>/uio/ rather than by number. This is not a
> >> regression in the patch, but it's a behavior change worth calling out.
> >
> > It would be a good reason *not* to make this change in stable.
> >
> > Ben.
> >
> 
> What issues are you fixing with this patch exactly? Is there any
> particular sequence of events you are targeting where traditional
> approach does not work?
> 
> Regards,
> Naman


^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Erni Sri Satya Vennela @ 2026-05-26 14:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Long Li, Leon Romanovsky, Konstantin Taranov,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260525230155.GB2487554@ziepe.ca>

On Mon, May 25, 2026 at 08:01:55PM -0300, Jason Gunthorpe wrote:
> On Mon, May 25, 2026 at 11:58:17AM -0700, Erni Sri Satya Vennela wrote:
> > > “There is no reason they should be signed, you should just fix the
> > > type.”
> > 
> > It is not allowed to change sign in props, so clamping is the best bet.
> 
> Why not? Fix the core code, it is just old junk they are signed, they
> should't never have been.
> 
> Jason

Thanks for the feedback, Jason.

I sent the v3 before your comments in v2.
I'll be sending a v4 which drops the clamping entirely.

- Vennela

^ permalink raw reply

* [PATCH v2 1/1] mshv: Add conditional VMBus dependency
From: Michael Kelley @ 2026-05-26 14:13 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, jloeser, linux-hyperv
  Cc: linux-kernel, arnd, hamzamahfooz

From: Michael Kelley <mhklinux@outlook.com>

When the VMBus driver is not part of the kernel (CONFIG_HYPERV_VMBUS=n),
the MSHV root driver fails to link:

ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!

Fix this while meeting these requirements:
* It must be possible to include the MSHV root driver without the
  VMBus driver. In such case, the MSHV root driver can be built-in
  to the kernel image, or it can be built as a separate module.
* If both the MSHV root driver and the VMBus driver are present, the
  MSHV root driver and VMBus driver can both be built-in, or they can
  both be separate modules. Or the MSHV root driver can be a module
  while the VMBus driver can be built-in, but the reverse is
  disallowed. Regardless of the build choices, the VMBus driver must
  be loaded before the MSHV driver in order for the SynIC to be
  managed properly (see comments in the MSHV SynIC code).

The fix has two parts:
* Add a Kconfig entry for MSHV_ROOT to depend on HYPERV_VMBUS if
  HYPERV_VMBUS is present. The entry disallows MSHV_ROOT being
  built-in when HYPERV_VMBUS is a module, but without requiring that
  HYPERV_VMBUS be built.
* Add a stub implementation of hv_vmbus_exists() for when the
  VMBus driver is not present so that the MSHV root driver has
  no module dependency on VMBus. When the VMBus driver *is*
  present, the module dependency ensures that the VMBus driver
  loads first when both are built as modules.

Existing code ensures that the VMBus driver loads first if it is
built-in. The VMBus driver uses subsys_initcall(), which is
initcall level 4. The MSHV root driver uses module_init(), which
becomes device_init() when built-in, and device_init() is
initcall level 6.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Closes: https://lore.kernel.org/all/20260520074044.923728-1-arnd@kernel.org/
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jork Loeser <jloeser@linux.microsoft.com>
---
Changes in v2:
* Instead of putting IS_ENABLED(CONFIG_HYPERV_VMBUS) around each of
  the two calls to hv_vmbus_exists() in mshv_synic.c, provide a stub
  for hv_vmbus_exists() when CONFIG_HYPERV_VMBUS is not set. The
  effect is the same as in v1, but the code is cleaner. [Jork Loeser]

Arnd: I've kept your Ack even though I changed how hv_vmbus_exists()
is stubbed out since the effect is the same. Let me know if
you have any concerns.

 drivers/hv/Kconfig     | 1 +
 include/linux/hyperv.h | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 2d0b3fcb0ff8..aa11bcefddf2 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -74,6 +74,7 @@ config MSHV_ROOT
 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
 	# no particular order, making it impossible to reassemble larger pages
 	depends on PAGE_SIZE_4KB
+	depends on HYPERV_VMBUS if HYPERV_VMBUS
 	select EVENTFD
 	select VIRT_XFER_TO_GUEST_WORK
 	select HMM_MIRROR
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 41a3d82f0722..734b7ef98f4d 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1304,7 +1304,11 @@ static inline void *hv_get_drvdata(struct hv_device *dev)

 struct device *hv_get_vmbus_root_device(void);

+#if IS_ENABLED(CONFIG_HYPERV_VMBUS)
 bool hv_vmbus_exists(void);
+#else
+static inline bool hv_vmbus_exists(void) { return false; }
+#endif

 struct hv_ring_buffer_debug_info {
 	u32 current_interrupt_mask;
-- 
2.25.1

^ permalink raw reply related

* [RFC] KVM/x86: Killing kvm_get_time_and_clockread() in favour of ktime_get_snapshot()
From: David Woodhouse @ 2026-05-26 13:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, John Stultz,
	Michael Kelley
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S. Hall,
	Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2323 bytes --]

In 2012, as part of implementing the "master clock" mode for kvmclock,
Marcelo added kvm_get_time_and_clockread() in commit d828199e8444
("KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag").

In 2016, Christopher Hall added the generic ktime_get_snapshot() in
commit 9da0f49c8767 ("time: Add timekeeping snapshot code capturing
system time and counter"), which provides the same paired read of
{ time, counter } through the core timekeeping code.

Then in 2018, Vitaly Kuznetsov added Hyper-V TSC page support in
commit b0c39dc68e3b ("x86/kvm: Pass stable clocksource to guests when
running nested on Hyper-V"), which extended vgettsc() to handle the
HVCLOCK case.

I'd quite like to kill it all with fire and make KVM use
ktime_get_snapshot() instead.

However, to correlate with the TSC provided to guests, KVM needs the
underlying host TSC counter value, *not* the cycles count from the
hyperv_clocksource_tsc_page clocksource which is scaled to 10MHz.

If we wanted to support master clock mode while nesting under KVM and
bizarrely using the kvmclock for system timing, we'd have the same
problem with the kvmclock clocksource, which similarly scales to 1GHz.

One option is to say "Don't Do That Then™": if you want to provide a
masterclock kvmclock to guests then *don't* use the silly pvclocks for
your own kernel's timekeeping, use the damn TSC. Because if the TSC
*isn't* reliable then you can't do masterclock mode for your guests
anyway.

Perhaps that should have been the response when commit b0c39dc68e3b was
submitted, but I guess we're stuck supporting that mode now. But I
really do want to kill the KVM hacks and use ktime_get_snapshot().

Reverse-engineering the original TSC reading from the clocksource
counter value doesn't look sane, without a loss of precision and/or
128-bit division.

One simple option that occurs to me would be to add a 'cycles_raw'
value to the system_time_snapshot, for PV clocksources like hyperv and
kvmclock to populate with the original TSC reading.

That might actually let us clean up some of the PTP code that currently
has to deal with TSC vs. kvmclock in counter snapshots too. I think I
could kill the use of get_cycles() in vmclock for the kvmclock case,
which might make Thomas happy...

Any better ideas?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-05-26 13:05 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: Yury Norov, Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Konstantin Taranov, Simon Horman, Dipayaan Roy,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <agq5/8rUFp3ttOFz@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Mon, May 18, 2026 at 12:04:31AM -0700, Erni Sri Satya Vennela wrote:
> > > But one observation I had was that " irq_set_affinity_and_hint(*irqs++,
> > > NULL);" is essentially a no-op and we end up relying on the initial
> > > placement from pci_alloc_irq_vectors().
> > 
> > Yes you are, assuming you're not binding them before in your call chain.
> > 
> > > Even though in these tests we
> > > were not able to reproduce it, but with this distribution there is a
> > > chance we end up clustering the mana queue IRQs, while other vCPUs are
> > > not running any network load.
> > 
> > That sounds like an IRQ balancer bug which you're unable to reproduce. 
> > 
> > > It's because the placement depends on
> > > system-wide IRQ state at allocation time.
> > 
> > I don't understand this point. The 
> > 
> >         irq_set_affinity_and_hint(*irqs++, NULL);
> > 
> > simply means: I trust system IRQ balancer to pick the best CPU for my
> > IRQ at runtime. It doesn't refer any "IRQ state at allocation time".
> >   
> > > The linear approach however gaurantees each queue IRQ lands on a
> > > distinct vCPU regardless of system state. Even after stressing the cpus
> > > using stress-ng, we did not observe any significant throughput drop.
> > 
> > If you just do nothing, it would lead to the same numbers, right? What
> > does that "non-significant throughput drop" mean? It sounds like the
> > linear approach is slightly worse.
> 
> The numbers are not worse, they almost same in both the cases.
> > 
> > --
> > 
> > So, as you can't demonstrate solid benefit for the 'linear' IRQ placement,
> > I would just stick to the no-affinity logic.
> 
> Thankyou Yury,
> We are investigating on more test scenarios and trying to
> capture numbers with both, your proposed change and the one from this
> patch. We will keep you updated about the results.
> 
> 
> - Vennela

Hi Yury,

Vennela and I ran a bunch of more tests and were able to reproduce the
clustering of mana IRQs issue we discussed earlier with the suggested
approach(setting the affinity and hint to NULL).
In these tests there were additional IRQs allocated(apart from MANA),
that disturbed the MANA IRQ distribution

ENV details
azure SKU(Standard_L4als_v5) 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4
Queue)

"Affinity set to NULL" approach
========================================
MANA IRQ distribution	vCPU
========================================
IRQ0	HWC		0
IRQ1	mana_q1		2
IRQ3	mana_q2		3
IRQ4	mana_q3		2
IRQ5	mana_q4		3


"Affinity set linearly" approach
========================================
MANA IRQ distribution	vCPU
========================================
IRQ0	HWC		0
IRQ1	mana_q1		1
IRQ3	mana_q2		2
IRQ4	mana_q3		3
IRQ5	mana_q4		0


Throughput(Gbps) with high TCP connection
========================================
connection	affinity NULL	Linear
20480		5.25		13.49
10240		5.77		13.48
8192		7.16		13.48
6144		9.33		13.53
4096		13.50		13.50


Considering these results, we would like to proceed with the linear
approach that was proposed by this patch.


Regards,
Shradha

^ permalink raw reply

* Re: [PATCH] uio_hv_generic: Bind to FCopy device by default
From: Naman Jain @ 2026-05-26 10:10 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Greg Kroah-Hartman, linux-hyperv
In-Reply-To: <afdcb1775e7a60b7824b5c540a44f0148abe3e1c.camel@debian.org>



On 5/26/2026 1:59 PM, Ben Hutchings wrote:
> On Tue, 2026-05-26 at 12:15 +0530, Naman Jain wrote:
>>
>> On 5/25/2026 5:34 PM, Ben Hutchings wrote:
>>> The Hyper-V kernel-mode fcopy driver was removed in 6.10 and the new
>>> fcopy daemon requires this uio driver to function.  However, by
>>> default the driver does not bind to any devices, and must be
>>> configured through the sysfs "new_id" file.
>>>
>>> Since the FCopy device is now only usable through this driver, add its
>>> ID to the driver's ID table so that the daemon will work "out of the
>>> box".
>>>
>>> Signed-off-by: Ben Hutchings <benh@debian.org>
>>> Fixes: ec314f61e4fc ("Drivers: hv: Remove fcopy driver")
>>> ---
>>> --- a/drivers/uio/uio_hv_generic.c
>>> +++ b/drivers/uio/uio_hv_generic.c
>>> @@ -395,9 +395,15 @@ hv_uio_remove(struct hv_device *dev)
>>>    	vmbus_free_ring(dev->channel);
>>>    }
>>>    
>>> +static const struct hv_vmbus_device_id hv_uio_id_table[] = {
>>> +	{ HV_FCOPY_GUID },
>>> +	{}
>>> +};
>>> +MODULE_DEVICE_TABLE(vmbus, hv_uio_id_table);
>>> +
>>>    static struct hv_driver hv_uio_drv = {
>>>    	.name = "uio_hv_generic",
>>> -	.id_table = NULL, /* only dynamic id's */
>>> +	.id_table = hv_uio_id_table,
>>>    	.probe = hv_uio_probe,
>>>    	.remove = hv_uio_remove,
>>>    };
>>

++ recipients, assuming you mistakenly clicked reply instead of reply all.


>> Two things worth considering before applying:
>>
>> 1. Please add Cc: stable@vger.kernel.org or is it that we do not want
>> this to be ported to older kernels?
>>
>> 2. Every Hyper-V guest (with UIO_HV_GENERIC enabled) will now have an
>> additional auto-bound /dev/uio0 node for FCopy.
> 
> I don't think that's quite true.  I tested with a Windows 11 host and
> needed to enable "Guest services" for the VM, which was disabled by
> default.  But if that includes other features besides FCopy it might be
> enabled for other reasons.
> 

Yes, meaning if these two conditions are satisfied (enabling guest 
services is also one time step for a Hyper-V VM), we would see uio0 by 
default for fcopy.

>> Anything that hardcodes
>> /dev/uio0 (e.g. ad-hoc DPDK scripts that bind a NetVSC NIC via
>> uio_hv_generic + new_id) may see its index shift, since FCopy now wins
>> uio0 at boot.
> 
> OK, so maybe I should implement the new_id dance in the fcopy service
> startup, to avoid that?  I did already looked at doing it in a systemd
> unit, but it's hard to do right because adding the same ID twice is an
> error.  Maybe the daemon itself ould do it?

Implementing it in uio daemon can introduce race conditions with sysfs 
creation. I guess it's OK then to implement it here, in kernel.

> 
>> The fix for such consumers is the same thing DPDK and the
>> in-tree daemon already do: resolve uio via
>> /sys/bus/vmbus/devices/<guid>/uio/ rather than by number. This is not a
>> regression in the patch, but it's a behavior change worth calling out.
> 
> It would be a good reason *not* to make this change in stable.
> 
> Ben.
> 

What issues are you fixing with this patch exactly? Is there any 
particular sequence of events you are targeting where traditional 
approach does not work?

Regards,
Naman

^ permalink raw reply

* Re: [PATCH] uio_hv_generic: Bind to FCopy device by default
From: Naman Jain @ 2026-05-26  6:45 UTC (permalink / raw)
  To: Ben Hutchings, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li
  Cc: Greg Kroah-Hartman, linux-hyperv
In-Reply-To: <ahQ6xuhSReidmN-3@decadent.org.uk>

On 5/25/2026 5:34 PM, Ben Hutchings wrote:
> The Hyper-V kernel-mode fcopy driver was removed in 6.10 and the new
> fcopy daemon requires this uio driver to function.  However, by
> default the driver does not bind to any devices, and must be
> configured through the sysfs "new_id" file.
> 
> Since the FCopy device is now only usable through this driver, add its
> ID to the driver's ID table so that the daemon will work "out of the
> box".
> 
> Signed-off-by: Ben Hutchings <benh@debian.org>
> Fixes: ec314f61e4fc ("Drivers: hv: Remove fcopy driver")
> ---
> --- a/drivers/uio/uio_hv_generic.c
> +++ b/drivers/uio/uio_hv_generic.c
> @@ -395,9 +395,15 @@ hv_uio_remove(struct hv_device *dev)
>   	vmbus_free_ring(dev->channel);
>   }
>   
> +static const struct hv_vmbus_device_id hv_uio_id_table[] = {
> +	{ HV_FCOPY_GUID },
> +	{}
> +};
> +MODULE_DEVICE_TABLE(vmbus, hv_uio_id_table);
> +
>   static struct hv_driver hv_uio_drv = {
>   	.name = "uio_hv_generic",
> -	.id_table = NULL, /* only dynamic id's */
> +	.id_table = hv_uio_id_table,
>   	.probe = hv_uio_probe,
>   	.remove = hv_uio_remove,
>   };

Two things worth considering before applying:

1. Please add Cc: stable@vger.kernel.org or is it that we do not want 
this to be ported to older kernels?
2. Every Hyper-V guest (with UIO_HV_GENERIC enabled) will now have an 
additional auto-bound /dev/uio0 node for FCopy. Anything that hardcodes 
/dev/uio0 (e.g. ad-hoc DPDK scripts that bind a NetVSC NIC via 
uio_hv_generic + new_id) may see its index shift, since FCopy now wins 
uio0 at boot. The fix for such consumers is the same thing DPDK and the 
in-tree daemon already do: resolve uio via 
/sys/bus/vmbus/devices/<guid>/uio/ rather than by number. This is not a 
regression in the patch, but it's a behavior change worth calling out.

Regards,
Naman

^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Jason Gunthorpe @ 2026-05-25 23:01 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: Long Li, Leon Romanovsky, Konstantin Taranov,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <ahSbyYcq0sgfJnmZ@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Mon, May 25, 2026 at 11:58:17AM -0700, Erni Sri Satya Vennela wrote:
> > “There is no reason they should be signed, you should just fix the
> > type.”
> 
> It is not allowed to change sign in props, so clamping is the best bet.

Why not? Fix the core code, it is just old junk they are signed, they
should't never have been.

Jason

^ permalink raw reply

* [PATCH rdma-next v3] RDMA/mana_ib: Clamp adapter capabilities at the ib_device_attr boundary
From: Erni Sri Satya Vennela @ 2026-05-25 19:01 UTC (permalink / raw)
  To: longli, kotaranov, Jason Gunthorpe, Leon Romanovsky, linux-rdma,
	linux-hyperv, linux-kernel
  Cc: Erni Sri Satya Vennela

mana_ib stores its adapter capabilities internally as u32 in
struct mana_ib_adapter_caps. The IB core, however, exposes the
corresponding device attributes through struct ib_device_attr, where
fields such as max_qp, max_qp_wr, max_send_sge, max_recv_sge,
max_sge_rd, max_cq, max_cqe, max_mr, max_pd, max_qp_rd_atom,
max_res_rd_atom and max_qp_init_rd_atom are signed int.

mana_ib_query_device() is the only place that copies the cached u32
caps into these int fields. If a cap exceeds INT_MAX, the implicit
u32-to-int narrowing yields a negative value. Clamp each cap to
INT_MAX at this boundary so the values handed to the IB core are always
non-negative.

While here, fix a related overflow in the computation of
max_res_rd_atom. It is derived as max_qp_rd_atom * max_qp, both of
which are int after the assignment above; the multiplication can
overflow an int even with the new clamps in place. Widen to s64
before multiplying and clamp the result to INT_MAX.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v3:
* Drop clamping from mana_ib_gd_query_adapter_caps(). The internal u32
  caps cache does not need to be clamped.
* Move all clamping exclusively to mana_ib_query_device(), which is the
  only place the cached u32 values are narrowed into the signed int
  fields of struct ib_device_attr.
* Reframe commit message: this is a u32-to-int type boundary fix, not a
  CVM/untrusted-hardware hardening patch.
Changes in v2:
* Update patch title.
---
 drivers/infiniband/hw/mana/main.c | 33 ++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index ac5e75dd3494..ca843083140f 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -555,19 +555,28 @@ int mana_ib_query_device(struct ib_device *ibdev, struct ib_device_attr *props,
 	props->vendor_part_id = dev->gdma_dev->dev_id.type;
 	props->max_mr_size = MANA_IB_MAX_MR_SIZE;
 	props->page_size_cap = dev->adapter_caps.page_size_cap;
-	props->max_qp = dev->adapter_caps.max_qp_count;
-	props->max_qp_wr = dev->adapter_caps.max_qp_wr;
+	/*
+	 * mana_ib stores adapter capabilities internally as u32, but the
+	 * corresponding ib_device_attr fields are signed int. Clamp each
+	 * value at this boundary so a cap larger than INT_MAX is never
+	 * narrowed into a negative value visible to the IB core or
+	 * userspace.
+	 */
+	props->max_qp = min_t(u32, dev->adapter_caps.max_qp_count, INT_MAX);
+	props->max_qp_wr = min_t(u32, dev->adapter_caps.max_qp_wr, INT_MAX);
 	props->device_cap_flags = IB_DEVICE_RC_RNR_NAK_GEN;
-	props->max_send_sge = dev->adapter_caps.max_send_sge_count;
-	props->max_recv_sge = dev->adapter_caps.max_recv_sge_count;
-	props->max_sge_rd = dev->adapter_caps.max_recv_sge_count;
-	props->max_cq = dev->adapter_caps.max_cq_count;
-	props->max_cqe = dev->adapter_caps.max_qp_wr;
-	props->max_mr = dev->adapter_caps.max_mr_count;
-	props->max_pd = dev->adapter_caps.max_pd_count;
-	props->max_qp_rd_atom = dev->adapter_caps.max_inbound_read_limit;
-	props->max_res_rd_atom = props->max_qp_rd_atom * props->max_qp;
-	props->max_qp_init_rd_atom = dev->adapter_caps.max_outbound_read_limit;
+	props->max_send_sge = min_t(u32, dev->adapter_caps.max_send_sge_count, INT_MAX);
+	props->max_recv_sge = min_t(u32, dev->adapter_caps.max_recv_sge_count, INT_MAX);
+	props->max_sge_rd = min_t(u32, dev->adapter_caps.max_recv_sge_count, INT_MAX);
+	props->max_cq = min_t(u32, dev->adapter_caps.max_cq_count, INT_MAX);
+	props->max_cqe = min_t(u32, dev->adapter_caps.max_qp_wr, INT_MAX);
+	props->max_mr = min_t(u32, dev->adapter_caps.max_mr_count, INT_MAX);
+	props->max_pd = min_t(u32, dev->adapter_caps.max_pd_count, INT_MAX);
+	props->max_qp_rd_atom = min_t(u32, dev->adapter_caps.max_inbound_read_limit, INT_MAX);
+	props->max_res_rd_atom = min_t(s64,
+				       (s64)props->max_qp_rd_atom * props->max_qp,
+				       INT_MAX);
+	props->max_qp_init_rd_atom = min_t(u32, dev->adapter_caps.max_outbound_read_limit, INT_MAX);
 	props->atomic_cap = IB_ATOMIC_NONE;
 	props->masked_atomic_cap = IB_ATOMIC_NONE;
 	props->max_ah = INT_MAX;
-- 
2.34.1


^ permalink raw reply related

* Re: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Erni Sri Satya Vennela @ 2026-05-25 18:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Long Li, Leon Romanovsky, Konstantin Taranov,
	linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260413134602.GL3694781@ziepe.ca>

On Mon, Apr 13, 2026 at 10:46:02AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 10, 2026 at 10:29:45PM +0000, Long Li wrote:
> > > On Sat, Mar 21, 2026 at 12:56:39AM +0000, Long Li wrote:
> > > 
> > > > How we rephrase this in this way: the driver should not corrupt or
> > > > overflow other parts of the kernel if its device is misbehaving (or
> > > > has a bug).
> > > 
> > > If we are going to do this CC hardening stuff I think I want to see a more
> > > comphrensive approach, like if we detect an attack then the kernel instantly
> > > crashes or something. Or at least an approach in general agreed to by the CC and
> > > kernel community.
> > > 
> > > Igoring the issue and continuing seems just wrong.
> > > 
> > > This sprinkling of random checks in this series doesn't feel comprehensive or
> > > cohesive to me.
> > > 
> > > Jason
> > 
> > Can we follow the virtio BAD_RING()/vq->broken pattern in
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/virtio/virtio_ring.c#n57.
> > 
> > Add a broken flag to mana_ib_dev. When any hardware response
> > contains out-of-range values, mark the device broken and fail the
> > operation - during probe this prevents device registration entirely,
> > at runtime all subsequent operations return -EIO.
> 
> If that's the plan I would think it should be struct device based, but
> yeah, I'm more comfortable with this sort of direction as a CC
> hardening plan.
> 
Hi Jason,

Our team is not aligned with marking the device broken, after multiple
discussions, since both the values that are received from hardware and
stored in mana_ib_gd_query_adapter_caps are u32.

I'm planning to send v3 as a non-hardening patch with only clamping the
values at mana_ib_query_device to INT_MAX when out-of-bound.

Your previous concerns:
> “I'm also not convinced clamping to such a high value has any value
> whatsoever, as it probably still triggers maths overflows elsewhere. I
> think you should clamp to reasonable limits for your device if you want
> to do this.”

We plan to clamp it to INT_MAX since it is the max in props.

> “There is no reason they should be signed, you should just fix the
> type.”

It is not allowed to change sign in props, so clamping is the best bet.

Thanks,
Vennela

^ permalink raw reply

* Re: [PATCH v5 0/2] drm/hyperv: harden host message parsing
From: Hamza Mahfooz @ 2026-05-25 15:32 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Berkant Koc, Saurabh Sengar, Dexuan Cui, Long Li,
	linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Thomas Zimmermann, Maarten Lankhorst, Maxime Ripard,
	Deepak Rawat
In-Reply-To: <SN6PR02MB4157F72302D2B4B86ECE553FD40A2@SN6PR02MB4157.namprd02.prod.outlook.com>

On Mon, May 25, 2026 at 02:57:24PM +0000, Michael Kelley wrote:
> From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Sent: Monday, May 25, 2026 4:36 AM
> > Applied, thanks!
> 
> Hamza -- which tree was this applied to?

drm-misc-fixes

> 
> Michael

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox